Analyzing Transformers and Implementing Multi-Head Attention

Overview

  • When I was doing my semester abroad at the University of Edinburgh, I worked on a course project about machine translation.
  • The task was to improve an existing baseline machine translation model, improving the BLEU score and other metrics.
  • To do so, I implemented and tested two approaches found in research. I implemented both the lexical attention model as described by Nguyen and Chiang (2017) and a multi-head attention mechanism as presented by Vaswani et al. (2017). I did this in PyTorch and also spent a lot of time analyzing and optimizing the training data.
  • As a result, the performance of the machine translation model improved considerably. I also learned how different model architectures can affect the training- and inference time.

Context

  • 🗓️ Timeline: 01/2022 — 05/2022
  • 🛠️ Project type: Course project in Natural Language Understanding, Generation and Machine Translation @ University of Edinburgh, UK

Technologies/Keywords

  • Python
  • PyTorch
  • Machine Translation (MT)
  • Transformers
  • Lexical attention
  • Multi-head attention
  • Paper implementation

Main Learnings

  • I learned to transform machine-learning papers into a working implementations. This requires understanding the papers in detail and transforming formulas and diagrams into code.
  • In addition, I learned to explore and extend an existing codebase and gained a quite deep understanding of transformers, multi-head attention, and multiple related topics (e.g., beam search).