Transformer Model Package

Transformer Model Package

There is currently a revolution taking place in the Natural Language Processing (NLP) Deep Learning space. Ever since the "Attention is all you need" (Google et al., Dec. 2017) [1] paper which highlighted the improvements of Attention Mechanisms over LSTMs, the field of NLP has been rapidly changing ever since.

There have been many State of the Art (SOTA) models that have conquered the NLP space since then, each better than the next. Models such as BERT, GPT, GPT-2, and XLNet have shown phenomenal success at NLP tasks such as Text Summarization, Q&A, Named-Entity Recognition (NER), and even Language Translation.

But it all started with the Transformer Model. In this project, I discuss the Transformer Model package [2] that I built recently in Tensorflow 2.0 which was written in an extensible format such that it's code base can be reused to build the complex models listed above.

The first task that the transformer was popularized for was language translation. Here you can see an example of just that. To accomplish this, you set up the model with the output being the language you want to translate too, and the input as the language you want to translate from.

While this sort of task was previously accomplished using Bi-Directional LSTMs, the Transformer model improved on both convergence and training time using it's Attention Mechanism. Specifically, it's Self-Attention Mechanism.

Now traditionally with Bi-Directional LSTMs, the network would need to loop through each input twice (once forwards, once backwards) in order to optimize it's weights via back propagation. The Transformer model captures these values all at once, relative to each word. Think of this as calculating feature importance. Essentially, each Self-Attention layer calculates the importance of each word relative to each other word in the input.

You can see an example here, where the darker the box, the more important the word is in relation to the word "it" in the sentence "The animal didn't cross the street because it was too tired"

The Transformer model uses each of these attention layer outputs (weights) calculated during the Encoder structure in order to decode (translate) from one language to another in the Decoder block.

My goal with this project was to provide an extensible code base that could be re-purposed to build more complex models such as BERT, GPT-2, or even XLNet. The difference between these models and the Transformer is how they use the Encoder or Decoder structures and the type of input Masking that they utilize. For more information on their similarities, check out this fantastic article from Jay Alammar! [3]


Thanks for checking out this project, make sure to download the transformer-model package that I built out [2]:

pip install transformer-model

Images taken from




A special thank you goes out to Jay Alammar for his fantastic article on the transformer model.