Transformer Architecture explained in depth

7 min readNov 29, 2024

What is Transformer?

The Transformer is an architecture used in deep learning, achieving significant success particularly in fields such as Natural Language Processing (NLP) and Computer Vision. Introduced by Google in the 2017 paper titled “Attention is All You Need,” this architecture diverges from traditional sequential models by learning relationships between data points more efficiently.

The Transformer architecture utilizes an attention mechanism to model the relationship of each input element with every other element. This structure eliminates the need for sequential processing of data, enabling simultaneous computation across the entire input. It offers notable advantages, especially when working with long sequences (lengthy sentences or large images).

One of the key components of the Transformer is the self-attention mechanism, which evaluates the dependencies between each element and all others, determining which elements are more important. In language models, this mechanism enhances the understanding of how words within a sentence influence one another, while in vision tasks, it helps comprehend the relationships among pixels in an image.

Architecture Explanation

Input Embedding

It is necessary to encode words in a way that computers can understand. Therefore, we create vector representations of the words with a fixed size (512 as stated in the published paper ‘Attention is all you need’) containing numerical values.

In the image above, the words of the example sentence ‘Once upon a time in’ are represented by vectors that are random and unique for each word (a smaller vector size is used here for simplicity since 512 is a large number). These values are updated as the transformer architecture learns the context and relationships between the words.

2D vector representation (https://medium.com/machine-learning-t%C3%BCrkiye/transformer-encoder-yap%C4%B1s%C4%B1-self-attention-c44564d0b74b) (1.3)

In the image above, the positions and clustering of the word vectors in space (shown here in 2D) are illustrated.

Positional Encoding

In the RNN architecture, the words in the input sentence are processed sequentially, one by one. However, this mechanism does not exist in the transformer architecture. Instead of feeding the sentence word by word, all the words in the sentence are fed to the network in parallel. Feeding words in parallel helps reduce training time and also aids in learning long-term dependencies. For this reason, we define the order of words in the sentence using positional embeddings.

For example, consider the sentence ‘Take care of yourself.’ Let’s assign positional vectors to each word:
For ‘Take’: [20, 0, 0, 0, 0, 0, 0]
For ‘care’: [300, 0, 0, 0, 0, 0, 0]
For ‘of’: [50, 0, 0, 0, 0, 0, 0]
For ‘yourself’: [1, 0, 0, 0, 0, 0, 0]

The problem here is that these positional vectors are positioned very far apart. This causes words that are semantically or contextually related to appear far away from one another in the vector space. To address this issue, the Attention Is All You Need paper proposed a formula for positional embeddings using sine and cosine functions, ranging between -1 and 1 across infinite frequencies. This approach ensures that words which are contextually and semantically close will also be closer in the vector space. Additionally, it enables the model to understand the position of each word during parallel processing.

The working principle is as follows: first, for each word in the input sentence — we will use the sentence ‘once upon a time’ here — two-dimensional embedding vectors are created. These word embedding vectors represent a point in space.

Note !!:

The Transformer architecture does not use timestamps. In RNN structures, there are timestamps, and the sentence is processed sequentially. When using a Transformer to speed up processing by working with the entire sentence at once, the concept of order is lost. For a word, this embedding vector doesn’t capture all meanings. For example, the word ‘yüz’ (meaning ‘hundred’ or ‘face’) in the sentence ‘After a hundred kilometers, the only thing he would see would be a pale face’ could have the same word vector, but its position and meaning in the sentence are different. This is where positional embedding becomes important.

Positional Embedding Calculation

When calculating the positional embedding, each word is considered individually. We will create a positional embedding vector using the numerical values from the input embedding. In this example, each word consists of 6 vectors (in the published paper, 512 were used. For simplicity, we used 6). The values in this vector are considered as indices 0, 1, 2, 3, 4, 5. If the index is even, the sine function is used, and if the index is odd, the cosine function is used. For example, let’s calculate the positional embedding for the word ‘Once’.

Since the word ‘Once’ is at the beginning of the sentence, the ‘1’ in PE(1,0) represents the position of the word in the sentence.

index 0 => PE(1,0);

index 1 => PE(1,1);

i: represents the index of the numerical value within that vector.
dim: represents the dimension of that vector (we used a 4-dimensional vector, while the paper used 512 as mentioned earlier).
pos: represents the position of the calculated word in the sentence.

The created positional embedding is added to the input embedding (word embedding) matrix.

The positional encoding values are added to the input embedding values.

The matrix on the right side of the image above is the result of adding the embeddings (1.7) (input embedding + positional embedding). This matrix will enter the encoder.
NOTE: The subsequent calculations will be done with a word vector dimension of 512 for the sentence ‘Once upon a time in’.

Multiple heads are created here. Each of these heads contains different meanings and contexts within a sentence. Similar to the feature maps in CNN architectures, each head extracts a different meaning from the sentence.

NOTE: Normalizations have been applied to the MH-A matrix output to ensure there are no significant differences between the numerical values.
In the image below, the MH-A matrix output from the encoder is multiplied by a weight matrix and enters the Decoder section as Query and Key values.
The next stages will be considered as the training process and the inference process.

Training

In the image above (2.0), for the training process, the French translation of the sentence ‘Hello my dog is cute’ (‘salut mon chien est mignon’) is shown in the context of how it works in an encoder-decoder transformer architecture from a broad perspective. To explain step by step:

In the training process of the Encoder-Decoder Transformer architecture, unique input_id values are assigned to the words in the input (sentence).
As explained above, word embedding and positional embedding are applied.
Then, after applying multi-head attention as explained in the mathematical calculations above, a series of normalizations and fully connected layers are applied.
The output of the encoder will be a matrix of (word size, 512).
From this matrix, Query and Key matrices are created and prepared to be fed into the Decoder.
The sentence ‘<sos> salut mon chien est mignon’ entering the Decoder is processed similarly, with input_id, word embedding, and positional embedding applied to form a matrix.
This matrix is calculated as masked, unlike the one from the encoder.
The multi-head attention mechanism operates between the Q matrix of this resulting matrix and the key and value matrices from the encoder.
After multi-head attention, a series of normalizations and fully connected layers are applied.
The Decoder output matrix undergoes normalization and fully connected layers again, resulting in an output matrix.
This output matrix is then flattened and converted into logits, and the loss function (predicted_id — groundTruth_id) is computed.

Inference

For the inference part of the Transformer architecture, a given input sentence is translated into French word by word.
The encoder part performed during training is the same for inference. In subsequent time steps, the encoder output is retrieved from the cache.
The resulting query and key matrices enter the decoder.
After normalization, linear, and softmax operations, a word is predicted from the vocabulary dataset, and this word is then provided as input to the decoder. Mathematical operations are performed using the vector of this word.
When the model predicts the <eos> token, the sentence is considered complete, and the prediction is finished.