Attention Is All You Need

This post is about the famous Transformer, which has advanced the progress of NLP research.

This model was first introduced in Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008) and have been adapted into various decent models until now.

Let’s see how this model came out and how it is designed in specific.

The original seq2seq model is RNN-based, so it has several shortcomings such as the vanishing gradient problem and loss of overall contexts etc.

Therefore the attention mechanism was implemented to overcome those limitations as we saw in last posts.

But what if we don’t need to have RNNs and just have to make a model with attention mechanism only?

This is why the title of the paper which first introduced this concept is “Attention is all you need”.

By maximizing the advantages of attention, we can boost the performance of seq2seq architecture.

Also, since basic RNN models take each input recursively and this leads to relatively long processing time, which is bounded to $O(L)$ ($L$ is maximum length of each sequence), Transformer conducts all calculations more efficiently with matrix multiplication.

Ok, then let’s see the details of this Transformer more specifically.

First, this is overall description of Transformer’s architecture.

Like a seq2seq model, transformer also comprises two main parts, encoder and decoder.

But as we can see, there are many differences in internal designs of encoder and decoder than those of original seq2seq model.

Basically, the encoder takes an input, process it and passes it to decoder which produces the final output.

And each encoder or decoder has $N$ layers, which contains several modules such as Multi-Head Attention, layer normalization and Feed-Forward module etc.

To describe it more simply, a transformer looks like below image.

In the paper, encoder and decoder have $6$ identical layers each, but let us just assume that a variable $N$ is the number of internal layers in the encoder or decoder.

Positional Encoding

When using RNN based models, we already take into account of tokens’ order since basically a RNN is specialized in dealing with sequential data.

But as I mentioned, we are going to use attention only, so the model cannot consider the sequential relations.

So we need to add “positional encoding” into embedded input before we put it into the encoder and the decoder.

First, we add a new variable, $d_{model}$, which is the embedding size of each token.

If we assume that the batch size is $B$ and the sequence length is $L$, then initial input size is $(B, L)$.

Then after embedding layer, this becomes $(B, L, d_{model})$.

Position encoding is a process that adds another $(B,L,d_{model})$ size tensor to the input, which has positional information.

And that information can be calculated with below expressions.

\[PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}})\] \[PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{model}})\]

$pos$ literally means the “position” of each token in the input sequence and $i$ represents the index of each dimension in an embedded vector.

As we can see, according to the index, different results come out, which means that $sin$ is used if the index is an even number and $cos$ is implemented if the index is an odd number.

For more detailed understandings, let’s see the description.

After making the positional encoding tensor whose size is same as that of input, two tensors are added and a new input tensor is made with the same size of $(B, L, d_{model})$.

So even if two words are the same, by each relative location, two embeddings can be different unlike the original input embedding results.

This makes the model be able to detect the sequential information of each token.

We should notice that positional encoding is not a parameter which is changed by the learning process, but a constant result by fixed calculations.

The reason why the learnable parameter is not used and fixed $sin$/$cos$ functions are adapted is that we do not have to update each encoding values during training and also if we add simple increasing values to the input, then encoding values become bigger and bigger along with the sequence length and this makes the differences between each token extensive, which leads to difficulty in training and normalization.

Since $sin$ and $cos$ have a range from $-1$ to $1$, we can restrict the max/min value of the positional encoding.

Also, some people might worry about that this addition possibly corrupts overall embedding of the input.

But by enough experiments, it is shown that this positional encoding does not only reserve important embedded information but also gives useful insights with different embedded values, such as how two identical words can be represented differently by relative positions.

Self attention

We talked about the attention mechanism, which consists of three parts, “Query”, “Key” and “Value”.

Basically in a seq2seq model, this attention is used when making decoder output referencing each encoder cell’s hidden state.

But in a transformer, the attention is conducted to encoder/decoder itself internally, so it is called “self attention”.

In self attention for a transformer, $Q$, $K$ and $V$ become all of word vectors, which means we do not consider certain time $t$ since this is not a recurrent procedure.

The embedding size is $d_{model}$, but for “Multi-head attention” we should consider an additional variable $num\_heads$ which represents the number of heads.

I will explain what the multi-head attention is later, so just keep in mind that each word vector is split in $num\_heads$ small vectors.

And this makes the size of each vector $d_{model}/num\_heads$, so this self attention is conducted $num\_heads$ times.

For convenience, let’s say that ${d_{model}}/{num\_heads} = d_k$.

In the paper, $d_{model}=512, num\_heads = 8$.

But for simplicity I’m gonna set $d_{model} = 6, num\_heads = 3$.

Let’s see below figure.

First, we have to make $Q$, $K$, $V$ vectors which are used in self attention mechanism.

So additional parameters $W^Q, W^K, W^V$ are implemented, which are depicted as weight matrix in above picture.

The size of input sequence is $(L, d_{model})$ (we are not going to think about batch size) and each weight has size of $(d_{model}, d_k)$.

So by matrix multiplication, $Q$, $K$ and $V$ have size of $(L, d_k)$.

Some might wonder why this process is necessary, asking why don’t we just conduct self attention with original input sequence?

Let’s remind that in seq2seq model, we use hidden states as each query, key and value for attention, not directly putting embedded word vectors.

Each hidden state has sequential context of an input, so we can reflect this context information into attention.

Likewise, to make use of this context information, we train additional weight parameters to make $Q$, $K$ and $V$ more informative.

Now, it is time to do actual attention.

In the transformer, “Scaled Dot-Product Attention” is used, which is similar with dot-product attention but has additional scaling factor $\sqrt{d_k}$.

The overall calculation is as follows.

\[Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\]

First, $Q$ and $K^T$ is multiplied to produce attention score.

Since the size of $Q$ and $K$ is $(L, d_k)$ each, the size of $QK^T$ is $(L, L)$.

That is, this is similar with attention score that reflects how each token should be focused when seeing each word.

Then each element in this matrix is divided by $\sqrt{d_k}$.

This is because we need to regularize scores to prevent values from getting extensively large, making differences between values large and resulting in too small gradients passed while doing back-propagation.

This scaled matrix becomes the actual attention score and like in attention mechanism for seq2seq models, softmax function should be implemented to get attention distribution.

Finally, this distribution matrix, which has size of $(L, L)$, is multiplied with $V$.

Then we can get the attention value and its size is $(L, d_k)$.

Multi-head attention

Now, let’s talk about what the multi-head attention is.

If a person examine an object, even if one tries to look at it as carefully as possible, he/she can only see a part of it.

But if there are multiple people, they can catch various aspects of the object.

The multi-head attention literally means that multiple heads(people) conduct their own attention to the input sequence.

So by doing attention several times, we can get more informative outputs.

Above figure shows us how the final output of self attention module is made.

After conducting $num\_head$ self attentions, we have $num\_head$ attention values.

By concatenating them, we become to have $(L, d_{model})$ matrix eventually.

Then additional weight matrix $W_0$ equal to concatenated attention value in size is multiplied and finally we have the final output matrix.

The point here is that the size of input and that of output is the same.

In other words, after going through multi-head attention layer, the size of input is maintained.

Let’s get to one more point.

Why are those whole procedures able to make use of the advantages of multi-head attention?

Why are those split and concatenation things necessary?

Can we just conduct self-attention to whole input, which seems to be same with above method?

Obviously, the answer is “that is different”.

Each word is represented in an embedded vector and each dimension in that vector tells us various semantic features of that word.

That is why we split the vector into segments in same size and apply attentions to each segments to make each parameter concentrates on that specific features in its charge.

And the softmax function regularizes each value in closed assumption that whole dimensions it is seeing right now are total dimensions it has to look.

So by restricting features in each word, attention mechanism can make more specific feature-focused attention scores and attention values.

So far, we have discussed important concepts that can be found in the architecture of the transformer.

Now let’s see the encoder & decoder to study how these are constructed and how they are different.

Encoder

As I mentioned before, encoder and decoder have own $N$ layers and each layer is constructed like above figure.

We already looked what positional encoding and multi-head attention are, so I’m gonna skip them.

After the multi-head attention, we have “Residual connection” and “Layer normalization”.

To put it simply, the residual connection is usually used in Computer Vision models, which adds original data to the output of layer.

If we say the layer can be represented as function $F$, then residual connection function $H(x)$ is calculated by $H(x)=x+F(x)$.

Since the size remains the same after multi-head attention, we can perform addition.

This make the training much better by helping the model to overcome vanishing gradient problem and optimization problem.

Layer normalization is similar with Batch normalization, which makes output of a layer to become a value in normal distribution.

While the batch normalization normalizes based on each batch, layer normalization is calculated based on each dimension.

This helps to reduce gradient vanishing/explode by control overall values in a model.

Finally, we should see Position-wise Feed Forward layer.

This layer is quite simple since this consists of two linear calculations and ReLU.

\[FFNN(x)=max(0, xW_1+b1)W_2+b_2\]

Here, we implement another hyperparameter $d_{ff}$, which is $2048$ in the paper.

First linear transformation transforms the input in size $(L, d_{model})$ into a matrix in size $(L, d_{ff})$.

Then it goes in ReLU for activation and another linear transformation makes this an output in size $(L, d_{model})$, which is identical to original input shape.

So this layer is just a combination of linear layers that we can usually see in various models to give non-linearity to them.

Eventually, after going through all these layers, still the size of sequence is maintained.

This identical size tensor is passed to next layer over and over again.

After the $N$th encoder layer, finally the output is passed to the decoder.

Decoder

Basically, the decoder layer has additional module named “Masked Multi-Head Attention”.

Same as the decoder in seq2seq model, the desired output sequence should be inserted in training process.

Of course, in testing phase we do not know what the correct answer is, so like in seq2seq model we just put starting token.

In decoder, attentions are implemented a little bit different.

First, masked multi-head-attention is similar with encoder’s multi-head attention but has additional masking.

This is because the decoder should not know entire answer.

When using an RNN for a decoder, a sequence is processed recursively, so even if the model wants to see other words at the back in advance, it cannot.

But in the transformer, the decoder can see the whole sequence so it is not right since basically the decoder should make the output in sequential manner.

If the decoder cheats an answer in advance, training cannot be done properly.

That’s why additional masking is implemented to prevent attentions to rightward information.

More specifically, after scaled dot-product attention, we put $-\infty$ to certain spots in attention score matrix to make this spots not focused.

As we can see, $i$th token can only pay attention to $(1st$~$ith)$ tokens since rest of them are masked into $-\infty$.

And when testing we already do not have any information of answer, so masking is not necessary.

And in second attention in the decoder, we use encoder’s output.

This is same with original seq2seq model, since original attention uses decoder hidden states as $Q$ and encoder hidden states as $K$ and $V$.

So we use the output from encoder as $K$ and $V$ to calculate attention score with $Q$ which is from the output of masked multi-head attention in the decoder.

The rest of the procedure is same with that of the encoder.

So this is the end of this post about “Attention Is All You Need”.

We have discussed about the limitation of original seq2seq model, basic ideas implemented in a transformer model, such as scaled dot-product attention and multi-head attention etc., and the overall architecture of it.

It is obviously difficult for beginners and it took quite a long time to understand this for me too.

For further study, in the next post we will actually build a transformer to train and test it.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). http://papers.nips.cc/paper/7181-attention-is-all-you-need

1) 트랜스포머(Transformer). (2019, Dec 27). https://wikidocs.net/31379 . </div>

Categories