BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT stands for “Bidirectional Encoder Representations from Transformers” which is one of the most notable NLP models these days.

When this first came out in late 2018, BERT achieved State-Of-The-Art results in $11$ NLU(Natural Language Understanding) tasks and finally was introduced with the title of “Finally, a Machine That Can Finish Your Sentence” in The New York Times.

The ideas inside it and its remarkable performances have inspired researchers around world.

Until now, many people have tried to apply and improve this model and have resulted in several derivate models like ALBERT, RoBERTa, DistilBERT etc.

BERT is pre-trained with general-purpose language modeling and fine-tuned with various downstream tasks.

Its architecture is based on the transformer but only consists of several transformer encoder layers.

So this is not for generation tasks since it does not have the decoder unlike the original transformer.

But with more encoder layers and unique pre-training techniques, it has been able to achieve far more decent results in various language understanding tasks.

Basically, the simple description of BERT is as follows.

First the input sequence is inserted after tokenizing and word embedding.

By the densely stacked encoder layers, the hidden states from last encoder layer come out as the output just like the encoder output.

This is the output from the basic structure, and these hidden vectors are used in various ways through different layers depending on the purpose of fine tuning.

We are gonna look at these more specifically later.

Comparison with other models

But before looking at the details, first let’s check what the differences between BERT and other models are.

In the paper, two models are compared with BERT.

ELMo(Embeddings from Language Modeling)

ELMo is the “Contextualized Word Embedding” model using pre-trained language model.

This model came out to overcome the limitations of existing embedding models like Word2Vec or Glove etc., since these models cannot reflect different meanings of same word when it is used in different contexts or positions.

ELMO uses the bidirectional language modeling which is pre-trained in forward and backward direction each, by concatenating two embeddings from each direction.

But this is not deeply bi-directional, since this embedding is simply concatenated version of each directional language model output.

In other words, this embedding vector reflects context on each direction but not on both direction simultaneously and this leads to the limitations on token-level tasks, such as NER(Named Entity Recognition), which require deeper understandings of each token itself.

BERT, on the other hand, is deeply bidirectional since it is pre-trained in “Masked Language Modeling” which should look at masked words referring contexts from both direction at the same time.
GPT(Generative Pre-training Transformer)

GPT is one of the most powerful generative models also based on the transformer architecture.

GPT is only pre-trained in basic left-to-right language modeling since it is more focused on text generation.

And in fine tuning, the pre-trained part of this model is frozen and only the additional output layer is tuned.

BERT, on the other hand, is pre-trained in deeply bidirectional language modeling since it is more focused on language understanding, not generation.

And when we fine-tune BERT, unlike the cased of GPT, pre-trained BERT itself is also tuned.

This causes a little bit heavier fine-tuning procedures, but helps to get better performances in NLU tasks.

Input representation

As we know, we conduct positional encoding to the input string in order to make the transformer understand positional information of given input.

Since BERT is also based on the transformer, it needs additional settings for the input sequence.

The input embeddings consist of three parts, token embeddings, segment embeddings, and position embeddings.

These three embeddings are added together before the input is inserted into the model.

Token embeddings are word embeddings for each token which we conduct basically in NLP tasks.

And position embeddings are positional encodings in the transformer as we reviewed in previous posts, so I will omit the details.

The difference is the segment embedding, which is related to unique pre-training method of BERT.

As I will elaborate later, BERT is pre-trained based on “Next Sentence Prediction”.

In other words, it learns the relation between two sentences by detecting whether two given sequences are consecutive or not.

In order to do that, two segments need to be separated and this is why segment embeddings exist.

So if we say that sentence $0$ and $1$ are given, $0$ or $1$ is added to all positions in the embedded sentence.

Additionally, extra tokens $[CLS]$ and $[SEP]$ are inserted.

The $[SEP]$ token is used to separate two segments and inserted at the end of each sentence to notify the last positions.

I will explain the $[CLS]$ token later.

Pre-training

BERT is pre-trained in two unique methods, “Masked Language Modeling” and “Next Sentence Prediction”.

Masked Language Modeling

Basically, Language Modeling(LM) is a modeling method to predict the next word of given previous word.

Previously, language modeling was conducted in one direction or even if conducted in both directions, two contexts are simply concatenated like in ELMO.

Masked LM is conducted in deeper setting, which masks random words to make the model predict that word referring the entire sequence with no specifically selected direction.

About $15$% of words are masked and the masked word is converted into $[MASK]$ token.

But in addition, we do not just make $15$% of words into $[MASK]$ but make settings as follows.
- $80$% of masked words are converted into $[MASK]$.
- $10$% of masked words are converted into other words randomly.
- $10$% of masked words are left as they are.
The reason why these variations are necessary is that we need to reduce the gap between pre-training and fine-tuning.

To put it specifically, if we just let the selected word completely masked, the model is fine-tuned not knowing the information of this word at all after pre-training.

So we give some hints to the model so that it can predict more effectively with the information of converted words.

This can be similar with the teaching forcing in seq2seq models as we saw before.
Next Sentence Prediction

BERT is pre-trained with a lot of pairs of sentences to learn the relations between two sequences.

It predicts whether the given pair is consecutive in same corpus or not.

So we set $50$% of correct pairs which are extracted from actual consecutive pairs in same document and $50$% of wrong pairs from different documents.

This improves sentence pair classification performance of BERT, such as NLI(Natural Language Inference) or sentence completion tasks.

The specific example of pre-training is described in the paper as follows.

Fine-tuning

Now we’re gonna see a few fine-tuning methods the authors tested with a pre-trained BERT model.

Above image is introduced in the paper and also very famous for the description of BERT fine-tuning tasks.

Note that the $[CLS]$ token is finally used in single/double sentence classification tasks, of which the output is put into an additional classification layer representing the overall context of the sequences.

(a) Sentence Pair Classification Tasks
- Two sequences are given, then the first output hidden vector, which comes from $[CLS]$ token, is used in the final classification layer and predicts the category.
- Tasks/Datasets
  - MNLI(The Multi-Genre Natural Language Inference): By putting sentence pairs as inputs, BERT determines which category the relation is, “entailment”, “contradiction”, and “neutral”.
  - QQP(The Quora Question Pairs2): Given question pairs as inputs, the model conducts binary classification, which determines that two questions are equivalent or not.
  - QNLI(The Stanford Question Answering): Given pairs of a paragraph and a question, BERT determines whether the answer for given question exists in the paragraph.
  - STS-B(Semantic Textual Similarity Benchmark): Given pairs of two sentences, the model determines the similarity score between two sequences from $1$ to $5$.
  - MRPC(The Microsoft Research Paraphrase Corpus): Given pairs of two sentences, BERT conducts binary classifications whether given pair is similar or not.
  - RTE(The Recognizing Textual Entailment): This is similar with MNLI, but the size is smaller. And it only requires binary classification which is entailment or not.
  - SWAG(Situations With Adversarial Generations): Given one sentence, the model chooses the most plausible next sentence among $4$ options.
    - First, $4$ inputs are made by concatenating Sentence $A$ + Sentence $B_1$ ~ $B_4$.
    - Then additional task-specific parameter $V \in R^H$ is trained. We conducts dot-products between $4$ first output hidden states and $V$ to get $4$ scalar scores of given option pairs.
    - Finally by the softmax layer, we can get the most acceptable option among $4$ choices.
(b) Single Sentence Classification Tasks
- Same with preceding methods, a input sequence is given and the first hidden state of BERT output is used for classification task.
- Tasks/Datasets
  - SST-2(The Stanford Sentiment Treebank): This is a simple sentiment classification dataset.
  - CoLA(The Corpus of Linguistic Acceptability): The model determines whether the given sentence is linguistically correct or not.
(c) Question Answering Tasks
- Basically, BERT gets a question as the first sentence and a paragraph, which might contain the answer for given question, as the second sequence for Q&A tasks. The task is to find the correct answer span which exists in the paragraph.
- Specific methods are a little bit different depending on used datasets.
- Tasks/Datasets
  - SQuAD(Stanford Question Answering Dataset) 1.1: This version does not assume the cases that the answer is not in the paragraph.
    - If we have the start vector $S$ and the end vector $E$, we conduct dot-product between $S$ or $E$ and each token vector $T_i$.
    - After getting scalar results, the softmax converts these into probabilities. We can get the probability for each token being the answer span’s starting point or ending point.
    - Assuming that there are position $i$ and $j$ in the condition that $j \ge i$, we extract the span with largest score $S \cdot T_i + E \cdot T_j$. This span becomes the most acceptable answer.
  - SQuAD 2.2: This version has the cases that the answer does not exist in the paragraph at all.
    - When the paragraph does not contain the answer, we set $S$ and $E$ to point the $[CLS]$ token. We assume that the score of this case is $s_{null}$.
    - And the rest of procedure is same, but if the highest score $\hat{s_{i,j}}$ cannot satisfy the condition $\hat{s_{i,j}} > s_{null} + \tau$ ($\tau$: threshold), we conclude that there is no answer.
(d) Single Sentence Tagging Tasks
- Although there is no specific explanation on sentence tagging tasks, but it is obvious by looking at the given description.
- Tagging is a token-level classification task, so after getting output hidden vectors of all positions, we put each vector into additional classification layer to determine correct tag of each token.
- Tasks/Datasets
  - CoNLL-2003: Given a single input sequence, BERT conducts token-level classification to determine correct entity tag of each word.

Ablation studies

And there are several ablation studies to understand the effects of various strategies implemented in BERT more thoroughly.

Effect of Pre-training Tasks

In this study, the authors gave changes in pre-training methods as follows.
- No NSP: Next Sentence Prediction was eliminated.
- LTR & No NSP: Next Sentence Prediction was eliminated and the language modeling was set only based on Left-to-Right direction. Actually this is similar with pre-training method of GPT except the size of dataset, input representations and fine-tuning schemes.
And the result in the paper says that without NSP most of sentence pair classification performances were degraded.

Removing Masked LM affected overall accuracies of all tasks more seriously especially in MRPC and SQuAD.

Since LTR LM is not deeply bidirectional, token-level detection tasks such as question answering cannot be performed as well as in pre-trained with Masked LM as I mentioned above.

And in the paper, BiLSTM is added on the top to strengthen LTR modeling property and this led to worse performances.

Additionally, some might say that we can just concatenate LTR and RTL contexts like ELMo does but this attempt is criticized in the paper.

Because it is twice expensive as a single bidirectional model, is not intuitive for tasks like Q&A since RTL cannot handle the condition of question answering tasks as we saw above, and simple concatenation is not deeply bidirectional, which cannot use left and right context at the same time.
Effect of Model Size

They conducted some experiments to find out the effect of the number of layers, the hidden size and the number of attention heads on various task performances.

As the paper described, the bigger the model, the better the result is.

The points here are that even if the dataset for a downstream task such as MRPC is much smaller and substantially different from pre-training datasets, performance improvement can be detected.

And a large-size BERT even outperformed already quite large existing models like the original largest transformer.

Before BERT, there are several researches arguing that just making the model bigger does not always guarantee the performance improvement.

Because even if the model gets bigger, the size of dataset is limited and this can leads to obstacles on parameter tuning since there are not enough data to tune the huge amount of model parameters.

But with this study, we can notice that if the model is sufficiently pre-trained well, then with small fine-tuning datasets we can expect performance improvement by upscaling the model size.

And as the paper says, this is a benefit from fine-tuning approach, which means with small amount of additional parameters, the model can perform better by using the larger and more expressive pre-trained representations.
Feature-based Approach with BERT

Until now, we have only discussed fine-tuning approach of BERT and seen how this approach can give benefits to pre-trained models.

But BERT can also be used in feature-based approaches and the authors conducted several experiments on that.

The difference between feature-based approach and fine-tuning approach is as follows.
- Feature-based: The pre-trained language representations are given as the features and an additional model for a specific task uses them. It can be seen as two different models cooperate and the pre-trained model is not tuned. (ex. ELMo etc.)
- Fine-tuning: Minimum task-specific parameters are added and pre-trained parameters are changed a little bit, depending on downstream tasks. These pre-trained parameters might be tuned or frozen but the point here is that additional tuning should be minimized. (ex. GPT, BERT etc.)
By applying feature-based approach to BERT, two advantages can be obtained.

First, some tasks cannot be easily handled with just transformer-encoder architecture, so additional task-specific model can support this limitation, such as CRF(Conditional Random Field) in NER tasks.

Second, since we can pre-compute expensive pre-trained representations of the training data and run many experiments, so we can get computational benefits.

In the paper, to ablate fine-tuning approach, pre-trained parameters were fixed and additional BiLSTM was added on the top of BERT to conduct NER task.

And the author argues that there was only $0.3$ F1 score decrease in the best performed experiment compared with fine-tuning approach.

So we can think that as a feature-based model, BERT can perform quite well.
Additional studies
- To accomplish remarkable performances of BERT, a large amount of pre-training resources ($128,000$ words/batch * $1,000,000$ steps) were necessary.
- Compared to LTR pre-training, Masked LM might converge a little bit slower but eventually outperforms almost immediately.
- We have seen that in Masked LM, the proportions of converting selected token into $[MASK]$, same token, and randomly chosen one are set to $80$%, $10$%, $10$% each. By changing these rates, the paper argues that slight degradations of performances on MNLI and NER tasks can occur. Especially, fine-tuning approach is quite robust, which the performance degradations are not that serious but feature-based approaches seem to be more sensitive to these masking rates.

So this is the post about the famous BERT.

This model is so important that you must study it and also conduct actual experiments with this if you are interested in NLP research.

I summarized the contents of the paper as compact as possible so you can understand this model without any difficulty and temporal burden.

I hope that this post can help many people to get used to this one of the most important deep learning models more easily.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805

BERT 논문정리 . (2019, Feb 23). https://tmaxai.github.io/post/BERT/

Categories