DialogueSentenceBERT: SentenceBERT for More Representative Utterance Embedding via Pre-training on Dialogue Corpus (1)

In the 2nd quarter of this year, I conducted research on developing a contextualized sentence embedding model, which is more optimized on an utterance embedding for dialogue tasks.

This model, so-called DialogueSentenceBERT, started from the idea that there might be a chance to make a more effective sentence embedding model for dialogues if it is trained with the Siamese networks[1] structure, just like the way the SentenceBERT[2] was trained, using a large dialogue corpus.

Although this project ended in failure, a few months later, I came up with a new method and decided to get to work again.

Throughout 2 posts, including this one, I will write about how the new approach is different from the original one, and what the results are.

Let’s get started.

Introduction

First, let’s remind the method of training the SentenceBERT, which was mentioned above, for further discussion.

The SentenceBERT was trained using the Siamese networks to make a more representative embedding vector for each natural language sentence.

The Siamese networks structure is one of the approaches to train a neural network, which uses a single model as two identical twins, taking advantage of the relevance between two outputs from it given two different inputs.

As we can see in the description, the SentenceBERT conducts classification after concatenating two sentence vectors and calculating the element-wise difference between them, or get the similarity of two vectors directly, in order to learn the closeness between each sentence embedding obtained from the original BERT[3].

For this, sentence vectors can be calculated by mean/max pooling, or just retrieving the contextualized word embedding of the "[CLS]" token’s position.

It is said that, by doing these, the model can get the semantic relation between two sentences more effectively and use this knowledge to solve natural language inference (NLI) or semantic textual similarity (STS) task, where the relevance between two inputs’ meaning is crucially important, better than using the raw embeddings from the BERT.

While doing some works regarding intent classification, I came to think that if I train a dialogue model similar to this, I can get better results, especially in the tasks which should utilize sentence embeddings, such as intent classification and action prediction.

Moreover, since the BERT was pre-trained on the text data like Wikipedia[4] and BooksCorpus[5], which are refined to a certain extent, models might have performed poorly when handling utterances from dialogues due to the properties of human conversations, which are more diverse, unpredictable, and sometimes severely ungrammatical.

So I was determined that by once making the model see the contents from a dialogue corpus, it will make a meaningful performance increase.

Additionally, as the models like the ToD-BERT[6] and ConvBERT[7], which were trained on dialogue data, showed much better performances compared to the original BERT, which will be provided in the next post, I started this project expecting that adding the idea of the Siamese networks training will make a further improvement.

Lastly, according to the paper, the SentenceBERT is originally not intended to be used for transfer learning, but simply a suggestion of a training strategy for certain tasks.

However, I wanted to try it for pre-training to obtain a more adaptable and useful sentence encoder for various tasks and concluded that it is worth challenging with proper training approaches and a sufficient amount of data.

Previous approach

First, I should explain the previous approach I tried before.

This is a picture of the 3-way classification approach which I implemented.

In fact, it is not different from SentenceBERT’s NLI training.

I trained the BERT to classify the relation between two utterances into one of “same”, “neutral”, and “different” by putting two inputs into the model, pooling the outputs to get a pair of sentence embedding vectors, concatenating these two vectors and the element-wise difference, and finally putting the concatenated vector into a 3-way classification layer.

Here, the pooling method can be the mean/max pooling, or a contextualized embedding of the "[CLS]" token, just like in the SentenceBERT.

In addition, I applied the masked language modeling (MLM), which is the famous pre-training objective of the BERT, to improve the understanding of dialogue corpora, and added the LM loss to the classification loss.

The problem was how to define the relevance between two utterances, and I decided to use the action labels tagged for each utterance in the datasets for task-oriented dialogue systems (TODS).

Generally, each utterance has some sort of intent or purpose, which a speaker wants to deliver, and most of them are tagged as labels called actions.

I chose to consider the number of actions overlapped as a standard for the relation of two utterances, since the more actions they share, the similar their meanings are.

Here, I used 3 datasets, which are the Schema-Guided Dialogue Dataset[8], Frames dataset[9], and End-to-end dataset[10].

They all have action tags for each turn, are easily attainable, and did not overlap other datasets I used for fine-tuning.

However, there was another problem, which is that each dataset has its own standard for action tagging.

In other words, in order to use these datasets as a whole, I first had to unify these different tagging standards.

To do that, I redefined 6 actions by grouping actions with similar purposes and classified each action to become one of these new action labels.

This is a part of the spreadsheet I wrote to redefine action labels as mentioned above.

I will explain the process briefly since it is written in Korean and the captured segment is just a part of the page, which makes the picture hard to understand.

In the beginning, I defined 6 action groups as “GENERAL”, “INFORM”, “POSITIVE”, “NEGATIVE”, “ASK”, “OFFER”. And I converted each original action into these 6 groups by judging which the most similar group is.

For example, the actions, such as “REQUEST”, “CONFIRM”, “REQ_MORE” in the schema dataset, and “request”, “moreinfo” in the frames dataset, are meant to ask or request additional information or response, which can be classified as “ASK”.

Likewise, since “AFFIRM”, “NOTIFY_SUCCESS” in schema dataset, and “confirm_answer” in e2e dataset have the positive meaning or notify that a certain process is finished successfully, these actions can be a new action, “POSITIVE”.

The actions with no special function or meaning, or which cannot be included in any of the groups become “GENERAL”.

In this way, I unified all actions and made a rule where if all actions from two utterances are perfectly matched, the class of these pairs is “same”, “different” if none of them are overlapped, and “neutral” if only a few of them are matched.

After that, I pre-trained the model and compared it to other baselines, the BERT, SentenceBERT, Tod-BERT, and ConvBERT.

Especially, I used the SentenceBERT models with different pooling strategies, and trained on the NLI dataset here.

The tasks were intent detection and system action prediction, both of which are classification tasks using a sentence embedding of an utterance.

The details of these fine-tuning tasks will be covered in the next post.

As I mentioned beforehand, the pre-trained model with this approach eventually failed.

In most cases, the performances were lower than the original BERT’s, let alone those of the ConvBERT and Tod-BERT, which showed remarkable results.

It seems that the pre-training made the BERT forget its linguistic knowledge reversely.

I think the reasons for this failure can be wrapped up as follows.

First of all, it is obvious that the training objective was not proper. I think the idea to use the tagged actions was not bad, but converting them into 3 categories made the meaning of each action bland. In other words, the process of grouping actions might have destroyed the detailed semantic differences between two utterances, which led to wrong signals during training eventually.
The second is a lack of diversity of training data. The TODS datasets I used have limited domains and topics, which might have narrowed down the wide linguistic knowledge which the BERT originally has. In fact, the ConvBERT which was trained on an open-domain corpus showed decent scalability to different tasks. Although the ToD-BERT also used the TODS datasets, they were much more diverse and larger than those I implemented in this project.

I stopped this at this point, and concluded that I should re-think all of the settings, from the pre-training method and the corpus.

In the next section, I will introduce the modified training strategy.

New training obective

I came to think that setting a classification task as the training objective might make the model over-fit to this classification objective during pre-training.

In other words, the model is not updated to produce semantically representative sentence embedding vectors, but to conduct a certain classification task for limited datasets too much, which is possible to lead to the loss of original knowledge the model has obtained.

This is not different from transfer learning and definitely violates my purpose of creating a general language model.

Therefore, it is necessary to apply another objective which is more likely to detect the semantic similarities/differences between different inputs.

Here, I decided to use the CosineEmbeddingLoss[11], which uses the cosine value between two vectors, not the basic classification loss function.

This loss can train the model by leading two vectors which are intended to be similar closer, and reversely making two vectors which should be different farther, according to a given label.

The calculation of the loss is as follows.

\[\text{Loss}(x_1, x_2, y) = \left\{ \begin{array}{ll} 1 - \cos(x_1, x_2) & \mbox{if } y = 1 \\ \text{max}(0, \cos(x_1, x_2)-\text{margin}) & \mbox{if } y = -1 \end{array} \right.\]

Assume that $y=1$ means the vector $x_1$ and $x_2$ should be similar, and $y=-1$ means the opposite.

Since $-1 \le \cos(x_1, x_2) \le 1$, then $0 \le 1 - \cos(x_1, x_2) \le 2$ when $y=1$.

Therefore, the loss is always positive and the farther the current distance between two vectors is, the higher the loss becomes.

On the other hands, if $y=-1$, then $-1-\text{margin} \le \cos(x_1, x_2) - \text{margin} \le 1-\text{margin}$.

This lowers the loss value in order to make the model not updated as much as possible.

In addition, since the loss should not be negative, $0$ becomes the lower bound.

This is actually very similar to SentenceBERT’s training method for an STS task, in that without an additional classification layer, the model is trained by directly optimized to compare two utterance vectors, which is intuitively closer to our original purpose.

However, there is still a risk of becoming another transfer learning, if sufficiently large and diverse data is not prepared.

In the next section, I will discuss the pre-training data and its pre-processing.

Data pre-processing

I decided to use the Opensubtitles[12] for pre-training, which is a large open-domain dialogue corpus, to allow the model to experience more diverse and abundant utterances in daily lives.

This corpus consists of about 3.36M of subtitle files from various movies and handles much more various and more general situations than the previous TODS datasets do, which is more helpful for the model to understand the human conversations.

In addition, since the ConvBERT was also trained on this corpus and showed remarkable results, I could expect an improvement of performance.

Since one subtitle file is extracted from one movie, it is more likely to contain the contents with similar contexts or topics.

Therefore, I came up with the idea to embed two utterances from the same file close, and to map inputs from different files more distant.

In a way, this is very similar to the negative sampling method, which randomly samples an utterance from other dialogues and uses it as a negative class.

However, since a movie might be very long, it is not guaranteed that two utterances from the identical file carry the same context.

That is, if two utterances are located far away from each other in a script, sentences that are not quite related might be grouped close to each other.

To alleviate this, I paired utterances in the positive class not entirely randomly but maintaining the order of utterances after sampling some proportions for the positive class in one file.

As a result, sometimes a sample might be paired with the one exactly next to itself (in fact in most of cases…), or the one with a distance.

This allows the model to see more diverse examples of positive pairs.

Let’s take a look at the picture below.

As you can see, the positive pairs are sampled orderly, and others which are not chosen as positive samples are paired with other utterances in different files randomly.

In addition, to inject the capability of handling the multi-turn context along with the target utterance, some samples are pre-processed to contain the extra previous contexts beforehand.

Each number of utterances to be included as the context is sampled from the Poisson distribution with $\lambda=2$.

The context is separated by the BERT’s SEP token, as described in the illustration.

After pre-processing, I was able to make $34,828,814$ positive pairs and $35,875,196$ negative pairs.

Pre-training details

After pre-processing, I was finally able to train the model as I designed.

I set the batch size to $128$ and the number of epochs to $1$, normalized the gradient to $1.0$ and used the AdamW[13] optimizer.

Additionally, I implemented the get_polynomial_decay_schedule_with_warmup[14] scheduler, which is provided by the Huggingface’s Transformers.

This increases the learning rate during the warm-up steps linearly and decays it exponentially until the training finishes.

I set the maximum learning rate before decaying to $2e-5$.

Lastly, I pre-trained the model using the PyTorch Lightning[15] which I have preferred since last year for efficiently utilizing multiple GPUs.

This is the end of part 1.

In this post, we discussed the introduction of the project, the previous approach and the possible factors of failure, the new training strategy, and the data pre-processing procedure.

In the next post, I will share the details of fine-tuning tasks, and the evaluation results from the baseline models and pre-trained DialogueSentenceBERT.

Thank you very much.

[1] Koch, G., Zemel, R., & Salakhutdinov, R. (2015, July). Siamese neural networks for one-shot image recognition. In ICML deep learning workshop (Vol. 2). http://www.cs.toronto.edu/~gkoch/files/msc-thesis.pdf.

[2] Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. https://arxiv.org/pdf/1908.10084.pdf.

[3] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/pdf/1810.04805.pdf.

[4] Wikipedia. https://www.wikipedia.org.

[5] Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision (pp. 19-27). https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Zhu_Aligning_Books_and_ICCV_2015_paper.pdf.

[6] Wu, C. S., Hoi, S., Socher, R., & Xiong, C. (2020). TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. arXiv preprint arXiv:2004.06871. https://arxiv.org/pdf/2004.06871.pdf.

[7] Mehri, S., Eric, M., & Hakkani-Tur, D. (2020). Dialoglue: A natural language understanding benchmark for task-oriented dialogue. arXiv preprint arXiv:2009.13570. https://arxiv.org/pdf/2009.13570.pdf.

[8] Rastogi, A., Zang, X., Sunkara, S., Gupta, R., & Khaitan, P. (2020, April). Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 8689-8696). https://ojs.aaai.org/index.php/AAAI/article/view/6394/6250.

[9] Schulz, H., Zumer, J., Asri, L. E., & Sharma, S. (2017). A frame tracking model for memory-enhanced dialogue systems. arXiv preprint arXiv:1706.01690. https://arxiv.org/pdf/1706.01690.pdf.

[10] Li, X., Wang, Y., Sun, S., Panda, S., Liu, J., & Gao, J. (2018). Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv preprint arXiv:1807.11125. https://arxiv.org/pdf/1807.11125.pdf.

[11] COSINEEMBEDDINGLOSS. https://pytorch.org/docs/stable/generated/torch.nn.CosineEmbeddingLoss.html.

[12] Lison, P., & Tiedemann, J. (2016). Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. http://www.lrec-conf.org/proceedings/lrec2016/pdf/947_Paper.pdf.

[13] Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. https://arxiv.org/pdf/1711.05101.pdf.

[14] Optimization. https://huggingface.co/docs/transformers/main_classes/optimizer_schedules#transformers.get_polynomial_decay_schedule_with_warmup.

[15] PyTorch Lightning. https://www.pytorchlightning.ai.

Categories