Jaewoo Song
Jaewoo Song


  • Tech

Previous post: https://songstudio.info/tech/tech-38

In the last post, we discussed the training objective and data pre-processing procedure for pre-training the DialogueSentenceBERT.

This time, I will elaborate on the details of fine-tuning tasks for model comparison and the performances of each model after evaluations.

Let us begin.

Fine-tuning tasks

In this project, two tasks are implemented to evaluate each sentence embedding model’s performance.

These are intent detection and system action prediction, both of which are text classification problems and predict one or more suitable classes using an embedding vector of each utterance produced by the model and an additional classification layer.

As mentioned in the previous post, sentence vectors can be pooled from the "[CLS]" position or by mean/max pooling.

Intent detection is a multi-class classification task which literally catches the user’s intent/purpose from an utterance.

System action prediction is a problem where a model should predict proper system actions to conduct given a dialogue context as an input, and it is a multi-label classification since these predicted actions might be multiple.

Moreover, since the datasets for action prediction consist of multi-turn dialogues, I conducted experiments by increasing the number of previous utterances in order to see the difference in model performance when handling multi-turn histories.

This illustration will help you understand of building each fine-tuning task.

The description of fine-tuning tasks.

For fine-tuning, I used the AdamW[1] optimizer, and implemented the get_linear_decay_schedule_with_warmup[2] scheduler, which is provided by the Huggingface’s Transformers.

This increases the learning rate during the warm-up steps linearly and decays it linearly until the training finishes.

Also, the batch size is $16$, and the gradient clipping is set to $1.0$.

The rest of the hyper-parameters are a little different between each dataset, which are omitted in this post and included in the file additionally attached at the end of the post.

Intent detection

I used three intent classification datasets, which are the Clinc’s OOS dataset[3], Banking77 dataset[4], and ATIS dataset[5].

Since Backing77 and ATIS do not have separate validation sets, I sampled 10% of the train set as the validation set.

The statistics of intent detection datasets.

Since these intent detection datasets have single-turn utterances, there is no need for multi-turn consideration when pre-processing.

For intent detection, all samples are processed by attaching the "[CLS]" token to the front, and the "[SEP]" token to the back.

For evaluation, I used the simple accuracy score for these datasets.

However, since the OOS dataset has special out-of-scope intent, which is one of the main focuses of this dataset, additional metrics were also calculated, including in-scope accuracy, out-of-scope accuracy, and out-of-scope recall score.

The scripts for these metrics were brought from the ToD-BERT’s official repository[6].

System action prediction

As I mentioned before, since there are multiple actions which a system can take for each user input, this task is a multi-label classification.

Here, I used MultiWoZ 2.3[7], DSTC2[8], and Simulated Dialogue[9].

In addition, I made each input sequence by concatenating a certain number of recent utterances starting from the query utterance from a user in order to include the multi-turn context.

The maximum number of utterances included can be $1$, $3$, or $5$, which is a hyper-parameter.

The concatenation details for this are identical to those of pre-training data.

Moreover, since the MultiWoZ 2.3 does not have separate train, validation and test sets, I split the total dialogues randomly with the ratio of $8$: $1$: $1$.

The statistics of system action prediction datasets.

For evaluation, I used $4$ evaluation metrics, which are sample-averaged F1 score, micro-averaged F1 score, macro-averaged F1 score, and exact accuracy.

Each metric is explained in my previous post, “Averaging methods for F1 score calculation in multi-label classification”.

You can check the details here.


In this section, I report the results of each fine-tuning task with the baseline models, BERT[10], ConvBERT[11], Tod-BERT[12], SentenceBERT[13], and DialogueSentenceBERT which I trained.

Although I experimented with all 3 pooling strategies using the baseline models, I only report the results from the “[CLS]” and mean pooling, since the max pooling did not produce satisfactory performances compared to others and I was able to obtain the pre-trained model with “[CLS]” and mean pooling only, due to limited resources and time.

In addition, I marked the score in bold if that is the best, and in red only if the score is from the pre-trained DialogueSentenceBERT.

Also, I wrote the rank of my model next to the score to check how it works well compared with other models.

First, the results of intent detection are as follows.

The intent detection results.

First of all, ConvBERT tops almost all metrics with overwhelming performances, which shows how effective pre-training with the masked language modeling (MLM) using large amounts of conversational data is.

ToD-BERT’s scores are lower than those of the original BERT in “[CLS]” pooling, but overall, it achieves higher scores in mean pooling.

Also, it is noticeable that although SentenceBERT is not intended for transfer learning and is trained on an NLI task, it shows quite great performances.

DialogueSentenceBERT, which I have focused on, does not perform well for most of the evaluation standards, ranking $4$~$5$, and especially the results of Banking77 data are the worst.

Although the scores for ATIS data look fine, it is difficult to say that it is meaningful since most of the scores are not significantly different.

In addition, it is important to see the difference in performance according to the pooling method depending on each metric, data, and model.

Overall, using “CLS” pooling seems better than adopting mean pooling, but in some cases, the scores are almost similar, and especially in the case of ToD-BERT, mean pooling scored overwhelmingly higher in OOS data.

Next, I report the results of system action prediction datasets as follows.

The first image is the results from "[CLS]" pooling, and the second one is from mean pooling.

The system action prediction (CLS pooling) results.

The system action prediction (Mean pooling) results.

The results are more complicated than before.

My model is still not satisfactory, but is partially effective in this task, resulting in the best score in several metrics.

Especially, the performances are noticeable in Simulated data with the "[CLS]" pooling, and in DSTC2 with the mean pooling.

Unfortunately, the DialogueSentenceBERT is worse than I expected in Multiwoz, not only compared with other competent models but also with the original BERT.

In addition, ToD-BERT, which showed modest results for the intent detection task, performs quite well in this task enough to match ConvBERT.

This can be attributed to the pre-training objective of ToD-BERT, which is the MLM and respond selection tasks with multi-turn TODS datasets.

Since these tasks seem more similar to action prediction with the multi-turn context and the pre-train data is also more related to the data adopted for the action prediction task than the intent detection data is.

Of course, the performances of ConvBERT are still amazing.

Also, in this task, SentenceBERT produces good results compared to the basic BERT.

However, it is inferior to BERT in a few metrics and this might be because the limit of SentenceBERT, which is trained on an NLI task, stands out this time, since the action prediction is more complicated and difficult than the intent detection.

Another thing we should focus on is the increase of scores according to the number of maximum turns.

Although this is quite obvious, in MultiWoZ, we can see that the differences according to the increase of turns are not as large as those in other data, which is in fact, some scores reversely decrease especially when the number of turns changes from $3$ to $5$.

On the other hand, in DSTC2, the effects of multi-turn contexts start to grow and they are very large in Simulated data.

Considering that MultiWoZ data generally has much longer utterances than other data, excessive information seems to cause deterioration.

In other words, in the case of DSTC2 and Simulated, which consist of short and simple utterances, the concatenation of a larger number of previous utterances gives a positive effect, but in MultiWoZ, including too many tokens might result in the dispersion of attention to each token, leading to decrease of focus on important information.

Further discussion

In conclusion, the pre-trained model showed a lot of progress compared to the last attempt.

Unlike the previous approach, which was inferior to the baselines in almost all scores, this one rose to a level close to other models this time and was able to take the lead in some scores.

However, the results were still below expectations, and the main reason might be the ambiguous decision boundary between the positive and the negative samples.

As I mentioned in the last post, I trained the model to detect the similarity between two utterances or contexts by considering the samples from the same script file as a similar pair and those from the different files as a distant pair.

However, I think this was a too naive approach.

In more detail, even if two utterances are sampled from the same subtitle file, they might represent entirely different subjects or contexts depending on the length or the flow of the movie.

Likewise, we cannot clearly say that the utterances from different subtitles are necessarily negative, since both movies might contain similar genres or contents.

Therefore, there might be a quite amount of samples that are contrary to my purpose which maps semantically similar utterances closer and more distant for the opposite cases.

It would have been better to extract all positive pairs by sampling two consecutive utterances entirely and to make negative samples by classifying the genre or theme of each movie more clearly.

So far, I have introduced the challenge for DialogueSentenceBERT, which has been newly improved.

In the end, I still didn’t get satisfactory results, but many improvements were made, and I also think it was time to learn more knowledge and know-how.

It is not clear whether I will conduct this project again, but if I come up with another new idea or get good feedback, I will try to have a chance to improve it again.

Thank you for visiting this page and I always welcome your feedback and opinions.

The link to the GitHub repository of this project is as follows.


[1] Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. https://arxiv.org/pdf/1711.05101.pdf.
[3] Larson, S., Mahendran, A., Peper, J. J., Clarke, C., Lee, A., Hill, P., ... & Mars, J. (2019). An evaluation dataset for intent classification and out-of-scope prediction. arXiv preprint arXiv:1909.02027. https://aclanthology.org/D19-1131.pdf.
[4] Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., & Vulić, I. (2020). Efficient intent detection with dual sentence encoders. arXiv preprint arXiv:2003.04807. https://arxiv.org/pdf/2003.04807.pdf.
[7] Han, T., Liu, X., Takanobu, R., Lian, Y., Huang, C., Wan, D., ... & Huang, M. (2020). MultiWOZ 2.3: A multi-domain task-oriented dialogue dataset enhanced with annotation corrections and co-reference annotation. arXiv preprint arXiv:2010.05594. https://arxiv.org/pdf/2010.05594.pdf.
[8] Henderson, M., Thomson, B., & Williams, J. D. (2014, June). The second dialog state tracking challenge. In Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue (SIGDIAL) (pp. 263-272). https://aclanthology.org/W14-4337.pdf.
[9] Shah, P., Hakkani-Tür, D., Tür, G., Rastogi, A., Bapna, A., Nayak, N., & Heck, L. (2018). Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871. https://arxiv.org/pdf/1801.04871.pdf.
[10] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/pdf/1810.04805.pdf.
[11] Mehri, S., Eric, M., & Hakkani-Tur, D. (2020). Dialoglue: A natural language understanding benchmark for task-oriented dialogue. arXiv preprint arXiv:2009.13570. https://arxiv.org/pdf/2009.13570.pdf.
[12] Wu, C. S., Hoi, S., Socher, R., & Xiong, C. (2020). TOD-BERT: pre-trained natural language understanding for task-oriented dialogue. arXiv preprint arXiv:2004.06871. https://arxiv.org/pdf/2004.06871.pdf.
[13] Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. https://arxiv.org/pdf/1908.10084.pdf.