Jaewoo Song
Jaewoo Song


  • Tech

It’s been a long time since I’ve posted a review on an NLP paper.

So, today’s topic is not about deep, complicated and new model architectures or approaches, but about a new decent dialogue-oriented benchmark which can be a fresh restart for me.

The paper’s name is DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue[1].

Let us begin.


DialoGLUE is a benchmark for evaluating various tasks which are needed for Task-oriented dialogue systems motivated by the famous GLUE[2] & SuperGLUE[3] benchmarks.

It consists of $7$ datasets and $4$ distinct tasks and enables us to test an encoder-based model’s performance manifoldly by providing the codes and standards.

The authors published a baseline model based on the pre-trained Bidirectional Encoder Representations from Transformers (BERT)[4], which is called ConvBERT, pre-trained on the large open-domain conversation dataset and let other researchers or developers share their accomplishments on the leaderboard.

The ConvBERT matched or exceeds SOTA results on $5$ of $7$ datasets, especially $2.98$ higher joint goal accuracy on dialogue state tracking.

To sum up, they argue that $4$ main contributions of this paper are as follows.

  1. A challenging task-oriented dialogue benchmark consisting of $7$ distinct datasets across $4$ domains.
  2. Standardized evaluation measures for reporting results.
  3. Competitive baselines across the tasks.
  4. A public leaderboard for tracking scientific progress on the benchmark.


These are the detailed specs of datasets & tasks provided DialoGLUE benchmark.

The detailed specs of datasets & tasks in DialoGLUE benchmark.

Specifically, for intent prediction, we use a pooled representation from an encoder model, such as BERT, and put it into a linear layer to predict a proper class index.

For slot filling, the problem is represented as IOB tagging and the token-level classification approach can be adopted for this task with an additional linear classification layer.

For semantic parsing, each utterance should be converted into a hierarchical tree with a top-level intent, which is a root, and a label for each word, which is a leaf.

Therefore, again a pooled representation can be used for intent prediction and each word-level latent representation is responsible for each word-level classification.

Finally, for dialogue state tracking (DST), the authors applied the TripPy[5] which uses the underlying BERT model and triple copy strategies to perform DST task.

In more detail, the method of TripPy includes $3$ steps, which are span prediction/the first copy mechanism for extracting values from a user utterance, the second copy mechanism over concepts mentioned by the system utterance and the third copy mechanism over DS memory that is the existing dialogue state.

All of the above settings can be applied to evaluate various types of models based on the transformer encoder structure including BERT.

Therefore, DialoGLUE is certainly helpful for checking the quality of our unique encoder models by depending on the representational power of an underlying encoder.

ConvBERT & Task-Adaptive Training

ConvBERT is an extended version of the original BERT, which is finetuned on Masked Language Modeling task using $700$M open-domain conversations.

The authors say, since BERT is not sufficient for dialogue understanding, by training it on the self-supervised manner with a large dialogue corpus, it is able to get a lot more meaningful semantic representations from the model for utterances in various dialogues.

In addition, they constructed the train sequences by including multi-turn utterances, which is obvious since usually a conversation consists of multiple turns and dialogue state tracking tasks cannot be conducted without multi-turn consideration.

Also, we should look into another useful training strategy named Task-Adaptive Training[6], which conducts a prior self-supervised learning with the dataset for downstream task before finetuning with it.

It is known that this approach can boost up the performance by enabling domain adaptation beforehand.

The authors also included Task-Adaptive Training when developing the baselines by performing self-supervised masked language modeling (MLM) on each DialoGLUE dataset.

In other words, a model can be finetuned with DialoGLUE benchmark in $4$ ways as follows.

  1. Without any Task-Adaptive Training
  2. Self-supervised MLM before finetuning (Pre)
  3. Self-supervised MLM at the same time during finetuning (Multi)
  4. Performing MLM before finetuning and also conducting multiple training during finetuning (Pre + Multi)

Experiments & Results

The authors used $4$ BERT-based models to get the results for DialoGLUE benchmark, including (1) BERT-base, (2) ConvBERT, (3) BERT-DG which is BERT trained on the full DialoGLUE data in a self-supervised MLM manner, and (4) ConvBERT-DG which is ConvBERT trained on DialoGLUE same as BERT-DG model.

These models were combined with $4$ training strategies mentioned above, so the total number of trainings is $16$.

Additionally, the authors tested for few-shot setting to assess the effectiveness of the pre-trained models with $10$% of training data. ($5$ times with different random seed)

And the evaluation metrics for each downstream task are suggested in above table.

The first two tables include the overall performances for full-data setting and the next table shows few-shot results.

The results of full-data setting.

The results of few-shot setting.


First, we’re going to think about several points according to the full-data results.

Surprisingly, ConvBERT alone is not effective compared with the original BERT as expected.

However, with Task-Adaptive self-supervised learning, it outperformed BERT in several tasks, especially remarkably in dialogue state tracking.

This shows that the large-scale pre-training on open-domain dialogue corpus can be transferred to TODS tasks effectively only through task-adaptive training.

Also, the scores of BERT-base tend to decrease even after task-adaptive training, which is ironic but allows us to assume that BERT is truly not capable of handing dialogue tasks well.

On the other hand, by adopting the prior MLM training with DialoGLUE, this tendency of BERT is detected less frequently, indicating that pre-training on dialogue data did affect BERT’s understanding capability.

And interestingly, BERT sometimes outperforms ConvBERT along with task-adaptive setting, which suggests that even pre-training with a huge amount of conversation data cannot beat task-specific self-supervised training in a certain condition.

By checking the results from ConvBERT-DG, we can see that pre-training with DialoGLUE results in mixed outcomes.

Without task-adaptive training, ConvBERT-DG made improvements in some datasets such as BANKING77, HWU64 and MultiWOZ, but also got drastically defective results especially in DSTC8 and TOP.

Only with the task-adaptive setting, its performances recovered to some extent but not entirely.

The authors says that ConvBERT-DG might have lost its language understanding capability due to DialoGLUE pre-training and some of it can be retrieved by task-specific training.

In conclusion, too much pre-training on a specific dataset is not always the answer.

Now, let’s look at the few-shot results.

We can see that ConvBERT and ConvBERT-DG outperform other models in general, especially with task-adaptive training.

Except for DSTC8, TOP dataset, the pre-training with DialoGLUE seems significantly effective compared with BERT & ConvBERT.

But this is obvious since these models have already seen the data inside DialoGLUE benchmark so they can take advantage of seen knowledge in few-shot settings.

Also, these gaps are huge in MultiWOZ, notifying us that dialogue state tracking is more dependent on semantically improved representations and can get benefits for seeing additional data intensively.

But it is also interesting that even trained with full data, the few-shot results in DSTC8 and TOP diminished, which is contradictory with our intuitions.

We looked through DialoGLUE, a new benchmark for natural language understanding tasks for Task-Oriented Dialogues and its competitive baseline models.

We can access the public leader board in this link hosted on EvalAI.

And the authors also provide the pre-trained checkpoints including ConvBERT, ConvBERT-DG and BERT-DG here.

This will be a great benchmark for anyone who develop or research on a dialogue model.

[1] Mehri, S., Eric, M., & Hakkani-Tur, D. (2020). Dialoglue: A natural language understanding benchmark for task-oriented dialogue. arXiv preprint arXiv:2009.13570. https://arxiv.org/abs/2009.13570
[2] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. https://arxiv.org/abs/1804.07461
[3] Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., ... & Bowman, S. R. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. https://arxiv.org/abs/1905.00537.
[4] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/pdf/1810.04805.pdf
[5] Heck, M., van Niekerk, C., Lubis, N., Geishauser, C., Lin, H. C., Moresi, M., & Gašić, M. (2020). Trippy: A triple copy strategy for value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877. https://arxiv.org/abs/2005.02877
[6] Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don't stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. https://arxiv.org/pdf/2004.10964.pdf