Jaewoo Song
Jaewoo Song


  • Tech

Let me introduce an interesting paper I have recently read regarding dialogue modeling and evaluation.

This paper suggests a new dialogue model called DialogFlow[1], which I personally consider as a compromise between two multi-turn context encoding methods, concatenation and hierarchical encoding.

Moreover, the pre-trained DialogFlow model works not only for the response generation but also for open-domain dialogue evaluation, in other words, an evaluation model called FlowScore.

Since I have been especially interested in both dialogue generation and open-domain dialogue evaluation, I thought I have several useful insights from this work and decided to review it.

Ok, let’s start.


The DialoFlow is suggested as a more representative dialogue modeling approach to detect the change between each turn better since the current dialogue encoding methods cannot appropriately acknowledge the dynamic flows in real-world conversations.

Let’s take a look at the picture.

The description of the overall idea of DialoFlow.

This shows that although the dialogues in our everyday life seem to be inconsistent and diverse, they are actually conducted focusing on specific topics(or goals) or the semantic influence which each utterance induces.

In other words, the paper contends that there are “flows” to make a certain context move to the next context and it is important to model these patterns properly.

Here, the author calls the current state of dialogue so far as the “context” and the semantic transition between each context as the “semantic influence” which can be interpreted as the result each speaker’s utterance triggers at each turn.

Then do dialogue models catch these flows well enough now?

At least the author says no.

It is argued that the current “flat” concatenation approach cannot separate each turn clearly and fails to detect the change between each context. (I will discuss this more specifically in the next section.)

Therefore, the DialoFlow can be a supplement of this limitation to generate better responses, and moreover, after being trained to identify ideal and universal dialogue flows, it can even evaluate other models’ results by comparing predicted utterances and expected semantic influences.

Previous works

Usually, there are two ways to encode the multi-turn context.

The first one is the “flat pattern” method to concatenate previous utterances into a single sequence and feed it as an input, and the second is the “hierarchical modeling” which first encodes each utterance in word-level and continuously encodes each turn in utterance-level.

The example of the flat pattern.
ToD-BERT[2]: Flat pattern.
The example of the hierarchical modeling.
VHRED[3]: Hierarchical modeling.

The flat pattern approach is the most widely used method, especially has been frequently implemented since the various pre-trained language models based on the transformer[4] came out.

By maximizing the strength of the multi-head attention, the model encodes the input based on the attention score of each token and it is known for the more fluent and decent responses compared to those of the hierarchical encoding.

However, as mentioned before, there is also a weakness that it is hard to catch the changes between turns since it conducts the attention mechanism in token-level, not in turn-level.

To supplement this, special tokens such as a speaker token or separation token are included, but I have seen that they are not as effective as expected sometimes, especially in real-world situations.

Due to this shortcoming, some irregular cases might emerge, for example, the model generates outputs that are not consistent with its previous utterances or directly responds to the question which it made itself.

The flat pattern method was used various models like DialoGPT[5], BST[6], and PLATO[7].

The hierarchical modeling has been commonly used in the RNN[8]-based dialogue models and has recently been implemented in transformer-based models like ReCoSa[9] which I previously reviewed.

This method might cause information loss during the procedure of several compressions since it first encodes each turn separately and combines all vectors after that.

Therefore, it is known that it performs more poorly than the concatenation approach does.

The examples of hierarchical modeling can be seen in HRED[10], VHRED, and DialogBERT[11].


Eventually, I think the main insight of the DialoFlow is a compromise between the flat pattern and hierarchical modeling to take advantage of the strengths of two ways.

The overall architecture of DialoFlow is as follows.

The overall architecture of DialoFlow.

The model consists of three main parts, the transformer block, flow module, and response generator.

The transformer block is based on the uni-directional transformer structure, which is identical to the GPT[12], and the flow module models context flows hierarchically by extracting the encoded vector of each turn from the predescent stage.

Then the response generator generates actual utterances calculating semantic influences based on these contexts.

Before looking into each part specifically, let’s check the basic definitions from the paper.

Fundamentally, the principle of DialoFlow can be defined as follows.

\[I_k=C_{k+1}-C_k\] \[\displaylines{ \text{$C_k$: The context from the dialogue history from $1$ to $(k-1)$ at the current turn $k$.} \\ \text{$I_k$: The semantic influence which would be brought by the $k$th utterance.} }\]

To sum up, to make a transition from the context $C_k$, which contains the utterance from $1$ to $(k-1)$, to the next context $C_{k+1}$, the semantic influence $I_k$ should occur, which is produced by the $k$th utterance.

That is, if we have a well-extracted $I_k$, it is also available to generate a response that appropriately changes $C_k$ into $C_{k+1}$.

Thus, the ultimate purpose of DialoFlow is to predict the future context $C_{k+1}^{\prime}$ from the context so far $C_k$, get $I_k^{\prime}$ which is the difference between two contexts, and make a final response utilizing this predicted semantic influnce.

If $C_{k+1}^{\prime}$ is obtained similar to the ideal context we want, $C_{k+1}$, we can extract the semantic influence ideally as well, and this eventually leads to a response with high-quality.

Ok, now we are going to look at each part specifically.

First, the transformer block is a uni-directional transformer, which takes a concatenated dialogue context as an input.

As you can see, this is not different from the GPT which we already know.

Here, the author added speaker tokens in front of utterances to separate each turn and another special token $[C]$ at the end of each utterance.

This $[C]$ token is used later in the flow module to be extracted as a context vector for each turn, which has a similar function to the $[CLS]$ token in BERT[13] prepended to an input.

The reason why this token is added to the back of an utterance is that unlike BERT, the GPT conducts the attention from left to right, the whole information from the attention mechanism is accumulated at the very end.

In addition, the author comments that the reason why the pre-normalization used in GPT-2 instead of the post-normalization from BERT is that it is known that the latter is less stable than the former, which can lead to a decrease in performance when the model size increases.

Lastly, since this part is nothing different from the original GPT structure, we can replace this with the pre-trained GPT-2[14] or DialoGPT checkpoint, not training the model from the begging. (We will see this later.)

Next is the flow module.

This is actually an additional uni-directional attention layer, so there is nothing special.

However, unlike the previous part, this flow module conducts the attention only with the vectors extracted from the spots of $[C]$ tokens at the end of all utterances, not with all token representations.

This can be seen as a hierarchical modeling phase, which catches the flow between each turn.

The difference from the original hierarchical encoding is that it uses both encoding approaches since before modeling context flows, each utterance vector is generated from the token-level attention which takes into consideration all tokens from the start, not generated from the independently conducted token-level encoding.

Anyway, the future context $C_{k+1}^{\prime}=\text{Flow}(C_1, C_2, … , C_k)$ is predicted through this module and the semantic influence $I_k^{\prime} = C_{k+1}^{\prime} - C_k$ is calculated accordingly.

Finally, the response generator makes a response based on the predicted semantic influence $I_k^{\prime}$.

This module operates in the next word prediction fashion using a feed-forward layer and a softmax function.

To predict the next token, the vector representations of currently generated tokens, which are outputs from the transformer block and this token vector is concatenated with $I_k^{\prime}$.

Therefore, the predicted semantic influence helps hidden states from previous tokens become more representative not only of the tokens themselves but also of the change of current context flow.

Training objectives

Now, let’s talk about the specific training procedure for the DialoFlow.

There are three objectives to train this model, Context Flow Modeling (CFM), Semantic Influence Modeling (SIM), and Response Generation Modeling (RGM).

First, the Context Flow Modeling is to train the model to predict the future context properly.

This is conducted by making the L2 norm of the difference between the actual context and the predicted context vector.

\[L_{CFM} = \sum_{k=1}^{N} ||C_k - C_k^{\prime}||_2^2\] \[\displaylines{ \text{$N$: The number of turns (contexts).} }\]

Second, the Semantic Influence Modeling uses a Bag-of-Words (BoW) loss to approximate the calculated semantic influence to the actual content of the response.

It seems that the reason why the BoW loss is used here is that the purpose of this objective is not auto-regressively fitting the influence into the whole utterance, which might cause a severe over-fitting, but making it just represent the overall meaning of the response.

\[\displaylines{ L_{SIM} = -\sum_{k=1}^{N} \sum_{t=1}^{T} \log{p(u_k^t | I_k^{\prime})}=-\sum_{k=1}^{N} \sum_{t=1}^{T} \log{f_{u_k^t}} \\ f=\text{softmax}(W_2I_k^{\prime} + b_2) \in \mathbb{R}^{|V|} }\] \[\displaylines{ \text{$N$: The number of turns.} \\ \text{$T$: The number of tokens in the $k$-th utterance.} \\ \text{$u_k^t$: The $t$-th token.} \\ \text{$W_2$: An additional weights to train.} \\ \text{$b_2$: An additional bias.} }\]

One thing interesting here is that there is another parameter to train in order to train the model.

So the order seems to be a little bit reversed, considering that originally the loss function itself exists to train parameters.

In my opinion, this objective might not be necessary since by appropriately modeling $L_{CFM}$, the semantic influence should also be calculated properly.

I think the author assumed that there is another layer to convert the semantic influence into the contents in the corresponding response, and tuning this layer can lead to model context flows better, even if this layer is not used in the actual inference.

Lastly, the Response Generation Modeling is the same as the original CrossEntropyLoss, which is broadly used in conditional language modeling.

The only difference is that the semantic influence is included.

\[L_{RGM} = -\sum_{k=1}^{N} \log{p(u_k|I_k^{\prime}, u_{<k})}=-\sum_{k=1}^{N} \sum_{t=1}^{T} \log{p(u_k^t|I_k^{\prime}, u_{<k},u_k^{<t})}\] \[\displaylines{ \text{$N$: The number of turns.} \\ \text{$T$: The number of tokens in the $k$-th utterance.} \\ \text{$u_k^t$: The $t$-th token.} }\]

And the final loss is the sum of all loss values, which is $L = L_{CFM} + L_{SIM} + L_{RGM}$.


As mentioned above, a well-trained DialoFlow can be also a useful dialogue evaluation model.

This is because the DialoFlow not just generates responses according to given contexts, but also similarly develops the next appropriate dialogue flows just like people usually do.

Therefore, by checking if comparing the semantic influence from a predicted response with the semantic influence which the DialoFlow expects for the next turn, it is available to assess the quality of dialogues generated from other models.

To calculate the turn-level score, the similarity between two semantic influences, one from an utterance generated by another model ($I_k$) and the other which the DialoFlow predicted ($I_k^{\prime}$).

\[s_k = \cos(I_k^{\prime}, I_k) \cdot length(I_k^{\prime}, I_k) = \frac{I_k^{\prime} \cdot I_k}{||I_k^{\prime}|| \ ||I_k||} \cdot \frac{\text{min}(||I_k^{\prime}||, \ ||I_k||)}{\text{max}(||I_k^{\prime}||, \ ||I_k||)}\]

Here, the similarity of the semantic aspect is obtained via the cosine similarity ($\cos(I_k^{\prime}, I_k)$), and the length penalty ($length(I_k^{\prime}, I_k)$) literally penalizes if the difference between two vectors’ size is high.

I think since the cosine similarity is based on the angle between two vectors, the length penalty is adopted to reflect the distance between two points.

Additionally, by combining each turn score, it is also possible to calculate the dialogue-level score.

Here, the dialogue-level perplexity is used, and this final dialogue-level score is called FlowScore, which is mentioned earlier.

\[\text{Flow score} = 2^{-\frac{1}{M} \sum_k^M \log{\frac{s_k+1}{2}}}\] \[\displaylines{ \text{$M$: The number of turns.} \\ \text{$s_k$: The turn score at the $k$-th turn.} }\]

This calculation is literally similar to that of the token-level perplexity we already know.

The difference is that instead of the conditional probability of each token, the normalized turn-level score is included.


In this paper, the DialoFlow was pre-trained in the same way that the DialoGPT was trained.

As mentioned before, the transformer block is not different from the transformer decoder models, such as GPT-2 or DialoGPT, so the pre-training was conducted starting from the GPT-2 checkpoints (GPT2-base, GPT2-medium, GPT2-large), instead of training it from the initialized state.

The pre-train data is the Reddit comments which are publicly available on pushshift.io[15] and cleaned following the procedure used in DialoGPT too.

After the pre-training, the model was tested using two dialogue datasets, the Reddit test set with the multiple references.

Additionally, DialoFlow was also fine-tuned on DailyDialog[16], which is also a popular multi-turn open-domain dialogue data, for response generation and evaluated with the multi-reference version test set[17].

And then the pre-trained & fine-tuned DialoFlow were compared with the baseline model DialoGPT, the version pre-trained from the GPT-2 checkpoint.

Also, two decoding methods were implemented, the greedy decoding (Reddit) and the beam search (Reddit: beam size 10 and DailyDialog: beam size 5).

I will omit the rest of the training details since they are easily found in the paper.

Also, to evaluate the capacity of the FlowScore as the open-domain evaluation standard, the author used the collected data from the Interactive Evaluation of Dialog Track @ The Ninth Dialog System Technology Challenge (DSTC9)[18].

This contains $2,200$ human-bot conversations from $11$ bots and tagged per each dialogue on the overall quality rating from $0$ to $5$.

To test the model on the human-human dialogues, additional $200$ dialogues were sampled from the BST[19] dataset.

The baseline evaluation scores were the FED[20] and perplexity.

The FED is another dialogue evaluation model based on the DialoGPT using the likelihood.

For response generation, the BLEU[21], METEOR[22], and NIST[23] were used.

The NIST is a variant of BLEU, which penalizes uninformative n-grams to concentrate on the information gain.

The Entropy[24] was also calculated to check the responses in a perspective of lexical diversity.

For the dialogue evaluation task, the Pearson and Spearman correlation between the model results and the human evaluations was computed.

Results & Analysis

First, we are going to look into the response generation results.

The response generation results.

On the multi-reference Reddit dataset, the DialoFlow resulted in a higher NIST and METEOR score, and in some cases, the DialoGPT outperformed in BLEU score.

Given that the NIST is more improved evaluation standards than the BLEU, with the uninformative penalty, we can see that DialoFlow generated the responses with more information and less generic.

On the multi-reference DailyDialog dataset, the DialoFlow outperformed DialoGPT more clearly.

It is stated by the author that the DailyDialog has more dialogue turns and is focused more on human conversations in real life, not in an online forum which deals with specific topics, which leads to boosting up the effectiveness of context flow modeling.

Judging by the Entropy, the lexical diversity does not seem to be incredibly different.

Also, the larger the model size, the better the performances are, which is already well-known, but it is a little bit surprising that the DialoGPT with medium size works better than the large one.

Additionally, from figure 3, the DialoFlow outperformed DialoGPT in all history lengths, even if the NIST scores from both drops when the length increases.

It is interesting that even with $1$ turn, the DialoFlow works better, which is also added in the paper that it is thanks to the semantic influence.

There is an additional interesting ablation study here that the model performances were evaluated without SIM and both SIM & CFM to check the contribution of each training objective.

As we can see from table $1$, the difference caused by the absence of SIM is marginal, but when both objectives are excluded the performances dropped slightly much more.

I think as I mentioned in the Training objectives section, the SIM loss does not have a large effect, but is just a supportive means for modeling the context flows.

Of course, the main objective would be the RGM, since it models the ultimate purpose of the model, which is the response generation.

We can see the superiority of the DialoFlow to DialoGPT in human evaluations, as follows.

The human evaluations for the response generation.

Ok, next let’s check the dialogue evaluation results.

The dialogue evaluation results.

By checking the two correlation results, we can see that the FlowScore outperformed other baselines, even beating the FED in a huge gap which is based on the widely used pre-trained dialogue model, DialoGPT.

Especially, the perplexity cannot correlate well with human evaluations, since it simply sees the language modeling quality, not other factors of generated responses, such as informativeness, diversity, or human-likeness.

The details on the evaluation results are as follows.

The evaluation results on the DSTC9 dataset and BST dataset.

The B1~11 represents each chat-bot which conducted dialogues in the DSTC9 Interactive Dialogue Evaluation Track, and “Human” is sampled human-human dialogues from the BST dataset.

The human and FED scores go up when the quality of dialogue is better, and the FlowScore is the opposite as it is based on the perplexity, which is lower when the quality is better.

The FED didn’t perform well on the human-human dialogues, since it rated the lowest score than other dialogues in which bots participated.

On the other hand, the FlowScore gave the highest credit to it, which means that it detected the most human-like dialogues, the human dialogues themselves.

Additionally, the gap between human-human dialogues and the best performing bot is quite similar to that from the human evaluations.

Lastly, it is also stated in the paper that the FlowScore can be seen as a kind of utterance-level perplexity.

As shown in the FlowScore section, it is calculated by getting perplexity combining normalized turn-level scores.

The original (token-level) perplexity is unstable since if a single token is a serious outlier, it might cause the explosion of value.

We can see that from Table 3, the difference between perplexity of each chat-bot is significant, which means that some words triggered unusual effects on the overall score.

On the other hand, the utterance-level perplexity is much more stable since the effect of perturbations is diminished while getting turn-level scores based on semantic influences.

Ok, this is it for the review on this paper.

First, I was searching for an effective method to evaluate an open-domain generation model, but I also got a useful insight into dialogue modeling itself.

I am going to look into the code which is publicly released by the author and will analyze it to utilize the model diversely.

Thank you for reading this post, and you’re always welcome to leave a comment.

[1] Li, Z., Zhang, J., Fei, Z., Feng, Y., & Zhou, J. (2021). Conversations are not flat: Modeling the dynamic information flow across dialogue utterances. arXiv preprint arXiv:2106.02227. https://arxiv.org/pdf/2106.02227.pdf
[2] Wu, C. S., Hoi, S., Socher, R., & Xiong, C. (2020). TOD-BERT: Pre-trained natural language understanding for task-oriented dialogue. arXiv preprint arXiv:2004.06871. https://arxiv.org/pdf/2004.06871.pdf
[3] Serban, I., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., & Bengio, Y. (2017, February). A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31, No. 1). https://arxiv.org/pdf/1605.06069.pdf
[4] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[5] Zhang, Y., Sun, S., Galley, M., Chen, Y. C., Brockett, C., Gao, X., ... & Dolan, B. (2019). Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536. https://arxiv.org/pdf/1911.00536.pdf
[6] Smith, E. M., Williamson, M., Shuster, K., Weston, J., & Boureau, Y. L. (2020). Can you put it all together: Evaluating conversational agents' ability to blend skills. arXiv preprint arXiv:2004.08449. https://arxiv.org/pdf/2004.08449.pdf
[7] Bao, S., He, H., Wang, F., Wu, H., Wang, H., Wu, W., ... & Xu, X. (2020). Plato-2: Towards building an open-domain chatbot via curriculum learning. arXiv preprint arXiv:2006.16779. https://arxiv.org/pdf/2006.16779.pdf
[8] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536. https://doi.org/10.1038/323533a0
[9] Zhang, H., Lan, Y., Pang, L., Guo, J., & Cheng, X. (2019). Recosa: Detecting the relevant contexts with self-attention for multi-turn dialogue generation. arXiv preprint arXiv:1907.05339. https://arxiv.org/pdf/1907.05339.pdf
[10] Serban, I., Sordoni, A., Bengio, Y., Courville, A., & Pineau, J. (2016, March). Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30, No. 1). https://ojs.aaai.org/index.php/AAAI/article/view/9883/9742
[11] Gu, X., Yoo, K. M., & Ha, J. W. (2020). Dialogbert: Discourse-aware response generation via learning to recover and rank utterances. arXiv preprint arXiv:2012.01775. https://arxiv.org/pdf/2012.01775.pdf
[12] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf
[13] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://arxiv.org/pdf/1810.04805.pdf
[14] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9. http://www.persagen.com/files/misc/radford2019language.pdf
[15] Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020, May). The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media (Vol. 14, pp. 830-839). https://ojs.aaai.org/index.php/ICWSM/article/view/7347/7201
[16] Li, Y., Su, H., Shen, X., Li, W., Cao, Z., & Niu, S. (2017). Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957. https://arxiv.org/pdf/1710.03957.pdf
[17] Gupta, P., Mehri, S., Zhao, T., Pavel, A., Eskenazi, M., & Bigham, J. P. (2019). Investigating evaluation of open-domain dialogue systems with human generated multiple references. arXiv preprint arXiv:1907.10568. https://arxiv.org/pdf/1907.10568.pdf
[18] Gunasekara, C., Kim, S., D'Haro, L. F., Rastogi, A., Chen, Y. N., Eric, M., ... & Subba, R. (2020). Overview of the ninth dialog system technology challenge: Dstc9. arXiv preprint arXiv:2011.06486. https://arxiv.org/pdf/2011.06486.pdf
[19] Smith, E. M., Williamson, M., Shuster, K., Weston, J., & Boureau, Y. L. (2020). Can you put it all together: Evaluating conversational agents' ability to blend skills. arXiv preprint arXiv:2004.08449. https://arxiv.org/pdf/2004.08449.pdf
[20] Mehri, S., & Eskenazi, M. (2020). Unsupervised evaluation of interactive dialog with dialogpt. arXiv preprint arXiv:2006.12719. https://arxiv.org/pdf/2006.12719.pdf
[21] Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318). https://aclanthology.org/P02-1040.pdf
[22] Lavie, A., & Agarwal, A. (2007, June). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the second workshop on statistical machine translation (pp. 228-231). https://aclanthology.org/W07-0734.pdf
[23] Lin, C. Y., & Och, F. J. (2004, July). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04) (pp. 605-612). https://aclanthology.org/P04-1077.pdf
[24] Zhang, Y., Galley, M., Gao, J., Gan, Z., Li, X., Brockett, C., & Dolan, B. (2018). Generating informative and diverse conversational responses via adversarial information maximization. Advances in Neural Information Processing Systems, 31. https://proceedings.neurips.cc/paper/2018/file/23ce1851341ec1fa9e0c259de10bf87c-Paper.pdf