Starting with this, I’m gonna post about the personal project developing the “Multi-turn chatbot”.
Actually I began this project about two months ago and was going to write about this earlier, but it was difficult as I modified the codes so many times since I had to go through a lot of trial and error.
The model is still being trained and there are some parts that haven’t even been implemented yet, but I don’t want the post about this delayed too much so I decided to write this article as an introduction to explain the outline of the project.
Today, I will talk about the purpose of this project and rough descriptions on the data used and models.
What I’m trying to do is to develop an open-domain chatbot which can generate a response reflecting the multi-turn context, trained on English dialogue datasets.
I think there is probably no one who does not know what a chatbot is, but there might be some people who are not familiar with the word “Multi-turn”.
Let’s assume that when the two speakers, speakers 1 and 2, exchange one utterance each, this is called one turn.
Then easily speaking, the multi-turn would be a case of several such turns.
But this simple definition considers the case that independent turns irrelevant to each other emerge as also a multi-turn case, which is not the one I want to include.
I think the multi-turn dialogue means the one on which a speaker has to understand the previous histories, in other words overall contexts of the conversation, in order to process the utterance in a certain time step.
Here are the examples of a single-turn dialogue and a multi-turn dialogue.
In a single-turn conversation in which one main topic is finished with one turn, we don’t have to be aware of additional information since we just have to respond to the current input.
However, a multi-turn case like above makes the speakers consider the previous contexts to do a proper action to the current situation.
If the bot cannot remember the information that it has a pet from the dialogue history and sees just the input in the current time step, an error like below will occur.
Therefore, to make a decent multi-turn chatbot, we need to think about not only how to make a proper response according to current input, but also how to refer to the overall contexts of the dialogue properly.
Next, let’s talk about the datasets.
I decided to combine $4$ multi-turn dialogue datasets to make the data much larger.
- DailyDialog: https://arxiv.org/abs/1710.03957
- EmpatheticDialogues: https://arxiv.org/abs/1811.00207
- Persona-Chat: https://arxiv.org/abs/1801.07243
- BlendedSkillTalk: https://arxiv.org/abs/2004.08449
Each data has a slightly different feature and purpose, but the common ground is that they are for dialogue training.
More specific information is available from each paper.
Now it is time to analyze the size of each data.
I calculated the number of utterances and dialogues and divided the train set and validation set into $0.85:0.15$ based on the number of conversations.
The results are as follows.
Next, in order to set the maximum sequence length and maximum number of histories, I analyzed the distributions of dialogue lengths and utterance lengths.
The dialogue length was calculated by counting the number of tokens after tokenizing with the GPT2 Tokenizer.
Let’s see the below charts.
Based on the above results, we can set the maximum utterance length and the maximum context length (maximum number of previous utterances to consider).
This is a matter of hyperparameters, so it will be handled in the next post.
There are several ways to implement the multi-turn dialogue models.
A traditional method is to use a Recurrent based model, such as RNN, to store the overall context of the dialogue.
Examples of researches using this method include Olabii et al., 2019, Mensio et al., 2018, Chen et al, 2018 (HVMN), Serban et al, 2016 (HRED), etc., which are still presented in most research papers as the baselines.
But these have chronic problems of an RNN like the long-term dependency problem, which is that the information loss emerges if the length of the dialogue becomes too long, and unwanted noises since an RNN takes all conversation histories.
Due to the shortcomings above, the methods using multi-head attention have been proposed these days.
As anyone familiar with the transformer would know, the multi-head attention can greatly alleviate RNN’s problem because it can easily refer to all positions through matrix multiplication and extract only the necessary information through the attention scores.
That is, we can get context vectors through the multi-head attention in utterance or history level and can make better actions by attending this information more efficiently.
These methods include Vlasov et al., 2020 (TED policy) and Zhang et al., 2019 (ReCoSa), etc.
The former was published by RASA, which is a multi-turn action retrieval model especially for Task-Oriented system and the latter is a response generation method I actually implemented, which stands for “the Relevant Contexts with Self-attention”.
Let’s see the details of ReCoSa structure.
It is almost same with the original transformer, but the only difference is the encoder part.
In the above figure, the processes in red happen in the encoder and the blue parts are conducted in the decoder.
The decoder is actually the same as the original since it applies Masked Multi-head attention to the current generated sequence so far and after that attention with the encoder output.
On the other hand, in the encoder, the word-level encoding is conducted by an LSTM.
The last hidden state from this becomes the utterance embedding.
So the turn-level positional encoding is required to reflect the temporal sequence of each turn.
The subsequent course consists of multi-head attention between each turn, similar to the original encoder, where we can obtain the encoder output that reflects the importance of each history.
The next model I’m going to implement is the multi-turn dialogue generation structure using the pre-trained GPT-2 (Generative Pre-Training 2).
At a time when GPT-3 is becoming a hot topic, you might think that GPT-2 is a little behind the trend, but I think it is worth trying since many studies using this model have been conducted over the past year or two.
As the GPT-2 is specialized with Language Modeling, the method is quite intuitive, which is concatenating all previous history utterances and making it generate the next sentence.
That is, unlike other examples conducting context encoding and decoding separately, this approach produces the next output seeing the contexts and the current input at the same time with the self-attention.
Related researches include Zhang et al., 2020, Olabiyi & Mueller., 2019 and Huggingface’s Conversational AI with Transfer Learning, etc.
Especially I will try to implement the fine-tuning method released by Huggingface myself.
Huggingface won the first prize in the automatic metrics category at ConvAI2(Conversational Intelligence Challenge 2) and I think it will be capable since they made public the details through their blog and the GitHub repository.
The description is as follows.
The task is to make a model generating dialogues considering both the persona information and the previous histories, but I am only interested in the multi-turn approach so I will exclude the persona part.
The noticeable point is that they trained the model not only on Language Modeling but also on Next Sentence Prediction, which classifies the concatenated sentence at the end is the appropriate reply or the distractor.
So the GPT-2 DoubleHead model which has two different heads to conduct each separated task was used.
By reducing both loss values, the model can have not only proper response generation capacity but also determination on what a natural response should be.
And this is very simliar to BERT’s pre-training method.
This is it for today’s post.
We have checked the analysis of data and the brief details of the models.
Next time, I’m going to talk about the actual implementation codes for the ReCoSa structure and the experience results.