Jaewoo Song
Jaewoo Song


  • Tech

This project is constructing the Multi-turn open-domain dialogue generation model by fine-tuning the pre-trained GPT-2(Generative Pre-Training-2).

In the last post, we found that there are several limitations in the results from the ReCoSa(the Relevant Contexts with Self-attention).

This time, I expect better outputs since the GPT-2 is applied, which is well-trained with Language Modeling tasks.

So, let’s start.

LM Head vs Double Heads

In the introductions, I introduced the fine-tuning method which Huggingface applied.

We fine-tune the GPT-2 by training not only the original Language Modeling task but also binary classification, which determines whether the given response is a proper one or not, as multi-task learning.

The description of this training setting is as follows.

The description of Huggingface's transfer learning structure using GPT-2 for ConvAI2.

As we can see, the model takes two inputs, the golden reply and the distractor which is not the proper response, and classifies which one is a correct target.

With this multi-task learning setting, the model learns not only how to generate the answer but also how to make the proper response with the relevant topic by considering dialogue contexts.

I adopted this method at first, but after an experiment I gave up the classification task and changed it into Language Modeling only.

The reason why I changed my mind is as follows.

  1. Increase in training time

    As I included a distractor when processing the training data, it leads to the bigger size of one batch. As a result, I could not make the batch size sufficiently large. Even though I introduced only one distractor, I had to set the batch size into $2$ in my resource environment and it took about $32$ hours to conduct one epoch.

  2. Less meaningful classification training

    The Huggingface team used PersonaChat data and extracted each distractor from the candidates which are included in the dataset itself. But in my case, I used various datasets combined and it was difficult to make these additional candidate sets with them. So I randomly sampled an utterance from entirely other dialogues and set it as a distractor. But I noticed that the loss for multi-choice classification had hardly decreased during the training. In my opinion, most distractors sampled are generic and this means that many of context + distractor pairs can quite make sense without a serious problem. Of course, I could search for another solution for this, but due to the cause #$1$ I mentioned before, I stopped.

So I decided to do this with the GPT-2 LM Head Model, not the GPT-2 Double Heads Model, with a single additional LM layer to fine-tune it focusing only on the response generation task.

Data processing

Next, let’s talk about data processing.

Unlike the case in the ReCoSa structure last time, in GPT-2 method the entire dialogue histories are concatenated and given to the model to make it generate the proper response.

This is because, as I stated before, GPT-2 is a model which was pre-trained to conduct Language Modeling with the Transformer’s decoder layers.

That is, this fine-tuning approach is obvious in that the model considers the overall contexts and generates a reply through the next word prediction, which is quite the same as the purpose of GPT-2’s pre-training.

First by referring to Huggingface’s idea, I added $5$ special tokens which is not included in the original GPT-2 vocabulary.

They help the model to notice the beginning, end and padded parts in the sequences and differentiate each speaker’s utterance.

These are the special token I added.

  • bos: "<bos>" (the beginning of sentence token)
  • eos: "<eos>" (the end of sentence token)
  • pad: "<pad>" (the pad token)
  • speaker1: "<speaker1>" (the first speaker token)
  • speaker2: "<speaker2>" (the second speaker token)

Huggigface supports users to add new tokens into the vocabulary in the tokenizer and increase the size of the embedding layer in the model accordingly.

If the increased size is larger than the original vocabulary size, then initial vectors which have initialized values fill the rest rows of the embedding lookup table.

This can be easily implemented as follows.

from transformers import *

# Tokenizer & Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# The dictionary for defining special tokens
special_tokens = {
    'bos_token': "<bos>",
    'eos_token': "<eos>",
    'pad_token': "<eos>",
    'additional_special_tokens': ["<speaker1>", "<speaker2>"]

num_new_tokens = tokenizer.add_special_tokens(special_tokens)
vocab = tokenizer.get_vocab()

Next, we’re gonna look at the compositions of inputs and outputs which are included in each batch.

We need $3$ components, input_ids, token_type_ids, and labels.

  • input_ids: This is the main input and consists of each token id. While training, it also includes the golden reply and at the inference phase, the partially generated reply so far is concatenated behind the dialogue contexts.
  • token_type_ids: This is the additional input which specifies each segment’s speaker in input_ids, which differentiates each time step’s utterance. It only comprises the id of speaker1 and speaker2 tokens.
  • labels: This is the actual golden reply to be generated. It can be built by masking all positions into mask value, $-100$, except the response parts. You might think that this should be shifted right, but it is not necessary since the model itself processes this as shifted output. Obviously, we don’t have this when inferencing.

It is much simpler than we thought.

You can see the details in below description.

The details of data composition in GPT-2 fine-tuning for multi-turn dialogue generation.

But one thing we should think about is the maximum length.

The maximum length the model can take is limited and the inputs are produced by all utterances are put together, so in some cases we cannot include all histories in specified time steps depending on the length of each utterance.

To prevent this, I pre-defined the maximum utterance length by considering the total sequence length and the maximum number of contexts.

If the total length of all sentences exceeded the maximum input length, I truncated each utterance to a length shorter than the maximum utterance length.

Of course, if the concatenated sequence is shorter than the maximum size, the pad token ids are inserted behind it.

The details of data processing codes are as follows.

from itertools import chain

def make_padding(input_id, token_type_id, lm_label, pad_id, max_len):
    left = max_len - len(input_id)

    input_id += [pad_id] * left
    token_type_id += [pad_id] * left
    lm_label += [-100] * left

    return input_id, token_type_id, lm_label

input_ids = []
token_type_ids = []
labels = []

# dialogues: the list of dialogues which have utterance histories.
# max_len: total maximum length of input
# utter_len: maximum length of one utterance
for d, dialogue in enumerate(dialogues):
    if len(dialogue) > 1:
        dialogue[0] = [bos_id] + dialogue[0]
        dialogue[-1] = dialogue[-1] + [eos_id]

        total_len = 0
        for utter in dialogue:
            total_len += len(utter)

        if total_len > max_len:
			dialogue = [utter[:utter_len] for utter in dialogue]
            dialogue[-1][-1] = eos_id

        token_type_id = [[utter[0]] * len(utter) if u != 0 else [utter[1]] * len(utter) \
                         for u, utter in enumerate(dialogue)]
        lm_label = [[-100] * len(utter) if u != len(dialogue)-1 else utter \
                    for u, utter in enumerate(dialogue)]
        input_id = list(chain.from_iterable(dialogue))
        token_type_id = list(chain.from_iterable(token_type_id))
        m_label = list(chain.from_iterable(lm_label))

        input_id, token_type_id, lm_label = make_padding(
            input_id, token_type_id, lm_label, pad_id, max_len



Actually, there is nothing difficult in training.

It is not that different from previous implementations, which we should just put inputs to GPT-2 LM Head model after pre-processing the data properly as mentioned above.

One thing I added this time is “perplexity” as well as train/validation losses to evaluate the model during training.

The calculation is simple, which can be obtained easily by implementing an exponential function to the loss as the exponent.

Thinking about the formula of the perplexity, it is quite obvious.

We checked it before, but let’s see the definition and the formula of perplexity again.

The perplexity is an evaluation method for Language Modeling which indicates how the model chooses the next tokens with high probabilities.

The perplexity is calculated by normalizing the reciprocal of the joint probability, where each current sequence will appear, to the length of the sequence.

\[PPL = \sqrt[n]{\frac{1}{P(w_1, w_2, ... , w_n)}} = \sqrt[n]{\frac{1}{\prod_{i=1}^{N}P(w_i \mid w_1, w_2, ... ,w_n)}}\]

As we can see, the higher the probability is, the lower the perplexity becomes, which means that the LM performance is more decent.

Then what is the relation between the perplexity and the loss function?

We can easily induce the process considering the loss function we normally use in the next word prediction task is Cross-entropy loss.

By putting the input sequences and the labels, we can get the negative log loss normalized to the sequence length.

And the value inside this negative log is the joint probability which has already passed through the softmax function.

So by making this as the exponent, the procedure becomes as follows.


Therefore, with torch.exp() function, we can get the perplexity.

When training, the inputs put into the model are input_ids, token_type_ids, and labels.

The GPT-2 LM Head Model gives the output tuple which contains the loss at $0$th position and the actual result logits tensor at its $1$st index.

You can see the logs below after conducting the training command.

The screenshot of training logs.

You might wonder why the training perplexity at the first epoch is $\inf$.

This is because at the beginning, the losses from some of the training batches are so high that their perplexities became $\inf$, which led to the average of all train perplexities is calculated as $\inf$.

And you can see that after one epoch of training, the train perplexity became normal.

Next, we will see the changes in train/validation losses and validation perplexities according to each epoch.

The train perplexities were also recorded, but some values are too high(as we can see from above captured log) so I just present the validation perplexity changes here.

As I mentioned, underfitting of the model or relatively high perplexity of a certain sequence corrupts the entire average and I concluded that there is less meaning in presenting the train perplexities.

So, let’s see the below charts.

The changes of training/validation loss values & validation perplexities per each epoch.

The most noticeable difference is that the loss values drop substantially faster than the case of the previous ReCoSa experiment.

When using the ReCoSa, even after training for $40$ epochs, the loss values were much higher than those of GPT-2 with training for only $10$ epochs. (Even if with additional training, the validation loss of the ReCoSa was still much higher than that of GPT-2…)

It is quite natural since the GPT-2 has already been pre-trained heavily in Language Modeling task and possesses the generation ability above a certain level.

Therefore it starts from a much more advantageous situation.


I used Nucleus Sampling(Top-$p$ Sampling) as the decoding algorithm like before.

When inferencing, labels parameter is not included, so only input_ids and token_type_ids are put into the model.

And the output from the model is also different, which provides the result logits at its initial position.

After conducting the softmax to this output, I made the model predict the next word at the target position with Nucleus Sampling.

I didn’t post the sampling codes this time because they are already in the last post.

And I was able to get the top $5$ satisfactory conversations as follows.

The results of the conversations with the trained chabot.

First of all, I could sense that it was relatively easier to pick out good results than last time.

And you can see that although they are not perfect, they are certainly more stable and human-like conversations than previous ones.

Of course, if the number of turns exceeds a certain range, it still makes long and repetitive answers, and especially the speaker token appears quite often, making the utterance awkward.

But comparing with the last results, the repetitive spans and incoherent contents certainly reduced.

Let’s have more detailed discussions in the next section.


As we saw, the pre-training makes the model optimized much faster and the results more decent.

However, there were several limitations as follows.

  1. The limitation of the max timesteps and max utterance length

    Making the whole dialogue history into one sequence makes the input too long. Especially, the pre-trained language models like GPT-2 and BERT have limited maximum input length, which forces us to truncate either contexts or utterances causing information loss inevitably. Even if we can set the sequence length as we want, this leads to an increase of data size and the memory usage becomes enormous.

  2. The speaker tokens sampled

    As we can see from the outputs, in some cases the speaker tokens suddenly emerge and even the speaker itself seems to be changed after that. This becomes an unexpected noise making the next responses’ quality worse. The model might be trained to generate not only the reply but also the next turn. This means that the speaker tokens are chosen as the proper next word, instead of the end token even if the generation is finished. I think there are several reasons for this.

    • First, this is because the input contexts, which have the speaker signals after each utterance, makes the model detect these speaker tokens as more likely word after a sentence during the actual sampling.

    • Second, in Huggingface’s implementation, they gave the persona information as the prefix to the model. So this would help the model to keep its identity and prevent it from taking the next speaker’s turn.
    • Third, Huggingface trained the GPT-2 to generate only one speaker’s utterance, which is quite different from my version that the model can be either of two. So the model is directed to become a certain speaker and not confused by different speaker information.

In addition, I found that this approach that all dialogue histories are concatenated and additional segment embeddings are added is originally used in many multi-turn settings quite often.

Not only in open-domain generation tasks as this, but also the models for TODS(Task-Oriented Dialogue System), such as TOD-BERT(Wu et al, 2020), also takes this input format to conduct various tasks like NLU, DST(Dialogue State Tracking), Action Prediction, etc. with multi-turn consideration.

Also, several recent retrieval-based models (Xu et al, 2020, Whang et al, 2020, Gu et al, 2020) select the response by calculating scores between these cross-encoded dialogue contexts and each candidate.

That is, in constructing the multi-turn dialogue system, this data format is useful grafted onto multi-head attention structure and we can try various researches with several variations, such as different speaker embeddings and splitting methods for utterances or turns, etc.

This is it for the open-domain multi-turn chatbot project using the GPT-2.

For months of developments and $3$ posts, I have learned a lot of knowledge.

But also it is shame that there were no bald, new and creative tries which can be called “research” strictly, since I just implemented a paper or other’s designs myself.

Although I’ll focus on other researches and TA works for the time being, I will try various improvements such as different decoding structures, adding user information and proper negative sampling, etc. if I have a chance.

I always welcome any feedback about what I need to improve or what I should fix.

Thank you.

(Updated at 2020-12-16)

This is a post about additional results after fixing the problem mentioned above.

I made the model generate the response of the second speaker and trained it with the same hyperparameters.

Luckily, the model was optimized much faster than before, where the training loss became $2.3345$, which is far lower than the previous outcome, and the validation loss dropped to $2.9874$.

And the results are as follows.

The results of the conversations with the trained chabot after additional training.

Although the quality of responses did not improve a lot, we can see that the speaker token emergence problem has certainly been fixed.

Of course, the model still generates awkward utterances when the context becomes longer or the input is complex.

Maybe I can improve it more with several additional components, such as proper negative sampling, persona information injection, and etc.

How to build a State-of-the-Art Conversational AI with Transfer Learning . (2019, May 9). https://medium.com/huggingface/how-to-build-a-state-of-the-art-conversational-ai-with-transfer-learning-2d818ac26313.
huggingface/transfer-learning-conv-ai. https://github.com/huggingface/transfer-learning-conv-ai.
Wu, C. S., Hoi, S., Socher, R., & Xiong, C. (2020). Tod-bert: Pre-trained natural language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871. https://arxiv.org/abs/2004.06871.
Xu, R., Tao, C., Jiang, D., Zhao, X., Zhao, D., & Yan, R. (2020). Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues. arXiv preprint arXiv:2009.06265. https://arxiv.org/abs/2009.06265.
Whang, T., Lee, D., Oh, D., Lee, C., Han, K., Lee, D. H., & Lee, S. (2020). Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection. arXiv preprint arXiv:2009.04703. https://arxiv.org/abs/2009.04703.
Gu, J. C., Li, T., Liu, Q., Ling, Z. H., Su, Z., Wei, S., & Zhu, X. (2020, October). Speaker-aware bert for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (pp. 2041-2044). https://dl.acm.org/doi/abs/10.1145/3340531.3412330.