The Curious Case of Neural Text Degeneration

This post is the review of the publication, Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration.

This paper compares various decoding algorithms used in NLG(Natural Language Generation) and discusses what should be implemented to make the best outputs with decent quality and diversity.

The authors propose a new suggestion, called Nucleus Sampling.

Currently, it seems to be the best available decoding method from the perspective of both quality and diversity.

What is “degeneration”?

First, let’s specify what the word “degeneration” means in this context.

In NLG, we need to consider two properties of a generated text, “quality” and “diversity”.

Quality means how the output is “well-constructed” considering overall contexts or conditions.

In other words, a good quality text should be relevant and coherent to the context.

On the other hand, diversity means how the generated result has diverse and various expressions or vocabularies.

If the model just imitates learned results from the training set, we cannot say that this generation task is successful especially in open-ended tasks like story generation or text continuation, etc.

So achieving both quality and diversity is the ultimate goal of all generation tasks.

Degeneration is described in this paper as meaning the case in which such quality or diversity is not achieved, which is stated in the paper as “output text that is bland, incoherent, or gets stuck in repetitive loops”.

The failure in quality can be results with incoherent meaning or context and that in diversity can be the case that exact same sentences in the dataset are generated or repetitive phrases come out endlessly at worst.

This is the example of text degeneration situation given by the paper.

As we can see, the left example fell into infinite repetition, which is a drawback of high-probability based decoding and the failure in diversity.

And the right case shows us that too much focus on diversity makes gibberish severely irrelevant to previous contents, which is the failure in quality.

I will discuss the difference among various decoding algorithms in the view of each strength/weakness of quality and diversity later.

Anyway, this paper is proposed to give a new strategy to overcome these degeneration problems.

Decoding algorithms

There are various decoding strategies in NLG, so before going into Nucleus Sampling, let’s see what algorithms can be used in decoding.

There are two representative approaches, maximization-based and sampling-based.

Of course, I cannot say these two categories are completely different since eventually both approaches take partial optimal samples probabilistically, but I will explain them separately for convenience.

Maximization-based approach

This is the most commonly used decoding method.

The model chooses the next word with the highest probability to make maximum final object probability.

That is, assuming that $y$ is the output sentence and $x$ is the condition, it makes $\prod_{t=1}^{N} P(y_t \mid y_1, …, y_{t-1}, x)$ maximized.

Greedy search and the beam search belong to this category as we saw the last post.

These algorithms might produce a text with good quality but fail to make diverse results.

See the picture below.

Beam search tries to select the next word with the highest probability but actually, humans do not.

The word choices by humans are very varied, diverse and uncertain.

So maximization-based methods have weaknesses in making human-like diverse texts.

On the other hand, maximizing probability has strength in directed generation tasks, which the generated output is a constrained as the transformation of the input such as machine translation, data-to-text generation and summarization.

These tasks make an output with tightly scoped by the input, so repetition and genericness are not that serious problems even if they are vulnerable to the problem of corrupting diversity.
Sampling-based approach

Sampling-based approach literally samples outputs from the certain partial group.

This can be pure sampling which samples the next word randomly from the whole probability distribution on each time step, or top-$k$ sampling which samples from restricted the top-$k$ most probable words set.

More specifically on top-$k$ sampling, given a distribution $P(x \mid x_{1:i-1})$, top-$k$ vocabulary $V^{(k)} \subset V$ can be defined maximizing $\sum_{x \in V^{(k)}} P(x \mid x_{1:i-1})$.

Random sampling is conducted based on this new distribution.

The problem is how to determine the optimal $k$ value.

Take a look at the below description.

By the above picture, we should notice that having a constant $k$ is not a good idea considering various contexts when conducting top-$k$ decoding.

If the probability distribution is flat, then a small $k$ will cause some problems since it cannot reflect nearly equally distributed word probabilities and fails to improve the diversity of the output.

In contrast, a large $k$ with a peaked distribution is also problematic because it considers extra words with very low probability and this leads to a decrease in quality.

Therefore, the varying $k$ value should be implemented in order to handle these unpredictable distributions.

Additionally, there is a sampling method with temperature that is a hyperparameter for softmax normalization.

Given the logits $u_{1 \colon \lvert V \rvert}$ and temperature $t$, the softmax is re-calculated as $p(x=V_l \mid x_{1:i-1})=\frac{exp(u_l/t)}{\sum_{l^{\prime}}exp(u_{l^{\prime}}/t)}$.

By dividing into $t$, the overall probability distribution is re-normalized to modify relative gaps between words.

Usually $t \in [0,1)$, so overall values become larger and this happens to emphasize the difference between each probability, which makes a little bit more “peaked”.

But controlling the value $t$, we can make this skewing more intense or weaker.

With a small $t$, the emphasis becomes stronger and this makes the model focus on the quality of the output text.

On the other hand, with a large $t$ distribution becomes more uniform and this leads to consideration on the diversity of results.

Nucleus Sampling

The paper proposes a new stochastic method called Nucleus Sampling.

This is also called top-$p$ sampling, which seems similar to top-$k$ sampling but more fluid.

When using Nucleus Sampling, we consider the probability distribution to determine the sample we are willing to get at each time step.

Given a distribution $P(x \mid x_{1:i-1})$, top-$p$ vocabulary $V^{(p)} \subset V$ as the smallest set can be defined as follows.

\[\sum_{x \in V^{(p)}} P(x \mid x_{1:i-1}) \geq p\]

Now let assume that $p^{\prime} = \sum_{x \in V^{(p)}} P(x \mid x_{1\colon i-1})$.

Then again, we define a new re-scaled probability distribution where the next possible words will be sampled.

\[P^{\prime}(x \mid x_{1:i-1}) = \begin{cases} P(x \mid x_{1:i-1})/p^{\prime} & \text{if } x \in V^{(p)} \\ 0 & \text{otherwise.} \end{cases}\]

As we can see, this is another similar problem which samples the next probable words with higher probability, but it determines the candidate samples by checking if the cumulative probability mass exceeds the pre-chosen threshold $p$.

So unlike top-$k$ sampling, the size of sampling set changes dynamically.

If $p$ becomes high, then the sample set is a small subset of vocabulary that takes up the vast majority of the entire probability mass.

That’s why this method is called “Nucleus” Sampling.

Experiments & Evaluation metrics

In the paper, the Generatively Pre-trained Transformer(GPT), version 2 was used for experiments.

This model was trained on WebText, a $40$GB collection of text scraped from the web.

The authors conducted several NLG tasks using different decoding algorithms above and evaluated each method with various evaluation metrics.

The metrics used are as follows.

Perplexity

This indicates the quality of language modeling made by an LM model.

Perplexity is the reciprocal of normalized test data’s probability which is calculated like below.
\[PPL(W) = \sqrt[n]{\frac{1}{P(w_1, w_2, ... , w_N)}} = \sqrt[n]{\frac{1}{\prod_{i=1}^{N}P(w_i \mid w_1, w_2, ... ,w_N)}}\]

We should notice that the lower the perplexity is, the better the quality of language modeling is.

In other words, the lower perplexity shows us that the LM model was trained to follow the language modeling sequence reflected in the train data.

But that doesn’t mean this model has a good decoding capability.

Lower perplexity also means that the model chooses each next word with relatively small options.

So this does not guarantee the diversity of the decoded outputs.

The paper argues that it is important to produce the perplexity similar to the human-produced sentences and that it is not absolutely good to have low perplexity.

Take a look at the picture below.

We can see that humans do not produce languages with extremely low perplexity as we discussed before.

And algorithms which focus on words with relatively higher probabilities such as beam search and top-$k$ with lower temperature have unnaturally lower perplexities than others.

The rest of the methods like random sampling, top-$k$ with higher temperature and Nucleus sampling reach the perplexity value the human produces.
Zipf Coefficient

Zipf’s law says that there is an exponential relationship between the rank of a word and its frequency in the text.

Zipf Coeffecient $s$ is used to evaluate this property and $s=1$ means that this is a theoretically perfect exponential curve.

That is, this value indicates how the usage of each word is similar to the human propensity.

The vocabulary distribution and Zipf’s Coefficient of each decoding algorithm is as follows.

The figure shows us that the high-probability based decoding such as beam search is somewhat different from the human tendency to choose words.

And sampling-based methods especially pure sampling and Nucleus sampling are closer to the human distribution as we can see.

But pure sampling overestimates the use of rare words a little bit more, which is quite obvious.

And lower temperature avoids sampling rare words, which is also obvious since it makes the overall probability distribution peakier.
Self-BLEU

Self-BLEU is another evaluation metric for diversity, which indicates that lower Self-BLEU implies higher diversity.

Self-BLEU is calculated by computing the BLEU score of each generated output using all other generations in the evaluation set as references.

Since the original BLEU score is based on n-gram matching, we can intuitively think that lower matching(score) can have more diversity.

Let’s look at the description below.

Note that higher $k$, $t$ and $p$ makes the score lower, which means an increase of diversity.

And this also means that each algorithm makes output with diversity closer to that of humans.

But this leads to higher perplexity although these high parameters are needed to get close to the reference distribution.
Repetition

We can also evaluate results by checking repetition to determine the degree of degeneration.

Again, let’s see the figure below.

As we can see, the golden standard obviously has no repetition and sampling methods with higher $p$ or $t$ follow this trend relatively well.

But with lower hyperparameters, these methods are similar to the greedy method and lead to severe repetition.

This indicates that the methods emphasizing the language modeling quality more happen to make repetitive output comparing to those focusing on diversity.
Human Unified with Statistical Evaluation (HUSE)

The human evaluation can measure the quality of the generated text well but has weakness in detecting the diversity defect.

On the other hand, the statistical evaluations are unable to measure the coherence of generated output properly while they can evaluate the diversity.

HUSE is introduced in Hashimoto, T. B., Zhang, H., & Liang, P. (2019). Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792, which is a combined evaluation metric of human evaluation and statistical evaluation.

HUSE is computed by training a discriminator to distinguish between text made by human and model distributions, based on two features, the probability assigned by the language model and human judgments of typicality of generations.

To put it simply, the well-produced text which is close to the human distribution should perform well on both evaluations from likelihood and human.

So the overall evaluation results of each algorithm with the above metrics are as follows.

Note that it is ideal that the score is similar to that of humans rather than it is low or high.

And Nucleus sampling made quite good performances although some metrics concluded that other methods are better.

In addition, it is important to know that Nucleus sampling has the highest HUSE score, indicating that it is satisfying to both human evaluation and statistical evaluation.

This is one of the generated examples by experimented decoding algorithms.

Given the initial segment of web text, each method conducted generation from it to complete the text.

You can check other interesting examples in the paper also.

To conclude, this was a very informative and valuable paper.

I think it was of great help to me not only because it presented a new methodology but also because it delivered comprehensive knowledge of various algorithms and evaluation methods for NLG.

I also found that simply extracting high probability words well is not the most important thing in the generation task, but the diversity is also a very important part to consider.

I hope this post is helpful to understand this paper well.

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. https://arxiv.org/abs/1904.09751.

Categories