Revisiting the transformer NMT project

It’s been a while.

I had to concentrate on my personal project, which is “building a multi-turn chatbot with the pre-trained Open AI’s GPT-2 model”, so I have not been able to spare my time for an additional post.

My chatbot model is currently being trained and I will post about the project later when it is finished.

Today I’m gonna talk about another topic, which is how I made the English-French translation model better I posted before.

You can see the original post here.

There are several improvements I should point out.

Training with a lot larger train set
Adding the validation process
Decoding with the beam search
Different training strategy for the sentencepiece tokenizer

Then let’s get started.

Training with a lot larger train set

Back then, I trained the transformer with $150,000$ English-French pairs due to a lack of time and resources.

This time, I retrained it with about $2,000,000$ pairs which are the full dataset from “European Parliament Proceedings Parallel Corpus 1996-2011”.

Since the number of pairs increased extremely, I reduced the total epochs from $50$ to $10$.

Unfortunately, I didn’t keep the screenshot of logs, but I can say that even if the total number of epochs was reduced, the model was actually optimized at a much faster rate.

From this, we can see that with a lot of training samples, the generalization capability of the model can be improved with less iterations.

This is actually quite obvious, since the model can detect reasonable features with more samples and also can avoid over-fitting thanks to less repetitions.

This somewhat explains why I got defective results last time.

Adding the validation process

Second, I added the validation function to check the model performances after each epoch. (And thank you, kr-sundaram, for notifying and helping me with this.)

The validation is quite important to avoid over-fitting on the training samples and regularly check the model performances during training, but I didn’t add it at first.

There are two reasons why I didn’t care about the validation before.

First, to my shame I had no idea how to evaluate the model for validation at that time.

Second, like I said, the training took so long, so I wanted to avoid any delay.

Now I added the validation by checking the validation losses on the additionally prepared data.

Of course, there are other possible ways for this, for example, calculating BLEU scores or perplexities.

Since the repository I referred to simply implemented loss checking, so I also used that method.

The codes for validation are as follows.

def validation(self):
    print("Validation processing...")
    self.model.eval()
        
    valid_losses = []
    with torch.no_grad():
    	for i, batch in tqdm(enumerate(self.valid_loader)):
            src_input, trg_input, trg_output = batch
            src_input, trg_input, trg_output = \
            	src_input.to(device), trg_input.to(device), trg_output.to(device)

            e_mask, d_mask = self.make_mask(src_input, trg_input)

            # (B, L, vocab_size)
            output = self.model(src_input, trg_input, e_mask, d_mask) 

            trg_output_shape = trg_output.shape
            loss = self.criterion(
            	output.view(-1, sp_vocab_size),
                trg_output.view(trg_output_shape[0] * trg_output_shape[1])
            )

            valid_losses.append(loss.item())

            del src_input, trg_input, trg_output, e_mask, d_mask, output
            torch.cuda.empty_cache()
        
	mean_valid_loss = np.mean(valid_losses)
        
   	return mean_valid_loss

The implementation is simple since it is almost similar to the training.

Additionally, with torch.no_grad(), del and torch.cuda.empty_cache() are for GPU memory management.

From what I heard, torch.no_grad() deactivates autograd engine, which leads to memory release and del & torch.cuda.empty_cache() deallocate the memories of tensors used and clear the remaining caches in CUDA.

Decoding with the beam search

Another difference is that the beam search algorithm was added as a decoding strategy. (You can see the post about it here.)

By passing the argument --decode='beam', the model translates the input using the beam search, not the original greedy decoding.

Let’s see how it looks like together.

First, I built additional classes, BeamNode and PriorityQueue.

The former is for a node object which contains the necessary information of each constructed sequence at a certain time, such as the decoded result, log probability, etc.

The latter is a data structure that manages the nodes in consideration of log probabilities. (You can see the post about the priority queue here.)

class BeamNode():
    def __init__(self, cur_idx, prob, decoded):
        self.cur_idx = cur_idx
        self.prob = prob
        self.decoded = decoded
        self.is_finished = False
        
    def __gt__(self, other):
        return self.prob > other.prob
    
    def __ge__(self, other):
        return self.prob >= other.prob
    
    def __lt__(self, other):
        return self.prob < other.prob
    
    def __le__(self, other):
        return self.prob <= other.prob
    
    def __eq__(self, other):
        return self.prob == other.prob
    
    def __ne__(self, other):
        return self.prob != other.prob
    
    def print_spec(self):
        print(f"ID: {self} \ 
              || cur_idx: {self.cur_idx} \ 
              || prob: {self.prob} \ 
              || decoded: {self.decoded}")
    

class PriorityQueue():
    def __init__(self):
        self.queue = []
        
    def put(self, obj):
        heapq.heappush(self.queue, (obj.prob, obj))
        
    def get(self):
        return heapq.heappop(self.queue)[1]
    
    def qsize(self):
        return len(self.queue)
    
    def print_scores(self):
        scores = [t[0] for t in self.queue]
        print(scores)
        
    def print_objs(self):
        objs = [t[1] for t in self.queue]
        print(objs)

With them, we can now implement the actual beam search.

def beam_search(self, e_output, e_mask, trg_sp):
	cur_queue = PriorityQueue()
    for k in range(beam_size):
    	cur_queue.put(BeamNode(sos_id, -0.0, [sos_id]))
        
    finished_count = 0
        
    for pos in range(seq_len):
        new_queue = PriorityQueue()
        for k in range(beam_size):
            node = cur_queue.get()
            if node.is_finished:
                new_queue.put(node)
            else:
                trg_input = torch.LongTensor(
                    node.decoded + [pad_id] * (seq_len - len(node.decoded))
                ).to(device) # (L)
                d_mask = (trg_input.unsqueeze(0) != pad_id).unsqueeze(1).to(device) # (1, 1, L)
                nopeak_mask = torch.ones([1, seq_len, seq_len], dtype=torch.bool).to(device)
                nopeak_mask = torch.tril(nopeak_mask) # (1, L, L) to triangular shape
                d_mask = d_mask & nopeak_mask # (1, L, L) padding false
                    
                trg_embedded = self.model.trg_embedding(trg_input.unsqueeze(0))
                trg_positional_encoded = self.model.positional_encoder(trg_embedded)
                decoder_output = self.model.decoder(
                    trg_positional_encoded,
                    e_output,
                    e_mask,
                    d_mask
                ) # (1, L, d_model)

                output = self.model.softmax(
                    self.model.output_linear(decoder_output)
                ) # (1, L, trg_vocab_size)
                    
                output = torch.topk(output[0][pos], dim=-1, k=beam_size)
                last_word_ids = output.indices.tolist() # (k)
                last_word_prob = output.values.tolist() # (k)
                    
                for i, idx in enumerate(last_word_ids):
                    new_node = BeamNode(
                        idx, 
                        -(-node.prob + last_word_prob[i]), 
                        node.decoded + [idx]
                    )
                    if idx == eos_id:
                        new_node.prob = new_node.prob / float(len(new_node.decoded))
                        new_node.is_finished = True
                        finished_count += 1
                    new_queue.put(new_node)
            
        cur_queue = copy.deepcopy(new_queue)
            
        if finished_count == beam_size:
        	break
        
    decoded_output = cur_queue.get().decoded
        
    if decoded_output[-1] == eos_id:
        decoded_output = decoded_output[1:-1]
    else:
        decoded_output = decoded_output[1:]
            
    return trg_sp.decode_ids(decoded_output)

The idea is quite simple.

We just make an additional node, save its attributes and put it into the queue.

Then we manage each node by getting only the top $k$(beam size) results and conduct the decoding from them.

After decoding, we make another node, save the current sentence, score and index in that node and finally put it into the priority queue again.

If the node meets the end token, then this sequence is finished and cannot be processed again later.

Additionally, I added the normalization after finishing the sentence to prevent the penalty from the sequence length.

Some might wonder why the log probabilities are treated after multiplied with $-1$ .

This is because Python heapq library does not support sorting in descending order, so I had to make these scores positive so that we can easily get “actual” high scores.

At the end of this post, I will show the differences between the result with the greedy decoding and that with the beam search.

Different training strategy for the sentencepiece tokenizer

In the last post about this project, I said the sentencepiece is based on BPE(Byte Pair Encoding) and therefore, I trained the tokenizer with BPE to avoid OOV problem.

But actually I was wrong, since BPE is one of the many algorithms for training sentencepiece tokenizer.

The default setting is unigram, which is based on the unigram language model.

I don’t know much about this, which requires another study, but we can infer that in terms of “Language Model”, it’s a high-probability based method.

So this is different from BPE method I introduced previously because BPE is based on the number of statistical frequencies.

In other words, BPE splits the text into minimum subwords and combines them again counting statistically frequent groups to make other high-level subwords and build the vocabulary.

From what one of my colleagues said, this might cause unusual combinations of words in NLG task and it is more beneficial using the unigram since it is very similar to the training strategy of generation tasks, which is mostly based on language modeling.

Eventually, I re-trained the sentencepiece tokenizer with the default unigram setting.

The results

Finally it’s time for us to check the results.

We will look into it by dividing the outcomes into two cases, first one is the case using the original greedy method and the second is using the beam search.

First, these are results using the greedy decoding just like in the previous post.

We can clearly see that even if with the greedy decoding, the translated outputs are quite decent, especially in second and third cases.

Before improvements, the model cannot translate complicated sentences like them, but this time the results are pretty understandable.

So we can conclude that above improvements did make progress!

But still, in the first case, the greedy method produces the same phrases repeatedly, even if the overall contexts are right.

And we can see that the greedy decoding makes too many periods which are also unnecessary.

Next, these are the results by the beam search with the beam size $k=8$.

As you can see, the results look better, especially considering the first case.

Of course, still there are some unnecessary phrases but the degeneration by repetition is truly reduced comparing the result with the greedy method.

This is because the beam search tends to avoid the long sequence even if we try to prevent the penalty.

Actually, it is a double-edged sword in some ways, since sometimes the model finished the sentence with a single word like “Je” when I increased the beam size.

Therefore, it is necessary to choose the appropriate beam size.

And naturalness of sentences is much better with less repeated periods.

So this is the end of the transformer NMT project.

Although there are still some improvements to be made, I will not fix more to concentrate on my next project.

The link to the final repo is here.

I hope this will be beneficial to anyone who needs help working on a similar task.

I also appreciate any suggestions or comments on this project anytime.

bentrevett/pytorch-seq2seq. (2020, Jun 02). https://github.com/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb..

Categories