key: cord-0150966-3ezam55m
authors: Nguyen, An
title: Language Model Evaluation in Open-ended Text Generation
date: 2021-08-08
journal: nan
DOI: nan
sha: d46c99721a79489f70f36646e8cd84e7ba143aff
doc_id: 150966
cord_uid: 3ezam55m

Although current state-of-the-art language models have achieved impressive results in numerous natural language processing tasks, still they could not solve the problem of producing repetitive, dull and sometimes inconsistent text in open-ended text generation. Studies often attribute this problem to the maximum likelihood training objective, and propose alternative approaches by using stochastic decoding methods or altering the training objective. However, there is still a lack of consistent evaluation metrics to directly compare the efficacy of these solutions. In this work, we study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text. From there, we propose a practical pipeline to evaluate language models in open-ended generation task, and research on how to improve the model's performance in all dimensions by leveraging different auxiliary training objectives.

Introduction Language modeling is the task of calculating the probability distribution over word sequences, with a goal to generate more fluent text in a certain language, where having higher probability is considered fluent. In recent years, neural language modeling has become the powerhouse for an array of natural language processing tasks, from machine translation (Bahdanau et al., 2014; Luong et al., 2015) , summarization (Zhang et al., 2020) to story generation (Fan et al., 2018) . The introduction of the Transformers architecture (Vaswani et al., 2017) has allowed language model to be trained on even more massive text datasets in significantly less time thanks to its ability to parallelize computation. Today, large pre-trained language model like GPT-2 , or the latest GPT-3 (Brown et al., 2020) with 175 billion parameters have achieved stateof-the-art results in numerous tasks in zero-shot and few-shot setting.

Despite of their superiority in multiple NLP tasks, these language models are still falling short in open-ended text generation task such as story generation and dialogue modelling, where the model is required to produce long continuation of text. Based on empirical observation with standard deterministic decoding method, the text is often found to be dull, repetitive (Holtzman et al., 2019; Welleck et al., 2019a; Shao et al., 2017; Fan et al., 2018) , and sometimes logically inconsistent and factually incorrect although being fluent and coherent (Li et al., 2020; Welleck et al., 2019b; Hayashi et al., 2020; Petroni et al., 2019) .

There exists a number of solutions to this problem, such as using stochastic decoding methods when generate continuations to keep the model from repeating itself (Fan et al., 2018; Holtzman et al., 2019) , or altering the training objective to penalize the model from being repetitive (Welleck et al., 2019a; Bengio et al., 2015) . However, to the best of our knowledge, there has not been any works that quantitatively compare the performance of these solutions, due to the lack of a consistent evaluation metric. Therefore, it is hard to decide which solution is better than the others.

Traditionally, language models have been evaluated based on perplexity, which concerns with the probability of a sentence being produced by the model. We also have a number of other metrics for different tasks, such as BLUE for machine translation (Papineni et al., 2002) or ROUGE for text summarization (Lin, 2004) , which essentially compare the similarity between human and machine generated text. This works when we want to judge the quality of the generated text -we want our models to produce natural, human-like and grammatically correct sentences.

However, with open-ended generation task such as story telling or dialogue generation, not only we want our model to produce high quality text but also to be creative and diverse. For example, given the prompt "Once upon a time", we expect our story generation model to generate a diverse range of continuations rather than repeating the most probable story again and again. This is where traditional metrics like perplexity can be problematic, since it places a high stress on the probability of individual words. In terms of creativity or diversity, we would probably prefer seeing "she takes the yacht to school" than "she takes the bus to school" in a story. However, these two sentences can have significantly different probabilities because the word "yacht" is much less probable than "bus". Thus, the first sentence might result in much higher perplexity, which can be perceived as being lower in quality.

Another important aspect to open-ended text generation is commonsense reasoning, which we will refer to as consistency. Since the models are expected to produce much longer text, they are more prone to generating illogical or factually incorrect sentences Welleck et al., 2019c; Li et al., 2020) . For example, when generating an answer to a customer in a dialogue, a chat bot can say that "our store remains open in the weekend" then immediately say "our store will be closed on Saturday". Perplexity simply cannot capture the quality of being consistent or logical of a generative model.

Question: What is the best metric to use when evaluating language models on open-ended text generation task in each dimension: quality, diversity and consistency?

It is important that when evaluating language models on open-ended generation task, we need to look at their performance in all three dimensions: quality, diversity and consistency. However, in each dimension, there exists a number of different metrics that have been proposed in the literature. For example, to evaluate the quality of machinegenerated text, one can use either Corpus-BLEU (Yu et al., 2017) or forward perplexity . If two studies make use of different metrics to evaluate their models performance, then it becomes impossible to directly make comparison between the two.

In this work, we study different evaluation metrics that have been proposed to evaluate quality, diversity and consistency of machine-generated text, and aim to find the best metric to use in each dimension.

Question: Which technique can lead to better performance of language models in open-ended text generation task, stochastic decoding methods or tweaking the training objective?

Although stochastic decoding methods significantly reduced the repetition issue thanks to randomization (Holtzman et al., 2019; Fan et al., 2018) , they do not solve the underlying problem with maximum likelihood training. Because of this, altering the training objective might sound like a more viable approach. Many works has been carried out in this regards, such as scheduled sampling (Bengio et al., 2015) , Generative Adversarial Nets (Goodfellow et al., 2014; Yu et al., 2017; , or most recently unlikelihood training (Welleck et al., 2019a; Li et al., 2020) .

However, since there has not been any direct comparison between the two techniques, it is hard to decide which one is superior. We argue that if both lead to the same performance, we would be better off using stochastic decoding methods since it is cheaper in terms of training. Using the new evaluation pipeline that we have proposed in this work, we aim to find out which one is the superior technique to use for language models in open-ended text generation task: stochastic decoding methods or tweaking the language model training objective.

Quesiton: Can multi-task learning help language models to generate better quality, more diverse and more consistent text?

With the success of BERT in a plethora of different NLP tasks (Devlin et al., 2019) , there has been a surge of interest in multi-task learning for language models. Besides traditional maximum likelihood estimate objective, many different auxiliary training objectives have been introduced, such as masked language model (Devlin et al., 2019; Yang et al., 2019) , next sentence prediction (Devlin et al., 2019) , or word/sentence order prediction (Wang et al., 2019) . All of these additional objectives share a common goal: to make the language model become better at understanding the language. In this work, we are curious to find out if multi-task learning can help languages model to become better on a specific task -open-ended text generation. We only focus on auxiliary training objectives in an unsupervised setting, i.e. where training labels can be obtained automatically.

In Chapter 2, we first provide necessary backgrounds to understand language models and their evolution, from simple probabilistic n-gram models to state-of-the-art Transformerbased models. We then illustrate the neural text degeneration problem, and compare different solutions that have been proposed in the literature. We end this chapter by looking at different evaluation metrics for language models on open-ended text generation task in all dimensions -quality, diversity and consistency.

In Chapter 3, we present our experiment to compare different evaluation metrics for

language models on open-ended text generation task. We decide what is the best metric to use for each dimension, then propose a evaluation pipeline using these metrics and apply it to compare the performance between different stochastic decoding methods and unlikelihood training. We also carry out an experiment to see how the choice of domain for training corpus can effect the consistency of continuations.

In Chapter 4, we give an overview of multi-task learning, then present our experiment in which we fine-tune language models with different auxiliary training objectives. Using the proposed pipeline from Chapter 3, we evaluate the efficacy of these auxiliary training objectives in open-ended text generation task.

In Chapter 5, we give a conclusion of our project and its contribution, and provide several suggestions for future work.

Chapter 2 Background 2.1 Language Models

In recent years, language models have become the powerhouse for an array of natural language processing (NLP) tasks, from machine translation (Bahdanau et al., 2014; Luong et al., 2015) , summarization (Zhang et al., 2020) , story generation (Fan et al., 2018) to dialogue generation (Li et al., 2020; Welleck et al., 2019b) .

So what exactly is a language model?

To formalize, for a text that contain m tokens {w 1 ...w m }, a language model is used to compute the probability of that text:

where V is the set of vocabulary. The goal of computing this probability is to produce more fluent texts, which should have much higher probability than odd ones. For example, if we want to translate the following sentence from Vietnamese to English:

"Tôi thích xe màu cam và màu xanh", the literal word-by-word translation would be "I like car orange and blue". The language model should give this translation a lower probability than a natural translation like "I like orange and blue cars".

To generate text, a language model can produce the next word given a context by using the conditional probability of the next word on the previous words:

In the rest of this section, we explore the history of language models and reveal how we have reached the states-of-the-art models at present, which stand behind recent remarkable achievements in multiple NLP tasks.

To compute the probability of a sequence of text, we can refactor the joint probability using the chained rule:

This equation suggests that at every step, we have to calculate the probability of word given all of its predecessors. This can be done by simply counting the number of occurrences of the combination with and without the last word in the whole universe. For example, for the sentence "I like orange and blue cars", the probability of the word "blue" is equal to:

p(blue|I like orange and) = count(I like orange and blue) count(I like orange and)

In practice, calculating probabilities by counting every existing word combinations is impossible, as we would never have a large enough dataset to produce a correct estimation, especially for long sequences. To make the calculation tractable, an n-gram language model simplifies this and calculate the probability of a word using only the n − 1 previous words:

For example, in a uni-gram (1-gram) language model, the probability of the sentence "I like orange and blue cars" will be computed as:

An obvious problem with n-gram language models is sparsity, i.e. yielding zero probability for unknown word combinations. This is very likely to happen in practice, as language is always evolving and new combinations are created everyday. There exists a number of solutions to this problem, such as back-off (Kneser and Ney, 1995) or smoothing (Chen and Goodman, 1999) . A general idea of these solutions is to give the unknown combination a tiny amount of the probability mass, so we would not encounter any zero values in our calculation.

Another problem with n-gram language models is that they are very limited in modeling long-range dependency between words in a sentence. For example, the sentence "The cat at the end of the street comes here everyday" requires at least an 8-gram language model in order to know that "comes" should be a singular verb because of the singular noun "cat". Increasing n is not a viable solution, as n-gram language models can be computationally expensive when n is large. In the last example, an 8-gram language model with the vocabulary size of 100,000 words can have up to 10 40 possible sequences.

Lastly, n-gram language models are incapable of informing us about the linguistic and semantic information of the language, as everything in n-gram language models are just plain statistic. Given the sentence "a man is eating an apple", n-gram language models have no way to recognize that it is actually very similar to the sentence "a woman is eating an orange" in term of semantic. This is because in n-gram language models we are only looking at the surface forms of words, thus in the view of an n-gram language model, the difference between the word "man" and "woman" is the same as the difference between "man" and "bread". Therefore, n-gram language models do not have the ability to generalize their knowledge to sequences that they have not encountered during training.

A neural network is a collection of computing units whose goals are to learn a mapping function between a set of inputs and desired outputs. For example, we might want to build a neural network to predict whether an image is a picture of a cat or a dog. In this case, the inputs can be the individual pixels of the image, and the outputs can be the probability of this image being a picture of a dog or a cat (Figure 2 .1). Each computation unit has an associated weight, which can be learned with the help of a loss function to penalize the network when it makes incorrect predictions. The end goal is to learn a set of weights that maximize the likelihood of the training data.

This neural network in Figure 2 .1 is called feed-forward because of its architecture: the computation proceed forward with no cycle. A neural network can have multiple hidden (Bengio et al., 2003) layers between the input layer and output layer, which is often known as a deep neural network.

The first feed-forward neural language model was proposed by Bengio et al. (2003) , whose architecture is shown in Figure 2. 2. Similar to a n-gram language model, the feed-forward neural language model uses the n previous words as context to compute the probability of the next word in the sequence. What differs here is that each context word has a vector representation, which can be looked up in a table C. These vectors are concatenated and passed to a hidden layer, follow by a final softmax layer to obtain the probability distribution of the next word over the entire vocabulary.

This model has the same downside as n-gram language models, since it only has access to n previous words to predict the next word. However, feed-forward neural language models do not require smoothing (technically the softmax layer never produce zero-values due to the exponential operation), and they can generalize much better over similar contexts. Recall the example from last section, where the n-gram model cannot know that the sentence "a man is eating an apple" is actually very similar to the sentence "a woman is eating an orange". The feed-forward neural language model solves this problem by having access to a dense vector representation for each word, such that similar words are expected to have similar feature vectors. This way, when the model updates its parameters following a specific word, the changes will be carried over to similar words as well. word embedding -a real-valued vector used to represent words, often ranges between ten to few hundreds of dimensions. There exists a number of pre-trained word embeddings, among which the most notable ones are GloVe (Pennington et al., 2014) and word2vec (Mikolov et al., 2013a,b) . Compare to Bengio et al. (2003) , GloVe and word2vec's training objectives are much simpler, allowing them to be trained on a much larger scale. Thus, using these pre-trained word embeddings directly on downstream tasks is much more effective than training an word embedding layer from scratch in terms of both training speed and task performance (Wang et al., 2020) .

Interestingly, word embeddings have been shown to be capable of capturing linguistic relationships between words (Figure 2 .3). This suggests that word embeddings can in fact represent the meaning of a word, which is a huge step towards natural language understanding. Figure 2 .3: Example of relations captured by word2vec word embeddings when being projected to a low dimension space (Mikolov et al., 2013a,b) . Image credit: Ruder (2018).

A recurrent neural network (RNN) is a special type of neural network that contain cycles in its structure. It is typically used for sequential data, such as audio signals, time series or languages. An RNN processes the input sequence one at a time, and maintains a state vector (known as the hidden state) to store the information of previously processed

contexts. An example of an RNN is given in Figure 2 .4. (Olah, 2015) We refer to a specific point in the sequence as a time step. At every time step t, the input x t is passed into the network. Using this input, along with the hidden state from the last time step h t−1 , the model computes the current hidden state h t . This process is repeated until we hit the end of the sequence. Depends on the type of recurrent neural network, the model can choose to yield an output y t at every time step (one-to-one, e.g. part-of-speech tagging), yield a single output y T at the final step (many-to-one, e.g. text classification), or produce a different length T outputs {y 1 , ..., y T } at the final step (many-to-many, e.g. machine translation).

The use of RNN in language modeling was first introduced by Mikolov et al. (2010) . With each sequence used for training, we can obtain the target labels by shifting the whole sequence one step to the left. At every time step, we pass the word embedding of the current word to the RNN layers, apply the softmax operation to obtain the probability distribution of the next word, then calculate the loss for back propagation using the target word. RNN language models have been shown to outperformed all state of the art n-gram models at the time, even when the n-gram models were given much more training data (Mikolov et al., 2010) . (Jurafsky and Martin, 2009) Thanks to their recurrent structure, RNN language models do not rely on a fixed length context, but can process a sequence of arbitrary length. Every word has access to the latest hidden state, which carries the information of every other previous words in the sequence. In practice however, RNN still struggles to capture long-range dependencies, as information from early time steps can diminish at later ones, especially in long sequence -a problem known as vanishing gradients. In addition, because of its sequential nature, RNN is much slower to trained compare to than feed-forward neural network.

To solve the vanishing gradient problem, Hochreiter and Schmidhuber (1997) proposed the use of the long short-term memory (LSTM) block to replace the traditional RNN block, with the addition of memory cells to preserve gradients throughout the sequence, and multiple gates to control access to these memory cells. A comparison between the traditional RNN block and LSTM block is given in Figure 2 .6. (Olah, 2015) The three type of gates in the LSTM block are the input gate, forget gate and output gate. At a high level, the forget gate determines how much of the current memory cell's content should be forgotten, the input gate decides how much knowledge should be added to the new memory cell, and the output gate controls how much of the memory cell's content should be transferred to the next hidden state. For example, when processing the word "he" in the sentence "The two cars that he loves", the forget gate can choose to let go of the information about the plural noun "cars", and the input gate can store this new information about the singular pronoun "he" in order to correctly predict the singular verb "loves".

Another variant of recurrent neural network language models is the sequence-to-sequence (also known as encoder-decoder) model, which deals with generation tasks where the length of the output sequence is different from the input sequence. The model was first introduced by Sutskever et al. (2014) , and it has been widely used in a variety of natural language processing tasks, such as neural machine translation or image captions generation. It works by first processing and compressing the input sequence into a fixed-sized vector using the encoder network, then uses that as the first hidden state to start decoding the output sequence using the decoder network. An example of this architecture is given in Figure 2 .7. Figure 2 .7: Example of sequence-to-sequence (encoder-decoder) architecture, where the model encodes the sentence ABC and outputs the sentence WXYZ (Sutskever et al., 2014) This compression step might become a bottleneck, as the source sequence has to be squeezed into a fixed-size vector before passing to the decoder network. If we have a long source sequence, it becomes difficult for the model to transfer every information of the source sequence in the encoder network to the decoder network.

Unlike the sequential process in seq2seq models, when translating a sentence from a language to another, human do not necessarily take just one single look at the source sentence and immediately translate it to the target language. What we tend to do is rather a iterative process, where we keep working back and forth between the source sentence and the target sentence to find the information needed at a specific time step and ignore the rest.

This example in language translation is exactly the motivation for the attention mechanism, which was first proposed by Bahdanau et al. (2014) . In this model, the attention mechanism allows the decoder network to look at every hidden state from the encoder network to find all the information it needs. In addition to the use of the last hidden state s t−1 and the last output y t−1 to produce the next hidden state s t in the decoder Despite the significant improvements in task performance, the slow computation time of seq2seq models remains an unsolved problem, preventing them to be trained on a larger scale.

In the seminal paper Attention is all you need, Vaswani et al. (2017) has revolutionized the NLP field with the Transformer block ( Figure 2 .9), which completely eliminates the need for sequence processing in RNN and seq2seq models. The main innovation in the Transformer block comes from the multi-head attention layer.

To understand multi-head attention, we first need to understand what self-attention is.

Whereas RNN maintains the current hidden state by looking at the previous hidden state/output vector, self-attention allows the model to construct a hidden state at each time step by attending to every other input in the sequence. Borrowing an example from Alammar (2018b), where we want to process the following sentence "The animal didn't cross the street because it was too tired" using the Transformer network. Figure   2 .10a illustrates how the model construct the hidden state for the word "it" by looking at other positions in the input sequence. Note that in this example, it gives most of the weights to the phrase "the animal", which is exactly what the word "it" is referring to. A word in a sentence can relate to other words in various ways, e.g. being a nominal subject for a word while being a direct object to another. This is exactly what multi-head attention can be used for: each head in the multi-head attention layer is a self-attention layer with its own parameters, so it can learn about different relationships between the source word and the others simultaneously. Figure 2 .10b gives an example of this phenomenon, where the word "it" attends to the phrase "the animal" in one head, but to the word "tired" in the other.

Similar to an RNN language model, a Transformer-based language model can be trained on unlabeled text dataset with next word prediction objective. In every training sequence, each word is passed to the self-attention layer, which allows it to construct its hidden state by attending to all of the previous words ( Figure 2 .11). Different from an RNN network, this computation is independent for all words in the input sequence, meaning that they can be performed in parallel. To prevent a word from attending to its successors, the model makes use of an attention mask -a matrix of 0s and 1s to zero out the successors softmax scores (Figure 2 .12). This has enabled Transformer-based language models to be trained on a much bigger scale than RNN and seq2seq models. Since Transformer models can take advantage of parallel computing resources, language models now can be trained on a massive scale. State-of-the-art pre-trained language models nowadays typically hold up to billions of parameters and are trained on terabytes of unlabeled data (Brown et al., 2020) . Figure 2 .13 shows popular pre-trained language models and their respective sizes. These language models continue to achieve stateof-the-art performance across a plethora of NLP tasks, such as language generation, machine translation or question answering (Radford et al., 2015 Peters et al., 2018; Yang et al., 2019; Devlin et al., 2019; Brown et al., 2020) .

With the advanced performance of Transformer pre-trained language models on various NLP tasks, fine-tuning them has become the go-to approach for doing transfer learning in NLP (Howard and Ruder, 2018; Chen et al., 2020; Dodge et al., 2020) . With access to state-of-the-art language models through library like Huggingface 1 , we can simply The following section gives a brief overview of GPT-2 , which is what we use to fine-tune all of our experiments in this project.

GPT-2 is a Transformer-based language model, with the largest variant containing around 1.5 billion parameters. Similar to the causal language model showed in Figure 2.11, GPT-2's objective is to predict the next word given the previous context.

GPT-2 was trained on the WebText dataset, which contains 8 million web documents from a variety of domains.

Thanks to huge number of parameters and the diversity of content in its training dataset, GPT-2 is able to achieve impressive performance in an array of language tasks without any supervised training data, including machine translation, summarization, reading comprehension and question answering. Fine-tuning GPT-2 models on specific task has been proved to be an effective approach, as they can achieve competitive results with 

As discussed, despite of their superiority in multiple NLP tasks, language models are still falling short in open-ended text generation task. With traditional deterministic decoding methods, machine-generated text is often found to be dull, repetitive (Holtzman et al., 2019; Welleck et al., 2019a; Shao et al., 2017; Fan et al., 2018) , and sometimes inconsistent and factually incorrect although being fluent and coherent (Li et al., 2020; Welleck et al., 2019b; Hayashi et al., 2020; Petroni et al., 2019 ) -a problem known as neural text degeneration. Example of neural text degeneration with greedy decoding and beam search can be seen in Table 2 .1.

Many studies suspect that the root of this problem might be the maximum likelihood training objective (Holtzman et al., 2019; Welleck et al., 2019a) . Multiple alternatives are suggested, which fall into either of the following two categories: using stochastic decoding or tweaking the training objective.

All stochastic decoding methods share a common goal, which is to introduce a degree of randomness to the generation process so the model has less chance to repeat itself. In this section, we give an overview of popular stochastic decoding methods in the literature,

with their examples of continuations in Table 2 .1.

Sampling with temperature One can generate more diverse text simply by sampling next words from the learned softmax distribution. This is often done with the use of a Boltzmann temperature parameter (Ackley et al., 1985) to control the randomness of the sampling process, where zero temperature is equivalent to argmax operation, and infinite temperature corresponds to a uniform sampling. However, choosing the right This virus is transmitted from person to person, through the Internet. This infection is characterized by a wide variety of symptoms including fever, headache, fatigue, fatigue, nausea, vomiting, muscle aches, and pain in the chest. The symptoms of COVID-19 include: Decreased heart rate Harmful bowel movements Decreased appetite Increased top-p (p = 0.9)

The outbreak began in October 2008, with 15 cases being confirmed between November 2008 and June 2009. The cases included 11 women aged between 18 and 43 years old. Of these, 1 had died. There is still no evidence of the virus spreading into humans, but it is thought to be spread by people who are infected by infected persons, or infected in public. top-k (k = 10)

The coronavirus is present in about half of all patients and can cause a wide range of health problems including: eye and skin cancers, gastrointestinal problems, and liver and kidney diseases. It is the most widely distributed virus in the United States today, and is transmitted by a number of viruses. It is also the most prevalent and highly contagious virus, causing the death of penalized sampling (θ = 1.2)

The virus causes severe respiratory and cardiovascular problems, including death from pneumonia or heart failure in some people with the infection. : A new type of viral that can cause serious health issues such as cancer, hepatitis C, HIV/AIDS, diabetes mellitus, stroke. : An emerging class called "viruses" are known to be responsible for many diseases like AIDS, malaria, tuberculosis greedy (with unlikelihood training)

It is a common and deadly disease that has been linked to the deaths of hundreds of thousands of people. The first known case of COVID-19 was reported in the late 1960s. The first person to be infected with the virus was a woman named Mary Ann Crouch, who had been living in a house in the village of Crouch. 

In some context like question answering, sampling the next tokens can lead to a wrong answer since implausible tokens can still receive non-zero probability mass. To take into account this problem, Keskar et al. (2019) proposed penalized sampling, which samples words in a near-greedy fashion but prevents repetitions by discounting the scores of previously generated tokens.

Stochastic decoding methods have one downside: they do not solve the underlying problem with maximum likelihood training. In this section, we examine different strategies to alter language models training objective to cope with neural language degeneration problem.

Entmax loss/sampling There exists a mismatch between training and testing conditions in stochastic decoding method, where the model generates text based on a truncated softmax distribution but is evaluated based on the full softmax distribution (Martins et al., 2020) . The authors thus proposed to use an entmax loss function when training instead of softmax, which transforms a vector of scores into a sparse probability distribution to prevent giving any probability mass to implausible words. At inference time, entmax sampling is used, thus making sure that training and testing conditions are similar. The downside of this approach is that we can no longer use perplexity as an evaluation metric for training language model, since the distribution can contain many zero probability values. To compensate for this drawback, Martins et al. (2020) proposed -perplexity and sparsemax score as alternative evaluation metrics. Generative Adversarial Nets (GAN) Since its introduction in (Goodfellow et al., 2014 ), GAN has taken over the computer vision community and achieved state-of-theart results for a plethora of computer vision tasks (Radford et al., 2015; Karras et al., 2017; Brock et al., 2018; . However, applying GAN to natural language processing is not straightforward since the sampling process cannot be described as a differentiable operation in discrete probabilistic models (Huszár, 2015) . Several GAN variants have been proposed (Yu et al., 2017; ; however, till this date, standard MLE models still outperformed GAN models in terms of the quality and diversity of the generated text (Caccia et al., 2018) . dates to calculate the loss, which can be the previous tokens to avoid repetition (Welleck et al., 2019a) or a list of tokens that appear too often (Li et al., 2020) . In sequence level unlikelihood training, the model is given a list of prefixes to generate text from, then get penalized for repeated n-grams on its own generation to accommodate for the distribution mismatch between training sequences and generation sequences (Welleck et al., 2019a) . For solving the inconsistency issues, Li et al. (2020) use unlikelihood training on existing natural language inference datasets to penalize contradicting sentence pairs, thus pushing down the probabilities of contradicting utterances.

There are several aspects that we want take a closer look on when evaluating language models on open-ended generation task, which are quality, diversity, and consistency. In this section, we study what metrics have been proposed in the literature and aim to make comparison between them.

Corpus-BLEU BLEU is originally proposed to evaluate models on machine translation task by comparing the similarity between machine generated translation and human references (Papineni et al., 2002) . In open-ended text generation task, since we want our models to produce natural and human-like text, BLEU seems like an intuitive metric to use. Yu et al. (2017) proposed the use of BLEU score to judge the quality of machine generated texts by comparing them to a large corpus of human text, which is now being referred in many studies as Corpus-BLEU (Caccia et al., 2018; Nadeem et al., 2020) . To formalize, Corpus-BLEU returns the mean BLEU score of every sample from the set of machine generated text S gen against the whole human reference set S ref :

Note that a higher Corpus-BLEU score implies better generation quality since it has more n-gram overlap with the human reference data. The downside of this evaluation metric is its quadratic runtime complexity: for each sample we need to calculate a BLEU score between that sample and the whole reference corpus.

Forward perplexity Because natural, high quality and grammatically correct sentences tend to have higher probabilities than gibberish, we can use the likelihood of a sentence as a proxy for its quality. propose the use of a RNN language model that have been trained with real text data to compute the perplexity of a model's samples, which the authors refer to as forward perplexity:

The reason for using a RNN language model here is to estimate the true distribution of the entire language. This metric can help to measure the fluency of machine generated text which is its quality in essence Cífka et al., 2018) . To remove the need for training a RNN language model on human data, one can leverage available pre-trained language models which have been trained on massive dataset such as GPT-2 .

Acceptability Another way to think about the quality of machine generated texts is their acceptability -how natural they feel to native speakers of the language. Acceptability can be influenced by context: sentences that sound strange when standing alone can appear natural in specific contexts, while those which appear perfectly by themselves may sound odd when surrounding by other sentences (Lau et al., 2020; Bizzoni and Lappin, 2019; Bernardy et al., 2018) . This is crucial in open-ended text generation task, since models are usually conditioned on specific contexts before being asked to generate more continuations. Lau et al. (2020) propose the use of pre-trained language models to calculate sentence probability as a proxy for its acceptability within a given context. According to the authors finding, using BERT model (Devlin et al., 2019) with PenLP (Vaswani et al., 2017) to normalize the sentence's probability produces acceptability scores that match human intuition.

Self-BLEU One way to think about diversity is how the generated samples from a collection are different from each other. Using the same intuition as Corpus-BLEU, Zhu et al. (2018) introduce Self-BLEU score to access the similarity between every document and the rest of the generated collection. To formalize, Self-BLEU returns the mean BLEU score of every sample from the set of machine generated text S gen against every other samples in the same set S gen :

Self-BLEU(S gen ) = 1 |S gen |

A lower Self-BLEU score implies higher diversity of the collection, as the documents in the collection are more different from each other. Similar to Corpus-BLEU, this metric suffers from its quadratic runtime complexity which makes it intractable for large collection of documents.

Reverse perplexity/Cross entropy Inspired by a similar metric in image generation, use an RNN language model to train on generated samples of a model and calculate its perplexity on human data, which the authors refer to as reverse perplexity:

Reverse ppl = ppl Sgen (S ref )

Here, the RNN language model resembles the distribution of the generated collection.

Similar to how forward perplexity judges the quality of the generated collection using human data, reverse perplexity measures the quality of human data based on the generated collection. If the generated collection is diverse enough to represent different writing styles or topics, the RNN language model should perceive human text as natural and fluent, therefore giving low perplexity to human data.

Sequence repetition While human rarely repeat themselves when writing, machine generated texts are often found to be repetitive, especially when being produced by deterministic decoding methods (Holtzman et al., 2019; Welleck et al., 2019a; Shao et al., 2017; Fan et al., 2018) . While not being diverse does not necessarily mean being repetitive, being repetitive prevents the model from generating diverse continuations. Therefore, using repetition as a metric can give us an idea how diverse a document collection is. Welleck et al. (2019a) use a metric called seq-rep-n to measure sequence repetition by calculating the portion of duplicate n-grams in a generated sequence S: seq-rep-n = 1.0 − |unique n-grams(S)| |n-grams|

demonstrate its commonsense/logical reasoning ability. This is similar to the task of natural language inference (NLI), in which the model is given pairs of sentences and must decide whether the relationship between them are neutral, entailment or contradiction (Fyodorov et al., 2000; Condoravdi et al., 2003; Bos and Markert, 2005 Chapter 3

Generation?

It is clear that using only traditional metrics such as perplexity is not enough to evaluate

language models on open-ended text generation task. Instead, we need to look at their performance on all three dimensions: quality, diversity and consistency of the generated text.

In this chapter, we first present our experiment to compare different evaluation metrics for each of the dimension when evaluating language models on open-ended text generation task. We then decide on what is the best metric to use for each dimension, and use those metrics to assess the two common techniques for solving neural text degeneration: stochastic decoding methods and tweaking the training objective. In this thesis, we explore using unlikelihood training (Welleck et al., 2019a ) as a representative for the training objective strategy.

We focus on story generation, an instance of open-ended generation, in this thesis. For this reason, we decide to use the Harry Potter series by J.K. Rowling as our training corpus. We use a version of the whole series from https://github.com/joycex99/hp-wordmodel, which has been striped off of page numbers and headings. 

The main goal of unlikelihood training is to push down the probability mass of negative candidates. Given a sequence of tokens (x 1 , ..., x T ) and a set of negative candidates at 

The unlikelihood loss can be used along side the usual cross-entropy loss when training/finetuning the language model. The unlikelihood loss can be applied at two levels: tokenlevel and sequence-level.

Token-level loss For token-level loss, the list of negative candidates at each time step is all of the tokens from previous time steps, so the model can avoid repeating tokens that it has seen before (Welleck et al., 2019a) .

The token-level loss is restricted to negative candidates selected from the training distribution. To accommodate for the difference between training and decoding condition, Welleck et al. (2019a) proposed the use of a sequence-level loss,

where we ask the model to generate continuations given some contexts, then take the repeated n-grams from the continuations as negative candidates.

We use the pre-trained GPT-2 Small from HuggingFace 2 as our base model. We finetune it with the Harry Potter books dataset using two different training methods: maximum likelihood estimate (MLE) training and unlikelihood (UL) training (Welleck et al., 2019a) . With unlikelihood training, we follow what the authors have suggested: with probability of 0.5 use sequence-level loss otherwise use the token-level loss. To compute sequence-level loss (with n-gram of 4), we use the prefix of size 50 of each training sequence in the current training batch and greedily decode continuations of length 100.

All models are trained with 4 epochs using Adam optimizer with batch size of 12 and learning rate of 0.001. To find out how sensitive unlikelihood training is to the number of training epochs, we also train another model using unlikelihood training with 1 epoch only.

We are also interested in which effects do larger models have on text generation. Whenever possible, we repeat the experiments with a GPT-2 Medium base model using a similar training regime.

In this section, we want to investigate how different the token distributions of human text are from that of machine-generated text. Our hypothesis is that if one model can produce text which have a similar distribution of tokens to human text, it would read natural and human-like. Below is a paragraph that we have taken from the Harry Potter series:

Harry, who was shaking all over, thought for a moment that Dumbledore might not be able to climb into the boat; he staggered a little as he attempted it; all his efforts seemed to be going into maintaining the ring of protective flame around them. Harry seized him and helped him back to his seat.

Once they were both safely jammed inside again, the boat began to move back across the black water, away from the rock, still encircled by that ring of fire, and it seemed that the Inferi swarming below them did not dare resurface.

"Sir," panted Harry, "sir, I forgot -about fire -they were coming at me and I panicked -" "Quite understandable," murmured Dumbledore. Harry was alarmed to hear how faint his voice was.

They reached the bank with a little bump and Harry leapt out, then turned quickly to help Dumbledore. The moment that Dumbledore reached the bank he let his wand hand fall We then calculate the token distribution of each paragraph using the same model which generates it. For continuations that are generated using stochastic decoding, we augment the distribution to match with the behavior of that decoding method: at each time step we only consider the top 90% of the tokens for top-p decoding and top 10 tokens for top-k decoding, then adjust the probability accordingly so the total probabilities sum up to 1. The plot of the distribution is given in Figure 3 .2.

The MLE GPT-2 with greedy decoding (Figure 3 .2a) has observed a phenomenon called can be difficult to train as it is quite sensitive to the number of training epochs.

In this section, our main objectives are (i) finding the best metrics to evaluate quality and diversity of machine generated text and (ii) comparing the models based on their quality/diversity trade-off.

Since We use all of the evaluation metrics as discussed in Chapter 2. For Corpus-BLEU and Self-BLEU, we only select 500 continuations from each sample set to calculate the scores due to their runtime complexity. For reverse perplexity, instead of an RNN language model as the authors have suggested, we use a GPT-2 Small model to speed up the training process. We fine-tuned it on the 6,220 generated samples of each model/parameter pair, then calculate the perplexity against our Harry Potter corpus test set.

For forward perplexity, we use an off the shelve GPT-2 Small model to calculate the perplexity against the generated sample set of each model/parameter pair. we argue that Corpus-BLEU is the best metric to evaluate the quality of generated text, since it behaves more like human judgement and is able to detect mode collapse. seq-rep-4 is able to detect highly repetitive samples; however, it is harder to see how difference the models performance are, since the magnitudes are marginal. As for complexity, using reverse perplexity can be much slower compared to the other metrics as we have to train a language model with the generated samples. Therefore, we choose to use Self-BLEU for the rest of the experiments. 

Using Corpus-BLEU and Self-BLEU as the two evaluation metrics for quality and diversity, we plot the quality/diversity trade-off between the models in Figure 3 .5 and use a log function to fit the data points. Note that we use Negative Corpus-BLEU instead of Corpus-BLEU, so that the lower is better for both metrics. Looking at the graph, it is unclear which model is the best in the quality-diversity trade-off space, since they all lie similar diagonal lines. The trade-off between quality and diversity is clear: increasing the degree of randomness (by decreasing p in top-p decoding or increasing k in top-k decoding) improves diversity but worsens quality and vice versa.

Even though the samples generated by MLE GPT-2 with greedy decoding are much more repetitive (with seq-rep-4 of 0.584 and Self-BLEU of 0.65), UL GPT-2 (1 epoch)

with greedy decoding appears much worse in terms of diversity with a Self-BLEU score of 0.78 and seq-rep-4 of 0.004 (Table 3 .1 and Figure 3 .5). Taking a closer look at the generated samples, we found out that no matter what context were given to the model, UL GPT-2 still produces texts that are rather similar. Below are several examples:

"But the task's not till tonight!" said Harry, accidentally spilling scrambled eggs down his front, afraid he had mistaken the time.

"I'm aware of that, Potter," she said.

"But I'm not going to tell anyone. I've got to go and find the Ministry of Magic, and I'm not going to be in the Ministry's office for the rest of the year. I've got a lot more to do with the Ministry's Head of Magical Law Enforcement than I could have ever expected."

Harry looked around the room, and then back to the Ministry's office. He had not seen the Ministry's office for a long time, but he had to admit "Well, isn't it obvious?" said Hermione, with a look of maddening superiority.

"If you don't want to tell us, don't," snapped Ron.

"Fine," said Hermione haughtily.

"I'll just tell you what, then. I'll be back in a few minutes."

The train was a long way from the station, and Harry was still not sure whether he was going to be able to get back to the castle. He had to wait until the next morning, when he had a chance to ask the old man if he wanted to go back to Hogwarts.

"I'll see you at the Ministry," he said, and he walked back to the "COME BACK, YOU THIEVING -!" "There's no point, Harry." Tonks had appeared out of nowhere, her mousy hair wet with sweat. "You're going to have to get back to Hogwarts, you know that, if you're going to be there, you'll have to be in the Ministry's office. You've got to get your wand back, and I'll be there, and I'll be there for you, Harry Potter, and you'll have a very special place in the school."

Harry looked around the room, his eyes wide open, and saw the Professor's face. He was wearing a very

We can see that even though the contexts are different, the continuations always drift to

Ministry of Magics and getting back to Hogwarts. This proves that looking at repetition as a metric alone is not enough to judge the diversity of generated text, and although unlikelihood training might solve the problem of generating repetitive text, it does not help the model to produce more diverse text. To see if bigger models are better in the quality/diversity trade-off space, we repeat the experiment using a GPT-2 Medium as a based model. Figure 3 .6 shows the comparison between Small and Medium, with a log function fitting the data points. Overall, Medium models only perform better than Small models by a narrow margin, as the orange points are closer to the origin of the graph.

As discussed in section 2.3.3, we use selection accuracy (Li et al., 2020) as a metric to evaluate a model's ability to demonstrate commonsense reasoning. As we are focusing on story generation, we choose to use the MultiNLI dataset (Williams et al., 2018) and

the StoryCloze dataset (Mostafazadeh et al., 2016) to Li et al. (2020) , we assume the model will select the sentence with lower perplexity, and we use that to calculate its selection accuracy. Note that this is an unsupervised setup, i.e. we do not fine-tune our models on the MultiNLI dataset. We only use them to compute the perplexity score.

For StoryCloze, we use its development set which consists of 1, 570 stories. We concatenate the first four sentences of each story as the context, and ask the model to select between two different endings. Similar to above, we calculate the perplexity of the two endings given the context to determine the model selection. Results are given in Table   3 .2.

It is clear that fine-tuning a model on the Harry Potter corpus decreases its selection accuracy in both dataset, however this is worse in MultiNLI than StoryCloze. One possible explanation is that in StoryCloze we have much longer context, which may help the model in finding the correct ending regardless of the fine-tuning process.

Overall, UL models perform worse than its MLE counterpart in both dataset. With the MultiNLI dataset, fine-tuning both GPT-2 Small and Medium models using unlikelihood training for 4 epochs decreases their selection accuracy by around 10%, while this is only around 4% for maximum likelihood estimate training. This suggests that using unlikelihood training might have a detrimental effect on a model's logical reasoning ability.

Among of all of our trained small models, MLE GPT-2 achieves the best result with a selection accuracy of 0.61 in MultiNLI and 0.59 in StoryCloze. However, this is just slightly better than chance, which might explain why the text produced by the model in Section 3.3 is not consistent.

In all cases, the models which are based off GPT-2 Medium perform considerably better than those are based off GPT-2 Small. This suggests larger models can be superior to smaller models at language understanding. The final result is given in Table 3 .3. Overall, the new models which were trained using the WikiText-2 performed better than the old ones in selection accuracy in both datasets.

However, the trend is still the same: in the MultiNLI selection task, unlikelihood training leads to a lower selection accuracy than the usual maximum likelihood estimate training.

This suggests that the drop in selection accuracy when using unlikelihood training has little to do with domain difference.

An interesting thing to note is that when being trained with the WikiText-2 dataset, using more epochs on unlikelihood training actually leads to a slight increase in selection accuracy in both tasks. This further suggests that unlikelihood training is sensitive to the number of training epochs and rather difficult to train.

Multi-task Learning

In machine learning, we typically build a model to solve a single well-defined task. We do this by first selecting a metric to measure the models performance for that task, then training the model with the help of a loss function to optimize for that metric.

Coming back to the example where we would like to build a machine learning model to categorize images of dogs and cats, the metric here can be the accuracy of predictions, and we can train this model using a binary categorical loss to optimize its parameters for predictions.

However, instead of focusing on that one single task, we can train our machine learning models to solve different tasks simultaneously while still sharing the models parameters across all tasks. This strategy is known as multi-task learning (Caruana, 1997) , and it has been shown to help machine learning models to generalize better on their original task, since they can learn valuable information from a variety of tasks.

Even though multi-task learning has been applied to language models from an early stage in the NLP field (Collobert and Weston, 2008) , it has grown even more popular with the introduction of BERT (Devlin et al., 2019) . BERT, or Bidirectional Encoder

Representations from Transformers, is a Transformer-based language model, with the largest variant containing around 340 million parameters. As opposed to GPT-2 -a unidirectional language model, where at every time step the model can only attend to previous words to generate the hidden state for the current word -BERT is bidirectional, meaning that each word can access both of its left (previous) and right (following) context. This bidirectional architecture has been shown to have a much deeper representation of the language context, which allows BERT to obtain new state-of-the-art results on a variety of natural language processing tasks, including question answering and natural language inference (Devlin et al., 2019) .

BERT obtains its powerful bidirectional representation of words by using two novel training objectives: masked language model and next sentence prediction (Figure 4.1) .

With the masked language model objective, the authors randomly mask 15% of the words in a training sequence, and ask BERT to correctly identify the masked word ( Figure 4.1a) . With the next sentence prediction objective, the authors first extract pairs of connected sentences from the training corpus, then randomly sample pairs of unrelated sentences. They then give these pairs of sentences to BERT and ask it to predict whether the first sentence is followed by the second sentence in each pair ( Figure   4 .1b).

One downside of BERT is that since each word has access to both of its left and right context, BERT loses the ability of a causal language model to generate continuations,

where the model only allows to condition on the context it has seen so far. However, with the success of BERT with multi-task learning, we believe it has the potential to bring causal language models to another level at natural language understanding. In this chapter, we explore different strategies to apply multi-task learning on training causal language models.

In this experiment, we use the same Harry Potter series from the previous chapter as our training corpus.

We train a new model by fine-tuning the GPT-2 model on the Harry Potter dataset using the original next word prediction (MLE) task with one auxiliary training objective. We end up with 5 different models, which we call NSP GPT-2 (Next Sentence Prediction), SOP GPT-2 (Sentence Order Prediction), TFIDF GPT-2 (Term Frequency -Inverse Document Frequency), POS GPT-2 (Part-of-speech Tag) and DP GPT-2 (Dependency Parsing).

All models are trained with 4 epochs using Adam optimizer with batch size of 5-20 (depending on the memory demand of the training objective) and learning rate of 0.001. In each training batch, we add the losses from all objectives together for back-propagation.

The implementation detail of each auxiliary objective is given in the following section. To obtain the sentence data for these objectives, we use the ntlk sentence segmenter 1 to split the Harry Potter dataset into sentences. This results in 92, 658 unique sentences for our experiment.

Next Sentence Prediction (NSP) Inspired by BERT (Devlin et al., 2019) , in this task we present our model with two pairs of sentences -one consists of two connected sentences (positive pair) while the other contains two unrelated sentences (negative pair) -and ask the model which one it prefers. The preference is calculated using the perplexity score (ppl) for each pair of sentence. We use a margin ranking loss function to incentivize the model to give lower perplexity to the positive sentence pair:

L M R (pos, neg) = max(0, ppl(pos) − ppl(neg) + margin)

Using a number of starting sentences, we obtain the positive training set by simply taking the sentence next to the starting sentences, and the negative training set by randomly sample a sentence from the training corpus. To make it easier for training, we only select 6,220 pairs from each set, which match the number of training sequences for the original next word prediction task. Note that this is done when constructing the training data, thus the same positive and negative sets are used for every training epoch.

Sentence Order Prediction (SOP) Inspired by StructBERT (Wang et al., 2019) , in this task we give our model a pair of sentences and ask it to predict whether the two sentences are in correct order. The implementation detail is similar to NSP task, except that we obtain the negative training set by simply switching the order of the two sentences in the positive training set.

TF-IDF In this task, we ask our model to predict the TF-IDF score for each token in the training dataset. To obtain the true TF-IDF scores as labels for training, we first split our training corpus into 243 documents, each one consists of 6400 tokens, and then calculate the TF-IDF scores based on these documents. In order to produce a TF-IDF score, we add a linear layer on top of the original GPT-2 model. We use a smooth L1-loss function for the regression task.

Part-of-speech Tags (POS) In this task, we ask our model to predict the correct part-of-speech tags for each token in the training dataset. We obtain weak labels for training using the spacy part-of-speech tagger 2 . To make the spacy tokenizer compatible with the GPT-2 tokenizer, we use a similar approach to Devlin et al. (2019) to align the spacy tokenizer Jane Doe is a musketeer GPT-2 tokenizer Jane Do #e is a musket #eer POS tags I-PER I-PER X O O O X assign them a special label of X (means Other ). An example is given in Table 4 .1.

We add a classification head on top of the original GPT-2 model, which uses the last hidden state of each token to predict the POS tag. We use a cross-entropy loss function for the classification task.

Dependency Parsing (DP) In this task, we ask our model to correctly identify the relationship between certain pair of tokens in the training dataset. Borrowing an example from Jurafsky and Martin (2009) (Jurafsky and Martin, 2009) Similar to the POS task, we extract all dependency pairs in the training sequences using the spacy dependency parser 3 , then use the same alignment strategy and cross-entropy loss function for the classification task. Note that we do not perform this classification task for every pair of tokens, but only the ones that we obtain from the spacy dependency parser.

The selection accuracy score with the MultiNLI and StoryCloze dataset for each model is given in Table 4 .2. Overall, all models either perform similar to or better than the MLE model, which has been trained using only the original next word prediction task. Among all of the five auxiliary training objectives in this experiment, the sentence order prediction task proves to be the most effective one, as the SOP model achieved the best score out of every model, and it got really close to the off-the-shelf GPT-2 selection accuracy score, even though it has been fine-tuned to a specific domain dataset like the Harry Potter series. We also experiment with another model where we combine the two most effective tasks according to the selection accuracy scores given in Table 4 .2 (TF-IDF and SOP), however this does not further improve the models performance.

To our disappointment, incorporating syntactical features like part-of-speech tags and word relations does not help the model with the commonsense reasoning task, as we do not achieve any improvements in the selection accuracy score with the POS and DP models compared to the MLE model. This suggests that the GPT-2 model has already learned the necessary syntactical information from its original next word prediction task, and it does not benefit from the help of weakly supervised signals.

Using Corpus-BLEU and Self-BLEU as the two evaluation metrics for quality and diversity, we plot the quality/diversity trade-off between the models in Figure 4 .3 and use a log function to fit the data points. Similar to the experiment in Chapter 3, we use Negative Corpus-BLEU instead of Corpus-BLEU, so that the lower is better for both metrics.

Even though NSP, SOP and TF-IDF + SOP are the ones that have achieved the highest selection accuracy scores in the commonsense reasoning task, they did slightly worse at the quality-diversity trade-off space according to the fitting lines in Figure 4 .3, which suggests there might be a further trade-off between consistency and quality-diversity.

For the other models, it is hard to decide which one is the best model in the qualitydiversity trade-off space, since they all have similar performance. Curiously, even though DP curve starts out very similar to MLE, as quality dips it produces the best diversity score (lower right corner, Figure 4 .3). We also give the sequence repetition scores for each model in Table 4 .3. Overall, the scores are pretty consistent between all models without any outliers, which suggests that by adding these auxiliary training objectives, we do not affect the model in terms of n-gram repetitions in its generation.

Overall, adding these auxiliary objectives seem to help improve logical consistency, while they do not harm the model in terms of quality-diversity trade-off and sequence repetition. This suggests that there might be an incentive for including these auxiliary objectives when training/fine-tuning a language model. Table 4 .3: Sequence repetition scores between different learning objectives and decoding methods. *Human text sequence repetition is computed using the training dataset

In this work, we have provided necessary backgrounds to understand language models and the problem that they have in open-ended text generation task, which are (i) neural language degeneration and (ii) the lack of consistent evaluation metrics to measure their performance. It is important that we evaluate language models in all dimensions of open-ended text generation -quality, diversity and consistency.

We have conducted an experiment to search for the best metric to use when evaluating

language models on open-ended text generation task. We have proposed an evaluation pipeline using these metrics and apply it to compare the performance between different stochastic decoding methods and unlikelihood training.

Finally, we have carried out experiments with multi-task learning to see if it can help language models to get better at open-ended text generation. Using the evaluation pipeline above, we evaluated the efficacy of different auxiliary training objectives in open-ended text generation task.

A practical pipeline to evaluate language models on open-ended text generation task

When evaluating language models on open-ended text generation task, we have found that Corpus-BLEU is the best metric to evaluate the quality of generated text due to its similarity with human judgement. As for diversity, Self-BLEU appears to be the best metric to use thanks to its simplicity to calculate. To evaluate the consistency of the generated text, using selection accuracy on the MultiNLI dataset is good enough for most cases. For specific task such as story generation, other dataset can be considered (e.g. StoryCloze).

A direct comparison between unlikelihood training and stochastic decoding methods To the best of our knowledge, there has not been any works that quantitatively compare unlikelihood training with stochastic decoding methods. Using our proposed evaluation pipeline, we found out that there was no clear difference between unlikelihood training and maximum likelihood estimate training with stochastic decoding methods in the quality-diversity trade-off space. However, unlikelihood training might have a negative effect on the ability of a language model to truly grasp the gist of the language, as it worsens the model performance in commonsense reasoning task.

An insight on how multi-task learning can lead to better machine generation

We found out that by adding certain auxiliary training objectives along with the maximum estimate likelihood objective when fine-tuning GPT-2 models, they achieved a much better score in commonsense reasoning task, while still maintain their performance in the quality-diversity trade-off space. This suggests that multi-task learning might help a language model to truly understand the language, which in turn leads to better generation.

Incorporate human evaluation to verify the correctness of the evaluation metrics Even though our experiment with evaluation metrics agree with Zhang et al. (2021) 's finding of the likelihood trap, further experiment with the help of human evaluation should be carried out to verify that the metrics work as expected. To the best of our knowledge, human judgement has only been used to evaluate the quality of generation, since it might be hard for one to assess the diversity of machine-generated text (Hashimoto et al., 2019) . Therefore, it might be worth to investigate how we can leverage human evaluation to confirm the correctness of diversity metrics as well.

Further experiments on different ways to tweak language models training objective

In this work, we only select unlikelihood training (Welleck et al., 2019a ) as a representative for the training objective strategy when making comparison with stochastic decoding methods. For future work, it would be useful to look at different strategies to tweak language models training objective as well, such as scheduled sampling (Bengio et al., 2015) , GAN (Yu et al., 2017; and entmax loss (Martins et al., 2020) .

In this work, we only focused on syntactic supervision objectives. For future work, it might be worth to carry out experiment with training objectives that contain more semantics-oriented information, such as word senses or topics classification.

Due to the limitation of the available computing resource, we could only perform our experiment on GPT-2 Small and Medium models. If one has access to larger language models, it might be worth to repeat the same experiment to see how much better do larger language models get at open-ended text generation.

A learning algorithm for Boltzmann machines

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

The Illustrated Transformer

Neural machine translation by jointly learning to align and translate

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

A neural probabilistic language model. The journal of machine learning research

The Influence of Context on Sentence Acceptability Judgements

The Effect of Context on Metaphor Paraphrase Aptness Judgments

Recognising textual entailment with logical inference

Large scale GAN training for high fidelity natural image synthesis

Multitask learning

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

An empirical study of smoothing techniques for language modeling

Eval all, trust a few, do wrong to none: Comparing sentence generation models

A unified architecture for natural language processing: Deep neural networks with multitask learning

Entailment, intensionality and text understanding

The PASCAL recognising textual entailment challenge

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping

Association for Computational Linguistics

A natural logic inference system

Generative adversarial nets

Differentiable scheduled sampling for credit assignment

Long text generation via adversarial training with leaked information

Unifying human and statistical evaluation for natural language generation

Latent relation language models

Long short-term memory

The curious case of neural text degeneration

Universal Language Model Fine-tuning for Text Classification

How (not) to train your generative model: Scheduled sampling, likelihood, adversary?

Attention is not explanation

Why are Sequence-to-Sequence Models So Dull? Understanding the Low-Diversity Problem of Chatbots

Speech and Language Processing

Progressive growing of gans for improved quality, stability, and variation

CTRL: A conditional transformer language model for controllable generation

Structured attention networks

Improved backing-off for m-gram language modeling

How Furiously Can Colorless Green Ideas Sleep? Sentence Acceptability in Context

Importance of self-attention for sentiment analysis

Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training

ROUGE: A Package for Automatic Evaluation of Summaries

Effective Approaches to Attention-based Neural Machine Translation

An extended model of natural logic

Sparse Text Generation

Pointer sentinel mixture models

Efficient estimation of word representations in vector space

Recurrent neural network based language model

Distributed representations of words and phrases and their compositionality

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation

Understanding LSTM Networks

BLEU: A Method for Automatic Evaluation of Machine Translation

GloVe: Global Vectors for Word Representation

Deep Contextualized Word Representations

Language Models as Knowledge Bases?

Unsupervised representation learning with deep convolutional generative adversarial networks

Language models are unsupervised multitask learners

A Review of the Neural History of Natural Language Processing

A neural attention model for abstractive sentence summarization

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Do Massively Pretrained Language Models Make Better Storytellers?

Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models

End-To-End Memory Networks

Sequence to sequence learning with neural networks

Attention is all you need

A survey of word embeddings based on deep learning

StructBERT: Incorporating language structures into pre-training for deep language understanding

Neural text generation with unlikelihood training

Dialogue Natural Language Inference

Dialogue Natural Language Inference

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Word attention for sequence to sequence text understanding

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Seqgan: Sequence generative adversarial nets with policy gradient

Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

Trading Off Diversity and Quality in Natural Language Generation

Pegasus: Pretraining with extracted gap-sentences for abstractive summarization

Personalizing Dialogue Agents: I have a dog, do you have pets too?

Adversarially regularized autoencoders

Texygen: A benchmarking platform for text generation models

Fine-tuning language models from human preferences

First and foremost, I would like to express my sincere appreciation to my supervisor, Dr.Jey Han Lau, who has constantly given me thoughtful guidance and valuable feedback throughout my thesis. I'm incredibly fortunate to have you as my supervisor -thank you so much for all of the encouragement and support for me during the middle of this global pandemic.