key: cord-026949-nu46ok9w authors: Varshney, Deeksha; Ekbal, Asif; Nagaraja, Ganesh Prasad; Tiwari, Mrigank; Gopinath, Abhijith Athreya Mysore; Bhattacharyya, Pushpak title: Natural Language Generation Using Transformer Network in an Open-Domain Setting date: 2020-05-26 journal: Natural Language Processing and Information Systems DOI: 10.1007/978-3-030-51310-8_8 sha: doc_id: 26949 cord_uid: nu46ok9w Prior works on dialog generation focus on task-oriented setting and utilize multi-turn conversational utterance-response pairs. However, natural language generation (NLG) in the open-domain environment is more challenging. The conversations in an open-domain chit-chat model are mostly single-turn in nature. Current methods used for modeling single-turn conversations often fail to generate contextually relevant responses for a large dataset. In our work, we develop a transformer-based method for natural language generation (NLG) in an open-domain setting. Experiments on the utterance-response pairs show improvement over the baselines, both in terms of quantitative measures like BLEU and ROUGE and human evaluation metrics like fluency and adequacy. Conversational systems are some of the most important advancements in the area of Artificial Intelligence (AI). In conversational AI, dialogue systems can be either an open-domain chit-chat model or a task-specific goal-oriented model. Task-specific systems focus on particular tasks such as flight or hotel booking, providing technical support to users, and answering non-creative queries. These systems try to generate a response by maximizing an expected reward. In contrast, an open-domain dialog system operates in a non-goal driven casual environment and responds to the all kinds of questions. The realization of rewards is not straightforward in these cases, as there are many factors to model in. Aspects such as understanding the dialog context, acknowledging user's personal preferences, and other external factors such as time, weather, and current events need consideration at each dialog step. In recent times, there has been a trend towards building end-to-end dialog systems such as chat-bots which can easily mimic human conversations. [19, 22, 25] developed systems using deep neural networks by training them on a large amount of multi-turn conversational data. Virtual assistants in open-domain settings usually utilize single-turn conversations for training the models. Chitchat bots in such situations can help humans to interact with machines using natural language, thereby allowing humans to express their emotional states. In dialogue systems, generating relevant, diverse, and coherent responses is essential for robustness and practical usages. Generative models tend to generate shorter, inappropriate responses to some questions. The responses range from invalid sentences to generic ones like "I don't know". The reasons for these issues include inefficiency of models in capturing long-range dependencies, generation of a large number of out-of-vocabulary (OOV) words, and limitations of the maximum likelihood objective functions for training these models. Transformer models have become an essential part of most of the state-of-the-art architectures in several natural language processing (NLP) applications. Results show that these models capture long-range dependencies efficiently, replacing gated recurrent neural network models in many situations. In this paper, we propose an efficient end-to-end architecture based on the transformer network for natural language generation (NLG) in an open-domain dialogue system. The proposed model can maximize contextual relevancy and diversity in generated responses. Our research reported here contributes in three ways: (i) we build an efficient end-to-end neural architecture for a chit-chat dialogue system, capable of generating contextually consistent and diverse responses; (ii) we create a singleturn conversational dataset with chit-chat type conversations on several topics between a human and a virtual assistant; and (iii) empirical analysis shows that our proposed model can improve the generation process when trained with enough data in comparison to the traditional methods like retrieval-based and neural translation-based. Conversational Artificial Intelligence (AI) is currently one of the most challenging problems of Artificial Intelligence. Developing dialog systems that can interact with humans logically and can engage them in having long-term conversations has captured the attention of many AI researchers. In general, dialog systems are mainly of two types -task-oriented dialog systems and open-domain dialog systems. Task-oriented dialog systems converse with the users to complete a specific task such as assisting customers to book a ticket or online shopping. On the other hand, an open-domain dialog system can help users to share information, ask questions, and develop social etiquette's through a series of conversations. Early works in this area were typically rule-based or learning-based methods [12, 13, 17, 28] . Rule-based methods often require human experts to form rules for training the system, whereas learning-based methods learn from a specific algorithm, which makes it less flexible to adapt to the other domains. Data from various social media platforms like Twitter, Reddit, and other community question-answering (CQA) platforms have provided us with a large number of human-to-human conversations. Data-driven approaches developed by [6, 16] can be used to handle such problems. Retrieval based methods [6] generate a suitable response from a predefined set of candidate responses by ranking them in the order of similarity (e.g., by matching the number of common words) against the input sentence. The selection of a random response from a set of predefined responses makes them static and repetitive. [16] builds a system based on phrase-based statistical machine translation to exploit single turn conversations. [30] presented a deep learning-based method for retrieval-based systems. A brief review of these methods is presented by [2] . Lately, generation based models have become quite popular. [19, 22, 23, 25] presented several generative models based on neural network for building efficient conversational dialog systems. Moreover, several other techniques, for instance generative adversarial network (GAN) [10, 29] and conditional variational autoencoder (CVAE) [3, 7, 18, 20, 32, 33] are also implemented for dialog generation. Conversations generated from retrieval-based methods are highly fluent, grammatically correct, and are of good quality as compared to dialogues generated from the generative methods. Their high-quality performance is subjected to the availability of an extensive repository of human-human interactions. However, responses generated by neural generative models are random in nature but often lack grammatical correctness. Techniques that can combine the power of both retrieval-based methods and generative methods can be adapted in such situations. On the whole hybrid methods [21, 27, 31, 34] first find some relevant responses using retrieval techniques and then leverages them to generate contextually relevant responses in the next stage. In this paper, we propose a novel method for building an efficient virtual assistant using single-turn open-domain conversational data. We use a self-attention based transformer model, instead of RNN based models to get the representation of our input sequences. We observe that our method can generate more diverse and relevant responses. Our goal is to generate contextually relevant responses for single-turn conversations. Given an input sequence of utterance U = u 1 , u 2 , ..., u n composed of n words we try to generate a target response Y = y 1 , y 2 , ..., y m . We use pre-trained GLoVE [15] 1 embeddings to initialize the word vectors. GLoVE utilizes two main methods from literature to build its vectors: global matrix factorization and local context window methods. The GloVe model is trained on the non-zero entries of a global word to word co-occurrence matrix, which computes how frequently two words can occur together in a given corpus. The embeddings used in our model are trained on Common Crawl dataset with 840B tokens and 2.2M vocab. We use 300-dimensional sized vectors. We formulate our task of response generation as a machine translation problem. We define two baseline models based on deep learning techniques to conduct our experiments. First, we build a neural sequence to sequence model [23] based on Bi-Directional Long Short Term Memory (Bi-LSTM) [5] cells. The second model utilizes the attention mechanism [1] to align input and output sequences. We train these models using the Glove word embeddings as input features. To build our first baseline, we use a neural encoder-decoder [23] model. The encoder, which contains RNN cells, converts the input sequence into a context vector. The context vector is an abstract representation of the entire input sequence. The context vector forms the input for a second RNN based decoder, which learns to output the target sequence one word at a time. Our second baseline uses an attention layer [1] between the encoder and decoder, which helps in deciding which words to focus on the input sequence in order to predict the next word correctly. The third model, which is our proposed method, is based on the transformer network architecture [24] . We use Glove word embeddings as input features for our proposed model. We develop the transformer encoder as described in [24] to obtain the representation of the input sequence and the transformer decoder to generate the target response. Figure 1 shows the proposed architecture. The input to the transformer encoder is both the embedding, e, of the current word, e(u n ), as well as positional encoding PE(n) of the nth word: There are a total of N x identical layers in a transformer encoder. Each layer contains two sub-layers -a Multi-head attention layer and a position-wise feedforward layer. We encode the input utterances and target responses of our dataset using multi-head self-attention. The second layer performs linear transformation over the outputs from the first sub-layer. A residual connection is applied to each of the two sub-layers, followed by layer normalization. The following equations represent the layers: where M 1 is the hidden state returned by the first layer of multi-head attention and F 1 is the representation of the input utterance obtained after the first feed forward layer. The above steps are repeated for the remaining layers: where n = 1, ..., N x . We use c to denote the final representation of the input utterance obtained at N x -th layer: Similarly, for decoding the responses, we use the transformer decoder. There are N y identical layers in the decoder as well. The encoder and decoder layers are quite similar to each other except that now the decoder layer has two multihead attention layers to perform self-attention and encoder-decoder attention, respectively. R y = [y 1 , ..., y m ] y m = e(y m ) + P E(m) (10) To make prediction of the next word, we use Softmax to obtain the words probabilities decoded by the decoder. In this section, we present the details of the datasets used in our experiments, along with a detailed overview of the experimental settings. Our dataset comprises of single-turn conversations from ten different domains -Data About User, Competitors, Emotion, Emergency, Greetings, About Bixby, Entertainment, Sensitive, Device, and Event. Professional annotators with a linguistics background and relevant expertise created this dataset. The total dataset comprises of 184,849 utterance and response pairs with an average of 7.31 and 14.44 words for utterance and response, respectively. We first split the data into a train and test set in a 95:5 ratio. We then use 5% of the training data for preparing the validation set. The dataset details are given in Table 2 . Some examples from the dataset are shown in Table 1 . We use two different types of models for our experiments -recurrent and transformer-based sequence-to-sequence generative models. All data loading, model implementations, and evaluation were done using the OpenNMT 2 [9] as the code framework. We train a seq2seq model where the encoder and decoder are parameterized as LSTMs [5] . We also experiment with the seq2seq model with an attention mechanism [1] between the decoder and the encoder outputs. The encoder and decoder LSTMs have 2 layers with 512-dimensional hidden states with a dropout rate of 0.1. The layers of both encoder and decoder are set to 6 with 512-dimensional hidden states with a dropout of 0.1. There are 8 multihead attention heads and 2048 nodes in the feed-forward hidden layers. The dimension of word embedding is empirically set to 512. We use Adam [8] for optimization. When decoding the responses, the beam size is set to 5. Automatic Evaluation: We use the standard metrics like BLEU [14] , ROUGE [11] and perplexity for the automatic evaluation of our models. Perplexity is reported on the generated responses from the validation set. Lower perplexity indicates better performance of the models. BLEU and ROUGE measure the ngram overlap between a generated response and a gold response. Higher BLEU and ROUGE scores indicate better performance. To qualitatively evaluate our models, we perform human evaluation on the generated responses. We sample 200 random responses from our test set for the human evaluation. Given an input utterance, target response, and predicted response triplet, two experts with post-graduate exposure were asked to evaluate the predicted responses based on the given two criteria: 1. Fluency: The predicted response is fluent in terms of the grammar. 2. Adequacy: The predicted response is contextually relevant to the given utterance. We measure fluency and adequacy on a 0-2 scale with '0' indicating an incomplete or incorrect response, '1' indicating acceptable responses and '2' indicating a perfect response. To measure the inter-annotator agreement, we compute the Fleiss kappa [4] score. We obtained a kappa score of 0.99 for fluency and a score of 0.98 for adequacy denoting "good agreement. In this section we report the results for all our experiments. The first two experiments (seq2seq & seq2seq attn) are conducted with our baseline models. Our third experiment (c.f Fig. 1 ) is carried out on our proposed model using word embeddings as the input sequences. Table 3 and Table 4 show the automatic and manual evaluation results for both the baseline and the proposed model. Our proposed model has lower perplexity and higher BLEU and ROUGE scores than the baselines. The improvement in each model is statistically significant compared to the other models 3 . For all the evaluation metrics, seq2seq attn has the highest score among the baselines, and our model outperforms those scores by a decent margin. For Adequacy, we find that our seq2seq model achieves the highest score of 73.70 among the baseline models. Our proposed model outperforms the baselines with a score of 81.75. For Fluency, we observe that the responses generated by all the models are quite fluent in general. To observe our results in more details, we perform an error analysis on the predicted response. In Table 5 As seen in the example, the predicted response would not be the best fit reply to the utterance "You are online" as the response falls out of context for the given utterance. In this paper, we propose an effective model for response generation using singleturn conversations. Firstly, we created a large single-turn conversational dataset, and then built a transformer-based framework to model the short-turn conversations effectively. Empirical evaluation, in terms of both automatic and humanbased metrics, shows encouraging performance. In qualitative and quantitative analyses of the generated responses, we observed the predicted responses to be highly relevant in terms of context, but also observed some in-corrections as discussed in our results and analysis section. Overall we observed that our proposed model attains improved performance when compared with the baseline results. In the future, apart from improving the architectural designs and training methodologies, we look forward to evaluating our models on a much larger dataset of single-turn conversation. Neural machine translation by jointly learning to align and translate Deep retrieval-based dialogue systems: a short review Variational autoregressive decoder for neural response generation Measuring nominal scale agreement among many raters Long short-term memory An information retrieval approach to short text conversation Generating informative responses with controlled sentence function Adam: a method for stochastic optimization OpenNMT: open-source toolkit for neural machine translation Adversarial learning for neural dialogue generation ROUGE: a package for automatic evaluation of summaries NJFun-a reinforcement learning spoken dialogue system Reinforcement learning of questionanswering dialogue policies for virtual museum guides BLEU: a method for automatic evaluation of machine translation Glove: global vectors for word representation Data-driven response generation in social media A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies A hierarchical latent variable encoder-decoder model for generating dialogues Neural responding machine for short-text conversation Improving variational encoder-decoders in dialogue generation An ensemble of retrieval-based and generation-based human-computer conversation systems A neural network approach to context-sensitive generation of conversational responses Sequence to sequence learning with neural networks Attention is all you need A neural conversational model The generalization of student's' problem when several different population variances are involved Retrieve and refine: improved sequence generation models for dialogue Partially observable Markov decision processes for spoken dialog systems Diversity-promoting GAN: a cross-entropy based generative adversarial network for diversified text generation Learning to respond with deep neural networks for retrieval-based human-computer conversation system A hybrid retrieval-generation neural conversation model Unsupervised discrete sentence representation learning for interpretable neural dialog generation Learning discourse-level diversity for neural dialog models using conditional variational autoencoders The design and implementation of Xiaoice, an empathetic social chatbot Acknowledgement. The research reported in this paper is an outcome of the project "Dynamic Natural Language Response to Task-Oriented User Utterances", supported by Samsung Research India, Bangalore.