key: cord-0482509-k9k4p1gm
authors: Wu, Zhiyong; Bi, Wei; Li, Xiang; Kong, Lingpeng; Kao, Ben
title: Lexical Knowledge Internalization for Neural Dialog Generation
date: 2022-05-04
journal: nan
DOI: nan
sha: b1db56226de7c7f0b04a393425b79b687d75c56b
doc_id: 482509
cord_uid: k9k4p1gm

We propose knowledge internalization (KI), which aims to complement the lexical knowledge into neural dialog models. Instead of further conditioning the knowledge-grounded dialog (KGD) models on externally retrieved knowledge, we seek to integrate knowledge about each input token internally into the model's parameters. To tackle the challenge due to the large scale of lexical knowledge, we adopt the contrastive learning approach and create an effective token-level lexical knowledge retriever that requires only weak supervision mined from Wikipedia. We demonstrate the effectiveness and general applicability of our approach on various datasets and diversified model structures.

Vacuous responses (Li et al., 2016; Ghazvininejad et al., 2018) , such as, I don't know, are commonly observed in end-to-end neural dialog models (Shang et al., 2015; Sordoni et al., 2015) . This is mostly because these models ignore the knowledge that resides in people's minds during a conversation. To bridge the gap, many existing works (Moghe et al., 2018; Dinan et al., 2018) have attempted to condition the dialog model on external knowledge, either a sentence or a paragraph, retrieved based on the utterance and/or previous context. This curates datasets with utterance-response-knowledge triples (see Fig 1(a) ). These knowledge-grounded dialog (KGD) models, despite demonstrated effectiveness, suffer from two major problems.

First, equipping models with sentence-level knowledge alone will limit responses' informativeness and diversity. As shown in Fig 1(a) , with the knowledge retrieved giving the utterance, a KGD model can relate J.K Rowling to Khalsa Aid. However, retrieval based solely on sentence embeddings Figure 1 : (a) An exemplary KGD data sample with an utterance (top), a response (bottom), and a sentencelevel knowledge (middle). (b) A list of lexical knowledge (in grey rectangle) related to words from the utterance in (a), and the potential responses (in white speech balloon) people would make given that knowledge.

will result in ignorance of lexical knowledge associated with individual tokens. In this example, the knowledge about J.K Rowling, COVID-19, donates, and India, is ignored during the retrieval, due to the semantic gaps between those lexical knowledge sentences (see Fig 1(b) ) and the utterance. This makes it rather difficult (if not impossible) for the model to generate responses carrying relevant information as shown in Fig 1(b) .

Second, retrieving knowledge for open-domain dialogs during inference incurs heavier computation, often involving similarity search over tens of millions of passages (Petroni et al., 2021) .

Existing systems (Zhao et al., 2020; Zheng et al., 2020) alleviate this problem relying on pre-selecting a small candidate set based on TF-IDF (Schütze et al., 2008) , in sacrifice of the diversity and the accuracy of the retriever. Directly conditioning the dialog model on the retrieved text, these models are easily effected by the quality of the constructed candidate set and are thus prone to errors (Dinan et al., 2018; Kim et al., 2020; Zhao et al., 2020) .

In this work, we propose to complement the lexical knowledge into neural dialog models by Knowledge Internalization (KI), a training approach based on contrastive learning (Hadsell et al., 2006) . The central idea of KI is to integrate more fine-grained lexical knowledge about each input token internally into model parameters (e.g., word embeddings), rather than further conditioning the model on externally retrieved knowledge (e.g., directly copy and/or modify tokens from external knowledge when decoding). Our research contributions include:

• a novel training objective (KI; §3.2) that infuses lexical semantics into word representations. With the knowledge internalized into the contextualized representation of every token, a dialog model can generate informative and diverse responses without engaging an external knowledge retrieval module during inference time, thus making the inference more efficient ( §6.1);

• an effective token-level lexical knowledge retriever ( §4) trained with weak supervision to contextually align tokens in dialog corpora to their related and possibly different knowledge (Appendix C).

• a demonstration of the effectiveness and general applicability of KI with extensive experiments on diversified dialog models and on three benchmark datasets: DailyDialog (Li et al., 2017) , Wizard of Wikipedia (Dinan et al., 2018) , and Commonsense Reddit Dataset . The implementation of our model can be found at https://github.com/LividWo/KI.

To address the vacuous responses problem in neural dialog models, researchers propose to ground dialogs on real world knowledge and construct new corpora that contain utterance-response-knowledge triples. Specifically, responses are grounded to external knowledge derived from different knowledge sources Liu et al., 2018; Dinan et al., 2018; Moghe et al., 2018; Ghazvininejad et al., 2018; Mostafazadeh et al., 2017; Meng et al., 2020; . Among different sources, textual knowledge (Dinan et al., 2018; Parthasarathi and Pineau, 2018; Qin et al., 2019) receives the most attention as it is easy to obtain and scale. However, the construction of knowledge-grounded datasets is costly and time-consuming. To build a more practical system without assuming a given knowledge, recent studies enhance KGD models with an extra knowledge selection component (Dinan et al., 2018; Kim et al., 2020; Zheng et al., 2020; Zhao et al., 2020) .

Most existing KGD models can be viewed as models with externalized knowledge, where knowledge is explicitly used as part of the model input. The principle behind these models is to copy words and/or modify sentences from external knowledge when generating responses Zhu et al., 2017; Zhao et al., 2019) . Our KI, on the other hand, does not explicitly present knowledge to dialog models for reading and/or copying. Instead, we inject and store external knowledge into models' parameters and encourage models to elicit the encoded knowledge during generation.

The idea of knowledge internalization has also been explored in language modeling. Factual knowledge , visual knowledge (Tan and Bansal, 2020) and syntactic knowledge (Kuncoro et al., 2020) have been injected into language models (LMs) and shown great promise in improving the performance of downstream tasks. KI differs from those knowledge-enhanced LMs in two aspects: (i) KI can be trained end-to-end with dialog models, while applying LMs on dialog generation often requires multiple rounds of pre-train and fine-tune. (ii) KI is lightweight that barely introduces extra parameters to the dialog model while applying LMs usually introduces hundreds of millions of extra parameters.

In this section, we illustrate how to train a dialog model with knowledge internalization. To infuse more fine-grained lexical knowledge to a neural dialog model, we assume a dialog corpus where each token is aligned with relevant knowledge (we will discuss the construction of such a corpus in §4). In particular, for an input sentence X in the corpus, we assume each token x i ∈ X is associated with a corresponding descriptive sentence K i .

Given an utterance-response pair (X, Y ), where X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y m }, neural dialog models generally minimize the negative log-likelihood loss:

where P(y i ) = P(y i |y <i , X) is the probability of generating the i-th response token y i given the utterance X and other tokens generated in previous steps y <i = {y 1 , y 2 , . . . , y i−1 }. P(y i ) is generally modeled by a sequence-to-sequence model (Sutskever et al., 2014) , which consists of an encoder and a decoder. The encoder represents X as a sequence of hidden vectors H(X) = h 1 , h 2 , ..., h n , where each h i is a low-dimensional representation of the token x i . The decoder generates y i based on H(X) and y <i , often with the attention mechanism (Bahdanau et al., 2014) .

Given a dialog corpus with token-level knowledge as discussed above, we now introduce a new training task: knowledge internalization (KI). In KI, we seek to boost dialog models by internalizing lexical knowledge into each token's representation. In particular, each token x i and its associated knowledge K i are first mapped into a low-dimensional space. We then adopt contrastive learning to shorten the distance between x i and K i in the space while enlarging that between x i and other irrelevant knowledge. Note that for each x i ∈ X, dialog models' encoder can embed it into a contextualized representation h i . Therefore, we only need an extra knowledge encoder to represent K i as g(K i ) (details will be given in § 4.2). After h i and g(K i ) are computed, we calculate the similarity between x i and K i by the inner product:

where f 1 and f 2 are the functions that map the h i and g(K i ) into the same vector space and normalize them. For each (x i , K i ) pair, we randomly sample an in-batch unrelated knowledge K − i associated with other input sentences, where K − i = K i , to construct a negative sample pair (x i , K − i ) in contrastive learning. Finally, the objective function of KI is defined by the contrastive loss between positive and negative sample pairs:

where m denotes the margin.

We now illustrate how to deploy KI on a neural dialog model. We use a sequence-to-sequence dialog model based on Transformer (Vaswani et al., 2017) as an example. The original model is trained to minimize the negative log-likelihood loss of response tokens, i.e., L NLL (X, Y ) (see Eq. 1). We can conveniently integrate KI into the model by reusing the contextualized representations generated by the model's encoder. The training objective of a knowledge-internalized dialog model can then be formulated as:

where λ is a hyperparameter. Note that the tokenlevel knowledge is only required during training to compute L KI (X). At the inference time, those relevant knowledge is no longer required as they have been internalized into model by KI, making inference more efficient.

In this section, we present how to train an effective retriever to mine knowledge for each token in the dialog corpora. Given a dialog utterance X = {x 1 , x 2 , . . . , x n }, the trained retriever will retrieve a relevant knowledge K i for each token x i in X. The constructed token-knowledge alignments can then be used to train a knowledge-internalized neural dialog model as in § 3.

To train such a retriever, we need a corpus with token-level knowledge annotated. However, to our best knowledge, no human annotated data exist and it is prohibitively expensive to build one. We therefore seek to train the retriever with distant supervision. A straight-forward solution is to align the noun words in an utterance to certain knowledge graph triples using entity linking tools (Shen et al., 2014) . The problem of that is it can only cover about 15% words in human conversations (Biber et al., 2000) .

To address this issue, we propose to mine tokenknowledge distant annotations from Wikipedia. In each Wiki article, the first sentence S = {s 1 , s 2 , ..., s n } is mostly declarative that gives a high-level summary on the topic of the article. Thus this sentence can used as a lexical knowledge item, denoted as K (note that K and S refer to the same sentence here). Inspired by Tan and Bansal (2020), we then further associate every token in the sentence with this knowledge item. These constructed alignments (e.g., (s i , K)) can then be used to train a token-level knowledge retriever.

The core of the retriever's training is to learn a scoring function r(s i |S, K) to measure the relevance between a token s i and a knowledge item K, giving s i 'context S. Similar as Eq. 2, we implement the scoring function r(s i |S, K) as the inner product between s i 'contextualized token representation f (h i ) and the knowledge representation f (g(K)).

Here, we use a pre-trained BERT (Devlin et al., 2019) model to obtain h i ; we apply another pretrained BERT model to encode knowledge K and further generate g(K) with an average-pooling operator. Two BERT models will be fine-tuned with the retriever.

Our training objective is to maximize the relevance score of aligned token-knowledge pairs while minimizing that of unaligned pairs. We also adopt the hinge loss similar as in Eq 3 by replacing x i in the dialog corpus to s i in the constructed tokenknowledge pairs.

Once the retriever is trained, we can use it to mine token-level lexical knowledge required in KI. We first construct a candidate knowledge base K that consists of 6.4 million knowledge items (first sentence) extracted from Wikipedia articles. Given a dialog utterance X = {x 1 , x 2 , . . . , x n }, we retrieve a lexical knowledge K i for each token x i by searching for the knowledge item that has the largest relevance score with x i .

To improve the retrieval results, we further employ two useful strategies: (i) Stopword Masking, where we discard knowledge associated with stopwords; (ii) Exact Matching, where if an utterance token exactly matches the title of a Wikipedia article, we will directly return the first sentence of this article as the retrieval result.

The retrieval process has two properties that can significantly improve dialog corpora's knowledge coverage. First, the retrieval is contextualized such that a token can be aligned to different knowledge items when it occurs in different contexts. Second, the retrieval is at token-level that enables us to associate each dialog sentence with multiple knowledge items (See Appendix C).

In this section, we present the datasets and metrics used for evaluation.

Datasets We use three datasets from various domains (statistics in Appendix A). The first one is DailyDialog (Li et al., 2017) , a multi-turn dialog benchmark that contains daily dialogs recorded as utterance-response pairs. However, there is no knowledge associated with the dialogs in Daily-Dialog, making it difficult to evaluate the informativeness of generated responses. To fully illustrate the strength of KI, we further consider two knowledge-grounded datasets: (i) Wizard of Wikipedia (WoW) (Dinan et al., 2018) , a multi-turn dataset that contains utterance-response-knowledge triples. For each dialog, a sentence retrieved from Wikipedia is selected to guide response generation. WoW contains two test sets: Test Seen/Unseen, where the latter includes topics that never appear in Train and Valid set. (ii) Commonsense Reddit Dataset (CRD) : a weakly knowledge-grounded single-turn dataset. Each dialog in the dataset is paired with at least one triple automatically extracted from ConceptNet (Speer et al., 2017) .

Metrics We conduct both automatic evaluation and human annotations. For automatic evaluation, we evaluate the generated responses from three perspectives 1 :

• Appropriateness: we employ Perplexity (PPL), corpus-level BLEU-4 (Papineni et al., 2002) and ROUGE-l (Lin, 2004) .

• Diversity: the ratio of distinct uni/bi-grams in all generated texts, i.e., Distinct-1/2 (Li et al., 2016) .

• Informativeness: For WoW, we consider wikiF1 (Dinan et al., 2018) , the overlapping F1 between the generated response and the grounded knowledge. For CRD, we calculate entity score (Ent.) , the average number of entities per response. To further measure the likelihood of generating safe responses, we define %safe: the percentage of responses that contains "I'm not sure" or "I don't know". 2 We also report the accuracy of knowledge selection (ACC) following Zheng et al. (2020) .

We further perform human annotations by randomly sampling 200/200/300/300 examples from WoW Test Seen/WoW Test Unseen/ CRD/DailyDialog, respectively. We recruit 5 annotators from a commercial annotation company to rate each response on a scale of 1-5 for its appropriateness Zheng et al., 2020) and informativeness Zhu et al., 2019) . The former measures whether the topic of the response fits that of the utterance, while the latter evaluates whether a response provides new information. A response is scored 1 if it is not appropriate/informative at all, 3 if part of the response is appropriate/informative, 5 if it is highly related to utterance and context or it can provide rich information to deepen the discussion. 2 and 4 are for decision dilemmas.

We evaluate the performance of KI by comparing it with three sets of baselines:

1. We first investigate the effectiveness and general applicability of KI by applying KI on conventional dialog models that are randomly initialized and trained with utterance-response pairs only.

2. We then investigate whether KI can complement or even further improve the state-of-the-art KGD model's performance.

3. As discussed in §2, although LMs differ from KI in many aspects, they also capture knowledge in their parameters. We thus compare KI with LMs to investigate its effectiveness in encouraging informative and appropriate responses. All model structures and training setups are given in Appendix B.

We first deploy KI on two representative neural dialog models that do not directly condition on any external knowledge: (i) Seq2Seq: a LSTM-based (Hochreiter and Schmidhuber, 1997) sequence-to-sequence model with the attention mechanism (Bahdanau et al., 2014) ; (ii) Transformer (Vaswani et al., 2017) : an encoder-decoder architecture relying solely on the attention mechanisms.

Effectiveness As shown in Table 1 's Setup 1 (rows 1-8), dialog models with KI consistently outperform their counterparts without KI on almost all the metrics across the datasets used. We want to point out the advantage of KI from two perspectives:

(1) Promoting informativeness. We first observe that applying KI can significantly improve the wikiF1 and Ent. scores. Unlike KGD models that can generate informative responses by explicitly copying words from given knowledge, models discussed here are not provided with any external knowledge during testing (thus copy mechanism is not applicable for them). This suggests that the improvement in informativeness should be attributed to the effectiveness of KI in injecting knowledge into models' parameters. The Info. scores from human evaluation in Table 2 can also substantiate our findings.

(2) Promoting diversity and reducing occurrence of safe response. Compared with the plain models, models with KI can significantly improve the Distinc-1/2 scores on all the test sets (sometimes doubled, even tripled). We also see a significant reduction of safe responses by the gap in %safe scores. Those improvements are powered by the rich lexical knowledge used in KI (see Appendix C).

Efficiency Besides the improvements in responses' quality, KI is also very efficient during inference. We report the decoding speed of Transformer and Transformer+KI in Table 3 . As we can see, KI does not incur any extra computation during inference.

We then apply KI on DiffKS (Zheng et al., 2020) 3 : a state-of-the-art model that uses a knowledgeaware decoder to generate a response based on utterance and the knowledge retrieved from a set of candidates. In the empirical study, DiffKS has outperformed many KGD models like CCM 4 and ITDD (Li et al., 2019) . We enhance DiffKS by applying KI on its context encoder. The rest of the model remains unchanged. Table 1 Rows 9-10 show that DiffKS with KI improves ACC over the plain DiffKS model. The reason is that with the injection of token-level knowledge, DiffKS can better understand the utterance, which leads to more accurate knowledge selection and thus less noisy external knowledge. As a result, we observe clear gains on overlapping-based metrics (BLEU and ROUGE). These results emphasize the importance of more fine-grained knowledge in KGD. Human evaluation results (Table 2 ) also suggest that KI can help KGD models in generating more informative and appropriate responses.

We follow previous practice (Rothe et al., 2020) to replace the Transformer's encoder with LMs and keep the decoder the same as the Transformer above. 5 We consider two baselines: (i) Bert2Rnd: Initializing Transformer's encoder with a pretrained BERT, which has been shown capturing rich factual knowledge during pre-training (Petroni et al., 2019; . (ii) Ernie2Rnd: Initializing the encoder with ERNIE 2.0 , a knowledge-enhanced BERT which is pretrained with novel objectives that injecting lexical, 4 Comparison with CCM is in Appendix D 5 We keep the hidden state dimension of decoder consistent with the LMs to enable encoder-decoder attention. syntactic, and semantic knowledge into its parameters .

From Table 4 , we see that parameters of LMbased models (Bert2Rnd and Ernie2Rnd) are more than three times than that of the Transformer baseline. But they do not seem to help improve informativeness (based on wikiF1, BLEU-4, and Info.) of responses. This indicates that although pre-trained LMs can encode knowledge in their parameters, eliciting the encoded knowledge for response generation is difficult when we only have utteranceresponse pairs for training. Another reason might be that previously learned knowledge is forgotten due to catastrophic forgetting (McCloskey and Cohen, 1989) . Comparing with knowledge-enhanced LMs, KI is more lightweight and more effective.

In addition, we observe that introducing LMs can significantly improve responses' diversity as KI does. However, according to Appr. metric and upon manual examination, we find that although the generated responses are diverse, they are often inconsistent with the context or hallucinating non-existing facts (e.g., "Yea, Canada is the largest country in the US."). These are known issues for LMs as discussed in Dou et al. (2021); Shuster et al. (2021) ; Chen et al. (2020) .

We also apply KI on Bert2Rnd/Ernie2Rnd, but we do not observe significant improvements as when applied on randomly initialized models. This could be due to the fact that we implement KI using knowledge from Wikipedia, which is already part of LMs' training corpora. We leave it as future work to investigate how to use KI to elicit knowledge from LMs better (e.g., use adapters (Xu et 2021) or prompt ).

In this section, we perform an in-depth analysis to understand the effectiveness of KI.

We investigate the working principle of KI by visualizing the token embeddings learned on WoW. We use principal component analysis (PCA) to map embeddings into a two-dimensional space as shown in Fig 2. Since there is no co-occurrence of British and Rowling in WoW, their embeddings learned by Transformer are distant (see Fig 2(a) ). However, their embeddings learned by Transformer+KI (see Fig 2(b) ) are much closer. This is because KI injects lexical knowledge (i.e., a British author) into the embedding of Rowling. Specifically, the Euclidean distances between British and Rowling are 0.37 for Transformer and 0.22 for Transformer+KI, respectively. This observation sheds light on the working principle of KI: the contrastive learning objective shortens the embedding distance between a token and tokens from its lexical knowledge. Thus when decoding, if a token is predicted (e.g. Rowling), its relevant knowledge tokens (e.g., British) are likely to receive high probabilities and be selected in the following steps (see the J.K Rowling example in Fig 1(b) .

Firstly, we experiment with a model variant (denoted as Random), which randomly assign knowledge to each utterance token. Results in Table 5 (Row 2) validate the effectiveness of the proposed token-knowledge retriever.

To further show the advantage of token-level knowledge, we consider a model variant in which we degenerate token-level KI to sentence-level by assigning all utterance tokens to a same lexical knowledge (we denote it as Sentence-level knowledge). Given the lexical knowledge retrieved for each token in an utterance, the sentence-level knowledge is chosen as the most-frequent one among all token-level knowledge. The results are summarized in Table 5 (Row 3). Note that tokenlevel knowledge results in better performance than sentence-level knowledge. This shows that finegrained information is useful in promoting more informative and diverse responses.

Lastly, we dive deep into the lexical knowledge retrieved to investigate which type of knowledge is most helpful in response generation. We classify a retrieved knowledge into two types: factual knowledge, which describes a real-world subject (e.g., knowledge about J.K Rowling), and is often associated with noun words in the utterance; linguistic knowledge, which explains the meaning of certain words (e.g., knowledge about donate, see Fig 1(b) , and is often associated with words except nouns. We use part-of-speech (POS) tags to classify tokens and their associated knowledge. We consider two model variants that only use factual/linguistic knowledge in KI respectively, denoted as factual and linguistic. In Fig 3, we compare these two model variants to a vanilla model without KI (denoted as base), and a full model that uses both knowledge (denoted as both). We find that injecting factual knowledge brings significant improvements on BLEU-4 and ROUGE-l. We also observe similar, albeit smaller improvements when equipping with linguistic knowledge. More interestingly, these two types of knowledge can complement one another to further improve the model performance. Figure 2 : Visualization of word embeddings learned by Transformer and Transformer+KI. We use words from two sources: 1) lexical knowledge retrieved for Rowling: "J.K. Rowling is a British author and philanthropist." 2) tokens from WoW that co-occur with "Rowling" in a sentence. Note that there is no co-occurrence of Rowling and British/author in WoW. All words are lower cased in the visualization. We use the K-means algorithm to group tokens into 3 clusters (shown in different colors). This emphasizes the need to consider non-factual knowledge in KGD, which is usually ignored in previous study. To understand what causes the difference between using factual and linguistic knowledge, we compute Knowledge Coverage: the percentage of ground truth response tokens that have been recalled in the retrieved knowledge. As we can see from Fig 3(c) , factual knowledge is more helpful because people tend to respond based on knowledge related to subjects (usually nouns) appearing in the dialog.

We show an example case in Appendix E to demonstrate how KI improves dialog generation and what the limitation is.

We propose knowledge internalization (KI), which aims to incorporate the lexical knowledge into neural dialog models. Models with KI can generate informative and diverse responses without explicitly conditioning on external knowledge. To provide the fine-grained knowledge needed in KI, we also build an effective token-level lexical knowledge retriever that contextually align tokens in a sentence to their related knowledge. We show the effectiveness and general applicability of KI by evaluating KI on various datasets and diversified model structures. 

The vocabulary size for DailyDialog/WoW/CRD is 14,696/22,168/22,512, respectively, with sentences tokenized using BERT's tokenizer provided by Transformers (Wolf et al., 2020) . For Seq2Seq and Transformer, we use a shared vocabulary between the encoder and the decoder. In Seq2Seq, we adopt a 2-layer bidirectional LSTM as the encoder and an unidirectional one as the decoder. The hidden size is set to 256, with a dropout probability of 0.3. The Transformer we used has 6 encoder/decoder layers. The dimensions of the input layer, output layer, and inner feed-forward layer are set to 512, 512, and 1,024, respectively. The number of attention heads is set to 4. We use Adam with β 1 = 0.9, β 2 = 0.98 for model optimization and start training with a warmup phase where we linearly increase the learning rate from 10 −7 to 0.005. After that we decay the learning rate proportional to the number of updates. Each training batch contains at most 4,096 source/target tokens. We early-stop the training if validation loss does not improve over ten epochs. We perform beam search with a beam size of 5. The λ (see Eq 3) is set to 1 in all our evaluation.

For Bert2Rnd and Ernie2Rnd, we initialize the Transformer's encoder with the pre-trained LMs using the Transformers (Wolf et al., 2020) and keep the decoder the same as above. Note that due to the exist of encoder-decoder attention, we modify the dimensions of input/output layer to 768 to be compatible with BERT (bert-base-uncased) and ERNIE (nghuyongernie-2.0-en). We share the embeddings between encoder and decoder. Models are learned with Adam optimizer with β 1 = 0.9, β 2 = 0.98. Learning rate is set to 1e4 with a linear scheduler. Each training batch contains 128 samples. The LMs are fine-tuned together with the decoder. We also experimented with LMs frozen, but this generally works worse. 

Since our retrieval component is based on the contextualized representations (see § 4.2), the same token can be aligned to different knowledge when it occurs in different contexts. As the supporting evidence, in Table 7 , we report the averaged number of knowledge items associated with each token.

In Table 8 , we show an example of the same token being aligned to different knowledge items when giving different contexts. In addition, our approach exposes each dialog sentence to very diverse knowledge items. The rich lexical knowledge, both at the token-level and sentence-level, is the key to KI's good performance.

We further conduct an ablation study to investigate the effectiveness of two additional retrieval strategies: stopword masking and exact matching ( § 4.3). We remove each strategy and keep the other unchanged. The results are presented in Table 9 . As we can see, both strategies are useful for generating appropriate (based on PPL, BLEU-4, and ROUGE-l), informative (based on WikiF1), and diversified (based on Distinc-1/2) responses. Transformer: I'm not sure , but I do know that they have been around for a long time! Transformer+KI: I love their chocolate chip cookies! They're actually the second largest candy company in the world! Knowledge for Dylan's: Lauren was inspired to create the store, which is asserted to be the "largest unique candy store in the world", by the Roald Dahl story of Willy Wonka the Chocolate Factory Knowledge for like: In English, the word like has a very flexible range of uses, ranging from conventional to non-standard. 

Similar to KI, CCM augments dialog corpora with token-level commonsense knowledge. In each encoding and decoding step, CCM explicitly uses the retrieved commonsense knowledge triples by concatenating their representations with the token representation. As existing KGD models, CCM also requires extra knowledge as input during both training and inference. Training CCM on the CRD dataset takes about a week on one Titan X GPU. The comparison of model performance is shown in Table 10 . As we can see, there is a significant gap between CCM and Transformer+KI. Thus in §6.2, we consider applying KI on a more state-of-the-art and recent KGD model: DiffKS.

We show an example case in Table 11 to demonstrate how KI improves dialog generation and what the limitation is. From the generated results, Transformer returns a vacuous response, as it has no idea on what "Dylan's Candy Bar" is. However, Transformer+KI, which perceives the knowledge about "Dylan's Candy Bar" during training, gives a much more informative response. Meanwhile, we further observe some inaccuracy during the knowl-edge transfer ("largest" becomes "second largest"). We take this as an interesting future work.

In § 6.3, we observe that KI can outperform models whose encoders are initialized with pre-trained BERT or ERNIE. Here we dive deeper to compare KI with a fully pre-trained seq2seq model: BART (Lewis et al., 2020) . BART has demonstrated superior performance on conditional language generation, including translation, summarization, and dialogue response generation. We start from the BART-base checkpoint 6 . We finetune the model for five epochs with a learning rate of 3e-5. We do not report PPL since these two models use different tokenization methods. As we can see from Table 12 , by introducing only a few extra parameters and computation, KI can significantly boost the Transformer's performance. Although a pre-trained BART model can generate slightly more diverse responses than Transformer+KI (higher Distinc-1/2), these generated responses are often inconsistent with the input (lower BLEU-4/ROUGE-l) or less informative (lower WikiF1).

Neural machine translation by jointly learning to align and translate

Longman grammar of spoken and written english

KGPT: Knowledge-grounded pretraining for data-to-text generation

Bert: Pre-training of deep bidirectional transformers for language understanding

Wizard of wikipedia: Knowledge-powered conversational agents

Scarecrow: A framework for scrutinizing machine text

A knowledge-grounded neural conversation model

Dimensionality reduction by learning an invariant mapping

Long short-term memory

How can we know what language models know?

Sequential Latent Knowledge Selection for Knowledge-Grounded Dialogue

Syntactic structure distillation pretraining for bidirectional encoders

BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension

A diversity-promoting objective function for neural conversation models

DailyDialog: A manually labelled multi-turn dialogue dataset

Incremental transformer with deliberation decoder for document grounded conversations

ROUGE: A package for automatic evaluation of summaries

Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing

Knowledge diffusion for neural dialogue generation

Haotang Deng, and Ping Wang. 2020. K-bert: Enabling language representation with knowledge graph

Catastrophic interference in connectionist networks: The sequential learning problem

Openvidial: A large-scale, open-domain dialogue dataset with visual contexts

Towards exploiting background knowledge for building conversation systems

Image-grounded conversations: Multimodal context for natural question and response generation

Bleu: a method for automatic evaluation of machine translation

Extending neural generative conversational model using external knowledge sources

Kilt: a benchmark for knowledge intensive language tasks

Language models as knowledge bases?

Conversing by reading: Contentful neural conversation with on-demand machine reading

Leveraging pre-trained checkpoints for sequence generation tasks

Introduction to information retrieval

Neural responding machine for short-text conversation

Entity linking with a knowledge base: Issues, techniques, and solutions

Retrieval augmentation reduces hallucination in conversation

A neural network approach to context-sensitive generation of conversational responses

Conceptnet 5.5: An open multilingual graph of general knowledge

Ernie 2.0: A continual pre-training framework for language understanding

Sequence to sequence learning with neural networks

Vokenization: Improving language understanding with contextualized, visual-grounded supervision

Attention is all you need

Transformers: State-of-the-art natural language processing

Diverse and informative dialogue generation with context-specific commonsense knowledge awareness

Proactive human-machine conversation with explicit conversation goal

Retrieval-free knowledge-grounded dialogue response generation with adapters

Augmenting end-to-end dialogue systems with commonsense knowledge

Grounded conversation generation as guided traverses in commonsense knowledge graphs

ERNIE: Enhanced language representation with informative entities

Low-resource knowledge-grounded dialogue generation

Knowledgegrounded dialogue generation with pre-trained language models

Difference-aware knowledge selection for knowledge-grounded conversation generation

Commonsense knowledge aware conversation generation with graph attention

Retrieval-enhanced adversarial training for neural response generation

Flexible end-to-end dialogue system for knowledge grounded conversation

This project is supported by the Tencent AI Lab Rhino-Bird Focused Research Program, the Shanghai Committee of Science and Technology, China (Grant No. 21DZ1100100). This research is also partly supported by the HKU-TCL Joint Research Center for Artificial Intelligence fund (Project 200009430).