key: cord-0482509-k9k4p1gm authors: Wu, Zhiyong; Bi, Wei; Li, Xiang; Kong, Lingpeng; Kao, Ben title: Lexical Knowledge Internalization for Neural Dialog Generation date: 2022-05-04 journal: nan DOI: nan sha: b1db56226de7c7f0b04a393425b79b687d75c56b doc_id: 482509 cord_uid: k9k4p1gm We propose knowledge internalization (KI), which aims to complement the lexical knowledge into neural dialog models. Instead of further conditioning the knowledge-grounded dialog (KGD) models on externally retrieved knowledge, we seek to integrate knowledge about each input token internally into the model's parameters. To tackle the challenge due to the large scale of lexical knowledge, we adopt the contrastive learning approach and create an effective token-level lexical knowledge retriever that requires only weak supervision mined from Wikipedia. We demonstrate the effectiveness and general applicability of our approach on various datasets and diversified model structures. Vacuous responses (Li et al., 2016; Ghazvininejad et al., 2018) , such as, I don't know, are commonly observed in end-to-end neural dialog models (Shang et al., 2015; Sordoni et al., 2015) . This is mostly because these models ignore the knowledge that resides in people's minds during a conversation. To bridge the gap, many existing works (Moghe et al., 2018; Dinan et al., 2018) have attempted to condition the dialog model on external knowledge, either a sentence or a paragraph, retrieved based on the utterance and/or previous context. This curates datasets with utterance-response-knowledge triples (see Fig 1(a) ). These knowledge-grounded dialog (KGD) models, despite demonstrated effectiveness, suffer from two major problems. First, equipping models with sentence-level knowledge alone will limit responses' informativeness and diversity. As shown in Fig 1(a) , with the knowledge retrieved giving the utterance, a KGD model can relate J.K Rowling to Khalsa Aid. However, retrieval based solely on sentence embeddings Figure 1 : (a) An exemplary KGD data sample with an utterance (top), a response (bottom), and a sentencelevel knowledge (middle). (b) A list of lexical knowledge (in grey rectangle) related to words from the utterance in (a), and the potential responses (in white speech balloon) people would make given that knowledge. will result in ignorance of lexical knowledge associated with individual tokens. In this example, the knowledge about J.K Rowling, COVID-19, donates, and India, is ignored during the retrieval, due to the semantic gaps between those lexical knowledge sentences (see Fig 1(b) ) and the utterance. This makes it rather difficult (if not impossible) for the model to generate responses carrying relevant information as shown in Fig 1(b) . Second, retrieving knowledge for open-domain dialogs during inference incurs heavier computation, often involving similarity search over tens of millions of passages (Petroni et al., 2021) . Existing systems (Zhao et al., 2020; Zheng et al., 2020) alleviate this problem relying on pre-selecting a small candidate set based on TF-IDF (Schütze et al., 2008) , in sacrifice of the diversity and the accuracy of the retriever. Directly conditioning the dialog model on the retrieved text, these models are easily effected by the quality of the constructed candidate set and are thus prone to errors (Dinan et al., 2018; Kim et al., 2020; Zhao et al., 2020) . In this work, we propose to complement the lexical knowledge into neural dialog models by Knowledge Internalization (KI), a training approach based on contrastive learning (Hadsell et al., 2006) . The central idea of KI is to integrate more fine-grained lexical knowledge about each input token internally into model parameters (e.g., word embeddings), rather than further conditioning the model on externally retrieved knowledge (e.g., directly copy and/or modify tokens from external knowledge when decoding). Our research contributions include: • a novel training objective (KI; §3.2) that infuses lexical semantics into word representations. With the knowledge internalized into the contextualized representation of every token, a dialog model can generate informative and diverse responses without engaging an external knowledge retrieval module during inference time, thus making the inference more efficient ( §6.1); • an effective token-level lexical knowledge retriever ( §4) trained with weak supervision to contextually align tokens in dialog corpora to their related and possibly different knowledge (Appendix C). • a demonstration of the effectiveness and general applicability of KI with extensive experiments on diversified dialog models and on three benchmark datasets: DailyDialog (Li et al., 2017) , Wizard of Wikipedia (Dinan et al., 2018) , and Commonsense Reddit Dataset . The implementation of our model can be found at https://github.com/LividWo/KI. To address the vacuous responses problem in neural dialog models, researchers propose to ground dialogs on real world knowledge and construct new corpora that contain utterance-response-knowledge triples. Specifically, responses are grounded to external knowledge derived from different knowledge sources Liu et al., 2018; Dinan et al., 2018; Moghe et al., 2018; Ghazvininejad et al., 2018; Mostafazadeh et al., 2017; Meng et al., 2020; . Among different sources, textual knowledge (Dinan et al., 2018; Parthasarathi and Pineau, 2018; Qin et al., 2019) receives the most attention as it is easy to obtain and scale. However, the construction of knowledge-grounded datasets is costly and time-consuming. To build a more practical system without assuming a given knowledge, recent studies enhance KGD models with an extra knowledge selection component (Dinan et al., 2018; Kim et al., 2020; Zheng et al., 2020; Zhao et al., 2020) . Most existing KGD models can be viewed as models with externalized knowledge, where knowledge is explicitly used as part of the model input. The principle behind these models is to copy words and/or modify sentences from external knowledge when generating responses Zhu et al., 2017; Zhao et al., 2019) . Our KI, on the other hand, does not explicitly present knowledge to dialog models for reading and/or copying. Instead, we inject and store external knowledge into models' parameters and encourage models to elicit the encoded knowledge during generation. The idea of knowledge internalization has also been explored in language modeling. Factual knowledge , visual knowledge (Tan and Bansal, 2020) and syntactic knowledge (Kuncoro et al., 2020) have been injected into language models (LMs) and shown great promise in improving the performance of downstream tasks. KI differs from those knowledge-enhanced LMs in two aspects: (i) KI can be trained end-to-end with dialog models, while applying LMs on dialog generation often requires multiple rounds of pre-train and fine-tune. (ii) KI is lightweight that barely introduces extra parameters to the dialog model while applying LMs usually introduces hundreds of millions of extra parameters. In this section, we illustrate how to train a dialog model with knowledge internalization. To infuse more fine-grained lexical knowledge to a neural dialog model, we assume a dialog corpus where each token is aligned with relevant knowledge (we will discuss the construction of such a corpus in §4). In particular, for an input sentence X in the corpus, we assume each token x i ∈ X is associated with a corresponding descriptive sentence K i . Given an utterance-response pair (X, Y ), where X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y m }, neural dialog models generally minimize the negative log-likelihood loss: where P(y i ) = P(y i |y