key: cord-0530450-qpv9kpja authors: Feng, Xiachong; Feng, Xiaocheng; Qin, Bing; Liu, Ting title: Incorporating Commonsense Knowledge into Abstractive Dialogue Summarization via Heterogeneous Graph Networks date: 2020-10-20 journal: nan DOI: nan sha: f76d6d4a8f8d5264ec7ae9c3cb6e05a1477d6150 doc_id: 530450 cord_uid: qpv9kpja Abstractive dialogue summarization is the task of capturing the highlights of a dialogue and rewriting them into a concise version. In this paper, we present a novel multi-speaker dialogue summarizer to demonstrate how large-scale commonsense knowledge can facilitate dialogue understanding and summary generation. In detail, we consider utterance and commonsense knowledge as two different types of data and design a Dialogue Heterogeneous Graph Network (D-HGN) for modeling both information. Meanwhile, we also add speakers as heterogeneous nodes to facilitate information flow. Experimental results on the SAMSum dataset show that our model can outperform various methods. We also conduct zero-shot setting experiments on the Argumentative Dialogue Summary Corpus, the results show that our model can better generalized to the new domain. Automatic summarization is a fundamental task in Natural Language Processing, which aims to condense the original input into a shorter version covering salient information and has been continuously studied for decades (Paice, 1990; Kupiec et al., 1999) . Recently, online multi-speaker dialogue/meeting has become one of the most important ways for people to communicate with each other in their daily works. Especially due to the spread of COVID-19 worldwide, people are more dependent on online communication. In this paper, we focus on dialogue summarization, which can help people quickly grasp the core content of the dialogue without reviewing the complex dialogue context. Recent works that incorporate additional commonsense knowledge in the dialogue generation (Zhou et al., 2018) and dialogue context representation learning show that even though neural models have strong learning capabilities, explicit knowledge can still improve response generation quality. It is because that a dialog system can understand conversations better and thus respond more properly if it can access and make full use of large-scale commonsense knowledge. However, current dialogue summarization systems (Ganesh and Dingliwal, 2019; Li et al., 2019; Liu et al., 2019a; Zhu et al., 2020) ignore the exploration of commonsense knowledge, which may limit the performance. In this work, we examine the benefit of incorporating commonsense knowledge in the dialogue summarization Bob's car has broken down. In 10 minutes Tom will give him a lift to work. give a lift Dialogue Reference Summary Figure 1 : An example of dialogue-summary pair. Green for speakers, blue for utterances, and pink for commonsense knowledge. In order to generate "give a lift" in the reference summary, the summarization model needs to understand the commonsense knowledge behind "pick up" and "car broke down". task and also address the question of how best to incorporate this information. Figure 1 shows a positive example to illustrate the effectiveness of commonsense knowledge in the dialogue summarization task. Bob asks Tom for help because his car has broken down. On the one hand, by introducing commonsense knowledge according to the pick up and car broke down, we can know that Bob expects Tom to give him a lift. On the other hand, commonsense knowledge can serve as a bridge between non-adjacent utterances that can help the model better understanding the dialogue. In this paper, we follow the previous setting (Zhou et al., 2018) and also use ConceptNet (Speer and Havasi, 2012) as a large-scale commonsense knowledge base, while the difference is that we regard knowledge and text(utterance) as heterogeneous data in a real multi-speaker dialogue. We propose a model named Dialogue Heterogeneous Graph Network (D-HGN) for incorporating commonsense knowledge by constructing the graph including both utterance and knowledge nodes. Besides, our heterogeneous graph also contains speaker nodes at the same time, which has been proved to be a useful feature in dialogue modeling. In particular, we equip our heterogeneous graph network with two additional designed modules. One is called message fusion, which is specially designed for utterance nodes to better aggregate information from both speakers and knowledge. The other one is called node embedding, which can help utterance nodes to be aware of position information. Compared to homogeneous graph network in related works (Ganesh and Dingliwal, 2019; Li et al., 2019; Liu et al., 2019a; Zhu et al., 2020) , we claim that the heterogeneous graph network can effectively fuse information and contain rich semantics in nodes and links, and thus more accurately encode the dialogue representation. We conduct experiments on the SAMSum corpus (Gliwa et al., 2019) , which is a large-scale chat summarization corpus. We analyze the effectiveness of integration of knowledge and heterogeneity modeling. The human evaluation also shows that our approach can generate more abstractive and correct summaries. To evaluate whether commonsense knowledge can help our model better generalize to the new domain, we also perform zero-shot setting experiments on the Argumentative Dialogue Summary Corpus (Misra et al., 2015) , which is a debate summarization corpus. In the end, we give a brief summary of our contributions: (1) We are the first to incorporate commonsense knowledge into dialogue summarization task. (2) We propose a D-HGN model to encode the dialogue by viewing utterances, knowledge and speakers as heterogeneous data. (3) Our model can outperform various methods. In this section, we describe the graph notation and the graph construction process, which consists of three steps, including (1) utterance-knowledge bipartite graph construction, (2) speaker-utterance bipartite graph construction and (3) heterogeneous dialogue graph construction. Our heterogeneous dialogue graph (HDG) is defined as a directed graph G = (V, E, A, R), where each node v ∈ V and each edge e ∈ E. Different types of nodes and edges are associated with their type mapping functions τ (v) : V → A and φ(e) : E → R. Current dialogue summarization corpus has no knowledge annotations. To ground each dialogue to commonsense knowledge, we make use of ConceptNet (Speer and Havasi, 2012) to incorporate knowledge. ConceptNet is a semantic network that contains 34 relations in total and represents each knowledge tuple by R = (h, r, t, w) meaning that head concept h and tail concept t have a relation r with a weight of w. It contains not only world facts such as "Paris is the capital of France" that are constantly true, but also informal relations that are part of daily knowledge such as "Call is used for Contact". We use each word in the utterance as a query to retrieve a one-hop graph from ConceptNet, as done by Guan et al. (2018) . We only consider nouns, verbs, adjectives, and adverbs. We filter out tuples where (1) r is in a pre-defined list of useless relations 1 (e.g. "number" is antonym of "letter"), (2) the weight of r is less than 1 (e.g. "text" is related to "refactor", weight: 0.9). Finally, we can get related concepts for the dialogue, as shown in Figure 2 (b). We construct utterance-knowledge bipartite graph by viewing utterances and knowledge as different types of nodes. As shown in Figure 2 (c), we connect two utterances to one tail concept t using edge know-by if they both have the same tail concept t. Note that two utterances may connect to multiple tail concepts, we choose the one with the highest average weight of relations (e.g. "phone book" is better than "date"). If there are multiple identical knowledge nodes, we also combine them to a single one (e.g. two "contact" nodes are combined into one node). Given multiple speakers and corresponding utterances in a dialogue, we construct the speaker-utterance bipartite graph by viewing speakers and utterances as different types of nodes. As shown in Figure 2 (d), we construct speak-by edges from speakers to utterances based on who said the utterances. We combine the utterance-knowledge bipartite graph and the speaker-utterance bipartite graph as our heterogeneous dialogue graph, as shown in Figure 2 (e). Additionally, we add a reverse edge rev-know-by and rev-speak-by to facilitate information flow over the graph. Finally, there are three types of nodes, where A becomes speaker, utterance, and knowledge and four types of edges, where R becomes speakby, know-by, rev-speak-by and rev-know-by. In this section, we describe the details of our dialogue heterogeneous graph network (D-HGN), including three components: node encoder, graph encoder and pointer decoder. The model is shown in Figure 3 . The role of node encoder is to give each node Note that speaker and knowledge may have multiple words. We employ a Bi-LSTM as the node encoder that encodes input node forwardly and backwardly to generate two sequences of hidden states Figure 3 : Illustration of our D-HGN model. (a) Graph construction receives a dialogue and ConceptNet and outputs a heterogeneous dialogue graph (HDG). (b) Node encoder receives a sequence of words for a node and produces initial node and word representations. (c) Graph encoder first conducts graph operations for initial node representations. Then a node embedding module is added after graph layers to make nodes to be aware of position information. Finally, the initial word representations and corresponding updated node representations are concatenated as final word representations. (d) Pointer decoder can either generate summary words from the vocabulary or copy from the input words. states are concatenated as the initial node representation h 0 . h 0 v i will be passed to the graph encoder to learn high-level representations. h 0 v i ,n will be concatenated with updated node representations to get final word representations. Graph encoder is used to digest the structural information and get updated node representations. We employ Heterogeneous Graph Transformer (Hu et al., 2020) as our graph encoder, which models heterogeneity by type-dependent parameters and can be easily applied to our graph. It includes: (a) heterogeneous mutual attention, which calculates attention scores Attn(s, e, t) between source nodes and the target node. (b) heterogeneous message passing, which prepares the message vector Msg(s, e, t) for each source node and (c) target-specific aggregation, which aggregates messages from source nodes to the target node using attention scores as the weight. Specifically, we design two modules named message fusion and node embedding to make the learning process more effective for our graph. Heterogeneous Mutual Attention Given an edge e = (s, t) with their node and edge type mapping functions τ and φ, we first project source and target node representations from (l-1)-th layer h with type-dependent linear projection. Next, to integrate edge type information, we calculate unnormalized score α(s, e, t) between t and s by adding a edge-based matrix W AT T (l),φ(e) . Finally, for each target node t, we conduct Softmax for all s ∈ N (t) to get the final normalized attention scores Attn (l) (s, e, t), where N (t) denotes neighbors of target node t. Note that if target node is of utterance type and source node is of speaker type, we do not calculate the attention score between these two types of nodes. See more detail at message fusion module. The process is shown in Figure 4(a) . Heterogeneous Message Passing We first project source node representation h Figure 4 : Illustration of one graph layer. Given a target node of utterance type and source nodes of knowledge and speaker type. Firstly, we use (a) heterogeneous mutual attention to calculate the attention scores by type-dependent linear projection. Secondly, we use (b) heterogeneous message passing to prepare the message vector for each source node. Thirdly, we use (c) target-specific aggregation to aggregate messages to the target node. Specifically, we propose a message fusion module that uses attention scores as the weight to average the knowledge vectors and add speaker information additionally. We divide this process into two cases based on the type of target node: (1) τ (t) = utterance, (2) τ (t) = utterance. For the first case, We use attention vector as the weight to average messages: h . For the second case, we design a Message Fusion module to aggregate messages to utterance node more effectively. After getting aggregated message vector h (l) t , we maps it back to τ (t)-type distribution with a linear projection followed by residual connection: h Message Fusion Dialogue summaries often describe "who did what", thus speaker information is required for utterances. However, if target node of utterance type aggregates messages from source nodes of knowledge and speaker type, it will prefer more to the speaker node while giving up using knowledge nodes, since attention is a normalized distribution. Therefore, in our message fusion module, we use attention weights for knowledge nodes to average corresponding messages and add speaker information additionally. The process is shown in Figure 4 (d). Node Embedding In this section, a module named Node Embedding is designed to make utterance nodes to be aware of position information in source dialogue. This is because original heterogeneous graph cannot directly model the chronological order between utterances, while an ideal dialogue summary needs to refer to the order of corresponding dialogue utterances. In detail, for speaker and knowledge nodes, we fix their position to 0. For each utterance node v i , it associates with a position p v i , which is the ranking of utterances in the original dialogue. As shown in Figure 3 (c), we add position information for each node:ĥ where W pos denotes a learnable node embedding matrix. After getting the output representationĥ (l) for each node, we concatenate updated node representation h (l) v i and corresponding initial word representations h 0 v i ,n followed by a linear projection F Linear to get final word representations: We employ a LSTM with attention and copy mechanism to generate summaries. At each decoding time step t, the LSTM reads the previous word embedding x t−1 and previous context vector c t−1 as inputs to compute the new hidden state s t = LSTM (x t−1 , c t−1 , s t−1 ). We use the average of all word representations s 0 = Average( v i ∈G n∈[1,|v i |] h v i ,n ) in the graph to initialize the decoder. The context vector c t is computed as in Bahdanau et al. (2014) , which is then used to calculate generation probability p gen and the final probability distribution P (w), as done by See et al. (2017) . For each heterogeneous dialogue graph G that is paired with a ground truth summary Y * = [y * 1 , y * 2 , ..., y * |Y * | ], we minimize the negative log-likelihood of the target words sequence. Dataset Following the latest works (Gliwa et al., 2019; Ganesh and Dingliwal, 2019) , we conduct experiments on two different settings. Firstly, we train and evaluate our model on the SAMSum corpus (Gliwa et al., 2019) , which contains dialogues around chit-chats topics. Secondly, we train using SAMSum corpus and use the Argumentative Dialogue Summary Corpus (ADSC) (Misra et al., 2015) as the test set to perform zero-shot setting experiments. Each dialogue in ADSC dataset owns 5 different summaries and is mainly around debate topics. The word embedding size is set to 100. The dimension of node encoder and pointer decoder is set to 300. The dimension of graph encoder is set to 200. The graph layer number is set to 1. Dropout is set to 0.5. We use Adam (Kingma and Ba, 2014) with the learning rate of 0.001 and use gradient clipping with a maximum gradient norm of 2. In the test process, beam size is set to 10. Evaluation Metrics We employ the standard F 1 scores for ROUGE-1, ROUGE-2, and ROUGE-L metrics (Lin, 2004) to measure summary qualities. These three metrics evaluate the accuracy on unigrams, bigrams, and longest common subsequence between the groundtruth and the generated summary. Baseline Models We compare our model with several baselines. LONGEST-3 chooses the longest three utterances as the summary. TextRank (Mihalcea and Tarau, 2004 ) is a graph-based extractive method. SummaRunner (Nallapati et al., 2017) extract utterances based on a hierarchical RNN model. Transformer (Vaswani et al., 2017 ) is a Seq2Seq model that utilizes self-attention operations. PGN (See et al., 2017 ) is a Seq2Seq model equipped with copy mechanism. HRED (Serban et al., 2016) is a hierarchical Seq2Seq model. Abs RL (Chen and Bansal, 2018 ) is a pipeline model that first selects salient utterances based on a extractive model then produces the summary based on a abstractive model using diversity beam search. The extractive model is trained using utterance-level extraction labels. The overall model is jointly trained using reinforcement learning. Based on Abs RL, Abs RL Enhance (Gliwa et al., 2019) appends all speakers after each utterance, because the original model may select utterances of a single speaker that will lead to no other speaker information. D-GAT, D-GCN and D-RGCN are variants of our model that replace heterogeneous graph layers with homogeneous graph layers, including GAT (Veličković et al., 2017) , GCN (Kipf and Welling, 2016) and RGCN (Schlichtkrull et al., 2018) . Note that D-GAT also use message fusion module to update representations for utterance nodes. Table 2 : Test set results on the SAMSum Dataset, where "R-1" is short for "ROUGE-1", "R-2" for "ROUGE-2", "R-L" for "ROUGE-L". "Know.", "Heter.", "Utter." and "RL" indicate whether knowledge, heterogeneity modeling, utterance-level extraction labels and reinforcement learning are used or not. Table 2 shows the results on SAMSum corpus. The D-HGN stands for our full model, which outperforms various baselines. Compared with HRED that uses no additional auxiliary information such as commensence knowledge or utterance-level extraction labels, D-RGCN that uses commensence knowledge can achieve 0.97% improvement on ROUGE-1, 0.94% on ROUGE-2, 1.28% on ROUGE-L, which shows the effectiveness of knowledge integration. Compared with homogeneous networks like D-RGCN, D-HGN that based on heterogeneous graph networks can achieve 0.67% improvement on ROUGE-1, 1.00% on ROUGE-2, 0.63% on ROUGE-L, which verifies the effectiveness of heterogeneity modeling. We conduct two types of ablation studies to verify the effectiveness of different types of nodes and two modules we propose. As shown in Table 3 (a), without knowledge integration(w/o knowledge), the model suffers the performance drop, which shows incorporating knowledge can help our model better modeling the dialogue context. For speaker nodes, directly remove them in the graph will lead to no speaker in the final summary. Instead, we append the speakers in front of utterances(w/o speaker). The results show that modeling speakers as heterogeneous data will do good the final summary generation process. As shown in Table 3 (b), we remove the message fusion module(w/o message fusion), the results show that it is worth to design specific message fusion method according to different types of nodes. Besides, without taking position information into account(w/o node embedding), our model will lose some performance. We conduct human evaluation to verify the quality of the generated summaries, including abstractiveness (contains higher-level conceptual words), informativeness (covers adequate information) and correctness (associates right names with actions). We hired five graduates to perform human evaluation. For each metric, the score ranges from 1 (worst) to 5 (best). The results are shown in Table 4 . Abstractiveness Our model achieves higher scores. Compared with D-HGN, D-HGN(w/o knowledge) gets a lower score in abstractiveness, which indicates knowledge incorporation can help our model express deeper meanings. D-HGN(w/o speaker) performs worse than D-HGN in correctness, which shows effectiveness of heterogeneity modeling by viewing speakers as heterogeneous data. AbsRL Enhance performs worst in correctness, which may due to the utterances extraction will break the coherence of dialogue contexts. To verify whether knowledge can help our model better generalize to the new domain, we directly test models on the ADSC Corpus. The results are shown in Table 5 . The homogeneous model D-GAT that uses knowledge can get better results than other baselines. The D-HGN gets the best score. We contribute this to the fact that knowledge can help our models better understand the dialogue in the new domain. To examine whether our D-HGN can learn easily distinguishable representations, we extract node representations from the last graph layer for the SAMSum test set. We apply t-SNE (van der Maaten, 2014) to these vectors. The results are shown in Figure 5 . We find that our model can generate more discrete and easily distinguishable representations. Besides, D-GAT also tends to separate representations of different types of nodes, which indicates explicitly heterogeneity modeling is a more reasonable approach. Figure 6 shows summaries generated by different models and the visualization of knowledge-to-utterance attention weights learned by our D-HGN model, the darker the color, the higher the weights. Our model incorporates two knowledge nodes, one is birthday party according to "bday party", "happy" and "cake", the other one is some people according to "Tom" and "boyfriend". We can see that our D-HGN model pays more attention to birthday party rather than some people. On the one hand, incorporating birthday party helps our model generate a more formal summary (using birthday rather than bday). On the other hand, birthday party connects non-adjacent utterances around the birthday topic, which helps our model generate a more informative and detailed summary (including cake). Gary and Lara will meet at 5 pm for Tom's bday party. PGN Gary and Lara are supposed to be back Tom's home at 5 pm for his bday party. AbsRL Enhance D-HGN Gary and Lara are going to Tom's birthday party at 5 pm. Lara will pick up the cake. It's Tom's birthday. Lara and Gary will come to Tom's place about 5 pm to prepare everything. Gary has already paid for the cake Lara will pick it. You're such a great boyfriend. He will be so happy! Lara Figure 6 : Example summaries generated by different models for one dialogue. Previous works used feature engineering (Xie et al., 2008) , template-based (Oya et al., 2014) and graphbased (Bui et al., 2009 ) methods for extractive dialogue summarization. Although extractive methods are widely used, the results tend to be incoherent and poorly readable. Therefore, current works mainly focus on abstractive methods, which can produce more readable and fluency summaries. They tend to incorporate additional auxiliary information to help better modeling the dialogue. Goo and Chen (2018) incorporated dialogue acts to model the interactive status of the meeting. Liu et al. (2019a) tackled the problem of customer service summarization, which first produced a sequence of pre-defined keywords then generated the summary. Liu et al. (2019b) generated summaries for nurse-patient conversation by incorporating topic information. Ganesh and Dingliwal (2019) first removed useless utterances by utilizing discourse labels and then generated summaries. Li et al. (2019) combined vision and textual features in a unified hierarchical attention framework to generate meeting summaries. Zhu et al. (2020) employed a hierarchical transformer framework and incorporated part-of-speech and entity information for meeting summarization. In this paper, we facilitate dialogue summarization task by incorporating commonsense knowledge and further model utterances, commonsense knowledge and speakers as heterogeneous data. In this paper, we improve abstractive dialogue summarization by incorporating commonsense knowledge. We first construct a heterogeneous dialogue graph by introducing knowledge from a large-scale commonsense knowledge base. Then we present a Dialogue Heterogeneous Graph Network (D-HGN) for this task by viewing utterances, knowledge and speakers in the graph as heterogeneous nodes. We additionally design two modules named message fusion and node embedding to facilitate information flow. Experiments on the SAMSum dataset show the effectiveness of our model that can outperform various methods. Zero-shot setting experiments on the Argumentative Dialogue Summary Corpus show that our model can better generalized to the new domain. Neural machine translation by jointly learning to align and translate Extracting decisions from multiparty dialogue using directed graphical models and semantic similarity Fast abstractive summarization with reinforce-selected sentence rewriting Abstractive summarization of spoken and written conversation Samsum corpus: A humanannotated dialogue dataset for abstractive summarization Abstractive dialogue summarization with sentence-gated modeling optimized by dialogue acts Story ending generation with incremental encoding and commonsense knowledge Heterogeneous graph transformer Adam: A method for stochastic optimization Semi-supervised classification with graph convolutional networks A trainable document summarizer Keep meeting summaries on topic: Abstractive multi-modal meeting summarization ROUGE: A package for automatic evaluation of summaries Automatic dialogue summary generation for customer service Topic-aware pointer-generator networks for summarizing spoken conversations TextRank: Bringing order into text Using summarization to discover argument facets in online idealogical dialog Summarunner: A recurrent neural network based sequence model for extractive summarization of documents A template-based abstractive meeting summarization: Leveraging summary and source text relationships Constructing literature abstracts by computer: techniques and prospects. Information Processing & Management Modeling relational data with graph convolutional networks Get to the point: Summarization with pointergenerator networks Building endto-end dialogue systems using generative hierarchical neural network models Representing general relational knowledge in conceptnet 5 Accelerating t-sne using tree-based algorithms Attention is all you need Masking orchestration: Multi-task pretraining for multi-role dialogue representation learning Evaluating the effectiveness of features and sampling in extractive meeting summarization Commonsense knowledge aware conversation generation with graph attention End-to-end abstractive summarization for meetings