key: cord-0039604-x67zc8so authors: Ishigaki, Tatsuya; Huang, Hen-Hsen; Takamura, Hiroya; Chen, Hsin-Hsi; Okumura, Manabu title: Neural Query-Biased Abstractive Summarization Using Copying Mechanism date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_22 sha: 589f0114d0c048690d0260bd451d6bb710309670 doc_id: 39604 cord_uid: x67zc8so This paper deals with the query-biased summarization task. Conventional non-neural network-based approaches have achieved better performance by primarily including the words overlapping between the source and the query in the summary. However, recurrent neural network (RNN)-based approaches do not explicitly model this phenomenon. Therefore, we model an RNN-based query-biased summarizer to primarily include the overlapping words in the summary, using a copying mechanism. Experimental results, in terms of both automatic evaluation with ROUGE and manual evaluation, show that the strategy to include the overlapping words also works well for neural query-biased summarizers. A query-biased summarizer takes a query in addition to a source document as an input, and outputs a summary with respect to the query, as in Table 1 . The generated summaries are intended to be used, for example, for snippets as the results of search engines. Query-biased summarization has been studied for decades [3, 4, 16, 18] . Conventional approaches are mostly extractive, and often use the overlapping words as cues to calculate the salience score of a sentence [14, 16, 18] . On the other hand, recurrent neural network (RNN)-based approaches have enabled summarizers to generate fluent abstractive summaries [1, 7, 13] , but do not explicitly model the strategy to primarily include the overlapping words. In this paper, therefore, we incorporate this strategy into RNN-based summarizers using copying mechanisms. Table 1 . Example of a source document, a query, the gold summary. The words overlapping between the source and query are shown in bold. Vigilanteism simply causes more problems and will not fix the original problem. one should restore law and order rather than implementing disorder Will vigilanteism restore law and order? A copying mechanism is a network to primarily include the words in the source document in the summary [5, 6, 17] . To achieve this, the copying mechanism increases the probability of including the words in the source document. A copying mechanism can be seen as an extension of the pointer-network [19] , which only copies words in the input and does not output words other than in the input. Gu et al. [5] , Gulcehre et al. [6] , and Miao et al. [12] extended the pointer-network to copying mechanisms by using a function to balance copying and generation. See et al. [17] and Chen and Lapata [2] applied the copying mechanism to single-document summarization tasks without a query. We came up with an idea of using copying mechanisms to include the overlapping words in the summary. However, the copying mechanisms were originally designed for the settings without the query information, and it is not necessarily clear how we can integrate the mechanisms into a query-biased summarizer. Encoder-decoders for the query-biased setting have been proposed. Hasselqvist et al. [7] proposed an architecture being able to copy the words in the source document, while our copying mechanisms copy the overlapping words and their surroundings explicitly. Nema et al. [13] presented a dataset extracted from Debatepedia. They proposed a method to gain the diversity of the summary, while we focus on copying mechanisms. We propose three copying mechanisms designed for query-biased summarizers: copying from the source, copying the overlapping words, and copying the overlapping words and their surroundings. We empirically show that the models copying the overlapping words perform better. These results support the fact that the strategy to include the overlapping words, which was shown useful for conventional query-biased summarizers, also works well for neural network-based query-biased summarizers. We first explain a base query-biased neural abstractive summarizer proposed by Nema et al. [13] , into which we integrate our copying mechanisms in the next section. Encoders: The base model has two bi-directional Long Short-term Memory (LSTM) [8] -based encoders; one is for the query q = {q 1 , ..., q |q| } and another is for the source document d = {w 1 , ..., w |d| }. In each encoder, the outputs of the forward and the backward LSTM are concatenated into a vector. We refer to the generated vector for the i-th word in the query as h q i , and the j-th word in the source document as h d j . Decoder with Query-and Source Document-Attentions: The decoder outputs the summary. The final state h d |d| is used to initialize the first state of the LSTM in the decoder. In each time step t, the decoder calculates the attention weights a q t,i = v q · tanh(W q s t + U q h q i ) for every word in the query. W q , U q ∈ R l×l are weight matrices, and v q ∈ R l is a l−dimensional weight vector, where each element is automatically learned from the training data. is the output of the LSTM in the decoder. Here, y t−1 is the embedding vector of the previously generated word and d t is a document representation explained later. The weights are converted into probabilities . We now obtain a query vector: The source document attention mechanism further calculates the attention weights a d t,j for every word in the source document and converts them into probabilities α d t,j : . (1) contains q t , which means that the weights are calculated by considering the query. We then take the weighted average to obtain a document vector: Finally, the score of generating the word n in the pre-defined dictionary N is calculated as a t,gen (n) = δ n · W o (W dec s t + V dec d t ). The scores are converted into a probability distribution: where W o ∈ R l×N , W dec ∈ R l×l and V dec ∈ R l×l are learnable parameter matrices. δ n ∈ {0, 1} |N | is a one-hot vector where the element corresponding to the word n is 1, or 0 otherwise. Thus, the dot product of δ n and W o (W dec s t + V dec d t ) calculates the score of generating the word n. Equation (2) converts the score into a probability distribution. Objective Function: All learnable matrices are tuned to minimize the negative likelihood for the reference summaries y in the training data D: We discuss the copying mechanisms for query-biased summarizers. Figure 1 shows the overview of a query-biased summarizer with a copying mechanism. In the following subsections, we present three mechanisms; SOURCE, OVERLAPand OVERLAP-WIND. We explain SOURCE, which copies the words from the source document. The strategy is a straightforward extension of the existing copying mechanisms [5, 6, 17] . The neural query-biased summarizer proposed by Hasselqvist et al. [7] also adopted this strategy, but they did not report its impact. We further extend this mechanism in the following subsections. In this strategy, the output layer calculates the probability distribution over the set N ∪ M = {n 1 , ..., n |N |+|M | }, where N is the set of words in the pre-defined dictionary and M is the set of words in the source document. Thus, p t,gen in Eq. (2) is modified to consider the extended vocabulary as follows: p t,gen (n) = p t,gen (n) (n ∈ N ), 0 ( n ∈ N ). In the copying mechanism, we consider two different probabilities for the word n in the vocabulary; the generation probability p t,gen and the copying probability p t,copy . The switching probability sw t = σ(z d · d t + z s · s t + z y · y t−1 ) balances those probabilities as: p t (n) = sw t p t,gen (n) + (1 − sw t )p t,copy (n). Here, z d , z s ∈ R l and z y ∈ R l emb . σ represents a sigmoid function. l emb is the dimension size of a word embedding. p t,copy is calculated as follows: idx s(n) is a function to return the position of the word n in the source document. The attention weight α d t,idx s(n) provided by the source document attention module is used as the score for outputting the word n in the source document. We propose OVERLAP, the model to copy the overlapping words. This model calculates the probability distribution over the set N ∪ M = {n 1 , ..., n |N |+|M | } in the same way as in SOURCE. This model increases the scores for the overlapping words as follows: The scores are converted into a probability distribution: . In the equations above, Q refers to the set of content words 1 in the query. Thus, Q ∩ M represents the overlapping words. λ d ∈ R > 0 is a hyperparameter that controls the importance of the overlapping words. By using λ d , this model can assign a relatively high probability for the overlapping words.Thus, the overlapping words are more likely to be included in the summary. λ d is tuned on validation data. We finally explain OVERLAP-WIND, the model that copies the overlapping words and their surrounding words. We assume that the surroundings of overlapping words might also be important. This model calculates the scores for the overlapping words and their surroundings as follows: where M L d is the set that contains L d (∈ N) words around the overlapping word in addition to the overlapping word itself. Then, the scores are converted into a probability distribution by using the softmax function: p t,copy (n) = We used the publicly available dataset 2 provided by Nema et al. [13] . The data contains the tuples of a source document, a query and a summary, extracted from Debatepedia 3 . We used 80% of the data for training, and the remaining was equally split for parameter tuning and testing. We used Adam [9] for the optimizer with β 1 = 0.9 and β 2 = 0.999. The initial learning rate was set to 0.0004. The word embeddings for both the query and the source document were initialized by GloVe [15] and further tuned during training the models. We selected the best-performing value from 200, 300 and 400 for the dimension size for LSTM by using the data for tuning. We used 32 for the batch size. We used all the vocabulary in the training data as the pre-defined dictionary. We prepared three baselines without any copying mechanisms. The first, ENC-DEC, was a simple encoder-decoder based summarizer without the query encoder. This model uses Eq. (1) without q t . The second, ENC-DEC QUERY, was the query-aware encoder-decoder explained in Sect. 2. The third, DIVERSE, was the state-of-the-art model proposed by Nema et al. [13] . We adopted full-length ROUGE [11] for the automatic evaluation metric. In addition, we conducted manual evaluation by human judges. 55 randomly selected document/query pairs and their summaries generated by DIVERSE, SOURCE, and OVERLAP-WINDwere shown to crowdworkers on Amazon Mechanical Turk. We assigned 10 workers for each set of document/query/summaries and asked them to rank the summaries. We adopted readability and responsiveness as the manual evaluation criteria, following the evaluation metric in DUC2007 4 . The workers were allowed to give the same rank to multiple summaries. We show the ROUGE scores 5 and the averaged rankings from human judges in Table 2 . Table 2 . The full-length ROUGE-1, ROUGE-2, ROUGE-L (higher is better) and the averaged rankings (lower is better) from human judges. The best performing model is in bold. Reference ROUGE Scores: ENC-DEC, without query information, achieved very low performance. Adding the query encoder (ENC-DEC QUERY) improved the score. Furthermore, copying the words in the source document (SOURCE) achieved better scores than those of the best-performing model without a copying mechanism (DIVERSE). Thus, integrating the copying mechanism improved the performance even in the query-biased setting. Among our models, OVERLAP and OVERLAP-WIND (L d = 1) achieved the better performances than SOURCE. The dagger ( †) indicates that the differences between the scores of SOURCE, OVERLAP and those of our best-performing model (OVERLAP-WIND(L d = 1)) are statistically significant with the paired bootstrap resampling test used in Koehn et al. [10] (p < 0.05). This supports our assumption that the strategy to copy the overlapping words is shown effective even for RNN-based summarizers. Our best model OVERLAP-WIND (L d = 1) is ranked higher than the state-of-the-art DIVERSE and SOURCE. The differences between OVERLAP-WIND (L d = 1) and SOURCE are statistically significant (p < 0.05) with the paired bootstrap resampling test [10] . The results of manual evaluation also support our assumption. We proposed the copying mechanisms designed for query-biased summarizers to primarily include the words overlapping between the source document and the query. Our experimental results showed that the mechanisms to primarily include the overlapping words between the source document and the query achieved the better performances in terms of both ROUGE and rankings by human judges. The results suggested that the strategy to include the overlapping words, which has been shown useful for conventional non-neural summarizers, also works well for RNN-based summarizers. Query focused abstractive summarization: incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models Neural summarization by extracting sentences and words Overview of DUC Bayesian query-focused summarization Incorporating copying mechanism in sequence-tosequence learning Pointing the unknown words Query-based abstractive summarization using neural networks Long short-term memory Adam: a method for stochastic optimization Statistical significance tests for machine translation evaluation Rouge: a package for automatic evaluation of summaries Language as a latent variable: discrete generative models for sentence compression Diversity driven attention model for query-based abstractive summarization Biased LexRank: passage retrieval using random walks with question-based priors Glove: global vectors for word representation FastSum: fast and accurate query-based multidocument summarization Get to the point: summarization with pointergenerator networks Advantages of query biased summaries in information retrieval Pointer networks