key: cord-0194984-sa5i0bq0 authors: Ju, Jiaxin; Liu, Ming; Koh, Huan Yee; Jin, Yuan; Du, Lan; Pan, Shirui title: Leveraging Information Bottleneck for Scientific Document Summarization date: 2021-10-04 journal: nan DOI: nan sha: 882a7756d91e98a498af294c6dd7d54cbe4b490b doc_id: 194984 cord_uid: sa5i0bq0 This paper presents an unsupervised extractive approach to summarize scientific long documents based on the Information Bottleneck principle. Inspired by previous work which uses the Information Bottleneck principle for sentence compression, we extend it to document level summarization with two separate steps. In the first step, we use signal(s) as queries to retrieve the key content from the source document. Then, a pre-trained language model conducts further sentence search and edit to return the final extracted summaries. Importantly, our work can be flexibly extended to a multi-view framework by different signals. Automatic evaluation on three scientific document datasets verifies the effectiveness of the proposed framework. The further human evaluation suggests that the extracted summaries cover more content aspects than previous systems. Automatic text summarization is a challenging task of condensing the salient information from the source document into a shorter format. Two main categories are typically involved in the text summarization task, one is extractive approach (Cheng and Lapata, 2016; Nallapati et al., 2016a; Xiao and Carenini, 2019; Cui et al., 2020) which directly extracts salient sentences from the input text as the summary, and the other is abstractive approach (Sutskever et al., 2014; See et al., 2017; Cohan et al., 2018; Sharma et al., 2019; Zhao et al., 2020) which imitates human behaviour to produce new sentences based on the extracted information from the source document. Traditional extractive summarization methods are mostly unsupervised, extracting sentences based on n-grams overlap (Nenkova and Vanderwende, 2005) , relying on graph-based methods for sentence ranking (Mihalcea and Tarau, 2004; Erkan and Radev, 2004) , or identifying important sentences with a latent semantic analysis technique (Steinberger and Jezek, 2004) . These unsupervised systems have been surpassed by neural-based models (Zaheer et al., 2020; Huang et al., 2021) in respect of performance and popularity, their encoder-decoder structures use either recurrent neural networks (Cheng and Lapata, 2016; Nallapati et al., 2016b) or Transformer (Zhang et al., 2019; Khandelwal et al., 2019) . Chu and Liu (2019) developed an unsupervised auto-encoder model which attempts to encode and then reconstructs the documents with some properly designed reconstruction loss. However, as it tries to preserves every detail that helps to reconstruct the original documents, it is not applicable to long-document summarization settings. Recently, Ju et al. (2020) proposes an unsupervised non-neural approach for long document by building graphs to blend sentences from different text spans and leverage correlations among them. Nevertheless, none of the aforementioned works utilize explicit guidance to aid the model in summarizing a source text. To this end, some works (Li et al., 2018; Liu et al., 2018; Zhu et al., 2020; Saito et al., 2020; Dou et al., 2021) explore the use of guided signals extracted from the input source document such as keywords, highlighted sentences and others to aid the model architecture in summarizing the input document. These works only utilize a single signal, and Dou et al. (2021) empirically showed that if multiple guided signals can be optimally exploited, the model could achieve even greater improvement to its summary outputs in the supervised neural summarization research space. Based on this finding, we propose a multi-view information bottleneck framework that can effectively incorporate multiple guided signals for the scientific document summarization task. The original idea of information bottleneck (IB) principle (Tishby et al., 2000) in information the- Figure 1 : Our proposed multi-view information bottleneck framework. I(s; Y ) denotes the mutual information between sentence s and correlated signal Y , and NSP is short for Next Sentence Prediction task. ory is to compress a signal under the guidance of another correlated signal. BottleSum (West et al., 2019) successfully applied IB to the summarization task for short document. Their model generates summary merely by removing words in each sentence while preserving all sentences without considering the importance of the sentences at a document level. It is not suitable for long scientific document summarization as it would preserve all sentences, resulting in significant redundancy. In contrast, our framework applies IB principle on document level rather than sentence level, where pruning unrelated information will only work on the selected important sentences. In particular, at the content selection stage as shown in Figure 1 , the signal that we seek to compress is the source document and the correlated signals are extracted from the source document using state-of-the-art language models. Followed by the text realization step where our proposed architecture conducts sentence search based on fluency to return the final extracted summaries. This framework can be flexibly extended to multi-view architecture by incorporating more selfdefined correlated signals. Our experiments on arXiv, PubMed (Cohan et al., 2018) and COVID-19 (Wang et al., 2020) show that our framework yields competitive performance compared to previous extractive summarizers. Despite the less satisfactory results for multi-view framework in our experiment, we believe it has fruitful potential for further study since the experiments from the work of Dou et al. (2021) have empirically shown that summarization through multiple guided signals can achieve significant improvements to the system with a single signal. Unsupervised Summarization Information bottleneck (IB) principle (Tishby et al., 2000) naturally incorporates selection and pruning into compressing information. It compresses the input source S intoS, which only preserves information related to signal Y , by minimizing the equation: where I denotes mutual information between two variables and the trade-off coefficient α controls the pruning term and relevance term. The term I(S; S) is to prune irrelevant information, while I(S; Y ) enforces the model to retain more information correlated the label Y . For the summarization task, we define S to be the source document,S is the output summary, and Y is the correlated signal. To leverage the benefits of multiple guided signals, we seek to extend IB principles to effectively incorporate multiple guided signals. Recent work (Federici et al., 2020; Wu and Fischer, 2020) leverages the benefit of multi-view IB in other domains, so we extend this framework with multiple views by minimizing the following equation: where Y 1 and Y 2 refer to two different views on the document content. In this equation, we consider the mutual information between sentences and the guided signal individually. In addition, the term I(S; S) still is to remove redundant information, while s∈S I(s; Y 1 ) and s∈S I(s; Y 2 ) are to retain correlated information. The trade-off parameter α and β control the relationship between two views Y 1 Y 2 and the pruning term I(S;S) respectively. However, there is no clear way to directly optimize a value for them without a supervised validation set so that we cannot directly compare the importance between two progresses. Instead, we formalise this process as selection step and text editing step. Followed by BottleSum (West et al., 2019) , the equation (2) can be posed formally as learning a conditional distribution. As we extend their work to document level, the probability of the sentence selected by the system, P (s), should be 1. Then the equation can be formulated as the following: Thus, the content selection is to keep relevant information by maximizing P (Y 1 |s) and P (Y 2 |s) while the text editing step is to prune irrelevant sentences by optimizing P (S). In our framework, we define Y 1 to be the document categories (e.g. cs, math), and Y 2 to be a keyphrase list of the specific article. The equation eventually can be rewritten as 1 : where y is the keywords in the extracted keywords list Y 2 . Hence, our goal is to maximize P (Y 1 |s) and P (y|s) while optimizing P (S). To illustrate how our frameworks are learned based on the IB principle, we divide Equation (4) into two parts and develop an algorithm for each part. The content selection algorithm corresponds to the second term. The algorithm below shows a generalized framework that can be extended to include more than two signals, Y = {Y 1 , Y 2 , ..., Y n }. The implementation details of Y i (s) will be explained in the section 4.2. The higher the score a sentence gains, the stronger the correlation with the guided signal(s) and the higher the probability it will be included in the output summary. For text realization algorithm, the candidate sentence set selected from content selection step is firstly reordered in terms of the sentence original position in the source document. Then we use SciBERT (Beltagy et al., 2019) to apply the next Algorithm: Content selection & Text Realization Require: Document D, signal set Y = {Y1, Y2, ..., Yn}, position information P os, and a language model LM Content Selection: 1: SD ← split doc D * full sentence set 2: for each s in SD do 3: if len(s) in length constraint then 4: for each Yi in Y do 5: P (Yi|s) ← Yi(s) 6: Score(Yi) = P (Yi|s) × log(P (Yi|s)) 7: Score ( best sent path ← Search(s, M ) 8: best summary path ← k best sent paths 9: Ssum ← best summary path 10: return Ssum sentence prediction (NSP) task, each sentence is evaluated against the sentence appeared before it to determine the likelihood that these two sentences are consecutive. Similar idea based on BERT NSP task has been proposed by (Bommasani and Cardie, 2020) to measure a summary's semantic coherence and fluency. Taking fluency of the summary into account, this searching algorithm aims to find the most likely sentence combination as the candidate summary, and the best sentence combination will be selected from these k candidate combinations. Here we implemented greedy search and beam search respectively for model performance comparison. The greedy search algorithm is started from the first sentence, then we find the sentence combination with the highest next sentence probability in each window. For beam search, since the best sentence combination may not start from the first sentence, we perform it for the first k sentences of the candidate sentence set. Additional to the widely-used arXiv and PubMed datasets (Cohan et al., 2018) , we also make use of the COVID-19 scientific paper dataset (Wang et al., 2020) . The dataset statistics can be seen in Table 2 . Content selection We define a list of keyphrase extracted by RAKE (Rose et al., 2010) related signal for single view, while the multiview framework incorporates the document category as another view. Top 10 keyphrases are extracted, and sentences and keyphrases are then mapped into high dimensional space by averaging the output from SciBERT (Beltagy et al., 2019) . We assume the sentence with higher similarity to the keyphrases are more likely to associate with the defined signal, and the score is the summation of the cosine similarity between the sentence and each keyphrase. For multi-view framework, we use Longformer (Beltagy et al., 2020) that is pre-trained with 100 classes on the kaggle arXiv dataset 2 to obtain the P (Y |s) for each sentence. In the pre-training process, we utilize the large model of Longformer and we set the learning rate as 1e-5, batch size as 4, epoch as 4, hidden dropout as 0.05 and the hidden size as 1024. 50 sentences with higher score will be selected for the next step. Text Realization For NSP task, we continue to use SciBERT (Beltagy et al., 2019) to obtain the likelihood of two adjacent sentences. We implement greedy search and beam search respectively for model performance comparison. The greedy 2 https://www.kaggle.com/Cornell-University/arxiv search algorithm is started from the first sentence (k=1), then we find the sentence combination with the highest next sentence probability in each window. We set the window size to 3 then slide the window by one sentence. For beam search, since the best sentence combination may not start from the first sentence, we perform it for the first 5 (k=5) sentences of the candidate sentence set and we set the beam size to 5. The number of sentences in the generated summary is 10. Results on scientific datasets We compare our framework with unsupervised summarization models as shown in Table 1 . We rerun these models and the number of sentences in the generated summary from all models is 10. Our models achieve the highest R-1 on arXiv and the highest R-2 on PubMed. On the COVID-19, the keywords+beam search setting achieves the highest score. SciSummPip (Ju et al., 2020) is a hybrid method that compresses and rewrites extracted sentences by building a wordrelational graph, so it is likely to have more bigrams that match the reference summary. SumBasic ( Vanderwende et al., 2007) tends to extract the sentence that contains more high frequency word so that it achieves a higher R-1 on PubMed. The comparison among our frameworks shows that single view settings performs better than multi-view setting, and the beam search algorithm is better than the greedy search algorithm. While we achieve better scores than baseline results, the performance differences are not significant. Thus, to investigate the effectiveness of our proposed framework, we further conduct the position analysis and human evaluation on single view settings. Sentence position analysis Our position analysis is shown in Figure 2 ., the oracle summaries are mostly extracted from the beginning of the source document, while summaries extracted by our models are from all the sections within the source document. Achieving a higher ROUGE score can prove that the model captures unigram/bigram appeared in the reference summary, but the more important thing is the extracted summary can concisely cover most/all of the key information in each section of the original article for the reader and our model seems to achieve this significantly better than the oracle summary. To prove this hypothesis, we conduct a thorough human analysis. Human analysis We randomly sample 50 documents from the COVID-19 dataset and conduct human evaluation against four criteria: fluency, faithfulness, coverage and conciseness. For each article we compare summaries generated from 4 frameworks with the true summary, and the human annotators are asked to blind rate these summaries on a 1-5 point scale (1 is the worst and 5 is the best). The average performance of each model is shown in Table 3. Even though keywords+beamSearch setting does not significantly perform better than others in terms of ROUGE score, it receive higher human ratings. In addition, oracle summary perform better on fluency and faithfulness but it contains more unnecessary sentences. Figure 3 shows an example 3 of the abstract and the system summaries. In this paper, we proposed an unsupervised framework based on IB principle for long document summarization. Our framework employs a two-steps system where content selection is guided by defined signal(s) and is followed by a text realization step where a pre-trained language model conducts sentence search to return final summaries. Experiments on three scientific show the effectiveness of our framework. Further human analysis suggests that the extracted summaries exhibit more coverage. Despite the less satisfactory results for multi-view framework in our experiment, we believe it has fruitful potential for further study. where pmi(x, y) = p(x,y) p(x)p(y) denotes pointwise mutual information. = P (S,S) log P (S,S) P (S)P (S) −α s∈s P (Y 1 , s) log P (Y 1 , s) P (Y 1 )P (s) −β s∈S P (Y 2 , s) log P (Y 2 , s) P (Y 2 )P (s) = P (S|S)P (S) log P (S|S) P (S) −α s∈S P (Y 1 |s)P (s) log P (Y 1 |s) P (Y 1 ) −β s∈S P (Y 2 |s)P (s) log P (Y 2 |s) P (Y 2 ) P (S|S)=1 for chosen summary, P (S), P (s), P (Y 1 ) and P (Y 2 ) are constant. = P (S|S)P (S) log P (S|S) P (S) − s∈S P (s) αP (Y 1 |s) log P (Y 1 |s) P (Y 1 ) + βP (Y 2 |s) log P (Y 2 |s) P (Y 2 ) = C 1 log 1 P (S) − s∈S C 2 [αP (Y 1 |s) log P (Y 1 |s) + βP (Y 2 |s) log P (Y 2 |s) − C 3 ] −log(p G ) is constant, so P (Y 1 |s)×log P (Y 1 ) is constant. For each sentence, P (Y 1 |s) will be scaled up or down in the same proportion, because P (Y 1 ) and P (Y 2 ) are constant and Log is a monotonically increasing function. = −logP (S)− s∈S [αP (Y 1 |s) log P (Y 1 |s) + βP (Y 2 |s) log P (Y 2 |s)] Variations of the similarity function of textrank for automated summarization Scibert: Pretrained contextualized embeddings for scientific text Longformer: The long-document transformer Intrinsic evaluation of summarization datasets Neural summarization by extracting sentences and words Meansum: a neural model for unsupervised multi-document abstractive summarization A discourse-aware attention model for abstractive summarization of long documents Enhancing extractive text summarization with topicaware graph neural networks Gsum: A general framework for guided neural abstractive summarization Lexrank: Graph-based lexical centrality as salience in text summarization Nate Kushman, and Zeynep Akata. 2020. Learning robust representations via multi-view information bottleneck Efficient attentions for long document summarization Monash-summ@ longsumm 20 scisummpip: An unsupervised scientific paper summarization pipeline Sample efficient text summarization using a single pre-trained transformer Guiding generation for abstractive text summarization based on key information guide network Generating wikipedia by summarizing long sequences Textrank: Bringing order into text Summarunner: A recurrent neural network based sequence model for extractive summarization of documents Abstractive text summarization using sequence-to-sequence rnns and beyond The impact of frequency on summarization. Microsoft Research Automatic keyword extraction from individual documents Abstractive summarization with combination of pre-trained sequence-tosequence and saliency models Get to the point: Summarization with pointer-generator networks An entity-driven framework for abstractive summarization Using latent semantic analysis in text summarization and summary evaluation Sequence to sequence learning with neural networks The information bottleneck method Beyond sumbasic: Taskfocused summarization with sentence simplification and lexical expansion Bottlesum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle Phase transitions for the information bottleneck in representation learning Extractive summarization of long documents by combining global and local context Big bird: Transformers for longer sequences Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization Summpip: Unsupervised multidocument summarization with sentence graph compression Xuedong Huang, and Meng Jiang. 2020. Boosting factual correctness of abstractive summarization with knowledge graph. arXiv e-prints