key: cord-0447155-78sz61me
authors: Huang, Luyang; Cao, Shuyang; Parulian, Nikolaus; Ji, Heng; Wang, Lu
title: Efficient Attentions for Long Document Summarization
date: 2021-04-05
journal: nan
DOI: nan
sha: 9dc624d7258d1a56117ca720aea953ce46b66b21
doc_id: 447155
cord_uid: 78sz61me

The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.

Long documents, such as scientific papers and government reports, often discuss substantial issues at length, and thus are time-consuming to read, let alone to comprehend. Generating abstractive summaries can help readers quickly grasp the main topics, yet prior work has mostly focused on short texts (containing hundreds of words), e.g., news articles (Gehrmann et al., 2018; Liu and Lapata, 2019; Zhang et al., 2019) .

Model training efficiency and summary quality present a pair of challenges for long document summarization. State-of-the-art systems (Lewis et al., 2020; Zhang et al., 2019) are built upon Transformer (Vaswani et al., 2017) , which uses attentions to compute pairwise relations between tokens. Such framework has quadratic time and memory complexities, and is too costly for long documents 1 . Solutions have been proposed to reduce 1 For instance, to fine-tune BART on documents of 10K the calculation of encoder self-attentions (Wang et al., 2020c; Zaheer et al., 2020) by selectively attending to neighboring tokens (Beltagy et al., 2020; Child et al., 2019) or relevant words (Kitaev et al., 2020; Tay et al., 2020a ). Yet, these methods do not apply to encoder-decoder attentions in summarization models since they collaborate and dynamically pinpoint salient content in the source as the summary is decoded. Truncation is commonly used to circumvent the issue. However, training on curtailed content further aggravates "hallucination" in existing abstractive models (Maynez et al., 2020) .

We argue that summarizing long documents (e.g., with thousands of words or more) requires efficient handling of both types of attentions. To this end, we propose an efficient encoder-decoder attention with head-wise positional strides (HEPOS), where the attention heads follow a strided pattern and have varying starting positions. HEPOS reduces computational and memory costs while (1) maintaining the power of emphasizing important tokens, and (2) preserving the global context per head. HEPOS successfully doubles the processed input sequence size, when combined with any encoder. To the best of our knowledge, we are the first to study efficient encoder-decoder attentions and provide a systematic comparison of diverse encoder attentions for the task of summarization. 2 For evaluation, we collect a new large-scale dataset, GOVREPORT, consisting of about 19.5k U.S. government reports with expert-written abstractive summaries. 3 GOVREPORT has two important features: (1) It contains significantly longer documents (9.4k words) and summaries (553 words) than existing datasets, such as PubMed and arXiv (Cohan et al., 2018 ) (see Table 2 ); (2) Salient content is spread throughout the documents, as opposed to cases where summary-worthy words are more heavily concentrated in specific parts of the document. These properties make GOVREPORT an important benchmark for producing long document summaries with multiple paragraphs.

We conduct experiments on GOVREPORT and scientific papers in PubMed and arXiv. First, when summarizing documents of the same length, HEPOS attention yields significantly better ROUGE scores than a non-trivial comparison that projects attentions into low-rank space (Wang et al., 2020c) . Second, when trained on the same GPU, HEPOS attention, combined with sparse encoder attentions, is able to read more than 10K words and obtains significantly higher ROUGE scores on GOVREPORT and new state-of-the-art results on PubMed, compared with full encoder-decoder attention models which can process at most 5K input words. Human judges further rate the summaries generated by our models to be more informative and faithful.

We further propose a new evaluation metric for faithfulness, inspired by APES (Eyal et al., 2019) , a fill-in-the-blank QA metric for summary evaluation. With questions generated from references, our metric, APES src , compares QA answers by reading the source and the system summary. It is shown to be better correlated with human judgment than the original metric and an entailment-based scorer (Kryscinski et al., 2020) .

The rest of the paper is organized as follows. We describe efficient encoder attentions in prior work in § 2, and formulate our proposed encoder-decoder attention in § 3. The GOVREPORT data is presented in § 4. We then share details on evaluation metrics ( § 5) and experimental results ( § 6). Additional related work is listed in § 7, with conclusion in §8.

Transformer models are built upon multi-head attentions in multiple layers. The attention is calculated as Attention(Q,

where Q, K, and V are query, key, and value matrices, each consisting of n vectors for a document with n tokens, thus the quadratic memory footprint.

Here, we present an overview of representative methods for efficient encoder self-attentions (henceforth "encoder attentions") that can be built upon large pre-trained seq2seq models, e.g., BART (Lewis et al., 2020) . We follow the naming

Complexity # New Para. convention of Tay et al. (2020b) , and summarize their memory complexities and numbers of newly learned parameters in Table 1 .

Fixed patterns are used to limit the scope of attentions. In our experiments, in addition to windowbased attentions, we also combine them with global tokens, stride patterns, or random attentions.

Sliding window attentions (Beltagy et al., 2020) aim to capture the local context, which is critical for language understanding (Liu* et al., 2018; Child et al., 2019) . Concretely, each query token attends to w/2 neighboring tokens on both left and right, yielding a memory complexity of O(nw).

Adaptive span is proposed by Sukhbaatar et al. (2019) to learn attention windows at different layers. This is implemented by learning a masking function for each head independently. In practice, the adaptive span attention has a complexity of O(nŵ), whereŵ is the maximum values of predicted spans for all heads. Besides, it introduces O(1) new parameters for learning spans.

Global tokens (Beltagy et al., 2020) are often added to sliding windows to let pre-selected tokens attend to the full sequence, to build global representations. Importantly, global attention operations are symmetric, i.e., a global token is also attendable to all tokens in the sequence. We select the first g tokens as global tokens, as leading sentences are often important for summarization. Memory complexity is O(2ng) due to the symmetric attentions. Stride patterns are proposed by Child et al. (2019) to capture long term interactions, where each query attends to every s-th token, with s as the stride size. It thus has a complexity of O(n 2 /s). Random attention is motivated by the fact that randomly constructed graphs withΘ(n) edges can approximate the complete graphs spectrally (Zaheer et al., 2020) . Zaheer et al. (2020) propose to allow each query to attend to r random keys, resulting in a complexity of O(nr). For efficient implementations, input tokens are first segmented into blocks. Tokens in the same block attend to tokens in another randomly selected block. et al. (2020c) show that self-attention matrices are low-rank. They propose Linformer that linearly projects key and value matrices into a lowdimensional space, e.g., from n to k, to achieve a O(nk) complexity. It also introduces O(n) new parameters for projection matrix learning.

Recently, learnable sparse attentions are proposed to better capture both local and global contexts than attentions based on fixed patterns. Locality-sensitive hashing (LSH) attentions use a random-projection hashing function to hash similar queries and keys into the same buckets in l rounds (Kitaev et al., 2020) . Attentions are then computed among tokens within each bucket. For bucket size b l , the complexity of LSH attention is O(lnb l ).

Sinkhorn attentions first segment a sequence into blocks, which are then arranged by a learned Sinkhorn sorting network (Tay et al., 2020a) . Given the new permutation, each query attends to b s tokens within the same block to maintain the local context and another b s tokens in a neighboring block to capture global interactions. Its complexity is O(2nb s ).

We also describe several notable methods that are not suitable for our experiments and excluded from this study: Recurrence over input segments are tailored for an autoregressive decoder only (Dai et al., 2019) ; memory methods use a separate memory module to attend to full sequences (Lee et al., Figure 1 : A toy example of our HEPOS attention, with a stride of 2 and four attention heads. Dark colors indicate that heads 1 and 3 attend to the first and third tokens ("Job" and "home") in the input, heads 2 and 4 look at the second and fourth words ("in" and "care"). 2019), which share a similar theoretical foundation as global tokens; and kernel methods over attentions require training models from scratch (Choromanski et al., 2020; Katharopoulos et al., 2020) .

The efficient design of encoder-decoder attentions with head-wise positional strides (HEPOS) allows models to consume longer sequences. Concretely, our design is motivated by two observations: (1) Attention heads are redundant (Voita et al., 2019) .

(2) Any individual head rarely attends to several tokens in a row (Clark et al., 2019) . Therefore, as illustrated in Fig. 1 , HEPOS uses separate encoderdecoder heads on the same layer to cover different subsets of source tokens at fixed intervals. Each head starts at a different position, and all heads collectively attend to the full sequence. Given a stride size of s h , for the h-th head, its attention value between decoder query q j (at step j) and encoder key vector k i (for the i-th input token) can be formulated as:

In HEPOS attention, each query token attends to n/s h tokens per head, yielding a memory complexity of O(mn/s h ), where m is the output length.

For comparison, Linformer ( § 2.2) can be straightforwardly adapted for encoder-decoder attentions by using decoder queries for attention calculation instead. We do not adapt pattern-based attentions ( § 2.1 and § 2.3), since they rely on local token grouping which makes it difficult to pinpoint salient content.

We introduce a new large-scale dataset, GOVRE-PORT, containing 19, 466 long reports published by U.S. Government Accountability Office (GAO) 4 to fulfill requests by congressional members, and Congressional Research Service (CRS) 5 , covering researches on a broad range of national policy issues. A human-written summary is provided along with each report. During data collection, we remove boilerplates from crawled files, and keep the section and paragraph structure of the documents and summaries. Additional data cleaning and processing details are in Appendix A.

We obtain 12, 228 GAO reports and 7, 238 CRS reports of high quality evidenced by human inspection of 200 parsed reports. Collected GAO reports and CRS reports have on average 6.9 and 4.6 sections, respectively. We split train, validation and test set by publication date on each dataset, and end up with 17519 training samples, 974 validation documents, and 973 test samples.

Notably, summaries of GAO reports are written by experts, and are often structured into three aspects in order: "Why GAO did this study"-motivation and problem(s) under discussion, "What GAO found"-findings of the report, and "What GAO recommends"suggestions and solutions to the problem(s). All but three GAO summaries include "What GAO Found". The percentages of GAO summaries that contain "Why GAO did this study" and "What GAO recommends" are 94.8% and 29.0%. For comparison, structured summaries are also observed on PUBMED (Cohan et al., 2018) samples. Though they do not contain explicit aspect labels, the summaries can often be broken down into "Introduction", "Methods", "Results", and "Conclusion" via keyword matching. Details about keyword choices for each aspect are provided in Table 11 in Appendix D.

Comparison with Existing Long Document Summarization Datasets. In Table 2 Table 2 : Statistics of GOVREPORT and existing long document summarization datasets. Comp.: compression ratio, Den.: extractive fragment density (Grusky et al., 2018) . All values are mean over the whole dataset except for the "# Doc" column. Documents and summaries in GOVREPORT are significantly longer. U.S. patent documents.

First, documents and summaries in GovReport are significantly longer than prior datasets. Next, we inspect the distribution of summary-worthy bigrams in the source by dividing each document into ten equisized partitions. For each partition, we count the occurrence of unique bigrams that also appear in the reference, accumulated from the start of the document to the end of the partition. Fig. 2 shows that key information is spread throughout documents in GOVREPORT, with new salient bigrams being steadily added as more content is consumed. For ARXIV and BIGPATENT, only about 10% of new salient bigrams are accumulated in the second half of the documents, reflecting the heavy positional bias in these two datasets. In contrast, in GovReport and BILLSUM, more than 18% of new summary-worthy bigrams appear in the later half of the articles, showing a more even distribution. A similar trend is observed on unigrams. However, BILLSUM has the shortest documents among the five datasets.

This work aims to evaluate whether processing more text improves both informativeness and faithfulness of abstractive summaries. In addition to ROUGE (Lin, 2004) and human evaluation, we extend existing QA-based metric (Eyal et al., 2019) and consider an entailment-based scorer.

QA-based Evaluation. We present a new faithfulness evaluation metric by extending the APES score (Eyal et al., 2019). We follow APES to construct a set of cloze questions, {q}, from each reference summary by masking entities. Events, dates, and numbers are also masked, as they are prevalent in our data. Each masked phrase becomes the goldstandard answer a ref for a question q. We do not generate natural language questions (Durmus et al., 2020; Wang et al., 2020a) , due to the lack of accurate question generation models for the domains of government reports and scientific papers.

QA models are trained by reading a question and a context to label the answer span in the context. We construct context by greedily selecting sentences that maximize the improvement of ROUGE-2 recall when compared with the reference summary. If the answer a ref cannot be found in the context, the sample is excluded from training. We train all QA models by fine-tuning BERT (Devlin et al., 2019) to predict the answer span.

To evaluate the faithfulness of a system summary, APES uses the QA model to read the summary and a question q to label an answer a sys . It calculates a unigram F1 score by comparing a sys and a ref . Different from APES, we further use the QA model to read the context (sentences selected from the source) and give an answer a cxt to the question q. We compute a unigram F1 by comparing a sys and a cxt , denoted as APES src . Given that existing summarization models rarely rewrite names or numbers correctly, our metric can better capture faithfulness by using a gold-standard answer constructed from the source article than from the human-written abstract.

To extract entities and events, we deploy a state-of-the-art IE framework, OneIE (Lin et al., 2020) on GOVREPORT. On PubMed, we retrain OneIE on Genia 2011 (BioNLP, 2011 ) and 2013 (BioNLP, 2013 , and PubMed (Wei et al., 2019) datasets to extract domain-specific entities and events, such as entities of Gene and Disease. We additionally include numbers and dates extracted by spaCy (Honnibal and Montani, 2017).

Entailment-based Evaluation. We further consider FactCC (Kryscinski et al., 2020) , which evaluates factual consistency of a system summary by predicting an entailment score between the source and the summary. We reproduce their method on our datasets.

Additional details for implementing the evaluation models and the entity extraction models are given in Appendix B.

In this section, we start with describing training details in § 6.1. We then compare attention variants on documents of the same length ( § 6.2) and study whether reading more text can generate more informative summaries ( § 6.3). We further report human evaluation on summary informativeness and faithfulness as well as automatic faithfulness scores ( § 6.4). Finally, we investigate whether automatic metrics correlate with human judgment ( § 6.5).

We fine-tune BART (Lewis et al., 2020) for all experiments. We implement our models with Py-Torch (Paszke et al., 2019) and Fairseq (Ott et al., 2019) . Additional position embeddings are initialized randomly for models that handle longer inputs. The learning rate is set to 1 × 10 −4 and learning rate warm-up is applied for the first 10,000 steps. Adafactor (Shazeer and Stern, 2018) optimizer with a gradient clipping of 0.1 is used. All models are trained on two Quadro RTX 6000 GPUs with 24GB memory or one Quadro RTX 8000 with 48GB memory. We set a batch size of 2 per step and accumulate gradient every 32 steps. During test, we adopt a beam size of 4 and a length penalty of 2 (Wu et al., 2016) on all datasets.

Comparisons. We first experiment with articles that are all truncated at 1024 tokens. For encoder attentions, we consider the following variants: (1) sliding WINDOW; (2) adaptive span (ADASPAN);

(3) GLOBAL tokens; (4) STRIDE; (5) RANDOM tokens; (6) Linformer (LIN.); (7) locality sensitive hashing (LSH); and (8) SINKHORN. We ensure models are comparable by setting hyperparameters to satisfy w =ŵ = k = lb l = 2b s = 256, so that models have similar memory complexity. For LSH attentions, we select l = 4 rounds of hashing. Following prior work (Zaheer et al., Table 3 : Results on evaluating encoder and encoderdecoder attentions on input of the same length. Best ROUGE scores of fixed patterns, learnable patterns, and enc-dec attentions are in red, orange, and purple, respectively. * : significantly better than comparison(s) using the same encoder or enc-dec attention (approximation randomization test, p < 0.0005). 2020), we combine GLOBAL, STRIDE, and RAN-DOM with WINDOW and ADASPAN, where we set g = n 2 /s = r = 128 for a fair comparison. We adapt Linformer to encoder-decoder attentions to compare with HEPOS, where we use s h = n/k = 4 for all experiments. Finally, we report results using FULL, i.e., the original, encoder and encoderdecoder attentions.

Results. Among all encoder variants, learnable patterns perform the best, approaching the performance of full attentions on both GovReport and PubMed, as shown in Table 3 . Within learnable patterns, Sinkhorn attention consistently obtains better ROUGE scores. Moreover, combining techniques in fixed patterns is more effective than simply using window-based sparse attentions, though with an increased memory cost.

For encoder-decoder attentions, HEPOS consistently yields higher ROUGE scores than Linformer on both datasets, using either full or Sinkhorn encoder. Notably, coupled with a Sinkhorn attention, our model's performance matches the variant using GovReport

PubMed System (MAXLEN) R-1 R-2 R-L R-1 R-2 R-L full encoder attention, implying the effectiveness of HEPOS on both identifying the salient content and capturing the global context.

We investigate whether processing more words generates more informative summaries.

Comparisons include recent top-performing abstractive models: PEGASUS (Zhang et al., 2019) , a large pre-trained summarization model with truncated inputs; TLM (Pilault et al., 2020) , DANCER (Gidiotis and Tsoumakas, 2020), and SEAL (Zhao et al., 2020) , all of them using hybrid extract-then-abstract methods; and BIGBIRD (Zaheer et al., 2020), which combines sliding window, global and random token attentions in the encoder. For encoder variants, we pick the best performing model from fixed patterns to be combined with full encoder-decoder attention, i.e., sliding window with stride (STRIDE), low-rank method (LIN.), and learnable patterns (LSH and SINKHORM). We then combine learnable patterns with HEPOS to support processing more text. All models consume as long an input as the memory allows.

Results. Overall, models that read more text obtain higher ROUGE scores, according to results on Gov-Report and PubMed in Table 4 . First, different encoder variants with full encoder-decoder attentions attain better results than the full attentions baseline except Linformer. Second, adding HEPOS encoderdecoder attention almost doubles the words that can be processed and further improves the performance. This highlights the importance of handling both encoder attentions and encoder-decoder attentions efficiently. Notably, HEPOS with an LSH encoder achieves new state-of-the-art results on PubMed, outperforming BigBird which only uses sparse attentions on the encoder. We also report performances of our two best models with HEPOS on arXiv in Table 5 , and they outperform all competitive abstractive models.

As can be seen from the sample summaries in Fig. 3 , our model that reads in 10k tokens generates more informative summary than the full attention model that only processes 1k tokens. Fig. 4 further shows that ROUGE-2 scores can be consistently lifted when reading more input, with similar trends observed on ROUGE-1 and ROUGE-L. More sample outputs are presented in Appendix C.

Here we first show human evaluation results on informativeness and unfaithful errors in the generated summaries. We sample 100 documents from GovReport and PubMed (50 each) with structured references that are labeled with aspects as described in § 4 and Appendix D. Each sample is evaluated by two fluent English speakers, who have cumulatively annotated tens of thousands of sentences for the same tasks before this work. Annotators are asked to label each summary sentence with an aspect and then decide whether it contains any type of error. Three types of unfaithful errors are considered: (i) hallucination-fabricating content not present in the input, (ii) deletion-incorrectly The selected states also reported that they did not have adequate processes to address these issues. CMS has taken steps to improve its oversight of the Medicaid program, including issuing guidance to states on the use of MAGIexempt bases for determining eligibility, but these efforts have not been fully implemented. (...) deleting crucial entities, events, or clauses, and (iii) false concatenation-inappropriately concatenating components from different sentences. 1 is given if any judge determines that a certain type of error exists in the sentence, 0 otherwise. After reading the full summaries, each judge also scores aspect-level informativeness-whether the summary covers important information of an aspect when compared with the reference. All system summaries and references are presented in a random order. Human evaluation guidelines and sample summaries for different aspects are included in Appendix D.

Results. Overall, reading more text significantly improves informativeness as well as reduces fabricated content. From Table 6 , we observe that HEPOS attention, combined with a SINKHORN encoder, obtains better informativeness scores than comparisons that read in less text on both datasets. This echos results from automatic evaluation in the previous section. Moreover, both models that use efficient attentions reduce unfaithfulness, especially hallucination errors, when compared with the full attention model, which only reads 1024 tokens. As the models read more content, they learn to surface more factual and richer content in the summaries, as seen in Fig. 3 . Next, we explore if reading more helps correctly reflect the content in documents' later sections. We plot aspect-level human ratings of informativeness and unfaithful errors on PubMed and GovReport in Fig. 5 and Fig. 6 . We report percentages of sentences with unfaithful errors by majority voting (i.e., at least one error is found by both annotators in the sentence). As can be seen, our models consistently improve informativeness and reduce errors across sections, especially for "Results" and "Conclusions" on PubMed and "What GAO recommends" on GovReport-these sections often appear in the later part of the source documents. Especially, we find that the full attention model tends to produce fabricated numbers in resultant summaries, whereas our models are able to correct them. Lastly, we report the entailment-based FactCC and QA scores APES and APES src for top performing models in Table 7 . The results again show that consuming longer input leads to more faithful summaries, though the differences are less pronounced.

Finally, we study whether the faithfulness evaluation metrics correlate with human judgment. As shown in Table 8 , on both government reports and scientific papers, QA metrics are better correlated with human ratings, with our newly pro- posed APES src being the stronger of the two. After inspection, we find that human-written summaries contain paraphrases or acronyms that APES cannot capture via strict lexical matching. For instance, for the question "Diabetes may worsen in patients", the reference answer is "death rate", whereas answers from the source and the system summary are both "mortality". APES src captures this, but not APES.

Summarizing long inputs has been investigated in many domains, including books (Mihalcea and Ceylan, 2007) , patents (Trappey et al., 2009), movie scripts (Gorinski and Lapata, 2015) , and scientific publications (Qazvinian and Radev, 2008) . However, the datasets are often too small to train neural models. Cohan et al. (2018) publish two large-scale datasets by collecting articles from ARXIV and PUBMED. Popular methods rely on extractive summarizers that identify salient sentences based on positional information (Dong et al., 2020) or combined global and local contexts (Xiao and Carenini, 2019) , where each sentence is represented as aggregated word embeddings. However, extractive summaries are often redundant and in-coherent, highlighting the need for handling long documents via abstractive summarization.

To that end, extract-then-abstract methods are proposed. For example, Pilault et al. (2020) first extract relevant sentences and then rewrite them into paper abstracts. Our work is in line with building end-to-end abstractive summarization models for long input. Cohan et al. (2018) design a hierarchical encoder to read different sections separately, and then use combined attentions over words and sections to generate the summary. Multiple agents are created to read segments separately, and then collaboratively write an abstract (Celikyilmaz et al., 2018) . However, both work truncates articles to 2K words. Although efficient encoder attentions have been studied in Zaheer et al. (2020) for abstractive summarization, at most 3K tokens can be consumed by their models. Our HEPOS encoderdecoder attention are able to process more than 10K tokens, significantly improving summary informativeness and faithfulness.

We investigate efficient attentions for long document summarization. We propose a novel encoderdecoder attention, HEPOS, based on head-wise positional strides that can effectively identify salient content. Models based on HEPOS attention can process at least twice as many words and produce more informative summaries with less unfaithful errors, according to both automatic evaluation and human evaluation. We further show that our new cloze QA metric better correlates with human judgment than prior faithfulness evaluation metrics. We collect CRS reports that were published before May 20, 2020 from EveryCRSReport 7 where the original PDF files are already parsed into HTML. We only keep documents with expertwritten summaries. We then gather texts from the html files.

FactCC Training Data Construction. Kryscinski et al. (2020) generate training data by applying rule-based transformations to sentences from source documents. We leverage reference summaries, where we train a FactCC model by reading a summary sentence (i.e., the claim) and a context to predict the corresponding label. A context is constructed by greedily selecting sentences that maximize the improvement of its ROUGE-2 when compared against the reference summary sentence. Following FactCC, we apply sentence negation, entity swap, and number swap to summary sentences to construct negative claims and use the original sentences as positive claims. During testing, we first find the context for each system summary sentence. The model then predicts a sentence-level faithfulness score by reading the system summary sentence and the context.

We fine-tune BERT (Devlin et al., 2019) for both FactCC and QA models. We include an additional classification head to predict entailment label or answer spans based on the [CLS] token. For GovReport dataset, we consider a base version of BERT with uncased tokens. For PubMed, we use a BERT model which is fine-tuned on PubMed abstracts to obtain better performance 8 .

Entity Extraction Model. We use OneIE to extract entities from the reference summary (Lin et al., 2020) . OneIE is a unified framework that combines entities, relations, and events extraction in one model. The model leverages the BERT pretrained weights as the sentence embedding to produce entities, relations, and events from a sentence. Two OneIE models are built.

The first model for government reports is trained on the Automatic Content Extraction (ACE) 2005 dataset (Walker et al., 2006) . This model can extract entities from general conversation contexts Table 9 . To train this model, we fine-tune the BioBERT pre-trained model (Lee et al., 2020) on the COVID-19 Open Research (CORD-19) dataset (Wang et al., 2020b) . As we proposed, this model is applied to the PubMed data.

We include two samples from GovReport and PubMed to further illustrate that our model with HEPOS attention generates more faithful and informative summaries in Fig. 7 and Fig. 8 .

In human evaluation, annotators are asked to evaluate the system summaries generated for a report or a paper. In addition to the summaries, annotators are provided with the report or the paper to be summarized and a corresponding human-written reference. Human judges evaluate each system summary sentence by sentence. The annotation consists of three tasks, which are described below.

Task 1: Aspect Labeling. First, annotators are asked to decide which aspect each sentence belongs to. For government reports, each sentence should be categorized into three aspects: (1) Why GAO did this study, (2) What GAO found, and (3) What GAO recommends. For scientific papers, summaries have four aspects: (1) Introduction and Literature, (2) Methods, (3) Results, and (4) Discussion and Conclusion. Table 10 and Table 11 contain example reference summaries with labeled aspects.

Task 2: Sentence-level Faithfulness Error Labeling. Next, annotators will judge whether each sentence contains any unfaithful content. Unfaithful content is categorized into three types. A "0" or "1" label will be given to each type, where "0" indicates the sentence is free of such type of error, and "1" otherwise.

Concretely, unfaithful content is the fabricated or contradictory content which is not present or contradicts the facts in the source article. It can also be ambiguous expression which distorts the meaning. Here are detailed descriptions for the three types of errors:

• Hallucination error refers to fabricated content that cannot be found or inferred from the source.

• Misconstruction error that is due to deletion of entities, events, or clauses, resulting in sentences that are incomplete, missing context, or ungrammatical.

• Misconstruction error that is caused by false concatenation of content from different places in the source.

Task 3: Aspect-level Summary Quality Rating. After reading the full summary, annotators will evaluate the informativeness of the summary for each aspect-whether the summary provides a necessary and enough coverage of information in the reference. For instance, whether the summary covers all the salient points in "Why GAO did this study".

Here are detailed descriptions of informativeness:

• 5: Summary covers enough key points in the reference (only misses minor topics), and is free of unfaithful errors.

• 4: Summary covers major key points (e.g., 80 percent) and may miss one or two key points in the reference. Summary can contain one unfaithful error.

• 3: Summary covers roughly half of the key points in the reference or contains 2 or 3 unfaithful errors.

• 2: Summary only covers 1 or 2 key points and misses many important topics (e.g. > 80 percent) in the reference, or contains more than 3 major unfaithful errors, e.g. summary fabricates or distorts some facts.

• 1: Summary is irrelevant and does not cover any content in the reference.

What GAO found The Food and Drug Administration (FDA) and U.S. Department of Agriculture (USDA) have begun collaborating on regulatory oversight of cell-cultured meat. For example, in 2019, the agencies signed an interagency agreement and created three working groups to carry out the terms of the agreement. However, the agreement and working groups could more fully incorporate practices to enhance and sustain collaboration, such as defining outcomes. For example, the agreement identifies the development of labeling principles as an outcome, but does not describe how the agencies will track and monitor progress toward this outcome, and the working groups identify a lead agency but not members' roles. (...) Developing and updating written guidance and agreements is also a leading practice for interagency collaboration. By fully incorporating leading practices into their efforts to collaborate, the agencies could minimize potential overlap and fragmentation, use resources in a more efficient manner, and better ensure the public and other key stakeholders have clarity about the agencies' oversight responsibilities. (...) Model w/ full attn.:

The frequency of inspections the agencies conduct varies, as follows: FDA. FDA conducts risk-based inspections of high-risk and non-high-risk food facilities. FDA's authority requires a risk-based approach, in which inspection rates vary depending on the level of risk associated with a food product. (...) In 2010, FDA began conducting inspections of meat slaughter facilities to determine whether they were complying with the agency's longstanding interpretation of its statutes requiring it to do so. In addition,USDA conducted inspections of slaughter facilities that process catfish to determine if they are complying with USDA's requirement to inspect catfish slaughter facilities. USDA also conducts inspections of processing facilities to verify that they are following USDA's requirements to inspect these facilities. In 2011, USDA began inspecting processing facilities for catfish for the first time. (...) Model w/ HEPOS enc-dec attn.: What GAO found The Departments of Health and Human Services (HHS) and Agriculture have taken several steps to coordinate their oversight of the development, commercialization, and regulation of cell cultured meat and seafood, but they have not fully incorporated leading practices for interagency collaboration. For example, HHS and USDA have established interagency working groups that have met to discuss issues such as coordination and coordination. However, the agencies have not yet fully incorporated some leading practices, such as establishing clear roles and responsibilities, into their interagency agreements. (...) Figure 7 : Sample summaries for a government report. Model with truncated input generates unfaithful content. Our HEPOS encoder-decoder attention with Sinkhorn encoder attention covers more salient information in "What GAO found" aspect.

Introduction and Literature introduction, case, objectives, purposes, objective, purpose, background, literature, related work background : the present study was carried out to assess the effects of community nutrition intervention based on advocacy approach on malnutrition status among school -aged children in shiraz , iran .

introduction . low serum vitamin d levels are associated with increased postural sway . vitamin d varies seasonally . this study investigates whether postural sway varies seasonally and is associated with serum vitamin d and falls .

Methods materials and methods, techniques, methodology, materials, research design, study design materials and methods : this case -control nutritional intervention has been done between 2008 and 2009 on 2897 primary and secondary school boys and girls ( 7 -13 years old ) based on advocacy approach in shiraz , iran . the project provided nutritious snacks in public schools over a 2 -year period along with advocacy oriented actions in order to implement and promote nutritional intervention . for evaluation of effectiveness of the intervention growth monitoring indices of preand post -intervention were statistically compared .

Results results, experiments, observations results : the frequency of subjects with body mass index lower than 5% decreased significantly after intervention among girls ( p = 0. 02 ) . however , there were no significant changes among boys or total population . (...)

Discussion and Conlusion discussion, limitation, conclusions, concluding conclusion : this study demonstrates the potential success and scalability of school feeding programs in iran . community nutrition intervention based on the advocacy process model is effective on reducing the prevalence of underweight specifically among female school aged children . 

Longformer: The long-document transformer

BillSum: A corpus for automatic summarization of US legislation

Evaluating the factual consistency of abstractive text summarization

Biobert: a pre-trained biomedical language representation model for biomedical text mining

Set transformer: A framework for attention-based permutation-invariant neural networks

BART: Denoising sequence-to-sequence pretraining for natural language generation, translation, and comprehension

ROUGE: A package for automatic evaluation of summaries

A joint neural model for information extraction with global features

Generating wikipedia by summarizing long sequences

Text summarization with pretrained encoders

On faithfulness and factuality in abstractive summarization

Explorations in automatic book summarization

fairseq: A fast, extensible toolkit for sequence modeling

Pytorch: An imperative style, high-performance deep learning library

On extractive and abstractive neural document summarization with transformer language models

Scientific paper summarization using citation summary networks

BIG-PATENT: A large-scale dataset for abstractive and coherent summarization

Annual Meeting of the Association for Computational Linguistics

Adafactor: Adaptive learning rates with sublinear memory cost

Adaptive attention span in transformers

Efficient transformers: A survey

Automatic patent document summarization for collaborative knowledge systems and services

Attention is all you need

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned

Ace 2005 multilingual training corpus

Asking and answering questions to evaluate the factual consistency of summaries

Linformer: Selfattention with linear complexity

Pubtator central: automated concept annotation for biomedical full text articles

Regression Analysis. WILEY SERIES in PROBABILITY and STATISTICS: AP-PLIED PROBABILITY and STATIST ICS SEC-TION Series

Google's neural machine translation system: Bridging the gap between human and machine translation

Extractive summarization of long documents by combining global and local context

Big bird: Transformers for longer sequences

PEGASUS: pre-training with extracted gap-sentences for abstractive summarization

Original Paper: (...) anatomical abnormalities included complex pelvicalyceal anatomy in 98, pelviureteric junction obstruction (pujo) in 35, ectopic, horseshoe, cross-fused ectopic kidneys in 40, and duplex system with nonfunctioning upper or lower moiety in 14 patients. (...) of the 1404 patients from rural area , 547 (39%) were managed by open surgery as compared to 362 (22%) of the patients from urban areas. (...) the mean size of the kidney stone was 5.05 5.88 cm, ureter 1.95 1.33 cm, and bladder 7.6 3.2 cm. (...) overall success rate was 90% in pcnl, 96% in urs, 100% in cystolithotripsy, and 81% in swl. Summary by Human: Results Results of 3969 surgeries, 2794 (70%) were minimally invasive surgery (mis) techniques to include swl 19%, pcnl 16 % , urs 18.9% , and pucl + pccl 16% and 1175 (30%) were open surgeries . the main factors necessitating open surgery were large stone burden 37%

tone-free rates by open surgeries were pyelolithotomy 91%, ureterolithotomy 100%, and cystolithotomy 100% with complication rate of up to 3%. Model w/ full attn.: Results results: the mean age of the patients was 6.5 years. the mean follow-up period was 12.5 months. mean stone size was 1.5cm. the majority of the stones were lager stones (n=454). the stone free rate was 100%. open surgery was performed in children with large bulk of stones, anatomical abnormalities

7 years with a male-to-female ratio of 2.8:1, and a range of 25 days (<1 month) to 15 years. anatomical abnormalities included complex pelvicalyceal anatomy in 98, pelviureteric junction obstruction (pujo) in 35, ectopic, horseshoe, cross-fused ectopic kidneys in 40, and duplex system with nonfunctioning upper or lower moiety in 14 patients. the mean size of the kidney stone was 5.05 5.88 cm3. of the 1404 patients from rural areas, 547 (39%) were managed by surgery as compared to 362 (22%) patients from urban areas. overall success rate was 90% in pcnl

Model with truncated input generates fabricated facts. Our HEPOS encoder-decoder attention with LSH encoder attention are more faithful for the aspect of "results

This research is supported in part by Oracle for Research Cloud Credits, National Science Foundation through Grant IIS-1813341, and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. We thank three anonymous reviewers for their valuable suggestions and comments.

For GAO reports, their summaries are organized as highlights. We collect GAO reports that include corresponding highlights and were published before Jul 7, 2020 . The reports and highlights are published in PDF files. Most of the highlights are also reorganized and shown on the web page as HTML. Since PDF parsing is more prone to errors than web parsing, we only keep the reports whose highlights can be obtained on the corresponding web page to ensure the quality of extracted goldstandard summaries. For reports, we first convert the PDF files to HTML using PDFMiner 6 . We then parse the HTML into text into sections and paragraphs with handcrafted parsing rules. We remove the reports that do not have cover pages, as our rules are constructed for documents with then. We further remove parsed documents with empty sections, non-capitalized section titles, or a single section, since these are common patterns of incorrectly parsed documents. Failed parsing would also result in short documents. Therefore, we examine the reports with shorter length and then filter out 10% of the shortest reports.

Why GAO Did This Study To protect data that are shared with state government agencies, federal agencies have established cybersecurity requirements and related compliance assessment programs. Specifically, they have numerous cybersecurity requirements for states to follow when accessing, storing, and transmitting federal data. GAO was asked to evaluate federal agencies' cybersecurity requirements and related assessment programs for state agencies. The objectives were to determine the extent to which (...)What GAO Found Although the Centers for Medicare and Medicaid Services (CMS), Federal Bureau of Investigation (FBI), Internal Revenue Service (IRS), and Social Security Administration (SSA) each established requirements to secure data that states receive, these requirements often had conflicting parameters. Such parameters involve agencies defining specific values like the number of consecutive unsuccessful logon attempts prior to locking out the user. Among the four federal agencies, the percentage of total requirements with conflicting parameters ranged from 49 percent to 79 percent. Regarding variance with National Institute of Standards and Technology guidance, GAO found that the extent to which the four agencies did not fully address guidance varied from 9 percent to 53 percent of total requirements. The variances were due in part to the federal agencies' insufficient coordination in establishing requirements. (...)What GAO Recommends GAO is making 12 recommendations to the four selected agencies and to OMB. Three agencies agreed with the recommendations and one agency (IRS) partially agreed or disagreed with them. OMB did not provide comments. GAO continues to believe all recommendations are warranted.