key: cord-0461129-965bqcmr authors: Espejel, Jessica L'opez; Chalendar, Gael de; Flores, Jorge Garcia; Charnois, Thierry; Ruiz, Ivan Vladimir Meza title: GeSERA: General-domain Summary Evaluation by Relevance Analysis date: 2021-10-07 journal: nan DOI: nan sha: 8754d4f18037150f2ea1c025ce2b8ec232bdd687 doc_id: 461129 cord_uid: 965bqcmr We present GeSERA, an open-source improved version of SERA for evaluating automatic extractive and abstractive summaries from the general domain. SERA is based on a search engine that compares candidate and reference summaries (called queries) against an information retrieval document base (called index). SERA was originally designed for the biomedical domain only, where it showed a better correlation with manual methods than the widely used lexical-based ROUGE method. In this paper, we take out SERA from the biomedical domain to the general one by adapting its content-based method to successfully evaluate summaries from the general domain. First, we improve the query reformulation strategy with POS Tags analysis of general-domain corpora. Second, we replace the biomedical index used in SERA with two article collections from AQUAINT-2 and Wikipedia. We conduct experiments with TAC2008, TAC2009, and CNNDM datasets. Results show that, in most cases, GeSERA achieves higher correlations with manual evaluation methods than SERA, while it reduces its gap with ROUGE for general-domain summary evaluation. GeSERA even surpasses ROUGE in two cases of TAC2009. Finally, we conduct extensive experiments and provide a comprehensive study of the impact of human annotators and the index size on summary evaluation with SERA and GeSERA. Automatic summary evaluation is a challenging task in Natural Language Processing (NLP). Evaluation is usually done by humans, but manual evaluation is subjective, costly and time expensive (Lin and Hovy, 2002) . Automatic evaluation methods (Lin, 2004a; Torres-Moreno et al., 2010; Zhao et al., 2019; Zhang et al., 2020) are an alternative to save time for users who extract the most relevant content from the web using Automatic Text Summarization systems (ATS). There exist two types of evaluation approaches: (1) manual evaluation methods like Pyramid (Nenkova and Passonneau, 2004) and Responsiveness, where human intervention is mandatory, and (2) automatic evaluation methods, where human intervention can be needed as a ground-truth reference (Lin, 2004a; Cohan and Goharian, 2016) or not (Torres-Moreno et al., 2010; Cabrera-Diego and Torres-Moreno, 2018) . Summary Evaluation by Relevance Analysis (SERA) (Cohan and Goharian, 2016) is an automatic evaluation method that partially relies on human references to evaluate abstractive summaries from the biomedical domain. It was proposed as an alternative to ROUGE (Lin, 2004a) , the widely used automatic metric, that is based on lexical overlaps between candidate and reference summaries. ROUGE is unfair to evaluate abstractive summaries where the ATS paraphrases the text instead of just copying-pasting chunks of it (Cohan and Goharian, 2016) . SERA is based on content relevance and is fairer to evaluate abstractive summaries because it attributes high scores to summaries that are lexically different but semantically related. However, it surpasses ROUGE on the biomedical domain only. In this paper, we modify SERA to make it usable for generic collections. We propose the following contributions: 1. Implement an open-source version of SERA from scratch. 2. Propose GeSERA (General-domain SERA), an improved version of SERA that is domainindependent. 3. Conduct extensive experiments with two large indexes (AQUAINT-2 (Graff, 2002) and Wikipedia) and three summarization datasets (TAC 1 2008 , TAC2009, CNN-Daily Mail (Bhandari et al., 2020 ). These datasets are well-suited for general-domain and news abstractive summary evaluation. GeSERA achieves competitive results compared to a range of state-of-the-art evaluation approaches on both abstractive and extractive summaries. 4. Make the code and our Wikipedia dataset publicly available to help future researchers. 2 5. Conduct extensive experiments on the impact of the index size and human annotators on summary evaluation with SERA and GeSERA. 2 Proposed approach 2.1 Baseline: SERA SERA (Cohan and Goharian, 2016) is based on a content relevance between a candidate summary and its corresponding human-written reference summaries using information retrieval. SERA compares these summaries (called queries) against a set of documents from the same domain (called index), and compares the overlap of retrieved results. SERA refines queries in three different manners (1) Raw text -only stop words and numbers are removed, (2) Noun phrases (NP) -only noun phrases are kept, and (3) Keywords (KW) -only unigrams, bigrams, and trigrams are kept. SERA is defined in Equation 1. where: R C is the list of retrieved documents for the candidate summary C, R G i is the ranked list of retrieved documents for the reference summary G i , and M is the number of reference summaries. Another variant of SERA is called SERA-DIS. It takes the order of retrieved documents into consideration (Equation 2). C is the j th result in the ranked list R C , and D max is the maximum achievable score used for normalization. In both SERA variants, retrieved results are truncated at 5 and 10 documents (hence the notations SERA-5 and SERA-10 in Section 3). Cohan and Goharian (2016) used articles from PubMed 3 as a index, and summaries from TAC 2014 as queries. The intuition behind SERA is that a summary context is represented by its most related articles. Thus, two summaries related to the same documents are semantically related, even if they are lexically different. Consequently, SERA is fairer to evaluate abstractive summaries contrarily to the lexical-based ROUGE. However, SERA suffers from a series of limitations: (1) the code is not open-source, (2) no information was provided concerning the subset of PubMed used as an index, and (3) PubMed is specialized in the biomedical domain only. The first two drawbacks make SERA unusable by the community, while the third restricts its usage to the biomedical domain. We build on SERA merits and limitations to propose GeSERA, an open-source version of SERA that evaluates summaries from the general domain. Novelties of GeSERA are the index pool and query reformulation adapted to the evaluation of summaries from the general domain. Index documents pool -Differently from SERA, GeSERA enables general-domain summary evaluation. It is thus necessary to replace the biomedical index used by Cohan and Goharian (2016) with a set of documents related to the general domain. We build many indexes using a variant number of articles from Wikipedia and AQUAINT-2. See Subsection 3.1 for more details. Query reformulation (QR) -It improves retrieval process by removing unnecessary terms from the text. Therefore, we propose a different approach to refine queries in GeSERA that is better suited for general domain summaries. According to Kieuvongngam et al. (2020) , nouns in generated summaries represent more accurately the information conveyed by the original abstracts than other POS tags. This study was conducted on Covid-19 medical texts and can explain why SERA achieved a higher correlation than ROUGE for the TAC 2014 biomedical dataset. We led an analysis that consists of analyzing Part-Of-Speech (POS) tags distribution for PubMed (biomedical dataset built by Cohan et al. (2018) ), AQUAINT-2 (news corpus), and Wikipedia (general domain encyclopedia). Figure 1 shows bar plots for percentages of nouns, verbs, adjectives (Adj.), prepositions (Prep.), and the total percentage of other tags (Others). Our analysis of three datasets which describe different domains confirms the observation of Kieuvongngam et al. (2020) for PubMed. However, it shows that the percentages of verbs and adjectives are higher in AQUAINT-2 and Wikipedia than in PubMed. Equally important, there is a remarkable absence of prepositions in Wikipedia and AQUAINT-2. Based on our analysis, we propose to reformulate queries in GeSERA by only keeping tokens tagged with nouns, verbs, and adjectives, the three most frequent tags in the news and general domain corpora. Search engine -A semantic-based retrieval approach is crucial when handling abstractive summaries. In order to compare the queries against the index, a search engine is needed for information retrieval and scoring. We use the Whoosh 4 search engine with the BM25F (Best Match 25 Model with Extension to Multiple Weighted Fields) ranking function (Zaragoza et al., 2004) . This model is widely used for semantic search (Pérez-Agüera et al., 2010; Robertson and Zaragoza, 2009) . It consists in weighting terms according to their field importance, combining them, and using the resulting pseudo-frequencies for ranking. SERA was developed in the context of scientific biomedical article summarization with the idea that its semantic specificity is particularly useful for this domain. We hypothesize that if we reformulate queries properly and change the index pool, SERA can assess summaries from other domains for both abstractive and extractive summarization. This hypothesis is based on the fact that SERA considers terms that are not lexically equivalent but are semantically related. We conduct extensive experiments on SERA and GeSERA to test our hypothesis. The index is a key component of GeSERA approach insofar as it should describe the same domain as the queries. The number of documents in the index is also decisive as we will show in Subsection 4.1. We describe briefly query and index datasets hereafter and provide more information in the supplementary material. • AQUAINT-2 (Graff, 2002) is a news corpus built from various sources. We vary the size of the index to include I ={10000, 15000, 30000, 60000, 89760, 179520, 825148} randomlyselected documents. • Wikipedia is a free encyclopedia that contains various information from the general domain. We vary the size of the index to include I ={10000, 15000, 30000, 1778742} randomly-selected documents. Candidate and reference summaries from the news datasets TAC2008 and TAC2009, and the CNN Daily Mail (CNNDM) version published by Bhandari et al. (2020) are used as queries. TAC2008 (/and TAC2009) are subsets of AQUAINT-2. They contain 5568 (/4840) candidate summaries proposed by 58 (/55) participants, and 384 (/352) reference summaries, respectively. These datasets are designed for multi-document extractive summarization (one summary is shared by a set of documents). CNN Daily Mail (Bhandari et al., 2020 ) is a news dataset, and is of great interest to us because it contains candidate summaries obtained from both extractive and abstractive systems. It consists of 100 reference summaries, having each 25 candidate summaries generated by 11 extractive and 14 abstractive systems. It is designed for mono-document abstractive and extractive summarization (one summary for each document). We compare GeSERA with some of the most influential evaluation metrics from the literature: • ROUGE and SERA -two automatic evaluation approaches that rely on human intervention. ROUGE has many variants, but we only report the most popular ones: ROUGE-N (N = {1, 2} is the n-gram size) and ROUGE-L (Longest Common Subsequence). For each variant, we report the F-score, Recall and Precision. • MoverScore (Zhao et al., 2019) and BERTScore ( More information is in Section 5. Implementation details are in the supplementary material. To compare GeSERA with other state-of-the-art methods, we measure the correlation between the scores provided by each automatic and manual evaluation methods. The manual evaluation approaches used here are: • Pyramid (Nenkova and Passonneau, 2004) exploits the content distribution in human summaries using Summary Content Units (SCUs) based on their frequency in the summary corpus. In Subsection 4.1, we vary the index size in SERA and GeSERA and study the variation of their performance on TAC datasets by averaging the score of all four manual annotators A 1 , A 2 , A 3 , and A 4 . Once we determine the best index size, we present in Subsection 4.2 the correlations using the best index size of each method. We first present the impact of the indexes sizes built from Wikipedia on SERA and GeSERA, then the ones built from AQUAINT-2. Pearson correlations of SERA and GeSERA with Pyramid when indexing different values of I from the Wikipedia dataset, and when using TAC2008 and TAC2009 as query datasets, respectively. Figures show that the best score (0.902) is obtained with SERA-DIS-NP-10 using I = 30, 000 for TAC2008 while the best one (0.957) for TAC2009 is obtained with GeSERA-DIS-10 with I = 10, 000. We thus use in Subsection 4.2 an index size I = 30, 000 and I = 10, 000 for TAC2008 and TAC2009, respectively. Surprisingly, the worst scores are obtained with I = 1, 778, 742, the largest and more diversified index size corresponding to all documents from our Wikipedia corpus. AQUAINT-2 Index -Figures 2-a and 2-c show the Pearson correlation coefficients of SERA and GeSERA with Pyramid when indexing different values of I from the AQUAINT-2 dataset, and when using TAC2008 and TAC2009 as query datasets. Similarly to Wikipedia, figures show that overall, the best results are obtained with small index sizes. The best score (0.928) was obtained with GeSERA-5 using I = 15, 000 for TAC2008, while the best one (0.947) for TAC2009 is obtained with both SERA-DIS-NP-10 and GeSERA-DIS-5 using I = 179, 520. Note that results with I = 10, 000 are comparable with those obtained with I = 179, 520 for TAC2009. We thus use in Subsection 4.2 an index size I = 15, 000 and I = 179, 520 for TAC2008 and TAC2009, respectively. Once again, the lowest results are obtained with the full AQUAINT-2 corpus corresponding to I = 825, 148. We use the best index sizes from Subsection 4.1 to report the detailed results with the best variants of each method from Subsection 3.3. More results are reported in the supplementary material. While ROUGE-2-R and ROUGE-3-F provide the best results for all correlation measures on TAC2008, GeSERA and SERA largely surpass the scores of SummTriver and FRESA with both Pyramid and Responsiveness. In the case of GeSERA-5, it achieves higher correlations than ST-JS-T m , the best variant of SummTriver, by 0.039, 0.097, and 0.117 for Pyramid, and 0.049, 0.053, and 0.055 for Responsiveness. Finally, FRESA baseline achieves the lowest correlation scores in all configurations. The performance of SummTriver and FRESA is not surprising insofar as they do not rely on any human reference. Table 1-bottom shows correlation coefficients of SERA, GeSERA, ROUGE, SummTriver and FRESA with two manual evaluation approaches: Pyramid and Responsiveness. Once again, we fix the query dataset TAC2009, while we vary the index between AQUAINT-2 and Wikipedia. AQUAINT-2 Index -ROUGE provides the highest scores against SERA and GeSERA when we index documents from AQUAINT-2. Importantly, GeSERA-DIS-5 and GeSERA-5 achieve higher correlations than SERA with Pyramid and Responsiveness, respectively. Note that the scores of SERA vary more between its variants, while results of GeSERA are more stable and the best ones are obtained with only two of its variants. This finding highlights the robustness of our approach against variations of configurations. Wikipedia Index -SERA and GeSERA surpass ROUGE against the Pyramid manual method in terms of Pearson correlation when indexing documents from Wikipedia. The best correlations are obtained by GeSERA-DIS-10 and SERA-10 against Pyramid and Responsiveness, respectively. Interestingly, for TAC2009, GeSERA-DIS-10 achieves better Pearson correlation than ROUGE with Pyramid, and GeSERA-10 with Responsiveness. This finding proves the effectiveness of GeSERA to evaluate summaries from the general domain. Equally, GeSERA reduces the gap between SERA and ROUGE in most of other cases. SummTriver achieves reasonably good results in Table 1 even without the use of any human reference. This baseline is useful when human summaries are costly or hard to find. However, when such references are available, SummTriver does not take advantage of them, leading its correlation to be low compared to human-based evaluation approaches such as ROUGE and SERA. FRESA shows the lowest scores among evaluation approaches tested here. It drops approximately from 0.1 to 0.3 point compared to the lowest results obtained by SERA. This is mainly because FRESA is based only on the divergence between the evaluated summary and its source documents, without including any comparison with summaries generated by other participants, as Summtriver does. Thus, FRESA is barely correlated with manual evaluation in many cases where the correlation gets so close to zero (for instance, FRESA-2 with TAC2009 using Kendall correlation). Note that SummTriver and FRESA have mostly negative correlations because they are based on a divergence measure which increases when the summary's quality is low and decreases when its quality is high. Based on results obtained in Subsection 4.1 regarding the effectiveness of SERA and GeSERA with small index sizes, we decided to index I = 10, 000 documents from Wikipedia to run the two methods on CNNDM. Results are in Table 2 Table 2 shows that the highest correlations of ROUGE are obtained with ROUGE-2-R, followed by ROUGE-1-R. Globally, the highest correlations in ROUGE are obtained with the recall metric (ROUGE-R), followed by ROUGE-F, and finally by ROUGE-P. The following highest correlations are obtained with GeSERA-10. Once again, GeSERA surpasses all the SERA variants, and the state-of-the-art methods presented in this table. Although SERA-KW-10 has the best score in terms of Pearson and Kendall, all SERA variants present very similar scores. Behind the SERA method, BERTScore and JS-2 measures present very similar scores. Meanwhile, MoverScore shows the lowest correlations. Results show the effectiveness of GeSERA to evaluate extractive and abstractive summaries since CNNDM contains both approaches. For the sake of comparability with state-of-the-art approaches, we presented in the previous section correlations computed with the four manual annotators. According to Lin and Hovy (2002) , human evaluation is subjective. We confirm experimentally this finding and highlight that human annotators affect the performance of automatic evaluation approaches. To know how much each annotator can affect the correlation against human evaluations, and which annotator gets the lowest and highest correlations, we compute scores using different combination of human annotators. We compute the correlation of each human annotator individually, using three human annotators: (A 1 , A 2 , A 3 ), (A 1 , A 2 , A 4 ), (A 2 , A 3 , A 4 ), and finally using the four human annotators (A 1 , A 2 , A 3 , A 4 ). Note that we only report here the results on TAC2009 dataset insofar as the results on TAC2008 are not conclusive, where the best score vary considerably from one annotator to another depending on the configuration. We present results on TAC2008 in the supplementary material. Figure 3 provides SERA and GeSERA correlations with Pyramid using TAC2009 as a query dataset and AQUAINT-2 and Wikipedia as indexes. Results show that the best human annotator is always A 1 as he provides summaries with the best correlations of SERA and GeSERA in terms of Pearson, Spearman, and Kendall. Inversely, the worst human annotator is always A 3 for SERA as he achieves the worst scores in terms of all correlation metrics used here. For GeSERA, the worst human annotator is A 3 for Wikipedia in terms of all correlation metrics, while it is A 4 for AQUAINT-2 in terms of Spearman and Kendall and A 3 in terms of Pearson. In Table 3 , we compare results obtained with the four manual annotators versus those obtained with the best three annotators (A 1 , A 2 , A 4 ) for TAC2009. Results show that there is a clear gain when discarding the most unreliable annotator. We conclude that human annotators partially participate in the quality of automatic summary evaluation. This bias is caused by the quality of their manually written summaries. Evaluation methods are fundamental techniques to assess if summaries generated by an automatic system capture the original document's idea. Different evaluation methods have been developed in the last decade for the evaluation of automatically-generated summaries. There exist two types of evaluation methods: (1) manual evaluation methods, and (2) automatic evaluation methods. The first group of methods requires human intervention as ground-truth references. Pyramid (Nenkova and Passonneau, 2004) and Responsiveness are the most popular such methods. The second group of methods is divided itself into two subsets: (1) methods that need human intervention like ROUGE (Lin, 2004a) and SERA (Cohan and Goharian, 2016) , and (2) methods that do not need any human reference like SummTriver (Cabrera-Diego and Torres-Moreno, 2018) and FRESA (Torres-Moreno et al., 2010) . The most popular automatic metric used by the community is ROUGE (Lin, 2004a) . It needs reference summaries, and is based on their lexical overlaps with candidate summaries. That is why it is more useful to evaluate extractive summaries where chunks of the text are copied and pasted to form the summary. However, in the case of abstractive summaries where the ATS paraphrases the text with possibly new vocabulary, the ROUGE metric becomes unfair. To overcome this issue, researchers have been proposing in the last few years other automatic metrics to fairly evaluate both extractive and abstractive summaries. The first type of automatic evaluation methods relies partially on human judgment as ROUGE does. The simplest method is based on a Jensen-Shannon (JS-2) (Lin et al., 2006) divergence between bi-gram's distribution of candidate and ref- erence summaries. More sophisticated systems include MoverScore (Zhao et al., 2019) that is based on fine-tuning the BERT model and combining contextualized representations with Earth Mover Distance (EMD) from Rubner et al. (2000) . BERTScore (Zhang et al., 2020) also is based on BERT model. Unlike ROUGE (Lin, 2004b) , BERTScore makes use of contextual embeddings that are effective for paraphrase detection. Similarly to BERTScore, Semantic Similarity for Abstractive Summarization (SSAS) (Vadapalli et al., 2017) is based on semantic matching between candidate and reference summaries. The second type of automatic evaluation methods does not need any human intervention. For instance, FRamework for Evaluating Summaries Automatically (FRESA) (Torres-Moreno et al., 2010) is based on divergences among probability distributions between the summary to evaluate and its source document. Another well-known metric is SummTriver (Cabrera-Diego and Torres-Moreno, 2018). It is based on Trivergences between the summary to evaluate, its source document(s), and a set of summaries related to the same source document(s) but generated with other ATS systems. We introduced GeSERA, an open-source system for general-domain summary evaluation. We rede-fine query reformulation of SERA based on POS Tags analysis of datasets from different domains, and replace the biomedical index with documents from AQUAINT-2 and Wikipedia. GeSERA achieves competitive results compared to state-ofthe-art approaches. Overall, GeSERA surpasses SERA and reduces its gap with ROUGE, and in two cases, it even surpasses ROUGE, the lexicalbased method. Unsurprisingly, the comparison with evaluation methods that do not rely on human references reveals a large gap in favor of GeSERA since it relies on human references while the others do not. Extensive experiments show that the index size has a considerable effect on the performance of SERA and GeSERA that tend to perform better with small-size indexes. Finally, the study of human annotators shows their impact on the performance of automatic evaluation methods that rely on human intervention. Our code is publicly available to facilitate reproducibility. We will: (1) explore other variants of the search engine to know its impact on GeSERA, (2) propose a new version of GeSERA that does not rely on human intervention by exploiting information from the source text, (3) apply prepossessing on the index and search for other solutions to improve query reformulation, and (4) explore larger query datasets such as Multi-News (Fabbri et al., 2019) . In this supplementary material, we provide: (1) more details about evaluation datasets (Section A), (2) implementation details (Section B), (3) detailed results of all tested approaches (Section C), and (4) SERA and GeSERA were implemented in Python. For information retrieval, we used the Okapi BM25F ranking function from Whoosh, a flexible and pure python search engine framework. We used the authors' public implementations to run ROUGE, SummTriver, and FRESA. The latter was basically designed for mono-document evaluation. Thus, we concatenated all the articles of the same topic to be able to run it on TAC. To compute the correlations with LitePyramid of ROUGE, BERTScore, MoverScore, and JS-2 on the CNN Daily Mail dataset, we used the scores provided by Bhandari et al., 2020 in their GitHub repository 5 . Based on experiments on the TAC datasets in Subsection 4.3, we use an index size of I = 10, 000 in SERA and GeSERA. For the sake of comparability, scores are averaged for each participant before computing the correlations with manual methods. 5 https://github.com/neulab/REALSumm/ C More results Table 4 and Table 5 provide more results on CN-NDM, TAC2008 and TAC2009 datasets. Both tables present variants of the evaluation metrics that we did not report in the main paper. Figure 4 provides SERA and GeSERA correlations with Pyramid using TAC2008 as a query dataset and AQUAINT-2 and Wikipedia as indexes. Contrarily to TAC2009, it is hard to define for TAC2008 the impact of human annotators on the evaluation with SERA and GeSERA, as the best scores change from one case to another. For instance, A 1 is the best human annotator for SERA with both AQUAINT-2 and Wikipedia corpora in terms of Spearman and Kendall. However, in terms of Pearson correlation, the best annotator is A 4 for AQUAINT-2 and A 2 for Wikipedia. Alternatively, the best annotator for GeSERA is always A 2 for Wikipedia while the same annotator provides the worst results with AQUAINT- Figure 4 : Correlations coefficients obtained by each annotator A i using TAC2008 dataset for queries, and AQUAINT-2 and Wikipedia as indexes. Pearson correlation coefficient Reevaluating evaluation in text summarization Summtriver: A new trivergent model to evaluate summaries automatically without human references A discourse-aware attention model for abstractive summarization of long documents Revisiting summarization evaluation for scientific articles Multi-news: A largescale multi-document summarization dataset and abstractive hierarchical model The AQUAINT corpus of English news text The treatment of ties in ranking problems Automatic text summarization of COVID-19 medical research articles using BERT and GPT-2 CRC standard probability and statistics tables and formulae Rouge: A package for automatic evaluation of summaries An information-theoretic approach to automatic evaluation of summaries Manual and automatic evaluation of summaries Evaluating content selection in summarization: The pyramid method Joaquin Perez Iglesias, and Victor Fresno The probabilistic relevance framework: BM25 and beyond The earth mover's distance as a metric for image retrieval Crowdsourcing lightweight pyramids for manual summary evaluation Summary evaluation with and without references SSAS: semantic similarity for abstractive summarization Asian Federation of Natural Language Processing Microsoft cambridge at TREC 13: Web and hard tracks Bertscore: Evaluating text generation with BERT Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance