key: cord-0648861-b5ihh11t authors: Kazemi, Ashkan; Li, Zehua; P'erez-Rosas, Ver'onica; Mihalcea, Rada title: Extractive and Abstractive Explanations for Fact-Checking and Evaluation of News date: 2021-04-27 journal: nan DOI: nan sha: 69116800a8a8195531d29c8e14cefb1c92cbb8a7 doc_id: 648861 cord_uid: b5ihh11t In this paper, we explore the construction of natural language explanations for news claims, with the goal of assisting fact-checking and news evaluation applications. We experiment with two methods: (1) an extractive method based on Biased TextRank -- a resource-effective unsupervised graph-based algorithm for content extraction; and (2) an abstractive method based on the GPT-2 language model. We perform comparative evaluations on two misinformation datasets in the political and health news domains, and find that the extractive method shows the most promise. Navigating the media landscape is becoming increasingly challenging given the abundance of misinformation, which reinforces the importance of keeping our news consumption focused and informed. While fake news and misinformation have been a recent focus of research studies (Pérez-Rosas et al., 2018; Thorne and Vlachos, 2018; Lu and Li, 2020) , the majority of this work aims to categorize claims, rather than generate explanations that support or deny them. This is a challenging problem that has been mainly tackled by expert journalists who manually verify the information surrounding a given claim and provide a detailed verdict based on supporting or refuting evidence. More recently, there has been a growing interest in creating computational tools able to assist during this process by providing supporting explanations for a given claim based on the news content and context (Atanasova et al., 2020; Fan et al., 2020) . While a true or false veracity label does not provide enough information and a detailed fact-checking report or news article might take long to read, bitesized explanations can bridge this gap and improve the transparency of automated news evaluation systems. To contribute to this line of work, our paper explores two approaches to generate supporting explanations to assist with the evaluation of news. First, we investigate how an extractive method based on Biased TextRank (Kazemi et al., 2020) can be used to generate explanations. Second, we explore an abstractive method based on GPT-2, a large generative language model (Radford et al., 2019) . Our methods take as input a news article and a claim and generate a claim-focused explanation by extracting or generating relevant information to the original article in relation to the claim. We evaluate our proposed methods on the health care and political domains, where misinformation is abundant. As current news on the COVID-19 pandemic and the elections are overloading social media outlets, we find these domains to be of timely importance. Through comparative experiments, we find that both methods are effective at generating explanations for news claims, with the extractive approach showing the most promise for this task. While explainability in AI has been a central subject of research in recent years (Poursabzi-Sangdeh et al., 2018; Lundberg and Lee, 2017; Core et al., 2006) , the generation of natural language explanations is still relatively understudied. Camburu et al. (2018) propose e-SNLI, a natural language (NL) inference dataset augmented with human-annotated NL explanations. In their paper, Camburu et al. generated NL explanations for premise and hypothesis pairs for an inference task using the In-ferSent (Conneau et al., 2017) architecture. Kumar and Talukdar (2020) propose the task of generating "faithful" (i.e., aligned with the model's internal decision making) NL explanations and propose NILE, a method that jointly produces NLI labels and faithful NL explanations. Generating explanations in the context of news and fact-checking is a timely and novel topic in the Claim: nearly half of Oregon's children are poor. Fact-Check Report: ...Jim Francesconi...said..."Nearly half of Oregon's children are poor." He said the information came from a 2012 report...According to that report, "nearly 50% of children are either poor or lowincome." Francesconi almost immediately realized his mistake. "In retrospect, I wish I would have said poor or low income."...there is a distinction between poor and low income as far as the U.S. government is concerned." ... If you check the...Census information, you'll find that...23 percent of children in Oregon live in...below...poverty level while another 21 percent live in low-income families. As far as the U.S. government is concerned, about a quarter of the state's children are poor, not half... (redacted) Supporting Explanation (Ground Truth): So where does this leave us? Francesconi said in an opinion piece that "nearly half of Oregon's children are poor." In fact, if you use federal definitions for poverty, about a quarter are poor and another quarter are low-income. But experts tell us that families that are described as low-income still struggle to meet their basic needs and, for all intents and purposes, qualify as poor. Be that as it may, Francesconi was referencing a report that used the federal definitions. Biased TextRank (Extractive): "Nearly half of Oregon's children are poor." According to that report, "nearly 50% of children are either poor or low-income." Low income refers to families between 100 and 200 percent of the federal poverty level. As far as the U.S. government is concerned, about a quarter of the state's children are poor, not half. GPT-2 Based (Abstractive):" That's still below the federal poverty level. But that's not half. About 47 percent of Oregon's children are not poor, according to the Census data. So the percentage of children in the state who are poor is not half yet. It's actually closer to half. NLP community (Atanasova et al., 2020; Fan et al., 2020; Kotonya and Toni, 2020) . In (Atanasova et al., 2020) the authors proposed a supervised BERT (Devlin et al., 2019) based model for jointly predicting the veracity of a claim by extracting supporting explanations from fact-checked claims in the LIAR-PLUS (Alhindi et al., 2018) dataset. Kotonya and Toni (2020) constructed a dataset for a similar task in the public health domain and provided baseline models for explainable fact verification using this dataset. Fan et al. (2020) used explanations about a claim to assist fact-checkers and showed that explanations improved both the efficiency and the accuracy of the fact-checking process. We explore two methods for producing natural language explanations: an extractive unsupervised method based on Biased TextRank, and an abstractive method based on GPT-2. Introduced by Kazemi et al. (2020) and based on the TextRank algorithm (Mihalcea and Tarau, 2004) , Biased TextRank is a targeted content extraction algorithm with a range of applications in keyword and sentence extraction. The TextRank algorithm ranks text segments for their importance by running a random walk algorithm on a graph built by including a node for each text segment (e.g., sentence), and drawing weighted edges by linking the text segment using a measure of similarity. The Biased TextRank algorithm takes an extra "bias" input and ranks the input text segments considering both their own importance and their relevance to the bias term. The bias query is embedded into Biased TextRank using a similar idea introduced by Haveliwala (2002) for topic-sensitive PageRank. The similarity between the text segments that form the graph and the "bias" is used to set the restart probabilities of the random walker in a run of PageRank over the text graph. That means the more similar each text segment is to the bias query, the more likely it is for that node to be visited in each restart and therefore, it has a better chance of ranking higher than the less similar nodes to the bias query. During our experiments, we use SBERT (Reimers and Gurevych, 2019) contextual embeddings to transform text into sentence vectors and cosine similarity as similarity measure. We implement an abstractive explanation generation method based on GPT-2, a transformer-based language model introduced in Radford et al. (2019) and trained on 8 million web pages containing 40 GBs of text. Aside from success in language generation tasks (Budzianowski and Vulić, 2019; Ham et al., 2020) , the pretrained GPT-2 model enables us to generate abstractive explanations for a relatively small dataset through transfer learning. In order to generate explanations that are closer in domain and style to the reference explanation, we conduct an initial fine-tuning step. While fine tuning, we provide the news article, the claim, and its corresponding explanation as an input to the model and explicitly mark the beginning and the end of each input argument with bespoke tokens. At test time, we provide the article and query inputs in similar format but leave the explanation field to be completed by the model. We use top-k sampling to generate explanations. We stop the generation after the model outputs the explicit end of the text token introduced in the fine-tuning process. Overall, this fine-tuning strategy is able to generate explanations that follow a style similar to the reference explanation. However, we identify cases where the model generates gibberish and/or repetitive text, which are problems previously reported in the literature while using GPT-2 (Holtzman et al., 2019; Welleck et al., 2020) . To address these issues, we devise a strategy to remove unimportant sentences that could introduce noise to the generation process. We first use Biased TextRank to rank the importance of the article sentences towards the question/claim. Then, we repeatedly remove the least important sentence (up to 5 times) and input the modified text into the GPT-2 generator. This approach keeps the text generation time complexity in the same order of magnitude as before and reduces the generation noise rate to close to zero. We use a medium (355M hyper parameters) GPT-2 model (Radford et al., 2019) as implemented in the Huggingface transformers (Wolf et al., 2019) library. We use ROUGE (Lin, 2004) , a common measure for language generation assessment as our main evaluation metric for the generated explanations and report the F score on three variations of ROUGE: ROUGE-1, ROUGE-2 and ROUGE-L. We compare our methods against two baselines. The first is an explanation obtained by applying TextRank on the input text. The second, called "embedding similarity", ranks the input sentences by their embedding cosine similarity to the question and takes the top five sentences as an explanation. LIAR-PLUS. The LIAR-PLUS (Alhindi et al., 2018) dataset contains 10,146 train, 1,278 validation and 1,255 test data points collected from Poli-tiFact.com, a political fact-checking website in the U.S. A datapoint in this dataset contains a claim, its verdict, a news-length fact-check report justifying the verdict and a short explanation called "Our ruling" that summarizes the fact-check report and the verdict on the claim. General statistics on this dataset are presented in Table 2 . Health News Reviews (HNR). We collect health news reviews along with ratings and expla- nations from healthnewsreview.org, a website dedicated to evaluating healthcare journalism in the US. 1 The news articles are rated with a 1 to 5 star scale and the explanations, which justify the news' rating, consist of short answers for 10 evaluative questions on the quality of information reported in the article. The questions cover informative aspects that should be included in the news such as intervention costs, treatment benefits, discussion of harms and benefits, clinical evidence, and availability of treatment among others. Answers to these questions are further evaluated as either satisfactory, non-satisfactory or non-applicable to the given news item. For our experiments, we select 1,650 reviews that include both the original article and the accompanying metadata as well as explanations. Explanations' statistics are presented in Table 2 . To further study explanations in this dataset, we randomly select 50 articles along with their corresponding questions and explanations. We then manually label sentences in the original article that are relevant to the quality aspect being measured. 2 During this process we only include explanations that are deemed as "satisfactory," which means that relevant information is included in the original article. We use the Biased TextRank and the GPT-2 based models to automatically generate explanations for each dataset. With LIAR-PLUS, we seek to generate the explanation provided in the "Our ruling" section. For HNR we aim to generate the explanation provided for the different evaluative questions described in section 4.2. We use the provided train- ing, validation and test splits for the LIAR-PLUS dataset. For HNR, we use 20% of the data as the test set and we study the first nine questions for each article only and exclude question #10 as answering it requires information beyond the news article. We use explanations and question-related article sentences as our references in ROUGE evaluations over the HNR dataset, and the section labeled "Our ruling" as ground truth for LIAR-PLUS. Extractive Explanations. To generate extractive explanations for the LIAR dataset, we apply Biased TextRank on the original article and its corresponding claim and pick the top 5 ranked sentences as the explanation (based on the average length of explanations in the dataset). To generate explanations on the HNR dataset, we apply Biased TextRank on each news article and question pair for 9 of the evaluative questions and select the top 5 ranked sentences as the extracted explanation (matching the dataset average explanation length). Abstractive Explanations. We apply the GPT-2 based model to generate abstractive explanations for each dataset using the original article and the corresponding claim or question as an input. We apply this method directly on the LIAR-PLUS dataset. On the HNT dataset, since we have several questions, we train separate GPT-2 based models per question. In addition, each model is trained using the articles corresponding to questions labeled as "satisfactory" only as the "unsatisfactory" or "not applicable" questions do not contain information within the scope of the original article. We also conduct a set of experiments to evaluate to what extent we can answer the evaluation questions in the HNR dataset with the generated explanations. For each question, we assign binary labels to the articles (1 for satisfactory answers, 0 for not satisfactory and NA answers) and train individual classifiers aiming to discriminate between these two labels. During these experiments each classifier is trained and evaluated ten times on the test set and the results are averaged over the ten runs. As results in Table 3 suggest, while our abstractive GPT-2 based model fails to surpass extractive baselines on the LIAR-PLUS dataset, Biased TextRank outperforms the unsupervised TextRank baseline. Biased TextRank's improvements over TextRank suggest that a claim-focused summary of the article is better at generating supporting explanations than a regular summary produced by TextRank. Note that the current state-of-the-art results for this dataset, presented in (Atanasova et al., 2020) achieve 35.70, 13.51 and 31.58 in ROUGE-1, 2 and L scores respectively. However, a direct comparison with their method would not be accurate as it is a method that is supervised (versus the unsupervised Biased TextRank) and extractive (versus the abstractive GPT-2 based model). Table 4 presents results on automatic evaluation of generated explanations for the HNR dataset, showing that the GPT-2 based model outperforms Biased TextRank when evaluated against actual explanations and Biased TextRank beats GPT-2 against the extractive baseline. This indicates the GPT-2 based method is more effective in this dataset and performs comparably with Biased Tex-tRank. Results for the downstream task using both methods are shown in Table 5 . As observed, results are significantly different and demonstrate that Biased TextRank significantly outperforms (ttest p = 0.05) the GPT-2-based abstractive method, thus suggesting that Biased TextRank generates good quality explanations for the HNR dataset. Our evaluations indicate that Biased TextRank shows the most promise, while the GPT-2 based model mostly follows in performance. Keeping in mind that the GPT-2 based model is solving the harder problem of generating language, it is worth noting the little supervision it receives on both datasets, especially on the HNR dataset where the average size of the training data is 849. In terms of resource efficiency and speed, Biased TextRank is faster and lighter than the GPT-2 based model. Excluding the time needed to fine-tune the GPT-2 model, it takes approximately 60 seconds on a GPU to generate a coherent abstractive explanation on average on the LIAR-PLUS dataset, while Biased TextRank extracts explanations in the order of milliseconds and can even do it without a GPU in a few seconds. We find Biased TextRank's efficiency as another advantage of the unsupervised algorithm over the GPT-2 based model. In this paper, we presented extractive and abstractive methods for generating supporting explanations for more convenient and transparent human consumption of news. We evaluated our methods on two domains and found promising results for producing explanations. In particular, Biased Text-Rank (an extractive method) outperformed the unsupervised baselines on the LIAR-PLUS dataset and performed reasonably close to the extractive ground-truth on the HNR dataset. For future work, we believe generating abstractive explanations should be a priority, since intuitively an increase in the readability and coherence of the supporting explanations will result in improvements in the delivery and perception of news. Where is your evidence: Improving factchecking by justification modeling Generating fact checking explanations Hello, it's GPT-2 -how can I help you? towards the use of pretrained language models for task-oriented dialogue systems e-snli: Natural language inference with natural language explanations Supervised learning of universal sentence representations from natural language inference data Building explainable artificial intelligence systems BERT: Pre-training of deep bidirectional transformers for language understanding Generating fact checking briefs End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2 Topic-sensitive pagerank The curious case of neural text degeneration Biased TextRank: Unsupervised graph-based content extraction Explainable automated fact-checking for public health claims NILE : Natural language inference with faithful natural language explanations ROUGE: A package for automatic evaluation of summaries GCAN: Graph-aware co-attention networks for explainable fake news detection on social media A unified approach to interpreting model predictions Textrank: Bringing order into text Automatic detection of fake news Manipulating and measuring model interpretability Language models are unsupervised multitask learners Sentence-BERT: Sentence embeddings using Siamese BERTnetworks Automated fact checking: Task formulations, methods and future directions Consistency of a recurrent language model with respect to incomplete decoding Huggingface's transformers: State-of-the-art natural language processing We are grateful to Dr. Stacy Loeb, Professor of Urology and Population Health at New York University, for her expert feedback, which was instrumental for this work. This material is based in part upon work supported by the Precision Health initiative at the University of Michigan, by the National Science Foundation (grant #1815291), and by the John Templeton Foundation (grant #61156). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the Precision Health initiative, the National Science Foundation, or John Templeton Foundation.