key: cord-0201495-0ejugcab
authors: Mishra, Rahul; Gupta, Dhruv; Leippold, Markus
title: Generating Fact Checking Summaries for Web Claims
date: 2020-10-16
journal: nan
DOI: nan
sha: e2c7cc4fcefebc32dd570095c38ccfd0f52425b0
doc_id: 201495
cord_uid: 0ejugcab

We present SUMO, a neural attention-based approach that learns to establish the correctness of textual claims based on evidence in the form of text documents (e.g., news articles or Web documents). SUMO further generates an extractive summary by presenting a diversified set of sentences from the documents that explain its decision on the correctness of the textual claim. Prior approaches to address the problem of fact checking and evidence extraction have relied on simple concatenation of claim and document word embeddings as an input to claim driven attention weight computation. This is done so as to extract salient words and sentences from the documents that help establish the correctness of the claim. However, this design of claim-driven attention does not capture the contextual information in documents properly. We improve on the prior art by using improved claim and title guided hierarchical attention to model effective contextual cues. We show the efficacy of our approach on datasets concerning political, healthcare, and environmental issues.

Most of the information consumed by the world is in the form of digital news, blogs, and social media posts available on the Web. However, most of this information is written in the absence of facts and evidences. Our ever-increasing reliance on information from the Web is becoming a severe problem as we base our personal decisions relating to politics, environment, and health on unverified information available online. For example, consider the following unverified claim on the Web: "Smoking may protect against COVID-19."

A user attempting to verify the correctness of the above claim will often take the following steps: issue keyword queries to search engines for the claim; going through the top reliable news articles; and finally making an informed decision based on the gathered information. Clearly, this approach is laborious, takes time, and is error-prone. In this work, we present SUMO, a neural approach that assists the user in establishing the correctness of claims by automatically generating explainable summaries for fact checking. Example summaries generated by SUMO for couple of Web claims are given in Figure 1 .

Prior approaches to automatic fact checking rely on predicting the credibility of facts [20] , instance detection [14, 31] , and fact entailment in supporting documents [18] . The majority of these methods rely on linguistic features [20, 22, 23] , social contexts, or user responses [13] and comments. However, these approaches do not help explain the decisions generated by the machine learning models. Recent works such as [2, 16, 21] overcome the explainability gap by extracting snippets from text documents that support or refute the claim. [16, 21] apply claimbased and latent aspect-based attention to model the context of text documents. [16] model latent aspects such as the speaker or author of the claim, topic of the claim, and domains of retrieved Web documents for the claim. We observe in our experiments that in prior works [16, 21] , the design of claim guided attention in these methods is not effective and latent aspects such as the topic and speaker of claims are not always available. The snippets extracted by such models are not comprehensive or topically diverse. To overcome these limitations, we propose a novel design of claim and document title driven attention, which better captures the contextual cues in relation to the claim. In addition to this, we propose an approach for generating summaries for fact-checking that are non-redundant and topically diverse.

Contributions. Contributions made in this work are as follows. First, we introduce SUMO, a method that improves upon the previously used claim guided attention to model effective contextual representation. Second, we propose a novel attention on top of attention (Atop) method to improve the overall attention effectiveness. Third, we present an approach to generate topically diverse multi-document summaries, which help in explaining the decision SUMO makes for establishing the correctness of claims. Fourth, we provide a novel testbed for the task of fact checking in the domain of climate change and health care.

Outline. The outline for the rest of the article is as follows. In Section 2, we describe prior work in relation to our problem setting. In Section 3, we formalize the problem definition and describe our approach, SUMO, to generate explainable summaries for fact checking of textual claims. In Sections 4 and 5, we describe the experimental setup that includes a description of the novel datasets that we make available to the research community and an analysis of the results we have obtained. In Section 6, we present the concluding remarks of our study.

We now describe prior work related to our problem setting. First, we describe works that rely only on features derived from arXiv:2010.08570v1 [cs.CL] 16 Oct 2020

The current evidence suggests that the severity of COVID is higher among smokers, prevent the health risk linked to the excessive consumption or misuse" of nicotine products by people hoping to protect themselves from COVID-19. Evidence from China, where COVID-19 originated, shows that people who have cardiovascular and respiratory conditions caused by tobacco use, or otherwise, are at higher risk of developing severe COVID-19 symptoms. HO urges researchers, scientists and the media to be cautious about amplifying unproven claims that tobacco or nicotine could reduce the risk of COVID-19. Smoking is also associated with increased development of acute respiratory distress syndrome, a key complication for severe cases of COVID-19.

Claim: Deforestation has made humans more vulnerable to pandemics Deforestation can directly increase the likelihood that a pathogen will be transferred from wildlife species to humans through the creation of suitable habitats for vector species. Climate change, including deforestation which drives it, is a key driver of cross-species transmission which is where zoonotic emerging diseases come from . There is a correlation between deforestation and the rise in the spread of infectious diseases affecting humans. Deforestation forces various species into smaller, shared habitats and increases encounters between wildlife and humans. Habitat destruction and fragmentation due to deforestation can also increase the frequency of contact between humans, wildlife species, and the pathogens they carry . This can occur through direct transfer of pathogens from animals to humans or indirectly through cross-species transfer of pathogens from wildlife to domesticated species . Deforestation could be to blame for the rise of infectious diseases like the novel coronavirus. documents that support the input textual claim. Second, we describe works that additionally include features derived from social media posts in connection to the claim. Third and finally, we describe works that rely on extracting textual snippets from text documents to explain a model's decision on the claim's correctness.

Prior approaches for fact checking vary from simple machine learning methods such as SVM and decision trees to highly sophisticated deep learning methods. These works largely utilize features that model the linguistic and stylistic content of the facts to learn a classifier [4, 12, 23, 25] . The key shortcomings of these approaches are as follows. First, classifiers trained on linguistic and stylistic features perform poorly as they can be misguided by the writing style of the false claims, which are deliberately made to look similar to true claims but are factually false. Second, these methods lack in terms of user response and social context pertaining to the claims, which is very helpful in establishing the correctness of facts.

Works such as [24, 26, 32] overcome the issue of user feedback by using a combination of content-based and context-based features derived from related social media posts. Specifically, the features derived from social media include propagation patterns of claim related posts on social media and user responses in the form of replies, likes, sentiments, and shares. These methods outperform content-based methods significantly. In [32] , the authors propose a probabilistic graphical model for causal mappings among the post's credibility, user's opinions, and user's credibility. In [24] , the authors introduce a user response generator based on a deep neural network that leverages the user's past actions such as comments, replies, and posts to generate a synthetic response for new social media posts.

Explaining a machine learning model's decision is becoming an important problem. This is because modern neural network based methods are increasingly being used as black-boxes. There exist few machine learning models for fact checking that explain this decision via summaries. Related works [16, 21] achieve significant improvement in establishing the credibility of textual claims by using external evidences from the Web. They additionally extract snippets from evidences that explain their model's decision. However, we find that the claim-driven attention design used in these methods is inadequate, and does not capture sufficient context of the documents in relation to the input claim. The snippets extracted by these methods are often redundant and lack topical diversity offered by Web evidences. In contrast, our method enhances the claim-driven attention mechanism and generates a topically diverse, coherent multi-document summary for explaining the correctness of claims.

We now formally describe the task of fact checking and explain SUMO in detail. SUMO works in two stages. In the first stage, it predicts the correctness of the claim. In the second stage, it generates a topically diverse summary for the claims. As input, we are provided with a Web claim c ∈ C, where C is a collection of Web claims and a pseudo-relevant set of documents

where m is the number of results retrieved for claim c. The documents d ∈ D are retrieved from the Web as potential evidences, using claim c as a query. Each retrieved document d is accompanied by its title t and text body bd, i.e.

(d = t, bd ). We define the representation of each document's body as a collection of k sentences as bd = {s 1 , s 2 , ..., s k } and each sentence as the collection of l words as {w 1 , w 2 , ..., w l } ∈ W, where W is the overall word vocabulary of the corpus. By k and l, we denote the maximum numbers of sentences in a document and the maximum number of words in a sentence, respectively. We use both WORD2VEC and pre-trained GloVe embeddings to obtain the vector representations for each claim, title, and document body. The objective is to classify the claim as either true or false and automatically generate a topically diverse summary pieced together from D for establishing the correctness of the claim.

We now describe SUMO's neural architecture (see Figure 2 ) that helps in predicting the correctness of the input claim along with its pseudo-relevant set of documents. The model additionally learns the weights to words and sentences in the document's body that help ascertain the claim's correctness. First, we need to encode the pseudo-relevant documents that support a claim. To this end, as a sequence encoder, we use a Gated Recurrent Unit (GRU) to encode the document's body content. Claim and document's title are not encoded using sequence encoder; we explain the method to represent them in detail in upcoming sections.

Claim-driven Hierarchical Attention., aims to attend salient words that are significant and have relevance to the content of the claim. Similarly, we aim to attend the salient sentences at the sentence level attention. Recent works have used claim guided attention to model the contextual representation of the retrieved documents from the Web. These approaches provide claim-guided attention by first concatenating the claim word embeddings with document word embeddings and then applying a dense softmax layer to learn the attention weights as follows:

where c i and d i are the i th claim and document embeddings. W a and b a are the weight matrix and bias and α is the learned attention weight. However, during experiments, we observe that applying claim-based attention provides an inferior overall document representation. Therefore, we do not concatenate the claim and document embeddings before attention weight computation.

Each claim c i is consists of l maximum number of words as {w 1 , w 2 , ......, w l }. We represent each claim c i as the summation of embeddings of all the words contained in it as:

is the word embedding of the j th word of claim c i . Claim representation Cl i and hidden states h j from the GRU are used to compute word-level claim-driven attention weights as:

where W j,i and b j,i are the weight matrix and bias, α C j,i is the word level claim driven attention weight vector, and h j = (h j,1 , h j,2 , ..., h j,l ) represents the tuple of all the hidden states of the words contained in the j th sentence. To compute sentence level claim-driven attention weights, we use claim representation Cl i and hidden states h S j from the sentence level GRU units as concatenations of both forward and backward hidden states

where W j and b j are the weight matrix and bias,

is the combination of all hidden states from sentences, and α C j = (α j,1 , α j,2 , ..., α j,k ) is the sentence level claim-driven attention weight vector for the j th document.

Title-driven Hierarchical Attention. The objective of using the document title is to guide the attention in capturing sections in the document that are more critical and relevant for the title. Articles convey multiple perspectives, often reflected in their titles. By title-driven attention, we attend to those words and sentences that are not covered in claim-driven attention. Titledriven attention at both word and sentence level can be computed in a similar fashion as claim-driven attention. Each title t i is comprised of l maximum number of words as {w 1 , w 2 , . . . , w l }. We represent each claim t i as the summation of embeddings of all the words contained in it as: T i = l j=1 f (w j ). Titledriven attention weights for both words and sentence level can be computed as follows:

Hierarchical Self-Attention. Self-attention is a simplistic form of attention. It tries to attend salient words in a sequence of words and salient sentences in a collection of sentences based on the self context of a sequence of words or a collection of sentences. In addition to claim-driven and title-driven attention, we apply self-attention to capture the unattended words and sentences which are not related to claim or title directly but are very useful for classification and summarization. Self-attention weights for both words and sentence level can be computed as follows:

where α S l j,i and α S l j are the self-attention weight vectors at word and sentence levels respectively.

Fusion of Attention Weights. We combine the attention weights from the three kinds of attention mechanisms: claimdriven, title-driven, and self-attention at both the word and sentence levels. At the word level, we set:

where α C j,i , α T j,i , and α S l j,i are the attention weight vectors from claim, title and self-attention at the word level. S j is the formed sentence representation after overall attention for the j th sentence. At the sentence level, we set:

where α C j , α T j , and α S l j are the attention weight vectors from claim, title, and self-attention at the sentence level, and doc is the formed document representation after overall attention.

Attention on top of Attention (Atop). Although the fusion of the three kinds of attention weights as an average of them works well, we realize that we lose some context by averaging. To deal with this issue, we use a novel attention on top of attention (Atop) method. We concatenate all three kinds of attentions α con and α S con at both the word and sentence levels correspondingly. We apply a tanh activation based dense layer as a scoring function and subsequently, a softmax layer to compute attention weights for each of three kinds of attention:

At word level: α con = (α C j,i α T j,i α S l j,i ) u wa = tanh(W wa α con + b wa ) β w = softmax(u wa )

where β w and β s are the learned attention weight vectors for three kinds of attentions at the word and sentence levels, and doc is the formed document representation after Atop attention.

Prediction and Optimization. We use the overall document representation doc in a softmax layer for the classification. To train the model, we use standard softmax cross-entropy with logits as a loss function, we computeŷ, the predicted label as:

Recent works retrieve documents from the Web as external evidence to support or refute the claims and thereafter extract snippets as explanations to model's decision [16, 21] . However, the extracted snippets from these methods are often redundant and lack topical diversity. The objective of our summarization algorithm is to provide ranked list of sentences that are: novel, non-redundant, and diverse across the topics identified from the text of the documents. In this section, we outline the method we utilize for achieving this objective.

Multi-topic Sentence Model: Each sentence in the document that is retrieved against the claim is modeled as a collection of topics: s = a (1) , a (2) , . . . a (k) . Let A be the set of topics a i ∈ A across all candidate sentences from all the pseudo relevant set of documents D for the claim.

Objective. We formulate the summarization task as a diversification objective. Given a set of relevant sentences R which are attended by Atop attention in SUMO while establishing the claim's correctness. We have to find the smallest subset of sentences S ⊆ R such that all topics a i ∈ A are covered. This is a variation of the Set Cover problem [1, 10, 29, 30, 8, 11, 5] . However, unlike IA-Select [1] we do not choose to utilize the Max Coverage variation of the Set Cover problem. Instead, we formulate it as Set Cover itself [10, 29] . That is, given a set of topics A, find a minimal set of sentences S ⊆ R that cover those topics [29] . Additionally, the inclusion of each sentence in the subset S has a cost associated with it, given by:

where θ s is the topic distribution score for sentence s computed using a topic model (e.g., Latent Dirichlet Allocation [3] ), W wa = l i=1 W wa (i) is the average of attention weights of the words contained in sentence s, W sa is the attention weight of the sentence s, and λ is a parameter to be tuned. We briefly describe our adaptation of the Greedy algorithm, which provides an approximate solution to the Set Cover problem, based on the discussion in [10, 29, 30, 8, 11, 5] .

Algorithm 1: Adaption of the approximate Greedy algorithm for Set Cover problem from [10, 29, 30, 8, 11, 5] to our topical diversification problem setting. At each iteration, a sentence is chosen that covers the most number of topics reflected by topic distribution score and has the highest attention weights. As an output, we are assured a non-redundant, novel, and a diversified set of sentences. 

Datasets. We use two publicly available datasets, namely Politi-Fact political claims dataset and Snopes political claims dataset [21] for evaluating SUMO's capability for fact checking. Dataset statistics for both the datasets are shown in Table 1 . In the case of Politifact, claims have one of the following labels, namely: 'true', 'mostly true', 'half true', 'mostly false', 'false', and 'pants-on-fire,'. We convert 'true', 'mostly true', and 'half true' labels to the 'true' and the rest of them to 'false' label. For the Snopes dataset, each claim has either 'true' or 'false' as a label.

We evaluate SUMO for the task of summarization on PolitiFact, Snopes, Climate, and Health datasets. The two new datasets, Climate and Health, are about climate change and health care respectively. We test SUMO only on the PolitiFact and Snopes dataset for the task of fact checking as they are magnitudes larger than the new datasets that we release. The climate change dataset contains claims broadly related to climate change and global warming from climatefeedback.org. We use each claim as a query using Google API to search the Web and retrieve external evidences in the form of search results. Similarly, we create a dataset related to health care that additionally contains claims pertaining to the current global COVID-19 pandemic from healthfeedback.org. Examples of claims from these two datasets are shown in Figure 3 . We make the new datasets, publicly available to the research community at the following URL: https://github.com/rahulOmishra/SUMO/. SUMO Implementation. We use TensorFlow to implement SUMO. We use per class accuracy and macro F 1 scores as performance metrics for evaluation. We use bi-directional Gated Recurrent Unit (GRU) with a hidden size of 200, word2vec [15] , and GloVe [19] embeddings with embedding size of 200 and softmax cross-entropy with logits as the loss function. We keep the learning rate as 0.001, batch size as 64, and gradient clipping as 5. All the parameters are tuned using a grid search. We use 50 epochs for each model and apply early stopping if validation loss does not change for more than 5 epochs. We keep maximum sentence length as 45 and maximum number of sentences in a document as 35. For the task of summarization, we use Latent Dirichlet Allocation (LDA) [3] as a topic model to compute topic distribution scores and the dominant topic for each candidate sentence.

We experiment with five variants of our proposed SUMO model and compare with six state-of-the-art methods. The six state-ofthe-art methods are as follows. First, we have the basic Long Short Term Memory (LSTM) [7] ) unit which is used with claim and document contents for classification. Second, we have a convolutional neural network (CNN) [9] for document classification. Third, we compare against the model proposed in [27] that uses a hierarchical representation of the documents using hierarchical LSTM units (Hi-LSTM). Fourth, we compare against the model proposed in [33] that uses a hierarchical neural attention on top of hierarchical LSTMs (HAN) to learn better representations of documents for classification. Fifth, we compare against the model proposed in [21] that uses a claim guided attention method (DeClarE) for correctness prediction of claims in the presence of external evidences. Sixth and finally, we compare against the recent work [16] that improves on DeClarE method by using latent aspects (speaker, topic, or domain) based attention.

• Global warming slowing down? 'Ironic' study finds more CO2 has slightly cooled the planet.

• The ozone layer is healing.

• Deforestation has made humans more vulnerable to pandemics.

• Historical data of temperature in the U.S.

destroys global warming myth.

• New evidence shows wearing face mask can help coronavirus enter the brain and pose more health risk, warn expert.

• Boil weed and ginger for Covid-19 victims, the virus will vanish.

• Smoking may protect against COVID-19.

• Wearing face masks can cause carbon dioxide toxicity; can weaken immune system. 

The results for establishing claim correctness are shown in Ta 

For the evaluation of the summarization capability of SUMO, we create gold reference summaries for claims. For creating the gold reference summaries, we include all the facts related to the claim, which are important for the claim correctness prediction, non-redundant, and topically diverse. We find that the descriptions provided for a claim on fact-checking websites such as snopes.com and politifact.com are suitable for this purpose. We use cosine similarity score of 0.4 between claims and sentences of description to filter out irrelevant or noisy sentences. As evaluation metrics, we use ROUGE-1, ROUGE-2, and ROUGE-L scores. The ROUGE-1 score represents the overlap of unigrams, while the ROUGE-2 score represents the overlap of bigrams between the summaries generated by the SUMO system and gold reference summaries. The ROUGE-L score measures the longest matching sequence of words using Longest Common Sub-sequence algorithm.

Standard summarization techniques are not useful in such a scenario as the objective of summarization with standard techniques is usually not fact-checking. Hence, we compare the SUMO results with an information retrieval (BM25) and a natural language processing based method (QuerySum). BM25 is a ranking function, which uses a probabilistic retrieval framework and ranks the documents based on their relevance to a given search query. We use Web claims as a query and apply BM25 to get the most relevant sentences from all the documents retrieved for the claim. We also compare the results with the query-driven attention based abstractive summarization method QuerySum [17] , which also uses a diversity objective to create a diverse summary. We use ROUGE metrics with a gold reference summary to evaluate the generated summaries.

Results for the task of summarization are shown in Table 3 , the QuerySum method performs significantly better than BM25 with a ROUGE-L score of 30.16 as it uses query-driven attention and diversity objective, which results in a diverse and query oriented summary. The proposed model SUMO outperforms QuerySum with a ROUGE-L score of 35.92. We attribute this gain to the use of word and sentence level weights, which are trained using back-propagation with correctness label. We also notice that in QuerySum some sentences are related to the claim but are not useful for fact checking. Therefore, they are absent in the gold reference summary. The results for SUMO are statistically significant (p-value = 1.39 × 10 −4 ) using a pairwise Student's t-test.

We presented SUMO, a neural network based approach to generate explainable and topically diverse summaries for verifying Web claims. SUMO uses an improved version of hierarchical claim-driven attention along with title-driven and self-attention to learn an effective representation of the external evidences retrieved from the Web. Learning this effective representation in turn assists us in establishing the correctness of textual claims. Using the overall attention weights from the novel Atop attention method and topical distributions of the sentences, we generate extractive summaries for the claims. In addition to this, we release two important datasets pertaining to climate change and healthcare claims.

In future, we plan to investigate the BERT [6] and other Transformer [28] architecture based embedding methods in place of GloVe [19] embeddings for better contextual representation of words.

Diversifying search results

Generating Fact Checking Explanations

Latent dirichlet allocation

Information credibility on twitter

A greedy heuristic for the setcovering problem

BERT: Pre-training of deep bidirectional transformers for language understanding

Long short-term memory

Approximation algorithms for combinatorial problems

Convolutional neural networks for sentence classification

Approximation algorithms

On the ratio of optimal integral and fractional covers

Detecting rumors from microblogs with recurrent neural networks

Detect rumors using time series of social context information on microblogging websites

Detect rumor and stance jointly by neural multi-task learning. WWWâC™18, page 585âC"593, Republic and Canton of Geneva

Distributed representations of words and phrases and their compositionality

Sadhan: Hierarchical attention networks to learn latent aspect embeddings for fake news detection

Diversity driven attention model for query-based abstractive summarization

A Decomposable Attention Model for Natural Language Inference

Glove: Global vectors for word representation

Where the truth lies: Explaining the credibility of emerging claims on the web and social media

Declare: Debunking fake news and false claims using evidence-aware deep learning

A stylometric inquiry into hyperpartisan and fake news

Rumor has it: Identifying misinformation in microblogs

Neural user response generator: Fake news detection with collective user intelligence

Truth of varying shades: Analyzing language in fake news and political factchecking

Beyond news contents: The role of social context for fake news detection

Document modeling with gated recurrent neural network for sentiment classification

Approximation algorithms

The design of approximation algorithms

Adversarial Domain Adaptation for Stance Detection

Unsupervised fake news detection on social media: A generative approach

Hierarchical attention networks for document classification