key: cord-0548559-7isjke4a
authors: Singh, Prakhar; Das, Anubrata; Li, Junyi Jessy; Lease, Matthew
title: The Case for Claim Difficulty Assessment in Automatic Fact Checking
date: 2021-09-20
journal: nan
DOI: nan
sha: 89d227f8a25e310ae4a91b42465432828bc39331
doc_id: 548559
cord_uid: 7isjke4a

Fact-checking is the process of evaluating the veracity of claims (i.e., purported facts). In this opinion piece, we raise an issue that has received little attention in prior work -- that some claims are far more difficult to fact-check than others. We discuss the implications this has for both practical fact-checking and research on automated fact-checking, including task formulation and dataset design. We report a manual analysis undertaken to explore factors underlying varying claim difficulty and identify several distinct types of difficulty. We motivate this new claim difficulty prediction task as beneficial to both automated fact-checking and practical fact-checking organizations.

Misinformation (and closely related disinformation, fake news, and propaganda) can cause serious societal harm, such as influencing election outcomes, public health during a pandemic, or financial markets (Reuters, 2021) . Such concerns have motivated NLP work into automatic fact-checking (Thorne et al., 2018; Popat et al., 2018; Augenstein et al., 2019; Hanselowski et al., 2019; Wadden et al., 2020; Gupta and Srikumar, 2021; Saakyan et al., 2021) . This task remains quite challenging; recent performance on key datasets for natural claims is less than 50% F1 (Atanasova et al., 2020; Augenstein et al., 2019; Gupta and Srikumar, 2021) .

In this work, we analyze Politifact 1 fact-checks and several NLP datasets based on them (Wang, 2017; Alhindi et al., 2018; Augenstein et al., 2019; Shu et al., 2018) . Our analysis shows that some claims are far more difficult to fact-check than others, an issue that has received little attention. To illustrate this, Figure 1 shows several claims that were fact-checked by Politifact. For example, Claim 3 can be directly verified by a simple search leading to a major media news article. In contrast, Claim 5 appears to lack any direct sources (i.e. written evidence, online or otherwise) and historians were eventually consulted. These examples illustrate the vast range of difficulty across claims.

Incorporating claim difficulty into fact-checking has conceptual, research, and practical implications. Conceptually, while prior work has modeled claim check-worthiness (to triage potential fact-checks) (Barrón-Cedeno et al., 2020; Nakov et al., 2021) , a claim may be check-worthy yet extremely difficult or impossible to fact-check. For research, designing benchmarks to incorporate claims of varying difficulty could provide greater insights in benchmarking and assessing state-of-the-art capabilities. In practice, because the scale of new claims daily far exceeds the number of human fact-checkers, prioritizing which few claims to fact-check is essential. Claim difficulty prediction could help prioritize claims that are not only check-worthy but also easy or fast to check, so that more claims could be checked. Easier claims might be delegated to less-skilled fact-checkers, or with human-in-theloop architectures, to the automated model so that fact-checkers can focus on more challenging cases.

We know of no prior work motivating this claim difficulty prediction task, or more fundamentally, investigating the underlying factors for why some claims are more difficult than others. In this opinion piece, we motivate this task and the opportunity it presents for NLP research to impact factchecking in a novel and important way. The analysis of claims we present leads us to identify five distinct factors contributing to claim difficulty.

Modeling and predicting difficulty of different task instances is already recognized in other areas, such as machine translation (Mishra et al., 2013; Li and Nenkova, 2015) , syntactic parsing (Garrette and Baldridge, 2013) , or search engine switching behaviors (White and Dumais, 2009), among others. We believe that instance difficulty prediction could yield similar benefits for fact-checking.

We analyze several NLP fact-checking datasets that use (real) Politifact data (Wang, 2017; Alhindi et al., 2018; Augenstein et al., 2019; Shu et al., 2018) , noting that accuracy of fact-checking models tends to be lower on these fact-checking datasets vs. others, indicating the difficulty of these real-world claims.

1. Claim ambiguity. Fact-checking a claim requires understanding what it is about. Claim ambiguity (under-specification) clearly undermines this. For example, in Figure 1 's first claim, the entity "Midtown" in is ambiguous, e.g., search results for this claim text tend to retrieve information about Manhattan, NY. Given additional contextual metadata ('Florida'), ambiguity still remains among Florida cities. Similarly, the relative temporal expression "last quarter" requires temporal grounding. For Claim 5, the phrase "South of the Union" is ambiguous (states in the south but inside the Union vs. states to the south and outside of the union). Ambiguity arises in many forms, including ambiguous pronouns (Kocijan et al., 2020), ill-defined terms (Vlachos and Riedel, 2014) , etc. Vlachos and Riedel (2014) simply side-step the problem by excluding ambiguous claims, Thorne et al. (2018) resolve named entities, and other datasets provide contextual metadata (Wang, 2017; Augenstein et al., 2019 ), yet targeted decontextualization (Choi et al., 2021 ) may be needed, as Claim 1 shows.

Such ambiguity can complicate all stages of factchecking: claim identification, de-duplication, and check-worthiness detection; evidence retrieval, and inference. Retrieval based on ambiguous claims can naturally yield poor search results, making verification difficult or impossible. This does not appear to have been considered in constructing most existing datasets (Wang, 2017; Alhindi et al., 2018; Popat et al., 2016; Shu et al., 2018; Augenstein et al., 2019) . Even if evidence retrieval is perfect, ambiguity can also undermine inference. 2. Poor ranking. Evidence-based fact-checking assumes retrieval of sufficient, quality evidence. Whereas Vlachos and Riedel (2014) assume evidence manually retrieved by experts, automated fact-checking often relies on automated information retrieval (IR) search engines to obtain evidence. Fact-checking systems (Samarinas et al., 2021) and datasets (Popat et al., 2016; Baly et al., 2018; Augenstein et al., 2019; Saakyan et al., 2021) have typically used only the top 5-30 search results, but while Web search engines may perform well on retrieval for easy claims (e.g., Claim 3 in Figure 1 ), difficult claims (Claim 2) are another story. Overall, we find a vast gulf between sources of evidence consulted by professional fact-checkers vs. Web search engine results. Figure 2 (Augenstein et al., 2019; Shu et al., 2018; Popat et al., 2016) unless we also triage claim difficulty based on retrieval results to filter out or tag difficult claims.

3. Limited access. Worse than poor ranking of evidence is when it cannot be accessed at all. For example, the deep web (Madhavan et al., 2008) contains far more information than the indexed Web yet is largely inaccessible to search engines. When information can be accessed, can it be read? Prior work has reported parsing difficulty even with HTML files (Augenstein et al., 2019; Vlachos and Riedel, 2014) , whereas evidence domains used by Politifact often utilize many other formats. Table 2 shows our finding that roughly 40% of Politifact's sources are difficult to parse due to issues such as dynamically-rendered javascript, links to pages without the main content (i.e., further link following needed), paywalls, media, etc. Many sources (7%) require tabular data extraction (Chen et al., 2019; Schlichtkrull et al., 2021) . Relevant evidence is sometimes buried in a 500-page government report that is too long for existing systems to process well (Wan et al., 2019) , or in non-textual formats (e.g., image or video). Such varying access and for- 2 We use the term "source domains" to refer to the websites that host the retrieved sources, e.g. www.nytimes.com. et al., 2015) . We find that IR search results lack high quality sources (e.g., reputable news outlets and scientific venues) for many claims. Consider Figure 1 's Claim 4. The best two sources retrieved are comments from a transcript of a news show and a blog. The news transcript lands credence to the user comment but is insufficient alone because it does not specify its context is "US law", whereas the claim is about South Africa. In fact, this was a key reason for Politifact's verdict as "mostly false". While such sources can be useful, gauging the degree of relative degree of trust to ascribe to them can be quite challenging. For example, Popat et al. (2016) found their per-domain strategy to be extremely limiting. In general, such varying reliability of retrieved evidence, and difficulty estimating reliability for less-established sources, is another important contributor to varying claim difficulty.

Inference. Claims also vary greatly in the degree of complex or multi-hop (Jiang et al., 2020) inference required for verification. For (easy) Claim 3 of Figure 1 , source b directly states that the absence of ICE would not affect border security, refuting the claim. In contrast, with (difficult) Claim 2, none of the sources jointly talk about "WI families" and "GOP estate tax break" together; a model would need to synthesize information from multiple sources. With Claim 5, one would first need to unpack the presuppositions within the claim (e.g., how many slaveholders on each "side", and whether the number is plausible). Politifact crossreferenced historical data, which requires sufficient grasp of pragmatics and common sense, challenging current NLP limitations (Geva et al., 2021; Pandia et al., 2021) . We also found that Politifact frequently cites documents like Senate bills that contain a variety of domain-specific vocabulary.

3 Where do we go from here? Improving benchmarks. Benchmarks define yardsticks by which empirical success and field progress are often measured. By designing factchecking datasets to not only incorporate claims of varying difficulty, but to include different aspects of claim difficulty, evaluation could yield greater insights between alternative approaches, as well as better identify which aspects of modeling are working well vs. where future work is most needed.

Our analysis of claim ambiguity (#1) calls for greater attention to decontextualize (Choi et al., 2021) ambiguous or underspecified language, in both the claim itself as well as evidence retrieved based on ambiguous claim text. This has complicate all stages of fact-checking: claim identification, de-duplication, and check-worthiness detection; evidence retrieval, and inference. Moreover, if important metadata or additional context used by fact-checkers is omitted from a derived dataset, this can change not only the difficulty but the label. Specifically, when veracity labels go beyond true/false determinations to also include indeterminate outcomes, such has "not enough info" (Thorne et al., 2018) , lack of sufficient context can actually change the correct label for a dataset from a concrete determination to being unresolvable. However, if annotations are taken directly from factchecking websites (Popat et al., 2016; Wang, 2017; Alhindi et al., 2018; Baly et al., 2018; Augenstein et al., 2019; Atanasova et al., 2020) , simply ingesting the official fact-check answer would not capture such a change in verdict from website to dataset.

Whereas some fact-checking datasets have assumed access to the same evidence manually identified by experts (Vlachos and Riedel, 2014; Alhindi et al., 2018; Atanasova et al., 2020) , enabling researchers to focus purely on the inference step, automated fact-checking in the wild typically assumes use of automated (noisy) IR (Popat et al., 2016; Baly et al., 2018; Shu et al., 2018; Augenstein et al., 2019) . A claim that might be easily checked from expert-found evidence may be difficult or impossible to verify using IR search results.

Benchmark evaluations might also include con-trolled testing of varying difficulty conditions, such as degree of claim ambiguity and expert-found vs. IR search results. Other evidence conditions might be varied as well, such as the portion of relevant vs. irrelevant evidence, reliable vs. unreliable evidence, whether or not sufficient evidence exists, and what formats must be parsed to obtain sufficient evidence. Direct IR evaluations could also benchmark evidence retrieval in realistic settings, such as searching the full Web on natural claims.

Predicting claim difficulty. Predicting claim difficulty -overall difficulty or different aspects of claim difficulty -promises many potential benefits.

I. As discussed above, the ability to predict claim difficulty could yield better fact-checking datasets and benchmarks. II. Claim difficulty prediction could inform or complement model confidence estimates. Various works (Guo et al., 2017; Dong et al., 2018) have found that posterior probabilities of deep models are not well-calibrated to predictive accuracy. Since model errors likely correlate with claim difficulty, difficulty prediction could provide orthogonal information. III. An explainable difficulty prediction model could lead to a more flexible approach for fact-checking, for example, to develop model architectures with inductive biases that exploit the specific aspects of the claim difficulty task that we have identified. Model explanations might also be tailored to claim difficulty, providing more thorough explanations for more difficult claims. For claims predicted to be too difficult to automatically check, a robust system might change tactics to provide a different user experience (i.e., fail gracefully), providing the user with key evidence to consider instead of a full fact-check.

IV. Claim difficulty prediction has real, practical value. Fact-checking organizations could prioritize claims that are not only check-worthy, but also easy or fast to check, improving throughput so human fact-checkers cover more claims in the same time. Fact-checking latency is also crucial: the impact of a fact-check on public opinion is far reduced if not released within an hour of a breaking claim (Jain, 2021). Moreover, easier claims might be automatically verified or delegated to citizen factcheckers, improving scale, latency, and throughput of fact-checking organizations while enabling expert fact-checkers to focus on more difficult claims. Difficulty-based pricing of claims might also help fact-checking organizations incentivize workers to undertake more difficult fact-checks (Hale, 2021). 

Where is your evidence: improving factchecking by justification modeling

Generating fact checking explanations

MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims

Integrating stance detection and fact checking in a unified corpus

Check-That! at CLEF 2020: Enabling the automatic identification and verification of claims in social media

TabFact: A large-scale dataset for table-based fact verification

Decontextualization: Making sentences stand-alone

Confidence modeling for neural semantic parsing

Learning a part-of-speech tagger from two hours of annotation

Automatically predicting sentence translation difficulty

The CLEF-2021 CheckThat! lab on detecting checkworthy claims, previously fact-checked claims, and fake news

Pragmatic competence of pre-trained language models through the lens of discourse connectives

Credibility assessment of textual claims on the web

Declare: Debunking fake news and false claims using evidence-aware deep learning

Fact check: CNN headline about GameStop trade phenomenon has been digitally altered

COVID-fact: Fact extraction and verification of real-world claims on COVID-19 pandemic

Improving evidence retrieval for automated explainable fact-checking

Joint verification and reranking for open fact checking over tables

Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media

FEVER: a large-scale dataset for fact extraction and verification

Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Fact checking: Task definition and dataset construction

Fact or fiction: Verifying scientific claims

Long-length legal document classification

liar, liar pants on fire": A new benchmark dataset for fake news detection

Characterizing and predicting search engine switching behavior