key: cord-0130146-2vo77alx
authors: Lee, Nayeon; Bang, Yejin; Madotto, Andrea; Fung, Pascale
title: Misinformation Has High Perplexity
date: 2020-06-08
journal: nan
DOI: nan
sha: e764597776aee7093aee356c45bcf5a60b5a6561
doc_id: 130146
cord_uid: 2vo77alx

Debunking misinformation is an important and time-critical task as there could be adverse consequences when misinformation is not quashed promptly. However, the usual supervised approach to debunking via misinformation classification requires human-annotated data and is not suited to the fast time-frame of newly emerging events such as the COVID-19 outbreak. In this paper, we postulate that misinformation itself has higher perplexity compared to truthful statements, and propose to leverage the perplexity to debunk false claims in an unsupervised manner. First, we extract reliable evidence from scientific and news sources according to sentence similarity to the claims. Second, we prime a language model with the extracted evidence and finally evaluate the correctness of given claims based on the perplexity scores at debunking time. We construct two new COVID-19-related test sets, one is scientific, and another is political in content, and empirically verify that our system performs favorably compared to existing systems. We are releasing these datasets publicly to encourage more research in debunking misinformation on COVID-19 and other topics.

Debunking misinformation is a process of exposing the falseness of given claims based on relevant evidence. The failure to debunk misinformation in a timely manner can result in catastrophic consequences, as illustrated by the recent death of a man who tried to self-medicate with chloroquine phosphate to prevent COVID-19 (Vigdor, 2020) . Amid the COVID-19 pandemic and infodemic, the need for an automatic debunking system has never been more dire. * * These two authors contributed equally. 1 https://github.com/HLTCHKUST/covid19-misinfo-data ... Figure 1 : Proposed approach of using the language model (LM) as a debunker. We prime an LM with extracted evidence relevant to the whole set of claims, and then we compute the perplexities during the debunking stage.

A lack of data is the major obstacle for debunking misinformation related to newly emerged events like the COVID-19 pandemic. There are no labeled data available to adopt supervised deeplearning approaches (Etzioni et al., 2008; Wu et al., 2014; Ciampaglia et al., 2015; Popat et al., 2018; Alhindi et al., 2018) , and an insufficient amount of social-or meta-data to apply feature-based approaches (Long et al., 2017; Karimi et al., 2018; Kirilin and Strube, 2018) . To overcome this bottleneck, we propose an unsupervised approach of using a large language model (LM) as a debunker.

Misinformation is a piece of text that contains false information regarding its subject, resulting in a rare co-occurrence of the subject and its neighboring words in a truthful corpus. When a language model primed with truthful knowledge is asked to reconstruct the missing words in a claim, such as "5G communication transmits ... ", it would predict word "information" with the highest probability (7 × 10 −1 ). On the other hand, it would predict the word "COVID-19" in a false claim such as "5G communication transmits COVID-19" with very low probability (i.e., 2 × 10 −4 ). It follows that truthful statements would give low perplexity whereas false claims tend to have high perplexity, when scored by a truth-grounded language model. Since perplexity is a score for quantifying the likelihood of a given sentence based on previously encountered distribution, we propose a novel interpretation of perplexity as a degree of falseness.

To further address the problem of data scarcity, we leverage the large pre-trained LM, such as GPT-2 (Radford et al., 2019) , which are shown to be helpful in a low-resource setting by allowing the transfer learning of features learned from their huge training corpus (Devlin et al., 2018; Radford et al., 2018; . It is also illustrated the potential of LMs in learning useful knowledge without any explicit task-specific supervision to perform well on tasks such as question answering and summarization (Radford et al., 2019; Petroni et al., 2019) . Moreover, it is crucial to ensure that the LM is populated with "relevant and clean evidence" before assessing the claims, especially when these are related to newly emerging events. There are two main ways of obtaining evidence in fact-checking tasks. One way is to rely on evidence from the structured knowledge base such as Wikipedia and knowledge-graph (Wu et al., 2014; Ciampaglia et al., 2015; Thorne et al., 2018; Yoneda et al., 2018a; Nie et al., 2019) . Another approach is to obtain evidence directly from unstructured data online (Etzioni et al., 2008; Magdy and Wanas, 2010) . However, the former approach faces a challenge in maintaining up-to-date knowledge, making it vulnerable to unprecedented outbreaks. On the other hand, the latter approach suffers from the credibility issue and the noise of the evidence. In our work, we attempt to combine the best of both worlds in the evidence selection step by extracting evidence from unstructured-data and ensuring quality by filtering noise.

The contribution of our work is threefold. First, we propose a novel dimension of using language model perplexity to debunk false claims, as illustrated in Figure 1 . It is not only data efficient but also achieves promising results comparable to supervised baseline models. We also carry out qualitative analysis to understand the optimal ways of exploiting perplexity as a debunker. Second, we present an additional evidence-filtering step to improve the evidence quality, which consequentially improves the overall debunking performance. Finally, we construct and release two new COVID-19-pandemic-related debunking test sets.

Language Modeling Given a sequence of tokens X = {x 0 , · · · , x n }, language models are trained to compute the probability of the sequence p(X). This probability is factorized as a product of conditional probability by applying the chain rules (Manning et al., 1999; Bengio et al., 2003) :

In recent years, large transformer-based (Vaswani et al., 2017) language models have been trained to minimize the negative log-likelihood over a large collection of documents.

Leveraging Perplexity Perplexity, a commonly used metric for measuring the performance of an LM, is defined as the inverse of the probability of the test set normalized by the number of words:

Another way of interpreting perplexity is the measure of the likelihood of a given test sentence in reference to the training corpus. Based on this intuition, we hypothesize the following:

"When a language model is primed with a collection of relevant evidence about given claims, the perplexity can serve as an indicator for falseness."

The rationale behind is that the extracted evidence sentences for a True claim would share more similarities (e.g., common terms or synonyms) with its associated claim. This leads to True claims to have lower perplexity while the False claims remain having higher perplexity.

Task Definition Debunking is the task of exposing the falseness of a given claim by extracting relevant evidence and verifying the claims upon it. More formally, given a claim c with its corresponding source document D, the task of debunking is False Claims Perplexity Ordering or buying products shipped from overseas will make a person get 556.2 Sunlight actually can kill the novel COVID-19.

385.0 5G helps COVID-19 spread.

178.2 Home remedies can cure or prevent the COVID-19.

146.2

The main way the COVID-19 spreads is through respiratory droplets. 5.8 The most common symptoms of COVID-19 are fever, tiredness, and dry cough.

6.0 The source of SARS-CoV-2, the coronavirus (CoV) causing COVID-19 is unknown.

8.1 Currently, there is no vaccine and no specific antiviral medicine to prevent or treat COVID-19.

8.4 to assigning a label from {True, False} by retrieving and utilizing a relevant set of evidence E = {e 1 , e 2 , e 3 } from D.

Our approach involves three steps in the inference phase: 1) Evidence selection to retrieve the most relevant evidence from D. 2) Evidence grounding step to obtain our evidence-primed language model (LM Debunker). 3) Debunking step to obtain perplexity scores from the evidence-primed LM Debunker and debunking labels.

Given a claim c, our Evidence Selector retrieves the top-3 relevant evidence E = {e 1 , e 2 , e 3 } in the following two steps.

Evidence Candidate Selection Given the source documents D, we select the top-10 most relevant evidence sentences for each claim. Depending on the domain of the claim and source documents, we rely on generic TF-IDF method to select the tuples of evidence candidates with their corresponding relevancy scores {(e 1 , s 1 ), · · · , (e 10 , s 10 )}. Note that any evidence extractor can be used here.

Evidence Filtering After selecting the top candidate tuples for the claim c, we attempt to filter out the noise and unreliable evidence based on the following rulings: 1) When an evidence candidate is a quote from a low-credibility speaker such as an Internet meme 2 or social-media-post, we discard it (e.g., "quote according to a social media post."). Note that this approach leverages the speaker information inherent in the extracted evidence; 2) If a speaker of the claim is available, any quote or statement by the speaker him/herself is inadmissible to the evidence candidate pool; 3) Any evidence identical to the given claim is considered to be "invalid evidence (i.e. direct quotation of true/false claim); 4) We filter out reciprocal questions, which only add noise but have no supporting or contradicting information to the claim from our observation. The examples of before and after this filtration is shown in the Appendix.

The final top-3 evidence E is selected after the filtering based on the provided extractor score s. An example of a claim and its corresponding extracted evidence are shown in Table 2 .

For the purpose of priming the language model, all the extracted evidence for a batch of claims C = {c 1 , · · · , c k } are aggregated as E = {e 1 1 , e 1 2 , e 1 3 , · · · , e k 1 , e k 2 , e k 3 }. We obtain our evidence-grounded language model (LM Debunker) by minimizing the following loss L:

where the x j i denotes a tokens in the evidence e j , and θ the parameters of the language model. It is important to highlight that none of the debunking labels or claims are involved in this evidence grounding step and that our proposed methodology is model agnostic.

The last step is to obtain debunking labels based on the perplexity values from the LM Debunker. As shown in Table 1 , perplexity values reveal a pattern that aligns with our hypothesis regarding its association with falseness; the false claims have Claim: The main way the COVID-19 spreads is through respiratory droplets.

Evidence 1: The main mode of COVID-19 transmission is via respiratory droplets, although the potential of transmission by opportunistic airborne routes via aerosol-generating procedures in health care facilities, and environmental factors, as in the case of Amoy Gardens, is known.

Evidence 2: The main way that influenza viruses are spread is from person to person via virus-laden respiratory droplets (particles with size ranging from 0.1 to 100 µm in diameter) that are generated when infected persons cough or sneeze.

Evidence 3: The respiratory droplets spread can occur only through direct person-to-person contact or at a close distance. Table 2 : Illustration of evidence extracted using our Evidence Selector higher perplexity than the true claims (For more examples of perplexity values, refer to the Appendix). Perplexity scores can be translated to debunking labels by comparing to a perplexity threshold th that defines the False boundary in the perplexity space. Any claim with a perplexity score higher than the threshold is classified as False, and vice versa for True.

The optimal method of selecting the hyperparameter threshold th is an open research question. From an application perspective, any value can serve as a threshold depending on the desired level of "strictness" towards false claims. We define "strictness" as the degree of precaution towards false negative error, which is the most undesirable form of error in debunking (refer to Section 7 for details). From an experimental analysis perspective, a small validation set could be leveraged for hyper-parameter tuning of the threshold (th). In this paper, since we have small test sets, we do kfold cross-validation (k = 4) to obtain the average performance reported in Section 6.

Covid19-scientific A new test set is constructed by collecting COVID-19-related myths and scientific truths labeled by reliable sources like Medi-calNewsToday, Centers for Disease Control and Prevention (CDC), and World Health Organization (WHO). It consists of the most common scientific or medical myths about COVID-19, which must be debunked correctly to ensure the safety of the public (e.g., "drinking a bleach solution will prevent you from getting the COVID-19."). There are 142 claims (Table 3) with labels obtained from the aforementioned reliable sources. According to WHO and CDC, some myths are unverifiable from current findings, and we assigned False labels to them (e.g., "The coronavirus will die off when Covid19-politifact Another test set is constructed by crawling COVID-19-related claims factchecked by journalists from a website called Politifact 3 . Unlike the Covid19-scientific test set, it contains non-scientific and political claims such as "For the coronavirus, the death rate in Texas, per capita of 29 million people, we're one of the lowest in the country". Such political claims may not be life-and-death matters, but they still have the potential to bring negative sociopolitical effects.

Originally, these claims are labeled into six classes {pants-fire, false, barely-true, half-true, mostlytrue, true}, which represent the decreasing degree of fakeness. We use a binary setup for consistency with our setup for Covid19-scientific by assigning the first three classes as False and the rest as True. For detailed data statistics, refer to Table 3 . For the Covid19-politifact, we leverage the resources of the Politifact website. When journalists verify the claims on Politifact, they provide pieces of text that contain: i) the snippets of relevant information from various sources, and ii) a paragraph of their justification for the verification decisions. We only take the first part (i) to be our gold source documents, to avoid using explicit verdicts on test claims as evidence.

Although unrelated to COVID-19 misinformation, there are notable state-of-the-art (SOTA) models and their associated datasets in fact-based misinformation field.

FEVER (Thorne et al., 2018) Fact Extraction and Verification (FEVER) is a publicly released large-scale dataset generated by altering sentences extracted from Wikipedia to promote research in fact-checking systems. We use one of the winning systems from the FEVER workshop 5 as our first baseline model. LIAR-Politifact (Wang, 2017) LIAR is a publicly available dataset collected from the Politifact website, which consists of 12.8k claims. The label setup is the same as our Covid19-politifact test set, and the data characteristics are very similar, but LIAR does not contain any claims related to COVID-19. We also report three strong BERT-based (Devlin et al., 2018) baseline models trained on LIAR data:

• LiarPlusMeta: Our BERT-large-based replication of SOTA paper from Alhindi et al.. It uses meta-information and "justification," human-written reasoning for verification decision in Politifact article, as evidence for the claim. Our replication is a more robust baseline, outperforming the reported SOTA accuracy by absolute 9% (refer to Appendix for detailed result table). 5 https://fever.ai/2018/task.html. We use the 2nd team because we had problems running the 1st team's codebase. Note that the accuracy between 1st and 2nd is very minimal • LiarPlus: Similar to LiarPlusMeta model, but without meta-information. Our replication also outperforms the SOTA by absolute 8% in accuracy. This baseline is important because the focus of this paper is to explore the debunking ability in a data-scarce setting, where meta-information may not exist.

• LiarOurs: Another BERT-large model finetuned on LIAR-Politifact claims with evidence from our Evidence Selector, instead of using human-written "justification." Table 4 to indicate models tested in OOD settings.

Evidence Input for Testing Recalling the task definition explained in Section 3, we test a claim with its relevant evidence. To make fair comparisons among all baseline models and our LM Denbunker, we use the same evidence extracted in the Evidence Selector step while evaluating the models on the COVID-19-related test sets.

Language Model Setting For our experiments, GPT-2 (Wolf et al., 2019) model is selected as our base language model to build LM Debunker. We use the pre-trained GPT-2 model (base), with 117 million parameters. Since the COVID-19 pandemic is a recent event, it is guaranteed that the GPT-2 has not seen any COVID-19 related information during its pre-training. Thus, very clean and unbiased language model to test our hypothesis.

We evaluate the performance LM Debunker by comparing it to other baselines on two commonly used metrics: accuracy and F1-Macro score. Since identifying False claims is important in debunking, we also report F1 of False class (F1-Binary). Recall that we report average results obtained kfold cross-validation. The thresholds used in each fold are {18, 19, 17, 22} for Covid-politifact and {15, 24, 17, 20} for Covid-scientific 6 .

For the evidence grounding step, a learning rate of 5e-5 was adopted, and different epoch sizes were explored {1, 2, 3, 5, 10, 20}. We reported the choice with the highest performance in both accuracy and F1-Macro. Each trial was run on Nvidia GTX 1080 Ti, taking 39 seconds per epoch for Covid-scientific and 113 seconds per epoch for Covid-politifact. 6 Experimental Results

From Table 4 , we can observe that our unsupervised LM debunker portrays notable strength in the out-of-distribution setting (highlighted with blue) over other supervised baselines. For the Covid19scientific test set, it achieved state-of-the-art results across all metrics and marginally improved in accu-6 Note that we use k-fold cross-validation to obtain the average performance, not the average optimal threshold. racy and F1-binary by an absolute 13.9% ∼ 30.3% and 10% ∼ 28.7% respectively. Considering the severe consequences Covid19-scientific myths could potentially bring, this result is valuable.

For the Covid19-politifact test set, our LM debunker also outperformed Fever-HexaF model and LiarPlus with a significant margin, but it underperformed the LiarOurs model and the LiarPlusMeta model. Nonetheless, it is still encouraging considering the fact that these two models were trained with task-specific supervision on Politifact dataset (LIAR-Politifact), which is similar to the Covid19politifact test set.

The results of LiarPlus and LiarPlusMeta clearly show the incongruity of the meta-based approach for cases in which meta-information is not guaranteed. LiarPlusMeta struggled to debunk the claims from the Covid19-scientific test set in contrast to achieving SOTA in Covid19-politifact test set. This is because the absence of meta-information for Covid19-scientific test set hindered LiarPlusMeta from performing to its maximum capacity. Going further, the fact that LiarPlusMeta performed FEVER dataset and many FEVER-trained systems successfully showcased the advancement of systematic fact-checking. Nevertheless, Fever-HexaF model exhibited rather low performance on COVID-19 test sets (10% ∼ 30% behind LMdebunker in accuracy). One possible justification is the way how FEVER data was constructed. FEVER claims were generated by altering sentences from Wikipedia (e.g., "Hearts is a musical composition by Minogue", label: SUPPORTS). It makes the nature of FEVER-claims have a discrepancy with the naturally occurring myths and false claims flooding the internet. This implies that FEVER might not be the most suitable dataset for training non-wikipedia-based fact-checkers.

In Table 5 , we report the performance of LiarOurs and LiarPlusMeta classifiers trained on randomly sampled train sets of differing sizes {10, 100, 500, 1000, 10000}.

As shown by the gray highlights in Table 5 , both classifiers overtake our debunker in F1-Macro score with 500 labeled training data, but they require 10000 to outperform on the rest of evaluation metric. Considering the scarcity of labeled misinformation data for newly emerged events, a data-efficient debunker is extremely meaningful.

The relationship between the number of epoch in the evidence grounding step and the debunking performance is explored. The best performance is obtained with epoch=5, as shown in Figure 2 . We Table 6 : Performance comparison between the "before" and "after" filtering steps in Evidence Selector believe this is because a low number of epochs does not allow enough updates to encode the content of evidence into the language model sufficiently. On the contrary, a higher number of epochs over-fit the language model to the given evidence and harms the generalizability of the language model.

Threshold Perplexity Selection As aforesaid, the threshold th is controllable to reflect the desired "strictness" of the debunking system. Figure  3 depicts that decreasing the threshold helps to reduce the FN errors, which is the most dangerous form of error. Such controllability over strictness would be beneficial to the real-world applications, where the level of "strictness" matters greatly depending on the purpose of the applications. Meanwhile, FN reduction comes with a tradeoff of increased false positive errors (FP). For a more balanced debunker, an alternative threshold choice could be the intersection point of FN and FP frequencies.

Our ablation study of the evidence filtering and cleaning steps (Table 6) shows that improved evidence quality brings big gains in F1-Macro scores (13.5% and 8.3%) with only 1% loss in accuracy.

Moreover, comparing the performances of LM Debunker on each of the two test sets, Covid19scientific scores surpass Covid19-politifact scores, especially in F1-Macro, by 11.1%. This is due to the disparate natures of the gold source doc-uments used in evidence selection; the Covid19scientific claims obtain evidence from scholarly articles, whereas Covid19-politifact claims extract evidence from news articles and other unverified internet sources. Consequently, this resulted in a different quality of extracted evidence. Therefore, an important insight would be that evidence quality is crucial to our approach, and additional performance gain would be fostered from further improvement in evidence quality.

We identified areas for improvement in future work through qualitative analysis of wrongly-predicted samples from the LM debunker. First, since perplexity originally serves as a measure of sentence likelihood, when a true claim has an abnormal sentence structure, our LM deunker makes a mistake by assigning high perplexity. For example, a true claim "So Oscar Helth, the company tapped by Trump to profit from covid tests, is a Kushner company. Imagine that, profits over national safety" has extremely high perplexity. One interesting future direction would be to explore a way of disentangling "perplexity as a sentence quality measure" from "perplexity as a falseness indicator".

Second, our LM debunker makes mistakes when selected evidence is refuting the False claim by simply negating the content of the paraphrased claim. For instance, for a false claim "Taking ibuprofen worsens symptoms of COVID-19," the top most relevant evidence from the scholarly articles is "there is no current evidence indicating that ibuprofen worsens the clinical course of COVID-19." Another future direction would be to learn a better way of assigning additional weight/emphasis on special linguistic features such as negation.

8 Related Work

Previous approaches (Long et al., 2017; Karimi et al., 2018; Kirilin and Strube, 2018; Shu et al., 2018; Monti et al., 2019) show that using metainformation (e.g. credibility score of the speaker) with text input helps improve the performance of misinformation detection. Considering the availability of meta-information is not always guaranteed, building a model independent from it is crucial to detect misinformation. There exist works with fact-based approaches, using evidence from external sources for assessing the truthfulness of information (Etzioni et al., 2008; Wu et al., 2014; Ciampaglia et al., 2015; Popat et al., 2018; Alhindi et al., 2018; Baly et al., 2018; Hanselowski et al., 2018) . These approaches based on the logic of "the information is correct if evidence from credible sources or a group of online sources is supporting it." Furthermore, some works focus on reasoning and evidence selecting ability by restricting the scope of facts to those from Wikipedia (Thorne et al., 2018; Nie et al., 2019; Yoneda et al., 2018a) 8.2 Language Model Applications lead to significant advancements in wide variety of NLP tasks, including question-answering, commonsense reasoning, and semantic relatedness (Devlin et al., 2018; Radford et al., 2019; Peters et al., 2018; Radford et al., 2018) . These models are typically trained on documents mined from Wikipedia (among other websites). Recently, a number of works have found that LMs store a surprising amount of world knowledge, focusing particularly on the task of open-domain question answering (Petroni et al., 2019; Roberts et al., 2020) . Going further, Guu et al.; Roberts et al. show that task specific fine-tuning of LM can achieve impressive results, proving the power of LMs. In this paper, we explore to confirm if large pre-trained LM can also be helpful in the field of debunking.

In this paper, we show that misinformation has high perplexity from the language model primed with relevant evidence. By proposing the new application of perplexity, we build an unsupervised debunker that shows promising results, especially in the absence of labeled data. Moreover, we emphasize the importance of evidence quality in our methodology by showing the improvement in the final performance with the addition of a filtering step in the evidence selection. We are also releasing two new COVID-19 related test sets publicly to promote transparency and prevent the spread of misinformation. Based on this successful leverage of language model perplexity for debunking, we hope to foster more research in this new direction.

Ordering or buying products shipped from overseas will make a person get 

Avoid touching eyes, nose and mouth help prevent COVID-19. 26.5 For COVID-19, other flu-like symptoms such as aches and pains, nasal congestion, runny nose, 24.2 sore throat or diarrhoea are also common. SARS was more deadly but much less infectious than COVID-19. 21.5 Everyone is at risk of getting COVID-19.

18.0 COVID-19 is different to SARS.

17.8 Antibiotics does not work to kill COVID-19 virus.

17.2 Some people become infected by COVID-19 but don't develop any symptoms and don't feel unwell. 10.4 Currently, there is no vaccine and no specific antiviral medicine to prevent or treat COVID-19.

8.4 The source of SARS-CoV-2, the coronavirus (CoV) causing COVID-19 is unknown.

8.1 The incubation period for COVID-19 range from 1 to 14 days.

7.5 People with other health conditions, such as asthma, heart diseases and diabetes, are 6.3 at higher risk of getting seriously ill from COVID-19. The most common symptoms of COVID-19 are fever, tiredness, and dry cough.

6.0 The main way the COVID-19 spreads is through respiratory droplets. 5.8 COVID-19 spreads from person to person.

3.4 

The following examples illustrate different evidence selected to be top-3, before and after the filtering step of our Evidence Selector. Filtered evidence is the one detected one of policies we mentioned in the Section for Evidence Selection (3.1) and Replacing Evidence is evidence that comes into top-3 after the filtered evidence gets discarded from the evidence candidate pool.

Claim: "The coronavirus has made it to Mississippi and the lady that caught it wasn't around nobody with it which means it is airborne. That means if the wind blows it your direction you'll have it also." Label: False Filtered Evidence: More than 5,000 people have shared a Feb. 28 Facebook post, for example, that warns "the coronavirus has made it to Mississippi." Replacing Evidence: First of all, as of March 3, it doesn't appear that there are any confirmed cases in Mississippi of COVID-19, the disease caused by the coronavirus.

Claim: Says for the coronavirus, "the death rate in Texas, per capita of 29 million people, we're one of the lowest in the country." Label: True Filtered Evidence: Patrick's statement also referenced deaths "per capita of 29 million people" in Texas.'. Replacing Evidence: Looking at all 50 states, the District of Columbia and Puerto Rico, Texas is among the areas with the lowest coronavirus death rate.

Smokers are likely to be more vulnerable to COVID-19 as the act of smoking means that fingers are in contact with lips which increases the possibility of transmission of virus from hand to mouth. Label: True Filtered Evidence: Are demographics with high smoking rates more vulnerable to Covid-19 outbreaks? Replacing Evidence: However, from their published data we can calculate that the smokers were 1.4 times more likely (rr=1.4, 95% ci: 0.98-2.00) to have severe symptoms of COVID-19 and approximately 2.4 times more likely to be admitted to an icu, need mechanical ventilation or die compared to non-smokers (rr=2.4, 95% ci: 1.43-4.04).' 

Where is your evidence: Improving factchecking by justification modeling

Integrating stance detection and fact checking in a unified corpus

A neural probabilistic language model

Computational fact checking from knowledge networks

Bert: Pre-training of deep bidirectional transformers for language understanding

Open information extraction from the web

Realm: Retrievalaugmented language model pre-training

Ukp-athene: Multi-sentence textual entailment for claim verification

Multi-source multi-class fake news detection

Exploiting a speakers credibility to detect fake news

Improving large-scale fact-checking using decomposable attention models and lexical tagging

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Roberta: A robustly optimized bert pretraining approach

Fake news detection through multi-perspective speaker profiles

Web-based statistical fact checking of textual documents

Foundations of statistical natural language processing

Fake news detection on social media using geometric deep learning

Combining fact extraction and verification with neural semantic matching networks

Deep contextualized word representations

Language models as knowl

Declare: Debunking fake news and false claims using evidence-aware deep learning

Improving language understanding by generative pre-training

Language models are unsupervised multitask learners

How much knowledge can you pack into the parameters of a language model

Understanding user profiles on social media for fake news detection

Christos Christodoulopoulos, and Arpit Mittal

Attention is all you need

Man fatally poisons himself while self-medicating for coronavirus, doctor says

liar, liar pants on fire": A new benchmark dataset for fake news detection

Huggingface's transformers: State-of-the-art natural language processing

Toward computational factchecking

UCL machine reading group: Four factor framework for fact finding (HexaF)

UCL machine reading group: Four factor framework for fact finding (HexaF)

We would like to thank Madian Khabsa for the helpful discussion and inspiration.