key: cord-0483835-m0vjad09
authors: Pradeep, Ronak; Ma, Xueguang; Nogueira, Rodrigo; Lin, Jimmy
title: Scientific Claim Verification with VERT5ERINI
date: 2020-10-22
journal: nan
DOI: nan
sha: e4cb6bfe88a8ed729d34d5a9ff74a992932b70ce
doc_id: 483835
cord_uid: m0vjad09

This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain. We propose VERT5ERINI that exploits T5 for abstract retrieval, sentence selection and label prediction, which are three critical sub-tasks of claim verification. We evaluate our pipeline on SCIFACT, a newly curated dataset that requires models to not just predict the veracity of claims but also provide relevant sentences from a corpus of scientific literature that support this decision. Empirically, our pipeline outperforms a strong baseline in each of the three steps. Finally, we show VERT5ERINI's ability to generalize to two new datasets of COVID-19 claims using evidence from the ever-expanding CORD-19 corpus.

The popularity of social media and other means of disseminating content, combined with automated algorithms that amplify signals, has increased the proliferation of misinformation. This has caused increased attention in the community towards building better fact verification systems. Until recently, most fact verification datasets were constrained to domains such as Wikipedia, discussion blogs, and social media (Thorne et al., 2018; Hanselowski et al., 2019) .

In the current environment, amidst the COVID-19 pandemic and the unease that comes perhaps with insufficient insight about the virus, there has been a sharp increase in scientific curiosity among the general public. While such curiosity is always appreciated, this has inadvertently resulted in a large spike of scientific facts being misrepresented, often to push personal or political agenda, inducing ineffective and often even harmful policies and behaviours.

To mitigate this issue, Wadden et al. (2020) introduced the task of scientific claim verification where systems need to evaluate the veracity of a claim against a scientific corpus. To facilitate this, they introduced the SCIFACT dataset that consists of scientific claims accompanied with abstracts that either support or refute the claim. The dataset also provide a set of rationale sentences for each claim that is often both necessary and sufficient to conclude its veracity.

In addition, they provide VERISCI, a baseline for this task that takes inspiration from previous state of the art systems (DeYoung et al., 2020) for the FEVER claim verification dataset (Thorne et al., 2018) . This pipeline retrieves relevant abstracts by TF-IDF similarity, uses a BERT-based model (Devlin et al., 2019) to select rationale sentences, and finally label each abstract as either SUPPORTS, NOINFO or REFUTES with respect to the claim.

Despite the success of BERT for a tasks like passage-level (Nogueira et al., 2019) , documentlevel (Dai and Callan, 2019; MacAvaney et al., 2019; Yilmaz et al., 2019) and sentence-level (Soleimani et al., 2019) retrieval, there is evidence that ranking with sequence-to-sequence models can achieve even better effectiveness, particularly in zero-shot scenarios or with limited training data . This was further demonstrated in the TREC-COVID challenge (Roberts et al., 2020) where one of the best performing systems used sequence-to-sequence models for retrieval (Zhang et al., 2020) . Similar trends are noted in CovidQA (Tang et al., 2020) , a question answering dataset for COVID-19, where zero-shot sequence-to-sequence models outperformed other baselines.

Hence, we propose VERT5ERINI, where all three steps-abstract retrieval, sentence selection and label prediction exploit T5 (Raffel et al., 2019) , a powerful sequence-to-sequence language model. VERT5ERINI significantly outperforms the VERISCI baseline on the SCIFACT tasks and hence qualifies as a strong pipeline for the task of Scientific Claim Verification. We also demonstrate the efficacy of our system in verifying two different sets of COVID-19 claims with no additional training or hyperparameter tuning.

In the SCIFACT task, systems are provided with a scientific claim q and a corpus of abstracts C and tasked to return:

• A set of evidence abstractsÊ(q).

• A labelŷ(q, a) that maps claim q and abstract a to one of {SUPPORTS, REFUTES, NOINFO}.

• A set of rationale sentencesŜ(q, a) when y(q, a) ∈ {SUPPORTS, REFUTES}.

Given the ground truth label y(q, a), the set of gold abstracts E(q), and the set of gold rationales R(q, a) (each gold rationale is a set of sentences), the predictions are evaluated in two ways:

• Abstract-level evaluation, where systems are judged on whether they can identify abstracts that support or refute the claim. First, a ∈Ê(q) is correctly labelled if both a ∈ E(q) and y(q, a) = y(q, a). Second, it is correctly rationalized, if in addition, ∃R ∈ R(q, a) such that R ⊆Ŝ(q, a). 1 These evaluations are referred to as Abstract Label-Only and Abstract Label+Rationale , respectively.

• Sentence-level evaluation, where systems are evaluated on whether they can identify sentences sufficient to justify the abstract-level predictions. First,ŝ ∈Ŝ(q, a) is correctly selected if ∃R ∈ R(q, a) such that bothŝ ∈ R and R ⊆Ŝ(q, a). Second, it is correctly labelled, if in addition,ŷ(q, a) = y(q, a). These evaluations are referred to as Sentence Selection-Only and Sentence Selection+Label , respectively.

Our proposed framework, VERT5ERINI (see Figure 1), has three major components:

1. H 0 : Abstract Retrieval -which given claim q retrieves the top-k abstracts from corpus C.

2. H 1 : Sentence Selection -which given claim q and one of the top-k abstracts a, selects sentences from a that formŜ(q, a).

3. H 2 : Label Prediction -which given claim q and the rationale sentencesŜ(q, a), predicts the final labelŷ(q, a).

Given a scientific claim q and a corpus C of scientific abstracts, H 0 is tasked with retrieving the top-k abstracts from C. We propose both a singlestage and a two-stage abstract retrieval pipeline. In both cases, the first stage H 0,0 involves treating the query as a "bag of words" for ranking abstracts from the corpus using a BM25 scoring function (Robertson et al., 1994) . Our implementation uses the Anserini IR toolkit (Yang et al., , 2018 , 2 which is built on the popular open-source Lucene search engine. The output of this stage is a list of k 0 candidate abstracts. The second abstract reranking stage, H 0,1 , is tasked to estimate a score p quantifying how relevant a candidate abstract a is to a query q. In this stage, the abstracts retrieved in H 0,0 are reranked by a pointwise reranker, which we call monoT5. Our reranker is based on , which uses T5 (Raffel et al., 2019) , a sequence-tosequence model pretrained with a similar masked language modeling objective as BERT. In this model, all target tasks are cast as sequence-tosequence tasks. We adapt the approach to abstract reranking by using the following input sequence:

The model is fine-tuned to produce the words "true" or "false" depending on whether the abstract is relevant or not to the query. That is, "true" and "false" are the "target words" (i.e., ground truth predictions in the sequence-to-sequence transformation). Since SCIFACT abstracts tend to be longer than context limit of T5, we first segment each abstract into spans by applying a sliding window of 6 sentences with a stride of 3.

In order to fine-tune monoT5 on abstract reranking in SCIFACT, we use all cited abstracts in the train set as positive examples. For each claim, we select negative examples by randomly selecting a non-ground truth abstract among the top-10 BM25 ranked candidates. We train on this set with a batch size of 128 for 200 steps which corresponds to approximately 5 epochs.

At inference time, we first to compute probabilities for each query-segment pair (in a reranking setting), we apply a softmax only on the logits of the "true" and "false" tokens. We then obtain the relevance score of the document as the highest probability assigned to the "true" token among all segments. The top-k 0 abstracts, R 0 , with respect to these scores are then selected.

We run inference with three different monoT5 3 settings for abstract reranking: (1) fine-tuned on the MS MARCO passage (Bajaj et al., 2016) dataset; (2) We choose to pretrain relevance classifiers on MS MARCO passage as it has been shown to help in various other tasks Zhang et al., 2020; Yilmaz et al., 2019) . Similarly, MacAvaney et al. (2020) demonstrate that fine-tuning the classifiers on MS MARCO MED helps with biomedical-domain relevance ranking.

In this stage, the goal is to select rationale sentenceŝ S(q, a) from each abstract a for each of the top-k abstracts retrievedÊ(q). We use T5 for this task too. The following input sequence is used:

where s is a sentence in the abstract a.

We fine-tune a monoT5 (trained on MS MARCO passage) on SCIFACT's gold rationales as positive examples and sentences randomly sampled from 3 All models are T5-3B. Total   Train  332  304  173  809  Dev  124  112  64  300  Test  100  100  100  300   Table 2 : SCIFACT label distribution.

E(q) as negatives. We train on this set of sentences with a batch size of 128 for 2500 steps. During inference, similar to abstract ranking, we compute a probability of the sentence being relevant based on the logits of "true" and "false" tokens. Finally, we filter out all sentences whose "true" probability is below the threshold of 0.999 to obtainŜ(q, a).

Given the claim q, an abstract a and their corresponding set of rationale sentencesŜ(q, a), H 2 is tasked to predict a labelŷ(q, a) ∈ {SUPPORTS, NOINFO, REFUTES}. Yet again, we use T5 for this task with input sequence: hypothesis: q sentence1: s 1 · · · sentencez: s z where s 1 , · · · , s z are the rationale sentences in S(q, a). The target sequence is one of "true", "weak" or "false" tokens corresponding to the labels SUPPORTS, NOINFO or REFUTES, respectively. SUPPORTS and REFUTES training examples are selected from evidence sets of cited abstracts for each claim. The sentences in each evidence set will be concatenated with the claim in the above input sequence format as a single example for the corresponding label. The NOINFO examples are selected by concatenating 1 or 2 randomly-selected non-rationale sentences from each of the cited abstracts across all labels. Here, we fine-tuned a fresh T5-3B (that was just pretrained on the mixture task) and not a fine-tuned monoT5 since there is no natural transfer from the relevance ranking task. We use a batch size of 128 and pick out the best checkpoint after [200, 400, 600, 800, 1000] steps based on the development set scores.

During inference, the token with the highest probability will be the labelŷ(q, a) for abstract a ∈ E(q). refute each claim are annotated with rationale sentences (see Table 1 for examples). The label distribution is provided in Table 2 To show that our system is able to verify claims related to COVID-19 by identifying evidence from the much larger CORD-19 corpus, 4 we evaluate VERT5ERINI in a zero-shot setting on two other datasets: COVID-19 SCIFACT (Wadden et al., 2020 ) is a set of 36 COVID-related claims curated by a medical student. In this set, the same claim can sometimes be both supported and refuted by different abstracts, a scenario not observed in the main SCIFACT task. Two examples in this set are shown in Table 4 . COVID-19 Scientific (Lee et al., 2020) contains 142 claims (label distribution in Table 3 ) gathered by collecting COVID-related scientific truths and myths from sources like the Centers for Disease Control and Prevention (CDC), Medical-NewsToday, and the World Health Organization (WHO). Unlike the other two datasets, COVID-19 Scientific only provides a single label y(q) ∈ {SUPPORTS, REFUTES} for a claim. They mention that in the construction of the dataset, claims that were unverifiable according to the CDC or the WHO were mapped to REFUTES. Hence, we make the following modifications to VERT5ERINI:

• First ifŷ(q, a) = NOINFO, thenŷ(q, a) is modified to REFUTES.

• Second,ŷ(q) = max a∈Ê(q)ŷ (q, a).

• Third, if | a∈Ê(q)Ŝ (q, a)| = 0 i.e. the set of all

Hypertension and Diabetes are the most common comorbidities for COVID-19.

Investigations reported that hypertension, diabetes, and cardiovascular diseases were the most prevalent comorbidities among the patients with coronavirus disease 2019 (COVID-19) . . . . The aim of this review was to summarize the current knowledge about the relationship between hypertension and COVID-19 and the role of hypertension on outcome in these patients.

The Secondary Attack rate of COVID-19 is 10.5% for household members/close contacts.

Background: As of April 2, 2020, the global reported number of COVID-19 cases has crossed over 1 million with more than 55,000 deaths. . . . We estimated the household SAR to be 13.8% (95% CI: 11.1-17.0%) if household contacts are defined as all close relatives and 19.3% (95% CI: 15.5-23.9%) if household contacts only include those at the same residential address as the cases, assuming a mean incubation period of 4 days and a maximum infectious period of 13 days. Bill Gates caused the infection of COVID-19. Three examples from this set are shown in Table 5. As once can imagine, it would be impossible to find any discussion for outlandish claims like "Bill Gates caused the infection of COVID-19" in a corpus of biomedical literature and hence VERT5ERINI maps it to REFUTES.

For the SCIFACT and COVID-19 SCIFACT end-to-end tasks, the baseline system used is VERISCI (Wadden et al., 2020) . It has an abstract retrieval module that uses TF-IDF, a sentence selection module trained on SCIFACT and a label prediction module trained on FEVER + SCIFACT. For the abstract retrieval module, they note the best full-pipeline development set scores by retrieving the top three documents.

For the COVID-19 Scientific task, we compare with the following two baselines established by Lee et al. (2020) :

• LiarMisinfo (Lee et al., 2020) uses a BERTlarge (Devlin et al., 2019) label prediction model fine-tuned on LIAR-PolitiFact (Wang, 2017), a set of 12.8k claims collected from PolitiFact. It is worth noting that LIAR-PolitiFact does not contain any claims related to COVID-19.

• LM Debunker (Lee et al., 2020) uses GPT-2 (Radford et al., 2019) to determine the perplexity of the claim given evidence sentences. Claims with a perplexity score higher than a threshold are labeled REFUTES while the others are labeled SUPPORTS.

The sentence selection module in both baselines employ TF-IDF followed by some rule-based evidence filtering to select the top three sentences Table 6 reports R@3 and R@5 for abstract retrieval. The oracle (first row) shows that most claims from the development set have fewer than 3 relevant abstracts and all have fewer than 5. For comparison, we show the effectiveness of TF-IDF used by Wadden et al. (2020) . We find that using BM25 results in a effectiveness improvement of around 10 points in comparison to the TF-IDF baseline. Using T5 to rerank the top-20 abstracts retrieved from BM25 results in a 17-point improvement over baseline.

However, there is almost no difference in efficacy whether T5 was fine-tuned on SCIFACT or on MS MARCO MED. This is potentially due to the relatively small size of the SCIFACT dataset and that MS MARCO MED data is not entirely relevant to the target task. Hence, we use T5 trained on MS MARCO in the full pipeline experiments (Section 5.4). Table 7 reports the precision, recall and F1 scores for the sentence selection task.

We find that T5 (MS MARCO) fine-tuned on SCIFACT outperforms the RoBERTa-large base- line fine-tuned on SCIFACT used by Wadden et al. (2020) . This result together with those from Table 6 demonstrates the effectiveness of the T5 model at selecting evidence in various levels of granularity.

In Table 8 , we present label-wise precision, recall, and F1 scores for the label prediction task. In the case of SUPPORTS and REFUTES labels, the input to the model are gold rationales from cited abstracts. The exception is for NOINFO labels, whose cited abstracts are available but no gold rationales exist.

In this case, we pick the two most similar sentences according to TF-IDF from each of these abstracts.

The results across all labels demonstrate that T5 fine-tuned on SCIFACT's label prediction task shows significant improvements over the baseline RoBERTa-large that was fine-tuned on FEVER followed by fine-tuning on SCIFACT's label prediction task. We believe some of this can be credited to T5's pretraining on a mixture of multiple tasks. Although this mixture does not include FEVER, it contains various other NLI datasets including MNLI (Williams et al., 2018) and QNLI (Rajpurkar et al., 2016) .

In Table 9 and 10, we report the precision, recall and F1 scores of abstract-level evaluation and sentence-level evaluation respectively for full pipeline systems.

Rows 1, 2, 6, 7 present the scores in the oracle abstract retrieval setting, where gold evidence abstracts are provided to systems. We see that our pipeline outperforms VERISCI by around 10 F1 points at both the abstract and sentence level. The improvements are even more significant in the Abstract Label+Rationale and Sentence Selection+Label evaluation settings (rows 6, 7 in Tables 9 and 10, respectively) which require more from systems in terms of sentence selection and label prediction. In rows 3-5 and 8-10, we report scores in the full pipeline setting where systems are also required to retrieve relevant abstracts. We evaluate two full pipeline systems, one that used BM25 alone and another that used BM25 followed by T5 (MARCO) for abstract retrieval. Both these systems outperform the baseline system VERISCI by about 14 F1 points. This comes as no surprise seeing that our models displayed significant improvements along each of the three steps.

Notice that in Table 6 , using T5 (MARCO) brings large gains in terms of R@3 of the BM25 baseline. Yet in the case of the full-pipeline with these two abstract retrieval methods, we only notice comparable efficacy on the development set. We believe this might be linked to the relatively small size of the development set and choose to probe the SCIFACT hidden test set with both these systems. From Table 11 , it is clear that in the hidden test set, both our systems outperform the baseline VERISCI, with evaluation aspects like Sen-tence+Label (rows 10-12) showing relative improvements of around 50%. Comparing with respective scores in Table 9 , 10, we also see no indication of overfitting. We also note that abstract retrieval using the two-stage approach brings significant gains here (rows 5, 11 vs 6, 12) unlike in the development set. This shows that neural reranking, even though used in a zero-shot formulation, is critical to getting higher quality abstracts from the corpus C thereby improving effectiveness in later stages too.

Finally we evaluate our most effective pipeline, VERT5ERINI (T5), on the two sets of COVIDrelated claims. We do this in a zero-shot paradigm in that we do not fine-tune our model on either of these sets.

In the COVID-19 SCIFACT set, for each claim q, we use VERT5ERINI (T5) to predict evidence abstracts,Ê(q). A (q,Ê(q)) pair is considered plausible if at least half of the evidence abstracts inÊ(q) are found to have reasonable rationales and labels. For 30/36 claims, we find that VERT5ERINI (T5) provides plausible evidence abstracts. These claims have reasonable labels and evidence rationales selected successfully from evidence abstracts. This is in comparison to the 23/36 claims that VERISCI provides plausible evidence, demonstrating the effectiveness of our system in the zero-shot case.

In the COVID-19 Scientific set, we compare the effectiveness of VERT5ERINI with that of two baselines considered by Lee et al. (2020) . Table 12 reports the accuracy, the F1-Macro and the F1-Binary scores on the test set. The F1-Binary score corresponds to the F1 score of the REFUTES label, since debunking misinformation is critical. Note that the LM Debunker baseline uses the average scores across a four-fold cross-validation on the test set, unlike VERT5ERINI and LiarMisinfo. We observe that VERT5ERINI outperforms both the baselines in a zero-shot setting, without any in-task tuning like the LM Debunker. The adaptability of VERT5ERINI to both these new tasks with no additional training makes a strong case for the strength of our pipeline.

We introduced VERT5ERINI, a novel pipeline for scientific claim verification that exploits a generation-based approach to abstract ranking, sentence selection and claim verification. Such systems are of significance in this age of misinformation amplified by the COVID-19 pandemic. We find that our system outperforms the state of art in the end-to-end task. We note improvements in each of the three steps, demonstrating the importance of this generative approach as well as zero-shot and few-shot transfer capabilities. Finally, we find that VERT5ERINI generalizes to two new COVID-19 related sets with no tuning of parameters while maintaining high efficacy.

Yet, there is still a large gap between our system and an oracle. Ideally, a system that performs scientific claim verification should possess additional attributes like:

• Numerical reasoning -the ability to interpret statistical and numerical findings and ranges.

• Biomedical background -the ability to leverage knowledge about domain-specific lexical relationships.

Future work that incorporate such attributes might be critical towards building higher quality scientific fact verification systems.

Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2016. MS MARCO: A human generated MAchine Reading COmprehension dataset

Context-aware sentence/passage term importance estimation for first stage retrieval

BERT: Pre-training of deep bidirectional transformers for language understanding

ERASER: A benchmark to evaluate rationalized NLP models

A richly annotated corpus for different tasks in automated factchecking

Yejin Bang, Andrea Madotto, and Pascale Fung. 2020. Misinformation has high perplexity

SLEDGE: A Simple Yet Effective Baseline for COVID-19 Scientific Knowledge Search

CEDR: Contextualized embeddings for document ranking

Document ranking with a pretrained sequence-to-sequence model

Document expansion by query prediction

Language Models are Unsupervised Multitask Learners. Ope-nAI Blog

Exploring the limits of transfer learning with a unified text-to-text transformer

SQuAD: 100,000+ questions for machine comprehension of text

TREC-COVID: Rationale and structure of an information retrieval shared task for COVID-19

Okapi at TREC-3

Christof Monz, and Marcel Worring. 2019. BERT for Evidence Retrieval and Claim Verification

Rapidly bootstrapping a question answering dataset for COVID-19

FEVER: a large-scale dataset for Fact Extraction and VERification

Fact or fiction: Verifying scientific claims

Liar, Liar Pants on Fire": A new benchmark dataset for Fake News Detection

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Anserini: Enabling the use of Lucene for information retrieval research

Anserini: Reproducible ranking baselines using Lucene. Journal of Data and Information Quality

Cross-domain modeling of sentence-level evidence for document retrieval

This research was supported in part by the Canada First Research Excellence Fund and the Natural Sciences and Engineering Research Council (NSERC) of Canada. Additionally, we would like to thank Google for computational resources in the form of Google Cloud credits.