key: cord-0549723-uh1dlkgx
authors: Gupta, Prakhar; Wu, Chien-Sheng; Liu, Wenhao; Xiong, Caiming
title: DialFact: A Benchmark for Fact-Checking in Dialogue
date: 2021-10-15
journal: nan
DOI: nan
sha: 21478400b07552831a012a1c01ab831e8c2e3fda
doc_id: 549723
cord_uid: uh1dlkgx

Fact-checking is an essential tool to mitigate the spread of misinformation and disinformation. We introduce the task of fact-checking in dialogue, which is a relatively unexplored area. We construct DialFact, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DialFact: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information. We found that existing fact-checking models trained on non-dialogue data like FEVER fail to perform well on our task, and thus, we propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue. We point out unique challenges in DialFact such as handling the colloquialisms, coreferences and retrieval ambiguities in the error analysis to shed light on future research in this direction.

Misinformation online can have deleterious consequences to our society, especially during public health crises like the COVID-19 pandemic. False and outdated information can be spread not only by humans but also by automatic agents as generative models have shown remarkable progress recently (Adiwardana et al., 2020; Xu et al., 2021) . These systems are not perfect, as they can either generate hallucinated and imperfect information, or they can be abused to automatically generate false claims and spread misinformation at a massive scale. Fact verification tools are thus necessary in the current information age to tackle the spread of misinformation propagated.

Dialogue Context: I have family in Ireland! Have you ever been there? Evidence: Ireland is an island in the North Atlantic.

Non-Verifiable Response: I haven't been but want to! Verifiable Supported Response: I haven't. It is an island in the north Atlantic right? Verifiable Refuted Response: I haven't been. Isn't it somewhere in north Pacific? Verifiable NEI Response: I haven't been. I heard it's the most popular tourist location in Europe! Figure 1 : Dialogue fact-checking involves predicting if a response should be considered a Verifiable claim, followed by finding relevant evidence, and finally predicting if the it is SUPPORTED, REFUTED or NEI. (2017); Thorne et al. (2018) and since then a growing body of research has explored and suggested various tasks and resources to address the challenges in this area. Fact-checking has been explored in medium such as Wikipedia passages, tables, social media and news articles (Guo et al., 2021; Bekoulis et al., 2021) . In dialogue domain, related work either focus on evaluating factual consistency (Honovich et al., 2021; Qin et al., 2021) or consistent response generation (Rashkin et al., 2021; Shuster et al., 2021) . However, due to lack of publicly available benchmarks, fact checking is still underexplored in the dialogue domain.

Verifying factual correctness of claims in dialogue poses new challenges to both dataset construction and modeling. Claims in existing datasets are from formal sources such as news articles and they are generally succinct and formal. In contrast, claims in dialogue are often informal and sparse in factual content. Furthermore, dialogue utterances often include personal opinions, slang, and colloquialisms which need to be distinguished from factual information. Another challenge in dialogue fact-checking is that ellipsis and coreference occur frequently which make utterances incomplete and ambiguous (DeVault and Stone, 2007) . Although humans can easily understand utterances with refer-ences or absent information based on the dialogue context and their reasoning skills, a fact-checking system may need to model this behavior explicitly.

We introduce the task of fact-checking in dialogue and propose an evaluation dataset, DIAL-FACT. An example is shown in Figure 1 . DI-ALFACT has three sub-tasks: 1) Verifiable claim detection aims to distinguish responses that do not contain verifiable factual information, such as "I haven't been but want to!" in Figure 1 . 2) Evidence retrieval involves selecting the most relevant knowledge snippets from Wikipedia which can verify the response. 3) Claim verification aims to classify if a response is supported, refuted, or does not have enough information to verify the response given the dialogue history and the retrieved evidence.

DIALFACT consists of both human-written and machine-generated claims based on the Wizard of Wikipedia (Dinan et al., 2019) dialogue dataset. Each response claim and its evidence sentences from Wikipedia are annotated by crowd workers and we perform rigorous quality checks on the annotations. For fact verification, we propose creation of weakly-supervised training data by leveraging techniques such as negation, entity swapping, language model mask-and-fill, and knowledgegrounded generation. We establish baseline model performance on this task, and point out the weaknesses of fact-checking models. Our analysis show that this is a non-trivial task with challenges remaining for future work. We hope that future work can leverage this dataset as a fact-checking benchmark or for development of automatic consistency metrics, and advance the state-of-the art in knowledgegrounded dialogue generation and evaluation.

Fact Verification The spread of false information online has led to a growing body of research exploring automatic fact-checking. Thorne et al. (2018) and subsequent works (Wenhu Chen et al., 2020; Jiang et al., 2020; Nørregaard and Derczynski, 2021; Aly et al., 2021) introduced fact extraction and verification datasets verifiable against pieces of evidence from Wikipedia articles. Factchecking has been explored in a variety of mediums such as Wikipedia based claims (Schuster et al., 2021) , claims over tables (Aly et al., 2021) , scientific claims (Wadden et al., 2020) , and social media claims (Nakov et al., 2021) . However, fact-checking in dialogue is still an underexplored area. Kim et al. (2021) explored fact-checking for colloquial claims, curated by converting FEVER claims into colloquial style. Although closely related to our work, colloquial claims is not a dialogue dataset, only contains verifiable claims, and does not have dialogue contexts for claims. In DIALFACT, on the other hand, both evidence retrieval and claim verification are more challenging as they require resolving ambiguities and coreferences from the dialogue context.

Neural dialogue systems grounded on knowledge sources such as Wikipedia (Dinan et al., 2019) , knowledge graphs (Wu et al., 2019) or snippets from the internet (Komeili et al., 2021) have garnered interest in recent years. Despite generating plausible and engaging responses, existing models still hallucinate invalid information (Roller et al., 2021) . Ensuring safety and consistency in dialogue response generation is thus an actively explored area (Rashkin et al., 2021; Shuster et al., 2021) . Some recent works have proposed evaluation metrics and benchmarks for factual consistency in knowledge grounded response generation (Honovich et al., 2021; Dziri et al., 2021) . Our work instead focuses on factchecking in dialogue for both human and machinegenerated responses, and involves additional tasks of verifiable claim detection and evidence retrieval.

Synthetic datasets Synthetic dataset construction has been shown to improve robustness of evaluation models (Gupta et al., 2021; Ghazarian et al., 2021) and improve the complexity of test sets (Sakaguchi et al., 2021; Feng et al., 2021) . Synthetic claims have been explored in fact-checking to create hard test sets. Several participants in the FEVER 2.0 breakers phase (Niewinski et al., 2019; Hidey et al., 2020; Atanasova et al., 2020) proposed approaches for automatically generated adversarial claims. Recently, Jiang et al. (2020) 

Let a conversation context consist of a list of utterances C = {u 1 , u 2 , ..., u n }. The task is to perform fact-checking on the last utterance of the conversation u n , henceforth called claim c. Factchecking claims in conversations is a pipeline that consists of several steps. First, the system needs to decide whether a response is VERIFIABLE or NON-VERIFIABLE. We define them as follows: NON-VERIFIABLE: The claim contains no verifiable factual information. It includes claims with personal opinions or personal information. VER-IFIABLE: The claim contains at least one factual information verifiable against a background corpus (Wikipedia in this task). Next, the system should retrieve documents from the background corpus and select relevant evidence sentences from the documents. Finally, the system should predict whether the claim belongs to one of the following three categories: SUPPORTED: The response contains factual information which is valid in light of the evidence. REFUTED: The response contains factual information which is invalid in light of the evidence. NOTENOUGHIN-FORMATION (NEI): The response contains factual information which can not be validated (supported or refuted) with the evidence. VERIFIABLE claims can be SUPPORTED, RE-FUTED, or NEI, and NON-VERIFIABLE claims are always NEI. We leverage the Wizard of Wikipedia (WoW) dataset (Dinan et al., 2019) as the base to build this task. WoW is a knowledge-grounded open-domain dialogue dataset with conversations between two speakers -a wizard who has access to background Wikipedia documents to deliver knowledge carrying responses, and an apprentice who plays the role of a curious learner. For each turn u i , the wizard is shown a set of articles K i retrieved from Wikipedia. The wizard either chooses a relevant knowledge sentence k i from the set K i , or chooses a no sentence used option to construct a response. For our fact-checking task, we additionally need claims which belong to REFUTED and NEI categories. We next describe the methodologies used to create claims from the valid and test splits of the WoW dataset.

We use two approaches to create claim responses for DIALFACT: 1) Automatically generated claims, and 2) Human written claims to emulates claims created by dialogue systems and humans respectively. All claims are further annotated by crowd workers on Amazon Mechanical Turk (Mturk).

In this approach, we use automatic methods to create claims for all categories either from scratch or by mutating the responses in WoW dataset.

Negation We use the 42 rule-based transformations from Thorne et al. (2019) which apply to verb phrases of the claims to convert them to their negated versions by adding words like "not" or "no". It typically creates REFUTED claims. Substitution We perform three types of substitutions: For 1) Context and knowledge-based entity substitution, we first run SpaCy NER tagging (Honnibal and Montani, 2017) on a response u i from WoW. We then swap an entity in the response u i with an entity from either its conversation context C or its background knowledge articles set K i . An entity is only swapped if it is present in k i , the original knowledge sentence to avoid swaps which do not change the facts. Entities are swapped within their types. For 2) Sense-based substitution, we swap an entity in u i with an entity with a similar "sense" returned from the sense2vec (Trask et al., 2015) library. For 3) Adjective substitution, we substitute adjectives in a claim (ignoring adjectives related to emotions, such as "happy") with their WordNet (Miller, 1998) antonyms (for example best is replaced with worst). These operations typically create REFUTED claims. Mask-and-Fill This method generates claims in two stages: 1) Mask salient words from the original claims, and 2) Substitute those words with their alternates using a language model. For masking salient words in the original response claims, we follow the procedure from Thorne and Vlachos (2021) and use the Neutrality Masker model from Shah et al. (2020) . It predicts the tokens which upon masking are likely to cause a label flip from SUPPORTED to NEI. For step 2) we first train a T5-base model (Raffel et al., 2020) on the WoW dataset on the task of infilling masked tokens conditioned on evidence sentences. For training, the input sequence consists of concatenated evidence sentence k i , dialogue context C, and the gold response with masked spans at random positions, and the output is the gold response. The model is thus trained to infill a masked response based on the provided evidence and the dialogue context. For generating response claims which belong to RE-FUTED or NEI categories, we use the following types of evidence sentences to condition the infilling: a) empty evidence, b) evidence sentences selected randomly from the knowledge article set K i belonging to the original response, and c) evidence sentences from a Wikipedia article of an entity retrieved using sense2vec based on its similarity with the entities in the original response. Conditioning on such evidence lead to generation of claims which have factual details inconsistent with the original evidence. Generation We fine-tune one of the best chit-chat dialogue systems, Blenderbot model (Roller et al., 2021) , on the WoW dataset. The model takes the concatenation of the knowledge sentence k i and the dialogue context C as input and it is trained to predict the tokens of the gold response. To generate new response claims, we condition the model on the three types of evidence described in the Maskand-Fill approach. We use a high temperature (1.5) and nucleus sampling (Holtzman et al., 2020) with p = 0.9 during decoding to encourage the model to generate unexpected and non-contextual entities in the responses. Final claim set creation Our target is to create a challenging and diverse test set for dialogue factchecking. Using the aforementioned methods of claim generation, we get a set R c = {r 1 , r 2 , ..., r k } of response claims for a dialogue context C. To select a final set of claims, we first remove any responses which do not have at least 3 words different from other responses in R c , then filter out less fluent claims whose GPT-2 (Radford et al., 2019) perplexity scores are higher than 1.1 times the average perplexity scores of the responses in R c . We then score the response claims using existing state-ofthe-art models related to our task: namely Dialogue NLI (Welleck et al., 2019) , Dialogue contradiction detection (Nie et al., 2021) , FEVER based fact verification (Schuster et al., 2021) and fact-checking on colloquial claims (Kim et al., 2021) . For each model, we calculate the entropy of the scores predicted for each label and rank the claims in R c based on the sum of the entropy of the scores of all the models, which gives an estimate of the confusion or difficulty in classifying the claims. The top 4 responses from the ranked list are chosen as the final set of response claims for that context.

For each claim, a set of evidence sentences is first automatically created and then labelled by crowd workers. We first extract a set of named entities and noun phrases n k from the following sources: the claim c, the dialogue context C, the original response u i for the dialogue context in WoW, and the title of the knowledge articles K i shown to the wizard for u i . We use the MediaWiki API 2 to find a set of relevant Wikipedia pages P c for n k . We then create a set of candidate sentences with the first 10 sentences of each page in P c . Finally, we use two methods -SpaCy's word2vec similarity 3 and BM25 similarity 4 to rank the top 10 evidence sentences using each method. We then combine the non-overlapping evidence from both methods to create the final evidence set e c for each claim c. We add the knowledge sentence k i associated with the original response in the WoW dataset if it is not already present in e c .

We carry out the annotations of the claims and evidence on the Mturk platform in 3 rounds. The screenshot of the annotation UI is shown in Figure 3 of the Appendix. In each round a worker sees the claim c, its dialogue context C, and its associated evidence sentences e c . Workers have to perform 3 tasks: First, they select if the claim is VERIFIABLE or NON-VERIFIABLE. Second, they select one or more evidence sentences related to the response claim. In case the set of evidence shown is not enough to decide the label of the response, or if they choose NEI, they are instructed to search Wikipedia and add relevant additional evidence sentences in the interface. For NEI claims they are instructed to add evidence sentences which are most related to the claim. Third, they choose the category of the response -SUPPORTED, REFUTED, or NEI. For NON-VERIFIABLE claims, NEI is autoselected. Since automatically created responses can have grammatical or coherence related issues, in the first round of labeling, annotators are asked to edit a response to make it appropriate to the context if needed, or mark a response as incoherent, in which case it is removed from further rounds (We dropped 5% of incoherent claims). In the second and third rounds we gather 2 additional annotations for each claim. We select the label which has the majority vote among the set of 3 annotations across all rounds. The evidence set for each claim is the union of evidence annotated in any of the rounds. Note that this mechanism can miss relevant evi- dence sometimes due to either retrieval errors in evidence set creation, or insufficient search of evidence or incorrect evidence annotation by workers.

Our dataset also consists of human written claims to cover lexical and stylistic patterns present in human-human conversations. The annotation is carried out in 3 rounds. In the first round, we instruct crowd workers to write VERIFIABLE factual responses conditioned on dialogue context and a set of evidence sentences for a pre-specified label l c -one of SUPPORTED, REFUTED, or NEI. Workers were provided detailed examples and instructions for the task such as "Avoid using negation words such as do not, no for Refuted claims" (Appendix C). The evidence set for each claim is constructed using the method described in section 4.1.2. In the second round, we use the claim labeling interface from section 4.1.3 to gather labels for the claims collected in the first round. For any claim which is not labeled in the second round with the original label l c , we gather a third round of annotations. If the label in the third round does not match l c , we drop that claim from the dataset. We drop about 7% of the human written claims.

We present the dataset statistics in Table 1 . The dataset consists of balanced SUPPORTED and RE-FUTED claims. Test set contains claims for 3,760 dialogue contexts with an average of 3.1 claims per context, and validation contains claims for 3,738 contexts with an average of 2.8 claims per context. The average number of tokens per claim is 22.0 in test set and 20.0 in validation set. Average number of evidence per claim is 1.3 in the test set and 1.1 in the validation set. We show some sample instances in Table 13 in the Appendix.

Annotators: We hire workers on Mturk with with at least 5000 HITS done and an acceptance rate of 95% or above. Workers have to first pass a qualification test where they are shown the task instructions, label definitions, and multiple examples and the explanations for each label. Then they are asked to label or write 12 claims. Using these qualification tests, we get a final set of 87 workers for the main data collection stage (Appendix C). Quality checks Annotations were carried out in batches over multiple weeks. We examined random samples to provide feedback to workers. Workers with poor annotations were either asked to retake a new qualification test or removed from further batches. We recollected annotations for data annotated by removed workers. We provide tooltips and examples during annotation, and we also added automatic checks to alert workers about issues such as too short responses, no evidence selected, and copy-pasting evidence sentences as claims.

Data validation To evaluate inter-annotator agreement, we collected 2 extra rounds of annotations for 1200 claims for both automatically generated and human written claims, which is 10% of the data. Krippendorff's alpha value for category labels was 0.68 for human written claims and 0.58 for automatically generated claims, denoting moderate agreement. Krippendorff's alpha for VERIFIABLE versus NON-VERIFIABLE was 0.49, with a lowto-moderate agreement. The lower agreement is due to some claims like "Guns N' Roses was the greatest rock band of all time.", where it is difficult to judge if this is a personal opinion or a verifiable fact. In such conflicts, workers would still typically correctly label such ambiguous claims as NEI. Lexical Biases Following Schuster et al. (2019), we measure the Local Mutual Information (LMI) to measure the correlation between bigrams in the claims (w) and the categories l, defined as follows:

. We present the top bigrams in REFUTED claims and their LMI value in Table 2 . The top bigrams in DIALFACT do not include obvious negations such as "do not", "is not", are mostly topical in nature, and the p(l/w) value is low with the Refute label. Investigating generated and written claims separately, we found that bigrams such as "does not, only one, did not, are not" had higher p(l/w) in written claims com- 

We propose new baselines and compare with existing models for three sub-tasks in dialogue factchecking -1) Verifiable claim detection, 2) Evidence retrieval, and 3) Claim verification.

We propose three simple baselines for verifiable claim detection. 1) Lexical overlap calculates the maximum word overlap between a claim and all evidence sentences after removing punctuation and stopwords using SpaCy. 2) DNLI uses the probability of the neutral class from the Dialogue Natural Language Inference model (Welleck et al., 2019) .

3) Lexical+DNLI uses the sum of scores of both baselines and Random predicts each class with 50% probability. For all baselines, we mark a response as VERIFIABLE or NON-VERIFIABLE based on a threshold value selected using validation data. We present the accuracy and individual F1 scores for both classes in Table 3 . Lexical+DNLI performs the best and all baselines have low F1 scores for NON-VERIFIABLE claims.

Evidence retrieval consists of two steps: 1) Document Retrieval, 2) Evidence Sentence selection.

We test two methods for document retrieval:

The first one is WikiAPI 5 , which retrieves Wikipedia pages and is used in past fact-checking work (Hanselowski et al., 2018; Stammbach and Neumann, 2019; Liu et al., 2020) . It uses the Al-lenNLP constituency parser (Gardner et al., 2018) to extract potential entities from the claims. Then it feeds the entities as queries through the Me-diaWiki API 2 and returns up to three Wikipedia pages per query. For each Wikipedia page, we query the KILT (Petroni et al., 2021) knowledge source to get the first 5 paragraphs of the page. We create two versions of this method: a) Wikictx which concatenates the last two turns of the dialogue context with the response claim before document retrieval and b) Wiki-claimonly -which uses just the claim. The second method is Dense Passage Retrieval (DPR) (Karpukhin et al., 2020), a dual encoder based model which retrieves documents using BERT (Devlin et al., 2019) trained by metric learning. We create three versions of this method: a) DPR-original, which uses the original DPR trained on question-answering tasks, b) DPR-WoWft-claimonly, which is fine-tuned on the WoW dataset to retrieve documents relevant to a query composed only of a response claim, and c) DPR-WoWft-ctx, which is also fine-tuned on WoW dataset but uses both the context as well as the response as a query (training details are provided in Appendix B). For DPR-based methods we retrieve the top 100 documents. A document is relevant if it contains a gold evidence sentence. We present the document recall results in Table 4 . WikiAPI methods outperform DPR-based methods. Both methods show better performance when dialogue context is used in retrieval. DPR is typically able to retrieve documents with the correct topic but often fails to retrieve a relevant evidence sentence. Entity linking is crucial for fact-checking in dialogue and WikiAPI is able to leverage that capability for better performance.

In evidence sentence selection, a final set of top k evidence sentences are chosen from the set of documents D c retrieved in the previous step for claim c. First, we create a candidate evidence sentence set S c by taking the union of all sentences in D c . We fine-tune a Bert-base model for ranking the candidate sentences in S c . The model is trained to predict -1 for irrelevant evidence and 1 for relevant evidence for a given claim. We use the context-response pairs from the WoW dataset for training the model. Besides using randomly selected evidence sentences, to create hard negative examples for training, we also chose sentences from the set of articles K i shown to the wizard during WoW data collection. These sentences are close in content and topic to the gold evidence sentence and form hard negative candidates for the model. At test time, we use the evidence sentences in the top k rank with a score of more than 0. Similar to document retrieval, we created two versions of the model: 1) Ret-with-context, and 2) Ret-onlyclaim, based on whether the last two utterances of the dialogue context were included in the input to the BERT model. We present the performance of the models in Table 5 for two of the best performing document retrieval models Wiki-ctx and DPR-WoWft-ctx. We find that recall@5 values for both models are higher when dialogue context is added as an input with the claim.

In claim verification, a claim c is classified as SUP-PORTED, REFUTED, or NEI given a context C and evidence sentences set S c . using two methods: 1) For every context-claimevidence triplet, we substitute the evidence with random unrelated evidence. 2) We use the Generation approach from section 4.1.1 to condition the generation on random evidence. We select a subset of 40,000 NEI claims from the two approaches. We fine-tune the Colloquial baseline model on this synthetic dataset. The input to the model is the sequence of the last 2 context utterances separated by [EOT] token, followed by the claim. For all Bert-based models, all evidence sentences are concatenated together. More details about training the baselines are provided in Appendix B. Table 6 summarizes the results for claim verification on the test set. NON-VERIFIABLE claims are included in the NEI category. We experiment with three evidence retrieval settings -1) Oracle Evidence, where we use gold evidence, 2) Wiki-Evidence, where we use Wiki-ctx for document retrieval and Ret-with-context for evidence selection, and 3) DPR-Evidence, where we use DPR-WoWft-ctx for document retrieval and Ret-withcontext for evidence selection. We set the max-imum evidence to 5. In all three settings, Aug-WoW outperforms baselines and the performance of all baselines drops when retrieved evidence is used compared to when oracle evidence is used. This indicates that evidence retrieval is an important step for this task. Even with oracle evidence, none of the models achieve an accuracy higher than 70%, which leaves abundant opportunity for future improvements. Colloquial baseline is the closest to Aug-WoW since it has been trained on conversation-like colloquial claims. Although Colloquial and CorefBert-Colloquial perform better than VitaminC with oracle evidence, the contrastive nature of VitaminC helps it perform better with retrieved evidences.

In Table 8 , we present the claim verification results on the Test set using oracle evidence on Generated and Written claims separately. The performance of all models is lower on Generated claims compared to Written claims. This is expected since as we mentioned in "Final claim set creation" in section 4.1.1, the Generated claims were chosen from a larger candidate claims set based on the difficulty of existing models to classify those claims. Thus Generated claims in DIALFACT are more challenging. Furthermore, Aug-WoW's performance is high on both types of claims, however, the gain in its performance on Written claims is higher on Written claims compared to Generated claims.

In Table 7 , we present the claim verification results on the test set with Aug-WoW model ablations. In Aug-WoW-noctx we do not concatenate the dialogue context, and in Aug-WoW-BertLarge we use the Bert-Large model as base architecture. Aug-WoW-noctx is comparable to Aug-WoW, and has slightly lower performance with Oracle evidence. Although Aug-WoW-BertLarge performs better with oracle evidence, it is more sensitive to the evidence quality and performs poorly with retrieved evidence.

To test if a model that relies solely on claims and no evidence can leverage lexical biases in the claims to obtain good performance on DIAL-FACT, we train a model Aug-WoW-claimonly with no evidence included during training and testing. Aug-WoW-claimonly achieves 33.2% accuracy and 28.9% macro F1 score on the DIALFACT test set. Thus, a model can not exploit lexical cues in the claims of DIALFACT to obtain good performance.

We report performance on a two-way classification experiment in Appendix A (Table 12) where we combine REFUTED and NEI into a single class named NOT-SUPPORTED.

We present sample dialogue contexts, claims, oracle evidence for the claims along with model predictions in Table 9 . We found that models tend to incorrectly predict a REFUTED or NEI response as SUPPORTED when there is significant overlap between the evidence and the claim while ignoring the semantics. The first example illustrates this point where the presence of terms "biathlon" and "cross country skiing" misleads some models to predict SUPPORTED incorrectly. Similarly, models predict SUPPORTED or REFUTED for a NEI claim due to word overlap between claim and evidence, as shown in the second example. Models also often fail to perform complex and commonsense-based reasoning during verification. In the third example, although humans can reason that the claim is RE-FUTED by the evidence, all models fail to correctly classify the claim. Finally, models struggle with lexical biases and separating the colloquial part of a claim from its factual parts. In the fourth example, although there is significant overlap between the claim and the evidence, models are fooled by the presence of the word "not one of", and predict a SUPPORTED claim as REFUTED.

We propose a new benchmark, DIALFACT, for factchecking in dialogue created based on grounded dialogues from the Wizard-of-Wikipedia dataset. Besides human-written response claims, we also create synthetic claims with operations such as contradiction, infilling and substitutions. We hire qualified crowd workers to annotate responses into NON-VERIFIABLE, SUPPORTED, REFUTED, or NOTE-NOUGHINFORMATION categories along with corresponding evidence. We point out empirically that existing fact-checking models trained on nondialogue data fail to perform well on our task. We demonstrate how to leverage automatically generated responses as weak supervised signals to improve performance. We hope that DIALFACT can facilitate fact-checking, and consistency modeling and evaluation research in the dialogue community.

In this paper, we study the problem of factchecking in dialogue. The DIALFACT benchmark dataset proposed in this work could be helpful in creation of more accurate automatic fact checking systems and metrics, and ultimately creation of dialogue systems which are more faithful to factual knowledge and are thus more trustworthy. Automatic fact-checking of dialogue could be useful in many real-life scenarios where conversations need to be properly monitored to avoid spread of misinformation and disinformation, and where the conversation participants are needed to be given accurate information. However, DIALFACT benchmark only covers a specific domain with Wikipedia as background knowledge. Furthermore, even with our best efforts to ensure high quality and accuracy, the dataset might still contain incorrect labels and biases in some instances. This could pose a risk if models that are evaluated or built using this benchmark are used in domains not covered by the dataset or if they leverage evidence from unreliable or biased resources. Thus the proposed benchmark should not be treated as a universal tool for all domains and scenarios. In our work, we mitigate this risk by using the trusted source of Wikipedia for evidence and by curating hard training and testing instances using automated generation approaches. Considerable additional work is needed to improve the scope, coverage and validity of fact-checking systems and metrics, but our work provides a cautious yet concrete step towards developing fact checking systems for dialogue. training and testing instances using automated generation approaches. 

We present the claim verification results on the validation set in Table 10 . The trend in performance is similar to the trend observed in the test set reported in 6. In our human studies discussed in subsection Data validation of section 4.4, we observe that workers confuse between REFUTED and NEI labels. Furthermore, there are cases where the workers can miss finding an evidence which refutes a claim on Wikipedia and label the claim as NEI even though they are instructed to find and verify a claim by visiting Wikipedia. Similar findings were reported in other fact-checking tasks (Jiang et al., 2020) . Hence we perform another experiment where we combine REFUTED and NEI into a single class, and name it NOT-SUPPORTED. We present the claim verification results on test set for this setting in Table 12 . The performance of all baselines is higher since the task is transformed to a 2-way classification task from a 3-way classification task. Aug-WoW performs the best in this setting. In Section5.3.2, we discuss results where NON-VERIFIABLE claims are included in the NEI category. In Table 11 , we present the results for 3-way classification on test set where NON-VERIFIABLE claims with NEI-PERSONAL labels are removed, that is, only Verifiable claims are kept for NEI labelled claims. The trends in results are similar to the ones observed in Table 6 .

We show the confusion matrix of our Aug-WoW model in Figure 2 . Aug-WoW has the lowest performance on NEI claims and highest confusion between NEI and Refuted classes.

First we discuss the implementation details for claim generation techniques in section 4.1.1. For Negation we use the implementation from fever-2 baseline 6 ( Thorne et al., 2019) . For the T5 model in Mask-and-Fill and Blenderbot model in Generation approach, we use the models and training scripts available in the Hugging Face's Transformers repository 7 . Blenderbot was finetuned on full WoW training dataset with batch size of 40.

We next discuss the implementation details for the document retrieval methods. For WikiAPI method, Kim et al. (2021) pointed out that Wiki-API method naively retrieves documents related to filler words such as "I", "Yes", "They" etc. frequently. In our implementation of WikiAPI we mitigate this issue by filtering out such colloquial phrases by using a manually created stopwords list. We remove the stopwords from the candidate set of entities on which MediaWiki API is called. Our experiments showed significant improvement in the quality of the returned documents. For DPR, we use the wiki_dpr dataset available in the Hug- Fact evidence), and Hugging Face's wiki_dpr (used for document retrieval in our experiments) are different, even if WikiAPI and DPR methods retrieve a correct document, it might not exactly match the evidence we picked up from WoW dataset due to wording changes and edits between the two versions of Wikipedia pages. Therefore we relax the requirements from exact document matching to partial matching. That is, we assume a retrieved document matches a gold document if either the initial half or final half of the retrieved document matches the gold evidence document's half. We next discuss the implementation details for the models for claim verification 5.3. For VitaminC, we use the tals/albert-base-vitaminc-fever model available in their repo 10 . We finetune CorefBERTbase for CorefBERT and use the official code from the authors 11 . We train AugWoW and Colloquial models using the code from the VitaminC repo 12 on a machine with 4 NVIDIA A100 GPUs and train batch size of 100. We use the validation set performance for model selection.

We present the screenshot of the annotation interface is shown in Figure 3 . Workers were paid an avergae of $ Table 12 : Results for claim verification on the test set for 2-way classification -SUPPORTED and NOT-SUPPORTED. We combine REFUTED and NEI into NOT-SUPPORTED. We report Accuracy and Macro F1 scores in percentage.

Context A: I prefer to eat fish that is not farm raised due to the pesticides in the food. B: Yes the two most common are atlanticcod and pacific cod A: Most cod sold in stores is farm raised, and also the cod you eat in restaurants.

Response 1: There are other varieties of cod as well, like the black, red, white, and yellow Evidence: Cod flesh is moist and flaky when cooked and is white in colour. It change colour at certain water depths. It has two distinct colour phases: gray-green and reddish brown Labels: Factual, Refuted

Response 2: I read that it is a popular food with a mild flavor and a dense flaky flesh Evidence: Cod is popular as a food with a mild flavour and a dense, flaky white flesh. Labels: Factual, Supported the claim labelling task, workers were told that they will be shown a conversation between two speakers, some previously created responses to the conversation, and some Wikipedia knowledge snippets related to the response (which we will call evidence henceforth). They will label some dialogue responses which could belong to one of the 3 categories mentioned below. Supported: The response should exclusively use factual information which can be verified by the given evidence sentences and is correct or true in light of the evidence. A response is verifiable if evidence could be retrieved from Wikipedia, which decreases the uncertainty about the truthfulness (or falsehood) of the statement.

Example 1:

• Context: I think Jazz is an American creation!

• Evidence: Jazz has roots in West African cultural and musical expression, and in African-American music traditions including blues and ragtime, as well as European military band music.

• Response: Its roots include African-American music traditions including blues and ragtime

• Explanation: Response is natural and can be verified from the evidence.

Example 2: • Context: What are the three different waterfalls Niagra is made from? Can you please share with me?

• Evidence: From largest to smallest, the three waterfalls are the Horseshoe Falls, the American Falls, and the Bridal Veil Falls.

• Response: The three waterfalls are the Horseshoe Falls, the American Falls and the Bridal Veil Falls.

• Explanation: Response is natural and can be verified from the evidence as all facts mentioned are correct.

Refuted: The response contains factual information which is "incorrect" or "false" in light of the evidence, that is it contradicts the evidence. The response should be marked refuted if even a small part of the response is incorrect. Example 1:

• Context: I think Jazz is an American creation!

• Evidence: Jazz has roots in West African cultural and musical expression, and in African-American music traditions including blues and ragtime, as well as European military band music.

• Response: Its roots include American music traditions including blues and ragtime

• Explanation: Roots are African-American, not American.

Example 2:

• Context: What are the three different waterfalls Niagra is made from? Can you please share with me?

• Evidence: From largest to smallest, the three waterfalls are the Horseshoe Falls, the American Falls and the Bridal Veil Falls.

• Response: The three waterfalls are the Horseshoe Falls, the American Falls and the Sommer Falls.

• Explanation: One of the falls is incorrect based on the evidence.

The response can not be verified (supported or refuted) with Wikipedia evidence. Moreover, for this response, it is allowed to use information/knowledge that might not be available in Wikipedia but you assume to be general knowledge, e.g. that 90s refers to the time span from 1990 to 1999. Example 1:

• Context: I think Jazz is an American creation!

• Evidence: Jazz has roots in West African cultural and musical expression, and in African-American music traditions including blues and ragtime, as well as European military band music.

• Response: Jazz is now played in all parts of the world except Russia.

• Explanation: The response is not a personal opinion and the provided evidence can't be used to verify the stated fact.

Example 2:

• Context: What are the three different waterfalls Niagra is made from? Can you please share with me?

• Evidence: From largest to smallest, the three waterfalls are the Horseshoe Falls, the American Falls and the Bridal Veil Falls.

• Response: I think three waterfalls all intersect multiple times. I am trying to remember the names.

• Explanation: The stated fact can not be verified from the evidence.

We ask workers to do the following:

• Read the context carefully and if writing or editing a response, write minimum of 9 words.

• The label should be exclusively based on the response and the selected evidence sentences.

We ask workers to NOT do the following:

• While writing or editing a response please avoid typos and mis-spelling as much as possible.

• While writing or editing a response, do not use "know-it-all" phrases such as "did you know" in your responses -e.g., the response "did you know that the Berlin Wall was demolished in 1989" will not be accepted.

Personal/generic response: We give workers some examples of personal response. The response should not make any factual claim that could be verified using Wikipedia or any knowledge source. It can contain facts that are personal opinions or background of the speaker, but no fact pertinent to encyclopedic knowledge. The response should be a good follow-up to the conversation. Example 1:

• Context: I do not understand why some people enjoy hunting.

• Evidence: Hunting is the practice of killing or trapping animals.

• Response 1: I enjoy going out in the woods to hunt animals.

• Response 2: Wow interesting. I have mostly used hunting as a means of pest control.

• Explanation: Even if hunting can be used as pest control, it is a personal detail or opinion here.

Example 2:

• Context: It would be perfect to have a family member involved in choosing foster care.

• Evidence: Usually children are taken care of by their parents, legal guardians or siblings.

• Response: Very true, that is why I think it is best when parents or or legal guardians take care of their children, because they are they only ones that love the children.

• Explanation: Although part of the response is present in the evidence, this is a subjective opinion of the speaker.

To start the final task, we ask workers to read the dialogue, the corresponding responses, and the Wikipedia knowledge provided (links and pieces of evidence).

• For each provided response, mark them as SUP-PORTED, REFUTED, or NOT ENOUGH IN-FORMATION.

• if the response consists of only personal opinions or personal information with no verifiable factual information, please mark the corresponding checkbox.

• Please read the instructions and examples in the link above carefully.

• If you select the SUPPORTED or REFUTED option, you must click at least one checkbox as evidence or copy-and-paste sentences from Wikipedia links.

• For NEI, you would generally need to verify the facts in the responses by visiting and searching Wikipedia pages and pasting any related evidence.

• Please edit and correct the responses if they contain any grammatical or spelling mistakes.

Towards a human-like open-domain chatbot

Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. Feverous: Fact extraction and verification over unstructured and structured information

Generating label cohesive and wellformed adversarial claims

A review on fact extraction and verification

Managing ambiguities across utterances in dialogue

BERT: Pre-training of deep bidirectional transformers for language understanding

Wizard of wikipedia: Knowledge-powered conversational agents

Evaluating groundedness in dialogue systems

Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for NLP

AllenNLP: A deep semantic natural language processing platform

Plot-guided adversarial example construction for evaluating open-domain story generation

Michael Schlichtkrull, and Andreas Vlachos. 2021. A survey on automated fact-checking

FEVER: a large-scale dataset for fact extraction and VERification

Evaluating adversarial attacks against multiple fact verification systems

2015. sense2vec-a fast and accurate method for word sense disambiguation in neural word embeddings

Fact or fiction: Verifying scientific claims

liar, liar pants on fire": A new benchmark dataset for fake news detection

Dialogue natural language inference

Tabfact : A large-scale dataset for table-based fact verification

Proactive human-machine conversation with explicit conversation goal

Beyond goldfish memory: Long-term open-domain conversation

Coreferential Reasoning Learning for Language Representation