key: cord-0152737-yiemwbr4 authors: Roberts, Denisa A.O. title: Multilingual Evidence Retrieval and Fact Verification to Combat Global Disinformation: The Power of Polyglotism date: 2020-12-16 journal: nan DOI: nan sha: 48cc04ea4791ae6898e3c75e2d0f1cfa54098926 doc_id: 152737 cord_uid: yiemwbr4 This article investigates multilingual evidence retrieval and fact verification as a step to combat global disinformation, a first effort of this kind, to the best of our knowledge. A 400 example mixed language English-Romanian dataset is created for cross-lingual transfer learning evaluation. We make code, datasets, and trained models available upon publication. The recent COVID-19 pandemic broke down geographical boundaries and led to an infodemic of fake news and conspiracy theories [40] . Evidence based claim verification (English only) has been studied as a weapon against fake news and disinformation [34] . However conspiracy theories and disinformation can propagate from one language to another. Polyglotism is not uncommon. According to a 2017 Pew Research study, 91% of European students learn English in school 1 . Furthermore recent machine translation advances are increasingly bringing down language barriers [15, 20] . Disinformation can be defined as intentionally misleading information [25] . A multilingual approach to evidence retrieval for claim verification aims at combating global disinformation, during globally significant events. The "good cop" of the Internet [8] , Wikipedia has become a source of ground truth as seen in the recent literature on evidence-based claim verification. There are more than 6mln English Wikipedia articles 2 but resources are lower in other language editions, such as Romanian (400K), which points to retrieving multilingual evidence. As a case study in Fig. 1 we evaluate a claim about Ion Mihai Pacepa, former agent of the Romanian secret police during communism, author of books on disinformation [23, 24] . Related conspiracy theories can be found on internet platforms. For example, it was claimed that he was deceased [1] (Romanian online publication). Twitter posts in multiple languages, with strong for and against language, exist such as (English and Portuguese) 3 or (English and Polish) 4 . Strong-language claim examples are "We were tricked by Pacepa" (against) vs "Red Horizons is one of the best political books of the 20st century · · · " (for). Strong language has been associated with propaganda and fake news [41] . In the following sections we review the relevant literature, present our methodology and the experimental results and conclude with final notes. The literature review touches on three topics: online disinformation, multilingual NLP and evidence based claim verification. Online Disinformation. Previous disinformation studies focused on election related activity on social media platforms like Twitter, botnet generated hyperpartisan news, 2016 US presidential election [3] [4] [5] 13] . To combat online disinformation via claim verification one must retrieve reliable evidence at scale since fake news tend to be more viral and spread faster [30] , [27] , [41] , [35] . Multilingual Natural Language Processing Advances. Recent multilingual applications leverage pre-training of massive language models that can be fine-tuned for multiple tasks. For example, the cased multilingual BERT (mBERT) [11] , 5 is pre-trained on a corpus of the top 104 Wikipedia languages 6 . It has 12 layers, 768 hidden units, 12 heads and 110M parameters. Cross-lingual transfer learning has been evaluated for tasks such as: natural language inference [9] , [2] , document classification [28] , questionanswering [7] , fake Indic language tweet detection [16] . English-Only Evidence Retrieval and Claim Verification. Fact based claim verification is framed as a textual entailment task that retrieves its evidence. An annotated dataset was shared [33] and a task [34] was set up to retrieve evidence from Wikipedia documents and predict claim verification status. Recently published SotA results rely on pre-trained BERT flavors or XLNet [36] . DREAM [38] , GEAR [39] and KGAT [21] achieved SotA with graphs. Dense Passage Retrieval [17] is used in RAG [19] in an end-to-end approach for claim verification. The system depicted in Fig. 2 is a pipeline with a multilingual evidence retrieval component and a multilingual claim verification component. Based on input claim c li in language l i the system retrieves evidence E lj from Wikipedia edition in language l j and supports, refutes or abstains (not enough info). We employ English and Romanian as sample languages. Multilingual Document Retrieval. To retrieve top Wikipedia n l documents D c,n l for each language l we employ an ad-hoc entity linking system similar to [14] based on named entity recognition in [10] . Entities are parsed from the (English) claim c using the AllenNLP [12] constituency parser. We search for the entities and retrieve 7 English and 1 Romanian Wikipedia pages using MediaWiki API 7 , based on the internationally recognized nature of the claim entities (144.9K out of 145.5K training claims have Romanian Wikipedia search results). Multilingual Sentence Selection. All sentences ∪ n l {S Dc,n l } from each retrieved document are supplied as input to the sentence selection model. For Romanian sentences we removed diacritics [29] . We prepend evidence sentences with the page tile to compensate for the missed co-reference pronouns [31, 37] . We frame the multilingual sentence selection as a two-way classification task [14, 26] . Our architecture includes an mBERT encoder E r (·) 8 and an MLP classification layer with softmax output φ(·). During training, all the parameters are fine-tuned and the MLP weights are trained from scratch. One example input is a pair of one evidence sentence and the claim [37, 39] . The encoded first < CLS > token, is supplied to the MLP classification layer. The model estimates We only include the 110K verifiable claims in training. The annotated evidence form positive examples, and we randomly sample 32 negative example sentences from the retrieved documents. We have two flavors of the fine-tuned model: EnmBERT only selects English negative sentences and EnRomBERT selects English (5) and Romanian (27) negative sentences. Claims are in English. We optimize the cross-entropy loss: L = CrossEntropy(y * , P (y|x)). (2) Multilingual Claim Verification. The claim verification step takes as input the top ranked 5 sentence-claim pairs by the sentence selection model (pointwise ranking [6] ). The architecture includes an EnmBERT or EnRomBERT encoder E v (·) and an MLP. We fine-tune the natural language inference model in a three-way classification task. A prediction is made for each of the 5 pairs and we aggregate based on logic rules [22] . In training for both models we use Adam optimizer [18] , batch size of 32, learning rate of 2e − 5, cross-entropy loss and 1 and 2 epochs of training respectively. Conceptual End-to-End Multilingual Retrieve-Verify System. There are limitations to the ad-hoc entity linking document retrieval step for non-English languages, multilingual annotation is expensive, and the inclusion of retrieved Romanian sentences only as negative sentences in the supervised sentence selection step in the pipeline leads to biases. We propose a novel end-toend multilingual evidence retrieval and claim verification approach similar to the English-only RAG [19] that automatically retrieves relevant evidence passages in language l j from a multilingual corpus corresponding to a claim in language l i . In Fig. 2 , the 2-step multilingual evidence retrieval is replaced with a multilingual version of dense passage retrieval (DPR) [17] with mBERT backbone. The DPR-retrieved documents form a latent probability distribution. The claim verification model conditions on the claim x li and the latent retrieved documents z to generate the label y. The probabilistic model is The multilingual retrieve -verify system is jointly trained and the only supervision is at the claim -verification level. We leave this promising avenue for future experimental evaluation. There are no equivalent multilingual claim verification baselines so we calibrate the model results by calculating the official FEVER score 9 [33] . To evaluate the zero-shot transfer learning ability of the trained models, we translate 10 supported and 10 refuted claims with 5 evidence sentences each and combine in a mix and match development set of 400 examples. Calibration results development and test sets. In Table 1 and Fig. 3 we compare EnmBERT and EnRombert label accuracy and evidence recall on a fair dev set, the test set and on the golden-forcing dev set. The golden forcing dev set adds all golden evidence to the sentence selection input, effectively forcing perfect document retrieval recall [21] . Note that any of the available English only systems with BERT backbone such as KGAT [21] and GEAR [39] can be employed with an mBERT backbone to lift the multilingual system performance. We reach within 5% of similar BERT-based English only systems such as [31] , though our training differs so the comparison is not directly attributable to the multilingual nature. We also reach within 5% evidence recall as compared to English-only KGAT [21] and better than [31] . Table 1 . Calibration of models evaluation using the official fever scores % in [33] . To better understand strengths and weaknesses and the impact of including Romanian evidence we do a per class performance analysis and we also calculate FEVER-2 (score for only "SUPPORTS" and "REFUTES" claims). The SotA on FEVER-2 is likely given in RAG [19] at 89.5% without golden evidence (fair dev set). Our EnRomBERT model reaches within 5%. The inclusion of the Romanian sentences improves the FEVER-2 score (see Fig. 3 ) coming within 5% of [32] English-only FEVER-2 SotA of 92.2% on golden-forcing. On SUPPORTS and REFUTES classes, EnRomBERT outperforms Enm-BERT on both fair and golden-forcing datasets. In EnRomBERT, likely the additional noise from the second language inclusion improves generalization on the English language claims. Both models struggle on the NEI class which is not surprising since there were no NEI claims included in the training set. Transfer Learning Performance Table 2 shows EnmBERT and En-RomBERT zero-shot transfer learning ability. We evaluate the two models performance on the mixed 400 examples (mixed column), En-En, En-Ro English evidence and Romanian claims, Ro-En and Ro-Ro. We directly evaluate the claim verification step. It is interesting to see the differences in cross-lingual transfer learning ability for the Ro-En, En-Ro and Ro-Ro scenarios. EnmBERT's label accuracy on Ro-Ro is 85% as compared to 95% for En-En, better than EnRomBERT. The pattern is similar for Ro-En and En-Ro. It is not surprising that EnmBERT outperforms EnRomBERT because EnRomBERT learned that Romanian evidence sentences are NEI (included as negative examples in sentence selection training) which led to a bias against the Romanian evidence. Disinformation Case Study We now evaluate the EnRomBERT system results for the case study in Fig. 1 . We retrieve supporting evidence in English, Romanian and Portuguese. The page title and summaries are directly retrieved using the MediaWiki API 10 . The system will be exposed as a demo service, with limitations on number of requests and latency. Based on the top predicted evidence (in 3 languages), the system predicts that the claim is supported. We present a first approach to multilingual evidence retrieval and claim verification to combat global disinformation. We evaluate two systems, EnmBERT and EnRomBERT, and their cross-lingual transfer learning ability for claim verification. We make available a translated claim and evidence mixed English-Romanian dataset for future multilingual research evaluation. impact.ro Homepage Massively multilingual sentence embeddings for zeroshot cross-lingual transfer and beyond The Brexit botnet and user-generated hyperpartisan news Social bots distort the 2016 us presidential election online discussion Strategies and influence of social bots in a 2017 German state election-a case study on twitter Learning to rank: from pairwise approach to listwise approach TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages Conspiracy videos? fake news? enter Wikipedia, the 'good cop'of the Internet Xnli: Evaluating cross-lingual sentence representations Large-scale named entity disambiguation based on Wikipedia data BERT: pre-training of deep bidirectional transformers for language understanding AllenNLP: A deep semantic natural language processing platform Fake news on twitter during the 2016 us presidential election UKP-Athene: Multi-sentence textual entailment for claim verification Google's multilingual neural machine translation system: Enabling zero-shot translation No rumours please! a multi-Indiclingual approach for COVID fake-tweet detection Dense passage retrieval for open-domain question answering Adam: A method for stochastic optimization Retrieval-augmented generation for knowledge-intensive NLP tasks Multilingual denoising pre-training for neural machine translation Fine-grained fact verification with kernel graph attention network Team Papelo: Transformer networks at FEVER Red Horizons: Chronicles of a Communist Spy Chief. Gateway Books Disinformation: Former Spy Chief Reveals Secret Strategy for Undermining Freedom, Attacking Religion, and Promoting Terrorism Online disinformation and the role of wikipedia Faq retrieval using queryquestion similarity and bert-based query-answer relevance Creating a data set and a challenge for deepfakes A corpus for multilingual document classification in eight languages Edinburgh neural machine translation systems for wmt 16 This Analysis Shows How Viral Fake Election News Stories Outperformed Real News On Facebook BERT for evidence retrieval and claim verification Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation FEVER: a large-scale dataset for fact extraction and verification The fact extraction and verification (FEVER) shared task The spread of true and false news online XLNet: Generalized autoregressive pretraining for language understanding Ucl machine reading group: Four factor framework for fact finding (hexaf) Reasoning over semantic-level graph for fact checking Gear: Graphbased evidence aggregating and reasoning for fact verification ReCOVery: A multimodal repository for COVID-19 news credibility research A survey of fake news: Fundamental theories, detection methods, and opportunities