key: cord-0219487-98nmccuo authors: Lee, Nayeon; Li, Belinda Z.; Wang, Sinong; Fung, Pascale; Ma, Hao; Yih, Wen-tau; Khabsa, Madian title: On Unifying Misinformation Detection date: 2021-04-12 journal: nan DOI: nan sha: 8de8010a09edced1866a3aa82bec2e01581b76a3 doc_id: 219487 cord_uid: 98nmccuo In this paper, we introduce UnifiedM2, a general-purpose misinformation model that jointly models multiple domains of misinformation with a single, unified setup. The model is trained to handle four tasks: detecting news bias, clickbait, fake news, and verifying rumors. By grouping these tasks together, UnifiedM2learns a richer representation of misinformation, which leads to state-of-the-art or comparable performance across all tasks. Furthermore, we demonstrate that UnifiedM2's learned representation is helpful for few-shot learning of unseen misinformation tasks/datasets and model's generalizability to unseen events. On any given day, 2.5 quintillion bytes of information are created on the Internet, a figure that is only expected to increase in the coming years (Marr, 2018) . The internet has allowed information to spread rapidly, and studies have found that misinformation spreads quicker and more broadly than true information (Vosoughi et al., 2018) . It is thus paramount for misinformation detection approaches to be able to adapt to new, emerging problems in real time, without waiting for thousands of training examples to be collected. In other words, the generalizability of such systems is essential. Misinformation detection is not well-studied from a generalizability standpoint. Misinformation can manifest in different forms and domains, i.e., fake news, clickbait, and false rumors, and previous literature has mostly focused on building specialized models for a single domain (Rubin et al., 2016; Omidvar et al., 2018; Ma et al., 2018) . (Even prior literature on multi-tasking for misinformation (Kochkina et al., 2018) focuses more on * Work partially done while interning at Facebook AI. † Work partially done while working at Facebook AI. Figure 1 : Architecture of our UNIFIEDM2 model using auxiliary tasks to boost performance on a single task, rather than on all tasks.) However, though these domains may differ in format (long articles vs. short headlines and tweets) and exact objective ("is this fake" vs. "is this clickbait"), they have the same ultimate goal of deceiving their readers. As a result, their content often exhibits similar linguistic characteristics, such as using a sensational style to incite curiosity or strong emotional responses from readers. Furthermore, models trained on multiple tasks are more robust and less prone to overfitting to spurious domain-specific correlations. Thus, unifying various domains of misinformation allows us to build a generalizable model that performs well across multiple domains/formats of misinformation. In this work, we propose Unified Misinfo Model (UNIFIEDM2), a misinformation detection model that uses multi-task learning (Caruana, 1997; Maurer et al., 2016; Zhang and Yang, 2017) to train on different domains of misinformation. Through a comprehensive series of empirical evaluations, we demonstrate that our approach is effective on all tasks that we train on, improving F 1 in some cases by an absolute ∼8%. Moreover, we conduct ablation studies to more precisely characterize how such positive transfer is attained. Beyond improvements on seen datasets, we examine the gen- eralizability of our proposed approach to unseen tasks/datasets and events. This is highly applicable to real-world use cases, where obtaining new misinformation labels is costly and systems often wish to take down misinformation in real time. Our experimental results indicate that our unified representation has better generalization ability over other baselines. In this section, we describe the architecture and the training details for our proposed UNIFIEDM2 model. Our proposed model architecture is a hard-parameter sharing multi-task learning model (Ruder, 2017) , where a single shared RoBERTa (Liu et al., 2019b) encoder is used across all tasks. RoBERTa is a Transformer encoder pretrained with a masked-languagemodeling objective on English Wikipedia and news articles (CC-NEWS), among other data. We additionally append task-specific multi-layer perceptron (MLP) classification heads following the shared encoder. During multi-task training, the model sees examples from all datasets, and we jointly train the shared encoder with all task-specific heads. During inference time, we only use the classification head relevant to the inference-time task. The overall architecture of the model is shown in Figure 1 . Our model training process consists of two steps. The first step is multi-task training of the shared UNIFIEDM2 encoder to learn a general misinformation representation. We jointly optimize for all tasks t 1 · · · t T by optimizing the sum of their task-specific losses L t , where L t refers to the cross-entropy loss of the task-specific MLP classifiers. Our overall loss is defined as L multi = t=t 1 ···t T L t . Note that since the dataset sizes are different, we over-sample from the smaller datasets to make the training examples roughly equal. The second step is to fine-tune each taskspecific heads again, similarly to the MT-DNN by Liu et al. (2019a) , to obtain the results reported in Table 2 and Table 4 . Here, we provide experimental details (dataset, baselines, experimental setups) and results that empirically show the success of the proposed UNIFIEDM2 model. Table 1 lists the four misinformation tasks/datasets we use to train UNIFIEDM2. They span various granularities and domains (articles, sentences, headlines and tweets) as well as various objectives (classifying veracity, bias and clickbaity-ness). NEWSBIAS A task to classify whether a given sentence from a news article contains political bias or not. We adapt the BASIL (Fan et al., 2019) dataset, which has bias-span annotations for lexical and informational bias within news articles. Using this dataset, we also include two auxiliary tasks related to political-bias detection: 1) bias type classification -given a biased sentence, the type of the bias (lexical vs informational) is classified; and 2) polarity detection -given a biased sentence, its polarity (positive, negative, neutral) is determined. FAKENEWS An article-level fake news detection task that leverages the Webis (Potthast et al., 2018) dataset annotated by professional journalists. RUMOR A task to verify the veracity of a rumor tweet. The PHEME dataset (Zubiaga et al., 2016) , which contains rumor tweets with their corresponding reply tweets (social engagement data), is used for this task. We only use the text of the source rumor tweet since we focus on learning a good representation for misinformation text. Originally, there were three class labels (true, false, unverified); however, following other literature (Derczynski et al., 2017; Wu et al., 2019) , we report the binary version, excluding the unverified label. CLICKBAIT A task to detect the clickbaity-ness of news headlines, which refers to sensational headlines that might deceive and mislead readers. For this task, we use the dataset from the Clickbait Challenge. 1 State-of-the-Art Models For each misinformation task, we report and compare our approach to the SoTA models from RoBERTa-based Baselines In addition to each task's published SoTA model, we create RoBERTabased models by fine-tuning RoBERTa to each individual task. Training Details We ran all our experiments for 3 times with different shots, and report the average. Our UNIFIEDM2 model is based on RoBERTalarge model which has 355M parameters. We used the Adam optimizer (Kingma and Ba, 2014) with a mini-batch size of 32. The learning rate was set to 5e-6 with linear learning rate decay. The maximum epoch count was 15, with early stopping patience set to 5. The maximum sequence length of input was set to 128. These parameters 1 https://www.clickbait-challenge.org/. There are two versions of the labeled dataset, but we only use the larger one. 2 They report bias-detection performance separately on the "lexical-bias vs. no-bias" setting and "informational-bias vs. no-bias" setting. In our experiments, we treat both lexicalbias and informational-bias to be "contains-bias" class, and conduct one unified experiment. Acc F1 were obtained by performing grid-search over our validation loss. We search within the following hyper-parameter bounds: LR = {5e − 5, 5e − 6, 5e − 7}, batch = {16, 32}. Training Details for few-shot experiments We did not do any parameter searching for these fewshot experiments. We kept all the training details and parameters the same to the training details that are state above. Computing Infrastructure We ran all experiments with 1 NVIDIA TESLA V100 GPU with 32 GB of memory. Table 2 presents the results of our proposed unified model, UNIFIEDM2, along with the two groups of baseline models. UNIFIEDM2 achieves better or comparable results over both baselines for all four misinformation tasks. The improvement is especially prominent on the NEWSBIAS and RU-MOR tasks, where we see an 8% and 5% improvement in accuracy, respectively. We conduct an ablation study to better understand how other tasks help in our multitask framework. One question we ask is what kinds of tasks benefit the most from being trained together. Namely, how well do more "similar" vs. more "different" kinds of task transfer to each other? Specifically, we use the RUMOR dataset as a case study 3 . We train on multiple task combinations and evaluate their performance on RUMOR. Results are shown in Table 3 . Note that adding FAKE-NEWS alone to single-task RoBERTa, or NEWS-BIAS, actually hurts performance, indicating that multi-task learning is not simply a matter of data augmentation. We hypothesize that the drop is due to FAKENEWS being the least similar in format and style to RUMOR. Qualitatively, we compare examples from FAKENEWS and CLICKBAIT (the most helpful dataset) to RUMOR. Examples from FAK-ENEWS are long documents with a mix of formal and sensational styles, whereas CLICKBAIT contains short, sensational sentences. However, as the model is trained on more datasets, adding the less similar FAKENEWS task actually improves overall performance (90.5 → 92.5 F1 in three datasets), despite hurting the model trained on RUMOR only (86.9 → 78.7 F1). We hypothesize this is due, in part, to including more diverse sources of data, which improves the robustness of the model to different types of misinformation. New types, domains, and subjects of misinformation arise frequently. Promptly responding to these new sources is challenging, as they can spread widely before there is time to collect sufficient task-specific training examples. For instance, the rapid spread of COVID-19 was accompanied by equally fast spread of large quantities of misinformation (Joszt, 2020; Kouzy et al., 2020) . 3 Other datasets show similar findings. Therefore, we carry out experiments to evaluate the generalization ability of UNIFIEDM2 representation to unseen misinformation (i) tasks/datasets and (ii) events. The first experiment is about fast adaption ability (few-shot training) to handle a new task/dataset, whereas the second experiment is about the model's ability to perform well on events unseen during training. Dataset We evaluate using the following four unseen datasets: PROPAGANDA (Da San Martino et al., 2019), which contains 21,230 propaganda and non-propaganda sentences, with the propaganda sentences annotated by fine-grained propaganda technique labels, such as "Name calling" and "Appeal to fear"; POLITIFACT (Shu et al., 2019) , which contains 91 true and 91 fake news articles collected from PolitiFact's fact-checking platform; BUZZFEED (Shu et al., 2019) , which contains 120 true and 120 fake news headlines collected from BuzzFeed's fact-checking platform; and COVIDTWITTER , which contains 504 COVID-19-related tweets. For our experiment, we use two of the annotations: 1) Twitter Check-worthiness: does the tweet contain a verifiable factual claim? 2) Twitter False Claim: does the tweet contain false information? We compare the fewshot performance of UNIFIEDM2 against off-theshelf RoBERTa and single-task RoBERTa. For each unseen dataset, a new MLP classification head is trained on top of the RoBERTa encoder, in a few-shot manner. Given N d to be the size of the given dataset d, we train the few-shot classifiers with k randomly selected samples and evaluate on the remaining N − k samples. We test with k = 10, 25, 50. Note that for single-task RoBERTa, we report the average performance across the four Model Acc F1 SoTA'19 (Li et al., 2019) 48.30% 41.80% SoTA'20 (Yu et al., 2020) Table 5 : Average acc and macro-F1 scores from leaveone-event-out cross-validation setup for RUMOR task. task-specific models (ST average). As shown in Table 4 , our UNIFIEDM2 encoder can quickly adapt to new tasks, even with very little in-domain data. While both the single-task models and UNIFIEDM2 significantly outperform vanilla RoBERTa, UNIFIEDM2 further outperforms the single-task models, indicating that multi-task learning can aid task generalizability. Dataset We use the previously introduced RU-MOR dataset, which includes nine separate events, for this experiment. A group of works (Kochkina et al., 2018; Li et al., 2019; Yu et al., 2020) have used this dataset in a leave-one-event-out crossvalidation setup (eight events for training and one event for testing) to take event generalizability into consideration in their model evaluation. We conduct a supplementary experiment following this evaluation setup for the completeness of our analysis. Experiment First, we train the UNIFIEDM2 encoder without RUMOR data, and then fine-tune and evaluate in the leave-one-event-out cross-validation setup. Note that we re-train the UNIFIEDM2 encoder to ensure that it has no knowledge of the left-out-event testset. Results in Table 5 show that our proposed method outperforms two recent SoTA models (Li et al., 2019; Yu et al., 2020) by an absolute 16.44% and 25.14% in accuracy. This indicates that unified misinformation representations are helpful in event generalizability as well. Existing misinformation works take three main approaches: Content-based approaches examine the language of a document only. Prior works have looked at linguistic features such as hedging words and emotional words (Rubin et al., 2016; Potthast et al., 2018; Rashkin et al., 2017; Wang, 2017) . Fact-based approaches leverage evidence from external sources (e.g., Wikipedia, Web) to determine the truthfulness of the information (Etzioni et al., 2008; Wu et al., 2014; Ciampaglia et al., 2015; Popat et al., 2018; Thorne et al., 2018; Nie et al., 2019) . Finally, social-data-based approaches use the surrounding social data-such as the credibility of the authors of the information (Long et al., 2017; Kirilin and Strube, 2018; Li et al., 2019) or social engagement data (Derczynski et al., 2017; Ma et al., 2018; Kwon et al., 2013; . Though prior works have explored multi-task learning within misinformation, they have focused exclusively on one domain. These works try to predict two different labels on the same set of examples from a single (Kochkina et al., 2018) or two closely-related datasets (Wu et al., 2019) . In contrast, our proposed approach crosses not just task or dataset boundaries, but also format and domain boundaries. Furthermore, prior works focus on using an auxiliary task to boost the performance of the main task, while we focus on using multitasking to generalize across many domains. Thus, the focus of this work is not the multitask paradigm, but rather the unification of the various domains, using multitasking. In this paper, we introduced UNIFIEDM2, which unifies multiple domains of misinformation with a single multi-task learning setup. We empirically showed that such unification improves the model's performance against strong baselines, and achieves new state-of-the-art results. Furthermore, we show that UNIFIEDM2 can generalize to out-of-domain misinformation tasks and events, and thus can serve as a good starting point for others working on misinformation. Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms Multitask learning Computational fact checking from knowledge networks Fine-grained analysis of propaganda in news article Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours Open information extraction from the Web In plain sight: Media bias through the lens of factual reporting Combatting covid-19 misinformation Adam: A method for stochastic optimization Exploiting a speaker's credibility to detect fake news All-in-one: Multi-task learning for rumour verification Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter Prominent features of rumor propagation in online social media Rumor detection by exploiting user credibility information, attention and multi-task learning Multi-task deep neural networks for natural language understanding Roberta: A robustly optimized bert pretraining approach Fake news detection through multi-perspective speaker profiles Rumor detection on twitter with tree-structured recursive neural networks How much data do we create every day? the mind-blowing stats everyone should read The benefit of multitask representation learning Combining fact extraction and verification with neural semantic matching networks Using neural network for identifying clickbaits in online news media Declare: Debunking fake news and false claims using evidence-aware deep learning A stylometric inquiry into hyperpartisan and fake news Truth of varying shades: Analyzing language in fake news and political fact-checking Fake news or truth? using satirical cues to detect potentially misleading news An overview of multi-task learning in Beyond news contents: The role of social context for fake news detection Fever: a large-scale dataset for fact extraction and verification Separating facts from fiction: Linguistic models to classify suspicious and trusted news posts on twitter The spread of true and false news online Liar, Liar Pants on Fire": A new benchmark dataset for fake news detection Different absorption from the same sharing: Sifted multi-task learning for fake news detection Toward computational factchecking Coupled hierarchical transformer for stance-aware rumor verification in social media conversations A survey on multitask learning Analysing how people orient to and spread rumours in social media by looking at conversational threads