key: cord-0231995-syi8p6py authors: Bang, Yejin; Ishii, Etsuko; Cahyawijaya, Samuel; Ji, Ziwei; Fung, Pascale title: Model Generalization on COVID-19 Fake News Detection date: 2021-01-11 journal: nan DOI: nan sha: 4cdce3aeac11b3f98af17e5d0f3d7941212a11b1 doc_id: 231995 cord_uid: syi8p6py Amid the pandemic COVID-19, the world is facing unprecedented infodemic with the proliferation of both fake and real information. Considering the problematic consequences that the COVID-19 fake-news have brought, the scientific community has put effort to tackle it. To contribute to this fight against the infodemic, we aim to achieve a robust model for the COVID-19 fake-news detection task proposed at CONSTRAINT 2021 (FakeNews-19) by taking two separate approaches: 1) fine-tuning transformers based language models with robust loss functions and 2) removing harmful training instances through influence calculation. We further evaluate the robustness of our models by evaluating on different COVID-19 misinformation test set (Tweets-19) to understand model generalization ability. With the first approach, we achieve 98.13% for weighted F1 score (W-F1) for the shared task, whereas 38.18% W-F1 on the Tweets-19 highest. On the contrary, by performing influence data cleansing, our model with 99% cleansing percentage can achieve 54.33% W-F1 score on Tweets-19 with a trade-off. By evaluating our models on two COVID-19 fake-news test sets, we suggest the importance of model generalization ability in this task to step forward to tackle the COVID-19 fake-news problem in online social media platforms. As the whole world is going through a tough time due to the pandemic COVID- 19 , the information about COVID-19 online grew exponentially. It is the first global pandemic with the 4th industrial revolution, which led to the rapid spread of information through various online platforms. It came along with Infodemic. The infodemic results in serious problems that even affects people's lives, for instance, a fake news "Drinking bleach can cure coronavirus disease" led people to death . Not only the physical health is threatened due to the fake-news, but the easily spread fake-news even affects the mental health of the public with restless anxiety or fear induced by the misinformation [38] . Label Train Valid Test Valid Test Real 3360 1120 1120 51 172 Fake 3060 1020 1020 9 28 Total 6420 2140 2140 60 200 With the urgent calls to combat the infodemic, the scientific community has produced intensive research and applications for analyzing contents, source, propagators, and propagation of the misinformation [26, 22, 11, 2, 14] and providing accurate information through various user-friendly platforms [16, 30] . The early published fact sheet about the COVID-19 misinformation suggested 59% of the sampled pandemic-related Twitter posts are evaluated as fake-news [2] . To address this, a huge amount of tweets is collected to disseminate the misinformation [21, 23, 27, 1] . Understanding the problematic consequences of the fake-news, the online platform providers have started flag COVID-19 related information with an "alert" so the audience could be aware of the content. However, the massive amount of information flooding the internet on daily basis makes it challenging for human fact-checkers to keep up with the speed of information proliferation [28] . The automatic way to aid the human fact-checker is in need, not just for COVID-19 but also for any infodemic that could happen unexpectedly in the future. In this work, we aim to achieve a robust model for the COVID-19 fake-news detection shared task proposed by Patwa. et al. [25] with two approaches 1) fine-tuning classifiers with robust loss functions and 2) removing harmful training instances through influence calculation. We also further evaluate the adaptability of our method out of the shared task domain through evaluations on different COVID-19 misinformation tweet test set [1] . We show a robust model with high performance over two different test sets to step forward to tackle the COVID-19 fake-news problem in social media platforms. Fake-News COVID-19 (FakeNews-19) A dataset released for the shared task of CON-STRAINT 2021 workshop [24] , which aims to combat the infodemic regarding COVID-19 across social media platforms such as Twitter, Facebook, Instagram, and any other popular press releases. The dataset consists of 10,700 social media posts and articles of real and fake news, all in English. The details of the statistic are listed in Table 1 . Each social media post is manually annotated either as "Fake" or "Real", depending on its veracity. To evaluate the generalizability of trained models test setting, we take the test set from [1] , which is also released for fighting for the COVID-19 Infodemic tweets. The tweets are annotated with fine-grained labels related to disinformation about COVID-19, depending on the interest of different parties involved in the Infodemic. We took the second question, "To what extent does the tweet appear to contain false information?", to incorporate with our binary setting. Originally, it is answered in five labels based on the degree of the falseness of the tweet. Instead of using the multi-labels, we follow the binary setting as the data releaser did to map to "Real" and "Fake" labels for our experiments. For our cleansing experiment, we split the dataset into validation and test set with equal label distribution. The detail is listed in Table 1 . The most frequent words after removing stopwords on each dataset is listed in Table 2 . The main task is a binary classification to determine the veracity for the given piece of text from social media platforms and assign the label either "Fake" or "Real". We aim to achieve a robust model in this task with a consideration on both high performance on predicting labels on FakeNews-19 shared task and generalization ability through performance on Tweets-19 with two separate approaches described in the following Sections (3.2 and 3.3). Note that models are trained only with FakeNews-19 train set. When handling text data, Transformers [31] based language models (LM) are commonly used as feature extractors [4, 13, 17] thanks to publicly released large-scale pre-trained language models (LMs). We adopt different Transformer LMs with a feed-forward classifier trained on top of each model. The list and details of models are described in Section 4.1. As reported in [37, 9, 12] , robust loss functions help to improve the deep neural network performance especially with noisy datasets constructed from social medium. In addition to the standard cross-entropy loss (CE), we explore the following robust loss functions: symmetric cross-entropy (SCE) [33] , the generalized cross-entropy (GCE) [39] , and curriculum loss (CL) [19] . Inspired by the symmetric Kullback-Leibler divergence, SCE takes an additional term called reverse cross-entropy to enhance CE symmetricity. GCE takes the advantages of both mean absolute error being noise-robust and CE performing well with challenging datasets. CL is a recently proposed 0-1 loss function which is a tighter upper bound compared with conventional summation based surrogate losses, which follows the investigation of 0-1 loss being robust [7] . This approach is inspired by the work of Kobayashi et al. [10] , which proposes an efficient method to estimate the influence of training instances given a target instance by introducing turn-over dropout mechanism. We define trn = { trn 1 , trn 2 , . . . , trn } as a training dataset with training sample and L ( , ) as a loss function calculated from a model and a labelled sample . In turn-over dropout, a specific dropout mask ∈ {0, 1 } with dropout probability is applied during training to zeroed-out a set of parameters ∈ R from the model for each training instance trn . With this approach, every single sample in the training set is trained on a unique sub-network of the model. We define ℎ( trn ) is a function to map a training data trn into the specific mask . The influence score ( tgt , trn , ) for each target sample tgt is defined as follow: is the flipped mask of the original mask , i.e., = 1 − , and is the sub-network of the model with the mask applied. Intuitively, the influence score indicates the contribution of a training instance trn to the target instance tgt . A positive influence score indicates trn reduces the loss of tgt and a negative influence score indicates trn increases the loss of tgt , and the magnitude of the score indicates how strong the influence is. To calculate the total influence score of a training data trn over multiple samples from a given target set tgt The total influence score tot can be used to remove harmful instances, which only add noise or hinder generalization of the model, from the training set by removing top-% of training instances with the smallest total influence score from the training data. We refer to our data cleansing method as influence-based cleansing which can remove noisy data and further improve model robustness and adaptability. We set up the baseline of our experiment from [25] , an SVM model trained with features extracted from extracted by using TF-IDF. We try five different pre-trained BERT-based models, including ALBERT-base [13] , BERT-base, BERT-large [4] , RoBERTa-base, and RoBERTa-large [17] . We fine-tune the models on FakeNews-19 train set with the classification layers on the top exploiting the pre-trained models provided by [36] . We train each model with four different loss functions, which are CE, SCE, GCE, and CL. The hyperparameters are searched with learning rate of 1e−6, 3e−6, 5e−6 and epoch of 1, 3, 5, 10 and the best combination is chosen based on performance on FakeNews-19 validation set. The robustness of fine-tune models is then evaluated on both FakeNews-19 and Tweets-19 test sets. In this experiment, we mainly focus our evaluation on the Weighted-F1 (W-F1) score. Table 3 reports the result of on FakeNews-19 task. Across all settings, RoBERTa-large trained with CE loss function achieved the highest W-F1 scores, 98.13%, with a gain of 4.81% in W-F1 compared to the TF-IDF SVM baseline. Except for BERT-large, all other models achieved their best performance when fine-tuned with CE loss function. The robust loss functions did not contribute in terms of improving the performance of predicting the labels. In other words, the large-scale LMs could extract high-quality features that the noise with FakeNews-19 was barely available for the robust loss functions to contribute. In Table 4 , we show the inference results on Tweets-19; unlike the successful result on FakeNews-19 RoBERTa-large with CE scores only 33.65% of W-F1 on Tweets-19, showing that the generalization of the model is not successful. Instead, the highest performance could be achieved with BERT-large with SCE with 38.18%, which is 4.53% gain compared to RoBERTa-large with CE. Interestingly, across all models, the highest performance when fine-tuned with the robust loss functions, SCE, GCE, and CL. This shows the robust loss functions help to improve the generalization ability of models. For instance, the RoBERTa-large could gain 3.85% with CL loss function, compared to its performance with CE. Considering that RoBERTa-large with CL achieves 97.47%, which is only 0.66% loss from the highest performance, it can be considered as a fair trade-off for selecting RoBERTa-large with CL could as a robust model, which achieves high performance on FakeNews-19 as well as generalizes better on Tweets-19. Overall, while LMs with robust loss functions could achieve the highest 98.13% and lowest 96.44% on FakeNews-19, performance on Tweets-19 is comparatively poor as lower than 40% and even results in 22.85% lowest for W-F1. It could be inferred that the test set distributions are distinct although they are both related to COVID-19 infodemic and share the same data source, Twitter. This could be explained that CL is We first fine-tune a pre-trained RoBERTa-large model with FakeNews-19 train set while applying turn-over dropout to the weight matrix on the last affine transformation layer of the model with dropout probability of = 0.5. We calculate the total influence score from the resulting model to the validation sets of FakeNews-19 and Tweets-19. We investigate the effectiveness of our data cleansing approach by removing % of training instances with the smallest total influence score with = {1, 25, 50, 75, 99}. Then, we retrain the models from the remaining training data and perform an evaluation of the retrained model. All the models are trained with Cross-Entropy loss function with a fixed learning rate of 3e−6. We run the model for 15 epochs with the early stopping of 3. As the baseline, we compare our method with three different approaches: 1) pre-trained RoBERTa-large model without additional fine-tuning, 2) RoBERTa-large model fine-tuned with all training data without performing any data cleansing, and 3) model trained with random cleansing using the same cleansing percentage. We run each experiment five times with different random seeds to measure the evaluation performance statistics from each experiment. Based on our experiment results in Table 5 , our influence-based cleansing method performs best for Tweets-19 when the cleansing percentage is at 99% by only using 64 most influential training data. When cleansing percentage ≥ 25%, our influencecleansed model outperforms the model without cleansing and the model with the random cleansing approach in terms of both accuracy and W-F1. The pre-trained model without fine-tuning (i.e. 0 training instance) results in 34.36% and 46.24% W-F1 on FakeNews-19 and Tweets-19 respectively. Our best model produces a significantly higher F1-score compared to the pre-trained model without fine-tuning by a large margin Although both data set built to address COVID-19 fake-news and share the same data collection source, tweets, the results show that the models trained on FakeNews-19 could achieve relatively lower performance on Tweets-19 test set. (Note that the Tweets-19 consists of the only test set with relatively smaller scale compared to FakeNews-19.) For further understanding, we visualize features extracted by the best performing model right before the classification layers with t-SNE. As shown in Figure 1 , even though the features of FakeNews-19 test set can distinguish the "Fake" and "Real" labels, the features of Tweets-19 cannot separate the two labels quite well. As mentioned in Subsection 5.2, higher cleansing percentage tends to lead to higher evaluation F1 score. By using the model trained with top 1% influential instances, we extract sentence representation as depicted in Figure 2 . Similar to in Figure 1 , the same number of instances from the test set are randomly selected for better understanding. Top 1% influential instances are fairly evenly sampled from the whole training set, and this small subset of the training set is enough to produce the distribution to separate the test features, which supports the effectiveness of the influential score. Moreover, since the top 1% samples are more sparse, the trained model can flexibly deal with samples from unseen distributions, resulting in extracted features of higher quality. For the performance on Tweets-19 test set, we take additional consideration on binary-Recall (B-Rec.), binary-Precision (B-Prec.), and binary-F1 (B-F1) scores to further analyze the generalization ability of the model. As shown in Table 6 , the model with around 99% data cleansing achieves the best per class F1-score with 37.17% B-F1 score on the fake label and 71.50% on the real label. In general, the "Fake" B-Pre and "Real" B-Rec scores increase as the cleansing percentage increase, while "Real" B-Pre and "Fake" B-Rec behave the other way around, which means the model with higher cleansing percentage capture more real news and reduce the number of false "Fake" label [29] analyzed the global trend of tweets at the first emergence of COVID-19. To understand the diffusion of information, [3, 27] analyze the patterns of spreading COVID-19 related information and also quantify the rumor amplification across different social media platforms. Alam et. al [1] focuses on fine-grained disinformation analysis on both English and Arabic tweets for the interests of multiple stakeholders such as journalists, fact-checkers, and policymakers. Kar et. al [8] proposes a multilingual approach to detect fake news about COVID-19 from Twitter posts. Generalization ability of models As described in the previous section, several NLP studies involve emerging COVID-19 infodemic yet the generalization aspect is neglected although it is essential to accelerate industrial application development. In recent years, along with the introduction of numerous tasks in various domains, the importance of model generalization ability with a tiny amount or even without additional training datasets has been intensely discussed. In general, recent works on model generalizability can be divided into two different directions: 1) adaptive training and 2) robust loss function. In adaptive training, different meta-learning [5] and fast adaptation [20, 35, 18] approaches have been developed and show promising result for improving the generalization of the model over different domains. Another meta-learning approach, called meta transfer learning [34] , improves the generalization ability for a low-resource domain by leveraging a high-resource domain dataset. In robust loss function, different kind of robust loss functions such as symmetric cross-entropy [33] , generalized cross-entropy [39] , and curriculum loss [19] have been shown to produce a more generalized model compared to cross-entropy loss due to its robustness towards noisy-labeled instances or so-called outliers from the training data. In addition to these approaches, data de-noising could actually improve model performance [15] , thus, a data cleansing technique with identifying influential instances in the training dataset is proposed to further improve the evaluation performance and generalization ability of the models [6,10]. We investigated the COVID-19 fake-news detection task with an aim of achieving a robust model that could perform high for the CONSTRAINT shared task and also have high generalization ability with two separate approaches. The robust loss functions, compared to the traditional cross-entropy loss function, do not help much in improving F1-score on FakeNews-19 but showed better generalization ability on Tweets-19 with a fair trade-off as shown with the result comparison between RoBERTa-large with CE and CL. By performing influence data cleansing with high cleansing percentage (≥ 25%), we can achieve a better F1-score over multiple test sets. Our best model with 99% cleansing percentage can achieve the best evaluation performance on Tweets-19 with 61.10% accuracy score and 54.33% W-F1 score while still maintaining high enough test performance on FakeNews-19. This suggests how we could use the labeled data to solve the problem of fake-news detection while model generalization ability should also be taken into account. For future work, we would like to combine the adaptive training, robust loss function with the influence score data cleansing method such that the resulting influence score can be made more robust for handling unseen or noisy data. Fighting the covid-19 infodemic in social media: A holistic perspective and a call to arms Types, sources, and claims of covid-19 misinformation The covid-19 social media infodemic BERT: Pre-training of deep bidirectional transformers for language understanding Model-agnostic meta-learning for fast adaptation of deep networks Data cleansing for models trained with sgd Does distributionally robust supervised learning give robust classifiers No rumours please! a multi-indic-lingual approach for covid fake-tweet detection Deep learning with noisy labels: exploring techniques and remedies in medical image analysis Efficient estimation of influence of a training instance Coronavirus goes viral: quantifying the covid-19 misinformation epidemic on twitter Robust loss functions for learning multi-class classifiers Albert: A lite bert for self-supervised learning of language representations Misinformation has high perplexity Team yeon-zi at semeval-2019 task 4: Hyperpartisan news detection by de-noising weakly-labeled data Jennifer for COVID-19: An NLP-powered chatbot built for the people and by the people to combat misinformation Roberta: A robustly optimized BERT pretraining approach Crossner: Evaluating cross-domain named entity recognition Curriculum loss: Robust learning and generalization against label corruption The adapter-bot: All-in-one controllable conversational model An" infodemic": Leveraging high-volume twitter data to understand public sentiment for the covid-19 outbreak Coronavirus: the spread of misinformation Critical impact of social networks infodemic on defeating coronavirus covid-19 pandemic: Twitter-based study and research directions Overview of constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts Fighting an infodemic: Covid-19 fake news dataset Fighting covid-19 misinformation on social media: Experimental evidence for a scalable accuracy-nudge intervention An exploratory study of covid-19 misinformation on twitter Anatomy of an online misinformation network A first look at covid-19 information and misinformation sharing on twitter Caire-covid: A question answering and multi-document summarization system for covid-19 research Attention is all you need Symmetric cross entropy for robust learning with noisy labels Meta-transfer learning for code-switched speech recognition Learning fast adaptation on cross-accented speech recognition Transformers: State-of-the-art natural language processing Part-dependent label noise: Towards instance-dependent label noise Impact of covid-19 pandemic on mental health in the general population: A systematic review Generalized cross entropy loss for training deep neural networks with noisy labels