key: cord-0746991-9i4hrdfh authors: Wahle, Jan Philip; Ashok, Nischal; Ruas, Terry; Meuschke, Norman; Ghosal, Tirthankar; Gipp, Bela title: Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection date: 2021-11-15 journal: 17th International Conference on Information for a Better World: Shaping the Global Future, iConference 2022 DOI: 10.1007/978-3-030-96957-8_33 sha: e25b0d6d9547b5bd4e2d6d653e08efef37ff07b3 doc_id: 746991 cord_uid: 9i4hrdfh A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods predominantly target specific content types (e.g., news) or platforms (e.g., Twitter). The methods' capabilities to generalize were largely unclear so far. We evaluate fifteen Transformer-based models on five COVID-19 misinformation datasets that include social media posts, news articles, and scientific papers to fill this gap. We show tokenizers and models tailored to COVID-19 data do not provide a significant advantage over general-purpose ones. Our study provides a realistic assessment of models for detecting COVID-19 misinformation. We expect that evaluating a broad spectrum of datasets and models will benefit future research in developing misinformation detection systems. The COVID-19 pandemic has claimed more than four million lives by the time of writing this paper, and the number of infections remains high 4 . The behavior of individuals strongly affects the risk of infection. In turn, the quality of information individuals receive strongly influences their actions [23, 10] . The novelty and rapid global spread of the SARS-CoV-2 virus has also led to countless life-threatening incidences of misinformation spread on the topic. Controlling COVID-19 and combating possible future pandemics early, requires reducing misinformation and increasing the distribution of facts on the subject [5, 35] . Researchers worldwide collaborate on automating the detection of false information on COVID-19. 5 The initiatives build collections of scientific papers, social media posts, and news articles to analyze their content, spread, source, and propagators [33, 7, 19] . Natural Language Processing (NLP) research has extensively studied options to automate the identification of fake news [27] , primarily by applying recent language models. Researchers proposed adaptions of well-known Transformer models, such as COVID-Twitter-BERT [20] , to identify false information on COVID-19 from specific sources. However, most prior studies analyze specific content types (e.g., news) or platforms (e.g., Twitter). These limitations prevent reliable conclusions regarding the generalization of the proposed language models. To fill this gap, we apply 15 Transformer models to five COVID-19 misinformation tasks. We compare Transformer models optimized on COVID-19 datasets to state-ofthe-art neural language models. We exhaustively apply models to different tasks to test their generalization on unknown sources. The code to reproduce our experiments, 6 and the datasets used are publicly available. The same way word2vec [18] inspired many models in NLP [4, 26, 25] , the excellent performance of BERT [8] , a Transformer-based model [28] , caused its numerous adaption for language tasks [34, 6, 29, 30] . Domain-specific models build on top of Transformers typically outperform their baselines for related tasks [12] . For example, SciBERT [2] was pre-trained on scientific documents and typically outperforms BERT for scientific NLP tasks, such as determining document similarity [22] . Many models for COVID-19 misinformation detection employ domain-specific pretraining to improve their representation. COVID-Twitter-BERT [20] was pre-trained on 160M tweets and evaluated for sentiment analysis of tweets, e.g., about COVID vaccines. BioClinicalBERT [1] was trained into clinical narratives to incorporate linguistic characteristics from the biomedical and clinical domains. Cui et al. [7] investigated the misinformation detection task by comparing traditional machine learning and deep learning techniques. Similarly, Zhou et al. [36] explored statistical learners, such as SVM, and neural networks to classify news as credible or not. The results of both studies show deep learning architectures as the most prominent alternatives for the respective datasets. As papers on COVID-19 are recent, some contributions are only available as pretrained models. COVID-BERT 7 and COVID-SciBERT 8 are pre-trained on the CORD-19 dataset and only available via the Huggingface API. Others, such as COVID-CQ [19] and CMU-MisCov19 [17] are used to either investigate intrinsic details (e.g., how dense misinformed communities are) or to explore the applicability of statistical techniques. Although related works provide promising approaches to counter misinformation related to COVID-19, none of them explore multiple datasets. Research in many NLP areas already uses diverse benchmarks to compare models [32, 31] . To the best of our knowledge, our study is the first to systematically test Transformer-based methods on different data sources related to COVID-19. Models. Our study includes 15 Transformer-based models, which are detailed in Appendix A.2. We categorize the models into the following three groups: General-Purpose Baselines. The first group consists of general-purpose Transformer models without domain-specific training, i.e., BERT [8] , RoBERTa [16] , BART [15] , DeBERTa [11] . These baselines show how vanilla Transformer-based models perform on the COVID-19 misinformation detection task. Intermediate Pre-Training. The second group contains models trained on specific content types and domains, i.e., SciBERT [2] , BERTweet [21] , and BioClinical-BERT [1] . For example, SciBERT adapts BERT for scientific articles. These models show the effect of intermediate pre-training on specific sources compared to generalpurpose training (e.g., whether BERTweet is superior to BERT for misinformation on Twitter). Moreover, we compare the models in this group to language models optimized using intermediate pre-training on COVID-19 data (third group). COVID-19 Intermediate Pre-Training. The third group comprises models employing an intermediate pre-training stage on COVID-19 data. Due to task-specific pretraining, we expect these models to achieve better results than the models in groups one and two. We include a model that optimizes the pre-training objective on a large Twitter corpus, i.e., CT-BERT [20] , two models trained on the CORD-19 dataset (COVID-BERT 9 and COVID-SciBERT 10 ), and two popular models from the huggingface API for which the intermediate pre-training sources are not released yet (ClinicalCOVID-BERT 11 and BioCOVID-BERT 12 ). We pre-train RoBERTa, BART, and DeBERTa on the CORD-19 dataset to compare them to the models we used as general-purpose baselines. Data. We compile an evaluation set from six popular datasets for detecting COVID-19 misinformation in social media, news articles, and scientific publications, i.e., CORD-19 [33] , CoAID [7] , COVID-CQ [19] , ReCOVery [36] , CMU-MisCov19 [17] , and COVID19FN. 13 Table 1 gives an overview of the datasets and Appendix A.1 presents more details. For CORD-19, we only use abstracts in the dataset, as they provide an adequate trade-off between size and information density. Additionally, less than 50% of the articles in CORD-19 are available as full texts. For CoAID and ReCOVery, we only extract news articles to reduce a bias towards Twitter posts in our evaluation. All remaining datasets are used in their original composition. We use CORD-19 to extend the pre-training of general-purpose models and all other datasets to evaluate the models for a downstream task. CORD-19 consists of scientific articles, while the other datasets primarily contain news articles and Twitter content. We chose different domains for training and evaluation to test the models' generalization capabilities and avoid overlaps between the datasets. Table 1 . An overview of the COVID-related datasets. CORD-19 has no specific Task or Label as it provides a general collection of documents. † Details on the labels are given in Memon et al. [17] . |Corpus| Task Overview. Our study includes three experiments. The first experiment tests how static word embeddings and frozen contextual embeddings perform compared to fine-tuned language models. The second experiment studies whether tokenizers specifically tailored to a COVID-19 vocabulary are superior to general-purpose ones. 14 The third experiment evaluates and compares all 15 Transformer models on the five evaluation datasets. Training & Evaluation. To compare general-purpose baselines and COVID-19 intermediate pre-trained models, we perform pre-training on the CORD-19 dataset for three models (RoBERTa, BART, and DeBERTa) and use pre-trained configurations for the remaining models (BERT, SciBERT, BioCOVID-BERT, ClinicalCOVID-BERT). We then fine-tune all models for each of the five test tasks (COVID-CQ, CoAID, Re-COVery, CMU-MisCov19, and COVID19FN). We use a split of 80% and 20% of the documents in a dataset for training and testing, respectively. This split is the most common configuration for the tested datasets [36, 19] and is comparable to other studies [7] . We use 10% of the train dataset as a hold-out validation set. Static and Frozen Embeddings. Table 2 compares the classification results of a baseline composed of BiLSTM and GlobalVectors (GloVe) [4] to the frozen embeddings of three Transformer models for the COVID-CQ dataset. The results show no significant difference between GloVe and the frozen models. However, fine-tuning the same three models end-to-end generally increases their performance. Therefore, we choose to fine-tune neural language models for the classification of COVID-19 misinformation. Tokenizer Ablation. Table 3 shows the results on COVID-CQ for the best configuration of the models using a standard 15 tokenizer for pre-training and fine-tuning. We expected adjusting the tokenizer to the CORD-19 dataset would improve the results, as it adds valuable tokens to the vocabulary, which are often not present in standard tokenizers. However, using specialized tokenizers decreased the performance. The content in CORD-19 originates from the scientific domain. We hypothesize tweets lack similar token relations, which causes the performance drop on the COVID-CQ dataset. Therefore, we use the standard tokenizer for our full evaluation experiments. Full Evaluation. Table 4 reports the results of our full evaluation. All results are statistically significant using bootstrap and permutation tests (p < .05) [9] . Generalpurpose baselines achieved the best result for two of the five datasets (BART on CoAID and BART on COVID19FN). For two datasets (ReCOVery and COVID-CQ), a model we pre-trained on CORD-19 data (COVID-RoBERTa) performed best. CT-BERT achieved the best result on CMU-MisCov19, an expected outcome as the datasets consist only of Twitter content. BERTweet, which was also trained on Twitter data, does not achieve better results than general-purpose baselines. We expected a minor drop in performance for BERTweet compared to CT-BERT as the former was not trained on COVID-19 vocabulary, but better a performance than general-purpose models as BERTweet was trained mainly on Twitter data. Table 2 . F1-Macro scores of neural language models and a baseline (BiLSTM+GloVe) for the COVID-CQ dataset. The static and frozen models use a stacked BiLSTM; fine-tuned models were pre-trained on the CORD-19 dataset and fine-tuned for the task. All models achieved low scores for CMU-MisCov19, making it the most challenging dataset in our evaluation. The best results were obtained for CoAID. Overall, general-purpose baselines achieved comparable results to COVID-19 intermediate pretrained models for all datasets. For example, the best mean result for the dataset ReCOVery was achieved by COVID-RoBERTa (F1=.91, std=.03) while the general-purpose model BART (F1=.90, std=.01) was only .01 score points worse. We observe similar results for COVID-CQ, where the best model COVID-RoBERTa (F1=.78, std=.03) has a score difference of .02 to the second-best model BERT (F1=.76, std=.01). We conclude that pre-training language models on COVID data before fine-tuning on a misinformation task did not generally provide an advantage for the tested datasets in this paper. This study empirically evaluated 15 Transformer models for five COVID-19 misinformation tasks. Our analysis shows domain-specific models and tokenizers do not generally perform better in the classification of misinformation. We conclude that the vocabulary related to COVID-19 and possibly text-patterns about COVID-19 do not have a significant effect on the models' ability to classify misinformation. The main limitation of our study is the non-standardized pre-training of models due to the models' diversity. To reliably detect misinformation across content types and platforms, researchers need access to diverse data. We see this study as an initial step to compile a benchmark for COVID-19 data similar to widely adopted natural language understanding benchmarks (e.g., GLUE, SuperGLUE) which enable an evaluation across diverse sets of misinformation domains, sources, and tasks. Controlling the current and future pandemics requires reliable detection of false information propagated through many streams and having different unique features. This study is a first step for researchers and policymakers to devise and deploy systems that reliably flag misinformation related to COVID-19 from a broad spectrum of sources. The usefulness of NLP models increases significantly if they are applicable to multiple tasks [31] . We anticipate future NLP technologies for detecting misinformation need to adopt the trend of evaluating on several benchmark datasets. This work provides a first milestone in evaluating general model capabilities and questioning the advantage of domain-specific model pre-training. Although COVID-19 accelerated the propagation of misinformation and disinformation, these problems are not unique to the current pandemic. The effects of COVID-19 misinformation and disinformation on elections, ethical biases, and the portrayal of ethical groups [3] can have similar or even more severe consequences on society than misinformation related to COVID-19. Therefore, identifying false information streams across domains will remain a challenging problem, and identifying which models can generalize for many sources is crucial. A.1 Dataset Details COVID-19 Open Research Dataset (CORD-19) [33] is the largest open source dataset about COVID-19 and coronavirus-related research (e.g. SARS, MERS). CORD-19 is composed of more than 280K scholarly articles from PubMed, 16 bioRxiv, 17 medRxiv, 18 and other resources maintained by the WHO. 19 We use this dataset to extend the general pre-training from selected neural language models (cf. Section 3) into the COVIDspecific vocabulary and features. Covid-19 heAlthcare mIsinformation Dataset (CoAID) [7] focuses on healthcare misinformation, including fake news on websites, user engagement, and social media. CoAID is composed of 5 216 news articles, 296 752 related user engagements, and 958 posts about COVID-19, which are broadly categorized under the labels true and false. Twitter Stance Dataset (COVID-CQ) [19] is a dataset of user-generated Twitter content in the context of COVID-19. More than 14K tweets were processed and annotated regarding the use of Chloroquine and Hydroxychloroquine as a valid treatment or prevention against the coronavirus. COVID-CQ is composed of 14 374 tweets from 11 552 unique users labeled as neutral, against, or favor. ReCOVery [36] explores the low credibility of information on COVID-19 (e.g., bleach can prevent COVID-19) by allowing a multimodal investigation of news and their spread on social media. The dataset is composed of 2 029 news articles on the coronavirus and 140 820 related tweets labeled as reliable or unreliable. CMU-MisCov19 [17] is a Twitter dataset created by collecting posts from unknowingly misinformed users, users who actively spread misinformation, and users who disseminate facts or call out misinformation. CMU-MisCov19 is composed of 4 573 annotated tweets divided into 17 classes (e.g., conspiracy, fake cure, news, sarcasm). The high number of classes and their imbalanced distribution make CMU-MisCov19 a challenging dataset. COVID19FN 20 is composed of approximately 2 800 news articles extracted mainly from Poynter 21 categorized as either real or fake. General-Purpose Baselines. BERT [8] mainly captures general language characteristics using a bidirectional Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. RoBERTa [16] improves BERT with additional data, compute budgets, and hyperparameter optimizations. RoBERTa also drops the NSP as it contributes little to the model representation. BART [15] optimizes an auto-regressive forward-product and auto-encoding MLM objective simultaneously. DeBERTa [11] improves the attention mechanism using a disentanglement of content and position. Intermediate Pre-Trained. SciBERT [2] optimizes the MLM for 1.14M randomly selected papers from Semantic Scholar 22 . BioClinicalBERT [1] specializes on 2M notes in the MIMIC-III database [13] , a collection of disidentified clinical data. BERTweet [21] optimizes BERT on 850M tweets each containing between 10 and 64 tokens. COVID-19 Intermediate Pre-Trained COVID-Twitter-BERT [20] (CT-BERT) uses a corpus of 160M tweets for domain-specific pre-training and evaluates the resulting model's capabilities in sentiment analysis, such as for tweets about vaccines. BioClini-calBERT [1] fine-tunes BioBERT [14] into clinical narratives in the hope to incorporate linguistic characteristics from both the clinical and biomedical domains. Cui et al. [7] propose CoAID and investigate the misinformation detection task by comparing traditional machine learning (e.g., logistic regression, random forest) and deep learning techniques (e.g., GRU). In a similar layout, Zhou et al. [36] compare traditional statistical learners, such as SVM and neural networks (e.g., CNN), to classify news as credible or not. In both studies, the results show deep learning architectures as the most prominent options. Pre-Training. We use the data from the abstracts of the CORD-19 dataset for pretraining. For pre-processing the CORD-19 abstract data, we consider only alphanumerical characters. We use a sequence length of 128 tokens, which reduces training time while being competitive to longer sequence lengths when fine-tuning [24] . We mask words randomly with a probability of .15, a common configuration for Transformers [8, 11] , and perform the MLM with the following remaining parameters: a batch size of 16 for all the base models, and eight for the large models, the Adam Optimizer (α = 2e − 5, β 1 = .9, β 2 = .999, ǫ = 1e − 8), and a maximum of five epochs. All experiments were performed on a single NVIDIA GeForce GTX 1080 Ti GPU with 11 GB of memory. Fine-Tuning. The classification model applies a randomly initialized fully-connected layer to the aggregate representation of the underlying Transformer (e.g., [CLS] for BERT) with dropout (p = .1) to learn the annotated target classes with cross-entropy loss for five epochs and with a sequence length of 200 tokens. We use the same configuration of the optimizer as in pre-training. Publicly Available Clinical BERT Embeddings SciBERT: A Pretrained Language Model for Scientific Text Network Propaganda Enriching Word Vectors with Subword Information The COVID-19 social media infodemic ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators CoAID: COVID-19 Healthcare Misinformation Dataset BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing A Global Panel Database of Pandemic Policies (Oxford COVID-19 Government Response Tracker) DeBERTa: Decoding-enhanced BERT with Disentangled Attention Universal Language Model Fine-tuning for Text Classification MIMIC-III, a freely accessible critical care database BioBERT: a pre-trained biomedical language representation model for biomedical text mining BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension RoBERTa: A Robustly Optimized BERT Pretraining Approach Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset Distributed Representations of Words and Phrases and their Compositionality A stance data set on polarized conversations on Twitter about the efficacy of hydroxychloroquine as a treatment for COVID-19 COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter BERTweet: A pre-trained language model for English Tweets Aspect-based Document Similarity for Research Papers Fighting COVID-19 Misinformation on Social Media: Experimental Evidence for a Scalable Accuracy-Nudge Intervention Shortformer: Better Language Modeling using Shorter Inputs Enhanced Word Embeddings Using Multi-Semantic Representation through Lexical Chains Multi-sense embeddings through a word sense disambiguation process Fake News Detection on Social Media: A Data Mining Perspective Attention is all you need Identifying Machine-Paraphrased Plagiarism Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding CORD-19: The COVID-19 Open Research Dataset XLNet: Generalized Autoregressive Pretraining for Language Understanding How to fight an infodemic ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research