key: cord-0124471-xmtyvwdp authors: Gundapu, Sunil; Mamidi, Radhika title: Transformer based Automatic COVID-19 Fake News Detection System date: 2021-01-01 journal: nan DOI: nan sha: f9497d2c4d80513924002245aa30d42a05e53580 doc_id: 124471 cord_uid: xmtyvwdp Recent rapid technological advancements in online social networks such as Twitter have led to a great incline in spreading false information and fake news. Misinformation is especially prevalent in the ongoing coronavirus disease (COVID-19) pandemic, leading to individuals accepting bogus and potentially deleterious claims and articles. Quick detection of fake news can reduce the spread of panic and confusion among the public. For our analysis in this paper, we report a methodology to analyze the reliability of information shared on social media pertaining to the COVID-19 pandemic. Our best approach is based on an ensemble of three transformer models (BERT, ALBERT, and XLNET) to detecting fake news. This model was trained and evaluated in the context of the ConstraintAI 2021 shared task COVID19 Fake News Detection in English. Our system obtained 0.9855 f1-score on testset and ranked 5th among 160 teams. The COVID-19 pandemic is considered the global public health crisis of the whole world and the biggest problem people faced after World War II. COVID-19, a contagious disease caused by a coronavirus, has caused more than 75 million confirmed cases and 1.7 million deaths across the world till 2020 December 1 . Unfortunately, the misinformation about COVID-19 has encouraged the growing of the disease and chaos among people. During the Munich Security Council held on February 15, 2020, World Health Organization (WHO) Director-General, Tedros Adhanom Ghebreyesus [2] stated that the world was in a war to fight not only a pandemic, but also an infodemic. So we should address the challenge of fake news detection to stop the spreading of COVID-19 misinformation. Since the global pandemic impacts the people, there is a broader public finding information about the COVID-19, whose safety is intimidated by adversarial agents invested in spreading fake news for economic and political reasons. Besides, due to medical and public health issues, it is also hard to be totally valid and factual, leading to differences that worsen with fake news. This difficulty is compounded by the quick advancement of knowledge about the disease. As researchers gain more knowledge about the virus, claims that looked right may turn out to be false, and vice versa. Detecting this spread of COVID-19 associated fake news, thus, has become a pivotal problem, gaining notable attention from government and global health organizations (WHO, 2020), online social networks (TechCrunch, 2020), and news organizations (BBC, 2020; CNN, 2020; New York Times, 2020). In response to the present disinformation, this paper looks at developing an efficient fake news detection architecture with respect to COVID-19. Initially, we started with developing machine learning (ML) algorithms with Term Frequency and Inverse Document Frequency (TF-IDF) feature vectors to detect misinformation on the provided dataset. These supervised TF-IDF methods are still relevant for many classification tasks and performed pretty well for fake news detection. We developed an effective ensemble model integrated with three transformer models for detecting fake news on the social media platforms. This resulted in higher accuracy and a more generalized model. The rest of this paper is organized as follows, Section II presents some prior works related to fake news, and its spread, on social media platforms. In Section III, we describe the dataset provided in the Constraint AI-2021 shared task. Section IV presents implemented models and framework for misinformation detection. Section V provides the discussions on the results. Finally we conclude this paper in Section VI. Fake News Detection: Fake news can be defined as inaccurate and misleading information that is growing knowingly or unknowingly [3] . Recognizing the spread of false information such as rumors, fake news, propaganda, hoaxes, spear phishing, and conspiracy theories is an essential task for natural language processing [4] . Gartner's [5] research studies explained that most people in advanced economies would believe more fake information than truthful information by 2022. To date, so many automated misinformation detection architectures have been developed. Rohit et al. [6] provided an extensive survey to detect fake news on various online social networks. Ghorbani et al. [7] presented an inclusive overview of the recent studies related to misinformation. Furthermore, they described the impact of misleading information, shown state-of-the-art fake news detection systems, and explored the disinformation detection datasets. The majority of the fake news detection models developed using supervised machine learning algorithms to classify the data as misleading or not [8] . This supervised classification is concluded by comparing the user input text with some already created corpora containing genuine and misleading information [9] . Aswini et al. [10] proposed a deep learning architecture with various word embeddings for Fake News Challenge (FCN-1) dataset 2 . They developed the architecture to accurately predict the stance between a given pair of news headlines and the corresponding article/body. On the same FCN-1 dataset, Sean et al. [11] developed an average weighted model of TalosCNN and TalosTree called TalosComb. TalosCNN is a convolutional neural network with pre-trained word2vec embeddings, and TalosTree is a gradient-boosted decision tree model with SVD, word count, TF-IDF. By analyzing the relationship between the news headline and the corresponding article, Heejung et al. [12] designed the Bidirectional Encoder Representations from Transformers model (BERT) to detect misleading news articles. In the case of COVID-19 fake news, a large number of misleading contents remain online on social media platforms. NLP researchers have been working on developing algorithms for the detection of online COVID-19 related disinformation. To develop any algorithm, we require a corpus. So members of the NLP community created the various fake news datasets: FakeCovid [13] , ReCOVery [14] , CoAID [15] , and CMU-MisCOV19 [16] . Yichuan Li et al. [17] developed multi-dimensional and multilingual MM-COVID corpora, which covers six languages. Mabrook et al. [18] created a large Twitter dataset related to COVID-19 misinformation. And authors developed an ensemble-stacking model with six machine learning algorithms on the created dataset for detecting misinformation. Elhadad et al. [22] constructed a voting ensemble machine learning classifier for fake news detection that uses seven feature extraction techniques and ten machine learning models. Tamanna et al. [20] used the COVIDLIES dataset to detect the misinformation by retrieving the misconceptions relevant to the Twitter posts. For COVID-19 fake news detection and fact-checking, Rutvik et al. [19] proposed a two-stage transformer model. The first model retrieves the most relevant facts about COVID-19 by using a novel fact-checking algorithm, and the second model, by computing the textual entailment, verifies the level of truth. Adapting all these classical and hybrid related work techniques, we developed a COVID-19 fake news detection system in this paper. The ConstraintAI'21 3 shared task organizers developed a COVID-19 fake news detection in English dataset [21] containing 10,700 data points collected from various online social networks such as Twitter, Facebook, and Instagram, etc. From the total dataset, 6,420 data points are reserved for training, 2,140 data points are used for hyperparameter tuning as a part of the validation phase, and the remaining 2,140 social media posts are kept aside for testing. Each dataset except the test set contains social media data points and their corresponding labels, either real or fake. Table 1 : Fake news dataset information Table 1 shows the corpus size and label distribution, and if we observe, the labels in each dataset are all roughly balanced. Table 2 shows some examples from the COVID-19 fake news detection in the English dataset. We illustrate the most occurring word cloud of the real and fake data points after removing the stop words in Figures 1(a) and 1(b). In Figure 1 (a), we can see unique words in real-labeled data points which don't often occur in Figure 1 (b), like "covid19", "discharged", "confirmed", "testing", "indiafightscorona", and "indiawin", etc.; meanwhile, from Figure 1 (b), we can find unique words frequently appearing in the fake articles, which include "coronavirus", "kill", "muslim", "hydroxychloroquine", "china", and "facebook post", but don't frequently appear in the true labeled data points. These frequent textual words can give important information to differentiate the true data points from fake ones. In this part, we present our transformer based ensemble model that is trained and tuned on the datasets which reported in the previous section. We compare our approach with various machine learning (ML) and deep learning (DL) models with different word embeddings. The full code of system architecture can be found at GitHub 4 . The main aim of this part is to use the NLP techniques to preprocess the input tweet data and prepare for the next step to extract the proper features. In Figure 2 , we shown the detailed data preprocessing pipeline with examples. In the preprocessing step, we will forward the tokenized tweet through the pipeline to eliminate the noise in the fake news dataset by remove or normilize the unnecessary tokens. The preprocessing pipeline includes the following subparts: 1. Emoticon Conversion: In this step, we converted the each emoticon in the tweet to text. Example: → Face with medical mask emoji 2. Handling of Hashtags: We identified the hashtag tokens by seeing pound (#) sign and splitted these based on digits or capital letters. Example: #IndiaF ightsCorona → IndiaF ightsCorona 3. Stemming: We removed the inflectional morphemes like "ed", "est", "s", and "ing" from their token stem. Ex: conf irmed → "conf irm" + " − ed" 4. Text cleaning: To remove the irrelevent data we used this step. Removed punctuation marks, digits and, non-ASCII glyphs from the tweet. To build the finest system for fake news detection, we started our investigations with traditional NLP approaches like Linear Regression (LR), Support Vector MAchines (SVM), Passive Agressive Classifier (PAC), XGBoost, and Multi-Layer Perceptron (MLP). We study the results of above mentioned supervised models with the combination of three types of word vectors: 1. Word-level, n-gram level, and character level TF-IDF vectors with the feature matrix size of 100000. 2. English Glove [23] word embeddings with the dimension of 300. 3. TF-IDF weighted averaging with Glove embeddings. We described below the fake news vector construction. In the above formula, N is the total number of words in the input fake news tweet, and token i is the i th token in the input text. After analyzing the results, TF-IDF weighted averaging gave better results than the standard TF-IDF. Supervised machine learning algorithms performed very well on the provided dataset. In this section, we experiment with deep learning models that give better results than traditional classification algorithms. We used Long Short-Term Memory (LSTM) [24] architecture with two different pre-trained word embeddings Glove and Fasttext [25] . LSTM is a type of Recurrent Neural Network (RNN) that can solve long term dependency problem, and it is a well-suited model for sequence classification. We converted the input data points into word vectors by using pre-trained word embeddings. These word vectors are passed as input to the LSTM layer. We stacked up two LSTM layers one after another with the dropout of 0.25. The size of LSTM is 128, and the last time step output is treated as input data point representation. The final time step's outcome is passed as an input to a dense layer for fake news detection. Sometimes not all the tokens in the input text contribute equally to the representation of input text. So we advantage word attention [26] mechanism to catch the tokens' prominent influence on the input data point. We built this attention mechanism on top of BiLSTM layers. The sequence of word vector is passed through a BiLSTM layer, which contains one forward and backward LSTM layer. Attention mechanism applied to the output of BiLSTM layer, which produces a dense vector. This dense vector is forwarded to a fully connected network. We explored a Convolution Neural Network (CNN) [27] model for misinformation detection. The model consists of an embedding layer, a convolution layer with 3 convolutions, a max-pooling layer, and a fully connected network. In the embedding layer, the input texts are converted into n×d sequence matrix, where n is the length of the input data point and d is the length of the word embedding dimension. In the convolution layer, fed the sequence matrix through three 1D convolutions of kernel sizes 3, 4, and 5. And each convolutions filter size is 128. The convolution layer's output is max pooled over time and concatenated to get the input datapoint representations in the max-pooling layer. The output of the max-pooling layer is passed to a fully connected network with a softmax output layer. CNN + BiLSTM: A CNN and BiLSTM architecture is an ensemble of CNN and bidirectional LSTM models with Fasttext/Glove word embeddings. In this architecture, the CNN extracts the maximum amount of features/information from the input text using convolution layers. The output of CNN becomes the input to BiLSTM, which keeps the data in chronological order in both directions. The sequence of word vectors are forwarded through a convolution of kernel size 3 with filter size 128. The output of convolution is passed through a BiLSTM. The outcome of BiLSTM is max-pooled over time and followed by one dense layer and a softmax layer. This section explored individual and ensembling of the three transformer models BERT, ALBERT, and XLNet. These models have outperformed the other ML and DL algorithms. We implemented these models using HuggingFace 5 is a PyTorch transformer library. And the hyperparameters of the three models are described in Table 1 BERT implementation has two steps; one is pre-training and another finetuning. In the first step, the model is trained on unseen data over various pretraining problems using a dataset in a particular language or in increases data with multiple languages. In the second step, all the initialized parameters are fine-tuned using the labeled data from certain tasks. We fine-tuned the pre-trained BERT (Base) model for our COVID-19 fake news detection task. BERT base model contains the 12 layers of encoder blocks and 12 bidirectional self-attention heads by considering the sequence of 512 tokens and emitting the representations of a sequence of hidden vectors. We added one additional output layer on top of the BERT model to calculate the conditional probability over the output classes, either fake or real. See FIGURE 1 for the fine-tuned model of BERT. XLNet: XLNet is an enhanced version of BERT. To understand the language context deeper, XLNet [29] uses Transformer-XL [30] as a feature engineering model, which alone is an adoption upon the native Transformer. This Transformer XL model integrates the two components Recurrence Mechanism and Relative Positional Encoding (RPE) to the Transformer used in BERT to handle the long-term dependencies for texts that are longer than the maximum allowed input length. Recurrence Mechanism will give context between two sequences at specific segments and RPE, which carries similarity information between two tokens. The XLNet model has been trained on a huge dataset using the permutation language modeling. This technique is one of the main differences between BERT and XLNet, and it uses permutations to generate data from the forward and backward directions at the same time. We used the pre-trained XLNet model from Hugging Face, then fine-tuned the model with a maximum length of 128 to update the pre-trained model to fit our fake news detection dataset. ALBERT: Modern language models increasing the model size and quantity of parameters when pre-training natural language representations. They often give better improvements in many downstream tasks, but in some cases, they become harder due to memory limitation and longer hours of training. To address these problems, a self-supervised learning model ALBERT (A Lite BERT) [31] often uses parameter reduction techniques to increase model speed and lower memory consumption. We used the A Lite BERT model for our misinformation detection problem, which achieves better performance than DL models. Ensemble Model: We ensembeled the three transformer models BERT, AL-BERT, and XLNet for better prediction. See Figure 4 for the ensemble model. Our ensemble model computes an average of all softmax values from these three transformer models after extracting the softmax probabilities from each model. This model relatively better than other models. In this section, we compared the performance of various machine learning, deep learning, and transformer-based models using several evaluation metrics like precision, recall, weighted f1-score and accuracy. The results of the various experiments on the test set are reported in Table 3 doing experiments, we observed that few models good at retrieving prominent features while other models have the best classification performance. Classical machine learning models with various TF-IDF feature vectors gave the approximate baseline model results. We observe that the TF-IDF weighted average performed better than the normal TF-IDF vectors. Bi-directional LSTM with attention mechanism f1-score approximate very close to transformer models. The BERT, XLNet, and ALBERT demonstrate better performance than deep learning models. An ensemble of the transformer-based model produces the best F1 score of 0.9855 on the test set. Our transformer based model ranked 5th among 160 teams. BERT ALBERT XLNet Ensemble #BillGates is shocked that America's pandemic response is among the worst in the world. We will all come out stronger from this #COVID #pandemic. Just #StaySafeStayHealthy In some problems, enesembling of four transformer models is very difficult, and sometimes this approach will not perform well. But if we observe the results of individual transformer models on our dataset are very close, meaning that any transformer model can be used for our fake news detection task. This is the major reason behind the ensembling of transformer models. In Table 4 , we showed the two misclassified test samples. The first test sample actual label is "real", but only BERT and ensemble models are predicted correctly remaining two models wrongly predicted. And the second sample true label is "fake", but XLNet and ensemble predicted correctly remaining two mod-els wrongly predicted. However, the ensemble model is correctly predicted in both cases because we are averaging the BERT, ALBERT, and XLNet softmax probabilities. This is a principal observation to ensemble the transformer models. In this paper, we presented various algorithms to combat the global infodemic, but transformer-based algorithms performed better than others. And we submitted these models to the Shared Task of COVID-19 fake news detection for English, ConstraintAI-2021 workshop. Fake news is a progressively significant and tricky problem to solve, particularly in an unanticipated situation like the COVID-19 epidemic. Leveraging state-of-the-art classical and advanced NLP models can help address the problem of COVID-19 fake news detection and other global health emergencies. We intend to explore other contextualized embeddings like FLAIR, ELMo, etc., for a better fake news detecting system in future works. Overview of CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts The infodemics of COVID-19 amongst healthcare professionals in india Misinformation? What of it?' Motivations and individual differences in misinformation sharing on social media Automated Fact Checking: Task formulations, methods and future directions Fake news: What exactly is it -and how can you spot it? Misinformation Detection on Online Social Media-A Survey An overview of online fake news: Characterization, detection, and discussion A Benchmark Study on Machine Learning Methods for Fake News Detection A Novel Approach for Selecting Hybrid Features from Online News Textual Metadata for Fake News Detection P: Talos Targets Disinformation with Fake News Challenge Victory exBAKE: Automatic Fake News Detection Model Based on Bidirectional Encoder Representations from Transformers (BERT) FakeCovid -A Multilingual Cross-domain Fact Check News Dataset for COVID-19 ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research CoAID: COVID-19 Healthcare Misinformation Dataset Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation Lies Kill, Facts Save: Detecting COVID-19 Misinformation in Twitter Two Stage Transformer Model for COVID-19 Fake News Detection and Fact Checking COVIDLies: Detecting COVID-19 Misinformation on Social Media Fighting an Infodemic: COVID-19 Fake News Dataset Detecting Misleading Information on COVID-19 Glove: Global Vectors for Word Representation Long Short-Term Memory Enriching Word Vectors with Subword Information Attention is All you Need Deep learning BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding XL-Net: Generalized Autoregressive Pretraining for Language Understanding Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context ALBERT: A Lite BERT for Self-supervised Learning of Language Representations