key: cord-0433478-iqzn0pzw authors: Das, Sourya Dipta; Basak, Ayan; Dutta, Saikat title: A Heuristic-driven Ensemble Framework for COVID-19 Fake News Detection date: 2021-01-10 journal: nan DOI: nan sha: ae370b4446d340ba72427d5756ffc6dda9c3e5d8 doc_id: 433478 cord_uid: iqzn0pzw The significance of social media has increased manifold in the past few decades as it helps people from even the most remote corners of the world stay connected. With the COVID-19 pandemic raging, social media has become more relevant and widely used than ever before, and along with this, there has been a resurgence in the circulation of fake news and tweets that demand immediate attention. In this paper, we describe our Fake News Detection system that automatically identifies whether a tweet related to COVID-19 is"real"or"fake", as a part of CONSTRAINT COVID19 Fake News Detection in English challenge. We have used an ensemble model consisting of pre-trained models that has helped us achieve a joint 8th position on the leader board. We have achieved an F1-score of 0.9831 against a top score of 0.9869. Post completion of the competition, we have been able to drastically improve our system by incorporating a novel heuristic algorithm based on username handles and link domains in tweets fetching an F1-score of 0.9883 and achieving state-of-the art results on the given dataset. Fake news represents the press that is used to spread false information and hoaxes through conventional platforms as well as online ones, mainly social media. There has been an increasing interest in fake news on social media due to the political climate prevailing in the modern world [1, 2, 10] , as well as several other factors. Detecting misinformation on social media is as important as it is technically challenging. The difficulty is partly due to the fact that even humans cannot accurately distinguish false from true news, mainly because it involves tedious evidence collection as well as careful fact checking. With the advent of technology and ever-increasing propagation of fake articles in social media, it has become really important to come up with automated frameworks for fake news identification. In this paper, we describe our system which performs a binary classification on tweets from social media and classifies it into "real" or "fake". We have used transfer learning in our approach as it has proven to be extremely effective in text classification tasks, with a reduced training time as we do not equal contribution arXiv:2101.03545v1 [cs.CL] 10 Jan 2021 need to train each model from scratch. The primary steps for our approach initially include text preprocessing, tokenization, model prediction, and ensemble creation using a soft voting schema. Post evaluation, we have drastically improved our fake news detection framework with a heuristic post-processing technique that takes into account the effect of important aspects of tweets like username handles and URL domains. This approach has allowed us to produce much superior results when compared to the top entry in the official leaderboard. We have performed an ablation study of the various attributes used in our post-processing approach. We have also provided examples of tweets where the post-processing approach has predicted correctly when compared to the initial classification output. Traditional machine learning approaches have been quite successful in fake news identification problems. Reis et al. [5] has used feature engineering to generate hand-crafted features like syntactic features, semantic features etc. The problem was then approached as a binary classification problem where these features were fed into conventional Machine Learning classifiers like K-Nearest Neighbor (KNN), Random Forest (RF), Naive Bayes, Support Vector Machine (SVM) and XGBOOST (XGB), where RF and XGB yielded results that were quite favourable. Shu et al. [6] have proposed a novel framework TriFN, which provides a principled way to model tri-relationship among publishers, news pieces, and users simultaneously. This framework significantly outperformed the baseline Machine Learning models as well as erstwhile state-of-the-art frameworks. With the advent of deep learning, there has been a significant revolution in the field of text classification, and thereby in fake news detection. Karimi et al. [7] has proposed a Multi-Source Multi-class Fake News Detection framework that can do automatic feature extraction using Convolution Neural Network (CNN) based models and combine these features coming from multiple sources using an attention mechanism, which has produced much better results than previous approaches that involved hand-crafted features. Zhang et al. [8] introduced a new diffusive unit model, namely Gated Diffusive Unit (GDU), that has been used to build a deep diffusive network model to learn the representations of news articles, creators and subjects simultaneously. Ruchansky et al. [9] has proposed a novel CSI(Capture-Score-Integrate) framework that uses an Long Short-term Memory (LSTM) network to capture the temporal spacing of user activity and a doc2vec [21] representation of a tweet, along with a neural network based user scoring module to classify the tweet as real or fake. It emphasizes the value of incorporating all three powerful characteristics in the detection of fake news: the tweet content, user source, and article response. Monti et al. [10] has shown that social network structure and propagation are important features for fake news detection by implementing a geometric deep learning framework using Graph Convolutional Networks. Language models: Most of the current state-of-the-art language models are based on Transformer [12] and they have proven to be highly effective in text classification problems. They provide superior results when compared to previous state-of-the-art approaches using techniques like Bi-directional LSTM, Gated Recurrent Unit (GRU) based models etc. The models are trained on a huge corpus of data. The introduction of the BERT [13] architecture has transformed the capability of transfer learning in Natural Language Processing. It has been able to achieve state-of-the art results on downstream tasks like text classification. RoBERTa [15] is an improved version of the BERT model. It is derived from BERT's language-masking strategy, modifying its key hyperparameters, including removing BERT's next-sentence pre-training objective, and training with much larger mini-batches and learning rates, leading to improved performance on downstream tasks. XLNet [16] is a generalized auto-regressive language method. It calculates the joint probability of a sequence of tokens based on the transformer architecture having recurrence. Its training objective is to calculate the probability of a word token conditioned on all permutations of word tokens in a sentence, hence capturing a bidirectional context. XLM-RoBERTa [14] is a transformer [12] based language model relying on Masked Language Model Objective. DeBERTa [17] provides an improvement over the BERT and RoBERTa models using two novel techniques; first, the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, and second, the output softmax layer is replaced by an enhanced mask decoder to predict the masked tokens pre-training the model. ELECTRA [18] is used for self-supervised language representation learning. It can be used to pre-train transformer networks using very low compute, and is trained to distinguish "real" input tokens vs "fake" input tokens, such as tokens produced by artificial neural networks. ERNIE 2.0 [19] is a continual pre-training framework to continuously gain improvement on knowledge integration through multi-task learning, enabling it to learn various lexical, syntactic and semantic information through massive data much better. The dataset [11] for CONSTRAINT COVID-19 Fake News Detection in English challenge was provided by the organizers on the competition website 3 . It consists of data that have been collected from various social media and fact checking websites, and the veracity of each post has been verified manually. The "real" news items were collected from verified sources which give useful information about COVID-19, while the "fake" ones were collected from tweets, posts and articles which make speculations about COVID-19 that are verified to be false. The original dataset contains 10,700 social media news items, the vocabulary size (i.e., unique words) of which is 37,505 with 5141 words in common to both fake and real news. It is class-wise balanced with 52.34% of the samples consisting of real news, and 47.66% of fake samples. These are 880 unique username handle and 210 unique URL domains in the data. We have approached this task as a text classification problem. Each news item needs to be classified into two distinct categories: "real" or "fake". Our proposed method consists of five main parts: (a) Text Preprocessing, (b) Tokenization, (c) Backbone Model Architectures, (d) Ensemble, and (e) Heuristic Post Processing. The overall architecture of our system is shown in Figure- 1. More detailed description is given in the following subsections. Some social media items, like tweets, are mostly written in colloquial language. Also, they contain various other information like usernames, URLs, emojis, etc. We have filtered out such attributes from the given data as a basic preprocessing step, before feeding it into the ensemble model. We have used the tweet-preprocessor 4 library from Python to filter out such noisy information from tweets. During tokenization, each sentence is broken down into tokens before being fed into a model. We have used a variety of tokenization approaches 5 depending upon the pre-trained model that we have used, as each model expects tokens to be structured in a particular manner, including the presence of model-specific special tokens. Each model also has its corresponding vocabulary associated with its tokenizer, trained on a large corpus data like GLUE, wikitext-103, CommonCrawl data etc. During training, each model applies the tokenization technique with its corresponding vocabulary on our tweets data. We have used a combination of XLNet [16] , RoBERTa [15] , XLM-RoBERTa [14] , DeBERTa [17] , ERNIE 2.0 [19] and ELECTRA [18] models and have accordingly used the corresponding tokenizers from the base version of their pre-trained models. We have used a variety of pre-trained language models 6 as backbone models for text classification. For each model, an additional fully connected layer is added to its respective encoder sub-network to obtain prediction probabilities for each class-"real" and "fake" as a prediction vector. We have used transfer learning in our approach in this problem. Each model has used some pre-trained model weights as initial weights. Thereafter, it fine-tunes the model weights using the tokenized training data. The same tokenizer is used to tokenize the test data and the fine-tuned model checkpoint is used to obtain predictions during inference. In this method, we use the model prediction vectors from the different models to obtain our final classification result, i.e. "real" or "fake". To balance an individual model's limitations, an ensemble method can be useful for a collection of similarly well-performing models. We have experimented with two approaches: soft voting and hard voting, that are described in the following figure: Soft Voting : In this approach, we calculate a "soft probability score" for each class by averaging out the prediction probabilities of various models for that class. The class that has a higher average probability value is selected as the final prediction class. Probability for "real" class, P r (x) and probability for "fake" class , P f (x) for a tweet x is given by, where P r i (x) and P f i (x) are "real" and "fake" probabilities by the i-th model and n is the total number of models. Hard Voting : In this approach, the predicted class label for a news item is the class label that represents the majority of the class labels predicted by each individual model. In other words, the class with the most number of votes is selected as the final prediction class. Votes for "real" class, V r (x) and Votes for "fake" class , V f (x) for a tweet x is given by, where the value of I(a) is 1 if condition a is satisfied and 0 otherwise. In this approach, we have augmented our original framework with a heuristic approach that can take into account the effect of username handles and URL domains present in some data, like tweets. This approach works well for data having URL domains and username handles; we rely only on ensemble model predictions for texts lacking these attributes. We create a new feature-set using these attributes. Our basic intuition is that username handles and URL domains are very important aspects of a tweet and they can convey reliable information regarding the genuineness of tweets. We have tried to incorporate the effect of these attributes along with our original ensemble model predictions by calculating probability vectors corresponding to both of them. We have used information about the frequency of each class for each of these attributes in the training set to compute these vectors. In our experiments, we observed that Soft-voting works better than Hard-voting. Hence our post-processing step takes Soft-voting prediction vectors into account. The steps taken in this approach are described as follows: -First, we obtain the class-wise probability from the best performing ensemble model. These probability values form two features of our new feature-set. -We collect username handles from all the news items in our training data, and calculate how many times the ground truth is "real" or "fake" for each username. -We calculate the conditional probability of a particular username indicating a real news item, which is represented as follows: where n(A) = number of "real" news items containing the username and n(B) = number of "fake" news items containing the username. Similarly, the conditional probability of a particular username indicating a fake news item is given by, We obtain two probability vectors that form four additional features of our new dataset. -We collect URL domains from all the news items in our training data, obtained by expanding the shorthand URLs associated with the tweets, and calculate how many times the ground truth is "real" or "fake" for each domain. -We calculate the conditional probability of a particular URL domain indicating a real news item, which is represented as follows: where n(P) = number of "real" news items containing the domain and n(Q) = number of "fake" news items containing the domain. Similarly, the conditional probability of a particular domain indicating a fake news item is given by, We obtain two probability vectors that form the final two additional features of our new dataset. -In case there are multiple username handles and URL domains in a sentence, the final probability vectors are obtained by averaging out the vectors of the individual attributes. -At this point, we have new training, validation and test feature-sets obtained using class-wise probability vectors from ensemble model outputs as well as probability values obtained using username handles and URLs from the training data. We use a novel heuristic algorithm on this resulting feature set to obtain our final class predictions. Table 1 shows some samples of the conditional probability values of each label class given each of the two attributes, URL domain and username handle. We have also shown the frequency of those attributes in the training data. The details of the heuristic algorithm is explained in the following pseudocode (Algorithm-1). In our experiment, the value of threshold used is 0.88. The post-processing architecture is shown in Figure- 2. We have fine-tuned our pre-trained models using AdamW [20] optimizer and crossentropy loss after doing label encoding on the target values. We have applied softmax on the logits produced by each model in order to obtain the prediction probability vectors. Result: label ( "real" or "fake") 1: if P r (x|username) > threshold AND P r (x|username) > P f (x|username) then 2: label = "real" 3: else if P f (x|username) > threshold AND P r (x|username) < P f (x|username) then 4: label = "fake" 5: else if P r (x|domain) > threshold AND P r (x|domain) > P f (x|domain) then 6: label = "real" 7: else if P f (x|domain) > threshold AND P r (x|domain) < P f (x|domain) then 8: label = "fake" 9: else if P r (x) > P f (x) then 10: label = "real" 11: else 12: label = "fake" 13: end if The experiments were performed on a system with 16GB RAM and 2.2 GHz Quad-Core Intel Core i7 Processor, along with a Tesla T4 GPU, with batch size of 32. The maximum input sequence length was fixed at 128. Initial learning rate was set to 2e-5. The number of epochs varied from 6 to 15 depending on the model. We have used each fine-tuned model individually to perform "real" vs "fake" classification. Quantitative results are tabulated in Table- 2. We can see that XLM-RoBERTa, RoBERTa, XLNet and ERNIE 2.0 perform really well on the validation set. However, RoBERTa has been able to produce the best classification results when evaluated on the test set. We tried out different combinations of pre-trained models with both the ensemble techniques: Soft Voting and Hard Voting. Performance for different ensembles are shown in Table-3 and 4 . From the results, we can infer that the ensemble models significantly outperform the individual models, and Soft-voting ensemble method performed better overall than Hard-voting ensemble method. Hard-voting Ensemble model consisting of RoBERTa, XLM-RoBERTa, XLNet, ERNIE 2.0 and DeBERTa models performed the best among other hard voting ensembles on both validation and test set. Among the Soft Voting Ensembles, the ensemble consisting of RoBERTa, XLM-RoBERTa, XL-Net, ERNIE 2.0 and Electra models achieved best accuracy overall on the validation set and a combination of XLNet, RoBERTa, XLM-RoBERTa and DeBERTa models produces the best classification result overall on the test set. Our system has been able to achieve an overall F1-score of 0.9831 and secure a joint 8 th rank in the leaderboard, against a top score of 0.9869. We augmented our Fake News Detection System with an additional heuristic algorithm and achieved an overall F1-score of 0.9883, making this approach state-of-the-art on the given fake news dataset [11] . We have used the best performing ensemble model consisting of RoBERTa, XLM-RoBERTa, XLNet and DeBERTa for this approach. We have shown the comparison of the results on the test set obtained by our model before and after applying the post-processing technique against the top 3 teams in the leaderboard in Table 5 . Table 6 shows a few examples where the post-processing algorithm corrects the initial prediction. The first example is corrected due to extracted domain which is "news.sky" and the second one is corrected because of presence of the username handle, "@drsanjaygupta". We have performed an ablation study by assigning various levels of priority to each of the features (username and domain) and then checking which class's probability value for that feature is maximum for a particular tweet, so that we can assign the corresponding "real" or "fake" class label to that particular tweet. For example, in one iteration, we have given URL domains a higher priority than username handles to select the label class. We have also experimented with only one attribute mentioned above in our study. Results for different priority and feature set is shown in Table 7 . Another important parameter that we have introduced for our experiment is a threshold on the class-wise probability values for the features. For example, if the probability that a particular username that exists in a tweet belongs to "real" class is greater than that of it belonging to "fake" class, and the probability of it belonging to the "real" class is greater than a specific threshold, we assign a "real" label to the tweet. The value of this threshold is a hyperparameter that has been tuned based on the classification accuracy on the validation set. We have summarized the results from our study with and without the threshold parameter in Table 7 . As we can observe from the results, domain plays a significant role for ensuring a better classification result when the threshold parameter is taken into account. The best results are obtained when we consider the threshold parameter and both the username and domain attributes, with a higher importance given to the username. In this paper, we have proposed a robust framework for identification of fake tweets related to COVID-19, which can go a long way in eliminating the spread of misinformation on such a sensitive topic. In our initial approach, we have tried out various pre-trained language models. Our results have significantly improved when we implemented an ensemble mechanism with Soft-voting by using the prediction vectors from various combinations of these models. Furthermore, we have been able to augment our system with a novel heuristics-based post-processing algorithm that has drastically improved the fake tweet detection accuracy, making it state-of-the-art on the given dataset. Our novel heuristic approach shows that username handles and URL domains form very important features of tweets and analyzing them accurately can go a long way in creating a robust framework for fake news detection. Finally, we would like to pursue more research into how other pre-trained models and their combinations perform on the given dataset. It would be really interesting to evaluate how our system performs on other generic Fake News datasets and also if different values of the threshold parameter for our post-processing system would impact its overall performance. Social media, political polarization, and political disinformation: A review of the scientific literature Political ideology predicts perceptions of the threat of covid-19 (and susceptibility to fake news about it) Detecting deceptive opinions with profile compatibility Linguistic Traces of a Scientific Fraud: The Case of Diederik Stapel Supervised learning for fake news detection Beyond news contents: The role of social context for fake news detection Multi-source multi-class fake news detection Fakedetector: Effective fake news detection with deep diffusive neural network Csi: A hybrid deep model for fake news detection Fake news detection on social media using geometric deep learning Fighting an Infodemic: COVID-19 Fake News Dataset Advances in neural information processing systems Bert: Pretraining of deep bidirectional transformers for language understanding Unsupervised cross-lingual representation learning at scale Roberta: A robustly optimized bert pretraining approach Xlnet: Generalized autoregressive pretraining for language understanding DeBERTa: Decodingenhanced BERT with Disentangled Attention Electra: Pre-training text encoders as discriminators rather than generators ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding Decoupled weight decay regularization Distributed representations of sentences and documents Overview of CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts