key: cord-0148379-bl6exvvm authors: Das, Sourya Dipta; Basak, Ayan; Dutta, Saikat title: A Heuristic-driven Uncertainty based Ensemble Framework for Fake News Detection in Tweets and News Articles date: 2021-04-05 journal: nan DOI: nan sha: 39aa5c695bc9ce3de561da92bc899c283ca8c9b3 doc_id: 148379 cord_uid: bl6exvvm The significance of social media has increased manifold in the past few decades as it helps people from even the most remote corners of the world to stay connected. With the advent of technology, digital media has become more relevant and widely used than ever before and along with this, there has been a resurgence in the circulation of fake news and tweets that demand immediate attention. In this paper, we describe a novel Fake News Detection system that automatically identifies whether a news item is"real"or"fake", as an extension of our work in the CONSTRAINT COVID-19 Fake News Detection in English challenge. We have used an ensemble model consisting of pre-trained models followed by a statistical feature fusion network , along with a novel heuristic algorithm by incorporating various attributes present in news items or tweets like source, username handles, URL domains and authors as statistical feature. Our proposed framework have also quantified reliable predictive uncertainty along with proper class output confidence level for the classification task. We have evaluated our results on the COVID-19 Fake News dataset and FakeNewsNet dataset to show the effectiveness of the proposed algorithm on detecting fake news in short news content as well as in news articles. We obtained a best F1-score of 0.9892 on the COVID-19 dataset, and an F1-score of 0.9073 on the FakeNewsNet dataset. Fake news represents the press that is used to spread false information and hoaxes through conventional platforms as well as online ones, mainly social media. There has been an increasing interest in fake news on social media due to the political climate prevailing in the modern world [1, 2, 3] , as well as several other factors. Detecting misinformation on social media is as important as it is technically challenging. The difficulty is partly due to the fact that even humans cannot accurately distinguish false from true news, mainly because it involves tedious evidence collection as well as careful fact checking. With the advent of technology and ever-increasing propagation of fake articles in social media, it has become really important to come up with automated frameworks for fake news identification. In this paper, we describe our system which performs a binary classification on news items from social media and classifies it into "real" or "fake". We have used transfer learning in our approach as it has proven to be extremely effective in text classification tasks, with a reduced training time as we do not need to train each model from scratch. The primary steps for our approach initially include text preprocessing, tokenization, model prediction, and ensemble creation using a soft voting schema. After completion of the competition, we have drastically improved our fake news detection framework with a Statistical feature fusion network (SFFN) with uncertainty estimation, and followed by a heuristic post-processing technique where both network takes into account the effect of important aspects of news items like username handles, URL domains, news source, news author, etc as statistical features. This approach has allowed us to produce much superior results when compared to other models in their respective datasets. We have also provided performance analysis of predictive uncertainty quality with proper metrics and showed improvement in overall performance, robustness of the SFFN with ablation study. We have also additionally performed an ablation study of the various attributes used in our post-processing approach. Our algorithm is also applicable to detection of fake news items in long news articles. In this context, we have evaluated the performance of our approach on Fake-NewsNet dataset [4] . Along with the news titles, we have also utilized the actual news body (document) in this case. We have used a BERT-inspired longformer [5] network which we trained on news articles for classification tasks. We denote this model as NewsBERT in this paper. NewsBERT is used on the news articles to obtain the prediction vectors, which can be used as additional features for our model. After that, we have implemented the same pipeline consisting of SFFN and heuristic post-processing module to boost our performance in FakeNewsNet Dataset. Using these additional features and modules, we have observed absolute improvement of 9.56 % in the overall accuracy and F1-score over current state of the art model in FakeNewsNet Dataset. We have also quantified the model uncertainty in the fake news classification task for both datasets. Traditional machine learning approaches have been quite successful in fake news identification problem. Reis et al. [6] has used feature engineering to generate handcrafted features like syntactic features, semantic features etc. The problem was then approached as a binary classification problem where these features were fed into conventional Machine Learning classifiers like K-Nearest Neighbor (KNN) [7] , Random Forest (RF) [8] , Naive Bayes [9] , Support Vector Machine (SVM) [10] and XGBOOST (XGB) [11] , where RF and XGB yielded results that were quite favourable. Shu et al. [12] have proposed a novel framework TriFN, which provides a principled way to model tri-relationship among publishers, news pieces, and users simultaneously. This framework significantly outperformed the baseline Machine Learning models as well as erstwhile state-of-the-art frameworks on early version of FakeNewsNet dataset [13] . With the advent of deep learning, there has been a significant revolution in the field of text classification, and thereby in fake news detection. Karimi et al. [14] has proposed a Multi-Source Multi-class Fake News Detection framework that can do automatic feature extraction using Convolution Neural Network (CNN) based models and combine these features coming from multiple sources using an attention mechanism, which has produced much better results than previous approaches that involved hand-crafted fea-tures. Zhang et al. [15] introduced a new diffusive unit model, namely Gated Diffusive Unit (GDU), that has been used to build a deep diffusive network model to learn the representations of news articles, creators and subjects simultaneously. Ruchansky et al. [16] has proposed a novel Capture-Score-Integrate (CSI) framework that uses an Long Short-term Memory (LSTM) network to capture the temporal spacing of user activity and a doc2vec [17] representation of a tweet, along with a neural network based user scoring module to classify the tweet as real or fake. It emphasizes the value of incorporating all three powerful characteristics in the detection of fake news: the tweet content, user source, and article response. Monti et al. [3] has shown that social network structure and propagation are important features for fake news detection by implementing a geometric deep learning framework using Graph Convolutional Networks. Julio et al. [18] have used a supervised approach for fake news classification using hand-crafted features like linguistic, lexical, psycholinguistic and semantic features, as well as news source and environmental features. They have applied traditional machine learning models on this data like KNN, Naive Bayes, Random Forest, SVM and XGBOOST, out of which Random Forest and XGBOOST have achieved the best results. Zellers et al. [19] have introduced a novel fake news generation model, GROVER, that possesses a GPT-like architecture. It has the capability to generate very realistic fake news items in a controlled manner, including various associated meta information like title, news source, publication date, author list, etc. GROVER also outperforms other deep-pretrained models while discriminating between real and fake news articles, hence, it is a powerful model for fake news generation and detection. Bang et al. [20] have tried to develop a robust model for fake news detection that can generalize across different test sets. They have shown their results by performing experiments on two different test sets -FakeNews-19 and Tweets-19. In one approach, they have fine-tuned transformer based language models using robust loss functions, that did not help to improve the F1-score on the FakeNews-19 dataset by much as compared to the traditional cross-entropy loss; however, it showed better generalization on the Tweets-19 dataset. They have also performed an influence-based data cleansing which has improved model robustness and adaptability. Shu et al. [21] has proposed an automated Fake News Detection framework, dEFEND, that uses a deep hierarchical co-attention network which takes into account the news items and user comments, and provides a classification output along with viable explanations. Felber [22] has analyzed the performance of some classical Machine Learning models using several linguistic features such as n-gram, readability, emotional tone and punctuation along with various preprocessing techniques like stop word removal, stemming/lemmatization, link removal. Shushkevich et al. [23] has used an ensemble technique consisting of Bidirectional LSTM (Bi-LSTM), SVM, Logistic Regression, Naive Bayes. Their combination of Logistic Regression and Naive Bayes models has produced results that are within 5% of state-of-the art results on the given dataset. Sharif et al. [24] have tried out various techniques like SVM, CNN, Bi-LSTM, and CNN+BiLSTM with tf-idf and Word2Vec embedding techniques, where SVM with tfidf features has produced the best results. Gautam et al [25] . has proposed a solution where they have combined topical distributions obtained using Latent Dirichlet Allocation (LDA) and contextualized representations obtained using XLNet. These features are then passed through a 2-layer Feed Forward Neural Network in order to obtain the final classification output. Li et al. [26] has proposed an ensemble model consisting of various pre-trained models like BERT, RoBERTa, ERNIE, etc. using five-fold fivemodel cross validation. Their pseudo label algorithm has also been able to improve overall model performance. Bilal et al. [27] have tried to model the flow of affective information in longer news articles using their framework, FakeFlow. They have evaluated their framework on four real-world datasets and have achieved state-of-the-art results, thereby underscoring the importance of affective information in texts. Most of the current state-of-the-art language models are based on Transformer [28] and they have proven to be highly effective in text classification problems. They provide superior results when compared to previous state-of-the-art approaches using techniques like Bi-directional LSTM, Gated Recurrent Unit (GRU) based models etc. Hence, we discuss few state of the art transformer based language models in this section. The introduction of the BERT [29] architecture has transformed the capability of transfer learning in Natural Language Processing. It has been able to achieve state-of-the art results on downstream tasks like text classification. RoBERTa [30] is an improved version of the BERT model. It is derived from BERT's language-masking strategy, modifying its key hyperparameters, including removing BERT's next-sentence pre-training objective, and training with much larger mini-batches and learning rates, leading to improved performance on downstream tasks. XLNet [31] is a generalized auto-regressive language method. It calculates the joint probability of a sequence of tokens based on the transformer architecture having recurrence. Its training objective is to calculate the probability of a word token conditioned on all permutations of word tokens in a sentence, hence capturing a bidirectional context. XLM-RoBERTa [32] is a transformer [28] based language model relying on Masked Language Model Objective. DeBERTa [33] provides an improvement over the BERT and RoBERTa models using two novel techniques; first, the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, and second, the output softmax layer is replaced by an enhanced mask decoder to predict the masked tokens pre-training the model. ELEC-TRA [34] is used for self-supervised language representation learning. It can be used to pre-train transformer networks using very low compute, and is trained to distinguish "real" input tokens vs "fake" input tokens, such as tokens produced by artificial neural networks. ERNIE 2.0 [35] is a continual pre-training framework to continuously gain improvement on knowledge integration through multi-task learning, enabling it to learn various lexical, syntactic and semantic information through massive data much better. Model uncertainty is a very important concept that is related to the model parameters. In order to capture model uncertainty, a prior distribution needs to be assigned over each weight in a neural network. Gal et al. [36] has developed a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. They have shown that a neural network with arbitrary depth or non-linearities can be analogous to a probabilistic deep Gaussian process when dropout is applied before every weight layer. This theory presents tools to model uncertainty with dropout NNs, and shows a considerable improvement in predictive log-likelihood and Root Mean Squared Error (RMSE) compared to existing state-of-the-art methods. Lakshminarayanan et al. [37] has proposed a novel approach to estimate the predictive uncertainty using ensembles of deep neural networks. This approach produced superior results when compared to traditional Bayesion Neural Networks, with an added advantage of being readily parallelizable and requiring less hyperparameter tuning. It also takes into account the data uncertainty as it produces higher uncertainty values for out-of-distribution examples. In the Fake News detection task, uncertainty estimation is a very important aspect since it improves the reliability and safety of the system. It gives us an estimate of how far we can trust a system, and thus increases the interpretability of a system's output. In the case of Fake News detection, it is extremely important to have a system that is both robust and reliable. If visibly benign texts are constantly flagged as fake, it leads to a reduction in credibility of the system. Similarly, if the system fails to identify a lot of fake news items, the scenario becomes dangerous. Hence, uncertainty estimation can provide the user some idea about the fault tolerance level. We have used two datasets to train and evaluate our approach that have the necessary attributes that we require to extract statistical features. The dataset [38] for CONSTRAINT COVID-19 Fake News Detection in English challenge was provided by the organizers on the competition website 1 . It consists of data that have been collected from various social media and fact checking websites, and the veracity of each post has been verified manually. The "real" news items were collected from verified sources which give useful information about COVID-19, while the "fake" ones were collected from tweets, posts and articles which make speculations about COVID-19 that are verified to be false. We have also evaluated the performance of our fake news detection system on the FakeNewsNet dataset [4] , which consists of two datasets with news content, social context, and spatiotemporal information: PolitiFact and GossipCop. In PolitiFact, the political news items are reviewed by journalists and domain experts, who review and provide fact-checking evaluation results to claim news articles as fake or real. Gos-sipCop is a website for fact-checking entertainment stories aggregated from various media outlets. GossipCop provides rating scores on the scale of 0 to 10 to classify a news story as the degree from fake to real. Most news items on GossipCop have a rating less than 5, which aligns with its purpose to showcase more fake stories. In order to collect real entertainment news items, the E! Online website 2 is crawled. It is a well known trusted media website for publishing entertainment news items. The articles from E! Online are considered as real news articles, while the ones from GossipCop are considered fake. The original dataset consists of 16817 real news items and 5323 fake news items from GossipCop, and 624 real news items and 432 fake news items from PolitiFact. However, we believe that Twitter's policy to remove certain fake news items from time to time, has prevented us from obtaining the entire dataset. We could able to crawl 15151 real news items and 5323 fake news items from GossipCop, and 610 real news items and 401 fake news items from PolitiFact. Total number of unique news website/news source avalible in this dataset is 2244 and total number of authors avalible in this dataset is 4616. Number of unique keywords for news articles used are 6882. We have done a 80-10-10 split of the data into training, validation and test sets. Our goal in this paper is to design a common fake news classification pipeline framework for both tweets and news items. For this method, we have used some easily available meta-data of tweets or news to boost the performance of the framework. We are also providing the uncertainty value along with predictions to make this framework suitable for active learning, as well as solving domain adaptation related problems. Some social media items, like tweets, are mostly written in colloquial language. Also, they contain various other information like usernames, URLs, emojis, etc. We have filtered out such attributes from the given data as a basic preprocessing step, before feeding it into the ensemble model. For tweets, We have used the tweet-preprocessor 3 library from Python to filter out such noisy information from tweets. For News articles, we have removed any username, URLs from Instagram, Facebook, Twitter etc. During tokenization, each sentence is broken down into tokens before being fed into a model. We have used a variety of tokenization approaches 4 depending upon the pre-trained model that we have used, as each model expects tokens to be structured in a particular manner, including the presence of model-specific special tokens. Each model also has its corresponding vocabulary associated with its tokenizer, trained on a large corpus data like GLUE, wikitext-103, CommonCrawl data etc. During training, each model applies the tokenization technique with its corresponding vocabulary on our news data. We have used a combination of BERT [29] , XLNet [31] , RoBERTa [30] , XLM-RoBERTa [32] , DeBERTa [33] , ERNIE 2.0 [35] and ELECTRA [34] models and have accordingly used the corresponding tokenizers from the base version of their pre-trained models. We have used a variety of pre-trained language models 5 as backbone models for text classification. For each model, an additional fully connected layer is added to its respective encoder sub-network to obtain prediction probabilities for each class-"real" and "fake" as a prediction vector. We have used transfer learning in our approach in this problem. Each model has been initialized using pre-trained weights. Thereafter, it fine-tunes the model weights using the tokenized training data. The same tokenizer is used to tokenize the test data and the fine-tuned model checkpoint is used to obtain predictions during inference. In this method, we use the model prediction vectors obtained from inference on the news titles for the different models to obtain our final classification result, i.e. "real" or "fake". Our main motivation behind using an ensemble of various fine-tuned pretrained language models are to utilize knowledge extracted by the respective models In this approach, we calculate a "soft probability score" for each class by averaging out the prediction probabilities of various models for that class. The class that has a higher average probability value is selected as the final prediction class. Probability for "real" class, P r (x) and probability for "fake" class , P f (x) for a tweet x is given by, where P r i (x) and P f i (x) are "real" and "fake" probabilities by the i-th model and n is the total number of models. In this approach, the predicted class label for a news item is the class label that represents the majority of the class labels predicted by each individual model. In other words, the class with the most number of votes is selected as the final prediction class. Votes for "real" class, V r (x) and Votes for "fake" class , V f (x) for a tweet x is given by, where the value of I(a) is 1 if condition a is satisfied and 0 otherwise. Our basic intuition behind using statistical features is that meta-attributes like username handles, URL domains, news source, news author, etc. are very important aspects of a news item and they can convey reliable information regarding the genuineness of such items. We have tried to incorporate the effect of these attributes along with our original ensemble model predictions. We have calculated probability values corresponding to each of the attributes, for example the probability of an username handle or URL domain indicating a fake news item, and added them to our feature set. We have used information about the frequency of each class for each of these attributes in the training set to compute these probability values. In our experiments, we observed that Soft-voting works better than Hard-voting. Hence our post-processing step takes Soft-voting prediction vectors into account. The steps taken in this approach are described as follows: • First, we obtain the class-wise probability from the best performing ensemble model. These probability values form two features of our new feature-set. • We collect all distinct values of a particular attribute from all the news items in our training data, and calculate how many times the ground truth is "real" or "fake" for this attribute. • We calculate the conditional probability of this particular attribute indicating a real news item, which is represented as follows: where n(A) = number of "real" news items containing the attribute k , n(B) = number of "fake" news items containing the attribute k , and k = 1,2,...,n. In our case, attribute 1 = "URL domain" in case of COVID-19 Fake News dataset and "news author" in case of FakeNewsNet, and attribute 2 = "username handle" in case of COVID-19 Fake News dataset and "news source" in case of FakeNews-Net. Similarly, the conditional probability of the particular attribute indicating a fake news item is given by, We obtain a probability vector that forms two additional features of our new dataset. • Similarly, we collect all other relevant attributes from all the news items in our training data, and calculate how many times the ground truth is "real" or "fake" for each one. This enables us to compute a two-dimensional prediction vector for each new attribute which can be appended to our current feature set. This approach enables us to create two types of feature-sets: one using the 10 raw We get an ensemble prediction by calculating the mean (µ x ) and variance (σ 2 x ) of this sample, which would be the mean of the model's posterior distribution for this sample and an approximation of the model's uncertainty. Here, v p is the predictive posterior mean and c u is the model uncertainty. In this approach, we have augmented our original framework with a heuristic approach that can take into account the effect of the statistical attributes mentioned in Section 4.5. This approach works well for data having attributes like URL domains, the training data. We use a novel heuristic algorithm on this resulting feature set to obtain our final class predictions. The intuition behind using a heuristic approach taking the statistical features into account is that if a particular feature can by itself be a strong predictor for a particular class, and that particular class is predicted whenever the value of a feature is greater than a particular threshold, a significant number of incorrect predictions obtained using the previous steps can be "corrected" back. In order to extract statistical features mentioned in Section 4.5, we have considered the username handles and URL domains from the COVID-19 Fake News dataset and news source and news domain from the FakeNewsNet dataset. Such attributes provide a significant lift in the final classification task since they contribute meaningful information regarding the origin of news items. We have fine-tuned our pre-trained models using AdamW [39] optimizer and crossentropy loss after doing label encoding on the target values. We have applied softmax on the logits produced by each model in order to obtain the prediction probability vectors. The experiments were performed on a system with 16GB RAM and 2.2 GHz Quad-Core Intel Core i7 Processor, along with a Tesla T4 GPU, with batch size of 32. The maximum input sequence length was fixed at 128. Initial learning rate was set to 2e-5. The number of epochs varied from 6 to 15 depending on the model. For evaluation of fake news classification, we have used precision, recall, accuracy and f1-score to measure performance of models. We additionally have used two metric, negative log likelihood (NLL) loss and Brier score for evaluating predictive uncertainty of the model. More details on these metrics are following. Negative Log Likelihood: The negative log-likelihood function produces a high value when all the values in a prediction vector are evenly distributed, i.e. when the classification is unclear. It also produces relatively high values in case of wrong classification. However, its value is very small when the output matches the expected value. where p is prediction vector and y is the true labels. Brier Score: The brier score is a metric that is applied for prediction probabilities. It calculates the mean squared error between the predicted probabilities and actual values. It is quite similar in spirit to the log-loss metric, with a major difference being the fact that it is gentler in penalizing inaccurate predictions. where, t i is the predicted probability and p i is the actual outcome. We have used XLNet, RoBERTa, XLM-RoBERTa, DeBERTa, ELECTRA and Network models, we used the feature samples created using prediction vector from ensemble of fine-tuned language models as features as well as the statistical features. The FakeNewsNet dataset is highly imbalanced, with 75% of the samples be-longing to the "real" class and 25% belonging to the "fake" class. In order to handle this problem of an imbalanced dataset, we have used the KMeans-SMOTE [40] algorithm, a variation of the Synthetic Minority Oversampling Technique (SMOTE) [41] . We synthesize new feature samples from the minority class data points, using the feature set obtained using the individual model prediction vectors and statistical features, in order to balance out the imbalance in class distribution without providing any additional information to the model. We have used each fine-tuned model individually to perform "real" vs "fake" classification. Quantitative results for COVID-19 Fake News dataset are tabulated in Table- 3. We can see that XLM-RoBERTa, RoBERTa, XLNet and ERNIE 2.0 perform really well on the validation set. However, RoBERTa has been able to produce the best classification results when evaluated on the test set. We have also evaluated the performance of XLM-RoBERTa, RoBERTa, XLNet, DeBERTa, and NewsBERT on the FakeNews-Net dataset. Corresponding quantitative results are shown in Table 4 . NewsBERT has been able to achieve the best results on the validation set, while RoBERTa produces the best results on the test set. We have also evaluated our best ensemble model combination from the above approach, consisting of XLM-RoBERTa, RoBERTa, XLNet and DeBERTa, as well as a combination of the above models along with NewsBERT, on the FakeNewsNet dataset in Table 7 and 8. We have tried out both soft-voting and hard-voting ensembling techniques, and have observed that the addition of the features obtained from NewsBERT prediction vectors provides a boost to the final F1-score. Also, soft-voting performs slightly better than hard-voting on the test set. In this section, we have qualitatively measured the performance of our Statistical Feature Fusion Network (SFFN) with MCDropout with respect to SFFN. We have also compared the performance of various classical models like Logistic Regression, SVM, Decision Tree, Random Forest. As a feature input to SFFN, we have studied two different feature input types. The first type of feature set is created using the individual prediction vector from the various language models of the ensemble (soft-voting), with the conditional probability values of various attributes as statistical features and the second type of feature set is created using the prediction vector from the ensemble (soft-voting) of the language models with the same conditional probability features as the previous one. In Table 9 , we have experimented with some classical machine learning models on a new feature set created using the individual predictions from the language models of the best ensemble mentioned in Table 5 , and the conditional probability values of URL domains and username handles for the COVID-19 Fake News dataset. In Table 10, we have tabulated the results of the same experiment, with the best ensemble from the Table 7 , on the FakeNewsNet dataset using the conditional probability values of news author and news source. Then, we have evaluated the performance of the same models on another type of feature set for COVID-19 Fake News and FakeNewsNet datasets respectively in Table 11 and 12. From these studies, we can conclude that SFFN with MCDropout got better accuracy than the other classical models and the feature set using average prediction While comparing the individual models to both the Soft-Voting and SFNN approaches, we observe in Tables 13-16 that the p-values obtained are always less than the predefined significance level. We can thus conclude that the error rates of using these two ensemble approaches are indeed different from using just a single model. Table- 17 , the KMeans-SMOTE algorithm has been able to achieve the best results. We augmented our Fake News Detection System with an additional heuristic algorithm to boost the accuracy of the model further. We have used the best performing ensemble model consisting of RoBERTa, XLM-RoBERTa, XLNet and DeBERTa for this approach. We have performed an ablation study by assigning various levels of priority to each of the features (for example, username>domain or author>source) and then checking which class's probability value for that feature is maximum for a particular news item, so that we can assign the corresponding "real" or "fake" class label to that particular item. For example, in one iteration, we have given URL domains a higher priority than username handles to select the label class. Results for different priority and feature set is shown in Table 18 and 19. Another important parameter that we have introduced for our experiment is a threshold on the class-wise probability values for the features. For example, if the probability that a particular attribute that exists in a news item belongs to "real" class is greater than that of it belonging to "fake" class, and the probability of it belonging to the "real" class is greater than a specific threshold, we assign a "real" label to the item. The value of this threshold is a hyperparameter that has been tuned based on the classification accuracy on the validation set. We have summarized the results from our study with and without the threshold parameter in Tables 18 and 19 . As we can observe from the results, the URL domain plays a significant role for ensuring a better classification result when the threshold parameter is taken into account in case of COVID-19 Fake news dataset, while the news author plays a significant role in an analogous scenario in case of the FakeNewsNet dataset. The best results are obtained when we consider the threshold parameter and both the username and domain attributes in case of COVID-19 Fake News dataset, and the news author and news source along with the threshold in the case of FakeNewsNet dataset, with higher importances given to the username and news author. We have also performed a similar ablation study on the FakeNewsNet dataset using the author and source attributes. We qualitatively evaluate the performance of the proposed method on the two mentioned datasets. In Table- 21, we have shown that with the addition of feature fusion network, the performance of the framework has improved compared to other models and achieved state of the art results on both datasets. In our earlier work [47] , we had shown that the heuristic post-processing approach improves the classification accuracy on the test set significantly. However, the incorporation of uncertainty estimation improves the model performance even more. We have shown the comparison of the results on the test set obtained by our model before and after applying the post-processing technique against the top 3 teams in the leaderboard for the COVID-19 Fake News dataset in Table 22 . We have also compared the same with state-of the-art approaches on FakeNewsNet Dataset in Table 23 . Table 24 shows a two examples where the post-processing algorithm corrects the initial prediction in the case of COVID-19 Fake News dataset. The first example is corrected due to extracted domain which is "news.sky" and the second one is corrected because of presence of the username handle, "@drsanjaygupta". The last two examples stand incorrect even after application of the post-processing algorithm. The first of put stands incorrect even after application of the post-processing algorithm. It is mainly due to the fact that the frequency of these particular authors and sources in the overall dataset is very low, hence the statistical information conveyed by them regarding the genuineness of the news items are unreliable. In this paper, we have proposed a robust framework for identification of fake news items, which can go a long way in eliminating the spread of misinformation on sensitive topics. In our initial approach, we have tried out various pre-trained language models. Our results have significantly improved when we implemented an ensemble mechanism with Soft-voting by using the prediction vectors from various combinations of these models. Furthermore, we have been able to augment our system with a statistical feature fusion network and a novel heuristics-based post-processing algorithm by incorporation of statistical features, that has drastically improved the fake tweet detection accuracy. Our novel heuristic approach shows that meta-attributes like username handle, URL domain, news author, news source, etc. form very important features of news and analyzing them accurately can go a long way in creating a robust framework for fake news detection. We have also quantified the model uncertainty in the task of fake news detection by applying Monte Carlo dropout as a Bayesian approximation in the statistical feature fusion network. With empirical experiments, we have shown the overall performance increase after including uncertainty in the model. Finally, we would like to pursue more research into how to extend our framework for an active learning based approach by utilizing uncertainty values, and incorporate other combinations of meta-attributes in our model perform on the given datasets. It would be really interesting to evaluate how our system performs on other generic Fake News datasets and also if different values of the threshold parameter for our postprocessing system would impact its overall performance. Social media, political polarization, and political disinformation: A review of the scientific literature, Political polarization, and political disinformation: a review of the scientific literature Political ideology predicts perceptions of the threat of covid-19 (and susceptibility to fake news about it) Fake news detection on social media using geometric deep learning Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media The long-document transformer Supervised learning for fake news detection Random forests An empirical study of the naive bayes classifier Support vector machines Xgboost: A scalable tree boosting system Beyond news contents: The role of social context for fake news detection Exploiting tri-relationship for fake news detection Multi-source multi-class fake news detection Fakedetector: Effective fake news detection with deep diffusive neural network Csi: A hybrid deep model for fake news detection Distributed representations of sentences and documents Supervised learning for fake news detection Defending against neural fake news Model generalization on covid-19 fake news detection Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining Constraint 2021: Machine learning models for covid-19 fake news detection shared task Tudublin team at constraint@ aaai2021-covid19 fake news detection Combating hostility: Covid-19 fake news and hostile post detection in social media Fake news detection system using xlnet model with topic distributions: Constraint@ aaai2021 shared task Exploring text-transformers in aaai 2021 shared task: Covid-19 fake news detection in english Fakeflow: Fake news detection by modeling the flow of affective information Attention is all you need Pre-training of deep bidirectional transformers for language understanding A robustly optimized bert pretraining approach Generalized autoregressive pretraining for language understanding Unsupervised cross-lingual representation learning at scale Deberta: Decoding-enhanced bert with disentangled attention Pre-training text encoders as discriminators rather than generators Ernie 2.0: A continual pre-training framework for language understanding Dropout as a bayesian approximation: Representing model uncertainty in deep learning, in: international conference on machine learning Simple and scalable predictive uncertainty estimation using deep ensembles Fighting an infodemic: Covid-19 fake news dataset Decoupled weight decay regularization Oversampling for imbalanced learning based on k-means and smote Smote: Synthetic minority over-sampling technique Note on the sampling error of the difference between correlated proportions or percentages Smote: synthetic minority over-sampling technique Adasyn: Adaptive synthetic sampling approach for imbalanced learning Borderline-smote: a new over-sampling method in imbalanced data sets learning Borderline over-sampling for imbalanced data classification A heuristic-driven ensemble framework for covid-19 fake news detection Overview of constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts g2tmn at constraint@ aaai2021: exploiting ct-bert and ensembling learning for covid-19 fake news detection Fakenewstracker: a tool for fake news collection, detection, and visualization