key: cord-1004918-oywj2ugc
authors: Malla, SreeJagadeesh; Alphonse, P. J. A.
title: Fake or real news about COVID-19? Pretrained transformer model to detect potential misleading news
date: 2022-01-13
journal: Eur Phys J Spec Top
DOI: 10.1140/epjs/s11734-022-00436-6
sha: ba3d295363bfab17038d8fe1465db68bd037a0d3
doc_id: 1004918
cord_uid: oywj2ugc

The World Health Organization declared the novel coronavirus disease 2019 a pandemic on March 11, 2020. Along with the coronavirus pandemic, a new crisis has emerged, characterized by widespread fear and panic caused by a lack of information or, in some cases, outright fake messages. In these circumstances, Twitter is one of the most eminent and trusted social media platforms. Fake tweets, on the other hand, are challenging to detect and differentiate. The primary goal of this paper is to educate society about the importance of accurate information and prevent the spread of fake information. This paper has investigated COVID-19 fake data from various social media platforms such as Twitter, Facebook, and Instagram. The objective of this paper is to categorize given tweets as either fake or real news. The authors have tested various deep learning models on the COVID-19 fake dataset. Finally, the CT-BERT and RoBERTa deep learning models outperformed other deep learning models like BERT, BERTweet, AlBERT, and DistlBERT. The proposed ensemble deep learning architecture outperformed CT-BERT and RoBERTa on the COVID-19 fake news dataset using the multiplicative fusion technique. The proposed model’s performance in this technique was determined by the multiplicative product of the final predictive values of CT-BERT and RoBERTa. This technique overcomes the disadvantage of these CT-BERT and RoBERTa models’ incorrect predictive nature. The proposed architecture outperforms both well-known ML and DL models, with 98.88% accuracy and a 98.93% F1-score.

Reports of Wuhan Municipal Health Commission, China, have mentioned the coronavirus evolution on Dec 31st, 2019. It was initially named SARS-CoV-2. Later on Jan 12th, 2020, World Health Organization (WHO) renamed this disease like the 2019 novel coronavirus (2019-nCoV). On Jan 30th, 2020, a health emergency was declared by WHO. Upon subsequent discussions on this disease outbreak, it was renamed coronavirus disease 2019 (COVID-19) on Feb 11th,2020. This COVID-19 pandemic has tremendously affected worldwide, and it faces an incredible threat to public health, food systems, psychology, and workplace safety.

According to the survey, COVID-19 is caused by the SARS-CoV-2 virus, which spreads from person to person, especially when they are in immediate contact. Furthermore, when people cough, sneeze, speak, sing, or breathe loudly, the virus can spread from an infected person to close contacted people. To deal with these critical pandemic situations, the government has promoted physical distancing by limiting close face-toa e-mail: malla.sree@gmail.com (corresponding author) b e-mail: alphonse@nitt.edu face contact with others. Further to reduce the disease spread, the government has established cantonment zones where positive cases have considerably increased. Hence it is highly essential to alarm the social organizations and government organizations to avoid the spread of disease to other regions that are not affected. Social media has taken an active step in developing contact with and through various sectors of people across the globe. Especially in critical times, Twitter content individuals can interact with each other during the lockdown period, update their knowledge about the disease, and take the necessary steps to get rid of the disease outbreak. During the lockdown era, precautions like physical separation, wearing a mask, keeping rooms adequately aired, avoiding crowds, washing hands, and coughing into a tissue or bent elbow were adopted. This information was updated to the public consistently by Twitter posts.

The COVID-19 pandemic has had a negative impact on the world in a variety of areas, including public health, tourism, business, economics, politics, education, and people's lifestyle. In the last two years, researchers have paid more attention to COVID-19. Some researchers have concentrated on Natural Language Processing [1] [2] [3] , which includes disease symp- [4] [5] [6] , which includes patient X-ray analysis to confirm whether the COVID-19 is positive or negative. During the COVID-19 outbreak, respiratory analysis research became popular [7] [8] [9] [10] . Deep learning models were used to categorize the respiratory sounds of patients in this study, yielding better results. Mathematical researchers are more focused on COVID-19 statistical reports [11] [12] [13] [14] , such as the number of cases identified, the number of deaths, and the number of patients recovered, among other things.

Twitter posts contain both fake and real news (source: COVID-19 FakeNews dataset) as shown as Table 1 . In a real sense, all real news may not be informative. For example, let us consider an accurate report containing some predictable content along with COVID-19 disease information. Only COVID-19 related content brings much hype to the tweet posted in public, and hence it is considered informative. In our proposed work, our objective is to highlight such informative content from the tweets and predict the severity of disease in a particular location based on geolocation, age, gender, and time. In detail, what sort of gender and where the outbreak of illness tends to be serious is identified within a particular period.

The following are the highlights of this paper.:-1. The ensemble transformer model with fusion vector multiplication technique was addressed. 2. The CT-BERT and RoBERTa transformers are utilised in a combination. 3. The FBEDL paradigm produces significant outcomes. 4. The dataset is based on the most recent COVID- 19 labelled English fake tweets collection. 5. The model has a 98.93% F1-score and a 98.88% accuracy in identifying fake tweets.

Following on from the discussion of related work in Sect. 2, the Sect. 3 delves into methodology and data, following with a discussion of the experimental results in Sect. 4. In Sect. 4, we examine the results and look at the errors, and Sect. 5 bring the paper to a conclusion.

The authors Easwaramoorthy et al. [15] illustrated the transmission rate in both times by comparing and predicting the epidemic curves of the first and second waves of the COVID-19 pandemic. Kavitha et al. [16] have investigated the duration of the second and third waves in India and forecasts the outbreak's future trend using SIR and fractal models. Gowrisankar et al. [17] have explained multifractal formalism on COVID-19 data, with the assumption that country-specific infection rates exhibit power law growth.

Minaee et al. [18] present a detailed quantitative analysis of over 100 DL models proposed after over 16 popular text classification datasets. Kadhim [19] automatically classified a collection of documents into one or more known categories. Discussed weighing methods and comparison of different classification techniques. Aggarwal and Zhai [20] have presented a survey of a broad range of text classification algorithms and have talked about classification in the database, machine learning, data mining and information retrieval communities, as well as target marketing, medical diagnosis, news group filtering, and document organisation. Kowsari et al. [21] have discussed different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods along with real-world problems. De Beer and Matthee [22] have pointed various language approaches like Topic-Agnostic, machine learning and knowledge based.

Uysal and Gunal [23] have discussed the impact of preprocessing on text classification in terms of classification accuracy, text domain, dimension reduction and text language. Wenet al. [24] employ a clarity map by using two-channel convolutional network and morphological filtering. The fusion image is created by combining the clear parts of the source images. Castillo Ossa et al. [25] have developed a hybrid model that combines the population dynamics of the SIR model of differential equations with recurrent neural network extrapolations. Wiysobunriet al. [26] have presented an ensemble deep learning system based on the Max (Majority) voting scheme with VGG-19, DenseNet201, MobileNet-V2, ResNet34, and ResNet50 for the automatic detection of COVID-19 disease using chest X-ray images. Table 2 , The identification and classification of tweets related to disaster and current disease pandemic COVID-19 have been discussed. It has covered Table 3 has explained the summary for COVID-19 fake news detection-based papers. As shown in the table, the authors have discussed the automatic fake news detection AI models (on the different dataset) with performance metrics as F1-score and accuracy. Transformer model papers have achieved good results than other Artificial Intelligence models.

During the COVID-19 epidemic, the FBEDL model detects fake COVID-19 tweets with an accuracy of 98.88% and an F1-score of 98.93%. Figure 1 depicts a high-level overview of the FBEDL model. The following subsections go over the FBEDL model in greater depth: The FBEDL model's data collection and it's preprocessing are described in Section A and B. Section C and D describes the pre-trained deep learning classifiers and section E had discussed fusion multiplication technique.

In the COVID-19 pandemic (2020), organizers provided the COVID-19 fake news English dataset [38] with the id, tweet, label ("Fake" and "Real") in the tsv format. Data is collected from the organizers of the Con-straint@AAAI2021 workshop [39] . The organisers considered only textual English contents and captured a generic corpus linked to the coronavirus epidemic using a predetermined list of ten keywords including: COVID-19, cases, coronavirus, deaths, tests, new, people, number and total. The attained tweets are preprocessed using the methods described below.

In Twitter information, there is a lot of noise. As a result, pre-trained models may benefit from data preparation. The following data preprocessing steps were inspired primarily by [40] . Because the user handles in the tweets had already been replaced by @USER, no further processing was required.

RoBERTa [41] improves on BERT by deleting the nextsentence pretraining target and train with considerably learning rates and huge mini-batches, as well as modifying important hyper parameters. Google announced transformer method, which has improved the NLP (Natural Language Processing) systems using encoder representations. RoBERTa enhanced the efficiency than BERT, which increased the benefit of the masked language modelling objective. Furthermore, when compared to the base BERT model, RoBERTa is explored with higher magnitude data. RoBERTa is a retraining of BERT with improved training methodology, 1000% more data, and compute power. So it outperforms both BERT and XLNet. But generally, the text is derived from all sources of text (not only tweets).

For the given COVID-19 fake dataset, the model has trained using various hyperparameter combinations (learning rate and batch size). The four metric parameters used to evaluate the results obtained for each combination are accuracy, recall, precision and F1-score. This model has been trained on the COVID-19 English fake dataset with batch sizes of 8, 16, and 32. However, the model performs well when the batch size is 8 and the learning rate is 1.12e−05 as shown in Table 4 . This results may vary from dataset to dataset. Finally, RoBERTa's performance measures are accuracy of 98.55, F1-score of 98.62, recall of 98.84, and precision of 98.40, all of which improves the proposed FBEDL model's performance.

CT-BERT (COVID-Twitter-BERT) [40] , a recent transformer based model, which has trained on a massive corpus of Twitter tweets on the issue of current on going COVID-19 outbreak. This model shows a better improvement of 05-10% when compared to its basic model, BERT-LARGE. The most substantial improvements have been made to the target domain. CT-BERT as well as other pretrained transformer models are trained on a specific target domain and can be used for a variety of NLP tasks, such as mining and analysis. CT-BERT was designed with COVID-19 content in mind.

Covid Twitter-BERT includes domain (COVID-19) as well as specific information, and it can better handle noisy texts like tweets. CT-BERT performs similarly well on other classification problems on COVID-19-related data sources, particularly on text derived from social media platforms.

For the given COVID-19 fake dataset, this model has trained using various hyperparameter combinations (batch size and learning rate). As indicated in Table 5 , the best results were obtained when the batch size was equal to 8 and the learning rate was equal to 1.02e-06. The CT-BERT model's results may vary from dataset to dataset. Finally, CT-performance BERT's metrics 

To overcome the disadvantages of CT-BERT and RoBERTa models, an ensemble model is introduced. For concatenation of output for internal models, fusion techniques are more popular. These techniques include max, min, mean, avg, sum, difference, and product probability values. The probability vector of a tweet is calculated using the fine-tuned RoBERTa model and the CT-BERT model. The multiplicative fusion technique [42] performs element-wise multiplication to combine both (array of the last layer) probability vectors into a single vector [27] . The predicted tweet label is based on the generated vector. 

All of our experiments in this paper have been completed using the Google Colaboratory (CoLab) interface and the Chrome browser. This section covers data sets, model parameter explanations, and performance evaluations. Furthermore, the proposed solution is evaluated in comparison to existing methods. The Huggingface package [43] has used in the implementation through Python. The "ktrain" package [44] has been used to fine-tune our baseline models.

In the COVID-19 outbreak (2020), Constraint@AAAI 2021 workshop organizers provided the COVID-19 fake news English dataset [38] with the id, tweet, label ("Fake" and "Real") in the form of tsv. The above dataset, which contains fake news collected from tweets, instagram posts, facebook posts, press releases, or any other popular media content, has a size of 10,700 records. Using the Twitter API, real news was gathered from potential real tweets. Official accounts such as the Indian Council of Medical Research (ICMR), the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), Covid India Seva, and others may have real tweets. They give valuable COVID-19 information such as vaccine progress, dates, hotspots, government policies, and so on. The dataset is divided into three sections: 60% for train, 20% for validation, and 20% for testing. Table 6 illustrates the distribution of all data splits by class. The dataset with 52.34% of the samples containing legitimate news and 47.66% including fraudulent news.

The outcome of the model is dependent on the use of a classifier. As a result, the following classifiers are used to conduct various tests. 

The model performance is evaluated using the following parameters: Precision, F1-score, Accuracy, and Recall. These metrics have been depended on the confusion matrix.

The performance of a classification model has been evaluated by an N × N matrix, where N indicates number of target classes. For binary classification N is equals to 2, so a 2 × 2 matrix containing four values, as shown below.

True False Negative (FN): the expected value is incorrectly predicted. Although the actual number was positive, the model predicted that it would be negative.

There are three subsections in this section. The performance of the ML (machine learning) models are compared in the first subsection. In the second subsection, the performance of the deep learning models are compared. The proposed model's performance is compared to existing approaches in the third subsection.

The Constraint@AAAI2021 workshop organisers have provided baseline results for the English COVID-19 fake dataset. Logistic Regression, Decision Tree, Gradient Boost and SVM have been considered for baseline results for predicting fake news tweets. The Support Vector Machines (SVM) classifier has achieved an accuracy of 93.32%, F1-score of 93.32%, precision of 93.33%, and recall of 93.32%. As a result, the SVM classifier outperformed all metrics values as shown as Table 7 .

The transformer pretrained deep learning models like DistliBERT, ALBERT, BERT, BERTweet, RoBERTa and CT-BERT have been considered in this subsection. The MAX_LENGTH(tweet) has been fixed to 143 in order to train the model's better with the English language corpus. The tweets that are being tested are in English. For training the models and learning the rate of values 1e−4, 1e−5, 1e−6, 1e−7, 1e−8 and tested with batch sizes of 8, 16, and 32.

CT-BERT and RoBERTa have occupied first two places as shown in the Table 8 than the BERTweet, BERT, DistilBERT, and ALBERT models as exhibit in Fig. 2a-d. They outperformed the other competitors in the race, according to the experiment results, because they had higher TP (true positive) and FN (false negative) values. CT-BERT performed well because it has pre-trained on a large corpus of COVID-19-related Twitter messages. 

The proposed model (FBEDL) is evaluated in terms of accuracy and F1-score to the machine learning models, deep learning models, and ensemble models. In comparison to existing models, our FBEDL model attained an F1 score of 98.93% and an accuracy of 98.88%, as shown in Tables 9 and 10 as well as Fig. 3 . This indicates that the model was successful in distinguishing fake tweets/News about the COVID-19 disease outbreak. [38] 93.32 93.32 (Baseline) XLNet + LDA [31] 96.70 96.60 Ensemble [32] 94.00 93.90 CT-BERT + hard voting [33] 98 

The principal goal of this work is to demonstrate how to use a novel NLP application to detect real or fake COVID-19 tweets. The conclusions of the paper assist individuals in avoiding hysteria about COVID-19 tweets. Our findings may also aid in the improvement of COVID-19 therapies and public health measures.

In this study, a fusion technique-based ensemble deep learning model is used to detect fraudulent tweets in the ongoing COVID-19 epidemic. The use of fusion vector multiplication is designed to help our model become more entrenched. We tried various deep learning model combinations to improve model performance, but COVID-Twitter BERT and RoBERTa deep learning models have achieved state-of-art performance. With 98.88% accuracy and a 98.93% F1-score, the proposed model outperforms traditional machine learning and deep learning models.

One of the disadvantages of our proposed model is that RoBERTa and CT-BERT are pre-trained models with a lot of memory for corpus training (657MB and 1.47GB, respectively). When compared to machine learning models, the models' time complexity is likewise relatively high. To boost model performance, we plan to apply data compression techniques This research focuses on COVID-19 pandemic English fake tweets for the time being. Our method may be able to predict fake tweets about diseases that are similar in the future. We can improve our results in the future by training other combinations on a sizeable COVID-19 dataset using alternative transformer-based models. 

NIT_COVID-19 at WNUT-2020 task 2: deep learning model RoBERTa for identify informative COVID-19 English tweets

CIA_NITT at WNUT-2020 task 2: classification of COVID-19 tweets using pre-trained language models

Artificial intelligence and Covid-19: deep learning approaches for diagnosis and treatment

Deep-Covid: predicting Covid-19 from chest X-ray images using deep transfer learning

Which role for chest X-ray score in predicting the outcome in Covid-19 pneumonia?

Fast deep learning computer-aided diagnosis of Covid-19 based on digital chest X-ray images

A literature review on Covid-19 disease diagnosis from respiratory sound data

Automatic Covid-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice

COVID-19 Pneumonia: Different Respiratory Treatments for Different Phenotypes?

Management of Covid-19 respiratory distress

A model based study on the dynamics of Covid-19: prediction and control

Covid-19: challenges and its consequences for rural health care in India. Public Health Pract

Remdesivir for Covid-19: challenges of underpowered studies

How India is dealing with Covid-19 pandemic

An exploration of fractal-based prognostic model and comparative analysis for second wave of COVID-19 diffusion

The second and third waves in India: when will the pandemic be culminated?

Can India develop herd immunity against Covid-19?

Deep learning-based text classification: a comprehensive review

Survey on supervised machine learning techniques for automatic text classification

A survey of text classification algorithms

Text classification algorithms: a survey

Approaches to identify fake news: a systematic literature review

The impact of preprocessing on text classification

Albertini, Multifocus image fusion using convolutional neural network

A hybrid model for Covid-19 monitoring and prediction

An ensemble deep learning system for the automatic detection of COVID-19 in X-ray images

A neural-based approach for detecting the situational information from Twitter during disaster

COVID-19 outbreak: an ensemble pre-trained deep learning model for detecting informative tweets

Improve text classification accuracy based on classifier fusion methods

Automatic diagnosis of Covid-19 disease using deep convolutional neural network with multi-feature channel from respiratory

Fake news detection system using xlnet model with topic distributions: constraint @ aaai2021 shared task

Tudublin team at constraint@aaai2021-covid19 fake news detection

g2tmn at constraint@aaai2021: exploiting ct-bert and ensembling learning for Covid-19 fake news detection

Cross-sean: a cross-stitch semi-supervised neural attention model for Covid-19 fake news detection

A novel self-learning semi-supervised deep learning network to detect fake news on social media

3han: a deep neural network for fake news detection

Detection of online fake news using n-gram analysis and machine learning techniques

Fighting an infodemic: COVID-19 fake news dataset

Overview of constraint 2021 shared tasks: detecting English Covid-19 fake news and Hindi hostile posts

Covidtwitter-bert: a natural language processing model to analyse Covid-19 content on twitter

Roberta: a robustly optimized bert pretraining approach

Multi-modal circulant fusion for videoto-language and backward

Scikit-learn: machine learning in python

ktrain: a low-code library for augmented machine learning

The authors declare no competing interests.