key: cord-0541797-tkbkivyt authors: Wang, Yuxiang; Zhang, Yongheng; Li, Xuebo; Yu, Xinyao title: COVID-19 Fake News Detection Using Bidirectional Encoder Representations from Transformers Based Models date: 2021-09-30 journal: nan DOI: nan sha: 130938b09837f2f0404f53a803bd720973adcf13 doc_id: 541797 cord_uid: tkbkivyt Nowadays, the development of social media allows people to access the latest news easily. During the COVID-19 pandemic, it is important for people to access the news so that they can take corresponding protective measures. However, the fake news is flooding and is a serious issue especially under the global pandemic. The misleading fake news can cause significant loss in terms of the individuals and the society. COVID-19 fake news detection has become a novel and important task in the NLP field. However, fake news always contain the correct portion and the incorrect portion. This fact increases the difficulty of the classification task. In this paper, we fine tune the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model as our base model. We add BiLSTM layers and CNN layers on the top of the finetuned BERT model with frozen parameters or not frozen parameters methods respectively. The model performance evaluation results showcase that our best model (BERT finetuned model with frozen parameters plus BiLSTM layers) achieves state-of-the-art results towards COVID-19 fake news detection task. We also explore keywords evaluation methods using our best model and evaluate the model performance after removing keywords. From the past year, the whole world has gone through the COVID-19 pandemic. Twitter, Facebook, Instagram and many other social platforms update news on pandemics every day. In this project, we will use and fine tune the pre-trained Bidirectional Encoder Representations from Transformers (BERT) based models to train social media news posts, which are already known for truth or fake, for better recognizing those possible false news that may appear in the future. Adhikari [1] presents the first application of BERT to document classification. Aggarwal [2] demonstrates classification of fake news by fine-tuning deep bidirectional transformers based language Model. Devlin [3] introduces a new language representation model called BERT. Gundapu [4] introduces an ensemble of three transformer models (BERT, ALBERT, and XLNET) to detect fake news. Gupta [5] builds a model that makes use of an abusive language detector coupled with features extracted via Hindi BERT and Hindi FastText models and metadata. Kaliyar [6] proposes a BERT-based deep learning approach by combining different parallel blocks of the singlelayer deep Convolutional Neural Network having different kernel sizes and filters with the BERT. Kula [7] presents a hybrid architecture connecting BERT with RNN. Liu [8] treats fake news detection as finegrained multiple-classification task and use two similar sub-models to identify different granularity labels separately. Pham [9] explores encoding news title pairs and transforms into new representation space. Pham-Hong [10] uses a stack of BERT and LSTM layers to evaluate multilingual offensive language identification in social media. Safaya [11] describes approach to utilize pre-trained BERT models with Convolutional Neural Networks for sub-task of the Multilingual Offensive Language Identification shared task. Sun [12] investigates different fine-tuning methods of BERT on text classification task and provides a general solution for BERT fine-tuning. Tang [13] mentions keyword extraction using Attention-based Deep Learning models with BERT. Vijjali [14] leverages a novel fact checking algorithm that retrieves the most relevant facts concerning user claims about particular COVID-19 claims, and verifies the level of"truth" in the claim by computing the textual entailment between the claim and the true facts. The COVID-19 Fake News Detection Dataset comes from the Kaggle website 1 . We have the balanced training data, which contains 6,420 data entries with variable id, tweet and label. We also have balanced testing data which contains 2,140 data entries with variable id and tweet. There are three main variables in our training dataset: 'id' indicates the id number of the tweet; 'tweet' means the actual context of the tweet/post; lastly, 'label' describes whether the news is real or fake. We combine those two datasets together and randomly split data into training set (90%) and test set (10%) using the set seed method. Data pre-processing is essential for feeding the data into BERT. The pre-processing steps can be summarized as the following steps: First, we load the dataset. Second, we perform tokenization and Encoding. We use BertTokenizer and our own tokenizer to tokenize the tweets. [SEP] and [CLS] tokens need to be added at the end and beginning of every sentence. Then, we map tokens to ids. Third, we apply pad and Truncation. BERT requires that all sentences must have the same fixed length and the max length of 512 tokens per sentence. We found out that only 10 out of 8560 rows has length that is over 512. Therefore, we set up our max length to 512. Last, we use Attention Masks. The purpose of adding the masks is to not incorporate the padded tokens into the interpretation of the sentences. Our training and evaluation procedure can be summarized as the following steps. First, we apply the BertForSequenceClassification model. Second, we fine tune the BERT model. Third, we add additional layers after the fine-tuned model, including CNN and Bidirectional LSTM, for both with and without freezing the parameters in the fine-tuned model.Then, we perform training, hyperparameter tuning, and testing. Last, we investigate key words that affect the authenticity of the news. The procedure can be visualized in Figure 1 . We build five different models to evaluate and compare the performance of fake news classification.We define Model 1 as the BERT finetuned model. Next, we define Model 2 as the BERT finetuned model with frozen parameters plus CNN layer(s).Then, we define Model 3 as the BERT finetund model without frozen parameters plus CNN layer(s). Besides, we define Model 4 as the BERT finetuned model with frozen parameters plus BiLSTM layer(s). Lastly, we define Model 5 as the BERT finetuend model without frozen parameters plus BiLSTM layer(s). To test our classifiers' prediction results on fake news dataset, we use the following metrics. First, we use test accuracy as our primary metric. In our task, the test accuracy is defined as (1). number of correctly classif ied news total number of news (1) The second matric we use is train loss. We use crossentropy loss can be defined as (2) . The cross-entropy compares the model's prediction with the label which is the true probability distribution. The third matric we use is ROC AUC score. The ROC AUC stands for the area under the curve of ROC. The range of ROC AUC score is 0 to 1, and a large auc value for a model indicates a good performance of the prediction. The last matric we use is F1 score. It ranges from 0 to 1 and is calculated from the precision and the recall of our test results. The precision is the number of true positive results divided by the number of all positive results, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. The higher the score, the better performance it indicates. The formula of F1 score can be written as (3). From the results, model 4 has the highest test accuracy, ROC AUC and F1 score. The performance of model 2 is better than that of the BERT fine-tuned model as well. It is in our expectation that model 4 performs the best since BiLSTM considers the context before and after the target words. Word count We count and sort the words in sentences which are classified as fake news by our best model to obtain keywords. For example, excluding some commonly used prepositions, some of the keywords can be visualized in Figure 4 . We delete those top frequent words listed above in our inputs, and see if the model performance changes after removing those words. As a result, the model performance does not change. This indicates that top frequent words do not usually sololy contribute to the overall performance. In terms of the size of the dataset, we could collect more fake news data so that the model can be better trained. We also could try different data set split ways to find a more reasonable one. In the future, there is still space for trying different values of hyperparameters, such as learning rate and number of additional layers. The combination of not pre-trained model with additional layers could possibly improve the performance. Pre-trained model is representative for general tasks but might not for this specific case. In addition, we could try different additional layers other than BiL-STM and CNN. One example could be GRU. We realize that the performance of models with frozen parameters in the fine-tuned model improves, and the performance of models without frozen parameters in the fine-tuned model does not improve. The reason could be that the size of the dataset does not have enough support to learn those architectures without frozen parameters. We also find out that adding additional layers will improve the accuracy because the new model can capture more information on the dataset. Overall, the performance of adding BiLSTM layer(s) is better than that of adding CNN layers under the same condition. We believe that the result is reasonable because BiLSTM consider the context of the sentence where CNN does not. In addition, the performances among five models are relatively similar. In order to achieve a stronger true conclusion, we need to consider use other methods such as cross validation. There are plenty of ways to find keywords that contribute to the fake news detection. For example, we can find keywords by analyzing the Attention Layer in BERT. Moreover, we can also find keywords by tracing the gradient value during the backpropagation procedure. Docbert: BERT for document classification Classification of fake news by fine-tuning deep bidirectional transformers based language model BERT: pre-training of deep bidirectional transformers for language understanding Transformer based automatic COVID-19 fake news detection system Hostility detection and covid-19 fake news detection in social media Fakebert: Fake news detection in social media with a bert-based deep learning approach Application of the bert-based architecture in fake news detection A two-stage model based on bert for short fake news detection Transferring, transforming, ensembling: the novel formula of identifying fake news PGSG at SemEval-2020 task 12: BERT-LSTM with tweets' pretrained model and noisy student training method KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media How to fine-tune BERT for text classification? CoRR Progress notes classification and keyword extraction using attention-based deep learning models with BERT. CoRR Two stage transformer model for covid-19 fake news detection and fact checking