key: cord-0950533-k31tlozm authors: Hayawi, Kadhim; Shahriar, Sakib; Serhani, Mohamed Adel; Taleb, Ikbal; Mathew, Sujith Samuel title: ANTi-Vax: A Novel Twitter Dataset for COVID-19 Vaccine Misinformation Detection date: 2021-12-07 journal: Public Health DOI: 10.1016/j.puhe.2021.11.022 sha: d5b9310338b47b03d108f3910ab8920f6275d771 doc_id: 950533 cord_uid: k31tlozm Objectives COVID-19 (SARS-CoV-2) pandemic has infected hundreds of millions and inflicted millions of deaths around the globe. Fortunately, the introduction of COVID vaccines provided a glimmer of hope and a pathway to recovery. However, due to misinformation being spread on social media and other platforms, there has been a rise in vaccine hesitancy which can lead to a negative impact on vaccine uptake in the population. The goal of this research is to introduce a novel machine learning-based COVID-19 vaccine misinformation detection framework. Study Design We collected and annotated COVID-19 vaccine tweets and trained machine learning algorithms to classify vaccine misinformation. Methods More than 15,000 tweets were annotated as misinformation or general vaccine tweets using reliable sources and validated by medical experts. The classification models explored were XGBoost, LSTM, and BERT transformer model. Results The best classification performance was obtained using BERT, resulting in 0.98 F1-score on the test set. The precision and recall scores were 0.97 and 0.98 respectively. Conclusion Machine learning-based models are effective in detecting misinformation regarding COVID-19 vaccines on social media platforms. As of July 26, 2021, more than 194 million infections and more than 4 million deaths are attributed to the SARS-CoV-2, commonly referred to as the COVID-19 pandemic (1) . Since the outbreak emerged in Wuhan, Hubei province in China and spread 30 worldwide, lockdown measures and social distancing methods were introduced in most parts of the globe. The impacts were significant on various sectors including the economy (2), education (3) , and the mental health of the population (4) . The emergence of various safe and effective vaccines (5) provided a potential solution by increasing population immunity and rising as an effective method to control the outbreak. Most 35 vaccines authorization and distribution began during December 2020 (6) . Despite the vaccine introduction, increasing hesitancy on vaccine uptake can be observed among significant parts of the population in various countries (7) . The vaccine hesitancy can be explained in part by the spread of misinformation regarding vaccines that are spread in person (8) . However, with wide social media access and usage, the 40 spread of vaccine misinformation can be significantly increased, potentially leading to a further decline in vaccine uptake. Misinformation can be spread on social media by human users as well as social bots (9) (10). Social bots are programmed to automatically spread false information in disguise. Therefore, it is essential for algorithms to automatically detect the content of the misinformation regardless of the source being a 45 human or a social bot. More specifically, the focus of this research is on Twitter and detecting misinformation in tweets related to vaccines. To the best of our knowledge, there are no existing datasets for detecting vaccine misinformation tweets and this is the first proposed approach on detecting COVID-19 vaccine misinformation. Machine learning-based algorithms have been widely and effectively utilized for 50 various COVID-19-related applications including screening, contact tracing, and forecasting (11) . CoAID dataset introduced by (12) contains misinformation related to COVID-19. The authors utilized several machine learning models to classify fake news with the best performance of 0.58 F1-score being obtained using a hierarchical attention network-based model. A COVID-19 vaccine misinformation tweets dataset was 55 introduced by (13). This dataset characterizes both users who are actively posting misinformation and those who are calling out misinformation or spreading true J o u r n a l P r e -p r o o f information. It was concluded that informed users tend to use more narratives in their tweets than misinformed ones. The ReCOVery dataset proposed by (14) contains over 2000 news articles and their credibility. Furthermore, it also includes over 140,000 60 tweets that reveal the way these news articles are spread on Twitter. A F1-score of 0.83 was obtained for predicting reliable news and 0.67 was obtained for predicting unreliable news using a neural network model. A billion-scale COVID-19 Twitter dataset covering 268 countries with over 100 languages was collected by (15) . Two predictive models were proposed for classifying whether a tweet was related to the 65 pandemic (COVID-relevance) and detecting whether a tweet was COVID-19 misinformation. The misinformation detection models were trained using the aforementioned CoAID and ReCOVery datasets and combining them resulted in the best F1-score of 0.92 using a BERT-based model. The authors in (16) combined four existing datasets including CoAID and used several machine learning algorithms to 70 classify COVID-19 misinformation. The best F1-score of 0.985 was obtained using a two-layer LSTM network. The ArCOV19-Rumors dataset was presented by (17) to detect COVID-19 misinformation in Arabic tweets. Two Arabic BERT-based models were used for classification obtaining a highest F1-score of 0.74. A bilingual Arabic and English dataset for detecting COVID-19 misleading tweets was presented in (18) . 75 Several machine learning models were used to annotate the unlabeled tweets. However, the authors did not quantify the evaluation of the predictive models. Finally, a Chinese microblogging dataset for detecting COVID-19 fake news was presented by (19) . Various deep learning models were explored and the best F1-score of 0.94 was obtained using the TextCNN model. in the context of the COVID-19 vaccine and therefore, the proposed work to the best of our knowledge is the first to perform vaccine misinformation detection. Table 1 summarizes the existing works in COVID-19 misinformation detection and COVID-19 vaccine-related tweet datasets. This section describes the methodology of the proposed application. The details of the implementation are presented next chronologically. 105 Twitter is one of the most popular social media platforms with 353 million active users and more than 500 million tweets are being posted every day (25) . Twitter API allows the extraction of public tweets including the tweet text, user information, retweets, and mentions in JSON format. A Python library called Twarc was utilized to 110 access the Twitter API. To obtain the relevant tweets about COVID-19 vaccines, we followed the approach in some of the existing works in the literature and collected the tweets using keywords. The following keywords (case insensitive) were used: 'vaccine', 'pfizer', 'moderna', 'astrazeneca', 'sputnik', and 'sinopharm'. Additionally, we only considered tweets in 115 the English language. Replies to tweets, retweets, and quote tweets were not considered. Overall, the vaccine-related tweets from December 1, 2020, until July 31, 2021, were collected. In total, 15,465,687 tweets were collected. J o u r n a l P r e -p r o o f In supervised learning, a labeled dataset is required before model training. Since no Preprocessing the contents of the tweets is significant for efficient model training. 165 Firstly, external links, punctuations, and text in brackets were removed. All text contents were also converted to lower case. Common words such as 'the', 'and', 'in', and 'for' are referred to as stop words. Removing these low information words that form. For example, both 'walking' and 'walked' will be converted to the stem 'walk'. In this step, snowball (31) stemmer from the NLTK library was used. Machine learning enables computer systems to learn from experience using data, attention heads and, a total of 340M parameters (41) . Transformers (42) Python was used to implement this approach. Overfitting is considered a major obstacle in training machine learning algorithms. When a specific model performs outstandingly well during the training phase, by using 215 unnecessary input features, but fails to make generalized predictions on the test set, it is 'overfitting' to the training dataset. To avoid the overfitting problem for the two deep learning models, dropout technique was used. Also, training and validation accuracy curves were monitored to ensure no overfitting occurred during training. The research framework for COVID-19 vaccine misinformation classification is 220 summarized in Figure 4 . The COVID-19 vaccine-related tweets were first collected and then annotated for misinformation or regular tweets using reliable sources. After necessary preprocessing and feature extraction, machine learning and deep learning models were trained to classify vaccine misinformation. Finally, the performance of the models were evaluated on the test set. 225 Classification algorithms can be evaluated using several metrics including accuracy, precision, recall, and F1-score, as defined in Equations (1-4) (43). The results from the XGBoost model as well as the two deep learning models are 235 presented next. All models were first trained and validated on 75% of the dataset and then evaluated on the remaining 25% of the dataset. J o u r n a l P r e -p r o o f The training time for XGBoost as expected was much quicker than the other two deep learning models. The training accuracy obtained was 96.9 % and the accuracy on 240 the test set was 95.6%. The precision, recall, and F1-score on the test were 0.96, 0.95, and 0.95 respectively. Figure 5 presents the confusion matrix on the test set using XGBoost. The majority of the error (84%) resulted from misinformation being classified as otherwise whereas very few of the non-misinformation tweets were wrongly classified. The LSTM model was trained for six iterations with 20% of the data from the training set used for validation. Figure 6 The maximum training accuracy using LSTM was 99% and the accuracy on the test set was 96%. The precision, recall, and F1-score on the test were 0.97, 0.96, and 0.96 255 respectively. Overall, there was a slight improvement compared to XGBoost. The confusion matrix on the test set using this approach is presented in Figure 7 . Compared to XGBoost, there was a decrease in misinformation being misclassified (68%). However, more non-misinformation tweets were being classified as misinformation. The maximum training accuracy using BERT was 99% and the accuracy on the test set was 98%. The precision, recall, and F1-score on the test were 0.97, 0.98, and 0.98 respectively. The performance using BERT was superior compared to the previous two 270 models. Figure 9 displays the confusion matrix on the test set using BERT. Compared to the previous two models, BERT provides the lowest error rate (43%) on misclassifying the misinformation tweets, but it has a higher error rate in misclassifying the non-misinformation tweets. In the previous section, the effectiveness of all the models in vaccine misinformation detection was discussed. Consistent with the literature, superior performance was obtained using the deep learning models compared to XGBoost for a relatively larger 280 training set. BERT is recommended for this application because it was able to predict most of the misinformation. There are several implications of the proposed application that re not limited to: 1) the dataset and models presented in this work can be used by social media sites effectively to limit the spread of misinformation, 2) it would also facilitate the detection 295 of social bots spreading vaccine misinformation, 3) the dataset can also be thoroughly analyzed to identify patterns of misinformation and their spread over the time, and 4) this study will raise awareness regarding the misinformation about vaccines in social media and also will trigger further research in this area. A limitation of this study is that statistical analysis was not presented. As the focus of this study was on detecting vaccines can be performed using the large COVID-19 vaccine dataset. This would reveal the public perception of vaccines and how they evolved over the months. Also, 310 the focus of this study was on English tweets, but researchers are encouraged to extend this study to multilingual tweets related to COVID-19 vaccines. The use of hashtags can provide insights into the general behavior of social media users (44) , and this could be utilized for future research. Finally, it is also worth investigating vaccine-related misinformation on social media platforms other than Twitter as well as blog posts. Author statements An interactive web-based dashboard to track COVID-19 in real time The Global Macroeconomic Impacts of COVID-19: 335 Seven Scenarios Impacts of the COVID-19 Pandemic on Life of Higher Education Students: A Global Perspective The COVID-19 pandemic and mental health 340 impacts Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine Covid-19: Pfizer/BioNTech vaccine judged safe for use in UK. BBC 345 News COVID-19 Vaccine Hesitancy Worldwide: A Concise Systematic Review of Vaccine Acceptance Rates Understanding COVID-19 misinformation and vaccine hesitancy in context: Findings from a qualitative study involving citizens in Bradford, UK. Health Expect Is This the Era of Misinformation yet: Combining 355 Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee Network attacks and defenses: A hands-on approach Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review Coaid: Covid-19 healthcare misinformation dataset. ArXiv Prepr Characterizing covid-19 misinformation communities using a novel twitter dataset ReCOVery: A Multimodal Repository 370 for COVID-19 News Credibility Research A billion-scale dataset of 100+ languages for covid-19 CoAID-DEEP: An Optimized Intelligent Framework for Automated Detecting COVID-19 Misleading Information on Twitter ArCOV19-rumors: Arabic COVID-19 twitter dataset for misinformation detection COVID-19-FAKES: A Twitter (Arabic/English) Dataset for Detecting Misleading Information on COVID-19 CHECKED: Chinese COVID-19 fake news dataset. Soc Netw Anal Min COVID-19 Vaccine Hesitancy on Social Media: 390 Building a Public Twitter Dataset of Anti-vaccine Content, Vaccine Misinformation and Conspiracies COVID-19 Vaccines: Characterizing Misinformation Campaigns and Vaccine Hesitancy on Twitter A collection of English-language Twitter posts about COVID-19 vaccines Association for Computing Machinery Revealing public opinion towards COVID-19 vaccines with Twitter Data in the United States: a 405 spatiotemporal perspective. medRxiv 100 Social Media Statistics For 2021 Vaccine Myths Debunked Doctors Debunk Popular COVID-19 Vaccine Myths and Conspiracy Theories COVID-19 Vaccine Facts The COVID-19 Vaccine: Myths Versus Facts The natural language toolkit. ArXiv Prepr The english (porter2) stemming algorithm. 425 Retrieved Deep learning XGBoost: A Scalable Tree Boosting System Using tf-idf to determine word relevance in document queries Long short-term memory Classifying Maqams of Qur'anic Recitations using Deep Learning Glove: Global vectors for word 440 representation Rectified linear units improve restricted boltzmann machines Dropout: 445 a simple way to prevent neural networks from overfitting A method for stochastic optimization Pre-training of deep 450 bidirectional transformers for language understanding Transformers: State-of-the-art natural language processing Machine Learning Approaches for EV Charging Behavior: A Review Characteristics of Similar-Context Trending 460 Hashtags in Twitter: A Case Study The authors would also like to thank the medical experts for volunteering their time and effort in validating the annotated dataset.