key: cord-0164498-tul50e2p authors: Pritom, Mir Mehedi A.; Rodriguez, Rosana Montanez; Khan, Asad Ali; Nugroho, Sebastian A.; Alrashydah, Esra'a; Ruiz, Beatrice N.; Rios, Anthony title: Case Study on Detecting COVID-19 Health-Related Misinformation in Social Media date: 2021-06-12 journal: nan DOI: nan sha: d6ec3e980aa07952fbd705d3ecda50f4f50f498d doc_id: 164498 cord_uid: tul50e2p COVID-19 pandemic has generated what public health officials called an infodemic of misinformation. As social distancing and stay-at-home orders came into effect, many turned to social media for socializing. This increase in social media usage has made it a prime vehicle for the spreading of misinformation. This paper presents a mechanism to detect COVID-19 health-related misinformation in social media following an interdisciplinary approach. Leveraging social psychology as a foundation and existing misinformation frameworks, we defined misinformation themes and associated keywords incorporated into the misinformation detection mechanism using applied machine learning techniques. Next, using the Twitter dataset, we explored the performance of the proposed methodology using multiple state-of-the-art machine learning classifiers. Our method shows promising results with at most 78% accuracy in classifying health-related misinformation versus true information using uni-gram-based NLP feature generations from tweets and the Decision Tree classifier. We also provide suggestions on alternatives for countering misinformation and ethical consideration for the study. The ongoing COVID-19 pandemic has brought an unprecedented health crisis. Along with the physical health-related effects, the pandemic has brought changes to daily social interactions such as teleworking, social distancing, and stay-at-home orders leading to high usage of social media (Statista, 2020) . These conditions have paved the way for opportunistic bad guys to fish in the troubled waters. From the beginning of the COVID-19 pandemic, an excessive amount of misinformation has spread across social and online digital media (Barua et al., 2020; Cinelli et al., 2020) . These misinformation campaigns include hoaxes, rumors, propaganda, or conspiracy theories, often with themed products or services that may protect from contracting the infection (Pritom et al., 2020; WHO, 2020) . Social media became the main conduit for the spread of COVID-19 misinformation. The abundance of health-related misinformation spreading over social media presents a threat to public health, especially in controlling and mitigating the spread of COVID-19 (Islam et al., 2020) . Misinformation campaigns also affect public attitudes towards health guidance compliance and hampering efforts towards preventing the spread. In some cases, individuals have lost their lives by making decisions based on misinformation (Barua et al., 2020) . Therefore, it is imperative to combat COVID-19 health-related misinformation to minimize its adverse impact on public health (Ali, 2020). In the past, researchers have focused on identifying social media misinformation (Yu et al., 2017) using various machine learning (ML) and deep learning-based approaches. However, the effectiveness of those approaches in tackling health-related COVID-19 misinformation from social media is unknown. There is also a lack of ground-truth and validated datasets to verify model accuracy for the previous research efforts. Moreover, we find lacking in the existing literature to address the misinformation problem with coordinated interdisciplinary approaches from social psychology, information operations, and data science (i.e., applied machine learning). Motivated by the lacking, the primary objective of this study is to leverage interdisciplinary techniques to understand the COVID-19 health-related misinformation problem and derive insights by detecting misinformation using the Natural Language Processing (NLP) methods and Machine Learning (ML) classifiers. Grounded in existing work on misinformation propagation (Schneier, 2019) and source credibility theory (Hovland and Weiss, 1951) in social psychology, our study derives various credible health-related themes for understanding health misinformation and propose detection utilizing state-of-the-art techniques from NLP and applied machine learning. We use Twitter as our social media platform to study the effectiveness of our proposed methodology. We have collected, processed, labeled, and analyzed tweets to train and test the supervised ML classi-fiers. Finally, we have analyzed the performance of the classifiers following our mechanism using standard performance metrics such as accuracy, precision, recall, F1-score, and Macro-F1-score. The major contributions of this paper are, • Provide a detailed methodology with a prototype for detecting COVID-19 health-related misinformation from social media (i.e., Twitter). • Propose a more fair social media (e.g., Twitter) annotation process for labeling misinformation. • Provide a labeled ground-truth dataset for COVID-19 health-related tweets for future model verification. • Provide the efficacy of state-of-the-art classifiers to detect COVID-19 health-related misinformation tweets leveraging different NLP text representation methods such as Bag-of-Words (BoW) and n-gram. Paper outline. Section 2 presents the problem background motivation and related works. Section 3 presents the research methodology, experiment details with the dataset collection, processing, and analyzing steps.Section 4 discusses the experiment results, limitations, future research directions, and ethical considerations of the present study. Section 5 concludes the paper. We find that misinformation research has also been driven by the COVID-19 pandemic as there are lots of ongoing research on fake news and social media misinformation. In this section, we draw from the previous works on how existing misinformation detection research is leveraging Natural Language Processing (NLP), Machine Learning (ML), and interdisciplinary techniques such as information kill chain (e.g., step for the propagation of misinformation), and social psychology. Zhang, 2020; Sylvia Chou and Gaysynsky, 2020) highlights how the COVID-19 infodemic has added additional challenges for the public health community. Infodemic is the product of an overabundance of information that undermines public health efforts to address the pandemic (WHO, 2020). The effect of COVID-19 misinformation also impacts law enforcement and public safety entities (Gradoń, 2020) . The study finds that social media have a higher prevalence of misinformation than news outlets (Bridgman et al., 2020) . Another study highlights that an increase in Twitter conversation on COVID-19 is also a good predictor of COVID-19 regional contagion (Singh et al., 2020) . Authors in (Roozenbeek et al., 2020) and (Bridgman et al., 2020) report that exposure to misinformation increases individual's misconception on COVID-19 and lowers their compliance with public health prevention guidelines. Our approach is also influenced by Information Operations Kill Chain (Schneier, 2019) . The framework is based on the Russian "Operation Infektion" misinformation campaign and provides the basis for our focus on existing grievances. A critical characteristic of misinformation is that it propagates using existing channels by aligning the messages to preexisting grievances and beliefs in a group (Schneier, 2019; Cyber-Digital Task Force, 2018) . Using existing media associated with a credible source (i.e., credible from the audience perspective) makes the message more likely to be accepted by the audience (Hovland and Weiss, 1951) . Moreover, Islam et al. (2020) reveals that the oversupply of health-related misinformation fueled by rumors, stigma, and conspiracy theories in social media platforms provides critical, adverse implications towards individuals and communities. Studies comparing the performance of different ML algorithms have been conducted in the literature. For instance, (Choudrie et al., 2021) analyzes how older adults process various kinds of infodemic about COVID-19 prevention and cure using Decision Tree and Convolutional Neural Network techniques. Although this study focuses on COVID-19 health-related misinformation, the data is collected via online interviews with 20 adults. (Mackey et al., 2021) have presented an application of unsupervised learning to detect misinformation on Twitter using "hydroxychloroquine" as the keyword. However, the study has only scoped to detect misinformation related to the word "hydroxychloroquine," one of the many health keywords we have used to filter health-related tweets. Again, (Patwa et al., 2020) presents a manually annotated dataset containing 10,700 social media posts and articles from various sources, such as Twitter and Facebook, and analyzes ML methods' performance to detect fake news related to COVID-19. The ML models explored in that study were Decision Tree, Logistic Regression, Gradient Boosting, and Support Vector Machine. They have not focused on health-related misinformation, which is the scope of the current study. Next, (Gundapu and Mamidi, 2021) have used supervised ML and deep learning transformer models (namely BERT, ALBERT, and XLNET) for COVID-19 misinformation detection. Likewise, in the previous one, they have not provided insights on any health-related themes or keywords. In (Park et al., 2020) , an investigation on the information propagation and news sharing behaviors related to COVID-19 in Korea is performed, using content analysis on real-time Twitter data shared by top news channels. The results show that the spread of the COVID-19 related news articles that delivered medical information is more significant than non-medical information; hence medical information dissemination impacts the public health decision-making process. NLP for COVID-19. NLP methods have also been leveraged to detect COVID-19 misinformation in YouTube (Serrano et al., 2020; and Twitter (Al-Rakhami and Al-Amri, 2020). Serrano et al. (2020) have studied catching COVID-19 misinformation videos on YouTube through extracting user conversations in the comments and proposed a multi-label classifier Next, Al-Rakhami and Al-Amri (2020) use a two-level ensemble-learningbased framework using Naive Bayes, k-Nearest Neighbor, Decision Tree, Random Forest, and SVM to classify misinformation based on the online credibility of the author. They define credibility based on user-level and tweet-level features leveraging NLP methods. Their findings show features like account validation (IsV), number of retweets NoRT), number of hashtags NoHash), number of mentions NoMen), and profile follow rates FlwR) are good predictors of credibility. However, in our work, we have not used any user-level information for classifying misinformation and only relied on the texts of the corresponding tweets. These above user-level features may be added as a complement to our methodology for more accurate detection models. Again, we find study related to COVID-19 that has leveraged machine learning algorithms with Bag of Words (BoW) NLP-based features for classifying COVID-19 diagnosis from textual clinical reports (Khanday et al., 2020). Other Related Works. have selected the top viewed 75 videos with keywords of 'coronavirus' and 'COVID-19' to be analyzed for reliability scoring. The videos have been analyzed using their proposed novel COVID-19 Specific Score (CSS), modified DISCERN (mDISCERN), and modified JAMA (mJAMA) scores. In (Ahmed et al., 2020) , the authors highlight the drivers of misinformation and strategies to mitigate it by considering COVID-19 related conspiracy theories on Twitter where they have observed ordinary citizens as the most critical drivers of the conspiracy theories. This finding highlights the need for misinformation detection and content removal policies for social media. In this study, we first try to understand the types of COVID-19 health-related misinformation that have been disseminated during the pandemic. We try to map them into the Information Operations Kill chain (Schneier, 2019) to understand the steps to conduct misinformation operations. Based on our understanding, we have studied some of the popular COVID-19 related hoaxes and misinformation articles (Lytvynenko, 2020a,b; Gregory and McDonald, 2020; World Health Organization, 2020) to derive various themes and keywords. Themes describe the pre-existing grievances and beliefs of the group, and keywords are words specific to COVID-19 health-related misinformation that align with a particular theme. These keywords help us to collect and filter relevant tweets from a sheer volume of COVID-19 related tweets for the current study. Again, we have selected only a few days to collect Twitter data for resource and time constraints. However, to increase the chances of coverage of common COVID-19 health-related tweets, we have selected dates where a COVID-19 related event has occurred in the United States. This pilot study would reveal the efficacy of our proposed methodology. The selection of dates also covered the tweets from the first two months (March-April 2020) to the first four months (June-July 2020) of the global pandemic declaration on March 11, 2020 (Cucinotta and Vanelli, 2020) . In this section, we present our methodology for the following modules: (i) Twitter Dataset Collection, (ii) Annotation of Tweets, (iii) Analyzing Tweets, (iv) Classification Tasks for Detection Model, (v) Performance Evaluation. We only consider collecting the tweets or usergenerated content posted by anonymous individuals from Twitter. This study does not infer or reveal any authors of the tweets, as we only extract the textual parts of a tweet. The biggest challenge for collecting quality tweets is similar to a "needle in haystack" problem as there are many COVID-19 related tweets on Twitter posted daily. We have focused only on health-related tweets because that directly impacts public health if people trust misguided tweets. Although it is possible to collect the related tweets directly using Twitter API via Tweepy (Roesslein, 2020), there are rate limitations on the API. As an alternative in this study, we use the COVID-19 Tweets dataset (Lamsal, 2020) from IEEE Dataport because this dataset is (i) publicly available online, (ii) provides COVID-19 related tweet IDs daily from as early as March 2020, and (iii) the tweets are collected from all over the world (thereby giving no regional limitation). We have selected the following four days: 04/02/2020 (i.e., Stay-at-Home orders in many states in the US), 04/24/2020 (i.e., POTUS comments on the use of disinfectant against COVID-19 goes viral), 06/16/2020 (i.e., reports published on the use of common drugs against COVID- (Mueller and Rabin, 2020) ), and 07/02/2020 (i.e., face cover mandates in many US states). After selection of the dates, we download the full dataset for each of the dates from IEEE Dataport (Lamsal, 2020). The data contains all the tweet IDs for the tweets, but it does not contain the tweets (i.e., texts) themselves. Next, we use the Hydrator app (the Now, 2020) for extracting actual tweets. For each selected day, we extract 10,000 tweet IDs and collect those tweets for further processing. We limit our collection to 10,000 tweets because of resources and time limitations. We observe that the tweets extracted from the Hydrator app are truncated to a maximum of 280 characters. Next, we identify various themed for ongoing COVID-19 health-related misinformation (as shown in Table 1 ) and define a glossary of COVID-19 health-related keywords (shown in Table 2 ) to filter only the interesting and relevant healthrelated tweets. This filtering resulted in a total of 2,615 unique tweets for all four selected dates. The health keyword glossary combines COVID-19 health-related misinformation (based on the themes) and true information. In this study, we apply manual annotation on the filtered tweets to label them. Initially, we have defined 5 class labels to annotate all the filtered tweets, such as (i) true information (T ), (ii) misinformation (M ), (iii) incomplete (I), (iv) not healthrelated (N ), and (v) unsure (U ). Here, true information is COVID-19 related health facts supported by scientific evidence; misinformation is inaccurate COVID-19 health-related information that health organizations like WHO and CDC have discredited; incomplete information is a truncated tweets that can not be verified for the complete statement; not health-related is any tweet about COVID-19 that does not directly relate to any health information; and unsure class contains tweets where the annotator is unsure about the exact categorization to any of the first four labels. The same set of tweets are independently labeled by multiple annotators (i.e., our research team members). We have relied on majority voting among the tweet labels to finalize labels for each tweet. Moreover, any of the tweets not having a winning label are finalized by conducting open discussion among the group of annotators to reach a unanimous agreement. This process has produced 314 tweets labeled as T , 210 tweets labeled as M , 173 tweets labeled as I, and 1,918 tweets labeled as N . For aiding the future research on COVID-19 health-related misinformation detection on social media, we would be happy to share our annotated ground-truth dataset through valid organizational email requests only. We believe the dataset can work as a basis for improved future research. We have only considered using the tweets with class labels M and T for health-related misinformation detection study. At first, we tokenize tweets to build two sets of tokens: (i) true information tokens, (ii) misinformation tokens. Next, to improve the model performance, we remove the default English stop words (SW english ) listed in (RANKS.NL) and more trivial COVID-19 words for true information and misinformation tweets. These trivial words are observed by analyzing the most frequent tokens from true information and misinformation set of tweets. Some of the highlighted trivial COVID-19 words are included in this set SW trivial = {covid19, covid, covid-19, coronavirus, corona, covid_19, health}. Moreover, it is important to cleanup tweets for reliable and effective analysis as many tweets are messy. The cleanup process includes transmitting all tweets to lower case letters for avoiding redundancy, removal of incomplete links, cleaning of unnecessary punctuation, irrelevant set of characters (e.g., ..., ",", @), non-meaningful single-character tokens, and digits only tokens (e.g., "100"). In total, we have removed 222 stop-words (presented as SW, where SW = SW english ∪ SW trivial ). Next, we apply python NLTK SnowballStemmer for stemming the tweets (NLTK, 2020) and extract the root forms of words for more generalized model. After these steps, we need to generate features from the tweet texts for the supervised learning classifiers. This study only relied on text features (e.g., extracted from individual tweets) to classify health misinformation versus true information. In this study, we have used popular Bag of Words (BoW) (Zhang et al., 2010) and different n-grams (Fürnkranz, 1998 ) NLP methods for feature extraction from the tweets. : This method uses raw word frequencies in a sentence as features. If the BoW method is used, a tweet is presented as set of tokens, where each token is a connected word (e.g., no spaces in between), and it stores a frequency count for that token in the tweet. Any tweet tw i containing multiple BoW tokens (or word) where each token, tok j = w j / ∈ SW, and class label of the i-th tweet, l i ∈ {M, T } can be presented as a set of BoW tokenized representation tokenized B i = P B i ∪ A B i for the i-th tweet. Now, P B i is the set of tokens that are present and A B i is the set of tokens absent in the i-th tweet tw i , derived as follows In Eq. 1 and 2, Freq(tok j ) present the frequency counts of any j-th token within the i-th tweet. n-grams : This method uses the sequence of n (where n={1, 2, 3}) words as binary features. For 1-gram (or uni-gram) a single word sequence is considered as a token. Unlike BoW method, 1gram method uses features for each token as binary values (1 or 0), stating whether the token is present in a tweet or not and does not mention if there are multiple instances of the token. Next, we have used 2-grams (or bi-grams) as features, which use all the sequences of two words as tokens and stores them with 1 or 0 binary values. Lastly, the 3-grams (or tri-grams) features use all the valid sequences of three words as tokens with the binary value 1 (if token is present in tweet) or 0 (if token is not present in tweet). Now, any tweet tw i containing all n-grams tokens ∀ j tok n j , where, n ∈ {1, 2, 3} and class label of the i-th tweet, l i ∈ {M, T } can be presented as a set of n-grams tokenized representation, tokenized n i = P n i ∪ A n i for the i-th tweet. Here, P n i and A n i present the set of tokens that are present and absent in tweet tw i , which is derived by Eq. 3 and Eq. 4, respectively. In Eq. 3, ∀ w∈tok n j w ∈ tw i depicts the presence of all the words of the j-th token in tweet tw i . Now, n-gram tokens is further derived based on the value of n, as follows, From equation 5, we see that uni-gram method (n=1) use single word w j ∈ tw i as tokens for any tw i . The bi-grams method (n=2) use a pair of two words (w j , w j+1 ) as tokes where both w j , w j+1 ∈ tw i . Lastly, tri-grams method (n=3) use a tuple of three words (w j , w j+1 , w j+2 ) as tokes, where w j , w j+1 , w j+1 ∈ tw i . Moreover, for any tweets containing J number of words has (J −n+1) tokens for any n-grams methods. The selection of the feature extraction method plays a critical role in health-related misinformation detection. To start preparing the training and testing datasets, the set of tokenized misinformation tweets is presented as Here, m ∈ {B, n} representing the method for feature extraction. Next, we merge these dataset to prepare as D merge = D M ∪ D T . Then, the D merge dataset is randomly splitted with 80 : 20 ratio into training data (D train ) and test data (D test ), along with their respective tweet labels. In this pilot study, our training dataset contains |D train | = 419 tweets, of which 164 are misinformation and 255 are true information. Again, the testing dataset contains |D test | = 105 tweets, of which 46 are misinformation and 59 true information. Note that, individual tweets shuffle from training to test and vise versa, but the training and test size remain constant, which causes a change in the number of tokens in training and test between BoW and 1-gram method. For the BoW method, we observed 5, 268 words (or tokens) in D train (training data), leading to a vocabulary size of 2, 301 unique words, and 1, 318 words in D test (test data), leading to a vocabulary size of 838 unique words. Moreover, the tweets contained in D train are consisting 12.57 words on average, with maximum of 32 and minimum of 2 words. Likewise, tweets contained in D test are also consisting 12.55 words on average, with maximum of 28 and minimum of 2 words, which indicates a similar distribution of tweets in both training and testing datasets with this method. In case of uni-gram method, we have 5,232 tokens in D train leading to 2,286 unique uni-gram tokens, and 1,354 tokens in D test leading to 848 unique uni-gram tokens. The average token length for training set is 12.49 tokens, with maximum of 32 and minimum of 2 tokens, while the average token length for test set is 12.90 tokens, with maximum of 30 and minimum of 4 tokens. Next, in case of bi-grams, we have 4,813 tokens in D train leading to 4,298 unique bi-grams tokens, and 1,249 tokens in D test leading to 1,171 unique bi-grams tokens. The average token lengths for training set is 11.49 tokens, with maximum of 31 and minimum of 1 tokens, while the average token length for test set is 11.90 tokens, with maximum of 29 and minimum of 3 tokens. Lastly, in case of tri-grams, we have 4,394 tokens in D train leading to a tri-grams vocabulary (unique tokens) size of 4,166, and 1,144 tokens in D test leading to vocabulary size of 1,109. The average token length for training set is 10.49 tokens, while the average token lengths for test set is 10.90 tokens. The purpose of the classification task is to separate COVID-19 health-related tweets between true information and misinformation. We have used multiple machine learning classifiers that are commonly used in literature for text classification. The present study considers Decision Tree (DT) (Song and Lu, 2015) , Naive Bayes (NB) (Langley and Sage, 2013) , Random Forest (RF) (Biau, 2012) , Support Vector Machine (SVM) (Evgeniou and Pontil, 2001) , and Maximum Entropy Modeling (MEM) (Berger et al., 1996) classifiers. We have selected these classifiers because, (i) they are popular, (ii) they are readily available under the python nltk.classify and nltk.classify.scikitlearn API libraries (NLTK, 2020), (iii) they would provide a basis if the proposed methodology is a viable way of tackling the COVID-19 themed health-related misinformation. These classifiers are trained based on the input features extracted from the tweets (e.g., tokens) and corresponding numerical labels assigned using the manual annotation process described in section 3.2. The steps involved in training the classification models are shown in Fig. 1 . We have used the following standard evaluation metrics to evaluate our detection (i.e., classification) methodology: class-wise Precision, class-wise Recall, class-wise F1-score, classification Accuracy, and Macro-F1-score. The following list provides the formulas for each of the metrics: • Accuracy = 1 2 c∈{M,T } Accuracy c , where individual class accuracy Accuracy c = TPc+TNc TPc+TNc+FPc+FNc . • Precision c = TPc TPc+FPc , for the selected class c ∈ {M, T }. • Recall c = TPc TPc+FNc , where c ∈ {M, T } depicts the class. • F1-score c∈{M,T } = 2×(Recallc×Precisionc) Recallc+Precisionc . • Macro-F1-score = 1 2 c∈{M,T } F1-score c . Here, TP c is the total number of true positives, TN c is the total number of true negatives, FP c is the number of false positives, and FN c is the total number of false negatives for class label c ∈ {M, T }. Table 3 shows that the uni-gram text representation method with the Decision Tree classifier achieves the best classification accuracy of 78%. We also observe that uni-gram method has outperformed all other methods for all the classifiers except SVM, where the bi-gram method has 75% over the 70% accuracy of the uni-gram method. Table 3 also highlights that all the classifiers achieve at least 74% classification accuracy with at least one of the text representation methods. We also looked at the class-wise F-1 score for both misinformation class and true information class, which further validates that uni-gram method is outperforming other n-grams and BoW methods consistently for misinformation class labels. The classifiers perform much better for detecting true information labeled tweets with F1-score ranging from 0.783 to 0.833 (i.e., at least one of the text representation methods) for different classifiers. In contrast, F1-score for misinformation detection is always higher for the uni-gram method, with a maximum of 0.683 for the NB classifier. One reason for getting a better result for the true information than the misinformation class is the imbalance ratio 1:0.686 for true information:misinformation in the dataset (see data distribution for both classes in section 3.3). Moreover, in reality, the number of tweets with true information is usually much higher than the actual misinformation in social media, which is imitated in this study. We also observe that the highest Macro-F1-Score of 0.755 is achieved for the Decision Tree classifier with the uni-gram method, which further indicates the effectiveness of uni-gram modeling for health-related tweets' classification. Next, for Precision and Recall, we find that some of the methods (e.g., bi-grams, tri-grams) have achieved a perfect precision (1.0) or recall (1.0) for either of the class. However, whenever a classifier achieved a perfect precision (very few false positives) for the misinformation class, it has a significantly lower recall of around 0.2 (very high false negatives). Any such behavior would count as a bias classifier towards one of the class labels. Hence, we want to choose prevision and recall more balanced for both of the class labels. Another insight we draw from the current analysis is that the bi-grams and tri-grams methods have not been performing well enough because most of the bi-grams and trigrams tokens are unique and are not repeated in the dataset for both class labels. We also feel that the small sample data size (n = 524) including both the training and testing phase, is also not enough for a rigorous study but it is certainly the first step towards exploring the significance of the problem. We believe with a larger dataset, our methodology's performance could be further generalized. There are still some limitations in the study. First, we are not inferring the tweet's meaning. For example, a classifier's failure would look like classifying two-sentences, sentence-1="Hydroxychloroquine is a medicine for COVID-19", and sentence-2="Hydroxychloroquine is not a medicine for COVID-19". Though sentence-2 = ¬sentence-1, and sentence-2 is true information, it does not have enough samples in the training dataset, and may get classified as misinformation. Thus, to handle such scenarios, we need to build further methods to get negation versus affirmation meaning from a tweet to get more accurate models. We can also include anti-misinformation (i.e., true information countering the existing misinformation) in our training datasets to make the model more proactive in detecting misinformation campaigns. Second, we have only selected four dates and 10,000 tweets for each day due to the lack of time and resources to invest in the manual annotation process. Moreover, some of the tweet IDs (around 5-10% of 10,000 tweets each day) we have extracted empty tweets (i.e., removed by Twitter or author). Third, the lack of ground-truth datasets forced us to do manual labeling of the data to train supervised learning-based classifiers. These manual tasks may have impurity and errors, but we have used our best judgments and followed the best practices to apply various mechanisms (e.g., group discussion, majority voting) to bring fairness into the annotation process. Fourth, We have only considered BoW and n-gram methods for this pilot study and have not examined other methods such as TF-IDF and word embedding. Moreover, using the BoW method has its own shortcomings (Swarnkar, 2020) which is also present in the current study. Fifth, misinformation through images and videos are not focused in this study. Sixth, we only analyze Twitter as social media platform, but we believe the methodology is still applicable to other platforms (e.g., Facebook, Instagram) with minimal tweaks. However, multimedia-based platforms (e.g., TikTok) need different approaches. Although it is essential to understand how to mitigate effects of misinformation, there are some ethical considerations. Misinformation is commonly encountered in the form of selective information or partial truths in conversations about controversial topics (Schneier, 2019 ). An example is the statistics on Black-on-Black crimes that are used to explain over policing of Black communities (Braga and Brunson, 2015) . From this perspective, misinformation can be generated when the information available is incomplete (e.g.in news developing stories), or when a new findings appear contradicting existing beliefs (Morawska and Cao, 2020) . We think misinformation detection mechanisms must consider these factors to avoid being viewed as censorship or violation of freedom of speech (Kaiser et al., 2020) . To ensure freedom of speech, we have opted to label tweets as misinformation if we feel the author of any tweet is demanding questions on the existing health system, therapeutics, or policies for tackling this pandemic. We have also ensured not to violate the privacy of any Twitter users by not using any sensitive account information about any tweet's owner (author of any tweet labeled as T or M ). We have also cleaned up our tweets using regex where texts contain tagging other Twitter users with '@TwitterID' tags. Lastly, we believe that a better policy for addressing misinformation would be to provide correct information that does not threatens individuals' existing beliefs (Chan et al., 2017) but deter them from harmful behavior instead blocking misinformation contents. We recommend that future studies should investigate the solution space in this direction for systematizing it. In future, it would be interesting to explore detection mechanisms with explainable classifiers for bringing trustworthiness to misinformation detection research. Another future work arena would be to study network activities such as retweets, likes, and followers count for health misinformation tweet accounts to see if bots or real accounts have disseminated it. An analysis of those network elements could validate existing misinformation propagation frameworks (Cyber-Digital Task Force, 2018) for health-related misinformation, which could be leveraged to develop proactive mechanisms to detect unseen misinformation activity in real-time. Finally, along with the detection part, correcting health-related misinformation among communities need to be studied. Because of the continued influence effect, misinformation once presented continues to influence later judgments, the author proposed compelling facts as a medium of correction (Lewandowsky et al., 2012) , which needs to be further analyzed in the context of COVID-19 health misinformation. In this paper, we present a methodology for health misinformation detection in Twitter leveraging state-of-the-art techniques from NLP, ML, misinformation propagation, and existing social psychology research domains. We find extracting and annotating quality data for misinformation research is challenging. We discover gap in availability of ground-truth dataset, which is addressed by our effort. Our findings highlight Decision Tree classifier is outperforming all other classifiers with 78% classification accuracy leveraging simpler uni-gram features for text representation. We recommend future studies to systematize the understanding of the health-related misinformation dissemination in social media and its impact on public health at large. Covid-19 and the "film your hospital" conspiracy theory: Social network analysis of twitter data Lies kill, facts save: Detecting covid-19 misinformation in twitter Combatting against covid-19 & misinformation: A systematic review Effects of misinformation on covid-19 individual responses and recommendations for resilience of disastrous consequences of misinformation A maximum entropy approach to natural language processing Analysis of a random forests model The police and public discourse on" Black-on-Black" violence. US Department of Justice Derek Ruths, Lisa Teichmann, and Oleg Zhilin. 2020. The causes and consequences of covid-19 misperceptions: Understanding the role of news and social media Debunking: A meta-analysis of the psychological efficacy of messages countering misinformation Machine learning techniques and older adults processing of online information and misinformation: A covid 19 study Fabiana Zollo, and Antonio Scala. 2020. The covid-19 social media infodemic Who declares covid-19 a pandemic Attorney general's cyber-digital task force Support vector machines: Theory and applications A study using ngram features for text categorization. Austrian Research Institute for Artifical Intelligence Crime in the time of the plague: Fake news pandemic and the challenges to law-enforcement and intelligence community Trail of Deceit: The Most Popular COVID-19 Myths and How They Emerged Transformer based automatic covid-19 fake news detection system The influence of source credibility on communication effectiveness COVID-19-Related Infodemic and Its Impact on Public Health: A Global Social Media Analysis Adapting security warnings to counter misinformation Qamar Rayees Khan, Nusrat Rouf, and Masarat Mohi Ud Din. 2020. Machine learning based approaches for detecting covid-19 using clinical text data Rabindra Lamsal. 2020. Coronavirus (COVID-19) Tweets Dataset Induction of selective bayesian classifiers Misinformation and its correction: Continued influence and successful debiasing Youtube as a source of information on covid-19: a pandemic of misinformation? Here are some of the coronavirus hoaxes that spread in the first few weeks Here's a running list of the latest hoaxes spreading about the coronavirus Application of unsupervised machine learning to identify and characterise hydroxychloroquine misinformation on twitter Airborne transmission of sars-cov-2: The world should face the reality Common Drug Reduces Coronavirus Deaths From fighting covid-19 pandemic to tackling sustainable development goals: An opportunity for responsible information systems research Conversations and medical news frames on twitter: Infodemiological study on covid-19 in south korea Amitava Das, and Tanmoy Chakraborty. 2020. Fighting an infodemic: Covid-19 fake news dataset Characterizing the landscape of covid-19 themed cyberattacks and defenses Joshua Roesslein. 2020. Tweepy: Twitter for python Anne Marthe Van Der Bles, and Sander Van Der Linden. 2020. Susceptibility to misinformation about covid-19 around the world Toward an information operations kill chain NLP-based Feature Extraction for the Detection of COVID-19 Misinformation Videos on YouTube Kornraphop Kawintiranon, Colton Padden, Rebecca Vanarsdall, Emily Vraga, and Yanchen Wang. 2020. A first look at covid-19 information and misinformation sharing on twitter Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry Estimated u.s. social media usage increase due to coronavirus home isolation 2020 Bag of words: Approach, python code, limitations. https: //blog.quantinsti.com/bag-of-words/ #Limitations-of-Bag-of-Words. Last accessed on A prologue to the special issue: Health misinformation on social media Documenting the Now. 2020. Hydrator [computer software Trail of Deceit: The Most Popular COVID-19 Myths and How They Emerged A convolutional approach for misinformation identification Understanding bag-of-words model: a statistical framework