key: cord-0140009-d9v08cw0 authors: Memon, Shahan Ali; Carley, Kathleen M. title: Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset date: 2020-08-03 journal: nan DOI: nan sha: 3e919d46835a2a78d21dd5a564ce3c33902fd115 doc_id: 140009 cord_uid: d9v08cw0 From conspiracy theories to fake cures and fake treatments, COVID-19 has become a hot-bed for the spread of misinformation online. It is more important than ever to identify methods to debunk and correct false information online. In this paper, we present a methodology and analyses to characterize the two competing COVID-19 misinformation communities online: (i) misinformed users or users who are actively posting misinformation, and (ii) informed users or users who are actively spreading true information, or calling out misinformation. The goals of this study are two-fold: (i) collecting a diverse set of annotated COVID-19 Twitter dataset that can be used by the research community to conduct meaningful analysis; and (ii) characterizing the two target communities in terms of their network structure, linguistic patterns, and their membership in other communities. Our analyses show that COVID-19 misinformed communities are denser, and more organized than informed communities, with a possibility of a high volume of the misinformation being part of disinformation campaigns. Our analyses also suggest that a large majority of misinformed users may be anti-vaxxers. Finally, our sociolinguistic analyses suggest that COVID-19 informed users tend to use more narratives than misinformed users. With the emergence of COVID-19 pandemic, the political and medical misinformation has elevated to create Paper under review what is being commonly referred to as the global infodemic. False information has hampered proper communication, and affected the decision-making process [BE + 20] . This makes debunking of false information vitally important. According to one study [TLC15] , if left undisputed, misinformation can in fact exacerbate the spread of the epidemic itself. Process of debunking misinformation, however, is complex and one that is not completely understood [CJHJA17] . This is because in order to conduct any intervention, it is first imperative to be able to identify the misinformation, as well as the misinformed communities. Because of the scarcity of data, and diversity of misinformation themes, this is already a challenging task in itself, but is also not enough. A second, and arguably a more important aspect of an intervention is to be able to correct and change the beliefs of the misinformed communities. To be able to do this, it is important to understand how different communities interact, which communities they belong to, and what are their preferences. In this paper, we characterize the COVID-19 misinformation communities on Twitter in terms of their network structure, linguistic patterns, and membership in other misinformation and disinformation communities. In the process, we also design and collect a large annotated dataset with a comprehensive codebook that we make available for the community to use for further analysis and models for misinformation detection. In the short amount of time, many COVID-19 datasets have been released. Most of these datasets are generic, and lack annotations or labels. Examples include multilingual corpus on a wide variety of topics related to COVID-19 [CLF20, AMEP + 20, HJB + 20], longitudinal Twitter chatter dataset [BTW + 20], multilingual dataset with location information of the users [QIO20] , Twitter dataset for Arabic tweets [AAA20] , Twitter dataset for popular Arabic tweets [HHSE20] , and dataset for identification of stance, replies, and quotes [VCKBC20] . Most of these datasets either have no annotations at all, employ automated annotations using transfer learning or semi-supervised methods, or are not specifically designed for misinformation. In terms of datasets collected for COVID-19 misinformation analysis and detection, examples include CoAID [CL20] which contains automatic annotations for tweets, replies, and claims for fake news; ReCOVery [ZMFZ20] is a multimodal dataset annotated for tweets sharing reliable versus unreliable news, annotated via distant supervision; FakeCovid [SN20] is a multilingual cross-domain fake news detection dataset with manual annotations; and [DSW20] is a large-scale Twitter dataset also focused on fake news. A survey of the different COVID-19 datasets can be found in [LUM + 20] and [SAAA20] . In terms of the diversity of the classes, and the size of the dataset, the most relevant dataset is by Alam et al. [ASN + 20] who, like our study, present a comprehensive codebook to annotate tweets on a finer granularity. Their dataset, however, is limited to a few hundred tweets, and our dataset is much more diverse in the range of topics covered. Dharawat et al. [DLMZ20] present a similar dataset with focus on the severity of the misinformation. However, their dataset does not consider the different "types" of misinformation. Finally, Song et al. present a dataset in [SPJ + 20] which contains a diverse set of 10 categories, but still is not as large, and contains fewer categories in relation to the dataset collected within our study. A plethora of research has already been conducted for analysing COVID-19 misinformation online. Some examples include categorization and identification of misinformed users based on their home countries, social identities, and political affiliation [HC20] , [SSM + 20] , characterization of different types conspiracy theories propagated by Twitter bots [Fer20] , characterization of the prevalence of low-credibility information related to COVID-19 [YTLM20] , exploratory analysis of the content of COVID-19 tweets [OPR20, SDM20] , understanding the types, sources, and claims of COVID-19 misinformation [BSHN20] , and comparison of the credibility of COVID-19 tweets to datasets pertaining to other health issues [BKF + 20]. To the best of our knowledge none of the studies have characterized COVID-19 misinformation communities in terms of their sociolinguistic patterns. In this study, we do not characterize the misinformation content directly. Instead, we conduct a set of analysis to un-derstand and characterize these communities through their content, and content-sharing behaviors and interactions. To collect Twitter dataset, we use Twitter search API using a diverse set of keywords as shown in table 1 to collect data. We collected our data on three days: 29th March 2020, 15th June 2020, and 24th June 2020. Each of these collections extracted a set of tweets from their corresponding week. Our annotation task aims to determine the category to which a given tweet belongs to. After many discussions and revisions, we identify 17 categories that a particular tweet could classify to. These 17 categories are defined in table 2. These categories are defined in further detail along with their definitions and examples in our codebook which we make available for the public to use 1 . Based on these categories, tweets were randomly and uniformly sampled from the data collection to maintain diversity in terms of topics covered. In the first phase around 4573 tweets were annotated by a single annotator. Table 2 shows the distribution of the data in terms of the different categories as annotated by the first annotator. In the second phase, 651 of these annotated tweets were assigned randomly to 6 other annotators. Our data collection strategy is different from others in two main aspects: (i) we have a diverse set of cate-1 Our annotation codebook can be found at: tinyurl.com/cmumiscov19-codebook gories taking into consideration different types of information and misinformation online; and (ii) our dataset is one of the very few, if not the only one, with emphasis on informed communities with categories such as "True Prevention", "Calling out/correction", "True Public Health Response", and "Sarcasm". We believe this is necessary as building models requires not just the annotation of false information, but as well as complementary true information categories. At the end, we have 4573 annotated tweets, comprising of 3629 users with an average of 1.24 tweets per user. Our annotated data not only covers a wide range of categories as observed in table 2, but also covers a wide range of topics as can be seen in figure 1. We call this dataset CMU-MisCOV19. It is available to the public for further research and analysis 2 . Figure 1: This chart shows the frequency of each identified topic across all the tweets. Note: Some tweets may have more than one topic. 2 We will make the data public at the acceptance of our paper. The data will contain twitter id for each tweet along with their corresponding annotations, and membership information. The tweet id can be used to rehydrate the tweets. Conducting analyses for a competing set of communities requires identifying those communities first. Because we have already annotated data across a set of true and false information categories, we identify the membership of the users by assigning a valence of +1 to the categories True Treatment, True Prevention, Correction/Calling Out, Sarcasm/Satire, and True Public Health Response, and a valence of -1 to the categories Conspiracy, Fake Cure, Fake Treatment, False Fact or Prevention, and False Public Health Response. Note that we assign the valence to the categories (or annotations) and not the tweets themselves. This is so that we can leverage the annotations from multiple annotators. At the end, we compute the valence of each user as a weighted sum of the valence of the annotations assigned to their tweets. Then we use the valence assigned to each user to identify their membership i.e. if valence is greater than 0, the user is assigned to the informed group, and if the valence is less than 0, the user is assigned to the misinformed group. Out of 3629 users, the community detection process assigns 47% (1697) of the users to the informed group, 29% (1043) of the users to the misinformed group, and 24% (889) of the users to ambiguous or irrelevant category 3 . Because our goal is to characterize communities and their behaviors, once we identify the two communities, we collect the timelines of users in each community to augment our data. Our hypothesis is that these additional posts can be used to mitigate survivorship bias [BGIR92] within our analyses. To conduct network analysis, we first extract only the COVID-19 related tweets from the timelines of each user. We do this by filtering all the tweets by the case-insensitive keywords "corona" and "covid". We then extract the retweet, mention, and reply networks of the two target communities, and combine those networks together. We then compute the network density for each of the two groups. As described in [MTMC20] , network density is defined as the ratio of actual connections and potential connections. In dense networks, conformity of the ideas is highly encouraged, and difference of opinions is discouraged. We also use ORA-PRO [CRC, ACR17, ACR18] to plot the network graph as shown in figure 2 Figure 2: Retweet+Mention+Reply network with informed users (in green) and misinformed users (in red) created using ORA-PRO [ACR18, Car17] . Note: Users with unidentified or ambiguous membership have been removed from the graph for simplicity. We note that both the informed and misinformed users display echo-chamberness with misinformed subcommunities being much denser than the informed sub-communities as shown in table 3. We do, however, notice some two-way communication from both sides. We also plot the retweet, mention and reply network separately as shown in figure 3. While retweet, and mention network show little to no two-way communication, we can observe that the reply network, while small in size, does in fact have much more intergroup engagement. We hypothesize that this is likely a consequence of the "corrective" or "calling-out" behavior. To understand the role of bots within the two competing groups, we used Bot-Hunter [BC18b, BC18a, BCB + 18, BC20], which has a precision of .957 and a recall of .704, to identify potential bot-like accounts. We use the probability of greater than or equal to .75 as our confidence threshold to identify bots. We use a two-sample z-test for the difference of proportions (α = 0.05) to test the difference in proportion of bots between the two competing groups of users. The results of our analyses can be found in table 4. We observe that from a total of 3629 users, 14% (505) of the users are identified as bots. The percentage of bots within identified misinformed users, however, is much higher (19%) than within identified informed users (11%). We find our results to be statistically significant (p < 0.001; z = −6.23). This indicates that more than 1/5th of the misinformation related posts in our dataset are potentially result of disinformation campaigns related to COVID-19. To understand the linguistic differences between the two competing communities, we conduct a linguistic analysis based on the tweets of the two groups by using the Linguistic Inquiry and Word Count (LIWC) program [PBJB15] . LIWC is a text analysis tool which looks at the different lexical categories each of which is psychologically meaningful. For a given text, LIWC calculates the percentage of each LIWC categories. All of these categories are based on word counts. We run the LIWC program on the timelines of all the members for each of the two competing groups. We only use tweets relevant to COVID-19. We also remove users identified as bots. Because some users may be more active than others, using the results of the program as is may introduce biases in our analyses. To account for those biases, we first normalize the percentages by the size of the data for each user. We use the mean of the normalized LIWC indices of tweets of individual users for a given lexical category as our test statistic. We use an independent z-test for the difference in means to establish statistical significance. For all our tests, α = 0.05. Our analyses are summarized in table 5. For this part, we focus on investigating three linguistic dimensions, each of which, along with its linguistic correlates, is described below. Narratives play a central role in how individuals process information, communicate, and reason [Ves17] . We set to test the differences in the usage of narratives or anecdotes between the two COVID-19 misinformation communities. The LIWC correlates for narrative discourse structure include high use of function words, pronouns, analytic summary dimension, and authenticity. High usage of function words and pronouns happens more often when expressing feelings and behaviors which tends to happen frequently in narratives [Pen11] . Moreover, low analytical thinking also suggests narrative language [PBJB15] . Furthermore, authentic individuals tend to be more personal, humble, and vulnerable [PBJB15] . Therefore, we use all of these as proxies to identify variation in the use of narratives across communities. In the past [MTMC20] , it has also been suggested that misinformed communities (eg. anti-vaxxers) tend to use many more pronouns suggesting highly narrative discourse structure. In this analysis, however, we find that informed users in the COVID-19 discourse use significantly more pronouns, more functional words, mention more family-related keywords, are less analytical, and more authentic and honest in comparison to misinformed users. All of these suggest that informed users use many more narratives than misinformed users. This is an interesting finding as it presents a dichotomy between the different misinformation communities (eg. anti-vaxxers and COVID-19 misinformed community). In hindsight, this is also an intuitive result, as our informed group is obtained from corrective discourse where users present their stories of family members or friends suffering from COVID-19 to call out conspiracies and false information. Because the two communities still seem to have less two-way communication, this also suggests that just the content and framing of the message may not be enough, and perhaps there is a need to connect the two groups by identifying an effective medium. Tone describes how positive a given text is. According to the definition by LIWC, the higher the number, the more positive the tone. Numbers less than 50 typically suggest a more negative tone. While we do not see significant differences in the emotional tone of the competing groups, we find both the communities to be highly negative. Formality of the language has often been considered as one of the most important dimensions for stylistic variation. In [GMC + 14], authors define linguistic formality as a style of writing that is meant to be precise, coherent, articulate and convincing to an educated audience, as opposed to informal discourse which is filled with deictic references (eg. here, there), pronouns, and narration. The LIWC correlates to this dimension are swear words (swear), and informal language (informal). Informal language in LIWC is computed on the bases of swear words, netspeak (eg. btw, lol), nonfluencies (eg. err, hmm), assents (eg. agree, OK), and fillers (eg. youknow). From table 5, it can be observed that misinformed users tend to be more informal than informed users, though informed users tend to use more swear words than misinformed users. This is intuitive as many of our informed users post corrective or sarcastic tweets to call out misinformation. However, our results are not significant, and, hence inconclusive. To understand the interplay between the different kinds of misinformation themes and communities, we identify the vaccination-related stance of the members of the misinformed sub-community. To do that, we first identify the subset of misinformed community who have posted at least one tweet related to "vaccines" in the past. We then collect the user-to-hashtag co-occurrence network. We use the valence of the vaccination hashtags obtained via the method mentioned in the study in [MTMC20] to identify the stance of each member (pro vs. anti) based on the weighted sum of the valences. If the weighted sum is greater than 0, we identify the member as pro-vaxxer, and if the weighted sum is less than 0, we identify the member as anti-vaxxer. The distribution of the pro-and anti-vaxxers within the COVID-19 misinformed group is as shown in table 6. We observe that from 1027 COVID-19 misinformed users in our dataset, 41% of the members are identified as anti-vaxxers, whereas only 22% of the members are identified as pro-vaxxers. The difference between the proportions of the two communities is striking. We also identify the proportion of bots within each of the two groups: misinformed pro-vaxxers, and misinformed anti-vaxxers. As shown in table 6, 17% of the misinformed pro-vaxxers are bots, which is significantly lower than the proportion of bots within the misinformed anti-vaxxers. The first thing this suggests is that a big chunk of COVID-19 misinformation online may in fact be disinformation, and hence, intentional. The existence of bots within both the informed and misinformed communities also suggests that much of the disinformation online may be an organized effort to amplify the COVID-19 debate to create discord in the communities as seen in the past with Twitter bots and Russian trolls [BJQ + 18]. The first important limitation pertaining to our work is that most of our analyses are based on the data that has been annotated by only 1 annotator. We try to mitigate this by having more than 1/7th of our annotations annotated by a second annotator, and taking into account all those annotations while computing the membership for each user. Another limitation to our work is that all our analyses are correlational in nature, and do not depict causation. A limitation pertaining to our data collection strategy is that we collect our data across a period of three weeks, augment our data with timelines of users, and update our list of hashtags to account for new themes. We then sample a subset of this data for annotation process. Because of the way data was collected, it cannot be used for assessing change over time. Moreover, while this ensures the diversity of misinformation-related topics and agents, it may limit our ability to estimate the actual extent to which the different types of stories are more or less present. Another limitation related to our bot analysis is that we use a second-level inference from a trained model. We try to mitigate this by using labels with probability greater than or equal to .75 to ensure high quality labels. Finally, unlike "vaccination" related discourse, COVID-19 does not have a clear definition of the "stance" of the users. This is because there are many sub-topics associated to COVID-19 each of which could have its own stance. In this work, we categorize users based on misinformation. However, the relationship between misinformation and stance vis-avis issues is complex, and one that needs to be understood. In the future work, we hope to explore this relationship to create a systematic way of characterizing communities both in terms of misinformation, and the difference stances of the users. In this paper, we present a methodology to characterize the competing COVID-19 misinformation communities by comparing them in terms of their network structure, sociolinguistic variation, and membership in disinformation campaigns and in other health-related misinformation communities such as anti-vaxxers. We find that even though COVID-19 is a recent event, misinformation related to it has created a set of polarized communities with high echo-chamberness. Misinformed communities are observed to be denser than informed communities which is in line with previous studies such as [MTMC20] . We find that bots exist in both the informed and misinformed groups, but the percentage of bots in misinformed users is significantly higher suggesting the prevalence of disinformation campaigns. Our sociolinguistic analysis sug-gests that both the target communities depict negative emotional tone in their posts, with signals that informed users use many more narratives than misinformed users. Finally, we discover that many misinformed users may be anti-vaxxers. Our analyses suggest that misinformation communities are much more complex as they are highly organized, and tend to be highly analytical. Unlike previous suggestions [SOC19] , they may not be responsive to narrative correctives, and hence, a "one size fits all" generic messaging intervention for debunking misinformation may not be a feasible solution. A successful intervention may require to identify, and ban the disinformation campaigns. It may also be useful to identify the right medium of communication to connect the two groups. This can be achieved by identifying users in misinformed communities who are not rebroadcasting, or have high betweenness centrality to be messengers for disseminating factual information. It may also be useful to further understand the linguistic patterns and preferences of these communities to create an effective content and framing of the messaging. This work was partially supported by a fellowship from Carnegie Mellon Universitys Center for Machine Learning and Health to Shahan A. Memon. We thank David Beskow for access to his Bot-Hunter model for bot analysis. We also thank members of CMU's Center for Computational Analysis of Social and Organizational Systems (CASOS) for insightful comments and discussions related to the data codebook and its revisions. Ora users guide 2017. Carnegie-Mellon Univ Mega-cov: A billion-scale dataset of 65 languages for covid-19 Fighting the covid-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society Bot conversations are different: leveraging network metrics for bot detection in twitter Bot-hunter: a tiered approach to detecting & characterizing automated activity on twitter Introducing bothunter: A tiered approach to detection and characterizing automated activity on twitter Defining misinformation, disinformation and malinformation: An urgent need for clarity during the covid-19 infodemic. Discussion Papers Survivorship bias in performance studies Weaponized health communication: Twitter bots and russian trolls amplify the vaccine debate The covid-19 social media infodemic reflects uncertainty and state-sponsored propaganda Types, sources, and claims of covid-19 misinformation A large-scale covid-19 twitter chatter dataset for open scientific research-an international collaboration Ora: A toolkit for dynamic network analysis and visualization Debunking: A metaanalysis of the psychological efficacy of messages countering misinformation Coaid: Covid-19 healthcare misinformation dataset Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set Drink bleach or do what now? covid-hera: A dataset for risk-informed health decision making in the presence of covid19 misinformation Ginger cannot cure cancer: Battling fake health news with a comprehensive data repository What types of covid-19 conspiracies are populated by twitter bots? First Monday Cohmetrix measures text characteristics at multiple levels of language and discourse Disinformation and misinformation on twitter during the novel coronavirus outbreak Arcov-19: The first arabic covid-19 twitter dataset with propagation networks Coronavirus twitter data: A collection of covid-19 tweets with automated annotations Leveraging data science to combat covid-19: A comprehensive review Characterizing sociolinguistic variation in the competing vaccination communities Exploratory analysis of covid-19 tweets using topic modeling The development and psychometric properties of liwc2015 The secret life of pronouns Geocov19: a dataset of hundreds of millions of multilingual covid-19 tweets with location information Waleed Alasmary, and Abdulaziz Alashaikh. Covid-19 open source data sets: A comprehensive survey. medRxiv An exploratory study of covid-19 misinformation on twitter Fakecovid-a multilingual crossdomain fact check news dataset for covid-19 The potential for narrative correctives to combat misinformation Classification aware neural topic model and its application on a new covid-19 disinformation corpus Covid-19 on social media: Analyzing misinformation in twitter conversations Exposure to health (mis) information: Lagged effects on young adults' health behaviors and potential pathways Stance in replies and quotes (srq): A new dataset for learning stance in twitter conversations Narrative policy framework: Narratives as heuristics in the policy process Prevalence of low-credibility information on twitter during the covid-19 outbreak Recovery: A multimodal repository for covid-19 news credibility research