key: cord-0137246-0b0vb02w authors: Broniatowski, David A.; Kerchner, Daniel; Farooq, Fouzia; Huang, Xiaolei; Jamison, Amelia M.; Dredze, Mark; Quinn, Sandra Crouse title: The COVID-19 Social Media Infodemic Reflects Uncertainty and State-Sponsored Propaganda date: 2020-07-19 journal: nan DOI: nan sha: 327182e27ac733136b2880cd0084140971270d69 doc_id: 137246 cord_uid: 0b0vb02w Significant attention has been devoted to determining the credibility of online misinformation about the COVID-19 pandemic on social media. Here, we compare the credibility of tweets about COVID-19 to datasets pertaining to other health issues. We find that the quantity of information about COVID-19 is indeed overwhelming, but that the majority of links shared cannot be rated for its credibility. Reasons for this failure to rate include widespread use of social media and news aggregators. The majority of links that could be rated came from credible sources; however, we found a large increase in the proportion of state-sponsored propaganda among non-credible and less credible URLs, suggesting that COVID-19 may be used as a vector to spread misinformation and disinformation for political purposes. Overall, results indicate that COVID-19 is unfolding in a highly uncertain information environment that not may amenable to fact-checking as scientific understanding of the disease, and appropriate public health measures, evolve. As a consequence, public service announcements must adequately communicate the uncertainly underlying these recommendations, while still encouraging healthy behaviors. COVID-19, an illness caused by the SARS-COV-2 virus, is a potentially fatal disease that was declared a pandemic on March 11th, 2020 by the World Health Organization (WHO) [4] . As the pandemic was emerging, the WHO declared a COVID-19 "infodemic" -"an overabundance of information, some accurate and some not that makes it hard for people to find trustworthy sources and reliable guidance when they need it." [2] The quantity of untrustworthy content shared online can hamper an effective public health response and create confusion and distrust among people (3) , ultimately leading to significant loss of life. Significant attention has focused on the malicious aspects of this infodemic, for example, with the United Nations characterizing it as an "'infodemic' of misinformation and cybercrime" [12] . Here, we seek to characterize the "infodemic" on Twitter -one of the world's most popular social media platforms. An exploratory study assessing information sharing on Twitter early in the COVID-19 pandemic traced how tweet and retweet volume 1/13 increased as the epidemic grew, with the volume of related tweets doubling between January and March [26] . Furthermore, a recent Pan American Health Organization publication characterized the infodemic as follows: According to a study by the Center for Health Informatics at the University of Illinois, in the month of March around 550 million tweets included the terms coronavirus, corona virus, covid19, covid-19, covid 19 or pandemic. An exponential increase in the volume of tweets occurred around the start of the lockdown in Italy, reaching a plateau around the day the United States declared the pandemic had become a national emergency. [7] Nevertheless, Brandwatch -a digital consumer intelligence company -estimates that Twitter users generate roughly 500 million tweets per day [3] . Thus, the numbers of tweets must be contextualized relative to other health topics. Furthermore, as online communities continue to try to make sense of the deluge of information, and accompanying uncertainties, one would expect that all sources would be heavily engaged in this collective sensemaking process, yielding a large amount of content. We therefore seek to compare the number of tweets pertaining to COVID-19 to other health topics, motivating our first research question: 1. Compared to other health topics, is there "an overabundance of information" pertaining to COVID-19 online? Widespread media coverage pertaining to the infodemic has focused on the distribution of conspiracy theories and online scams. If the infodemic is indeed characterized by malicious content, one might expect a higher proportion of this content to come primarily from "low-credibility" sources [19, 28] that "lack the news media's editorial norms and processes for ensuring the accuracy and credibility of information" [21] because "...the attribution of 'fakeness' is...not at the level of the story but at that of the publisher." [21] . This should be especially true when comparing COVID-19 social media samples to equivalent samples pertaining to other topics. However, preliminary research regarding the prevalence of misinformation surrounding the COVID-19 pandemic online seems to indicate that significant content may actually come from credible sources, at least on Twitter [26] . Nevertheless, estimates of the prevalence of content from non-credible sources varies widely. For example, Singh et al. [26] found that key "myths" about coronavirus on Twitter represent only roughly 1% of the investigators' total sample, whereas Yang et al., [28] have found a similar prevalence of all low-credibility sources on Twitter, with the total volume of tweets from non-credible sources comparable to shares from a single high-credibility or government source. In contrast, Cinelli et al. [15] found that unreliable sources made up roughly 11% of rated content, with both reliable and unreliable sources engaged with, and amplified, at roughly the same rate. Similarly, a content analysis of coronavirus-related tweets from a two-day period in February found that 10.6% of tweets contained false information [24] . Although all of these prior studies found that information from credible sources was more prevalent than information from non-credible sources, there is wide variability of estimates of non-credible content. Furthermore, none of these studies compared their prevalence estimates to any other Twitter datasets; thus, it is difficult to evaluate whether COVID-19 tweets contain more content from non-credible sources than tweets about other topics. Beyond our ability to assess source credibility, significant content may come from unrated sources. For example, "Plandemic," a viral movie promoting conspiracy theories about COVID-19, was primarily spread via social media, which hosts user-generated content of varying reliability [17] . Similar concerns apply to content 2/13 linking to online vendors, such as amazon.com, which may sell reliable content, unreliable content, or may even contain questionable material in comments posted to otherwise unrelated products. Indeed, prior work [26] shows that four of the top ten most shared domains in a set of COVID-19 tweets included two social media platforms (youtu.be/youtube.com, and instagram.com) and Amazon (amzn.to). This motivates our second research question: 2. How much of the information that is shared comes from untrustworthy, and unrated, sources? Beyond source credibility, a recent analysis of misinformation and disinformation during the infodemic emphasized the prevalence of racism, xenophobia, conspiracy theories, health misinformation, and malicious geopolitical actors [27] . These findings seem to accord with our previous findings that Russian Twitter trolls have amplified the vaccine debate in order to promote discord in American politics [14] . More recently, news reports have alleged that networks of state-sponsored bot accounts were actively promoting misinformation about COVID-19 [13] . Finally, several outlets have reported that state-sponsored propaganda has employed COVID- 19 2 Materials and Methods We collected a COVID-19 specific Twitter data set using the Social Feed Manager software [22] , which collected English-language tweets from the Twitter API's statuses/filter streaming endpoint [10] that matched keywords of #Coronavirus, #CoronaOutbreak, and #COVID19 between March 3, 2020 and May 2, 2020. We compared this dataset to data from the same time period from two ongoing collections of tweets: one containing keywords pertaining to generalized health topics (the "health stream" 1 [23] ), and one containing keywords pertaining to 1 Containing the following keywords: ill, sick, cold, body, pain, hurts, sore, nose, hospital, doctor, cancer, killing, stomach, headache, neck, ear, throat, chest, hurting, ouch, massage, burning, flu, exhausted, medicine, surgery, knee, cough, fever, doctors, insomnia, irritated, freezing, intense, emergency, dose, miserable, exercise, cure, eaten, dentist, vision, bedtime, physical, treatment, pills, coma, pounds, dealing, breathing, insurance, feelin, tooth, heal, appointment, ache, ankle, pill, numb, recovery, physically, wrist, depression, hungover, allergies, allergic, nurse, stroke, meds, cramps, woken, muscles, dizzy, clinic, pains, jaw, sneeze, lungs, swollen, puke, anxiety, appt, recover, severe, headaches, thirsty, vomit, tension, sneezing, caffeine, itchy, appetite, resting, coughing, infection, diabetes, migraine, sickness, uncomfortable, pounding, mild, aching, itching, hiccups, forehead, illness, recovering, hurtin, ribs, medication, aches, stuffy, advil, sneezed, symptoms, prescription, nyquil, drained, asthma, lung, anxious, itch, remedy, elbow, infected, sinus, kidney, allergy, torn, rash, chronic, tumor, poisoning, pimples, crutches, diagnosed, tylenol, nauseous, stiff, bladder, splitting, fatigue, lump, bruised, puking, germs, sunburn, relieve, runny, rehab, paracetamol, panadol, stomachache, watering, faint, toothache, icky, blisters, throbbing, veins, dehydrated, spine, heartburn, dental, nausea, needles, watery, puffy, yucky, surgeon, colds, antibiotics, vomiting, skull, shivering, acne, sniffles, healed, throats, painkillers, contagious, vitamins, stomache, strep, tiredness, benadryl, sinuses, congestion, ibuprofen, withdrawal, arthritis, migraines, pneumonia, recovered, cured, cravings, tonsils, ulcer, remedies, limping, fluids, 3/13 vaccine-preventable illnesses (the "vaccine stream" 2 ; [16] ). Although there was significant overlap between the health and vaccine streams (due to the inclusion of the word "vaccine" in both) the COVID-19 dataset was largely distinct from these (among tweets containing URLs, COVID-19 tweets only overlapped 2.4% and 0.5% with the health and vaccine stream tweets, respectively). Additionally, we compared each of these datasets to equivalent data from the same dates in 2019 (data prior to March 7, 2019, was not available due to a server outage). We calculated the total number of tweets in each dataset. We restricted our analysis to English tweets only. For each data set, we identified all Twitter posts containing a URL, and extracted these URLs. Specifically, we first unshortened any shortened links (e.g. "bit.ly/x11234b") using the Python "requests" module [11] and then used the "tldextract" Python module [20] to identify the top-level domain for each unshortened link. We removed links to twitter.com, t.co (links that could not be unshortened), and t.me (inadvertently removed when removing t.co). Next, we enumerated the frequency of each top-level domain in each dataset. We obtained a credibility rating for each domain from the MediaBiasFactCheck (MBFC) web site [8] , using an automated web scraper. MBFC rates several domains according to their factual accuracy into six categories: "Very Low", "Low", "Mixed", "Mostly Factual", "High", and "Very High". Additionally, some domains were categorized as "Questionable" (indicating propaganda or fake news), or "Satire". We retained these categories. Next, we coded as "government" all domains ending in .gov, .gc.ca, .mil, .nhs.uk, starting with gov., mygov., government., containing .govt. or .gov., or matching who.int, paho.org, un.org, canada.ca, ontario.ca, toronto.ca, or alberta.ca. We also coded as "academic" all domains ending in .edu, containing .edu., .ac., thelancet.com, sciencedirect.com., medrxiv.org, pnas.org, apa.org, nature.com, sciencemag.org, nejm.org, bmj.com, mayoclinic.org, aaas.org, healthdata.org, researchgate.net, or rand.org. Finally, we coded as "social media" the following domains: youtube.com, instagram.com, facebook.com, blogspot.com, reddit.com, pscp.tv, vimeo.com, linkedin.com, and tumblr.com. Any remaining domains for which we retrieved an MBFC rating were categorized as "news". We next grouped these categories into high-level indicators of credibility, with academic, government, "very high", and "high" MBFC scores labeled as "more credible"; "mostly factual" and "mixed" MBFC scores labeled as "less credible", "low", "very low"; and "questionable" MBFC scores labeled as "not credible". The remaining categories did not have MBFC scores and were therefore not given credibility ratings. Thus and the remaining categories were labeled as "unrated". We next used the Webshrinker Category API [5] to assign the most popular of these unrated domains into discrete categories and, where available, subcategories. Due to the highly-skewed distribution of domains that were shared, we scored the domains containing the top 90% of all unrated domains shared in the dataset. We compared proportions of these categories across each of our datasets. Additionally, we extracted the top domains in each category and subcategory. Finally, in order to determine which domains were more likely to be shared as part of the "infodemic", we calculated the relative likelihood with which each domain appeared in the COVID-19 dataset compared to the other datasets collected in 2020. A total of 482,479 unique domains were shared 23,530,432 times across all five datasets. Of these, 63,608 (13.2%) were assigned some kind of categorization (either an MBFC rating, social media, government, academic, or a Webshrinker category). These 63,608 domains were nevertheless shared 22,448,453 (95.4%) times across all five datasets, with no uncategorized domain shared more than 2,032 times, and no more than 2,005 times within a single dataset. Table 1 provides summary statistics of each dataset. The COVID-19 data shows more concentration overall, with a higher number of tweets per user compared to other datasets. Additionally, the COVID-19 dataset contains a higher proportion of English tweets, and tweets with URLs, than do the other two datasets. Examining English tweets, our results show that just three COVID-19 related keywords yielded 2.6 times more English tweets than the largest vaccine-related dataset, collected using 64 keywords, and 3.8 times more tweets overall. Although we collected roughly twice as many tweets in each health dataset as were in the COVID-19 dataset, the former were collected using 269 keywords. The number of tweets per keyword for the COVID-19 dataset was 42.44 times higher than the next highest dataset -Health 2019. Non-Credible Sources than Tweets from other Datasets Table 2 shows the proportions of domains in our datasets assigned to each of the high-level categories (more credible, less credible, not credible, and unrated). COVID-19 tweets contain the lowest proportion of "not credible" tweets and similar (or higher) proportions of "more credible" tweets compared to the other datasets. Only the Vaccine 2020 dataset has slightly more trustworthy content, but also roughly 4.7 times as much untrustworthy content. Notably, all datasets contain a majority of 5/13 unrated tweets, with the proportions of social media data roughly consistent across datasets. Like Singh et al. [26] , we find that YouTube and Instagram are among the most popular domains shared (see Table 3 ); however, both of these domains are shared less often in the COVID-19 dataset than in the comparator datasets. Similarly, the "shopping" category (containing amazon.com, and other online marketplaces) is less popular in the COVID-19 datasets than in other datasets. In contrast, Periscope TV -6/13 a livestreaming video service -and the "business" category are both more frequent in the COVID-19 dataset than in other datasets, in large part due to several links to paper.li (2.41%) -a content curation service that lets users create, and disseminate, online newspapers (often for marketing purposes) -and fiverr.com (0.42%) -an online freelancer marketplace. are more likely to contain state-sponsored propaganda Table 4 shows the top 10 domains, in each high-level category, for which the proportion of COVID-19 content exceeded that of both of the other datasets in 2020. Results show that the first 3 domains categorized as "not credible" in the COVID-19 dataset -rt.com, presstv.com, and sputniknews.com -reflect state-sponsored media. Additionally, Chinese state-sponsored media generated roughly equivalent amounts of content as Russian state media, although these sources were ranked as more credible by MBFC. Our results contextualize the widespread claims of an "infodemic" surrounding COVID-19 on Twitter. We are the first to compare volumes of tweet content pertaining to COVID-19 to other other health topics -namely, vaccines and a general health dataset. Like previous studies, we find that there is indeed an overwhelming amount of content pertaining to COVID-19 online. However, Twitter generates roughly 550-600 million tweets per day. Of these, three hashtags pertaining to COVID-19 generated roughly 3.1 million tweets per day alone, or about 0.5% of all tweets on the platform. Notably, Twitter's rate limit on the statuses/filter endpoint prohibits collecting more than 1% of all daily tweets. To our knowledge, our collection of COVID-19 tweets was not rate limited, as indicated by the absence of error messages returned from Twitter's streaming API. Therefore, we assume that the data set represented the complete set of matching tweets for this time period. Similarly, the vaccine streams, which were 0.26 and 0.35 times the size of the COVID-19 twitter dataset, respectively, were not significantly rate limited. In contrast, the health streams, which returned 22-28% more tweets, were occasionally rate limited. Thus, we cannot accurately estimate the relative sizes of COVID-19 dataset relative to the health streams other than to say that it was somewhat smaller. Nevertheless, it is significant that the amount of content pertaining to one specific disease, COVID-19, vastly exceeded that pertaining to all other viral illnesses in the vaccine stream. Like prior work [15, 24, 26, 28] , we find that the proportion of tweets from sources tagged as "not credible" is quite small, with the COVID-19 dataset containing the smallest proportion of these sources compared to other datasets. In contrast, the amount of content from "more credible" sources was roughly one fourth of all content across all datasets, suggesting that credible sources may be more prolific or more frequently shared than less credible sources. In effect, the COVID-19 infodemic on COVID-19 is more a function of information volume than information quality. Although there are widespread concerns pertaining to the spread of misinformation online, this issue is complicated in the context of an emerging pandemic in which there is little known and enormous uncertainty. As the scientific understanding of this emerging disease evolves, changing information will invariably mean that information reported early may, in fact, be incorrect. Tweets that share data that later proves incorrect after further study are distinctly different from deliberate attempts to share misinformation or disinformation for other ulterior motives. State-sponsored propaganda, and especially Russian and Iranian state media that MBFC ranks as "Low" or "Very Low" factual accuracy, were more likely to tweet about COVID-19 than about other health concerns. Additionally, Chinese state media that MBFC rated as "Mixed" factual accuracy displayed significantly more interest in COVID-19 than in other health topics. These findings are consistent with recent testimony from the US State Department indicating that Russia, China, and Iran may be active in spreading disinformation about COVID-19 online [1]. Nevertheless, our findings indicate that these specific state-sponsored sources are only 0.77% of our dataset and are far exceeded in volume by content from more credible sources. Indeed, several privately-owned mainstream news sources representing British, Filipino, Indian, Nigerian, and Qatari perspectives also showed increased interest in COVID-19 across a range of different credibility ratings; however, more credible US news sources actually published proportionally less content about COVID-19 than about other health topics. The majority of content from each of our datasets came from websites that could not be rated. In particular, almost half of all content was grouped into five unrated categories: social media, "News / Weather / Information", "Business", "Technology & Computing", and "Unknown". The top domains within each of these categories reflect significant information dissemination. For example, the "News / Weather / Information" category contains sources that simply have not been rated by MBFC (e.g., thecable.ng, citinewsroom.com, etc.), perhaps because they reflect audiences with which MBFC's fact checkers are not familiar (e.g., African audiences). The "Unknown" category also contains several unrated news sources (e.g., abc7.com, 9/13 newsfilter.io, etc.), but also self-published subscription content, such as onlyfans.com. News and other information-seeking behavior also falls into the remaining categories, with "Business" containing links to paper.li, and "Technology & Computing" containing links to apple.news and google.com, all of which can be used as content curation services. Finally, social media platforms can be used to redirect users to news, but also to self-published content of indeterminate credibility. Social media is a major source of online content about COVID-19, but no more nor less than for other sites. Like paper.li, these sites serve the role of content-curation, enabling people to circumvent the fact-checking role of elite journalistic institutions. Notably, some of the largest sources of misinformation were primarily spread using these types of media (e.g., "Plandemic" was spread on Facebook, YouTube, Vimeo, and Twitter [18] ). Our study has several limitations that may be addressed by future research. Primarily, credible sources do not necessarily indicate credible content, and vice versa. Especially during a pandemic where what is known to be truthful is highly uncertain, misinformation may spread through credible channels. Additionally, links to low credibility sources may sometimes be shared insincerely or sarcastically -retweets do not always indicate endorsement. Furthermore, information may spread on Twitter without containing an external URL. Nevertheless, measures of the prevalence of these sites are an adequate metric of the scope of the infodemic to the extent that it refers to the volume of information available. Additionally, we cannot, and do not, claim, that data from Twitter are representative of the general public, or even of all social media. Indeed, news sources may be more likely to be shared by automated accounts ("bots") than other types of tweets, especially when those bots serve as news aggregators. For example, Yang and colleagues recently showed that there is a higher-than-normal prevalence of bot postings in the coronavirus tweets [28] , with even a highly-rated news site like nytimes.com tweeted by social bots nearly 5% of the time [28] . As above, measures of URL prevalence are an indicator of information availability rather than information consumption. Lastly, we recognize that relying on MBFC for categorizing news sites as trustworthy or untrustworthy may be a limiting factor. Future work should consider incorporating new and different sources of expert categorization such as the multimodal repository created by Zhou et al. [29] and Newsguardtech.com [9] . Finally, we restricted our analysis to a single platform -Twitter. Future research should apply a similar methodology to other platforms. In conclusion, our results reflect a complex information environment that can't be characterized by simple "false" vs. "true" dichotomy of online content. Although prior work has shown that the proportion of information from low credibility sources is quite small, our results demonstrate that these proportions are even smaller for COVID-19 tweets. Thus, attempts to combat misinformation that focus on low credibility sources, such as sites that peddle conspiracy theories or false cures, may be missing the larger body of misinformation that is factually mixed, or even true, yet out of context, or that has changed significantly over time. Under conditions of deep uncertainty, source credibility may be a false signal of misinformative content. Beyond these considerations, we find that most content is unrated and perhaps even can't be rated, concomitant with the deep uncertainty underlying an emerging infectious disease during a pandemic. Even if we were able to rate all sources according to meaningful credibility metrics, news aggregation sites, such as paper.li, or social media platforms that disseminate user-generated content, do not employ fact-checkers at the scale required to provide credibility assessments for all of their content. Furthermore, there is no legitimate expectation that user-generated content is subject to the same fact-checking standards as journalistic, government, or academic sources. Nevertheless, this content is overwhelming in scope and varied in style, making it difficult for users to distinguish between fact, opinion, misinformation, and satire. This further underscores the importance of distinguishing between disinformation -content that is generated for malicious purposes by non-credible actors -and misinformation that may come about due to the search for meaning in a legitimately uncertain environment. It is precisely when society faces these "meaning threats" that misinformation becomes plausible, possibly moving from the fringes (e.g., state-sponsored propaganda and conspiracist sites) into the mainstream [25] . Under these circumstances, efforts to fact-check or otherwise promote "mythbusters" may be missing the larger point that there is a need to communicate the fundamental gist of uncertainty as information about the virus changes from day-to-day, while managing expectations that information will change as we learn more. 60 incredible and interesting twitter stats and statistics -brandwatch Announcement new icd 10 code for coronavirus Media bias/fact check -search and learn the bias of news media Post statuses/filter -twitter developers Http for humans TM -requests 2.24.0 documentation Un tackles 'infodemic' of misinformation and cybercrime in covid-19 crisisunited nations Researchers: Nearly Half Of Accounts Tweeting About Coronavirus Are Likely Bots Weaponized health communication: Twitter bots and russian trolls amplify the vaccine debate The covid-19 social media infodemic Understanding vaccine refusal: why we need social media now How the 'Plandemic' Movie and Its Falsehoods Spread Widely Online. The New York Times YouTube and other platforms are struggling to remove new pandemic conspiracy video Fake news on twitter during the 2016 us presidential election john-kurkowski/tldextract The science of fake news Discovering health topics in social media using topic models Covid-19 infodemic: More retweets for science-based information on coronavirus than for false information A scientific theory of gist communication and misinformation resistance, with implications for health A first look at covid-19 information and misinformation sharing on twitter Library Catalog: graphika Prevalence of low-credibility information on twitter during the covid-19 outbreak Recovery: A multimodal repository for covid-19 news credibility research