key: cord-1020653-3xk2zi4r authors: Medford, Richard J; Saleh, Sameh N; Sumarsono, Andrew; Perl, Trish M; Lehmann, Christoph U title: An “Infodemic”: Leveraging High-Volume Twitter Data to Understand Early Public Sentiment for the COVID-19 Outbreak date: 2020-06-30 journal: Open Forum Infect Dis DOI: 10.1093/ofid/ofaa258 sha: a3bad73497ddb2f845a2d8b490104c4220c88e5c doc_id: 1020653 cord_uid: 3xk2zi4r BACKGROUND: Twitter has been used to track trends and disseminate health information during viral epidemics. On January 21, 2020, the CDC activated its Emergency Operations Center and the WHO released its first situation report about Coronavirus Disease 2019 (COVID-19), sparking significant media attention. How Twitter content and sentiment evolved in the early stages of the COVID-19 pandemic has not been described. METHODS: We extracted tweets matching hashtags related to COVID-19 from January 14(th) to 28(th), 2020 using Twitter’s application programming interface. We measured themes and frequency of keywords related to infection prevention practices. We performed a sentiment analysis to identify the sentiment polarity and predominant emotions in tweets and conducted topic modeling to identify and explore discussion topics over time. We compared sentiment, emotion, and topics among the most popular tweets, defined by the number of retweets. RESULTS: We evaluated 126,049 tweets from 53,196 unique users. The hourly number of COVID-19 related tweets starkly increased from January 21, 2020 onward. Nearly half (49.5%) of all tweets expressed fear and nearly 30% expressed surprise. In the full cohort, the economic and political impact of COVID-19 was the most commonly discussed topic. When focusing on the most retweeted tweets, the incidence of fear decreased and topics focused on quarantine efforts, the outbreak and its transmission, as well as prevention. CONCLUSION: Twitter is a rich medium that can be leveraged to understand public sentiment in real-time and potentially target individualized public health messages based on user interest and emotion. M a n u s c r i p t 4 With over 300 million monthly users, the micro-blogging platform Twitter is increasingly used to disseminate public health information and obtain real-time health data using crowdsourcing methods [1] . Researchers analyzed Twitter data to project the spread of influenza and other infectious outbreaks in real time [2] . In 2009, investigators measured the evolving interest in an Influenza A outbreak by analyzing tweet keywords and estimating real-time disease activity and disease prevention efforts [3] . During the Ebola virus (EV) outbreak in 2014, Twitter users publicized pertinent health information from media sources with peak Twitter activity within 24 hours following news events [4] . Tweet content analysis following the EV epidemic discovered that Ebola-related tweets revolved mainly around risk factors, prevention, disease trends, and compassion [5] . Similarly, the 2015 Middle Eastern Respiratory Syndrome (MERS) outbreak, disease spread was found to be correlated with Twitter activity, promoting Twitter as a potential surveillance tool for emerging infectious diseases [6] . During the Zika virus epidemic, Twitter was used to study significant changes in travel behavior due to mounting public concerns [7] . Recognizing Twitter's potential to inform and educate the public, governmental agencies such as the World Health Organization (WHO) and the Centers for Disease Control and Prevention (CDC) have adopted the use of Twitter and other social media. In the first 12 weeks of the Zika outbreak in late 2015, the WHO Twitter account was retweeted over 20,000 times, demonstrating its widespread impact on disseminating health information [8] . In December 2019, the first diagnosis of a novel, emerging coronavirus, formally named severe acute respiratory syndrome coronavirus (SARS-CoV-2), was made in Wuhan City, Hubei Province, China. In subsequent weeks, the coronavirus's rapid spread garnered increasing media coverage and public attention. Press coverage further heightened on January 21, 2020 A c c e p t e d M a n u s c r i p t 5 when the CDC activated its Emergency Operations Center and the WHO began publishing daily situation reports. Subsequent travel limitations, large-scale quarantine of Chinese residents, and numerous international index cases generated significant interest by the general public [9] . However, there is limited insight into the main topics discussed and the sentiment of the general public over time. We postulate that analysis of the content and sentiments expressed over time on Twitter in the early stages of the Coronavirus Disease 2019 (COVID-19) pandemic can aid understanding of the effect of the outbreak on the emotions, beliefs, and thoughts of the general public. Such understanding would enable large-scale opportunities for education and appropriate information dissemination about public health recommendations. From January 14 th to 28 th 2020, a random sample of tweets in the English language was extracted using Twitter's application programming interface (API) and its advanced search tool (https://twitter.com/search-advanced), which generates a relevant subset of tweets [10] that does not include any retweets. The dates were chosen to include one week of data before and after the activation of the Emergency Operations Center by the Centers for Disease Control and Prevention [11] and the release of the first WHO situation report [12] . Hashtags used for identification of COVID-19 related tweets included #2019nCoV, #coronavirus, #nCoV2019, #wuhancoronavirus, and #wuhanvirus (COVID-19 and SARS-COV-2 were not coined until Feb 19, 2020) based on the top trending hashtags related to the COVID-19 outbreak during the study A c c e p t e d M a n u s c r i p t 6 period. Nineteen variables were extracted from tweets, of which eight were used in our analysis: tweet text, time stamp, if the tweet had a reply, if the tweet was a reply, if the tweet was a retweet (which does not include quoted tweets), if the tweet included an image, if the tweet included a link, number of tweet likes, number of retweets, and number of replies. We performed all data processing and analysis using Python software, version 3.6.1 (Python Software Foundation) and RStudio version 1.2.1335 (R Foundation for Statistical Computing). We compared the COVID-19-related tweets per hour with the number of newly confirmed cases worldwide over each 24-hour period and completed descriptive statistics for the collected variables. To analyze tweets, we extracted the plain text from the original message. For all but the sentiment analysis, we removed commonly used words that are of little analytic value (e.g., "for", "the", "is"), converted text to lowercase, and changed words to their root forms (e.g., "viruses" to "virus" or "went" to "go"). We extracted one-word and two-word terms from tweets. We removed terms present in less than five tweets and two terms present in greater than 10% of tweets ("case" and "people") decreasing the dictionary of terms from 626,614 to 38,823. Using a word cloud, we visualized the top three hundred words with larger font size representing greater frequency. We used a subset of keywords to identify tweets related to three common infection prevention and control (IPC) strategies as well as vaccination. Appendix Section A1 details the keywords used. We analyzed the incidence of these tweets over time and manually reviewed a random 10% subset to validate content, evaluate narratives Sentiment polarity describes emotions that refer to the intrinsic attractiveness or aversiveness of a subject like events, objects, or situations [13] . We analyzed the sentiment polarity of tweets separately using four commonly used methods through the Syuzhet R package [14] . Because each method uses a different scale, we normalized scores to detect the polarity of tweets as positive, negative, or neutral. For the emotion analysis, we used recurrent neural networks to label a primary emotion for a document according to a previously established emotional classification system (i.e., anger, disgust, fear, joy, sadness, or surprise) [15] . We trended the findings by visualizing the daily number of tweets labeled with each sentiment polarity and each emotion over the two-week period and comparing their rate of change by tweets per day. A Latent Dirichlet Allocation (LDA) [16] model (gensim Python package [17] ) automatically generates topics from observations (in our case, from tweets) and groups similar observations to one or more of these topics using the distribution of words. We iteratively trained multiple LDA models using different numbers of topics to maximize a topic coherence score (which measures the degree of semantic similarity between high scoring words in the topic). Selecting the highest coherence score resulted in the use of the LDA model with ten topics. Adhering to convention, we presented the top fifteen terms (a common number of terms used in analyzing topics in LDA models) that contributed to each topic group and manually labeled a theme for each topic. We then visualized the topic model using a t-distributed Stochastic Neighbor A c c e p t e d M a n u s c r i p t 8 Embedding (t-SNE) graph [18] , which embeds high-dimensional data (i.e., ten dimensions given ten topics) into a graphable two-dimensional space where similar tweets are grouped together. We created an interactive visualization of the t-SNE to qualitatively evaluate the change in topics over time. A total of 126,049 tweets from 53,196 unique users were collected during the study period (Appendix Table A2 ). Of these tweets, 123,407 had unique text (i.e., text that was not duplicated in any other tweet in the dataset); there were no retweets in the sample. The most prevalent identification hashtag found was #coronavirus followed by #wuhancoronavirus present in 82% and 13% of tweets respectively. The collected tweets accumulated 114,635 replies, 1,248,118 retweets, and 1,680,253 likes. In the first week of our analysis, the number of COVID-19 related tweets remained stable with less than 100 tweets per hour. The number of tweets per hour increased on January 20th, and reached as many as 250 tweets per hour by January 21st and continued to grow with a peak of over 1,700 tweets per hour by January 28th, 2020. This trend closely tracked the number of newly confirmed COVID-19 cases in the study period ( Figure 1 ). A c c e p t e d M a n u s c r i p t 9 Collected tweets contained 2,877,816 words and 15,955,720 characters. The most common word in our analysis was 'outbreak', numbering 11,549 times ( Figure 2 ). The other top 15 most commonly used words and their frequency in descending order were: 'spread' (11, 290) , 'health' (9, 734) , 'confirm' (6, 897) , 'death' (5,819), 'city' (5, 662) , 'report' (5, 662) , 'first' (5, 431) , 'world' (5, 244) , 'travel' (5,049), 'hospital' (4, 405) , 'infect' (4, 388) , 'SARS' (4, 133) , 'mask' (3, 996) , 'patient' (3, 981) , and 'country' (3, 885) . Prior to January 20th, our analysis showed a very small percent of tweets related to infection prevention and control (IPC) followed by a steady increase starting January 21st ( Figure 3 ). Isolation-related tweets were the most prevalent followed by mask and hand hygiene. Coinciding with the quarantine of the Hubei province, isolation-related tweets disproportionately increased on January 24. All IPC subgroups increased over time but their ranking did not change. IPC-related content was present in 4.8% of tweets. Discussions of prevention techniques, shortage of protective gear, dissemination of health information, and large-scale quarantine were most common. Tweets with reference to vaccinations were found in 1.2% of total tweets and increased at a slower rate than IPC-related tweets overall. The most prevalent vaccine-related tweets were about vaccine availability, vaccine development, and advocacy to receive the influenza vaccine. (Table 1) . We analyzed tweets for positive, neutral, or negative polarity. Tweets with a negative sentiment polarity were more common than neutral and positive tweets and increased at a faster rate over time (Figure 4b Topic modeling identified ten themes that are recorded in Figure 5a . Keywords are listed in order of weight in forming the abstract topics found within the text. A tweet may include multiple topics, but typically has one predominant topic. The most common predominant topic was the economic and political impact, followed by government response to the virus, then discussion of the outbreak and its development and transmission. The least common topics included index cases, the public health response, and healthcare provision. Other topics included the number of cases and death as well as prevention and large-scale quarantine. An interactive visualization of tweet themes showing their development by day is available at https://ssaleh2.github.io/Early_2019nCoV_Twitter_Analysis/; hovering over a node will show the tweet text and the day it was posted (please note the figure is slow to load and the slider on top allows navigation through time). Figure 5b shows three screen shots from the visualization. Major themes clustered in the center while more obscure tweets displayed in the periphery. As tweets may include multiple topics, there is visible cross-over between topic clusters in the visualization. Topic clusters that included themes of outbreak and its transmission, public health risk, and index cases were discussed from the start of the study period, while discussion of quarantine effects, economic and political impact, and government response increased significantly in the second week of the study period. When focusing on the top 10%, 1%, and 0.1% most retweeted tweets, discussion of quarantine efforts was the most predominant topic (Table 1) . Outbreak transmission as well as prevention were the next most common topics in the top 10% and 1% of tweets. In the top 0.1% of tweets, healthcare provision and index cases by country were the next most common topics. A c c e p t e d M a n u s c r i p t 12 In this study, we demonstrate a persistent increase in overall Twitter activity as well as tweets with negative sentiment and emotions for the COVID-19 outbreak from January 21, 2020 onward. The frequency of tweets paralleled the number of infected individuals worldwide during the early stages of the COVID-19 outbreak. Tweets predominantly showed negative sentiment and were linked to emotions of fear primarily, as well as surprise and anger. We identified examples of tweets with misinformation, but tweets were also significantly used to disseminate valuable public health information, especially in the more popular retweeted tweets. These data may help medical experts and public health officials to identify types of communication and messaging that may allay emotion and decrease misinformation. Emotions have been shown to alter how we think, decide, and solve problems especially in highly charged situations of outbreaks [19] . Surveillance programs for emerging and highly dangerous infections are difficult and labor intensive [22] . Leveraging the knowledge of the crowds by analyzing social media posts offers a Twitter is the most popular social media platform for healthcare communication; however, skepticism of its utility has been long discussed. Opponents often cite misinformation and the inability to process high volumes of information [23] . We found evidence of misinformation and Twitter have taken on the responsibility of acting as stewards of information related to COVID-19 by removing false information and redirecting web traffic to reputable websites [24] . The account of the user, who tweeted the misleading patent information above was subsequently suspended [25] . Twitter Singapore adjusted their search prompt to show links to authoritative Twitter to obtain and disseminate information offers the opportunity to change the narrative and educate millions of people. Since the outbreak started, the WHO has educated the public with a steady stream of tweets [32] . Some tweets analyzed were related to infection prevention measures (hand washing, mask wearing, self-isolation), but these were still the minority, representing less than 5% of tweets. From a public health perspective, the ability to analyze Twitter feeds in real-time (using the Twitter Streaming or PowerTrack API) and the potential to individually target segments of the population with high-impact messages based on their information needs and sentiment could be an extremely powerful tool, potentially more effective than any other communication medium. To date, bots (autonomous programs able to interact with computer systems or users) have been used on Twitter for advertising or to promulgate malicious or false content [33, 34] . A c c e p t e d M a n u s c r i p t 15 However, public health and governmental organizations like the WHO or the CDC should invest in this new technology. Deploying autonomous tools that identify tweets for example, by users who are scared to contract COVID-19, could be used to send individually targeted messages that provide reassurance and education on preventive measures such as handwashing and selfquarantine. Tailoring automatic responses to the sentiments and content of tweets has the potential to engage more Twitter users on public health topics and to redirect the discussion to useful, accurate information. This study had several limitations. First, we used a non-comprehensive list of hashtags that was limited by a subset of trending hashtags at the time and the imagination of the authors. We may have missed alternative terminology or misspellings and may have introduced some selection bias in the tweets we analyzed. For example, #wuhanoutbreak was not included, but arose as a weighted term in our topic modeling. Conversely, #coronavirus may have identified tweets related to other infections such as Severe Acute Respiratory Syndrome. Second, despite the large number of tweets analyzed (>126K), we collected and analyzed only a relevant subset of all tweets, which introduces some selection bias. Third, we targeted tweets in the English language; thus, our conclusions may not be generalizable to other countries where English is not the predominant language. Therefore, this study does not likely inform perception in China, where the majority of cases were in the early stages of the outbreak. Lastly, we recognize that ascribing topic themes based on a subset of weighted terms has opportunity for labeling bias. To mitigate that, two authors designed the topic model and a separate set of authors labeled the topic themes. A c c e p t e d M a n u s c r i p t 16 We were able to show that the frequency of tweets paralleled the number of newly infected individuals for the early stages of the COVID-19 outbreak. Tweets predominantly showed negative sentiment and were linked to emotions of fear primarily, as well as surprise and anger. While tweets with misinformation were present, tweets were also significantly used to disseminate valuable public health information, especially in the more popular retweeted tweets. Twitter offers novel opportunities to public health and governmental agencies to not only measure outbreaks, but also to target messages of a public health nature based on user interest and emotion. A c c e p t e d M a n u s c r i p t 17 Availability of data and materials: The data that support the findings of this study are available upon request. Competing interests: The authors declare that they have no competing interests. Dissemination of health information through social networks: Twitter and antibiotics org: Open-source and linked data for epidemiology The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic Communicating Ebola through social media and electronic news media outlets: A cross-sectional study What can we learn about the Ebola outbreak from tweets? High correlation of Middle East respiratory syndrome spread with Google search and Twitter trends in Identifying Protective Health Behaviors on Twitter: Observational Study of Travel Advisories and Zika Virus Zika in Twitter: Temporal Variations of Locations, Actors, and Concepts State Dept. issues highest advisory: 'Do not travel to China' amid coronavirus outbreak Accessed 4 First Travel-related Case of 2019 Novel Coronavirus Detected in United States Novel Coronavirus (2019-nCoV) SITUATION REPORT -1. World Health Organization The emotions. Cambridge Editions de la Maison des sciences de l'homme Syuzhet: Extract Sentiment and Plot Arcs from Text Emotion Recognition on Twitter: Comparative Study and Training a Unison Model Latent dirichlet allocation parallelized Latent Dirichlet Allocation. gensim Visualizing Data using t-SNE How emotions affect logical reasoning: evidence from experiments with mood-manipulated participants, spider phobics, and people with exam anxiety Perception is Reality, and Reality Drives Perception: No Time to Celebrate Yet Respirator Use in a Hospital Setting: Establishing Surveillance Metrics Google and Twitter scramble to stop misinformation about coronavirus. The Washington Post Misinformation About The Coronavirus Is Spreading Online 30. (11.9) A c c e p t e d M a n u s c r i p t 28 Figure 5a . The fifteen terms (in order of weighting) that contributed to each abstract topic with their potential theme labels. The topics are ordered by frequency. Colors for each topic correspond to those in Figure 5b . Topic labels were assigned by the authors. Words contributing to topic model