key: cord-0790251-ipupxz7e
authors: Xue, Jia; Chen, Junxiang; Chen, Chen; Zheng, Chengda; Li, Sijia; Zhu, Tingshao
title: Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter
date: 2020-09-25
journal: PLoS One
DOI: 10.1371/journal.pone.0239441
sha: 7a3e01c2076e526ea5676fd42cd90c846fd70f97
doc_id: 790251
cord_uid: ipupxz7e

The study aims to understand Twitter users’ discourse and psychological reactions to COVID-19. We use machine learning techniques to analyze about 1.9 million Tweets (written in English) related to coronavirus collected from January 23 to March 7, 2020. A total of salient 11 topics are identified and then categorized into ten themes, including “updates about confirmed cases,” “COVID-19 related death,” “cases outside China (worldwide),” “COVID-19 outbreak in South Korea,” “early signs of the outbreak in New York,” “Diamond Princess cruise,” “economic impact,” “Preventive measures,” “authorities,” and “supply chain.” Results do not reveal treatments and symptoms related messages as prevalent topics on Twitter. Sentiment analysis shows that fear for the unknown nature of the coronavirus is dominant in all topics. Implications and limitations of the study are also discussed.

WHO declares COVID-19 as a global health pandemic. Social media has played a crucial role before the virus outbreak and continues to do so as it spreads globally. After China took strict quarantine measures as an intervention (e.g., cities on locked down, school closure, and employed self-isolation), Chinese social media platforms (e.g., Weibo, WeChat, Toutiao) become the lifeline for almost all isolated people who have been housebound for 30+ days and relying on these channels to obtain information, exchange opinions, socialize, and order food [1] . Existing studies [2] [3] [4] [5] show that Twitter data can provide useful information for epidemic disease (e.g., H1N1, Ebola), including tracking rapidly evolving public sentiments, measuring public interests and concerns, estimating real-time disease activity and trends, and tracking reported disease levels. However, these studies have limitations, with only qualitatively manual coding a very small number of Tweets. They require more advanced techniques to improve accuracy and precision for examining public opinions and sentiments. In addition, it remains unknown about public reactions to the COVID online. The vast majority of searched articles about COVID-19 and 2019-nCoV focus on epidemic control, such as the transmissibility of the virus [6] , clinical characteristics of the infected cases [7] , and patient screening [8] .

The present study uses tremendous amounts of collected Twitter data to respond and add knowledge to our understandings of the pandemic. Aiming to explore the public discourse and psychological reactions during the early stage of COVID-19, we use a machine learning approach to examine (1) What latent topics related to COVID-19 can we identify from these Tweets? (2) What are the themes of these identified topics? (3) How Twitter users emotionally react to COVID-19 pandemic? And (4) How do these sentiments change over time?

We used an observational study design and a purposive sampling approach to select all the Tweets contained defined hashtags (e.g., #2019nCoV) related to COVID-19 on Twitter. We used natural language processing methods to find salient topics and terms related to COVID-19. Our Twitter data mining approach included data preparation and data analysis. Data preparation consisted of three steps: (1) sampling; (2) data collection; and (3) pre-processing the raw data. After pre-processing the raw dataset, we proceeded to the data analysis stage, including (1) unsupervised machine learning, (2) qualitative method; and (3) sentiment analysis. The unit of analysis was each message-level Tweet posted on Twitter.

We purposely selected a list of 19 trending hashtags related to COVID-19 as key search terms to collect Tweets on Twitter (S1 Table) . We used Twitter's open application programming interface (API) to collect Tweets published between January 23, 2020, and March 7, 2020. We used the Python code provided by Twitter Developer [9] to access the Twitter API. Shown in Fig 1, a total of 20 million (n = 20,370,854) Tweets were collected. After we removed the non-English Tweets (n = 9,694,320), duplicates and retweets (n = 7,731,035), 1.9 million (n = 1,963,285) Tweets were our dataset for this study. The following features were collected for each single Tweet message (1) each message-level tweets (full text); (2) function features of (a) hashtags; (b) the number of favorites; (c) the number of followers; (d) the number of friends; (e) number of retweets; (f) user location; and (g) user description. Our data collection method complied with Twitter's Terms of Service and Developer's Agreement and Policy.

We pre-processed the raw data to ensure quality. We used Python, a programming language, to conduct data analysis. The pre-processing plan was as follows:

Unsupervised machine learning. We used unsupervised machine learning to examine data for patterns because this approach was commonly used when studies had little observations or insights of the unstructured text data. A qualitative approach had challenges analyzing the large scale of Twitter data. Unsupervised learning derived a probabilistic clustering based on the data itself, allowing us to conduct exploratory analyses of large unstructured texts in social science research. We configured topic modeling, an unsupervised machine learning method, to generate top latent topic distributions. Latent Dirichlet allocation (LDA) [10] was a probabilistic model of word counts that analyzes a set of documents. We used LDA to identify patterns, themes, and structures of the Tweets texts and examine how these themes were connected. It enabled us to efficiently categorize the large bodies of data based on patterns and features. LDA had been used to do sentiment analysis of Tweets related to health [11] . Topic modeling had been widely used to gain a descriptive understating of unstructured Twitter big data in social science research [12] .

Qualitative analysis. We triangulated and contextualized findings from unsupervised learning in the study. We employed the qualitative approach to support deeper qualitative dives into the dataset, such as labeling popular words and Tweet topics, assigning meanings and themes to the topics, interpreting the themes and patterns identified from the Tweets https://doi.org/10.1371/journal.pone.0239441.g001 [13] , and inductively developing themes for the latent topics generated by machine algorithms. The qualitative approach relies on the diverse, in-depth interpretations from human, which allows for inductive, exploratory analysis, and the application of theoretical approaches [14] . Sentiment analysis. Sentiment analysis was a computational and natural language processing-based method that analyzed the people's sentiment, emotions, and attitudes in given texts [15] and an essential method in social media research. The sentiment analysis in the present study was based on a machine learning model for predicting emotions from English Tweets [16] . This model classified each tweet into eight pairwise emotions in Plutchik's wheel of emotions [17] , including joy-sadness, trust-disgust, fear-anger, and surprise-anticipation. This method returned one emotion from the eight categories for each given Tweet.

After pre-processing the collected tweets, our final dataset consisted of 1,963,285 Tweets after removing the duplicates mentioning at least one of the nineteen hashtags from January 23 to March 7, 2020. Fig 2 presented the number of Tweets under the top 9 hashtag by dates ("#Coronavirus", n = 1,405,254, "#Wuhan", n = 144,240, "#Wuhancoronavirus", n = 73,393, "#Coronaoutbreak", n = 73,147, "#2019ncov", n = 60,278, "#ChinaCoronavirus", n = 19,188, "#Chinavirus", n = 17,865, "#CoronavirusChina", n = 16,371, "#Wuhanoutbreak", n = 10,548). The number of Tweets using hashtag #coronavirus gradually 

The automated machine learning LDA approach generated commonly co-occurred words and also organized them into different topics. We calculated the most appropriate number of topics based on the coherence model-gensim [18] . We chose the number of topics to be 11 returned by LDA for this dataset because it had the highest coherence score. Fig 3 showed the coherence score for the number of topics returned by the LDA model.

We analyzed the document-term matrix with the chosen 11 topics and obtained the distributions of the 11 topics. Table 1 presented the results of identified 11 salient topics, the most popular pairs of words within each topic, and the number of Tweets under each topic.

We generated some representative Tweets on each topic to explain the themes of these topics. Two authors discussed the bigrams and representative Tweets in each of the 11 topics and then categorized them into ten themes ( Table 2 ). In addition, we computed the topic distance [10] and presented a 2D plane of the intertopic distance [19] in Fig 4. Each circle represented a topic from Topic 1 to Topic 13 in the study. The centers are determined by computing the distance between topics. In the visualization, these circles were not overlapped, which cross-validated the classification of the ten themes. Table 2 presented the identified topics and themes, and each row of bigrams represented one topic under the theme. We identified ten themes, such as "updates about the number of COVID-19 cases (confirmed cases, total confirmed, cases reported)," "COVID-19 related death [(new deaths, total deaths) and (people die, death rates)]," and "preventive measures [(toilet paper, self-isolate), (face masks, panic buying), travel bans, and (washing hands, test kits, 20 seconds, soap water, hands soap)]". Table 3 highlighted the representative Tweets within each topic under each theme. To protect the privacy and anonymity of the Twitter users of these sample Tweets, we used either excerpt of Tweets or paraphrased several terms in the message.

Tweets contained information about people's thoughts and emotions [20] . We presented individuals' emotional reactions to the COVID-19 pandemic in Fig 5. It represented the proportion of emotional tweets over daily tweets by date. Fear (yellow line) was consistently the dominant emotion over time, which was about 50% of daily Tweets from the Wuhan outbreak to early March. Proportionally lower than feeling of fear, Tweets on trust (brown line) slightly increased over time. Table 4 showed the percentage of each emotion within each of the 11 topics. Across all topics, we observed that the feeling of fear has been prominent. For example, fear for the unknown nature of the COVID-19 consisted of almost 50% of the Tweets in all eleven topics. Approximately 24% of the emotions within Tweets under Topic 1 related to the public's trust for the health authorities.

Since fear was prominent in all eleven topics, we further ran a one-tailed z test and assessed if each of the eight emotions was statistically significantly different across topics. We used a pvalue smaller than .001 as a threshold and presented the results in Table 4 . For example, fear for the uncertainty about COVID-19 was found to have a higher probability of being prevalent in Topics 1, 4, 9, and 11. Trust expressed in Tweets was statistically significant prevalent in Topics 1, 2, and 10. Surprise for the pandemic was statistically significant frequent in Topics 1 and 11. Joy was statistically significant widespread in Topics of 5, 7, 8, and 11. 

This study shows Twitter users' discussions and sentiments to the COVID-19 from January 23 to March 7, 2020. Our findings facilitate an understanding of public discussions and sentiments to the outbreak of COVID-19 in a rapid and real-time way, contributing to the surveillance system to understand the evolving situation. The study overcomes the limitations of the traditional social science approach, which relies on time-consuming, retrospective, timelagged, small-scale surveys, and interviews. The identified patterns and emotions of public tweets could be used to guide targeted intervention programs. First, early recognition of COVID-19 cases and a potential outbreak in New York City were identified among a massive number of tweets, suggesting that the Twitter community has acknowledged the disease severity as early as February. A small peak of the Tweets volume is identified between Feb 10 th and 14 th , and then gradually increase again after Feb.14 th . This finding is also timed with the very first CDC's warning on Twitter (@CDCgov) on February 10, 2020: "If you've recently from China, know the symptoms of #2019nCoV. These include mild to severe respiratory illness with fever, cough, shortness of breath. See bit.ly/38zjnYo." An increasing number of Tweets may be followed with CDC's post, suggesting a good opportunity to guide the public to take action to take preventive measures in February. Rapidly identifying and utilizing social media messages may help the public and authorities to respond to the spread of the disease at the early stages.

Second, discussions of COVID-19 symptoms (e.g., cough, fever, difficulty breathing) and treatments (e.g., vaccine, rest and sleep, drink liquids) were notably missing from our collected Tweets from January 23 to March 7, 2020. One study selects Tweets (n = 35,786) associated with COVID-19 symptoms (e.g., diagnosed, pneumonia, fever, cough) from March 3 to 20, 2020, and finds that the volume of signal Tweets for symptoms increases over time [21] . The inconsistent findings suggest that Twitter is not widely used as a platform for posting symptoms or seeking medical help. Findings inform that more treatmentrelated messages can be posted as an educational tool for the public on social media Health authorities or public health communities.

Third, fear is a dominant emotion in all topics during the early stages of the COVID-19 pandemic. Results are consistent with other studies [22] [23] [24] [25] , which shows that COVID-19 significantly impacts individuals' psychological conditions. Sentiment analysis of the COVID-19 pandemic related content contributes to our understanding of the dynamics of • ". . .South Korean city face shortage of hospital bed as #outbreak expands. . ."

• "#southkorea declares 'war' on #coronavirus . . ."

Early signs of the outbreak in New York city

• ". . .in the news, NYC orders mandatory coronavirus testing for public workers . . ."

• "@homedepot,@lowes, and any respectable hardware store from the bottom of NYC all the way upstate to Rochster is completely sold out of all respiratory masks. . ."

Diamond princess cruise • ". . .approx‥100 more people on Princess Diamond showed symptoms like a fever, and will be tested soon. . ."

• ". . .passenger of Diamond Princess ship tested positive for the virus #2019nCoV. . ."

• ". . .61 people now infected on #DiamondPrincess cruise ship off japan #coronavirus. . ."

Economic impact • ". . .IMF chief says the outbreak could derail global economic growth. . ."

• ". . . https://t.co/OtsbHOZBTW #economicoutlook #markets • #globaleconomy #Coronavirus likely to impact. . ."

• ". . .airline stocks crash, face turbulence amid coronavirus. . .airline stocks fell significantly on Thursday . . ."

Preventive measures • ". . .a crappy coronavirus shortage toilet paper . . ."

• ". . .my understanding is that the best way to stop the spread of #covid19 is to use hand sanitizer and not touch my face. . ."

• ". . .stay safe wearing masks, avoid outside plans, stay at home as much as you can #coronavirusoutbreak. . ."

• ". . .we've had travel bans for over 4 weeks. . ."

• ". . . Trump lied about #coronavirus, vote him out #voteblue #JoeBiden2020. . ."

• "coronavirus 'likely' to hit UK-professors say public health officials must do more #coronavirus. . ."

• "Mike pence will stop #coronavirus with gender segregated workplaces and don't tell him otherwise. . ."

• ". . .Chinese doctor #LiWenLiang, one of the eight HERO whistleblowers who tried to warn other . . ."

• ". . .is the the figure #WHO told us the coronavirus is under control? Let there be no panic. . ."

• ". . .the PRESIDENT OF THE UNITED STATES said the coronavirus was not a concern anymore #CDC. . ."

Supply chain • "with #wuhancoronavirus, the supply chain in China will soon collapse, better prepare for the global shortage of supply of everything. . .?

• ". . .@Catalysis3D can help with low cost and fast additive manufactured bridge tooling and part. . .#supplychain. . ." online users' concerns and feelings during the epidemic. Our findings have implications for health authorities that mental health and psychosocial well-being support is needed during this time [20] . There are several limitations to the study. First, we only sample a trending of 19 hashtags as search terms to collect Twitter data. Some new hashtags have become new trending terms for Twitter users to group topics over time. For example, #COVID19 has been widely used after it becomes the official name for the virus. Second, Twitter users are not representative of the whole population and only indicate online users' opinions and reactions about COVID-19. However, the Twitter dataset is a valuable source for understanding the real-time Twitter usergenerated content related to COVID-19 disease activities. Third, non-English Tweets are removed from the analysis, and results are limited to a particular population. Future studies are recommended to include Italian, Germany, and Spanish languages for COVID-19 analysis. 

Conceptualization: Jia Xue, Tingshao Zhu.

Data curation: Jia Xue, Chen Chen.

Formal analysis: Jia Xue, Junxiang Chen.

Funding acquisition: Jia Xue.

Methodology: Jia Xue, Chen Chen, Chengda Zheng.

Supervision: Tingshao Zhu.

The coronavirus and Chinese social media: finger-pointing in the post-truth era

Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak

Early assessment of anxiety and behavioral response to novel swine-origin influenza A (H1N1)

Using photos for public health communication: a computational analysis of the Centers for Disease Control and Prevention Instagram photos and public responses

Using Twitter to estimate H1N1 influenza activity

Pathogenicity and transmissibility of 2019-nCoV-A quick overview and comparison with other emerging viruses

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study

Effectiveness of airport screening at detecting travellers infected with novel Coronavirus (2019-nCoV) Euro Surveillance

Get Tweet timelines

Latent dirichlet allocation

Discovering health topics in social media using topic models

Personality, gender, and age in the language of social media: The open-vocabulary approach

Using thematic analysis in psychology

The SAGE handbook of social media research methods. London: SAGE publication

An overview of sentiment analysis in social media and its applications in disaster relief

Emotion Recognition on Twitter: Comparative Study and Training a Unison Model

A General Psychoevolutionary Theory of Emotion

Exploring the space of topic coherence measures

Interpretation and trust: designing model-driven visualizations for text analysis

Using Social Media to Track Geographic Variability in Language About Diabetes: Infodemiology Analysis

Machine learning to detect selfreporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance Study. JMIR Public Health and Surveillance

The impact of COVID-19 epidemic declaration on psychological consequences: a study on active Weibo users

Global sentiments surrounding the COVID-19 pandemic on Twitter: analysis of Twitter trends. JMIR Public Health and Surveillance

Examining the impact of COVID-19 lockdown in Wuhan and Lombardy: a psycholinguistic analysis on Weibo and Twitter

Twitter discussions and concerns about COVID-19 pandemic: Twitter data analysis using a machine learning approach

1. We removed the hashtag symbol and its content (e.g., #COVID19), @users, and URLs from the messages because the hashtag symbols or the URLs did not contribute to the message analysis.2. We removed all non-English characters (non-ASCII characters) because the study focused on the analysis of messages in English.3. We removed repeated words. For example, sooooo terrified was converted to so terrified. 4 . We removed special characters, punctuations, and numbers from the dataset as they did not help with detecting the profanity comments.Validation: Chengda Zheng.

Writing -review & editing: Jia Xue, Sijia Li.