key: cord-0972549-3kjyvyaq
authors: Zheng, Han; Goh, Dion H.‐L.; Lee, Chei S.; Lee, Edmund W. J.; Theng, Yin L.
title: Uncovering temporal differences in COVID‐19 tweets
date: 2020-10-22
journal: Proc Assoc Inf Sci Technol
DOI: 10.1002/pra2.233
sha: 11a62f46b1dff7c2eeb72418ead3e0f1d75be458
doc_id: 972549
cord_uid: 3kjyvyaq

In the fight against the COVID‐19 pandemic, understanding how the public responds to various initiatives is an important step in assessing current and future policy implementations. In this paper, we analyzed Twitter tweets using topic modeling to uncover the issues surrounding people's discussion of the disease. Our focus was on temporal differences in topics, prior and after the declaration of COVID‐19 as a pandemic. Nine topics were identified in our analysis, each of which showed distinct levels of discussion over time. Our results suggest that as the pandemic progresses, the concerns of the public vary as new developments come to light.

In December 2019, pneumonia of unknown cause was reported in Wuhan, China. Initially thought to be a localized problem, this disease has now been declared the COVID-19 pandemic, infecting about 2 million people worldwide and claiming more than 120,000 lives as of April 2020 (World Health Organization [WHO] , 2020b). To stem the progress of COVID-19, governments around the world have instituted a variety of measures including lockdowns of various degrees, public health campaigns, work from home initiatives and online learning.

A critical prong in the multifaceted fight against this disease is the behavior of the public in conforming to government directives as well as taking various protective measures such as washing of hands and social distancing (World Health Organization [WHO] , 2020a). Hence, understanding how the public responds to COVID-19 initiatives is an important step in assessing current policy implementations and guiding future policy development. Here, social media postings have the potential to provide a glimpse into people's responses to the disease as numerous messages urging positive public health behaviors have emerged on various platforms, along with news updates, personal opinions and anecdotes.

The present research aims to uncover the issues surrounding the discussion of COVID-19 on Twitter. We employ topic modeling in our analysis of Twitter tweets. This technique facilitates the automated discovery of patterns that reflect the underlying topics in a corpus of documents (Sharma & Sharma, 2017) . Of particular interest are temporal differences in topics, notably prior to the declaration of COVID-19 as a pandemic by the WHO on March 11, 2020, and after this characterization. Although the disease had already reached high levels of spread and severity worldwide before 11 March, the pandemic label would have presumably spurred governments and individuals to pay more attention to COVID-19 and adopt measures to curb it.

There are two reasons for using Twitter tweets in our research. First, this social media platform is currently in active use by governments, organizations and individuals for COVID-19 information sharing. Second, it is arguably an important source of information, and has been used in studies of other disease outbreaks (e.g. Signorini, Segre, & Polgreen, 2011) .

User-generated postings such as tweets are excellent sources of public health information (Sinnenberg et al., 2017) . As compared to traditional public health surveillance methodologies (e.g. surveys), the data from Twitter have the advantages of being "naturally occurring", inexpensive to get, and contain high velocity granular data (Lee & Yee, 2020) . The act of tweeting reflects the degree of public attention and collective public sentiments toward certain health issues, and thus would provide potentially useful leading signals for public health researchers to act on (Kuehn, 2015) . In the context of infectious diseases, Twitter data has been used to understand and map the spread of malaria (Fung et al., 2017) , H1N1 (Chew & Eysenbach, 2010) , and Ebola (Liang, 2018) , to name a few examples.

By incorporating temporal components when analyzing tweets, one could uncover critical variations in the spread of COVID-19 information down to a granular level, such as the evolution of discussions on specific days as the disease spreads, and monitor the spread of the disease (Chen, Hossain, Butler, Ramakrishnan, & Prakash, 2016) . This allows researchers to effectively engage in dissemination science, by enabling public health organizations to be targeted in developing strategic messaging efforts. After all, past research has documented that time matters when examining tweets in public health contexts, and temporal distribution of COVID-19 information could provide a nuanced understanding of how people communicate, which text alone cannot give (Stefanidis et al., 2017) .

The dataset used in this study was from an ongoing project that actively collected COVID-19 tweets from January 28, 2020 (Chen, Lerman, & Ferrara, 2020) , leveraging Twitter's search API with a list of keywords and accounts related to COVID-19 (e.g., "coronavirus", "corona", "Covid-19", "Covid"). Until April 10, 2020, this project had collected around 94.67 million tweets. Since we focused on the tweets before and after the declaration of COVID-19 as a pandemic on March 11, 2020, we selected two weeks of tweets between March 4, 2020 and March 18, 2020. The project only released the Tweet IDs of the collected tweets. Thus, we used the software Hydrator to extract the tweets for this timeframe (Summers, 2017) . There was a total of 18.8 million tweets during this two-week period, and the number of tweets per day ranged from 913,230 to 3,408,778. Due to the large data size and to facilitate processing, we randomly sampled 5% of the tweets on each day, and the final samples constituted 940,837 tweets. This random sampling approach is consistent with prior research (e.g. Cavazos-Rehg et al., 2016; DiGrazia, McKelvey, Bollen, & Rojas, 2013) .

Data were analyzed using R statistical software version 3.5.1. First, we eliminated non-English tweets and duplicate tweets in the dataset. Next, we preprocessed the tweets by removing the "RT" (retweet) text and usernames, URL links, punctuations, and numbers. We tokenized the tweets into single words and converted all words to lower case. Further, we removed a list of standard stopwords such as "the," "is," and "are," plus additional stopwords that frequently appeared in the tweets (e.g., "COVID-19," "coronavirus," "virus," etc.). Also, we used the Porter stemmer to stem the words into their root forms. Finally, to reduce the dimensionality of data, we removed sparse terms that did not appear very often. After preprocessing, 258,290 valid English tweets that consisted of 1,450,595 words and 1,509 unique words were used for further analysis.

Latent Dirichlet Allocation (LDA) topic modelling was employed to identify the common COVID-19 topics discussed on Twitter. It is an unsupervised machine learning method to uncover the hidden semantic structures from a given textual corpus and assign individual documents to a fixed set of topics (Blei, Ng, & Jordan, 2003) . We used the Gibbs sampling algorithm as it allows iterative steps through configurations to estimate optimal model fit (Geman & Geman, 1984) . To select the best number of topics for the corpus, we ran several models ranging from 2 to 20, in intervals of 1. For the quality evaluation of these models, we considered two data-driven metrics (Cao, Xia, Li, Zhang, & Tang, 2009; Deveaud, SanJuan, & Bellot, 2014) and interpretability of the topics in each model. Cao et al.'s (2009) metric suggests that when the average cosine distance of topics reaches the minimum, the LDA model performs best. Deveaud et al.'s (2014) metric posits that the optimal number of topics would be the one with the maximum information divergence. The analyses resulted in a decision to run LDA with nine topics for the corpus. Table 1 shows the nine topics derived from our LDA topic modelling. To manually assign topic names, the top 10 terms based on beta values in each topic were taken into account. A beta value refers to the probability of a term belonging to a given topic. Thus, a higher beta value indicates the term can better describe the topic.

In addition, we examined tweets in each topic to help in the labeling. To illustrate, for topic 3, the key words were "hand," "home," and "stay." The focus of this topic might be related to preventive measures in response to COVID-19 such as washing hands and staying home. We thus examined the associated tweets for topic 3. For example, one user on 14 March wrote that "The CDC says you should avoid shaking hands due to coronavirus during a press conference…" Similarly, another user posted "Wearing a face mask when you have a cold or flu should become the norm as it is in Japan." on 4 March. As such, we labelled topic 3 as "Preventive measures." In this way, we assigned names to the other eight topics as presented in Table 1 .

Next, we sought to uncover temporal differences in the COVID-19 tweets. First, as the number of tweets varied across the days, we divided the number of tweets in each topic per day by the total number of tweets per day to get a topic weightage score for each day. Second, we visualized the trend of how each topic weightage changed during the 2 weeks (see Figure 1) .

Overall, compared to the week prior to the pandemic declaration on March 11, there were more discussions on preventive measures (topic 3), organizing healthcare resources (topic 6), and government help and support (topic 9) in the second week. In contrast, less attention was paid on mortality rates of COVID-19 (topic 1) and reporting of new cases (topic 5) after the declaration. Interestingly, discussions on topic 2 (origin of COVID-19) and topic 4 (Trumps' responses to pandemic) fluctuated before it experienced a sharp increase on March 17. Finally, topic 7 (coping with the pandemic) and topic 8 (reports of lockdowns) had a steady increase in the first week and reached its peak after the declaration, followed by a sharp decrease thereafter.

We found that the topics generated reflect the diversity of the narratives surrounding COVID-19. With the rapidly T A B L E 1 Nine topics generated by the LDA topic modeling

Top 10 words in the topic Rate % Example One interesting finding is that the discussions of topics reflected the volatility and social effects of COVID-19. For example, one of the topics we found was the origins of the virus (topic 2). A possibility for the interest is that there were multiple narratives on this topic, and the truth of the origin remains elusive. The volume of discussion on this topic was initially high but waned until March 11, 2020. However, the real-life political exchanges and tensions between the US and Chinese officials as well as the US president labelling the pandemic a "Chinese virus" on March 17, 2020 likely triggered more attention on this topic on Twitter. This suggests that discussions on Twitter are influenced by reports from mainstream media. In particular, as the pandemic evolved during our period of analysis, new issues were reported in the mainstream media, triggering discussions on Twitter.

Our results also demonstrate that people depend on social media platforms (Twitter in this study) to meet various needs during times of uncertainty and crisis. Before the pandemic declaration on March 11, 2020, discussions centered mainly around informational exchanges such as COVID-19 mortality (topic 1) and reports of new cases (topic 5). After 11 March, conversations were not only informational, but were also emotional where people supported each other during the lockdown (topic 8) and helped each other cope with the pandemic (topic 7). This finding is consistent with the notion of audience-media dependency (Ball-Rokeach & DeFleur, 1976; Lee, 2012) in which an audience is impacted not only by media content but also by the society in which they consume the content.

To conclude, our findings suggest that social media platforms such as Twitter play important roles to meet people's needs during the pandemic. Next, discussions are influenced by what people read in the mainstream media and possibly other sources (e.g. Topic 4 and 5). Hence it is essential that these platforms put in place fact-checking mechanisms quickly to reduce ambiguity and misinformation. Further, our results show that government and other decision-makers may use Twitter to uncover ongoing discussions that may help craft official responses to ongoing developments or chart new policy directions.

A limitation of our research is that due to the large volume of data, we were not able to analyze all the tweets. Consequently, the topics uncovered may deviate from the themes that people actually discussed online. Further, our nine topics are a two-week snapshot of Twitter discussions that may not capture new conversation topics as the pandemic develops over time. Other social F I G U R E 1 Topic change with time media platforms may also yield different sets of topics. Hence, it would be worthwhile to analyze new tweets as they become available as well as content from other social media platforms to ascertain the stability of our nine topics. Finally, because there were differences in how countries responded to COVID-19, it would be interesting to examine geographical variations in discussions of the disease.

A dependency model of mass-media effects

Latent dirichlet allocation

Syndromic surveillance of flu on twitter using weakly supervised temporal topic models

A density-based method for adaptive LDA model selection

Pandemics in the age of twitter: Content analysis of tweets during the 2009 H1N1 outbreak

A content analysis of depression-related tweets

Tracking social media discourse about the COVID-19 pandemic: Development of a public Coronavirus Twitter data set

More tweets, more votes: Social media as a quantitative indicator of political behavior

Accurate and effective latent concept modeling for ad hoc information retrieval

#Globalhealth twitter conversations on #malaria, #HIV, #TB, #NCDs, and #NTDS: A cross-sectional analysis

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Twitter streams fuel big data approaches to health forecasting

Exploring emotional expressions on YouTube through the lens of media system dependency theory

Toward data sense-making in digital health communication research: Why theory matters in the age of big data

Broadcast versus viral spreading: The structure of diffusion cascades and selective sharing on social media

Study and analysis of topic modelling methods and tools -A survey

The use of twitter to track levels of disease activity and public concern in the U.S. during the influenza a H1N1 pandemic

Twitter as a tool for health research: A systematic review

Zika in twitter: Temporal variations of locations, actors, and concepts. JMIR Public Health Surveillance, 3(2), e22. World Health Organization (WHO). (2020a). Coronavirus disease (COVID-19) advice for the public

World Health Organization (WHO). (2020b)

Uncovering temporal differences in COVID-19 tweets