key: cord-351863-onipxf2z authors: Wang, X.; Zou, C.; Xie, Z.; Li, D. title: Public Opinions towards COVID-19 in California and New York on Twitter date: 2020-07-14 journal: medRxiv : the preprint server for health sciences DOI: 10.1101/2020.07.12.20151936 sha: doc_id: 351863 cord_uid: onipxf2z Background: With the pandemic of COVID-19 and the release of related policies, discussions about the COVID-19 are widespread online. Social media becomes a reliable source for understanding public opinions toward this virus outbreak. Objective: This study aims to explore public opinions toward COVID-19 on social media by comparing the differences in sentiment changes and discussed topics between California and New York in the United States. Methods: A dataset with COVID-19-related Twitter posts was collected from March 5, 2020 to April 2, 2020 using Twitter streaming API. After removing any posts unrelated to COVID-19, as well as posts that contain promotion and commercial information, two individual datasets were created based on the geolocation tags with tweets, one containing tweets from California state and the other from New York state. Sentiment analysis was conducted to obtain the sentiment score for each COVID-19 tweet. Topic modeling was applied to identify top topics related to COVID-19. Results: While the number of COVID-19 cases increased more rapidly in New York than in California in March 2020, the number of tweets posted has a similar trend over time in both states. COVID-19 tweets from California had more negative sentiment scores than New York. There were some fluctuations in sentiment scores in both states over time, which might correlate with the policy changes and the severity of COVID-19 pandemic. The topic modeling results showed that the popular topics in both California and New York states are similar, with "protective measures" as the most prevalent topic associated with COVID-19 in both states. Conclusions: Twitter users from California had more negative sentiment scores towards COVID-19 than Twitter users from New York. The prevalent topics about COVID-19 discussed in both states were similar with some slight differences. In December 2019, a novel and contagious coronavirus disease of 2019 (COVID-19) has been reported in China and continued spreading out massively. The cause of this pandemic is due to a novel coronavirus called SARS-CoV-2. According to the Centers for Disease Control and Prevention (CDC), the coronavirus could transmit between persons through close contact [1] . On January 30, 2020, COVID-19 became a Public Health Emergency of International Concern [2] . By June 22, 2020, the coronavirus has been confirmed in six continents with nearly nine million COVID-19 cases. The number of confirmed cases in the U.S. has been grown rapidly since early March. On March 7, New York declared a state of emergency after 89 confirmed cases, and since then, the confirmed cases continued increasing at an alarming rate. New York state was facing severe challenges in health care, including a shortage of protective gear and medical equipment, hospital overcrowding, and paramedic shortages. Meanwhile, California, as the most populous state in the U.S., was also facing the viral transmission, but less severe than what NY faced. Discussions about the COVID-19 pandemic are widespread online, especially through social media, which offers powerful public platforms where people can share their opinions with large crowds. Twitter is an online microblogging and networking platform, and users can post and All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint interact messages known as "tweets". In past few years, Twitter data has been increasingly used for research. Previous study had used Twitter data to reveal insights from user-generated tweets about Ebola through textual analytics [3] . The outbreak of coronavirus in the U.S. has created an uproar in discussions about the pandemic on Twitter. A recent research collected a dataset of 9 million Twitter posts using the keywords "Coronavirus" and discussed the dynamics of coronavirus on Twitter [4] . However, little work has been reported on social media data analysis that focused on COVID-19 in the United States, since its outbreak on March 2020. With the growth of the number of confirmed COVID-19 cases and the announcement of COVID-19 related policies, mentions of coronavirus-related topics and sentiments might vary over time. It's important to understand how the perceptions of the pandemic related issues evolve over time, and whether they vary with different geolocations. Thus, in this study, we aimed to investigate the differences in sentiments and public opinions towards the outbreak of COVID-19 for twitter users between New York state and California state in the United States, and how these are related to the number of cases and policy changes. Our study provides valuable information about public attitude changes over time towards the emergence and outbreak of COVID-19, which could guide future research on exploring public concerns towards the pandemic. We obtained COVID-19-related Twitter posts from March 5, 2020 to April 2, 2020, via Twitter streaming API. To generate only COVID-19-related posts, we used a list of keywords for Twitter streaming-"CORONA", "corona", "COVID19", "covid19", "covid", "coronavirus", "Coronavirus", "CoronaVirus", and "NCOV". The dataset was then filtered by the same keyword All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint set to remove any unrelated posts. Another filtering process was to remove the promotional and commercial posts. Since "Corona" is a beer brand in the market, it's possible that the discussions in the posts about Corona were not related to the coronavirus. The keywords for filtering are "dealer", "deal", "supply", "beer", "drink", "drank", "drunk", "store", "promo", "promotion", etc. To identify tweets from California and New York, the last data filtering process is to separate tweets into two datasets -one dataset contains posts only from California state, another only from New York state, based on the geolocation information associated with each tweet. The keywords used to filter locations, such as "Los Angeles, CA", "Monroe, New York", are scraped from Wikipedia website (List of cities and towns) [5] . To prevent from missing locations while filtering, all the county and city names are considered into the keyword lists. By converting itemset into data frames and dropping the duplicates, we removed any duplicated items that share the same id number. The number of confirmed COVID-19 cases in California and New York states each day from March 5, 2020 to April 2, 2020 were downloaded from the website of 1Point3Acres.com [6]. In order to explore the trend of COVID-19-related tweets posted between March 5, 2020 and April 2, 2020, we calculated the number of tweets by days and hours for each state. To convert the date into the correct time zone, we parsed the date time into readable date format with the corresponding time zone. We calculated the average number of tweets posted by each hour within a 24-hour period. The VADER (Valence Aware Dictionary and sentiment Reasoner) was used for the sentiment analysis, which is a tool that is commonly used for sentiment analysis of social media data [7] . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint We applied the vaderSentiment package in python to calculate the sentiment score for each tweet and calculated the average sentiment scores by each day and each hour. For the text in each post, we extracted the compound score, which is normalized between -1 (negative extreme) and +1 (positive extreme). We calculated the average sentiment score within each hour in a 24-hour period for 29 days. In addition, a two-sample t-test was conducted to test whether there is a significant difference between average sentiment scores in California and New York based on the sentiment scores per minute. Topic Modeling is typically used to discover abstract "topics" in documents through statistical modeling [8] . To understand the most frequent topics that Twitter users were discussing about COVID-19, we applied Latent Dirichlet Allocation (LDA) for topic modeling analysis. To build the LDA model, several steps were conducted: • Remove noises -emails, newline, extra spaces, distracting single quote and urls • Tokenization -split texts into sentences and words, lower case the words and remove punctuation • Create bigram and trigram models -convert two or three words frequently occurring together in the document (eg. difficulty_breathing, magic_johnson, catching_feelings) • Remove stop words downloaded from nltk, and make bigrams and trigrams • Lemmatization using spaCy -lemmatize the words to a normal form • Create dictionary and corpus as LDA input To identify the optimal number of topics, we calculated the coherence scores from 2 topics to 15 topics and considered the number of topics with the highest coherence score as the optimal number of topics. After building the LDA model, we applied pyLDAVis to visualize the All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint information contained in the topic models. It's a python built-in package that helps understand the meaning of each topic, the prevalence of each topic, and the correlation among topics. According to Sievert and Shirley's study result, the optimal value of the weight parameter is 0.6 in the pyLDAVis method [9] . In our analysis, we adjusted the parameter to 0.6 and extracted the top ten frequent words in each topic. The summary of each topic was manually conducted and discussed among all coauthors. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint A sentiment analysis was conducted to understand public attitudes toward the number of COVID-19 cases and the policy changes. We calculated average sentiment scores for COVID-19 tweets from California and New York between March 5, 2020 and April 2, 2020. Overall, both states had negative sentiment scores towards COVID-19. The average sentiment score for California was -0.042 (Standard Error = 0.0004), while for New York, it was -0.033 (Standard Error = 0.0003). A two-sample t-test showed that the difference in average sentiment scores between California and New York was significant (p-value < 0.001). The daily average sentiment scores in both New York and California varied over time ( Figure 3 ). In the 24-hour period, New York showed a clear peak on the hourly average sentiment score, with the highest sentiment score of -0.018 at 11:00AM (Supplemental Figure 2 ). In contrast, California did not show a clear peak. In addition, twitter users from both states showed more negative sentiments towards COVID-19 from midnight to early morning. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint The LDA topic modeling was applied to understand the general topics in COVID-19 tweets. Based on the coherence scores, we found that with New York data the optimal number of topics was 13 while with California data it was 14. We then summarized each topic with top 10 frequent words (Table 1) . All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint Table 1 As shown in Table 1 , topics discussed in COVID-19 tweets in both New York and California were very similar, among which the most prevalent topic was "Protective measures". In addition, (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint concerns about the COVID-19 pandemic and government actions, such as "Public concerns" and "Government and policy", were other popular topics in both states. In this study, we showed that from March 5 to April 9 in 2020, the number of COVID-19 tweets in California was almost twice of that in New York. California had a significantly lower sentiment score towards COVID-19 than New York. In addition, we showed that the sentiment scores towards COVID-19 in both California and New York varied over time, and the policy announcements and number of confirmed cases might be the major drives for these sentiment changes. Twitter users from both states discussed similar topics, with the "protective measures" as the most popular topic, followed by "Pandemic's impact" and "Government policy". While we showed that the sentiment scores varied over time in both California and New York, the highest and lowest sentiment scores correlated with some important policy announcements and the severity of COVID-19. On March 7, Governor Cuomo declared a State of Emergency in New York, and the sentiment score on that day was the lowest (-0.093). One possible explanation is that people expressed serious concerns about the COVID-19 pandemic. The word "emergency", along with "sick", "leave", "business", "worker", suggests that people were worried about their businesses and not being able to work. Meanwhile, on the same day, the average sentiment score in California was the lowest, which coincide with the policies including the largest school district was closed, and San Francisco banned large group gatherings. Another interesting observation is that on March 25, 2020, the average sentiment scores in both states reached the highest point. The major event happened on that day was that a $2 trillion coronavirus relief package was passed to address the economic impacts caused by the pandemic. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07. 12.20151936 doi: medRxiv preprint It can be inferred that since the pandemic has forced a lot of people losing their jobs and not being able to maintain their livings, the announcement of relief bill could relieve some financial stresses, which led to less negative sentiment scores in both states. In addition, we found that the number of confirmed cases and deaths might correlate with the public's sentiment. For instance, on March 7, the number of confirmed cases in California reached 100 [10] ; on March 29, New York surpassed 1000 death tolls [11] , and California surpassed 100 death tolls [12] . These numbers might be the drives for more negative sentiment scores in both states. Since New York had much more confirmed cases and deaths than California in March 2020, we would expect that the average sentiment score in New York would be lower than in California. However, our results showed that California had a lower sentiment score (-0.042) than New York (-0.033), and the difference is significant. One explanation for this is that on March 21, 2020, New York began gathering ventilators across the state. The federal disaster assistance might bring people who were in pessimistic situations some hopes. A recent research by WalletHub about the unemployment rate showed that New York had a 1569.69% increase in the unemployment claims since the start of the COVID-19 pandemic, and California had a 1143.54% increase, much less than New York [13] . Thus, the $2 trillion relief bill happened on March 25 could provide people in New York financial support more significantly. Another reasonable speculation could be that as the most populous state in the U.S., California people might be more worried about the potential high infection rate. Our result on topic modeling analysis showed that there was no major difference between the topics discussed in California and New York. However, the popularity of each topic was different between two states. The topic that had the highest token percentage in both states was related to protective measures. They shared the same keywords such as "work", "stop", "home", All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint "stay", "hand", "mask". It can be inferred that these tweets were talking about staying home, washing hands, and wearing masks. The second prevalent topic in New York was about the impact that pandemic brought, including keywords such as "business", "travel", "worker". Since New York had far more confirmed cases and deaths, the negative impacts were more severe, and people might discuss more about the harm to businesses and traveling restrictions. In contrast, tweets from California talked less about working impacts according to the topic modeling keywords, presumably due to less serious situations. Topics related to government and policy seem to be the next mostly discussed categories. Keywords including "trump", "president", and "crisis" appeared in tweets from both states with high frequencies. Other than the prevention and impacts of COVID-19, people paid significant attention to the government policy and president's actions, which might partially explain the fluctuation of sentiment scores. Hospital environment and medical support were other popular topics in both states. Although social media provides first-hand users' perspectives, it does not provide controlled variables. We don't have the user demographic information, including age, gender, and education, etc. Twitter users can't represent the whole population, since about only 20% of US adults are currently using Twitter. Furthermore, our study only focused on the beginning of COVID-19 epidemic in the US, and the story might evolve later. This study used social media data from Twitter to analyze the sentiment scores and public opinions towards the emergence of COVID-19 in California and New York states in the United States at the beginning of the COVID-19 epidemic. This study showed how COVID-19 affected people's attitudes from different states in a timely manner. The results from this study could provide guidelines for future social media research on the associations between the outbreak of COVID-19 and public opinions, as well as valuable suggestions on future pandemic research. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 14, 2020. . https://doi.org/10.1101/2020.07.12.20151936 doi: medRxiv preprint Coronavirus (COVID-19) Frequently Asked Questions Coronavirus Disease (COVID-19) -Events as They Happen Detecting Themes of Public Concern: A Text Mining Analysis of the Centers for Disease Control and Prevention's Ebola Live Twitter Chat Dataset on Dynamics of Coronavirus on Twitter List of Cities and Towns in California Global COVID-19 Tracker & Interactive Charts: Real Time Updates & Digestable Information for Everyone VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text Topic Modeling and Latent Dirichlet Allocation (LDA) in Python LDAvis: A Method for Visualizing and Interpreting Topics California Coronavirus Map and Case Count Coronavirus Death Toll in New York Surpasses 1,000, Less than 3 Weeks after 1st Reported Death Coronavirus Timeline: Tracking Major Moments of COVID-19 Pandemic in San Francisco Bay Area All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted The project described in this publication was supported by the University of Rochester Clinical and Translational Science Award UL1 TR002001 from the National Center for Advancing Translational Sciences of the National Institutes of Health (DL).