key: cord-0990818-4gwbronq authors: Luu, Truong; Follmann, Rosangela title: The relationship between sentiment score and COVID-19 cases in the United States date: 2022-01-09 journal: Journal of Information Science DOI: 10.1177/01655515211068167 sha: a696b3d571e2e7395eb7d806613ee819764e9a3c doc_id: 990818 cord_uid: 4gwbronq The coronavirus disease (COVID-19) continues to have devastating effects across the globe. No nation has been free from the uncertainty brought by this pandemic. The health, social and economic tolls associated with it are causing strong emotions and spreading fear in people of all ages, genders, and races. Since the beginning of the COVID-19 pandemic, many have expressed their feelings and opinions related to a wide range of aspects of their lives via Twitter. In this study, we consider a framework for extracting sentiment scores and opinions from COVID-19 related tweets. We connect users' sentiment with COVID-19 cases across the USA and investigate the effect of specific COVID-19 milestones on public sentiment. The results of this work may help with the development of pandemic-related legislation, serve as a guide for scientific work, as well as inform and educate the public on core issues related to the pandemic. outbreaks and helping understand public attitudes and behaviours during a crisis 12 . There has been much work on sentiment related to COVID-19 [13] [14] [15] [16] , but not much is known about the relationship between the sentiments and the COVID-19 both the number of confirmed cases and the death toll in the USA. In this study, we analyze COVID-19 related tweets that were generated in the USA from March 19 to August 31, 2020. We test the correlation between users' sentiment and COVID-19 cases across the USA and investigate the effect of specific COVID-19 milestones on sentiment scores. Our implementation of sentiment analysis shows the existence of a link between sentiment scores, COVID-19 confirmed cases and death toll. Additionally, significant events such as new regulations from the government, celebration of important holidays and social conflicts seem to directly affect the public's sentiment. The COVID-19 pandemic has challenged our way of living. It has limited our social interactions, prompted vast virtualization of our daily routines, and promoted extensive transformations in the workplace. Such drastic changes have affected human behaviour and are having a great impact on people's mental health. Due in part to the large restrictions to in-person social interactions imposed by the COVID pandemic, about 42% of adults in the USA have reported symptoms of anxiety or depression in December 2020 compared to 11% in the previous year 17, 18 . Social interaction plays a crucial role in people's manifestations of emotions and sentiments by using corporal or oral expressions, or in writing. Confronted with severe limitations in their in-person communication capability, people resorted more heavily to social media as a means for expressing emotions and sentiments. In particular, Twitter became very popular as a written form of microblogging. It has more than 150 million users where people gather news information as well as express their concerns, feelings, and health-related information. This extensive availability of social media data has propitiated much research work in the field of sentiment analysis. Sentiment analysis is a field of study that uses natural language processing techniques to extract opinions and feelings from written texts and has been incorporated in areas like business, economics, and health 10, 11, 19, 20 . There are different approaches to sentiment analysis that can be lexicon-based 8, 21, 22 or machine learning-based 9, [23] [24] [25] . A sentiment lexicon with words and phrases predefined as positive or negative is used in the lexicon approach. In the machine learning ap-proach training data is required for automatically classifying the text. Morente-Molinera et al. 11 used lexicon-based sentiment analysis to extract preferences and build a decision-making process, while Ji et al. 10 developed a two-step approach combining a corpus of personal clues and machine learning to classify Twitter sentiment for addressing public health concerns. Besides providing insights about people's emotions and feelings, sentiment analysis along with text mining can provide much help in creating systemic reviews of literature related to infectious diseases 26, 27 . For example, studies of this kind can help health and medical communities to extract useful information and interrelationships from coronavirus-related studies, along with future directions of research topics 28 . The COVID-19 pandemic has prompted various studies trying to identify human emotional responses and opinions to the pandemic across the globe [13] [14] [15] [29] [30] [31] [32] [33] [34] . In the early stages of the pandemic, Han et al. 35 explored COVID-19 related public opinion in China from January 9 to February 10, 2020. The authors used the latent Dirichlet allocation model for topic extraction, suggesting a temporal variability of the number of texts for different topics and subtopics corresponding to the different developmental stages of the event. By looking at temporal changes and spatial distribution of COVID-19 related texts, they found a synchronization between frequent daily discussions and the trends in the COVID-19 outbreak. Also early in the pandemic, Samuel et al. 14 studied issues in public sentiment in the United States, reflecting concerns about Coronavirus with growth in fear and negative sentiments. The authors used exploratory and descriptive textual analytics, along with textual data visualization, to provide insights into the progress of fear sentiment over time as COVID-19 approached peak levels. Additionally, their work contributes to the strategic process, presenting methods with valuable informational and public sentiment insights, which can be used to develop much needed motivational solutions and strategies to counter the rapid spread of fear-panic-despair associated with Coronavirus and COVID-19. Li et al. 36 explored the impact of COVID-19 on people's mental health. Texts from Weibo active users along with machine-learning predictive models, were used to compute word frequency, and scores of emotional and cognitive indicators before and after the declaration of COVID-19 on 20 January, 2020. Their results showed that negative emotions (e.g., anxiety, depression and indignation) and sensitivity to social risks increased, while the scores of positive emotions (e.g., Oxford happiness) and life satisfaction decreased, suggesting a need for clinical practitioners prepare to deliver corresponding therapy foundations for the risk groups and affected people. In addition, Sarker et al. 16 showed that self-reported COVID-19 symptoms by Twitter users can complement those identified in clinical settings. Barkur 13 used sentiment analysis of tweets from India after the announcement of the lockdown, addressing the population feelings towards the lockdown, while de Las Heras -Pedrosa et al. 30 addressed the question of how social media has affected risk communication in uncertain contexts, and its impact on the emotions and sentiments derived from the semantic analysis in Spanish society during the COVID-19 pandemic. Similarly, Chakraborty et al. 15 In all, it is not surprising that the COVID-19 pandemic is having a devastating effect economically and emotionally across the globe. In this study, we consider a framework for extracting sentiment scores and opinions from COVID-19 related tweets in the USA and investigate the effect of COVID-19 milestones on people's sentiment. In this study, we utilize sentiment analysis to identify outputs and trends in attitudes, feelings and opinions based on tweets in the USA during the COVID pandemic. Our approach includes collecting the COVID-19 related Tweeter data as well as the corresponding COVID-19 cases and death toll numbers. The Tweeter data are then preprocessed, and sentiment analysis is performed using three sentiment lexicons (TextBlob, AFINN and SentimentR). Next, we analyze and report the results. A schematic view of the methodology framework is illustrated in Figure 1 . The data we use in this work were obtained from a collection of geotagged tweet identifiers related to the COVID-19 pandemic 37 After extracting the USA tweets, the data were preprocessed. The preprocessing in this study involved converting text to lowercase, removal of punctuation, stop words, numeric values, and ideograms and links. Removing stop words, ideograms, punctuation and numeric values can im-prove the performance of sentiment analysis techniques 40 . Each step is briefly explained below. Converting all data to lowercase helps in the preprocessing and in later stages of natural language processing when parsing through the data. In this study, we used the lower function part of the Python regular expression module. For example, when applying the function, the words "COVID", "Covid" are all converted to "covid". In natural language processing, stop words are words that if removed do not change the context of a sentence, as for example the words "the", "a", "an" and "in". Here we used the Natural Language Toolkit (NLTK) library stop word corpus. 41, 42 . Punctuation characters do not contribute to the sentiment analysis. We removed punctuation, as for example ! " # $ % & ' ( * + , -. / : ; < = ? @ [ { | from the COVID-19 tweets. Numeric values in the tweets do not contribute to the sentiment embedded in the text. Therefore we removed all numeric values such as 12345, which are not valuable for text analysis. We also removed ideograms such as smile faces, flags, etc. Additionally, links in tweets do not contribute to sentiment analysis, hence they were also removed. Three samples of tweets before and after preprocessing are shown in TableI. All tweets were subject to preprocessing and the cleaned data were then used for calculating the sentiment score of each tweet. Word cloud is a technique for visualizing the level of prominence of frequent words in a text, with large font sizes used for more frequent words. In addition to the previous preprocessing, we also applied stemming and tokenization to the dataset. The word cloud we obtained from the tweets data by applying the Python WordCloud library. Sentiment analysis is a field of study that uses natural language processing techniques to extract opinions and feelings from written texts 24 . We obtain preliminary sentiment analysis results using three different lexicon-based methods: TextBlob, AFINN, and SentimentR. A brief description of each method is as follows: TextBlob is an open-source Python library for performing various natural language processing (NLP) tasks on textual data 43 . It provides a simple Application Programming Interface (API) for performing common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, and translation. For sentiment analysis, TextBlob uses Pattern library and NTKL toolkits. The sentiment dictionary of TextBlob consists of 2,918 words annotated with polarity, subjectivity and intensity scores. TextBlob determines the polarity (positivity or negativity) of a text along with its subjectivity. A sentiment score between 1 and -1, defined as polarity, is assigned to the text depending on the most commonly occurring positive (good, best, excellent, etc.) and negative (bad, awful, pathetic, etc.) adjectives. In addition to the sentiment score, subjectivity is also determined. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The subjectivity value can be a number between 0 and 1. A higher subjectivity means that the text contains more personal opinion and less factual information, and a low subjectivity value means less personal and more factual information. Afinn is a lexicon-based sentiment analysis approach developed by Finn A. Nielsen 44 . It contains more than 2477 words with a valence (polarity) associated with each word. The words in AFINN's lexicon are scored for valence within the range from -5 (very negative) to 5 (very positive), where a positive score indicates positive sentiment and a negative score indicates negative sentiment. For example, the sentence "Face covering is good and bad" will result in a score of 0 (neutral sentiment) and the sentence "Face covering is terrible and bad" will result in a score of -6 (negative sentiment), and the sentence "Face covering is good and beautiful" will result in a score of 6 (positive sentiment). SentimentR is also a lexicon-based sentiment analysis approach developed by Tyler Rinker 45 . It is a dictionary lookup approach that tries to incorporate weighting for valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). The lexicon contains 11,709 words, whose individual scores may take values between -2 and 1. The authors in Ref. 46 used SentimentR to analyze sentiments expressed by energy consumers on Twitter. Our preliminary results indicate that the sentiment scores of the three methods are comparable with each other within their different ranges. These results are summarized in Table II where the scores from the three methods are shown for a sample of six tweets. Additional results comparing the three methods are shown in Table III , displaying the number of tweets classified as positive, To investigate the relationship between the tweets' sentiment scores and both the daily new COVID-19 cases and death toll, we apply the Pearson correlation test, which provides a quantita- tive measure of the linear dependency between two variables. 47, 48 . It is given by where x and y are the sample vectors,x andȳ are the corresponding means. Additionally, we used the two tail t-test to assess the significance of the variables. The texts we analyzed were obtained from tweets originated from locations distributed across the USA, as illustrated in Figure 2 . In terms of the lexicon used in the COVID-19 related tweets, those with high polarity scores tend to use positive words (mostly adjectives) such as "greatest," "best," "grateful," "perfect," and "wonderful." In contrast, tweets with low polarity scores tend to use negative words (also mostly adjectives) such as "worst," "terrible," "killed," "no time to waste." The sentiment scores of polarity ranging between -1 and 1 across our dataset were classified as negative (sentiment score < 0), neutral (sentiment score = 0) or positive (sentiment score > 0). The word cloud for the dataset is shown in Figure 5 . In this figure, the higher the presence of a word in the tweets, the larger the font used to display it in the cloud, providing a visual account of the most-used vocabulary in the day-to-day challenges posed by the coronavirus. Not surprisingly, the word "covid" is the most directly related to COVID-19, followed by "pandemic," "corona," "today," "social distancing," "quarantine," etc. But we can also see positive words such as "love", "family" and "beautiful," for example. While polarity gave us an account of how positive or negative peoples' sentiments were, we also looked at the subjectivity score to check whether people expressed factual information (low score) or opinion (higher score) in their tweet messages. Low subjectivity score tweets tend to use more factual vocabulary compared to those with higher polarity scores. Samples of subjectivity and polarity scores tweets extracted from the dataset can be found in Table II . A graphical representation of how polarity and subjectivity are related to each other is displayed in the scatter plot of Figure 6 . We observe a skewed distribution with more points towards the positive polarity scores and higher subjectivity scores (> 0.5), suggesting that the more positive-oriented a tweet is, the more opinion-oriented its meaning will be. This can be quantified by measuring the proportion of tweets with positive polarity and subjectivity greater than 0.5 (Table V) . Among the tweets with subjectivity higher than 0.5, positive tweets comprise 78.3% accounting for 24% of the total number of tweets. The combination of neutral and negative tweets account for 21.7% of the tweets with subjectivity above 0.5. This corresponds to 6.7% of the total number of tweets. These numbers indicate that nationwide, people are more likely to express their positive rather than their negative opinions. 14 humans tend to be more responsive to negative than to positive news 49 . It is unquestionable that the COVID-19 pandemic is affecting all aspects of our day-to-day lives, including our emotions. In this section, we present results obtained from using sentiment analysis to investigate the extent of the influence that daily new COVID-19 cases and death tolls are having on our emotions. The graphs in Figure 8 show how COVID-19 daily confirmed cases and daily death toll, relate to polarity and subjectivity scores. They indicate a weak correlation between daily confirmed cases and both polarity (graph 8(A)) and subjectivity (graph 8(B). These numbers suggest that people's sentiment is affected, but not strongly, by the daily increase of COVID-19 confirmed cases. Graphs 8(C) and 8(D) also show a weak negative correlation, but between daily death toll and polarity and subjectivity, respectively. These results, similarly to cases (A) and (B) above, suggest that people's sentiment is also affected, but not strongly, by the increase in the daily death toll due to COVID-19. However, a comparison between statistical coefficients of the combined polarity and subjectivity with respect to daily confirmed cases (graphs 8 (A) and (B)), and the combined polarity and subjectivity with respect to daily death toll (graphs 8 (C) and (D)), indicates that the daily death toll has a larger effect on peoples sentiment than new daily confirmed cases. This result can be understood from the perspective that, an increase in the number of deaths poses a more threatening challenge compared with an increase in the number of new cases which, even though threatening, still has the door open for a possible recovery. We now analyze the time evolution of the polarity score in connection with the number of confirmed cases and the death toll using a five-day moving average from March 19 to August 30, as shown in Figure 9 . In graph A, the confirmed cases, scaled on the left-hand side y-axis, were users are more likely to express mournful feelings, leading to a low polarity score. On the other hand, when the death toll is lower, people seem to be more optimistic so that more positive tweets are generated, leading to a higher polarity score. Aiming at providing information about the sentiments of people in largely affected areas, we now analyze the number of COVID-19 cases, the death toll and the polarity scores for the top four most populous states. Figure 10 while the other three states show a peak in mid-July, consistently with the fact that New York was affected first by the pandemic. Graph (C) shows a similar temporal evolution for the daily number of deaths with New York again exhibiting a peak in mid-April followed by a downward trend for the rest of the period. Interestingly, the other three states had a peak in the number of death cases in early August but not as prominent as was the case in New York. This is probably a consequence of the fact that by August health professionals and hospitals were better prepared and equipped to treat COVID-19 patients. 3. Polarity score and significant events. While the predominant sentiment analysis factor determining the oscillations in polarity show a direct connection with the pandemic itself, other factors connected or not with the pandemic may have a punctual influence on the polarity. Figure 11 shows the time evolution of the polarity displaying also the dates of two events promoting lower polarity values: (i) The extended stay- The difficulties we are presently facing with the COVID-19 pandemic are not new. Back in 1918, for example, the human race was plagued by an H1N1 virus pandemic that caused more than 50 million deaths worldwide, of which 675,000 occurred in the United States 54 . At the time, with no vaccine, no medication and no infrastructure to alleviate the symptoms, control measures were limited to isolation, quarantine, personal hygiene, disinfectants and restrictions of gathering. These are striking similarities with the COVID-19 pandemic, but there are differences as well, including age groups with higher vulnerability, and easiness of travel which of course plays a role in the spread of the virus. Additionally today we have more effective communication, which can help disseminate useful information with preventive effects, but also damaging information which might cause people to underestimate the risks of the COVID-19 virus. Social media has also given rise to plenty of venues for people to express their concerns, fears, joy and happiness in ways not even though possible some 20 years ago. Among others, microblogging is one consisting of shared online posting of short texts containing personal experiences and emotions. Large amounts of data from microblogging services such as Twitter make them a fascinating source for opinion mining and sentiment analysis. We used TextBlob to calculate each tweet's subjectivity and polarity score and to classify them into positive, negative, and neutral. We presented a comprehensive investigation on the sentiment distribution in the US as a whole and among the four most populous states. In this work, we also investigated the relationship between the sentiment score and the COVID-19 cases in the United States. Our results indicate that there is a link between sentiment scores and COVID-19 confirmed cases and the death toll in the USA. Coronavirus disease 2019 (covid-19) -symptoms and causes -mayo clinic The who just declared coronavirus covid-19 a pandemic -time Proceedings of the 2010 ACM conference on Computer supported cooperative work Introduction to wordnet: An on-line lexical database Thumbs up? sentiment classification using machine learning techniques Twitter sentiment classification for measuring public health concerns Carrying out consensual group decision making processes under social networks using sentiment analysis over comparative expressions Using online social networks to track a pandemic: A systematic review Sentiment analysis of nationwide lockdown due to covid 19 outbreak: Evidence from india Covid-19 public sentiment insights and machine learning for tweets classification Sentiment analysis of covid-19 tweets by deep learning classifiers-a study to show how popularity is affecting accuracy in social media Self-reported covid-19 symptoms on twitter: an analysis and a research resource Mental health, substance use, and suicidal ideation during the covid-19 pandemic-united states Covid's mental-health toll: how scientists are tracking a surge in depression The impact of social and conventional media on firm equity value: A sentiment analysis approach Twitter mood predicts the stock market Affective computing and sentiment analysis," in A practical guide to sentiment analysis A comprehensive study on lexicon based approaches for sentiment analysis Opinion mining and sentiment analysis Sentiment analysis of twitter data User reviews: Sentiment analysis using lexicon integrated two-channel cnn-lstm family models Sentiment analysis and its applications in fighting covid-19 and infectious diseases: A systematic review An overview of literature on covid-19, mers and sars: Using text mining and latent dirichlet allocation A comparative nlp-based study on the current trends and future directions in covid-19 research Analysis of spatiotemporal characteristics of big data on social media sentiment with covid-19 epidemic topics Sentiment analysis and emotion understanding during the covid-19 pandemic in spain and its impact on digital ecosystems Twitter sentiment analysis during covid-19 outbreak in nepal Informational flow on twitter-corona virus outbreaktopic modelling approach Cross-cultural polarity and emotion detection using sentiment analysis and deep learning on covid-19 related tweets Topic detection and sentiment analysis in twitter content related to covid-19 from brazil and the usa Using social media to mine and analyze public opinion related to covid-19 in china The impact of covid-19 epidemic declaration on psychological consequences: a study on active weibo users Coronavirus (covid-19) geo-tagged tweets dataset Centers for disease control and prevention Data download A performance comparison of supervised machine learning models for covid-19 tweets sentiment analysis Natural language processing with Python: analyzing text with the natural language toolkit Python 3 text processing with NLTK 3 cookbook textblob Documentation A new anew: Evaluation of a word list for sentiment analysis in microblogs SentimentR: Calculate Text Polarity Sentiment Analyzing sentiments expressed on twitter by uk energy company consumers Multimodal sensory information is represented by a combinatorial code in a sensorimotor system Scipy: Pearson correlation coefficient Cross-national evidence of a negativity bias in psychophysiological reactions to news New york coronavirus: Gov. andrew cuomo extends stay-at-home order until at least may 15 -cnn Press releases PressReleases/Pages/PR20200423.aspx (2020) Black lives matter may be the largest movement in u.s. historythe new york times First coronavirus stimulus checks deposited