key: cord-0463098-ppg4f19z authors: Zhang, Qihuang; Yi, Grace Y.; Chen, Li-Pang; He, Wenqing title: Text mining and sentiment analysis of COVID-19 tweets date: 2021-06-26 journal: nan DOI: nan sha: ffd482fe8d2f7e89b9df219f0bfa18bac518c2fd doc_id: 463098 cord_uid: ppg4f19z The human severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2), causing the COVID-19 disease, has continued to spread all over the world. It menacingly affects not only public health and global economics but also mental health and mood. While the impact of the COVID-19 pandemic has been widely studied, relatively fewer discussions about the sentimental reaction of the population have been available. In this article, we scrape COVID-19 related tweets on the microblogging platform, Twitter, and examine the tweets from Feb~24, 2020 to Oct~14, 2020 in four Canadian cities (Toronto, Montreal, Vancouver, and Calgary) and four U.S. cities (New York, Los Angeles, Chicago, and Seattle). Applying the Vader and NRC approaches, we evaluate the sentiment intensity scores and visualize the information over different periods of the pandemic. Sentiment scores for the tweets concerning three anti-epidemic measures, masks, vaccine, and lockdown, are computed for comparisons. The results of four Canadian cities are compared with four cities in the United States. We study the causal relationships between the infected cases, the tweet activities, and the sentiment scores of COVID-19 related tweets, by integrating the echo state network method with convergent cross-mapping. Our analysis shows that public sentiments regarding COVID-19 vary in different time periods and locations. In general, people have a positive mood about COVID-19 and masks, but negative in the topics of vaccine and lockdown. The causal inference shows that the sentiment influences people's activities on Twitter, which is also correlated to the daily number of infections. The COVID-19 disease, caused by the human severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2), was declared to be a pandemic by the World Health Organization (WHO) in March of 2020. This disease has caused over sixty-five million infections and a half million deaths all over the world as of December 3, 2020. While extensive studies have been conducted to examine various types of influence of COVID-19 on public health, including the studies concerning the infected cases number and the fatality, investigations of the impact of the pandemic on people's emotion are relatively limited. The first case of COVID-19 in Canada was reported in Toronto on January 25, 2020. To prevent the spread of COVID-19, four most populated provinces in Canada have consecutively announced the "state of emergency" (displayed in Figure 1 ), taking the measures of shutting down the public business, banning social gathering, encouraging social distancing, and requiring the masks wearing in the public area, etc. As the situation of disease spreading ameliorated during July and August, the "state of emergency" was relaxed to various extent in different provinces. With the recent roaring number of newly infected cases, the "state of emergence" has been restored again in all the four provinces. While social distancing has been advocated and many people are taking the practice of working at home, online communication tools such as social media have become active in communication. The COVID-19 pandemic has become one of the most discussed topics on the internet since it was first reported in January 2020 (Bhat et al. 2020) . As the opinions and feelings are freely and openly shared on the internet, it is interesting to conduct text mining of the public information on the social media to extract useful messages. Text mining is a commonly used technique to explore corpus, and common strategies of analyzing sentiments can be found in Kwartler (2017, Ch.4) . For an application of text mining to Twitter data, Kumar et al. (2014) provided a comprehensive discussion together with detailed case studies. Aflakparast et al. (2020) used the Bayesian fused graphical lasso to convert textual Twitter data to understandable networks of terms that can signify important events. For COVID-19 data, Tworowski et al. (2020) studied drug repository and applied text mining methods to putative COVID-19 therapeutics. Khanday et al. (2020) employed text mining methods to do preprocessing and extract relevant features, and then classified textual clinical reports by using machine learning algorithms. Saire and Pineda-Briseno (2020) considered a case study to analyze the publications in Mexico and examined people's behavior. Most available work on COVID-19 data, however, presented on "exploratory data analysis" such as standard word clouds and histograms of most posted words. In this article, we conduct sentiment analysis (e.g., Pak and Paroubek 2010; Agarwal et al. 2011; Kouloumpis et al. 2011) to understand the impact of COVID-19 on the emotion of the public. While there has been some discussion about the public emotional reaction to the COVID-19 pandemic using the text mining techniques based on the available platforms, the existing research is subject to several limitations: (1) most studies conducted sentiment analysis based on a single word and ignoring the interaction among words (e.g., Xue et al. 2020) . For example, completely opposite meaning from the original word is expressed if this word is combined with the word "not" or "no"; (2) most current studies (e.g., Lwin et al. 2020; Pastor 2020) only present the results without discussing its connection to the anti-epidemic measures; (3) emotions of individuals change over time and differ from place to place due to different levels of anti-epidemic measures but such features are not necessarily incorporated in the most available studies; (4) only few studies (e.g., Zhou et al. 2020) consider the possible typos, slangs, word variation, abbreviation, and emoji, which commonly appear in the casual environment of Twitter. In this article, we mainly focus on two aspects of public sentiments regarding the COVID-19 pandemic. The first questions is: "how do people react to the spread of COVID-19 over time"? Specifically, we are interested in comparing the change of emotion in different time periods based on tweets, and we also study the association between people's reactions to the COVID-19 and the number of daily reported infected cases. Second, multiple measures have been implemented by the government to mitigate the virus spread, but it is still unclear how serious people take those measures. We are also interested in people's reactions by using the keyword "lockdown", "masks", and "vaccine" in the analysis. To alleviate possible confounding effects associated underlying factors including culture, government practice, and internet access regulations etc., here we restrict our analysis to the four cities of Canada, Toronto, Montreal, Vancouver, and Alberta, which are mostly hit by COVID-19. We analyze sentiments expressed in four cities of Canada on one of the most widely used social media platform, Twitter -a microblog website, for the period of February 24, 2020, to October 14, 2020. The remainder of the article is organized as follows. In Section 2, we discuss the study design and the sentiment analysis method, which includes the procedure of text mining using the Twitter data. In Section 3, we present the results of sentiment analysis using Twitter text data of Canada. We compare the results over different periods. Further, we extend the comparisons by contrasting the results for the four cities in the United States. In Section 4, statistical models are built to analyze how the public emotion is associated with the daily reported infected cases. Finally, we conclude the article in Section 5. Our Twitter data mining pipeline consists of a data preparation stage and a data analysis stage, shown in Figure 2 . The data preparation procedure includes data collection and raw data cleaning. COVID-19 related tweets, published between February 24, 2020 and October 14, 2020, are collected from four Canadian cities: Toronto, Montreal, Vancouver, and Calgary, the representative cities of the four most populated provinces, Ontario, Quebec, British Columbia, and Alberta, respectively. Four cities in the U.S.: New York, Los Angeles, Seattle, and Chicago, are considered for comparison. We use the snscrape module on Python 3.8 (https://github.com/JustAnotherArchivist/snscrape) to scrape the tweet text online in the search of using the keyword "COVID-19". All the appearing tweets that are published within 50km of the center of each considered city are included in our analysis, leading to 30,655 tweets in total for the four cities in Canada and 69,742 tweets in total for the four cities in the United States. The data cleaning procedure is to be described in detail in the following section; and the clean data are then analyzed using pandas package of Python 3.8. The data analysis stage includes the construction of descriptive statistics characterizing the topic popularity and public sentiment. We specifically compare the sentiment scores within different regions dynamically with temporal effects incorporated. Furthermore, we build a statistical model to study the association among the sentiment scores, Twitter activities, and the number of reported infected cases. Applying the common standards (e.g., Stone et al. 2011; Xue et al. 2020) , we first clean raw data using the pandas module of Python 3.8, following Xue et al. (2020): (1) The URLs, hashtag symbol "#" and symbol "@" are removed from the tweets in the data set; (2) The tweets written in a non-English language are removed; (3) Meaningless characters, punctuation, and stop-words are removed from the dataset as they do not contribute semantic meanings. Some examples are shown in Table 1 . Tweets are primarily connected with moods and sentiments. To reflect this feature, we carry out sentiment analysis, which is a process of identifying an attitude of the author on a topic that is being written about. The evaluation of the sentimental effect is conducted by matching the words to the existing lexicons (Jagdale et al. 2018 ) and look up for their sentiment score in the lexicon. We firstly break each tweet into individual words, determine the sentiment score of each word in the tweet, and then obtain the sentiment score for each tweet by adding those scores for each word. Different lexicons have been proposed in the field of text mining. Some of the lexicons are purely polarity-based (e.g., Liu and Zhang 2012), i.e. the sentiment is classified as negative, neutral, or positive. Some lexicons, e.g., AFINE proposed by Nielsen (2011) and the Valence Aware Dictionary and sEntiment Reasoner (Vader) proposed by Gilbert and Hutto (2014) , take the intensity of the sentiment into account, where the sentiment intensity score is a scale ranged between a negative and a positive values. NRC Sentiment and Emotion Lexicons (Mohammad and Turney 2013) consider each word to be one or a combination of multiple moods, including anticipation, positive, negative, sadness, disgust, joy, anger, surprise, fear, and trust. Table 2 shows examples of the lexicons of different types. In this article, we use Vader and NRC to characterize the sentiment and emotion of the tweets, respectively. The Vader lexicon quantifies the sentiment into numeric scores which can be used for further analysis, and the NRC lexicon provides detailed categories to describe the mood in a refined fashion. The Vader lexicon is also specifically attuned to sentiments expressed in social media. It contains the utf-8 encoded emojis and emoticons, which are important features frequently used in the tweet texts (Gilbert and Hutto 2014) . To conduct sentiment analysis, polarity scores are identified for every single word in each tweet according to the Vader lexicon, and the frequency of each mood in emotion categories appearing in the text are identified according to the NRC lexicon. For the sentiment score obtained by Vader, we calculate the average sentiment score of each tweet by first summing the scores of each words in the tweet and then dividing by the total number of words in the tweet. Following Gilbert and Hutto (2014) , after calculating the sentiment scores, we further adjust the calculated tweet-wise sentiment scores to incorporate the information related to negation words ("not" and "n't"), punctuation to intensify sentiments (e.g., "Good!!!"), conventional use of word-shape to signal emphasis (e.g., using CAPITAL words), the word modifying the intensity (e.g., "very", "pretty", etc.), and the conventional slangs and emoji (e.g, "lol",":)"). For the case with "??", we treat it as an intensified punctuation just like "!!", but for the case with "?", we do not made any adjustment because of the uncertainty of being a real question or a rhetorical question. After computing the positive, neutral and negative scores for each tweet, we further calculate the compound score according to the rules in Gilbert and Hutto (2014) , and then normalize it to be between -1 (most extreme negative) and +1 (most extreme positive). These steps are implemented via the polarity scores() function in the Vader module (https://github.com/cjhutto/vaderSentiment). For the mood categories annotated by the NRC lexicon, the accumulating number of counts of each emotion category for each tweet is also calculated. We further compute the frequency, defined as the ratio of the counts of each emotion to the total word count in a tweet using the NRClex() function in the python NRCLex module (https://github.com/ metalcorebear/NRCLex). For each word, the emotion is represented by a ten-dimensional vector to reflect 10 different moods specified in Section 2.3, where each element is expressed as the frequency of a mood, ranging from 0 (extremely lack of this emotion) to 1 (extremely full of this emotion). As an example, sentiment analysis of the tweets on February 24, 2020 in Toronto, Canada is presented in Tables 3 and 4. Before conducting text mining of tweets, we summarize demographic information of the tweets data scraped online. The number of tweets reflects the popularity of a topic on Twitter, and "like", "replies" and "retweets" are three main activities for users to engage with the tweet, whose counts are an indication of the impact of a tweet in generating discussions. In Figures 4 and 5 , we present the city-level trajectory of the number of tweets and of the total number of likes, replies, and retweets for the COVID-19 associated tweets, together with that of the daily number of reported infected cases for the province or states in which the city is located. To trace the change of sentiment scores in different cities over time, we summarize the mean and standard deviation of tweetwise sentiment scores, stratified by city and the time periods as in Figure 1 . Table 5 reports on the results for the cities in Canada and U.S.A. All the four Canadian cities and four U.S. cities have the largest sentiment scores for Period 2 (i.e., during the lockdown), which may indicate confidence and positive feelings about COVID-19 during the lockdown period. In contrast, smaller mean scores in Period 1 than in Period 2 may be related to the concern and the uncertainty about the disease in the early stage of the pandemic. The sentiment scores in U.S. is more negative than those in Canada in all the periods. While the mean scores differ in cities, the associated standard deviations remain similar for different cities and different periods. To closely understand how anti-epidemic measures may be related to the daily average sentiment scores (obtained from Vader approach) of the tweets, we produce the heatmaps for three keywords: "mask", "vaccine", and "lockdown", and display them in To closely visualize the change of the mood composition of the daily tweets over time, in Figure 10 , we present the density plots obtained from the NRC method. Overall, the changes in Canadian cities seem to be more variable than U.S. cities, and the trend and trajectory vary from city to city and from time to time. For example, from May to September, the moods are slightly intensified in Vancouver, Montreal, and Calgary, whereas an opposite trend is observed for Toronto and New York. Relative to other months, February is the month that incur a large variation of the word frequency in each mood for most cities, which may be attributed to the uncertainty and the lack of knowledge of COVID-19 in the early stage of the pandemic. With the descriptive analysis reported in Section 3, we are further interested in examining the Vader sentiment scores in terms of the potential causal relationships. Specifically, here we explore possible causal relationships of the daily number of reported infected cases with tweet activities (e.g., tweet total counts, tweet like counts, tweet reply counts, and retweet counts) and with sentiment scores of COVID-19 related tweets. We implement convergent cross mapping (Ye et al. 2015) , integrated with the echo state network approach (Lukoševičius and Jaeger 2009) , to explore the causal relationships. Let X and Y denote the two variables whose causal relationship is of interest. It is our goal to identify whether there is evidence to suggest a causal relationship between X and Y by examining {X t : t = 1, . . . , T } and {Y t : t = 1, . . . , T }, the time series of observations for X and Y , respectively, where X t and Y t are observed values of X and Y at time t, respectively. For example, X t represents the daily average sentiment score on day t and Y t can be the number of tweets on day t. The idea of the convergent cross mapping is that if Y is the cause of X, then the time series {Y t : t = 1, . . . , T } of the causal variable Y can be recovered from the time series Tsonis et al. 2018) . To facilitate this rationale and possible the lag effects, let τ denote the lag time, and we take X t as the input data and repeatedly fit the leaky echo state network model (Jaeger et al. 2007 ) to predict Y t+τ by varying the value of τ , denotedŶ t+τ . The detail is given in Section 4.2. Now we describe the evaluation of the performance of the prediction of Y t+τ using the X t . For any given τ ≤ T , we compute the Pearson's correlation coefficient between the predicted time series Y t+τ and their corresponding observations Y t+τ (Ye et al. 2015), denoted ρ y (τ ) and given by Likewise, taking Y t as the input data and X t+τ as output labels, we fit another echo state network model in predicting the values of X t+τ , denoted as X t+τ . Then, for any given τ , we compute the Pearson's correlation coefficient between X t+τ and X t+τ , denoted as ρ x (τ ). Following the lines of Ye et al. (2015) and Huang et al. (2020) , we determine the causality between X and Y according to the time lag τ so that ρ x and ρ y reach their peak values. Specifically, • If X causes Y and not vice versa, we expect the peak value of the ρ x (τ ) to be located in the positive domain, i.e., the corresponding τ is positive. Meanwhile, we expect the peak value of the ρ y (τ ) to be at a negative τ . • If X and Y cause each other, both the peak values of the ρ x (τ ) and the ρ y (τ ) are expected occur in the negative domain. • If the coupling of X and Y has a delay effect, the lag positions of the peaks of the ρ x (τ ) and the ρ y (τ ) are expected to be influenced by the delay time. To study how ρ x (τ ) and ρ y (τ ) may change at different time lag τ , we consider the range [−30, 30] for τ in the analysis in Section 4.3. As discussed in Section 4.1, the identification of the causality relationship between X and Y is carried out by examining the Pearson correlation coefficient between the observed data, say Y t+τ , and their predicted counterparts, say Y t+τ , which are obtained by fitting a prediction model connecting X and Y . While various modeling schemes may be considered for building a prediction model, here we employ the echo state network approach for its good performance in the prediction of non-linear time series data (Lukoševičius and Jaeger 2009 ). The echo state network is basically composed of three layers: the input layer, the reservoir layer, and the output layer ( Figure 3 ). The input layer contains the input data and the output layer is made up of the output labels. The reservoir layer consists of the hidden reservoir neuron states, denoted u t , which form an N × 1 vector with N being a user-specified positive integer. An N × N adjacent matrix, say A, is constructed to describe the connections among the reservoir neurons. The determination of the links among the three types of layers is carried out with two procedures: the input-to-reservoir procedure and the reservoir-to-output procedure, which are described as follows using the data: • From the input layer to the reservoir layer: For any t = 1, . . . , T , we construct the relationships between the impute data X t and the layers of hidden reservoir neuron states. Let u (0) t denote the initial reservoir neuron states, then we update the neuron states u t by where ψ is the leaky parameter to be specified, and W in is an N × 1 matrix representing the weight to transform X t from input layer to reservoir layer. Here, • From reservoir layer to output layer: We compute the predicted output labels, where W out is an n×N matrix representing the weight matrix transform u t into output layer. In these steps, only the weights W out need to be trained by minimizing the penalized loss where · 2 represents the L 2 -norm and α is the tuning parameter to be specified. The adjacent matrix A and the weight matrix W in are prespecified, which may be specified using the following procedure. Let p s be a user-specified value between 0 and 1, let γ be a user-determined positive scaling parameter determining the degree of nonlinearity in the reservoir dynamics (Goudarzi et al. 2015) ; and let β be a pre-specified positive scaling parameter. Generate a sequence of values, say v, independently from Uniform [−1, 1] and a sequence of values, say s, independently from Bernoulli(p s ), and then we form the weight matrix W in by letting its elements be given by γsv. The adjacent matrix A is formed similarly with its element given by βsv. To ensure the echo state network to work properly, the effects of initial conditions should vanish as the times series evolves (Jarvis et al. 2010) , which is also known as the echo state property. A necessary condition to ensure the echo state property is that the largest eigenvalue of A, denoted λ max , is smaller than 1 (Jaeger et al. 2007 ). Here, λ max , also called the "spectral radius" of A, determines the time range that the time series data interact each other non-linearly. Such a property guides us to set a suitable value for scale parameter β. Treating variable X t as the training data and Y t+τ as the output labels, we implement the echo state network using the EchoTorch module (Schaetti 2018) in Python 3.8. To determine suitable values of the tuning parameters λ max , ψ, N , p s , α, and γ, we take the leave-oneout cross-validation procedure. Specifically, we take the time series of seven cities as the training data and the remaining one city as the testing data. The procedure is repeated by leaving each city out as testing data once, where in each study, we record the normalized root-mean-squared-error (NRMSE) (Lukoševičius and Jaeger 2009 ): We conduct a grid search to find the optimal set of tuning parameters such that the NRMSE are minimized. Similarly, to study another direction of the causal relationship, the procedure described above is repeated by switching X and Y . In this study, we examine possible causal relationships among the six features pairwisely: COVID-19 daily infected cases; Vader sentiment scores; the daily number of tweets, and the daily total number of likes, of retweets and of replies. Assuming that the causal relationships among those features are the same for each city, we merge the data in all cities and consider two directions of the relationship as explained in Section 4.1. For illustrations of the implementation of the procedures described in Sections 4.1-4.2, in Direction 1, let {X t : t = 1, . . . , T } denote the time series of the daily average sentiment scores in a city and let {Y t : t = 1, . . . , T } denote the time series of the daily tweet counts of that city; and in Direction 2, we swap the X t and Y t . According to the leave-one-out cross-validation, we choose the tuning parameter λ max , ψ, N , p s , α, and γ, respectively, to be 0.1, 0.5, 150, 0.1, 0.1, 0.9 in Direction 1, and 0.1, 0.9, 250, 0.7, 100 and 0.9 in Direction 2. Then using those identified optimal parameter values, we train the echo state network for W out to the time series X t and Y t to predict Y t and X t for both directions, respectively. The predictions are repeated for all the cities considered in the study. Finally, for each τ ≤ T , we compute the Pearson's correlation coefficient between Y t+τ and Y t+τ , where Y t+τ is the predicted value corresponding to Y t+τ . In another direction, we compute the correlation coefficient between X t+τ and X t+τ . We present the analysis results in Figure 11 , which shows that in Direction 1, the Pearson's correlation coefficient ρ x reaches the peak at τ = 8, whereas in Direction 2, the Pearson's correlation coefficient ρ y is peaked at τ = −5. This suggests that the sentiment of COVID-19 is likely to cause the changes in the time series of the tweet counts, but not vice versa. On the other hand, the small value of the peak of the correlations (i.e., ρ x = 0.006 and ρ y = 0.169), indicate a weak relationship between the infected case number and the tweet counts. We repeat the analysis for each pair of features of our interest and identify the optimal lag that achieves the highest correlation in each scenario, shown in Figure To properly present the trend, the y-axis is presented with logarithm transformation. Figure 9 : The heatmap of the sentiment score of "lockdown" related tweets calculated using Vader lexicon over time for the eight cities in the North America. The orange color denotes the positive sentiment and the green color represents the negative sentiment. Analysis of twitter data with the bayesian fused graphical lasso Sentiment analysis of twitter data Sentiment analysis of social media response on the COVID-19 outbreak Vader: A parsimonious rule-based model for sentiment analysis of social media text Exploring transfer function nonlinearity in echo state networks Detecting causality from time series in a machine learning framework Optimization and applications of echo state networks with leaky-integrator neurons Review on sentiment lexicons Extending stability through hierarchical clusters in echo state networks Machine learning based approaches for detecting covid-19 using clinical text data Twitter sentiment analysis: The good the bad and the omg! Twitter data analytics Text mining in practice with R A survey of opinion mining and sentiment analysis Reservoir computing approaches to recurrent neural network training Global sentiments surrounding the covid-19 pandemic on twitter: analysis of twitter trends NRC emotion lexicon Afinn sentiment analysis in Python Twitter as a corpus for sentiment analysis and opinion mining Sentiment analysis of filipinos and effects of extreme community quarantine due to coronavirus (COVID-19) pandemic Text mining approach to analyze coronavirus impact: Mexico city as case of study The authors would like to thank reviewers for their useful comments. The first two authors lead the project with equal contributions including writing the paper; the last two authors participate in the project with equal contributions. The authors declare that they have no competing interests. All codes and data involved in this paper are available on request from the authors.