key: cord-0455919-utyfdvfh authors: Yang, Qiang; Alamro, Hind; Albaradei, Somayah; Salhi, Adil; Lv, Xiaoting; Ma, Changsheng; Alshehri, Manal; Jaber, Inji; Tifratene, Faroug; Wang, Wei; Gojobori, Takashi; Duarte, Carlos M.; Gao, Xin; Zhang, Xiangliang title: SenWave: Monitoring the Global Sentiments under the COVID-19 Pandemic date: 2020-06-18 journal: nan DOI: nan sha: dd4cd7679adc28ea53e78d3f5f6de42a75f9c1d5 doc_id: 455919 cord_uid: utyfdvfh Since the first alert launched by the World Health Organization (5 January, 2020), COVID-19 has been spreading out to over 180 countries and territories. As of June 18, 2020, in total, there are now over 8,400,000 cases and over 450,000 related deaths. This causes massive losses in the economy and jobs globally and confining about 58% of the global population. In this paper, we introduce SenWave, a novel sentimental analysis work using 105+ million collected tweets and Weibo messages to evaluate the global rise and falls of sentiments during the COVID-19 pandemic. To make a fine-grained analysis on the feeling when we face this global health crisis, we annotate 10K tweets in English and 10K tweets in Arabic in 10 categories, including optimistic, thankful, empathetic, pessimistic, anxious, sad, annoyed, denial, official report, and joking. We then utilize an integrated transformer framework, called simpletransformer, to conduct multi-label sentimental classification by fine-tuning the pre-trained language model on the labeled data. Meanwhile, in order for a more complete analysis, we also translate the annotated English tweets into different languages (Spanish, Italian, and French) to generated training data for building sentiment analysis models for these languages. SenWave thus reveals the sentiment of global conversation in six different languages on COVID-19 (covering English, Spanish, French, Italian, Arabic and Chinese), followed the spread of the epidemic. The conversation showed a remarkably similar pattern of rapid rise and slow decline over time across all nations, as well as on special topics like the herd immunity strategies, to which the global conversation reacts strongly negatively. Overall, SenWave shows that optimistic and positive sentiments increased over time, foretelling a desire to seek, together, a reset for an improved COVID-19 world. Since the outbreak of coronavirus, it has affected more than 180 countries where massive losses in the economy and jobs globally and confining about 58% of the global population are caused. Many people have been forced to work or study from home under pandemic. The research on people's feelings is essential for keeping mental health and informed about Covid-19. Social medias (e.g., Twitter, Weibo) have played a major role in expressing people's feelings arXiv:2006.10842v1 [cs.SI] 18 Jun 2020 and attitudes towards Covid-19. We thus target on building a system named SenWave to monitor the global sentiments under the COVID-19 Pandemic by deep learning powered sentiment analysis. Sentiment analysis has been widely researched in the field of natural language processing [1, 2, 3, 4] . Most of the current sentimental analysis tasks usually consider the coarse-grained emotion labels like positive, neutral, and negative for reviews/comments of books/products/movies, or five values to indicate the degree of emotions with the ranking score from 1 to 5. However, the feelings of people in pandemic are much more complicated than the sentiments in movie reviews and product comments ect. For instance, people may negatively feel angry and sad since Covid-19 leads to the increasing number of deaths and unemployment, while others may feel optimistic because of the medical supplies and medical assistance for the people in need. Therefore, we need to define the fine-grained labels to better understand the impact of the health crisis on sentiment. However, there are no appropriate and sufficient annotated data to support the building of our SenWave by training deep learning sentiment classifiers. One non-Covid-19 tweet sentimental analysis dataset is avaliable in [5] with the labeled 7724 English tweets, 2863 Arabic tweets, and 4240 Spanish tweets (in total 14, 827), which is a benchmark dataset labeled in 11 categories, anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust. Initially, we tried to use it as our training data for building a Covid-19 sentiment classifier. However, the sentimental results were not suitable for Covid-19 analysis. For example, very few tweets are in the categories joy, love and trust. Besides, many tweets of official reports were classified into inappropriate categories as well as the tweets making jokes and denying conspiracy theories. Actually, these kinds of labels are essential in expressing opinions and attitudes according to our observations. Due to the lack of appropriate training data, the Covid-19 tweet sentiment analysis has been conducted mostly based on engineered features or conventional bag-of-words derived representations not touching the deep learning models, either in unsupervised or supervised ways but with limited training data (e.g., only 5K in [6] ). Yasin et al. built a real-time Covid-19 tweets analyzer using LDA model in USA data on positive, neutral, and negative [7] . Jia et al. used LDA model and NRC Lexicon on the English tweets to predict one of emotions from (anger, anticipation, fear, surprise, sadness, joy, disgust and trust) [8] . Mohammed et al. employed naïve Bayes model to predict Saudis' attitudes towards Covid-19 preventive measures on the (postive, negative, or neutral). Caleb et al. used logistic regression classifier with linguistic features, hashtags and tweet embedding to identify anti-Asian hate and counterhate text [9] . However, these methods are either limited by the sentimental dictionary and its availability or in lack of the deep understanding of tweets semantics. Even if the supervised methods are used like naïve Bayes, they still cannot satisfy the real case (multi-class classification) since emotions in Covid-19 can be a mixture of multiple emotions (multi-label classification). In order to solve above problems, on one hand, we collected over 105+ million tweets covering six different languages, including English, Spanish, French, Arabic and Italian from March 1, 2020. On the other hand, to meet the requirement of fine-grained sentiment analysis target for Covid-19, we annotated 10,000 tweets in English and 10,000 tweets in Arabic in 10 categories, including optimistic, thankful, empathetic, pessimistic, anxious, sad, annoyed, denial, official, and joking. Each tweet was annotated by at least three experienced annotators in the corresponding language under strict quality control. We allowed one tweet to be annotated by more than one category, to support the multi-label analysis. To analyze over 1 million Chinese weibo of COVID-19 posts, we construct a set of 21,173 Weibo posts labeled in 7 sentimental categories, such as anger, disgust, fear, optimism, sadness, surprised, and gratitude. These 41K labeled datasets and the over 106+ million unlabeled tweets and weibo posts compose our dataset studied by SenWave. More details about dataset are shown in Sec. 2. We will make available the data and the SenWave implementation in public for supporting other social impact analysis of Covid-19 and fine-grained sentiment analysis. To comply with Twitter's Terms of Service, we will only publicly release the Tweet IDs for unlabeled data and limited number of tweet texts for labeled data (totally 20K) for non-commercial research use. To the best of our knowledge, this is the largest labeled Covid-19 sentimental analysis dataset with the fine-grained labels. We summarize our main contributions in this paper as follows: • We constructed the so far the largest fine-grained annotated Covid-19 tweets dataset (10K for English tweets and 10K for Arabic tweets) in 10 sentiment categories, which help to facilitate the studies of social impact of Covid-19 and other fine-grained analysis tasks in research community. • We share a large set of Covid-19 tweets IDs collected since Mar 1, 2020, in five languages accumulated over 105+ million tweets, which will be continuously updated. • We report the usability of the labeled Covid-19 tweets by first evaluating the performance of deep learning classifiers trained on them and then test them on the over 105 million unlabeled tweets from March 1 to May 15, 2020 to monitor how the global emotions vary in concerned topics and report other interesting findings. This is the first report of COVID-19 sentiment over 105 million tweets. The largest analyzed COVID-19 tweet dataset before our work is 1.8M in [8] , by an unsupervised way using topic modeling and lexicon features. We used Twint 1 , an open source Twitter crawler, to collect our tweet dataset where Twint allows users to specify a number of parameters alongside the query, such as tweet language, time period, etc. By forming requests with specified parameters, the resulting response was scraped into JSON documents. We used a unified query across these languages: "covid-19 OR coronavirus OR covid OR corona OR (corona in English) ". We launched 12 instances on 24 cores for downloading daily updates and historical data up to the March 1. Data rates varied slightly throughout the period averaging around a little over a million tweets a day. Tweets were saved as JSON documents, and pooled into a shared medium, to be pre-processed and consumed by the language models for sentiment analysis, running on a GPU server (GTX 1080ti GPU and 20 CPUs). Due to the limited number of tweets in Chinese that could be found on Twitter, our data collection for the sentiment analysis of COVID-19 in China was conducted on Sina Weibo, which is the largest social media platform in China. The Weibo records were collected by Sina weibo API, starting from collecting first the hashtags about COVID-19, and then extracted weibo records including these hashtags. We randomly selected 10K English and 10K Arabic tweets for sentiment annotation. These two languages are selected because English and Arabic are among the top-5 popular languages in the world 2 . In addition, English can be effectively translated into other languages when needed. We thus have a considerably large labeled dataset for tweet sentimental analysis. The sentiment categories were determined by domain experts after reviewing a subset of the collected tweets and discussing for several rounds. The final determined set of labels reflect the complicated sentiments in pandemic. These 10 labels and their covered auxiliary emotions are optimistic (representing hopeful, proud, trusting), thankful for the efforts to combat the virus, empathetic (including praying), pessimistic (hopeless), anxious (scared, fearful), sad, annoyed (angry), denial towards conspiracy theories, official report, and joking (ironical). We recruited over 50 experienced annotators to make every tweet labeled by at least three annotators. Example tweets were provided in advance to annotators with suggested categories. We allowed each tweet to be assigned to multiple labels, which is in line with the convention. For example, the tweet "Dear Covid19, Will you please retaliate on our behalf. We're hopeless, helpless, restless, speechless." has a mixture of pessimistic and sad emotions. We allowed the multi-label annotation with our fine-grained sentiments to support the analysis of the complicated emotions in the pandemic. In order to measure the reliability of the sentiment annotations, we conducted a verification study on the annotated tweets where the final results of the labeled tweets is determined by the majority-voting strategy. Our labeled dataset can also be used for other events with complex emotions, e.g., public opinion analysis and general election analysis. The commonality is that the emotions of these events are multiple aspects. For Chinese weibo, we analyzed the COVID-19 posts and annotated 21,173 Weibo in 7 sentimental categories, such as optimistic, thankful, surprised, fearful, sad, angry and disgusted. The processed data is saved and updated in the git repository using the fetched data. Table 1 ) of the 5 languages, as well as the COVID-19 Weibo in Chinese (Zh) from Jan 10 to May 15. The volumes of collected daily data for each language are illustrated in Fig. 1 . The statistics show a similar pattern of rapid rise followed by gradual fall in the global citizen conversation around COVID-19, but with the peak in messages in Chinese reaching the maximum on January 22, two months earlier than the peak in all other languages examined on March 12-13 and March 21, reflecting the lag between the development of the epidemic first detected in Wuhan and the spread to reach pandemic status. The discussion and attention quickly reached the peak when important decisions were made. For example, Italy, Spain, and France announced the national lockdown or closure of leisure areas while UK considered the herd immunity and US announced a state of alarm on March 12-13. Meanwhile, in KSA the peak was reached due to the King's speech and the suspended entry to two Holy cities on March 21. In addition, people's attention cools down as time goes on. As shown in Table 2 , the fall and rise in the volume of messages on COVID-19 was remarkably correlated for English, French, Italian, Arabic and Spanish languages, although the rise occurred earlier in Italian, the first western nation to suffer the epidemic. The high correlation coefficient values indicate that populations speaking different languages are responding in a similar way. The label distributions are shown in Table 3 . Note that the sum of the percentages is not 1 due to the multi-label annotation in En and Ar. In English, joking and annoyed emotions took large portions, which is consistent with the the reality, since Covid-19 causes deaths, high unemployment rates and other problems. However, we also see optimistic emotion is the third largest category, indicating that people feel confident about combating the virus and about the future. In Arabic, official is significantly higher than others, because since the outbreak of Covid-19, most of the Arabic governments announced a lot of decisions regarding different situations on Twitter. English tweet examples of each category are shown in Table 4 where some tweets have more than one label, even with three labels. Based on the statistics of categories, we find that more than 70% of English tweets were assigned with more than one label, while about 20% of Arabic tweets were assigned with more than one labels. We also present the relations between these labels in Fig. 2 . While the label co-occurrence in Arabic shows as three blocks (positive, negative and neural), the label co-occurrence in English is more complicated. These observations imply that the multi-label classification in our English dataset is more challenging than that in Arabic. We also illustrate this fact in the experiments. We pre-processed the raw data to ensure the analysis quality. In details, we first remove the @users, and URLs from the tweet because they do not contribute to the tweet analysis. Then, we remove emojis and emoticons like¨ though they can express emotions well since we focused on the analysis textual data. Next, we filtered out noisy symbols and texts, which cannot convey meaningful semantic or lexicon information, and may even hinder the model from learning, such as retweet symbol "RT" and some special symbols including line break, tabs and redundant blank characters. Unlike previous methods which also removed hashtags in tweets, we kept these hashtags since they have meaningful semantics, like "Proud to be one of the few people who hasn't texted their ex #Covid-19 #Quarantine #lockdown". Apart from that, we also conducted word tokenization, steaming and tagging with the NLTK tool (https://www.nltk.org/) for English, Spanish, French and Italian, and with Pyarabic for Arabic (https://github.com/linuxscout/pyarabic). We used Jieba for Chinese weibo segmentation (https://github.com/fxsjy/jieba). We built our multi-label sentiment classifiers based on deep neural network language models due to their success on diverse NLP tasks. An integration framework called simpletransformer (https://simpletransformers.ai/) supports the fine-tuning of these pre-trained models and the training of a customized classifier. We used XLNet [10] for English, AraBert [11] for Arabic, and ERNIE [12] for Chinese (selected due to the better performance of ERNIE than that of Bert [13] and LSTM). We first evaluated the performance of the sentiment classifier in English and Arabic language on the 10K annotated tweets by 5-fold cross validation, except Chinese on 21K labeled Weibo. Then all 10K labeled tweets were used to train the final sentiment classifiers, except the Chinese sentiment classifier that was trained with 21K Chinese Weibo posts. The trained model was then used for predicting the sentiments of millions of Covid-19 tweets (Mar 1 -May 15, 2020 for Non-Chinese data and January 20 -May 15, 2020 for Chinese Weibo message) for our analysis. Considering that the translation between English and Spanish, French, Italian has been well developed, we translated the labeled English tweets into Spanish, French, Italian with Google translate (https://translate.google.com/) to illustrate whether our classifiers can work well . We manually checked a subset of translated tweets and were surprised by the high quality of translation. We used Bert [13] for Spanish, French and Italian tweets representation learning, and then follow the same steps for sentiment analysis. We ran the experiments on a workstation with one GeForce GTX 1080 Ti with memory size 11178MB. The batch size is 16, and the learning rate is 4e − 5 with 20 epochs. We used multi-label accuracy (Jaccard index), F1-macro, and F1-micro, as well as the weak accuracy as the performance metrics. The accuracy with Jaccard index is defined as: where Y i is the ground truth labels for the i-th testing sample, andŶ i is the predicted labels. And the weak accuracy of multi-label classification is defined as: where σ(ŷ ij == y ij ) checks if the predictedŷ ij is the same as the ground truth y ij , which can be 1 meaning the i-th testing sample has a j-th label, and can be 0 indicating the i-th testing sample doesn't have j-th label. The total number of corrected prediction of y ij is averaged with D by m, which are the number of testing samples and the number of labels, respectively. In addition, we used the ranking average precision score (LRAP) and Hamming loss, which are specified for multi-label classification. We present the 5-fold cross validation results of our multi-label classifiers on SenWave in Table 5 . Our classifiers reach above 80% weak accuracy values, which prove the effectiveness of our models. The multi-label Jaccard accuracy of English and Arabic data is larger/equal than/to 0.5. However, the accuracy of Spanish, French and Italian tweets are not better than the original data. The reasons can be two-folds: 1) the usage of different pre-trained language models: XLNet used for English tweets and AraBert used for Arabic tweets perform better than Bert generally used for Spanish, French and Italian on the same conditions [10, 11] ; 2) the difficulty of classifying multi-label English tweets due to the complex multi-labels (shown in Fig. 2 (a) ). We are working on the improvement of these models. It is worth noting that F1 values are around 0.5 due to the class imbalance issue, which will be resolved in our future work. However, the high accuracy, LRAP and low Hamming loss demonstrate that the trained classifiers are usable for practical usage. The multi-class classification accuracy of Chinese weibo shown in Table 5 helped us select ERNIE (with accuracy 0.88) for the final analysis due to its better performance than Bert (with accuracy 0.83) and LSTM (with accuracy 0.78). We present the sentiments variation of 6 languages from March 1 to May 15, 2020 for Non-Chinese data and from January 20 to May 15, 2020 for Chinese Weibo message in Fig. 3 . The statistics of these sentiment results are given in Table 6 . The sentiment results of English tweets are shown in Fig. 3 (a) . All the positive emotions, including optimistic, thankful and empathetic, showed the similar trend of first rising up and then falling down. It implies that people first felt positive due to the various decisions made for combating the virus staring from the mid of March. However, the emotions went down in late April when a large number of people got infected. Among negative emotions, anxious and joking fell down with the slope −0.0004 and −0.0007 respectively as time went on, while the others went stable with slight changes. The anxious may be reduced by the increasing of medical supplies and the fact that people have known much better about Covid-19 than before and got used to the ways living with Covid-19. However, the resulted high unemployment rate and the high number of death may be the reason that sad and annoyed have been staying high. The results of Arabic tweets shown in Fig. 3 (b) demonstrate significant variations in all categories of emotions. In particular, optimistic has been rising up, and anxious, denial and joking are falling down. The sad emotion keeps rising up due to the increasing number of new cases in several Arabic-speaking populations, such as Saudi Arabia, Qatar and United Arab Emirates (UAE). The rise of optimistic and thankful and the fall of pessimistic and annoyed were also observed in Fig. 3 (c) of Spanish tweets. The similar trend of increase in thankful is observed in French tweets, as shown in Fig. 3 (d) . However, the other emotions went stable, except the decline of joking and the sudden increase of denial to the conspiracy theory of lab source of corona-virus. Italian tweets also showed weak increase or decrease trends in most of the emotions, as shown in Fig. 3 (e) , except those in thankful and empathetic. The Chinese Weibo sentiments show strong variations, but no obvious trend of increase and decrease, as shown in Fig. 3 (f). The most significant decrease in fearful is observed in the very beginning on Jan 20, 2020, when human-to-human transmission was confirmed on that day. The fearful state continued until January 22, when Wuhan was locked, and the arrival of Chinese New Year. The significant jumping up of sad on April 4 due to the nationwide memorial for victims of Covid-19. We selected some countries and areas to illustrate how the sentiments vary over days including USA, Washington D.C. , UK, Spain, Argentina and Saudi Arabia in Fig. 4 . The statistics of these sentiment results are given in Table 7 . the pie chart at the right hand). On March 13, there was a positive spike due to the declaration of a National Day of Prayer amid coronavirus pandemic. On March 29, negative pumped up due to the high number of death within one day achieving 2000, and especially a significant increase of anxious because of the negligence of social distancing, even having gathering. On April 13, many people showed gratitude to the medical professionals and front-line workers (see the pie chart at the right hand) while during April 27 and 28, people expressed their dissatisfaction with the government due to the resistance, as well as on May 10. In Spain (Fig. 4(d) In Argentina shown in Fig. 4 (e) , the proportion of negative emotions was very close to 0.5 even much higher in some days. On March 8, the discussions about first dead case of coronavirus and dengue were focused on leading to the increase of anxious, sad and annoyed (see pie chart at the right hand). On March 21, the feelings of stress, anxiety, panic went up because of the long quarantine, which resulted in the increase of anxious and sad. On April 29, more than 2,300 prisoners were released because of the coronavirus, which increased the feelings of pessimistic, anxious and annoyed. We give the sentiment analysis of 7 topics including stock market, oil price, herd immunity, economic stimulus, drug/medicine/vaccine, employment/job, and working from home. The results are shown in Figure 5 . The statistics of these sentiment results are given in Table 8 . The topic stock markets collapsed on March 9, when the peak of discussion was reached. On this day, anxious reaches a high value, which is greater than mean+2*std (out of the black dash line, and the black line is the mean, the dot line is the mean-2*std). On March 12, the DJI (Dow Jones Index) had its worst day since 1987, plunging about 10% (the second time breakers) and the volumes arrived the second largest. The anxious state remained at a high rate during these days. On the weekends of March 20-21 and March 28-29, the spikes of denial are higher than the blue dash line (mean+2*std), as a reflection of the continuous stock market collapse. DJI: Dow Jones Industrial Average is shown on top of the denial curve, and the markers indicate the days of March 9, 20 and 28, when these negative sentiments are reflected on the drop of DJI, as shown in Fig.5(a) . The topic oil price also showed the peak of discussion on March 9. The drop of crude oil price resulted in significant anxious on March 9-12. However, this was not the worst. On April 21, crude oil price reached an 18-year low, which is shown on the marked point on the WTI crude oil curve. Among the triggered discussion, we see pessimistic was significant. Anxious slope: -0.0008 (g) Working from home Figure 5 : Sentiments variation on seven topics. We show the sentiment results for these topics when they were intensively discussed (around the peak of volume curve in the background). March 15-16. The discussion continued with significant annoyed from March 22 to April 7, and caused another rise of denial on April 12-13. The topic economic stimulus reached top on March 26 when US Senate passed historic $2tn relief package. And another peak on April 15-16 when the checks were received. Surprisingly, during the discussion in March 23-26, positive was lower compared to other days, and denial was significant on March 25. We found many tweets under this topic are for example "this is not enough", "US economy is tanking", and "the pandemic is getting worse". By looking into the joking, we see increases in March 24-30 and April 13-18. The topic drug/medicine/vaccine collected the largest amount of discussion among these 7 topics (reaching 20-40K on the daily volume). This topic has been hot since the global outbreak around March 10. Two events in this topic caused This paper contributed a Covid-19 sentimental analysis system with annotate datasets, called SenWave, which includes 20K labeled English and Arabic tweets, and 21K labeled Chinese Weibo, as well as 106M+ Covid-19 tweets and Weibo messages collected since Mar 1, 2020 and January 20 respectively. We trained classifiers for 6 languages based on deep learning language models to monitor the global sentiments under the Covid-19 pandemic sentimental. We analyzed the sentiments varying of all languages and hot topics over days. On one hand, the emotions on these languages can be directly reflected by corresponding events at the specific days through the varying of volumes and a significant increase of emotions. On the other hand, the sentimental varying trends of 7 topics are also analyzed by showing the corresponding emotions. This work helps to provide a rich resource to the community to study and combat COVID-19. Germany tries to stop U.S. poaching German firm seeking coronavirus vaccine. The second event was on April 6-7, when Anti-Malaria drugs were hyped as unproven coronavirus treatment. Overall from March to May, we see two sections of more anxious and less optimistic, and two other sections of less anxious and optimistic The topic employment/job covered the hot words such as unemployment, income, rent, salary, mortgage, laid off, no job/work etc, as shown in the table included in Fig.5(f) We also see that optimistic keeps increasing with a slope of 0.0015, while anxious is decreasing with a slope of −0.0008. Only on two days, optimistic dropped: April 18 when a UK police chief was criticised on a covid-19 mass gathering The statistics of daily sentiment fraction in different categories under different topics, presented as mean±std, and the number of tweets from which the statistics were obtained Survey paper on sentiment analysis: Techniques and challenges Deep learning for sentiment analysis: A survey Sentiment analysis of twitter data: a survey of techniques Sentiment analysis algorithms and applications: A survey Semeval-2018 Task 1: Affect in tweets Measuring emotions in the covid-19 real world worry dataset A real-time covid-19 tweets analyzer Machine learning on big data from twitter to understand public reactions to covid-19 Racism is a virus: Anti-asian hate and counterhate in social media during the covid-19 crisis Xlnet: Generalized autoregressive pretraining for language understanding Arabert: Transformer-based model for arabic language understanding Enhanced language representation with informative entities Pre-training of deep bidirectional transformers for language understanding We present the hot words of the predicted English tweets for each category shown in Fig. 6 on March 9, 2020. More representations of different languages will be provided in future. The class optimistic is represented with hand washing and health, which means people should wash their hands frequently to keep health. The class thankful is presented with Covid-19 testing, while the class empathetic is shown with pray, hope, god and safe. The class pessimistic is reflected with economy market, oil market and large number of death. These hot words are also suitable for the class anxious. People felt sad about a lot of death and confirmed cases and the lockdown of school. The class annoyed is displayed with dont and flu while the class denial is demonstrated with market and China since some people didn't believe the Covid-19 report of China. Overall, these hot words in each category can represent the sentiments to some extend.