key: cord-0726794-2ibfqyyh authors: Okango, Elphas; Mwambi, Henry title: Dictionary Based Global Twitter Sentiment Analysis of Coronavirus (COVID-19) Effects and Response date: 2022-01-20 journal: Ann DOI: 10.1007/s40745-021-00358-5 sha: ff8d8473ba35563b28d2417dfc0fb90f52cd01dc doc_id: 726794 cord_uid: 2ibfqyyh In December 2019, a new pandemic called the coronavirus began ravaging the world. By May 2020, the pandemic had caused great loss of lives and disrupted the way of lives in more ways than one. The nature of the disease saw several strategies to curb its spread rolled out. These strategies included closing of businesses and borders, restriction of movements and working from home, mask mandate among others. With these measures and the effects, many individuals have taken to the social media to express their frustrations, opinions and how the pandemic is affecting them. This study employs dictionary based method for sentiment polarization from tweets related to coronavirus posted on Twitter. We also examine the co-occurrence of words to gain insights on the aspects affecting the masses. The results showed that mental health issues, lack of supplies were some of the direct effects of the pandemic. It was also clear that the COVID-19 prevention guidelines were well understood by those who tweeted. The results from this study may help governments combat the consequences of COVID-19 like mental health issues, lack of supplies e.g. food and also gauge the effectiveness or the reach of their guidelines. In recent years, data science has emerged as a new and important discipline which can be viewed as an amalgamation of traditional disciplines like statistics, data mining and distributed systems [1] . Data driven decision making has become ubiquitous in almost all aspects of the society. With the Internet of Things, huge volumes of wide B Elphas Okango kangphas@gmail.com 1 School of Mathematics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa variety of data are generated at high velocity. Real time decision making is central to the Internet of Things [2] . In the world where a lot is bound to happen, what the masses think or feel about these happenings is a concern for governments, businesses and even individuals. Governments would want to know how their policies, interventions etc. are received or perceived by the masses, politicians would want to know if they have a favorable rating, and how their policies are received and implemented while businesses would want to understand the reputation of their brands. Social media presents a great promise for achieving this by analyzing social media posts, product reviews, customer feedback etc. The advent of social media has made available a platform where individuals can freely express their opinions, feelings or judgments. Careful data mining techniques can help unravel valuable information and draw insights which may be hidden in these expressions. On 31st December 2019, a cluster of pneumonia cases of unknown etiology was reported in Wuhan, Hubei Province, China [3, 4] . About a week later on 9th January 2020, the Chinese center for disease control (CDC) reported a novel coronavirus as the causative agent of this outbreak, corona virus disease 2019 . Covid-19 is spread from person to person through respiratory droplets when an infected person sneezes, coughs or talks [5] . One is also able to contract the COVID-19 by touching a surface or object that has the virus on it and then touching his/her nose, mouth or eyes. As of April 29th 2020, there were 2,995,758 confirmed cases, 204, 987 deaths in 213 countries, areas or territories [6] . The nature of the disease (highly transmissible even when an infected individual is still asymptomatic) has seen many governments put in place a raft of measures in bid to curb the spread or the disease. Some of these measures include total and partial lockdowns which have seen businesses closed, curfews, advocacy for staying at home, social distancing, wearing of a cloth face covering nose and mouth in public places, regular washing of hands for at least 20 s or by using alcohol based hand sanitizers that contains at least 60% alcohol, quarantine for infected individuals [5] . The measures put in place as a result of the COVID-19 pandemic has affected the way people do things and it would be of interest to know and understand the feelings, opinions or judgment of the masses on various issues. Several studies have used varied approaches and datasets to try and explain the COVID-19 dynamics. Kumar [7] employed cluster analysis in monitoring COVID-19 infections in India. The approach identified areas/clusters that needed more medical facilities (ventilators, testing kits, masks etc.) and those that needed optimization of monitoring techniques (screening, lockdowns, closedowns, curfews etc.). Khakharia et al. [8] used machine learning techniques to predict the outbreak of COVID-19 for 10 densely populated countries. In particular, they compared the performance of 9 machine learning models in predicting the outbreak. The highest prediction accuracy was achieved for Ethiopia using the Autoregressive Moving Average Model. Social contact based analysis has also been employed to study the underlying disease transmission patterns. Liu et al. [9] using this approach showed that the age-groups involving relatively intensive contacts in households and public/communities were dispersedly distributed explaining why the transmission of COVID-19 in the early stage mainly took place in public places and families in Wuhan. Other data mining techniques that can be employed to study the dynamics of COVID-19 are available in [10, 11] Sentiment analysis or opinion mining can be defined as the process of identifying and extracting the subjective information that underlies a text [12] . This information can either be an opinion, a feeling about a particular topic or subject matter or a judgment. Sentiment analysis is becoming a field of interest that cannot be ignored. Nguyen et al. [13] employed sentiment analysis on social media to predict stock movement. Their method of incorporating social media data achieved 2.07% better performance than the model using historical prices only. Vincenza et al. demonstrated that Twitter data and sentiment analysis can be used to study disease dynamics [14] . Twitter is a micro blogging and social networking service on which users post and interact with messages known as tweets [15] . Twitter's 321 million active users provide a rich source of data from the tweets they post. In this study we seek to mine opinions and sentiments on the COVID-19 pandemic from Twitter users. This study seeks to provide a framework for real time social media data analysis for actionable intelligence. Data used in this study were tweets relating to the COVID-19 pandemic and these were streamed live from Twitter on 14th -15th April 2020 (from 16:43:09 on 14th to 23:50:53 on 15th) and from 18:24:25 on 17th April 2020 to 16:41:16 the following day using streamR package [16] . The time periods were East African time. In particular only tweets bearing words such as corona, covid-19, sanitizer, virus, lockdown, quarantine, social distance were of interest and thus streamed. The streaming was broken into two-three hours intervals with about 2 s break between each interval in order to obtain smaller sizes of streamed tweets. The tweet files were then parsed and compiled into a single excel file. We obtained more than 20 million tweets out of which a 91,784 geo-tagged tweets from all over the world were derived. Figure 1 was obtained using data from the John Hopkins University and functions from tidycovid19 package [17] . The United States of America, some parts of Europe, Asia and Russia had the highest number of active cases per 100,000 inhabitants. Western and Southern Africa. Countries with high cases of COVID-19 posted more tweets. Figure 3 displays tweets location by language. English language was the most dominant language in our data set denoted by red dots. In particular there were 63,056 English tweets. There exists several methods of sentiment analysis. Sentiment analysis can be done on three levels namely: document-level, sentence level and aspect-level [18] . Documentlevel sentiment analysis considers the whole document as a basic information unit (talking about one topic) and classifies it as expressing a negative, positive or neutral sentiment. Sentence level sentiment analysis classifies sentiment expressed in each sentence [18] . Sentiment classification techniques can be divided into machine learning approach, lexicon based approach and a hybrid approach that combines machine learning and lexicon approaches [19] . Machine learning approach relies on the machine learning algorithms like the naïve Bayes, support vector machines, neural networks among others together with linguistic features. In lexicon based approach, a collection of known and precompiled sentiment terms known as sentiment lexicon is used. This approach can be divided into dictionary based approach and corpus based approach which employs statistical or semantic methods to find sentiment polarity [18] . In communication, one listens out to an entire sentence and derive meaning that is greater than the sum of individual words. Calculating polarity or sentiment by matching words with those in the dictionary of words classified as positive, negative or neutral leaves out useful information. In many cases valance shifters (negators, amplifiers/intensifiers, de-amplifiers/downtoners, adversative conjunctions) are not taken into account. Negators flip the sign of a polarized word e.g. "I do not like", An amplifier (intensifier) increases the impact of a polarized word (e.g., "I r eally like it."). de-amplifier (downtoner) reduces the impact of a polarized word (e.g., "I hardly like it."). An adversative conjunction overrules the previous clause containing a polarized word (e.g., "I like it but it's not worth it.") [20] . Valence shifters affect polarized words and if they do occur frequently, a single dictionary look up may not be the best approach to model the sentiments appropriately. The entire sentence may be reversed or overruled in the case of negators and adversative conjunctions [20] . From Tinker's methodology [20] , tweet S j is a sentence composed of words W 1 , W 2 , . . . , W n . Each tweet is broken down into an ordered bag of words. With the exception of pause punctuations (commas, colons, semicolons) which are considered words within a sentence, other punctuations are removed. The words are indexed as W i j indicating the jth word in the ith tweet. The words in each tweet are searched and compared to a dictionary of polarized words, with positive words W + i j assigned + 1 and negative ones W −1 i j −1 or other positive and negative weighting depending on the sentiment dictionary used. Denote polarized words by pw, these will form a polar cluster c i jl which is a subset of a tweet i.e.c ikl ⊂ s i j . The polarized cluster of words c i jl is pulled from around the polarized word pw and defaults to 4 words before and two words after pw to be considered as valance shifters. The cluster is represented asc i jl = pw i j − nb, . . . , pw i j , . . . , pw i j − na. Here nbandna are parameters n-before and n-after set by the user. The words c i jl are labeled neutralw 0 i j , negatorw n i j , amplifier/intensifier w a i j or deamplifier of downtonerw d i j . Neutral words only contribute to the number of words in the equation. Each polarized word is then weighed by some function and the number of valence shifters surrounding the positive or negative word. Pause locations denoted by cw (i.e. punctuations that denote a pause including commas, colons and semicolons) are indexed and incorporated in calculating the upper and lower bounds in the polarized context cluster. The polarized word in the cluster is acted upon by the valence shifters. Amplifiers increase polarity by 1.8 (0.8 is the default weight) and they become de-amplifiers if the context cluster contains an odd number of negators (two negatives equal a positive and 3 negatives equal a negative). De-amplifiers decrease polarity. Adversative conjunctions (AC) (e.g. but, however, although) also weight the cluster. AC before a polarized word up-weights the cluster by 1 + z(n AC ) (with 0.85 being the default weight for z 2 andn B AC is the number of ACs before the polarized word. An AC after the polarized word down weights the cluster by1+{n A AC − 1} * z 2 . The weights z may be provided by the use with the default being 0.8. Lastly, these weighted context clusters c i jl are summed and divided by the square root of the word count (W i jn ) yielding the polarity score δ i j for each tweet i.e. δ i j = c i j / W i jn For the co-occurrence of words the study used the udpipe package [21] . The study considered only 63,056 tweets that were written in English. Figure 4 shows the spatial location of all positive tweets. USA, Europe, Western and Southern Africa and India had high number of positive tweets. Regions with high numbers of positive tweets also posted high number of negative tweets. This could indicate that individuals had opposing views on different issues (Fig. 5, 6 ). Figure 6 show the location of positive, negative and neutral tweets. Spatial distribution of tweets: negative, positive and neutral is not that informative. A look at word co-occurrence may supply more insights on why tweets were negative, positive or neutral. Figure 7 shows which words co-occurred with negative words. The thicker the path the more the co-occurrence. The blue dots depict the negative words while the red ones the words they occurred with. The strongest co-occurrence was mental and health. With many individual's routine lives altered, there is a risk of mental health problems. Qui j Annals of Data Science [22] in their nationwide survey of psychological distress among Chinese people in the COVID-19 epidemic rightly captures the aftermath of the COVID-19 pandemic: "The implementation of unprecedented strict quarantine measures in China has kept a large number of people in isolation and affected many aspects of people's lives. It has also triggered a wide variety of psychological problems, such as panic disorder, anxiety and depression". These aspects are supported by Fig. 7 . Other strong co-occurrences were small-business which have been adversely affected by the lock down. In their survey on the effects of COVID-19 on small businesses, Alexander W et al. [23] report that 43 percent of businesses are temporarily closed, and businesses have-on average-reduced their employee counts by 40 percent relative to January 2020. The strongest co-occurrence was face and mask followed by hand and sanitizer (Fig. 8) . These are some of the WHO recommended steps to curb the spread of COVID-19 [24] . This indicates an effective campaign to help curb the spread. Other concerns were food supplies, insecurity, school etc. From Fig. 9 , most tweets conveyed negative sentiment, understandably so because of the pandemic. This paper has demonstrated the wealth of information that is contained in sentiments expressed on social media, in this case Twitter. The direct effect of a great pandemic like the corona virus is death which can easily be measured. The indirect effects which range from loss of jobs [25] , mental issues [22] to closing down of countries need other methods of quantification. Sentiment analysis is particularly useful in gauging the uptake of directives, emerging issues relating to the topic of interest among others, fake news, misinformation that may lead to fear and panic. Results from sentiment analysis may help the government or relevant authorities relax, tighten or change approach altogether. Mental health was among the key concern among individuals, fear and panic was also evident Fig. 7 and Fig. 9 . Studies [26, 27] have indicated that domestic violence is on the rise during this period of the coronavirus pandemic, a clear indication of mental anguish faced by the masses. Face-mask and hand-sanitizers also had high number of co-occurrence indicating that the sensitization efforts were working. As it is with all studies, this one too has some shortfalls and limitations. Some of the shortfalls is that global Twitter data comes in various languages and as such methodologies to handle multilingual sentiment analysis are still in development. Our study focused on tweets written in English. The 2019 global multidimensional poverty index report indicates that 1.3 billion people or 23.1% are multidimensionally poor (in terms of health, education, standards of living) [28] , This makes Twitter not a good platform to get insights from this group of people making it another downside of this study. Analysis of Twitter data for over a long period of time is computationally expensive as a whole day's tweets may be few hundred gigabytes. The study results relied on the data from a two day live stream, further work can be dedicated towards live streaming for a longer period or using historic data for over a longer period. These limitations however do not invalidate the results. The results from this study may help governments combat the consequences of COVID-19 like mental health issues, lack of supplies e.g. food and also gauge the effectiveness or the reach of their guidelines. Li et al. [29] motivate the need for multifaceted approach in combating the COVID-19 pandemic. They stress that there is a need for more global collaboration to effectively combat the COVID-19 pandemic. They outline five pillars for achieving this including: Cross cultural collaboration and communication, strengthening of data and information sharing system, Adopting early experiences learned in other countries, evaluation and strengthening of public health systems and promoting of virtual communities to help improve mental health and well-being issues. ed) Process mining: data science in action Internet of things, real-time decision making, and artificial intelligence Disease background of COVID-19.European Centre for Disease Prevention and Control WHO | Pneumonia of unknown cause -China Coronavirus disease 2019 Monitoring novel corona virus (COVID-19) infections in India by cluster analysis Outbreak prediction of COVID-19 for dense and populated countries using machine learning What are the underlying transmission patterns of COVID-19 outbreak? An age-specific social contact characterization Introduction to business data mining Optimization based data mining: theory and applications What is Sentiment Analysis?MonkeyLearn Blog Sentiment analysis on social media for stock movement prediction Using twitter data and sentiment analysis to study diseases dynamics Package 'streamR joachim-gassen/tidycovid19: {tidycovid19}: An R Package to Download, Tidy and Visualize Covid-19 Related Data Sentiment analysis algorithms and applications: a survey Automatic detection of political opinions in tweets Package 'sentimentr udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the'UDPipe A nationwide survey of psychological distress among Chinese people in the COVID-19 epidemic: implications and policy recommendations How are small businesses adjusting to COVID-19? Early evidence from a survey Advice for public Inequality in the impact of the coronavirus shock: new survey evidence for the UK A new Covid-19 crisis: domestic abuse rises worldwide The pandemic paradox: the consequences of COVID-19 on domestic violence Global Multidimensional Poverty Index (MPI) | Human Development Reports Culture versus policy: more global collaboration to effectively combat COVID-19 Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations The authors also thank the University of KwaZulu-Natal for its continued support for research and publication.Author Contributions Elphas Okango: Conceptualization, Methodology, Writing-Original draft preparation, Software. Henry Mwambi: Conceptualization, Editing and Reviewing. The data set used for this study is available upon request.