key: cord-0059873-ggh4ai3v authors: Shah, Christina Sanchita; Sebastian, M. P. title: Sentiment Analysis and Topic Modelling of Indian Government’s Twitter Handle #IndiaFightsCorona date: 2020-11-10 journal: Re-imagining Diffusion and Adoption of Information Technology and Systems: A Continuing Conversation DOI: 10.1007/978-3-030-64861-9_30 sha: 313885cef651700b62fdd6ef4d6eaf154747828b doc_id: 59873 cord_uid: ggh4ai3v The purpose of this study was to conduct opinion mining on Twitter data containing “#IndiaFightsCorona” to analyse public opinion. This was accomplished using sentiment analysis and topic modelling. First, sentiment analysis was done and positive and negative sentiments were separated. Then, on each sentiment, topic modelling was done to discover hidden topics. Two approaches were used namely Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) and then their results were compared. It was found that there were more positive sentiments than negative. For positive sentiments, LDA performed better and for negative sentiments, LSA performed better. While some topics were common between LSA and LDA for positive sentiments, there was very little overlap for negative comments. India reported its first case of novel coronavirus on 30th January 2020. On 11th February, the World Health Organization (WHO) announced that this coronavirus would be called COVID-19. A month later, on 11th March, WHO declared the COVID-19 outbreak as a pandemic. As of 1st April, there are almost 922,000 confirmed cases of COVID-19 out of which approximately 656,000 cases are active, 193,000 recoveries and 46,000 deaths 1 In India, there were 1,238 confirmed cases with 32 deaths, according to the Ministry of Health and Family Welfare website. On 31st March 2020, the Ministry of Information and Broadcasting set up a dedicated Twitter handle called @CovidnewsbyMIB to share news and updates about novel coronavirus COVID-19. The account is named #IndiaFightsCorona with the handle @ Covid-newsbyMIB. This handle provides information on the latest number of coronavirus cases in India, and various relief and economic measures 2 While the handle @CovidnewsbyMIB was set up by the end of March, hashtag #IndiaFightsCorona has been trending even before that as indicated by Twitter data. With the rise of the COVID-19 pandemic, many countries placed restrictions on travel and movement and imposed "lockdowns". The world saw a rise in social distancing initiatives, travel bans, self-quarantines and business closures. As people could no longer access public spaces freely, a significant proportion of the dialogue on the COVID-19 pandemic shifted to online forums and social networking sites such as Twitter [1] . In light of the COVID-19 pandemic and the purpose of this study is to analyse public opinion surrounding hashtag #IndiaFightsCorona. Opinion mining, broadly speaking, uses text analytics to understand public sentiment. Twitter is a popular microblogging platform that people use to express themselves in real-time and can be used for opinion mining [2] [3] [4] . While there are multiple hashtags for novel coronavirus trending on Twitter, the reason for selecting #IndiaFightsCorona was its credibility as it is used by the Ministry of Information and Broadcasting, India. In this research, Twitter data containing #IndiaFightsCorona was extracted, and sentiment analysis using Support Vector Machine (SVM) classifier was done to classify data into positive and negative sentiments. The performance of the SVM classifier was measured using the Confusion Matrix. Once the sentiments had been segregated, topic modelling was conducted on each sentiment. Topic modelling discovers hidden topics/themes in each sentiment. Two approaches were used for conducting topic modelling, namely Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). Results from each method were then compared and evaluated to decide which method gave the most coherent topics within each positive and negative sentiment data. The preliminary findings of this paper are: (1) It was found that there were more positive sentiments than negative sentiments (2) For positive sentiments, LDA performed better, and for negative sentiments, LSA performed better (3) For negative sentiments, there was minimal overlap between the topics generated by LSA and LDA whereas all other topics were distinct. Research scholars have often used Twitter as a way of understanding trends visible in online social networks [2] [3] [4] [5] . More specifically, Twitter offers researchers the opportunity to analyse the role of social media during a public health crisis, such as the latest COVID-19 pandemic [6] [7] [8] [9] . This helps researchers investigate the social dimensions of the pandemic. While previous epidemics have demonstrated the relevance of researching social media information, there is a special significance for studying the role of social media during the current COVID-19 pandemic. In today's information age, social media is expected to play a much larger role as compared to previous health crises. For instance, during the Ebola outbreak in February 2014, Twitter comprised of approximately 255 million active users. However, Twitter had 330 million active users by 2019 [10] . Thus, a lot of people communicate online and get their news through social media sites like Twitter [11, 12] . Further, compared to past epidemics, there is a much higher risk of incorrect information circulating [10] . Studies have shown that the amount of misinformation available on Twitter regarding medical content is as high as 24% and is being circulated at an alarming rate [10, 13, 14] . Another study developed a multilingual COVID-19 Twitter data set to track misinformation and unverified rumours [1] . Thus, we see that social media sites can be used to spread all kinds of information. However, on the other hand, a positive aspect of the increasing presence of public opinions on social media sites is that policy makers can mine this data to understand popular discourse and develop measures to tackle the pandemic [15] . Opinion mining is also known as sentiment analysis which is an analysis of people's opinions, emotions, and sentiments from the written language. With the rapid growth of social media platforms such as Twitter, Facebook, Instagram, the relevance of sentiment analysis and opinion mining has increased. Opinions influence our behaviours and therefore, are essential to almost all human actions. The way we perceive and interpret the world is essentially influenced by our beliefs and interpretations of reality. This is why we always look for the inputs or opinions of others when we have to make a decision. It refers not only to individuals but to businesses as well [16] . Microblogs like Twitter are platforms where users can post their opinions, reactions, feelings, and thoughts in real-time. Some early analyses on Twitter data using sentiment analysis include Bermingham and Smeaton [6] , Pak and Paroubek [4] , Barbosa et al. [5] , Bifet and Frank [6] , Davidov et al. [7] and Agarwal et al. [17] . In this study, we use a support vector machine (SVM) classifier for sentiment classification. For the purpose of this study, we have classified sentiments only as being either positive or negative. The topic model can be considered as a tool for addressing the enormous amount of data, to find hidden concepts, prominent features or latent variables [9] . It is an unsupervised natural language processing technique that extracts latent topics from a corpus of documents. There are many methods to implement topic modelling. In our study, we use two approaches: (1) Latent Semantic Analysis (LSA) and (2) Latent Dirichlet Allocation (LDA). Latent Semantic Analysis (LSA). Latent Semantic Analysis is a single value decomposition (SVD) based algebraic process in which a bag-of-words (BoW) model is used to create a document term matrix [9] . LSA is one of the simplest topic models that are easy to understand and implement. More often than not, it gives better results than vector space models. It is also faster. Latent Dirichlet Allocation (LDA). Latent Dirichlet Allocation learns how words, topics, and documents relate to each other by assuming that documents are generated using a particular probability model [19] . The purpose of LDA is to discover hidden topics based on data [20] . Topic in LDA is defined as "probability distribution over words". It is also a bag-of-words model [13] . Topic Coherence. Topic coherence is a method of evaluating topic models. It measures the degree of semantic similarity between its high scoring words. These measurements help distinguish between topics that are human interpretable and those that are artefacts of statistical inference. Greater coherence scores indicate greater interpretability by humans [21] . This study is divided into several stages. The first stage consists of data collection, followed by the pre-processing of data. Then comes the sentiment analysis stage and segregation of data into positive and negative sentiments. Then topic modelling is done on each sentiment using LSA, and LDA. Finally, both these approaches are evaluated using topic coherence scores. Implementation of each model is done in Python version 3.7.4. Twitter data was collected using NodeXL, which is a network analysis software package for Microsoft Excel. Since the aim of the study was opinion mining of novel coronavirus COVID-19 in India, the Twitter database was searched for all tweets containing the hashtag "IndiaFightsCorona" or "indiafightscorona". 33378 tweets in all languages were collected out of which 16665 tweets were in English. For the purpose of this study, we have considered tweets written only in English. The tweets were dated between 16 th March to 1 st April 2020. The purpose of this pre-processing of the data is to clean it and make it ready for further analysis. At the end of this stage, data is more structured and processed. This was done twice, once before sentiment analysis and once before topic modelling. The preprocessing steps were different before sentiment analysis and different before topic modelling. Combined, all the steps include [18, 21, 22] : 1) Data cleaning. In this step, data is converted to lowercase, all punctuation and stop words such as "a"," and", "to", "the" and so on and forth, are removed. 2) Tokenization. A tokenizer splits the document at the word level, and each word is labelled as a token. For our study, we used word2vec algorithm. 3) Stemming and Lemmatization. Stemming and Lemmatization are text normalization techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. Stemming is the process of reducing inflexion in words to their root forms. To conduct sentiment analysis, SVM classifier was trained and tested. It was then used to predict the sentiment of words using their vector representation. The sentiment score of each word was then calculated. Primarily, there are four steps in training and using the sentiment classifier: (1) Load a pre-trained word embedding. Word embeddings map words in vocabulary to numeric vectors. These embeddings capture semantic details of the words so that similar words have similar vectors. (2) Load an opinion lexicon listing positive and negative words 3 (3) Train the SVM sentiment classifier to classify words into positive and negative categories. (4) Sentiment score of each word is calculated in the text, and the mean score is taken. Once the corpus of tweets was divided into positive and negative sentiments, topic modelling was run on each sentiment to discover various topics and themes associated with each sentiment. Two approaches to topic modelling were used: Latent Dirichlet Analysis (LDA) and Latent Semantic Analysis (LSA). The results for both approaches were then compared for each sentiment using topic coherence scores to understand which approach gave better topics within each sentiment. This study uses tweets containing "#IndiaFightsCorona" from 16 th March to 1 st April 2020. Only English tweets have been considered for this study amounting to 16665 tweets. We first conducted a sentiment analysis on the corpus of tweets, segregating it into positive and neutral tweets. Then topic modelling using LSA and LDA approaches was done on each sub-corpus of the sentiments. Finally, coherence scores of LSA and LDA were computed and compared to see which approach was better. For the purpose of sentiment analysis, support vector machine (SVM) classifier was trained, which classifies word vectors into positive and negative categories. 10% of the corpus was set aside for testing purposes while the rest was used for training. A confusion matrix is a table that is often used to describe the performance of a classification model. Figure 1 shows the classification accuracy in a confusion matrix for SVM sentiment classifier. As can be seen, the classifier is an efficient classifier. Based on this confusion matrix, we found that Precision was 93.26%, Recall was 91.37% and F1 score was 92.30%. Next, the sentiment of all the tweets was calculated to predict the sentiment score of each word in the text. Scores greater than zero were considered as positive sentiments and scores less than zero were considered as negative sentiments. There were 13701 positive sentiments and 2964 negative sentiments. After the process of determining the sentiment, then the next step is to find topics within each positive and negative sentiments. Two approaches are used, namely LSA and LDA to find topics, and then their results are compared using coherence scores. Positive Sentiments. To find the optimal number of topics, we ran the LDA model for a different number of topics' k'. Figure 2 shows the coherence chart in which we see that maximum coherence is achieved at k = 13 with a coherence score of 0.426. We then ran the LDA model with K = 13 and computed individual coherence of each topic. We see in Table 1 that topics 3, 5, 7, 1, 9, 6 and 11 have high coherences while the rest have lower coherence scores. Topic 3 and 5 refer to the positive reception of the news of the nationwide lockdown with many offices, Bollywood celebrities like Kartik Aryan being supportive. Topic 7 praises the lockdown as a response to the pandemic. Topic 1 refers to the State-Trait Anxiety Inventory (STAI), and how people are following the lockdown by working from home. Topic 9 is a favourable reaction to PM Cares fund by Prime Minister Narendra Modi and the measures taken by the government for its citizens. Topic 6 and 11 favour the direction of India's leadership, refer to the people at the forefront of this battle against the pandemic. Similarly, we compute the coherence score for LSA model. As can be seen from Fig. 3 , the optimal number of topics is 12 with a score of 0.428. We then ran the LSA model with K = 12 and then computed individual coherence of each topic. We see in Table 2 that topics 12, 2, 5, 11, 6 and 8 have high coherences while the rest have lower coherence scores. In Table 3 , we see the comparison of LDA and LSA coherence scores between their most coherent topics. We see that LDA produces greater coherent scores for each topic than LSA demonstrating that LDA performs better than LSA in finding hidden topics. LDA T3 T5 T7 T1 T9 T6 T11 0 Negative Sentiments. Similar to the process followed for positive sentiments, we ran the LDA model and compute coherence scores to find the optimal number of topics. As can be seen in Fig. 4 , the optimal number of topics is 8 with a score of 0.542. We then ran the LDA model with K = 8 and then computed individual coherence of each topic. We see in Table 4 that In a similar vein, we compute the coherence score for LSA model to determine optimal number of topics. As can be seen from Fig. 5 , optimal number of topics is 9 with a score of 0.62. We then ran the LSA model with K = 9 and then computed individual coherence of each topic. We see in Table 5 that high coherences while the rest have lower coherence scores. Topic 3 is a reference to the negative perception regarding lockdown and social distancing, lack of availability of sanitizer and how it is affecting people's daily lives. Topic 8 refers to people not following the protocol and staying at their homes. This will result in increased infections spread through cough droplets. There is mention of Software Technology Parks of India (STPI) which refers to IT offices being shut down and people working from home. Topic 5 is about facing coronavirus by staying at home. In some cities of India, police are asking people on the streets to go inside. Topic 4 refers to deaths due to pandemic and the drastic action of nationwide lockdown. In Table 6 , we see the comparison of LDA and LSA coherence scores. We see that LSA produces greater coherent scores for each topic than LDA. This means that LSA performs better than LDA in finding hidden topics here. Thus, we see that while LDA performed better for positive sentiments, LSA performed better for negative sentiments. This is because the corpus size is different for both sentiments. There are 13701 positive sentiments and 2964 negative sentiments. It is a known fact that for bigger data sets, LDA performs better while for smaller datasets, LSA performs betters [18] . This then explains the performance behaviour difference between positive and negative datasets. The purpose of this study was to conduct opinion mining on twitter data containing "#IndiaFightsCorona" to analyse public opinion. This was accomplished using sentiment analysis and topic modelling. First, sentiment analysis was done, and positive and negative sentiments were separated. Then, on each sentiment, topic modelling was done to discover hidden topics. Two approaches were used, namely LDA and LSA and then their results were compared. It was found that there were more positive sentiments than negative. For positive sentiments, LDA performed better, and for negative sentiments, LSA performed better. LDA revealed topics within positive sentiments which include positive reception of the news of the nationwide lockdown and praising the measures taken by the government, vocal support of lockdown by Bollywood celebrities like Kartik Aryan, STAI, work from home, PM Cares fund and the people who are at the forefront of the battle against the pandemic. Topics revealed by LSA for positive sentiments include the altruistic efforts of people, money donation, the measures are taken are for the people's welfare and benefit, to Ratan Tata's donation of Rs. 500 crores, STAI, PM Cares fund by Prime Minister Narendra Modi and the hard work of many people in fighting coronavirus. As we can see that while some topics are common between LSA and LDA like STAI, PM Cares fund and the people fighting coronavirus, some topics are distinct such as vocal support of lockdown by celebrities like Kartik Aryan, Ratan Tata's donation of Rs. 500 crores and work from home. For negative sentiments, topics revealed by LDA include Tablighi Jamat incident in New Delhi, the lockdown in Kashmir versus lockdown in India due to coronavirus, food availability issues and its impact of workers and the sacrifices made by people during the lockdown period with social life being affected. Topics revealed for negative sentiments by LSA include a negative perception regarding lockdown and social distancing, lack of availability of sanitizer and how it's affecting people's daily lives, people not following the protocol and leaving their homes, increased infections spread through cough droplets, shut down of IT offices and actions taken by police to make the people follow lockdown protocol. Here, we see very little overlap between the topics generated by LSA and LDA for negative sentiments which include only the negative impact on people's social life. All other topics are distinct. Our findings are echo prior research [1] in which we see that public sentiment is inclined towards positivity. This study has a few limitations. First, the data was collected only for #India-FightsCorona. Future work can include hashtags for broader coverage of public opinion. Second, only one evaluation criteria are used, namely topic coherence. Future work can include other criteria such as log-likelihood and perplexity. Also, other methods of topic modelling can also be used apart from LSA and LDA, and a comparison can be made. Third, the data collected is only a sample of the total Twitter data due to API restrictions. A bigger dataset can, perhaps reveal more insights. Tracking social media discourse about the COVID19 pandemic: development of a public coronavirus Twitter data set Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks, ArXiv10032664 Phys Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on Twitter Information credibility on Twitter The rise of social bots Top concerns of tweeters during the COVID-19 pandemic: infoveillance study Pandemics in the age of Twitter: content analysis of tweets during the 2009 H1N1 Outbreak How did Ebola information spread on twitter: broadcasting or viral spreading? Conversations and medical news frames on Twitter: infodemiological study on COVID-19 in South Korea A first look at COVID-19 information and misinformation sharing on Twitter, ArXiv200313907 Cs Social media use spikes during pandemic Coronavirus goes viral: quantifying the COVID-19 misinformation epidemic on Twitter. Cureus The COVID-19 Social Media Infodemic, ArXiv200305004 Nlin Physicsphysics Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset, ArXiv200310359 Cs Sentiment analysis and opinion mining Sentiment analysis of twitter data Topic modeling: a comprehensive review Latent Dirichlet allocation LDA-based topic modelling in text sentiment classification: an empirical analysis Exploring topic coherence over many models and many topics Sentiment analysis and topic modelling for identification of Government service satisfaction