key: cord-0840182-cye7pigo authors: Sv, Praveen; Tandon, Jyoti; Vikas; Hinduja, Hitesh title: Indian citizen's perspective about side effects of COVID-19 vaccine – A machine learning study date: 2021-06-10 journal: Diabetes Metab Syndr DOI: 10.1016/j.dsx.2021.06.009 sha: 03a970589e10ebea041b3bd7d0e328e71c9091ee doc_id: 840182 cord_uid: cye7pigo BACKGROUND AND AIMS: Ever since the vaccination drive for COVID-19 has started in India, the citizens have been sharing their views on social media about it. The present study examines the attitude of Indian citizens towards the side effects of the COVID-19 vaccine. METHODS: Social media posts were used for this research. Using Python, we have collected social media posts of Indians focusing on side effects of COVID -19 vaccines. In study one, sentimental analysis was done to find overall attitude of Indian citizens towards the side effects of COVID-19 vaccine and in study two, topic modeling done to analyze the major side effects voiced out by the citizens after taking COVID-19 vaccine. RESULTS: The studies conducted have revealed that nearly 78.5% of tweets posted by Indian citizens about the side effects of the COVID-19 vaccine were either in neutral or positive sentiments. Our topic modeling studies have found that fear of efficiency in the workplace and the fear of death as the prime two issues that contributes Indian citizens to have negative sentiment about the side effects of the COVID-19 vaccine. CONCLUSION: While it is important for the Indian government to actively encourage its citizens to have vaccine, it is also important to help the citizens understand the important of the vaccination program. The best way to educate citizens regarding the positive aspect of the vaccination program is by addressing the fears, Indian citizens have voiced in their social media post about the COVID-19 vaccines. A pandemic is considered a phenomenon having devastating effects on people and economics with many causalities. Pandemics are said to have both health and economic calamities. [1] The first case of COVID-19 was registered in India on 27th January 2020. A nationwide lockdown was imposed in the country in the month of March 2020. The general public was being monitored for following the lockdown, social distancing, and wearing masks. As per the last year's records, it is established that by 18th May 2020, India registered 1 Lakh COVID-19 infected cases. However, within a span of less than 2 months the cases were increased by 8 times and India had 8 Lakh infected cases [2] . As of May 4, 2021, the number of confirmed cases of COVID 19 rose to 2,02,82,833 in India, and the death toll reaching 2,22,408 [3] . The above statistics make India the second largest affected country in the world only after the United States of America. Currently, COVID-19 is spreading at a distressing rate in India. The spread of the deadly virus highlights the importance of vaccination at a national level. The vaccination is supposed to protect the nation from continued damage. The Government has introduced the vaccination drive all over the country on 16th January 2021 [4] . The vaccination drive is being conducted in stages, focusing on health care workers and frontline staff in the first stage. They are at a supreme risk J o u r n a l P r e -p r o o f of getting exposed to the virus. The second stage focuses on elderly people above the age group of 45. They are at a high risk of getting affected owing to the virus. In the third stage, people in the age group of 18-45 will be targeted. Nationwide usage of two COVID-19 vaccines is approved, Covishield by Serum Institute of India and Covaxin manufactured by Bharat Biotech. As per the official data, as of 4th May 2021, a total of 158932921 doses overall including both the first dose and second dose of the vaccine has been provided in India [3] . The analysis aims at taking into consideration the social media posts of citizens of India on Twitter to analyze the perception about the COVID-19 vaccine and its side effects. Using Python Library Twint, tweets having the words 'COVID Vaccine' and 'Side effects' were scrapped. For the purpose of data collection, Twitter has been used. Post the outbreak of COVID-19, more and more people are using Twitter as a social media platform to endorse their views in the form of "Tweets" [5] Twitter is also represented as a powerful public health tool apart from traditional sources like Radio, Newspaper, J o u r n a l P r e -p r o o f and Television as the leaders directly communicate the information on COVID-19 to citizens. It is well established from the previous research that social media acts as a most credible source to access and record masses' behavior during unusual periods like the current one [6] [7] [8] [9] . For the present study, tweets with the words "COVID vaccine" and 'Side effects' were considered. By using the geographical filtering option on Python library Twint only tweets belonging to India were studied. Only the tweets posted in English were examined for the study and tweets from other languages were eradicating for analysis. After eliminating the tweets of other languages, the tweets in the English language were taken into consideration for this study. To curtail down the sampling errors caused by unbalanced samples, an equal number of tweets for each week was used in the analysis. Post selecting the tweets, the process of data cleaning takes place which aims at removing the punctuation, emoticons, images, hyperlinks, numbers, and stop words. Only "Text" shall be considered for analysis. Stop words needs to be filtered out as they have no meaning of their own and removing them from the sentence leaves the meaning of the sentence unaltered. They are not required for analysis. After eliminating the punctuation, hyperlinks, numbers, and stop words from the corpus, stemming & lemmatizing the data was performed. Stemming is a method by which prefixes and suffixes are removed to find their common base or root. And, lemmatization is a practice of combining dissimilar words to narrow down the dimensionality. Lemmatisation and stemming are very closely related and are an important step in Natural Language Processing. Python Libraries Regular expressions and Gensim were used for the data cleaning process. The motive of research conducted in study one is to comprehend the attitude of residents of India for the side effects of the COVID-19 vaccine. To analyze and understand the same, we used the process of sentiment analysis. Sentimental analysis is a method of categorizing the sample texts into positive, negative, or neutral brackets. The sentiment score is ascertained from the sentimental analysis. Every word in the sentiment corpus whether it is positive, negative, or neutral contains a sentiment score and using the score the model will determine whether the particular tweet in the corpus is having positive, negative or neutral sentiment. Sentiment analysis also known as opinion analysis can be defined as "An automatic technique to select and analyze the subjective verdicts on various aspects of an item" [10] . It is a machine learning technique involving the use of Natural Language Processing. The basic aim of sentiment analysis is to find the opinion, attitude, or emotions of the writer for a particular text [11] . The aims also include identifying the degree of polarity of the tone of the text message which can be expressed as positive, negative, or neutral. The positive score in the analysis denotes satisfaction, happiness, and contentment on the part of the author as against negative indicating disappointment, sadness, and sorrow. Sentiment analysis is a technique that identifies and classifies the opinions of people computationally. Initially, when sentimental analysis started, it was done at a document level [12] , sentence-level was the second level [13] followed by the phrase level [14, 15] . For this study, Python (a computer programming language) has been used to collect the tweets. For processing the textual data, the Text Blob-Python library will be used. As per the text blob, every English word will have a sentimental score. Text blob while applying the principles of advanced machine learning as well as Natural Language Processing (NLP) aims at studying each word collected in the corpus and will classify the opinions as being neutral, positive, or negative. [16] J o u r n a l P r e -p r o o f Study one helped in understanding the perception of citizens of India towards the side effects of the COVID-19 vaccine if the attitude was positive or negative. We conducted a sentimental analysis for the same, it can be inferred that sentimental analysis can assist in recording the general attitude of people and cannot highlight the exact side effects of the vaccine. To understand the major side effects/ after-effects of the COVID-19 vaccine on Indian's study two will be conducted. Topic modeling uses a set of algorithms to recapitulate extensive texts by determining and finding the unseen subjects and themes in a corpus [17] . Traditionally, before the origination of Latent Dirichlet Allocation, Probabilistic Latent Semantic Indexing was used to derive the issues. The concept behind PLSI is that each word in a document is modeled using an algorithm as a sample from the mixture model. The mixture elements in the mixture models are the multinomial random variables that can be considered as topics. A major drawback of PLSI which led to its loss of popularity and increased the usage of LDA is the algorithms in Probabilistic Latent Semantic Indexing makes the probabilistic model unavailable for the whole document [18] . LDA works on the principle of the "Bag of words" assumption and follows Bayesian probability theory [19] . The fundamental characteristic of LDA topic modeling is based on the usage of algorithms to derive a similar set of topics in every document which is most talked about or opined about. Latent Dirichlet Allocation very well takes into account the assumption that it is possible for some set of words to develop a linkage with some particular topics always. While using the LDA technique, it is a J o u r n a l P r e -p r o o f possibility to discover latent topics from the group of vast and huge unstructured data in the corpus. Library LDAvis is used to better analyze, understand, and later summarise the identified side effects. In study one, we have performed sentiment analysis for the data we have collected Table 2 . Business Continuity and the Pandemic Threat. IT Governance Ltd How India is dealing with COVID-19 pandemic Vaccination state-wise World's largest vaccination programme begins in India on Twitter's user growth soars amid coronavirus, but uncertainty remains Crisis information distribution on Twitter: a content analysis of tweets during Hurricane Sandy. Natural hazards Twitter tsunami early warning network: a social network analysis of Twitter information flows Twitter earthquake detection: earthquake monitoring in a social world Evaluating public response to the Boston Marathon bombing and other acts of terrorism through Twitter A survey of multimodal sentiment analysis Reflections on sentiment/opinion analysis. In A practical guide to sentiment analysis A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts Mining and summarizing customer reviews Recognizing contextual polarity in phrase-level sentiment analysis Contextual phraselevel polarity analysis using lexical affect scoring and syntactic n-grams Proceedings of the 12th Conference of the European Chapter of the ACL Topic models Latent Dirichlet Allocation Press Trust of India. Recovered COVID-19 patients last immunity for 8 months raise hopes for vaccine: Study. India Today Analyzing Indian general public's perspective on anxiety, stress and trauma during COVID-19-a machine learning study of 840,000 tweets General public's attitude toward governments implementing digital contact tracing to curb COVID-19-a study based on natural language processing An analysis of attitude of general public toward COVID-19 crises-sentimental analysis and a topic modeling study Survivors Face-A Text Analysis Study