key: cord-0657796-xrbfcg3n authors: Yin, Hui; Song, Xiangyu; Yang, Shuiqiao; Li, Jianxin title: Sentiment Analysis and Topic Modeling for COVID-19 Vaccine Discussions date: 2021-10-08 journal: nan DOI: nan sha: 28e2eb16c9b1ba0dc71e03a2c29dc349f0b7c4fb doc_id: 657796 cord_uid: xrbfcg3n The outbreak of the novel Coronavirus Disease 2019 (COVID-19) has lasted for nearly two years and caused unprecedented impacts on people's daily life around the world. Even worse, the emergence of the COVID-19 Delta variant once again puts the world in danger. Fortunately, many countries and companies have started to develop coronavirus vaccines since the beginning of this disaster. Till now, more than 20 vaccines have been approved by the World Health Organization (WHO), bringing light to people besieged by the pandemic. The promotion of COVID-19 vaccination around the world also brings a lot of discussions on social media about different aspects of vaccines, such as efficacy and security. However, there does not exist much research work to systematically analyze public opinion towards COVID-19 vaccines. In this study, we conduct an in-depth analysis of tweets related to the coronavirus vaccine on Twitter to understand the trending topics and their corresponding sentimental polarities regarding the country and vaccine levels. The results show that a majority of people are confident in the effectiveness of vaccines and are willing to get vaccinated. In contrast, the negative tweets are often associated with the complaints of vaccine shortages, side effects after injections and possible death after being vaccinated. Overall, this study exploits popular NLP and topic modeling methods to mine people's opinions on the COVID-19 vaccines on social media and to analyse and visualise them objectively. Our findings can improve the readability of the noisy information on social media and provide effective data support for the government and policy makers. in Iran, 42.29% in Mexico, and 18.09% in Pakistan. The current vaccination rate has not yet reached the minimum requirement for controlling the spread of the pandemic in countries. One possible reason is that people do not know enough about the vaccine, lack confidence in it, and are skeptical about its safety. They fear that the vaccine may cause long-term chronic illness because the vaccines have not been tested sufficiently. Another possible reason is that the spread of fake information related to COVID-19 on social media may encourage those who are hesitant or doubtful about the vaccine to go against it. For example, there is a conspiracy theory that denies the existence of the coronavirus epidemic. Therefore, it is necessary to figure out people's concerns about vaccines in the promotion of vaccines. Social media (e.g., Twitter, Facebook, Instagram) and online forums (e.g., StackOverflow, Kaggle, Yahoo) provide a convenient and trustable information source for researchers [2, 3, 4, 5] . People can freely post, comment, express their opinions on specific topics or communicate with others on these platforms [6, 7] . Therefore, the discussion of the COVID-19 vaccine on social media provides us with a source of data to find out people's concerns about the vaccine. This paper examines the tweets about the coronavirus vaccine on Twitter, extract the topics of concern, and the sentiment polarity in the tweets. Rather than looking at the whole world, this paper examines four countries with the highest number of tweets during the study period. These tweets dominate the direction of public opinion and usually represent the majority of people's opinions. In addition, analyzing people's attitudes towards different vaccines and performing visual analyses can help the government understand the people's conditions and take corresponding measures more objectively if necessary. The highlights for this work are summarized as follows: -To the best of our knowledge, this is the first analysis of the public discussions related to the COVID-19 vaccines on social media since the emergence of the COVID-19 Delta variant. -We adopted two robust text mining techniques: Latent Dirichlet Allocation (LDA) and Valence Aware Dictionary and Sentiment Reasoner (VADER), to extract the hidden information buried in noisy social media discussions. -We conducted a comprehensive analysis of COVID-19 related discussions on Twitter. We found that discussions are mostly positive, and the dominant sentiment of trust means a high acceptance of the COVID-19 vaccine. The paper is organised as follows. We review the related work in Section 2. Section 3 introduces the process of data preprocessing and makes an in-depth exploration of the dataset. The methods used in this study are detailed in Section 4. Section 5 introduces the process of data analysis and visualizes the results. Conclusions are made in Section 6. Social media data analysis has been widely used for health-related issues and emerging public health crises [8, 9] . Since the outbreak of the COVID-19 pandemic, there have been many discussions on social media every minute, such as Twitter, Facebook, Weibo. Such a large number of posts provide a valuable source of data and has received attention from researchers [10, 11, 12, 13] . Some research has been carried out on tweets related to the COVID-19 pandemic, analyzing and mining helpful information hidden in the posts. Some work exploited sentiment analysis as a tool to investigate people's reactions during the pandemic through their posts on social media. Li et al. [14] analyzed the posts of Americans and Chinese on Twitter and Weibo during the pandemic from January 20, 2020 to May 11, 2020. They compared the emotions (i.e., anger, disgust, fear, happiness, sadness, surprise) and the emotional triggers (e.g., what a user is angry/sad about) to reveal sharp differences in how people perceive COVID-19 in different cultures. Stella et al. [15] investigated the emotional and social repercussions during the lockdown in Italy, the first country to respond to the threat of COVID-19 with a national lockdown. They discovered the emergence of complex emotions in which fear and anger coexisted with solidarity, trust, and hope. Dubey [16] conducted text mining and sentiment analysis on COVID-19 tweets from 12 countries to analyze how people in different countries/regions responded to these situations. The results showed that most people were confident in controlling the pandemic, but there was also fear, sadness and disgust around the world. Zhou et al. [17] extracted five months of COVID-19 related tweets on Twitter to analyze the sentiment dynamics of people living in the state of New South Wales (NSW), Australia during the pandemic period. They divided tweets according to local government areas (LGA) and observed the dynamic changes in sentiment over time. Yin et al. [18] proposed a novel framework to dynamically analyze the topic and sentiment of 13 million tweets related to COVID-19. They found that the proportion of positive tweets was higher than negative tweets during the study period (2 weeks), which is consistent with other similar work. This work further analyzed the daily hot topics about the COVID-19 pandemic and found the common concerns discussed by people during the study period. For example, staying at home to ensure safety, the latest case reports, and people dying from the pandemic. In addition to being a valuable data source, social media has been described as a source of toxic "infodemic" (i.e., information of questionable quality). During the COVID-19 pandemic, vast infodemics have been generated worldwide mixed with false/fake or misleading information in the digital and physical environment. It causes confusion and risk-taking behaviors that can harm health, leads to mistrust in health authorities and undermines public health response 6 . Some work has focused on this type of information on social media during the pandemic period. Yang et al. [19] comprehensively studied the spread of prevalent myths related to COVID-19, people's participation with them, and people's subjective feelings about myths. They found that myths about the spread of infection and preventive measures spread faster than other myths, such as "5g corona is truth", "Eating curry can prevent the COVID-19". People were most worried about the spread of coronavirus, and the common emotion among people was fear. Gallotti et al. [20] noticed that infodemic spread rapidly and widely through social media platforms during the pandemic. This information may mislead the public or increase social panic. Therefore, while the government and the people were fighting against the COVID-19 virus, they must also fight against infodemic. They analyzed more than 100 million Twitter messages posted worldwide during the early stages of the epidemic and then classified the reliability of the news being circulated. Furthermore, an Infodemic Risk Index was developed to capture the magnitude of exposure to unreliable news across countries. To contribute to the fight against the infodemic, Bang et al. [21] aimed to achieve a robust model for the COVID-19 fake-news detection task proposed in CONSTRAINT 2021 (FakeNews-19). They further improved the robustness of the model by evaluating different COVID-19 misinformation test sets (Tweets-19) to further improve the gen-eralization ability of the model to solve the COVID-19 fake news problem in online social media platforms. With the development and promotion of vaccines, many researchers have carried out research work on COVID-19 vaccine related discussions on social media. Kwok et al. [22] extracted topics and sentiments related to the COVID-19 vaccine from Australian Twitter users between January and October 2020 . They employed R library package syuzhet to score each tweet into two sentiments (positive, negative) and eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). They found that two-thirds of all tweets expressed positive opinions and one-third expressed negative opinions. Finally, they identified 3 LDA topics in the dataset: (1) attitudes toward COVID-19 and its vaccination, (2) advocacy of infection control measures against COVID-19, and (3) misconceptions and complaints about COVID-19 control. Lyu et al. [23] used the same methods as [22] to identify sentiments and topics over a long time span in public discussions related to the COVID-19 vaccine on social media, with the goal of better understanding public perceptions, concerns and emotions that may influence the achievement of herd immunity goals. For the topic modeling, they yielded 16 topics, which were grouped into five overarching themes. Bonnevie et al. [24] quantified the increase in Twitter conversations around vaccine opposition during the COVID-19 pandemic in the United States. They first collected such tweets, classified them into topics, and then tracked them. After four months of observation, they found a noticeable increase in vaccine opposition on Twitter. Exposure to these increased amounts of vaccine opposition may mislead people to oppose vaccines, which could have a drastic impact on the health of populations for decades to come.Therefore, to ensure the widest support for a COVID-19 vaccine, it is essential to identify and address the messages used by vaccine opponents. Thelwall et al. [25] conducted a study to understand what types of vaccine hesitancy information shared on Twitter might be helpful in designing interventions to address misleading attitudes. The main themes discussed were conspiracies, vaccine development speed, and vaccine safety. The majority (79%) of those who refused vaccines on Twitter expressed right-wing views, fear of the deep state, or conspiracy theories. A significant proportion of those who refused vaccination (18%) tweeted about other topics in a mainly apolitical manner. For this study, we focus on analyzing the topics and sentiments of the COVID-19 vaccine related tweets on Twitter. As of August 23, 2021, there are 139 vaccine candidates, and 22 of which have been approved by different countries, and 192 countries with approved vaccines. For example, vaccines such as Pfizer, Oxford/AstraZeneca and Sinovac have been approved in the United States, the United Kingdom and India, respectively. We adopt the latest publicly available dataset of the COVID-19 vaccine tweets from Kaggle 7 . The period for the collected tweets is from December 12, 2020 to July 2, 2021, and the dataset covers seven popular vaccine brands 8 , as shown in Table 1 : Approved in 60 countries, 9 trials in 7 countries. Approved in 39 countries, 19 trials in 7 countries. Oxford/AstraZeneca Approved in 121 countries, 39 trials in 20 countries. Approved in 69 countries, 25 trials in 6 countries. Approved in 9 countries, 7 trials in 1 countries. Approved in 71 countries, 20 trials in 7 countries. We preprocessed the original dataset by the following steps. Firstly, as the location of a tweet is necessary information in this study, we first deleted the tweets without location information and got 78,827 tweets. After that, we removed the noisy words from the remaining tweets. The procedures include: (1) Removing the Twitter handles, URLs, emojis, and hashtags; (2) Removing non-English words or common words that do not provide insights into a specific topic (e.g., stop words); (3) Case folding (i.e., lowering the case of words to allow for lexical processing); (4) Lemmatization to remove inflected endings and return a word to its base or dictionary form. (5) Investigating the combination of two words (bigrams) to ensure that words such as "side effect" could be one token instead of separating "side" and "effect". We also removed tweets with a length less than four after processing, which usually cannot provide reasonable semantics. In the end, we got 75,665 tweets for our experimental study. According to the COVID-19 vaccine official website 9 , in Table 2 , we list the approved vaccines for the four countries with the most tweets in the dataset. In fact, the vaccines approved in each country are not limited to these seven brands, but we only study the most popular seven vaccines in this study. Table 2 : Statistics on COVID-19 vaccines approved in four countries. We only count the seven brands included in the dataset. In fact, every country has approved more vaccines. India US Canada UK We employ two methods for in-depth analysis of the COVID-19 vaccine related tweets on Twitter. The first one is Valence Aware Dictionary for sEntiment Reasoning (VADER) for sentiment analysis, and the second one is Latent Dirichlet Allocation (LDA) for topic modeling. Sentiment analysis (SA), also known as opinion mining, aims to automatically mine the opinions, attitudes, and feelings in texts, has a wide range of applications. We use VADER [26] to analyze the sentiment polarity of tweets in this study. VADER is a lexicon and rule-based sentiment analysis tool specifically attuned for sentiments expressed in social media. One of its most significant advantages is that it can be used directly on the original tweet with no preprocessing of the text. Some examples of VADER scoring results 10 are shown in Table 3 . VADER provides three types of scores, which are positive, neutral and negative. In addition, VADER can also give a compound score, which is very effective when dealing with social media data. VADER uses the compound score as the final sentiment score of a sentence. The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). We set a standardized threshold for classifying sentences as positive, neutral, or negative, as follows: where v score is the compound score of the i-th tweet, S fi is the final polarity of tweet. If the compound score v score is not less than 0.05, the sentence is considered to be positive. If the score is not greater than -0.05, its polarity is negative. Otherwise, the sentence polarity is neutral. Table 4 shows examples of tweets with positive and negative sentiment scores computed with VADER in this study. Topic modeling is a method for the unsupervised classification of documents. Specifically, it's the process of learning, recognizing, and extracting high-level semantic topics across a corpus of unstructured text even when people are unsure what they are looking for. It is a great way to get a bird's-eye view [27] , LDA aims to find topics a document belongs to, based on its words. LDA is based on a Bayesian probabilistic model where each topic has a discrete probability distribution of words, and each document is composed of a mixture of topics. In LDA, the topic distribution is assumed to have a Dirichlet prior, giving a smooth topic distribution for each document. The probability for a corpus is modeled in Eq. 2, where the documents and words are assumed to be independent. We show the plate notation explanation of LDA in Fig. 3 while the meaning of the notations is shown in Table 5 . In this study, we employ LDA for topic modeling and discuss hot topics in positive and negative tweets separately. The number of topics is a crucial parameter in topic modeling. To make these topics human interpretable, we use the coherence score to determine the optimal number of topics. The coherence score in the following Eq. 3 helps to distinguish between human understandable topics and artifacts of statistical inference: The coherence selects top n frequently occurring words in each topic, then aggregates all the pairwise scores of the top n words w i , · · ·, w n of the topic. Finally, we can get the total coherence score of the current number of topics. Fig. 4 displays the coherence score of all tweets for the number of topics across two validation sets, and a fixed α = 0.01 and β = 0.1. We set the range of the number of topics from 1 to 100. According to the results, the coherence score is highest when the number of topics is 11, so we determine the number of topics to be 11, and then perform LDA topic modeling on the tweets. We first look at the high-frequency vocabulary in the dataset, and then we extract prevalent words in tweets concerning user location and vaccine brands. After that, we use VADER to generate the sentiment polarity of each tweet, namely positive, negative and neutral, and then further analyze the attitudes of users in various countries to the seven vaccines. Finally, we use the LDA topic model to generate the topics of positive and negative tweets and examine the hot topics discussed in the tweets, respectively. After removing the stopwords and meaningless words, we first count the highfrequency vocabularies in the dataset, as shown in Fig. 5 . Then, we separately count the popular words in the discussion of the COVID-19 vaccine in different countries. We extract prevalent words from tweets in India, the United States, Canada, the United Kingdom and then use the word cloud to visualize them, as shown in Fig. 6 . According to Table 2 , we learn that India has approved 4 of the 7 vaccines, the United States has approved 2 of the 7 vaccines, Canada and the United Kingdom have approved 3 of the 7 vaccines. Fig. 6a clearly shows that Indian people pay more attention to Bharat biotech (Covaxin) and Sputnik than Moderna and Oxford/AstraZeneca. In the United States, Canada and the United Kingdom, Moderna and Pfizer are the most mentioned vaccines by users. The word "thank" is clearly visible in the word cloud, showing a positive attitude, as shown in Fig. 6b, 6c and 6d . In this section, we pay attention to the high-frequency emotional vocabularies related to vaccines and gain a general understanding of people's attitudes toward different vaccines. We employ the VADER dictionary to filter the emotional words in the tweets, select the top 30 high-frequency emotional words for each vaccine, and then separate the words by polarity. The results are shown in Table 6 and Table 7 , and we can see that the number of positive words is much higher than the number of negative words, such as "thank", "approved", "effective", "safety", "hope". These words represent a positive attitude towards vaccines, trusting vaccines can protect us from infection. We did not list neutral words because only two neutral words were mentioned in all vaccines' top 30 high-frequency emotional words. The sentiment polarity of each tweet is generated using the VADER tool as described previously. Fig. 7 presents the overall emotional distribution of tweets across the four countries to the seven vaccines over the study period. Obviously, it can be seen that the number of positive tweets is greater than that of negative tweets, regardless of the brand, which shows that the majority of Twitter users maintain a positive attitude towards the vaccines. According to Table 6 , most of the positive tweets focused on the following aspects, such as believing supreme number approved well help ready thank thanks free allow best trust good effective like want approval great top launched approves please agreed that vaccines can provide effective protection, expecting that the vaccine will be approved and promoted as soon as possible, thanking the injection of the vaccine. In contrast, negative tweets are mostly related to vaccine shortages, side effects after vaccination, and reports of deaths due to vaccination. As mentioned in section 4.2, we use the coherence score to determine the optimal number of topics for topic modeling is 11. In this section, we use LDA to generate the topics of the tweets to understand which aspects users concern about in the positive and negative tweets, respectively. We count the number of tweets corresponding to different topics in positive tweets and negative tweets separately. According to the popularity, the top 5 topics discussed in positive and negative tweets are listed in Table 8 and Table 9 , where the most contributing words related to the topic are shown below the topic in the Tables. We get the consistent conclusions as in section 5.2 and section 5.3. In the positive tweets, people were grateful for being vaccinated in anticipation This study conducted a comprehensive analysis of COVID-19 vaccine-related tweets collected from Twitter between December 12, 2020 and July 2, 2021. A total of 75,665 tweets were used for this study, including seven vaccine brands, e.g., Pfizer/BioNTech, Sinopharm, Sinovac, Moderna, Oxford/AstraZeneca, Covaxin and Sputnik V. According to statistics based on the location of tweet users, these tweets are mainly from India, the United States, Canada, and the United Kingdoms. We first performed an overall analysis of the whole dataset and then a specific analysis of the four countries. The sentiment analysis results showed that the overall sentiment polarity is positive, and the number of positive tweets is approximate twice the number of negative tweets. When we drilled into country-level, it was found that the sentiment polarity scores of each country for the approved vaccines were consistent with the overall sentiment polarity scores. But when it came to other vaccine brands, the number of negative tweets for some vaccines is higher than positive tweets, such as Sputnik V in the United States and Sinovac in Canada and the United States. In the positive tweets, people expressed their gratitude for being able to be vaccinated. They hope that with the help of the vaccination, the pandemic can be controlled as soon as possible and normal life can be resumed. People mostly complained about side effects after vaccination in the negative tweets, such as fever, sore arm, etc. In summary, this paper presented a case study of popular topics and sentiment analysis of tweets related to the COVID-19 vaccines. In the future, more interesting topics can be explored based on the current study. For example, performing individual-level topic and sentiment analysis to help local communities locate the people that may suffer from the negative sentiments and thus take actions in advance. Examination of community sentiment dynamics due to covid-19 pandemic: A case study from a state in australia Influence propagation model for cliquebased community detection in social networks Communitydiversified influence maximization in social networks Deep fusion of multimodal features for social media retweet time prediction Modeling user preferences on spatiotemporal topics for point-of-interest recommendation Discovering topic representative terms for short text clustering Sentence level topic models for associated topics extraction Neural attention with character embeddings for hay fever detection from twitter. Health information science and systems Automated detection of mild and multi-class diabetic eye diseases using deep learning Efficient targeted influence minimization in big social networks Evidence-driven dubious decision making in online shopping Vulnerability exploitation time prediction: an integrated framework for dynamic imbalanced learning Decision-based evasion attacks on tree ensemble classifiers Analyzing covid-19 on online social media: trends, sentiments and emotions # lockdown: Networkenhanced emotional profiling in the time of covid-19 Twitter sentiment analysis during covid-19 outbreak. Available at SSRN 3572023 Detecting community depression dynamics due to covid-19 pandemic in australia Detecting topic and sentiment dynamics due to covid-19 pandemic using social media Analysis and insights for myths circulating on twitter during the covid-19 pandemic Assessing the risks of 'infodemics' in response to covid-19 epidemics Model generalization on covid-19 fake news detection Tweet topics and sentiments relating to covid-19 vaccination among australian twitter users: Machine learning analysis Covid-19 vaccine-related discussion on twitter: Topic modeling and sentiment analysis Quantifying the rise of vaccine opposition on twitter during the covid-19 pandemic Covid-19 vaccine hesitancy on english-language twitter. Profesional de la información (EPI) Vader: A parsimonious rule-based model for sentiment analysis of social media text Latent dirichlet allocation