key: cord-0119696-smpq5o61
authors: Na, Tao; Cheng, Wei; Li, Dongming; Lu, Wanyu; Li, Hongjiang
title: Insight from NLP Analysis: COVID-19 Vaccines Sentiments on Social Media
date: 2021-06-08
journal: nan
DOI: nan
sha: bc2697e19e7a13e532c015de7071e3735970335e
doc_id: 119696
cord_uid: smpq5o61

Social media is an appropriate source for analyzing public attitudes towards the COVID-19 vaccine and various brands. Nevertheless, there are few relevant studies. In the research, we collected tweet posts by the UK and US residents from the Twitter API during the pandemic and designed experiments to answer three main questions concerning vaccination. To get the dominant sentiment of the civics, we performed sentiment analysis by VADER and proposed a new method that can count the individual's influence. This allows us to go a step further in sentiment analysis and explain some of the fluctuations in the data changing. The results indicated that celebrities could lead the opinion shift on social media in vaccination progress. Moreover, at the peak, nearly 40% of the population in both countries have a negative attitude towards COVID-19 vaccines. Besides, we investigated how people's opinions toward different vaccine brands are. We found that the Pfizer vaccine enjoys the most popular among people. By applying the sentiment analysis tool, we discovered most people hold positive views toward the COVID-19 vaccine manufactured by most brands. In the end, we carried out topic modelling by using the LDA model. We found residents in the two countries are willing to share their views and feelings concerning the vaccine. Several death cases have occurred after vaccination. Due to these negative events, US residents are more worried about the side effects and safety of the vaccine.

Abstract-Social media is an appropriate source for analyzing public attitudes towards the COVID-19 vaccine and various brands. Nevertheless, there are few relevant studies.

In the research, we collected tweet posts by the UK and US residents from the Twitter API during the pandemic and designed experiments to answer three main questions concerning vaccination. To get the dominant sentiment of the civics, we performed sentiment analysis by VADER and proposed a new method that can count the individual's influence in. This allows us to go a step further in sentiment analysis and explain some of the fluctuations in the data changing. The results indicated that celebrities could lead the opinion shift on social media in vaccination progress. Moreover, at the peak, nearly 40% of the population in both countries have a negative attitude towards COVID-19 vaccines. Besides, we investigated how people's opinions toward different vaccine brands are. We found that Pfizer vaccine enjoys the most popular among people. By applying the sentiment analysis tool, we discovered most people hold positive views toward the COVID-19 vaccine manufactured by most brands. In the end, we carried out topic modelling by using the LDA model. We found residents in the two countries are willing to share their views and feelings concerning the vaccine. Several death cases have occurred after vaccination. Due to these negative events, US residents are more worried about the side effects and safety of the vaccine.

Index Terms-Natural Language Processing, COVID-19, Vaccine, UK, US, Sentiment analysis, Topic modelling, Social media, Text mining

Since the first patient was identified in Wuhan, China, in December 2019 [1] , the COVID-19 has spread rapidly to Europe and eventually worldwide. COVID-19 can cause severe respiratory illness [2] . This complication has caused more than 2.17 million deaths. An outbreak of the virus has also been spreading in the UK and the US since March 2020. As of 22 March 2021, the number of confirmed cases in both countries exceeded 29.8 million and 4.3 million. COVID-19 has killed more than 668,000 people in the two countries.

To prevent further spread of COVID-19 and relieve the enormous medical pressure, the development and promotion of vaccines are crucial. Several pharmaceutical companies and universities have been working on COVID-19 vaccines at an unprecedented rate. More than 260 possible COVID-19 vaccines have been proposed, but only a few have been approved. Several others are in state-of-the-art steps of testing [3] . Pfizer is the first international pharmaceutical company to have its vaccine approved in multiple countries: in the UK on 2 December 2020 [4] , in the US on 12 December 2020 [5] and in the EU on 21 December 2020 [6] . At the time of writing, the availability of the COVID-19 vaccine is still limited, and less than 6% of the global population has received the vaccine. As of 20 March 2021, vaccination coverage in the UK and US was 43.99% and 36.31%, respectively [7] .

The development of a safe vaccine through animal models of RSV can take up to thirty years [8] . Vaccine development needs to be evaluated repeatedly in animal models before putting into clinical trials. Because of the severity of the epidemic, the development pace of new crown vaccines is unprecedented. Each country and global organization have lowered relevant criteria for new COVID-19 vaccines, and they have also shortened the clinical trials of vaccines [9] . Although the vaccines now in use have undergone rigorous testing and review to ensure safety, many people remain skeptical about the safety of the vaccines. There are small parts of people who even refused to be vaccinated with the COVID-19 vaccine. In order to promote the vaccine effectively, it is undoubtedly significant to collect people's opinions toward the vaccine and its side effects. However, measures such as social isolation, quarantine and travel restrictions imposed by governments have hindered the collection process. Therefore social media, like Twitter, becomes the ideal source of data.

To investigate the perceptions and attitudes of the UK and US citizens regarding the vaccine, we used the twitter API to collect relevant tweets during the outbreak. And we conducted social media analysis on these tweets. The dataset was collected by using the Twitter API. Based on a public dataset 1 provided by Banda et al. [10] , we only kept tweet id column of the English twitter posts with the location limitation to the US and the UK. And then we downloaded all COVID-19 vaccine-related tweets from the twitter stream via twitter API and the tweet ids. The public messages, like tweet id, date, time, lang and country place are included in our final dataset. The usage of dataset fully compliant with Twitter's terms of service.

Three main questions we analyze in this paper about COVID-19 vaccine and our contribution to each question are as shown below:

• What is the dominant sentiment towards COVID-19 vaccines? For this question, We provide the analysis about the attitude of citizens locating in the UK and US toward the COVID-19 vaccine. The analysis is mainly conducted using a sentiment analysis tool, VADER [11] , which is used primarily to explore the main sentiment expressed in people's tweets. In the research, we also proposed a new method that can capture the user's public influence on the social network, thus contributing to the analysis of experiments result. • Which COVID-19 vaccine brands/manufacturers have been most talked about recently? Do people prefer any brands? Regarding to this, we explore the sentiment of people locating in the UK and US toward different COVID-19 vaccine brands. We manually determined vaccine brands and corresponding keywords that are currently talked most about on the Twitter platform. VADER is used to analyze people's preferences regarding different brands. • What people concern about the COVID-19 vaccine?

What are the popular topics regarding vaccines? With respect to this issue, we identified the main concerns among citizens about the COVID-19 vaccine. Using the LDA model, we explored the popular topics from the perspective of time series, country and sentiment, respectively.

II. RELATED WORK Sentiment analysis, also called opinion mining, aims to evaluate embedded attitudes, opinions, sentiments, evaluations via the computational subjectivity in natural language text [11] . Through sentiment analysis, we can know whether a text has a positive or negative subjective orientation.

Common sentiment analysis models rely heavily on sentiment dictionaries. A sentiment lexicon is a set of vocabulary in which each word is labelled according to the positivity and negativity of its subjective orientation.

Separately, the different lexicons can be classified into two types: semantic orientation labelling (divided into positive or negative) or more fine-grained quantitative scoring with predefined rules. LIWC [12] , GI [13] , HU-LIU04 [14] are widely used polarity-based lexicons in which words are context-free. In contrast, ANEW [15] , SentiWordNet [16] and SenticNet [17] are based on sentiment intensity thus could conduct a quantitative scoring evaluation.

The VADER (Valence Aware Dictionary for Sentiment Reasonable) proposed by C.J. Hutto et al. [11] Is both a polarity and intensity Aware Dictionary. Especially, its performance in the field of social media text is exceptionally excellent. Based on its complete rules, VADER can carry out sentiment analysis on various lexical features: Punctuation, capitalization, degree modifiers, the contrastive conjunction "but", negation flipping tri-gram. Therefore, VADER could address the challenges stem from informal language usage and implicitly sentiment expression.

Automated identifying sentiment features in the text, through the deep learning approach, is also a study direction. However, most of these methods are unstable and have some questions: First, such an approach requires a considerable amount of tagged emotional vocabulary data that is often challenging to obtain towards a particular text-domain. Second, the deep model is Computational extensive in training, validation and testing, and the model's performance in predicting tasks directly limit its ability to process streaming data. Third, such deep models are usually of the black box type, with limited interpretability.

Because of the advantages of VADER, it has a wide range of applications. Toni Pano and Rasha Kashef [18] researched if outbreaks of COVID-19 can influence Bitcoin prices. They performed 13 different strategies for BTC tweets. VADER scoring systems are regarded as the optimum processing approach. Mohapatra et al. [19] had attempted to assign each tweet a compound sentiment score based on the VADER sentiment analysis algorithm. The number of Twitter followers, number of likes, and number of retweets associated with each tweet is used for the final sentiment score.

Three-level hierarchical Bayesian model, Latent Dirichlet Allocation(LDA), is generative probabilistic model for finding patterns of words in text corpus [20] . LDA is demonstrated that it outperforms batch variational bayes (VB) and also need less running time [21] . The performance of classical state space models and specify a statistical model of topic evolution has been enhanced by David et al. [22] . Based on probabilistic time series, this dynamic model can capture the evolution of topics in a corpus. One of LDA limitations is the incompetence to model topic correlation. [23] has presented the correlated topic model(CTM) with respect to this limitation. The CTM directly models correlation between topics by using co-variance structure among the components. This proved correlation play an important role in topic modelling. In [24] , key words are used to represent topics. Automatic coherence evaluation was proposed to rate coherence or interpretability. Michael et al. [25] proposed a framework that combines existing wordbased coherence measures and the combinations of basic components. This configuration space has explored the best overall correlation for the coherence definition with respect to all available human topic ranking data.

Firstly, we collect the tweets data from a public Twitter dataset. It contains millions of tweets data related to COVID-19. For this dataset, we only retain the data from the UK or US and related to the vaccine. After obtaining the data, it is necessary to execute data processing to remove redundant and invalid content appearing in the tweet text. In the analyzing steps, the whole dataset is split differently depending on the requirements of three different questions. During the experiments, we mainly apply VADER to implement the sentiment analysis. Besides, high-frequency words are collected and displayed by word cloud. And the LDA model is applied to three aspects, countries, emotion, and time series. Subsequently, we model and analyze the data from different countries, sentiments and periods to dig the potential attitude towards the vaccines and coronavirus on Twitter.

Banda et al. [10] provides a dataset of twitter posts in four languages about the COVID-19 vaccine. This dataset is updated at least once every fortnight. Therefore, based on this public dataset, we further collected vaccine-related tweets posted by residents of the UK and US during the outbreak.

The specific dataset acquisition process is defined as following ( Fig.1 ):

• The public dataset file [10] is downloaded. • The path list for 432 days' dataset is created for later use. • According to the field, country place in the public dataset, the tweets were posted by residents in the GB and US are retained. • Multiprocessing features are used to speed up the data processing. Specifically, 8 processes are used to do that. • We combine the 8 data results obtained from the last step to generate 4 data sets for later use. • We use the multiprocessing technique to set 4 processes for collecting twitter posts or tweets. Based on the field of tweet id in the public dataset obtained from the last step, we acquire vaccine-related tweet posts via four Twitter APIs in parallel. • Only the 15 key fields are kept in the 4 separate files for each tweet post. Finally, those 4 files are converted into 4 CSV files, and then, they are merged for subsequent research. Our dataset retains the "retweet count" field. This field can help us track and research important tweets. In addition, every tweet we collected is fully completed(including the memes in the tweets). Through the above method, a dataset of about 110,000 British and American tweets from 25 January 2020 to 14 March 2021 is collected. For this dataset, the next stage of data pre-processing is carried out to ensure the accuracy of the dataset during the research. The relevant changes of this dataset are continuously monitored by our team members during the study.

Since the Twitter text data has some typical but nonsemantic features, we processed these features in the data preprocessing steps.

First, almost every tweet contains a short link, it can be "https://t.co/o7amgl8ybl", this kind of link do not have the actual semantics and can cause ambiguity after tokenization, such as "o7amg" is divided into "o", "7" and "amg", and "amg" could be taken as a car brand. This causes unnecessary trouble to later analysis, so remove of this link is needed. Second, although VADER can score punctuation, it is only used by an exclamation point (!), so other punctuation is deleted in this step. Third, we remove unnecessary line breaks because this is the most common meaningless identifier. Fourth, conversion of all the text content to lowercase is conducted to ensure every word appears in a consistent format.

For the first and second questions we raised, we try to use sentiment analysis to understand the attitudes of American and British citizens towards COVID-19 vaccine from the perspective of emotional expression. At the same time, we used multiple dimensions to try to explain people's acceptance of different vaccine brands, and how attitudes towards vaccines have changed over time.

In our study, sentiment analysis is mainly realized through VADER. This tool's sentiment lexicon is sensitive towards the polarity and intensity of subjectivity expressed in social media contexts and is also widely applicable to other domains.

Generally speaking, sentiment analysis mainly determines the proportion of positive, negative and neutral texts (the proportion of polarity) and the intensity of their emotional expression in a given text through various methods. Finally, a predefined rule is used to make a category judgment or a comprehensive score for the text.

VADER is a gold-standard sentiment lexicon obtained through the Wisdom of the Crowd (WotC) approach. Through extensive human work, the emergence of this dictionary enables the emotional analysis of social networks to be completed quickly and has a very high accuracy comparable to that of human beings [11] . As a complete tool, we can use VADER to obtain sentiment scores at the document level, sentence level, and phrase level, depending on the granularity of the analysis for the actual application scenario.

In the scenario we studied, the minimum unit of analysis was set to the preprocessed text data of a tweet. At the same time, according to the classification method recommended by Hutto et al., we mapped the emotional score into three categories: positive, negative and neutral through the same parameter setting.

Interestingly, based on the rich data returned from the Twitter API, we were able to capture a lot of non-text data in this study, such as a user's tweeting location, country code, time, follower number, likes number, retweet number, listed number(how many lists the account is contained in). Based on these data, our analysis can be carried out in terms of geographical region and time.

The study was conducted on all tweets in English, which would include tweets from many English-speaking countries. Obviously, if we do not make a distinction between the regions where people tweet, it is easy to get insignificant results. Because the epidemic situation in different countries is affected by the national governance ability, economic level, scientific and technological level, international political status and many other aspects, the vaccination opportunities of people in different countries are greatly different. Meanwhile, due to international politics, different countries promote different vaccine brands (Chinese vaccine, American vaccine, British vaccine, etc.). The effectiveness and safety of different vaccines also vary. In other words, common sense suggests that the severity of the epidemic in a country and the speed of vaccination, the brand of the vaccine, have a big impact on the acceptance of the vaccine. Fig. 1 : The steps of dataset-collecting. According to the public dataset and first multiprocessing, the residents' tweets in the GB and UA are maintained. Subsequently, the fields of specific tweets are collected by using Twitter APIs and the second multiprocessing. Finally, we get final dataset for further research. Therefore, we selected the two most representative Englishusing countries, the United Kingdom and the United States, and analyzed them respectively.

In terms of time, considering the imbalance of data volume and data distribution, we divide the data according to month as the smallest unit. The number and proportion of tweets in different categories, as well as the average emotional score of all tweets in each month, were calculated.

As far as we know, existing studies on sentiment analysis of COVID-19 have ignored the dynamics of opinion transmission in social networks [26] . In other words, all users are regarded as individuals with the same influence in analysis so as to reach the final conclusion. Such a method deviates from the actual sentiment.

After considering different diffusion models [26] , we believe that follower number and listed number have the same nature of degree in social network, and retweet number and likes number reflect the potential influence of a single user. That is, the degree is directly proportional to the follower number and listed number, which means that the user is in contact with more users and can impose influence on them. Retweet number and likes number are directly proportional to the potential impact, and these two metrics can be interpreted as content quality metrics. Therefore, we propose the following improvements to the compound score:

The Final compound above is used for the study of question one, and we name the method as "Weighted-VADER". 

As for the third question, we choose the LDA model to build topics. The LAD model can analyze representative and valuable objects from tweets. An effective LDA model can generate a predefined probabilistic procedure [21] . There are the basic processing steps. The first step is to choose a distribution over a mixture of K topics. Next, it selects a topic and draws a word from that topic by the topic's word probability distribution. Finally, it produces a K-topic list ranked by the percentage of the total number of words related to this topic. Each topic shows the most relevant words.

Meanwhile, we find the content on Tweet is changeable, and the topic content is not a static corpus, which spans over a couple of months. The dynamic topic model is a generative model for analyzing topic evolutions in a gigantic corpus [22] . As a part of the probabilistic topic models class, the dynamic one can catch how various themes on tweets evolved. The whole period is split into several time-slices. These time-slices are put into the model provided by "gensims" [27] . The details about the DTM are illustrated in Fig.2 .

An essential challenge is to determine an appropriate topic number for the topic model. Michael et al. [25] proposed coherence score to evaluate the quality of each topic model. Coherence scores measure the consistency of the words that compose a topic. Besides the distribution on the primer component analysis (PCA) is considered, which can visualize the topic models in a word spatial with two dimensions. A uniform distribution is preferred, which is considered as a high degree of independence for each topic. The judgement for a good model is a higher coherence and an average distribution on the primer analysis displayed by the pyLDAvis [28] . The first point indicates that the content within each topic is highly harmonized. The second point means the model has fewer intersections among topics, which summarizes the whole word space well and remain relatively independent.

The sentiment analysis is applied to section IV-A and section IV-B.

We first filtered the data by location to get all the tweets with "country code" fields "USA", "GB" and then divided it into two subsets, each of which was further divided by month.

To classify the sentiment of a tweet as positive, negative or neutral one, we follow the classification method proposed in [11] . After executing the VADER tool on a tweet, it generates four scores including "pos", "neg", "neu" and "compound". The "compound" score is calculated based on specific rules and the valence score of each word in this tweet. It is also normalized to ensure the ranges from -1 to 1. Therefore, the "compound" score is considered to represent the sentiment of this tweet. By setting the classification threshold for the "compound" score, the sentiment of a tweet can be determined. Depending on the rule from [11] , the threshold values are -1, -0.05, 0.05 and 1. When the score is greater than 1 and less than or equal to -0.05, this tweet is considered a negative tweet. When the score is greater than or equal to 0.05 and less than 1, the sentiment of this tweet is determined as positive. For the tweet with a score in the range from -0.05 to 0.05, it is considered a neutral tweet.

Weighted-VADER and VADER methods are used to conduct a control experiment on the processed dataset.

By executing the data cleaning process, the tweets text data with less confusion can be obtained. With respect to this particular experiment, the processed dataset is separated into two based on the locale (the UK or US) of tweets. The duplicate tweets are removed and only the unique tweets remain. For tweets data in the UK and US respectively, we analyze which vaccine brands are talked about more frequently and how the opinions of people toward them.

To determine the vaccine brands we would analyze, we browsed many news and resources on social media platform related to vaccine brands. We found that there are 12 large enterprises that currently make great progress in researching and manufacturing the COVID-19 vaccine. They are Sinopharm, Sinovac, Cansinobio, Novavax, AstraZeneca, Johnson & Johnson (Janssen), Sanofi & GlaxoSmithKline, Moderna, Pfizer, Sputnik-V, Valneva and CureVac. The vaccine from these brands is already widely in use or ready to be used soon. We investigated what corresponding keywords are used most frequently on Twitter for each vaccine brand. The corresponding keywords of a certain vaccine brand would be highly possible to be included in tweets when a tweet talks about this brand. For instance, when users talk about Johnson & Johnson on Twitter, it is most likely keywords such as "johnson & johnson", "johnsonjohnson", "jnjnews", "janssen" and "johnsonandjohnson" occur in the tweet text.

For different vaccine brands, we counted the number of tweets that talk about them respectively. It aims to analyze which brand is the most popular and widely known in the UK or US. Furthermore, we applied VADER to implement the sentiment analysis for tweets discussing different brands. Afterwards, the tweets' number of each sentiment class (positive, negative, neutral) would be recorded. The purpose is to research people's opinions regarding each vaccine brand. It can also help to realize that if people in the UK or US prefer to or dislike a certain brand.

In this part, a higher quality dataset is required for topic model. Apart from the general data cleaning methods, lemmatization could enable the model to achieve better performance. The different forms of a word cause the misclassification for models. Consequently, NLTK [29] is used to accomplish lemmatization. Finally, We pruned the vocabulary by stemming each term to its root, removing stop words, and removing terms that unrelated to the topic like "wa", "ha" and "would". The total size of vocabulary is 41645 in this part. After the data processing, we counted word frequency and displayed the top 30 words in Fig.3 . And the word cloud was generated to analyze the main opinion.

To explore what the user concerns about on Twitter, we applied the LDA to our clean corpus. And for a better representation of the whole content, it is necessary to find an appropriate topic number. By using the topic number ranging from 5-40, we initiated the LDA models and calculated the model's coherence. According to Fig.4 , the coherence score peaked at eight topic numbers. However, an unexpected aggregation of topics appears in Fig.5a . We mainly used "cv" coherence as an indicator and "umass" coherence as a secondary reference. Coherence scores showed an overall downward trend but still fluctuated considerably. Consequently, an accepted method is to choose the local maximum with the average distribution. Finally, we chose 18 as the topic number (Fig.5b) , which reaches the highest point from 8-40.

Besides, due to the complexity of the entire tweet, the information obtained is very general. To analyze more precisely the attitudes towards vaccines in different periods, countries, and emotions, we divided the dataset and modeled them from the following aspects:

• Time Series analysis: We assumed the topics change slightly over time, which is suitable for dynamic topic modelling. Therefore, the whole dataset was split into three time-slices to feed the dynamic topic model. And we explored the word changes in each topic. • Country analysis: To investigate the different concerns between the UK and US, we divided tweets into two parts according to their countries and analyzed them separately using LDA model. • Sentiment analysis: Because different emotions express different themes, we divided the dataset into positive and negative subsets and performed LDA analysis on each of them to infer the effect of the vaccine.

Although the data is not evenly distributed in each month, we believe it is enough to produce effective analysis results. This is because the current data is randomly sampled from Twitter, so the distribution of the dataset at different times can be considered to be the same as the actual distribution. According to the statistics of Fig.6a and 6c , it shows that the cumulative number of tweets has increased since the beginning of the pandemic. In the plot, the US has a significant peak in December 2020, while the UK has a peak in May 2020 and January 2021. However, the US did not have this peak at the beginning of the epidemic, which may be due to the government's cover-up and neglect of the COVID-19 epidemic, as well as the influence of the anti-vaccination movement in the US [30] .

In Fig.6a and 6c , we can also observe that only the United States consistently has a more significant percentage of positive tweets than tweets with other attitudes in all months. Notably, approximately 40% of the population in both countries hold a negative attitude towards COVID-19 vaccines by October 2020. That is not a good sign of the acceptance of vaccines. As the vaccination in December 2020 started 2 , 20% of the decrease appeared in the UK, but only 10% of the decline happened in the US. The government should focus on the cause of the negative attitude towards the vaccine, and formulate the corresponding propaganda and education policies, make the mass vaccination finish faster. As for the results presented by the Weight-VADER model in Fig.6b and 6d, it can be seen from the comparison with Fig.6a and 6c that under the influence of these more influential users (the peak value in the Ratio-Date graph), all the turning points of the proportion of users' different opinions in unweighted sentiment analysis can be well explained. In other words, in the context of vaccine promotion, people with public influence should play a positive role in vaccination promotion. In Fig.6b (Number-date), we can see a substantial positive peak in December 2020, which we believe is due to some influential official accounts' publicity, resulting in a sharp rise in the overall weighted number. In the Fig.6a and 6c (Score-Date), the overall sentiment is not apparent, but in the Fig.6b and 6d , different trends can be observed. The British people's attitude towards vaccines changed in a more dramatically way(-0.55 to 0.75), compared with the American people's (-0.3 to 0.55).

Regarding to the analysis about vaccine brands, the experiment result can be observed in Table I. From Table Ia , it can be seen that for the tweet data in this particular dataset, nearly 55% and 30% of tweets related to COVID-19 vaccine talk about Pfizer and AstraZeneca respectively. It demonstrates that these two brands are the most popular in the UK. From the result in Table Ib , we can see Pfizer and Moderna are the most popular vaccine brands in the US, there are approximately 57% and 31% of tweets talking about them respectively. Whatever in the UK or US, Pfizer is the most popular vaccine brand.

In terms of the sentiment analysis, we can see a majority of tweets express positive and neutral opinions toward vaccine brands. Furthermore, the proportion of positive tweets is greater than the neutral tweets for most brands. It shows that in the UK and US, the attitude of most people toward different brands of COVID-19 vaccine is positive.

The word frequency and word cloud statistics for the entire dataset are shown in Fig.3 and Fig.5 respectively. We noticed Fig. 4 : Coherence values. The topic coherence is the measure to evaluate the coherence between topics inferred by a model. Left: based on a sliding window, the cv measure uses normalized point-wise mutual information (NPMI) and the cosine similarity. Right: according to document co-occurrence counts, the umass evaluation is confirmed by segmentation and a logarithmic conditional probability. The LDA analysis for the whole text is shown in Table  II . The table shows that the second and third themes take up 15% of the total tokens, including the words "token", "people", "vaccine", "immunity" and "safety". Based on this, we inferred that most people accept vaccination because they believe it has positive effects. Besides, several words, like "medical", "worker", "get" and "vaccine", are mentioned in the fifth topic. This indicates that healthcare workers should be the first group to be vaccinated. As for the eighth topic, Pfizer, Modena and AstraZeneca are frequently reported in the UK news. So, our analysis of the experimental results is accurate. The dynamic topic model has shown that the people's concern remain stable over the period. Words such as "get", "worker" and "health" gained more importance in the later stages of the pandemic. While the word "distribution" ranked Fig. 6 : Sentiment analysis by two models on the USA and GB Twitter text data. The four result graphs were all composed of three sub-graphs: Tweets Number by label, Tweets ratio by label and Average compound score. The left column shows results performed by VADER, right column displays outcomes executed by Weighted-VADER, where the Number of Tweets is the Number after weight adjustment, so its value is much larger than the normal. low at first but started climbing in the middle of the period. Those results indicated that there is growing concern about the distribution of vaccines. Analyzing from the country perspective, we found that British people worried about the vaccine shortage. Nevertheless, American are keen to talk about vaccination. Most US residents have no adverse reactions after vaccinating, and they have faith in vaccination. Similarly, both countries' citizens are greatly concerned about governments' vaccine allocation.

By analyzing the positive sentiment, nearly 9% of tokens consider the vaccine to be effective, ranking first in this model. Furthermore, 8.5% of tokens mention "free" and "safe" after the vaccination. By analyzing the negative topic model, words like "death", "case" and "news" appear in the list of topics 2, which could be interpreted as the news of vaccination deaths has caused people's anxiety towards COVID-19 vaccine. The third topic reflects that the lack of vaccine is still a severe problem. The rest of the negative thoughts mainly come from irrelevant COVID-19 vaccine events such as lockdown, maskwearing and virus-damaging.

The research dataset in our experiment is completely insufficient. Based on the UK and US residents' tweet posts obtained via the tweet API, we only kept tweets including the keyword "vaccine". If more research time were available, we would like to implement cross-platform data research, such as combining Instagram, YouTube and Twitter. We will use multi-modal sentiment analysis methodology to implement sentiment analysis, including the analysis of text, audio and video simultaneously. If we use the multi-modal method, we can get more comprehensive and accurate experimental results of people's attitudes towards vaccines.

Social media have the characteristic of high interaction, rapid spread and immediate change. Moreover, Twitter content is indeterminate. The hashtag and keyword of COVID-19 vaccine can change at any time depending on the epidemic and the vaccine development. Therefore, our research results about this topic are time-sensitive. It is only relevant for the tweet posts during our research period.

We have not further improved the sentiment analysis and LDA model. The experimental model is based on existing libraries. If we could have a chance, we would try to compare the accuracy of different models. We will attempt to enhance the accuracy of the sentiment analysis and the topic modelling.

In conclusion, it is important to analyze individuals' sentiments towards COVID-19 vaccine by their social media comments during the dramatic outbreak of the coronavirus disease. To analyze dwellers attitude towards COVID-19 vaccine in the UK and US, we designed and executed a series of experiments on the dataset collected from Twitter. We found that through the VADER and Weighted-VADER methods, a quantitative and qualitative research can be conducted. Public attitudes towards vaccines have improved sharply following the rapid progress of vaccination in the UK and the US, but there is still a significant proportion of people with negative attitudes. Besides, among different brands of COVID-19 vaccines, Pfizer is the most talked about one in the UK and US. Most citizens hold a positive view of the COVID-19 vaccine. Through the LDA analysis, we found that most people hope to be vaccinated and feel great after receiving the vaccine. Public concerns about vaccines stem mainly from death cases of vaccinated people. For the rest with negative attitudes, few people criticize the vaccine directly. They merely mention the vaccine when complaining about the damage caused by the virus. Other concern topics are vaccines distribution, the relationship between schools and vaccines, and vaccines appreciation.

In future work, we will redesign the functions within the VADER tool for sentiment analysis. With new functions, we will further improve the accuracy of sentiment recognition. Meantime, our team will collect more tweet posts for model training. We will also do further research on the LDA model.

A novel coronavirus from patients with pneumonia in china

Entropy analysis of COVID-19 cardiovascular signals

The covid-19 candidate vaccine landscapee

The uk has approved a covid vaccine-here's what scientists now want to know

Covid-19: Fda panel votes to authorise pfizer biontech vaccine

Covid: Pfizer-biontech vaccine approved for eu states

Pfizer-biontech vaccine approved

Rapid covid-19 vaccine development

The 5 stages of covid-19 vaccine development: What you need to know about how a clinical trial works

A large-scale covid-19 twitter chatter dataset for open scientific research-an international collaboration

Vader: A parsimonious rule-based model for sentiment analysis of social media text

The development and psychometric properties of liwc2015

The general inquirer: A computer approach to content analysis

Mining and summarizing customer reviews

Affective norms for english words (anew): Instruction manual and affective ratings

Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining

Senticnet 2: A semantic and affective resource for opinion mining and sentiment analysis

A complete vader-based sentiment analysis of bitcoin (btc) tweets during the era of covid-19

Kryptooracle: A real-time cryptocurrency price prediction platform using twitter sentiments

Latent dirichlet allocation

Online learning for latent dirichlet allocation," in advances in neural information processing systems

Dynamic topic models

Correlated topic models

Automatic evaluation of topic coherence

Exploring the space of topic coherence measures

Maximizing the spread of influence through a social network

Software Framework for Topic Modelling with Large Corpora

Ldavis: A method for visualizing and interpreting topics

Natural language processing with Python: analyzing text with the natural language toolkit

Blocking information on covid-19 can fuel the spread of misinformation

The effect of human mobility and control measures on the covid-19 epidemic in china

What was written vs. who read it: News media profiling using text analysis and social media context