key: cord-0999317-k5qecdt8
authors: Mahdikhani, Maryam
title: Predicting the popularity of tweets by analyzing public opinion and emotions in different stages of Covid-19 pandemic
date: 2021-12-17
journal: International Journal of Information Management Data Insights
DOI: 10.1016/j.jjimei.2021.100053
sha: e51dafb28deda8cb7f08e57a291da319a0bb17ca
doc_id: 999317
cord_uid: k5qecdt8

In this study, public opinion and emotions regarding different stages of the Covid-19 pandemic from the outbreak of the disease to the distribution of vaccines were analyzed to predict the popularity of tweets. More than 1.25 million English tweets were collected, posted from January 20, 2020, to May 29, 2021. Five sets of content features, including topic analysis, topics plus TF-IDF vectorizer, bag of words (BOW) by TF-IDF vectorizer, document embedding, and document embedding plus TF-IDF vectorizer, were extracted and applied to supervised machine learning algorithms to generate a predictive model for the retweetability of posted tweets. The analysis showed that tweets with higher emotional intensity are more popular than tweets containing information on Covid-19 pandemic. This study can help to detect the public emotions during the pandemic and after vaccination and predict the retweetability of posted tweets in different stages of Covid-19 pandemic.

The coronavirus pandemic, also known as Covid-19, began in December 2019 when several patients from Wuhan Hubei province in China reported severe health symptoms. Since then, Covid-19 has spread across the globe. According to the World Health Organization (WHO) report on July 14 th , 2021, there have been 187,519,798 cases of Covid-19, including 4,049,372 deaths 1 . In the very early stages of the pandemic, the WHO advocated for isolation and self-quarantine of affected individuals to reduce the number of cases and mortality rates, leading to the largest lockdown in history. Spending time at home and searching for Covid-19-related news became a common preoccupation, and many turned to social media platforms such as Twitter, which became one of the most important means of sharing information and expressing feelings regarding Covid-19 (Mohammed & Ferraris, 2021; Su et al., 2021; Younis et al., 2020) .

Twitter users can "retweet" or forward a posted tweet to their network, which speeds up the information sharing process. Thus, retweets can represent Twitter users' interests on a large scale. The popularity of tweets is measured by their content and the volume of retweets. Shahi et al., (2021) conducted an exploratory study to examine the sources, spread, and content of misinformation in tweets related to the Covid-19 pandemic. Yousefinaghani et al., (2021) examined the content of four million tweets to learn about public opinion regarding the Covid-19 vaccine. Using Twitter data from several mega-cities worldwide, Yao et al., (2021) employed machine learning techniques to analyze the public's response to the Covid-19 pandemic. To the best of our knowledge, none of the previous studies have investigated the patterns in public responses to the pandemic from its onset to vaccine distribution by analyzing the content of tweets and predicting the popularity of tweets.

This study, address this gap by collecting tweets generated from January 2020 to May 2021 and by analyzing the public opinions and emotions by applying advanced machine learning technique, including the latent Dirichlet allocation (LDA) topic (Blei et al., 2003) and CrystalFeel algorithm (Gupta & Yang, 2018) . More importantly, the extraction of different categories of content features and the building of a predictive model that assesses the popularity of tweets by using the number of retweets (based on the content of posted tweets) is another gap in the literature that we addressed in this study. The research objectives for this study are as follows: (i) Detecting public emotions in different stages of Covid-19 pandemic using Twitter data. (ii) Exploring the dominant English topics related to Covid-19 on Twitter, and the sentiment associated with them. And (iii) Building a predictive model for retweetability of the posted tweets based on their content. Furthermore, the contribution of this study to the literature can be summarized as follows: (i) Analyzing 1,251,216 randomly selected tweets from January 20, 2020 to May 29, 2021, which includes tweets from the early stages of the pandemic to tweets related to the distribution of vaccines, can help to understand the public opinions and emotions regarding Covid-19 pandemic at the ongoing pandemic. (ii) This study applied the latent Dirichlet allocation (LDA) topic and CrystalFeel algorithm to detect four basic emotions, fear, anger, joy, and sadness, at different stages of the pandemic. (iii) The proposed approach extracts five different sets of content features from the posted tweets to applies them to three base supervised machine learning algorithms, and an ensemble voting classifier to predict the retweetabilty of the posted tweets. (iv) The experimental results are then compared by four metrics including, accuracy, F1-score, recall, and precision to choose a model with the highest performance. The study further compared the execution time for running each model to choose the most efficient model. This study is organized by reviewing the literature in section 2, and specifically reviewing the background of the impact of social media and Twitter during the pandemic. The research methodology is introduced in Section 3. Experimental design and analysis along with the models' results are discussed in section 4. The discussion and the implication of the research are presented in section 5. The conclusions and limitations of our work are discussed in Section 6.

During the Covid-19 pandemic, social media platforms such as Facebook, Instagram, TikTok, and Twitter became even more important as a means to interact and connect with others. Visits to the Twitter increased by 36 percent in 2020 compared with those of the previous year, and users in the United States spent an average of 32.7 minutes on the platform per day 2 . Access to large datasets on various platforms offer opportunities for scholars to use advanced computational science to gain insights (Kar & Dwivedi, 2020) .

For instance, Mishra et al., (2019) applied term frequency-inverse document frequency (TF-IDF) and

Cosine Similarity on hotels reviews to generate a recommendation system for suggesting proper hotels to the customers. Chintalapudi et al., (2021) analyzed medical records from digital health systems from 2018 to 2020 by implementing text mining approach to gain insights on improving healthcare quality and assessing patient feedback. Rajendran & Sundarraj, (2021) conducted experiments in two domains including movies and restaurants to gather users browsing history, generate topics by using Latent Dirichlet Allocation (LDA) models, and extract user preferences by enhancing recommendation algorithm. Mishra et al., (2020) also used the reviews data to apply sentiment intensity analyzer and generate a recommendation system for tourist point of interest. This research contributes to two research streams, including the impact of media, and particularly Twitter during pandemics, and retweeting behavior based on the content of tweets.

Regarding the first research stream, Odlum & Yoon., (2015) studied the use of Twitter during the Ebola outbreak to monitor information sharing among users and examine the users' behavior and their knowledge of the disease during the pandemic. The result of this study revealed the pattern in the spread of information among the public and highlighted the value of Twitter as a tool for spreading public awareness. Lazard et al., (2015) used textual analysis to examine public concerns about the Ebola virus and interest in safety information. The study highlighted the efficiency of using Twitter in public health communication. Jain & Kumar., (2015) examined the use of Twitter in the 2015 H1N1 pandemic (also known as Swine flu) to create an inspection system by analyzing information relevant to Influenza (H1N1) and enhancing public awareness in India. They classified tweets as either relevant or irrelevant to studying public opinion regarding H1N1. Their results highlighted the importance of social media for tracking a disease. Szomszor et al., (2011) analyzed tweets and online media related to the Swine flu pandemic of 2009 to identify the popularity of true information. They found that poorly represented scientific information can still be shared in public and cause harm. Furthermore, several studies have examined Twitter content to analyze how the public expresses their feelings at the onset of pandemics (Baboukardos et al., 2021; Garcia & Berton, 2021a; S. Kaur et al., 2020; Ridhwan & Hargreaves, 2021) . By following a quasi-inductive approach, Mittal et al., (2021) found that the majority of Twitter users tend to share positive content regarding the lockdown but their opinions could swing over the course of pandemic based on recent developments. Some studies analyzed tweets with a focus on the public's emotions during the Covid-19 pandemic (Gupta et al., 2021; Kabir & Madria, 2021a) , while others focused on public opinions following the rollout of Covid-19 vaccines (Sv et al., 2021; Yousefinaghani et al., 2021) . Kabir & Madria., (2021b) developed a neural network model to automatically detect a variety of emotions in tweets on Covid-19. They randomly selected ten thousand tweets in English from the United States for their analysis, and their results showed that negative emotions increased during the pandemic. Kaur et al., (2021) discussed the use of advanced machine learning tools to predict and analyze the impact of quarantine during Covid-19 pandemic. Rustam et al., (2021) identified sentiments regarding Covid-19 from tweets using a supervised machine learning approach to understand how people made informed decisions on how to handle their circumstances during the pandemic. Mishra et al., (2021) used LDA model on almost 20,000 tweets for tourism sector, subdomains hospitality and healthcare during Covid-19 pandemic to identify frequent terms and applied stateof-the-art deep learning algorithm to generate a robust sentiment prediction model. This study contributes to this research stream by analyzing 1,251,216 Covid-19-related tweets from January 20, 2020, to May 29, 2021 to investigate Twitter users' opinion and feeling about the Covid-19 pandemic during different phases of the pandemic, including the early stage of the disease, during the lockdown, and after the distribution of vaccines.

Several studies have contributed to this field by proposing methods for predicting the results of important events, such as games, and political elections, using data on the volume of retweet (Abdullah et al., 2015; Liang et al., 2016) . Some studies explored the reasons why users retweet certain information without applying machine learning techniques for prediction. Boyd et al., (2010) empirically examined several case studies on Twitter to understand and analyze the motivations behind retweeting behavior. Their study highlighted that bias in interpreting tweets caused the spread of false information on Twitter. Kwak et al., (2010) studied the impact of retweeting on information sharing. To evaluate the popularity of tweets, they ranked users based on their number of followers and followings compared to the volume of retweets. The results of this study showed the volume of retweets based on the tweet's content has a stronger impact than the number of people who follow the Twitter account's user. Naveed et al., (2011) examined the impact of a tweet's content on its retweet volume. They analyzed two different levels of content-based features in tweets and predicted the retweetability of a given tweet. Guidry et al., (2014) analyzed the content of 3,415 Twitter updates for 50 nonprofit organizations to examine which type of content is likely to be retweeted and to learn how to engage audiences and facilitate discussions. Marino & lo Presti, (2018) examined the content of tweets of European Commissioners and proposed a retweetability rate to measure citizen engagement based on the content on social media in response to certain events. Chung et al., (2020) collected the tweets from Women Who Code (WWC) over a one-year period to examine whether certain content and features such as hashtags and photos resulted in differences in retweet volume. Rao et al., (2020) studied the alarming vs. reassuring retweet distribution pattern related to Covid-19. To the best of our knowledge, none of the Covid-19-related used an advanced machine learning predictive model to examine the retweetability of tweets based on content. Neogi et al., (2021) generated models to categorize and analyze sentiments based on a collection of tweets pertaining to protests of Indian farmers. We contribute to this research stream by examining the content-based features for predicting the popularity of tweets based on the volume of retweets during the Covid-19 pandemic.

Recently, several studies adopted topic modeling analysis on tweets to identify public concerns about forum. They adopted the LDA technique by defining 50 topics and reviewing the top ten words associated with each topic. Lwin et al., (2020a) examined worldwide trends of four basic emotions (i.e., fear, anger, sadness, and joy) during the pandemic by analyzing more than 20 million tweets from January 28 to April 9, 2020. They adopted a lexical approach by using the algorithm CrystalFeel and used "wuhan", "corona", "nCov", and "Covid" as search keywords to generate word clouds related to emotions. Cinelli et al., (2020) collected data related to Covid-19 on Twitter, Instagram, YouTube, Reddit, and Gab to examine public engagement on the topic of Covid-19. They extracted all the topics related to Covid-19 by generating word embedding for the text corpus, and then analyzed the topics. This study contributes to the literature by employing the LDA algorithm to identify the most popular topic related to Covid-19 for content feature purposes and applying them into the CrystalFeel algorithm to examine the public's basic emotions about the Covid-19 pandemic.

In this study, the primary objective was to identify public concerns and basic emotions related to the Covid-19 pandemic at the early stages, during the pandemic and in the post-pandemic phases. Five sets of content features, including topic modeling, topics plus the TF-IDF vectorizer, BOW by the TF-IDF vectorizer, document embedding, and document embedding plus the TF-IDF vectorizer, are then selected.

The five sets of features are applied as inputs for the selected classifiers to compare the accuracy of the prediction performance of tweet popularity based on the volume of retweets. 

To implement this study, a subset of a dataset of tweets related to Covid-19 were examined which were collected by Chen et al., (2020a) from January 20, 2020, to May 29, 2021. In this study, the English tweets for each month are randomly chosen, and narrowed down the dataset to 1,251,216 tweet IDs. The tweet IDs further were retrieved to tweets' complete information by using Hydrator software. A laptop with Quad-Core i7-8750 H processors running at 16X PCI-e lanes was used for analyzing the data.

The following table shows the relevant information about the dataset and an example of one unique record. The data were imported into the Python console by using numpy, nltk, and pandas packages. In Table 1 , the user ID represents a unique identifier for the tweet, and EN in our dataset refers to English.

Furthermore, the number of tweets that are issued by user ID is shown as the user status count, which describes the user's activity on Twitter. The number of times that the tweet is shared with the user ID's network is described as the retweet count. The raw texts were further cleaned by removing punctuation, usernames, URL links, numbers, pictures, and emojis, and converted text to the lowercase. Furthermore, the stop words such as "the", "the", "of", "in", "at" were removed. Cleaned tweets were tokenized to be processed from sentences to words for future analysis.

To measure the popularity of tweets based on the volume of the retweets, we considered tweets that had at least one retweet during the period from January 20, 2020, to May 29, 2021. The purpose of this categorization is to describe the process of the binary response variable for future analysis.

Five different categories of features were chosen for this study: (i) topic modeling, (ii) topic modeling plus TF-IDF vectorizer, (iii) BOW by TF-IDF vectorizer, (iv) document embedding, and (v) document embedding plus TF-IDF vectorizer. The following subsections will cover each set of content features, particularly topic modeling and how basic emotions related to Covid-19 were detected using the CrystalFeel algorithm.

Due to the large volume of tweets and retweets, topic modeling was used to classify text data pertaining to Covid-19 based on the frequency of words in each document. The latent Dirichlet allocation (LDA) model (Blei et al., 2003) was applied to identify the most popular topics in tweets related to Covid-19.

LDA model is an unsupervised machine learning algorithm that detects a certain number of topics within documents with a certain probability. Note that each topic is also represented as a probabilistic distribution over words. LDA models a corpus D including M documents, and each document has In the generative process, the probability of observed data D is computed as follows:

In the above equation, variables. This research aimed to find the optimal number of topics within the documents by calculating the coherence score which is referred as v C score (Röder et al., 2015) and measures the coherence of the topics by the normalized mutual information (NPMI) metric. NPMI is defined as follows:

Where the topic coherence is automatically computed by point wise mutual information (PMI) metric as follows: . Given the size of the dataset in this study, applying the LDA model was one of the most effective methodology to extract the features. In this study, Python Scikit-learn's LatentDirichelAllocation function is used with the learning decay of 0.85. Learning decay is a parameter for controlling the learning rate, and its value must be set between 0.5 to 1 to guarantee asymptotic convergence. Fig. 1 . shows the optimal number of topics along with the coherence score for the whole dataset. A higher value for the coherence score indicates an optimal number of topics within the documents. The highest coherence value is 0.6088, indicating 38 topics for the whole dataset. 

Previous studies analyzed the four emotions in different periods of the pandemic using the CrystalFeel algorithm (Garcia & Berton, 2021b; Lwin et al., 2020b; Shah et al., 2021) , which has been proven in recent works to be accurate. In this study, the emotional strength scores of the CrystalFeel algorithm (R.

K. Gupta & Yang, 2018) were used to label the dominant emotions of fear, anger, sadness, and joy at different phases of the pandemic according to the timeline of WHO tweets and U.S news during the ongoing Covid-19 pandemic. In the CrystalFeel algorithm, topics are labeled based on emotion score (i.e., emotional valence refers to feelings' polarity) in three different categories including: (i) No-specific emotion, (ii) If valence-score is higher than 0.520, then the emotion category is "joy"; (iii) If valencescore is lower than 0.480, then the emotion category is: (1) "anger" if and only if the anger intensity-score is higher than both the fear and sadness intensity-scores, (2) "fear" if and only if the fear intensity-score is higher than both the and sadness intensity-scores, and (3) "sadness" if and only if sadness intensity-score is higher than both the anger and fear intensity-scores (Garcia & Berton, 2021b) . Fig 4. illustrates the algorithm 1. The results of CrystalFeel analysis are shown in Table 2 from January 2020 to May 2021. For each month, the LDA algorithm was applied on the randomly selected tweets, and then the top ten words for each topic were extracted and used as inputs for the CrystalFeel algorithm. Furthermore, Fig 5. shows the timeline of the Covid-19 pandemic based on selected WHO tweets and its 

N-gram analysis for extracting features is one of the most reliable, efficient, and fastest techniques for text classification. The process starts by preprocessing language documents by removing unnecessary information, e.g., punctuations, numbers, tags, while keeping necessary terms. N-grams are sequence of words from the documents, and " N " corresponds to the window size of the words in text analysis. In this study, the window size of sequence words for n-gram analysis is one for bag of words, which generates the vocabulary list for all the unique words and their frequencies in the documents. To enhance the performance of classification models, the TF-IDF vectorizer was used to weight the n-gram profiles (Hassan et al., 2020; Nasser et al., 2021) . The highest weight of TF-IDF occurs when a word has high term frequency (TF) in any tweet, and low document frequency (DF) of the word in the entire dataset. In this study, the TF-IDF vectorizer method introduced by (Salton & Buckley, 1988 ) was applied to the documents and it is an older method compared to other aforementioned features. The TF-IDF method assumes that the important words in a given document frequently appears in that document but rarely appears in other documents which aids in recognizing meaningless terms. Therefore, the frequency of 

Doc2vector or document embedding is the extension of word embedding for text analysis. Word2Vec can convert tokenize words into a vector that represents the vocabulary of texts within documents. Word2Vec enables exploration of the correlation among the words and their contextual information and constructs the network of words. Doc2vec builds a numerical representation of a document where there is a group of words as a unique document to achieve sentence embedding. Thus, when training Word2Vec (Mikolov et al., 2013) , Doc2vec is also trained. One of the main learning algorithms for Doc2vec that is implemented in this research is distributed bag of word version of paragraph vector (PV-DBOW), which is based on skip-gram. In PV-DBOW, each text is associated with a specific paragraph vector, and each word is associated with a specific word vector in a whole dataset.

The genism package was further imported to Python and created the document-to-vector model to learn the network of documents and to detect similar tweets based on the vector distance.

Scikit-learn package in Python 3.8 was used to implement three base and effective supervised machine learning algorithms: (i) random forest (RF) (Breiman, 2001) classifier, (ii) stochastic gradient descent (SGD) (Zhang, 2004) classifier, and (iii) logistic regression (LR) (Hosmer Jr, 2013) classifier, and an ensemble voting classifier of the three machine learning algorithms (i.e., RF, SGD, and LR) to enhance accuracy and reduce error rates of classifiers. Each classifier and ensemble approach are explained in detail in the following subsections. Note that, in this study, ensemble voting classifier is referred as EVC for ensemble approach.

The random forest classifier is a supervised machine learning algorithm. It consists of tree classifiers where each tree is grown with a random vector that is distributed independently and identically, and each tree casts a vote for the most popular class of input vectors (Breiman, 2001) . After creation, RF classifier can split into two stages: random forest creation and prediction from the created RF classifier (Biau & Scornet, 2016) . The algorithm has the following steps .

Step 1 RF randomly selects " k " features from a total of " m " features where km .

Step 2 RF calculates the node " d " among the " k " features using the best split point.

Step 3 RF uses the optimal split by breaking the node into child nodes.

Step 4 RF repeats 1 to 3 steps iteratively until the number of nodes reaches the maximum allocated value.

Step 5 RF builds a forest by repeating step 1 to 4 for " n " number time to create " n " number of trees. In this study, RF classifier accuracy was compared with the accuracy of stochastic gradient descent (SGD) and logistic regression (LR), and ensemble voting classifier (EVC).

The stochastic gradient descent (SGD) classifier is a supervised machine learning algorithm and is a very powerful classifier for building a predictive model (Zhang, 2004) . The algorithm has the following steps.

Step 1 SGD computes the gradient of the loss function with respect to each feature.

Step 2 SGD selects a random initial value for the parameters.

Step 3 SGD updates the gradient function by allocating the parameter values.

Step 4 SGD calculates the step sizes for each feature with respect to learning rate of algorithm.

Step 5 SGD calculates the new parameters.

Step 6 SGD repeats step 3 to 5 until the gradient reaches to zero. In SGD classifier, learning rate value has a significant impact on the behavior of gradient descent. Thus, the learning-rate in Python codes is set to "optimal" and the loss function is set to "log" which gives logistic regression, a probabilistic classifier. The log loss function gives the probability of false classifications (Rustam et al., 2021) , and can be defines as:

Where N is the number of instances, i y is the outcome of the _ i th instance, and   i py is the probability of the _ i th instance for the value i y .

The logistic regression (LR) classifier is a supervised machine learning algorithm that is used to model the probability of a binary classification problem (Hosmer Jr, 2013) . The LR algorithm has the following steps.

Step 1 the LR classifier develops the implementation of the sigmoid function. The LR model predicts the binary outcome with sigmoid function as follows: is the input vector.

Step 2 the LR classifier determines the cost function.

Step 3 the LR classifier calculates and updates new coefficients. The value of the coefficient is updated as follows:

Where  is the learning rate,

Step 4 the LR classifier calculates the output with the highest probability.

Step 5 the LR classifier repeats steps 1 to 4 and updates the model for each training instance in the dataset. In this classifier, scikit-learn's LogisticRegression uses liblinear for the solver parameter as a loss function which is the differentdifferent algorithmic style to optimize the loss function, and it supports both L1 and L2 regularization for penalizing the model complexity. Note that, liblinear applies a Newton method for the LR classifier (Galli & Lin, 2020; Lin et al., 2007) .

An ensemble approach is a combination of classifiers that improves the performance of a classification system (Li et al., 2007) . Classic machine learning methods are trained by using one classification method on the dataset, while ensemble approach is trained by using multiple classifiers. The error rate for ensemble approach is lower than individual classifier' error rate. To combine the decision of RF, SGD, and LR, this study used soft voting in ensemble approach. The convex combination of the predicted class probabilities was applied for individual classifier. The summation of weights for classifiers was one, and the weighting was chosen based on performance of classifier due to its simplicity and accurate results (Pierola et al., 2016) . In soft voting approach, the predict-proba attribute is used to give the probability of each variable, and shuffles training set, and data points for RF, SGD, and LR classifier. Each classifier computes its prediction with soft voting technique, the majority voting is calculated for the final prediction (Kumari et al., 2021) . Fig. 7 . illustrates algorithm 2 for the soft voting technique. 

As mentioned in the research method section, tweets related to the Covid-19 pandemic were collected using Twitter APIs (Chen et al., 2020b) , and keywords such as: Covid, corona, pandemic, and similar keywords. The study randomly chose 1,251,216 tweets written in English that were posted between January 20, 2020 and May 29, 2021. The tweets were labeled as popular and non-popular tweets based on the number of retweets. Each of the classification models used a grid search to find optimal hyperparameters. The grid search utilized the GridSearchCV object of scikit-learn in Python for all classification models. The results of the models were obtained using five-fold cross-validation with a split ratio of 0.75 to train the classifiers. The optimal hyper-parameters for all the proposed classifiers are summarized in Table 4 . Furthermore, to overcome the imbalance data problem, the class weight for each classifier is modified such that higher weight is given to smaller classes to produce optimal results. Affects shape of transforms output with the matrix of (n_samples, n_classifiers * n_classes) TRUE weights weights class probabilities before averaging [45, 35, 20] 

The binary response variable in this study was popular versus non-popular tweets based on the volume of retweets, where the tweets with at least one retweet were labeled as popular and tweets with no retweets were labeled as non-popular. Since there were 435,900 non-popular tweets and 815,316 popular tweets, this was an imbalanced dataset. To avoid misleading results due to an imbalanced dataset, an oversampling technique in which the minority class is duplicated was adopted to keep all the relevant information in the training set. Furthermore, three main sets of content features and their combinations were utilized as inputs for three robust and effective machine learning classifiers and an ensemble voting classifier for imbalanced datasets and used to predict the retweetability. To enhance the performance of the classifiers, the feature-extraction function was used from the scikit-learn package in Python 3.8 to extract the lexical features and weight them using a TF-IDF vectorizer. The gensim package was then applied for Doc2vec and LDA, and LatentDirichletAllocation function from Scikit-learn package was used for topics analysis. The parameters for classifiers were also adjusted to prevent poor results. All the classifiers were modified by adding class weights as "balanced" to their cost function where the penalty to the minority class is higher. The scikit-learn Python package provides the class weights for the classifiers.

Furthermore, an ensemble voting classifier was applied to enhance the accuracy of prediction and reduce bias, and error rate. This study utilized an ensemble of random forest, stochastic gradient decent, and logistic regression by applying soft voting technique. Furthermore, this research addressed two main components of generating a prediction model. First, tuning the hyperparameters of each base model, and second, weighting the base models by adopting a soft voting technique to create the prediction model which are explained in the following sections.

The classification models were trained for 250 epochs on a system with a RAM of 32 GB. The GPU had a RAM of 8 GB. The unsupervised machine learning algorithms took more than 30 hours to train. The supervised machine learning algorithms were efficient and took less time to run and provide outcomes.

However, creating an ensemble voting classifier for each set of features took more time for both training and executing the models. By optimizing the hyper-parameters GridSearchCV, and classification models, the performance improved and runtime was more efficient.

To evaluate the performance of the selected classifiers, four metrics were chosen: (i) accuracy, (ii) precision, (iii) recall, and (iv) F1-score. The accuracy score is the ratio of correct predictions to total predictions, and range is between zero to 1. The equation for accuracy is as follows: 

F1-score is a harmonic mean of the precision score and recall score, and its value lies between zero to one. The equation for F1-score is as follows: 

The execution time for running the classifiers was utilized to compare and evaluate which classifier consumes a shorter time with more accurate results. In sum, the EVC achieved the highest accuracy compared with the RF classifier, the SGD classifier and the LR classifier for all five sets of features, particularly when using topics plus TF-IDF vectorizer feature with a runtime of 12420.34 seconds. Table 5 also shows that although the RF, SGD, and LR classifiers had the shortest runtime of all the models compared with ensemble approach, the accuracy of their models was not as high as the ensemble approach. Fig. 8 . shows the F1-score for the four classifiers and all five sets of features. However, with applying ensemble approach and soft voting technique the runtime increased for all five sets of features. The runtime of each model depends on the complexity of the base learners and the size of the dataset. Fig. 9 . shows the comparison between the runtime of models by using ensemble approach, and the accuracy of the models. Among all the sets of features, topics plus TF-IDF vectorizer has the highest accuracy, and the runtime is relatively short compared to BOW by TF-IDF vectorizer. 

Inaccurate information related to the ongoing COVID-19 pandemic and the safety of vaccines and their side effects spread quickly through social media, especially via retweets on Twitter. Therefore, it has become more important to address misinformation (Budhwani & Sun, 2020; Forati & Ghose, 2021; Singh et al., 2020) . Prior research has explored the essential characteristics of retweet prediction, including retweeting behaviors, emoji and playfulness engagement, and number of followers. However, there is less progress in exploring the content of tweets and in predicting the retweetability over the phases of the pandemic from the initial spread of the virus to the distribution of vaccines. In this study, the content and popularity of tweets and public opinion and emotions were analyzed according to the number of retweets occurring during different phases of the Covid-19 pandemic. Five different sets of content features (i.e., topic modeling, BOW by TF-IDF vectorizer, topics plus TF-IDF vectorizer, Doc2vec, and Doc2vect plus TF-IDF vectorizer) were selected, compared, and then used for three effective and robust classifiers, random forest, stochastic gradient descent, and logistic regression, and an ensemble voting classifier which is a meta classifier to evaluate and compare the outcomes. The results highlighted a strong support for the study's contributions by introducing a novel approach to extract the features from tweets and to predict their retweetability using supervised machine learning algorithms.

The results of this study showed that topics plus TF-IDF vectorizers outperformed other sets of features for all the base classifiers and the ensemble voting classifier. The result of BOW by TF-IDF vectorizers as a content feature set was very close to topics plus TF-IDF vectorizers. One possible explanation is that all tweets pertained to Covid-19, so the performance of the basic text representation was close to that of topic modeling. Moreover, the results of all the experiments in this study confirmed that the EVC has the highest accuracy compared with the state-of-art methods.

The results of this study have several theoretical and practical implications. To the best of our knowledge, this is the first study that used the most updated dataset that covers tweets from the onset of the pandemic to the distribution of vaccines. As such, this is the first study that utilized unsupervised machine learning algorithms such as LDA, and document embedding to extract the features and apply them to the supervised machine learning algorithms such as random forest, stochastic gradient descent, and logistic regression, and an optimal ensemble voting model of the selected classifiers to build a predictive model for their retweetability. Furthermore, by applying the LDA algorithm, the most popular topics for each month were identified. The CrystalFeel algorithm was employed to label the public emotions in response to the Covid-19 pandemic, to analyze the patterns in public opinion and emotions, and to extract the most effective features for the predictive model.

In terms of practical implications, the results of this research can be adopted to create a recommendation system for tweets that are relevant to certain events, or as a means of obtaining a higher number of retweets. Identifying patterns in public emotions during the ongoing pandemic can help public health authorities make strategic decisions regarding communication during critical events such as a pandemic.

The findings of this study show that although negative emotions, such as anger, fear and sadness were dominant in the early stages of the Covid-19 pandemic, the vaccine rollout and published results on vaccine effectiveness has a positive influence on public emotions. Furthermore, the finding of this study can help to detect and minimize the misleading information related to Covid-19 on Twitter.

In this study, the popularity of tweets (based on the number of retweets) was predicted by extracting content features from tweets written in English on the Twitter platform from January 20, 2020, to May 29, 2021. This study shows that the popularity of tweets based on the number of retweets can be drawn from the content of tweets and certain repeated terms during important events such as the Covid-19 pandemic.

This section discusses the findings of the study, and its limitations. The results of this study revealed how public opinion changed throughout the stages of the Covid-19 pandemic. The study aimed to select the effective features from the content of the posted tweets by applying unsupervised machine learning algorithms and then to use them as inputs to feed the selected supervised machine learning algorithms for predicting retweetability. Identifying negative and misleading sentiments on popular social media platforms such as Twitter can help to prevent the spread of misinformation. Promoting accurate information and positive sentiments can enhance public awareness regarding certain events such as pandemics. In the proposed approach, the most popular topics at different stages of the pandemic were first identified by using the LDA, and the emotional intensity were detected by employing the CrystalFeel algorithm (Gupta & Yang, 2018) for four emotions: fear, anger, joy and sadness. Second, they were used as one category of content features along with other sets of features to apply them to the selected classifiers. The results showed that topics plus TF-IDF vectorizers feature set had the highest accuracy compared with other sets of content features, and the ensemble voting classifier by ensemble of three machine learning algorithms such as random forest, stochastic gradient decent, and logistic regression had the highest performance when compared with the state-of-art classifiers.

The analysis in this study was limited to tweets written in English and related to Covid-19. Future studies can expand the analysis into different languages. Furthermore, the findings of this study are limited to only users on Twitter platform; future research can explore text content from other social platform to compare the results.

Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study

User's action and decision making of retweet messages towards reducing misinformation spread during disaster

Social performance and social media activity in times of pandemic: evidence from COVID-19-related Twitter activity. Corporate Governance: The International Journal of Business in Society, ahead-of-print(ahead-of-print

A random forest guided tour

Latent dirichlet allocation

Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter. 2010 43rd Hawaii International Conference on System Sciences

Random Forests

Creating COVID-19 stigma by referencing the novel coronavirus as the "Chinese virus" on twitter: Quantitative analysis of social media data

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set

Text mining with sentiment analysis on seafarers' medical documents

Understanding the information diffusion of tweets of a non-profit organization that targets female audiences: an examination of Women Who Code's tweets

The COVID-19 social media infodemic

Geospatial analysis of misinformation in COVID-19 related tweets

Trust region Newton method for large-scale logistic regression

Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA

Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA

CrystalFeel at SemEval-2018 Task 1: Understanding and Detecting Emotion Intensity using Affective Lexicons

An Emotion Care Model using Multimodal Textual Analysis on COVID-19

Credibility Detection in Twitter Using Word N-gram Analysis and Supervised Machine Learning Techniques

Applied logistic regression

An Effective Approach to Track Levels of Influenza-A (H1N1) Pandemic in India Using Twitter

EMOCOV: Machine learning for emotion detection, analysis and visualization using COVID-19 tweets

EMOCOV: Machine learning for emotion detection, analysis and visualization using COVID-19 tweets

Theory building with big data-driven research -Moving away from the "What" towards the "Why

Machine Learning Tools to Predict the Impact of Quarantine

Monitoring the dynamics of emotions during Covid-19 using twitter data

An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier

What is Twitter

Detecting themes of public concern: A text mining analysis of the Centers for Disease Control and Prevention's Ebola live Twitter chat

Sentiment Classification through Combining Classifiers with Multiple Feature Sets

RTPMF: Leveraging user and message embeddings for retweeting behavior prediction

Trust region Newton methods for large-scale logistic regression

Global Sentiments Surrounding the COVID-19 Pandemic on Twitter: Analysis of Twitter Trends

Global Sentiments Surrounding the COVID-19 Pandemic on Twitter: Analysis of Twitter Trends

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study

From citizens to partners: the role of social media content in fostering citizen engagement. Transforming Government: People

Linguistic regularities in continuous space word representations

A Sentiment analysis-based hotel recommendation using TF-IDF Approach

Sentiment Analysis for POI Recommender Systems

Deep Learning-based Sentiment Analysis and Topic Modeling on Tourism During Covid-19 Pandemic

Twitter users exhibited coping behaviours during the COVID-19 lockdown: an analysis of tweets using mixed methods. Information Discovery and Delivery, ahead-of-print(ahead-of-print

Leveraging Twitter data to understand public sentiment for the COVID-19 outbreak in Singapore

Factors influencing user participation in social media: Evidence from twitter usage during COVID-19 pandemic in Saudi Arabia

n-Gram based language processing using Twitter dataset to identify COVID-19 patients. Sustainable Cities and Society

Bad news travel fast

Sentiment analysis and classification of Indian farmers' protest using twitter data

What can we learn about the Ebola outbreak from tweets?

Moving social marketing beyond personal change to social change

An ensemble of ordered logistic regression and random forest for child garment size matching

Using topic models with browsing history in hybrid collaborative filtering recommender system: Experiments with user ratings

Retweets of officials' alarming vs reassuring messages during the COVID-19 pandemic: Implications for crisis management

Exploring the Space of Topic Coherence Measures

A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis

Term-weighting approaches in automatic text retrieval

Mining topic and sentiment dynamics in physician rating websites during the early wave of the COVID-19 pandemic: Machine learning approach

An exploratory study of COVID-19 misinformation on Twitter

A first look at COVID-19 information and misinformation sharing on Twitter

Public Priorities and Concerns Regarding COVID-19 in an Online Discussion Forum: Longitudinal Topic Modeling

Twitter-based analysis reveals differential COVID-19 concerns across areas with socioeconomic disparities

Indian citizen's perspective about side effects of COVID-19 vaccine -A machine learning study

Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemic

Comparing tweet sentiments in megacities using machine learning techniques: In the midst of COVID-19

Social media as an early proxy for social distancing indicated by the COVID-19 reproduction number: Observational study

An analysis of COVID-19 vaccine sentiments and opinions on Twitter

Solving large scale linear prediction problems using stochastic gradient descent algorithms