key: cord-0823509-1cr7axc4
authors: Gourisaria, Mahendra Kumar; Chandra, Satish; Das, Himansu; Patra, Sudhansu Shekhar; Sahni, Manoj; Leon-Castro, Ernesto; Singh, Vijander; Kumar, Sandeep
title: Semantic Analysis and Topic Modelling of Web-Scrapped COVID-19 Tweet Corpora through Data Mining Methodologies
date: 2022-05-10
journal: Healthcare (Basel)
DOI: 10.3390/healthcare10050881
sha: aed084980b66f40bebe1fba2bd4be3b143eb3105
doc_id: 823509
cord_uid: 1cr7axc4

The evolution of the coronavirus (COVID-19) disease took a toll on the social, healthcare, economic, and psychological prosperity of human beings. In the past couple of months, many organizations, individuals, and governments have adopted Twitter to convey their sentiments on COVID-19, the lockdown, the pandemic, and hashtags. This paper aims to analyze the psychological reactions and discourse of Twitter users related to COVID-19. In this experiment, Latent Dirichlet Allocation (LDA) has been used for topic modeling. In addition, a Bidirectional Long Short-Term Memory (BiLSTM) model and various classification techniques such as random forest, support vector machine, logistic regression, naive Bayes, decision tree, logistic regression with stochastic gradient descent optimizer, and majority voting classifier have been adapted for analyzing the polarity of sentiment. The effectiveness of the aforesaid approaches along with LDA modeling has been tested, validated, and compared with several benchmark datasets and on a newly generated dataset for analysis. To achieve better results, a dual dataset approach has been incorporated to determine the frequency of positive and negative tweets and word clouds, which helps to identify the most effective model for analyzing the corpora. The experimental result shows that the BiLSTM approach outperforms the other approaches with an accuracy of 96.7%.

The micro-blogging and social networking site Twitter exhibits a leading platform for several individuals and organizations for expressing their views and opinions, sharing their thoughts, and keeping them up to date with day-to-day social and political affairs [1] . Twitter has about 145 million day-to-day active users and 330 million monthly active users, making it an important source for gathering tweets for research [2] . Twitter had a restriction of 140 characters, but in the year 2017 Twitter doubled the character count to 280 characters for every tweet, which compels users to adapt phrases in their tweets [3] . Twitter has over 1 billion unique tweets posted every day and obtains 15 billion API calls every day [1] . the study is to provide a comparison between various algorithms and provide the best model in terms of accuracy and other result parameters, which are further discussed in the experimental section. Our second main goal was to provide our own mined dataset which can be used in further studies and contribute to the research society. This dataset is very useful and necessary as it has been mined from Twitter and the sentiments were found using the best algorithm after comparison of eight different models. We explain the methods that were used for collecting the data samples and pre-processing steps. We also provide the results of the best model that was stated from the previous dataset and apply them to our newly created dataset for more accurate analysis and efficiency. The main contributions of the paper are jotted down in the following points:

Our own mined COVID-19 dataset from Twitter API is proposed, consisting of 6648 tweets.

Our mined dataset has been compared with the other two trained datasets. • Topic modeling with the help of LDA has been performed on all datasets. • RNN network, BiLSTM, and various other classification algorithms have been performed, and the ROC curve has been found for all of these to select the best among them.

The remaining part of the paper has been arranged into many sections. Section 2 briefs about works related to COVID-19 semantic analysis. Section 3 describes the methodology and materials which explain the statistics inspection, data pre-processing, and feature extraction. Section 4 describes the topic modeling technique, Latent Dirichlet Allocation (LDA), the Bidirectional Long Short-Term Memory (BiLSTM) algorithm, and various algorithms implemented in the paper, namely, support vector machine, naïve Bayes [14] , logistic regression-stochastic gradient descent, logistic regression, decision tree, random forest [15] and Majority Voting Classifier (MVC). Section 5 comprises of scrutiny, results, and comparison of models, followed by Section 6, which discusses the analysis and discussion of the results obtained in the experiment. Section 7 constitutes the conclusion and future work.

COVID-19 has evolved as one of the major challenges in the world due to its highly mutating, contagious nature. Tweets have a wide impact on public emotions; therefore, it very important to know the polarity of tweets. In this paper we review several articles related to sentiment analysis from COVID-19 tweets collected from a Kaggle dataset using various deep learning and machine learning models. Hung et al. (2020) [10] applied Natural language Processing (NLP), a Machine Learning (ML) technique for analyzing and exploring the sentiments of Twitter users during the COVID-19 crisis. The hidden semantic features in the posts were extracted via topic modeling using Latent Dirichlet Allocation (LDA). Their dataset was originated exclusively from the United States and tweeted in English from 20 March to 19 April 2020. They analyzed 902,138 tweets, out of which semantic analysis classified 434,254 (48.2%) as positive, 280,842 (31.1%) as negative, and 187,042 (20.7%) as neutral. Tennessee, Vermont, Utah, North Dakota, North Carolina, and Colorado expressed the most positive sentiment, while Wyoming, Alaska, Pennsylvanian, Florida, and New Mexico conveyed the most negative tweets. The themes that were considered in the experimental section included health care environment, business economy dominant topic, social change, emotional support, and psychological stress. However, the authors do not provide any industrial level model that can be implemented for analyzing these themes and provide conclusive results, unlike our experiments where models can provide different results based upon the tone, speech, etc. of the text given. Xue et al. (2020) [16] also applied the Latent Dirichlet Allocation (LDA) technique for topic modeling and identified themes, patterns, and structures using a Twitter dataset containing 1.9 million tweets associated with coronavirus gathered from 23 January to 7 March 2020. They identified 10 themes including "COVID-19 related deaths", "updates about confirmed cases", "early signs of the outbreak in New York", "cases outside China (worldwide)", "preventive measures", "Diamond Princess cruise", "supply chain", "economic impact", and "authorities". These results do not reveal symptoms and treatment-related messages. They also noticed that panic for the mysterious nature of COVID-19 prevailed in all themes. Although the study talks about the procedure used in the experiment comprising of machine learning techniques, the study does not provide any experimental results or analysis which can be used as a model. In comparison, in our work we have used machine learning techniques and have experimented with our best models that have been tested on two Kaggle datasets and got the result. This result has been compared with our own mined dataset (the dataset was mined from Twitter using the keyword COVID-19 and generated 6648 tweets). We have also attached the label for every tweet.

Muthusami et al. (2020) [17] aimed to inspect and visualize the impact of the COVID-19 outbreak in the world using Machine Learning (ML) algorithms on tweets extracted from Twitter. They utilized various machine learning algorithms such as naïve Bayes, decision tree, SVM, max entropy, random forest, and LogitBoost for classifying the tweets as positive, neutral, and negative. LogitBoost ensemble classifier with three classes performed better with an accuracy of 74%. However, authors lack in terms of their model's accuracy when compared to our models used in the different datasets. Similar work was presented by Lwin et al. (2020) [18] investigating four emotions, namely, anger, fear, sadness, and joy, during the COVID-19 pandemic. They collected 20,325,929 tweets from Twitter during the initial phase of COVID-19 from 28 January to 9 April 2020 using the keywords "Wuhan", "corona", "nCov" and "COVID". They found that social emotions altered from fear to anger throughout the COVID-19 crisis, while joy and sadness also surfaced. Sadness was indicated by topics of losing family members and friends, while gratitude and good health showed joy. Chakraborty et al. (2020) [19] analyzed the kinds of tweets collected during this COVID-19 crisis. The first dataset containing 23,000 tweeted posts from 1 January 2020 to 23 March 2020 had a maximum number of negative sentiments while the second dataset contains 226,668 tweets collected from December 2020 to May 2021, which contrasts the greatest number of negative and positive tweets. They utilized bag-of-words vectorizers like TF-IDF vectorizer and count vectorizer from the sklearn library for word embedding purposes. They used various classifiers such as ensemble models, naïve Bayes models, Bernoulli classifier, multinomial classifier, support vector machine models, AdaBoost, logistic regression, and LinearSVC. The best classifier was naïve Bayes with an accuracy of 81%. Li et al. (2020) [20] analyzed the effect of COVID-19 on the psychological well-being of people by organizing different trials on sentiment analysis using microblogging sites. It was established that information gaps in the short-term in individuals change with psychological burdens after the outbreak. They used Online Ecological Recognition (OER), which automatically recognizes psychological conditions such as anxiety, well-being, etc. of a person. Bakur et al. (2020) [21] studied the sentiments of Indian people post lockdown enforced by the Indian government. They collected about 24,000 tweets obtained from the handles #IndiafightsCorona and #IndiaLockdown in the period of 25 to 28 March 2020. The study was concluded only by using Word cloud and the study depicts that Indians took the lockdown decision positively. Imran et al. (2020) [22] used deep learning models like Long Short-Term Memory (LSTM) to analyze tweets related to the COVID-19 crisis. They utilized different datasets such as the Sentiment140 dataset containing 1.6 million tweets, an emotional tweet dataset, and a trending dataset on COVID-19. For comparison, they also trained Bidirectional Encoder Representations from Transformers (BERT), GloVe, BiLSTM, and GRU. Wang et al. (2020) [23] fine-tuned the Bidirectional Encoder Representations of Transformer (BERT) model for classifying the sentiments of Chinese Weibo posts about COVID-19 into positive, negative, and neutral and analyzed the trends. The dataset contains 999,978 tweets from 1 January 2020 to 18 February 2020. The model achieved an accuracy of 75.65%, which surpasses many NLP baseline algorithms. However, the accuracy is lacking when compared to our results. Sitaula et al. (2021) [24] conducted an analysis on COVID-19 tweets in the Nepali language. They utilized different extraction methods such as domain-agnostic (da), domainspecific (ds), and fastText-based (ft). They also proposed three CNN methods and ensembled three CNN methods using CNN ensemble. They made a Nepali Twitter sentiment analysis dataset. Their feature extraction technique has the capability to discriminate characteristics for sentiment analysis. Shahi et al. (2022) [25] demonstrated text representation methods fastText and TF-IDF and a combination of both to gain hybrid features. They used nine classifiers on NepCov19Tweets, which is a dataset of COVID-19 tweets in the Nepali language. The best classifier was SVM with a kernel Radial Bias Function (RBF) with an overall classification accuracy of 72.1%. Sitaula et al. (2022) [26] combined the semantic information generated from the combination of the domain-specific (ds) fastText-based (ft) methods. They used a Multi-Channel Convolutional Neural Network (MCNN) for classification purposes. They found that the hybrid feature extraction technique performed better with 69.7% accuracy, while the MCNN also performed much better than an ordinary CNN with 71.3% accuracy.

The above-presented studies which we included in this section cover various themes and other analysis of the sentiments but lack the provision of any machine-learningbased model which can help in doing the same with other tweets or messages. However, out of nine studies shown above, only two studies presented a model-based application. Furthermore, these models lack in terms of accuracy when compared to our experimental models. Apart from the models, previous studies lack in comparing their outcomes with other datasets to have a deeper insight into the sentiments of the tweets. We, in our experiment, include a new approach in which we first try different models on the previously collected datasets (varying in size), and after getting the model, we introduce our new dataset collected based upon the understandings and algorithms. We also check the best model on our dataset to check how varied the results are and how they can improve the work. Table 1 provides a summary of the dataset. 

Machine learning is a trending technology where algorithms enhance automatically by automatically learning the relationships found in data [27] . This research paper deals with classifying the tweets related to COVID-19 into positive or negative sentiments. The dataset was pre-processed before applying it to the model. The pre-processing steps include stopwords removal, stemming, and tokenization. Latent Dirichlet Allocation (LDA) was done for topic modeling. Data classification techniques like BiLSTM [28] , random forest, naïve Bayes, LR-SGD classifiers, logistic regression, decision tree, SVM, and MVC were used to categorize the tweets into positive or negative. These methods were used as these are promising classifiers and techniques for analyzing the polarity of tweets. These classifiers have been successfully applied in many applications such as social media text analysis, emotion analysis, text analysis, etc. These technologies also prove beneficial in analyzing the thought processes of the general public. These approaches can successfully classify and express the opinion and feelings of human beings. LDA for topic modelling will be very much beneficial in recognizing the pattern of the tweets. We can find a group of words which are mainly involved in the negative or positive tweets. Later on, we can select those tweets which contain these types of words. The count vectorization and tokenization technique provide a vector for experimentation of the models and are a very well-known feature extracting technique. Figure 1 shows the workflow of the COVID-19 sentiment analysis. For the experimental workflow, we used the Keras library (using TensorFlow backend) on an anaconda environment. All models were trained through Python 3 using high-level APIs for the construction of neural networks in the Bi-LSTM model. We used an i5 8th generation processor with a 16GB RAM.

approaches can successfully classify and express the opinion and feelings of human beings. LDA for topic modelling will be very much beneficial in recognizing the pattern of the tweets. We can find a group of words which are mainly involved in the negative or positive tweets. Later on, we can select those tweets which contain these types of words. The count vectorization and tokenization technique provide a vector for experimentation of the models and are a very well-known feature extracting technique. Figure 1 shows the workflow of the COVID-19 sentiment analysis. For the experimental workflow, we used the Keras library (using TensorFlow backend) on an anaconda environment. All models were trained through Python 3 using high-level APIs for the construction of neural networks in the Bi-LSTM model. We used an i5 8th generation processor with a 16GB RAM. 

In this research paper, three different datasets were taken. The first dataset [29] was taken from Kaggle in CSV format consisting of 648,958 tweets with 177,456 unique tweets, and the remaining 471,412 tweets were retweeted by the users. As the retweeted posts contain the same tweets and sentiments, we removed those tweets. These tweets were related to COVID-19 and the sentiments of people in India during lockdown from 20 March to 31 May 2020.

The second dataset [30] was also taken from Kaggle, which contains 3090 tweets related to the coronavirus and lockdown in India from 23 March to 15 July. The third dataset is self-mined tweets from Twitter API containing 6648 tweets. Inspecting the data has an important part in machine learning as it assists us in imagining the class and statistics of corpora. Figure 2 shows the statistics of positive or negative tweets of both the datasets. As it is a textual dataset, the word cloud can also be seen in Figure 3 . 

In this research paper, three different datasets were taken. The first dataset [29] was taken from Kaggle in CSV format consisting of 648,958 tweets with 177,456 unique tweets, and the remaining 471,412 tweets were retweeted by the users. As the retweeted posts contain the same tweets and sentiments, we removed those tweets. These tweets were related to COVID-19 and the sentiments of people in India during lockdown from 20 March to 31 May 2020.

The second dataset [30] was also taken from Kaggle, which contains 3090 tweets related to the coronavirus and lockdown in India from 23 March to 15 July. The third dataset is self-mined tweets from Twitter API containing 6648 tweets. Inspecting the data has an important part in machine learning as it assists us in imagining the class and statistics of corpora. Figure 2 shows the statistics of positive or negative tweets of both the datasets. As it is a textual dataset, the word cloud can also be seen in Figure 3 . positive tweets. Later on, we can select those tweets which contain these types of words. The count vectorization and tokenization technique provide a vector for experimentation of the models and are a very well-known feature extracting technique. Figure 1 shows the workflow of the COVID-19 sentiment analysis. For the experimental workflow, we used the Keras library (using TensorFlow backend) on an anaconda environment. All models were trained through Python 3 using high-level APIs for the construction of neural networks in the Bi-LSTM model. We used an i5 8th generation processor with a 16GB RAM. 

In this research paper, three different datasets were taken. The first dataset [29] was taken from Kaggle in CSV format consisting of 648,958 tweets with 177,456 unique tweets, and the remaining 471,412 tweets were retweeted by the users. As the retweeted posts contain the same tweets and sentiments, we removed those tweets. These tweets were related to COVID-19 and the sentiments of people in India during lockdown from 20 March to 31 May 2020.

The second dataset [30] was also taken from Kaggle, which contains 3090 tweets related to the coronavirus and lockdown in India from 23 March to 15 July. The third dataset is self-mined tweets from Twitter API containing 6648 tweets. Inspecting the data has an important part in machine learning as it assists us in imagining the class and statistics of corpora. Figure 2 shows the statistics of positive or negative tweets of both the datasets. As it is a textual dataset, the word cloud can also be seen in Figure 3 . 

The dataset contains ill-formed words, heterogeneous, unstructured, non-dictionary terms, and irregular grammar, so before the feature extraction step, the tweets were 

The dataset contains ill-formed words, heterogeneous, unstructured, non-dictionary terms, and irregular grammar, so before the feature extraction step, the tweets were cleaned using the numerous NLTK techniques [2] . The various pre-processing steps are [31] :

Removing non-ASCII and non-English characters from the text.

Eliminating the HTML tags and URL links.

Removing numbers and extra white spaces, as they do not impart any facts about sentiment.

Removing the special characters such as @, $, *, #, etc. • Converting all the letters into a smaller case.

Eliminating English literature stopwords such as "an", "about", "as", "any", etc., as these words are not involved in detecting the polarity of sentiments.

Stemming was done to bring back the word to its root form such as "strength" becomes "strong", "better" becomes "good", and so on.

Feature extraction is the premier step of Natural Language Processing (NLP). The text data cannot be fed directly in its original form into the machine learning or deep learning models so these words are encoded into numbers and these numbers are represented as vectors.

This is a basic encoding technique where a vector of size equal to the size of the English dictionary is taken with all its elements initialized to zero. Every time text data points to a vocab word then the element in the vector representing that word will be increased by one "1", leaving zeroes in each place of the vector where the word was not found even once, as shown in Equations (1) and (2) . A vector was created with 171,476 words of the Oxford English Dictionary [32] , and so the architecture will have high feature selection, and thus, high variance is noted. Here, the count vectorizer keeps track of the rare as well as the most frequent words of the corpora. Feature extraction is the dimensionality reduction technique used for eliminating rare and non-informative words. A bag-of-words model consisting of the 1500 most frequent words of the corpora is created from the feature vector to enhance the accuracy of the model [2] .

Breaking up raw text into unique text, i.e., tokens, is known as tokenization. Every token has different token ids. In tokenization, a vector of size equal to the corpora is created. A token sequence is created and represented as a vector, as demonstrated in Equations (3) and (4). Due to the difference in length of each tweet and its corresponding vector sequence, it is very tough to feed in deep learning models as it needs sequences of equal length [33] . This issue has been countered using truncating and padding steps. If the tokenized sequence length is greater than the padded sequence length, then the extra length needs to be truncated, and if the length is smaller than the sequence, it is padded with '0'. On choosing the sequence length to be 6, then truncating will happen to Equation (3) and padding to Equation (4), as shown in Equations (5) 

One of the most crucial parts of supervised machine learning is the classification algorithms which find the class of the data. This research paper utilizes various classification algorithms for classifying the tweets as positive or negative.

In the context of topic modeling, Latent Dirichlet Allocation (LDA) [34] is the most famous in terms of popularity and its usability. It is a generative model [35] used for topic modeling; however, it is more widely known as a dimensionality reduction technique. Topic modeling can be defined by the process in which a machine predicts the most pertinent and relevant topics in an input corpus. Now, we explain how LDA achieves this. A general assumption is made by LDA that there is a vocabulary having P indistinguishable words and T different topics where every word can be represented as P j such that 0 ≤ j ≤ P − 1. Similarly, each topic represented as T i (0 ≤ i ≤ T − 1) represents a probability distribution Ψ T i over P words, each having a Dirichlet prior β. Now, Ψ T i ,P j is the probability that the word P j represents the topic T i . Having a total of D documents (here, documents do not mean full instances of articles or reports, but a small block of text such as a paragraph), then we can say β yields the distribution of T topics over D documents. If we take a variable Z denoting the assignment of topics to every word, then a document can be considered to have a mixture of different topics. We assume there are µ D b words in a document D b (0 ≤ b ≤ D − 1) and that δ D b is the probability distribution of documents over the topics drawn from Dirichlet distribution parameterized by α. Figure 4 denotes the plate notation for LDA. Clearly, δ , is the probability that is associated with . For now, we assume that α and β are scalars (in Figure 1 and defining the Dirichlet distribution we take them to be vectors, however); LDA iterates through all the documents which have words. For every word a topic assignment is drawn from ℤ , from the categorical distribution , after which a word , is drawn from a categorical distribution Ψ ℤ , . The following are the steps of the algorithm:

for each 0 ≤ ≤ − 1, is the distribution of the categories with the Dirichlet distribution denoted as ℎ having arguments or Dirichlet priors as or . If we take vector , ℎ is given in Equation (7). Clearly, δ D b ,T i is the probability that D b is associated with T i . For now, we assume that α and β are scalars (in Figure 1 and defining the Dirichlet distribution we take them to be vectors, however); LDA iterates through all the documents D b which have µ D b words. For every word P j a topic assignment is drawn from Z D b ,P j from the categorical distribution δ D b , after which a word W D b ,P j is drawn from a categorical distribution Ψ Z D b ,P j . The following are the steps of the algorithm:

Ca is the distribution of the categories with the Dirichlet distribution denoted as Dirichlet having arguments or Dirichlet priors as α or β. If we take vector, Dirichlet is given in Equation (7).

where µ is the Beta distribution. It is defined having input α in Equation (8). (8) is given in Equation (9).

We remark that τ(y) is more popularly known as the complete gamma function. LDA is used in plenty of applications including web-spam filtering [36] , tag recommendation [37] , bug localization [38] , etc. LDA has also been used for annotation of satellite images to segment different types of regions such as golf courses, deserts, urban areas, etc. [39] .

A conventional neural architecture cannot recall the prior inputs but a Recurrent Neural Network (RNN) has the ability to memorize and recall due to the loops and hidden layers in between them. A RNN converts the independent activations to reliant activations by appointing the same weights and biases to complete the layers, and the outcome of a layer is input to the next hidden layer. LSTM is a particular form of RNN which abstains from the extensive dependencies [40] . The long short-term memory cell stores the hidden layer of a RNN. The memory cell if LSTM can be attained via Equations (10)- (14) . Figure 5 denotes the LSTM memory cell.

where logistic sigmoid function is represented by σ. The forget, cell vectors, output and input gate are represented by f, c, o, and i. The dimensions of these are the same as of the hidden vector h [40] .

Healthcare 2022, 10, x FOR PEER REVIEW 10 of 29 from the extensive dependencies [40] . The long short-term memory cell stores the hidden layer of a RNN. The memory cell if LSTM can be attained via Equations (10)- (14) . Figure  5 denotes the LSTM memory cell.

where logistic sigmoid function is represented by . The forget, cell vectors, output and input gate are represented by f, c, o, and i. The dimensions of these are the same as of the hidden vector h [40] . An extension of the LSTM is the Bidirectional Long Short-Term Memory (BiLSTM), which was designed by including two independent LSTM cells. A fixed sequence to sequence problem was solved using Bi-LSTM. It is very much efficient in a text dataset An extension of the LSTM is the Bidirectional Long Short-Term Memory (BiLSTM), which was designed by including two independent LSTM cells. A fixed sequence to sequence problem was solved using Bi-LSTM. It is very much efficient in a text dataset where the input has various lengths. Through this architecture, the neural network can have both backward and forward details at every time interval. Figure 6 shows the Bi-LSTM [41] . An extension of the LSTM is the Bidirectional Long Short-Term Memory (BiLSTM), which was designed by including two independent LSTM cells. A fixed sequence to sequence problem was solved using Bi-LSTM. It is very much efficient in a text dataset where the input has various lengths. Through this architecture, the neural network can have both backward and forward details at every time interval. Figure 6 shows the Bi-LSTM [41] . 

Logistic Regression (LR) is among those analytical or mathematical methods that have been demonstrated to be highly authentic while performing sentiment analysis. It is a lot easier to interpret, implement, and train efficiently. It is much less inclined towards overfitting. In high-dimensional data overfitting can happen, but that can be avoided by using L1 and L2 regularization. The independent variables in this algorithm are observed as the predictor of the dependent variable. It is less prone to overfitting in a low-dimensional dataset. It proves very efficient when the dataset has features that are linearly separable. The relation between the independent and dependent variables is nonlinear and can be treated as a particular instance of a common linear model. It has a binomial distribution in place of Gaussian distribution since the dependent variable is categorical [42, 43] . It gives back the probability by converting the result with the assistance of the logistic sigmoid function. If the predicted value is greater than 0.5, then it is marked as positive or else negative. Nonlinear problems cannot be solved using logistic regression. It is also difficult to capture a complex relationship. The linear regression equation is specified in Equation (15).

The equation of sigmoid function is given in Equation (16).

Now, putting Equation (16) to Equation (15) and analyzing for y, we get Equation (17), i.e., logistic regression computation.

A Support Vector Machine (SVM) is a well-known ML method for maximizing the predictive result by automatically avoiding the overfit to the data by building a decision line between the two classes, i.e., positive or negative [44] . A SVM is very efficient in high dimensional databases. A SVM is very capable of delivering capable results due to complex complexity problem. It is effective in cases where the number of dimensions is greater than the number of samples. A SVM does not perform very well on noisy data. It will underperform where the number of features for each data point exceeds the number of training data samples. The decision line also known as the hyperplane is aligned such that it is far away from the nearest data points from each of the categories. A SVM detects the hyperplane by calculating the Euclidean distance between two data points. These nearest points are known as support vectors. The distance between two support vectors is called a margin. The margin of the hyperplane [45] can be calculated by using Equation (18) .

A SVM aims to identify the class correctly so the mathematical calculations of the SVM are given in Equations (19) and (20) .

The optimal hyperplane can be defined in Equation (21).

where x j represents the feature vector, w refers to the weight vector, and b is the bias. SVMs are implemented using kernels. Here, a linear SVM kernel is used whose mathematical equation is defined in Equation (22).

The naïve Bayes classification algorithm is the most ordinary supervised machine learning model which evaluates the probability of a current observation belonging to a predetermined class, using a Bayes' theorem with naïve independence presumption between the features [46] . Naïve Bayes is an elementary technology for classifiers construction. This algorithm does not require much training data and is highly scalable with numerous data points and predictors. Real-time predictions can be attained very easily due to its fast-implementing nature. A naïve Bayes algorithm makes use of bag-of-words features to recognize the sentiments of tweets. They perform the classification work by correlating the use of tokens with positive or negative tweets and then by using the Baye's theorem to estimate the probability that the tweet is a positive tweet or not. A multiclass prediction problem can easily be solved using this classifier. In the case of a categorical input variable, this classifier performs the best. The technology determines the previous probability of each class based on the training set and presumes that classification could be predicted by considering the posterior probability and conditional density function [42] . The posterior probability can be evaluated using Equation (23) . The main drawback is that it assumes all the features are independent, and in real life, it is very hard to find a set of independent features.

where P N j V is the posterior probability; P V N j presents the chance, i.e., the probability of V when N j is true; P(N j ) is the prior, i.e., the possibility of N j ; and P(V) presents the marginalization, i.e., the probability of V.

As the training data are independent and all of them contribute equally to the classification problem, a simple method of naïve Bayes has been developed. Due to the conditional independence, P V N j could be evaluated using Equation (24) .

The prediction is made for the category with the greatest posterior probability [42] , given in Equation (25),

where j = {positive, negative}.

Decision Tree [47] categorizes the leveled trained data into rules or trees [48] . It is a technique for approximating discrete-valued functions that is powerful with noisy data, and the learned function is constituted by a decision tree. To increase human readability, the trees can be exhibited as a set of if-then rules [49] . The anatomy of a decision tree is related to a tree with the right subtree, left subtree, and the root node. The class labels are represented by the leaf nodes. Data preparation requires much less effort during preprocessing. The decision tree building process is not affected by missing values in the data. Scaling and normalization of data is not required in a decision tree. It can also handle both numerical and categorical data, even Boolean too, and normalization is not required here. It is non-parametric, and also normalization is not required in a decision tree. One major drawback of decision trees is the method of overfitting, which can be solved using the pruning method. It cannot be used in big data. It takes more time for the training time complexity to increase as the input increases. The condition on the attributes is denoted by the arcs from one node to another node.

In this, the overfitting and noise are checked by pruning the tree. The benefits of the tree-structured approach are it is easy to handle numeric and categorical attributes, interpret, and understand and it is robust with missing values [50] .

A random forest classifier is an ensemble learning method that has gained tremendous interest as it is more detailed and robust regarding noise than an individual classifier and it is based on the philosophy that a set of classifiers perform better than a single classifiers does [51] . Overfitting does not happen with many features and it is very efficient in large datasets. The forest created can be reused by saving it. A random forest is a combination of classifiers with trees as base classifiers. Each classifier gives a unique vote so that the most frequent class may be assigned to the input vector (X) [52] .

Here, votemajority {} refers to the majority of votes by each classifier for the class, andĈ d (X) refers to the category forecast of the d th random forest tree. While training the classifiers some data may be used more than once, while some might never be used. Thus, higher classifier firmness is attained, as it makes it stronger and the classifier accuracy is improved. For designing the decision tree attribute selection metric a pruning technique is required [53] . The selection of attributes used for the decision tree has many ways, allocates a quality amount right to the attribute. The most frequent attributes are the Gini index and the information gain ratio. The random forest classifier uses the Gini index, which computes the sum of an attribute with respect to the categories. The Gini index [54] can be described in Equation (27) . (27) where ( f (G i , T)/|T|) refers to the probability that the chosen case belongs to the category G i . The main advantage of a random forest is that it can be used for both classification and regression problems and works well with categorical and continuous variables. It also automatically handles missing values and outliers. However, a long training period is required and it is complex in nature.

Logistic Regression-Stochastic Gradient Descent (LR-SGD) is a kind of linear model, also known as incremental gradient Descent. A LR-SGD classifier is a constructive way to discriminate learning of linear classifiers under numerous penalties and loss functions such as a SVM and logistic regression [55] . The 'log' loss function optimizes the logistic regression while the 'hinge' loss function optimizes the support vector machine. The broad and sparse problems encountered in sentimental analysis make use of the LR-SGD, and this factor inspired us to use the LR-SGD in this paper. Hyperparameter tuning is a major robustness of LR-SGD which is used for resolving the error function also known as the cost function. Logistic regression [56, 57] has a likelihood parameter which is expressed in Equation (28) . It is computationally fast, as only one sample is processed at a time. It also converges faster for larger datasets and is easier to fit in the memory due to single training. Some drawbacks of this classifier are that it loses the advantage of vectorized operations as it deals with only a single example at a time. Secondly, due to noisy steps, it may take longer to achieve convergence.

where M denotes the data samples number. The following likelihood function is to be maximized to find the optimal model parameter θ.

The parameter θ can be optimized using the stochastic gradient descent classifier technique. Therefore, parameter θ can be given using Equation (30) .

where x i,0 = 1 for all i.

In this paper, the Majority Voting Classifier (MVC) has been adopted to get the best result. It is based on combining various single classifiers to obtain a highly accurate classifier from less accurate ones [58] . The combination can rectify the errors made by single classifiers on various input spaces, thereby improving the accuracy of the single classifiers used in isolation [59] . In majority voting, the forecasted outcome category is the category with the largest vote majority, i.e., the category which is the output for more than half of classifiers. It relies on the performance of many models and is not hindered by large errors from one model. It performs well in classification and regression problems.

However, this classifier is more computationally intensive and thus very costly in terms of training and deploying.ŷ = mode{C 1 (x), C 2 (x), . . . , C n (x)} (31) 

The implementation of latent Dirichlet allocation [35] gave us a fascinating theme which makes good sense to a great extent. Before applying Latent Dirichlet Allocation (LDA) it is the principal step to analyze the text corpora, so a bar graph showing the top ten frequent words of all the datasets was plotted, as shown in Figures 7-9 LDA was applied on all three datasets to detect five themes and displayed the top 10 most notable words, and the results obtained are listed in Tables 2-4. Relevance [60] and saliency [61] were introduced, which can be defined as, A major part of the model is to evaluate it while observing the exactness and performance of classifiers on the test data and comparing the best from them. The confusion matrix [62] contains four outcomes produced by binary classifiers which can be used for describing the performance of the models. Various metrics such as recall accuracy [63] , precision, AUC score, specificity, F1-score [64] , and BAC were examined to verify and validate the results. The four outcomes of the confusion metric, i.e., false negative, true negative, false positive, and true positive, of various classifiers of the first and second datasets are shown in Table 5 . The various evaluating metrics are shown in Tables 6 and 7, respectively. The results of the classifiers with respect to the AUC score, F1-score, recall, accuracy, precision, BAC, and specificity are represented graphically in Figures 13 and 14 . The evaluating metrics are mathematically described in Equations (34)- (40) .

Healthcare 2022, 10, x FOR PEER REVIEW 17 of 29 Figure 11 . Inter-topic distance map for the mined dataset. Inter-topic distance map for the second dataset. Figure 12 . Inter-topic distance map for the second dataset. Table 7 . The performance measure of various classifiers of the second dataset.

Bidirectional In this research paper, a Balanced Accuracy (BAC) metric has been used. BAC is calculated for an imbalanced dataset and model accuracy is represented better. It is the average of recall secured from both classes. The balanced accuracy can be calculated by using Equation (39) .

where FP is the false positive, TN refers to the true negative, FN means false negative, TP refers to true positive, P refers to precision, and R is the recall.

The Receiver Operating Characteristics Curve (ROC) [65] is a graphical plot that demonstrates the characteristics ability of a binary classifier. The correlation of the False Positive Rate (FPR) and True Positive Rate (TPR) is shown using the ROC curve. It is a remarkable metric, as the entire area between 0 and 1 is covered by it. At this point, a 0.5 false positive rate is equal to a true positive rate and therefore represents a non-skilled or random classifier. The area below the ROC curve gives the AUC score. Figure 15 shows the ROC curve for the first dataset and Figure 16 for the second dataset of all the models.

In this paper, a noble dataset has been proposed. The dataset was mined from Twitter using the keyword "COVID-19". By comparing the two datasets, i.e., the first and second, we are labelling the mined tweets. It has been seen that the first dataset gave more accurate labels than the second dataset. Table 8 shows the prediction of tweets from the first dataset and also the prediction by the authors. Table 9 shows the prediction of tweets from the second dataset and also the predictions. Tables 10 and 11 show the number of correct and incorrect predictions by all classifiers.

Latent Dirichlet Allocation (LDA), a topic modeling technique, was applied on all three datasets related to the tweets on the COVID-19 pandemic. This led to various kinds of reactions in which the model attempted to represent a set of themes and the most appropriate words pertaining to the topic. The first dataset indicates that "India", "people", "cases", "lockdown", etc. are the most frequent topics showing that the users are very much conscious about their country and its citizens, while the second dataset emphasizes "people", "twitter", etc. The mined tweets have the top three topics as "Trump", "people", and "cases" showing that people are very much aware of COVID-19 and that most of the tweets involved the former president of the USA-this is not surprising since a majority of the users of Twitter are based in the USA. Different topics have been plotted as circles and the centers of each topic were calculated by evaluating the distance among topics. In Figures 10-12 , it can be seen that many topics are very close to each other and intersect each other in a few cases, thereby showing that they have many common words. In this paper, a noble dataset has been proposed. The dataset was mined from using the keyword "COVID-19". By comparing the two datasets, i.e., the first and we are labelling the mined tweets. It has been seen that the first dataset gave more labels than the second dataset. Table 8 shows the prediction of tweets from the firs and also the prediction by the authors. Table 9 shows the prediction of tweets second dataset and also the predictions. Tables 10 and 11 show the number of cor incorrect predictions by all classifiers. In this paper, a noble dataset has been proposed. The dataset was mined from using the keyword "COVID-19". By comparing the two datasets, i.e., the first and we are labelling the mined tweets. It has been seen that the first dataset gave more a labels than the second dataset. Table 8 shows the prediction of tweets from the first and also the prediction by the authors. Table 9 shows the prediction of tweets fr second dataset and also the predictions. Tables 10 and 11 show the number of corr incorrect predictions by all classifiers. All are concerned about their future. We are also concerned about when will our colleges and universities been opened up? #BREAKING #India #COVID-19 #education #reopencollege #students @EduMinOfIndia #reopen_ug_college #Health #healthcare #Trending 0 To label the mined tweets, it was very important to find the best classifier for the mined dataset so that all the classifiers were trained on the first and second datasets for predicting the results of the mined tweets. In this paper, 15 samples of the mined tweets and their predictions on various classifiers trained on both datasets are tabulated in Tables 8 and 9 . A Majority Voting Classifier (MVC) was also utilized for choosing the best classifier. We compared our predictions of the tweets with the predictions of classifiers and enumerated the number of correct and incorrect predictions and then calculated the accuracy of each classifier trained on both datasets, as shown in Tables 10 and 11, respectively. By observing the accuracy, it was noted that the logistic regression classifier trained on the second dataset has an accuracy of 86.67%.

In this section, we analyze the results obtained during the experiment. Considering the results of the classifiers for the first dataset, accuracy is varied from 96.7% to 76.5%. From Table 6 and Figure 13 , it can be seen that BiLSTM, random forest, and decision tree classifier models performed exceptionally well in terms of accuracy when compared with other models used for the same dataset. However, when we come to the other dataset, there is not much of a difference in the model's accuracy which was visible in the other dataset. One of the reasons that can justify the results of the BiLSTM model is that of the use of a deep neural approach. This model has two LSTM architectures which permit the neural networks and allow both backward and forward information at every step. From this, every new result is generated from the previous instances. Coming to the other models, random forest and decision tree, both use more or less similar techniques for classifying the data points. However, in a random forest, a group of decision trees is used to provide the best results for all the trees. Due to this, in the dataset that we have used, random forest and decision tree provide promising results when compared to other models. It can be justified from Figure 14 as well.

However, when we consider the same models and expect the same results on a smaller dataset, the results are not the same. From Table 7 , we tested every model on a smaller dataset and compared them with the previous result metrics. The accuracies achieved via logistic regression, naïve Bayes, SVM, and LR-SGDC were 90.93%, 90.93%, 89.96%, and 89.96%, respectively. Although these models are known for their accurate results, when it comes to the size of the dataset that considered and the relationship/dependencies among the features and target variables, these models lack in terms of accuracy. It can be justified by considering Tables 6 and 7 , from which we can see that the models that performed poorly in terms of accuracy performed well when the size of the dataset was reduced. However, if we compare the results, we find that there is not much of a difference, and the mean accuracy achieved for the second dataset is 89.03% and for the first dataset it is 86.41%.

Furthermore, looking at other result parameters, precision is considered to be a more dominating result matrix over other parameters. This is because it states the correct number of outcomes presented by the model. However, in the medical industry-based models, recall is considered to be a more efficient matrix apart from accuracy as it points out the total number of false detections given by the model. Considering our models, the mean precision value for the first dataset is 86.81%. This means that our models were able to correctly classify 86 data samples out of 100, and only 14 tweets were misclassified by the model. Looking at the other dataset with a lesser number of data samples, the mean precision value was 84.98%, which implies approximately equal results when compared to the other group of results. This could be due to the size of the data samples that were considered in the experiment. Another possible explanation could be the internal relationships that are formed by the model for classifying the results. For instance, the logistic regression model assumes a linear relationship among the data points, and based on the equations formed, performs the classification. Similarly, other models also have an internal equation based on the relationships formed, which helps in determining the results.

Similarly, recall is one of the parameters which gives the negative count of the classified samples. This parameter is also termed sensitivity. From Table 6 , the average recall value is 85.79%, which is the ratio of correct positive predictions to the total number of positive data samples. Likewise, for Table 7 , the mean recall value was found to be 83.93%. Apart from these four result parameters, the F1-score is among the most widely used parameter as it provides the combined detail of recall and precision. The F1-score mathematically is the harmonic mean between the precision and recall values. Since we have talked in detail about the individual parameters, the F1-score is omitted in our discussion, but for performance analysis, it can be found to be a more promising metric over individual comparison.

Another parameter that is taken into consideration apart from the performance criteria is the time complexity of the model. For this, we provide the CPU utilization time for each model that can help in providing a better viewpoint for the model selection decision. Table 12 demonstrates the time complexities for each model belonging to each dataset. For the BiLSTM model, the training time is found to be the maximum among all the classifier models; however, the average epoch training time was found to be 3141.4 and 13.2 s for the first and second datasets, respectively. 

In this paper, the Twitter users' sentiments and discussions related to COVID-19 have been conveyed. The findings obtained are used to understand public sentiment and discussion of the outbreak of COVID-19 in a real-time and rapid way, aiding surveillance systems to grasp the evolving conditions. The recognized patterns and response of public tweets could be used to guide the targeted intervention strategies. Different deep learning and machine learning approaches were used for analyzing tweets. The tweets were filtered in the pre-processing part by eliminating the numbers, stopwords, URL, and various Twitter-related features with the assistance of NLTK. The features were extracted using a bag-of words model and tokenization and padding. Two datasets were used for classifying the tweets into positive or negative sentiments using different classifiers such as naïve Bayes, random forest, decision tree, SVM, logistics regression, LR-SGD classifier, bidirectional LSTM and majority voting classifier (MVC). The most suitable classifier was selected by comparing various evaluation metrics and a ROC curve. This research could be very helpful in understanding the sentiments of people in this coronavirus pandemic and could also help to avoid the fear among people by filtering out the negative comments. The government can take fruitful decisions based on the result of our application and thus reduce the chaos in the society. Through the LDA approach we can also filter out the types of tweets which can create negativity in the society. Though our approach is little bit time consuming in large datasets or high-dimensional datasets, it could be very beneficial for the society.

In this paper, a novel dataset consisting of 6648 tweets has been proposed. The dataset was mined from Twitter using the keyword "COVID-19". We took a few tweets and labeled them to compare the results achieved by different models trained on the other two datasets. This dataset can be used for further research related to COVID-19 by utilizing various other methods. It can be executed in web and android applications to understand public opinion and control any negative sentiments or rumors related to COVID-19 in the future. This approach can also be applied on other social networking sites such as Facebook, LinkedIn, etc. to know the sentiment of the people on any topic. 

Twitter Sentiment Analysis: A Bootstrap Ensemble Framework

Deep Convolution Neural Networks for Twitter Sentiment Analysis

Tweeting Made Easier

COVID-19): A Perspective from China

Healthcare 2022

WHO. Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19). 2020. Available online

WHO Director-General's Opening Remarks at the Media Briefing on COVID-19-11

Coronavirus Disease (COVID-19) Weekly Epidemiological Update and Weekly Operational Update

Total Lockdown" after Spike in Cases

Social Network Analysis of COVID-19 Sentiments: Application of Artificial Intelligence

The psychological impact of quarantine and how to reduce it: Rapid review of the evidence

Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study

Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis

Performance Analysis of Machine Learning Algorithms for Prediction of Liver Disease

Classification System for Prediction of Chronic Kidney Disease Using Data Mining Techniques

Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter

COVID-19 outbreak: Tweet based analysis and visualization towards the influence of coronavirus in the world. Gedrag

Global Sentiments Surrounding the COVID-19 Pandemic on Twitter: Analysis of Twitter Trends

Sentiment Analysis of COVID-19 tweets by Deep Learning Classifiers-A study to show how popularity is affecting accuracy in social media

The Impact of COVID-19 Epidemic Declaration on Psychological Consequences: A Study on Active Weibo Users

Sentiment analysis of nationwide lockdown due to COVID 19 outbreak: Evidence from India

Cross-Cultural Polarity and Emotion Detection Using Sentiment Analysis and Deep Learning on COVID-19 Related Tweets

COVID-19 Sensing: Negative Sentiment Analysis on Social Media in China via BERT Model

Deep Learning-Based Methods for Sentiment Analysis on Nepali COVID-19-Related Tweets

A Hybrid Feature Extraction Method for Nepali COVID-19-Related Tweets Classification

Multi-channel CNN to classify nepali COVID-19 related tweets using hybrid features

Early-Stage Detection of Liver Disease through Machine Learning Algorithms

Semantic Analysis of Sentiments through Web-Mined Twitter Corpus

Indian Sentiments on COVID-19 and Lockdown

Sentiment analysis in twitter using machine learning techniques

How Many Words Are in the English Language? Word Count

Training word embeddings for deep learning in biomedical text mining tasks

Latent dirichlet allocation

A comprehensive survey and analysis of generative models in machine learning

Latent dirichlet allocation in web spam filtering

Latent Dirichlet allocation for tag recommendation

Bug localization using latent Dirichlet allocation

Semantic Annotation of Satellite Images Using Latent Dirichlet Allocation

Bidirectional LSTM-CRF models for sequence tagging

Framewise phoneme classification with bidirectional LSTM and other neural network architectures

Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size

Discrimination of Mine Seismic Events and Blasts Using the Fisher Classifier, Naive Bayesian Classifier and Logistic Regression

Applications of Support Vector Machine (SVM) Learning in Cancer Genomics

Tutorial on Support Vector Machine (SVM)

Fake news detection using naive Bayes classifier

Data science appositeness in diabetes mellitus diagnosis for healthcare systems of developing nations

Ensemble Decision Tree Classifier for Breast Cancer Data

Classification of epileptiform EEG using a hybrid system based on decision tree classifier and fast Fourier transform

Using decision tree algorithms as a basis for a heart sound diagnosis decision support system

Bagging predictors

An assessment of the effectiveness of a random forest classifier for land-cover classification

Random forest classifier for remote sensing classification

Classification and Regression Trees

Comparative Study of Classification Algorithms used in Sentiment

Spatial prediction of shallow landslide using Bat algorithm optimized machine learning approach: A case study in Lang Son Province

Automatic detection of asphalt pavement raveling using image texture based feature extraction and stochastic gradient descent logistic regression

Listed companies' financial distress prediction based on weighted majority voting combination of multiple classifiers

Predicting stock returns by classifier ensembles

LDAvis: A method for visualizing and interpreting topics

Termite: Visualization Techniques for Assessing Textual Topic Models Categories and Subject Descriptors

Diagnosis of Intracranial Tumors via the Selective CNN Data Modeling Technique

Prolificacy Assessment of Spermatozoan via State-of-the-Art Deep Learning Frameworks

Mycobacterium Tuberculosis Detection Using CNN Ranking Approach

A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Institutional Review Board Statement: Not applicable.

Data Availability Statement: This study consists of three datasets where dataset 1 and dataset 2 were collected from Kaggle, and for dataset 3, a GitHub link is provided as it is our own mined proposed dataset. However, for more reference, links for all the datasets are given below. Dataset 1 Link: https://www.kaggle.com/abhaydhiman/covid19-sentiments (accessed on 7 January 2022). Dataset 2 Link: https://www.kaggle.com/surajkum1198/twitterdata (accessed on 7 January 2022). Dataset 3 Link: https://github.com/satish-1999/Covid-Sentiment-Analysis/blob/main/mined_ Logistic_small.csv (accessed on 7 February 2022).

The authors declare no conflict of interest.