key: cord-0793544-27hssh4x authors: Alkhaldi, Nora A.; Asiri, Yousef; Mashraqi, Aisha M.; Halawani, Hanan T.; Abdel-Khalek, Sayed; Mansour, Romany F. title: Leveraging Tweets for Artificial Intelligence Driven Sentiment Analysis on the COVID-19 Pandemic date: 2022-05-13 journal: Healthcare (Basel) DOI: 10.3390/healthcare10050910 sha: 5af455766e2b77c8673bca54365a43c03de518c8 doc_id: 793544 cord_uid: 27hssh4x The COVID-19 pandemic has been a disastrous event that has elevated several psychological issues such as depression given abrupt social changes and lack of employment. At the same time, social scientists and psychologists have gained significant interest in understanding the way people express emotions and sentiments at the time of pandemics. During the rise in COVID-19 cases with stricter lockdowns, people expressed their sentiments on social media. This offers a deep understanding of human psychology during catastrophic events. By exploiting user-generated content on social media such as Twitter, people’s thoughts and sentiments can be examined, which aids in introducing health intervention policies and awareness campaigns. The recent developments of natural language processing (NLP) and deep learning (DL) models have exposed noteworthy performance in sentiment analysis. With this in mind, this paper presents a new sunflower optimization with deep-learning-driven sentiment analysis and classification (SFODLD-SAC) on COVID-19 tweets. The presented SFODLD-SAC model focuses on the identification of people’s sentiments during the COVID-19 pandemic. To accomplish this, the SFODLD-SAC model initially preprocesses the tweets in distinct ways such as stemming, removal of stopwords, usernames, link punctuations, and numerals. In addition, the TF-IDF model is applied for the useful extraction of features from the preprocessed data. Moreover, the cascaded recurrent neural network (CRNN) model is employed to analyze and classify sentiments. Finally, the SFO algorithm is utilized to optimally adjust the hyperparameters involved in the CRNN model. The design of the SFODLD-SAC technique with the inclusion of an SFO algorithm-based hyperparameter optimizer for analyzing people’s sentiments on COVID-19 shows the novelty of this study. The simulation analysis of the SFODLD-SAC model is performed using a benchmark dataset from the Kaggle repository. Extensive, comparative results report the promising performance of the SFODLD-SAC model over recent state-of-the-art models with maximum accuracy of 99.65%. COVID-19 is a communicable disease that can be transferred or spread mainly by the tiny droplets released by the individual during sneezing, coughing, and also while talking. It is currently becoming a source of anxiety depression and stress, owing to the false information that is to be posted on social media. The mental well-being of people is severely affected due to the fast spread of incorrect information on social media [1, 2] . Due to the present situation of lockdown and social distancing, people are mainly dependent on, An intelligent SFODLD-SAC model is presented consisting of TF-IDF-based feature extraction, CRNN classification, and SFO-based hyperparameter optimization for COVID-19 tweet analysis. To the best of our knowledge, the SFODLD-SAC model has been never presented in the literature; • The SFODLD-SAC technique involves the design of an SFO algorithm to optimally choose the hyperparameters, which helps in increasing the classification accuracy and avoids computational overhead; • The performance of the SFODLD-SAC model is validated using a benchmark dataset from the Kaggle repository, and the results are investigated under distinct sizes of training/testing data. The rest of this paper is organized as follows: Section 2 offers related research, and Section 3 discusses the proposed model. Then, Section 4 elaborates on the experimental validation with the benchmark Kaggle dataset, and Section 5 draws the conclusions of the paper. This section offers a detailed review of existing SA models related to COVID-19. Researchers in [9] analyzed Indian people's sentiment during the lockdown. They used some popular hashtags for measuring negativity and positivity in people. Samuel et al. [10] high-lighted public sentiments related to the COVID-19 pandemic using two machine learning (ML) classification techniques. The researchers in [11] presented an architecture, in which a deep-learning-based language model was applied through long short-term memory (LSTM) recurrent neural network for sentimental analysis during the increase in COVID-19 cases in India. In [12] , bidirectional encoder representation conducted COVID-19 tweet data analysis from a Transformer-based (BERT) model. Gulati et al. [13] implemented a comparative analysis of an ML-based classifier. This classifier was employed for above 72,000 tweets related to COVID-19. Mujahid et al. [14] employed a Twitter dataset comprising 17,155 tweets regarding e-learning. ML and DL methods showed the potential, suitability, and capability for object detection, natural language processing, and image processing tasks. Luo and Xu [15] presented a DL method to explore customer opinion regarding restaurant features and to discover reviews with mismatched ratings. This study strengthens the extant literature by analyzing restaurant reviews posted during the COVID-19 pandemic and finding a DL algorithm for text mining tasks [16] . Singh et al. [17] proposed a DL technique for SA of Twitter statistics based on COVID-19 analyses. The suggested model depends on the LSTM-RNN-based network and improved featured weight by attention layer. This approach makes use of an improved feature transformation architecture through the attention model. Yin et al. [18] conducted a study based on COVID-19 vaccination on Twitter. The authors analyzed the deliberations of individuals in terms of this research topic and the emotional polarization between vaccine brands and perceptions of countries. The results showed that the majority of individuals trust the usefulness of vaccines, and they are ready to vaccinate themselves. In another study [19] , the authors focused on increasing the consideration of public awareness of the COVID-19 pandemic trend and uncovering meaningful themes of concern posted by Twitter users in the English language. An NLP method and the latent Dirichlet allocation model was utilized to classify cluster and identify themes based on keyword analysis, along with identifying the most common twitter topics. In [20] , data from the Arabic COVID-19-based tweet dataset were gathered. The data were processed according to the ML prediction model. The results showed that applying the SVM classification together with bigram in TF-IDF outperformed other algorithms, with 85% accuracy. Lyu et al. [21] identified sentiments and topics in COVID-19 vaccine-interrelated conversation among the public on social networking platforms and discriminate the relevant modifications in sentiments and topics over time for a good understanding of public emotions, perceptions, and concerns that might affect the accomplishment of herd immunity objectives. Basiri et al. [22] presented a methodology according to the fusion of four DL and one traditional supervised ML method for SA of COVID-based twitters from eight countries. Moreover, the authors analyzed COVID-based searches using Google Trends for a good understanding of the changes in sentimental patterns at dissimilar places and times. Imran et al. [23] analyzed the reaction of citizens from various cultures to the novel COVID-19 and people's sentiments regarding subsequent actions taken by many countries. The deep LSTM model was utilized for assessing the emotions and sentimental polarities from extracted tweets. In [24] , GloVe and fastText were tested as word embedding. Data collected from Twitter were prepared as stemmed and unstemmed datasets. In short, SA can be considered a meaningful source of data mining, particularly for circumstances relevant to the requirement of examining massive quantities of publicly relevant data, such as investigating public behavior concerning the COVID-19 pandemic and its outcome on people's lives. Furthermore, it is desirable to improve decision makers' countermeasures and offer them an effortless method with a collection of common rules that assist complex decision-making processes depending on people's sentiments and via examining and sorting an essential set of key features for COVID-19 posts. Thus, the proposed study in this paper varies from earlier research in combining DSS with SA for improving government decisions at the time of COVID-19. The use of the SFODLD-SAC model offers more insights and achieves better performance than other state-of-the-art techniques. In this study, a novel SFODLD-SAC model was developed for the identification and classification of sentiments on COVID-19 tweets. The presented SFODLD-SAC model follows a series of processes-namely, preprocessing, TF-IDF feature extraction, CRNN classification, and SFO-based parameter optimization. Figure 1 illustrates the pipeline of the SFODLD-SAC model. The workflow of each module in the SFODLD-SAC model is elaborated in the following subsections. examining and sorting an essential set of key features for COVID-19 posts. Thus, the proposed study in this paper varies from earlier research in combining DSS with SA for improving government decisions at the time of COVID-19. The use of the SFODLD-SAC model offers more insights and achieves better performance than other state-of-the-art techniques. In this study, a novel SFODLD-SAC model was developed for the identification and classification of sentiments on COVID-19 tweets. The presented SFODLD-SAC model follows a series of processes-namely, preprocessing, TF-IDF feature extraction, CRNN classification, and SFO-based parameter optimization. Figure 1 illustrates the pipeline of the SFODLD-SAC model. The workflow of each module in the SFODLD-SAC model is elaborated in the following subsections. In this section, the performance of the SFODLD-SAC model on the COVID-19 tweet dataset is investigated [25] . The dataset holds 2750 instances with 11 class labels. The details related to the dataset are given in Table 1 . Some sample tweets related to COVID-19 are provided in Table 2 . In this section, the performance of the SFODLD-SAC model on the COVID-19 tweet dataset is investigated [25] . The dataset holds 2750 instances with 11 class labels. The details related to the dataset are given in Table 1 . Some sample tweets related to COVID-19 are provided in Table 2 . The problem of poverty has now covered the cover of religion. The issue has changed. There is relief from corona. All is well (0) (4) My mental health hasn't suffered at all under the coronavirus quarantine! Ha-ha, April Fools. 11 i cannot die before watching a concert live coronavirus pls try to understand (5) (10) At first, the SFODLD-SAC model preprocessed the tweets in distinct ways such as stemming, removal of stopwords, usernames, link punctuations, and numerals [25] . Removing usernames and links in tweets that do not affect SA; • Removing punctuation marks such as hashtags and converting them to lower case; • Removing stopwords and numerals. In addition, stemming was performed to reduce the terms to their root forms. The process of reducing the term also aids to reduce the complexity of text features. Then, the TextBlob approach was used to determine the sentiment scores. Afterward, the TF-IDF model was executed to generate a collection of feature vectors. In this study, the TF-IDF model was applied for the useful extraction of features from the preprocessed data. For the effective recognition and classification of sentiments, the CRNN model was exploited [26] . RNN is a branch of an artificial neural network (ANN), that is, a feedforward neural network (FFNN) with connections and loops. Unlike FFNN, RNN is able to calculate input sequence using a recurrent hidden layer with the activation of previous steps. Given the sequential dataset (x 1 , x 2 , . . . , x T ), where x i denotes the data in i th time step, RNN upgrades the recurrent hidden layer h t as follows: (1) where φ indicates a nonlinear function. Therefore, RNN is made up of output (y 1 , y 2 , . . . , y T ). Eventually, data classification is implemented by an output y T . In the traditional RNN model, the update rule of the recurrent hidden layer in (1) can be implemented by where W and U represent the coefficient matrix for input and activation of recurrent hidden units. Given that p(x 1 , x 2 , . . . , x T ) is a sequential probability as follows: Next, the conditional likelihood distribution can be developed by utilizing a recurrent network. The tweets can be processed as sequence data, and a recurrent network is employed to model spectral sequence [26] . In contrast to the LSTM unit, GRU needs a smaller number of variables pertinent for classification, and a fewer number of training instances is needed. Therefore, GRU was chosen as a key element of RNN. The essential component of GRU is 2 gating units that are used to control the data flow within the unit. where indicates a nonlinear function. Therefore, RNN is made up of output ( , , … , ). Eventually, data classification is implemented by an output . In the traditional RNN model, the update rule of the recurrent hidden layer in (1) can be implemented by where and represent the coefficient matrix for input and activation of recurrent hidden units. Given that ( , , … , ) is a sequential probability as follows: Next, the conditional likelihood distribution can be developed by utilizing a recurrent network. The tweets can be processed as sequence data, and a recurrent network is employed to model spectral sequence [26] . In contrast to the LSTM unit, GRU needs a smaller number of variables pertinent for classification, and a fewer number of training instances is needed. Therefore, GRU was chosen as a key element of RNN. The essential component of GRU is 2 gating units that are used to control the data flow within the unit. Figure 2 depicts the framework of CRNN. Now, symbolizes the update gate as follows: p(x t |x 1 , . . . , Now, u t symbolizes the update gate as follows: Finally, the SFO algorithm was utilized to optimally adjust the hyperparameters involved in the CRNN model. Gomes et al. [27] introduced an approach for flowering plants based on a flower pollination technique that takes into account the biological process of reproduction. Generally, the SFO algorithm involves six steps, as given in Figure 3 . It starts with the parameter initiation process, during which the number of sunflowers, maximum iterations, and solution dimension space are initialized. Then, the sunflower parameters such as pollination rate, mortality rate, and survival rate are fixed. In the third step, the optimal objective of every sunflower is arbitrarily chosen. Next, the optimal sunflower is updated. Afterward, the new sunflower is produced depending upon the pollination and mortality rate. In the final step, the termination condition is checked, and the process continues until the stopping criteria are fulfilled. The mathematical modeling of the SFO algorithm is given in what follows. Finally, the SFO algorithm was utilized to optimally adjust the hyperparameters involved in the CRNN model. Gomes et al. [27] introduced an approach for flowering plants based on a flower pollination technique that takes into account the biological process of reproduction. Generally, the SFO algorithm involves six steps, as given in Figure 3 . It starts with the parameter initiation process, during which the number of sunflowers, maximum iterations, and solution dimension space are initialized. Then, the sunflower parameters such as pollination rate, mortality rate, and survival rate are fixed. In the third step, the optimal objective of every sunflower is arbitrarily chosen. Next, the optimal sunflower is updated. Afterward, the new sunflower is produced depending upon the pollination and mortality rate. In the final step, the termination condition is checked, and the process continues until the stopping criteria are fulfilled. The mathematical modeling of the SFO algorithm is given in what follows. For this algorithm, we considered the peculiar nature of sunflowers in detecting the optimal direction toward the sun. Pollination was considered to occur randomly, with minimal distance between flower and flower + 1. Then, the flower patch releases billions of pollen gametes. For simplicity, it was assumed that each sunflower only generates 1 pollen gamete and reproduces individually. Next, the amount of heat accomplished by the plant is given by For this algorithm, we considered the peculiar nature of sunflowers in detecting the optimal direction toward the sun. Pollination was considered to occur randomly, with minimal distance between flower i and flower i + 1. Then, the flower patch releases billions of pollen gametes. For simplicity, it was assumed that each sunflower only generates 1 pollen gamete and reproduces individually. Next, the amount of heat Q accomplished by the plant is given by where P denotes source power, and r i indicates distance amongst current plant and optimal i. The sunflower's direction toward the sun can be represented as follows: The sunflowers in direction s are evaluated by where λ represents constant value, P i (||X i + X i−1 ||) denotes pollination possibility, i.e., sunflower i pollinated with neighboring i − 1, creating an individual in an arbitrary position that varies according to the distance among the flowers. Specifically, the individual near the sun would take small steps in the local refinement search. Additionally, it is necessary to bound maximal steps given by the individual. Hence, it is defined as where X max and X min indicates lower and upper bounds, and N pop represents the number of plants in the population. It can be expressed as follows: The SFO approach resolves an FF for achieving enhanced classification performance. In this case, the minimized classifier error rate was assumed to be the FF determined by Equation (12) . The best result includes a minimal error rate, and the worse result gains a high error rate. Table 3 provides the detailed classification outcomes of the SFODLD-SAC model on 70% of TRS. The experimental results revealed that the proposed model provided effective outcomes under all class labels. Table 4 provides the detailed classification outcomes of the SFODLD-SAC model on 30% of TSS. Figure 8 showcases a comparative result of the SFODLD-SAC model on 30% of TSS in terms of accu y , prec n , and reca l . The figure exhibits that the SFODLD-SAC technique attained improved performance under all class labels. For instance, the SFODLD-SAC model recognized class 0 with accu y , prec n , and reca l of 99.52, 96.05, and 98.65%, respectively. Moreover, the SFODLD-SAC method identified class 5 with accu y , prec n , and reca l of 99.76, 98.57, and 98.57%, respectively. Furthermore, the SFODLD-SAC model recognized class 10 with accu y , prec n , and reca l of 99.76, 100, and 97.40%, correspondingly. The training accuracy (TA) and validation accuracy (VA) attained by the SFODLD-SAC model on phishing email classification is demonstrated in Figure 11 . Based on the experimental outcomes, the SFODLD-SAC model gained maximum values of TA and VA. Specifically, VA seemed to be higher than TA. The training accuracy (TA) and validation accuracy (VA) attained by the SFODLD-SAC model on phishing email classification is demonstrated in Figure 11 . Based on the experimental outcomes, the SFODLD-SAC model gained maximum values of TA and VA. Specifically, VA seemed to be higher than TA. Healthcare 2022, 10, x 13 of 17 Figure 10 showcases the average classification performance of the SFODLD-SAC model on 30% of TSS. The results revealed that the SFODLD-SAC model provided an average , , and values of 99.76, 98.12, and 98.05%, respectively. Therefore, the SFODLD-SAC model accomplished effective sentiment classification on tweets. The training accuracy (TA) and validation accuracy (VA) attained by the SFODLD-SAC model on phishing email classification is demonstrated in Figure 11 . Based on the experimental outcomes, the SFODLD-SAC model gained maximum values of TA and VA. Specifically, VA seemed to be higher than TA. Specifically, VL seemed to be lower than TL. The results denoted that the SFODLD-SAC model exhibited its ability in categorizing different classes on the test datasets. Healthcare 2022, 10, x 14 of 17 The training loss (TL) and validation loss (VL) achieved by the SFODLD-SAC model on phishing email classification are shown in Figure 12 . Based on the experimental outcomes, it can be inferred that the SFODLD-SAC model accomplished the least values of TL and VL. Specifically, VL seemed to be lower than TL. The results denoted that the SFODLD-SAC model exhibited its ability in categorizing different classes on the test datasets. To highlight the supremacy of the SFODLD-SAC model, a comparative study with recent approaches [12] was conducted, the results of which are shown in Table 5 To highlight the supremacy of the SFODLD-SAC model, a comparative study with recent approaches [12] was conducted, the results of which are shown in Table 5 and Figure 13 . The experimental outcomes stated that the SVM and DT models showed the least classification performance over the other methods. At the same time, the RF and XGBoost models accomplished slightly improved outcomes over the other techniques. In addition, the extra tree classifier accomplished reasonable performance with accu y , prec n , reca l , and F1 score of 92.32, 93.08, 92.42, and 92.13%, respectively. However, the SFODLD-SAC model accomplished superior outcomes with max mum , , , and 1 of 99.65, 98.12, 98.05, and 98.06%, respectively The above-mentioned results and discussion demonstrate that the SFODLD-SAC mode accomplished effective classification performance on COVID-19 tweets. The enhance performance of the proposed model is due to the optimal hyperparameter tuning of th CRNN model using the SFO algorithm. In this study, a novel SFODLD-SAC model was introduced for the recognition an classification of sentiments on COVID-19 tweets. At the initial stage, the SFODLD-SAC model preprocessed the tweets in distinct ways, such as stemming, removal of stopwords usernames, link punctuations, and numerals. Then, the TF-IDF model was applied for th useful extraction of features from the preprocessed data. Afterward, features were passe into the CRNN model to analyze and classify sentiments. Lastly, the SFO algorithm wa utilized to optimally adjust the hyperparameters that exist in the CRNN model. A simu lation analysis of the SFODLD-SAC model was performed using a benchmark datase from the Kaggle repository. Extensive comparative results report the promising perfor mance of the SFODLD-SAC model over other recent state-of-the-art models, with max mum , , , and 1 of 99.65, 98.12, 98.05, and 98.06%, respectively Thus, the presented SFODLD-SAC model can be applied for enhanced SA on COVID-1 tweets, as well as on big data environments to analyze the sentiments in a real-time envi ronment. In the future, outlier detection and clustering models can be employed to im prove the sentiment classification performance. Moreover, the proposed SFODLD-SAC model can be extended to the design of an ensemble voting-based fusion model to im prove classification performance. In addition, the proposed model can focus on the desig of metaheuristic feature selection techniques to reduce the curse of dimensionality. F nally, different data preprocessing approaches can be employed for improving the inpu data quality in the future. However, the SFODLD-SAC model accomplished superior outcomes with maximum accu y , prec n , reca l , and F1 score of 99.65, 98.12, 98.05, and 98.06%, respectively. The abovementioned results and discussion demonstrate that the SFODLD-SAC model accomplished effective classification performance on COVID-19 tweets. The enhanced performance of the proposed model is due to the optimal hyperparameter tuning of the CRNN model using the SFO algorithm. In this study, a novel SFODLD-SAC model was introduced for the recognition and classification of sentiments on COVID-19 tweets. At the initial stage, the SFODLD-SAC model preprocessed the tweets in distinct ways, such as stemming, removal of stopwords, usernames, link punctuations, and numerals. Then, the TF-IDF model was applied for the useful extraction of features from the preprocessed data. Afterward, features were passed into the CRNN model to analyze and classify sentiments. Lastly, the SFO algorithm was utilized to optimally adjust the hyperparameters that exist in the CRNN model. A simulation analysis of the SFODLD-SAC model was performed using a benchmark dataset from the Kaggle repository. Extensive comparative results report the promising performance of the SFODLD-SAC model over other recent state-of-the-art models, with maximum accu y , prec n , reca l , and F1 score of 99.65, 98.12, 98.05, and 98.06%, respectively. Thus, the presented SFODLD-SAC model can be applied for enhanced SA on COVID-19 tweets, as well as on big data environments to analyze the sentiments in a real-time environment. In the future, outlier detection and clustering models can be employed to improve the sentiment classification performance. Moreover, the proposed SFODLD-SAC model can be extended to the design of an ensemble voting-based fusion model to improve classification performance. In addition, the proposed model can focus on the design of metaheuristic feature selection techniques to reduce the curse of dimensionality. Finally, different data preprocessing approaches can be employed for improving the input data quality in the future. Data Availability Statement: Data sharing is not applicable to this article, as no datasets were generated during the current study. Predicting the impact of the third wave of COVID-19 in India using hybrid statistical machine learning models: A time series forecasting and sentiment analysis approach A proposed sentiment analysis deep learning algorithm for analyzing COVID-19 tweets Unsupervised deep learning based variational autoencoder model for COVID-19 diagnosis and classification Optimized convolutional neural network for automatic detection of COVID-19 Deep Convolutional Neural Network Approach for COVID-19 Detection A novel machine learning classification model to investigate topics and sentiment in COVID-19 tweets A large-scale benchmark Twitter data set for COVID-19 sentiment analysis A complete VADER-based sentiment analysis of bitcoin (BTC) tweets during the era of COVID-19. Big Data Cogn Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review COVID-19 public sentiment insights and machine learning for tweets classification COVID-19 sentiment analysis via deep learning during the rise of novel cases Sentimental analysis of COVID-19 tweets using deep learning models Comparative analysis of machine learningbased classification models using sentiment classification of tweets related to COVID-19 pandemic Sentiment analysis and topic modeling on tweets about online education during COVID-19 Comparative study of deep learning models for analyzing online restaurant reviews in the era of the COVID-19 pandemic A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis Learning Approach for Sentiment Analysis of COVID-19 Sentiment analysis and topic modeling for COVID-19 vaccine discussions Public perception of the COVID-19 pandemic on Twitter: Sentiment analysis and topic modeling study A sentiment analysis approach to predict an individual's awareness of the precautionary procedures to prevent COVID-19 outbreaks in Saudi Arabia COVID-19 vaccine-related discussion on Twitter: Topic modeling and sentiment analysis A novel fusion-based deep learning model for sentiment analysis of COVID-19 tweets Cross-cultural polarity and emotion detection using sentiment analysis and deep learning on COVID-19 related tweets Sentiment Analysis of COVID-19 Vaccines in Indonesia on Twitter Using Pre-Trained and Self-Training Word Embeddings Sentiment Analysis of COVID-19 Related Tweets An optimal cascaded recurrent neural network for intelligent COVID-19 detection using Chest X-ray images A sunflower optimization (SFO) algorithm applied to damage identification on laminated composite plates Taif University Researchers Supporting Project number (TURSP-2020/154), Taif University, Taif, Saudi Arabia. The authors declare that they have no conflicts of interest. The manuscript was written with the contributions of all authors. All authors have given approval to the final version of the manuscript.