key: cord-0963146-afxwu278 authors: Rahib, Md. Rumman Hussain Khan; Tamim, Amzad Hussain; Tahmeed, Mohammad Zawad; Hossain, Mohammad Jaber title: Emotion Detection Based on Bangladeshi People’s Social Media Response on COVID-19 date: 2022-03-03 journal: SN Comput Sci DOI: 10.1007/s42979-022-01077-1 sha: 8c7091d80c6b61b9190de601e4f8146640e3ae0c doc_id: 963146 cord_uid: afxwu278 The COVID-19 pandemic is still active on a global scale while the virus was first identified in December 2019 in Wuhan, China. As the pandemic continues to affect millions of lives, several countries including Bangladesh have gone into complete lockdown for a second time. During the lockdown periods, people have expressed their experiences, curiosities, and ideas regarding the problems caused by the pandemic in terms of health and socioeconomic issues. This study was conducted to determine how Bangladeshi people are responding to and dealing with the circumstances of COVID-19. This study took into account the status and comments on those issues related to COVID-19 from a variety of Facebook pages and YouTube channels run by reputable Bangladeshi news organizations and health experts. Throughout the study, several machine learning methods were studied, ranging from classical algorithms which include SVM and Random Forest, while CNN and LSTM are the deep learning algorithms to conduct experiments on a classified data set that belongs to the authors, which contains 10,581 data points. While evaluating the efficiency of these models in terms of model assessment, the finding suggests that LSTM outperforms all others with an accuracy of 84.92. In today's culture, people's values and opinions matter a lot. There is a lot of insight from users on the web in a world where nearly everything is online. Sentiment analysis has altered the way people interpret data and make decisions based on the evaluation of the data. With the comprehensive development of the connections between users through several social media platforms, the study of sentiment analysis has attained more academic and business attention. Because of the social media influenced lifestyle in recent times, the delivery of experiences and ideas regarding issues arising in daily life has been easy to access. Sentiment analysis is one of the most well-studied and active research fields of natural language processing since having access to such vast amounts of opinion-based data allows the research community to interpret and evaluate the information to facilitate in detecting emotion, decision-making, appreciate product sentiment, or perspectives on cultural, social, international, and political agendas [1] . On December, 2019, Chinese authorities first identified Coronavirus (COVID-19) as a virus developed in Wuhan, China and since then, the disease has spread globally, resulting in a pandemic [2] . While updating this manuscript, the virus outbreak has affected more than 157 million people worldwide, killing more than 3.2 million people, according to official updates from World Health Organization (May, 2021) . Bangladesh was reported to be infected with the virus as on March 8, 2020 , the country's epidemiology institute, IEDCR, announced the first three known cases and since then, the pandemic has progressively spread across the country, with the number of people infected steadily increasing [3] . The concerns around the global pandemic's progression are difficult and complicated because they have far-reaching implications not just in the medical field, but also in the social, economic, technological, political, and psychological areas. This study provides an in-depth exploration of Bangladeshi people's response to the circumstances arising from the pandemic, imposed lockdown in the country, the ongoing vaccination process, with an emphasis on the second wave. The continuing reactions of Bangladeshi people to the global pandemic on social media platforms will provide insight into how Bangladeshi people are conceptualizing the pandemic and responding to its progression. The study will include an interpretation of social media data to identify public opinion, with the focus on Bangladeshi people, which have been linked to the outbreak of COVID-19 infections, as well as the country's condition as a result of the pandemic. Comparative studies were carried out in this analysis using an in-house developed data set containing three emotions specifically insightful, curious and gratitude which are generated from the status and comments on various issues related to COVID-19 from several Facebook pages and You-Tube channels run by prominent Bangladeshi news organizations and health specialists. The study considered both traditional and deep learning algorithms, which included Random Forest, SVM, LSTM, and CNN for model evaluation measurements. The data set was carefully picked, cleaned while a thorough investigation was made to the data set to get a more extensive comprehension of the emotion behind the opinion. Sentiment analysis is not restricted only to evaluating online product ratings, it has evolved with the rise of several social media platforms (i.e., Twitter, Facebook, etc.). Many subjects other than product ratings, such as financial prices, elections, disasters, medicine, technological production, and cyber bullying, all these sectors implement sentiment analysis for detecting people's emotion in their respective areas [4] . The rapid growth of sentiment analysis started significantly in the early 2000s. In Ref. [5] , Pang and Lee did groundbreaking work in this domain by approaching two-class sentiment analysis on movie reviews using several classifiers. The growth and availability of several social media platforms provided a boost to it as sentiment analysis using machine learning algorithms is being studied for over a decade as researchers have become more enthusiastic in analyzing sentiment using social media data. The enhanced lexical resource SentiWordNet, specifically designed to promote sentiment classification and opinion mining applications, was introduced in Ref. [6, 7] . The measurement of the sentiment orientation was performed based on the sentences through positive and negative scoring of words from SentiWordNet in Ref. [8] . In Ref. [9] , the proposed method looked into the application of sentiment analysis approaches for the classification of Web forum opinions in several languages. In 2010, the Bangla variant of SentiWordNet claimed as significantly sound and reliable was introduced in Ref. [10] . In the Bangla language, there has been a lot of research done regarding people's responses while the majority of the study being devoted to binary sentiment polarity, on the other hand, emotion identification received comparatively less attention. In Ref. [11] , a lexicon-based backtracking technique was proposed with 77.16% accuracy to explore how people use emotional keywords to express their emotions. A data set that included Bangla, English, and Romanized Bangla comments on various YouTube videos, alongside deep learning-based multi-label sentiment and emotion detection technique was proposed in Ref. [12] , which extracted with an accuracy of 59.23%. In Ref. [13] , the proposed study presented a data set of six distinct emotion types and illustrated that a non-linear SVM with RBF kernel delivers an average accuracy of 52.98% with a significant contribution of exploring various pre-processing and feature selection techniques. With the same data set used in Ref. [13] , while considering three classes for investigation, in Ref. [14] , the study applied a multinomial naive Bayes (NB) classifier along with several features (i.e., stemmer, parts-of-speech (POS) tagger, n-grams, etc.) to distinguish multi-class emotions (i.e., happy, sad, and anger) from Bangla text and acquired an accuracy of 78.6%. In Ref. [15] , the study produces a data set alongside three models of text classification that were used for baseline evaluation to check the efficacy of the data set. In Ref. [16] , the study conducted comparison tests utilizing several annotated sentiment data sets containing Bangla material for positive and negative sentiment with a thorough investigation on several machine learning algorithms including transformer-based models, where transformer-based models outperform all other models. While updating this manuscript, no attempt has been made to detect emotion in Bangladeshi people's responses towards the COVID-19 pandemic using both classical and deep learning algorithms. The findings indicate considerable consistency in this study. In contrast to previous research, the study presents a comparative analysis of multiple classical and deep learning algorithms on how the response of Bangladeshi people's approaching the COVID-19 pandemic. The data set used for this study includes a vast number of user comments from status and comments on several issues related to COVID-19 from a variety of social media sources run by reputable news organizations and health experts of Bangladesh. People's reactions to the pandemic included such a wide variety of topics, the majority of which were potential suggestions and criticism about the current condition, queries towards the doctors on health issues, and appreciation to the authority and frontline workers for their efforts. Therefore, the study considered data in various categories for particular three types of emotion classes (i.e., insightful, curious and gratitude). The authors, who are Bangla native speakers, manually annotated the data set based on the words and phrases referring to the information of the corresponding opinion, as well as the overall emotion expressed in the comment. There were certain scenarios when data might have been labeled with all three labels, resulting in ambiguity, however, in such circumstances, the entries triggering ambiguity were removed. Figure 1 depicts the raw text data and finally determined the corresponding emotion class. There were 10,581 entries in the data set. The acquired data are labeled accordingly with three different labels: Insightful, Curious, and Gratitude. Among the three emotion classes, there are 3800, 3549, and 3232 data entries for Insightful, Curious, and Gratitude classes, respectively. Data preprocessing is obligatory to achieve the best results from a data analysis task. Since text-based reviews are not of set lengths, are not well standardized, and do not meet any clear criteria, the gathered raw data carried significant noise. The reviews sometimes included irrelevant and insignificant data for interpretation, which requires some processing. While annotating the raw text data, the main focus was to avoid texts with high noise and keep the data as clean as possible. Therefore, in the annotation process, texts with high-frequency noises were dropped. Besides, as the annotation process was manually done, some basic corrections were made such as spelling errors. In the data preprocessing phase, further noises were eliminated. To complete these tasks, the following steps were taken: (a) Eliminating noise: initially, all HTML markups, URLs, non-ASCII characters, trailing white space were removed. Alongside replacing Unicode as the data set is in Bangla, punctuation and digits were also dropped from the text. (b) Removing emoticons: all emoticons from the comments at this stage were excluded since the aim was to consider only text details. (c) Tokenization (Size of the tokens was 57,269) of the texts and a count of unique tokens was done for deep learning models. Besides, padding was performed to keep each input (sentence or text) at the same length. The labels were then mapped to integers and classified. (d) Stop word removal may have been used, but because pre-trained word2vec was applied for all the models, no measures like stemming or lemmatization were required, since words with the same stem can have different meanings. The term word embedding simply refers to the representation of words in a vector format. As a result, the procedure for converting a word to a vector is critical and there are several ways for converting text data to vectors that the model can comprehend (i.e., CountVectorizer, Word2Vec, TF-IDF, etc.) while the focus of this study was on Word-2Vec among them. Initially, we encountered some difficulties while searching for well-performing Word2Vec model for Bangla language. Then, we modified the standard Word-2Vec model to create our customized Word2Vec model. Experiments using a pre-trained Word2Vec with a vector size of 300 were implemented to train the models. The data set owned by the authors for this work, as well as the most recent Bangla Wikipedia dump, were utilized to train the pre-trained Word2Vec. To train the classifiers using the above-mentioned classical algorithms (i.e., SVM, Random Forest), the pre-processed data was turned into numerical feature vectors using Word-2Vec, which is a generic procedure for turning a collection of text documents into numerical feature vectors. Convolutional neural networks (CNN) was primarily utilized in image classification tasks, where convolution lies at the core of pictures, and it is now being employed in natural language processing (NLP) [17] . While using CNN for text classification, embedding the words of a text into a 2D array was implemented while stacking them together in an embedding layer. A pretrained Word2Vec with the dimension of 300 was added into it. Filter sizes of 256 and a kernel size of 3 were utilized to train the CNN model. The characteristics were then extracted using one convolutional layer while a ReLU activation was employed to add non-linearity. After the convolutional layer, a GlobalMax-Pooling layer has been utilized to reduce dimensionality and a dropout of 0.20 was included with a fully connected layer to decrease the model's overfitting. Finally, a Soft-Max function is utilized to classify input reviews into three emotion labels. While the maximum number of epochs was set to 64, the Adam optimizer [18] and categorical cross-entropy loss function were implemented throughout compiling. These parameter variables differed across tests. Table 1 illustrates the sequence of CNN. With the inclusion of memory into the model, RNN and LSTM are frequently utilized during sentiment analysis. Because the meaning of a word in text data is dependent on the context of the prior text, hence, it is useful to have a memory in the network. The recurrent neural network, on the other hand, has a significant limitation as it can only cope with short-term dependencies. LSTM networks are designed to solve this problem by including a long-term memory into the network [19] . The majority of tasks of CNN remained the same while using LSTM. The embedding layer contains a pre-trained Word2Vec with 300 dimensions while a ReLU activation was employed to add non-linearity. Following that, a 128-layer LSTM hidden layer was implemented. To reduce overfitting, a 0.20 dropout was employed. Later, a fully connected layer was added to link all of the prior states' outputs. Finally, a SoftMax function is utilized to discriminate between the three previously described emotion classes (i.e., insightful, curious, gratitude). Same as CNN, while the maximum number of epochs was set to 64, the Adam optimizer [18] and categorical cross-entropy loss function were implemented throughout compiling. Table 2 demonstrates the sequence of LSTM. Machine learning techniques, lexicon-based techniques, and hybrid techniques are the three types a text unit can be classified with. Regarding the classification problem, sentiment analysis employs the assessment metrics of precision, recall, F1-score, and accuracy. For addressing the issue of class imbalance, the performance of each classifier was evaluated using a weighted average of precision (P), recall (R), and F1-score (F1). The train and test data were divided by 0.8 and 0.2, respectively. Table 3 depicts the overall evaluation results. While traditional methods are taken into consideration, Random Forest surpasses SVM in terms of accuracy, with Random Forest achieving 78.94% and SVM achieving 75.72%. Tables 4 and 5 illustrate the normalized confusion matrix and evaluation of SVM respectively, while Tables 6 and 7 demonstrate the same for Random Forest. We utilized CNN and LSTM as deep learning algorithms, with the decision influenced by a number of considerations. In recent years, CNN and LSTM have been frequently used with static embedding (i.e., Word2Vec) for detecting emotion classes. In terms of accuracy, CNN acquired a score of 81.91%. Tables 8 and 9 represent the normalized confusion matrix and evaluation of CNN, respectively. LSTM outperforms all other models explored in the study in terms of accuracy, with a score of 84.92%. Tables 10 and 11 represent the normalized confusion matrix and evaluation of LSTM, respectively. There are various opportunities to enhance our Word2Vec model, and having a proper stemmer for the Bangla language can increase the overall performance of our suggested models. The models were constructed using an NVIDIA GeForce MX250 4 GB DDR5 graphics card, integrating with more resources may improve overall stability and accuracy of the models. Although extensive effort has been done to detect emotion in English text, recognizing emotion in the Bangla language still requires a great deal of attention. The construction of an in-house constructed, reliable and authentic data set for emotion detection has been done in this study, which includes 10,581 entries based on responses from a range of sources regarding several concerns relevant to the COVID-19 pandemic. A variety of machine learning methods, spanning from classical (i.e., SVM, Random Forest) to deep learning (i.e., CNN, LSTM), were investigated throughout the study. While doing so, the following are the authors' contributions to this study: acquired authentic and reliable datasets, which were meticulously cleaned and tested, examined the information to analyze the expressions that are associated with specific emotions, experimented with classical and deep learning algorithms such as Random Forest, SVM, CNN, and LSTM, as well as documenting the results for future reference. According to the findings from this study, the performance of deep learning algorithms is superior to that of traditional methods. In the future, the authors want to look at a variety of emotion ranges from different Bangla data sources alongside considering Transformer-based models. Sentiment analysis: mining opinions, sentiments, and emotions A novel coronavirus outbreak of global health concern Bangladesh confirms its first three cases of coronavirus The evolution of sentiment analysis-a review of research topics, venues, and top cited papers Thumbs up? sentiment classification using machine learning techniques Sentiwordnet: a publicly available lexical resource for opinion mining Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining Sentiment classification of reviews using sentiwordnet Sentiment analysis in multiple languages: feature selection for opinion classification in web forums Sentiwordnet for bangla. Knowl Shar Event-4 Task A survey on emotion detection: A lexicon based backtracking approach for detecting emotion from Bengali text international conference on bangla speech and language processing (ICBSLP). IEEE Comparison of classical machine learning approaches on Bangla textual emotion analysis Emotion detection from bangla text corpus using naive bayes classifier Data set for sentiment analysis on Bengali news comments and its baseline evaluation Sentiment classification in bangla textual content: A comparative study A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification Adam: a method for stochastic optimization Sentiment analysis with long short-term memory networks The authors declare that they have no conflict of interest.