key: cord-1048223-5n06yw6h authors: Jelodar, Hamed; Wang, Yongli; Orji, Rita title: Deep Sentiment Classification and Topic Discovery on Novel Coronavirus or COVID-19 Online Discussions: NLP Using LSTM Recurrent Neural Network Approach date: 2020-04-24 journal: bioRxiv DOI: 10.1101/2020.04.22.054973 sha: 1bdf3c4613936ba022e38f126f620c770e18b9ca doc_id: 1048223 cord_uid: 5n06yw6h Internet forums and public social media, such as online healthcare forums, provide a convenient channel for users (people/patients) concerned about health issues to discuss and share information with each other. In late December 2019, an outbreak of a novel coronavirus (infection from which results in the disease named COVID-19) was reported, and, due to the rapid spread of the virus in other parts of the world, the World Health Organization declared a state of emergency. In this paper, we used automated extraction of COVID-19–related discussions from social media and a natural language process (NLP) method based on topic modeling to uncover various issues related to COVID-19 from public opinions. Moreover, we also investigate how to use LSTM recurrent neural network for sentiment classification of COVID-19 comments. Our findings shed light on the importance of using public opinions and suitable computational techniques to understand issues surrounding COVID-19 and to guide related decision-making. Online forums, such as reddit, enable healthcare service providers to collect people/patient experience data. These forums are valuable sources of people's opinions, which can be examined for knowledge discovery and user behaviour analysis. In a typical sub-reddit forum, a user can use keywords and apply search tools to identify relevant questions/answers or comments sent in by other reddit users. Moreover, a registered user can create a topic or post a new question to start discussions with other community members. In answering the questions, users reflect and share their views and experiences. In these online forums, people may express their positive and negative comments, or share questions, problems, and needs related to health issues. By analysing these comments, we can identify valuable recommendations for improving health-services and understanding the problems of users. In late December 2019, the outbreak of a novel coronavirus causing COVID-19 was reported [1] . Due to the rapid spread of the virus, the World Health Organization declared a state of emergency. In this paper, we focused on analysing COVID-19related comments to detect sentiment and semantic ideas relating to COVID-19 based on the public opinions of people on reddit. Specifically, we used automated extraction of COVID-19-related discussions from social media and a natural language process (NLP) method based on topic modeling to uncover various issues related to COVID-19 from public opinions. The main contributions of this paper are as follows: -We present a systematic framework based on NLP that is capable of extracting meaningful topics from COVID-19-related comments on reddit. -We propose a deep learning model based on Long Short-Term Memory (LSTM) for sentiment classification of COVID-19-related comments, which produces better results compared with several other well-known machine-learning methods. -We detect and uncover meaningful topics that are being discussed on COVID-19-related issues on reddit, as primary research. -We calculate the polarity of the COVID-19 comments related to sentiment and opinion analysis from 10 sub-reddits. Our findings shed light on the importance of using public opinions and suitable computational techniques to understand issues surrounding COVID-19 and to guide related decision-making. Overall, the paper is structured as follows. First, we provide a brief introduction to online healthcare forums. Discussion of COVID-19-related issues and some similar works is provided in section 2. In section 3, we describe the data pre-processing methods adopted in our research, and the NLP and deeplearning methods applied to the COVID-19 comments database. Next, we present the results and discussion. Finally, we conclude and discuss future works based on NLP approaches for analysing the online community in relation to the topic of COVID-19. Machine and deep-learning approaches based on sentiment and semantic analysis are popular methods of analysing text-content in online health forums. Many researchers have used these methods on social media such as Twitter, reddit [2] - [7] , and health information websites [8] , [9] . For example; Halder and colleagues [10] focused on exploring linguistic changes to analyse the emotional status of a user over time. They utilized a recurrent neural network (RNN) to investigate user-content in a huge dataset from the mental-health online forums of healthboards.com. McRoy and colleagues [11] investigated ways to automate identification of the information needs of breast cancer survivors based on user-posts of online health forums. Chakravorti and colleagues [12] extracted topics based on various health issues discussed in online forums by evaluating user posts of several subreddits (e.g., r/Depression, r/Anxiety) from 2012 to 2018. VanDam and colleagues [13] presented a classification approach for identifying clinic-related posts in online health communities. For that dataset, the authors collected 9576 thread-initiating posts from WebMD, which is a health information website. The COVID-19-related comments from an online healthcare-oriented group can be considered potentially useful for extracting meaningful topics to better understand the opinions and highlight discussions of people/users and improve health strategies. Although there are similar works regarding various health issues in online forums, to the best of our knowledge, this is the first study to utilize NLP methods to evaluate COVID-19-related comments from sub-reddit forums. We propose utilizing the NLP technique based on topic modeling algorithms to automatically extract meaningful topics and design a deep-learning model based on LSTM RNN for sentiment classification on COVID-19 comments and to understand the positive or negative opinions of people as they relate to COVID-19 issues to inform relevant decision-making. This section clarifies the methods used to investigate the main contributions to this study, which proposes the use of an unsupervised topic model, with a collaborative deep-learning model based on LSTN RNN to analyse COVID-19-related comments from sub-reddits. The developed framework, shown in Fig. 2 , uses sentiment and semantic analysis for mining and opinion analysis of COVID-19-related comments. Reddit is an American social media, a discussion website for various topics that includes web content ratings. In this social media, users are able to post questions and comments, and to respond to each other regarding different subjects, such as COVID-19. The posts are organised by subjects created by online users, called "subreddits", which cover a variety of topics like news, science, healthcare, video, books, fitness, food, and image-sharing. This website is an ideal source for collecting healthrelated information about COVID-19-related issues. This paper focuses on COVID-19-related comments of 10 sub-reddits based on an existing dataset as the first step in producing this model. One of the most important steps in pre-processing COVID-19-related comments is removing useless words/data, which are defined as stop-words in NLP, from pure text. Moreover, we also decreased the dimensionality of the features space by eliminating stop-words. For example, the most common words in the text comments are words that are usually meaningless and do not effectively influence the output, such as articles, conjunctions, pronouns, and linking verbs. Some examples include: am, is, are, they, the, these, I, that, and, them. Text-document modeling in NLP is a practical technique that represents an individual document and the set of text-documents based on terms appearing in the textdocuments. Topic modeling based on Latent Dirichlet Allocation (LDA) [14] is one type of document modelling approach. As a third step, we utilized topic modeling based on an LDA Topic model and Gibbs sampling [15] for semantic extraction and latent topic discovery of COVID-19-related comments. COVID-19 comments, however, can depend on various subjects that are discussed by reddit users. In this step we can detect and discover these meaningful subjects or topics. Therefore, based on the LDA model, we considered a collection of documents, such as COVID-19related comments and words, as topics (K), where the discrete topic distributions are drawn from a symmetric Dirichlet distribution. The probability of observed data Step 3: Semantic Processing Step 4: Deep Learning and Comment Classification Step 1: COVID- 19 Comments Defining and applying Stop=words Determined α parameters of topic Dirichlet prior and also considered parameters of word Dirichlet prior as β. M is the number of text-documents, and N is the vocabulary size. Moreover,(α, θ) was determined for the corpus-level topic distributions with a pair of Dirichlet multinomials. (β, ϕ) was also determined for the topic-word distributions with a pair of Dirichlet multinomials. In addition, the document-level variables were defined as θ d , which may be sampled for each document. The wordlevel variables z dn , w d n , were sampled in each text-document for each word [14] . Algorithm 1 Pre-processing and removing the noise to prepare the input data word-probability under the topic of sampling --or the word distribution for topic k among COVID-19-related comments 4: φ ∼ Dirichlet(β) 5: end for 6: for each COVID-19-related comments d ∈ {1, . . . , D} do 7: The topic distribution for document m 8: dθ ∼ Dirichlet(α) 9: for Per word in COVID-19-related content-document d do 10: sampling the distribution of topics in the COVID-19-related comments-documents to obtain the topic of the word: .Z d ∼ Mul(θ) 11: word-sampling undert the topic, W d ∼ Mul(φ) 12: end for 13: end for Algorithm 2 describes a general process as part of our framework for extracting latent topics and semantic mining. The input data consists of the number of COVID-19-related comments as the context of the document: Line 1 processes the pure-data to eliminate noise and stop-words based on Algorithm 1. Lines 2-5 compute the probability of the word distribution from Topic K[i]. Lines 6-11 compute the probability of the topic distribution from the COVID-19-Content-Document m [i]. As highlighted in Equation 1, the variables θ m , w n are computed for document-level and word-level of the framework. In more detail, the LDA handles topics as multinomial distributions in documents and words as a probabilistic mixture of a pre-determined number from latent topics. Lines 1-3 of Algorithm 3 show the semantic mining to extract the latent topics. We then used a sorting function to determine the recommended highlighted topics. Because the Gibbs sampling method is used in this step, the time requested for model inference can be specified as the sum of the time for inferring LDA. Therefore, the time complexity for LDA is O(N K), where N denotes the total size of the corpus (COVID-19-related comments) and K is the topic number. Deep neural networks have been successfully employed for different types of machinelearning tasks, such as NLP-based methods utilizing sentiment aspects for deep classification [16] - [21] . Deep neural networks are able to model high-level abstractions and to decrease the dimensions by utilizing multiple processing layers based on complex structures or to be combined with non-linear transformations. RNNs are popular models with demonstrated importance and strength in most NLP works [22] - [24] . The purpose of RNNs is to use consecutive information, and the output is augmented by storing previous calculations. In fact, RNNs are equipped with a memory function that saves formerly calculated information. Basic RNNs, however, have some challenges due to gradient vanishing or exploding, and they are unable to learn long-term dependencies. LSTM [25] , [26] units have the benefit of being able to avoid this challenge by adjusting the information in a cell state using 3 different gates. The formula for each LSTM cell can be formalized as: The forget ( f t ), input (i t ), and output (o t ) gates for each LSTM cell are determined by these 3 equations, eqs. 2-4, respectively. In an LSTM layer, the forget gate determines which previous information from the cell state is forgotten. The input gate controls or determines the new information that is saved in the memory cell. The output gate controls or determines the amount of information in the internal memory cell to be exposed. The cell-memory/input block equations are: In which, C i is the cell state, z t is the hidden output, and x t is an input vector. W and b are the weight matrix and the bias term respectively. σ is sigmoid and φ is tanh. is element-wise multiplication. As the last step of this framework, an LSTM model was utilised to assess the COVID-19-related comments of online users who posted on reddit, in order to recognize the emotion/sentiment elicited from these comments. We designed two LSTMlayers and for pre-trained embeddings, considered the Glove-50 dimension 1 , which were trained over a large corpus of COVID-19-related comments ( Figure 3 ). The processed text from the COVID-19-related comments, however, is changed to vectors with a fixed dimension by converting pre-trained embeddings. Moreover, COVID-19 comments can also be described as a characters-sequence with its corresponding dimension creating a matrix [27] . In this section, we provide a detailed description of the data collection and experimental results followed by a comprehensive discussion of the results. We assessed 563,079 COVID-19-related comments from reddit. The dataset was collected between January 20, 2020 and March 19, 2020 (the full dataset is available at Kaggle website 2 ). We used MALLET 3 to implement the inference and capture the LDA topic model to retrieve latent topics. We used the Python library Keras 4 to implement our deep-learning model. According to Table 1 and 2 and Figures 4-8 , the following observations were made: Topics 85 and 18 had a similar concept in "People/Infection". Topic 85 included words referring to people, such as "people", "virus", "day", "bad", "stop", "news", "worse", "sick", "spread", and "family". This topic is the first ranked topic discovered from the generated latent topics, in which most users express their opinion and comment on this issue. Based on Table 1 and Figure 5 (a) in this topic, the terms "people" and "virus" were the most highlighted words, with word-weights of 0.1295% and 0.0301%, respectively. Also, we can see the importance of the term "family" from this topic. In addition, Topic 18 contains the telling words "virus", "people", "symptoms", "infection", "cases", "disease", "pneumonia", "coronavirus", and "treatment". Other revealing words in Topic 18 included "people", "infection", and "treatment". These terms initially suggest a set of user comments about treatment issues. Moreover, the sentiment analysis of the terms suggest that negative words were more highlighted than positive words. Topic 63 also addresses healthcare and hospital issues with the most frequent term being "hospital". Words such as "hospital", "medical", "healthcare", "patients", "care", and "city" were included. The terms "hospital", "medical", and "healthcare" were the most highlighted words, with word-weights of 0.0561%, 0.0282%, and 0.0278%, respectively. Other words worth mentioning that were seen for this topic were "person", "patient", "staff", "workers", and "emergency". Topic 63 was assigned as medical staff issues. Topic 4 included words relating to money, such as "pay", "money", "companies", "insurance", "paid", "free", "cost", "tax", "years", and "employees". Moreover, the sentiment analysis of the terms suggested that negative words were more highlighted than positive words. Topic 30 covers user's comments concerning issues related to "feelings and hopes" and highlight words such as "good", "hope", "feel", "house", "safe", "hard", "months", "fine", "live", and "friend". Moreover, sentiment analysis of terms suggested that positive words were more highlighted than negative words. Positive words such as "good", "hope", "safe", "fine", "kind", and "friend", thus pertain to the phenomenon of "positive feelings". For Topic 93, we can see that there was a clear focus on "people, age, and COVID issues" with the top words being "covid", "young", "risk", "fever", "immune", "age", "sick", "cough", "life", "cold", "elderly", and "older". The terms "covid", "young", and "risk" were the most highlighted words, with wordweights of 0.0299%, 0.0222%, and 0.0218%, respectively, and this topic had negative polarity. Topic 48 also addresses "COVID-19 testing issues" and contains words like "people", "testing", "government", "country", "tested", "test", "infected", "home", "covid", and "pandemic". Based on the results, the terms "people" and "testing" were the most highlighted words with word weights of 0.0447% and 0.0337%, respectively. Moreover, the opinion words based on sentiment analysis scored high in negative polarity for Topic 17. The top terms of this topic were "coronavirus", "quarantine", "stupid", "happening", "shit", "watch", and "dangerous", thus pertaining to the phenomenon "quarantine issues". The terms "coronavirus" and "quarantine" were the most highlighted words, with word-weights of 0.0353% and 0.0346%, respectively. Sentiment analysis is a practical technique in NLP for opinion mining that can be used to classify text/comments based on word polarities [28] - [30] . This technique has many applications in various disciplines, such as opinion mining in online healthcare communities [31] - [33] . We obtained the sentiment of the COVID-19-related comments using the SentiStrength algorithm [34] - [36] . Therefore, with all COVID-19-related comments tagged with sentiment scores, we calculated the average sentiment of the entire dataset along with comments mentioning only 10 COVID-19 sub-reddits. The main objective of this analysis was to identify the overall sentiment Figure 9 shows the sentiment of all comments in the database along with the average sentiment of comments containing the terms COVID-19. For each of the polar comments in our labelled dataset, we assigned negative and positive scores utilizing SentiStrength, and employed the various scores directly as rules for building inference about the polarity/sentiment of the COVID-19 comments. Based on SentiStrength, we determined that a comment was positive if the positive sentiment score was greater than the negative sentiment score, and also considered a similar rule for determining a positive sentiment. For example, a score of +5 and -4 indicates positive polarity and a score of +4 and -6 indicates negative polarity. Moreover, If the sentiment scores were equal (such as -1 and +1, +4 and -4), we determined that the comment was neutral. To prepare the dataset to automatically classify the sentiment of the COVID-19 comments for all of the data, we labelled each of the comments as very positive, positive, very negative, negative, and neutral based on the sentiment score obtained using the Sentistrength method. The training set had 338,666 COVID-19-related comments and the testing set had 112,888 comments. In this experiment, we evaluated the proposed LSTM-model and also supervised machine-learning methods using the Support Vector Machine (Senti-ML1), Naive Bayes (Senti-ML2), Logistic Regression (Senti-ML3), K Nearest Neighbors (Senti-ML4) techniques. Figure 4 shows the accuracy of the best model for classifying a COVID-19 comment as either a very positive, positive, very negative, negative, or neutral sentiment. Our approach based on the LSTM model, which classified all COVID-19 comments in the majority class achieved 81.15% accuracy, which was higher than that of traditional machinelearning algorithms. We believe that the sentiment and semantic techniques can provide meaningful results with an overview of how users/people feel about the disaster. Analysing social media comments on platforms such as reddit could provide meaningful information for understanding people's opinions, which might be difficult to achieve through traditional techniques, such as manual methods. The text content on reddit has been analysed in various studies [37] - [39] ; to the best of our knowledge, this is the first study to analyse comments by considering semantic and sentiment aspects of COVID-related comments from reddit for online health communities. Overall, we extended the analysis to check whether we could find a dependency of semantic aspects of user-comments for different issues on COVID-19-related topics. In this case, we considered an existing dataset that included 563,079 comments from 10 sub-reddits. We found and detected meaningful latent topics of terms about COVID-19 comments related to various issues. Thus, user comments proved to be a valuable source of information, as shown in Tables 1 and 2 and Figures 4-8 . A variety of different visualisations was used to interpret the generated LDA results. As mentioned, LDA is a probabilistic model that, when applied to documents, hypothesises that each document from a collection has been generated as a mixture of This research was limited to English-language text, which was considered a selection criterion. Therefore, the results do not reflect comments made in other languages. In addition, this study was limited to comments retrieved from January 20, 2020 and March 19, 2020. Therefore, the gap between the period in which the research was being completed and the time-frame of our study may have somewhat affected the timeliness of our results. Overall, the study suggests that the systematic framework by combining NLP and deep-learning methods based on topic modelling and an LSTM model enabled us to generate some valuable information from COVID-19-related comments. These kinds of statistical contributions can be useful for determining the positive and negative actions of an online community, and to collect user opinions to help researchers and clinicians better understand the behaviour of people in a critical situation. Regarding future work, we plan to evaluate other social media, such as Twitter, using hybrid fuzzy deep-learning techniques [40] - [41] that can be used in the future for sentiment level classification as a novel method of retrieving meaningful latent topics from public comments. To our knowledge, this is the first study to analyse the association between COVID-19 comments'sentiment and semantic topics on reddit. The main goal of this paper, however, was to show a novel application for NLP based on an LSTM model to detect meaningful latent-topics and sentiment-comment-classification on COVID-19related issues from healthcare forums, such as sub-reddits. We believe that the results of this paper will aid in understanding the concerns and needs of people with respect to COVID-19-related issues. Moreover, our findings may aid in improving practical strategies for public health services and interventions related to COVID-19. The coronavirus 2019-nCoV epidemic: Is hindsight 20/20? Reddit and Radiation Therapy: A Descriptive Analysis of Posts and Comments Over 7 Years by Patients and Health Care Professionals Sentiment analysis of Twitter data during critical events through Bayesian networks classifiers Unsupervised Classification of Health Content on Reddit Ebola and localized blame on social media: analysis of Twitter and Facebook conversations during the 2014-2015 Ebola epidemic Deep learning for pollen allergy surveillance from twitter in Australia Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach Similarity of medical concepts in question and answering of health communities Identifying peer experts in online health forums Modeling temporal progression of emotional status in mental health forum: A recurrent neural net approach Assessing unmet information needs of breast cancer survivors: Exploratory study of online health forums using text classification and retrieval Detecting and Characterizing Trends in Online Mental Health Discussions Detecting clinically related content in online patient posts Latent dirichlet allocation Gibbs sampling for logistic normal topic models with graph-based priors Improving the reliability of deep neural networks in NLP: A review. Knowledge-Based Systems Semantic-based padding in convolutional neural networks for improving the performance in natural language processing. A case of study in sentiment analysis Gluoncv and gluonnlp: Deep learning in computer vision and natural language processing Deep learning models and datasets for aspect term sentiment classification: Implementing holistic recurrent attention on target-dependent memories. Knowledge-Based Systems Sentiment Analysis in Healthcare: A Brief Review Develop a Neural Model to Score Bigram of Words Using Bag-of-Words Model for Sentiment Analysis Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition Recurrent neural networks for classifying relations in clinical notes Optimization of Recurrent Neural Networks on Natural Language Processing Long short-term memory Convolutional, long short-term memory, fully connected deep neural networks Sentiment extraction from Consumer-generated noisy short texts Natural Language Processing, Sentiment Analysis, and Clinical Analytics Deep Learning Approaches for Textual Sentiment Analysis Sentiment analysis using deep learning approaches: an overview WiP] Sentiment Analysis Electronic Healthcare System Based on Heart Rate Monitoring Smart Bracelet Enriching user experience in online health communities through thread recommendations and heterogeneous information network mining Sentiment lexicons for health-related opinion mining The Heart and soul of the web? Sentiment strength detection in the social web with SentiStrength Sentiment strength detection in short informal text Topic-based sentiment analysis for the Social Web: The role of mood and issue-related words Natural language processing of Reddit data to evaluate dermatology patient experiences and therapeutics Tracking health related discussions on Reddit for public health applications Social media based analysis of opioid epidemic using Reddit Fuzzy deep belief networks for semi-supervised sentiment classification Classification of healthcare data using hybridised fuzzy and convolutional neural network We acknowledge SciTechEdit International, LLC (Highlands Ranch, CO, USA) for providing pro bono professional English-language editing of this article. This work has been awarded by the National Natural Science All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.Declaration of Conflict of Interest : All authors declare no conflict of interest directly related to the submitted work.