key: cord-0117292-o8ouwed0 authors: Yan, Tian; Liu, Fang title: Sentiment Analysis and Effect of COVID-19 Pandemic using College SubReddit Data date: 2021-11-30 journal: nan DOI: nan sha: da9ba49d6f91f36a179e7a1a41c5777d4b233915 doc_id: 117292 cord_uid: o8ouwed0 The COVID-19 pandemic has affected societies and human health and well-being in various ways. In this study, we collected Reddit data from 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with 8 universities, applied natural language processing (NLP) techniques, and trained graphical neural networks with social media data, to study how the pandemic has affected people's emotions and psychological states compared to the pre-pandemic era. Specifically, we first applied a pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) to learn embedding from the semantic information of Reddit messages and trained a graph attention network (GAT) for sentiment classification. The usage of GAT allows us to leverage the relational information among the messages during training. We then applied subgroup-adaptive model stacking to combine the prediction probabilities from RoBERTa and GAT to yield the final classification on sentiment. With the manually labeled and model-predicted sentiment labels on the collected data, we applied a generalized linear mixed-effects model to estimate the effects of pandemic and online teaching on people's sentiment in a statistically significant manner. The results suggest the odds of negative sentiments in 2020 is $14.6%$ higher than the odds in 2019 ($p$-value $<0.001$), and the odds of negative sentiments are $41.6%$ higher with in-person teaching than with online teaching in 2020 ($p$-value $=0.037$) in the studied population. The COVID-19 pandemic has affected our society in many ways -traveling was limited and restricted, supply chains were disrupted, companies experienced contractions in production, financial markets were not stable, school was temporarily closed, and in-person learning was replaced by remote and online learning. Last but not the least, the pandemic has taken a toll on our physical health and also has had a huge impact on mention health, found by multiple studies. Xiong et al. (2020) provided a systematic review on how the pandemic has led to increased symptoms of anxiety, depression, post-traumatic stress disorder in the general population. Bo et al. (2021) observed that COVID-19 patients suffered from post-traumatic stress symptoms before discharge. Zhang et al. (2020) detected an increased prevalence of depression predominately in COVID-19 patients. Chen et al. (2020b) observed that the prevalence of self-reported depression and anxiety among pediatric medical staff members was significantly high during the pandemic, especially for workers who had COVID-19 exposure experience. Using online surveys, Sønderskov et al. (2020) claimed that the psychological well-being of the general Danish population is negatively affected by the COVID-19 pandemic. Wu et al. (2020) surveyed college students from the top 20 universities in China and found the main causes of anxiety among college students included online learning and epidemic diseases. Sharma et al. (2020) collected text data via an APP and learned that there was a high rate of negative sentiments among students and they were more interested in health-related topics and less interested in education during the pandemic. The above listed work uses surveys, patient data from hospitals and health care providers, or data collected through specifically designed APPs to study the impact of the pandemic on the mental states in various demographic groups. Social media data, on the other hand, provide another great source of real-world data and contain a vast amount of information that can be leveraged to study the effects of the pandemic on people's life, behaviors, and health, among others. Due to the unstructured nature of the text and semantic data collected from social media, text mining and machine learning (ML) techniques in natural language processing (NLP) are often used to make sense of the data. For example, Low et al. (2020) applied sentiment analysis, classification and clustering, topic modeling to uncover concerns throughout Reddit before and during the pandemic. Jelodar et al. (2020) applied long shortterm memory (LSTM) recurrent neural networks (RNNs) to Reddit data and achieved an 81.15% sentiment classification accuracy on COVID-19 related comments. Jia and Li (2020) collected and analyzed data from Weibo (a Chinese social media platform) and concluded that students' attitude toward returning to school was positive. Pandey (2021) applied the Bidirectional Encoder Representations from Transformers (BERT) to Reddit data to assimilate meaningful latent topics on COVID-19 and sentiment classification. Besides the rich text and semantic information in social media data, the data are also known for their vast amount of relational information that is also important for training effective ML procedures. Relational information is often formulated as networks or graphs. Graph neural networks (GNNs), a type of NNs that take graphs as input for various learning tasks such as node classification and graph embedding, can be used for learning from both the semantic and relational information in social media data. Since Scarselli et al. (2008) proposed the first GNN, many GNN extensions and variants have been proposed. Li et al. (2015) incorporated gated recurrent units into GNNs. Defferrard et al. (2016) generalized convolutional NNs (CNNs) to graphs using spectral graph theory. Kipf and Welling (2016) proposed a semi-supervised layer-wise spectral model to effectively utilize graph structures. Atwood and Towsley (2016) introduced the diffusion-convolution operation to CNNs to learn representation of graphs. Monti et al. (2017) generalized CNNs to non-Euclidean structured data including graphs and manifolds. Hamilton et al. (2017) extended semi-supervised learning to a general spatial CNN by sampling from local neighborhoods, aggregating features, and exploring different aggregation functions. Veličković et al. (2017) proposed graph attention networks (GAT), a self-attention-based GNN to assign different weights to different neighbors. It is computationally efficient because of parallelization and doesn't require knowing the overall graph structure upfront, but it works with homogeneous graphs only. Wang et al. (2019) developed the heterogeneous attention network (HAN) that extends GAT to heterogeneous graphs of multi-type nodes and edges. They employ the attention mechanisms to aggregate different types of neighbors on different types of meta-paths that represent different types of relations among nodes of different or the same types. HAN is capable of paying attention to "important" neighbors for a given node and capturing meaningful semantics efficiently. Hu et al. (2020) extended the HAN framework to learn from dynamic heterogeneous graphs. Chen et al. (2020a) improved previous GNN approaches by using initial residual connections and identity mapping and mitigating over-smoothing. In this work, we leverage the advances in NLP and GNNs to learn from social media data and study whether the pandemic has negatively affected the emotions and psychological states of people who are connected with a higher-education institute (HEI). "Connected with an HEI" in this context is defined in a very broad sense, described as follows. The data we collected are from university subreddits communities that include basically everyone who contributes to the subreddits during August to November 2019 and August to November 2020 1 . Therefore, an individual who contributes to our data might be a student in an HEI, a staff or faculty member there, or people who do have a direct connection with the HEI but is interested in the HEI or are involved in some situations some way. It is not our goal to generalize the conclusion from this study to the general population as the studied population is clearly not a representative sample of the whole population. However, it represents a subgroup of the general population to which the study conclusions can be generalized to and contributes to the body of literature and research on the impact of the pandemic on mental health in various sub-populations using real-world data. We collected the social media data in 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with 8 universities chosen per a full-factorial design according to three factors as detailed in Section 2.1. We employ the Robustly Optimized BERT pre-training approach (RoBERTa) and GNNs with the attention mechanism that takes both semantic and relational information as input to perform sentiment analysis in a semi-supervised manner to save manual labeling costs. The predicted sentiment labels are then analyzed by a generalized linear mixed model to examine the effect of pandemic and online learning on the mental state of the communities represented by the collected data. Our contributions and the main results are summarized as follows. • We adopt a full-factorial 2ˆ2ˆ2 design when choosing schools for Reddit data collection. This not only makes the collected data more representative by covering different types of schools compared to using data from a single HEI, but also helps with controlling for confounders for the subsequent statistical inference on the effect of the pandemic on the outcome of interest. • We use both the semantic and graph information from the collected Reddit data as inputs, leverage the state-of-the-art NLP techniques and GNNs with the attention mechanism, and apply model stacking to combine the prediction powers from two ML techniques and improve the sentiment classification accuracy. • Our results suggest that the odds of having negative sentiments increases by 14.65% during the pandemic (65,329 cases in 2020) compared to the pre-pandemic period (55,409 cases in 2019) and the increase is statistically significant (p-value ă 0.001). During the pandemic, the odds of having negative sentiments in schools that opted for in-person learning (38,132 cases from 4 schools) is 1.416 folds of that in schools that chose remote learning (27,197 cases from 4 schools) and the increases is also statistically significant (p-value = 0.037). The rest of the paper is organized as follows. Section 2 describes the research method, including data collection and processing, the ML procedures we adopt for sentiment classification, and the model used for hypothesis testing and statistical inference on the effects of the pandemic on sentiment. The main results are presented in Section 3. The final discussions and remarks are offered in Section 4. Figure 1 depicts the process and main steps of the research method we take in this study. We feed the collected text data from Reddit to a pre-trained RoBERTa, which produces embeddings that contain the meaning of each word in the input text data. The embeddings are used for two downstream tasks. First, they are sent to a classification layer (softmax function) to predict the probabilities of negative and non-negative sentiments; second, they are used as part of the input, along with adjacency matrices formulated from the relational data among the messages collected from Reddit, to a GNN. The GNN outputs another set of predicted probabilities of negative and non-negative sentiments, which are combined with the set of probabilities from RoBERTa to train a meta-model (the ensemble and model stacking step) to produce a final classifier. The classifier is then used to generate a sentiment label for each unlabeled message in our collected data. Finally, we combine the data with the observed or learned sentiment labels across the 8 schools prior to and during the pandemic, and employ regression (generalized linear mixed-effects model or GLMM) to examine the effect of the pandemic on sentiment, after adjusting for school characteristics. In what follows, we illustrate each step in details. Sections 2.1 and 2.2 present the data collection and manual labelling steps to create a set of labeled data to train the ML procedures, respectively; Sections 2.3, 2.4, and 2.5 present the application of RoBERTa, formulation and training of GAT, and model stacking and ensemble of RoBERTa and GAT, respectively; Section 2.6 lists the GLMM used for statistical testing and inference. In the current study, we focus on schools that are "R1: Doctoral Universities -Very high research activity" per the Carnegie Classification of Institutions of Higher Education (The Carnegie Classification of Institutions, 2021) in this study, with the plan to examine more universities in the future. A university can be characterized in many different ways. When choosing which schools to include in our study, we focus on three factors that can potentially affect students' sentiment: private vs. public schools, school location (small vs. large cities), and whether a school opted in for in-person learning during the pandemic vs. taking a fully online learning approach from August to November in 2020. To balance potential confounders for the subsequent statistical analysis, we adopt a full-factorial 2ˆ2ˆ2 design on the above three factors and select one school for each of the eight cells in the full-factorial design ( Table 1) . Each of the eight schools has a subreddit on Reddit. We downloaded the post and comment data from each subreddit from August to November in 2019 and 2020, respectively, representing the pre-pandemic and pandemic periods, using the Pushshift API (https://github.com/pushshift/api). This results in 120,738 messages in total. For the downloaded data, we manually labeled a sizable number of messages in each schoolyear so that there are sufficient labeled cases in each sentiment category for training and testing of our ML procedures. The labeled messages are summarized in Table 2 . We focus on binary sentiment classification in this study; and rather than labeling the messages as Negative and Positive, we use the Negative and non-Negative classification because some messages are rather neutral than being either positive or negative. For example, the message could just be a question on the number of students in a class and the reply to that just message is a number. There is also subjectivity among manual labelers on what Neural means. For example, some labelers label "OK" as Positive but others labeled it as Neutral. We had also examined the possibility of adding the Very Positive and Very Negative categories, and again there was substantial disagreement among the labelers on what can be described as Very Positive vs Positive and Very Negative vs Negative. We had also examined the classification accuracy on sentiments when using 3 categories (Positive, Negative, Neutral) vs using 5 categories (Very Positive, Positive, Negative, Very Negative, Neutral), the prediction accuracy was " 60% for the former and " 50% for the latter. 2 Given the objective of our study -whether the pandemic has negatively affected the emotions and psychological states of people -and that the 2-class classification is commonly accepted in sentiment analysis, we expect that the Negative versus non-Negative classification is sufficient to address our study goal, along with less discrepancy when labelling and higher accuracy when training the ML procedures and predicting the messages in our collected data. RoBERTa (Liu et al., 2019) is an improved version of the Bidirectional Encoder Representations from Transformers (BERT) framework (Devlin et al., 2018) , owing to several modifications of BERT (e.g., dynamic masking, input changes without the next-sentence-prediction loss, large mini-batches, etc). The BERT framework itself is based on transformers (Vaswani et al., 2017) , a deep learning model built upon the attention mechanism. The BERT model is pre-trained using text from Wikipedia and can be fine-tuned for various types of downstream learning tasks. The Reddit data in our study, the same as any social media data, contain a large number of emoticons, non-standard spellings and internet slang, causing difficulty for traditional sentence embedding learning methods designed for semantics with standard grammar and spelling. The RoBERTa model that we employ is capable of generating embedding from internet slang, and other non-standard spelling by using special tokenizer. Specifically, we applied the RoBERTa model trained on "58 million messages from Twitter and fine-tuned for sentiment analysis (Barbieri et al., 2020a,b) to obtain the embeddings of the semantic information in our collected Reddit text data. Besides generating embeddings from the text data via RoBERTa, which are fed to the GAT NN in the next step, we also use the RoBERTa framework to predict the labels of the sentiments in our collected Reddit data, as part of the subgroup-adaptive model stacking introduced in Section 2.5. The Python code for the RoBERTa framework that we applied is adapted from Barbieri et al. (2020a) and is available at https://huggingface.co/cardiffnlp/twitter-roberta-basesentiment. We employ GAT to incorporate the relational information in the collected data as a graph input to train a sentiment classifier. GAT is a GNN that employs the attention mechanism to incorporate neighbors' information in graph embedding. In our problem setting, each message is regarded as a node in the graph and the relation "message reply to Ý ÝÝÝ Ñ message" is coded as an edge in the graph and the corresponding adjacency matrix. We randomly selected 601 non-Negative and 94 Negative messages from the merged 2019 and 2020 Dartmouth data in Table 2 , along with the adjacency matrix in the merged Dartmouth data set, as the training set for the GAT NN. There are a few reasons why we only used the data from one school to train GAT. First, the graph on the merged data across different schools leads to a high-dimensional block-diagonal adjacency matrix because the comments from different schools are barely linked. In other words, merging the data from different schools does not provide additional benefits from the perspective of leveraging relational data to train GAT, but potentially increasing computational costs, and, in our experiments, also leading to worse prediction accuracy than using the data from a single school. Second, laborious manual labeling confined us to getting sufficient labeled data in only one school to train the GAT with meaningful relational information. Third, what makes negative vs positive sentiments and how they relate the relational information among the message is rather independent of a particular school; and there would not be much scarification in prediction accuracy via a trained GAT NN with the relational data from one school. The steps in training of the GAT NN are listed below and a schematic illustration of the steps is given in Figure 2 . Before the implementation, we first train RoBERTa to learn a set of representative features (embeddings) h i in message i. We use N i to denote the neighbors of message i, that is, the set of messages that reply to message i. 1) Define the "closeness" of message j P N i to message i as e ij " σ 1 pa T rh 1 i }h 1 j sq, where h 1 i " Mh i , h 1 j " Mh j , and M is a linear transformation matrix. } denotes the row concatenation of the column vectors h 1 i and h 1 j ; a is the attention vector that measures the importance of the elements in h 1 i }h 1 j ; and σ 1 is an activation function. 2) Define α ij " exppe ij q{ ř kPN i exppe ik q for each j P N i and calculate z i " } K k"1 σ 2´ř jPN i α ij h 1 jp er the multi-head attention mechanism, where K is number of attention heads, and σ 2 is an activation function. The multi-head attention mechanism allows the GNN to capture different types of relations among words within each message as well as across messages. We use K " 4 in our application. 3) Feed embedding z i from step 2) through a dense single layer perceptron f , parameterized by θ f and train f by minimizing the l 2 -regularized cross-entropy loss in Eq (1), for i " 1, . . . , |Y L |, where the set Y L contains the labeled messages and Y L " tY non-neg L , Y neg L u. (1) y i denotes the observed label of message i in the training data. f classifies the messages into either Negative or non-Negative and is parameterized by θ f , θ z contains the param-eters from the graph embedding and attention mechanism in steps 1) and 2) above, and Θ " pθ f , θ z q T . r is the weight parameter and is set at 1 in a general case. Our training data are imbalanced in terms of labeled non-Negative cases vs labeled Negative cases (the former is 6 folds higher). To achieve better classification results, we "oversampled" the Negative cases, setting r " 2. The l 2 penalty parameter is λ " 0.8, chosen via the 4-fold cross-validation. The loss function in Eqn (1) is formulated with the labeled data only. Though some nodes in the input graph are unlabeled (given the high cost of manual labeling) and are not part of the loss function, their text information and their relational information with the labeled nodes are still used in the graph embedding in steps 1) and 2) above. In this sense, the training of the GAT NN can be regarded as semi-supervised learning. Besides the classification via the GAT NN in Section2.4, we also performed sentiment classification via RoBERTa. Comparing the classification results by GAT and RoBERTa on the testing data 3 side by side (Figure 3) , we found some inconsistency. The left plot shows the sensitivity or the probability of correctly predicting non-Negative sentiments and the right plot depicts the specificity defined as the probability of correctly predicting negative sentiments. The cutoff c for labeling a message Negative or non-Negative based on its predicted Prob(non-Negative) p is optimized by maximizing the geometric mean of sensitivity (recall) and positive predictive value (precision) (Fowlkes and Mallows, 1983) in the ROC curves for GAT and RoBERTa and is 0.2438 and 0.6923, respectively. If p ă c, the message is labeled Negative; otherwise, it is labeled non-Negative. In both plots, there is some deviation from the identity line, indicating the inconsistency in the classification between GAT and RoBERTa. GAT performs better that RoBERTa in terms of predicting true negative sentiments, whereas RoBERTa performs better that GAT in terms of predicting true nonnegative sentiments. In addition, both classifiers outputs some low probabilities; ideally, if both classifiers had high accuracy, most of the points would be clustered in the upper right corner around p1, 1q. (a) Pr(predicted Neg | labeled Neg) (b) Pr(predicted non-Neg | labeled non-Neg) The observations in Figure 3 suggest that both GAT and RoBERTa have their respective strength but also suffer some degree of inaccuracy when predicting a certain sentiment category. To leverage the advantage of both methods, we develop a subgroup-adaptive ensemble method to aim for better prediction results. Ensemble learning is an ML technique that combines multiple base models (weaker learners) to form a strong learner. Bagging, boosting, and model stacking are all well-known and popular ensemble methods and concepts. The specific technique we use here is model stacking (Wolpert, 1992) by fitting logistic regression models, a.k.a. meta-models, on the predicted classification probabilities from GAT and RoBERTa. The fitted logistic models are adaptive to each population subgroup, i.e., a specific school-year combination in our study. We first scale the raw prediction probability p of being non-Negative from GAT and RoBERTa to obtain p 1 . Specifically, where m " 1, 2 represents the RoBERTa and GAT, respectively, p pmq ijk is the predicted probability that message i from school j in year k has a non-negative sentiment per the prediction by m, The reason behind the scaling of p is that the optimal c pmq for the raw p is different for RoBERTa and GAT and neither equals 0.5, leading to ineffective downstream training of the logistic regression meta-model (Eqn (3)) and difficulty in choosing a cutoff on the final probabilityp from the meta-model. After the scaling, the cutoff on p 1 is 0.5 for both GAT and RoBERT, eliminating both concerns. Specifically, the logistic meta-model for school j and year k uses p 1p1q ijk and p 1p2q ijk as input to generate the final prediction probabilitȳ whereβ 0,jk ,β 1,jk ,β 2,jk are the estimated regression coefficients from the logistic model for school j and year k. The cutoff onp from the meta-model prediction is 0.5; that is, if p ijk ě 0.5, then message i in school j and year k is non-Negative; otherwise, it is labeled as Negative. In terms of the training data for the model stacking step, we used 25 Negative cases and 25 non-Negative cases from the Dartmouth data and from each of the 14 school-years, leading to 375 Negative and 375 non-Negative cases overall. This very set of data (375 Negative and 375 non-Negative cases) was also used to find the optimal cutoff c pmq on the probability of Negative sentiments for the binary classification for RoBERTa and GAT, respectively. The testing data set for the logistic meta-model is the same as one used for testing the prediction accuracy of RoBERTa and GAT, with 25 Negative and 25 non-Negative cases per school-year. With the learned sentiment labels (Negative vs. non-Negative) for the messages in the data set, we apply GLMM to examine the effects of the pandemic on emotional state. GLMM uses a logit link function with a binary outcome (non-Negative vs. Negative). Specifically, we run two GLMMs. Model 1 is fitted to the whole collected data of 120, 738 cases and compares the sentiment between 2019 (pre-pandemic) and 2020 (pandemic), after controlling for school location and type. Model 2 is fitted to the 2020 subset data with 65, 329 cases and examines how in-person learning during the pandemic affects the sentiment compared to remote learning, after controlling for school location and type. Since the messages are clustered by school and the messages from the same school are not independent, we include in both models a random effect of school, thus the "mixed-effects" model. The formulations of the two models are given below. logˆP rpnon-Negativeq 1´Prpnon-Negativeq˙" where γ " N p0, σ 2 γ q and γ 1 " N p0, σ 2 γ 1 q, X type " 1 if private and 0 if public, X location " 1 if small city and 0 if large city, X year " 1 if 2020 and 0 if 2019, X in-person " 1 if it is in-person and 0 if it is remote, and β j and β 1 j are the corresponding fixed-effects for j " 1, 2, 3 and represents the log odds ratio of being non-Negative when X j " 1 vs. X j " 0 in the models. The classification results from the meta-model from the model stacking step on the testing data are presented in Table 3 . We examine multiple metrics on the prediction accuracy, including the overall accuracy rate F1 score, recall, precision, and specificity. Though there is some variation across the schools and years, we have achieved satisfactory accuracy for each subgroup by all the metrics. Compared to GAT and RoBERTa, and model-stacking leads to better or similar classification results across all the examined metrics (Figure 4) . The GLMM results are presented in Table 4 . "Year 2020" shows a statistically significant effect on sentiment with p-value ă 0.001; the odds of Negative in year 2020 is 1.146 times the odds of Negative in year 2019, after adjusting for school type and location. "In-person" learning in 2020 also affects sentiments in a statistically significant manner and the odds of negative sentiments increase by 41.6% compared to online learning. Whether the school is located in small city and whether the school is private do not seem to influence the odds of Negative in a statistically significant manner in both analysis. In this study, we collected social media data from Reddit and applied state-of-the-art ML techniques to study whether the pandemic has negatively affected the emotional and psychological states of a sub-population. The ML techniques we employed achieved greater than 80% prediction accuracy on sentiment overall by various metrics. Our results suggest the pandemic has a negative impact on the group's emotional and psychological states in a statistically significant manner and online teaching also increases the odds of negative sentiment in a statistically significant manner compared to in-person learning. In the future, we plan to keep collecting Reddit data from the same period every year (2021 and behind) and examine the long-term effects of the pandemic on the emotional and psychological states and whether and when the states will return to the pre-pandemic baseline. In addition, we plan to apply and evaluate more ML techniques, such as those developed for heterogeneous graphs, to further improve the prediction accuracy in sentiment analysis based on semantic and relational information. Diffusion-convolutional neural networks Tweeteval: Unified benchmark and comparative evaluation for tweet classification Twitter-roberta-base for sentiment analysis Posttraumatic stress symptoms and attitude toward crisis mental health services among clinically stable patients with covid-19 in china Simple and deep graph convolutional networks Prevalence of self-reported depression and anxiety among pediatric medical staff members during the covid-19 outbreak in guiyang, china Convolutional neural networks on graphs with fast localized spectral filtering Pre-training of deep bidirectional transformers for language understanding A method for comparing two hierarchical clusterings Inductive representation learning on large graphs Heterogeneous graph transformer Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: Nlp using LSTM recurrent neural network approach Emotional analysis on the public sentiment of students returning to university under covid-19 Adam: A method for stochastic optimization Semi-supervised classification with graph convolutional networks Gated graph sequence neural networks A robustly optimized bert pretraining approach Natural language processing reveals vulnerable mental health support groups and heightened health anxiety on reddit during covid-19: Observational study Learned in translation: Contextualized word vectors Geometric deep learning on graphs and manifolds using mixture model cnns redbert: A topic discovery and deep sentiment classification model on covid-19 online discussions using BERT NLP model. medRxiv The graph neural network model Assessing covid-19 impacts on college students via automated processing of free-form text Recursive deep models for semantic compositionality over a sentiment treebank The Carnegie Classification of Institutions. Doctoral universities: Highest research activity Attention is all you need Graph attention networks Heterogeneous graph attention network Stacked generalization Analysis of college students' psychological anxiety and its causes under covid-19 Impact of covid-19 pandemic on mental health in the general population: A systematic review The differential psychological distress of populations affected by the covid-19 pandemic