key: cord-0605907-bleh41dr authors: Liu, Yuhan; Gao, Yuhan; Nan, Zhifan; Chen, Long title: Textual Analysis of Communications in COVID-19 Infected Community on Social Media date: 2021-05-03 journal: nan DOI: nan sha: 86c3fe2e069b65e9585c17172913cd14667125b6 doc_id: 605907 cord_uid: bleh41dr During the COVID-19 pandemic, people started to discuss about pandemic-related topics on social media. On subreddit textit{r/COVID19positive}, a number of topics are discussed or being shared, including experience of those who got a positive test result, stories of those who presumably got infected, and questions asked regarding the pandemic and the disease. In this study, we try to understand, from a linguistic perspective, the nature of discussions on the subreddit. We found differences in linguistic characteristics (e.g. psychological, emotional and reasoning) across three different categories of topics. We also classified posts into the different categories using SOTA pre-trained language models. Such classification model can be used for pandemic-related research on social media. Starting in late 2019, the COVID-19 pandemic has rapidly impacted over 200 countries, areas, and territories. As of December 7th, 2020, according to the World Health Organization (WHO), 66,243,918 COVID-19 cases were confirmed worldwide, with 1,528,984 confirmed deaths 1 . This disease has tremendous impacts on people's daily lives worldwide. With the pandemic spreading in the United States, people who tested positive started sharing information about their physical condition, emotion and story with the virus. In addition, those who have not gotten infected are curious about the symptoms and nature of the virus, as well as procedures of testing services across the country. A community of those who want to share their own story and who want to know more about the virus emerged on Reddit, a platform for any user (older than 13 years) to discuss, connect, and share their experiences and opinions online. Under subreddit r/COVID19positive, people are sharing and discussing the virus, while seeking and giving emotional supports. An online community like this can have mixed emotions and splendid textual contents. In this study, we investigate the linguistic features of contents in the subreddit. First, we classify the threads into different categories, including a) reporting of positive COVID-19 case, b) reporting of a presumed COVID-19 case, and c) question regarding COVID-19. Second, we aim to investigate linguistic characteristics of posts and subsequent comments in different contexts. Specifically, we found differences in contents when people are posting for different purposes (i.e. self-report their discussion to be in one of the three aforementioned categories). A large number of studies were performed with LIWC, an API 2 for linguistic analysis of documents. Tumasjan et al. (2010) used LIWC to capture the political sentiment and predict elections with Twitter. The API was also used by Zhang et al. (2020) to provide insights into the sentiment of the descriptions of crowdfunding campaigns. Previous studies have also attempted to make textual classification on social media data. Mouthami et al. (2013) implemented a classification model that approximately classifies the sentiment using Bag of words in Support Vector Machine (SVM) algorithm. Huang et al. (2014) applied SMOTE (Synthetic Minority Oversampling TEchnique) method to defecting online cyber-bullying behavior. In addition, a number of other studies performed textual classifications for various purposes using social media data (Chen et al., 2020; Chatzakou and Vakali, Data from subreddit r/COVID19positive between March 14, 2020 3 and October 14, 2020 is collected using Pushshift API 4 . In total, 17,285 submissions (contents that starts a Reddit thread) were collected. As a medium-sized subreddit with 91.1K members 5 , contents in this community should contain limited fake posts or misinformation, therefore leading to a relatively clean dataset. Submission on Reddit starts a discussion with a title and an optional textual body. The title and the body are naturally good source for textual analysis. In addition, most submissions have flair, a hashtag-like, user-reported label that describes the category of discussion under which the submission is about. The flairs serve as a perfect label for potential supervised classification tasks. Data cleanup and preprocessing are preformed with the collected dataset. First, all posts without flairs are deleted. This left 15,410 posts in the dataset. Next, title and body are concatenated into one field, titletext, so that the new field can be used as textual input for models. Then, we removed emojis 6 , extra separators and repeated punctuations from the text. Lastly, since we have 10 different flairs but limited dataset size, we merged related flairs into one category, resulting in three categories: a) question, b) tested positive, and c) presumed positive, as shown in Table 1 . In this exploratory task, we aim to find discrepancies of texts with topics in different categories. Therefore, Linguistic Inquiry and Word Count (LIWC) is applied to extract the sentiment of submissions and comment of our corpus. LIWC2015 is a dictionary-based linguistic analysis tool that can count the percentage of words that reflect different emotions, thinking styles, social concerns, and capture people's psychological states 7 . We concatenated post in the three categories respectively to form three big documents, as performed by Yu et al. (2008) , then use LIWC to get scores for each of the categories. Significance tests are performed to find relevant fields. We also manually selected field of interests even if no significant differences are found. In the end, we selected 3 summary linguistic features (Analytics, Clout and Tone), 5 psychological features (pos emo, neg emo, sad, anxiety and anger) and 3 time-oriented features (focuspast, focuspresent and focusfuture). With the categorized flairs, we build three-label classification models. A number of models were used, with details explained below. We build a stacking ensemble model, with Random Forest, SVM, Naïve Bayes, XGBoost, Logistic Regression and K-Nearest Neighbor as meta-models. To transform our dataset into model-compatible format, we apply term frequency-inverse document frequency (TF-IDF) vectorization process on the dataset. Default hyperparameters are used. Next, we build a Bidirectional Long-Short Term Memory (Bi-LSTM) model. The dataset was converted into a 50-dimension word vector using spaCy 8 encoding. Hyperparameters were tuned with grid search method, with best ones as: ADAM optimizer, lr=1e-3, eps=1e-8 and dropout=0.2. Best performance was achieved at epoch=5. Transformers (Devlin et al., 2018) is used as our first pre-trained language model. We used the BERT-base with cased input model, as more complex models performed poorly on our limitedsized dataset. We fine-tuned the model using a dense layer and an output layer of 3 neurons. Hyperparameters are also tuned using grid search, with best ones as: ADAM optimizer, lr=1e-5, eps=1e-8 and hidden size=50. Best performance was achieved at epoch=3 XLNet (Yang et al., 2019) is used as our second pretrained language model. To keep comparability, we chose the XLNet-base with cased input model. We alsofine-tuned the model using a dense layer and an output layer of 3 neurons. Hyperparameters are tuned using grid search, with best ones as: ADAM optimizer, lr=3e-5, eps=1e-8 and hidden size=50. Best performance was achieved at epoch=4. Dataset was converted into model-compatible formats using various tokenization/vectorization methods. We then made a train-validation-test split with a 70:15:15 ratio. As the dataset is imbalanced among the three classes, we upsampled the minority classes in the training set using SMOTE (Chawla et al., 2002) . First, we found some differences in 3 summary linguistic features among the three classes, as shown in Fig. 1 . Presumed positive posts have higher Analytic (i.e. analytical thinking) score, inferring more logical and formal thinking presented in discussion. Indeed, most posts in presumed positive posts are posted by those who are very likely to be positive but still uncertain. Building upon uncertainty, they tend to start a logical analysis/reasoning on the symptoms and their recent activities which might get them infected. Question posts have relatively lower analytic score, which can also be explained by the question-raising nature of such posts. In terms of Clout, which stands for the level of confidence, tested positive posts have higher score on this feature, inferring that they are "more certain" about their positive diagnosis and their "confidence" of getting better. The same category of posts also has higher emotional tone scores, inferring that they are "more emotional" with getting infected. Next, we investigate time-oriented features of the three categories of posts, as shown in Fig. 2 . Tested positive posts have higher focus on the past, while question posts have more focus on present. By looking at sample posts, we found that tested positive posts tend to reflect more on their potential cause of getting infected (e.g. hang out in a pub, attend a gathering without mask) and their symptoms before visiting a doctor, while question posts tend to report their current feelings and symptoms, seeking for answers. In addition, all categories of posts does not have significant difference of focus on the future and that the magnitude of future-focus is low for the three categories com-paring to focuses on past or present. As for the emotional features, tested positive posts have, surprisingly, higher level of positive emotions, inferring that those who got infected are more "optimistic" about getting better, while presumed positive posts have higher level of negative emotions, which makes sense as they are the ones who are most uncertain and that are really feeling bad. As we look into different negative emotions, presumed positive posts have significantly higher sadness level, which can be interpreted as depressed during uncertainty of a very likely diagnosis, while question posts have significantly higher anxiety level. The high anxiety level from question posts reflect their uncertainty and concerns about the pandemic in general. The performance of classification models are shown in Table. 2. The best model is BERT, with a testing F-1 score of 0.722. The model converges We observed that the performance difference between SOTA pre-trained models and traditional models is not huge. We suspect that the relatively small size of dataset restricted better performance in BERT or XLNet, as these models are more complicated and thus require larger dataset to train on. Such finding is in congruent to the research by Ezen-Can (2020). In this study, we performed a linguistic analysis of posts in an online COVID-19 discussion community on Reddit. Posts in the three categories showed some differences with psychological, emotional and other characteristics. We also built classification models to differentiate posts among the categories, found that the SOTA pre-trained models yields best performance. Future work can be done to incorporate more features into classification models, such as metadata from reddits (e.g. up/down votes, number of comments), and linguistic scores from LIWC model for each individual posts. Also, the classification model developed in our study can be used on social media to identify infected people for other studies, such as psychological evaluation of the infected group. In addition, as we only analyzed Reddit submissions, comments are the other textual source on Reddit with greater amount but without labels. In fact, more than 1 million comments were collected in this subreddit in comparison to only 17,285 submissions. Such data source can be for unsupervised machine learning for pattern recognition, such as Latent Dirichlet Allocation topic modelling. Harvesting opinions and emotions from social media textual resources Smote: synthetic minority over-sampling technique the eyes of the beholder: Sentiment and topic analyses on social media use of neutral and controversial terms for covid-19 Bert: Pre-training of deep bidirectional transformers for language understanding Aysu Ezen-Can. 2020. A comparison of lstm and bert for small corpus Cyber bullying detection using social and textual analysis Hawkes processes for continuous time sequence classification: an application to rumour stance classification in twitter Sentiment analysis and classification based on textual reviews Predicting elections with twitter: What 140 characters reveal about political sentiment Xlnet: Generalized autoregressive pretraining for language understanding Exploring the characteristics of opinion expressions for political opinion classification User classification with multiple textual perspectives What contributes to a crowdfunding campaign's success? evidence and analyses from gofundme data