key: cord-0958413-rx9cux8i authors: Sarker, Abeed; Lakamana, Sahithi; Hogg-Bremer, Whitney; Xie, Angel; Al-Garadi, Mohammed Ali; Yang, Yuan-Chi title: Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource date: 2020-07-04 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocaa116 sha: 254915afdc312a7b0b7ef122a1576fd802219b30 doc_id: 958413 cord_uid: rx9cux8i OBJECTIVE: To mine Twitter and quantitatively analyze COVID-19 symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon for future research. MATERIALS AND METHODS: We retrieved tweets using COVID-19-related keywords, and performed semiautomatic filtering to curate self-reports of positive-tested users. We extracted COVID-19-related symptoms mentioned by the users, mapped them to standard concept IDs in the Unified Medical Language System, and compared the distributions to those reported in early studies from clinical settings. RESULTS: We identified 203 positive-tested users who reported 1002 symptoms using 668 unique expressions. The most frequently-reported symptoms were fever/pyrexia (66.1%), cough (57.9%), body ache/pain (42.7%), fatigue (42.1%), headache (37.4%), and dyspnea (36.3%) amongst users who reported at least 1 symptom. Mild symptoms, such as anosmia (28.7%) and ageusia (28.1%), were frequently reported on Twitter, but not in clinical studies. CONCLUSION: The spectrum of COVID-19 symptoms identified from Twitter may complement those identified in clinical settings. The outbreak of the coronavirus disease 2019 (COVID-19) is 1 of the worst pandemics in the known World history. 1, 2 As of May 8, 2020, over 4 million confirmed positive cases have been reported globally, causing over 275 000 deaths. 3 As the pandemic continues to ravage the world, numerous research studies are being conducted whose focuses range from trialing possible vaccines and predicting the trajectory of the outbreak to investigating the characteristics of the virus by studying infected patients. Early studies focusing on identifying the symptoms experienced by those infected by the virus mostly included patients who were hospitalized or received clinical care. [4] [5] [6] Many infected people only experience mild symptoms or are asymptomatic and do not seek clinical care, although the specific portion of asymptomatic carriers is unknown. [7] [8] [9] To better understand the full spectrum of symptoms experienced by infected people, there is a need to look beyond hospital-or clinic-focused studies. With this in mind, we explored the possibility of using social media, namely Twitter, to study symptoms self-reported by users who tested positive for COVID-19. Our primary goals were to (i) verify that users report their experiences with COVID-19-including their positive test results and symptoms experienced-on Twitter, and (ii) compare the distribution of self-reported symptoms with those reported in studies conducted in clinical settings. Our secondary objectives were to (i) create a COVID-19 symptom corpus that captures the multitude of ways in which users express symptoms so that natural language processing (NLP) systems may be developed for automated symptom detection, and (ii) collect a cohort of COVID-19-positive Twitter users whose longitudinal self-reported information may be studied in the future. To the best of our knowledge, this is the first study that focuses on extracting COVID-19 symptoms from public social media. We have made the symptom corpus public with this article to assist the research community, and it will be part of a larger, maintained data resource-a social media COVID-19 Data Bundle (https://sarkerlab. org/covid_sm_data_bundle/). We collected tweets, including texts and metadata, from Twitter via its public streaming application programming interface. First, we used a set of keywords/phrases related to the coronavirus to detect tweets through the interface: covid, covid19, covid-19, coronavirus, and corona AND virus, including their hashtag equivalents (eg, #covid19). Due to the high global interest on this topic, these keywords retrieved very large numbers of tweets. Therefore, we applied a first level of filtering to only keep tweets that also mentioned at least 1 of the following terms: positive, negative, test, and tested, along with at least 1 of the personal pronouns: I, my, us, we, and me; and only these tweets were stored in our database. To discover users who self-reported positive COVID-19 tests with high precision, we applied another layer of filtering using regular expressions. We used the expressions "i.*test[ed] positive," "we.*test [ed] positive," "test.*came back positive," "my.*[covidjcoronavirusjcovid19].*symptoms," and "[covidjcoro-navirusjcovid19].*[testjtested].*us." We also collected tweets from a publicly available Twitter dataset that contained IDs of over 100 million COVID-19-related tweets 10 and applied the same layers of filers. Three authors manually reviewed the tweets and profiles to identify true self-reports, while discarding the clear false positives (eg, ". . . I dreamt that I tested positive for covid . . ."). We further removed users from our COVID-19-positive set if their self-reports were deemed to be fake or were duplicates of posts from other users, or if they stated that their tests had come back negative despite their initial beliefs about contracting the virus. These multiple layers of filtering gave us a manageable set of potential COVID-19-positive users (a few hundred) whose tweets we could analyze semiautomatically. The filtering decisions were made iteratively by collecting sample data for hours and days and then updating the collection strategy based on analyses of the collected data. For all the COVID-19-positive users identified, we collected all their past posts dating back to February 1, 2020. We excluded non-English tweets and those posted earlier than the mentioned date. We assumed that symptoms posted prior to February 1 were unlikely to be related to COVID-19, particularly because our data collection started in late February, and most of the positive test announcements we detected were from late March to early April. Since we were interested only in identifying patient-reported symptoms in this study, we attempted to shortlist tweets that were likely to mention symptoms. To perform this, we first created a meta-lexicon by combining MedDRA, 11 Consumer Health Vocabulary (CHV), 12 and SIDER. 13 Lexicon-based approaches are known to have low recall-particularly for social media data, since social media expressions are often nonstandard and contain misspellings. 14, 15 Therefore, instead of searching the tweets for exact expressions from the tweets, we performed inexact matching using a string similarity metric. Specifically, for every symptom in the lexicon, we searched windows of term sequences in each tweet, computed their similarities with the symptom, and extracted sequences that had similarity values above a prespecified threshold. We used the Levenshtein ratio as the similarity metric, computed as 1 À Lev: dist: maxðlengthÞ , where Lev. dist. represents the Levenshtein distance between the 2 strings and max(length) represents the length of the longer string. Our intent was to attain high recall, so that we were unlikely to miss possible expressions of symptoms while filtering out many tweets that were completely off topic. We set the threshold via trial and error over sample tweets, and because of the focus on high recall, this approach still retrieved many false positives (eg, tweets mentioning body parts but not in the context of an illness or a symptom). After running this inexact matching approach on approximately 50 user profiles, we manually extracted the true positive expressions (ie, those that expressed symptoms in the context of a COVID-19) and added them to the meta-lexicon. Following these multiple filtering methods, we manually reviewed all the posts from all the users, identified each true symptom expressed, and removed the false positives. We semiautomatically mapped the expressions to standardized concept IDs in the Unified Medical Language System using the meta-lexicon we developed and the National Center for Biomedical Ontology BioPortal. 16 In the absence of exact matches, we searched the BioPortal to find the most appropriate mappings. Using Twitter's web interface, we manually reviewed all the profiles, paying particularly close attention to those with less than 5 potential symptom-containing tweets, to identify possible false negatives left by the similarity-based matching algorithms. All annotations and mappings were reviewed, and the reviewers' questions were discussed at meetings. In general, we found that it was easy for annotators to detect expressions of symptoms, even when the expressions were nonstandard (eg, "pounding in my head" ¼ Headache). Each detected symptom was reviewed by at least 2 authors, and the first author of the article reviewed all the annotations. Once the annotations were completed, we computed the frequencies of the patient-reported symptoms on Twitter and compared them with several other recent studies that used data from other sources. We also identified users who reported that they had tested positive and also specifically stated that they showed "no symptoms." We excluded nonspecific statements about symptoms, such as "feeling sick" and "signs of pneumonia." When computing the frequencies and percentages of symptoms, we used 2 models: (i) computing raw frequencies over all the detected users, and (ii) computing frequencies for only those users who reported at least 1 symptom or explicitly stated that they had no symptoms. We believe the frequency distribution for (ii) was more reliable since for users who reported no specific symptoms, we could not verify if they had actually not experienced any symptoms (ie, asymptomatic) or just did not share any symptoms over Twitter. Our initial keyword-based data collection and filtering from the different sources retrieved millions of tweets, excluding retweets. We found many duplicate tweets, which were mostly reposts (not retweets) of tweets posted by celebrities. Removing duplicates left us with 305 users (499 601 tweets). 102 of them were labeled as "negatives"-users who stated that their tests had come back negative, removed their original COVID-19-positive self-reports, or posted fake information about testing positive (eg, we found some users claiming they tested positive as an April Fools' joke). This left us with 203 COVID-19-positive users with 68 318 tweets since February 1. The similarity-based symptom detection approach reduced the number of unique tweets to review to 7945. The 203 users expressed 1002 total symptoms (mean: 4.94; median: 4) using 668 unique expressions, which we grouped into 46 categories, including a "No Symptoms" category (Table 1) . 171 users expressed at least 1 symptom or stated that they were asymptomatic (84.2%). 32 (15.8%) users did not mention any symptoms or only expressed generic symptoms, which we did not include in the counts (we provide these expressions in the lexicon accompanying this paper). 10 users explicitly mentioned that they experienced no symptoms. As Table 1 shows, fever/pyrexia was the most commonly reported symptom, followed by cough, body ache & pain, headache, fatigue, dyspnea, chills, anosmia, ageusia, throat pain and chest pain-each mentioned by over 20% of the users who reported at least 1 symptom. Figure 1 illustrates the first detected report of each symptom from the cohort members on a timeline, and Figure 2 shows the distribution of the number of symptoms reported by the cohort. Table 2 compares the symptom percentages reported by our Twitter cohort with several early studies conducted in clinical settings (ie, patients who were either hospitalized or visited hospitals/ clinics for treatment). The top symptoms remained fairly consistent across the studies-fever/pyrexia, cough, dyspnea, headache, body ache, and fatigue. The percentage of fever (66%), though the highest in our dataset, is lower than all the studies conducted in clinical settings. In our study, we distinguished, where possible, between myalgia and arthralgia and combined pain (any pain other than those explicitly specified) and body ache. Combining all these into 1 cate- gory, as some studies had done, would result in a higher proportion. We found considerable numbers of reports of anosmia (29%) and ageusia (28%), with approximately one-fourth of our cohort reporting these symptoms. Reports of these symptoms, however, were missing from the referenced studies conducted in clinical settings. Our study revealed that there were many self-reports of COVID-19 positive tests on Twitter, although such reports are buried in large amounts of noise. We observed a common trend among Twitter users of describing their day-to-day disease progression since the onset of symptoms. This trend perhaps became popular as celebrities started describing their symptoms on Twitter. We saw many reports from users who reported to have tested positive but initially showed no symptoms, and some who expressed anosmia and/or ageusia (first reported on March 5) as the only symptoms, which were undocumented in the comparison studies. There are some studies that suggest that anosmia and ageusia may be the only symptoms of COVID-19 among otherwise asymptomatic patients. [20] [21] [22] The most likely explanation behind the differences between symptoms reported on Twitter and the clinical studies is that the former were reported mostly by users who had milder infections, while people who visited hospitals often went there to receive treatment for serious symptoms. Also, the median ages of the patients studied in clinical studies tended to be much higher than the median age of Twitter users (in the US, median Twitter user age is 40 23 ). In contrast to the clinical studies, in our cohort, some users expressed mental healthrelated consequences (eg, stress/anxiety) of testing positive. It was difficult in many cases to ascertain if the mental health issues were directly related to COVID-19 or whether the users had prior histories of such conditions. To the best of our knowledge, this is the first study to have utilized Twitter to curate symptoms posted by COVID-19-positive users. In the interest of community-driven research, we have made the symptom lexicon available with this publication. The cohort of users detected over social media will enable us to conduct targeted studies in the future, enable us to study relatively unexplored topics such as the mental health impacts of the pandemic, and the longterm health-related consequences of those infected by the virus. The work reported in this article was supported by funding from Emory University, School of Medicine. Funding for computational Huang et al. 6 Chen et al. 5 Wang et al. 17 Chen et al. 18 Guan For users who expressed at least 1 symptom or expressed that they did not have any symptoms. *The study provided a combined number for myalgia and fatigue. Headache and dizziness was combined for this study. The reported number is for myalgia/muscle ache and/or arthralgia. In our study, we separated myalgia, arthralgia, body ache, and pain. WHO Director-General's opening remarks at the media briefing on COVID-19 -11 Effects of the COVID-19 Pandemic on the World Population: Lessons to Adopt from Past Years Global Pandemics COVID-19 Map -Johns Hopkins Coronavirus Resource Center Clinical characteristics of coronavirus disease 2019 in China Clinical progression of patients with COVID-19 in Shanghai Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Presumed asymptomatic carrier transmission of COVID-19 A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public MedDRA: an overview of the medical dictionary for regulatory activities Exploring and developing consumer health vocabularies The SIDER database of drugs and side effects Semi-supervised approach to monitoring clinical depressive symptoms in social media Combining Lexicon-based and Learning-based Methods for Twitter Sentiment Analysis Welcome to the NCBO BioPortal j NCBO BioPortal Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Report of the WHO-China Joint Mission on Coronavirus Disease Isolated sudden onset anosmia in COVID-19 infection. A novel syndrome? Loss of smell or taste as the only symptom of COVID-19. Covid-19 med nedsatt lukte-og smakssans som eneste symptom Anosmia and Dysgeusia in the Absence of Other Respiratory Diseases: Should COVID-19 Infection Be Considered? Sizing Up Twitter Users j Pew Research Center None declared. AS designed the study and data collection/filtering strategies. All authors contributed to the analyses, annotation process, and the writing of the manuscript. The authors would like to acknowledge the feedback provided by collaborators from Emory University and the Georgia Department of Public Health (GDPH) through the Emory-GDPH partnership for COVID-19.