key: cord-0643184-71qh6yjr authors: Krieg, Steven J.; Schnur, Jennifer J.; Marshall, Jermaine D.; Schoenbauer, Matthew M.; Chawla, Nitesh V. title: Pandemic Pulse: Unraveling and Modeling Social Signals during the COVID-19 Pandemic date: 2020-06-10 journal: nan DOI: nan sha: 5c175e9fc374ff0c41857cd88d67a498c77b7fc0 doc_id: 643184 cord_uid: 71qh6yjr We present and begin to explore a collection of social data that represents part of the COVID-19 pandemic's effects on the United States. This data is collected from a range of sources and includes longitudinal trends of news topics, social distancing behaviors, community mobility changes, web searches, and more. This multimodal effort enables new opportunities for analyzing the impacts such a pandemic has on the pulse of society. Our preliminary results show that the number of COVID-19-related news articles published immediately after the World Health Organization declared the pandemic on March 11, and that since that time have steadily decreased---regardless of changes in the number of cases or public policies. Additionally, we found that politically moderate and scientifically-grounded sources have, relative to baselines measured before the beginning of the pandemic, published a lower proportion of COVID-19 news than more politically extreme sources. We suggest that further analysis of these multimodal signals could produce meaningful social insights and present an interactive dashboard to aid further exploration. The COVID-19 pandemic has disrupted the rhythms of society in unprecedented ways and at an unparalleled scale. In this work, we present and begin to explore a collection of social signals that represent part of the social pulse of the United States. These signals include COVID-19 case data, demographic data, longitudinal news and web search trends, media bias data, and mobility reports. As a doctor studies a patient's vitals to aid in identifying a diagnosis and prescribing treatment, we aim to unravel and model these signals to inform our understanding of broad effects of the COVID-19 pandemic on the spread of information, social behaviors, and more. To aid in further exploration, we published an interactive dashboard alongside this paper. 1 The rest of the paper proceeds as follows: in Section 2 we describe data collection and preprocessing, in Section 3 we present the results of preliminary analysis of news signals, and in Section 4 we discuss opportunities for future work. We collected COVID-19 case data from Johns Hopkins Univerisity [11] , news data from the Global Database of Events, Language, and Tone (GDELT) [13] , web search data from Google trends, media bias labels from Media Bias/Fact Check [8] and AllSides [5], social distancing data from Unacast [15] , and demographic data from the Center for Disease Control and Prevention [1, 2, 3, 4, 9, 10 ]. In the following sections, we detail our methods for collection and analysis. Johns Hopkins University (JHU) has created a repository for COVID-19 case data that combines information from the World Health Organization (WHO) and a number of other global and national sources [11] . We use this data from JHU to report the number of new cases and new deaths by location and date. In order to represent demographic information as well as risk factors based on individual states, we collected data from various sources including the Center for Disease Control, United States Census Bureau, and the Bureau of Labor Statistics. This data enables us to explore correlations between demographic information for locations and other data, such as searching for relationships between locations with higher rates of COVID-19 deaths. The demographic data we collected includes heart disease hospitalization rate, cancer rate, population age, hypertension and stroke rates, obesity, walk scores, eating habits (i.e. veggie intake), ethnicity and smoking habits. After collecting all variables for each state, we performed normal preprocessing and cleaning steps: noise removal, aggregation, and conversion to percentages. COVID-19 Articles The Global Database of Events, Language, and Tone (GDELT) monitors worldwide print, broadcast, and online news in over 100 languages [13] . For each article published, GDELT adds to its Global Knowledge Graph (GKG) a record that contains a variety of metadata including geographical references, textual themes, and sentiment scores. 2 The GKG processes several terabytes of data every year, making it a rich source of longitudinal news data. We created a corpus of COVID-19 news by extracting from the GKG any record that met at least one of the criteria listed in Table 1 . We also removed duplicate articles, which we defined as those with a non-unique combination of publisher and title. Table 1 . Criteria used to determine whether an article from the GKG should be included in the COVID-19 corpus. The * character represents the prefix "TAX DISEASE". Media Bias Data We used two independent sources for labeling the political bias of news sources: Media Bias/Fact Check (MBFC) and AllSides. MBFC is an independent online media outlet that evaluates news sources on their political bias and the factuality of their publications [8]. AllSides [5] takes a similar task, but incorporates surveys, reviews, and additional data into their evaluation process. Both have been utilized in recent works on media bias detection [14, 17, 12] . Table 2 lists the possible ratings given by each organizaztion. We utilize MBFC as our primary source and AllSides as supplementary. We prefer MBFC for the following reasons: 1. MBFC's evaluation methodology is explained in more detail, and thus more transparent. 2. MBFC includes a "Scientific" category, which we found to be a helpful addition. Most of MBFC's Scientific sources were labeled "Least Biased" by AllSides. 3. MBFC includes a "Questionable Sources" category. While this is comprised largely of extreme right sources, it also contains many extreme left sources. We found it helpful to separate these extremes from regular right and leftleaning sources. Unacast Social Distancing Data Unacast provides social distancing scores for U.S. states and counties based on cell phone GPS data [15] . From this data set, retrieved the Daily Distance Reduction score for all states since February 24. This feature measures the change between the average distance traveled per device for each day and the average distance traveled on the same weekday during the four weeks prior to the COVID-19 outbreak in the U.S. (February 10-March 8). Based on this percent change, each state is given a letter grade on each day according to the following rules [16] : -A: > 70% decrease -B: 55−70% decrease -C: 40−55% decrease -D: 25−40% decrease -F: < 25% decrease or increase. Google Community Mobility Reports The publicly available global Google Mobility Report [6] describes longitudinal changes in population movement trends over the course of the COVID-19 outbreak. These movement trends are divided into categories for retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. For each category, the report provides the percent change in visitation or time spent in places of that category relative to a baseline, which is computed as the median value for each weekday from the 5week period January 3, 2020-February 6, 2020. The data is aggregated from anonymized users who have opted in to sharing their location history in Google Maps. Using our collection of COVID-19-related news, we first extracted a set of keywords by tokenizing and lemmatizing the titles of each news article. Next, we retrieved the 1000 most frequently mentioned terms, the first 10 of which are reported in Table 3 . We then scraped Google Trends [7] for the longitudinal "Interest over Time" of each keyword from January 1 to May 31, 2020, in each U.S. state. For each keyword, Trends measures web search popularity by taking an anonymized sample of Google searches and dividing the total count of searches containing the given keyword by the total searches associated with a particular location and time range. This value is normalized between 0 and 100 in order to represent search interest relative to the given state and time, where 100 represents peak popularity for the term and 0 represents a lack of available data for the given term. Through May 31, 2020, we have extracted data on over 7.6 million news articles related to the COVID-19 pandemic. Figure 1 shows the daily and weekly article counts from Jan. 1 through May 31, 2020. The daily oscillation represents a consistent pattern that fewer articles are published on Saturdays and Sundays. The weekly coverage increased at the end of January, around when the first case was confirmed in the United States (Jan. 20) and the Chinese authorities quarantined the city of Wuhan (Jan. 23). A local peak of 18,636 articles were published on Jan. 31, the day after the World Health Organization (WHO) declared a public health emergency. However, average weekly coverage slowly declined until the last week of February, when cases surged in Italy and Iran. At this point news coverage surged through the first reported death in the United States (Feb. 29) and the WHO's declaration of a global pandemic (Mar. 11) to a global peak of 123,623 articles (Mar. 18). Since then, coverage has decreased steadily, even as new cases reached their global peak of 36,163 (Apr. 24). Even after the number of new cases has begun to decrease, the news coverage has continued to decrease at a faster rate. This suggests that, on a broad scale, news sources were most interested in reporting the novel events surrounding the beginning of the pandemic. Of the 7.6 million articles extracted from the GKG, just under 2 million were published by the sources evaluated for bias by MBFC or AllSides. Figure 2 shows the daily count of articles published by each bias category, each of which follow a similar trend to the total article count. This is corroborated by Pearson tests performed with respect to the normalized distribution of articles from all sources (Figure 1 ), which report correlation coefficients ≥ 0.99 for each bias category except "Scientific" and "Conspiracy-pseudoscience," which report coefficients of 0.91 and 0.92, respectively. The lower correlation of the distribution of articles published by these two bias categories may be attributable to noise. As Figure 3 shows, both Scientific and Conspiracy-pseudoscience represent only a small percentage of the collection of COVID-19-related articles. However, we found that the representation of articles published by Scientific sources, when measured as a percentage of total published news, is significantly lower (0.68x) for COVID-19-related news when compared to a baseline of all 2019 articles, of which "Scientific" sources accounted for 1.5% of the records. However, some bias categories have increased representation in COVID-19-related news: Right sources increased their representation by 1.15x, Right-center sources by 1.07x, and Left sources by 1.04x. This could be due to the fact that sources with stronger political biases are publishing more news than their baseline, that more moderate and scientific sources are publishing less, or a combination of the two. By aggregating multimodal data from many sources that represent a variety of social signals in the United States, we have begun to explore the effects of the COVID-19 pandemic on the pulse of U.S. society. Our current data includes COVID-19 case data, demographic data, longitudinal news and web search trends, media bias data, and mobility reports, but there are many other types of social signals that could be studied in order to better understand and model the effects of the pandemic. These could include social media trends, economic patterns, and additional healthcare data. In beginning to explore this data, we analyzed the quantity of news coverage, and showed that the amount of COVID-19-related news peaked just after the announcement of the pandemic, after which it steadily decreased. We additionally explored media bias and demonstrated that, with respect to quantity, all groups of political biases published news in a similar pattern, and that more scientific sources have significantly less representation in the COVID-19-related news when compared to their pre-pandemic baseline. There are many opportunities to examine other relationships between signals, such as the influence of news on social distancing and web searches, correlations between web searches and news topics, and differences of these effects between locations and demographics. We additionally hope to extend this data and work beyond the United States to understand the effects of the COVID-19 pandemic around the world. Data trends and maps Population distribution by age Media bias/fact check: The most comprehensive media bias resource Automated identification of media bias in news articles: an interdisciplinary literature review Gdelt: Global data on events, location, and tone Media bias monitor: Quantifying biases of social media news outlets at large-scale Experiments in detecting persuasion techniques in the news