key: cord-0680973-re1k30cd authors: Wang, Guannan; Gu, Zhiling; Li, Xinyi; Yu, Shan; Kim, Myungjin; Wang, Yueying; Gao, Lei; Wang, Li title: Comparing and Integrating US COVID-19 Daily Data from Multiple Sources: A County-Level Dataset with Local Characteristics date: 2020-06-02 journal: nan DOI: nan sha: 3b4e9b54d81b59554064026f339981cfb44eb25f doc_id: 680973 cord_uid: re1k30cd Over the past several months, the outbreak of COVID-19 has been expanding over the world. A reliable and accurate dataset of the cases is vital for scientists to conduct related research and for policy-makers to make better decisions. We collect the COVID-19 daily reported data from four open sources: the New York Times, the COVID-19 Data Repository by Johns Hopkins University, the COVID Tracking Project at the Atlantic, and the USAFacts, and compare the similarities and differences among them. In addition, we examine the following problems which occur frequently: (1) the order dependencies violation, (2) the delay-reported issue on weekends and/or holidays, and (3) abnormal data point or data period. We also integrate the COVID-19 reported cases with the county-level auxiliary information of the local features from official sources, such as health infrastructure, demographic, socioeconomic, and environment information, which are important for understanding the spread of the virus. four major sources, including (1) the New York Times (NYT, 2020a), (2) the COVID Tracking Project at the Atlantic (Atlantic) (Atlantic, 2020), (3) the data repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) (CSSE, 2020) , and (4) USAFact (USAFacts, 2020) . Although these sources usually obtain their confirmed infectious and death cases data from the government agencies, the counts still vary due to the time of their collection as well as several other issues. However, these differences can be critical for real-time analysis. In this work, we first collect and compare the COVID-19 daily reported data from the above four open resources. The COVID-19 data poses unique data quality challenges due to its spatiotemporal nature, and the problem of delayed-reporting and under-reporting. After the detection of abnormal data, we explore various methods to detect and repair the problematic data. To be more specific, the entire data cleaning procedure has been divided into two categories: (1) manual cleaning, and (2) automatic cleaning. On the one hand, manual cleaning has very high accuracy; on the other hand, it is challenging to implement due to the high cost in time and effort. Furthermore, it has been observed that the local characteristics, such as socioeconomic inequity, may also contribute to the spread of epidemic (Ahmed et al., 2020; Silver, 2020) . For example, the intrinsic local community characteristics might influence and shape the spread of COVID-19, such as demographics, endemic infections, and environmental conditions. Census data availability thus leads us to model the infections, deaths, and recoveries accounting for all the epidemic data, control measures, and local information. To facilitate research in identifying the significant factors that affect the disease spread pattern and predict future infections, we also collect and combine local auxiliary information at the county level in the U.S. from reliable sources. To help users better visualize the epidemic data, we developed multiple R shiny apps embedded into a COVID-19 dashboard launched on March 27, 2020. Currently, we provide both infectious and death maps and time series of the U.S. Moreover, we provide a short-term (7-day) forecast (Wang et al., 2020a) (updated daily) and a long-term (2-month) projection (Wang et al., 2020b) (updated weekly) of the COVID-19 infected and death count at both the county level and state level. For public usage, a Github repository (https://github.com/covid19-dashboard-us/cdcar) is established to provide daily updated and cleaned data. An R package cdcar is also created for abnormal data detection and repairing. Thanks to the contribution of the data science communities across the world, multiple sources are providing different precision and focus of the COVID-19 data. In our article, we consider the reported cases from the following four sources: the NYT (NYT, 2020a), the Atlantic (Atlantic, 2020), the JHU (CSSE, 2020), and the USAFacts (USAFacts, 2020). The NYT releases daily data at the country, state, and county levels at noon of the following day. The Atlantic releases daily state-level data along with testing, hospitalization, and recovered information, updated every afternoon. Repository by the CSSE at JHU provides both state and county-level data daily, updated every night. USAFacts updates county-level data daily in early morning of the following day. Table 1 summarizes the differences among the four sources of data based on how the data are collected and compiled. For the cleaned data on the proposed repository, we first fetch data from the above four sources and compile them into the same format for further comparison and cross-validation. Then, we detect the anomalies in the data sources and choose the one with the least abnormalities to repair. Islands, Virgin Islands. * * * : Whether the dataset has unallocated/unassigned information, which is useful to match state-level and county-level data. #: How does the dataset assigns the cases to a place. p indicates the place of infection/fatality. r indicates the place of residence. r, p indicates both standards exist in the dataset, unknown indicates the information is not found. ##: Whether the dataset includes both confirmed and probable cases when probable data is available. y means yes. NYT releases daily live data for probable and confirmed cases separately, but historical data is unavailable. In addition, there are two other issues, to which there is no perfect solution currently, but requires attention from the users when they try to draw conclusive statements using the COVID-19 related data. infected at the moment and suggested to be quarantined to avoid infecting others. Meanwhile, the antibody tested positive population must have been exposed to the virus, but there is no indication of whether they are still infectious or recovered (The U.S. Department of Health and Human Services, 2020). In addition, the antibody tests are known to be much less accurate. Mixing these two tests make positive cases uninterpretable. Some states and counties have started to separate antibody tests from virus tests (Madrigal and Meyer, 2020) , while states such as Pennsylvania, Texas, Georgia, and Vermont did not specify the type of tests. In this subsection, we discuss a measure to assess the dissimilarity of the time series of the reported cases collected from different sources. Let K be the number of all available sources in comparison, so for the county level comparison, K = 3 since the USAFacts does not have county level data, while for the state level, K = 4. Let T be the number of days observed, or the length of each time series. Let n be the number of counties or states. For source k, it be the cumulative number of the reported cases of location i on day t, where i = 1, . . . , n, t = 1, . . . , T . In the following, we define a dissimilarity measure to assess the difference between two time series: Y where Y iT is used as the denominator to mitigate the variability of the current observed counts. By taking the L2-norm and dividing by the number of days observed, we obtain a measurement that effectively detects the counties and states with the most discrepancy between each pair of sources, and also is meaningful in the comparison between different locations. In Fig. 1 we present the county map for infected and death counts collected from three different data sources. In Fig. 2 & 3 we present the state map for infected and death counts collected from four different data sources. Areas in dark blue in these three figures are detected to be different between the corresponding pair of two sources. Next, we look further into the underlying reasons for the dissimilarity at the county level and state level. In Tables Except for the issues mentioned above in the raw data collection, we observe three types of abnormalities in the data, including (1) order dependencies violation, (2) delayed-reported issue on weekend/holiday, and (3) abnormal data point or data period. Examples of these issues are illustrated in Fig. 4 . One might need to clean and repair these issues before doing the analysis. Order Dependencies Violation. Order dependency (OD) is widely used in the relational database. In this project, we incorporate this concept into the abnormal data detection and data repairing process of cumulative time series. To be more specific, OD for the cumulative time series can be defined as follows: for any two time points, t 1 and have multiple causes, including (1) the result of a large batch of tests was released; (2) the change of reporting standard, e.g., some states started to report probable cases from a specific date. Sometimes, we may experience a continuous abnormal period, referred to as the period where the increasing speed is significantly different from the previous and the subsequent period. Since this type of abnormal data could be a change of pattern in time series, we will only provide a warning message once detected. Note both the single abnormal point and the continuous abnormal period share are essentially the change point problem in time series. We apply the R package segmented (function segmented) (Muggeo, 2019) to detect the change points. The data curation workflow is illustrated in Fig. 5 . Once raw data is collected, we start with the OD violation detection and repairing. Next, we check for delay-reported issues on the weekend/holiday a b c d e f and let the user decide whether to repair it. Last, we check for the abnormal data point and data period. If an abnormal data point/period is detected, we suggest to check it manually first and then decide the method of repairing. First of all, the infection and death count can be considered as count time series by nature. Therefore, when repairing for count time series, we need to take into account that the observations are nonnegative integers, and we should utilize the dependence structure among observations. Furthermore, in the study of the infectious disease, the population is usually assigned to compartments such as Susceptible different compartments are usually considered as an entire system and studied together, for example, the SIR model (Brauer et al., 2008; Lawson et al., 2016; Pfeiffer et al., 2008) . Third, the spread of the disease also has a spatial pattern. In this section, we propose data repairing methods to handle the issues mentioned in the above three categories. In this following, we summarize the background of these methods and give details on the implementation of the repairing procedure. All of these repairing procedures can be easily implemented in various areas whenever the variable is the number of events occurs observed over time, for instance, the daily number of hospital admissions, the number of stock market transactions, as well as the number of defected components from industrial engineering. • Time Series Model for Count Data. One of the conventional methods to deal with these challenges is the generalized linear model (GLM), which models the observations conditionally on the past information. In this project, we consider both Poisson and Negative Binomial as the conditional distribution. The second important class for analyzing count time series is the integer autoregressive moving average (INARMA) models, and a comprehensive review is given by Weiß (2008) . State-space is another type of count time series models. Comparing with the GLM, it allows a more flexible data generating process. However, it requires a more complicated model specification. Due to the explicit formulation, the GLM-based models yield a more convenient way in predictions. Thus, in this project, we focus on the GLM-based method. We denote Y t the infectious or death count at time t. To repair the dataset, we model the conditional mean µ t = E(Y t |Y t−1 , µ t−1 ) in the following form where ν t = log (µ t ). For this type of data repairing, we use the R package tscount (Liboschik et al., 2020) , which conduct the model estimation by the quasi-conditional maximum likelihood method (function tsglm). • Combined Linear and Exponential Predictors (CLEP) (Altieri et al., 2020) . This method assembles the following three different models: 1. An individual county exponential predictor: model (2) uses a series of separate predictors for each county to capture the reported exponential growth of COVID-19 infectious and death counts, and we assume that where the parameters β i0 and β i1 are the coefficients for county i in the generalized linear model (GLM) using glm function in R with a Poisson link function. 2. An individual county linear predictor: model (3) fits a linear version of the separate county predictors, as shown in the following: 3. An individual county exponential epidemic predictor: model (4) uses a series of disease related factor for each county to capture the reported exponential growth of COVID-19 infectious and death counts, and we assume that • Spatio-Temporal Epidemic Model (STEM). Based on the idea of the SIR models, Wang et al. (2020c) proposed the discrete-time spatial epidemic model which combines the susceptible state, infectious state, and removed state together. In the following, we denote I it , D it , and R it the count in infected, death and recovered states in location i and time stamp t, respectively. We assume that the conditional mean value of daily new positive cases (I it ), fatal cases (D it ) and recovery (R it ) can be modeled via a link function g as follows: In practice, we use the bivariate spline over triangulation to approximate the spatially varying coefficient functions, β 0t (lon i , lat i ) and β 1t (lon i , lat i ). The triangulation can be obtained through various software packages; see for example, the Matlab code DistMesh, and the R package For public usage, a Github repository at https://github.com/covid19-dashboard-us/ cdcar has been established. A copy of the cleaned data set starting from January 22, 2020, has also been included in the R package cdcar. A live version of the data analysis will be continually updated on our dashboard at https://covid19.stat.iastate.edu. We collect the epidemic data up to county level in the U.S. along with control measures and other local information, such as socioeconomic status, demographic characteristics, healthcare infrastructure, and other essential factors to analyze the spatiotemporal dynamic pattern of the spread of COVID-19. Our data covers about 3,200 county-equivalent areas from 50 U.S. states and the District of Columbia. The sources and introductions for these data are detailed in Table 3 . I. Epidemic Data The daily counts of cases and deaths of COVID-19 are crucial for understanding how this pandemic is spreading. Using the algorithm discussed in the section of methods, we aggregate the reported COVID-19 infected, death, and recovered cases from January 22, 2020 from (1) gatherings, limits on bars, restaurants and other public places, the deployment of severe travel restrictions, and "stay-at-home" or "shelter-in-place" orders). President Trump declared a state of emergency The entire detection and repairing procedure is illustrated in Figure 5 . First of all, we obtained the data from all of the four data sources, and use the dissimilarity measure proposed in the above to compare them. We visualize and check the difference at the state level among different data sources based on the comparison results. For the county-level data, we calculate the measure and reported the top 10 counties, which are the most different pairwisely. Then, all the data are processed with all types of abnormal detection discussed in the section of abnormal data detection. Once an abnomality has been detected, a warning will be given automatically by R package cdcar. We handle the abnormal data differently depending on the type of problem. For example, if an order dependency violation is detected, we will repair that point using our data repairing algorithms. If a single abnormal point is detected, we first manually check possible legitimate reasons based on news and social media. If no such information can be found, we will repair the point using the proposed algorithm. The integrated data are publicly available to assist researchers to investigate the spread of COVID-19 in the U.S. We will continue to provide the cleaned data as the pandemic progresses. Both the R package and the datasets discussed in this article is hosted on our GitHub repository: https: //github.com/covid19-dashboard-us/cdcar. Since the late April, all 50 states in the U.S. began to reopen successively, due to the immense pressures of the crippled economy and anxious public. A state is categorized as "reopening" once its stay-at-home order lifts, or once reopening is permitted in at least one primary sector (restaurants, retail stores We compiled the dates of executive orders by checking national and state governmental websites, news articles, and press releases In the demographic characteristics category, we consider the factors describing racial, ethnic, sexual, and age structures. Specifically, we include the following six variables AA PCT -The percent of the population who identify as African American HL PCT -The percent of the population who identify as Hispanic or Latino Old PCT -The percent of aged people (age ≥ 65 years) Sex ratio -The ratio of male over female PD log -The logarithm of the population density per square mile of land area Pop log -The logarithm of local population Mortality -The 5-year (1998-2002) average mortality rate, measured by the total counts of deaths per 100, 000 population in a county We incorporated three features related to the healthcare infrastructure at the county level in the datasets. Among these variables, NHIC PCT is available in the USA Counties Database The percent of persons under 65 years without health insurance EHPC -The local government expenditures for health per capita TBed -Total bed counts per 1, 000 population Socioeconomic Status. We consider diverse socioeconomic factors in the county level datasets We also calculate the Gini coefficient based on the household income data from the Affluence -Social affluence generated by factor analysis from HighIncome, HighEducation, WCEmployment and MedHU HIncome PCT -The percent of families with annual incomes higher than $75 HEducation PCT -The percent of the population aged 25 years or older with a bachelor's degree or higher MedHU -The median value of owner-occupied housing units The percent of the households with public assistance income The percent of households with female householders and no husband present Unemployment PCT -Civilian labor force unemployment rate Gini coefficient, a measure for income inequality and wealth distribution in economics Another category of factors in the literature that affects the spread of epidemics significantly is the environmental factor, such as the urban rate and crime rate UrbanRate -Urban rate ViolentCrime -The total number of violent crimes per 1, 000 population PropertyCrime -The total number of property crimes per 1, 000 population ResidStability -The percent of the population residence in the same house for one year and over Geographic Information. The longitude and latitude of the geographic center for each county in the U.S. are available in Gazetteer Files from Available at https:// coronavirus.1point3acres.com/en Why inequality could spread COVID-19 Curating a COVID-19 data repository and forecasting county-level death counts in the United States The COVID tracking project data Standardized surveillance case definition and national notification for 2019 novel coronavirus disease (COVID-19) 2019 novel Coronavirus COVID-19 (2019-nCoV) data repository A county-level dataset for informing the United States' response to COVID-19 Handbook of spatial epidemiology How could the CDC make that mistake? Coronavirus (Covid-19) Data in the United States See Which States and Cities Have Told Residents to Stay at Home Spatial analysis in epidemiology Stiglitz: Pandemic exposed health inequality and flaws of market economy Department of Health and Human Services (2020) American Community Survey 5-year estimates Decennial Census USA Counties Economic Census American Community Survey Demographic and Housing Estimates The U.S. Gazetteer Files Homeland infrastructure foundation-level data Available at https: //usafacts.org/visualizations/coronavirus-covid-19-spread-map BPST: Bivariate Spline over Triangulation Triangulation An R shiny app to visualize, track, and predict real-time infected cases of COVID-19 in the United States An R shiny app to predict the infected and death cases of COVID-19 in the U.S. in the next two months Spatiotemporal dynamics, nowcasting and forecasting COVID-19 in the United States Thinning operations for modeling time series of counts-a survey We would like to thank all our sources, especially the NYT, the JHU, the Atlantic and the USAFacts for