key: cord-0711412-3pn0gap4 authors: Haratian, A.; Fazelinia, H.; Maleki, Z.; Ramazi, p.; Wang, H.; Lewis, M.; Greiner, R.; Wishart, D. title: Dataset of COVID-19 outbreak and potential predictive features in the USA date: 2020-10-20 journal: nan DOI: 10.1101/2020.10.16.20214098 sha: efdf773cc452d1fbb04ef3c4aa85d9a5778fedea doc_id: 711412 cord_uid: 3pn0gap4 This dataset provides information related to the outbreak of COVID-19 disease in the United States, including data from each of 3142 US counties from the beginning of the outbreak (January 2020) until September 2020. This data is collected from many public online databases and includes the daily number of COVID-19 confirmed cases and deaths, as well as 33 features that may be relevant to the pandemic dynamics: demographic, geographic, climatic, traffic, public-health, social-distancing-policy adherence, and political characteristics of each county. We anticipate many researchers will use this dataset to train models that can predict the spread of COVID-19 and to identify the key driving factors. mechanistic models, such as the Susceptible-Infected-Recovered (SIR) type models 1 , as well as effective machine-learning models 2 . We rely on authoritative government and academic sources to collect the data, here is to provide the data for each of the features at the county level. For each of the 3142 counties in the US, for each day from the beginning of the diseaseouttrea in the country, January 22, 2020, until September 15, 2020, the dataset provides the number of confirmed COVID-19 cases and deaths as well as 33 other demographic, geographic, climatic, epidemiological and sociological features that potentially influence the spread and effects of the disease. These features include both fixed and temporal characteristics. Fixed (time-invariant) features generally represent a county's geographic, demographic, and public health information. Temporal ( eme -garyinv ) eatures consist o climate actors , adherence to social -distancinv policies , girus pressure rom neivhtorinv counees , and a numter o tests per ormed in each state (Online-only Table 1 ). The colleceon and pre -processinv o each o the gariatles are descrited telod . We obtained the number of COVID-19 confirmed cases and deaths from the USAFacts website 3 , which is sourced from the US Centers For Disease Control and Prevention (CDC) 4 . This data source records the cumulative number of confirmed COVID-19 cases and the cumulative number of confirmed COVID-19 deaths for each county and day from the first case report on January 22, 2020. We obtained the new cases (and deaths) per day by subtracting the total of each day from the previous. This resulted in negative values for some counties on some days. After contacting the data source, we learned that the government agencies update their reported accumulative numbers to be lower than the previous day if they obtain a more accurate count by re-examining their medical records. Thus, the negative values generated in the daily confirmed cases and deaths data are related to the additional counts that were incorrectly reported. In the processed dataset, we used the following method to solve this problem. We set the entry of each day with a negative value to zero and add this negative value to the previous day's entry. Then if the result of the sum is negative, we repeat this step until one day's entry sums up to a positive value. So far, the repetition was never required as the resulting sums were always non-negative. We obtained most of the demographic data --The total population, age and gender distribution, number of housing units, and county area --from the US Census Bureau websites; here we report 2018 values 5,6 . We calculated population and housing density by (respectively) dividing the total population and number of housing units in each county by the area of that county. To calculate the academic population ratio, we first collected the total enrollment of university and college students and academic staff that each county reported in the fall of 2018 using data from the National Center of Education Statistics (NCES) 7 . This was divided by the population of that county in 2018 to calculate the ratio. Education level distribution is the percentage of educated people at the three education legels 'less than hivh school diploma', 'hivh school diploma', and 'some colleve or associate's devree', dhich de dodnloaded rom the United States Department of Agriculture (USDA) 8 . (Note this is 3 real values, each between 0 and 1.) Immigrant students are those who enrolled in the fall of 2018 in any institution in the county for the first time at the undergraduate level but reside in another state 7 . We derived the immigrant student ratio by dividing the total number of immigrant students by the total county population. Religious congregation ratio is calculated by dividing the total number of active members of a county's relivious congregations reported in 2010 by the total county population 9 . We collected from CDC 10 the number of deaths per 100,000 residents as the overall death rate in 2018 for each county, regardless of the cause of the death. The information about a county's health facilities is included in the data mostly on a per capita basis. We collected the number of intensive care unit (ICU) beds and ventilator capacity data from the Tableau Public website 11 and the total number of hospital beds per 1000 individuals from the Urban Institute website 12 , which is sourced from the American Hospital Association Annual Survey Database 13 . Then we derived the per capita information atout these acilities usinv each county's population. Percent of smokers and Percent of diabetes show the percentage of adult smokers and diabetic adults in the total population of the county, respectively. Our data source for smoker and diabetes ratios and percent of insured residents was the County Health Rankings and Roadmaps website 14 . Airport distance for each county shows the distance to the nearest international airport, ottained ty considerinv the "vreat circle distance" calculated throuvh the latitude and longitude of the airport and the county center. Only airports with more than ten daily passengers, on average, are considered (using data prior to COVID-19). The passenger load for each county is the passenger load of the nearest international airport to that county, and if a county includes more than one international airport, the passenger load equals the total passenger load of these airports. These data were collected from the United States Department of Transportation website and other resources 15, 16, 17 . We derived the passenger load ratio by dividing the passenger load by the total population of the county. The number of meat plants shows the number of meat and poultry processing plants in each county, collected from the United States Department of Agriculture website 18 . Our economic features for each county include the median household income and GDP per capita, both taken from data reported in 2018. These data are obtained from the Census Bureau and the United States Bureau of Economic Analysis and State Science and Technology Institute websites 19,20,21 . The governing political party data was collected from wikipedia 22 . Our source for climate data contains daily precipitation and daily maximum, minimum, and average values for temperature each day. We obtained these data from the Daily Summaries dataset 23 . Note that 35% of the daily precipitation values were missing, which we imputed using the KNN imputer 24 on the dataset consisting of the daily precipitation values over the All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.16.20214098 doi: medRxiv preprint counties. We similarly imputed each of the other temporal features using the KNN imputer on that temporal feature. As with daily average temperature, 82% of the values were missing. If the corresponding min and max daily temperatures were reported, we would impute the mean by averaging those two values. We used the KNN imputer to impute the remaining 48% of the missing values for the instances that did not include any of the average, minimum, or maximum temperatures. Our social distancing data source, Unacast 25 , is based on mobile location data. In collecting this data, the users consent to opt-in and can opt-out by filling out a form on the data source website. This data source contains four different metrics: travel distance grade, visitation grade, encounters grade, and an overall average score of these three metrics. Each of these grades determines the percentage of reduction in a measure of unnecessary social activities (e.g. traveling, human encounters, non-essential visitation) compared to the pre-COVID-19 period, which is translated into letter grades, as described in Online-only Table 1 . The encounters grade is based on the proximity of two devices within a circle of radius 50m for less than an hour, counted as one encounter. This grade shows the decrease in human encounter density (number of encounters per square km) compared to the baseline, where the baseline is defined as the national average encounter density during the four weeks before the COVID-19 outbreak (February 10th -March 8th). The reason why the baseline is defined over the whole nation is to assign lower grades to denser areas, even if they witness fewer encounters compared to the pre-COVID-19 period. Namely, dense areas have a high infection risk, even if they were denser in the past. The visitation grade indicates the percent change in visits to non-essential venues compared to the pre-COVID-19 period. Non-essential venues include but are not limited to restaurants (multiple kinds), department and clothing stores, jewelers, consumer electronics stores, cinemas and theaters, office supply stores, spas and hair salons, gyms and fitness/recreation facilities, car dealerships, hotels, craft, toy and hobby shops. The average visitation for each day of the week prior to the COVID-19 outbreak (between February 10th to March 8th) is considered as the baseline. The percent change is calculated by comparing those baselines to visits on the corresponding days of the week post-outbreak (March 9th onwards). The travel distance grade simply shows the percentage reduction in average distance traveled in each county for each day. The highest grade for this metric represents more than 70% reduction in average distance traveled. This threshold is selected based on the experience gained from Italy because they implemented some of the most strict social distancing policies, which resulted in a 70% to 80% reduction. Therefore, Unacast expects a maximum of 70% reduction in distance traveled under a total shot-down. The data source began recording the social distancing data from February 24. We set the values for the previous days (January 22 to February 23) to the lowest grade for each social distancing feature, assuming that no social distancing policies were imposed by then. However, the encounters grade, travel distance grade, and total grade, each still had 17% missing values for the remaining dates (post 24 February) and the visitation grade had 44% missing values, all of which we imputed using the KNN imputer. Daily state tests refer to the number of daily tests performed in each state. These numbers were obtained using statistics from multiple type COVID-19 tests including antibody, antigen, All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.16.20214098 doi: medRxiv preprint and PCR. This data was downloaded from the COVID Tracking Project 26 and contains negative values for some states and days. The starting recording date for the number of performed tests varies from state to state. We considered the negative numbers as missing values and imputed them along with the unrecorded number of tests using the KNN imputation method. The virus pressure at county , and day , denoted by , is defined based on the number of COVID-19 cases in the neighboring counties: where denotes the number of COVID-19 cases in county at day , and is the set of the neighboring counties of county . We removed from the processed dataset, those counties that missed any of the fixed features. For each of the temporal features, if a county had missing values for only some of the recorded dates, these missing values were imputed as explained above; however, if it missed the values in all of the recorded dates, the county was removed from the dataset. This resulted in the elimination of a total of 790 counties. The version of the dataset at the time of submission, containing data from January 22 to September 16, has been archived in figshare 27 , and the latest version of the dataset is publicly available in our Github repository https://github.com/network-and-Data-Science-IUT/covid-19. We included 2 datasets: (i) the raw dataset with the negative and missing values, and (ii) the processed dataset where the counties with missing values are all imputed or removed from the dataset. Each row in the datasets corresponds to a specific county and date. Counties and their associated states are represented using their name and fips code 28 . Online-only Table 1 specifies the name, type, spatial scale, and description of each feature and also the percentage of their values available in the raw dataset. Note that virus pressure exists only in the processed dataset, and the visitation grade is removed from this dataset because it has a high volume of missing values. The size of the current raw and processed datasets are 277 and 218 MB. We plan to update the datasets until the end of 2021. For illustration, Figure 1 plots the daily temporal features for the New York and Los Angeles counties over the peak days of the disease outbreak. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 20, 2020. . https://doi.org/10.1101/2020.10.16.20214098 doi: medRxiv preprint All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in We checked the compatibility of the number of COVID-19 confirmed cases and deaths with worldometers website 29 reports. We checked the values of each collected feature for All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.16.20214098 doi: medRxiv preprint abnormal values (e.g., negative values for counting variables) using the pandas package 30 for the python programming language. We then considered these values as missing values and imputed them using the imputation methods mentioned above. When adding a feature to the dataset, we used the pandas package to identify and remove duplicate feature records for the same county and date. In addition, by looking at the summary of each feature, including its min, max, and mean values, as well as randomly observing some of the values of that feature, we checked if the values belong to the logical range of that feature. For example, the raw data collected for temperature included values in the range [-500,500] for Celsius. After contacting the corresponding website 23 , it appeared that we had to divide by 10 to obtain the correct values in Celsius. None of the other features had this issue. Data collection and preparation are done using the python programming language. We used the json and requests packages 31, 32 to collect data and the scikit-learn package 33 to impute missing values. The codes used to collect and prepare the datasets are available in our Github repository https://github.com/network-and-Data-Science-IUT/covid-19. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted October 20, 2020. A SIR model assumption for the spread of COVID-19 in different communities Analysis on novel coronavirus (COVID-19) using machine learning methods USA Facts, US Coronavirus Cases and Deaths United States Census Bureau, Population and Housing Unit Estimates Tables Multiple Cause of Death Definitive HealthCare COVID-19 Capacity Predictor Airport, airline and route data United States Department of Agriculture, Food Safety and Inspection Service List of United States governors Missing value estimation methods for DNA microarrays USA covid-19 data Data structures for statistical computing in python Requests: HTTP for Humans All rights reserved. No reuse allowed without permission. perpetuity preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted Scikit-learn: machine learning in python. the Journal of machine This work was funded by Alberta Innovates, project number: RES0052027. We thank Dr. Ammar Hassanzadeh Keshteli for illustrative discussions. A.H. found relevant data sources and drafted the original manuscript. H.F. helped with finding the data sources, developed the code for automatic data collection, and contributed to the writing of the manuscript. P.R. and Z.M. managed and coordinated the production of the dataset and revised the manuscript. H.W., M.L., R.G., and D.W. revised the manuscript and supervised the study. The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.