key: cord-0863506-901ghexi authors: Mehta, Mihir; Julaiti, Juxihong; Griffin, Paul; Kumara, Soundar title: Early Stage Prediction of US County Vulnerability to the COVID-19 Pandemic date: 2020-04-11 journal: nan DOI: 10.1101/2020.04.06.20055285 sha: 975310567e0a03b1f00ed0ca13b4dfe9f28a4ce9 doc_id: 863506 cord_uid: 901ghexi Key Points: Question: What are key factors that define the vulnerability of counties in the US to cases of the COVID-19 virus? Findings: In this epidemiological study based on publicly available data, we develop a model that predicts vulnerability to COVID-19 for each US county in terms of likelihood of going from no documented cases to at least one case within five days and in terms of number of occurrences of the virus. Meaning: Predicting county vulnerability to COVID-19 can assist health organizations to better plan for resource and workforce needs. Abstract Importance: The rapid spread of COVID-19 means that government and health services providers have little time to plan and design effective response policies. It is therefore important to rapidly provide accurate predictions of how vulnerable geographic regions such as counties are to the spread. Objective: Developing county level prediction around near future disease movement for COVID-19 occurrences using publicly available data. Design: Original Investigation; Decision Analytical Model Study for County Level COVID-19 occurrences using data from March 14-31, 2020. Setting: Disease spread prediction for US counties. Participants: All US county level granularity based on data fused from multiple publicly available sources inclusive of health statistics, demographics, and geographical features. Exposure(s) (for observational studies): Daily county level reported COVID-19 occurrences from March 14-31, 2020. Main Outcome(s) and Measure(s): We developed a 3-stage model to quantify, firstly the probability of COVID-19 occurrence for unaffected counties using XGBoost classifier and secondly, the number of potential occurrences of a county via XGBoost regression. Thirdly, these results are combined to compute the county level risk. This risk is then used as an estimated after-five-day-vulnerability of the county. Results: Using data from March 14-31, 2020, the model shows a sensitivity over 71.5% and specificity over 94%. Conclusions and Relevance: We found that population, population density, percentage of people aged 70 or greater and prevalence of comorbidities play an important role in predicting COVID-19 occurrences. We found a positive association between affected and urban counties as well as less vulnerable and rural counties. The developed model can be used for identification of vulnerable counties and potential data discrepancies. Limited testing facilities and delayed results introduces significant variation in reported cases and produces a bias in the model. Trial Registration: Not Applicable The continued spread of confirmed cases of COVID-19, absence of a vaccine, limited resources for testing and assisting people with confirmed cases have presented a great challenge for our public health and healthcare provider systems. To this point, nonpharmaceutical interventions such as social distancing are the only effective mitigation measures. The rapid spread of the disease means that government and health services have very little time to plan and design effective response policies such as resource and workforce planning. Accurately predicting the near future COVID-19 spread at sufficient granularity would provide these organization with better information and time to appropriately plan and respond. We have developed a three-stage machine learning model to estimate COVID-19 spread outcomes at the US county level. In the first stage, we estimate the probability that a county has at least one confirmed COVID-19 case. In the second stage, we estimate the number of COVID-19 occurrences given that county has at least one case. Finally, we combine the results from the two stages to estimate those counties that have the greatest and least vulnerability for changes in disease prevalence for the next five-day period. There has been significant epidemiological work for previous coronavirus pandemics such as MERS and SARS. 1 For example, Badawi et al. 2 performed systematic analysis of prevalence of comorbidities in MERS using data from 12 studies and found that diabetes and hypertension were present in 50% of the cases. Matsuyama et al. 3 systematically reviewed studies involving laboratory confirmed MERS cases to measure both the risk of admission to the Intensive Care Unit (ICU) and death. They compared risks by age, gender and underlying comorbidities. Park et al. 4 reviewed characteristics and associated risks factors of MERS. Bauch et al. 5 surveyed SARS modeling literature focused on understanding the basic epidemiology of the disease and evaluating control strategies. Surveyed SARS models varied in the terms of population studied and geographical characteristics. 6, 7 Different designs were used for SARS modeling consisting of deterministic compartmental models 7 , stochastic compartmental models 6 , a combination of stochastic and deterministic compartmental models 8 , discrete-time models 9 , logistics curve fitting models 10 , contact network models 11 and likelihood-based models. 12 Studies associated with risk factors for SARS 13 and MERS 3,14-20 have found an association between comorbidities and infected cases. MERS and SARS epidemiological modeling has been done at different granularities such as the country 21, 22 , specific region 23 , and case clusters. 6 Given the much broader reach of COVID-19 compared to MERS and SARS, it is very important to predict at a sufficiently high level of granularity. This is particularly important since previous studies have shown that there is considerable heterogeneity in space, transmissibility and susceptibility. 5 Our approach is developed at county level with inclusion of a variety of health statistics, demographics and geographical features of counties. Further, we use publicly available data so that any organization could use the model. To the best of our knowledge, no work has been done to predict near future infection risk at the county level using the combination of health statistics, demographics and geographical features of counties. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint We performed an epidemiological study at the US county level using publicly available data to develop a machine learning predictive model. Data analysis was performed from February 15, 2020, to April 3, 2020. The study was reviewed by the Penn State Integrated Research Ethics Board and deemed exempt because it was a deidentified, secondary data analysis. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline. 24 We used US Census data to obtain county level population statistics for age, gender and density . 25 There are three primary outcomes for our predictive model: i) the probability that a county has at least one confirmed case of COVID-19, which we define as a positive instance, ii) the number of . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint confirmed COVID-19 cases within a county, which we define as occurrences, and iii) vulnerability of the county. Previous studies have shown angiotensin-converting enzyme 2 (ACE2) facilitates the infection of COVID-19 [35] [36] [37] , and that patients with diabetes, hypertension and cardiovascular diseases have an increased expression of ACE2. 35 County population factors such as density, age, and sex have a significant impact on the spread of an epidemic. 38 Cancer and chronic respiratory diseases have also been shown to increase mortality risk for COVID-19. 39 The dataset used for our three-stage model contains correlated variables. For example, diabetes and hypertension prevalence, cancer crude rate and old population. Additionally, the underlying relationship between variables was assumed to be non-linear. For such cases the literature supports 40-47 using gradient tree boosting and deep learning methods for better prediction results. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint In order to predict COVID-19 outcomes, we divided the problem into three stages. In the first stage, we formulated a binary classification problem that included both positive and negative instances. We developed an XGBoost 48 classifier model to learn from the data. We divided the dataset into training and testing in 80-20 proportions for each class. We tuned the hyperparameters of the model using the Hyopt package. In the second stage, we formulated an XGBoost regression model that included data only for positive instances with number of occurrences as the response. As in the case for the first stage, we divided data into training and testing sets in 80-20 proportions and used the Hyopt package for hyperparameter tuning. In the last stage, we combined results from the first two stages and calculated the expected occurrences for counties as a measure of county vulnerability. For the calculation of expected occurrences, we multiplied the probability of county belonging to the positive instances derived using the classification model, with potential occurrences the same county will have if it becomes a positive instance derived using the regression model. Area under the receiver operating characteristic curve (AUC) and accuracy are used as the criteria to evaluate the classification model (the first stage of the model). The root mean . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint squared error (RMSE) is used as the criteria to evaluate the regression model (the second stage of the model). The final stage of the model-vulnerability was assessed by examining the sensitivity and specificity of the prediction. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint The variable importance for the overlapping predictors between the final classification and regression models for March 16 th is shown in Figure 1 . Total population (TOT_POP) was the most important variable for both the classification and regression models. Other important variables included population density, longitude, hypertension prevalence, chronic respiratory mortality rate, cancer crude rate, and diabetes prevalence. Latitude (we use this to identify neighboring counties and the presence or absence of positive class in the neighborhood) and percentage of populations older than 70 years were found to be the least important features of those considered, though still played a role. Table 2) is given by percentage of counties that had no confirmed cases but were identified as being among the 5% most vulnerable had at least one confirmed COVID-19 case five days later. The specificity (Table 3) is given by the percentage . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint of counties identified as being among the 10% least vulnerable with no confirmed cases that still had no confirmed cases five days later. The dataset is comprised of 37% urban and 63% rural counties based on the urban and rural county definition for year 2013. 49 In order to determine if there is an association between urbanicity and vulnerability, we performed a set of one-sided t-tests. The null hypothesis -the 10% least vulnerable counties would have the same proportion of rural counties as the actual proportion of rural counties in the dataset -was rejected for every day from March 14 th to March 26 th . Additionally, the null hypothesis -the actual positive instances counties would the same proportion of urban counties as the actual proportion of urban counties in the dataset -was also rejected for every day over the analysis period. It can therefore be concluded that there is a positive association between urban and most vulnerable counties as well as rural and least vulnerable counties. The continuous decreasing trend in the confidence interval of the urban counties proportion estimate within actual positive instance counties can be used to infer that COVID-19 is propagating from urban counties to rural counties. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint We developed a three-stage machine learning model using publicly available data to predict the five-day vulnerability of a US county. The model estimates the likelihood and impact that a county with no documented COVID-19 cases will have within a five-day period and using them, vulnerability prediction for a county is made. Using data from March 14 th to Marth 31 st , 2020, the model showed a sensitivity over 71.5% and specificity over 94%. We found a positive association between affected counties and urban counties as well as top 10% least vulnerable counties and rural counties. Further, counties with higher population density, a greater percentage of 70 years of above age people, higher diabetes, cardiac illness and respiratory diseases prevalence are more vulnerable to COVID-19 than their counterparts. Our model serves multiple purposes. First, it can help in identifying potentially vulnerable counties. This prediction would be a vital component in managing COVID-19 spread by providing vulnerability information based on the likelihood and magnitude of change within five days. That can help health organizations to plan effectively for management of hospital resources and workforce, rapid response teams, and COVID testing kits and testing locations. In addition, there are multiple counties with limited testing facilities, and with current swab-based testing, it takes multiple days to get the results. Thus, occurrences associated with each county fluctuate rapidly daily. There are multiple limitations to our work. First, there are several predictors that we did not include in the model that have known associations with COVID-19. However, one of our goals was to make sure that any organization could use our model by only including data that is publicly available. Second, our analysis (Table e2) found that there is an increasing trend for the coefficient of variation (CV) for occurrences associated with positive instances counties. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint Note that CV is a proxy for economic inequality. 50-53 Hence, there is a bias in the response variable, which can reduce the accuracy of the prediction. As testing facilities improve in terms of numbers and efficiency, this bias would be minimized and would be reflected in the model. Given this point, it would useful to look at top riskiest and top safest counties predicted by MJK model and examine for potential data discrepancies. Finally, additional feature engineering and stacking methods can be utilized to enhance the prediction capabilities of existing models. Our work uses open source programming and publicly available data. We will make the full dataset, sample modeling and result outputs available with instructions for use soon on: https://github.com/mihirpsu/covid_19 Funding: There was no funding provided for any of the authors. . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.06.20055285 doi: medRxiv preprint Economics in the Time of COVID-19: A New EBook Prevalence of comorbidities in the Middle East respiratory syndrome coronavirus (MERS-CoV): a systematic review and meta-analysis Clinical determinants of the severity of Middle East respiratory syndrome (MERS): A systematic review and meta-analysis MERS transmission and risk factors: A systematic review Dynamically modeling SARS and other newly emerging respiratory illnesses: Past, present, and future Transmission dynamics of the etiological agent of SARS in Hong Kong: Impact of public health interventions Transmission dynamics and control of severe acute respiratory syndrome A simple approximate mathematical model to predict the number of severe acute respiratory syndrome cases and deaths Severe Acute Respiratory Syndrome Epidemic in Asia. Emerging Infectious Diseases Transmission of severe acute respiratory syndrome in dynamical small-world networks Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures A family cluster of middle east respiratory syndrome coronavirus infections related to a likely unrecognized asymptomatic or mild case Community Case Clusters of Middle East Respiratory Syndrome Coronavirus in Hafr Al-Batin, Kingdom of Saudi Arabia: A Descriptive Genomic study Presentation and outcome of Middle East respiratory syndrome in Saudi intensive care unit patients. Critical Care Risk factors for primary middle east respiratory syndrome coronavirus illness in humans, Saudi Arabia Clinical and epidemiologic characteristics of spreaders of middle east respiratory syndrome coronavirus during the 2015 outbreak in Korea Recovery from the Middle East respiratory syndrome is associated with antibody and T cell responses Clinical aspects and outcomes of 70 patients with Middle East respiratory syndrome coronavirus infection: A single-center experience in Saudi Arabia Epidemiological investigation of MERS-CoV spread in a single hospital in South Korea Fatality risks for nosocomial outbreaks of Middle East respiratory syndrome coronavirus in the Middle East and South Korea. Archives of Virology The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: Guidelines for reporting observational studies Annual Resident Population Estimates, Estimated Components of Resident Population Change, and Rates of the Components of Resident Population Change for States and Counties County Level Population Density Database: National Program of Cancer Registries and Surveillance, Epidemiology, and End Results SEER*Stat Database: NPCR and SEER Incidence -U.S. Cancer Statistics Public Use Research Database with Puerto Rico United States Chronic Respiratory Disease Mortality Rates by County 1980-2014 | GHDx Accessed April 3, 2020. 34. NYTimes. NYtimes/covid-19-data: An ongoing repository of data on coronavirus cases and deaths in the U Are patients with hypertension and diabetes mellitus at increased risk for COVID-19 infection? The Lancet Respiratory Medicine Two Things About COVID-19 Might Need Attention COVID-19-New Insights on a Rapidly Changing Epidemic Spread of infectious disease modeling and analysis of different factors on spread of infectious disease based on cellular automata Preliminary Estimates of the Prevalence of Selected Underlying Health Conditions Among Patients with Coronavirus Disease 2019 -United States Greedy function approximation: A gradient boosting machine Predicting clicks: Estimating the click-through rate for new ads Application of XGBoost algorithm in hourly PM2.5 concentration prediction Probability Analysis of Hypertension-Related Symptoms Based on XGBoost and Clustering Algorithm Deep learning for healthcare: Review, opportunities and challenges Using Deep Learning for Energy Expenditure Estimation with wearable sensors Deep Neural Networks for Acoustic Modeling in Speech Recognition. Ieee Signal Processing Magazine A Critical Review for Developing Accurate and Dynamic Predictive Models Using Machine Learning Methods in Medicine and Health Care XGBoost : A Scalable Tree Boosting System NCHS urban-rural classification scheme for counties. Vital and Health Statistics, Series 2: Data Evaluation and Methods Research Income Distribution: Includes CD Policy Impacts on Inequality. Welfare Based Measures of Inequality. The Atkinson Index. EASYPol. 2006. 53. Coefficient of variation -Wikipedia