key: cord-0818814-r8vpair6 authors: Tiwari, Anuj; Dadhania, Arya V.; Ragunathrao, Vijay Avin Balaji; Oliveira, Edson R.A. title: USING MACHINE LEARNING TO DEVELOP A NOVEL COVID-19 VULNERABILITY INDEX (C19VI) date: 2021-02-05 journal: Sci Total Environ DOI: 10.1016/j.scitotenv.2021.145650 sha: a3b09113dd6da3da937488487bea92c717ee6566 doc_id: 818814 cord_uid: r8vpair6 COVID-19 is now one of the most leading causes of death in the United States (US). Systemic health, social and economic disparities have put the minorities and economically poor communities at a higher risk than others. There is an immediate requirement to develop a reliable measure of county-level vulnerabilities that can capture the heterogeneity of vulnerable communities. This study reports a COVID-19 Vulnerability Index (C19VI) for identifying and mapping vulnerable counties. We proposed a Random Forest machine learning-based vulnerability model using CDC’s sociodemographic and COVID-19-specific themes. An innovative ‘COVID-19 Impact Assessment’ algorithm was also developed for evaluating severity of the pandemic and to train the vulnerability model. Developed C19VI was statistically validated and compared with the CDC COVID-19 Community Vulnerability Index (CCVI). Finally, using C19VI and the census data, we explored racial inequalities and economic disparities in COVID-19 health outcomes. Our index indicates that 575 counties (45 million people) fall into the ‘very high’ vulnerability class, 765 counties (66 million people) in the ‘high’ vulnerability class, and 1435 counties (204 million people) in the ‘moderate’ or ‘low’ vulnerability class. Only 367 counties (20 million people) were found as ‘very low’ vulnerable areas. Furthermore, C19VI reveals that 524 counties with a racial minority population higher than 13% and 420 counties with poverty higher than 20% are in the ‘very high’ or ‘high’ vulnerability classes. The C19VI aims at helping public health officials and disaster management agencies to develop effective mitigation strategies especially for the disproportionately impacted communities. the current study, we developed a more reliable assessment: the COVID-19 Vulnerability Index (C19VI) which quantifies the pandemic vulnerability of each United States county. This relative index processed the same six input variables as CCVI, however, instead of using a statistical linear algorithm, we utilized machine learning technique. We implemented Random Forest (RF) machine learning technique to calculate C19VI. An innovative ‗COVID-19 Impact Assessment' algorithm was also developed using homogeneity analysis and temporal trend assessment techniques for training the RF model. Our ‗COVID-19 Impact Assessment' algorithm, for the first time, introduce the concept of analyzing temporal dynamics of confirmed cases, deaths and IFR in addition to analyzing the CDC's six themes in a non-parametric, non-linear machine learning-integrated method. Thus, our vulnerability modeling approach has a two-fold added advantage than the conventional methods. First, we assessed the additional variables that introduce variability in vulnerability modeling, i.e., temporal analysis of daily confirmed cases, deaths, and IFR data. Secondly, all of the variables were processed in a non-linear, nonparametric fashion by using RF machine learning techniques. Next, our C19VI index was compared with CDC's CCVI using advanced statistical measures and a machine learning model. We then tested the accuracy and checked the internal consistency of the C19VI. Our vulnerability assessment methodology has allowed us to analyze the impact of COVID-19 that has been unequal and widespread across the nation [12] [13] [14] [15] . Besides, there are current techniques in vulnerability modeling, leveraging the preparedness of vulnerable counties to reduce the COVID-19 burden within the United States. We used publicly available datasets from Johns Hopkins University 2 , Centers for Disease Control and Prevention (CDC) 10 In order to understand the impact of COVID-19 pandemic in all 3142 counties in the United States, we have proposed a ‗COVID-19 Impact Assessment' algorithm. This algorithm ‗Scores' and ‗Ranks' the impact of COVID-19 pandemic by evaluating the temporal changes in confirmed cases, deaths, and infection fatality rate (IFR) 20 datasets using trend analysis (Mann Kendall 21, 22 & Theil and Sen Slope 23, 24 ) and homogeneity assessment (Pettitt's test 25 ) . Trend analysis characterizes the overall pattern in daily-time series dataset and homogeneity assessment identifies abrupt changes in temporal trends [21] [22] [23] [24] [25] . Together, trend and homogeneity analyses make the algorithm more sensitive to daily changes in the epidemiological curve and recognize the subtle impacts of the health policies. Thus, the algorithm classifies each county in one of the six impact groups, ‗very high' (Rank = 1), 'high' (Rank = 2), ‗moderate' (Rank = 3), ‗low' (Rank = 4), ‗very low' (Rank = 5) and 'non-significant' (Rank = -999). See supplementary material for the ‗COVID-19 Impact Assessment' algorithm pseudocode. The algorithm functions in four steps: 1. Data import and pre-processing: County-wise, daily time-series data of the confirmed cases and deaths were obtained from the John Hopkins University as mentioned above 2 . Then, daily time-series data for IFR is calculated using the imported datasets. 2. Homogeneity analysis: Pettitt's test 25 was applied county-wise to check for the homogeneity in the time-series dataset of all three epidemiological parameters obtained after step 1. If the data was found to be non-homogeneous, pre and post-changepoint time series were computed and kept alongside the ‗overall' dataset, which was the only populated data column in the cases of homogenous datasets. This expanded the timeseries dataset into three aspects, i.e., pre-changepoint, post-changepoint, and overall, for J o u r n a l P r e -p r o o f Journal Pre-proof each of the three epidemiological parameters, i.e., confirmed cases, deaths, and IFR for each county. 3. Trend analysis: We applied Mann Kendall's test 21, 22 to assess the trend and its nature, i.e. increasing, decreasing, or no trend, in a given time-series. Next, the trend magnitude was quantified using the Theil and Sen slope estimator test 23, 24 . Mann Kendall's, and Theil and Sen slope estimator test was performed on all three time-series computed at the end of step 2 for all three epidemiological parameters in each county. 4. COVID-19 Impact ‗Score' and ‗Rank' determination: Impact Score was determined using the trend magnitude data obtained from the previous step. We used IFR as the most important parameter for assessing the impact of the COVID-19 pandemic in our algorithm 20, 26 . In the instances where IFR did not show a significant trend in a given county, we first used the deaths 26 . If the deaths did not show a significant trend either, confirmed cases were used to evaluate the impact of the pandemic 26 . Thus, rank classification occurred in three stages, each further divided according to the homogeneity results: a. On the basis of the IFR: i. In a homogeneous IFR time-series with an increasing ‗overall' trend, the county was assigned Rank 1 and its impact Score was equal to the ‗overall' trend magnitude. ii. In a non-homogeneous IFR time-series with an increasing pre-changepoint trend, the scoring and ranking were specified based on the post- i. In a homogeneous death time-series with an increasing ‗overall' trend, the county was assigned Rank 2 and its impact Score was equal to the ‗overall' trend magnitude. ii. In a non-homogeneous death time-series with an increasing pre- Every other county was classified as Rank -999 and Score -999. Finally, out of the three ranks, assigned to each county, based on the three epidemiological variables, the highest impact group (lowest rank) and its corresponding trend magnitude were decided as the final COVID-19 Impact Score and Rank for a given county. Our study methodology was built and tested in six steps ( Figure 3) . First, the trainingtesting data was prepared using the -most affected‖ and the -non-significantly‖ affected counties using the proposed ‗COVID-19 Impact Assessment' algorithm. Second, COVID-19 vulnerability map was generated using the RF machine learning technique 27, 28 . Third, vulnerability modeling was validated using Receiver Operating Characteristic (ROC)-Area Under the ROC Curve J o u r n a l P r e -p r o o f Journal Pre-proof (AUC) technique 29-31 and Cronbach's α 32 . Fourth, our C19VI modeling was comparatively assessed against the CDC's CCVI using Friedman 33 and two-tailed Wilcoxon signed rank 34 test and later, the input themes contribution to the respective vulnerability index, the output, were ranked using, and Boruta technique 35 . Fifth, C19VI was analyzed with racial minority population and poverty dataset to determine the disproportionate county-level impact of COVID-19 pandemic. Lastly, an interactive version of the C19VI map with other results was released to the public using the ESRI Web GIS customization toolkit 36 . Each step is further detailed below: 1. Preparation of the training-testing dataset: Proposed ‗COVID-19 Impact Assessment' algorithm was used to map the impact of COVID-19 pandemic on all 3142 counties in the US using confirmed cases and deaths. Out of total 3142 counties, 200 very highly affected and 200 non-significantly affected counties were selected to prepare the COVID-19 vulnerability modeling training and testing dataset. 70% of the total counties (280) were randomly selected and implemented as a training dataset while rest 30% (120) were used for testing. 4. Comparison of the CCVI and C19VI: As both the CCVI and the C19VI models were developed using the same six thematic indicators, Friedman 33 and two-tailed Wilcoxon signed rank 34 statistical tests were implemented to comparatively assess model vulnerability prediction ability. Next, Boruta feature importance assessment technique 35 was used to evaluate the relative importance of input indicators in CCVI and C19VI. 5. Community specific vulnerability analysis: Long-standing systemic, social and economic inequities across the counties have put many people from racial minority groups and living below the poverty line at increased risk of getting sick and dying from COVID-19 15, 16, 37 . By overlaying the C19VI map on racial minority population percentage data, COVID-19 vulnerability specific to racial minority groups were identified. As recommended by CDC, a 13% of the racial minority threshold, i.e. a given county with J o u r n a l P r e -p r o o f more than 13% racial minorities residents, was used for computing the COVID-19 vulnerability for racial minority groups 38 . Similarly, by overlaying the C19VI map on poverty percentage data, COVID-19 vulnerability specific to economically poor communities were identified. As defined by the Economic Research Service (ERS), United States Department of Agriculture (USDA) a 20% of the poverty threshold, i.e. a given county with more than 20% economically poor residents, was used to estimate the vulnerability for economically poor communities 39, 40 . ESRI ArcGIS overlay analysis tool 41 was used to conduct the community-specific vulnerability analysis. Our ‗COVID-19 Impact Assessment' algorithm performed a county-wise assessment of the pandemic using the confirmed cases, deaths and IFRs data from 22nd January 2020 to 31st J o u r n a l P r e -p r o o f July 2020. We generated a map of our assessment that groups the impact of the pandemic on all United States counties in one of the six categories (Figure 4(A) ). We found 88 counties with ‗very high', 30 with ‗high', 73 with ‗moderate', 344 with ‗low', 214 with ‗very low,' and 2393 with ‗non-significant' impact due to the COVID-19 pandemic (Figure 4(B) ). Top 200 counties with the most impact and the bottom 200 with non-significant impact were used as training and testing datasets for our COVID-19 vulnerability model. Using the impact assessment data of the selected United States counties, input themes and the RF technique, we developed COVID-19 Vulnerability Index (C19VI). Figure 5 (A) shows the C19VI map at the scale of 0 to 1. As presented in Figure 5 We used the AUC-ROC technique to validate the prediction accuracy of our C19VI model. As shown in Figure 7 The racial minority populations of the United States reside more densely in the southern states and in urban areas 17, 44, 45 . Our community-specific analysis reveals that the racial minorities disproportionately reside in counties that are more vulnerable to COVID-19 (Figure 9(A) ). We found that 77.62% counties with racial minority populations > 13%, have very high or high (CCVI > 0.60) COVID19 vulnerability. Similar to racial minorities, economically poor communities are more likely to be affected by the virus and have higher mortality rates 45 . The C19VI derived COVID-19 vulnerability with reference to poverty is presented in Figure 9 (B). We find that 82.84% of economically poor counties, where poverty > 20%, have very high or high (CCVI > 0.60) COVID-19 vulnerability. iii) heuristic modeling 52 . In addition, a few studies were conducted by performing numerical simulations of the total confirmed cases, deaths, and IFRs using statistical 47, 53 and machine learning 54 techniques to compute COVID-19 specific vulnerability. While these approaches have enhanced the domain of pandemic vulnerability modeling, they show at least one of the three underlying limitations recognized by the public health planners and policy makers that impair an optimal modeling process. Either they implement an equal weight assignment approach in vulnerability assessment, assume steady transmission rates in mathematical modeling, or treat J o u r n a l P r e -p r o o f Journal Pre-proof confirmed cases, deaths, and IFR as constants for vulnerability assessment. However, it is known that 1) not all input themes variables are equally important in determining vulnerability 49 , 2) confirmed cases, deaths, and IFR are not biological constants in a pandemic and thus, they do reflect the severity of the pandemic in a particular context, at a particular time 55 Furthermore, we optimized the dynamic characteristics of the pandemic by developing novel ‗COVID-19 Impact Assessment' algorithm, which assesses the regional pandemic impact by performing trend and homogeneity analysis on daily datasets rather than static values for a defined period. Trend and homogeneity assessments help characterize the course of the pandemic and point out the COVID-19 response through changes in healthcare infrastructure or policies in a given region by identifying subtle changes in daily datasets [21] [22] [23] [24] [25] . Moreover, besides optimization, our impact assessment algorithm also serves to enhance vulnerability modeling to be driven by the chronic disease burden, healthcare infrastructure, and policy impact such as lockdown phases. In conjunction with the optimized impact assessment algorithm, high training (90%) and testing (84%) accuracy with favorable internal reliability score (Cronbach's α = 0.709) of the RF J o u r n a l P r e -p r o o f machine learning-derived predictive modeling technique makes C19VI an accurate and reliable index. Besides, despite using the same input, our machine-learning derived C19VI produced significantly different and consistent results in contrast to the CDC's CCVI as elucidated through the Friedman and Wilcoxon signed-rank tests. Moreover, Boruta algorithm-based importance assessment of the variables for both methods show that both methods handled the variables with major notable differences. The divergence between the two methods indicates that the C19VI was able to capture non-linear relationships in the variables which were not captured with the linear ‗equal weight assignment approach' used in the CDC's CCVI model 10 . The ability of capturing non-linearity in the input variables alongside the unique characteristics of the C19VI methodology, makes the C19VI an optimal index to be considered for vulnerability assessment. Our nationwide vulnerability analysis reveals interesting patterns of vulnerability distributions around the country. We found that most of the vulnerable counties are concentrated in the southern states. As shown in the Figure 5 This index can also be used alongside other epidemiological data, such as disease transmission, infection fatality rate, the proportion of cases needing hospitalization, intensive care unit admissions, or ventilator support to heighten the preparedness of a district or state, as well as planning and executing the response. We also recommend the use of our C19VI index alongside the CDC's Social Vulnerability Index (SVI) for developing disaster risk assessment and preparedness plans in COVID-19 affected regions. For example, in the times of COVID-19 pandemic, the C19VI should be used alongside the SVI for the disaster management in counties with frequent forest fires, tornadoes or hurricanes. COVID-19 has brought previously unaddressed health disparities of racially marginalized and economically poor communities to the forefront of both disaster management officials and government concern. By overlaying the C19VI with the race and poverty data, we found that racial minorities and economically poor Americans disproportionately reside in communities that are more vulnerable to COVID-19. This finding is consistent with other evidences highlighting the disproportionate incidence of COVID-19 among minority groups and poor communities 13, 15, 37, [57] [58] [59] . The currently available county-level cases and deaths dataset, that is J o u r n a l P r e -p r o o f Journal Pre-proof segregated by minority population and economic status, is not sufficient to generate reliable COVID-19 risk estimates. The analysis proposed here provides an excellent way to help the communities that disproportionately bear the burden of this crisis, by precisely identifying these areas. Thus, the C19VI is intended to help policy makers, non-profit entities, private companies, local organizations, and the general public to improve the COVID-19 contingency planning. This index may also be useful for: i) a better management of distribution of resources; ii) addressing pandemic-associated healthcare disparities; iii) providing businesses with opportunities to grow where support is needed the most; and iv) raising public awareness of the COVID-19 pandemic. Besides, we hope that this methodology will also prove to be useful in driving more advanced predictive modeling techniques by professionals in academia. Ideally, it would be possible to calculate the index at a census-tract level. However, several important variables used to define vulnerability were not available at this level. Hence, this analysis is restricted to the county-level. Secondly, being based on the ranking of counties for CDC six themes, our C19VI is a relative index of each county rather than being an absolute score. Thirdly, we were unable to test the external validity of C19VI since no accurate and stable measure of vulnerability was available. Fourthly, the ‗COVID-19 Impact Assessment' algorithm requires to be evaluated for space and time complexity, and internal errors. We declare no competing interests. vulnerability of the racial minorities. The map shows counties with high vulnerability (C19VI > 0.6) and higher than 13% racial minorities in cobalt, low vulnerability (C19VI < 0.6) and higher than 13% racial minorities in tropical blue, high vulnerability (C19VI > 0.6) and lower than 13% racial minorities in red, and low vulnerability (C19VI < 0.6) and lower than 13% racial and higher than 20% poverty in red, low vulnerability (C19VI < 0.6) and higher than 20% poverty in pink, high vulnerability (C19VI > 0.6) and lower than 20% racial minorities in orange, and low vulnerability (C19VI < 0.6) and lower than 20% poverty in chardonnay. J o u r n a l P r e -p r o o f Rolling Updates on Coronavirus Disease (COVID-19) Bringing resources to state, local, tribal & territorial governments Quarantine Fatigue: first-ever decrease in social distancing measures after the COVID-19 pandemic outbreak before reopening United States Trends in Number and Distribution of COVID-19 Hotspot Counties-United States Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months A social vulnerability index for disaster management The Impact of Social Vulnerability on COVID-19 in the US: An Analysis of Spatially Varying Relationships Development of a vulnerability index for diagnosis with the novel coronavirus, COVID-19 Foundation S. The COVID-19 Community Vulnerability Index (CCVI) A New Approach to the Social Vulnerability Indices: Decision Tree-Based Vulnerability Classification Model The disproportionate impact of COVID-19 on racial and ethnic minorities in the United States Disparities in Incidence of COVID-19 Among Underrepresented Racial/Ethnic Groups in Counties Identified as Hotspots During Does the Covid-19 pandemic disproportionately affect the poor? Evidence from a six-country survey Poverty and Covid-19: rates of incidence and deaths in the United States during the first 10 weeks of the pandemic Why inequality could spread COVID-19 Homeland infrastructure foundation-level data Shapefile technical description. An ESRI white paper How Large Was the Mortality Increase Directly and Indirectly Caused by the COVID-19 Epidemic? An Analysis on All-Causes Mortality Data in Italy Nonparametric tests against trend Rank correlation methods A rank-invariant method of linear and polynominal regression analysis (Parts 1-3) Estimates of the regression coefficient based on Kendall's tau A non-parametric approach to the change-point problem Modelling insights into the COVID-19 pandemic Random forests Classification and regression by randomForest Diagnostic tests 3: receiver operating characteristic plots Understanding receiver operating characteristic (ROC) curves Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach Coefficient alpha and the internal structure of tests The use of ranks to avoid the assumption of normality implicit in the analysis of variance Individual comparisons by ranking methods Feature selection with the Boruta package COVID-19 exacerbating inequalities in the US Improving Health Equity for Black Communities in the Face of Coronavirus Disease-2019 Rural Health Disparities: The Economic Argument. Application of the Political Economy to Rural Health Disparities Rural, low-income families and their well-being: Findings from 20 years of research Making sense of Cronbach's alpha Cronbach's Alpha: Simple definition, use and interpretation The Coronavirus's unique threat to the south Spatial Variation in Socio-ecological Vulnerability to COVID-19 in the Contiguous United States COVID-19 Progression Timeline and Effectiveness of Responseto-Spread Interventions across the United States Risk assessment of novel coronavirus COVID-19 outbreaks outside Data-Driven Development of a Small-Area COVID-19 Vulnerability Index for the United States A vulnerability index for the management of and response to the COVID-19 epidemic in India: an ecological study. The Lancet Global Health Social Vulnerability and Racial Inequality in COVID-19 Deaths in Chicago COVID-19: District level vulnerability assessment in India COVID-19 and urban vulnerability in India Visualizing and Assessing US County-Level COVID19 Vulnerability The COVID-19 Pandemic Vulnerability Index (PVI) Dashboard: monitoring county level vulnerability What do we know about the risk of dying from COVID-19. Our World in Data Mathematical models for COVID-19: Applications, limitations, and potentials Inequity and the disproportionate impact of COVID-19 on communities of color in the United States: The need for a trauma-informed social justice response Social Vulnerability and Equity: The Disproportionate Impact of COVID-19 The Disproportionate Impact of Covid-19 on Communities of Color. NEJM Catalyst Innovations in Care Delivery We would like to thank Johns Hopkins University, Centers for Disease Control and Prevention