key: cord-0955311-6mntqyo4
authors: Patel, C. J.; Deonarine, A.; Lyons, G.; Lakhani, C.; Manrai, A. K.
title: Identifying communities at risk for COVID-19-related burden across 500 U.S. Cities and within New York City
date: 2020-12-24
journal: nan
DOI: 10.1101/2020.12.17.20248360
sha: f661e0b5e7c42413972b8f7cb861e3893db458ed
doc_id: 955311
cord_uid: 6mntqyo4

Background. While it is well-known that older individuals with certain comorbidities are at highest risk for complications related to COVID-19 including hospitalization and death, we lack tools to identify communities at highest risk with fine-grained spatial and temporal resolution. Information collected at a county level obscures local risk and complex interactions between clinical comorbidities, the built environment, population factors, and other social determinants of health. Methods. We develop a robust COVID-19 Community Risk Score (C-19 Risk Score) that summarizes the complex disease co-occurrences for individual census tracts with unsupervised learning, selected on their basis for association with risk for COVID complications, such as death. We mapped the C19 Risk Score onto neighborhoods in New York City and associated the score with C-19 related death. Last, we predict the C-19 Risk Score using satellite image data to map the built environment in C-19 Risk. Results. The C-19 Risk Score describes 85% of variation in co-occurrence of 15 diseases that are risk factors for COVID complications among 26K census tract neighborhoods (median population size of tracts: 4,091). The C-19 Risk Score is associated with a 40% greater risk for COVID-19 related death across NYC (April and September 2020) for a 1SD change in the score (Risk Ratio for 1SD change in C19 Risk Score: 1.4, p < .001). Satellite imagery coupled with social determinants of health explain nearly 90% of the variance in the C-19 Risk Score in the United States in held-out census tracts (R2 of 0.87). Conclusions. The C-19 Risk Score localizes COVID-19 risk at the census tract level and predicts COVID-19 related morbidity and mortality.

Introduction used in navigation --to predict the C-19 Risk Score. This is inspired by work from Maharana and Nsosie, who developed an approach to map the built environment to obesity prevalence [19] . We also demonstrate how social determinants of health correlate with and predict the C-19 Risk Score. Further, we focus on a late-May 2020 hotspot of COVID-19 epidemic, New York City, to show how the C-19

Risk Score is associated with zipcode-level COVID-19 related deaths. Lastly, deploy the COVID-19 Risk Score with an application programming interface and a browsable dashboard (accessible at http://launchpad.xybeta.ai/).

We obtained geocoded disease prevalence data at the census tract level from the U.S. Centers for Disease Control and Prevention 500 Cities Project (updated December 2019 [20] ) ( Figure 1A ). 500 Cities contains disease and health indicator prevalence for 26,968 number individual census tracts of the 500 Cities which are estimated from the Behavioral Risk Factor Surveillance System [21] .

From the 500 Cities data, we chose 13 health indicators that are associated with COVID-19-related hospitalization and death based on reports from China, Italy, and the United States (e.g., [1, 6, 8, 9] ).

Disease indicators include the prevalence amongst adults of diabetes, coronary heart disease, chronic kidney disease, asthma, arthritis, any cancer, chronic obstructive pulmonary disorder. We also selected behavioral risk factors including smoking and obesity. Third, we selected variables that reflected access to care, such as prevalence of individuals on blood pressure medication and high cholesterol levels.

4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020. 12.17.20248360 doi: medRxiv preprint We further obtained 5-year 2013-2017 American Community Survey (ACS) Census data [22] , which contain sociodemographic prevalences and median values for Census tracts ( Figure 1C) . We selected the total number of individuals in the tract, proportion of males and females over the age of 65, and proportion of individuals by race. These data also included information on socioeconomic indicators including median income, the proportion of individuals living in poverty, unemployed, cohabitate with more than one individual per room, and have no health insurance.

Defining the COVID-19 Community Risk Score (C-19 Risk Score)

We merged disease, behavior, and health access prevalence data from the 500 Cities Project for each of the 26,968 census with ACS information and calculated their Pearson pairwise correlations ( Figure 1D ).

We considered 15 variables in total, including 13 health indicators (e.g., diseases and risk factors), and 2 demographic factors, the proportion of males and females individuals over 65 in the risk score. The disease prevalence included diabetes, coronary heart disease, any cancer, asthma, chronic obstructive pulmonary disease, arthritis; the behavioral risk factors included prevalence of obesity, smoking, high cholesterol, and high blood pressure. The clinical risk factors included the prevalence of individuals on a blood pressure medication.

To summarize the total variation of the disease prevalences in a single score, we devised 2 approaches.

The first approach scaled each of the 13 health indicators plus 2 indicators of males and females over 65 (z-score transformation) prevalence by subtracting the overall average and dividing by the standard deviation of the prevalence. Then, for each census tract, the z-scores were summed and rescaled to be between 0-100. Therefore, the tracts with the highest scores have the highest "additive" prevalences for all the health indicators. We call this the additive score.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020. 12.17.20248360 doi: medRxiv preprint The second and primary score utilized principal components analysis (PCA), a "unsupervised" machine learning approach that transforms the data (26,968 by 15 dimensions) into a new space where each new variable is a linear combination of variables from the original dataset. PCA attempts to maximize the variance explained over the dataset in successive new variables (call them Y 1 through Y 15 ) that are defined by the "components", or linear combination of each of the original variables (e.g., health indicators) ( Figure 1E ). Therefore, the first variables (Y 1 , Y 2 , etc) that correspond to the first principal components in the new dataset explain the maximal amount of variation in the entire dataset. After reprojecting the census tracts on the first two principal components, call them Y 1 and Y 2 (fitting each census tract to the first two principal components), we "aligned" Y 1 and Y 2 such that the increasing prevalence of all the disease and health indicators were monotonically increasing with increase of the disease prevalences. Next, we estimated a single score for each census tract as a weighted average between Y 1 and Y 2 , where the weights were proportional to the variance explained by the first and second principal components respectively. Finally, the score is rescaled to be between 0-100. The higher the value, the higher the total burden of disease and proportion of individuals over 65 in that census tract. We calculated risk scores for different units of administrative areas, including the 500 cities, 316 counties, and 50 states. We estimated a city-wide prevalence of each of the 15 COVID-19 risk factors and diseases and then computed the additive and PCA-based scores as above. We repeated the same procedure for counties (m=316) and states (m=50).

Next, we sought to estimate how robust the PCA-based risk score is to sampling error via simulation. To do so, we estimated the standard deviation of the prevalences as a function of the size of the population of the tract. We assumed a covariance structure between the disease prevalences to be the observed census-level correlation across the US. Next, for each census tract, we simulated 100 times the prevalence of the diseases using a multivariate normal distribution, centered around the actual prevalence and with 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint covariance equal to the COVID-19 risk factor correlation over all 26,968 tracts. Next, for each of the 100 simulated datasets we computed the principal components and obtained a simulated distribution about the principal component. Next, we estimated the predicted new projections for plus or minus 5 standard deviations (SD) of the principal components. The "robustness" score is the range of the score across the +/-5 SD of the principal components.

We associated each of the ACS-estimated sociodemographic indicators with the C-19 Risk Score multivariate linear and random forests regression to test the linear and non-linear contribution of the sociodemographic indicators in the COVID-19 Score ( Figure 1H ). We split the dataset into half "training" and "testing" to get a conservative estimate of variance explained and predictive capability of the sociodemographic variables in the COVID-19 Risk Score while not overfitting the data. Specifically, we tested the linear and non-linear association prediction between American Community Survey 5-year proportions of individuals in each census tract who were (a) below poverty, (b) unemployed, (c) non-employed, (d) have less than high school education, (e) lack health insurance, (f) have more than one person occupied per room, and (g) Hispanic, Asian, African American, or Other Ethnic group.

Furthermore, we also included the median income and median home value of a census tract. Coefficients in the linear regression denoted a 1 unit change in a 1 SD change in the socioeconomic variable (e.g., a 1 SD unit change in the prevalence or a 1 SD change in the median income for a census tract). Random forests were fit using 1000 trees and a tree size of 5.

Reconstructing the COVID-19 Community Risk Score from satellite imagery

To reconstruct the C-19 Risk Score from satellite imagery ( Figure 1B CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint machine learning algorithm. The images are satellite raster tiles that we downloaded from the OpenMapTiles database. The images have a spatial resolution close to 20 meters per pixel allowing a maximum zoom level of 13. [23] Images were extracted in tiles from the OpenMapTiles database using the coordinate geometries of the census tracts. After extraction, images were digitally enlarged to achieve a zoom level of 18. First, we passed images through AlexNet, a pretrained convolutional neural network, in an unsupervised deep learning approach called feature extraction. [24] (FIgure 1G) The resulting vector from this process is a "latent space feature" representation of the image comprising 4,096 features. This latent space representation is essentially an encoded (non-human readable) version of the visual patterns found in the satellite images, which, when coupled with machine learning approaches, is used to model the built environment of a given census tract. [19] For each census tract, we calculated the mean of the latent space feature representation. We performed feature extraction on a NVIDIA Tesla T4 GPU using Python 3.7.7 and the PyTorch package. Finally, the latent space feature representation was regressed against the COVID-19 Risk Score using gradient boosted decision trees. [25] We split the 80% of the data into training and the remaining 20% to testing. To train the model, we used a maximum tree depth of 5, a subsample of 80% of the features per tree, a learning rate (i.e., feature weight shrinkage for each boosting step) of 0.1, and used 3-fold cross-validation to determine the optimal number of boosted trees. Training was completed on a NVIDIA Tesla T4 GPU using Python 3.7.7 and the XGBoost package. In a separate 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint analysis, both satellite image features and the social determinants of health features (above) were regressed against the COVID-19 Risk Score. We report R 2 for the predictions in the test data ( Figure   1GH ).

Association of the COVID-19 Community Risk Score with zipcode-level

We downloaded case and death count data on a zipcode tabulation area (ZCTA) of New York City, a hotspot of the US COVID-19 epidemic as of 5/20/20 and then again on 9/20/20 ( Figure 1J ). We used 2010 Census cross-over files to map Census tracts to ZCTAs. We computed the average C-19 Risk Score for the ZCTA, weighting the average by population size of the Census tract. As above, we estimated the ZCTA-level socioeconomic values and proportions. We associated the C-19 Risk Score with the death rate using a negative binomial model. We set the offset term as the logarithm of the total population size of a zipcode. The exponentiated coefficients are interpreted as the incidence rate ratio for a unit change (e.g, 1 standard deviation increase) in the variable (versus no change).

To distribute the data to other software applications, users, and provide interactive graphing capabilities, we created an Application Programming Interface (API) and dashboard for the C-19 Risk Score. We obtained COVID-19 case data from Johns Hopkins University [26] . The API endpoints and their outputs are described in Appendix I. We also developed a dashboard through the API in the form of geographical coordinate information for Census tracts, and data values (in GeoJSON format) to construct a choropleth in the web browser. We obtained geographical coordinates and shapefiles from the US Census (2018) [27] . Distributions of values are displayed for Census tracts being viewed at a particular location (the 9 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint

Prevalence and heterogeneity of COVID-19 associated co-morbidities and risk factors across 500 Cities of the United States Figure 1A and 1C) . We chose these comorbidities and risk factors because they were (1) among the strongest risk factors for COVID-19 related hospitalization, intensive care unit utilization, and death (e.g., males and females above 65, diabetes, heart disease, and stroke), (2) were indicative of risk for cardiometabolic disease or impaired lung function (e.g., smoking, obesity, high blood pressure, high cholesterol, kidney disease, asthma, chronic pulmonary obstructive disorder), or (3) may be indicative of drug use that might impair immune systems (e.g., cancer, arthritis, blood pressure drug use).

Census tracts represent small "communities" that have a median population size of 4,091 (total range of 15-51,536). For example, of the 500 cities in our analysis, cities had a median 28 number of census tracts 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint

We found that risk factor prevalence and its range in census tracts across the entire US was, as expected, large (ranging from 6% to 100% [entire tract]) ( Figure 1D , Figure 2A ). For example, the median proportion of females and males over that age of 65 is 14% and 11% respectively (interquartile range

[IQR] of 10-18% for females and 7-15% for males). The median proportion of diabetes and heart disease was 10% (IQR of 8-13%) and 5% (IQR of 4-7%); on the other hand, the median proportion of a census tract that were on treatments for, or had risk factors for heart disease, was over a third of their respective communities. Specifically, the proportion having high blood pressure, on blood pressure lowering medication, or had high cholesterol was 30% (IQR of 25-36%), 72% (IQR of 67-76%), and 32% (IQR of 29-34%) respectively.

The median proportion of individuals with impaired lung function, such as chronic obstructive pulmonary disorder and asthma was 10% (IQR of 6-18%) and 6% (IQR of 5-8%) respectively. Smoking, a risk factor for both impaired lung function and heart disease was 17% (IQR of 13-22%).

There was a large variation of COVID-19 comorbidities and risk factors within cities themselves ( Figure   2B , Supplementary Table 1 , 2) as ranked by the interquartile range (IQR, 75th percentile minus the 25th percentile) of the within city prevalence. For example, Atlanta had the greatest IQR for obesity (IQR: 22 to 40%), high blood pressure (20 to 44%), and COPD (4 to 9%), while places such as Gainesville had the highest variation in prevalence of high cholesterol (18 to 34%) and blood pressure medication (51 to 74%). Surprise, Arizona, had the highest variation in population over 65 (females:11 to 52% , males: 9 to 49%) and cancer prevalence (5 to 12%).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ;

We estimated the cross-census tract correlation between the 15 COVID-19 candidate health indicators and risk factors ( Figure 1D, Figure 3A ). We found that there was dense correlation (median absolute value of correlation was 0.63 (IQR of 0.35 to 0.78) amongst many of the disease prevalences; specifically, higher or lower prevalence of one disease was strongly associated with higher or lower prevalence of their risk factors or complications of disease. For example, cardiometabolic diseases, such as diabetes, stroke, and heart disease displayed a strong correlation (mean pairwise Pearson correlation between these 4 disease outcomes were 0.92) between one another. Risk factors for these diseases, such as obesity, high blood pressure and high cholesterol exhibited on average a correlation of 0.62 between them, and furthermore, an average correlation of 0.78 between diseases such as diabetes, stroke, and heart disease.

Communities that had a higher prevalence of smoking also exhibited higher prevalence of COPD and chronic asthma. The correlation between asthma and COPD was 0.69, between smoking and COPD was 0.81, and between smoking and asthma was 0.78. Obesity, an established risk factor for many diseases, was correlated with all disease indicators and risk factors (mean absolute value of pairwise correlation of 0.54). Communities that had higher prevalence of males and females older than age 65 larger prevalence of any cancer (average correlation of 0.78).

The first two principal components of the 15 COVID-19 health indicators and risk factors described 85% of the total variation (61 and 24% for component 1 and 2, respectively) of the variation over all 29,768 census tracts ( Figure 1E ). The first principal component had equal contribution from all 15 health diseases, except for cancer and males and females over the age 65; the second principal component was 13 .

CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint dominated by cancer and age ( Figure 3B, Supplementary Table 3 ). The structure across health indicators and risk factors persisted when examining less specific geographical units of counties and states.

To estimate a C-19 Risk Score, we reprojected the 29,978 tracts onto the first two principal components, reducing the 15 health indicators to two variables ( Figure 1EF ). The COVID-19 score was the weighted sum of the two variables, where the weight was the total variance explained for that variable (61% and 24% respectively). We scaled the score to be between 0-100. The average score was 33.7 (SD: 8.6); the median of the score was 33.32 and the interquartile range was 28 to 38. Table 1 The COVID-19 Community Risk Score can be explained by social determinants of health and satellite images of the built environment "Social determinants of health" and demographic characteristics of a community ( Figure 1C , 1H) explain a 54% of the total additive variation of the C-19 Risk Score in the testing dataset. Second, we found an additional 11% of variation attributed to non-linear relationships, or a total of 65% between social 14 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint determinants and the C-19 Risk Score in the testing data using random forest based regression. Third, we found that features of the built environment captured by satellite images contributed to 27% of the variation in the C-19 Risk Score. In total, combining both social determinants and satellite imagery explained 87% of the variation of the C-19 Tisk Score ( Figure 1G, 1H) . Below, we describe the most important features of each of the models.

All 13 sociodemographic variables contribute independently to the C-19 Risk Score (linear regression p-values less than 2x10 -16 for 11 out of 13 variables); however, the relationships are complex ( Table 2 ).

The variables that had the largest additive contribution included the proportion of the community that was non-employed (for a 1 SD change in proportion of non-employed was associated with a 5.3 unit increase in the COVID-19 Score, P < .001). Similarly and independently, a 1 SD increase in the increase of individuals with less than a high school education was associated with a 2 unit increase in the score.

However, for a 1 SD change in the increase of those at or below the poverty level was associated with a 3.3 unit decrease in the COVID-19 Risk Score.

We trained a random forest regression to test for non-linear relationships (1000 trees and 4 variables at each split) in the training data. In the testing data, the social determinants of health indicators explained 65% of variation in the C-19 Risk Score in the US. The "most important" variables in the training data, ascertained through a permutation of each variable sequentially, included the proportion of the tract that was not employed (273% increase of mean-squared error [MSE] when permuted), of Asian ethnicity (93% increase of MSE), at or below poverty (91% increase of MSE), Hispanic (78% increase MSE), and less than high school (78% increase MSE). The rank order of the importance of these variables was similar to the strength of their association in the linear model (Table 2) .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020. 12.17.20248360 doi: medRxiv preprint We hypothesized that the C-19 Risk Score is highly correlated with indicators of the built environment as captured by satellite imagery, even after accounting for social determinants of health. We deployed existing deep learning models originally trained on images from the internet and "retrained" them to predict C-19 Risk Score. In testing data, satellite imagery information plus 13 social determinants indicators explained 87% of the C-19 Risk Score variation. We deploy our findings as visualizations in a dashboard ( Figure 1IJK ).

Neighborhoods with higher C-19 Risk were associated with higher rates of COVID-19 related deaths in New York City ( Figure 1L ). We mapped the C-19 Risk Score to each zipcode tabulation area (ZCTA) in 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint

In this multi-scale analysis integrating spatial disease information from gold standard disease prevalence sources such as the US Centers for Disease Control and Prevention, social determinants of health information from the US Census, and satellite imagery data, we demonstrate an approach to identify characteristics of communities at risk for COVID-19 complications. We use the tools of unsupervised learning to develop a C-19 Risk Score that provides a single interpretable number that summarize a communities' (census tract) aggregate risk. The constituents of the C-19 Risk Score are established neighborhood level risk factors for COVID-19, such as age, obesity, diabetes, and heart disease.

We believe that the COVID-19 Risk Score can be a tool in the growing armamentarium for public health and healthcare companies' toolbox to enable communities to prepare for the potential onslaught of cases in the coming winter months, ultimately helping to "flatten the curve" [29] to achieve precision public health. Notably, we found that the zipcode-level COVID-19 Risk Score for New York City and surrounding areas predicted risk for COVID-19 complications, such as death. Zipcodes with the highest COVID-19 scores (in the top 5%) had double the risk of COVID-19 death versus zipcodes with lowest scores. As of this writing, New York City is contemplating another lockdown due to a surge in the very same zipcodes we identified as high risk.

As a byproduct of developing a risk score for communities, we observed that there is a great disparity of chronic disease prevalence within cities and across cities in the United States. With the exception of New York City and a few other places in the United States, public health agencies mostly report COVID-19 case and death records are collected at the level of the county. However, the findings in our study implicate that smaller populations are at risk, and counties are heterogeneous.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ;  Furthermore, we demonstrated how COVID-19 is a disease inextricably connected to social determinants. Social determinants of health are many fold, comprising both non-modifiable and modifiable factors socioeconomic status, residential location, type of occupation, access to healthcare, physical exposures (e.g. air pollution and individual factors of the exposome [30] [31] [32] [33] ), and built environment [34] . They are hierarchical in structure and distributed over both geographic space and time whose measurement can occur on both the individual-level (exposure of a person) or area level (exposure levels of a place). Satellite images provide a microscope into the area-level "built environment", a concept that encapsulates the physical structures of how humans live, such as the city layout, resource presence, and landscape.

We uncovered immense structure in the disease prevalence information and the correlation between disease prevalence was large. ~90% of the variation of prevalence of the 15 disease and health indicator prevalences (e.g. diabetes, obesity, cardiovascular disease, populations that take blood pressure medication, and average age, among others) can be explained by just two dimensions.

There are a few limitations of our study. First, we rely on disease and health indicator prevalence from the 500 largest cities in the United States, but miss out on less urban areas whose populations are at risk for COVID-19 complications. In the future, we aim to task satellite imaging technology to locations that cannot be covered by resource-limited public surveillance programs. Second, while the CDC 500

Cities data are reflective of the diversity of individuals who live in a Census tract, they are updated every two years and are dated to the latest collection (2019 data release reflects disease prevalence in 2017).

Relatedly, individual-level disease nor COVID-19 status of individuals from these communities are measured. Last, satellite image data are captured at a resolution of approximately 20m per pixel. It is not clear from our study if higher resolution images (that can theoretically capture more human-visible details of the built environment) would lead to better predictions of the C-19 Risk Score.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ;  It is clear that COVID-19 is a disease of disparity; however, we cannot make a causal claim between the instruments, such as the C-19 Risk Score, satellite imagery, and census-tract-level sociodemographic factors and eventual individual-level COVID-19 related complications.

Others have deployed similar risk scores to identify communities at risk [35] . Furthermore, we were inspired by the work of others that demonstrate how remote sensing images predict obesity prevalence [19] . However, to our knowledge, this is the first study to integrate with social determinants of health and satellite image information for prediction of C-19 Risk in neighborhoods. We found that, by combining established social determinants, information measured on earth with the built environment from space can explain most of the variation in the C-19 Risk Score. A mere 13 sociodemographic variables explain 50% of variation of the C-19 Risk Score. Combined with satellite images, we could identify ~90% of variation in the C-19 Risk Score. Therefore, it may be possible to survey populations where COVID-19 testing and resource allocation is overlooked through the use of our predictors.

While it is clear that individual-level comorbidities are associated with risk for COVID-19, here we show that communities' clinical co-prevalence structure, satellite indicators of the built environment, and social determinants are predictive of risk. We provide all our tools to monitor these phenomena in real-time ( https://dashboard.xyai-health.com/ )

19 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 24, 2020. ; https://doi.org/10.1101/2020.12.17.20248360 doi: medRxiv preprint

CDC COVID-19 Response Team. Severe Outcomes Among Patients with Coronavirus Disease 2019 (COVID-19) -United States

Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease 2019 -COVID-NET, 14 States

Characteristics and Clinical Outcomes of Adult Patients Hospitalized with COVID-19 -Georgia

Factors associated with hospital admission and critical illness among 5279 people with coronavirus disease 2019 in New York City: prospective cohort study

Who is at the highest risk from COVID-19 in India? Analysis of health, healthcare access, and socioeconomic indicators at the district level

Viral and host factors related to the clinical outcome of COVID-19

Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study

Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region

Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study

Geographic distribution of diagnosed diabetes in the U.S.: a diabetes belt

Geocoding and monitoring of US socioeconomic inequalities in mortality and cancer incidence: does the choice of area-based measure and geographic level matter?: the Public Health Disparities Geocoding Project

county-level characteristics to inform equitable COVID-19 response. medRxiv

Community-Level Factors Associated With Racial And Ethnic Disparities In COVID-19 Rates In Massachusetts

Links between air pollution and COVID-19 in England

Geospatial Analysis of Individual and Community-Level Socioeconomic Factors Impacting SARS-CoV-2 Prevalence and Outcomes

Zip code caveat: bias due to spatiotemporal mismatches between zip codes and US census-defined geographic areas--the Public Health Disparities Geocoding Project

Use of Deep Learning to Examine the Association of the Built Environment With Prevalence of Neighborhood Adult Obesity

Search & Browse 500 Cities

Multilevel regression and poststratification for small-area estimation of population health outcomes: a case study of chronic obstructive pulmonary disease prevalence using the behavioral risk factor surveillance system

American Community Survey Data

World maps you can deploy on your own

ImageNet Classification with Deep Convolutional Neural Networks

XGBoost: A Scalable Tree Boosting System

Cartographic boundary files -shapefile

Repurposing large health insurance claims data to estimate genetic and environmental contributions in 560 phenotypes

Studying the elusive environment in large scale

Development of exposome correlation globes to map out environment-wide associations

Informatics and Data Analytics to Support Exposome-Based Discovery for Public Health

Physicians and Social Determinants of Health

Individual and community-level risk for COVID-19 mortality in the United States

We thank Emmanuel Coloma and Sumeet Parekh for their help in crafting the figures. This study is funded by XY Health, Inc, a company that develops machine learning approaches for prediction of health outcomes using satellite and land sensor data.

This study was funded by XY Health, Inc. Chirag J Patel and Arjun K Manrai are co-founders and consultants to XY Health, Inc. All other authors are employees of the XY Health, Inc.