key: cord-0797435-3dhumsoa authors: Jannot, A.-S.; Countouris, H.; Burgun, A.; Katsahian, S.; Rance, B. title: COVID-19, a social disease in Paris: a socio-economic wide association study on hospitalized patients highlights low-income neighbourhood as a key determinant of severe COVID-19 incidence during the first wave of the epidemic date: 2020-11-03 journal: nan DOI: 10.1101/2020.10.30.20222901 sha: f9488ba625fae0b3db80ed59e1abf2dd1987cd13 doc_id: 797435 cord_uid: 3dhumsoa Background Studies have already shown that many environmental factors are associated with COVID-19 incidence. However, none have studied a very large set of socio-economic indicators and analysed to what extent these factors could highlight populations at high risk for COVID-19. We propose here a new approach, a socio-economic wide study, to pinpoint subgroups with a high incidence of COVID-19, and illustrated this approach using hospitalized cases in Paris area. Methods We extracted 303 socio-economic indicators from French census data for the 855 residential units in Paris and assessed their association with COVID-19 hospitalization risk. We then fitted a predictive model using a penalized regression on these indicators to predict the incidence of patient hospitalization for COVID-19 in Paris. Findings The most associated indicator was income, corresponding to the 3rd decile of the population (OR= 0.11, CI95% [0.05; 0.20]). A model including only income achieves a high performance in both the training set (AUC=0.78, CI95%: 0.72-0.85) and the test set (AUC=0.79 (CI95%: 0.71-0.87). Overall, the 45% most deprived areas gathered 86% of the areas with a high incidence of COVID-19 hospitalized cases. Interpretation During the first wave of the epidemic, income predicted Paris areas at risk for a high incidence of patients hospitalized for COVID-19 with a high performance. Socio-economic wide association studies, collecting passively data from hospitalized cases, therefore not necessitating any effort for health caregivers, is of particular interest in such a period of hospital overcrowding as it provides real-time indirect information on populations having high COVID-19 incidence. The COVID-19 outbreak has been raging worldwide since late 2019, killing thousands and disrupting health systems with hundreds of hospitalized patients. In France, the peak number of positively tested patients was reached on the 31st of March, with 7578 new cases, most of which were located in the eastern part of France and the greater Paris area. Many environmental factors may contribute to the spread of the outbreak, including population density, pollution, demographics, education and income (1) . Paris, the capital of France, is the largest city in the country, with 2 187 526 inhabitants. Its density is one of the highest in the world, with 20,754 inhabitants per squared kilometre. By comparison, New York City, NY, US, and New Delhi, India, have densities of 7,101 and 5,855 inhabitants per squared kilometre, respectively. Because of the multiple factors influencing the diffusion rate of the virus, the geographical spread is heterogeneous, with some neighbourhoods having a higher incidence than others. From a public health perspective, predicting neighbourhoods at high risk of incidence for severe infections is particularly useful to organize targeted actions and to reduce hospital overcrowding and patient burden in such areas. While clinical factors can easily be retrieved from electronic health records, socio-economic determinants are not always available but can be gathered from census data available through patient geolocalization. In most countries, census data enable the assessment of a large variety of socio-economic indicators at fine geographical levels, and many ecological studies have already been performed on COVID-19 using these data (2, 3) . For instance, in patients from the UK biobank, a striking gradient in COVID−19 hospitalization rates according to the Townsend Deprivation Index − a composite measure of socio-economic deprivation − and household income was observed (4) . In the US, wide inequities have been shown (5) in COVID-19 positivity and incidence, with less advantaged neighbourhoods having a higher incidence in Chicago, NYC and Philadelphia. In the megacity of Chennai (India) (6), spatial risks of COVID-19 were predominantly concentrated in wards with higher levels of area deprivation, which were mostly located in the northeast parts of Chennai. The same holds for England and Wales (7) . Other indicators among the many analysed have been shown to be associated with COVID-19 incidence, such as ethnicity (8) and population density (9) . Previous studies only analysed a few indicators together and showed that most of them were associated with COVID-19 incidence. A socio-economic environment-wide analysis considering a very large set of socio-economic indicators would enable a global picture of neighbourhoods at risk. This data-mining strategy has become common since the first genome-wide association studies (10) and has been adapted to many kinds of data, such as phenotypes (11) and medications (12) . This type of All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint study would enable the pinpointing of specific environmental characteristics strongly associated with COVID-19 risk with no a priori hypothesis regarding the type of associated indicator. Using highly associated characteristics, a predictive model could be inferred to precisely highlight neighbourhoods at risk for a greater incidence of severe COVID-19. In this paper, our objective is to perform a socio-economic environment-wide study and to analyse to what extent a large set of census indicators can predict the incidence of hospitalization for COVID-19 in the Paris area. Our analysis follows recommendations provided by the REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. Completed checklist is available as supplementary material (Table S2 ). The socio-economic wide association study method presented here aims at testing the association between a large set of socio-economic indicators from French census data and COVID-19 hospitalization risk. The AP-HP is the largest hospital entity in Europe, with 39 hospitals (22,474 beds) mainly located in the Greater Paris area, with 1.5 M hospitalizations per year (10% of all hospitalizations in France). Because of the centralized healthcare organization during the COVID epidemic, COVID-infected patients were mostly hospitalized in AP-HP hospitals, and private clinics were not habilitated to take charge of COVID patients. Since 2014, the AP-HP has deployed an analytics platform based on a clinical data repository (CDR), aggregating day-to-day clinical data from 8.8 million patients captured by clinical databases. An "EDS-COVID" database stemmed from this initiative. The AP-HP COVID database retrieved electronic health records from all AP-HP facilities and aggregated them into a clinical data warehouse following the OMOP common data model (13) . The clinical data warehouse allows a large set of data to be retrieved in real time to deeply characterize hospitalized patients, including their residential address. Using the IRIS Paris division and adopting one-week periods, we aggregated the data for the new patients who tested positive for COVID-19 in one of the AP-HP facilities and lived in Paris. inhabitants are divided into several IRIS, while smaller towns form one IRIS each). There are two All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint types of IRIS: business and residential. The homogeneity of each unit is based mainly on habitat type (residential area, public housing, etc.) Therefore, each French residential IRIS has relatively homogeneous social characteristics. New patients who tested positive by polymerase chain reaction (PCR) as being infected by SARS-CoV-2, hospitalized in one of the AP-HP hospitals and living in Paris were aggregated at the IRIS Paris division level to constitute the dataset for this study. For each Paris IRIS residential unit, we estimated the proportion of patients hospitalized at AP-HP from the 1 st of February until the 4 th of June. High-and low-risk IRIS units were defined using the first decile of this proportion. We then assessed the association of a given socio-economic indicator above median value with having a high COVID-19 risk for each of the 303 socio-economic indicators using a chi-square test. This enables an equivalent power when testing the association with each indicator. This is an important point of the methodology of this study, as it enables us to directly compare the magnitude of odds ratios for each indicator. P-values for the different performed tests were represented through a Manhattan plot using the indicator type as the variable group. We randomly divided the IRIS residential unit into a training (66%) and a test (33%) dataset. In the training set, we selected for the prediction model all indicators reaching significance at the 0.1% level and having a correlation between each other lower than 80%. To achieve this selection, we ranked indicators reaching significance at 0.1% from the lower to the higher p-values, and we withdrew each indicator presenting a correlation larger than 80% with one of the other indicators. We then performed a multivariate regularized logistic regression using lasso cross-validated penalty, with the selected indicators as the explaining variable and COVID-19 risk as the variable to explain. We estimated the prediction performance of the fitted model on the training set by estimating the area under the curve (AUC). We simplified the full model when relevant: for each included variable in this model, we also estimated the corresponding AUC and compared it with the AUC of the full model using the Delong method. For the final model, we then estimated the threshold maximizing the Youden index and the corresponding sensitivity and specificity. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint We applied the final model to the test set and estimated the AUC and the sensitivity and specificity corresponding to the above estimated Youden index. To take into account possible spatial autocorrelation, we tested spatial autocorrelation in the final model residuals using the Moran's I test. All analyses were performed using R software with the ggplot, pROC and DHARMa packages. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in Using a high-dimensional set of socio-economic indicators, we show that many socio-economic indicators are strongly associated with an area of high incidence of hospitalized patients for COVID-19 in Paris, with an odds ratio of magnitude of approximately 5 for indicators measuring deprivation. As a consequence, several of these indicators, such as income, proportion of the population with no All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint diploma, and the rate of workers in unskilled occupations, are by themselves highly predictive of hospitalized case clusters. Overall, the 45% most deprived areas comprised 86% of the areas with a high incidence of COVID-19 hospitalized cases. This type of study, collecting passively data from hospitalized cases, therefore not necessitating any effort for health caregivers, is of particular interest in such a period of hospital overcrowding. Indeed, it can provide real-time information on populations having high COVID-19 incidence. This type of study, with nearly no cost, has to be faced to seroprevalence studies deemed to be biased due to non-respondents, involving huge research efforts and not useable for real-time surveillance. This study highlights the high pertinence of geographic breakdown at the small IRIS level that is finely described using census data. This can both preserve patients' privacy and finely depict patients' environment. Indeed, geolocalization based on the IRIS area level renders the data anonymous while being pertinent to describing patients' socio-economic environment, as attested by the strong value of the predictive model built using these data. This study is based on declared residential addresses, which can be different from a patient's living residence at the time of hospitalization. Interestingly, 14% of hospitalized patients have their residential address outside the Ile-de-France region and are therefore located far from Paris. Although some transfers from remote hospitals occurred during the study period, we cannot exclude the possibility that some patients were actually living in Paris at the time of their hospitalization. The same holds for the considered indicators, which were based on census data that might differ during the study period because a large exodus from Paris was observed when lockdown started: during lockdown, in addition to the departure of foreign visitors and overseas departments, Paris saw its current population decrease by 450,000 people (i.e., 20%). Half of this population decline was due to non-residents of the capital who were able to return home, and the other half was due to Parisians leaving their city (15). Only public hospitals were considered for this study. This is unlikely to have introduced a bias because private facilities were not allowed to take charge of patients affected by COVID-19, with the exception of non-profit private facilities, in which patients were hospitalized according to the same financial conditions as public hospitals. Moreover, non-profit private hospitals are distributed evenly across Paris like public hospitals are. A significant portion of hospitalized patients came from medical institutions, such as medico-welfare and long-term care establishments. These patients were not considered in our study, as the vast majority of these establishments are located in non-residential areas. Considering such patients in All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint our study would have introduced a bias as the transmission mechanism underlying these cases is different from the general population. Notably, we did not include air pollution data because they were not available at the same scale. Some studies have already shown that air pollution might increase the COVID-19 mortality rate (16, 17) , but these studies did not consider socio-economic determinants, while low-income areas are usually located in more polluted areas (18) . In Paris, low-income areas are concentrated near the Parisian ring road, which is a highly polluted area. Therefore, it would have been interesting to have these data considered in our model. Moreover, from the map, it can visually be deduced that income is a more important determinant than ring road proximity. The predictive value of low-income neighbourhoods on epidemic clusters is high and can be explained both by factors favouring infection, such as occupational status, and by factors favouring severe infection in infected patients, such as obesity and hypertension. The subgroup pinpointed as overrepresented in the high-risk area, i.e., the population with unskilled occupations living in deprived areas, is not considered to be at high risk of severe COVID-19 because these people are still active, i.e., below 65 years, and age is the major risk factor for severe COVID-19 infection. However, they might have a high rate of infection because of their professional status (residence employee, housekeeper, delivery driver, etc.) involving a large number of professional contacts with the more vulnerable population. This strong predictive value of income level for areas at high risk has to be compared with the low level of inequalities in France compared to the level worldwide (19). The French health care system is ranked among the better regarding waiting times, results, and generosity (20), which renders these results unexpected. Indeed, many studies have performed ecological analysis on COVID-19 mortality and hospitalization rates but have not found such a strong association in other countries, including the US and the United Kingdom, but their study design was slightly different, with an odds ratio of approximately 3 for poverty in Chicago (21) and of approximately 2 in England (1), for instance, while odds ratios were as high as 5 in our study. We did not find any association with population density. The population density in residential areas is high (37,000 inhabitants per km2 in residential areas), but the interquartile range is also high (from 28,000 to 46,000 inhabitants per km2 in residential areas). Therefore, this lack of association is not explained by a homogeneous population density. In Paris, population density is not strongly correlated with population wealth, as opposed to the case in most other large cities, where parts with high densities correspond to low-income neighbourhoods (21, 22) . This might explain why population density was associated with high COVID-19 incidence in other areas but not in Paris. All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint Moreover, most previous studies focus on COVID-19 incidence, sometimes in relationship with testing incidence (5) . COVID-19 incidence strongly depends on testing policy, which is associated with socio-economic determinants of the population. Therefore, analysing the association between COVID-19 incidence and socio-economic determinants is deemed to be biased. This is why we focused on COVID-19 hospitalized cases, which better reflect epidemic spread because they are less related to testing policy. This was particularly relevant in this study, as a low-income population might have difficulties accessing testing facilities (23) because of the testing cost and long working hours of this population. This study only includes data from the first wave of the epidemic. Different mechanisms could be at stake in the future. In Germany, it was shown that different mechanisms were responsible for epidemic spread at different stages of the epidemic (24) . The period analysed in this study corresponds mostly to lockdown, which has a strong impact on population behaviour in France and therefore might have led to very specific mechanisms for epidemic spread. Nevertheless, our study highlighted that the proposed methodology in this paper, a socio-economic wide association study in hospitalized patients, can provide indirect information on populations having high COVID-19 incidence. It is therefore of utmost importance to organize a real-time follow-up of these socioeconomic indicators for newly hospitalized patients using such an approach. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint Factors associated with COVID-19-related death using OpenSAFELY An ecological study of socioeconomic predictors in detection of COVID-19 cases across neighborhoods in New York City Social Determinants of COVID-19 in Massachusetts, United States: An Ecological Study Socioeconomic Deprivation, and Hospitalization for COVID-19 in English participants of a National Biobank Spatial Inequities in COVID-19 outcomes in Three US Cities Modeling the effect of area deprivation on COVID-19 incidences: a study of Chennai megacity Covid-19: People in most deprived areas of England and Wales twice as likely to die Coronavirus disease 2019 mortality: a multivariate ecological analysis in relation to ethnicity, population density, obesity, deprivation and pollution High population densities catalyse the spread of COVID-19 The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations Natural Language Processing for Rapid Response to Emergent Diseases: Case Study of Calcium Channel Blockers and Hypertension in the COVID-19 Pandemic Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers COVID-19 prevalence and fatality rates in association with air pollution emission concentrations and emission sources All rights reserved. No reuse allowed without permission. perpetuity the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted Air Pollution and Individual and Neighborhood Socioeconomic Status: Evidence from the Multi-Ethnic Study of Atherosclerosis (MESA) COVID-19 and Inequity: a Comparative Spatial Analysis of New York City and Chicago Hot Spots COVID-19 and Slums: A Pandemic Highlights Gaps in Knowledge About Urban Poverty Disparities in COVID-19 Testing and Positivity in New York City The Pandemic Predominantly Hits Poor Neighbourhoods? SARS-CoV-2 Infections and Covid-19 Fatalities in German Districts We would like to acknowledge the authors thank the EDS APHP Covid consortium integrating the APHP Health Data Warehouse team as well as all the APHP staff and volunteers who contributed to the implementation of the EDS-COVID database and operating solutions for this database (list in the Supplementary Table S1 ).Funding H. Countouris received funding from Canceropole Ile-de-France to develop the geolocalization method used for this study. AS Jannot received funding for this project from the COVID sponsorship (mécénat « Collecte Crise Covid ») of AP-HP Centre Université de Paris. The authors declare no other support from any organisation for the submitted work, no financial relationships with any organisations that might have an interest in the submitted work in the previous three years and no other relationships or activities that could appear to have influenced the submitted work. All authors have seen and approved the manuscript. This study was approved by the institutional review board (authorization number IRB 00011591) from All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint All rights reserved. No reuse allowed without permission. perpetuity. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in The copyright holder for this this version posted November 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222901 doi: medRxiv preprint