key: cord-0865309-tr6vyae8 authors: Su, C.; Zhang, Y.; Flory, J. H.; Weiner, M. G.; Kaushal, R.; Schenck, E. J.; Wang, F. title: Novel clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health date: 2021-03-02 journal: nan DOI: 10.1101/2021.02.28.21252645 sha: 23e709eb0e0d9ee0efcf86f61fa9869ea3fc8b1b doc_id: 865309 cord_uid: tr6vyae8 The coronavirus disease 2019 (COVID-19) is heterogeneous and our understanding of the biological mechanisms of host response to the novel viral infection remains limited. Identification of meaningful clinical subphenotypes may benefit pathophysiological study, clinical practice, and clinical trials. Here, our aim was to derive and validate COVID-19 subphenotypes using machine learning and routinely collected clinical data, assess temporal patterns of these subphenotypes during the pandemic course, and examine their interaction with social determinants of health (SDoH). We retrospectively analyzed 14418 COVID-19 patients in five major medical centers in New York City (NYC), between March 1 and June 12, 2020. Using clustering analysis, four biologically distinct subphenotypes were derived in the development cohort (N = 8199). Importantly, the identified subphenotypes were highly predictive of clinical outcomes (especially 60-day mortality). Sensitivity analyses in the development cohort, and re-derivation and prediction in the internal (N = 3519) and external (N = 3519) validation cohorts confirmed the reproducibility and usability of the subphenotypes. Further analyses showed varying subphenotype prevalence across the peak of the outbreak in NYC. We also found that SDoH specifically influenced mortality outcome in Subphenotype IV, which is associated with older age, worse clinical manifestation, and high comorbidity burden. Our findings may lead to a better understanding of how COVID-19 causes disease in different populations and potentially benefit clinical trial development. The temporal patterns and SDoH implications of the subphenotypes may add new insights to health policy to reduce social disparity in the pandemic. The outbreak of coronavirus disease 2019 , caused by the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, has led to a pandemic that imposed tremendous pressure on healthcare systems globally 1 . As the pandemic continues and the second wave has emerged in the US and many other countries, research is still needed to understand how SARS-CoV-2 causes the wide spectrum of COVID-19 disease. Previous studies have uncovered substantial variation in the host response to SARS-CoV-2 and the variable clinical manifestations of this disease, including respiratory failure, kidney injury, and cardiovascular dysfunction [2] [3] [4] [5] [6] [7] [8] . Pivotal studies of corticosteroids 9 and anticoagulation 10, 11 demonstrate differential responses in distinct subpopulations based on severity of disease. The pathophysiology of differential organ dysfunction in COVID-19 remains unclear across varied patient populations. Prior to the COVID-19 pandemic, identification of biologically distinct, data driven subphenotypes 12, 13 has helped to disentangle complex syndromic disease such as sepsis 14, 15 , ARDS 16 , heart failure 17, 18 , diabetes 19 , and Alzheimer's disease 20 . Identifying robust subphenotypes in COVID-19 patients could lead to improved understanding of biological mechanisms of host response to SARS-CoV-2 infection and may identify subpopulations that could be prioritized for clinical trial enrollment 13, 21 . Previous efforts [22] [23] [24] have been made in this area but remain limited probably due to cohort size, data availability, and lacking evaluation of robustness and usability of the identified subphenotypes. In addition, the hospitalized case fatality rate of COVID-19 has varied over the course of the pandemic 25,26 and according to social determinants of health (SDoH) [27] [28] [29] . Exploration of temporal patterns and SDoH characteristics in conjunction with subphenotypes may derive new insights to improve public health. 3 In this analysis, our goal was to derive and validate COVID-19 subphenotypes amongst a population of patients who presented to the emergency department (ED) or were hospitalized in multiple health systems in New York City (NYC). Specifically, we used routinely collected clinical data to first derive subphenotypes using the agglomerative hierarchical clustering model. Then, multiple strategies in data pre-processing, data filtering, and data-driven models (both unsupervised clustering model and supervised predictive model) were used to confirm reproducibility and usability of the identified subphenotypes. After that, statistical analyses were conducted to evaluate the characteristics and clinical outcomes of the subphenotypes. Further analyses were performed to examine temporal patterns of the subphenotypes and impacts of SDoH status on subphenotype-level outcomes. The overall workflow of our study is illustrated in (20.2%) black patients. Across the three cohorts, the overall 60-day mortality rates after ED or hospital discharge were 18.65%, 19.78%, and 20.59%, respectively. More details of the characteristics of the studied cohorts appeared in Table 1 . In the development cohort, the agglomerative hierarchical clustering model identified 4 distinct subphenotypes based on presenting clinical data of the patients (see eResults, eFigures 3 and 4 in Supplement). Characteristics including demographics, clinical variables, comorbidities, clinical outcomes, and medication treatments across the 4 subphenotypes were presented in (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In the development cohort, sensitivity analyses under two different settings (sensitivity to quality control and outliers and sensitivity to clustering methods) confirmed the underlying 4-cluster structure of the data (see eResults, eFigures 3 and 4, and eTable 5 in Supplement). Patients' memberships of the 4 clusters re-derived by sensitivity analyses were highly consistent with those derived in the primary analysis (see eFigure 6 in Supplement). Moreover, we did not find substantial changes in clinical characteristics of the subphenotypes in the sensitivity analyses (see eTables 6 and 7 in Supplement). Subphenotypes were also re-derived in the internal validation cohort, where the 4cluster structure was found as the optimal fit as well (see eResults and eFigure7 in Supplement). Clinical characteristics of the re-derived subphenotypes in the internal validation cohort, including demographics, laboratory variables, comorbidities, and All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252645 doi: medRxiv preprint clinical outcomes, also showed very similar patterns with the subphenotypes derived in the primary analysis (see Figure 3 , and eTable 8 and eFigure 8 in Supplement). To further evaluate subphenotype robustness and usability, we trained a predictive After that, the trained predictive model was used to predict subphenotype memberships of patients in the external validation cohort. The predicted subphenotypes in the external validation cohort were well separated in the UMAP space (see eFigure 11 in Supplement) and showed clinical characteristics similar to findings in the primary analysis (see Figure 3 , and eTable 9 and eFigure 12 in Supplement). Temporal patterns of the COVID-19 subphenotypes were illustrated by the bar charts, showing the composition of subphenotype memberships of patients confirmed per week, since the All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; outbreak in NYC, i.e., March 1, 2020 (see Figures 4a-c). Except week 1 and week 14 that had few patients confirmed, the composition of the four subphenotypes per week evolved over time and showed similar patterns across the development, internal validation, and external validation cohorts. In general, patients with confirmed SARS-CoV-2 infection rapidly increased within the first month since the outbreak and reached the peak at week 5 (early April). Subphenotype I (mild symptom) and Subphenotype II (moderate symptom, low comorbidity burden) dominated the time period prior to the peak (first 4 weeks since outbreak). In contrast, Subphenotype IV (severe symptom, high comorbidity burden) had a low proportion within the first 4 weeks but showed a largely increased proportion from week 6 to week 9. Since week 10, the proportion of Subphenotype I gradually increased while others especially Subphenotype IV shrank. Subphenotype III (moderate symptom, high comorbidity burden) had a relatively stable proportion over time. In general, worse SDoH in terms the socioeconomic variables were likely related to Subphenotype IV (see eTable 10 in Supplement). Moreover, logistic regression analysis identified similar patterns of relationships between the SDoH variables with 60-day mortality risk across subphenotypes; however, absolute log odds of the SDoH variables varied across subphenotypes (see Figure 4d and eTable 11 in Supplement). For example, low absolute log odds were observed in all six SDoH variables in Subphenotype I. In contrast, we did see increased absolute log odds of all six SDoH variables in Subphenotype IV. Agglomerative hierarchical clustering based on the SDoH variables grouped the patients into a 3-cluster model (see eResults and eFigure 13 in Supplement), which can be interpreted as high (H), middle (M), and low (L) SDoH strata (see eTable 12 in Supplement). Stratum L, All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. were shown in eTable 13 in Supplement. In the analysis to further explore how SDoH strata affected the outcome of each biological subphenotype, we found varied patterns of correlations between SDoH strata and 60-day mortality (see Figure 4e ) by subphenotypes. Notably, in line with the results of the univariate analysis above, SDoH strata were likely to have a strong impact on the 60-day mortality in Subphenotype IV. Particularly, in Subphenotype IV, SDoH stratum L was associated with a 55.19% 60-mortality rate, which was 5.55% higher than the subphenotype level (49.64%, see Table 2 ) and 8.52% higher than that of the SDoH stratum H. In subphenotypes I, II, and III, we didn't find mortality rate discrepancy higher than 3% between any pair of SDoH strata. Similarly, considering stratum H as reference, stratum L had largely increased log odds of mortality in Subphenotype IV (log odds = 0.40, SD = 0.19, P-value = 0.04). We derived subphenotypes of COVID-19 patients treated at five major medical centers in NYC across the whole course of the first wave of the pandemic, using the clinical data at the presentation to the emergency department (ED) or hospital. Different from the previous subphenotype studies of COVID-19 22-24 , we focused on a larger, more representative, and diverse population presented at the ED and/or hospitalized without COVID-19 specific therapy. We derived subphenotypes using clustering analysis in the development cohort and validated All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; them using a combination of multiple validation strategies, including the use of different data processing, different data filtering, and different machine learning models (both unsupervised clustering and supervised predictive models). All validation approaches confirmed the reproducibility of the 4-cluster structure of the data and clinical characteristics of the identified subphenotypes. We would also highlight that all machine learning models used for subphenotype derivation and validation were performed only on the presenting clinical variables that were routinely collected in daily patient care and are available to providers by ED or hospital admission. This allows us to potentially capture the underlying variable mechanisms of the complex disease, but also enhances the generalizability and feasibility of the identified subphenotypes to be used in clinical practices and patient enrollment in clinical trials. Importantly, the 4 subphenotypes identified were significantly separated in demographics, clinical variables, and chronic comorbidities, and strongly predictive of the 60-day mortality outcome. Subphenotype IV included more older, male patients, abnormal markers indicating hyperinflammation, liver injury, cardiovascular problems, renal dysfunctions, and coagulation disorders, and a higher comorbidity burden compared to the other subphenotypes. In contrast, Subphenotype I was composed of relatively healthy, younger females who had more normal values across all markers and comorbidity burdens compared to the other subphenotypes. There was a strong concordance between their clinical profiles and outcomes, such as Subphenotype IV showed the worst clinical outcome while Subphenotype I showed the best outcome among the 4 subphenotypes. These are in line with observations reported in a previous small cohort study 23 . Interestingly, Subphenotypes II and III showed similar, moderatelevel 60-day mortality rates, but their clinical characteristic profiles suggested that they were likely to have distinct biological mechanisms. In particular, results from our primary analysis and validation approaches demonstrated that Subphenotype II was correlated with relative hyperinflammation, while Subphenotype III was associated with renal injury, lower platelet level All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. and a high comorbidity burden (significantly higher than Subphenotypes I and II, and equivalent to Subphenotype IV). Moreover, in accordance with the clinical characteristics and outcomes, the worse subphenotypes (Subphenotypes III and IV) were more likely to receive medications in antibiotics, corticosteroids, and vasopressor than the others. These findings suggested that our identified subphenotypes offer insight into the varied mechanisms of COVID-19. Typically, data-driven approaches for the identification of subphenotypes of human disease are based on the unsupervised clustering methods 12, [14] [15] [16] [22] [23] [24] 31 . The natural attributes of the unsupervised methodology in discovering underlying patterns from data make them the best fit for subphenotype identification. Once the subphenotypes were determined, there would be a Time is a crucial factor in the spread of COVID-19. Previous studies have examined the temporal trends of COVID-19 outcomes such as in-hospital mortality rate during the course of the pandemic 25,26 , but limited attention has been drawn on evolving patterns of COVID-19 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; phenotypes. We filled this gap in the present study. Our observations suggested varied temporal trends of the identified subphenotypes during the first 14 weeks of the pandemic in NYC. Interestingly, since the COVID-19 outbreak in NYC on March 1, 2020, Subphenotypes I and II dominated the time period prior to the peak (first 4 weeks since outbreak), possibly as they contained more relatively younger patients who may have had more frequent social activities to be infected. Subphenotype IV, with older age, worse health conditions, and poorer outcomes, was boosted within the second month (April 2020) post spread peak, consistent with tremendous mortality rate of NYC in April 32 . This would suggest that younger, biologically strong patients (Subphenotypes I and II) got infections early and boosted the spread, while older, biologically vulnerable patients (Subphenotype IV) accounted for the second infections within a population probably due to housing. After that, the proportion of Subphenotype I out of all patients confirmed per week gradually expanded while that of the others, especially Subphenotype IV shrank. The potential reason would be that valuable experience (such as the improved use of masks and social distancing), reinforced health care systems, and announced health policies did protect the population who likely develop severe subphenotypes (Subphenotype IV). In general, such temporal trends of the novel biological subphenotypes would be a considerable, fine-grained explanation of the observed outcome (mortality rate) evolving trends in epidemiology 25 . SDoH such as vulnerable socioeconomic neighborhood status have been associated with poor outcomes of COVID-19 25,29 . In this work, we explored the impact of SDoH on different biological subphenotypes from both univariate and multivariate perspectives. We first examined the associations of individual socioeconomic characteristics with mortality risk in each subphenotype. We then derived comprehensive SDoH strata using the data-driven clustering method and evaluated their correlations with mortality risk in each subphenotype. The results confirmed our hypothesis that SDoH impacts biological subphenotypes differently. The highly All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; expanded mortality risk log odds of individual SDoH variables and discrepancy of mortality rate among SDoH strata indicate that SDoH has a much stronger association with mortality outcomes in Subphenotype IV, compared to the others. In other words, once a sick, elderly patient shows up with COVID-19 (Subphenotype IV), the disadvantaged socioeconomic status significantly increased their mortality. In contrast, disadvantaged SDoH status was unlikely to lead to significantly increased mortality risk in Subphenotype I. This evidence further demonstrated that the COVID-19 pandemic has disproportionately affected patients with lower socioeconomic status. In general, our findings added new information on social disparities in the COVID-19 pandemic. Unlike previous studies 28,29,33,34 that focused on the entire population, we extended the study from a new angle by focusing on the biologically different populations (i.e., subphenotypes). Our findings also showed evidence that the identified subphenotypes would provide considerable guidance in health policy to reduce social disparities in the pandemic. While this study presents a new contribution in the efforts to parse the biological heterogeneity of COVID-19, there remain several limitations. First of all, our data-driven approach relied on the availability of patient data. In this study, we identified subphenotypes using the routinely collected clinical variables that were correlated with COVID-19 35 and available in the INSIGHT database 36 . We were not able to extract presenting symptoms and vital data while the incorporation of such data would add in new insights. Second, in our study, the analyzed data were collected at ED or hospital presentation, so the time between COVID-19 symptom onset to ED or hospital presentation could be a covariate of All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; disease severity and clinical outcomes. However, such data was not available in the INSIGHT database. Third, missing values may affect the robustness of the identified subphenotypes. In order to address this issue, we excluded variables with high missingness. For the remaining variables, we used the state-of-the-art K-nearest neighbors imputation algorithm 37 . Even so, we still missed these real values hence may incorporate bias. Fourth, our study was based on presenting clinical data, such that each patient was characterized in a snapshot. The full use of longitudinal data of patients may allow us to capture the complexity of the disease arc to identify interesting subphenotypes. Previous studies tried to derive COVID-19 subphenotypes based on longitudinal information 22,24 , yet they were based on univariate trajectory data in small cohorts. The collection of multivariate, longitudinal data in large cohorts remains challenging and modeling such data to identify subphenotypes requires improved data-driven methods 12, 13, 21 . Fifth, this is a multiple institutional analysis in NYC. To evaluate the generalizability of the identified subphenotypes, further validation on data collected from other areas is needed in future work. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. We (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; We considered 30 clinical variables associated with COVID-19 onset, symptoms, or outcomes 37 and available in the INSIGHT database as the candidate variables to derive subphenotypes. The variables included inflammatory markers (C-reactive protein, erythrocyte sedimentation rate We also examined other clinical characteristics of the patients, including demographics, comorbidities, and body mass index (BMI). Demographics included age, sex, and race. Baseline comorbidities included hypertension, diabetes, coronary artery disease (CAD), heart failure, chronic obstructive pulmonary disease (COPD), asthma, cancer, obesity, and hyperlipidemia. For each patient, the most recent BMI data was collected. We analyzed 60-day all-cause All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; mortality as the primary outcome for the patients. Need for mechanical ventilation and admission to the intensive care unit (ICU) were the secondary outcomes. We also analyzed the treatments for COVID-19, including antibiotics (combining ceftriaxone, azithromycin, piperacillin tazobactam, meropenem, vancomycin, and doxycycline), corticosteroids (combining prednisone, methylprednisolone, dexamethasone, and hydrocortisone), hydroxychloroquine, enoxaparin, heparin, and vasopressor. To explore the impact of SDoH status on the subphenotypes, we extracted patients' neighborhood socioeconomic characteristics, including median household income, percentage of residents without a high school degree, percentage of residents who are essential workers, percentage of households with crowding housing conditions (i.e., households with >1 person per room), percentage of non-white residents, and unemployment rate. These characteristics were extracted from the 2018 American Community Survey 38 . Previous studies have indicated that these social conditions are associated with higher probability of infection, hospitalization, and other adverse outcomes. We first assessed the value distributions and missingness of the 30 candidate clinical variables (see eTables 2 and 3 in Supplement). For data quality control, 7 variables of high missingness (missing more than 70% values) were excluded and the remaining 23 variables were used for deriving subphenotypes. Logarithmic transformation was applied to the non-normal distributed variables (see eTable 4 in Supplement). In order to eliminate the effects of value magnitude, all All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; variables were scaled based on z-score. K-nearest neighbors (KNN) imputation 39 was used to address missing values (see eMethods in Supplement). We originally derived subphenotypes using the development cohort. More specifically, agglomerative hierarchical clustering with Euclidean distance calculation and Ward linkage criterion 40 was applied to the 23 clinical variables after data preparation. We used agglomerative hierarchical clustering because it is robust to different types of data distributions and typically produces a dendrogram that visualizes data structure to help determine the optimal cluster number. Besides dendrogram, we calculated 21 measures of clustering models provided by 'NbClust' software 41 to determine the optimal number of clusters, i.e., subphenotypes. In order to evaluate the reproducibility, we validated our subphenotypes in three ways. First, we performed sensitivity analyses using the development cohort to evaluate 1) sensitivity to quality control and outliers and 2) sensitivity to clustering algorithms. To assess sensitivity to quality control and outliers, we incorporated all 30 candidate variables and excluded patients who have outlier values (see eMethods in Supplement). Then similar to the primary analysis, we performed agglomerative hierarchical clustering to re-derive subphenotypes and determined optimal cluster number using dendrogram and 'NbClust'. To assess sensitivity to clustering algorithms, we re-derived subphenotypes using the Gaussian mixture model (GMM) 42 , which is a probabilistic model for clustering analysis based on a mixture of Gaussian distributions. The optimal cluster number in GMM was determined by comprehensively considering Akaike information criterion (AIC), Bayesian information criterion (BIC), and median probability of group membership (see eMethods in Supplement). All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; Second, we used the internal validation cohort and re-derived subphenotypes using the same agglomerative hierarchical clustering with the primary analysis for validation. The optimal cluster number was determined using dendrogram and 'NbClust' as well. Last, for the aims of confirming subphenotypes and their usability, we used the supervised predictive model. More specifically, considering subphenotype membership of each patient as the label to predict, we built a predictive model of subphenotypes based on the 23 clinical variables used for subphenotype derivation. The predictive model was based on the supervised XGBoost classifier 43 , a powerful tree-based machine learning model. The predictive model was trained in the development cohort using a 10-fold cross-validation strategy. To address the multi-label classification (since we identified more than 2 subphenotypes), a one-vs-the-rest strategy was used in model training. Prediction performance was measured by receiver operating characteristics curve (ROC) and area under ROC curve (AUC). We also engaged the novel SHapley Additive exPlanation (SHAP) values to assess contributions of the clinical variables in distinguishing each subphenotype from the others. Once the predictive model was trained, it was performed on the external validation cohort to predict the patients' subphenotype memberships. For the aim of subphenotype interpretation, we first visualized the subphenotypes in two ways: 1) (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; We also characterized subphenotypes by evaluating their differences in demographics, all clinical variables, comorbidities, clinical outcomes, and medications prescribed after COVID-19 confirmation. Data were presented as median (interquartile range [IQR]) for continuous variables and exact patient number (percentage) for categorical variables. To compare subphenotypes, we performed the Kruskal-Wallis test for continuous data and ߯ ଶ test for categorical data. Analysis of covariance (ANCOVA) was also applied for betweensubphenotypes comparisons, adjusting for age and gender. Two-tailed P-values smaller than 0.05 were considered as the threshold for statistical significance. Survival analyses were performed to assess associations of subphenotypes to clinical outcomes, where Kaplan-Meier plots were created accordingly. To evaluate the temporal pattern of the subphenotypes during the course of the pandemic, we created bar charts to visualize the proportion of each subphenotype out of the total patients confirmed per week, since the COVID-19 outbreak in NYC (March 1, 2020). Multiple analyses were conducted to assess the impact of SDoH on COVID-19 subphenotypes. For each subphenotype, we first performed logistic regression analysis to assess the association of each SDoH variable with 60-day mortality, adjusting for age and sex. After that, we performed agglomerative hierarchical clustering on the 6 socioeconomic variables to derive comprehensive SDoH strata. Within each subphenotype, we compared 60-day mortality rates between the SDoH strata. We also used logistic regression analysis to assess the association of SDoH strata with 60-day mortality, adjusting for age and sex, within each subphenotype. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; All data studied in this work can be downloaded from INSIGHT clinical research network at https://insightcrn.org/our-data/, via request. Implementation of our work is based on Python 3.7 and R 3.6. More specifically, clustering models were implemented based on Python packages 'scikit-learn 0.23.2' (https://scikit-learn.org/stable/) and 'scipy 1. Chord diagrams were created using R package 'circlize' (https://cran.rproject.org/web/packages/circlize/index.html). All statistical tests and survival analyses were performed based on R. The Institutional Review Board of the Weill Cornell Medicine approved this study (Protocol number: 20-04021948). All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; cohort. Reproducibility of the identified subphenotypes were evaluated in multiple ways, including (d) sensitivity analyses in the development cohort and subphenotype re-derivation in the internal validation cohort; and (e) Training subphenotype predictive model in the development cohort and (f) using it to predict subphenotype memberships of patients in the external validation cohort. (g) further analyses were conducted to interpret subphenotypes, explore temporal patterns of subphenotypes during the pandemic, and evaluate impact of SDoH characterisitics on subphenotypes. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252645 doi: medRxiv preprint Figure 3 . Kaplan-Meier plots for 60-day mortality by subphenotypes. The survival probabilities were shown with 95% confidence interval. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. individual SDoH characteristics and 60-day mortality risk, using logistic regression analysis, All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 2, 2021. ; https://doi.org/10.1101/2021.02.28.21252645 doi: medRxiv preprint A Novel Coronavirus from Patients with Pneumonia in China Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area Clinical characteristics of COVID-19 in 104 people with SARS-CoV-2 infection on the Diamond Princess cruise ship: a retrospective analysis Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection Transmission, Diagnosis, and Treatment of Coronavirus Disease 2019 (COVID-19): A Review Sex differences in immune responses that underlie COVID-19 disease outcomes Association Between Early Treatment With Tocilizumab and Mortality Among Critically Ill Patients With COVID-19 Outcomes of Patients With Coronavirus Disease 2019 Receiving Organ Support Therapies: The International Viral Infection and Respiratory Illness Universal Study Registry Dexamethasone in Hospitalized Patients with Covid-19 -Preliminary Report COVID-19 and VTE/Anticoagulation: Frequently Asked Questions Full-dose blood thinners decreased need for life support and improved outcome in hospitalized COVID-19 patients. 2021 (National Institutes of Health (NIH) Deep phenotyping: Embracing complexity and temporality-Towards scalability, portability, and interoperability Subphenotypes in critical care: translation into clinical practice Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis Identifying Novel Sepsis Subphenotypes Using Temperature Trajectories Subphenotypes in acute respiratory distress syndrome: latent class analysis of data from two randomised controlled trials Clinical Implications of Chronic Heart Failure Phenotypes Defined by Cluster Analysis No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. References 38 Missing value estimation methods for DNA microarrays Ward's Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward's Criterion? NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set Gaussian Mixture Models. Encyclopedia of biometrics XGBoost: A Scalable Tree Boosting System Uniform manifold approximation and projection for dimension reduction circlize implements and enhances circular visualization in R No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity This study is funded by the COVID-19-Related Project Enhancement to the grant PCORI/HSD-1604-35187 ("Identifying and Predicting Patients with Preventable High Utilization", PI: Kaushal) from the Patient-Centered Outcomes Research Institute. The authors have declared that no conflict of interest exists.