key: cord-0941841-bf04ipp9 authors: Huang, Yufang; Liu, Yifan; Steel, Peter A D; Axsom, Kelly M; Lee, John R; Tummalapalli, Sri Lekha; Wang, Fei; Pathak, Jyotishman; Subramanian, Lakshminarayanan; Zhang, Yiye title: Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups date: 2021-09-27 journal: J Am Med Inform Assoc DOI: 10.1093/jamia/ocab203 sha: a44384a0e1c6a512b8bdc823951a7bfccbedadbb doc_id: 941841 cord_uid: bf04ipp9 OBJECTIVE: Deep significance clustering (DICE) is a self-supervised learning framework. DICE identifies clinically similar and risk-stratified subgroups that neither unsupervised clustering algorithms nor supervised risk prediction algorithms alone are guaranteed to generate. MATERIALS AND METHODS: Enabled by an optimization process that enforces statistical significance between the outcome and subgroup membership, DICE jointly trains 3 components, representation learning, clustering, and outcome prediction while providing interpretability to the deep representations. DICE also allows unseen patients to be predicted into trained subgroups for population-level risk stratification. We evaluated DICE using electronic health record datasets derived from 2 urban hospitals. Outcomes and patient cohorts used include discharge disposition to home among heart failure (HF) patients and acute kidney injury among COVID-19 (Cov-AKI) patients, respectively. RESULTS: Compared to baseline approaches including principal component analysis, DICE demonstrated superior performance in the cluster purity metrics: Silhouette score (0.48 for HF, 0.51 for Cov-AKI), Calinski-Harabasz index (212 for HF, 254 for Cov-AKI), and Davies-Bouldin index (0.86 for HF, 0.66 for Cov-AKI), and prediction metric: area under the Receiver operating characteristic (ROC) curve (0.83 for HF, 0.78 for Cov-AKI). Clinical evaluation of DICE-generated subgroups revealed more meaningful distributions of member characteristics across subgroups, and higher risk ratios between subgroups. Furthermore, DICE-generated subgroup membership alone was moderately predictive of outcomes. DISCUSSION: DICE addresses a gap in current machine learning approaches where predicted risk may not lead directly to actionable clinical steps. CONCLUSION: DICE demonstrated the potential to apply in heterogeneous populations, where having the same quantitative risk does not equate with having a similar clinical profile. Risk stratification involving clinical and sociodemographic factors is crucial to the management of disease in medicine. Risk stratification is often implemented in clinical pathways in directing care to distinct subgroups of patients according to risk status. [1] [2] [3] [4] While risk stratification has been particularly successful within specific disease or outcome contexts, clinical pathways that address risk in a broad cohort of patients with heterogenous sociodemographic and clinical profiles are more complex to implement due to the need to identify interventions specific to risk levels and patient subgroups. [5] [6] [7] [8] [9] For example, heart failure (HF) impacts nearly 6 million Americans where more than 80% of individuals suffer from 3 or more comorbidities. 10 The complexity due to frequent comorbidity and the lack of guidelines that incorporate heterogeneity present challenges in the discovery of patient strata to assist with clinical decision-making. 11 Another motivating example is acute kidney injury among COVID-19 patients (Cov-AKI), [12] [13] [14] where the initial kidney recovery during admission ranges from 30% to 75%. 12, [14] [15] [16] The high degree of heterogeneity potentially originate from different pathophysiologic mechanisms such as volume depletion, acute tubular necrosis leading to fibrosis, and cardiometabolic disease leading to the incident cardiorenal syndrome. 12, [14] [15] [16] Effective treatment strategies against Cov-AKI may benefit from risk stratification that targets each stratum. Machine learning has been widely explored for risk stratification in medicine, 17 with supervised algorithms showing great potential in predicting individual risks. However, in a heterogenous population, patients may have the same risk levels while exhibiting different disease manifestations and thus requiring different interventions. Thus, to support the use in real patient care, there remains a gap between predicted risks and the next reasonable clinical actions. From an opposite angle, unsupervised machine learning algorithms have been used in previous literature to identify patient subgroups who do exhibit similar disease manifestations and thus requiring similar interventions. [18] [19] [20] [21] [22] However, the lack of supervision may lead to patient subgroups derived as clusters without actually stratifying patients based on the outcome of interests. [23] [24] [25] Existing clustering algorithms are also not designed to be predictive, limiting the utility of applying to unseen patients. Thus, distinctively partitioned patient subgroups, or precisely predicted individualized risks, without a bridge to the next clinical steps, may still bear limited translational values. [26] [27] [28] [29] Yet, few existing clustering and risk prediction algorithms jointly achieve outcome-driven clustering in an end-to-end fashion for clinical applications. [23] [24] [25] 30 This gap between practical needs in medicine and existing machine learning solutions inspired deep significance clustering (DICE), an end-to-end, risk-stratifying, and predictive clustering algorithm. By jointly training representation learning, clustering, and classification, DICE identifies deep representations that generate outcome-driven cluster membership as subgroups. Patients within each subgroup are intended to have similar levels of risk of an outcome, as well as similar clinical needs. The novelty and feasibility of DICE originate from the use of a combined objective function including a constraint requiring significantly different outcome distributions across clusters. This framework design enforces backpropagation through the representation, clustering, and outcome prediction components. In addition, this design allows unseen patients to be predicted into risk-stratified subgroups trained in DICE as a multiclass classification task. Lastly, DICE performs neural architecture search (NAS) designed with an alternative grid search strategy over the number of clusters and representation dimension size to heuristically optimize outcome prediction. The architecture of DICE is illustrated in Figure 1 . Supplementary Figure S1 provides an illustration of DICE using a simple example to provide the motivation for its development. DICE is customized to medicine by considering statistical significance, a concept familiar to many medical researchers, into a machine learning framework. Previous work on risk stratification and subtyping has commonly conducted post hoc analysis on variable significance, 22, 31 whereas DICE directly incorporates the statistical significance as a constraint. For evaluation, we applied DICE on 2 real-world electronic health record (EHR) datasets to compare the performance of DICE to baseline methods through extensive experiments, ablation studies, and fairness evaluation. Baseline methods compared include principal component analysis (PCA), 32 as well as autoencoder (AE), 33 k-means clustering, and logistic regression performed in separate steps without having the statistical significance constraint. Since the ground truth for stratification is unknown, we used Silhouette score, 34 Calinski-Harabasz index, 35 and Davies-Bouldin index 36 to evaluate the clustering performance. We also computed the relative risk ratios across the subgroups to assess the associations between subgroups and the outcomes. In addition, we evaluated the predictiveness of the DICE-learned representation by area under the ROC curve (AUC). Unsupervised learning is a fundamental topic in machine learning and has been widely applied to medical data. 19, 22, 37, 38 Clustering algorithms such as k-means and hierarchical clustering separate a population based on the similarities of input variables. For example, k-means algorithm determines the cluster centroids by iterating between selecting centroids according to the assignment of data points to clusters, and assigning data points to clusters according to current centroids, until stopping criteria are met. 39 The cluster assignment is mainly driven by the cluster purity in terms of distances within or between clusters, but not by whether the distribution of one target variable differs across clusters. There are also semisupervised learning algorithms that make use of a small amount of labeled data with a large amount of unlabeled data. 40 Neither purely unsupervised learning nor semisupervised learning directly address the need for risk-stratified clustering of patients. Most related to our proposed methodology is self-supervised learning, 41 and in particular, previous work on outcome-driven, or predictive, clustering. [23] [24] [25] 42 Xia et al 25 applied k-means clustering on the learned representation from multitask classification model. Liu et al 23 applied agglomerative hierarchical clustering based on a distance metric that best suits the patient population. In their experiment, a linear discriminant analysis was chosen to learn a generalized Mahalanobis distance metric. These 2 methods are 2-stage, with clustering process independent from the representation learning process or metric learning. Locally Supervised Metric Learning proposed by Sun et al minimizes the distance of neighborhoods with the same class label while maximizing the distance of neighborhoods with different class labels. In addition, Lee et al proposed an actorcritic approach for predictive clustering by minimizing the Kullback-Leibler (KL) divergence between a predictor's output given learned representations and that given the assigned centroids. This is to ensure that patients in the same cluster share similar future outcomes. Lastly, Zhang et al 43 add a constraint on a centroid-based probability distribution. Different from previous works, DICE proposes a "back-propagation" through the cluster membership classification component, to use cluster membership probabilities as input to predict the outcome to ensure that patients in the same cluster have similar outcome distribution. Importantly, DICE proposes a novel constraint to obtain ensure the outcome distribution is statistically significantly different across clusters. To select variables that most contribute the outcome-driven stratification, DICE has a deep representation learning step to compress input data prior to clustering. [43] [44] [45] [46] [47] The representation learning component of DICE is related to other data transformation approaches which map the raw data into a new feature set such as PCA 32 and AE. 33 To reduce the dimensionality of input data, PCA identifies principal components that most well explain the data by computing eigenvectors and eigenvalues of the covariance matrix, regardless of whether a target outcome variable represents the input data. Recent deep clustering approaches are learning-based and conduct inference in one-shot, consisting of 2 stages, such as deep representation learning followed by various clustering models. 47 Caron et al 48 jointly learned the parameters of a deep network and the cluster assignments of the resulting representation. Deep clustering via a Gaussianmixture variational autoencoder with Graph embedding (DGG) uses Gaussian mixture variational AEs and graph embedding to improve the clustering and data representation abilities. 49 Yang et al use alternating stochastic optimization to update clustering centroids and representation learning parameters iteratively. Different from these methods, DICE constructs a clustering prediction network and updates representation learning parameters through selfsupervised learning by considering cluster memberships as pseudolabels of the clustering prediction network. In addition to machine learning approaches, DICE has similar objectives to statistical approaches including finite mixture model, 50, 51 Gaussian Mixture Models (GMM), 39 kernel methods, 52 model-based clustering, 53, 54 and spectral methods. 55, 56 Compared to these models, DICE does not have distribution assumptions on observations 54 and can handle high computational complexity on large-scale datasets. 57 Jagabathula et al 58 proposed a conditional gradient approach for nonparametric estimation of mixing distributions. However, clustering of high-dimensional heterogeneous data remains challenging because of inefficient data representation. NAS is a technique to find the network architecture with the best performance on the validation set. Early NAS conducted architecture optimization and network learning in a nested manner. [59] [60] [61] These works typically used reinforcement learning or evolution algorithms to explore the architecture search space A. A recent work decoupled architecture search and weight optimization in a one-shot NAS framework and uses evolutionary architecture search to find candidate architectures after training. 62 EfficientNet and EfficientDet 63,64 further used grid search to balance network depth, width, and resolution and achieve state-of-the-art results on the ImageNet and COCO datasets, respectively. 65, 66 We propose an alternative grid search to optimize the number of clusters and other hyperparameters in the DICE framework. Given a dataset X ¼ fX 1 ; . . . ; X P g with P subjects, we denote each subject as a sequence of events where F is the number of features at each timestamp. We have an outcome y p for each subject p. The first step is to transform discrete sequences into latent continuous representations, followed by clustering and outcome prediction. The latent representation learning for each subject is performed by a long short-term memory (LSTM) AE. 33 The AE consists of 2 parts, the encoder and the decoder, denoted as E and F , respectively. Given the pth input sequence , where z p 2 R d is the representation, d is the dimension of representation, and E is an LSTM network with parameter h E : 67 We choose the last hidden state z p of LSTM to be the representation of the input X p . The de- coder can be formulated as X , and F is the other LSTM network with parameter h F . The representation learning is achieved by minimizing the reconstruction error where we use L 2 norm in the loss. The obtained representations Z ¼ fz p g P p¼1 can be employed for clustering with K clusters, where K is a hyperparameter of total number of clusters to tune, . . . ; c K p , c k p is the cluster membership of cluster k, M 2 R dÂK and the k-th columns of M is the centroid of the k-th cluster. To enable fast inference and learn representation with the driven of outcome, we build a cluster classification network for deep clustering based on self-supervision from c p in Equation (2). We employ the a priori clustering results fc p g P p¼1 in Equation (2) as pseudolabels to update the parameters of the encoder E and F . The cluster membership assignment can be formulated as a classification network, where c^p ¼ ½c^p 1; . . . ; c^p K is the predicted cluster membership from the cluster classification network gðÁ; h 1 Þ, h 1 is the parameter in the cluster classification network, L 1 is the negative log-likelihood loss for multiclass cluster classification. After obtaining cluster membership fc^pg P p¼1 for K clusters, we use the cluster membership and other confounders such as demographics to predict the outcome, formulated as: where v p represents confounders to adjust in testing the significance, ½Á; Á denotes the concatenation of cluster membership feature and confounders. gðÁ; h 2 Þ is the logistic regression for the outcome prediction, and L 2 is the negative log-likelihood loss for the classification. This approach partially addresses interpretability in the application of deep learning methods in medicine. Using the cluster membership from the learned representation as the input to predict the outcome allows us to infer a broad theme (ie, risk-stratified stratum) with a set of learned representations. Interpretability is further enhanced by enforcing the following statistical significance constraint to the cluster membership with respect to the outcome. The main novelty of DICE is the introduction of a statistical significance constraint to the cluster membership with respect to the outcome distribution to drive the deep clustering process. This step also drives the interpretation of the representation learning. After obtaining cluster memberships fc^pg P p¼1 for K clusters, we require that the association between the cluster membership and outcome be statistically significant while adjusting for relevant confounders. To quantify the significant difference of cluster k 1 and cluster k 2 (k 1 6 ¼ k 2 ), we use likelihood-ratio test to calculate the P value of variable c^k 2 when considering cluster c^k 1 as the reference, where c^k refers to the cluster membership belonging to cluster k, formulated as, Then we obtain the P value from Chi-square distribution, denoted as S k1;k2 . A predefined threshold of significance a (equivalently, G k1;k2 > a G ) is used to measure significance. In this paper, we use a ¼ :05. In implementation, we design a mask technique to remove variables of input b c, corresponding to cluster k 1 and cluster k 2 , in Equation (5), then calculate the likelihood ratio G k1;k2 , and add significance constraint to the likelihood-ratio G k1;k2 , that is, The neural weights optimization is denoted as: s:t: 1 T c p ¼ 1; c j p 2 f0; 1g; 8p 2 f1; . . . ; Pg; k 1 2 f1; . . . ; Kg; k 2 2 f1; . . . ; Kg; k 1 where k 1 , k 2 , k 3 , and k 4 are tradeoffs for L AE , L 1 , L 2 , and the statistical significance constraint. We iteratively optimize deep clustering and the other components with the statistical significance constraint. We firstly employ a priori, such as k-means, 68 to obtain pseudolabels for the cluster classification network. Then, we can optimize L AE for the representation learning network, L 1 for cluster classification network, L 2 for outcome prediction network, and the statistical significance constraint jointly. The algorithm is elaborated in Algorithm 1. We utilize NAS to optimize the network hyperparameters in the DICE: the hyperparameter in the clustering and the network hyperparameters in the representation learning. NAS conducts 2 processes sequentially. The first is the neural weights optimization of a given network architecture with the fixed number of clusters K and hidden state dimension d in DICE. The second is the NAS process. NAS is conducted in the search space to select the combination of hyperparameters and has no direct link to the cost function of neural weights optimization. We choose the network architecture which is trained on the training set and has the best evaluation performance on the validation set, that is where AUC val ðÁÞ is the AUC score on the validation set. Data Study data included HF and COVID-19 patients treated in the inpatient and emergency department (ED) settings in 2 hospitals of an urban academic center, respectively. EHR variables extracted include information on sociodemographics, vital signs, diagnoses, therapeutics orders, medication prescriptions, laboratory test orders and test results, and census-tract level social determinants of health (SDOH). The sociodemographic information included age, gender, race, marital status, preferred language, and insurance payor. Diagnoses were extracted using International Classification of Diseases, Ninth/Tenth Revision, Clinical Modification (ICD-9/10-CM) codes. 69 Outcomes are defined as discharged to home among HF patients and Cov-AKI among COVID-19 patients. Continuous variables for each patient were represented as normalized vectors, and they were normalized with mean of 0 and standard deviation of 1. Categorical variables were converted to binary vectors, whose val-ues were represented as 1 or 0, using one-hot encoding. Missing values in laboratory and SDOH variables were imputed with mean values. HF data include adult patients from years 2014 to 2018 who were treated on the inpatient Medicine services. Only those patients whose initial (acute, ED, admitting) and principal diagnoses both contained HF codes were included to ensure that HF was the working diagnosis throughout the hospital stay and was being treated from the beginning of the encounter. In the HF data, variables were timestamped into day intervals since ED arrival and used as sequential features. Figure 2 describes the datasets and the inclusion/exclusion criteria for the HF cohort. HF definitions in ICD-9/10-CM are listed in Supplementary Table S4 . COVID-19 was defined by a positive polymerase chain reaction test. COVID-19 data included adult patients who were admitted to the hospital in March and April 2020 from the ED. 14 We define baseline creatinine to be the closest creatinine obtained prior to March 2020, and alternatively, if not available, the earliest creatinine at the time of ED presentation. AKI was defined by the Kidney Disease Improving Global Outcomes criteria. 70, 71 It is defined as an increase in creatinine of 0.3 or greater from the baseline creatinine during the hospitalization, or in an increase of creatinine greater than 1.5 times the baseline creatinine during the hospitalization, or initiation of renal replacement therapy. Furthermore, this definition of AKI was verified by manual chart reviews led by an MD coauthor. 14 In COVID-19 data, demographic variables (age, gender, race), chronic conditions, and the first values of commonly ordered laboratory tests obtained within 12 h of ED presentation were included as one-time features. Variables used are listed in Supplementary Table S5 . We compared our method with baseline methods including (1) PCA for representation learning followed by k-means clustering (PCA [kmeans]), (2) AE for representation learning followed by k-means clustering (AE [k-means]), and (3) AE for representation learning with classification followed by k-means clustering (AE w/class [kmeans]). For baseline (1), we treated sequential data as one-time features in HF dataset to learn PCA representations, followed by kmeans clustering. In (2), k-means clustering was applied directly to representations learned from AE. 33 In (3), we first jointly trained AE and outcome prediction with representation learned from AE as the input for outcome prediction, then applied k-means clustering to the final learned representation. We report the results of these baseline methods of the same hyperparameters with DICE. Supplementary Table S6 lists the baseline methods against DICE. Based on the dataset size and the number of features, the number of clusters experimented was set to 2 through 5. The sizes of the representation dimension were 20 through 100 for HF and 10-20 for COVID-19, respectively. Experiments were conducted in PyTorch 72 on NVIDIA GeForce RTX 2070. We initialized the AE with one epoch training. We set P value a ¼ 0:05 which leads to a G ¼ 3:841, n iter ¼ 150, n epoch ¼ 1. Parameters k 1 , k 2 , k 3 , k 4 were set as 1.0, 10, 1.0, 1.0 for HF and COVID-19 based on the accuracy on the validation set. HF and COVID-19 datasets were split into training, validation, and test sets in a 4:1:1 ratio. Evaluation DICE was compared against 3 baseline methods with respect to AUC on the outcome prediction, Silhouette score, 34 Calinski-Harabasz index, 35 and Davies-Bouldin index. 36 Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index are normalized metrics, and therefore, allow us to evaluate the cluster goodness across methods regardless of the input representation scale. To evaluate the outcome-driven nature of the clusters, we computed risk ratios between each cluster and the cluster with the lowest incidence as CI ci =CI cr where C i is the cumulative incidences of clusters i and C r the reference cluster, r. Source code is available in https://github.com/YiyeZhangLab/DICE. HF data contained 1585 patients, of whom 36.8% of the patients were discharged to home (Figure 2 ). Supplementary Tables S1 and S7 describe the demographic information in the data. Among the 1002 COVID-19 patients, 30.3% of the patients developed AKI subsequently during hospitalization. The network hyperparameters chosen were K ¼ 4, d ¼ 35 for discharged to home among HF patients and K ¼ 3 and d ¼ 16 for AKI among COVID-19 patients. Table 1 , displaying performance on the test set, shows that DICE can generate more distinctive clusters as subgroups. Experiments and analyses demonstrate that DICE obtained better performance than baseline methods in deriving subgroups that have higher risk ratios in comparing the reference (lowest risk) with the other subgroups. We further demonstrate the clustering separation across the 2 datasets through the t-Distributed Stochastic Neighbor Embedding (t-SNE) visualizations in Figures 3 and 4 . Compared with baselines shown in Figure 3B -D for HF, the 4 subgroups in Figure 3A discovered by DICE displayed tighter separation (Silhouette score ¼ 0.48, Calinski-Harabasz index ¼ 212, Davies-Bouldin index ¼ 0.86). In order of outcome rates, subgroups 1-4 had 79.9%, 38.8%, 29.7%, and 8.6%, respectively. The baseline AE with classification also discovered 4 subgroups with the outcome ratio in each subgroup ranging from 72.2% to 5.8%, but the cluster purity metrics were lower (Silhouette score ¼ 0.35, Calinski-Harabasz index ¼ 200, Davies-Bouldin index ¼ 1.30). PCA (k-means) and AE (k-means) did not discover subgroups as clearly separated and outcome-driven as DICE. Examining the visualizations for the COVID-19 dataset in Figure 4 , AE (k-means) achieved pure cluster metrics (Silhouette score ¼ 0.46, Calinski-Harabasz index ¼ 163, Davies-Bouldin index-¼ 0.84). However, the subgroups were similar with respect to the outcome rates, ranging from 46.9% to 23.9% (risk ratios 1.88 and 1.10), showing that cluster purity does not necessarily guarantee risk-stratified subgroups. Table 2 , displaying predictive performance on the test set, shows that DICE-learned representations are predictive. To evaluate the learned representation by DICE, we used the representations for outcome prediction using L1-regularized logistic regression. DICE outperformed the baselines in AUC, true positive rate (TPR), false negative rate, positive predictive value (PPV), and negative predictive value (NPV) in both HF dataset and COVID-19 dataset. Relatedly, we evaluated the AUC for outcome prediction using DICE subgroup membership alone. Notably, the DICE subgroup membership alone achieved moderately high prediction of the outcome (AUC ¼ 0.772 for HF, AUC ¼ 0.627 for COVID-19). Supplementary Table S8 describes the predictiveness of the DICE cluster membership alone, including AUC with confidence bounds, accuracy (ACC), TPR, true negative rate (TNR), PPV, and NPV. Using HF data, we examined the advantage of the statistical significance constraint as well as algorithm fairness. Figure 5 illustrates the AUC on the HF validation set across different neural network architecture on the Y-axis and representation dimension d on the Xaxis. At each fixed cluster size and representation dimension, the architecture network that met the statistical constraint achieved higher AUC than those that did not. We further conducted ablation studies to gauge the effect of the statistical significance constraint. When we disabled the statistical significance constraint, 2 clusters were outputted by NAS, compared to the 4-level separation as reported in Table 1 . In addition, the percentage of neural networks that passed the significance constraint in NAS decreased from 82.4% to 64.7% when cluster size was set to 5 in the ablation study. The AUC was lower when the statistical significance constraint was not met. These results suggest that the statistical significance constraint contributes to better stratification especially as we increase the number of clus- Subgroups generated by DICE were evaluated for their clinical relevance. Figure 6 illustrates the distribution of relevant laboratory variables across subgroups in the COVID-19 cohort. DICE discovered 3 subgroups that had high (69%), medium (42%), and low (14%) incidence of AKI. Distributions of laboratory measurements across subgroups had linear trends from high-to low-risk subgroups. The distributions were consistent with clinically expected risk factors of AKI among COVID patients , 73 including older age, higher value of alkaline phosphatase, C-reactive protein, D-dimer, Ferritin; and lower values of hemoglobin and albumin, corresponding to severe illness and higher risk of AKI. 73 Other baseline techniques were unable to detect these AKI-focused subgroups ( Figure 6 ). Table 3 to Supplementary Table S2 and Table 4 to Supplementary Table S3 show the notable characteristics across subgroups in the 2 datasets, respectively, comparing DICE and baseline AE w/class (kmeans). Only variables with a linear trend observed across the highest-to lowest-risk clusters are displayed. For example, we observe a linear trend in comorbidity (chronic kidney disease and obesity) across the clusters, where clusters with the lowest and the highest percentages of patients discharged to home displaying the highest and lowest percentages of comorbidity, respectively. Similarly, we observe trends in the use of medications that are indicative of disease severity and complexity, such as Bumetanide and Haloperidol being more prevalent in the cluster with the lowest percentages of patients discharged to home. This cluster of patients also has the highest needs for social work referral as observed in the orders placed. P values were calculated using Kruskal-Wallis rank-sum test for continuous variables and using Chi-square/Fisher's exact test for categorical variables. DICE was motivated to join concepts of machine learning and statistics as a customized machine learning algorithm for medicine. It is intended to create risk-stratified and predictive subgroups to facilitate risk-stratified intervention designs. These features of DICE were demonstrated in the evaluation using EHR datasets with different sizes, variable types, incidence, and clinical areas. Compared to DICE, applying baseline methods in COVID-19 data, we observed that subgroups had good cluster purity but not clearly stratified by the risk level of the outcome. In HF data, we observed that DICE achieved cluster purity while the cluster membership also served a predictive purpose. Evaluation results suggest that DICE has certain advantage over baseline methods particularly when the characteristics indicative of the outcome risk, or root causes, are heterogenous, rendering outcome prediction challenging. Beyond patient populations evaluated in this paper, DICE may have the potential to be used in other clinical areas to facilitate subgroup-specific care and clinical pathways for clinical decision support. In this study, DICE jointly trained AE for representation learning, k-means for clustering, and logistic regression for prediction. 74 Depending on the data structure, we can revise DICE to replace k-means with other clustering algorithms, and similarly, logistic regression with other prediction algorithms. Moreover, if clinical notes were used as input, Transformers may serve as the encoder and decoder in representation learning. 75 Future studies may also evaluate additional statistical concepts to better ensure the outcome separation using metrics such as Tukey's Honestly Significant Difference and Cochran-Armitage test for trend to increase risk ratio across clusters. 76 In summary, DICE offers a flexible framework and a conceptual innovation that may drive meaningful application machine learning in the EHR. This paper demonstrated DICE, an outcome-driven clustering algorithm for risk-stratifying patients. Compared to baseline methods, DICE is optimized to cluster patients based on both the risk level of an outcome and on the input clinical features. Because of this feature, we propose that DICE may be used to identify subgroups of patients who require risk-stratified interventions in a heterogeneous population, who are similar in ways that allow them to respond to the similar treatments against their risk of an outcome. Beyond the datasets used in this paper, DICE has the potential to be used in other clinical areas. This study was supported by NLM K01LM013257-01 (PI Zhang) and Center for Transportation, Environment, and Community Health (CTECH) New Research Initiatives Fund (PI Zhang). YZ and YH designed the overall study in consultation with LS, PADS, KMA, JRL, and SLT. YZ, YH, and YL performed data analysis. PADS, KMA, JRL, and SLT provided clinical inputs, and interpretation to the study. YH and YZ wrote the manuscript with inputs from all the authors. FW, JP, LS, and YL provided suggestions to the manuscript. Supplementary material is available at Journal of the American Medical Informatics Association online. YZ and JP have equity ownership at Iris OB Health. JRL has filed patent US-2020-0048713-A1 titled "Methods of detecting cell-free DNA in biological samples" and has received a grant from BioFire Diagnostics LLC. The data generated and/or analyzed during the current study cannot be shared publicly due to its inclusion of patient health information protected by the Health Insurance Portability and Accountability Act, but will be shared on reasonable request to the corresponding author. (0.06, 0.18) Prothrombin time* 14 Risk stratification and clinical pathways to optimize length of stay after transcatheter aortic valve replacement Risk stratification of patients with nonalcoholic fatty liver disease using a case identification pathway in primary care: a cross-sectional study Risk stratification and the care pathway Beyond screening: a stepped care pathway for managing postpartum depression in pediatric settings Crisis clinical pathway for COVID-19 Problems related to the application of guidelines in clinical practice: a critical analysis Chest pain in the emergency room: value of the HEART score Application of the ABCD2 score to identify cerebrovascular causes of dizziness in the emergency department Assessing the effectiveness of NICE criteria for stratifying breast cancer risk in a UK cohort Global public health burden of heart failure Performance of 2014 NICE defibrillator implantation guidelines in heart failure risk stratification AKI in hospitalized patients with COVID-19 Northwell Nephrology COVID-19 Research Consortium. Acute kidney injury in patients hospitalized with COVID-19 Characteristics of acute kidney injury in hospitalized COVID-19 patients in an Urban Academic Medical Center Outcomes among patients hospitalized with COVID-19 and acute kidney injury AKI in hospitalized patients with and without COVID-19: a comparison study Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians? Investigating clinical care pathways correlated with outcomes Paving the COWpath: learning and visualizing clinical pathways from electronic health record data Utilization of deep learning for subphenotype identification in sepsis-associated acute kidney injury Identifying sub-phenotypes of acute kidney injury using structured and unstructured electronic health record data with memory networks Data-driven subtyping of Parkinson's disease using longitudinal clinical records: a cohort study Precision cohort finding with outcome-driven similarity analytics: a case study of patients with atrial fibrillation Temporal phenotyping using deep predictive clustering of disease progression Outcome-Driven Clustering of Acute Coronary Syndrome Patients Using Multi-Task Neural Network with Attention Potential biases in machine learning algorithms using electronic health record data Implementing machine learning in health care-addressing ethical challenges Physician perspectives on integration of artificial intelligence into diagnostic pathology What this computer needs is a physician: humanism and artificial intelligence Robust finite mixture regression for heterogeneous targets Development and validation of a machine learning algorithm for predicting the risk of postpartum depression among pregnant women Principal component analysis Sequence to sequence learning with neural networks. Advances in neural information processing systems Silhouettes: a graphical aid to the interpretation and validation of cluster analysis A dendrite method for cluster analysis A cluster separation measure Paving the COWpath: data-driven design of pediatric order sets Machine learning in medicine Pattern Recognition and Machine Learning Introduction to semi-supervised learning Self-supervised visual feature learning with deep neural networks: a survey Supervised patient similarity measure of heterogeneous patient records A framework for deep constrained clustering-algorithms and advances Deep clustering: discriminative embeddings for segmentation and separation Discriminatively boosted image clustering with fully convolutional auto-encoders Unsupervised deep embedding for clustering analysis Towards k-means-friendly spaces: simultaneous deep learning and clustering Deep clustering for unsupervised learning of visual features Deep clustering by gaussian mixture variational autoencoders with graph embedding Finite Mixture Models A review of recent developments in latent class regression models Kernel methods in machine learning Model-based clustering, discriminant analysis, and density estimation A unified framework for model-based clustering On spectral clustering: analysis and an algorithm A tutorial on spectral clustering A survey of clustering with deep learning: from the perspective of network architecture A conditional gradient approach for nonparametric estimation of mixing distributions Designing neural network architectures using reinforcement learning Neural architecture search with reinforcement learning Learning transferable architectures for scalable image recognition Single path one-shot neural architecture search with uniform sampling EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR Efficientdet: scalable and efficient object detection Imagenet: a largescale hierarchical image database Microsoft coco: common objects in context Long short-term memory Some methods for classification and analysis of multivariate observations A modification of the Elixhauser comorbidity measures into a point system for hospital death using administrative data Work group: section 2: AKI definition KDOQI US commentary on the 2012 KDIGO clinical practice guideline for acute kidney injury Coronavirus disease (COVID-19) and the liver: a comprehensive systematic review and meta-analysis Stacked convolutional auto-encoders for hierarchical feature extraction Attention is all you need What is the proper way to apply the multiple comparison test?