key: cord-0316913-fsh3tyli authors: Cui, Sunny; Yoo, Elizabeth C.; Li, Didong; Laudanski, Krzysztof; Engelhardt, Barbara E. title: Hierarchical Gaussian Processes and Mixtures of Experts to Model COVID-19 Patient Trajectories date: 2021-10-04 journal: bioRxiv DOI: 10.1101/2021.10.01.462821 sha: cd879dc92d4d90efdda0fd246b6159843c33121a doc_id: 316913 cord_uid: fsh3tyli Gaussian processes (GPs) are a versatile nonparametric model for nonlinear regression and have been widely used to study spatiotemporal phenomena. However, standard GPs offer limited interpretability and generalizability for datasets with naturally occurring hierarchies. With large-scale, rapidly-updating electronic health record (EHR) data, we want to study patient trajectories across diverse patient cohorts while preserving patient subgroup structure. In this work, we partition our cohort of over 2000 COVID-19 patients by sex and ethnicity. We develop and apply a hierarchical Gaussian process and a mixture of experts (MOE) hierarchical GP model to fit patient trajectories on clinical markers of disease progression. A case study for albumin, an effective predictor of COVID-19 patient outcomes, highlights the predictive performance of these models. These hierarchical spatiotemporal models of EHR data bring us a step closer toward our goal of building flexible approaches to capture patient data that can be used in real-time systems*. The highly contagious nature of the emergent coronavirus and limited knowledge of treatment methods necessitate decision support tools that can e ciently estimate and predict patient trajectories in order to measure disease progression. Notably, recent findings report considerable disparities in manifestations of COVID-19 across racial minorities within combination of physiological, socioeconomic, behavioral, and cultural factors. A model that can account for group structures that arise both inherently and environmentally is necessary in order to develop clinical recommendations tailored to individual patients and to mitigate bias in treatment procedures; at the same time, that model should also allow for the sharing of signal across groups when patient group sample sizes are small. The Hospitals at the University of Pennsylvania (HUP) COVID-19 dataset contains clinical observations of 2069 patients who tested positive for COVID-19 via a PCR test between April 2020 to August 2020 at the University of Pennsylvania Medical Center (UPMC) hospital in Philadelphia, PA. This anonymized dataset includes the following patient information: • patient demographic information including age, sex and ethnicity; • labs and vital sign measurements, including blood serum creatinine, partial pressure of oxygen, and total urine output; • procedural information, including details of mechanical ventilation, nasal cannula, and liters of oxygen flow; and • medication information including type, dosage, and time of administration. With an emergent disease like COVID-19, we want a model that is robust to missing and noisy patient data, and also computationally tractable to allow continuous data updates. Known for their flexibility, interpretability, and uncertainty quantification, Gaussian processes (GPs) have proven useful in machine learning, 3 spatiotemporal statistics, 4 and functional data analysis. 5 Among their applications, GP regression is a nonparametric regression model that places a distribution on arbitrary nonlinear functions with smoothness modulated by the selected kernel function. 6 Updated by observations, the GP posterior enables predictions and uncertainty estimates at unobserved locations on sequences, such as the time or space domain, including the future. Due to the Gaussian assumption of the joint distributions over observations, the posterior is Gaussian with closed-form mean and variance terms. Previous work has exploited the flexibility of GPs to obtain insights into problems in healthcare, including early detection of sepsis through multi-output GPs, 7,8 online updates of patient vital signals with sparse multi-output GPs, 9 and reliable prediction of adverse hospital events by jointly modeling longitudinal trajectories and time-to-event data. 10 For the task of modeling disease trajectories, particularly for a large patient cohort, using standard GP regression is insu cient because many complex diseases such as lupus and pneumonia manifest heterogeneously in patients across di↵erent demographic and clinical subgroups. 11, 12 Noting this heterogeneity, prior work placed a hierarchy on scleroderma patients at the population, subgroup, and individual levels. 10 B-splines were used to model each subgroup trajectory and a GP was used to capture noise. Although the MedGP approach 9 combined information across patients using an empirical Bayes approach, allowing subgroups to be captured via kernel parameters, it lacks a rigorous approach to evaluating group structure and posteriors. Motivated by the need to explicitly account for group structure, our framework builds on the premise of a group structure in the patient population and provides a fully Bayesian treatment of hierarchical disease trajectory modeling. The contributions of this work are as follows: At a high level, we develop a flexible Gaussian process that is able to capture sparse, noisy, electronic health record (EHR) time-series data. More specifically, we build a hierarchical mixture of experts (MOE) Gaussian process (GP) regression model that allows sharing of strength across patient samples with known group structure. The MOE allows each sample to participate in multiple patient groups simultaneously, such as inclusion in both the female (sex) and Black (race) patient groups. Furthermore, our fast closed-form inference method allows us to apply this framework to hundreds of COVID-19 patient trajectories to show its robustness in fitting a variety of clinically important covariates. This paper is organized as follows: In Section 2, we discuss the background for standard, hierarchical, and MOE Gaussian process regression models. We introduce our framework of MOE hierarchical Gaussian process regression in Section 3. We demonstrate the performance of our framework on COVID-19 patient EHR data and discuss the implications of these results in Section 4. We conclude by exploring future directions in Section 5. In this section, we provide a brief summary of GP regression and its extension to a Bayesian hierarchical setting. We consider the Bayesian analysis of standard linear regression f (x i ) = T x i , where is the weights of the linear model, x i are regressors, and f (x i ) is the noiseless function. Given observed are regressors such as time across n total observations, and Y = {y i } n i=1 are noisy, scalar responses, then we can write each response as We can extend these linear models to nonlinear regression functions using Gaussian processes. Gaussian process regression is a probability distribution over arbitrary smooth functions such that any finite realization is a multivariate Gaussian random variable. For any where m(·) is the mean function and (·, ·) is a positive definite kernel function. As in prior work, the mean function m is assumed to be zero. 9 There are many possible positive definite kernel functions , including exponential (Ornstein-Uhlenbeck), squared exponential, and Matérn covariance functions. These covariance functions include parameters that control the spatial variance and decay of the dependency over the domain; these kernel parameters are often estimated by maximizing the log likelihood (MLE): A point estimate of Y ⇤ is given by µ ⇤ , the posterior mean, while ⌃ ⇤ is the variance of this posterior mean. The computational complexity of inference for GPR is O(n 3 ) because of the need to invert , an n by n matrix. Fortunately, there is an immense literature on scalable inference algorithms for GPs, including tapering. 13 The idea of tapering is to impose zero correlation between two points that are not close to each other by multiplying  by a tapering function resulting in a sparse block diagonal covariance matrix. One of the main challenges in predicting future values of a disease trajectory or imputing unobserved values within a trajectory is that biological and environmental factors lead to high variance in patient state and disease progression. For instance, many diseases include one or more disease subtypes, and the progression and severity of a disease can vary across patients with di↵erent ages, sexes, or chronic conditions. For datasets with known subgroups, hierarchical models are a natural choice because they allow the sharing of information across and within subgroups. The use of hierarchical models allows precise modeling of each subgroup and sharing of signal across all of the subgroups; it is particularly beneficial in the case where each subgroup has a small sample size. Hierarchical structure can be enforced through the mean function, the covariance function, or a structured prior. Prior work [14] placed a hierarchy on the mean function parameters to model P M 2.5 levels, a measurement of air quality, much like the spline model for individualized disease prediction. 10 Other work [15] placed a hierarchy on gene expression at two levelseach experiment and each replicate gene-to model heterogeneity. Conjugate inverse Gamma priors were placed on the kernel parameters to model the relationships between low and high accuracy experiments. 16 Variants of the hierarchical model include hierarchical MOE that lends a tree structure in computing parameter values, 17 deep GPs in which inputs to each GP have their own GP prior. 18 This work uses subsets of inducing points to fit experts, which hold information at the group and individual levels. 19 In the context of prior work, we develop a Bayesian hierarchical GP regression model for patient data. We group the patient population by attributes including sex and ethnicity. We impose a hierarchy on these trajectories at the group and individual levels by letting the mean of each level in the hierarchy be distributed by a Gaussian process parameterized for the level above. We use k = 1, · · · , K as the group-level subscript and i = 1, · · · , N k as the patientlevel subscript in group k. All patients in the kth subgroup share an underlying trajectory modeled by g k (x). Patient i in subgroup k is associated with a unique trajectory, denoted by f k,i (x), that is influenced by various factors including demographics, lifestyle choices, genetic predispositions, and pre-existing conditions. Then, be the collection of noisy observations of clinical markers of N k patients in subgroup k at time points X k := {X k,i } N i=1 . The covariance between the data Y and the functions f (·), g(·) is Our model uses an additive hierarchical kernel, similar to that introduced by [15] , with tapering that further enforces sparsity. For flexibility in the smoothness of the inferred functions, we choose the Matérn kernel with parameter ⌫ that controls the smoothness of the GP: 20 where K ⌫ is the modified Bessel function of the second kind with order ⌫. In practice, we estimate these parameters by maximizing the likelihood. In our model, we set kernel parameter ⌫ = 5 2 at the group level, and ⌫ = 3 2 at the individual level. With this kernel function, we model the data distribution as multivariate normal. The parameters ✓ are {↵ T , T , T }. The covariance matrix ⌃ n is written as Both g and f are matrices formed by evaluating  g and  f , respectively, on x k,i and x k,i 0 . These covariance matrices inherit a natural block structure from the kernels (Fig. 2) . To scale up the HGP with computational complexity O(n 3 ), we further perform tapering to enforce relationships only between close time points. Tapering encodes sparsity in the covariance matrix on the o↵-diagonal elements that are more distant from each other in time, which improves inference tractability. 13 Although the HGP allows us to model group structure and individual patient trajectories that di↵er from the group, its exponential cost with respect to number of groups renders it impractical for large patient cohorts with many groups. Because each patient belongs to multiple groups simultaneously -sex, ethnicity, and disease subtype for instance -we want a tractable way to combine information from all of the patient's group attributes, i.e., an additive kernel. Thus, we extend the HGP with mixture of experts (MOE) kernels at the group level (Fig. 3) . Originally developed to handle multiple modalities in large datasets, 21 MOE GPs can be adapted to a hierarchical setting such that the group-level kernel is the sum of attribute kernels of patients belonging to that group. An ensemble of local experts allows the kernel function to adapt to each observation, 22 which in our case corresponds to a patient. Again, we use a tapered Matérn 5/2 kernel at the group level and a tapered Matérn 3/2 kernel at the patient level. We perform e cient close-form inference using the SciPy Optimizer. We first benchmark our MOE HGP model, using HUP patient trajectories, against standard GPR and an HGP. We then present examples of fitted and predicted trajectories of cluster representatives, or patients whose trajectory minimizes the Wasserstein distance to all other patients in their subgroup. Intuitively, the cluster representative corresponds to the patient who best captures the canonical trajectory of that group. We evaluate the performance of our MOE HGP on COVID-19 patient trajectories from the Hospitals at the University of Pennsylvania (HUP). For the purposes of model fitting, we only consider patient trajectories with over 25 observations corresponding to unique time points. We group patients based on attributes of sex (male and female) and ethnicity (Black and white). We create balanced patient cohorts with 30 patients per permutation of groups (i.e., 30 Black women, 30 Black men, 30 white women, 30 white men). For each patient and each covariate, we select 25% of the measurements randomly as the test set and use the remaining measurements as the training set. It is also possible to include future time points in the test set, albeit at the expense of GP model performance as test points extend further into the future, meaning there is greater uncertainty in the predictions. 20 To evaluate performance, we use mean squared error (MSE) and R 2 metrics to compare the train and test sets to predicted values. We also evaluate the 95% confidence intervals (CIs) to measure model calibration for GPR, HGP, and MOE HGP. In our discussion, the values reported for 95% CI calibration refer to the percentage of points that fall outside the 95% confidence interval. We focus on albumin as our covariate of interest, as it has been shown to be a clinical marker of COVID-19 progression. 23 The results for albumin are representative of trends across covariates in the dataset (see Supplementary material for details). The shapes of the patient trajectories for albumin vary greatly (Fig. 4) . GPR cannot, for example, capture the trajectory of patient 11, but the HGP and MOE HGP are able to do so. For patient 38, the more granular trends for the first few time points are captured by the MOE HGP, but not the HGP. The average train MSE across patients for the covariate albumin is the lowest for the MOE HGP. The average test MSE across patients is comparable across the three models. However, the train and test R 2 values, and the 95% CI calibration, are much better across patients for the HGP and MOE HGP as compared to GPR (Table 1) We find substantial overlap in the the patient trajectories that benefit from the MOE HGP and HGP over GPR. Patient 7's trajectory is a canonical case in which the R 2 value is greatly improved with the HGP and MOE HGP (Fig. 5, Top) . The mean function for GPR appears to a running average in the first half of the observed time points. The HGP and MOE HGP both provide better fits where GPR cannot. Similar to patient 7, patient 2's trajectory has higher variance with GPR (Fig. 5, Bottom) . This large variance has negative consequences on the 95% CI calibration. This patient has eight test points, so GPR gives a 95% CI of 0%, but the HGP and MOE HGP give 95% CIs of 25% since they each have two "outlier" test points. Taken together, these empirical results suggest that the two hierarchical models are more e↵ective on these complex patient trajectories. The importance of group structure becomes more evident when we examine the kernel parameters at the patient level. The MOE HGP has lower spatial variance across all patients, as reflected in the distribution of the patient-level kernel variance parameters. GPR, lacking a group structure, defaults to learning a higher variance parameter. The structure of the MOE HGP is also useful for comparison across groups. When partitioning the patient cohort by ethnicity and sex, we see that Black patients have higher variance parameters than white patients do (Fig. 6) . We do not observe meaningful di↵erences in these parameters between male and female patients. Next, we fit the three models to the following clinical markers of COVID-19 disease progression for a randomly selected patient: anion gap, creatinine, partial pressure of oxygen (P O 2 Arterial), blood carbon dioxide levels (CO 2 ), fraction of inspired oxygen (F IO 2 ) and blood oxygen saturation (Arterial O 2 Content) (Fig. 7) . Our experiments suggest that the MOE HGP e↵ectively fits these markers for any randomly selected patient in the cohort. Across patients and groups, we see that the HGP and MOE HGP consistently outperform GPR in fitting patient trajectories for albumin, blood CO 2 , fraction of oxygen inspired F IO 2 , and lactic acid (Supplementary material Fig. 1-3, 5) . These covariates -albumin as an indicator of kidney function and the remaining covariates as indicators of cardiovascular function -can inform immediate treatment decisions. Furthermore, the MOE HGP demonstrates superior uncertainty quantification over the HGP by giving the best 95% CI calibration at no observed cost to the test MSE, as reported for albumin, blood CO 2 , fraction of oxygen inspired (Table 1, Supplementary material Tables 2-4 ). The MOE HGP's strong performance, particularly in capturing complex trajectories with low spatial variance, can be attributed to its incorporation of group structures. We propose a hierarchical mixture of experts Gaussian process (MOE HGP) model to fit and predict COVID-19 patient trajectories for clinically relevant covariates. We show that our MOE HGP model is e↵ective in analyzing covariates and provides an in-depth analysis for albumin. We demonstrate the robustness of our model for an individual patient on indicators of blood oxygen levels like arterial P O 2 , CO 2 and F IO 2 . Theses covariates are noisy yet useful for monitoring patient state in ICUs. Overall, the MOE HGP allows us to model groups separately while sharing signal across groups to enable more precise modeling of the natural group structure in patient populations without losing statistical power. A natural extension of this work is to generalize the model to perform multi-output predictions. Because clinical covariates are often correlated, a multi-output GP that captures correlations between disparate covariates, in addition to correlations between observations within a single covariate, would be useful for more accurately modeling of clinical markers across time. With a multi-output model, we may include larger patient cohorts that are more diverse with respect to group attributes that could serve as proxies of socioeconomic status such as zip code, marriage status, and insurance status. We anticipate that we would be able to leverage such group structure to explore di↵erences in disease trajectory or biases in treatment. Other group attributes like age, body mass index (BMI), and estimated glomerular filtration rate (eGFR) inform our understanding of how comorbidities such as obesity and renal disease impact disease progression within a certain socioeconomic or ethnic subpopulation. Another direction of future work may be to apply contrastive learning, or methods that capture di↵erences between the groups using parameters present in one but not the other. 24 Contrastive modeling has been applied to linear dimension reduction 25 and formalized to a probabilistic model-based alternative. 26, 27 With an extension of probabilistic contrastive modeling to Gaussian processes, we could improve the group-based prior for our model with information regarding di↵erences between patients from traditionally marginalized populations, the "foreground" group, and their majority counterparts, the "background" group. We would like to thank the University of Pennsylvania Medical Center for providing the data and consultation regarding clinical domain knowledge. This work was funded in part by a COVID-19 grant from the Fast Grants program, a grant from the Helmsley Trust, a grant from the NIH Human Tumor Atlas Research Program, NIH NHLBI R01 HL133218, and NSF CAREER AWD1005627. We show the robustness of our mixture of experts hierarchical Gaussian process model by fitting and predicting patient trajectories on the following additional covariates: blood CO 2 , fraction of oxygen inspired F IO 2 , chloride, lactic acid, and creatinine. We present the results for albumin referenced in the main text as a point of reference. Again, we benchmark our mixture of experts (MOE) HGP model against Gaussian process regression (GPR) and a hierarchical Gaussian process (HGP). Assessment of COVID-19 hospitalizations by race/ethnicity in 12 states Racial and ethnic di↵erences in presentation and outcomes for patients hospitalized with COVID-19: findings from the American Heart Association's COVID-19 Cardiovascular Disease Registry Gaussian processes in machine learning, in Summer school on machine learning Hierarchical modeling and analysis for spatial data Gaussian process regression analysis for functional data Fundamentals of nonparametric Bayesian inference An improved multi-output Gaussian process RNN with real-time validation for early sepsis detection Learning to detect sepsis with a multitask Gaussian process RNN classifier Sparse multi-output Gaussian processes for online medical time series prediction A framework for individualizing predictions of disease trajectories by exploiting multi-resolution structure Di↵erences between male and female systemic lupus erythematosus in a multiethnic population The influence of age and gender on the population-based incidence of community-acquired pneumonia caused by di↵erent microbial pathogens Covariance tapering for likelihood-based estimation in large spatial data sets Improving satellite-based pm2.5 estimates in China using Gaussian processes modeling in a Bayesian hierarchical setting Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters Bayesian hierarchical modeling for integrating low-accuracy and high-accuracy experiments Hierarchical mixture-of-experts model for large-scale Gaussian process regression Deep Gaussian processes Hierarchically-partitioned Gaussian process approximation Interpolation of spatial data: some theory for kriging Infinite mixtures of Gaussian process experts Human motion tracking by temporal-spatial local Gaussian process experts Hypoalbuminemia predicts the outcome of COVID-19 independent of age and co-morbidity Contrastive learning using spectral methods Exploring patterns enriched in a dataset with contrastive principal component analysis Probabilistic contrastive principal component analysis Contrastive latent variable modeling with application to case-control sequencing experiments