key: cord-0885754-wdf1luje
authors: Wang, X.; Zhang, H. G.; Xiong, X.; Hong, C.; Weber, G. M.; Brat, G. A.; Bonzel, C.-L.; Luo, Y.; Duan, R.; Palmer, N. P.; Hutch, M. R.; Gutierrez-Sacristan, A.; Bellazzi, R.; Chiovato, L.; Cho, K.; Dagliati, A.; Estiri, H.; Garcia-Barrio, N.; Griffier, R.; Hanauer, D. A.; Ho, Y.-L.; Holmes, J. H.; Keller, M. S.; Klann, J. G.; L'Yi, S.; Lozano-Zahonero, S.; Maidlow, S. E.; Makoudjou, A.; Malovini, A.; Moal, B.; Moore, J. H.; Morris, M.; Mowery, D. L.; Murphy, S. N.; Neuraz, A.; Ngiam, K. Y.; Omenn, G. S.; Patel, L. P.; Pedrera-Jimenez, M.; Prunotto, A.; Samayamuthu, M. J.; Sanz Vidorreta, F. J.
title: SurvMaximin: Robust Federated Approach to Transporting Survival Risk Prediction Models
date: 2022-02-04
journal: nan
DOI: 10.1101/2022.02.03.22270410
sha: 9d1bce58da4eff71a511618fb656e48e3a361118
doc_id: 885754
cord_uid: wdf1luje

Objective: For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. Materials and Methods: For each of the centers from which we want to borrow information to improve the prediction performance for the target population, a penalized Cox model is fitted to estimate feature coefficients for the center. Using estimated feature coefficients and the covariance matrix of the target population, we then obtain a SurvMaximin estimated set of feature coefficients for the target population. The target population can be an entire cohort comprised of all centers, corresponding to federated learning, or can be a single center, corresponding to transfer learning. Results: Simulation studies and a real-world international electronic health records application study, with 15 participating health care centers across three countries (France, Germany, and the U.S.), show that the proposed SurvMaximin algorithm achieves comparable or higher accuracy compared with the estimator using only the information of the target site and other existing methods. The SurvMaximin estimator is robust to variations in sample sizes and estimated feature coefficients between centers, which amounts to significantly improved estimates for target sites with fewer observations. Conclusions: The SurvMaximin method is well suited for both federated and transfer learning in the high-dimensional survival analysis setting. SurvMaximin only requires a one-time summary information exchange from participating centers. Estimated regression vectors can be very heterogeneous. SurvMaximin provides robust Cox feature coefficient estimates without outcome information in the target population and is privacy-preserving.

Electronic health records (EHR) have been widely adopted in the U.S. and other countries [1] [2] [3] [4] [5] .

The EHR contains a wealth of patient medical information collected over time by health care providers, and common structured data types include demographics, diagnoses, laboratory test results, medications, and vital signs. Given its longitudinal nature, EHR data have been utilized for various research purposes, including survival analysis [6] [7] [8] . For example, the Cox proportional hazards model is used commonly and has been applied to EHR risk prediction [9] .

With the increasing availability of EHR data, there is a great interest in integrating knowledge from a diverse range of health care centers to improve generalizability and accelerate discoveries. There now exist multiple collaborative consortia each composed of diverse health care centers seeking to leverage their EHR data in unison. For example, the Consortium for Clinical Characterization of COVID-19 by EHR (4CE consortium) is an international research collaborative that collects patientlevel EHR data to study the epidemiology and clinical course of COVID-19 [10] . The consortium comprises more than 300 hospitals across seven countries with 83,178 patients, representing a broad range of multi-national health care centers serving diverse patient populations.

However, EHR data obtained from multiple diverse health care centers often exhibit a high degree of heterogeneity due to variability in EHR and data warehouse platforms, patient populations, health care practices, coding, and documentation. Further, patient-level data often cannot be shared directly between health care centers in a timely manner due to patient and institutional privacy laws [11] . Thus, there is a need for robust analytic strategies to overcome the barriers to conduct multi-center EHR studies.

Our objective is to jointly leverage multi-center, high-dimensional EHR data to make more precise inferences for a target population in the survival analysis setting by sharing only summary statistics obtained from each center, such as Cox feature coefficients and covariance matrices. The target population may be the entire population inclusive of all centers, a subset of centers, or a new, separate . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. population. Integrative analysis approaches that only require individual sites to share summary statistics are often referred to as federated learning [12] [13] [14] .

Most existing federated learning methods focus on settings with a small number of predictors and/or homogeneous settings where the underlying predictive models are shared across sites [12] [13] [14] [15] [16] . In addition, existing methods generally require several rounds of communication between sites, which can be inefficient and labor-intensive. To ensure transportability of models across sites, transfer learning methods have been proposed to transfer knowledge from separate but related centers to provide robust and precise estimates for patients in a new center. This approach has widespread applications in medical studies such as drug sensitivity prediction, integrative analysis of "multi-omics" data, and natural language processing [17] [18] [19] [20] . However, most transfer learning methods require outcome labels from the target population, which may be difficult and expensive to obtain, and do not consider the federated learning scenario where individual-level data cannot be shared across sites. In the absence of outcome labels in the target population, transfer learning methods require stringent assumptions that the target and source populations share the same underlying risk model, leading to potential transfer failure when the risk model for the target population is similar to only a subset of source populations [21] .

With heterogeneous training datasets from multiple centers, one potential limitation of the existing federated transfer learning methods is that the performance of the prediction model can vary substantially across centers. Thus, although the overall performance may be satisfactory, the performance of the model in a particular center might be low. Moreover, when trained models are applied to a new population, transferability and portability are not guaranteed. To improve the robustness of prediction models, the maximin effect approach was first proposed in [11, 14, 15] , and used as a metric to build a robust prediction model for continuous outcomes across heterogeneous training datasets [22] [23] [24] . Instead of optimizing the average performance across all training datasets, the maximin effect method aims to train a model that maximizes the minimum gain over the null model among all training datasets. The maximin approach was further extended to a setting that allows for covariate shift between the source and target populations [25] . The group distributional robustness optimization in [20] . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 4, 2022. ;  is closely related to the maximin effect, which builds a robust prediction model by minimizing the worst-case training loss over a class of distributions [26] . The maximin projection has been developed in [21] to construct the optimal treatment regimen for new patients by leveraging training data from different groups with heterogeneity in optimal treatment decision [27] .

In this paper, motivated by the maximin algorithm for continuous outcomes in [11, 27] , we propose a maximin transfer learning algorithm for predicting a survival outcome (SurvMaximin) in a target population with high-dimensional features by robustly combining multiple prediction models trained in different source populations [22] [23] [24] [25] . This algorithm only requires sharing of summary statistics across centers and can easily accommodate high-dimensional features. SurvMaximin can be viewed as a robust federated approach to transfer models trained at multiple external centers to a target population, so we refer to it as a federated transfer learning method. SurvMaximin differs from existing transfer learning methods in that it does not require the target population to share the same underlying model with the source population, a highly desirable property when learning with multiple heterogeneous health care systems. The training of the SurvMaximin algorithm also does not require the target population to have gold-standard outcome labels.

The main aim of the SurvMaximin algorithm is to derive a robust risk prediction model for an unlabeled target population based on labeled data from the ‫ܮ‬ 

. Following [27] , we may show that the maximin effect β ‫כ‬ as defined by (4) can be expressed as a weighted average of

, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint

norm, and the minimization above is restricted to the simplex in L -dimension space. The optimal aggregation weight γ ‫כ‬ in (5) 

The SurvMaximin algorithm involves three key steps: (I) locally train the prediction model for each of the according to (5) . The schema of the SurvMaximin algorithm is shown in Figure 1 .

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint Step I: Training L local risk prediction models

We first obtain as the maximizer of the penalized partial likelihood where is the log partial likelihood associated with and is the elastic net penalty function, which is frequently used to overcome high dimensionality and collinearity of features with α = 1 corresponding to the standard LASSO and α = 0 corresponding to the ridge penalty [28] . The non-negative penalty parameter can be selected via standard tuning criteria including the AIC, BIC or cross-validation. e f . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 

is the empirical variance covariance matrix of ொ estimated based on the unlabeled target population data.

Step III: Maximin aggregation via (5) Finally, we obtain the SurvMaximin aggregated log hazard ratio estimator as 

A substantial challenge in transfer learning across different health care centers is that certain risk predictors, such as laboratory test results or demographic information, may be available in one center but not in a different center. For example, in the 4CE Consortium, all U.S. centers report data on race while

European centers do not, causing race data to be entirely missing for European centers. To transport a risk prediction model for a target center Q with only a subset of features available, one may fit a reduced model limited to only the available features for each source center and transport the reduced risk models from the source centers. When the target center changes, essentially we will need to retrain the model at each source center according to the feature availability of the target center. Such an approach is not computationally efficient as each center needs to fit multiple models, and also increases the number of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint communications required across centers.

To enable transfer learning in the context of differential feature availability, we propose a simple projection approach that only requires each source center to additionally compute the empirical covariance matrix of the features, ൛ 

We validated the performance of SurvMaximin in federated transfer learning using both simulation studies and a real-world study where we transported COVID-19 mortality risk prediction models to target centers using EHR data from hospitalized patients with COVID-19.

Simulation studies were conducted to assess the performance of SurvMaximin and to compare its performance against existing federated learning methods. Since SurvMaximin transports a risk prediction model to a future target center without survival outcomes, use other federating learning methods that also do not require supervised training on the target data as comparisons. Specifically, we consider the standard random effect meta-analysis estimator (herein referred to as Meta); the One-shot Distributed Algorithm (ODAC) for the Cox model [26] ; and the locally trained risk prediction model with varying training sizes of ݊ ொ = 200, 400, and 600. We considered simulation scenarios with ‫ܮ‬ = 15 centers each . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 

with an autoregressive correlation (AR) structure, or where Σ is a compound symmetry covariance matrix with variance 1 and covariance 0.5. We then generated

. Subsequently, we generated ܶ and ܶ ொ from:

where ߳ and ߳ ொ were generated from extreme value distributions. We let . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. . As naïve benchmarks, we additionally constructed the ODAC and Meta models based on and transported these models to the target site by removing the component associated with the first covariate. Such naive approaches are often adopted in practice due to the inability to refit the reduced models on source sites.

We evaluate the overall performance of the estimated risk score from each method in predicting 

We further validated the performance of SurvMaximin by deriving robust and transportable mortality risk prediction models for patients hospitalized with COVID-19 using international, multi-institutional EHR data from the 4CE consortium [10, 16] . Baseline risk factors and mortality information were available for 83,178 patients from ‫ܮ‬ is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint as a potential target population and sought to derive a mortality risk prediction model that is transportable to this population from multiple external models. Given the multinational nature of our data, we anticipated a significant amount of between health care center heterogeneity in their mortality risk models.

Baseline risk predictors considered include: age groups (18-25, 26-49, 50-69, 70-80, 80+ ) sex, and race (White, Black, Asian, Hispanic and other); the pre-admission Charlson comorbidity index (CCI) derived from diagnostic codes; and laboratory test values at admission [30] . We focused on ten commonly measured laboratory tests (with missing rates < 30%), including C-reactive protein (CRP), albumin, is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

Simulation results are summarized in Figure 2 for the moderate signal scenario. In setting (I), where 5 source sites have feature coefficients like the target site, SurvMaximin results in models with accuracy comparable to those from ODAC and Meta when the heterogeneity is low (τ = 0.05, 0.1) and

outperforms other methods when the heterogeneity is high (τ = 0.2). Since there are 5 source sites relatively like the target site, the transported model from SurvMaximin attained accuracy higher than the . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Results for assessing the performance of the projected SurvMaximin algorithm in the presence of missing features are summarized in Figure S2 of the Supplement. The projected SurvMaximin model attains prediction performance comparable to the SurvMaximin model trained by aggregating the locally fit sub-models with Z. Thus, the projection method provides a comparable alternative SurvMaximin . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint estimator when features may be missing for some sites, without the need to unify the set of features for all the centers all the time. The projected SurvMaximin estimator also outperforms the naïve approach of removing the component associated with the first covariate from the ODAC or Meta estimators.

For each covariate, we compare the ‫ܮ‬ local estimates of its log hazard ratio to those based on SurvMaximin in Figure 3 . While these two sets of estimators are generally consistent, SurvMaximin estimators tend to be more concentrated at the center, while local estimators exhibit higher variability in part due to unstable estimates from some sites. For example, the log hazard ratio (HR) of the age group (18) (19) (20) (21) (22) (23) (24) (25) ranges from -6.58 to 0 for the local estimates while the SurvMaximin estimates range from -1.43

to -0.7.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The AUC estimates associated with the risk models obtained based on SurvMaximin and local supervised training for predicting 3, 7, and 14-day mortality are shown in Figure 4 . For each site, we also compared the AUC of models trained in each of the external sites, the locally trained model, and the SurvMaximin model for predicting 14-day mortality ( Figure S4 of the Supplement). The accuracy of risk models transported by SurvMaximin, which does not utilize the outcome information of the target local site, is comparable or even sometimes higher than that of locally trained models. The AUCs of SurvMaximin are more concentrated at a comparatively higher AUC, suggesting the robustness of the SurvMaximin approach. 

We proposed the SurvMaximin approach to deriving a robust risk prediction model for a target population by robustly synthesizing information from estimated risk models from multiple sites. For the target site, the SurvMaximin estimator ߚ መ ௫ is a linear combination of the coefficient estimators of the local . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) and is closest to a zero point with respect to some distance related to the target population. The method enables us to safely transport a set of existing risk models to a target population in the presence of high cross-site heterogeneity.

Compared with existing federated learning methods, such as Meta and the federated learning methods proposed by [14] , the proposed maximin method can handle high-dimensional covariates and is very robust to heterogeneity between sites. It's also robust to sample size differences and improve the inference when the sample size of the target population is small as seen from the simulation studies. The Thus, SurvMaximin is very flexible and generalized such that it can adapt to a variety of scenarios while achieving high accuracy with limited information.

In this paper, we developed a SurvMaximin covariate effect estimator for multi-center survival data with high-dimensional covariates. Simulation studies and real EHR data analysis show that the proposed estimator achieves high accuracy in a range of settings with different levels of heterogeneity between sites and different sample sizes. The SurvMaximin is a highly flexible and robust approach for multicenter survival analysis, which enables federated learning, transfer learning, as well as federated transfer learning. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint Figure S1 : Average C-statistics under settings (I) and (II) with Σ being either AR (1) or compound symmetry, p = 20 or 50, and tau=0.05, 0.1, 0.2 (local coefficients heterogeneity) for predicting survival in the target population with risk models trained by SurvMaximin, Meta, and ODAC, as well as supervised penalized Cox regression with ‫ܖ‬ ொ =200, 400, 600 labeled target data (Local 200 , Local 400 , Local 600 ).

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint Figure S3 : Missing predictors in each site for the COVID-19 Mortality Risk Modeling with 4CE consortium EHR data.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint Figure S4 : Comparing the AUC of locally trained model (red), the SurvMaximin model (blue), and the risk models trained in each external site (black) for predicting 14-day mortality by using a given health care center (APHP, FRBDX, UKFR, BIDMC, MGB, UPENN, UPITT, NWU, UMICH, UCLA, VA1 -VA5) as a potential target site.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 4, 2022. ; https://doi.org/10.1101/2022.02.03.22270410 doi: medRxiv preprint

Regression models and life-tables

Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases

Frustratingly easy domain adaptation

Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2)

Easing the adoption and use of electronic health records in small practices

DataSHIELD: resolving a conflict in contemporary bioscience?performing a pooled analysis of individual-level data without sharing the data

On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data

mice: Multivariate imputation by chained equations in R"

Physicians in nonprimary care and small practices and those age 55 and older lag in adopting electronic health record systems

G rid Binary LO gistic RE gression (GLORE): building shared models without sharing data

Magging: maximin aggregation for inhomogeneous large-scale data

Survival analysis with electronic health record data: Experiments with chronic kidney disease

WebDISCO: a web service for distributed cox model learning without patient-level data sharing

Maximin effects in inhomogeneous large-scale data

Confidence intervals for maximin effects in inhomogeneous large-scale data

Integrative analysis of multi-omics data for discovery and functional studies of complex human diseases

Electronic health record portal adoption: a cross country analysis

Transfer learning approaches to improve drug sensitivity prediction in multiple myeloma patients

Does distributionally robust supervised learning give robust classifiers?

Maximin projection learning for optimal treatment decision with heterogeneous individualized treatment effects

Statistical learning with sparsity: the lasso and generalizations

Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

Association of patient characteristics and tumor genomics with clinical outcomes among patients with non-small cell lung cancer using a clinicogenomic database

International electronic health record-derived COVID-19 clinical course profiles: the 4CE consortium

Learning from local to global: An efficient distributed algorithm for modeling time-to-event data

Inference for High-dimensional Maximin Effects in Heterogeneous Regression Models Using a Sampling Approach

Adoption rates of electronic health records in Turkish Hospitals and the relation with hospital sizes

Transfer learning for high-dimensional linear regres-sion: Prediction, estimation, and minimax optimality

Predicting with proxies: Transfer learning in high dimension