key: cord-0283673-psg8b0g5
authors: Luo, C.; Islam, M. N.; Sheils, N. E.; Reps, J. M.; Buresh, J.; Duan, R.; Tong, J. M.; Edmondson, M.; Schuemie, M. J.; Chen, Y.
title: Lossless Distributed Linear Mixed Model with Application to Integration of Heterogeneous Healthcare Data
date: 2020-11-18
journal: nan
DOI: 10.1101/2020.11.16.20230730
sha: 9c3c159f153db464460f096ef9c5fc70ce764540
doc_id: 283673
cord_uid: psg8b0g5

Linear mixed models (LMMs) are commonly used in many areas including epidemiology for analyzing multi-site data with heterogeneous site-specific random effects. However, due to the regulation of protecting patients' privacy, sensitive individual patient data (IPD) are usually not allowed to be shared across sites. In this paper we propose a novel algorithm for distributed linear mixed models (DLMMs). Our proposed DLMM algorithm can achieve exactly the same results as if we had pooled IPD from all sites, hence the lossless property. The DLMM algorithm requires each site to contribute some aggregated data (AD) in only one iteration. We apply the proposed DLMM algorithm to analyze the association of length of stay of COVID-19 hospitalization with demographic and clinical characteristics using the administrative claims database from the UnitedHealth Group Clinical Research Database.

The COVID-19 outbreak has become a pandemic, causing a large increase in mortality and posing a heavy burden to the healthcare system. Much research has been done on treatment efficacy and adverse clinical outcomes [1] [2] [3] [4] [5] and much remains to be done. As studies continue to be conducted and published, multi-site collaboration is demanded for evidence synthesis [3, 4] . Multi-site studies based on healthcare data, including the electronic health record (EHR) and claims data, can integrate clinical information across multiple sites or systems to improve estimation and predictive performance due to use of a larger and more inclusive sample from the population of interest.

One primary challenge for multi-site collaboration is preserving the privacy of protected health information. Sensitive individual patient data (IPD) including the patient's identity, diagnoses and treatments are usually not allowed under privacy regulation to be shared across networks. Existing approaches to performing multi-site studies, e.g. distributed algorithms, have the drawback of biased estimation [12] or communication burden due to the requirement of iterative transmission of summarized data [14, 15] . Specifically, for ordinary linear regression, identical results with pooled analysis can be obtained by lossless compression [7] . Another important challenge of analyzing multisite data is the heterogeneity of data distribution across sites. Many existing approaches for multi-site analysis assume the data are homogeneously distributed across sites. This assumption may be violated in practical scenarios and thus make the model vulnerable in estimation and hypothesis testing. For instance, in this manuscript we consider the length of stay of the hospitalization due to COVID-19 as the continuous outcome. Heterogeneity across countries, regions, and sub-populations has been reported in the literature [2] . The aforementioned lossless compression approach [7] for ordinary linear regression, fails to take the heterogeneity into account.

In this paper we propose a novel algorithm for distributed linear mixed models (DLMMs). Linear mixed models (LMMs) are commonly used in many areas including epidemiology for analyzing multi-site data with heterogeneity. The model assumes site-specific random effects of the covariates (and intercept) on a continuous outcome. To the best of our knowledge, there is no existing approach for fitting LMMs in a distributed manner, see Figure 1 for the comparison of several approaches in this context. Our proposed distributed LMM can achieve exactly the same results as if we had pooled individual patient data from all sites, hence the lossless property. These lossless results can be obtained by requiring the sharing of summary statistics from each site in only one iteration. We apply the proposed DLMM to analyze the association of length of stay of COVID-19 hospitalization with demographic and clinical characteristics using the administrative claims database for Medicare Advantage members from a large US Health insurance provider (Appendix Figure 2A) . 

Due to the heterogeneity of data across sites, the effects of the covariates on the outcome among sites in the linear regression model may not always be the same [7] . A linear mixed model is thus often used. Assume for the th patient at the th site, $% is the continuous outcome, $' is the -dimensional covariate vector and is the vector of fixed effects, $' is the -dimensional covariate vector having random effect $ , and $' is the random error. 

where $ ∼ (0, ), $% ∼ (0, σ > ). The random effects covariates $' can be part or all of $% , or constant if random intercept only. The random effect covariance matrix can admit certain structures with unknown parameters. For instance we can assume the random effects are independent, i.e. = ( C > , . . . , D > ). These parameters (e.g. variance components) and the fixed effects are usually estimated by maximum likelihood (ML) or restricted maximum likelihood (REML) estimation [6] . The log-likelihood of LMM using all the data is

where $ and $ are the covariate matrix and the outcome vector of the th site, |. | is the matrix determinant and $ = $ ( > , ) = $ $ / + σ > U V .

The maximum likelihood estimation can be further simplified by profiling out and σ > from (2). Denote = / > , given , the estimation of and σ > are

where $ = $ ( ) = $ $ / + U V . Thus the profile log-likelihood with respect to only is

and the restricted profile log-likelihood is

The ML or REML estimate of can be obtained by maximizing (5) or (6) . The estimates of and σ > can be subsequently obtained by (3) and (4). We denote these estimates as (`, a > , b ). The variance of the estimated fixed effects ` is thus

(7) or the sandwich estimator:

It's easy to see from (5) that there is no closed-form estimation for LMM. Thus unlike in the ordinary linear model [7] , the LMM estimation is not trivial to be distributed to each site losslessly. Fortunately, with some linear algebra, we can disentangle the data ( $ , $ ) and the parameters in | $ | and $ OC and thus reconstruct the profile log-likelihood (5) without communicating individual patient data. Specifically, we utilize the Woodbury matrix identity [8] to obtain

and the matrix determinant lemma [9] to obtain

where D is the × identity matrix. We focus on the situation that the covariates in are a subset of that in . The more general case is similar and will be elaborated in the Appendix. We require the th site to contribute some summary statistics, i.e. the × matrix $ g = $ / $ , the × 1 vector $ gh = $ / $ , the scalar $ j = $ / $ and the sample size $ . The exact log-likelihood (5) (or a restricted likelihood (6)) can then be reconstructed using these summary statistics. The details of the reconstruction are in the Appendix. The result by the DLMM algorithm is thus identical to that of the pooled LMM analysis, see Figure 2 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint 

We test the significance of random effects of each individual covariate by likelihood ratio test. For simplicity we assume the potential random effects are independent and the random intercept always exists, i.e. = ( C > , . . . , D > ) and C > > 0, for the covariate corresponding to variance component l > , ≥ 2, we test

The likelihood ratio test (LRT) gives the likelihood ratio

follows a 50:50 mixture of q > and C > [10, 11] . Notice both the log-likelihoods in (10) can be reconstructed by the communicated summary statistics.

Finally, the BLUP [6] of the random effects $ at the th site is

Conditioning on $ ,a $ has mean zero and covariance matrix

Since we are more interested in prediction of $ , it is more appropriate to use prediction intervals as below . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint

We summarize the analysis with the proposed DLMM algorithm as in Algorithm 1. 

Perform the likelihood ratio test for the significance of random effects of each covariate by (10) . 3. With the significant random effects identified by the above step, reconstruct the profile log-likelihood (5) or the restricted profile log-likelihood (6), obtain the estimate b . 4. Obtain `= Y ( b )and a > = [ > ( b )by (3) and (4).

Calculate the variance of the estimated fixed effects ` by (7). 6. Calculates the BLUPs of the random effects in each site by (11) .

We demonstrate the utility and lossless property of the DLMM method by studying the association of length of stay of COVID-19 hospitalization with patients' demographic and clinical characteristics. We emphasize that this example is for illustrative purposes only and to this end we considered only covariates that have already been well-documented in the literature.

We identified patients who were admitted as inpatients to a hospital with a primary or secondary diagnosis of COVID-19 between January 1, 2020 and September 30, 2020. The data are collected from = 538 sites in the UnitedHealth Group Clinical Research Database and the total number of patients is = ∑ $ Q $RC = 47756. The detailed inclusion criteria is in Appendix Figure 1A . We treat length of stay as a continuous outcome. The demographic characteristics include age, gender and race and the clinical characteristics include a history of cancer, chronic obstructive pulmonary disease (COPD), heart disease, hypertension, hyperlipidemia, kidney disease and obesity. Charlson comorbidity index score is also included as a measure of the overall patient's burden of diseases; the higher the score, the more severe the patient's health condition is. We provide the details of the definition of the characteristics in the Appendix Table 2A . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint (c) (d) Figure 2 . We select the covariate-specific random effects $t , = 1, . . . , , as described in Section 2.3. For simplicity we assume the random effect of different covariates are independent, i.e. = ( C > , . . . , D > ). We select the covariates for which the corresponding p-value ≤0.05. In particular, we select random slopes for obesity, diabetes, kidney-diseases, and charlosn scores. A LMM with random intercept and random effects for obesity, diabetes, kidney-diseases, and charlosn scores is then fitted by either pooling the IPD together, or the proposed DLMM algorithm. We also calculate the BLUPs and the prediction intervals of the random effects at each site by (11) in Section 2.4.

We compare the result of the pooled analysis and the distributed algorithm in Figure 2 and Appendix Figure 3A . Specifically, the estimation of the fixed effects, their standard errors, the variance components are shown to be identical by the pooled analysis or the distributed algorithm. The estimated BLUPs by either the pooled analysis or the distributed algorithm are also shown to be identical in Figure  3A . The forest plot of fixed effects estimation and BLUPs of the random effects at a specific site are shown in Figure 3 . Notice male, older age (≥80), higher Charlson score, obesity, prevalence of diabetes, and kidney diseases are shown to be significantly associated with longer COVID-19 hospitalization and Caucasian race, compared to others, is significantly associated with shorter COVID-19 hospitalization.

The results match with that of literature for most of the covariates [16-19].

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint BLUPs for random effects corresponding to a site located in the south region; reported are 95% prediction intervals.

Special care must be taken with healthcare data in order to preserve patient privacy. Anonymizing data while preserving features that are important for understanding an individual's health is highly nontrivial. In addition, large, representative datasets are especially scarce. Distributed models solve the privacy issue by requiring that only summary level statistics are shared. The one-shot model presented here requires only the × matrix of summary statistics, sample size, and -dimensional vector be sent once. This allows for efficient sharing to build models for various applications across healthcare where data may remain completely protected by eliminating the need for data pooling at a central source. By considering a large, more diverse sample from multiple sites we expect a more robust outcome which benefits all institutions. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint (3) and (4). Notice the unknown parameters are contained only in and are separated from the summary statistics. Therefore, the profile log-likelihood (5) with respect to can be reconstructed as

where | $ | = | D + $ ‰ | according to (9) . The restricted profile log-likelihood (6) can also be reconstructed in the same way.

Medical and pharmacy claims data are captured, predominantly electronically, from sites of care seeking third-party reimbursement for both Medicare and commercial plans using the industry standard data collection forms HCFA/CMS-1500 for facility claims, UB04/CMS-1450 for professional services and outpatient claims, and NCPDP for pharmacy claims or their electronic equivalents. Structured data from these standardized forms are coded using the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM), National Drug Codes (NDC), Current Procedural Terminology (CPT) codes, and Logical Observation Identifiers Names and Codes (LOINC) codes, and Diagnosis Related Groups (DRG). This nomenclature ensures consistency of data collection across geographic regions, health systems, and payers throughout the United States.

Claims that do not adhere to the form or coding standards described above are rejected from reimbursement, minimizing the risk that inappropriately structured data are included in the database. Data specific to SARS-CoV-2 and COVID-19 has an additional Quality Control layer to control for errors in sampling and data collection; this is described below in the section on Quality Control. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint Figure 1A . Flow chart of cohort definition for the COVID-19 hospitalization length of stay study using UHG claims data.

A COVID-19 data source-specific layer of quality control is also present, given the rapidly evolving situation. Members with a qualified COVID-19 related hospital admission are included in the report when any diagnosis matches qualified ICD-10 codes of U071, U072, or B9729. Suspected COVID-19 inpatient cases are manually reviewed daily by health plan clinical staff via clinical notes to determine an individual's COVID-19 status. Each case is then manually flagged as either negative, confirmed, presumed positive, or needs clinical review. If a case is confirmed, it is not reviewed again. If a case is listed as negative or unknown, it is periodically reviewed for changes in the record. All others are reviewed and updated daily.

The data are proprietary and are not available for public use but, under certain conditions, may be made available to editors and their approved auditors under a data use agreement to confirm the findings of the current study. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint Myocardial infarction  I21, I22, I252   Paralysis  G041, G114, G801, G802, G81, G82, G830, G831, G832, G833, G834, G839   Peripheral  Vascular  Disease  I70, I71, I731, I738, I739, I771, I790, I792, K551, K558, is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint hospitalization as the outcome and demographics and comorbidity variables as covariates. Diabetes, obesity, kidney disease and Charlson score are selected as having significant random effects.

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.16.20230730 doi: medRxiv preprint

COVID-19 Clinical Trials: A Teachable Moment for Improving Our Research Infrastructure and Relevance

Understanding the Heterogeneity of Adverse COVID-19 Outcomes: The Role of Poor Quality of Air and Lockdown Decisions

Keep up with the latest coronavirus research

Aggregating data from COVID-19 trials

OpenSAFELY: factors associated with COVID-19 death in 17 million patients

Semiparametric regression

Regression cubes with lossless compression and aggregation

Adjustment of an inverse matrix corresponding to a change in one element of a given matrix

Eigenvalues of rank-one updated matrices with some applications

Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions

Variance components testing in the longitudinal mixed effects model

Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm

Learning from local to global-an efficient distributed algorithm for modeling time-to-event data

Webdisco: a web service for distributed cox model learning without patient-level data sharing

G rid binary lo gistic re gression (glore): building shared models without sharing data

In the case that the random effects covariates in is not a subset of the fixed effects covariates in , the proposed DLMM algorithm requires the … † site to communicate