key: cord-0123471-hbvy91jb
authors: El-Hay, Tal; Yanover, Chen
title: Estimating Model Performance on External Samples from Their Limited Statistical Characteristics
date: 2022-02-28
journal: nan
DOI: nan
sha: 659b1d5d46772b104a914a6013a46d210d5bb7d5
doc_id: 123471
cord_uid: hbvy91jb

Methods that address data shifts usually assume full access to multiple datasets. In the healthcare domain, however, privacy-preserving regulations as well as commercial interests limit data availability and, as a result, researchers can typically study only a small number of datasets. In contrast, limited statistical characteristics of specific patient samples are much easier to share and may be available from previously published literature or focused collaborative efforts. Here, we propose a method that estimates model performance in external samples from their limited statistical characteristics. We search for weights that induce internal statistics that are similar to the external ones; and that are closest to uniform. We then use model performance on the weighted internal sample as an estimation for the external counterpart. We evaluate the proposed algorithm on simulated data as well as electronic medical record data for two risk models, predicting complications in ulcerative colitis patients and stroke in women diagnosed with atrial fibrillation. In the vast majority of cases, the estimated external performance is much closer to the actual one than the internal performance. Our proposed method may be an important building block in training robust models and detecting potential model failures in external environments.

Predictive models, such as disease risk scores, are typically trained on a single, or few, data sources but are often expected to work well in other environments, that may vary in their population characteristics, clinical settings, and policies (Steyerberg and Harrell, 2016) . In many cases, model performance deteriorates significantly in these external environments, as demonstrated repeatedly (e.g., Ohnuma and Uchino (2017) ), and most recently for the widely implemented proprietary Epic Sepsis Model (Wong et al., 2021) and for COVID-19 risk models (Reps et al., 2021) .

Model robustness -that is, its ability to provide accurate prediction despite changes, e.g., in the characteristics of input covariates -can be demonstrated using external validation, the process of evaluating model performance on data sources that were not used for its derivation. However, full access to medical datasets is often limited due to privacy, regulatory and commercial factors. Therefore, we aim to estimate the performance of a given model on external sources using only their more commonly available statistical characteristics.

Here, we propose an algorithm which reweights individuals in an internal sample to match external statistics, potentially reported in preceding publications or characterization studies (e.g., Recalde et al. 2021) ; then estimates the performance on the external sample using the reweighted internal one. We focus on cases that are common in the healthcare domain, where the size of samples (that is, number of individuals) is much larger than the number of features. In such cases, infinite number of weight sets may recapitulate the external statistics, therefore the proposed algorithms searches for weights with a minimal divergence from a uniform distribution.

We first study the strengths and limitations of our suggested approach using simulated data; then split a sample from a primary care dataset into "internal" versus "external" subsets based on demographic information, and validate the approach using a prediction model for 3-year risk of complications in ulcerative colitis patients; and, finally, use the entire primary care data to estimate the performance of three stroke risk scores in seven external resources and compare it to the actual performance, as reported in a recent study (Reps et al., 2020) .

The task of evaluating model performance in external samples, often with (at least some) data shift (Finlayson et al., 2021) , is tightly coupled with that of training robust models, as evaluation is a necessary step in model selection and optimization.

One line of work handling data shifts adopts ideas from causal inference. Specifically, causal models (Bareinboim and Pearl, 2016) can distinguish invariant relations between risk factors (e.g., biological or physiological) and outcomes from context-or environment-dependent mechanisms (Subbaswamy et al., 2019) . Subbaswamy et al. (2021) developed a method for analyzing model robustness (or stability) that, given a model, a set of distribution-fixed (immutable) variables and a set of distribution-varying (mutable) variables, identifies the sub-population with the worst average loss; thus, enabling evaluation of model safety, with no external information.

Sample reweighting is commonly applied to adjust for confounders, either measured (Hainmueller, 2012) or unmeasured (Streeter et al., 2017) , and to account for selection bias (Kalton and Flores-Cervantes, 2003) , typically leveraging fully-accessible samples. Methodologically, the optimization problem we derive is similar to that studied for entropy balancing (Hainmueller, 2012) , which attempts to reweight a sample (e.g., control group) so its prespecified set of moments exactly match that of another sample (e.g., treatment group), while maximizing the weight entropy (that is, keeping weights as close as possible to uniform). We note, however, that we explore a different use-case and, consequently, optimize over moments of an otherwise inaccessible sample (rather than samples from an accessible data source).

The goal of the proposed method is to estimate the performance of a prediction model, e.g., risk score, on an external sample, given some of its statistical properties, and using an internal, fully-accessible data. Briefly, we reweight an internal sample to obtain the external statistics, then compute model performance on the weighted sample as an estimate of the external performance.

Let x i and y i denote an observation (or feature) vector and a binary outcome † , respectively, for an individual i. Suppose we have access to observations for n int individuals in an internal sample:

; and summary statistics for an external sample (with n ext individuals):

where φ(x i , y i ) is a set of transformations on individual-level observations. For example:

allows computation of features mean in subsets of individuals with and without the outcome (as often reported in a study's Table 1) . We aim at estimating the performance of a model m on the external sample D ext , using µ and observations from D int . To this end, we search for weights w ∈ [0, 1] nint , i w i = 1, such that the statistical properties of the weighted sample {x i , y i , w i } nint i=1 approximate these of the external one. Let W (µ, Z) denote the space of such weight sets:

. . , n int †. We focus here on binary outcomes, as these are commonly used -and reported -in healthcare applications; it is possible to extend the proposed approach to continuous outcomes, using an appropriate performance measure and statistical characteristics.

Where Z is a matrix whose rows are z i ≡ φ(x i , y i ). As W (µ, Z) may be infinitely large, we propose to search for a set of weights that is also closest to uniform. This additional constraint is based on a proximity assumption, intuitively that the external distribution is relatively similar to the distribution in W (µ, Z) that is closest to the internal distribution.

Using the reweighted sample we can now estimate two types of performance measures:

• Measures that can be expressed as a pointwise loss function, l(m(x i ), y i ), for which we estimate the expected loss of the model in the external sample as:

For example, for a model that computes the probability of an outcome y, we can estimate the expected negative log-likelihood by setting

• Non-decomposable measures that can be evaluated on weighted samples. For example, the area under the receiver operating characteristic curve (AUC).

Below we present a model independent scheme, which minimizes an f-divergence function (for example, maximizes the weights entropy); and in Appendix A, we derive a model (and loss) dependent scheme, which maximizes a weighted upper bound on the model loss and the regularized divergence function.

Model-independent optimization scheme. To find a weighted representation of an internal sample that replicates the external expectations, we solve the following optimization problem:

where D f (P Q), f-divergence, for discrete measures P and Q is:

and f : R + → R is a convex function, with f (1) = 0. For example, when f (t) = t log t, Optimization Problem (1) becomes:

where H(w) = − i w i log w i is the entropy function. We derive a dual formulation of Problem (2), similar to Hainmueller (2012) , in Appendix B; and show that the optimal solution has the form:

where ν ∈ R |φ| . In other words, the optimal weights are normalized exponents of a linear function of Z. We note that, as the number of features is typically much smaller than the sample size, the solution to the dual problem is expected to be more numerically stable than the primal's.

In cases where W (µ, Z) = ∅, Problem (1) can be adjusted to trade-off, using hyper-parameter λ, the accuracy at which the weighted internal sample reproduce the external statistics and proximity and rewritten as:

where the norm can be L 2 or L 1 .

To estimate the performance of a model in an external sample D ext , with distribution P ext (z) of transformed features, we assume that P int (z) > 0 whenever P ext (z) > 0. This condition is analogous to the positivity assumption in causal inference, except that it is one sided. In other words, the support of P ext (z) can be a strict subset of the support of P int (z). Although this assumption cannot be verified, its violations can be detected when external expectations cannot be attained in the internal sample.

We used R's CVXR (Fu et al., 2020) library to solve the optimization problem and WeightedROC library (Hocking, 2020) to compute weighted AUC. To deal with cases where W (µ, Z) = ∅ we used the relaxed Problem (3) with λ = 10 −6 and L 2 norm. To alleviate numerical issues, we set a minimum weight parameter (10 −6 ) and remove features with a small standard deviation (< 10 −4 ).

To evaluate the accuracy of the proposed algorithm, we estimated the external performance of various models using an internal sample and limited external statistical characteristics, in three scenarios: (a) simulating data using a structural equation model (Bareinboim and Pearl, 2016) for "internal" and "external" environments; training an outcome prediction model on the internal sample and evaluating its performance on the external one; (b) extracting a cohort of newly diagnosed ulcerative colitis individuals in IMRD-UK data; synthetically splitting this cohort into "internal" and "external" samples; training a complication risk model on the internal sample and evaluating its performance on the external one; and (c) extracting atrial fibrillation patient cohorts in IMRD-UK data as an internal sample; evaluating the performance of three stroke risk models in multiple inaccessible claim and EMR databases using their published statistical characteristics.

We simulated synthetic data using structural equation models that contain a hidden variable H ∈ R, features X ∈ R p , a binary outcome Y , and a deterministic binary variable A where A = 0 denotes an internal environment and A = 1 denotes an external one ( Figure 1 ). This framework allows examining the strengths and limitations of the proposed algorithm subject to different types of data shifts. Figure 1 : Graphical representation of the datagenerating causal model. A is an environment variable (e.g., in a clinical setting, specific healthcare system), H is a hidden variable (encoding, for example, an individual's healthcare status), X is a set of observed features (e.g., prescribed medications or lab test results) and Y is a binary outcome (e.g., disease onset or progression).

We defined the simulations using the following structural equations model:

β X,· and β ·,X ∈ R p are coefficient vectors, the rest of the coefficients are scalars, H ∼ N (0, 1), X ∼ N (0, I p ) are independent sources of variability and sigmoid(z) = 1 1+e −z . This model is similar to the anchor regression model (Rothenhäusler et al., 2021) , replacing the continuous outcome with a binary one. The dependency of X on the hidden variable H induces correlations between features, and the interaction term AH induces differences in the correlations structure between environments. Therefore, the coefficient β X,AH controls the "strength" of the shift in correlations between features, depending on the environment; and the coefficient β Y,AX controls the shift in direct effect of X on Y .

Here, we set the dimension of X to be p = 10 and sample coefficients β H,A , β Y,A ∼ N (0, 0.2), β X,A ∼ N (0, 0.2I p ), β X,H ∼ N (0, I p ), and β Y,H ∼ N (0, 1). We let only X 1 and X 2 (but not X 3 to X 10 ) affect the outcome Y by setting β Y,X = (1, 1, 0, . . . , 0) and β Y,AX = (−0.8, −0.2, 0, . . . , 0).

As studies do not typically report correlations between features within each outcome class, we tested our algorithm in different scenarios of correlation shifts. Specifically, we used three configurations of β X,AH ∼ N (0, σ X,AH ), where σ X,AH = 0, 0.5, or 1, emulating weak, medium, and strong correlation shifts, respectively.

Given a specific simulation model, we generated three data sets, namely internal training and tests sets and an external data set. We computed the mean and variance of every feature in X, separately for individuals with Y = 0 and Y = 1, in the external set. Next, we trained an elastic net regularized logistic regression model on the internal training set and computed the AUC on the internal test and external sets. Finally, we applied the performance estimation algorithm on the internal test set, using external expectations, and compared the estimated AUC to the actual one.

Supplementary Figure 6 presents examples of generated samples with varying values of σ X,AH . For each setting, we generated 200 models, and from each sampled data with varying sizes (n = 200, 500, 1000, 2000, 5000) .

The results of the proposed algorithm, in terms of divergence from uniform weights and AUC estimation accuracy, for different values of σ X|AH and data size n = 5000 are shown in Table 1 . As expected, weight divergence from uniform (D KL (w 1/n)) and estimation error grow with σ X|AH . Figure 2 presents the estimation error of the external AUC values, as a function of correlation shift strength and sample size, n. Estimation quality is lower for strong shifts in correlations which are not captured in the shared expectations, whereas milder differences result in good estimations. For comparison, the difference between internal and actual external AUC values is around 0.1 (Table 1) . Next, we studied the IMRD-UK primary care data and synthetically split it into "internal" and "external" sets based on various demographic criteria. Specifically, we trained a model on the internal sample to predict the 3-year risk of intestinal surgery (or death) in ulcerative colitis (UC) patients; estimated its performance on the external sample, using limited external statistics; and compared the estimated and observed performance.

UC is a chronic inflammatory bowel condition with consistently increasing incidence rates in both newly industrialized and developed countries (Benchimol et al., 2014; Windsor and Kaplan, 2019; Kaplan and Windsor, 2021) . The increase in its prevalence has a significant impact on healthcare financial burden due to chronically administered medications as well as hospitalizations and surgical procedures (Windsor and Kaplan, 2019) .

UC pathogenesis is not well understood. Presumed risk factors for a more complicated disease include younger age at diagnosis, extensive disease, use of steroids and immunosupressive drugs, and being a non-smoker (Koliani-Pace and Siegel, 2019). 

The UC onset cohort includes individuals with at least two diagnoses of inflammatory bowel disease (IBD) or with an IBD diagnosis and a prescription for an IBD medication; who have an ulcerative colitis diagnosis and no Crohn's disease diagnosis. We set index (or cohort entry) date to the first IBD diagnosis or medication prescription and required that individuals have a minimum observation of 365 days prior to index date. We excluded subjects with insufficient follow-up. For each individual in the ulcerative colitis cohort we extracted a set of features, previously reported as associated with increased intestinal surgery risk (Koliani-Pace and Siegel, 2019). These include age (and age 2 ), sex, smoking, being underweight or overweight, presence of perianal disease, and use of steroids; and considered sets of predefined features (per OHDSI's Feature Extraction R library), e.g., drugs prescribed to a at least 1,000 subjects up to 90 days after index date. The outcome considers procedure codes for colostomy, colectomy, ileostomy, small intestinal resection, stricturoplasty, balloon dilation, drainage of perianal abscess, drainage of intraabdominal abscess, or death, within 3 years following index date. Definition of all concept sets and cohorts are available at https://atlas-demo.ohdsi.org/.

We split the IMRD-UK data into internal and external sets based on individual age or country of living, as described below.

In the following experiments, we split the subset of individuals who live in England by their age. Specifically, in the first experiment, the internal set contained the 75% youngest subjects (≤ 64 years) and the external set -the 25% older ones; and in the second experiment, the internal set contains the 75% older individuals (> 34 years) and the internal setthe remaining older individuals. In each of these setups we randomly split the internal set to training (75%) and test (25%); we repeated the training-test random split 200 times. Next, we trained a linear model as well as a non-linear one, using XGBoost (Chen and Guestrin, 2016) , computed model's AUC on the internal test and external sets, and estimated the external AUC using the internal set and the expectations of the external one. To maintain positivity and to emulate an environment dependent hidden factor, we excluded age from the feature set. The populations were different in several observed characteristics, notably, percentage of women, underweight and overweight; see Table 2 for details.

Overall, the external performance estimations, using either elastic net or XGBoost, are close to the actual ones (Figure 3) , notably for external younger samples, where the difference between the internal and external performance is large (right panel). Next, we split the UC cohort by country of residence and considered the sub-cohort of individuals living in England as the internal sample and those living in Scotland, Wales and Northern Ireland as three distinct external samples. Similarly to the age split analysis, we split the internal sample into training and test sets, repeatedly 200 times; trained a model on each training set; extracted expectations for the external samples; and evaluated model performance on the internal test and external sets.

The characteristics of different sub-populations are presented in Table 3 ; Figure 4 shows the external performance evaluation results, attesting to the (much) improved accuracy of the estimated AUC values, compared to internal performance.

Atrial fibrillation is a common cardiac rhythm disorder, associated with increased risk of stroke (Sagris et al., 2021) . Risk factors associated with the occurrence of stroke include older age, various comorbidities (in particular, hypertension, diabetes, and renal disease) and smoking (Singer et al., 2013) . To guide treatment, multiple risk scores have been devised and externally evaluated in several studies (van den Ham et al., 2015) . Recently, Reps et al. (2020) replicated five previously published prognostic models that predict stroke in females newly diagnosed with atrial fibrillation; and externally validated their performance across nine observational healthcare datasets. Below, we use our proposed algorithm and the limited perdatabase statistical characteristics, as it appears in Reps et al. (2020) , to estimate the external performance of these risk scores. 

We downloaded Reps et al. (2020) 's analysis package and applied it to the IMRD-UK data, with the following modifications that adjust the study definitions to a primary care setting:

Target cohorts. We considered ECG-related procedures and conditions, in addition to measurements, within 30 days prior the atrial fibrillation diagnosis, as an optional inclusion criterion.

Outcome cohort. As stroke, typically not diagnosed in a primary care setting, may be poorly recorded for deceased individuals, we added death as an entry event to the stroke cohort.

Feature definitions. We extended the time window for extraction of model features to span the entire history of each individual until, and including, the date of the first atrial fibrillation event; included individuals with estimated glomerular filtration rate (eGFR) lower than 45 mL/min/1.73m 2 in the end stage renal disease cohort, as originally defined in the ATRIA risk model (Singer et al., 2013) ; and defined former smokers as individuals with an observation of smoker, as well as those diagnosed with tobacco dependence syndrome. For each individual, the analysis package computed a stroke risk score given her set of features, as extracted from IMRD-UK; then, calculated score performance, vis-à-vis recorded stroke (and death) events. To estimate score performance in each external sample, we weighted individuals in the IMRD-UK data using the proposed algorithm to reproduce the sample's populations characteristics, as reported in Reps et al. (2020) , and computed the score performance for the weighted individual cohort. We computed 95% confidence intervals using 1000 bootstrapping iterations.

Population attributes (Reps et al., 2020) include percentage of individuals in certain age groups (65-74 years, 75-85 years and above 85 years), comorbidities (hypertension, congestive heart failure, congestive cardiac failure, coronary heart disease, valvular heart disease, chronic and end stage renal disease, proteinuria, diabetes, and rheumatoid arthritis) and being a former smoker.

A comparison between risk score performance, as reported by Reps et al. (2020) , and the estimated performance is shown in Figure 5 . For the full cohort (top panel), in three out of six datasets, the confidence interval of the ATRIA estimation overlaps the actual AUC ( Figure 5a ); in two other datasets, the estimation is better than the internal, IMRD-UK based performance. Qualitatively similar results are observed for the CHADS2 and Q-Stroke risk scores (Figure 5b and c, respectively); as well as for women 65 years or older (bottom panel).

We note that for two additional risk scores, Framingham and CHA 2 DS 2 VASc, and two datasets, Ajou University School Of Medicine (AUSOM) and Integrated Primary Care Information (IPCI), Reps 

We presented an algorithm that estimates the performance of prediction models on external samples from their limited statistical characteristics; and demonstrated its utility using synthetic data, synthetic split of an ulcerative colitis cohort from a single database into age groups and by country of living, and a recent risk model benchmark of stroke risk models on multiple external samples. Importantly, our proposed algorithm can help identifying models that perform well across multiple clinical settings and geographies, even when detailed test data from such settings is not available. It can thus direct development of robust models and accelerate deployment to external environments. The algorithm relies on two assumptions: one-sided positivity and proximity. Both assumptions cannot be fully tested, but clear violations of the former one can be detected, for example, when the expected value of a feature is non-zero in the external distribution but all the individuals in the internal set have a zero value for that feature. Intuitively, proximity is more likely to be plausible when the statistical information becomes more detailed. Therefore, whereas our preliminary experiments involved only marginal statistics of features it may be informative to test the performance of the algorithm when more detailed statistics are available, for example interactions among features or information available in deep characterization studies Burn et al. (2020) .

We believe that the proposed methodology can serve as a building block in network studies that aim to construct robust models across datasets when data sharing is limited, e.g., by regulatory constraints. Although federated learning methods may be a promising avenue for such scenarios, it would be interesting to explore in which cases the proposed algorithm can facilitate a one-shot federated learning scheme, that does not require deployment of federated algorithm clients in all network nodes.

In future work, we will combine the proposed algorithm with methods that aim to construct robust models such as those that leverage distributionally robust optimization (Bühlmann, 2020) ; study methods that exploit the relations between calibration and robustness (Wald et al., 2022) ; and look into decomposing AUC (Eban et al., 2017) , so it can be optimized explicitly.

This study has been approved by IQVIA Scientific Review Committee (Reference numbers: 21SRC066, 22SRC002).

Problem (2) becomes:

Following Equation 5.11 in Boyd et al. (2004) the dual function is:

where (−H) * is the conjugate of the negative-entropy function (Boyd et al. (2004) The Lagrangian of the primal problem is:

Let ν * be the optimal solution of max ν g(ν). Then, following Section 5.5.3 of Boyd et al. (2004) , the solution of the primal problem minimizes the Lagrangian at ν * :

∂L(w; ν * ) ∂w i = log w i + 1 + (z i , 1) ν * = 0 giving w i = e −1−(zi,1)ν * .

This result shows that the optimal weights are normalized exponents of a linear function of the data points.

Causal inference and the data-fusion problem

Changing Age Demographics of Inflammatory Bowel Disease in Ontario, Canada: A Population-based Cohort Study of Epidemiology Trends

/19-STS721.full. Publisher: Institute of Mathematical Statistics

Convex optimization. Cambridge university press

Deep phenotyping of 34,128 adult patients hospitalised with COVID-19 in an international network study

XGBoost: A Scalable Tree Boosting System

Scalable Learning of Non-Decomposable Objectives

The Clinician and Dataset Shift in Artificial Intelligence

CVXR: An R Package for Disciplined Convex Optimization

Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies

WeightedROC: Fast, Weighted ROC Curves

The four epidemiological stages in the global evolution of inflammatory bowel disease

com/articles/s41575-020-00360-x. Number: 1 Publisher

Prognosticating the Course of Inflammatory Bowel Disease

The Book of OHDSI: Observational Health Data Sciences and Informatics. OHDSI

Prediction models and their external validation studies for mortality of patients with acute kidney injury: a systematic review

Characteristics and outcomes of 627 044 COVID-19 patients living with and without obesity in the United States, Spain, and the United Kingdom

Feasibility and evaluation of a large-scale external validation approach for patient-level prediction in an international data network: validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation

Implementation of the COVID-19 Vulnerability Index Across an International Network of Health Care Data Sets: Collaborative External Validation Study

Anchor regression: Heterogeneous data meet causality

Atrial Fibrillation: Pathogenesis, Predisposing Factors, and Genetics

A new risk scheme to predict ischemic stroke and other thromboembolism in atrial fibrillation: the ATRIA study stroke risk score

Prediction models need appropriate internal, internalexternal, and external validation

Adjusting for unmeasured confounding in nonrandomized longitudinal studies: a methodological review

Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport

Evaluating Model Robustness and Stability to Dataset Shift

Comparative Performance of ATRIA, CHADS2, and CHA2DS2-VASc Risk Scores Predicting Stroke in Patients With Atrial Fibrillation: Results From a National Primary Care Database

On Calibration and Out-of-domain Generalization

Evolving Epidemiology of IBD

External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients

We thank Drs Roni Weisshof and Ramit Magen, Rambam Health Care Campus, Haifa, Israel, for their help in defining the ulcerative colitis predictive model; and Prof Seng Chan You, Yonsei University Health System, Seoul, Republic of Korea, for pointing us to the stroke external validation study. We are also grateful to KI Institute's researchers and, specifically, to Nir Kalkstein and Yonatan Bilu, for many fruitful discussions.

An upper bound of a model m's weighted loss l, up to a finite sample error, can be derived as follows:The tightness of the bound may depend on the number of expectations we consider. Furthermore, as z, and consequently µ, may not represent all interfeature dependencies existing in the data, an additional constraint may yield improved estimations:As we increase λ, the bound may become tighter but confidence may decrease.

Recall that optimization Problem (2) is defined as follows:where w ≥ 0. Denoting

(a) σ X,AH = 0 internal (b) σ X,AH = 0 external (c) σ X,AH = 0.5 internal (d ) σ X,AH = 0.5 external (e) σ X,AH = 1 internal (f ) σ X,AH = 1 external Figure 6 : Simulation examples with varying values of σ X,AH . Dot colors denote outcome class, diamonds represent class means. The shift in correlation between X 1 and X 2 , given an outcome class, increases with σ X,AH .