key: cord-0214876-8d3oomvk
authors: Haidar-Wehbe, Sami; Emerson, Samuel R; Aslett, Louis J M; Liley, James
title: Optimal sizing of a holdout set for safe predictive model updating
date: 2022-02-13
journal: nan
DOI: nan
sha: 8e0e82bb437d6a50e843e3779688bcf0310f2b4e
doc_id: 214876
cord_uid: 8d3oomvk

Risk models in medical statistics and healthcare machine learning are increasingly used to guide clinical or other interventions. Should a model be updated after a guided intervention, it may lead to its own failure at making accurate predictions. The use of a `holdout set' -- a subset of the population that does not receive interventions guided by the model -- has been proposed to prevent this. Since patients in the holdout set do not benefit from risk predictions, the chosen size must trade off maximising model performance whilst minimising the number of held out patients. By defining a general loss function, we prove the existence and uniqueness of an optimal holdout set size, and introduce parametric and semi-parametric algorithms for its estimation. We demonstrate their use on a recent risk score for pre-eclampsia. Based on these results, we argue that a holdout set is a safe, viable and easily implemented solution to the model update problem.

Risk scores estimate the probability of an event Y given predictors X. Their use has become routine in medical practice (Topol, 2019) , where Y may represent disease incidence and X various clinical observations. Once calculated, risk scores may be used to guide interventions (perhaps modifying X), with the aim of decreasing the probability of an adverse Y . For example, the QRISK3 score predicts thromboembolic risk given predictors including age and hypertension (Hippisley-Cox et al., 2017) , and a high score may prompt prescription of antihypertensives.

Risk scores are typically developed by regressing observations of Y on X. Should the distribution of (X, Y ) subsequently change (or 'drift'), then risk estimates may become biased (Tsymbal, 2004; Zliobaite, 2010) . This can happen naturally over time, meaning that risk scores typically need to be updated periodically to maintain their utility.

Updating of the risk score will involve obtaining new observations of (X, Y ). Crucially the distribution of (X, Y ) may also change due to the effect of the risk score itself: that is, high predicted risk of an adverse event may trigger intervention to reduce that risk.

The effect of such interventions may be difficult to infer, and indeed the act of intervening is often unrecorded. In the QRISK3 example above, individuals prescribed antihypertensives in response to higher QRISK3 scores should have lower thromboembolic risk than they would have if QRISK3 was not used. Should a new risk score be fitted to observed (X, Y ), the effect of hypertension on risk would be underestimated, and overall risk estimates upwards-biased. This bias is worsened by heavier intervention resulting in risk scores becoming 'victims of their own success' (Lenert et al., 2019) . This framework of directly updating a risk score on an 'intervened' population has been termed 'repeated risk minimisation' or 'naïve updating' (Liley et al., 2021b) .

A possible solution to this problem is to split the population on which the score can be used into an 'intervention' set and a 'holdout' set (Liley et al., 2021b) . Risk scores are computed for samples in the intervention set and allowed to guide intervention, while risk scores are not computed for samples in the holdout set. As opposed to naïve model updating, we now update the model using data exclusively from samples in the holdout set. This allows the model to be updated safely, since only natural drift is represented in the holdout set. This poses a vital tension, the resolution of which is the primary contribution of this paper: for the risk score to be accurate, the holdout set should not be too small; but, any samples in the holdout set will not benefit from risk scores, so nor should it be too large. In this work we develop methodology to ascertain the optimal size for a holdout set which balances these conflicting goals.

Contributions in this area are important due to rapidly evolving legislation. Currently, the European Union treat each update of a risk model as a separate risk score requiring re-approval, but in the USA a proactive approach is taken with a 'total-life cycle' paradigm which allows practitioners to update risk models as necessary without requesting approval (USFDA et al., 2019) . This approach could allow updating-induced biases to go undetected, and highlights the need for safe updating methods in risk score deployment. The use of holdout sets as examined in this work offers one potential solution.

Our paper is structured in the following way. In Section 2, we review relevant literature and precisely define the problem. In Section 3, we quantify the expected cost as a function of holdout set size, and describe reasonable sufficient conditions under which an Optimal Holdout set Size (OHS) exists. In Section 4, we then describe two algorithms for OHS estimation, together with supporting theory: the first using an explicit parametrisation; the second using Bayesian emulation to allow deviation from these parametric assumptions. In Section 5, we support our findings with numerical demonstrations and apply our algorithms to a risk score for pre-eclampsia (PRE) to estimate an OHS for updating it.

Widespread collection of electronic health records has spurred development of new diagnostic and prognostic risk scores (Cook and Collins, 2015; Liley et al., 2021a) , which can allow detection of patterns too complex for humans to discover (Koopman and Mainous, 2008) . Examples of such scores in widespread use include: EuroSCORE II, which predicts mortality risk at hospital discharge following cardiac surgery (Nashef et al., 2012) ; and the STS risk score from the USA predicting risk of postoperative mortality (Jacobs et al., 2006; Shahian et al., 2018; O'Brien et al., 2018) . Many such scores have demonstrable efficacy in clinical trials and in-vivo (Ad et al., 2016; Chalmers et al., 2013; Barili et al., 2013; Wallace et al., 2014; Durand et al., 2013) .

An important general concern with these scores is continued accuracy of predictions. A 2011 review, found that risk scores for hospital readmission perform poorly and highlighted issues with design of their trials (Kansagara et al., 2011) . More recently, an analysis of a sepsis response score used during the COVID pandemic found increasing risk overestimation over time Finlayson et al. (2020) . Various efforts have been made to standardise procedures in risk score estimation to address these issues Collins et al. (2015 Collins et al. ( , 2021 .

Several algorithms have been developed to update models with new data in the presence of drift (Lu et al., 2018) , which ideally leads to the best possible model performance after every update. Adaptation of model updating to avoid naïve updating-induced bias requires explicit causal reasoning (Sperrin et al., 2019) and often further data collection (Liley, 2021) . In a seminal paper, Perdomo et al. (2020) analyse asymptotic behaviour of repeated naïve updating, giving necessary and sufficient conditions for successive predictions to converge to a stable setting where they 'predict their own effect'. Other approaches to optimise a general loss function by modulating parameters of the risk score are developed in Mendler-Dünner et al. (2020) ; Drusvyatskiy and Xiao (2020) ; Li and Wai (2021) and Izzo et al. (2021) . These approaches seek to minimise a 'performative' loss to the population in the presence of an arbitrary risk score. Our alternative approach seeks to target risk scores which reliably estimate the same quantity, namely pr(Y | X) in a 'native' system prior to risk score deployment. Our approach is well-suited to settings where the performative loss is essentially intractable, requiring cost estimates of risk scores only in limited settings. We note 'stability' is not necessarily desirable in terms of distribution of interventions: in the QRISK3 setting, if an individual is at untreated risk of 50% and treated risk of 10%, with treatment distributed proportionally to assessed risk, a 'stable' risk score would assess risk as e.g. 30%, prompting a mild intervention after which true risk remained at 30%, regardless of treatment cost.

We found no literature directly addressing the focus in this paper: determining how large a holdout set should be. Similar problems do arise in clinical trial design. For example, Stallard et al. (2017) describe a method to estimate the optimal size of clinical trial groups for a rare disease in which individuals not in the trial stand to gain more than those in it. A Bayesian decision-theoretic method is used to optimise a gain function while accounting for benefit to future patients in the population.

OHS estimation requires quantification of expected material costs when using risk scores trained to data of various sizes. Such costs will typically depend on the error of risk predictions. The relation of predictive error to training set size is known as the 'learning curve', which can sometimes be accurately parametrised (Amari et al., 1992; Amari, 1993) . A recent review paper suggests a power-law is accurate for simple models (Viering and Loog, 2021) .

Our general strategy for safe model updating using a hold-out set is illustrated in Figure 1 , which is interpretable as a causal graph. Each of the three columns is called an 'epoch' (0,1,2 in subscripts), representing a period of time in which a risk score is deployed and data gathered to construct such a risk score. Ellipses containing X or Y are covariates and outcomes (respectively) of populations of samples. Under a 'native' setting prior to deployment of a risk score, X 0 and Y 0 have a single causal link, modelled by risk score ρ 0 (leftmost epoch). Once ρ 0 is in use in the intervention set in epoch 1 (ellipses X i 1 , Y i 1 ), a second causal pathway through ρ 0 is established from X i 1 to Y i 1 , but there remains only one causal pathway from X h 1 to Y h 1 in the holdout set (middle epoch). The updating process can be continued rightwards (ρ 1 , ρ 2 , . . . ).

Intervention set

Model fitting ρ Risk score In each epoch e a new risk score ρ e is trained. A model is only ever used to make predictions on the same system to which it was trained (under a naïve updating setting, this is not the case; ρ 1 would be trained to the system X i 1 , Y i 1 linked through ρ 0 , and used on a system X i 2 , Y i 2 in which they are not). Because the variables inside the grey region are independent of those outside it, we may consider them in isolation. For notational convenience, we jointly consider samples in the holdout set in epoch e and in the intervention set in epoch e + 1, and will assume that the total population size and holdout set size is the same across all epochs.

Suppose the aggregate groups above comprise n holdout set samples D n which are independent and identically distributed as (X h e , Y h e ), and N samples in total, where the remaining N − n samples in the intervention set are independent and identically distributed as (X i e+1 , Y i e+1 ). A risk score is fitted to the holdout set which approximates E Y h e |X h e = x and is used in the intervention set, affecting Y i e+1 |X i e+1 . We denote the standard normal PDF and CDF by φ(·), Φ(·) respectively.

We presume that we fit a risk score to D n for use in epoch e + 1. We define C 1 (X) and C 2 (X; D n ) as random variables associated with the total 'cost' of an observation with covariates X in the holdout set in epoch e and intervention set in epoch e + 1 respectively, where the cost covers both the cost of managing the event Y = 1, as well as any gains or losses associated with intervention. Function C 2 (X; D n ) depends on D n only through the fitted risk score.

We define the expected cost per observation in the holdout and intervention sets, respectively, as

recalling that D n ∼ (X h e , Y h e ) n . Values C 1 and C 2 in outer expectations encompass variance in C 1 (X), C 2 (X; D n ) independent of X, D n .

We make the following assumptions, and describe their interpretation in a hypothetical medical context similar to QRISK3 (Hippisley-Cox et al., 2017).

1. k 1 does not depend on n: treatment plans and outcomes for patients without risk scores do not depend on the number of such patients.

2. k 2 (n) is monotonically decreasing in n: the more data available to train the risk score, the greater its clinical utility.

3. There exists an M with 0 < M < N such that n ≥ M ⇔ k 1 < k 2 (n): a good enough risk score will lead to better patient outcomes than baseline treatment, and a poor enough risk score fitted to small amounts of data leads to worse expected outcomes than baseline.

for 1 ≤ i < j ≤ N −1: the 'learning curve' for our risk score is convex; there are diminishing returns in the cost per patient from adding more samples to the training data.

We now express total cost, , across the sample population as (n) = k 1 n (tot. cost. holdout set)

We may freely extend the domain of k 2 (·), (·) to the real interval [0, N ) in such a way that both functions are smooth; and k 2 (n) < 0 (given assumption 2), k 2 (n) > 0 (given assumption 4).

This leads to the first core result, namely that there does indeed exist an optimal size for the holdout set which minimises the expected total cost. The proof is given in Appendix A1.

Theorem 1. Suppose assumptions 1-4 hold. Then there exists an N * ∈ {1, . . . , N − 1} with N ∈ N, such that:

We refer to N * as the optimal holdout set size. The following Corollary is an immediate consequence: that the optimal holdout set size always exceeds the minimal training sample size required to match baseline treatment.

Corollary 1.1. The value of N * always exceeds the value of M in assumption 3, since if N < M we have

Consequently assumption 4 may be relaxed for i, j < M (though to avoid tedious details we will generally not do so in this work); in other words, we need only be concerned with the behaviour of k 2 (n) at realistically large values of n, rather than n ∈ 1, 2, . . . , M .

We also note that

and since k 1 > k 2 (n) > 0 for large n, we have that expected total costs are increasing, but bounded by the per observation expected cost of baseline treatment k 1 .

In order to estimate the OHS, we must estimate the cost function (n) up to a linear transformation, which in turn entails estimating the constants N , k 1 , and the function k 2 . Such estimates may be made in several ways. A simple option may be to use expert opinion or literature search, and in some cases it may be possible to compute expected cost directly, especially if the action taken on a risk score is systematically determined, as may be the case for medical risk scores (Williams, 2003) . A more general option is to use some mechanism to estimate the cost dependence on holdout set sizes, via pairs (n, C 2 (n)), (n, k 2 (n)) or (n, (n)). Even if such point estimates are not made directly, we will assume in our parametric method (section 4.2) that errors in estimates of N , k 1 , k 2 decrease as the inverse square root of effort made to measure them; that is, behave as though dependent on some number of such point estimates, in that estimates may be made more accurate at a tradeoff of greater cost. If, for instance, an expert opinion poll is used, then more or fewer opinions may be collected.

A gold-standard option to make an unbiased estimate of (n, k 2 (n)) is through an interventional trial, in which a cohort of individuals are given risk scores fitted to a set amount of training data (several scores may be tested on sample subcohorts in parallel), the eventual cost C 2 measured directly, and these regressed to the expected cost (possibly with constraints to observe assumptions 1-4). Although such trials could potentially be conducted at the time an initial risk score is fitted, this option may not be necessary.

It may be reasonable to assume that C 2 (and hence k 2 ) is a linear function L (·) of some measure of the predictive accuracy of the risk score (e.g., mean mis-classification error), and work with this predictive accuracy L −1 (k 2 ) instead, since L −1 ( ) will conserve the OHS of . We demonstrate an example in section 5.3. This assumption necessitates an estimate of L −1 (k 1 ) (the 'accuracy of no risk score'), which is typically straightforward and results in an overall simplification of the estimation problem, since L need never be made explicit. That is, only the expected predictive accuracy of the score at candidate holdout set sizes n need to be calculated, which can be achieved readily using training observations of X and Y when no score is in use.

In the following sections, we will consider that observation/estimation of points (n, k 2 (n)) may be costly and seek to minimise the number of such pairs that must be used. We consider that the constants N (the total number of samples on which a given predictive score can be fitted or used) and k 1 (the average cost per sample under baseline behaviour without a score) will generally be possible to estimate, although we still allow for error in their estimation.

A natural algorithm for estimating the optimal holdout set size (OHS), should it exist, is immediately suggested by Theorem 1: assume k 2 is known up to parameters θ, and estimate N , k 1 and θ to estimate the optimal holdout size.

Parameters θ of k 2 may be estimated from observations of pairs (n, k 2 (n)), potentially with error in k 2 (n). To minimise the number of times we have to estimate k 2 (n) we suggest iteratively adding observations n to an existing set of estimation data in such a way as to greedily reduce the expected error in the resultant OHS estimate.

We firstly develop asymptotic confidence intervals for parametric OHS estimates to link error in parameter estimates to error in OHS. We will take k 2 , k 2 , to mean partial derivatives with respect to n. If for θ in some neighbourhood of its true value we have k 2 (n; θ) < 0 and k 2 (n; θ) > 0

( 3) and hence (n; θ) = (N − n)k 2 (n; θ) − 2k 2 (n; θ) < 0 (4) then the function (n; θ) is continuous and monotonically increasing, and thus injective and invertible with respect to n. We may consider the optimal holdout set size N * to be a rounding of a real solution n * to (n; θ) = 0, and in turn consider n * as a real-valued function of θ, N , and k 1 (though not always defined). We then have n * n : (n; θ) = 0

and

We will use these expressions to construct asymptotic confidence intervals for n * and (n * ) in the following Theorem. We will use the shorthand Θ = (N, k 1 , θ) and Θ 0 = E(Θ). We will also write n * = n * (Θ), (n * ) = (n * (Θ); Θ), n 0 = n * (Θ 0 ) and (n 0 ) = (n * (Θ 0 ), Θ 0 ) for brevity. We presume that Θ is an unbiased estimate of parameters, so Θ 0 corresponds to 'true' parameter values. Note that the sample-size m used in the following Theorem denotes a proxy for effort expended in estimating Θ 0 , as will be expanded upon later.

Theorem 2. Assume that k 2 (n; θ), k 2 (n; θ) and ∇ θ k 2 (n; θ) are continuous in n and θ in some neighbourhood of (n 0 , Θ 0 ), and that Θ 0 parametrises a setting satisfying assumptions 1-4. Suppose that Θ behaves as a mean of of m appropriately-distributed samples in

where Θ 0 does not depend on m, that an estimate Σ of Σ is available which is independent of Θ and satisfies || Σ − Σ|| 2 → d 0, and that n 0 is finite and unique as above. Then denoting

and the confidence intervals

The proof of this Theorem is given in Appendix A2. A consequence is that for sufficiently accurately estimated costs, the OHS will be a non-trivial size as follows.

Corollary 2.1. Under the assumptions of Theorems 1 and 2, we have

as m → ∞.

In light of the proportionality assumption in Section 4.1, and the tendency of the accuracy of a risk score with number of training samples ('learning curve') to follow a power-law form (Viering and Loog, 2021) , we recommend considering such a parametric form for k 2 (i.e. k 2 (n; θ) = an −b + c with θ = (a, b, c)), and provide explicit asymptotic confidence intervals for this setting also in Appendix A2. Examples of variation in n * and (n * ) with a power-law form for k 2 , are shown in supplementary figures 7, 8.

Note that confidence intervals must be interpreted with with care: if the sampling distributions for k 1 and θ admit the possibility that assumptions of Theorem 1 are violated such that

then the standard error of n * does not exist, as n * can be undefined. Finite-sample confidence intervals may be constructed by bootstrapping (see function ci ohs() in our R package OptHoldoutSize).

Our parametric algorithm assumes Θ is estimated from a multiset n of values in {1, . . . , N } and estimates d of k 2 (n) for each n ∈ n with known finite sampling variances σ 2 . For certain multisets n, estimates of Θ will not converge (for instance, if n contains only a single value repeated), so m in Theorem 2 should be interpreted as an 'effective' population size, such that

. Given that our eventual aim to estimate the OHS with minimal error, we suggest the following way to iteratively select a new valueñ at which an estimatek 2 (ñ) of k 2 (ñ) should be made, given a set n of points at which estimates k 2 of k 2 (n) have been made already. We denote by Θ(n, k 2 , σ),Σ(n, k 2 , σ) and I α (n, k 2 , σ) respectively the estimates of Θ 0 , lim m→∞ var ( √ m(Θ(n, k 2 , σ) − Θ 0 )) and the width of the confidence interval I α (Θ(n, k 2 , σ),Σ(n, k 2 , σ)). Suppose we have the option of estimating d(n) for one value of n ∈ {1, . . . , N } with known variance var (d(n)) = σ 2 . We selectñ as:

that is, 'select theñ which will minimise the expected OHS confidence interval width if added to our set n, with expectation computed with respect to our current parameter estimates'. If no minimum exists,ñ is selected uniformly from 1, . . . , N . Our algorithm is now:

Algorithm 1: Overview of estimation of optimal holdout set size; parametric Data: A total number n add of times we can afford to estimate k 2 (n) 1 Randomly choose a set n of dim(Θ) values of n in {1, . . . , N }; 2 For all n ∈ n, make an estimatek 2 (n) ∈ k 2 of k 2 (n) ; 3 while |n| < n add do 4 Find best new valueñ to add to n as per formula (8);

If we require an asymptotic confidence interval on n f inal * , this should be evaluated on a new, independently chosen set of values k 2 .

The consistency of algorithm 1 generally depends on whether n eventually contains enough elements of sufficient multiplicity to estimate Θ 0 consistently. If consistency is of particular concern, sampling some positive proportion of values of n randomly from {1, . . . , N } will guarantee that the multiplicity of all n ∈ n eventually exceeds any finite value with probability 1, readily guaranteeing consistency.

The finite-sample bias of n f inal * depends on ∇ Θ n * and the variance of Θ. For a powerlaw form of k 2 (n), the value of n f inal * is generally biased slightly upwards (see supplementary figure 7 for typical forms of ∇ Θ n * ).

Parametrisation of k 2 (n) may be inappropriate if the learning curve of the risk score or the map L (from section 4.1) are complex (Viering and Loog, 2021) . We propose a second algorithm which is less reliant on assuming a parametric form for k 2 (n), using Bayesian optimisation (Brochu et al., 2010) . We approximate the cost function as an emulator consisting of a Gaussian process with parametric prior mean function, which allows deviation from parametric assumptions. We take the minimum of its posterior mean over n to be our OHS estimate, efficiently choosing values of n at which to estimate (n) using an 'expected improvement' function.

First we must construct an emulator which approximates the cost function. We begin with an initial set of design points n and their corresponding observed cost estimates d with sampling variances σ 2 (noting that σ has a slightly different meaning to that in section 4.2). The prior for our emulator is (Vernon et al., 2018) :

with mean function m(n, Θ) = k 1 n + k 2 (n; θ)(N − n), given some initial estimate of Θ = (N, k 1 , θ), and u(n) a zero-mean Gaussian process

where k is chosen to enforce smoothness in (n), though other covariance functions having varying degrees of smoothness could be used (Stein, 1999) . The hyperparameters θ, σ u and ζ are problem-specific and must be specified; however, we will show that for sufficiently large |n| (with some caveats) mis-specification of θ, σ u and θ is overcome. Following McHutchon et al. (2015), we denote

Since n may be a multiset, we take n 1 as the set of unique values in n, with d 1 , σ 1 defined correspondingly with d 1 i as an inverse-variance weighted mean of {d j : n j = n 1 i } with sample variance (σ 1 i ) 2 :

noting that σ 1 i may change with i. Alternatively, we may account for the variation in d through 'inactive' variables and opt to use a 'nugget' term (Bower et al., 2010) ; this approach is described in detail in supplementary section S1. Now with input n, with an unevaluated loss value, our emulator specifies that the joint distribution of (n) and our observed output values d 1 is:

By obtaining the conditional posterior distribution π n π( (n)|n, n 1 , d 1 , σ 1 ) and taking the expectation and variance we gain the Bayes linear update equations (Vernon et al., 2018) :

In algorithm 1, selection of new design points should generally favour well-spaced points across {1, . . . , N } for both exploration and exploitation purposes. In this case, since we wish both to estimate the OHS accurately but also locally approximate well, we choose the next n in a way which predominantly (but not completely) favours exploitation. We use the 'expected improvement', which measures discrepancy between the emulator at a certain design point and the known minimum EI(·) (Brochu et al., 2010) : Training/holdout set size Total cost (= num. cases) By formulating the problem in terms of EI(·), there is a natural stopping criterion on the size of n: setting a threshold EI(ñ) > τ allows us to specify that for each iteration that we expect total cost to improve by at least τ over our current known minimum d − . This leads to the following algorithm for OHS estimation by Bayesian Emulation:

Algorithm 2: Overview of estimation of optimal holdout set size; emulation Data: A value τ specifying minimum improvement in total cost 1 Choose initial values n randomly from {1, . . . , N } with |n| > dim(Θ); 2 Estimate costs d i = d(n i ) ≈ (n i ) with errors σ i = var(d(n i )); 3 Coalesce n, d, σ into n 1 , d 1 , σ 1 as above; 4 Estimate functions µ(n), Ψ(n), EI(n), with Θ = Θ(n 1 , d 1 , σ 1 ) ;

Coalesce n, d, σ into n 1 , d 1 , σ 1 ; 10 Re-estimate functions µ(n), Ψ(n), EI(n), with Θ = Θ(n 1 , d 1 , σ 1 ) ; 11 end 12 return n f inal * = arg min n∈{1,...,N } {µ(n)} Various results on the consistency of the expected improvement algorithm have been proved, albeit in differing settings; either with noiseless observations d (Locatelli, 1997; Vazquez and Bect, 2010; Bull, 2011) or with noisy observations with known variance (Ryzhov, 2016) . We prove the following consistency results specifically for the setting of this work in Appendix A3.

Theorem 3. If (n), σ, and m(n, Θ) are almost surely bounded and d i ∼ N l(n i ), σ 2 i then for every n ∈ {1, . . . , N }, as the multiplicity of n in n tends to ∞ we have µ(n) −→ (n) and Ψ(n) −→ 0 almost surely with respect to variation in d.

This result asserts that µ(n) can eventually approximate any loss function sufficiently well given enough estimates of at all values of n. It is not obvious that this is guaranteed by algorithm 2, although we show that this generally does occur in the following, the proof of which is given in Appendix A3:

Theorem 4. If (n), σ, and m(n, Θ) are almost surely bounded and d i ∼ N l(n i ), σ 2 i then under algorithm 2 with τ = 0, the value µ(ñ) converges almost surely to (ñ) for everỹ n ∈ {1, . . . , N }.

We characterise the error in n * using 'the number of values of n for which the probability of the true cost at holdout set size n is less than the estimated minimum cost exceeds 1−α', or formally:

although this should not be interpreted as a credible set for n * . This is implemented in our R package OptHoldoutSize, available on CRAN.

In this section, we analyse the dynamics of a roughly realistic, binary outcome system, subject to predictions from different families of risk models. Our main aim is to demonstrate the natural emergence of an OHS from a reasonable setting. We generated datasets with a population size N = 5000 with seven standard normally distributed covariates and outcomes Y under a ground-truth logistic model, either with interaction terms (i.e., non-linear) or without (linear). We considered risk scores ρ derived from either logistic regression models (not including interaction terms) or random forests. We designated random-valued cost functions C 1 , C 2 , as

whereŶ j ∈ {0, 1} is sample j's class as predicted by the risk model given X j , and Y j is the observed incidence of Y . Figure 3 shows the results of the simulation under this setup, using either linear or logistic prediction models and linear or non-linear underlying models for Y |X. We can observe that an OHS can arise naturally from standard predictive models, since empirical k 2 curves for both a random forest and logistic regression satisfy assumptions 2 and 4. The OHS occurs at a value n smaller than that at which k 2 (n) is nearly 'flat', indicating that unnecessarily large training sets are suboptimal. However, since (n) rises only linearly as n increases, it is generally less costly to slightly overestimate rather than underestimate the OHS. Finally, the rightmost panels illustrate that the OHS is not necessarily smaller for a more accurate model: the random forest model (non-lin ρ) in the non-linear underlying case (right panels) leads to uniformly lower expected costs k 2 (n) at all potential holdout set sizes, although the optimal holdout set size is larger. Figure 3 : Examples of cost functions as per Theorem 1 arising naturally from a basic risk score, with varying underlying model (und.), risk score type (ρ) and one point-wise standard deviation (shaded regions). The contributions of terms k 1 n to (n) depend only on the underlying model and are the same in each column.

In this section, we demonstrate circumstances in which either one of the proposed algorithms may be preferable to the other. We consider two versions of the function k 2 (n) k p 2 (n) = an −b + c and k np 2 (n) = an −b + c + 10 4 φ n − 4 × 10 4 8 × 10 3

where 'p'/'np' denotes 'parametric assumptions satisfied/not satisfied', and θ = (a, b, c) = (10000, 1.2, 0.2). The function k np 2 (n) exhibits 'double-descent' behaviour (figure 4a), which is possible for various learning curves (Viering and Loog, 2021) . Corresponding cost functions are shown in figure 4b. We assume N and k 1 are known to be 1 × 10 5 and 0.4 respectively. For emulation, we use a kernel width ζ = 5000 and variance σ 2 u of 1 × 10 7 . The double-descent form of k 2 does not satisfy the assumptions of Theorem 1, but in our case leads to a single single optimal holdout set size nonetheless. We note that more subtle misparametrisations will also lead to inconsistency in OHS estimation if the parametric approach is used; for instance, if k 2 (n) follows exponential decay but is parametrised using a power-law curve.

We firstly show the distribution of estimates of OHS using our two algorithms when k 2 takes either form above. To fit k 2 , we use n given by 200 values of n randomly chosen from {1, . . . , N }, with values k 2 independently sampled as (k 2 ) i ∼ N (k 2 (n i ), σ 2 i ), where σ ∼ U (0.001, 0.02). Figure 4c shows the distributions and medians of OHS estimates using the parametric and emulation algorithm in settings with parametric assumptions either satisfied or unsatisfied. The parametric OHS estimate is empirically unbiased and has less variance than the emulation estimate when parametric assumptions are satisfied, but is biased when they are not. The variance of OHS estimates using the emulation method is lower when parametric assumptions are not satisfied because the true cost function has a sharper minimum in that case (see figure 4b ). Because the cost function is 'flat' around the minimum in the setting where parametric assumptions are satisfied (figure 4b), the consequences of the high variance of the semi-parametric (emulation) estimator are minimal, as the cost is similar across a range of values near the OHS.

We next examine the consequences of samplingñ (the 'next' value of n) greedily (using equation (8) as in algorithm 1 or EI as in algorithm 2), rather than randomly (ñ chosen uniformly in {1, . . . , N }), by comparing the rates of convergence of OHS estimates. Figure 5 show medians and (roughly) a discrete kernel estimate of optimal holdout size estimates at various sizes of |n|, where n is generated by adding pointsñ either randomly systematically (after choosing the initial five values in n randomly). Convergence is faster when 'next points' are picked systematically rather than randomly. Convergence is also faster when using parametric estimates, though again parametric estimates are biased and inconsistent when parametric assumptions are not satisfied. This is highlighted by the smaller panels which show the root mean-square error between the total cost at the estimated optimal sizes and the total cost at the true OHS: as expected the parametric algorithm is to be favoured where the assumptions are satisfied and the non-parametric where they are not. Note in particular, that the non-parametric method shows bifurcation detecting both local minima whilst the parametric method converges to a mid-point which is far from optimal in terms of total costs.

In this section, we describe a potential practical end-to-end implementation of our algorithm in a healthcare setting. We describe possible motivations for updating and harms of updating naïvely, a practical set-up of a hold-out set procedure, estimation of requisite parameters of the function (·), and computation of an optimal holdout set size and associated error. We consider the ASPRE score (Akolekar et al., 2013) for evaluating risk of pre-eclampsia (PRE), a hypertensive complication of pregnancy, on the basis of predictors derived from ultrasound scans in early pregnancy. Although treatable, PRE confers a serious risk to both the fetus and the mother. The risk of pre-eclampsia is lowered by treatment with aspirin through the second and third trimesters , but aspirin therapy itself confers a slight risk, contraindicating universal treatment, and suggesting prescription of aspirin only if the risk of PRE is sufficiently high or other indications are present (LeFevre, 2014; ACOG, 2016). The ASPRE score was developed to aid clinicians in estimating PRE risk (Wright et al., 2012) and has been shown to be useful in prioritising patients for aspirin therapy ). We will not differentiate early-and late-stage PRE.

It may be desirable to update the ASPRE score in future for several reasons. Firstly, inclusion of additional covariates or more sophisticated machine-learning methods may be of benefit (Akolekar et al., 2013) , and distributions of covariate values across the population may change with population demographic shift over time (e.g. maternal serum placental growth factor, (Yang et al., 2016) ), necessitating changes to the ways that such covariates are used in the model. However, as discussed in Section 1, a difficulty in updating models in this way arises from the possible effect of the ASPRE score itself: namely, a naïve refitting of a risk score on the basis of (X :) maternal assessment in early pregnancy and (Y :) eventual PRE incidence could lead to dangerous underestimation of PRE risk, due to individuals previously assessed as high risk being treated in response to the assessment. Retraining a new model on a held-out set could avoid this problem. For such a hold-out set, no ASPRE score would be calculated at the first ultrasound scan, and patients would Figure 5 : Comparison of convergence rates with parametric and emulation algorithms, using either a random or greedy method to select the next value of n to add to n. All algorithms are run for 200 datasets simulated from the underlying model. In the larger panels, dashed horizontal lines show the true OHS, while vertical lines indicate when at least 2.5% of runs of the algorithm select the next value of n within a grid of size 1000 (ie a form of discrete kernel estimate). Smaller panels show the root mean-square error between the total costs at each of the 200 proposed n and the minimal total costs at the true OHS. Note variable axis scaling on left and right.

be treated according to best practice in the absence of a risk score. An updated score would then be fitted to data from such patients. An obvious concern is that patients in this holdout set go without the benefit of the ASPRE score, leading to a less accurate allocation of prophylactic treatment (aspirin) and consequently a higher risk of PRE , which may indicate using the smallest possible holdout set. However, an inappropriately small hold-out set would lead to an inaccurate updated model, reducing the benefit of future use of the score. This vividly illustrates the tension at the heart of model updating which we seek to address in this work.

If we suppose that ASPRE is to be refitted every five years, then the 'intervention set' should include all individuals in the subsequent years before the model is refitted, and all individuals not used in the next refitting procedure. Suppose we are refitting ASPRE to use in a population of 5 million individuals, from which we have approximately 80,000 new pregnancies per year. Thus N ≈ 400, 000 (SE: 1500). See supplementary section S2 for further details.

We presume a simple clinical action in which a fixed proportion π = 10% of individuals at the highest assessed PRE risk are treated with aspirin. We assume that if untreated with aspirin, a proportion π 0 of individuals designated to be 'low-risk' (lowest 90%) will develop PRE, as will a proportion π 1 of individuals designated high-risk. Under current pre-ASPRE best-practice guidelines O'Gorman et al. (2017) we have π 0 ≈ 0.02 (SE 0.0009) and π 1 ≈ 0.08 (SE 0.008) (see supplementary section S2). Aspirin reduces PRE risk by approximately 1−α = 63% (SE 0.09) . We denote 'cost' as simply the number of cases of PRE in a population, so we expect a total expected cost per individual under 'baseline' treatment (ie clinical actions without the aid of a risk model) proportional to k 1 = π 0 (1 − π) + π 1 πα ≈ 0.022 (21) with standard error approximately 0.001. Note that this is not equal to the untreated PRE risk in the population, since some proportion of individuals are treated pre-emptively. The data used to fit the initial ASPRE model can be used to estimate the learning curve for potential model updates: at the stage at which the ASPRE score was first fitted, the optimal 'holdout set' size is as large as possible. We do not have access to this dataset, but demonstrate estimation of a learning curve on 'mockup' data designed to resemble it. Although k 2 (n) is easy and fast to estimate here, in order to mimic a real example where such estimation is time consuming or costly we restrict ourselves to use only |n| = 120 values of n, determined using either algorithm 1 or 2. For both algorithms, we assumed a power-law form k 2 (n; θ = (a, b, c)) = an −b + c.

Using the parametric algorithm, we found an OHS of 10271 (90% CI 8103-12438), with minimum cost (expected number of cases over five years) of 8172. Using the emulation algorithm, we found an OHS of 13313 with an expected cost of 8164, with holdout sizes of 9210-17619 having a probability > 0.1 of an expected cost < 8164. Figure 6 shows estimated cost functions, optimal holdout sizes, and error using the two algorithms. Opt. holdout size (c) Figure 6 : Estimation of cost functions, OHS and error using parametric (right) and emulation (middle) algorithms, and track of estimated OHS with number of sample points |n| on right. Note that the 'best' points to optimise parametric estimation tend to be spread-out, to estimate θ well, but for the emulation method they tend to be close to the OHS to locally approximate the cost function well. Error measures in parametric and emulation algorithms have different meanings and are not comparable.

In this work we establish theoretically the existence of an optimal holdout set size under reasonable assumptions, establish two algorithms for estimating this optimal holdout size, and evaluate their use in both a toy simulation and a real-life motivated simulation. We establish a practical and simple approach to an important contemporary problem in modern machine learning, which will become particularly important as risk scores become more widely used in real-world applications. Theorem 1 establishes that straightforward conditions on the system to be modelled lead to an OHS which is straightforward to find, and indeed that the cost function is convex (in a discrete sense). An obvious limitation is the requirement for convexity of k 2 ; risk score learning curves, on which k 2 depends, are not necessarily convex for complex models (Viering and Loog, 2021) . In practice, the cost function (n) is still generally wellbehaved even if convexity of k 2 is violated, and may be approximated using our emulation algorithm (as demonstrated in figures 4b, 5).

The use of a Gaussian process emulator to approximate the true loss function enables an automatic selection of the optimal holdout set size under fewer assumptions, although the efficiency of this method heavily depends on the quality of our emulator. Various extensions of the emulator may improve our surrogate of the loss function, for example specifying priors on the parameters θ, σ 2 u , ζ and using the likelihood provided by the Gaussian process to marginalise out these parameters. An explicit approach is given in Andrianakis and Challenor (2011) , but under linearity assumptions which do not hold in our case, so analytic tractability would be lost. If we were able to cheaply estimate the derivative of the cost function at design points, this could be incorporated into our emulator (Killeya, 2004) , enabling greater posterior accuracy around these points. Direct estimation of gradients from only estimates of (n) usually requires double the number of evaluations as estimation of (n) values, and so has the potential to become a more costly procedure than the method presented in section 4.3.

We have generally assumed that the function k 2 is to be estimated by repeated noisy observation of pairs (n, k 2 (n)). It is possible that k 2 could be estimated in other ways or be known a priori. Testing the impact of a risk score is generally a difficult problem (Ben-Israel et al., 2020) and is unavoidable in OHS estimation, although is also necessary to justify deployment. If a risk score can not in any way affect interventions (for instance, a risk score for surgical complications (Nashef et al., 2012) used exclusively to discuss risk with patients) then k 2 (n) is constant and no holdout set scheme is needed for updating. We emphasise that even indirect action on risk scores (for instance, using a medical risk score to identify at-risk demographic classes for budgetary planning) lead to risk scores being 'victims of their own success' (Lenert et al., 2019) and require planning of an updating strategy.

Other solutions to the model updating problem have been proposed, such as developing models that introduce missing causal connections Alaa and van der Schaar (2018); Sperrin et al. (2019) and using several predictive scores together Liley (2021) . Such solutions tend to be difficult to implement: more comprehensive modelling generally requires some observation of the intervention, requiring further samples or more data across time, and parallel use of several risk scores is difficult to implement. With this in mind, holdout sets could prove valuable in updating strategies since, in principle, they can be applied in any setting. Moreover, as suggested by Sperrin et al. (2019) , forced availability of direct data for the underlying distribution of (X, Y ) through the holdout set in itself facilitates post-deployment maintenance and surveillance.

In summary, we demonstrate that standard settings for predictive risk scores give rise to optimal holdout set sizes, and develop tractable approaches to finding them. In particular, we strongly suggest planning an updating strategy for a risk model before it is deployed. This work illustrates one strategy in this direction and we hope stimulates both use of and extensions of such methods for safe predictive score updating.

Theorem 1. Suppose assumptions 1-4 hold. Then there exists a N * ∈ (0, N ) with N ∈ N, which we call the optimal holdout set size, such that:

As above, we may impose that

Since both k 2 (n) and (N − n) are positive and monotonically decreasing in n, so is

By assumption 3, k 1 < k 2 (0), and, from equation (22), k 2 (0) < 0, so both terms in equation (25) are negative when n = 0 and (0) < 0. When n = N , the second term vanishes while the first one is positive, as k 1 > k 2 (N ) by assumption 3. We thus have (N ) > 0. By assumption, is smooth, so by Bolzano's Theorem, there must exist at least one point n * for which (n * ) = 0, which is an extremum of .

We now prove that this extremum is unique and a minimum. First, by assumption 4, we may impose that ∂ 2 ∂n 2 k 2 (n) > 0

Taking the second derivative of

and using equations (22) and (26), we see that (n) is strictly positive, and, as a consequence, (n) is monotonically increasing. Therefore, the extremum of (n) at n * we found earlier is unique and, as (n) > 0, it is a minimum. If n * ∈ 1..(N −1), let N * = n * . If n * / ∈ N, let N * be the closest natural number to either side of n * . From assumption 3, N * cannot be 0 or N . In both scenarios, this completes the proof.

As above, we consider n * as a function of parameters Θ = (N, k 1 , θ) (where k 2 (·) = k 2 (·; θ)), write n * = n * (Θ), set Θ 0 = E(Θ), n 0 = n * (Θ 0 ) and (n 0 ) = (n 0 ; Θ 0 ) and (n * ) = (n * (Θ), Θ). As discussed above, n * and (n * ) do not generally have means or standard errors.

Theorem 2. Assume that k 2 (n; θ), k 2 (n; θ) and ∇ θ k 2 (n; θ) are continuous in n and θ in some neighbourhood of (n 0 , Θ 0 ), and that Θ 0 parametrises a setting satisfying assumptions 1-4. Suppose that Θ behaves as a mean of of m appropriately-distributed samples in satisfying

where Θ 0 does not depend on m, that an estimate Σ of Σ is available which is independent of Θ and satisfies || Σ − Σ|| 2 → d 0, and that n 0 is finite and unique as above. Then denoting

we may uniquely define n 0 = {n : (n; Θ 0 ) = 0} and we have

and the confidence intervals

where z α = Φ −1 1 − α 2 , satisfy P n 0 ∈ I α (Θ,Σ) → 1 − α and P (n 0 ) ∈ J α (Θ,Σ) → 1 − α as m → ∞.

Proof. From (n) = k 1 n + k 2 (n; θ)(N − n) and n * = {n : (n; Θ) = 0}, where such n * is unique, we have (as per section 4.2)

for all components Θ i of Θ. Thus partial derivatives of n * exist as long as

By assumption, (·; Θ 0 ) has a minimum at n 0 . Since

where both terms are continuous in a neighbourhood of n 0 , Θ 0 by assumption, the value of ∂ 2 ∂n 2 must be positive in some (possibly smaller) neighbourhood R δ of (n 0 , Θ 0 ) of width 2δ, and hence all partial derivatives of n * and (n * ) are defined (and indeed continuous) in R δ . Within R δ we have

from which, given the assumption of asymptotic normality of Θ, assertions (29) follow. We note that despite this convergence in distribution, n * and (n * ) do not generally have first or second moments for finite m.

We now have

since, by the assumption of convergence of Σ

and, since P (Θ ∈ R δ ) → 1 by the asymptotic normality of Θ, we have from (32)

Thus, combining with the corresponding limit for the lower end of I α (Θ,Σ):

as required. An identical argument holds for J α (Θ,Σ).

If we assume a power-law form of k 2 , parametrised by θ = (a, b, c, k 1 , N );

then we have

and, more simply

Theorem 3. If (n), σ, and m(n, Θ) are almost surely bounded and d i ∼ N l(n i ), σ 2 i then for every n ∈ {1, . . . , N }, as the multiplicity of n in n tends to ∞ we have µ(n) −→ (n) and Ψ(n) −→ 0 almost surely with respect to variation in d.

Proof. Assume W.L.O.G that (n 1 ) 1 = n. Since σ is bounded, we have (from equation (12)) var (d 1 ) 1 = (σ 1 ) 2 1 → 0, so (d 1 ) 1 −→ (n) almost surely. We now prove that k(n, n 1 )[k(n 1 , n 1 ) + diag((σ 1 ) 2 )] −1 = (1, 0, . . . , 0) when (σ 1 ) 1 = 0. Now:

and k(n, n 1 ) = (1, 0, . . . , 0) * k(n 1 , n 1 + diag((σ 1 ) 2 ) −1 ) is true by definition as the first row of k(n 1 , n 1 ) + diag((σ 1 ) 2 ) is k(n, n 1 ). Therefore, Proof. From Theorem 3 we have that µ(n) −→ (n) < ∞ and d 1 i −→ (n 1 i ), so therefore in the limit we can state P d (−∞ < d − − µ(n) ≤ 0) = 1. Indeed, let j be the index such that n 1 j = n. If in the limit d − > µ(n) = d(n) then this implies that d 1 j < min i {d 1 i } which is a contradiction. Also note from Theorem 3 that Ψ(n) −→ 0 and that Φ (·) ∈ (0, 1), φ (·) ∈ (0, (2π) −1/2 ]. As a result the following two scenarios have joint probability 1:

• d − − µ(n) = 0 in the limit: As Φ (·), φ (·) are bounded and Ψ(n) = 0 in the limit, we also have EI(n) = 0 in the limit.

• ∞ < d − − µ(n) < 0 in the limit: As Ψ(n) = 0 in the limit, Φ d − −µ(n) √ Ψ(n) = 0 in the limit. As φ (·) is bounded we have that EI(n) = 0 in the limit.

which proves the corollary Theorem 4. If (n), σ, and m(n, Θ) are almost surely bounded and d i ∼ N l(n i ), σ 2 i then under algorithm 2 with τ = 0, the value µ(ñ) converges almost surely to (ñ) for everỹ n ∈ {1, . . . , N }.

Proof. Our overall argument is to show that algorithm 2 leads to the multiplicity ofñ in n tending to infinity, from which the result follows from Theorem 3.

To do this, we begin with the following two lemmas, the second of which describes the limiting behaviour of EI(n) according to how often n occurs in n: namely that if the multiplicity of n in n diverges, the value of EI(n) converges to 0; otherwise, it remains positive. We introduce the index EI n (n) to indicate the dependence of EI(n) on n and assume that the function (n) is fixed. For a multiset n i , we denote mult n i (n) as the multiplicity of n in n i . indicates the adjugate matrix and · 1 the top row. We now have 

Suppose we have infinite sequences n, d, σ, where d ∼ N (l(n), σ 2 ) and σ 2 is upper-bounded, and let n i , d i , σ i denote the (multiset) first i elements of each sequence.

Suppose that q 1 (n i ) ≤ m 1 for all i and q 2 (n i ) → ∞, and the set k 2 (n, Θ i ) k 2 (n, Θ(n i , d i , σ i )) : n ∈ 1 . . . N, i ∈ N is almost surely asymptotically bounded. Then for sufficiently large σ u :

almost surely.

Proof. We will in fact show that even lim inf EI n i (n) > 0 for n ∈ S 1 , but lim sup will suffice for our purposes. We note that

We will show that for all n, we have

By the argument in theorem 3 and corollary 4.1 we have for n ∈ S 2 that lim i→∞ Ψ n i (n) = 0, from which both sides of (45) converge to 0. For n ∈ S 1 we will show lim i→∞ Ψ n i (n) > 0, in which case we may define

from which inequality (45) reduces to

which holds for all 0 ≤ z n i (n) < −∞. Since z n i (n) is asymptotically bounded between positive values, the result follows. Beginning with d − n i , we note that d − n i is the minimum of For sufficiently large s, the sequence {n j = (n) j : j > s} never contains any n ∈ S 1 again; hence, the minimum of item 1 is determined after finitely many i and its limit is finite. Since σ is upper-bounded, all values of d 1 i in item 2 converge to finite values in { (n) : n ∈ S 2 } almost surely. Hence d − n i converges almost surely to a finite value. Since lim sup i→∞ and lim inf i→∞ of m (n; Θ (n i , d i , σ i )) are almost surely finite, all terms in µ(n) are asymptotically finite, from which equation (46) follows.

It remains to consider Ψ n i (n) for n ∈ S 1 . Firstly take n ∈ n 1 and suppose W.L.O.G that n 1 i 1 = n. Since n ∈ S 1 we have lim i→∞ mult n i (n) > 0 so lim i→∞ (σ 1 i ) 1 exists and is positive. Denoting σ as σ 1 i with 0 substituted for the first element, we have

by lemma 4.1; hence Ψ n i (n), considered as a function of (σ 1 i ) 2 1 , is increasing. Given that lim i→∞ (σ 1 i ) j is 0 for (n 1 i ) j ∈ S 2 and is positive for (n 1 i ) j ∈ S 1 , we conclude that lim i→∞ Ψ n i (n) is positive when n ∈ S 1 and n ∈ n 1 .

If n / ∈ n 1 , so n never occurs in any n i , then we firstly note that since k(n, n) < k(n, m) for any m = n, we have:

This omits the term diag((σ 1 i ) 2 from the expression for Ψ n i (n). However, if we denote k j the matrix k(n 1 i , n 1 i ) + diag((σ 1 i ) 2 ) with the jth row replaced by k(n, n 1 i ), we have from lemma 4.1:

for any element (σ 1 i ) 2 j of (σ 1 i ) 2 ; hence Ψ n i (n) is increasing in any such element and its positivity follows. This completes the proof of the lemma. Now suppose that some n ∈ {1 . . . N } occurs only finitely often in n 1 . Then there must be some largest set S 1 of such n, with complement S 2 = {1 . . . N } \ S 1 . Since every element in S 1 occurs in n 1 with finite multiplicity there must be some j such that no n ∈ S 1 occurs amongst the values {(n 1 ) j+1 , (n 1 ) j+2 , . . . }. But from lemma 4.2, there will almost surely eventually be some J > j for which some value in {EI n J (n) : n ∈ S 1 } exceeds all values in {EI n J (n) : n ∈ S 1 }, and hence (n 1 ) J+1 ∈ S 1 (as long as τ is sufficiently small), contradicting the choice of j. So the event that an n ∈ {1 . . . N } occurs in n 1 with finite multiplicity has probability 0. This completes the proof.

One may then be sceptical of the benefit of duplicating design points for this method, and whilst it is possible to use this method without duplication (i.e n = n 1 ), the consequence of this would be that we are heavily reliant on a singular sample to locate the minimum which could be misleading. Averaging various samples at the same design point mitigates this potential problem, as does replacing d − with µ − = min i {µ(n 1 i )} as detailed in Brochu et al. (2010) . Taking the median of samples instead of a weighted mean is more appropriate here as we are not seeking to accurately approximate an expectation, instead we only wish to avoid extreme samples misleading our search for the minimum.

S2 Estimation of parameters for optimal holdout size in AS-PRE S2. S2.2 Estimation of k 1 and k 2

We assume π = 10% ≈ 2707/25797, the proportion of individuals assigned to the treatment group in due to having an estimated risk of PRE > 1%.

To estimate k 1 , we considered the study reported in O'Gorman et al. (2017) assessing sensitivity and specificity of NICE and ACOG guidelines in assessing PRE risk. In this study, 8775 indivduals were assessed, amongst which 239 developed PRE, for an overall incidence of 239/8875 ≈ 0.027. We estimated the performance of a 'baseline' estimator of PRE risk (that is, in the absence of any ASPRE score) by linearly interpolating the points corresponding to 'ACOG aspirin', 'NICE' and 'ACOG' on ROC curves in Figure 1 . On this basis, a baseline estimator identifying the 10% of individuals at highest PRE risk (approximately 800) would correspond to the point (x, y) on the interpolated ROC curve with 239x + (8775 − 239)y = 0.1 × 8875 (59) which occurs at roughly a 20% detection (true positive) rate and a 10% false positive rate, close to that of the NICE guidelines.

Since few women in the study were treated with aspirin, we assume that PRE rates in the highest-10% and lowest-10% risk groups assessed by baseline risk (NICE) are untreated risk (that is, if not treated with aspirin). At the inferred true and false positive rates, we would expect that amongst the 10% of women designated highest-risk by the NICE guidelines, we have a PRE rate of Given that true positive rates of the NICE guidelines are computed as a fraction with denominator 239, we presume standard errors of π 1 and π 0 of SE(π 1 ) ≈ π 1 (1 − π 1 ) 8875 × 0.1 ≈ 0.0076 SE(π 0 ) ≈ π 0 (1 − π 0 ) 8875 × 0.9 ≈ 0.0017

Now, treating errors in π 0 , π 1 and α as pairwise independent k 1 = π 0 (1 − π) + π 1 πα ≈ 0.0235 SE(k 1 ) = SE (π 0 (1 − π) + π 1 πα) ≈ 0.0016

We estimate the population prevalence π P RE of untreated PE as the frequency observed in the original ASPRE data:

Note that, although this is approximately equal to π 0 , they are different quantities: π 0 is the population frequency of PRE amongst individuals at the lowest 90% risk by NICE guidelines.

Denoting π 1 (n) as the untreated risk of PRE in the top 10% of individuals according to an ASPRE score trained on n individuals (and π 0 (n) correspondingly), we note that it is equal to the sensitivity (or TPR) of the risk score at the level where proportion π of individuals are designated high-risk. Thus for any training set size n π 0 (n) = π P RE − ππ 1 (n) 1 − π (65) so the average cost to an individual in the intervention set may be expressed in terms of π 1 (n):

k 2 (n) = π 0 (n)(1 − π) + π 1 (n)πα = π P RE − ππ 1 (n)(1 − α)

We implemented the complete ASPRE model as described in . We simulated a population of individuals with a similar distribution of ASPRE model covariates. We computed the ASPRE scores for our simulated individuals, and found a linear transformation of these scores such that, should the scores exactly specify the probability of PRE, the expected population prevalence and sensitivity of the score would match those reported in : prevalence π P RE , and sensitivity amongst 10% highest scores: 12.3%. We then simulated PRE incidence according to these transformed scores. We found that a generalised linear model with logistic link performed almost as well as the ASPRE score on our simulated data, so we used this model type to estimate the learning curve in the interests of simplicity.

To choose values n and k 2 /d, we initially chose a set n of 20 random values from [500, 30000]. For each size n in n, we took a random sample of our data of size n, fitted a logistic model to that sample, and estimated corresponding expected costs per individual k 2 as above. We fitted values θ = θ(n, k 2 ) = (a, b, c) parametrising k 2 as the maximumlikelihood estimator of θ under the model (k 2 ) i ∼ N k 2 ((n) i , θ), σ 2 ∼ N (a(n) −b i + c, σ 2 )

for a fixed values σ, noting that the estimate of θ is independent of σ. For the parametric algorithm, we then set all values of σ to the same value, chosen empirically as the sample variance of k 2 − k 2 (n, θ(n, k 2 ))

For the emulation algorithm, we set values d as

transforming values σ correspondingly for use in the emulation algorithm. We then sequentially chose 100 additional values n using both algorithm 1 and 2, setting σ as the 

Practice advisory on low-dose aspirin and prevention of preeclampsia: Updated recommendations

Comparison of EuroSCORE II, original EuroSCORE, and The Society of Thoracic Surgeons risk score in cardiac surgery patients

Competing risks model in early screening for preeclampsia by biophysical and biochemical markers

Autoprognosis: Automated clinical prognostic modeling via Bayesian optimization with structured kernel learning

A universal theorem on learning curves

Four types of learning curves

Parameter estimation for Gaussian process emulators

Alessandro. Does EuroSCORE II perform better than its original versions? A multicentre validation study

The impact of machine learning on patient care: a systematic review

Galaxy formation: a Bayesian uncertainty analysis

A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning

Convergence rates of efficient global optimization algorithms

Validation of EuroSCORE II in a modern cohort of patients undergoing cardiac surgery

Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence

Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement

The rise of big clinical databases

Stochastic optimization with decision

Performance analysis of EuroSCORE II compared to the original logistic EuroSCORE and STS scores for predicting 30-day mortality after transcatheter aortic valve replacement

The clinician and dataset shift in artificial intelligence

Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study

How to learn when data gradually reacts to your model

What is operative mortality? Defining death in a surgical registry database: a report of the STS Congenital Database Taskforce and the joint EACTS-STS Congenital Database Committee. The Annals of thoracic surgery

Risk prediction models for hospital readmission: a systematic review

Thinking inside the box: using derivatives to improve Bayesian black box emulation of computer simulators with application to compart mental models

Evaluating multivariate risk scores for clinical decision making

Low-dose aspirin use for the prevention of morbidity and mortality from preeclampsia: Us preventive services task force recommendation statement

Prognostic models will be victims of their own success, unless

State dependent performative prediction with stochastic approximation

Stacking interventions for equitable outcomes

Model updating after interventions paradoxically introduces bias

Bayesian algorithms for one-dimensional global optimization

Learning under concept drift: A review

Nonlinear modelling and control using Gaussian processes

Stochastic optimization for performative prediction

Multicenter screening for pre-eclampsia by maternal factors and biomarkers at 11-13 weeks' gestation: comparison with NICE guidelines and ACOG recommendations

The Society of Thoracic Surgeons 2018 adult cardiac surgery risk models: part 2-statistical methods and results. The Annals of thoracic surgery

Performative prediction

ASPRE trial: performance of screening for preterm pre-eclampsia

Aspirin versus placebo in pregnancies at high risk for preterm preeclampsia

On the convergence rates of expected improvement methods

The Society of Thoracic Surgeons 2018 adult cardiac surgery risk models: part 1-background, design considerations, and model development. The Annals of thoracic surgery

Explicit causal reasoning is needed to prevent prognostic models being victims of their own success

Determination of the optimal sample size for a clinical trial accounting for the population size

Interpolation of spatial data: Some theory for kriging

High-performance medicine: the convergence of human and artificial intelligence

The problem of concept drift: definitions and related work

Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD)-discussion paper

US-FDA-Artificial-Intelligence-and-Machine-Learning-Discussion-Paper

Convergence properties of the expected improvement algorithm with fixed mean and covariance functions

Bayesian uncertainty analysis for complex systems biology models: emulation, global parameter searches and evaluation of gene functions

The shape of learning curves: a review

Risk prediction models to predict emergency hospital admission in community-dwelling adults: a systematic review

Risk assessment and management of cardiovascular disease in new zealand

A competing risks model in early screening for preeclampsia

Racial-ethnic differences in midtrimester maternal serum levels of angiogenic and antiangiogenic factors

Learning under concept drift: an overview

After choosing the 120 values of n using algorithm 1, we reestimated k 2 /d for each of these values before estimating the OHS and confidence interval to avoid any potential regression-to-the mean effects from choosing next-values-of-n so as to minimise estimated confidence interval width

Multicenter screening for pre-eclampsia by maternal factors and biomarkers at 11-13 weeks' gestation: comparison with NICE guidelines and ACOG recommendations

ASPRE trial: performance of screening for preterm pre-eclampsia

Aspirin versus placebo in pregnancies at high risk for preterm preeclampsia

SH's contributions arose from an MSc dissertation undertaken on the MISCADA programme at Durham University.JL and LJMA were partially supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the "Health" theme within that grant and The Alan Turing Institute; and were partially supported by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations, and leading medical research charities; SRE is funded by the EPSRC doctoral training partnership (DTP) at Durham University, grant reference EP/R513039/1; LJMA was partially supported by a Health Programme Fellowship at The Alan Turing Institute.JL is grateful to the Research Design Service North-East for part-funding their position. We acknowledge the SPARRA project (Public Health Scotland/Alan Turing institute) as an example of a risk score used on a population-wide scale, which had a large role in incentivising this project.We thank Dr Catalina Vallejos (University of Edinburgh), Professor Sebastian Vollmer (German Research Centre for Artificial Intelligence) and Dr Bilal Mateen (Kings College London Hospital, Wellcome Trust) for helpful discussion.

We have implemented all functionality described in this manuscript in an R package on CRAN (OptHoldoutSize) and at https://github.com/jamesliley/OptHoldoutSize. Pipelines to generate all results in this manuscript are available at https://github.com/ jamesliley/OptHoldoutSize_pipelines.

Rather than explaining the variation of values in d corresponding to a design point in n 1 as approximation error of a deterministic loss function, we can explain this variation as the result of not including active variables, being the data (X, Y ). Note that as a consequence we are now not emulating a deterministic function (n) as we are not generalising the loss through expectations, we are generalising the loss through omission of the data which generated d. To clarify this distinction we replace the loss function (n) with the stochastic function E(n).Now we may specify variation in d using a 'nugget' term w(n):where m(n) and u(n) are as before but now w(n) represents our nugget term, which we again specify as a Gaussian process:Since there is less variance in risk scores fitted to larger datasets, we expect less variance in E(n) for larger n, so we specify κ(n) as a monotonically decreasing function in n.The joint distribution between E(n) and d 1 is now:, k(n, n) + κ(n) k(n, n 1 ) k(n 1 , n) k(n 1 , n 1 ) + diag(κ(n 1 ))This then gives our Bayes linear update equations in terms of π n = π(E(n)|n, n 1 , d 1 ) as µ(n) = E π n 1 (E(n)) = m(n) + k(n, n 1 )[k(n 1 , n 1 ) + diag(κ(n 1 ))] −1 (d 1 − m(n 1 )) (54) Ψ(n) = var π n 1 (E(n)) = k(n, n) + κ(n) − k(n, n)[k(n 1 , n 1 ) + diag(κ(n 1 ))] −1 k(n 1 , n)Note that this differs only slightly from the emulator constructed in section 4.3, with the main difference being we now attribute uncertainty in the loss values as an inherent behaviour of our emulator and not in the procedure to obtain these loss values. As a result κ(n) does not decrease as the multiplicity of elements of n increases, which represents a major disadvantage to the uncertainty representation in section 4.3.