key: cord-0131437-xxipth74
authors: Williams, Nicholas; Rosenblum, Michael; D'iaz, Iv'an
title: Optimizing Precision and Power by Machine Learning in Randomized Trials, with an Application to COVID-19
date: 2021-09-09
journal: nan
DOI: nan
sha: d394211dc5157a85f8413000b51c29595441875e
doc_id: 131437
cord_uid: xxipth74

The rapid finding of effective therapeutics requires the efficient use of available resources in clinical trials. The use of covariate adjustment can yield statistical estimates with improved precision, resulting in a reduction in the number of participants required to draw futility or efficacy conclusions. We focus on time-to-event and ordinal outcomes. A key question for covariate adjustment in randomized studies is how to fit a model relating the outcome and the baseline covariates to maximize precision. We present a novel theoretical result establishing conditions for asymptotic normality of a variety of covariate-adjusted estimators that rely on machine learning (e.g., l1-regularization, Random Forests, XGBoost, and Multivariate Adaptive Regression Splines), under the assumption that outcome data is missing completely at random. We further present a consistent estimator of the asymptotic variance. Importantly, the conditions do not require the machine learning methods to converge to the true outcome distribution conditional on baseline variables, as long as they converge to some (possibly incorrect) limit. We conducted a simulation study to evaluate the performance of the aforementioned prediction methods in COVID-19 trials using longitudinal data from over 1,500 patients hospitalized with COVID-19 at Weill Cornell Medicine New York Presbyterian Hospital. We found that using l1-regularization led to estimators and corresponding hypothesis tests that control type 1 error and are more precise than an unadjusted estimator across all sample sizes tested. We also show that when covariates are not prognostic of the outcome, l1-regularization remains as precise as the unadjusted estimator, even at small sample sizes (n = 100). We give an R package adjrct that performs model-robust covariate adjustment for ordinal and time-to-event outcomes.

Coronavirus disease 2019 has affected more than 125 million people and caused more than 2.7 million deaths worldwide (World Health Organization 2021). Governments and scientists around the globe have deployed an enormous amount of resources to combat the pandemic with remarkable success, such as the development in record time of highly effective vaccines to prevent disease (e.g., Polack et al. 2020; Baden et al. 2021) . Global and local organizations are launching large-scale collaborations to collect robust scientific data to test potential COVID-19 treatments, including the testing of drugs re-purposed from other diseases as well as new compounds (Kupferschmidt and Cohen 2020) . For example, the World Health Organization launched the SOLIDARITY trial, enrolling almost 12,000 patients in 500 hospital sites in over 30 countries (WHO Solidarity Trial Consortium 2021) . Other large initiatives include the RECOVERY trial (The RECOVERY Collaborative Group 2021) and the ACTIV initiative (Collins and Stoffels 2020). To date, there are approximately 2,400 randomized trials for the treatment of COVID-19 registered in clinicaltrials.gov.

The rapid finding of effective therapeutics for COVID-19 requires the efficient use of available resources. One area where such efficiency is achievable at little cost is in the statistical design and analysis of the clinical trials. Specifically, a statistical technique known as covariate adjustment may yield estimates with increased precision (compared to unadjusted estimators), and may result in a reduction of the time, number of participants, and resources required to draw futility or efficacy conclusions. This results in faster trial designs, which may help accelerate the delivery of effective treatments to patients who need them (and may help rule out ineffective treatments faster).

Covariate adjustment refers to pre-planned analysis methods that use data on patient baseline characteristics to correct for chance imbalances across study arms, thereby yielding more precise treatment effect estimates. The ICH E9 Guidance on Statistical Methods for Analyzing Clinical Trials (FDA and EMA 1998) states that "Pretrial deliberations should identify those covariates and factors expected to have an important influence on the primary variable(s), and should consider how to account for these in the analysis to improve precision and to compensate for any lack of balance between treatment groups." Even though its benefits can be substantial, covariate adjustment is underutilized; only 24%-34% of trials use covariate adjustment (Kahan et al. 2014) .

We focus on estimation of marginal treatment effects, defined as a contrast between study arms in the marginal distribution of the outcome. Many approaches for estimation of marginal treatment effects using covariate adjustment in randomized trials invoke a model relating the outcome and the baseline covariates within strata of treatment. Recent decades have seen a surge in research on the development of model-robust methods for estimating marginal effects that remain consistent even if this outcome regression model is arbitrarily misspecified (e.g., Yang and Tsiatis 2001; Tsiatis et al. 2008; Zhang et al. 2008; Moore and van der Laan 2009a; Austin et al. 2010; Zhang and Gilbert 2010; Benkeser et al. 2020) . We focus on a study of the model-robust covariate adjusted estimators for timeto-event and ordinal outcomes developed by Moore and van der Laan (2009a), , and Díaz et al. (2016) .

All potential adjustment covariates must be pre-specified in the statistical analysis plan. At the end of the trial, a prespecified prediction algorithm (e.g., random forests, or using regularization for variable selection) will be run and its output used to construct a modelrobust, covariate adjusted estimator of the marginal treatment effect for the trial's primary efficacy analysis. We aim to address the question of how to do this in a model-robust way that guarantees consistency and asymptotic normality, under some weaker regularity conditions than related work (described below). We also aim to demonstrate the potential value added by covariate adjustment combined with machine learning, through a simulation study based on COVID-19 data.

As a standard regression method for high-dimensional data, ℓ 1 -regularization has been studied by several authors in the context of covariate selection for randomized studies. For example, Wager et al. (2016) present estimators that are asymptotically normal under strong assumptions that include linearity of the outcome-covariate relationship. Bloniarz et al. (2016) present estimators under a randomization inference framework, and show asymptotic normality of the estimators under assumptions similar to the assumptions made in this paper. Both of these papers present results only for continuous outcomes. The method of Tian et al. (2012) is general and can be applied to continuous, ordinal, binary, and time-to-event data, and its asymptotic properties are similar to the properties of the methods we discuss for the case of ℓ 1 -regularization, under similar assumptions.

More related to our general approach, Wager et al. (2016) also present a cross-validation procedure that can be used with arbitrary non-parametric prediction methods (e.g., ℓ 1regularization, random forests, etc.) in the estimation procedure. Their proposal amounts to computation of a cross-fitted augmented inverse probability weighted estimator (Chernozhukov et al. 2018) . Their asymptotic normality results, unlike ours, require that that their predictor of the outcome given baseline variables converges to the true regression function. Wu and Gagnon-Bartsch (2018) proposed a "leave-one-out-potential outcomes" estimator where automatic prediction can also be performed using any regression procedure such as linear regression or random forests, and they propose a conservative variance estimator. It is unclear as of yet whether Wald-type confidence intervals based on the normal distribution are appropriate for this estimator. As in the above related work that compares the precision of covariate adjusted estimators to the unadjusted estimator, we assume that outcomes are missing completely at random (since otherwise the unadjusted estimator is generally inconsistent).

In Section 3.3, we present our main theorem. It shows that any of a large class of prediction algorithms (e.g., ℓ 1 -regularization, Random Forests, XGBoost, and Multivariate Adaptive Regression Splines) can be combined with the covariate adjusted estimator of Moore and van der Laan (2009b) to produce a consistent, asymptotically normal estimator of the marginal treatment effect, under regularity conditions. These conditions do not require consistent estimation of the outcome regression function (as in key related work described above); instead, our theorem requires the weaker condition of convergence to some (possibly incorrect) limit. We also give a consistent, easy to compute variance estimator. This has important practical implications because it allows the use machine learning coupled with Wald-type confidence intervals and hypothesis tests, under the conditions of the theorem. The above estimator can be used with ordinal or time-to-event outcomes.

We next conduct a simulation study to evaluate the performance of the aforementioned machine learning algorithms for covariate adjustment in the context of COVID-19 trials. We simulate two-arm trials comparing a hypothetical COVID-19 treatment to standard of care. The simulated data distributions are generated from longitudinal data on approximately 1,500 patients hospitalized at Weill Cornell Medicine New York Presbyterian Hospital prior to 15 May 2020. We present results for two types of endpoints: time-to-event (e.g., time to intubation or death) and ordinal (e.g., WHO scale, see Marshall et al. 2020) outcomes. For survival outcomes, we present results for two different estimands (i.e., targets of inference): the survival probability at any given time and the restricted mean survival time. For ordinal outcomes we present results for the average log-odds ratio, and for the Mann-Whitney estimand, interpreted as the probability that a randomly chosen treated patient has a better outcome than a randomly chosen control patient (with ties broken at random). Benkeser et al. (2020) used simulations based on the above data source to illustrate the efficiency gains achievable by covariate adjustment with parametric models including a small number of adjustment variables (and not using machine learning to improve efficiency). In this paper we evaluate the performance of four machine learning algorithms (ℓ 1regularization, Random Forests, XGBoost, and Multivariate Adaptive Regression Splines) in several sample sizes, and compare them in terms of their bias, mean squared error, and type-1 and type-2 errors, to unadjusted estimators and to fully adjusted main terms logistic regression with all available variables included. Furthermore, we introduce a new R package adjrct (Díaz and Williams 2021) that can be used to perform model-robust covariate adjustment for ordinal and time-to-event outcomes, and provide R code that can be used to replicate our simulation analyses with other data sources.

In what follows, we focus on estimating intention-to-treat effects and refer to study arm assignment simply as treatment. We focus on estimation of marginal treatment effects, defined as a contrast between study arms in the marginal distribution of the outcome. We further assume that we have data on n trial participants, represented by n independent and identically distributed copies of data O i : i = 1, . . . , n. We assume O i is distributed as P, where we make no assumptions about the functional form of P except that treatment is independent of baseline covariates (by randomization). We denote a generic draw from the distribution P by O. We use the terms "baseline covariate" and "baseline variable" interchangeably to indicate a measurement made before randomization.

We are interested in making inferences about a feature of the distribution P. We use the word estimand to refer to such a feature. We describe example estimands, which include those used in our simulations studies, below.

For ordinal outcomes, assume the observed data is O = (W, A, Y ), where W is a vector of baseline covariates, A is the treatment arm, and Y is an ordinal variable that can take values in {1, . . . , K}. Let F (k, a) = P(Y ≤ k | A = a) denote the cumulative distribution function for patients in arm A = a, and let f (k, a) = F (k, a) − F (k − 1, a) denote the corresponding probability mass function. For notational convenience we will sometimes use the "survival" function instead: S(k, a) = 1 − F (k, a). The average log-odds ratio is then equal to

and the Mann-Whitney estimand is equal to

The Mann-Whitney estimand can be interpreted as the probability that a randomly drawn patient from the treated arm has a better outcome than a randomly drawn patient from the control arm, with ties broken at random (Ahmad 1996) . The average log-odds ratio is more difficult to interpret and we discourage its use, but we include it in our comparisons because it is a non-parametric extension of the parameter β estimated by the commonly used proportional odds model logit{F (k, a)} = α k + βa (Díaz et al. 2016 ).

For time to event outcomes, we assume the observed data is

where C is a right-censoring time denoting the time that a patient is last seen, and 1{E} is the indicator variable taking the value 1 on the event E and 0 otherwise. We further assume that events are observed at discrete time points {1, . . . , K} (e.g., days) as is typical in clinical trials. The difference in restricted mean survival time is given by

and can be interpreted as a contrast comparing the expected survival time within the first K time units for the treated arm minus the control arm (Chen and Tsiatis 2001; Royston and Parmar 2011). The risk difference at a user-given time point k is defined as

and is interpreted as the difference in survival probability for a patient in the treated arm minus the control arm. We note that the MW and RD parameters may be meaningful for both ordinal and time-to-event outcomes.

For the sake of generality, in what follows we use a common data structure O = (W, A, ∆ = 1{Y ≤ C}, Y ) for both ordinal and survival outcomes, where for ordinal outcomes C = K if the outcome is observed and C = 0 if it is missing. Many approaches for estimation of marginal treatment effects using covariate adjustment in randomized trials invoke a model relating the outcome and the baseline covariates within strata of treatment. It is important that the consistency and interpretability of the treatment effect estimates do not rely on the ability to correctly posit such a model. Specifically, in a recent draft guidance (U.S. Food and Drug Administration 2021), the FDA states: "Sponsors can perform covariate adjusted estimation and inference for an unconditional treatment effect ... in the primary analysis of data from a randomized trial. The method used should provide valid inference under approximately the same minimal statistical assumptions that would be needed for unadjusted estimation in a randomized trial." The assumption of a correctly specified model is not typically part of the assumptions needed for an unadjusted analysis, and should therefore be avoided when possible.

All estimands described in this paper can be computed from the cumulative distribution functions (CDF) F (·, a) for a ∈ {0, 1}, which can be estimated using the empirical cumulative distribution function (ECDF) or the Kaplan-Meier estimator. Model-robust, covariate adjusted estimators have been developed for the CDF, including, e.g., Chen and Tsiatis (2001) We focus on the model-robust, covariate adjusted estimators of Moore and van der Laan (2009b), Díaz et al. (2016) , and . These estimators have at least two advantages compared to unadjusted estimators based on the ECDF or the Kaplan-Meier estimator. First, with time-to-event outcomes, the adjusted estimator can achieve consistency under an assumption of censoring being independent of the outcome given study arm and baseline covariates (C ⊥ ⊥ Y | A, W ), rather than the assumption of censoring in each arm being independent of the outcome marginally (C ⊥ ⊥ Y | A) required by unadjusted estimators. The former assumption is arguably more likely to hold in typical situations where patients are lost to follow-up due to reasons correlated with their baseline variables. Second, in large samples and under regularity conditions, the adjusted estimators of Díaz et al. (2016) and can be at least as precise as the unadjusted estimator (this requires that missingness/censoring is completely at random, i.e., that in each arm a ∈ {0, 1}, C ⊥ ⊥ (Y, W )|A = a), under additional assumptions.

Additionally, under regularity conditions, the three aforementioned adjusted estimators are asymptotically normal. This allows the construction of Wald-type confidence intervals and corresponding tests of the null hypothesis of no treatment effect.

While we make no assumption on the functional form of the distribution P (except that treatment is independent of baseline variables by randomization), implementation of our estimators requires a working model for the following conditional probability

In time-to-event analysis, this probability is known as the conditional hazard. The expression working model here means that we do not assume that the model represents the true relationship between the outcome and the treatment/covariates. Fitting a working model for m is equivalent to training a prediction model for m (specifically, a prediction model for the probability of Y = k, ∆ = 1 given Y ≥ k, A = a, W ), and we sometimes refer to the model fit as a predictor.

In our simulation studies, we will use the following working models, fitted in a dataset where each participant contributes a row of data corresponding to each time k = 1 through k = Y :

• The following pooled main terms logistic regression (LR) logit{m β (k, a, W )} = β a,0,k + β ⊤ a,1 W estimated with maximum likelihood estimation. Note that this model has (i) separate parameters for each study arm, and (ii) in each arm, intercepts for each possible outcome level k.

• The above model fitted with an ℓ 1 penalty on the parameter β a,1 (ℓ 1 -LR, Tibshirani 1996; Park and Hastie 2007).

• A random forest classification model (RF, Breiman 2001).

• An extreme gradient boosting tree ensemble (XGBoost, Friedman 2001 ).

• Multivariate adaptive regression splines (MARS, Friedman 1991) .

For RF, XGBoost, and MARS, the algorithms are trained in the whole sample {1, . . . , n}. For these algorithms, we also assessed the performance of cross-fitted versions of the estimators. Cross-fitting is sometimes necessary to guarantee that the regularity assumptions required for asymptotic normality of the estimators hold when using data-adaptive regression methods (Klaassen 1987; Zheng and van der Laan 2011; Chernozhukov et al. 2018) , and is performed as follows. Let V 1 , . . . , V J denote a random partition of the index set {1, . . . , n} into J prediction sets of approximately the same size. That is, V j ⊂ {1, . . . , n}; J j=1 V j = {1, . . . , n}; and V j ∩V j ′ = ∅. In addition, for each j, the associated training sample is given by T j = {1, . . . , n}\V j . Let m j denote the prediction algorithm trained in T j . Letting j(i) denote the index of the prediction set which contains observation i, cross-fitting entails using only observations in T j(i) for fitting models when making predictions about observation i. That is, the outcome predictions for each subject i are given by m j(i) (u, a, W i ). We let η j(i) = ( m j(i) , π A , π C ) for cross-fitted estimators and η j(i) = ( m, π A , π C ) for non-cross-fitted ones. RF, XGBoost, and MARS were fit using the ranger (Wright and Ziegler 2017), xgboost (Chen et al. 2021), and earth (Milborrow 2020) R packages, respectively. Hyperparameter tuning was performed using cross-validation with the origami (Coyle and Hejazi 2020) R package.

Our simulation studies use the TMLE procedure presented in . We will refer to that estimator as TMLE with improved efficiency, or IE-TMLE. We will first present the TMLE of (Moore and van der Laan 2009b), which constitutes the basis for the construction of the IE-TMLE.

In the supplementary materials we present some of the efficiency theory underlying the construction of the TMLE. Briefly, TMLE is a framework to construct estimators η j(i) that solve the efficient influence function estimating equation

is the efficient influence function for S(k, a) in the non-parametric model that only assumes treatment A is independent of baseline variables W (which holds by design), defined in the supplementary materials. TMLE enjoys desirable properties such as local efficiency, outcome model robustness under censoring completely at random, and asymptotic normality, under regularity assumptions. TMLE estimator definition: Given a predictor m constructed as in the previous subsection and any k, a, the corresponding TMLE estimation procedure for F (k, a) can be summarized in the next steps:

1. Create a long-form dataset where each participant i contributes the following row of data corresponding to each time u = 0 through k:

where 1{X} is the indicator variable taking value 1 if X is true and 0 otherwise. 3. Fit a model π A (a, W ) for the probability P(A = a | W ). Note that, in randomized trials, this model may be correctly specified by a logistic regression logit π A (1, W ) = α 0 + α ⊤ 1 W . Let π A (a, W i ) denote the prediction of the model for individual i.

4. Fit a model π C (u, a, W ) for the censoring probabilities P( Y = u, ∆ = 0 | Y ≥ u, A = a, W ). For time-to-event outcomes, this is a model for the censoring probabilities. For ordinal outcomes, the only possibilities are that C = 0 (outcome missing) or C = K (outcome observed); in this case we only fit the aforementioned model at u = 0 and we set π C (u, a, W ) = 0 for each u > 0. For either outcome type, if there is no censoring (i.e., if P (∆ = 1) = 1), then we set π C (u, a, W ) = 0 for all u. Let π C (u, a, W i ) denote the prediction of this model for individual i, i.e., using the baseline variable values from individual i.

For each individual i and each u ≤ k, compute a "clever" covariate H Y,k,u as a function of m, π A , and π C as detailed in the supplementary materials. The outcome model fit m is then updated by fitting the following logistic regression "tilting" model with single parameter ǫ and offset based on m:

This can be done using standard statistical software for fitting a logistic regression of the indicator variable 1{ Y = u, ∆ = 1} on the variable H Y,k,u using offset logit m(u, a, W ) among observations with Y ≥ u and A = a in the long-form dataset from step 1. The above model fitting process is iterated where at the beginning of each iteration we replace m in the above display and in the definition of H Y,k,u by the updated model fit. We denote the maximum number of iterations that we allow by i max .

6. Let m(u, a, W i ) denote the estimate of m(u, a, W i ) for individual i at the final iteration of the previous step. Note that this estimator is specific to the value k under consideration.

7. Compute the estimate of S(k, a) = 1 − F (k, a) as the following standardized estimator

and let the estimator of F (k, a) be 1 −S TMLE (k, a).

This estimator was originally proposed by Moore and van der Laan (2009b). The role of the clever covariate H Y,k,u is to endow the resulting estimator S(k, a) with properties such as model-robustness compared to unadjusted estimators. In particular, it can be shown that this estimator is efficient when the working model for m is correctly specified. The specific form of the covariate H Y,k,u is given in the supplementary materials. Throughout, the notation m is used to represent the predictor constructed as in Section 3.1 and which is an input to the above TMLE algorithm, while m denotes the updated version of this predictor that is output by the above TMLE algorithm at step 6. IE-TMLE estimator definition: In Section 4 we will compare several machine learning procedures for estimating m in finite samples. The estimators used in the simulation study are the IE-TMLE of , where in addition to updating the initial estimator for the outcome regression m, we also update the estimators of the treatment and censoring mechanisms. Specifically, we replace step 5 of the above procedure with the following: 5. For each individual i construct "clever" covariates H Y,k,u , H A , and H C,k,u (defined in the supplementary materials) as a function of m, π A , and π C . For each k = 1, . . . , K, the model fits are then iteratively updated using logistic regression "tilting" models:

where the iteration is necessary because H Y,k,u , H A , and H C,k,u are functions of m, π A , and π C that must be updated at each step. As before, for ordinal outcomes we only fit the aforementioned model at u = 0 and we set π C (u, a, W ) = 0 for each u > 0.

We useS IE−TMLE to denote this estimator. The updating step above combines ideas from Moore and van der Laan (2009b), Gruber and van der Laan (2012), and Rotnitzky et al. (2012) to produce an estimator with the following properties:

(i) Consistency and at least as precise as the Kaplan-Meier and inverse probability weighted estimators;

(ii) Consistency under violations of independent censoring (unlike the Kaplan-Meier estimator) when either the censoring or survival distributions, conditional on covariates, are estimated consistently and censoring is such that C ⊥ ⊥ Y | W, A; and (iii) Nonparametric efficiency when both of these distributions are consistently estimated at rate n 1/4 . Please see for more details on these estimators, which are implemented in the R package adjrct (Díaz and Williams 2021). Next, we present a result (Theorem 1) stating asymptotic normality ofS TMLE using machine learning for prediction that avoids some limitations of existing methods, and present a consistent estimator of its variance. In Section 4 we present simulation results where we evaluate the performance ofS IE−TMLE for covariate adjustment in COVID-19 trials for hospitalized patients. We favorS IE−TMLE in our numerical studies because, unlikeS TMLE , it satisfies property (i) above. The simulation uses Wald-type hypothesis tests based on the asymptotic approximation of Theorem 1, where we note that the variance estimator in the theorem is consistent forS TMLE but it is conservative forS IE−TMLE (Moore and van der Laan 2009b).

Most available methods to construct confidence intervals and hypothesis tests in the statistics literature are based on the sampling distribution of the estimator. While using the exact finite-sample distribution would be ideal for this task, such distributions are notoriously difficult to derive for our problem in the absence of strong and unrealistic assumptions (such as linear models with Gaussian noise). Thus, here we focus on methods that rely on approximating the finite-sample distribution using asymptotic results as n goes to infinity. In order to discuss existing methods, it will be useful to introduce and compare the following assumptions:

A1. Censoring is completely at random, i.e., C ⊥ ⊥ (Y, W ) | A = a for each treatment arm a.

We abbreviate m(k, a, W ) and m(k, a, W ) by m and m, respectively. Assume the estimator m is consistent in the sense that || m − m|| = o P (1) for all k ∈ {1, . . . , K} and a ∈ {0, 1}. We also assume that there exists a δ > 0 such that δ < m < 1 − δ with probability 1.

A3. Assume the estimator m converges to a possibly misspecified limit m 1 in the sense that || m − m 1 || = o P (1) for all k ∈ {1, . . . , K} and a ∈ {0, 1}, where we emphasize that m 1 can be different from the true regression function m. We also assume that there exists a δ > 0 such that δ < m 1 < 1 − δ with probability 1.

For estimators m of m that use cross-fitting, the function m consists of J maps (one for each training set) from the sample space of O to the interval [0, 1]. In this case, by convention we define || m − m|| in A2 as the average across the J maps of the L 2 (P) norm of each such map minus m. Convergence of || m − m|| to 0 in probability is then equivalent to the same convergence where m is replaced by the corresponding map before cross-fitting is applied. The same convention is used in A3.

There are at least two results on asymptotic normality forS TMLE relevant to the problem we are studying. The first result is a general theorem for TMLE (see Appendix A.1 of van der Laan and Rose 2011), stating that the estimator is asymptotically normal and efficient under regularity assumptions which include A2. Among other important implications, this asymptotic normality implies that the variance of the estimators can be consistently estimated by the empirical variance of the efficient influence function. This means that asymptotically correct confidence intervals and hypothesis tests can be constructed using a Wald-type procedure. As stated above, it is often undesirable to assume A2 in the setting of a randomized trial, as it is a much stronger assumption than what would be required for an unadjusted estimator.

The second result of relevance to this paper establishes asymptotic normality of S(k, a) under assumptions that include A3 (Moore and van der Laan 2009a). The asymptotic variance derived by these authors depends on the true outcome regression function m, and is thus difficult to estimate. As a solution, the authors propose to use a conservative estimate of the variance whose computation does not rely on the true regression function m. While this conservative method yields correct type 1 error control, its use is not guaranteed to fully covert precision gains from covariate adjustment into power gains.

We note that the above asymptotic normality results from related works rely on the additional condition that the estimator m lies in a Donsker class. This assumption may be violated by some of the data-adaptive regression techniques that we consider. Furthermore, we note that resampling methods such as the bootstrap cannot be safely used for variance estimation in this setting. Their correctness is currently unknown when the working model for m is based on data-adaptive regression procedures such as those described in Section 3.1 and used in our simulation studies.

In what follows, we build on recent literature on estimation of causal effects using machine learning to improve upon the aforementioned asymptotic normality results on two fronts. First, we introduce cross-fitting (Klaassen 1987; Zheng and van der Laan 2011; Chernozhukov et al. 2018) to avoid the Donsker condition. Second, and most importantly, we present a novel asymptotic normality result that avoids the above limitations of existing methods regarding strong assumptions (specifically A2) and conservative variance estimators (that may sacrifice power).

The following are a set of assumptions about how components of the TMLE are implemented, which we'll use in our theorem below:

For time-to-event outcomes, the initial estimator Π C (a, u) is set to be the Kaplan-Meier estimator estimated separately within each treatment arm a. For ordinal outcomes, Π C (a, 0) is the proportion of missing outcomes in treatment arm a and Π C (a, u) = 0 for u > 0.

A6. The initial estimator m(u, a, W ) is constructed using one of the following:

1. Any estimator in a parametric working model (i.e., a model that can be indexed by a Euclidean parameter) such as maximum likelihood, ℓ 1 regularization, etc.

2. Any data-adaptive regression method (e.g., random forests, MARS, XGBoost, etc.) estimated using cross-fitting as described above.

A7. The regularity conditions in Theorem 5.7 of (van der Vaart 1998, p.45) hold for the maximum likelihood estimator corresponding to each logistic regression model fit in step (5) of the TMLE algorithm.

Theorem 1. Assume A1 and A3-A7 above. Define the variance estimator

Then we have for all k ∈ {1, . . . , K} and a ∈ {0, 1} that

Theorem 1 is a novel result establishing the asymptotic correctness of Wald-type confidence intervals and hypothesis tests for the covariate-adjusted estimatorS TMLE (k, a) based on machine learning regression procedures constructed as stated in A6. For example, the confidence intervalS TMLE (k, a) ± 1.96 × σ/ √ n has approximately 95% coverage at large sample sizes, under the assumptions of the theorem. The theorem licenses the large sample use of any regression procedure for m when combined with the TMLE of Section 3.2, as long as the regression procedure is either (i) based on a parametric model (such as ℓ 1 -regularization) or (ii) based on cross-fitted data-adaptive regression, and the assumptions of the theorem hold. The theorem states sufficient assumptions under which Wald-type tests from such a procedure will be asymptotically correct. Assumption A3 states that the predictions given by the regression method used to construct the adjusted estimator converge to some arbitrary function (i.e., not assumed to be equal to the true regression function). This assumption is akin to Condition 3 assumed by Bloniarz et al. (2016) in the context of establishing asymptotic normality of a covariateadjusted estimator based on ℓ 1 -regularization. We note that this is an assumption on the predictions themselves and not on the functional form of the predictors. Therefore, issues like collinearity do not necessarily cause problems. While this assumption can hold for many off-the-shelf machine learning regression methods under assumptions on the datagenerating mechanism, general conditions have not been established and the assumption must be checked on a case-by-case basis.

We note that assumption A1 is stronger than the assumption C ⊥ ⊥ Y | A = a required by unadjusted estimators such as the Kaplan-Meier estimator. However, if W is prognostic (meaning that W ⊥ ⊥ Y | A = a), then the assumption C ⊥ ⊥ Y | A = a required by the Kaplan-Meier estimator cannot generally be guaranteed to hold, unless A1 also holds. Thus, our theorem aligns with the recent FDA draft guidance on covariate adjustment in the sense that "it provides valid inference under approximately the same minimal statistical assumptions that would be needed for unadjusted estimation in a randomized trial" (U.S. Food and Drug Administration 2021).

The construction of estimators based on A5 should be avoided if A1 does not hold. Confidence that A1 holds is typically warranted in trials where the only form of right censoring is administrative. When applied to ordinal outcomes, A1 is trivially satisfied if there is no missing outcome data.

Consider the case where censoring is informative such that A1 does not hold, but censoring at random holds (i.e., C ⊥ ⊥ Y | W, A). Then consistency of the estimatorsS TMLE andS IE−TMLE will typically require that at least one of two assumptions hold: (a) that the censoring probabilities π C (u, a, w) are estimated consistently, or that (b) the outcome regression m(u, a, w) is estimated consistently. To maximize the chances of either of these conditions being true, we recommend the use of flexible machine learning for both of these regressions, including model selection and ensembling techniques such as the Super Learner (van der Laan et al. 2007 ). The conditions for asymptotic normality ofS TMLE andS IE−TMLE under these circumstances are much stronger than those for Theorem 1, and typically include consistent estimation of both π C (u, a, w) and m(u, a, w) at certain rates (e.g., each of them converging at n 1/4 -rate is sufficient, see Appendix A.1 of van der Laan and Rose 2011).

Our data generating distribution is based on a database of over 1,500 patients hospitalized at Weill Cornell Medicine New York Presbyterian Hospital prior to 15 May 2020. The database includes information on patients 18 years of age and older with COVID-19 confirmed through reverse-transcriptase-polymerase-chain-reaction assays. For a full description of the clinical characteristics and data collection methods of the initial cohort sampling, see Goyal et al. (2020) .

We evaluate the potential to improve efficiency by adjustment for subsets of the following baseline variables: age, sex, BMI, smoking status, whether the patient required supplemental oxygen within three-hours of presenting to the emergency department, number of comorbidities (diabetes, hypertension, COPD, CKD, ESRD, asthma, interstitial lung disease, obstructive sleep apnea, any rheumatological disease, any pulmonary disease, hepatitis or HIV, renal disease, stroke, cirrhosis, coronary artery disease, active cancer), number of relevant symptoms, presence of bilateral infiltrates on chest x-ray, dyspnea, and hypertension. These variables were chosen because they have been previously identified as risk factors for severe disease (Guan et al. 2020; Goyal et al. 2020; Gupta et al. 2020) , and therefore are likely to improve efficiency of covariate-adjusted effect estimators in randomized trials in hospitalized patients.

Code to reproduce our simulations may be found at https://github.com/nt-williams/covid-RCT-co

We consider two types of outcomes: a time-to-event outcome defined as the time from hospitalization to intubation or death, and a six-level ordinal outcome at 14 days posthospitalization based on the WHO Ordinal Scale for Clinical Improvement (Marshall et al. 2020 ). The categories are as follows: 0, discharged from hospital; 1, hospitalized with no oxygen therapy; 2, hospitalized with oxygen by mask or nasal prong; 3, hospitalized with non-invasive ventilation or high-flow oxygen; 4, hospitalized with intubation and mechanical ventilation; 5, dead. For time to event outcomes, we focus on evaluating the effect of treatment on the RMST at 14 days and the RD at 7 days after hospitalization, and for ordinal outcomes we evaluate results for both the LOR and the Mann-Whitney statistic.

We simulate datasets for four scenarios where we consider two effect sizes (null versus positive) and two baseline variable settings (prognostic versus not prognostic, where prognostic means marginally associated with the outcome). For each sample size n ∈ {100, 500, 1500} and for each scenario, we simulated 5000 datasets as follows. To generate datasets where covariates are prognostic, we draw n pairs (W, Y ) randomly from the original dataset with replacement. This generates a dataset where the covariate prognostic strength is as observed in the real dataset. To simulate datasets where covariates are not prognostic, we first draw outcomes Y at random with replacement from the original dataset, and then draw covariates W at random with replacement and independent of the value Y drawn.

For each scenario, a hypothetical treatment variable is assigned randomly for each patient with probability 0.5 independent of all other variables. This produces a data generating distribution with zero treatment effect. Next, a positive treatment effect is simulated for time-to-event outcomes by adding an independent random draw from a χ 2 distribution four degrees of freedom to each patient's observed survival time in the treatment arm. This effect size translates to a difference in RMST of 1.04 and RD of 0.10, respectively. To simulate outcomes being missing completely at random, 5% of the patients are selected at random to be censored, and the censoring times are drawn from a uniform distribution between 1 and 14. A positive treatment effect is simulated for ordinal outcomes by subtracting from each patient's outcome in the treatment arm an independent random draw from a four-parameter Beta distribution with support (0, 5) and parameters (3, 15), rounded to the nearest nonnegative integer. This generates effect sizes for LOR of 0.60 and for MW of 0.46.

We evaluate several estimators. First, we evaluate unadjusted estimators based on substituting the empirical CDF for ordinal outcomes and the Kaplan-Meier estimator for time-to-event outcomes in the parameter definitions of Section 2. We then evaluate adjusted estimator S IE−TMLE (k, a) where the working models are: LR: a fully adjusted estimator using logistic regression including all the variables listed in the previous section, ℓ 1 -LR: ℓ 1 regularization of the previous logistic regression, RF: random forests, MARS: multivariate adaptive regression splines, and XGBoost: extreme gradient boosting tree ensembles.

For estimators RF, MARS, and XGBoost, we further evaluated cross-fitted versions of the working model. For all adjusted estimators the propensity score π A is estimated with an intercept-only model (A4), and the censoring mechanism π C is estimated using a Kaplan-Meier estimator fitted independently for each treatment arm (A5) (or equivalently for ordinal outcomes the proportion of missing outcomes within each treatment arm).

Confidence intervals and hypothesis tests are performed using Wald-type statistics, which use an estimate of the standard error. The standard error was estimated based on the asymptotic Gaussian approximation described in Theorem 1. We compare the performance of the estimators in terms of the probability of type-1 error, power, the absolute bias, the variance, and the mean squared error.

We compute the relative efficiency RE of each estimator compared to the unadjusted estimator as a ratio of the mean squared errors. This relative efficiency can be interpreted as the ratio of sample sizes required by the estimators to achieve the same power at local alternatives, asymptotically (van der Vaart 1998). Equivalently, one minus the relative efficiency is the relative reduction (due to covariate adjustment) in the required sample size to achieve a desired power, asymptotically; e.g., a relative efficiency of 0.8 is approximately equivalent to needing 20% smaller sample size when using covariate adjustment.

In the presentation of the results, we append the prefix CF to cross-fitted estimators. For example, CF-RF will denote cross-fitted random forests.

Tables containing the comprehensive results of the simulations are presented in the supplementary materials. In the remainder of this section we present a summary of the results. First, we note that the use of random forests without cross-fitting exhibits very poor performance, failing to appropriately control type-1 error when the effect is null, and introducing significant bias when the effect is positive. We observed this poor performance across all simulations. Thus, in what follows we omit a discussion of this estimator.

Results for the LOR in Tables 3 and 11 show that covariate adjusted estimators have better performance than the unadjusted estimator at small sample sizes, even when the covariates are not prognostic. In these cases, the unadjusted estimator is unstable with large variance due to near-empty outcome categories in some simulated datasets, which causes division by near-zero numbers in the unadjusted LOR estimator. Some covariate adjusted estimators fix this problem by extrapolating model probabilities to obtain better estimates of the probabilities in the near-empty cells.

Tables 1-4 (in the web supplementary materials) display the results for the difference in RMST, RD, LOR, and MW estimands when covariates are prognostic and there is a positive effect size. At sample size n = 1500 all adjusted estimators yield efficiency gains, with CF-RF offering the best RE ranging from 0.51 to 0.67 compared to an unadjusted estimator, while appropriately controlling type-1 error. In contrast, the RE of ℓ 1 -LR at n = 1500 ranged from 0.79 to 0.89.

At sample size n = 500, ℓ 1 -LR, CF-RF, and XGBoost offer comparable efficiency gains, ranging from 0.29 to 0.99. As the sample size decreases to n = 100 most adjusted estimators yield efficiency losses and the only estimator that retains efficiency gains is ℓ 1 -LR, with RE from 0.86 to 0.92. (An exception is in estimation of the LOR, where the RE of ℓ 1 -LR was 0.1 due to the issue discussed above.)

Efficiency gains for ℓ 1 -LR did not always translate into power gains of a Wald-type hypothesis test compared to other estimators (e.g. LR at n = 100), possibly due to biased variance estimation and/or a poor Gaussian approximation of the distribution of the test statistic. At small sample size n = 100 power was uniformly better for a Wald-type test based on LR compared to ℓ 1 -LR. At sample size n = 500 a Wald-type test based on ℓ 1 -LR seemed to dominate all other algorithms, whereas at n = 1500 all algorithms had comparable power very close to one.

Results when the true treatment effect is zero and covariates are prognostic are presented in Tables 5-8 (in the web supplementary materials). At sample size n = 1500, CF-RF generally provides large efficiency gains with relative efficiencies ranging from 0.66 to 0.77.

For comparison, ℓ 1 -LR has RE ranging from 0.88 to 0.92. As the sample size decreases to n = 500, ℓ 1 -LR and CF-RF both offer the most efficiency gains while retaining type-1 error control, with RE ranging from 0.74 to 0.88. At small sample sizes n = 100, ℓ 1 -LR consistently leverages efficiency gains from covariate adjustment (RE ranging from 0.73 to 0.95) but its type-1 error (ranging from 0.07 to 0.09) is slightly larger than that of the unadjusted estimator. For estimation of LOR and MW, XGBoost has similar results at sample size n = 100.

Tables 9-12 (in the web supplementary materials) show results for scenarios where the covariates are not prognostic of the outcome but there is a positive effect. This case is interesting because it is well known that adjusted estimators can induce efficiency losses (i.e., RE > 1) by adding randomness to the estimator when there is nothing to be gained from covariate adjustment. We found that ℓ 1 -LR uniformly avoids efficiency losses associated with adjustment for independent covariates, with a maximum RE of 1.03. All other covariate adjustment methods had larger maximum RE. At sample size n = 100, the superior efficiency of the ℓ 1 -LR estimator did not always translate into better power (e.g., compared to LR) due to the use of a Wald-test which relies on an asymptotic approximation to the distribution of the estimator.

Results when the true treatment effect is zero and covariates are not prognostic are presented in Tables 13-16 (in the web supplementary materials). In this case, ℓ 1 -LR also avoids efficiency losses across all scenarios, while maintaining a type-1 error that is comparable to that of the unadjusted estimator.

Lastly, at large sample sizes all cross-fitted estimators along with logistic regression estimators yield correct type I error, illustrating the correctness of Wald-type tests proved in Theorem 1. Our simulation results also show that Wald-type hypothesis tests based on dataadaptive machine learning procedures fail to control type 1 error if the regressions procedures are not cross-fitted.

In our numerical studies we found that ℓ 1 -regularized logistic regression offers the best tradeoff between type-I error control and efficiency gains across sample sizes, outcome types, and estimands. We found that this algorithm leverages efficiency gains when efficiency gains are feasible, while protecting the estimators from efficiency losses when efficiency gains are not feasible (e.g., adjusting for covariates with no prognostic power). A direction of future research is the evaluation of bootstrap estimators for the variance and confidence intervals of covariate-adjusted estimators, especially for cases where the Wald-type methods evaluated in this manuscript did not perform well (e.g., ℓ 1 -LR at n = 100).

We also found that logistic regression can result in large efficiency losses for small sample sizes, with relative efficiencies as large as 1.17 for the RMST estimand, and as large as 7.57 for the MW estimand. Covariate adjustment with ℓ 1 -regularized logistic regression solves this problem, maintaining efficiency when covariates are not prognostic for the outcome, even at small sample sizes. However, Wald-type hypothesis tests do not appropriately translate the efficiency gains of ℓ 1 -regularized logistic regression into more powerful tests. This requires the development of tests appropriate for small samples.

We recommend against using the LOR parameter since it is difficult to interpret and the corresponding estimators (even unadjusted ones) can be unstable at small sample sizes. Covariate adjustment with ℓ 1 -LR, CF-MARS, CF-RF, or CF-XGBoost can aid to improve efficiency in estimation of the LOR parameter over the unadjusted estimator when there are near-empty cells at small sample sizes. This improvement in efficiency did not translate into an improvement in power when using Wald-type hypothesis tests, due to poor small-sample Gaussian approximations or poor variance estimators.

We discourage the use of non-cross-fitted versions of the machine learning methods evaluated (i.e., RF, XGBoost, MARS) for covariate adjustment. Specifically, we found in simulations that non-cross-fitted random forests can lead to overly biased estimators in the case of a positive effect, and to anti-conservative Wald-type hypothesis tests in the case of a null treatment effect. We found that cross-fitting the random forests alleviated this problem and was able to produce small bias and acceptable type-1 error at all sample sizes. This is supported at large sample sizes by our main theoretical result (Theorem 1) which establishes asymptotic correctness of cross-fitted procedures under regularity conditions. In fact, we found that random forests with cross-fitting provided the most efficiency gains at large sample sizes.

Based on the results of our simulation studies, we recommend that cross-fitting with data-adaptive estimators such as random forests and extreme gradient boosting be considered for covariate selection in trials with large sample sizes (n = 1500 in our simulations). In large sample sizes, it is also possible to consider an ensemble approach such as Super Learning (van der Laan et al. 2007 ) that allows one to select the predictor that yields the most efficiency gains. Traditional model selection with statistical learning is focused on the goal of prediction, and an adaptation of those tools to the goal of maximizing efficiency in estimating the marginal treatment effect is the subject of future research.

The conditions for asymptotic normality and consistent variance estimation ofS TMLE (k, a) established in Theorem 1 may be restrictive if censoring is informative. In that case, consistency of theS TMLE (k, a) andS IE−TMLE (k, a) estimators requires that censoring at random holds (i.e., C ⊥ ⊥ Y | W, A), and that either the outcome regression or censoring mechanism is consistently estimated. Thus, it is recommended to also estimate the censoring mechanism with machine learning methods that allow for flexible regression. Standard asymptotic normality results for theS TMLE (k, a) andS IE−TMLE (k, a) require consistent estimation of both the censoring mechanism and the outcome mechanism at certain rates (e.g., both estimated at a n 1/4 rate is sufficient). The development of estimators that remain asymptotically normal under the weaker condition that at least one of these regressions is consistently estimated has been the subject of recent research (e.g., Díaz Pei-Yun Chen and Anastasios A. Tsiatis. Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics, 57 (4) 

Similarly, define the following function of the censoring distribution:

Under the assumption C ⊥ ⊥ (Y, W ) | A = a for each treatment arm a, we have Y ⊥ ⊥ C | A, W and therefore S(k, a, w) and G(k, a, w) have the following product formula representations:

At each iteration of the estimation algorithm, the auxiliary covariates frS TMLE andS IE−TMLE are constructed as follows:

where π A , S, and Π C are the estimates in the current step of the iteration. * corresponding author: ild2005@med.cornell.edu

Before proving Theorem 1 given in the paper, we introduce some notation and efficiency theory for estimation of S(k, a). We will use the notation of . First, we encode a single participant's data vector O = (W, A, ∆ = 1{Y ≤ C}, Y = min(C, Y )) using the following longitudinal data structure:

where R u = 1{ Y = u, ∆ = 0} and L u = 1{ Y = u, ∆ = 1}, for u ∈ {0, . . . , K}. The sequence R 0 , L 1 , R 1 , L 2 . . . , R K−1 , L K in the above display consists of all 0's until the first time that either the event is observed or censoring occurs, i.e., time u = Y . In the former case L u = 1; otherwise R u = 1. For a random variable X, we denote its history through time u asX u = (X 0 , . . . , X u ). For a given scalar x, the expressionX u = x denotes element-wise equality. The corresponding vector (5) for participant i is denoted by

Define the following indicator variables for each u ≥ 1:

The variable I u is the indicator based on the data through time u − 1 that a participant is at risk of the event being observed at time u; in other words, I u = 1 means that all the variables R 0 , L 1 , R 1 , L 2 ..., L u−1 , R u−1 in the data vector (5) equal 0, which makes it possible that L u = 1.

Analogously, J u is the indicator based on the outcome data through time u and censoring data before time u that a participant is at risk of censoring at time u. By convention we let J 0 = 1. The efficient influence function for estimation of S(k, a) (see Moore and van der Laan 2009) is equal to:

where we have explicitly added the dependence of the auxiliary covariate H Y on (u, A, W ) to the notation, and have denoted the nuisance parameters with η = (m, π A , π C ). In what follows we will use θ = S(k, a), and will use θ(η 1 ) to refer to the target parameter evaluated at a specific distribution implied by η 1 . We will denote Pf = f (o)dP(o), and Ph(t, a, W ) = h(t, a, w)dP(w) for functions f and h. The efficient influence function has important implications for estimation of S(k, a). First, the variance of D η (O) is the non-parametric efficiency bound, meaning that it is the smallest possible variance achievable by any regular estimator (Bickel et al. 1997) . Second, the efficient influence function characterizes the first order bias of a plug-in estimator based on data-adaptive regression. Correction for this first order bias will allow us to establish normality of the estimators. Specifically, for any estimate η we have the following first order expansion around the true parameter value θ(η), proved in Lemma 1 in the Supplementary materials of Díaz et al. (2018) :

where Rem 1 is a second order remainder term given by

and θ( η) is the substitution estimator

The following proposition establishing the robustness of D η to misspecification of the model m will be useful to prove consistency of the estimator.

Proposition 1. Let η 1 = (m 1 , π A,1 , π C,1 ) be such that either m 1 = m or (π A,1 , π C,1 ) = (π A , π C ). Then PD η 1 = 0.

Recall the cross-fitting procedure described in the main document as follows. Let V 1 , . . . , V J denote a random partition of the index set {1, . . . , n} into J prediction sets of approximately the same size. That is, V j ⊂ {1, . . . , n}; J j=1 V j = {1, . . . , n}; and V j ∩ V j ′ = ∅. In addition, for each j, the associated training sample is given by T j = {1, . . . , n} \ V j . Let m j denote the prediction algorithm trained in T j . Letting j(i) denote the index of the validation set which contains observation i, cross-fitting entails using only observations in T j(i) for fitting models when making predictions about observation i. That is, the outcome predictions for each subject i are given by m j(i) (u, a, W i ). Since only m and not ( π A , π C ) is cross-fitted, we let η j(i) = ( m j(i) , π A , π C ) and η j(i) = ( m j(i) , π A , π C ).

In what follows we let P n,j denote the empirical distribution of the prediction set V j , and let G n,j denote the associated empirical process n/J(P n,j − P). Let G n denote the empirical process √ n(P n − P). We use E(g(O 1 , . . . , O n )) to denote expectation with respect to the joint distribution of (O 1 , . . . , O n ), and use a n b n to mean a n ≤ cb n for universal constants c. The following lemmas will be useful in the proof of the theorem.

Lemma 1. Assume A1, A4, and A5 . Then we have Π C (k, a, w) does not depend on w, and π A (a, w) does not depend on w. Furthermore, we have

for mean-zero functions ∆ k,a (O i ) and Λ a (O i ) of (k, a) and O i that do not depend on W i .

Proof. This lemma follows by application of the Delta method to the non-parametric maximum likelihood estimators π A and Π C .

Lemma 2. For two sequences a 1 , . . . , a m and b 1 , . . . , b m we have

in the right hand side and expand the sum to notice it is a telescoping sum.

The proof of Theorem 1 proceeds as follows.

Proof. Since censoring is completely at random by A1, we have θ = S(k, a) = S(k, a, w)dP(w). Let θ =S TMLE (k, a). Define σ 2 = Var[D η 1 (O)], where η 1 = (m 1 , π A , π C ), and let

First, note that Θ n N (0, 1) by the central limit theorem. We will now show that | Θ n − Θ n | = o P (1), which would yield the result in the theorem. First, note that

where the last inequality follows because |σ/ σ − 1| = o P (1) (which follows by Lemma 1 and A3) and because |Θ n | = O P (1) by the central limit theorem. We will now show that |Θ n − Θ n | = o P (1).

An application of (7) with η = η yields

where the second equality follows because P n D η = 0 by definition of η (see ). This impliesΘ n − Θ n = B n,2 + B n,1 , where B n,2 = G n (D η − D η 1 ) and B n,1 = √ nRem 1 ( η).

We first tackle the case of A6.2, where the estimators for m are cross-fitted. Note that

and that D η j depends on the full sample through the estimate of the parameter ε of the logistic tilting model. To make this dependence explicit, we introduce the notation D η j , ε = D η j . Let ε 1 denote the probability limit of ε, which exists and is finite by Assumption A7. We can find a deterministic sequence δ n → 0 satisfying P (| ε − ε 1 | < δ n ) → 1. Let F j n = {D η j ,ε − D η 1 : |ε − ε 1 | < δ n }. Because the function η j is fixed given the training data, we can apply Theorem 2.14.2 of van der Vaart and Wellner (1996) to obtain

where N [ ] (α F j n , F j n , L 2 (P)) is the bracketing number and we take F j n = sup ε:|ε−ε 1 |<δn |D η j ,ε − D η 1 | as an envelope for the class F j n . Theorem 2.7.2 of van der Vaart and Wellner (1996) shows

This shows

Since D η j , ε → D η 1 and δ n → 0, F j n = o P (1). The above argument shows that sup f ∈F j n |G n,j f | = o P (1) for each j, conditional on T j . Thus |B n,2 | = o P (1).

In the case of A6.1, where the estimators for m are not cross-fitted but belong in a parametric family, standard empirical process theory such as Example 19.7 of van der Vaart (1998) shows that D η takes values in a Donsker class. Therefore, an application of Theorem 19.24 of van der Vaart (1998) yields |B n,2 | = o P (1).

We now show that |B n,1 | = o P (1). First, Lemma 1 along with the Delta method show that

for some function Γ k,a (O) not depending on W . Thus

where the last equality follows from Lemma 2. Expression (7) together with the assumptions of the theorem and Proposition 1 show that the estimator θ is consistent, and thus {S(k, a, w) − S(k, a, w)}dP(w) = o P (1).

The central limit theorem shows that G n Γ k,a = O P (1), which yields |B n,1 | = o P (1), concluding the proof of the theorem.

D.1 Results for simulations with a positive effect and where the covariates are prognostic of the outcome 

A class of Mann-Whitney-Wilcoxon type statistics

origami: Generalized Framework for Cross-Validation

Statistical inference for data-adaptive doubly robust estimators with survival outcomes

Doubly robust inference for targeted minimum lossbased estimation in randomized trials with missing outcome data

Enhanced precision in the analysis of randomized trials with ordinal outcomes

adjrct: Efficient Estimators for Survival and Ordinal Outcomes in RCTs Without Proportional Hazards and Odds Assumptions

Improved precision in the analysis of randomized trials with survival outcomes, without assuming proportional hazards

E9 statistical principles for clinical trials. U.S. Food and Drug Administration: CDER/CBER. European Medicines Agency: CPMP/ICH/363/96

Multivariate adaptive regression splines

Greedy function approximation: a gradient boosting machine

Clinical characteristics of Covid-19 in new york city

Targeted minimum loss based estimator that outperforms a given estimator

Clinical characteristics of Coronavirus Disease 2019 in China

Systematic evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-19: an observational cohort study

The risks and rewards of covariate adjustment in randomized trials: an assessment of 12 outcomes from 8 studies

Consistent estimation of the influence function of locally asymptotically linear estimators

Race to find COVID-19 treatments accelerates

Semiparametric estimation of treatment effect with time-lagged response in the presence of informative censoring

A minimal common outcome measure set for COVID-19 clinical research

earth: Multivariate Adaptive Regression Splines

Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation

Increasing power in randomized trials with right censored outcomes through covariate adjustment

Landmark estimation of survival and treatment effect in a randomized clinical trial

L1-regularization path algorithm for generalized linear models

Safety and efficacy of the bnt162b2 mrna covid-19 vaccine

Efficient and Adaptive Estimation for Semiparametric Models

Targeted learning ensembles for optimal individualized treatment rules with time-to-event outcomes

Improved precision in the analysis of randomized trials with survival outcomes, without assuming proportional hazards

Increasing power in randomized trials with right censored outcomes through covariate adjustment

Asymptotic Statistics

Weak Convergence and Emprical Processes