key: cord-0179280-khfq1hpy authors: Li, Xiudi; Li, Sijia; Luedtke, Alex title: Estimating the Efficiency Gain of Covariate-Adjusted Analyses in Future Clinical Trials Using External Data date: 2021-04-30 journal: nan DOI: nan sha: 0e7b43ad9bcc13d2cf7c16a41f1d252e044c6fa5 doc_id: 179280 cord_uid: khfq1hpy We present a general framework for using existing data to estimate the efficiency gain from using a covariate-adjusted estimator of a marginal treatment effect in a future randomized trial. We describe conditions under which it is possible to define a mapping from the distribution that generated the existing external data to the relative efficiency of a covariate-adjusted estimator compared to an unadjusted estimator. Under conditions, these relative efficiencies approximate the ratio of sample size needed to achieve a desired power. We consider two situations where the outcome is either fully or partially observed and several treatment effect estimands that are of particular interest in most trials. For each such estimand, we develop a semiparametrically efficient estimator of the relative efficiency that allows for the application of flexible statistical learning tools to estimate the nuisance functions and an analytic form of a corresponding Wald-type confidence interval. We also propose a double bootstrap scheme for constructing confidence intervals. We demonstrate the performance of the proposed methods through simulation studies and apply these methods to data to estimate the relative efficiency of using covariate adjustment in Covid-19 therapeutic trials. The aim of most clinical trials is to estimate a marginal treatment effect that contrasts outcomes in a treatment group to those in a control group. In addition to the treatment assignment and outcome, data on prognostic baseline covariates are often available. In the case of continuous outcomes, the U.S. Food and Drug Administration [10] recommends adjusting for these baseline covariates through ANCOVA or linear regression models. However, such covariate-adjusted analyses are often underutilized in practice, especially with ordinal or time-to-event data [2] . Compared with unadjusted analyses, analyses that adjust for baseline covariates have several benefits. First, adjusted analyses can lead to consistent estimators of the treatment effect under weaker assumptions. One such example arises when right-censoring is present in a trial with a time-to-event outcome. Adjusted analyses often give consistent estimates provided that the censoring and survival times are independent given treatment and covariates [22] . This condition is more plausible in many trial settings than is the requirement made in unadjusted analyses that the censoring and survival times are independent given treatment alone. Second, adjusting for covariates that are predictive of the outcome can improve precision, and thus a smaller sample size can be required to achieve a desired power. Such precision gain is generally expected when the outcome is fully observed, and also applies in certain cases where the outcome is only partially observed -some exceptions occur, for example, when the covariates are highly predictive of the censoring time but are only weakly predictive of the survival time. Despite these potential benefits, covariate adjustment is underutilized in analyzing clinical trial data. This is partly because, at the trial planning stage, there is typically little prior knowledge about the amount of precision gain that should be expected to result from using covariate adjustment. To address this problem, many previous works have aimed to estimate this precision gain using external datasets. In particular, some works have demonstrated the potential precision gain of covariate adjustment in clinical trial settings by comparing the standard errors of adjusted and unadjusted estimators on existing clinical trial datasets [e.g., 26, 34] . When the data-generating mechanism that gave rise to one of these existing datasets is reflective of the corresponding mechanism that is anticipated in an upcoming trial, these point estimates may yield a reasonable estimate of the precision gain anticipated in these future trials. It is worth noting, however, that the sampling variability in the existing trial dataset induces uncertainty in this point estimate. Other works have used an existing trial dataset to design a simulation study that can be used to estimate the precision gain [e.g., 16, 7, 26] . However, even when these simulation studies involve many repetitions, so that the Monte Carlo error is negligible, there is still uncertainty associated with these precision gain estimates that arises due to sampling variability in the existing trial dataset. In many cases, there may not be data available from a clinical trial that is reflective of the upcoming trial. An alternative approach, which does not require access to such data but can leverage it when it is available, is to conduct a simulation based on an external dataset that may be reflective of the covariate and outcome distributions that will be seen in the control arm of the upcoming trial [3] . This data may, for example, be derived from a pilot study or an observational study. Treatment arm data can then be simulated under a sharp null of no effect or as a user-defined shift of the conditional distribution of the outcome given covariates in the pilot study. As for the earlier simulation approach, a point estimate of the precision gain can be easily obtained, but there is still uncertainty in this point estimate that arises from the sampling variability in the dataset upon which the simulation is based. It can be challenging to be confident that a favorable estimated precision gain is not due to random noise, especially when the external dataset is small. Consequently, some clinical trialists may be cautious when making decisions about using covariate adjustment in future clinical trials based on a point estimate alone, even if the external dataset upon which it is based is known to be reflective of the data that will be seen in the trial. In other statistical estimation problems, the lack of interpretability of point estimates is often addressed by reporting a confidence interval alongside each point estimate. However, to the best of our knowledge, the problem of making statistical inferences about the precision gains of covariate-adjusted estimators has not been formally investigated. In this work, we aim to fill this critical knowledge gap. When doing so, we focus on the most general case described above, namely that data from an external study that is reflective of the covariate and control arm outcome distributions are available. Special cases of this setting include the case where data are available from a previous trial and the control arm data are used for the external study, and also the case where covariate and outcome data are available from an observational study. We primarily consider treatment effect estimands that can be written as contrasts of the distributions of the outcomes within each treatment arm. Most commonly, investigators perform an unadjusted analysis that uses the empirical distribution to estimate these two arm-specific distributions. One approach to covariate adjustment involves instead estimating these distributions with possibly-misspecified working models. Specifically, this involves fitting a working parametric model within each arm that conditions on covariates, and then marginalizing over the arm-pooled empirical distribution of the covariates [21, 3] . In many cases, this approach can result in consistent and asymptotically normal estimators of the marginal effect of interest, even if the working model is misspecified. Nevertheless, these approaches are typically inefficient when the model is not correct, in the sense that they fail to achieve the asymptotic efficiency bound within a model that only imposes that treatment is randomized [6] . In contrast, many covariate-adjusted estimators have been proposed recently that achieve the efficiency bound under appropriate regularity conditions (see, for example, [33, 9] for ordinal outcomes; and [22, 27, 8] for survival outcomes). These approaches usually involve estimating nuisance parameters such as the treatment mechanism and the outcome regression functions. While being more efficient, these estimators are often more difficult for practitioners to understand because they cannot typically be framed as corresponding to a commonly used estimator within a parametric working model. In this paper, we consider both of the above-described strategies for covariate adjustment, which we refer to as working-model-based approaches and fully adjusted approaches, respectively. Our main contributions are as follows: 1. we provide a framework for using external data to identify the efficiency gain in terms of percentage reduction in sample size needed to achieve a desired power from using covariate-adjusted rather than unadjusted estimation methods on future clinical trial data; 2. we introduce efficient estimators of this quantity that allow for the incorporation of flexible statistical learning tools to estimate the needed nuisance functions; 3. we present statistical inference procedures to accompany the proposed estimators, namely a Waldtype procedure that requires knowledge of their influence functions but is widely applicable and a bootstrap procedure that applies only to working-model-based estimators but is easy to implement; and 4. we evaluate the performance of the proposed methods in a simulation study and an application to a dataset of Covid-19 patients hospitalized at the University of Washington Medical Center. This paper is organized as follows. In Section 2, we provide background on efficient estimation in semiparametric models and describe the relevance of the relative efficiency and local alternatives to clinical trial settings. In Section 3, we introduce the framework to identify and estimate the efficiency gain when the outcome is fully observed. We also propose efficient estimators and develop analytical and bootstrap inference procedures for estimands that are of frequent interest in the cases of continuous and ordinal outcomes. In Section 4, we study the case where the outcome is partially observed and consider time-to-event outcomes with right-censoring as an example. In Section 5, we demonstrate the performance of the proposed methods through simulation experiments and an analysis of a real dataset. Section 6 concludes with a discussion. 2 Review of efficiency theory and its relevance to clinical trial settings 2.1 Pathwise differentiability and regular and asymptotically linear estimators The theory of efficient estimation in nonparametric and semiparametric models was described in Pfanzagl [23] and Bickel et al. [6] . Here we give a brief review of the relevant concepts. Let X denote a generic data unit with distribution P and let is referred to as the tangent space. A parameter ψ : M → R is called pathwise differentiable at P in M if there exists a function D ∈ L 2 0 (P ) such that, for all submodels {P ǫ : ǫ ∈ R} ∈ M(P ), it holds that ∂ ∂ǫ ψ(P ǫ )| ǫ=0 = E P [D(X)s(X)], where s is the score function of {P ǫ : ǫ ∈ R} at ǫ = 0. Any such function D is called a gradient of ψ with respect to M at P . The canonical gradient D * is the gradient that lies in the tangent space T M (P ) -it can be shown that this gradient is unique. We note that the X → R functions D and D * both depend on P . We refer to an estimatorψ of ψ as a random variable that is a function of an independent and identically distributed (iid) sample X := {X 1 , . . . , X n } drawn from some distribution. An estimator ψ is called regular if there exists a real-valued probability distribution L such that, for all submodels {P ǫ : ǫ ∈ R} in M(P ) and all c ∈ R, √ n ψ − ψ (P cn −1/2 ) Importantly, note that, if an estimator is regular, then the distribution L above does not depend on the choice of submodel in M(P ). An estimatorψ of ψ(P ) is called asymptotically linear if there exists some function ξ P ∈ L 2 0 (P ) such thatψ − ψ(P ) = 1 n n i=1 ξ P (X i ) + o P (n −1/2 ). The function ξ P is referred to as the influence function ofψ. Asymptotically linear estimators are consistent and asymptotically normal, in the sense that where σ 2 P is the variance of ξ P (X) when X ∼ P . Ifψ is asymptotically linear and ψ is pathwise differentiable, thenψ is regular if, and only if, ξ P is a gradient of ψ with respect to M at P . Among the collection of gradients, the canonical gradient D * has the smallest variance, and is also called the efficient influence function (EIF). This variance characterizes the efficiency bound of estimating ψ given the model M with a regular and asymptotically linear (RAL) estimator. An estimator is called efficient if it is RAL and its influence function is the same as the EIF. Suppose that we have available an initial estimatorP of the distribution. A plug-in estimator is defined as ψ(P ). However, such estimators may not be √ n-consistent due to the potential bias in the initial estimators. One way to construct a RAL estimator with influence function D is through onestep estimation [15, 5, 24] , which corrects for this bias by usingψ = ψ(P ) + P n D(P ) where P n (·) is the empirical mean. Estimating equations [29, 28] and targeted minimum loss-based estimation [30] are alternative approaches. These techniques are often used to construct efficient covariate-adjusted estimators of a treatment effect. Later, we will also use them to estimate the relative efficiency of two estimators based on external data. In the context that we consider in this paper, the treatment effect measure that will be estimated with the future clinical trial data will often correspond to an evaluation ψ(P ) of a pathwise differentiable parameter ψ. In many cases, a primary objective of the forthcoming trial will be to test the null hypothesis that this quantity is equal to zero against a one-sided alternative, for example, that this quantity is positive. Suppose that a level α Wald test is performed, which corresponds to evaluating whether zero is smaller thanψ − n −1/2 z 1−ασP based on a RAL estimatorψ, where z 1−α is the (1 − α)-quantile of a standard normal distribution andσ P is a consistent estimator of σ P , as defined in (2) . Fix an arbitrary c = 0. As ψ is pathwise differentiable, for any {P ǫ : ǫ ∈ R} in M(P ) with score function s at ǫ = 0, it holds that ψ(P cn −1/2 ) = ψ(P ) + cn −1/2 µ P,s + o(n −1/2 ), where µ P,s := E P [D * (X)s(X)]. If P is such that the null that ψ(P ) = 0 holds, then this shows that n 1/2 ψ(P cn −1/2 ) → cµ P,s as n → ∞. If s is such that µ P,s > 0, then we call {P cn −1/2 } ∞ n=1 a sequence of local alternatives -this name is natural given that ψ(P cn −1/2 ) > 0 for all n large enough (and so the alternative holds for all n large enough), while also ψ(P cn −1/2 ) → 0 as n → ∞ (and so these alternatives are local to the null hypothesis). Becauseψ is RAL, combining (1) and (2) with Slutsky's theorem implies that √ nψ where σ 2 P is as defined below (2) . Let β denote the power for rejecting the null that ψ(P ) = 0 under sampling from P cn −1/2 -that is, let Letting Φ(·) denote the cumulative distribution function (CDF) of the standard normal distribution, (3) implies that β = 1 − Φ(z 1−α − cµ P,s /σ P ) which lies in (α, 1) when the shift in the mean of the limit normal distribution cµ P,s is positive. This gives us a way to quantify the power of the test in a range of settings where the effect size is small. Hence, effect sizes scaling as n −1/2 are interesting in general testing problems, given that it is exactly at these effect sizes that the problem is neither asymptotically trivial (power converging to 1) or impossible (power converging to α). Nevertheless, in many statistical problems, there may be no a priori reason to believe that the effect size will be of the order n −1/2 . The setup is quite different in randomized trials. Indeed, these local alternatives are natural to think about in these settings because, under sampling from such a sequence of alternatives, the asymptotic power takes some intermediate value between α and 1, which reflects the fact that the sample size in most trials is specified so that a test of the null will have a chosen power β * ∈ (α, 1). To be more concrete, suppose that {ψ (k) } ∞ k=1 is a decreasing sequence of effect sizes that satisfy the alternative hypothesis, that is, that are such that ψ (k) ↓ 0 as k → ∞. We suppose that these effect sizes arise from some sequence of distributions {P (k) } ∞ k=1 that belong to some submodel M 1 := {P ǫ : ǫ ∈ R} ∈ M(P ), so that ψ(P (k) ) = ψ (k) for each k. Our objective is to establish an expression for the sequence of sample sizes {n (k) } ∞ k=1 so that, as k → ∞, the power for rejecting the null hypothesis converges to β * when n (k) iid observations are drawn from P (k) . Let s denote the score of ǫ at 0 in the submodel M 1 , and suppose that µ P,s = 0. To derive the sequence {n (k) } ∞ k=1 , it will be helpful to first find a c such that, when n iid observations are drawn from P cn −1/2 , the power converges to the desired β * as n → ∞. Because , it will then be reasonable to expect that, when a sequence of tests is conducted based on n (k) iid observations sampled from each P (k) , the power for rejecting the null hypothesis will converge to β * as k → ∞. We now find the expressions for c and n (k) that were described in the last paragraph. Recalling (4) and the alternative expression for the power given below that display, we see that, when c = σ P (z 1−α − To find the expression for n (k) , we note that, as ψ is pathwise differentiable, ψ(P cn −1/2 ) = cn −1/2 µ P,s + o(n −1/2 ) when n is large -here, the little-oh term describes behavior as n → ∞. Hence, P (k) ≈ P c/ √ n (k) , where n (k) = ⌈(cµ P,s /ψ (k) ) 2 ⌉ = ⌈σ 2 P (z 1−α −z 1−β * ) 2 /(ψ (k) ) 2 ⌉. Thus, to achieve asymptotic power β * when sampling n (k) iid observations from P (k) , n (k) must scale proportionally with σ 2 P . These calculations also provide a means to compare the sample sizes needed to achieve the same power based on two different RAL estimators. In particular, suppose that a second RAL estimator is available and its asymptotic variance is equal toσ 2 P ≤ σ 2 P . In this case, the proportional reduction in sample size needed to achieve power β * when using this estimator rather than the estimator with variance σ 2 P is approximately equal to 1 −σ 2 P /σ 2 P when k is large. In fact,σ 2 P /σ 2 P is often referred to as the relative efficiency of the RAL estimator with varianceσ 2 P versus the RAL estimator with variance σ 2 P , and so this proportional reduction is exactly equal to one minus the relative efficiency of these two estimators. 3 When the outcome is fully observed 3 We now propose a general framework to identify the relative efficiency of covariate-adjusted estimators. We start by defining notation that we will use to describe the data that will arise in the future clinical trial. Let A denote the binary treatment, W denote the d-dimensional covariate vector and Y denote the outcome. We will use superscript t to denote random variables in a future clinical trial. Let P 1 be the conditional distribution of Y t |A = 1 and P 0 be the conditional distribution of Y t |A = 0. The treatment effect is often a functional f of these distributions, i.e., ψ = f (P 1 , P 0 ) for some f . An example is the average treatment effect for continuous outcome, where f (P 1 , Randomization implies that A and W t are independent, and thus that the distribution of X t , denoted by ν, is determined by the marginal distribution of A, the conditional distribution of Y t given (A = 0, W t = w), the conditional distribution of Y t given (A = 1, W t = w), and the marginal distribution of W t -we denote these distributions by Π, P t 0 , P t 1 , and P W , respectively. Let M X consist of all distributions for which A and W t are independent. In the randomized trial settings that we consider, ν ∈ M X . The adjusted analysis uses X t , while the unadjusted analysis ignores the covariate W t . Letψ u ,ψ a andψ m be a specified unadjusted estimator, fully adjusted estimator and working-model-based adjusted estimator of ψ, respectively. Further suppose that these estimators are regular and asymptotically linear with influence functions D u , D a and D m , respectively. We note that these influence functions all depend on the underlying distribution ν, but we will omit this dependence when it is clear from the context. We assume the following regularity condition holds throughout, which guarantees that the treatment effects of interest can be estimated using strategies typically employed in randomized trial settings. Although some of the conditions can in principle be relaxed, they cover most realistic clinical trial settings. Condition A1. Treatment is independent of covariates (A ⊥ W t ), the treatment probability Π(A = 1) falls in (0, 1), and Y and W t both have bounded support under sampling from ν. Let W ⊆ R d and Y ⊆ R be bounded and convex sets that contain the support of W t and Y , respectively. Our objective is to quantify the relative precision of the specified adjusted and unadjusted estimators. To do this, we will consider the relative efficiency of these two estimators, defined as the ratio between the asymptotic variances of the adjusted estimator and the unadjusted estimator under a sharp null distribution. Though we will focus on the sharp null when introducing these relative efficiencies, these quantities also correspond to the relative efficiencies under local alternatives (see Section 2.2). Consequently, our apparent restriction to the sharp null setting will in fact not be restrictive at all. Indeed, from this sharp null setting, we can typically approximate the reduction in sample size needed to achieve a desired power at all local alternatives that are consistent with the design alternative used to size the trial (ibid.). We now define these relative efficiencies. Consider a trial where the sharp null holds, that is, the treatment has no effect and P t 1 = P t 0 . In this case, let P denote the joint distribution of (Y t , W t ), determined by the pair (P t 1 , P W ). Under Condition A1, ν is equal to the product measure P Π in this sharp null setting. The relative efficiency of the fully adjusted estimator compared to that of the unadjusted estimator is defined as whereas the analogous quantity for the working-model-based estimator is defined as Although in general ν depends on both Π and P , we define relative efficiencies as functions of P only. As we will show, in many cases that are of interest in practice, the relative efficiency does not depend on Π. Even in cases where it does depend on Π, the investigator in a trial would have control over the treatment distribution Π in the trial setting, and the only unknown component would still be P . The local alternatives we consider allow for a variety of perturbations to the underlying distribution. The direction of these perturbations is described by their scores, which belong to the tangent space . This tangent space decomposes into three subspaces, corresponding to the marginal distribution of W t , the distribution of treatment A, and the conditional distribution of Y t |A, W t . The score in a smooth submodel can lie in one or more subspaces, which means that the local alternative can perturb one or more of the four components of ν, namely Π, P t 0 , P t 1 , and P W . For an example of how these perturbations may impact ν, consider the special case of the average treatment effect that was introduced at the beginning of this section. We note that is the conditional average treatment effect function. Here, ψ will be zero when this function is zero for all values of the covariates. Now, for any We now describe conditions that we use to identify the relative efficiency φ a (P ) and φ m (P ) using the external data that are available at the trial planning stage. When doing this, we assume that P ∈ M, where M is a locally nonparametric model of all distributions of (Y t , W t ), that is, a model where the tangent space at P is L 2 0 (P ). Let X = (Y, W ) be the data unit in the external dataset, which we assume consists of n iid draws from some distribution. The identifiability condition that we consider imposes that the external data should accurately reflect the distribution of covariate and outcomes in future trials where treatment has no effect. Condition A2. A random variate X = (Y, W ) from the external dataset has distribution P . Under this condition, the relative efficiencies in (5) and (6) can be estimated based on the external data. We now consider estimating relative efficiency for certain treatment effect estimands that are of particular interest in many clinical trials. We focus primarily on continuous and ordinal outcomes in this section. For the examples we consider, the asymptotic variances of the adjusted and unadjusted estimators factorize into a product of two terms, one that depends on Π only and another that depends on P only. Moreover, the term that depends only on Π is the same for both the adjusted and unadjusted estimators, and the relative efficiency is a function of P only. In particular, (5) and (6) now take the following forms, φ a (P ) = σ 2 a (P )/σ 2 u (P ), φ m (P ) = σ 2 m (P )/σ 2 u (P ). The exact forms of σ a , σ m and σ u depend on the treatment effect estimand and the specified estimators but do not depend on Π. Some specific examples are presented in the remainder of this section. To illustrate the idea, we start with a simple example where the outcome is continuous and we are interested in the average treatment effect, defined as . Let n t denote the sample size of the future trial dataset. The unadjusted estimator that we consider corresponds to the difference between the arm-specific For the fully adjusted estimator, we consider the augmented inverse probability weighted (AIPW) estimator, namelŷ wherer a (w) is an estimator of the conditional mean function r a (w) := E[Y t |A = a, W t = w] andπ is an estimator of the treatment mechanism π(w) := P (A = 1|W t = w). In randomized trials, the treatment mechanism can be estimated withπ, the empirical marginal of A, which is √ n-consistent, and the AIPW estimator is efficient provided thatr a is consistent and satisfies appropriate conditions. The estimator will be consistent and asymptotically normal even ifr a is inconsistent but has an appropriately defined limit. For the working-model-based adjusted estimator, we consider linear models, which are commonly used by practitioners for continuous outcomes [10] . Specifically, we fit an arm-specific linear model for the outcome regression, which assumes that and denote byα a ,β a the fitted coefficients. To estimate the average treatment effect, we marginalize the fitted values over all covariates and take the difference between treatment arms, We note again that the consistency and asymptotic normality of this estimator does not rely on the arm-specific linear models being correct. The following lemma gives the forms of the relevant variances in the definition of relative efficiency in this setting. Suppose that the appropriate regularity conditions hold such that the AIPW estimatorφ a is efficient. Then, for the aboveψ u ,ψ a andψ m , we have that σ 2 u (P ) = var P (Y ), σ 2 a (P ) = We now estimate these variances using external data {(Y i , W i ), i = 1, . . . , n}. Specifically, we use the sample variance for the unadjusted variance,σ 2 u = n i=1 (Y i −Ȳ ) 2 /n, whereȲ is the overall mean of the outcome. Letr(w) be an estimator of r(w) := E P [Y |W = w], then we estimate the adjusted variance bŷ where (α,β) are the coefficients in the linear regression of Y on W . We estimate the relative efficiencies byφ a =σ 2 a /σ 2 u andφ m =σ 2 m /σ 2 u . For a generic W → R function f , define its (squared) L 2 (P W ) norm as f 2 L 2 (PW ) := f (w) 2 dP W (w). The following theorem establishes the asymptotic linearity ofφ a andφ m under appropriate conditions. Theorem 1. Suppose that Conditions A1 and A2 hold. Suppose, in addition, that the random function r : W → Y is such that r − r L 2 (PW ) = o P (n −1/4 ) and belongs to some fixed P -Donsker class F of functions with probability tending to one. Then,φ a is an efficient estimator of φ a andφ m is an efficient estimator of φ m . Now suppose that the outcome is ordinal and takes value in {1, 2, . . . , K}. Dichotomous outcomes correspond to the special case where K = 2. Let F a (k) := P (Y t ≤ k|A = a) denote the treatment-specific CDF. The treatment effect estimands we consider can all be written as ψ = g {F 0 (k), F 1 (k)} K−1 k=1 for some real-valued function g. The proportional odds model [19] is a commonly used parametric model for ordinal outcomes. Here we use a treatment-specific proportional odds model as our working parametric model. For k ∈ {1, . . . , K − 1}, the model assumes that P (Y t ≤ k|A = a, W t = w) = θ αa,βa (k, w), where logit θ αa,βa (k, w) = α a (k) + β ⊤ a w. The above reduces to a logistic regression when the outcome is dichotomous. Let (α a ,β a ) be the coefficients, fitted by minimizing the following empirical risk function: where I{·} is the indicator function. In the special case that For such cases we use the conventions that logit −1 (−∞) = 0, logit −1 (∞) = 1, and 0 log(0) = 0. The treatment-arm-specific CDFs are estimated byF a (k) = n t i=1 θα a,βa (k, W t i )/n t . In addition, we define (α * a , β * a ) as the minimizer of E[L n,a (α, β)] over R K−1 × R d . We first establish the RAL property ofF a (k), which holds even when the proportional odds model is misspecified. Let θ a (k, w) := P (Y t ≤ k|A = a, W t = w) be the true outcome regression, and let θ * a (k, w) = θ α * a ,β * a (k, w) be the best model approximation to the true outcome regression according to the population analogue of the risk in (9) . Note that θ * a (k, w) can be different from θ a (k, w) in the presence of misspecification. Lemma 2. Suppose that Condition A1 holds and that (α a ,β a ) is estimated by minimizing (9), then F a (k) is an asymptotically linear estimator of F a (k), for k ∈ {1, . . . , K − 1}. Its influence function is given by Following [3] , we focus on three treatment effect estimands that are often of interest. for a pre-specified monotone transformation u(·). This reduces to the average treatment effect when u(·) is the identity function. The unadjusted estimator is the difference between the arm-specific sample means, Instead of using sample means, the adjusted estimator based on proportional odds model computes means with respect to the estimated CDFsF a . Finally, we define an AIPW estimator similarly to (8) , but with Y t replaced by u(Y t ). We denote this estimator asψ a . As in the previous section, this estimator achieves the semiparametric efficiency bound when the treatment mechanism is estimated with the marginal proportion of treatment and the outcome regression is consistently estimated. For the above three estimators, the variances in (7) are given in the following lemma. where θ * (k, w) = θ α * ,β * (k, w) and (α * , β * ) maximizes the following objective: We now propose estimators of these quantities for settings where external data are available. Letr(w) be an estimator of r(w) := E P [u(Y )|W = w], the conditional mean of u(Y ), and letū n be the sample mean of u(Y ). We estimate the unconditional and conditional variances bŷ To estimate σ 2 m , we fit the proportional odds model by maximizing the empirical counterpart of (10), and let (α,β) denote the fitted coefficients. We then construct a plug-in estimator Finally, we estimate the relative efficiency byφ a =σ 2 a /σ 2 u andφ m =σ 2 m /σ 2 u . In the upcoming theorem, we let u(Y) denote the convex hull of {u(y) : y ∈ Y}. Theorem 2. Suppose that Conditions A1 and A2 hold. Suppose, in addition, that the random function r : W → u(Y) is such that r − r L 2 (PW ) = o P (n −1/4 ) and belongs to some fixed P -Donsker class F of functions with probability tending to one. Thenφ a is an efficient estimator of φ a . Moreover,φ m is an efficient estimator of φ m . Exact forms of the influence functions ofφ a andφ m are given in Appendix B. The Mann-Whitney estimand (MW) is defined as ψ = P (Y t 1 >Ỹ t 0 ) + P (Y t 1 =Ỹ t 0 )/2, for two independent variables Y t 1 ∼ P 1 andỸ t 0 ∼ P 0 . It is the probability that a randomly chosen individual's outcome under treatment is larger than another randomly chosen individual's outcome under control plus one half times the probability that the two outcomes are equal. Define h(x, y) = I{x > y} + I{x = y}/2. Then the Mann-Whitney parameter can be alternatively written as ψ = h(x, y)dP 1 (x)dP 0 (y). This alternative definition suggests the following unadjusted estimator , and the following working-model-based estimator whereP a is the distribution with CDFF a (k). In addition, letψ a be the covariate-adjusted estimator in Vermeulen et al. [33] for the MW parameter, which is efficient under appropriate regularity conditions. Lemma 4. Let (Y, W ) ∼ P , and define η P (k) = P (Y < k) + P (Y = k)/2 and p k = P (Y = k). Suppose that the appropriate regularity conditions hold such thatφ a is an efficient estimator of the MW parameter. Then, for the aboveψ u ,ψ a andψ m , we have that σ 2 We now propose estimators for these quantities. Unlike in the case of the DIM estimand, η P (·) depends on the unknown marginal distribution of Y and needs to be estimated from the external data. A simple estimator is based on the empirical distribution, Finally, a natural plug-in estimator forσ 2 m is given bŷ where (α,β) is again the fitted coefficients from the proportional odds model, by maximizing the sample counterpart of (10). The relative efficiency can be estimated asφ a =σ 2 a /σ 2 u andφ m =σ 2 m /σ 2 u . The next theorem establishes the asymptotic properties of these estimators. Theorem 3. Suppose that Conditions A1 and A2 hold. Suppose, in addition, that the random function r : W → R is such that r − r L 2 (PW ) = o P (n −1/4 ) and belongs to some fixed P -Donsker class F of functions with probability tending to one. Then,φ a is an efficient estimator of φ a andφ m is an efficient The influence functions ofψ a andψ m are given in Appendix B. The log odds ratio (LOR) is defined as ψ = , which is an average of the cumulative log odds ratios [9] . In general, one can also consider a weighted average. Employing this definition for the LOR ensures that the LOR is well-defined even in settings where a proportional odds assumption fails. The unadjusted estimator is given bŷ The working-model-based adjusted estimatorψ m replacesF a (k) with the proportional odds model-based estimatorF a (k) that was defined earlier. Finally, letψ a be the covariate-adjusted estimator proposed in Díaz et al. [9] , which achieves the semiparametric efficiency bound under regularity conditions. The following lemma gives the forms of the relevant variances. Suppose that the appropriate regularity conditions hold such thatφ a is an efficient estimator of the LOR. Then, for the aboveψ u ,ψ a andψ m , we have that Letθ(k, w) be an estimator of the true conditional distribution function θ(k, w) := P (Y ≤ k|W = w) in the setting where external data are available. We estimate the relative efficiencies byφ a =σ 2 a /σ 2 u andφ m =σ 2 m /σ 2 u , with the variance estimators all taking the following form with certain choice of the estimatorθ c : is replaced withθ(k, w); forσ 2 u , we usẽ F (k); and, forσ 2 m , we takeθ c (k, w) = θα ,β (k, w). Theorem 4. Suppose that Conditions A1 and A2 hold and that there exists a constant δ > 0 such that Suppose, in addition, that, for all k ∈ {1, . . . , K − 1}, the random functionθ(k, ·) : W → R is such that θ (k, ·) − θ(k, ·) L 2 (PW ) = o P (n −1/4 ) and belongs to some fixed P -Donsker class F of functions with probability tending to one. Then,φ a is an efficient estimator of φ a . Moreover,φ m is an efficient estimator of φ m . For all of the aforementioned treatment effect estimands, φ a ∈ (0, 1], since the fully adjusted estimator achieves the semiparametric efficiency bound. However, φ m might be larger than 1 if the proportional odds model is far from the truth. Wald-type intervals are a standard approach for constructing confidence intervals when an asymptotically linear estimatorφ of φ is available. Specifically, a (1 − α)-CI is given byφ ± n −1/2 z 1−α/2 P nτ 2 , where z 1−α/2 is the (1 − α/2)-quantile of a standard normal distribution andτ is the influence function ofφ except that we replace unknown quantities with consistent estimates. However, there are certain cases where, even though such consistent estimates are used, the Wald-type confidence interval will not provide asymptotically valid coverage. A key time when this challenge arises occurs when the limiting distribution of a RAL estimator is degenerate because the influence function is almost everywhere zero, which will often occur in our setting when the relative efficiency is one. One such example arises in estimating the relative efficiency of the fully adjusted estimator to the unadjusted estimator for the ATE or DIM estimands. In this case, the influence function ofφ a is almost everywhere zero when for almost all w. While a Wald-type interval will typically achieve asymptotically valid coverage outside of these degenerate cases, there is no way to know in advance whether or not P is such that degeneracy will occur. To overcome this challenge, we propose an alternative approach that yields a confidence set that achieves the desired coverage regardless of whether degeneracy occurs. Most importantly, the resulting confidence sets are valid regardless of whether or not the true relative efficiency is, in fact, one. The proposed confidence set is constructed as follows. Suppose that we have available a valid level α test of the null hypothesis H 0 : φ a = 1 (φ m = 1) -one such test based on sample splitting is given in Appendix E. Denote the Wald confidence interval by I wald . We first test this null hypothesis. If we do not reject it, we take the 1 − α confidence set to be I wald ∪ {1}. If, instead, the null hypothesis is rejected, we take the confidence set to be I wald . At first glance, it seems that the proposed approach may fail to achieve valid coverage given that it uses the same data twice -once to test the null hypothesis and a second time to form the Wald-type interval. Nevertheless, as we show in Appendix E, the confidence set resulting from this procedure in fact has at least 1 − α asymptotic probability of covering the truth. It may happen that I wald and {1} are disjoint. In this case, a disconnected confidence set I wald ∪ {1} can be reported or, if this is considered undesirable, the convex hull of this confidence set can be taken to form a confidence interval. The inferential procedures described in the preceding subsection are based on closed-form expressions for the relative efficiency parameter in several problems and knowledge of the corresponding efficient influence functions. On the one hand, now that these expressions have been calculated, the estimators that we have presented can be used in any problems in which the relative efficiency takes this form. On the other hand, if a new effect estimand or working parametric model is of interest in a future setting, then new analytical calculations will need to be conducted to derive the closed-form expression for the relative efficiency parameter and develop corresponding estimators and confidence intervals. Here, we propose an automated double bootstrap procedure that avoids the need to perform these potentiallytedious analytic calculations. When doing so, we focus on the case where the goal is to infer about the relative efficiency of a new working-model-based adjusted estimator, that is, we focus on φ m . The reason for this choice is discussed at the end of this subsection. Before describing this procedure, we first investigate the applicability of a more traditional, one-layer bootstrap procedure. Suppose that the relative efficiency parameter Φ m : M → R + is sufficiently smooth so that a plug-in estimator of Φ m (P ) based on the empirical distribution is asymptotically linear [see, e.g., Theorem 20.8 in 31] . In this case, we can construct a plug-in estimator based on the empirical distribution P n of a sample of iid observations. We denote this plug-in estimator by Φ m (P n ) and note that all the estimators proposed so far for φ m correspond to plug-in estimators of this form. In a traditional setting where the bootstrap would be applied, a closed-form expression for the functional Φ m would be available and so would be the plug-in estimator Φ m (P n ), and the goal would be to derive a corresponding confidence interval. In particular, let the n entries of X = (X k ) n k=1 correspond to an iid sample of external data, where X k = (Y k , W k ). We sample from X with replacement B 1 times, to form the bootstrap resamples X * i of size n for i = 1, . . . , B 1 . Letting P * n,i denote the empirical distribution of the observations in X * i , we could then use the empirical standard deviation of Φ m (P * n,i ), i = 1, . . . , B 1 , as the standard error estimate used to construct a confidence interval centered around Φ m (P n ). Though this traditional bootstrap approach is useful in that it avoids explicitly computing the influence function of Φ m (P n ), it does not fully avoid the aforementioned analytic calculations. Indeed, in many cases, deriving the plug-in estimator will itself require deriving the explicit form of the relative efficiency parameter, which in turn relies on computing inefficient and efficient gradients of the treatment effect estimand in the model M X . Computing these gradients requires specialized calculations that are unfamiliar to many practitioners. To avoid this challenge, we approximate the plug-in estimator with an alternative estimatorφ that can be obtained in a fully automated fashion. Specifically, we propose to use an additional layer of resampling to approximate φ m . Evaluating the resulting estimation strategy only requires having access to the external data and the treatment effect estimator that will be used to analyze data from the future clinical trial. be the first layer resamples, and in addition define X * 0 := X. For each i ≥ 0, we then letX ij , j = 1, . . . , B 2 , denote an iid sample of size N from the product measure P * n,i Π, where Π is the known distribution of treatment. To simulateX ij , we first draw an iid sample of size N from the empirical distribution P * n,i and then append a random draw of the treatment vector, which is a tuple consisting of N iid draws from a Bernoulli(π) distribution. For each X * i , we will construct an estimatorφ(X * i ) using the collection of second-layer resamples. Specifically, for each (i, j), we compute the adjusted and unadjusted estimators based on the samplẽ A stochastic approximation of the parameter evaluation Φ m (P * n,i ) is then given bỹ We note thatφ n (X * i ) depends on {X ij } B2 j=1 , and therefore also on B 2 and N -we omit these dependencies in the notation. A Wald-type bootstrap confidence interval can be constructed by using the empirical standard deviation ofφ n (X * i ) over i = 1, . . . , B 1 as the standard error andφ n (X) :=φ n (X * 0 ) as the center. This double bootstrap procedure is summarized in Algorithm 1 in Appendix F. We now provide some intuition behind why the above-described double bootstrap procedure is ex-pected to work. We then provide a theorem that formalizes these arguments. First, we observe that the double bootstrap procedure is analogous to a traditional single-layer bootstrap, except that we replace the plug-in estimator with an estimatorφ n , which itself is defined through an additional layer of bootstrap. Intuitively, ifφ n (X * i ) is close enough to the plug-in estimator Φ m (P * n,i ) on all X * i , we would expect that using the stochastic approximation instead of the plug-in makes little difference and the procedure works similarly as the traditional bootstrap works. We now give heuristic arguments showing thatφ n (X * i ) and Φ m (P * n,i ) should indeed be close, in the sense that To see this, for an arbitrary j ∈ {1, . . . , B 2 }, consider a general asymptotically linear estimatorψ of the treatment effect that satisfieŝ whereX l ij is the l-th observation inX ij . In our upcoming theorem, we will assume that Rem is negligible in an appropriate sense. Suppose that we take sufficiently many samples from P * n,i Π -that is, that B 2 is sufficiently large -so that the Monte-Carlo error from the second bootstrap layer is negligible. We can then accurately approximate the sampling distribution of under P * n,i by the empirical distribution of {ψ(X ij )} B2 j=1 . Applying these arguments atψ u andψ m suggests thatφ n (X * i ) accurately approximatesσ 2 are the variances of the sampling distributions whereX ij is an iid sample from P * n,i Π. In addition, provided that N ≫ n so that the remainders in the above linear expansion (12) are sufficiently small whenψ is equal toψ u andψ m , the ratio between these variancesσ 2 As a result, we expectφ n (X * i ) to be reasonably close to the plug-in estimator Φ m (P * n,i ). The upcoming theorem formalizes the heuristic argument given in the previous paragraph. Before giving this result, we define a key differentiability concept that is useful for establishing theoretical We will assume that the following conditions hold: Condition B1. Both σ 2 u (·) and σ 2 m (·) are Hadamard differentiable; Condition B2. There exists a γ ∈ (1/2, ∞) such that the remainder Rem 1 in Eq. 12 is such that where the expectation is over the draw of the bootstrap sample P * n,1 and X 1 , X 2 , . . .; Condition B3. B 2 grows with n in such a way that We are now ready to state the theorem. of the functional Φ m and G is a mean-zero Gaussian process with covariance cov(Gf 1 , The proof is a modification of the proof of Theorem 23.9 in Van der Vaart [31] , and is given in To approximate the limiting distribution Φ ′ m (G) given in the above theorem, Algorithm 1 uses the empirical distribution of √ n{φ n (X * i )−φ n (X)} across the B 1 bootstrap replicates X * 1 , . . . , X * B1 . We now discuss the conditions of Theorem 5. Condition B1 ensures the Hadamard differentiability of the relative efficiency parameter Φ m (·), which is used in most standard sets of sufficient conditions for the validity of bootstrap methods. We can establish these Hadamard differentiability conditions by noting that the variance is essentially the mean of a function indexed by nuisance parameters, which themselves are transformations of some population means. We use the Mann-Whitney estimand in the ordinal outcome case as an example. The variance of the adjusted estimator takes the form Here b, α, β are nuisance parameters, which are defined, either explicitly or implicitly, with a set of population means. The mean functional is Hadamard differentiable, for example, when the support is bounded. One can then apply the chain rule of Hadamard differentiability as in [14] . Condition B2 ensures that the remainder term in the asymptotic linear expansion is sufficiently small. For the examples we have considered, it is possible to show that we can take γ = 1 under mild conditions. Conditions B3 and B4 require that the user selects sufficiently large values for B 2 and N . Condition B3 places a restriction on the Monte Carlo approximationφ n (X * 1 ) ofσ 2 m,1 /σ 2 u,1 . In most cases, this condition will hold provided the number of second-layer bootstrap samples goes to infinity faster than does n, that is, so that n/B 2 → 0. Condition B4 places a restriction on the sample size N of each second-layer bootstrap sample. When γ = 1, this condition requires that these samples be of a larger order than the original sample size n. Taken together, Conditions B3 and B4 impose that sufficient computing power must be available to compute the estimatorψ approximately B 1 B 2 times on samples of size N -in contrast, the analytic method in the previous section only required fitting the estimatorφ m (and estimating its standard error) once on a sample of size n. We conclude by noting that we can define a double bootstrap procedure analogous to Algorithm 1 for the estimation of the relative efficiency of a fully adjusted estimator φ a . However, our arguments cannot generally be used to establish the validity of double bootstrap confidence intervals for φ a . The problem arises because the asymptotic variance of the fully adjusted estimator often involves a regression function of the outcome against the covariate. Because the statistical model is nonparametric up to knowledge of the treatment probability, this dependence will often make it so that the parameter σ 2 a (·) is not Hadamard differentiable, and so the theoretical guarantee presented above for our double bootstrap procedure may not apply. It is therefore an open question as to whether the double bootstrap will yield valid confidence intervals for the relative efficiency of fully adjusted estimators. We now consider settings where the outcome in the trial is only partially observed. For this purpose, we use the notion of coarsening-at-random [12, 13] . Let Z t = (T t , A, W t ) be the full data unit in the trial, C t be a coarsening variable, and X t = G a (Z t , C t ) be the observation unit in the trial where G a (·, ·) is some many-to-one function. We further assume that, under G a , the covariate W t is fully observed. The adjusted analysis estimates the treatment effect ψ based on X t . We can write X t as (X t , W t ), wherẽ X t represents the components in X t that are not covariates and W t is the covariate vector. Define a function c(·) such that c(X t ) :=X t , and write G u to denote the composition c • G a . The unadjusted analysis ignores the covariate information, which is equivalent to working with the observation unit The relative efficiency, defined in terms of the variances of the unadjusted and adjusted estimators, is interesting only when both analyses give consistent estimators. Thus, we will assume that both conditional distributions G a (Z t , C t )|Z t and G u (Z t , C t )|Z t satisfy the coarsening-at-random assumption, so that both the unadjusted and adjusted analyses are asymptotically unbiased for the treatment effect. Let ν denote the distribution of X t . We again define the relative efficiencies by focusing on trials under the sharp null, that is, the conditional distributions T t |A = 0, W t and T t |A = 1, W t are the same. We let G(·|a, w) denote the conditional distribution of C t given that (A, W t ) = (a, w). Under the sharp null, ν is fully characterized by the treatment distribution Π, the conditional distribution of C t characterized by G, and the joint distribution of (T t , W t ) denoted by P -when we wish to emphasize this dependence, we write ν Π,G,P . We define the relative efficiencies as where D u , D a and D m are the influence functions of the unadjusted, fully adjusted and working-modelbased adjusted estimators, respectively. We will often suppress the dependence of these relative efficiencies on Π and G in the notation by writing Φ a (P ) and Φ m (P ). We aim to identify and estimate these relative efficiencies from external data available at the trial planning stage. Like the future trial data, the external data can be subject to coarsening. Let C be the coarsening variable, and Γ(·, ·) be a many-to-one function. The full data unit in the external dataset is Z = (T, W ), and the observed data unit is X = Γ(Z, C). Let Q be the distribution of X, induced by the joint distribution of (Z, C) and the many-to-one function Γ. To identify the relative efficiencies from the observed external data X, we assume the following condition holds throughout this section. This condition is similar to Condition A2 and assumes in addition that coarsening-at-random holds in the external data. Condition A3. A full data unit in the external data Z = (T, W ) has distribution P , and the conditional distribution Γ(Z, C) | Z satisfies the coarsening-at-random assumption. Under this condition, it is possible to identify the relative efficiencies in (13) as parameters of the distribution of the observed external data, and also to show that, under reasonable conditions, these parameters will be smooth enough so that it should be possible to develop regular and asymptotically linear estimators based on the external data [Theorem 1.3 in 29]. The external data might be obtained from various settings including observational studies, some of which are distinct from randomized clinical trials. Consequently, the reasons for coarsening can be much different from those in the future trial. For example, for time-to-event data, administrative censoring may account for a large proportion of right censoring in clinical trials, but a lesser proportion for observational data. Thus, it is often not plausible to assume that we can identify G from the external data. To overcome this issue, we define the relative efficiencies for a particular G, and the user can choose a coarsening mechanism that is expected to reflect the setting of a future trial. In Appendix A, we use the identifiability result stemming from Condition A3 to develop estimators and confidence intervals for the relative efficiency in settings where there are time-to-event outcomes with right censoring. In this case, T t is the time to some event of interest and C t is the censoring The mapping that gives rise to this observation unit is given by G a (z t , c t ) = (min{t t , c t }, I{t t ≤ c t }, a, w t ). The validity of the unadjusted analysis relies on the condition that T t ⊥ C t |A, while the validity of the adjusted analysis relies on the condition For the ordinal outcome case, we generate data based on a CDC report describing the age distribution and probabilities of various outcomes within age groups for hospitalized Covid-19 patients [4] , which are also presented in Table 1 . The ordinal outcome is assigned the value 1, 2, or 3 for "death", "ICU and survived", or "no ICU and survived", respectively. Age category is the only covariate we adjust for. We consider both the fully adjusted and working-model-based estimators. The relative efficiency of fully adjusted estimators is estimated with the analytical approach, while for the working-model-based estimators, we use both the analytical and the bootstrap approaches. We consider three estimands of the treatment effect: difference in mean, Mann-Whitney, and average log odds ratio. As the covariate is ordinal as well, the nuisance conditional mean functions are estimated by sample averages within each age group. In the analytical approach, we build Wald-type confidence intervals on the logit scale first and transform them. For the bootstrap, we take the number of bootstrap resamples in the two layers to be 100 and 500. Though these resample sizes are small compared to those used in typical applications of the bootstrap, we use them to reduce the computational cost in this Monte Carlo simulation. We do 1,000 replications for the analytical approach and 200 for the bootstrap. The simulation results for sample size 1,000 are presented in Table 2 . We observe that despite the small number of resamples, the bootstrap procedure gives approximately 95% coverage, but that this estimator has larger variance than does the analytical estimator across all settings considered. We expect the performance to improve as the number of resamples increases. The coverage of the analytical approach is close to the nominal level. Additional results for sample sizes 200 and 500 are given in Appendix D. We note that, as the true relative efficiency is strictly less than 1, the confidence sets constructed using the two-step approach detailed in Section 3.2 have the same coverage. For survival outcomes, we only consider the relative efficiency of the fully adjusted estimators. We generate a univariate covariate W ∼ Uniform(0, 1), and the survival time follows an exponential distribution Y |W ∼ Exp{(1 + 9W )/10}. The censoring time in the external data C is generated from an The user-specified censoring mechanism in the trial is taken to be the same, that is, Exp(0.1). We again consider three estimands: risk difference (RD), relative risk With continuous time, the nuisance functions are estimated using a sequence of Cox proportional hazard models with polynomials of the covariate. We select the best model based on BIC. For discrete time, we use a proportional odds model instead, which slightly outperforms the Cox model in the simulations. Results for sample size 1,000 are presented in Table 3 . The coverage of the confidence intervals is close to the nominal level across all settings. The uncertainty in the estimates becomes larger as time (t) increases, due to the reduced size of the risk set. The R scripts for all the simulation experiments are available as supplementary files. We apply the proposed methods to assess the efficiency gain of covariate-adjustment using Covid-19 data. Ordinal Outcome. We use the following mutually exclusive ordinal outcome based on the severity of a patient's Covid-19 status: (1) censor or discharge, (2) intubation or ventilation, and (3) death. Among 40 patients who had been admitted twice, only 14 patients had different outcomes between the two visits (9 patients were classified as 2 during first admission and as 1 in the second admission while 5 patients transited from 1 to 2). Among 3 patients who had been admitted three times, only 1 patient had different ordinal outcomes between 3 visits that he was classified a 1, 2, and 1 respectively. For all patients who had been admitted more than once, there was no death. To deal with duplicated observations for these patients, we only include the observations with a more severe outcome. As a result of the above classification, there are 207 (60%) censor/discharges, 59 (17%) intubation or ventilation, and 79 (23%) deaths. We consider three estimands of treatment effects: difference in mean (DIM), Mann-Whitney (MW), and average log odds ratio (LOR). To estimate the nuisance functions for fully adjusted estimators, we fit a series of polynomial regressions from order 1 to 5 and then select the optimal model based on BIC score for DIM and MW. For LOR, these nuisance functions are estimated by fitting proportional odds models with polynomials of order 1 to 5 and selecting the best model based on BIC. We present the relative efficiency of covariate-adjusted estimators that adjust for all the covariates in Table 4 . The estimated efficiency gain is about 7% for the fully adjusted estimator, whereas for the working-model-based estimators we do not see evidence of a significant efficiency gain. In contrast, adjusting for a single baseline covariate gives an estimated efficiency gain ranging from 1% to 5%, and the difference between using fully adjusted and working-model-based estimators is negligible when only adjusting for one covariate. We leave the details to Appendix D. Survival Outcome. We choose the time point of interest to be t = 350 hours, where the overall survival is around 70%, and assess the relative efficiency for survival outcomes. We consider three estimands of treatment effects: risk difference (RD), relative risk (RR), and restricted mean survival time (RMST). We use elastic net [11] for variable selection, where the tuning penalty parameter is selected via 5-fold cross validation. In particular, we select those variables with nonzero coefficients. To estimate the nuisance functions, we then fit a sequence of Cox proportional hazards models with polynomials of orders 1 to 7 of the selected variables and select the model with the smallest BIC score. The results are shown in Table 5 . Adjusting for a single baseline covariate gives an efficiency gain ranges from 1% to 9% in estimating RD or RR, with age being the most prognostic factor. A similar trend is observed for RMST. Using elastic net, we select the following 4 baseline factors: age, CVD, chronic kidney disease, and cholesterol medications. Adjusting for these four factors gives an 11% efficiency gain in estimating RD or RR and RMST. In this paper, we presented a framework to use external data to infer about the relative efficiency of covariate-adjusted analyses in a future clinical trial. We also exhibited the applicability of our framework for a variety of treatment effect estimands of particular interest. For each of these estimands, we introduced a consistent and asymptotically normal estimator of the relative efficiency and provided an analytic means to develop Wald-type confidence intervals. We also introduced a double bootstrap scheme that enables confidence interval construction in certain problems even when an analytic form for the standard error is not available. When the outcome is only partially observed, standard unadjusted and adjusted analyses typically The relative efficiency we considered is based on a sharp null setting where the treatment has no effect. As a consequence, we do not need to specify the full distribution of Y t |A, W t expected in the trial. Moreover, if the treatment effect estimator is regular, which is the case for all those that we considered, then the relative efficiency at this sharp null also serves as an accurate approximation to the relative efficiency under a variety of local alternatives. Though accurate in such settings, we expect that this approximation may be poor when the treatment is extremely beneficial in some subgroups while being quite harmful in some others. While a subgroup analysis might be able to detect this after the trial is completed, it is not generally possible to know a priori whether this kind of subgroup effect exists. An alternative approach would involve specifying a particular alternative distribution that the investigator is interested in. In this case, the relative efficiency under that alternative can be derived and estimated. Our framework for estimating relative efficiencies based on external data can be easily modified for this setting. Observational settings and clinical trials can be quite different in terms of coarsening, and thus we define relative efficiency for a user-specified coarsening mechanism that approximates that of the future trial. This also extends to the case where the covariate distribution is different between the external data and the future clinical trial due to, for example, trial eligibility criteria. In such cases, a particular covariate distribution for the future trial can be imposed when defining the relative efficiency, and the external data can then be used to estimate the distribution of the outcome conditional on covariates. This appendix is organized as follows. In Appendix A, we develop estimators and confidence intervals for the relative efficiency in settings where there are time-to-event outcomes with right censoring. In Appendix B, we prove lemmas and theorems in Section 3 on continuous and ordinal outcomes. In Appendix C, we prove lemmas and theorems in Appendix A on time-to-event outcomes with right censoring. In Appendix D, we show some additional experiment results. In Appendix E, we develop a two-step procedure with sample splitting to construct confidence intervals, and show that it achieves nominal coverage. In Appendix F, we give the pseudocode for the double bootstrap scheme presented in Section 3. A Estimation of relative efficiencies for time-to-event outcome with right censoring We consider three estimands of treatment effect, which are all functionals of the treatment-arm-specific survival function S a (t) := P (T t > t|A = a). For the unadjusted analysis, we consider plug-in estimators based on the treatment-arm-specific Kaplan-Meier estimator [17] , which we denote asS a (t). Such plug-in estimators are consistent and asymptotically linear provided that T t ⊥ C t |A [see, for example, 8]. In contrast, the consistency of covariate-adjusted estimators often relies on the assumption that A) . In fact, many recently proposed adjusted estimators are based on the efficient influence function of the treatment effect estimand in a model where the only assumption is that [e.g., 22, 27, 8] . Under regularity conditions, these estimators achieve the semiparametric efficiency bound in this model. Constructing these estimators often requires estimation of nuisance functions such as the conditional hazard function h a (t, w) = P (T t = t|T t ≥ t, A = a, W t = w), the conditional survival function S a (t, w) = P (T t > t|A = a, W t = w), the conditional distribution of censoring time G a (t, w) := P (C t ≥ t|A = a, W t = w) or the treatment mechanism π(w) = P (A = 1|W t = w). We call these estimators "fully adjusted". As discussed in Section 4, the efficiency of an adjusted estimator relative to that of an unadjusted estimator is relevant only when both estimators are consistent -as noted earlier, a sufficient condition for this to hold is that the observed data arises from a distribution in the intersection model consisting of all distributions of (Z t , C t ) for which T t ⊥ C t |(W t , A) and T t ⊥ C t |A. Notably, there is not generally any guarantee that a fully adjusted estimator will be efficient relative to the observed data model consisting of the distributions of G a (Z t , C t ) generated by sampling (Z t , C t ) from a distribution in this intersection model. Stated more plainly, if it is known in advance that both the adjusted and unadjusted survival function estimators are consistent, then, in certain cases, there may exist a more efficient estimator of this survival function. Unlike the cases of continuous or ordinal outcomes that we considered in Section 3, we are not aware of a parametric working model for the conditional distribution of T t |A, W t that yields a RAL estimator of S 0 and S 1 when marginalized over the distribution of the covariate W t . Nevertheless, it is possible to define adjusted estimators based on working models in this setting. To see this, note that many of the aforementioned fully adjusted estimators do have the doubly robust property: they are consistent if either (S 0 , S 1 ) or (G, π) is correctly specified, and are efficient if both are correctly specified. This allows us to use potentially misspecified parametric working models to estimate (S 0 , S 1 ) as long as we estimate the distribution of censoring time using a correctly specified semiparametric or nonparametric model -this is the case, for example, if we estimate the censoring distribution via a correctly specified arm-specific Kaplan-Meier estimator. Such estimators are rarely used in practice. We, therefore, focus on computing the relative efficiency of fully adjusted estimators, which see more use, as compared to that of unadjusted estimators. As in previous works [22, 27, 8] , we assume that survival and censoring time are discrete, and take values in {t 1 , t 2 , . . . , t K }. We let t 0 = 0 be the baseline time. We expect similar derivations can be done for continuous time, and in the simulation studies we empirically validate the performance of our proposed methods when time is measured on a continuous scale. The first two estimands we consider focus on survival functions at a specific time point. The risk difference (RD) is defined as S 0 (t k ) − S 1 (t k ) for a time t k of interest. The relative risk (RR) is defined as for RD and {1 −S 1 (t k )}/{1 −S 0 (t k )} for RR, whereS a is the Kaplan-Meier estimator within each treatment group. LetŜ a denote the efficient adjusted estimator proposed in Moore and van der Laan [20] . For each of the two estimands under consideration, we refer to the estimator that replacesS a in the unadjusted estimator withŜ a as the fully adjusted estimator. Recall thatŜ a is a consistent estimator of S a when T t ⊥ C t |(A, W t ). Under additional regularity conditions given in Theorem 1 in Moore and van der Laan [22] , for each a ∈ {0, 1} and k ∈ {1, . . . , K}, S a (t k ) is an asymptotically linear estimator of S a (t k ) with influence function Moreoever, for each a ∈ {0, 1} and k ∈ {1, . . . , K},S a (t k ) is a RAL estimator of S a (t k ) when T t ⊥ C t |A with influence function [see, e.g., 8] Here, h a (t) is the hazard corresponding to S a at time t and G a (t) := P (C t ≥ t|A = a). The influence functions of the fully adjusted and unadjusted estimators of the treatment effect estimand, which we denote as D a and D u , respectively, can then be derived via the delta method. As in Section 4, we define the relative efficiency as the ratio between the variances of D a and D u under the sharp null. In such cases, the distribution of the observed data in the trial is characterized by the marginal distribution of A, denoted by Π, the joint distribution of (T t , W t ), denoted by P , and the conditional distribution of C t given (A, W t ). In particular, this implies that S 1 (t, w) = S 0 (t, w) = S(t, w) for all (t, w), where S(t, w) := P (T t > t|W t = w) is the conditional survival function under P , and also that S 1 (t) = S 0 (t) = S(t) for all t, where S(t) := P (T t > t) is the marginal survival function under P . To simplify the presentation, we suppose additionally that C t ⊥ A|W t = w, and write G(t, w) := P (C t ≥ t|W t = w) and G(t) := P (C t ≥ t). For given G and Π, the relative efficiency parameter is a functional of P . Before presenting the form of the relative efficiency, we introduce some additional needed notation. For (T, W ) ∼ P , let h(t, w) := P (T = t|T ≥ t, W = w) and h(t) := P (T = t|T ≥ t) be the conditional and marginal hazard functions under P , respectively. We define the following quantities, which will be useful throughout this section: Interestingly, in the null case that we consider, the relative efficiencies are the same for the RD and RR estimands. Lemma 6. Suppose that Conditions A1 and A3 hold and, in addition, that S(t k ) < 1. Suppose that S a andS a are asymptotically linear with influence functions given in (14) and (15), respectively. For both the RR and RD estimand, the relative efficiency of the fully adjusted estimator as compared to the unadjusted estimator is given by Φ a (P ) = σ 2 a (P )/σ 2 u (P ), where In what follows, we will often write φ a for Φ a (P ). To estimate φ a from the external data (Y, ∆, W ), we estimate σ 2 u and σ 2 a separately. We observe that σ 2 u is a transformation of S(t), and hence we construct a plug-in estimatorŝ kl j of s kl j using covariateadjusted estimator of S(t) given in Moore and van der Laan [20] and estimate σ 2 u byσ 2 u = k j=1ŝ kk j /G(t j ). We estimate σ 2 a using one-step estimation based on its EIF. Recall that C is the censoring time in the external data. We define H(t, w) := P (C ≥ t|W = w). For notational convenience, we define the following function, which appears multiple times in the EIF of σ 2 a : The efficient influence function of σ 2 a relative to the observed data model is The derivation of this expression is deferred to Appendix C. LetŜ(t, w) andĥ(t, w) be estimators of the conditional survival and hazard functions, respectively. LetĤ(t, w) be an estimator of the conditional censoring distribution. Defineĝ kk j ,f kk j with these estimates. We estimate σ 2 a witĥ We then estimate φ a byφ =σ 2 a /σ 2 u . The properties ofφ are given in the following theorem. The influence function ofφ is given in Appendix C. The final treatment effect estimand we consider is the restricted mean survival time (RMST), defined as ψ = k j=1 {S 1 (t j ) − S 0 (t j )}. We again consider two plug-in estimators: the unadjusted KM-based Lemma 7. Suppose that Conditions A1 and A3 hold. Then, the relative efficiency φ a takes the following form φ a = σ 2 a (P )/σ 2 u (P ), where We constructσ 2 u in a similar way as in the case of the risk difference, namely by plugging in an efficient adjusted estimator of S(·). As for σ 2 a , its EIF can be derived in a similar fashion as in the case of RD, and we defer the details to Appendix C. We propose the following estimator The relative efficiency is estimated byφ =σ 2 a /σ 2 u . Based on (20) , it appears that computing the quadruple sum used to defineσ 2 a will take order nk 3 time. As it turns out, these sums can be computed much more efficiently. In Appendix C, we show that for a given i and given estimates of τ and S, the inner three sums can be computed in O(k) time, resulting in an O(nk) complexity for computing the above quadruple sum. Theorem 7. Under the same conditions as in Theorem 6,φ is an efficient estimator of φ a . Again, the specific form of its influence function is given in Appendix C. As discussed at the end of Section 3.2, the influence function ofφ is identically 0 in certain special cases. One such case arises when T ⊥ W under P and the mapping G does not depend on W . In these cases, a two-step procedure can be considered for inference -see the end of Section 3.2 for a description of such an approach in a similar setting. As noted earlier, for it to be interesting to compare the efficiency of the adjusted and unadjusted estimators, it must be the case that both are asymptotically linear. In such settings, we now characterize cases in which the adjusted estimator will be more efficient than will the unadjusted estimator. Moreover, unlike in the uncoarsened data setting, there are also settings where the adjusted estimator may be less efficient than the unadjusted estimator. We also characterize these cases. Theorem 8. For all three estimands considered: 1. If the conditions in Lemma 6 hold and P is such that T ⊥ W , then σ 2 u (P ) ≤ σ 2 a (P ), that is, the unadjusted estimator is at least as efficient as the adjusted estimator. Moreover, if var P [G(t j , W )] > 0 for some j ∈ {1, . . . , k}, then the inequality is strict: σ 2 u (P ) < σ 2 a (P ). 2. If the conditions in Lemma 6 hold and G(·, ·) is such that G(t, w 1 ) = G(t, w 2 ) for all t ≤ t k and w 1 , w 2 ∈ W, then σ 2 a (P ) ≤ σ 2 u (P ), that is, the adjusted estimator is at least as efficient as the unadjusted estimator. B Proofs of results in the case where the outcome is fully observed B.1 Supporting lemmas for proofs in Section 3 Lemma 8 (EIF of mean conditional variance). For a given function u P (·) that can be written as u P (y) = h(x, y)dP (x) for some function h, the canonical gradient of σ 2 a = E P [var(u P (Y )|W )] is given by where f (w) = E P [u P (Y )|W = w]. Proof. We prove this lemma by directly applying the definition of a gradient. We consider the one- The second term on the right has the following derivative with respect to ǫ at ǫ = 0 and so will contribute {u P (Y ) − f P (W )} 2 − σ 2 a to the gradient. We now focus on the first term, which re-writes as where c(ǫ) = h(x, y){ǫs 1 (x|w) + ǫs 2 (w) + ǫ 2 s 1 s 2 }dP (x, w). The first term in the above display has derivative 0 with respect to ǫ at ǫ = 0, as f P is the true conditional mean. The second term also has derivative 0 at ǫ = 0, as it is quadratic in ǫ. At ǫ = 0, the third term has derivative 2 h(x, y){s 1 (x|w) + s 2 (w)}dP (x, w) h(x, y)dP (x, w) − f P (w) dP (y, w) The inner integral has mean Therefore the following is a gradient: Since we are working within a locally nonparametric model, the above is also the canonical gradient. Lemma 9 (EIF of mean conditional covariance). Consider a locally nonparametric model of distributions of (Y, W ). For given functions u(·) and v(·), the canonical gradient of Proof. As in the proof of Lemma 8, we consider a one-dimensional submodel {P ǫ : ǫ} with density Note that Also, because This shows that D cov is a gradient, and, because the model is locally nonparametric, D cov must therefore be the canonical gradient. Let X be a generic random variate with distribution P ∈ M with support in a bounded set X ⊂ R d . Let U α : X − → R m be a function indexed by α ∈ R m . Suppose that P U α = 0 has a unique solution in α, and we denote this solution by ψ 1 = ψ 1 (P ). In general, we can regard ψ 1 (P ) as a parameter defined implicitly through the estimating equation. The following lemma establishes the pathwise differentiability of this and a related parameter, under appropriate conditions. Lemma 10 (Pathwise differentiability of parameters defined via estimating equations). Let ψ 1 : M − → R m be such that ψ 1 (P ) is the unique solution in α to the estimating equation P U α = 0. Suppose that, for each x ∈ X , U α (x) is continuously differentiable in α, with derivativeU α (x) := ∂ ∂α Uα(x)|α =α . Suppose in addition that U ψ1 ,U ψ1 ∈ L 2 (P ) and that PU ψ1 is invertible. Then, ψ 1 is pathwise differentiable and its gradient relative to any locally nonparametric model is given by Moreover, for each α ∈ R m , let g α : X − → R be a function, and suppose that α → g α (x) is differentiable for all x ∈ X . For each P ∈ M, define ψ 2 := P g ψ1 . Then Proof. For h ∈ L 2 0 (P ) whose range is contained in [−1, 1], consider the one-dimensional submodel {P ǫ : ǫ}, where each P ǫ has density p ǫ (x) = {1 + ǫh(x)}p(x). This submodel has score h at ǫ = 0. By definition, ǫh(x)}p(x)dx, which is linear in ǫ. Thus, the continuous differentiability of U as a function of α implies that f is also continuously differentiable. Then the implicit function theorem implies that We note that D 1 ∈ L 2 0 (P ). Hence by definition ψ 1 is pathwise differentiable with gradient D 1 , which is also the only gradient in any locally nonparametric model. Also, noting that ψ 2 (P ǫ ) = g ψ1(Pǫ) (x){1 + ǫh(x)}dP (x), we see that Thus, ψ 2 is pathwise differentiable and D 2 is the gradient in any locally nonparametric model. Proof of Lemma 1. For notational convenience, we define µ a = E[Y t |A = a], for a ∈ {0, 1}. We first establish properties of the working-model-based estimator of µ a , given byμ a =α a +β ⊤ aW t . To do this, we note that the fitted coefficients from arm-specific linear regression (α a ,β ⊤ a ) satisfy the following first-order conditions: Let (α * a , β * a ) be the large-sample limit of (α a ,β a ), defined implicitly as the solution to The efficient influence function of µ a when the treatment is randomized is given by where π a = Π(A = a). We claim that D a defined below is also a gradient of µ a in this model. To show this, we will show that D * a − D a lies in the orthogonal complement of the tangent space. First, define L 2 0,A (ν) = {s ∈ L 2 0 (v) : s(y, a, w) = s(y ′ , a, w ′ ) ∀(y, w), (y ′ , w ′ )}; The tangent space decomposes as We note that each individual factor in the above display has mean 0. This, together with the independence between A and W t , implies that s, D a − D * a ν = 0 for any s in L 2 0,A (ν), L 2 0,W t (ν) or L 2 0,Y t |A,W t (ν). This implies that the difference is orthogonal to each component of the tangent space, and hence the tangent space itself. Next, we note thatμ a re-writes as a one-step estimator based on the gradient D a . Letν be a the sample proportion of A = a. The remainder is given by which is o P (n −1/2 ) as the first term is O P (n −1/2 ) and the second term is o P (1). We note that Y and W both have bounded support and π a ∈ (0, 1). Thus, D a (ν) − D a L 2 (ν) = o P (1), which follows from the convergence ofα a ,β a andπ a to their population counterparts. Let γ * := (π a , α * a , β * a , µ a ) be the true parameter value and γ = (π, α, β, µ) be a generic parameter value, and define a class of functions small neighborhood of γ * . We note that D a (ν) ∈ F with probability tending to 1. The boundedness of Y t , W t and π a implies that this class of functions is Lipschitz in γ, and Example 19.7 in Van der Vaart [31] implies that F is a Donsker class. Therefore,μ a is asymptotically linear with influence function D a , and so isψ m with influence function D m = D 1 − D 0 . Now focusing on ν where the sharp null holds, D m simplifies to Due to the independence of A and (Y t , W t ) under the sharp null, it has variance σ 2 m /(π 1 π 0 ) under the sharp null. Now, we derive the asymptotic variances of the unadjusted estimator and the fully adjusted estimator. In doing so, we consider a more general parameter for a given function u. The average treatment effect corresponds to the special case of u being the identity function. Recall that the unadjusted estimator is given by the difference between arm-specific meanŝ . Applying the delta method, we see thatψ u is asymptotically linear with influence function Under the sharp null, it simplifies to with variance under the sharp null var P (u(Y ))/(π 1 π 0 ), due to the independence between A and (Y t , W t ) under the sharp null. The fully adjusted estimator we consider is the AIPW estimator given bŷ The influence function of this estimatorψ a is given by Under the sharp null, it simplifies to with variance under the sharp null E P [var P (u(Y )|W )]/(π 1 π 0 ), where we again use the independence between A and (Y t , W t ) under the sharp null. Proof of Theorem 1. First, we consider the estimation of σ 2 m . Define a parameter P → (α * (P ), β * (P )), . Define a set of estimating ). The first-order condition implies that (α * (P ), β * (P )) is the unique solution to the estimating equation P U (α, β) = 0. This solution can be written as a Hence, by the chain rule, (α * (P ), β * (P )) is pathwise differentiable. In the remainder of this proof, we will often write (α * (P ), β * (P )) as (α * , β * ). The fitted coefficients (α,β) solve the empirical estimating equation P n U (α, β) = 0, and are asymptotically linear estimators of (α * , β * ) with influence function Lemma 10 implies that this influence function is also the canonical gradient of (α * , β * ) in any locally nonparametric model, and therefore (α,β) is also regular [Proposition 2.3.i, 23]. Define the estimating function U AT E (α, β, σ 2 ) := (y − α − β ⊤ w) 2 − σ 2 . We note that σ 2 m is the solution in σ 2 to the estimating equation P U AT E (α * , β * , σ 2 ) = 0 andσ 2 m solves its empirical counterpart P n U AT E (α,β, σ 2 ) = 0. Hence, The consistency of (α,β) implies thatσ 2 m is also consistent. This together with the bounded support . Furthermore, we can show that U AT E is a Lipschitz function of (α, β, σ 2 ) in a neighborhood of (α * , β * , σ 2 m ), and thus U AT E (α,β,σ 2 m ) belongs to a Donsker class with probability tending to 1 [Example 19.7, 31] . Lemma 19.24 Vaart [31] implies that the last term in the above display is o P (n −1/2 ). Applying a Taylor expansion to the second term, we havê In particular, we have both of which have mean 0 at (α * , β * ) by the first-order condition of (α * , β * ). This implies thatσ 2 m is asymptotically linear with influence function IF m (y, w) = (y − α * − w ⊤ β * ) 2 − σ 2 m . Lemma 10 implies that this is also the canonical gradient of σ 2 m , and henceσ 2 m is also regular. Next we estimate σ 2 a . The proposed estimator is a one-step estimator based on the canonical gradient. Applying Lemma 8 with h(x, y) = y, we obtain the canonical gradient IF a (y, LetP be a distribution of (Y, W ) such that the conditional mean of Y given W isr(w) and the marginal distribution of W is the empirical distribution of W . Then, Asr(w) is uniformly bounded and belongs to a Donsker class F with probability tending to 1 and Y has bounded support, Theorem 2.10.6 in Van Der Vaart and Wellner [32] implies that IF a (P ) = {y −r(w)} 2 −σ 2 a belongs to a Donsker classF that is also bounded. The difference between IF a (P ) and IF a (P ) is given by Therefore, for some M as the support of Y andr are bounded. The first term in the last line is o P (1) by the assumption that r − r L 2 (P ) = o P (n −1/4 ). Thus, IF a (P ) − IF a (P ) L 2 (P ) = o P (1) provided thatσ 2 a is a consistent estimator of σ 2 a . Suppose for now that this is indeed the case, then Lemma 19.24 in Van der Vaart [31] implies that (P n − P ){IF a (P ) − IF a (P )} is o P (n −1/2 ). This shows thatσ 2 a is regular and asymptotically linear with influence function IF a . We now show thatσ 2 a is indeed a consistent estimator of σ 2 a . Note that Finally, we estimate σ 2 u with the sample variance of Y , which is regular and asymptotically linear Theorem 2 then follows by applying the delta method. Specifically, the influence function ofφ a is given by (IF a − φ a IF u )/σ 2 u , and similarly the influence function ofφ m is (IF m − φ m IF u )/σ 2 u . Both estimators are efficient as we work in a locally nonparametric model. Proof of Lemma 2. First recall that θ * a (k, w) is the best approximation to the true outcome regression θ a (k, w) within the proportional odds model. The coefficients α * a and β * a satisfy the following first-order We claim that when the treatment is randomized, the following is a gradient of F a (k), To show this, we will show that the difference between IF Fa(k) and the canonical gradient of F a (k) lies in the orthogonal complement of the tangent space. The canonical gradient is given by Hence, As each individual factor has mean 0 and A ⊥ W t , the difference is orthogonal to each component of the tangent space and hence the tangent space itself. Finally,F a re-writes as a one-step estimator based on the gradient IF Fa(k) . In particular, letν be a distribution with outcome regression θα a ,βa and treatment mechanismπ 1 , then First we note that R(ν, ν) is o P (n −1/2 ) as the first term in the last line is O P (n −1/2 ) and the second term is o P (1). Also given the bounded support of W t and the fact that π 1 and π 0 are bounded away from 0, we can show that the convergence ofπ a ,α a andβ a implies that Example 19.7 in Van der Vaart [31] shows that IF Fa(k) (v) lies in a Donsker class with probability tending to 1, as IF Fa(k) is Lipschitz in its indexing parameters in a neighborhood of the true parameter value, again due to the boundedness of W t and π a . Lemma 19.24 in Van der Vaart [31] implies that ThusF a (k) is asymptotically linear with influence function IF Fa(k) . Under the sharp null, the above display simplifies to Under the sharp null, its variance is E P /(π 1 π 0 ) due to the inde-pendence between A and (Y t , W t ) under the sharp null. Proof of Theorem 2. We first consider estimating σ 2 m . To start, we show that (α,β), the maximizer of the empirical version of (10), is a RAL estimator of (α * , β * ). The first-order conditions of this maximization imply that (α,β) solves a set of estimating equations, as the parameter space is unconstrained. Specifically, define U (α, β) = (U 1 (α, β) , . . . , U K−1 (α, β), U K (α, β)) with , k = 1, . . . , K − 1, . Then we have that P n U k (α,β) = 0 for all k = 1, . . . , K. We can show, through the usual arguments used to study estimating equations [ Chapter 5, 31] , that the influence function of (α,β) is In particular, the derivative matrixU can be partitioned into Lemma 10 implies that IF ab is the canonical gradient of (α * , β * ) in a locally nonparametric model, and thus (α,β) is also regular [Proposition 2.3.i, 23]. Next, we define the following estimating equation: By definition, σ 2 m is the unique solution in σ 2 to the equation P U DIM (α * , β * , σ 2 ) = 0, andσ 2 m solves its empirical counterpart P n U DIM (α,β, σ 2 ) = 0. Hence, we have The consistency of (α,β) implies thatσ 2 m is also consistent. Combining this with the bounded support of W , we see that U DIM (α,β,σ 2 m ) − U DIM (α * , β * , σ 2 m ) L 2 (P ) = o P (1). Furthermore, it can be shown that U DIM is a Lipschitz transformation of (α, β, σ 2 ) in a neighborhood of (α * , β * , σ 2 m ), and thus that [31] implies that the last term in the above display is o P (n −1/2 ). Thus, The partial derivatives are given by Combining all the results above, we see thatσ 2 m is an asymptotically linear estimator of σ 2 m with influence function IF m , where We now consider the estimation of σ 2 a . The proposed estimator is a one-step estimator based on the canonical gradient, and the proof is very similar to that of Theorem 1. Applying Lemma 8 with h(x, y) = u(y), we obtain the canonical gradient IF a (y, w) = {u(y)− r(w)} 2 − σ 2 a . LetP be a distribution of (Y, W ) such that the conditional mean of u(Y ) given W isr(w), and that the marginal distribution of W is the empirical distribution of W . We have that where the latter equality holds by assumption. Asr(w) is uniformly bounded and belongs to a Donsker class F with probability tending to 1 and Y has bounded support, Theorem 2.10.6 in Van Der Vaart and Wellner [32] implies that IF a (P ) = {u(y) −r(w)} 2 −σ 2 a belongs to a transformed Donsker classF that is also bounded. The difference between IF a (P ) and IF a (P ) is given by Therefore, for some M as the support of Y andr are bounded. The first term in the last line is o P (1) by the assumption that r − r L 2 (P ) = o P (n −1/4 ). Thus, IF a (P ) − IF a (P ) L 2 (P ) = o P (1) provided thatσ 2 a is a consistent estimator of σ 2 a . Suppose for now that this is indeed the case, then Lemma 19.24 in Van der Vaart [31] implies that (P n − P ){IF a (P ) − IF a (P )} is o P (n −1/2 ). Hence, if we show thatσ 2 a is a consistent estimator of σ 2 a , then we will have shown thatσ 2 a is regular and asymptotically linear with influence function IF a . We now show thatσ 2 a is indeed a consistent estimator of σ 2 a . Note that for some constant M 1 , where we used the fact that both Y and W have bounded support andr is uniformly bounded. The first term in the last line is o P (1) by the weak law of large numbers. As for the second term, we see that it is equal to Lemma 19.24 in Van der Vaart [31] implies that (P n − P )|r − r| = o P (1) as |r − r| lies in a Donsker class with probability tending to 1 and r − r L 2 (P ) = o P (1). In addition, P |r − r| ≤ r − r L 2 (P ) = o P (1). This establishes the consistency ofσ 2 a . Finally, we estimate σ 2 u with the sample variance of u(Y ), which is regular and asymptotically linear Theorem 2 then follows by applying the delta method. Specifically, the influence function ofφ a is given by (IF a − φ a IF u )/σ 2 u , and, similarly, the influence function ofφ m is (IF m − φ m IF u )/σ 2 u . Proof of Lemma 4. Recall that the unadjusted estimator iŝ First we introduce some notation. Let ν n be the empirical distribution of X t in the future trial data. Define the functions η a (·), a ∈ {0, 1}, analogously to η(·) but within each treatment arm as η a (k) := P a (Y < k) + P a (Y = k)/2 for a ∈ {0, 1}, and defineψ u1 : We note thatψ u1 is a V-statistic with symmetric kernelh(X t 1 , X t 2 ) : . With this notation,ψ u1 = ν 2 nh . Note that To establish the asymptotic linearity ofψ u1 , we first show that (ν n − ν) 2h is o P (n −1/2 ). To start, define a class of functionsH := {x →h(x, x 2 ) : x 2 ∈ X } where X is the support of X t . Each function inH is a weighted sum of 4 binary terms, with weights being either 1 or 1/2. Each term is indexed by a 2 and y t 2 , and can be computed with 3 arithmetic operations and 1 comparison. Theorem 8.4 in Anthony and Bartlett [1] implies that each binary term belongs to a VC-class with VC dimension at most 64, and Lemma 19.15 in Van der Vaart [31] in turn implies that this class is Donsker. Theorem 2.10.6 in Van Der Vaart and Wellner [32] then implies thatH is a Donsker class (hence also Glivenko-Cantelli). Next we note that νh(x t , ·) : Combining this with the previous results, we have that Applying the delta method, we see thatψ u is asymptotically linear with influence function which, under the sharp null, simplifies to ( a t π1 − 1−a t π0 ){η(y t )− 1 2 }. The variance of η(Y t ) can be calculated exactly, and equals to (1 − K k=1 p 3 k )/12. Next we look at the fully adjusted estimator. The efficient influence function of ψ was given in, for example, Mao [18] , which, under the sharp null, simplifies where we use the independence between A and (Y t , W t ) under the sharp null. Finally we consider the estimator based on proportional odds model. LetP a be a distribution with CDFF a (k) for a = 0, 1. Recall thatψ m = h(x, y)dP 1 (x)dP 0 (y). SinceF 's are asymptotically linear, we havê The first term can be alternatively written as , similarly for the second term. Then Lemma 2 implies thatψ m is asymptotically linear with influence function D m (y t , a t , w t ) = which, under the sharp null, simplifies to The variance of D m under the sharp null is σ 2 m /(π 1 π 0 ), where we used the independence between A and (Y t , W t ) under the sharp null. Lemma 4 follows by the definition of relative efficiency. Proof of Theorem 3. We first studyσ 2 m based on estimating equations. The proof is similar to that of Theorem 2 except that we need to estimate the marginal distribution of Y in addition. We use the sample proportionp k , which is asymptotically linear with influence function IF p k (y, w) = I{y = k} − p k . Consider the following estimating equation By definition, σ 2 m is the unique solution to the equation P U MW (α * , β * , p, σ 2 ) = 0, andσ 2 m solves its empirical counterpart. Hence, we have Consistency of (α,β) andp implies thatσ 2 m is also consistent. This together with the bounded sup- show that U MW is a Lipschitz function of (α, β,p, σ 2 ) in a neighborhood of (α * , β * , p, σ 2 m ), and thus [31] implies that the last term in the above display is o P (n −1/2 ). Applying a Taylor expansion, we have that We note that using similar arguments as Lemma 10, we can show that IF m is the canonical gradient of We estimate the unadjusted variance by plugging inp k . By the delta method,σ 2 u is asymptotically linear with influence function IF u = − K k=1 p 2 k IF p k /4. Next we establish the asymptotic linearity ofσ 2 a . Applying Lemma 8 with h(ỹ, y) = I{ỹ < y}+ 1 2 {ỹ = y}, we obtain the EIF of σ 2 a as follows We show through direct linearization thatσ 2 a is indeed asymptotically linear with influence function IF a . To start, we note thatσ 2 a = P n (η −r) 2 and σ 2 a = P (η − r) 2 . Thus, We analyze term 2 first. The first term on the right-hand side is already linear and contributes the first term to the influence function. By the assumption that r − r L 2 (PW ) = o P (n −1/4 ), the third term is negligible because We now turn to the second term on the right-hand side of (23). By Theorem 2.10.6 in Van Der Vaart and Wellner [32] , the fact thatr belongs to a bounded P -Donsker class with probability tending to 1 implies that (η −r) 2 − (η − r) 2 also belongs to a P -Donsker class with probability tending to 1, as η and r are both fixed and bounded functions. Furthermore, for some constant M . Lemma 19.24 in Van der Vaart [31] implies that the second term is also o P (n −1/2 ). We now analyze term 1. Note that We note thatη belongs to a bounded P -Donsker class, as the empirical distribution function lies in the closure of the convex hull of a P -Donsker class. Theorem 2.10.6 in Van Der Vaart and Wellner [32] again implies that (η −r) 2 − (η −r) 2 belongs to a P -Donsker class with probability tending to 1. Moreover, Thus the third term is o P (n −1/2 ). The second term is also o P (n −1/2 ) by our assumption on the convergence rate ofr and the convergence ofη. Finally, the first term is the linear term that contributes to the rest of IF a . To see this, we write it in the integral form. Note that Lemma 8 implies that P {η(ỹ) − r(w)}h(·,ỹ)dP (ỹ,w) = σ 2 a . Theorem 3 now follows by applying the delta method. Specifically, the influence function ofφ a is given by (IF a − φ a IF u )/σ 2 u , and similarly the influence function ofφ m is (IF m − φ m IF u )/σ 2 u . These estimators are RAL in any locally nonparametric model. Proof of Lemma 5. Recall that the unadjusted estimator is given bŷ Applying the delta method shows that its influence function is given by which, under the sharp null, simplifies to Due to the independence of A and (Y t , W t ) under the sharp null, the variance of D u (Y t , A, W t ) under the sharp null is σ 2 u /(π 1 π 0 ). The working-model-based adjusted estimatorψ m replacesF a withF a . By Lemma 2 and the delta method, the influence function ofψ m is which under the sharp null becomes The variance of D m (Y t , A, W t ) under the sharp null is σ 2 m /(π 1 π 0 ). Finally the efficient influence function can be obtained by projecting D u onto the tangent space. Under the sharp null, it simplifies to The variance of D * (Y t , A, W t ) under the sharp null is σ 2 a /(π 1 π 0 ). Proof of Theorem 4. We first consider estimating σ 2 u . Define the following estimating equation Then, σ 2 u is the unique solution in σ 2 to the equation P U LOR,u (F , σ 2 ) = 0; andσ 2 u solves P n U LOR,u (F , σ 2 ) = 0, withF being the CDF of the empirical distribution of Y . We can apply similar estimating equation arguments as in the proof of Theorems 2 and 3 to show thatσ 2 u − σ 2 u = (P n − P )IF u + o P (n −1/2 ), where This implies thatσ 2 u is asymptotically linear with influence function IF u . In particular, Next we consider the estimation of σ 2 m . Define the following estimating equation Then, σ 2 m is the solution in σ 2 to P U LOR (α * , β * , F , σ 2 ) = 0, whileσ 2 m is the solution to P n U LOR (α,β,F , σ 2 ) = 0. We can again apply estimating equation arguments as in the proof of Theorems 2 and 3 to show that This implies thatσ 2 m is asymptotically linear with influence function IF m . Here, Finally we consider the estimation of σ 2 a . We define σ kl := E P [cov(I{Y ≤ k}, I{Y ≤ l}|W )]. Then σ 2 a can be equivalently written as We estimate F (k) withF (k), which is an asymptotically linear estimator. By Lemma 9, the EIF of σ kl is given by We show that the estimatorσ is asymptotically linear with influence function D kl . For the ease of notation, we use I k as shorthand for the function y → I{y ≤ k}, θ k for the function w → θ(k, w) andθ k for the function w →θ(k, w), for k ∈ {1, . . . , K − 1}. Then we have thatσ kl = P n {(I k −θ k )(I l −θ l )} and σ kl = P {(I k − θ k )(I l − θ l )}. Therefore, The first term is exactly (P n − P )D kl . For the second term, we note that Now we turn to the third term. By Theorem 2.10.6 in Van Der Vaart and Wellner [32] , the fact thatθ k andθ l belong to fixed P -Donsker classes with probability tending to 1 implies that (I k −θ k )(I l −θ l ) − (I k − θ k )(I l − θ l ) also belongs to a fixed P-Donsker with probability tending to 1, as the support of W is bounded and I k , I l , θ k and θ l are all fixed and bounded functions. In addition, (I k −θ k )(I l −θ l ) − (I k − θ k )(I l − θ l ) L 2 (P ) = o P (1). Thus, Lemma 19.24 implies that the third term is o P (n −1/2 ). Hencê σ kl has influence function D kl . Applying the delta method, we see thatσ 2 a is asymptotically linear with influence function We study term 2 first. By the Lipschitz property of h, term 2 is bounded by We now show that E M √ n σ 2 (P * n ) − σ 2 (P * n ) P → 0. Recall the asymptotic linear expansion (12) , It then follows that We note that N var P * n (Rem) = N 1−2γ var P * n (N γ Rem). Therefore, Condition B2 implies that E N var P * n (Rem) ≤ LN 1−2γ for some constant L. In addition, the boundedness of the support of X implies that var P * n [D P * n Π (X)] is also bounded. Thus, by the Cauchy Schwartz inequality and Jensen's inequality, for some constant L 1 . As a result, we have that E √ n σ 2 (P * n ) − σ 2 (P * n ) ≤ √ L 2 nN 1−2γ for some con- where the outer expectation is over X 1 , X 2 , . . . Markov's inequality and Condition B4 then imply that, for all ǫ > 0, We now turn to term 1, which further decomposes into sup h∈BL1(R) Theorem 23.7 and Equation 23.8 in Van der Vaart [31] imply that term 1.2 converges to 0 in probability. For term 1.1, the same argument as was used in the proof of Theorem 23.9 in Van der Vaart [31] shows that this term converges to 0 in probability. Combining these steps as in the proof of Theorem 23.9 in Van der Vaart [31] , we see that √ n{σ 2 (P * n )− σ 2 (P n )} converges conditionally in distribution to (σ 2 ) ′ (G), given X 1 , X 2 , . . ., in probability. In particular, we can apply the above argument to the variance of the unadjusted estimator σ 2 u and the variance of the working-model-based estimator σ 2 m to show that (i) √ n{σ 2 u (P * n ) − σ 2 u (P n )} converges conditionally in distribution to (σ 2 u ) ′ (G), and (ii) √ n{σ 2 m (P * n ) − σ 2 m (P n )} converges conditionally in distribution to (σ 2 m ) ′ (G), both given X 1 , X 2 , . . ., in probability. The delta method implies that √ n{σ 2 m (P * n )/σ 2 u (P * n ) − φ m (P n )} converges conditionally in distribution to (φ m ) ′ (G), given X 1 , X 2 , . . ., in probability. Finally, Theorem 3.10 follows by Condition B3 and Slutsky's theorem. C Proofs of results for time-to-event outcomes with right censoring All three estimands of the treatment effect we consider are transformations of the arm-specific survival functions S 1 (t) and S 0 (t). Recall that for the unadjusted analysis we plug inS 1 (t) andS 0 (t), the armspecific KM estimators, and, for the adjusted analysis, we plug inŜ 1 (t) andŜ 0 (t), the efficient adjusted estimators for the arm-specific survival function. The influence function of the KM estimator was derived, for example, in Reid [25] . In particular, if T t ⊥ C t |A, thenS a (t k ) is an asymptotically linear estimator of S a (t k ) with influence function for a = 0 or 1, and k ∈ {1, . . . , K}. Here h a (t) is the hazard corresponding to S a at time t, G a (t, w) := P (C t ≥ t|A = a, W t = w) and G a (t) := P (C t ≥ t|A = a). Under the assumption that T t ⊥ C t |(A, W t ) and other regularity conditions,Ŝ a (t k ) is asymptotically linear with influence function given in Moore and van der Laan [20] λ a,k (y t , δ t , a t , for a = 0 or 1, and k ∈ {1, . . . , K}. C.1 Supporting lemmas for proofs in Section C.2 Lemma 11 . (EIF of the variance of the fully-adjusted estimators) For (k, l) ∈ {1, . . . , K} 2 and j ∈ {1, . . . , min(k, l)}, let D kl j be defined as in (25) . When we measure the treatment effect with the risk When we use restricted mean survival time Proof. Recall that we define f kl j (·) as f kl j (w) = S(t k , w)S(t l , w) When we wish to emphasize the dependence of f kl j on P through its survival function, we instead denote this function by f kl j,P . First we define a parameter σ kl j : M → R such that σ kl j (P ) : We consider the efficient influence function of σ kl j in the full data model, that is, the model where we observe (T, W ) and there is no censoring. Let p(t, w) = p(t|w)p(w) be the density of the joint distribution. We consider the one-dimensional submodel [28] to show that the following is an observed data influence function of σ kl j : Moreover, as our observed data model is locally nonparametric, D kl j is the efficient observed data influence function of σ kl j . Lemma 11 then follows since the variances are linear combinations of E[f kl j (W )/G(t j , W )]. Lemma 12 (Computation time ofσ 2 a with RMST). For givenτ andŜ, the function can be computed in O(k) time. Proof. There are 5 terms inside the sums, and we show that the sum of each term can be computed in To start, we note that (t j , w)Ŝ(t l , w) 1 By taking a cumulative sum, { l≥uŜ (t l , w)} k u=1 can be pre-computed in O(k) time. Thus, the above display can be computed in O(k) time as we sum over u. The terms in g are of two types. The first of these is k j=1 k l=1 min(j,l) By taking a cumulative sum, ( l≥uτ l ) k u=1 can be pre-computed in O(k) time and summing over u takes another O(k) steps. The second type of term is Again with the sum over j and l pre-computed for all u, the summation over u takes O(k) time. Proof of Lemma 6. First we note that the (conditional) independencies A ⊥ W t , T t ⊥ A|W t , C t ⊥ T t |A and C t ⊥ T t |(A, W t ) together imply that T t ⊥ C t and T t ⊥ C t |W t . This result will be useful when we compute the variances of the adjusted and unadjusted estimators. We consider the risk difference first. The unadjusted estimator, namelyψ u =S 0 (t k ) −S 1 (t k ), has influence function D u,k = η 0,k − η 1,k . Under the sharp null where the treatment has no effect and the assumption that C t ⊥ A|W t , the influence function simplifies to Noting that The adjusted estimator, namelyψ a =Ŝ 0 (t k ) −Ŝ 1 (t k ), has influence function D a,k = λ 0,k − λ 1,k . Under the sharp null and the assumption that C t ⊥ A|W t , the influence function becomes The variance of D a,k can be calculated in a similar way as was done for the unadjusted estimator, except that in taking expectation of the indicators, we condition on W t first. var(D a,k ) = 1 Hence the relative efficiency is given by σ 2 a /σ 2 u , which depends only on the distribution of survival time and the covariate, for a user-specified mapping G. Next we consider the relative risk. The unadjusted estimator, namelyψ u = 1−S1(t k ) 1−S0(t k ) , has influence function 1 1−S0(t k ) (−η 1,k + ψη 0,k ), which becomes D u,k /{1 − S(t k )} under the sharp null. Similarly, the influence function of the adjusted estimatorψ a = 1−Ŝ1(t k ) 1−Ŝ0(t k ) simplifies to D a,k /{1 − S(t k )} under the sharp null and the assumption that C t ⊥ A|W t . Hence the relative efficiency is again σ 2 a /σ 2 u . Proof of Theorem 6. Recall thatŜ(t k ) is the efficient adjusted estimator of the survival probability at time t k using the external data. Hence, this estimator is asymptotically linear with influence function where τ is defined in (18) . Furthermore, define IF 0 (y, δ, w) := 0. Recall that Applying the delta method, we have thatŝ jl u is an asymptotic linear estimator of s jl u with influence function ξ jl u (y, δ, w) = Also,σ 2 u is a linear combination ofŝ kk j , and its influence function is given by We now consider estimating σ 2 a . Its efficient influence function is derived in Lemma 11, and also given in (19) , which is of the form IF a = k j=1 D kk j . The proposedσ 2 a is a one-step estimator based on the EIF. Let Q be the distribution of the observed data unit (Y, ∆, W ) in the external dataset, induced by the joint distribution P of (T, W ), the conditional distribution of the censoring time H and the function Γ. LetP be a joint distribution of (T, W ) such that the condtional hazard function is given byĥ(t, w), the conditional survival function is given byŜ(t, w) and the distribution of W is given by its empirical distribution. LetQ be the observed data distribution induced byP ,Ĥ and Γ. Then, we have that σ 2 a = Q n IF a (Q) + σ 2 a (P ), where Q n is the empirical distribution of (Y, ∆, W ) in the external dataset. Hence, Our first step is to show that the remainder term R(Q, Q) is o P (n −1/2 ). As the influence function IF a and the variance σ 2 a itself can both be written as a sum of k terms, it is easy to see that we can write Here we omit the superscript "kk" in f and g. We examine the terms coming from f and g separately. First, we study . Then, we apply a second-order Taylor expansion to the function (x, y, z) → x 2 (1/y − 1/z). The relevant second-order derivatives are bounded by some constant M whenŜ(t, w), S(t, w) and G(t, w) are all uniformly bounded away from 0. Then, Next, we study the terms in the remainder resulting from g. To do this, we define δ l h (w) := S(t l−1 , w){h(t l , w) −ĥ(t l , w)}/Ŝ(t l , w). Then, We first establish condition (1) . Using the triangle inequality, it suffices to show that for each j, {ĝ j − g j }/G(t j , ·) L 2 (Q) = o P (1) and {f j − f j }/G(t j , ·) L 2 (Q) = o P (1), and that σ 2 a (Q) − σ 2 a (Q) = o P (1). As G is uniformly bounded away from 0, it suffices to show that ĝ j − g j L 2 (Q) = o P (1) and f j − f j L 2 (Q) = o P (1). Recall that when studying the remainder term, we applied a second-order Taylor expansion to the function (x, y, z) → x 2 (1/y − 1/z). Here a first-order Taylor expansion suffices. As S is bounded away from 0, there exists some constant M such that the first derivatives are bounded by M . Then, We note thatĝ j − g j consists of three terms that are of similar forms. We study one of them in details and similar arguments apply to the other two, and we can then apply the triangle inequality to conclude that ĝ j − g j L 2 (Q) = o P (1). For notational convenience, we definê S(t l , w)Ĥ(t l , w) I{y = t l , δ = 1} −ĥ(t l , w)I{y ≥ t l } . We focus on the term inĝ j − g j that is given by The triangle inequality allows us to bound each term separately. SinceŜ andĤ are uniformly bounded away from 0 andŜ andĥ are uniformly bounded above, there exists some constant M such that |τ k | ≤ M . Therefore, in term 1 it suffices to upper bound 2Ŝ(t k ,·) S(tj ,·) − 2Ŝ(t k ,·) S(tj−1,·) − 2S(t k ,·) S(tj ,·) + 2S(t k ,·) S(tj−1,·) L 2 (PW ) . As in our analysis of f j − f j L 2 (PW ) , we apply a first-order Taylor expansion, where the first order derivatives are bounded by some constant M ′ . Then, Thus, term 1 is indeed o P (1). As S is uniformly bounded away from 0, 2S(t k ,w) S(tj ,w) − 2S(t k ,w) S(tj−1,w) is bounded above. Therefore, it suffices to show that τ k − τ k L 2 (Q) = o P (1). Note that τ k (y, δ, w) − τ k (y, δ, w) To see that Term 2.1 is o P (1), note that the first factor is bounded above and ĥ (t, ·) − h(t, ·) L 2 (PW ) = o P (1). To see that Term 2.2 is o P (1), note that the second factor is bounded above and the first factor is o P (1), which can be shown again by a Taylor expansion. This argument shows one of the three terms inĝ j − g j is o P (1). A similar argument can be applied to show the other two terms are o P (1) as well. Finally we show that σ 2 a (Q) − σ 2 a (Q) = o P (1). Note that The term (Q n − Q)f j is o P (1) by the law of large numbers. We have shown that f j − f j L 2 (Q) is o P (1), which provided an upper bound on Q(f j − f j ). As we will show momentarily,f j lies in a Q-Donsker class with probability tending to 1, so Lemma 19.24 in Van der Vaart [31] implies that (Q n − Q)(f j − f j ) = o P (1). Combining these results, we have σ 2 a (Q) − σ 2 a (Q) = o P (1). Next we establish condition (2), which says that IF a (Q) lies in a Q-Donsker class with probability tending to 1. By Theorem 2.10.6 in Van Der Vaart and Wellner [32] , it suffices to show thatĝ j andf j both lie in Q-Donsker classes with probability tending to 1, since G is bounded away from 0. This can be shown again by Theorem 2.10.6 in Van Der Vaart and Wellner [32] , as by assumptionŜ,Ĥ andĥ all belong to fixed Q-Donsker classes with probability tending to 1 and all the functions involved inĝ j and f j are uniformly bounded away from 0 and also bounded above. Conditions (1) and (2) allows us to apply Lemma 19.24 in Van der Vaart [31] to conclude that (Q n − Q){IF a (Q) − IF a (Q)} is o P (n −1/2 ). Now, combining step 1, which showed that R(Q, Q) is o P (n −1/2 ), and step 2, which showed that (Q n − Q){IF a (Q) − IF a (Q)} is o P (n −1/2 ), we see thatσ 2 a is asymptotically linear with influence function IF a . Theorem 6 then follows by the delta method. Specifically, the influence function ofφ is given by (IF a − φ a IF u )/σ 2 u . This estimator is efficient as its influence function agrees with the EIF of φ a . Proof of Lemma 7. The unadjusted estimator, namelyψ u = k j=1 {S 1 (t j ) −S 0 (t j )}, has influence function k j=1 (η 1,j − η 0,j ). Under the sharp null, the form of the influence function simplifies to k j=1 D u,j , where the definition of D u,j is given in (26) . Since Now we consider the fully adjusted estimatorψ a = k j=1 {Ŝ 1 (t j ) −Ŝ 0 (t j )}. Under the sharp null, the influence function ofψ a is given by k j=1 D a,j , where D a,j is defined in (27) . As in the proof of Lemma 6, the variance of k j=1 D a,j can be derived in the same way, except that we condition on W first when taking expectations. Moreover,σ 2 a is again a one-step estimator based on its EIF given in Lemma 11. Using the same approach as was used in the proof of Theorem 6, it can be shown that the remainder term R(Q, Q) is o P (n −1/2 ), and (Q n − Q){IF a (Q) − IF a (Q)} = o P (n −1/2 ). Due to their close similarity to earlier arguments, we omit the details here. We can then conclude thatσ 2 a has influence function IF a = k j=1 k l=1 min(j,l) u=1 where the definition of D jl u is given in (25) in Lemma B.1. Applying the delta method, the influence function ofφ is given by (IF a − φ a IF u )/σ 2 u . This estimator is efficient as its influence function agrees with the EIF of φ a . Proof of Theorem 8. We prove the first claim by showing that E P f jl u (W )/G(t u , W ) ≥ s jl u /G(t u ), for all (u, j, l) such that max{j, l} ≤ k and u ≤ min(j, l). To start, we note that T ⊥ W under P implies that f jl u (w) = s jl u for all w ∈ W. Thus, it suffices to show that E P [1/G(t u , W )] ≥ 1/G(t u ), which follows from the convexity of the function a → 1/a for a > 0 and Jensen's inequality. Strict inequality holds when var P [G(t u , W )] > 0 for some u. To prove the second claim, we note that C ⊥ W implies that G(t j , w) = G(t j ) for all w ∈ W and j ∈ {1, . . . , k}. We focus on the case of RD and RR first. Consider a bivariate function (a, b) → a 2 /b for (a, b) ∈ (0, 1) 2 . The Hessian matrix is given by  with eigenvalues 0 and 2(a 2 + b 2 )/b 3 , both of which are non-negative. Therefore, this function is convex for (a, b) ∈ (0, 1) 2 . We note that where the inequality follows from Jensen's inequality on the function (a, b) → a 2 /b for (a, b) ∈ (0, 1) 2 . For the case of RMST, we consider a (q + 1)-variate function that maps (a 1 , . . . , a q , b) to ( q i=1 a i ) 2 /b for a i ∈ (0, 1) and b ∈ (0, 1). The only non-zero eigenvalue of its Hessian matrix is 2{( q i=1 a i ) 2 +qb 2 }/b 3 , which is positive. Therefore, this function is convex for any q ≥ 1. In the following argument, we will take q ∈ {1, . . . , k}. where the inequality follows from Jensen's inequality. In this section, we present additional simulation results for sample sizes n = 200 and n = 500. The simulation set-up is otherwise the same as described in Section 5. The results for ordinal outcomes are presented in Table 6 , and the results for survival outcomes are presented in Table 7 . In this section, we present additional results when applying our proposed methods to the Covid-19 dataset with ordinal outcomes. In particular, we estimate the efficiency gain from using the fully adjusted and working-model-based estimators that adjust for one of the covariates, for estimating three treatment effect estimands: DIM (Table 8) , MW (Table 9 ) and LOR (Table 10 ). Table 7 : Simulation results for survival outcomes. We only consider the analytical approach and consider relative efficiency of fully adjusted estimators for RD at time 1, 2 and 3 (the relative efficiency is the same for RR) and RMST at time 3. Sample size is 200 or 500, and results are based on 1000 replications. We split the external data into two subsamples of size n/2, denoted by D 1 and D 2 . Letσ 2 a,1 be the proposed estimator of the adjusted variance, calculated from observations in D 1 only; and letσ 2 u,2 be the proposed estimator for the unadjusted variance using D 2 . By the asymptotic linearity of these estimators, we have that Moreover, asσ 2 a,1 andσ 2 u,2 are based on different observations, they are independent. Therefore, delta method implies that √ n σ 2 a,1 A Wald test can be used to test the hypothesis H 0 : φ a = 1, and in fact a Wald confidence interval can also be constructed directly although it might be wider than the one obtained from the proposed two-step procedure. The same argument applies when we consider the working-model-based variance. The sample splitting approach was also used in Williamson and Feng [35] to test a null hypothesis that lies on the boundary of the parameter space. Recall that the confidence set, which we denote as I ts , is constructed using a two-step procedure. We first test the null hypothesis H 0 : φ = 1 using a level α test. If it is rejected, we take the confidence set to be I wald , the Wald confidence interval; otherwise the confidence set is taken to be I wald ∪ {1}. We argue that the asymptotic coverage of this confidence set is at least 1 − α. First, consider the case where φ = 1 under P . This implies that the influence function ofφ without sample splitting is not identically 0. Hence, asymptotic linearity in the form of (2) implies that P (φ ∈ I wald ) → 1 − α. In addition, we have that P (φ ∈ I wald ) ≤ P (φ ∈ I ts ). Next, consider P such that φ = 1 + P (φ ∈ I ts | do not reject H 0 )P ( do not reject H 0 ) B 2 do 2: Resample X k of size N from X with replacement 3: Randomly assign treatment to formX k 4: Computeψ u,k =ψ u (X k ),ψ m,k =ψ m (X k ) 5: end for 6: Letφ(X) = var(ψ m,k )/var(ψ u,k ) 7: for i = 1, 2 Neural network learning: Theoretical foundations A substantial and confusing variation exists in handling of baseline covariates in randomized controlled trials: a review of trials published in leading medical journals Improving precision and power in randomized trials for covid-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes Severe outcomes among patients with coronavirus disease 2019 (COVID-19)-United States On adaptive estimation. The Annals of Statistics Efficient and adaptive estimation for semiparametric models Leveraging prognostic baseline variables to gain precision in randomized trials Improved precision in the analysis of randomized trials with survival outcomes, without assuming proportional hazards Enhanced precision in the analysis of randomized trials with ordinal outcomes Adjusting for covariates in randomized clinical trials for drugs and biologics with continuous outcomes. draft guidance for industry Regularization paths for generalized linear models via coordinate descent Coarsening at random: Characterizations, conjectures, counter-examples Ignorability and coarse data. The annals of statistics On differentiability of implicitly defined function in semi-parametric profile likelihood estimation Statistical estimation: asymptotic theory The risks and rewards of covariate adjustment in randomized trials: an assessment of 12 outcomes from 8 studies Nonparametric estimation from incomplete observations On causal estimation using-statistics Regression models for ordinal data Application of time-to-event methods in the assessment of safety in clinical trials. Design and Analysis of Clinical Trials with Time-to-Event Endpoints Robust extraction of covariate information to improve estimation efficiency in randomized trials Increasing power in randomized trials with right censored outcomes through covariate adjustment Estimation in semiparametric models Contributions to a general asymptotic statistical theory Influence functions for censored data Improving precision by adjusting for prognostic baseline variables in randomized trials with binary outcomes, without regression model assumptions A general implementation of tmle for longitudinal data applied to causal inference in survival analysis Semiparametric theory and missing data Unified methods for censored longitudinal data and causality Targeted maximum likelihood learning Asymptotic statistics Weak convergence. In Weak convergence and empirical processes Increasing the power of the mann-whitney test in randomized experiments through flexible covariate adjustment Analysis of covariance in randomized trials: More precision and valid confidence intervals, without model assumptions Efficient nonparametric statistical inference on population feature importance using shapley values The authors gratefully acknowledge the support of the NIH through award number DP2-LM013340. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. similarly to f kl j but with S replaced by S ǫ . We omit the subscript "j" and superscript "kl" below when it is clear from the context that we are focusing on f kl j .Note thatwe then see thatLet τ f ull,k (t, w) = I{t > t k } − S(t k , w). By definition, the gradient is given bywhere the partial derivatives d l , d k , d j , and d j−1 are given byThis is the EIF in the full data model, as we work with a locally nonparametric model.The observed data unit (with censoring) is (Y, ∆, W ). To find the EIF in the observed data model,Noting that S(t k , w) = l≤k {1 − h(t l , w)} andŜ(t k , w) = l≤k {1 −ĥ(t l , w)}, we can write the difference betweenŜ(t k , w) and S(t k , w) asŜHence,The term in the first line on the right-hand side is o P (n −1/2 ) because G, S,Ŝ andĤ are uniformly bounded away from 0;Ŝ is uniformly bounded above; and {ĥ(t, ·) − h(t, ·)}{Ĥ(t, ·) − H(t, ·)} L 2 (PW ) = o P (n −1/2 ) for all t. The term in the second line is also o P (n −1/2 ). To see this, we apply a first-order Taylor expansion to the first factor, which is very similar to the second-order Taylor expansion we studied earlier and the derivative is again bounded by some constant M . In addition, we have thatSimilarly, we can show that R f 2 + R g2 and R f 3 + R g3 are both o P (n −1/2 ), and so is R j (Q, Q) andconsequently so is R(Q, Q).Our second step is to show that (Q n − Q){IF a (Q) − IF a (Q)} is o P (n −1/2 ). To do this, we again use Lemma 19.24 in Van der Vaart [31] . We need to verify the following two conditions: (1) IF a (Q) − IF a (Q) L 2 (Q) = o P (1), and (2) IF a (Q) lies in a fixed Q-Donsker class with probability tending to 1.