key: cord-0443125-jpj2n5tw
authors: Heng, Fei; Sun, Yanqing; Gilbert, Peter B.
title: Estimation and Hypothesis Testing of Strain-Specific Vaccine Efficacy with Missing Strain Types, with Applications to a COVID-19 Vaccine Trial
date: 2022-01-22
journal: nan
DOI: nan
sha: 2c554fc82654b98b6b658433d3edd7c281b624ad
doc_id: 443125
cord_uid: jpj2n5tw

Statistical methods are developed for analysis of clinical and virus genetics data from phase 3 randomized, placebo-controlled trials of vaccines against novel coronavirus COVID-19. Vaccine efficacy (VE) of a vaccine to prevent COVID-19 caused by one of finitely many genetic strains of SARS-CoV-2 may vary by strain. The problem of assessing differential VE by viral genetics can be formulated under a competing risks model where the endpoint is virologically confirmed COVID-19 and the cause-of-failure is the infecting SARS-CoV-2 genotype. Strain-specific VE is defined as one minus the cause-specific hazard ratio (vaccine/placebo). For the COVID-19 VE trials, the time to COVID-19 is right-censored, and a substantial percentage of failure cases are missing the infecting virus genotype. We develop estimation and hypothesis testing procedures for strain-specific VE when the failure time is subject to right censoring and the cause-of-failure is subject to missingness, focusing on $J ge 2$ discrete categorical unordered or ordered virus genotypes. The stratified Cox proportional hazards model is used to relate the cause-specific outcomes to explanatory variables. The inverse probability weighted complete-case (IPW) estimator and the augmented inverse probability weighted complete-case (AIPW) estimator are investigated. Hypothesis tests are developed to assess whether the vaccine provides at least a specified level of efficacy against some viral genotypes and whether VE varies across genotypes, adjusting for covariates. The finite-sample properties of the proposed tests are studied through simulations and are shown to have good performances. In preparation for the real data analyses, the developed methods are applied to a pseudo dataset mimicking the Moderna COVE trial.

Randomized, placebo-controlled vaccine efficacy (VE) trials have demonstrated that several SARS-CoV-2 candidate vaccines prevent acquisition of COVID-19 with VE level above 50% and reaching up to 95% (e.g., Polack et al., 2020; Baden et al., 2020) .

All of these vaccines use the so-called Wuhan or Washington strain (henceforth WA strain) of SARS-CoV-2 in the vaccine construct. Genetic variability of SARS-CoV-2 viruses has been increasing over time, with several variants emerging that have several genetic mutations compared to the WA strain, generating concern that the level of VE could be lower against certain variants (Lauring and Hodcroft, 2021) . For efficacy results reported through about February 2021, the viruses circulating during the trial were almost all WA strains or very slight variants (1 or 2 mismatches), such that the trials did not provide information on VE against variants. For recent efficacy results for trials in the United Kingdom, Brazil, and South Africa, variants dominated the circulating strains, such that estimates did assess VE against variants, and the estimates were lower than for trials in regions where WA strains dominated (Madhi et al., 2021; Sadoff et al., 2021) . To date the published statistical analyses for understanding how VE may depend on variants has consisted of analyses of VE against all SARS-CoV-2 strains circulating in a given geographic region, for example the ENSEMBLE trial of the Ad26.COV2.S vaccine reported an estimate of VE of 66.9% in the U.S. where the WA strain predominated and an estimate of VE of 52.0% in South Africa where the B.1.351 variant strain predominated (Sadoff et al., 2021) .

Many of the COVID-19 VE trials are sequencing the SARS-CoV-2 spike gene for all COVID-19 primary endpoint cases, where in general the vaccines only include the spike gene. These data will enable sieve analysis (Gilbert, Ashby, Self, 1998; Edlefsen et al., 2015; Neafsey et al., 2015; Juraska et al., 2018) , which, based on a competing risks failure time data set-up, assesses whether and how VE depends on genetic features of the pathogen strain causing the disease endpoint. Rolland and Gilbert (2021) briefly discussed motivation and applications of sieve analysis in SARS-CoV-2 VE trials. While many statistical methods of sieve analysis have been developed, some features of the forthcoming data sets for the COVID-19 VE trials require some novel methods development. First, the methodology needs to allow for missing sequences, because sequencing technology is only able to measure the spike sequence if the viral load of the sample used for sequencing is sufficiently high (Xiao et al., 2022) . A wealth of data in natural history studies suggest an expected 20-30% of placebo arm COVID-19 endpoint cases will have missing sequences, and the rate of missingness will generally be higher in vaccine arm COVID-19 endpoint cases, given that the vaccines usually have some impact to suppress viral load. Second, the methodology needs to handle J discrete categorical genotypes with J > 2 and allowing the multiple genotypes to be either unordered categorical or ordered categorical (e.g., a Hamming distance that is the number of amino acids in the Spike protein that are mismatched to the WA vaccine strain). Third, it is useful for the methods to provide inferences for whether VE differs across genotypes (with variation termed a "sieve effect"), not only providing separate inferences about VE against each individual genotype. Lastly, the methodology needs to accommodate that the background incidence of COVID-19 can vary over calendar time, given waxing and waning outbreaks. In this work, we focus on addressing these needs through a proportional hazards model, which is reasonable for the COVID-19 VE trials during their primary periods of follow-up, given that these periods last less than 6 months and immune responses induced by the vaccines are fairly stable during these periods. This indicates that the Cox model assumption of time-constant VE against any given genotype is a reasonable assumption, at least approximately.

Among Cox model methods that handle missing causes of failure, Goetghebeur and Ryan (1995) used weighted Cox modeling, and Lu and Tsiatis (2001) used a parametric model for the probability of observing a sequence and used multiple imputation to predict missing genotypes from auxiliary covariates. Adapting the theory of Robins, Rotnitzky, and Zhao (1994) , Gao and Tsiatis (2005) considered linear transformation models -with the Cox model a special case -developing inverse probability weighted (IPW) and augmented IPW (AIPW) methods; Hyun et al. (2012) also developed IPW and AIPW methods for the Cox model. These papers focused on two causes of failure, and did not provide techniques for sieve effect tests. Moreover, auxiliary covariates were considered, but for the COVID-19 application auxiliary marks are more valuable. (A 'mark' is a random variable only meaningfully defined in failure endpoint cases.) In particular, the key auxiliary mark is the SARS-CoV-2 viral load from the blood sample used for sequencing the virus. In addition, for the applica-tions it is useful to allow separate baseline hazards for different calendar intervals of enrollment, as one way to handle unpredictable secular trends in placebo COVID-19 arm incidence. While all of the methods could be devised to allow multiple baseline hazards, the available implementations typically do not include this implementation.

This current work most closely resembles that of Hyun et al. (2012) , where we also develop IPW and AIPW methods, and take on the new features not considered previously of handling J > 2 unordered or ordered categorical genotypes, and developing hypothesis testing procedures for multiple new questions of interest including the assessment of sieve effects. The methods are implemented in the R package cmprskPH available at https://github.com/fei-heng/cmprskPH. The rest of this article is organized as follows. In Section 2, we present the mathematical framework for estimating VE against specific genotypes which are subject to missingness. Section 3 is devoted to demonstrating the development of IPW and AIPW estimation methods. Asymptotic properties for the proposed estimators are established in Section 4, with proofs in the Web Appendix A. In Section 5, the confidence intervals and hypothesis testing procedures for VE are derived. We conduct simulation studies to examine the finite-sample performance of estimators and the tests in Section 6, and we apply our methods to a pseudo dataset mimicking the Moderna COVE trial in Section 7.

2 Stratified cause-specific proportional hazards models and missing causes Let τ be the duration of the vaccine trial. Let T be the failure time, V the cause of failure (also termed as discrete mark or type of infecting strain in VE trials), and Z(t) a possibly time-dependent p-dimensional covariate. Statistical interest focuses on the conditional cause-specific hazard rate of cause j defined by

for j = 1, . . . , J. The function λ j (t|z(t)) is the instantaneous failure rate from cause j at time t in the presence of the other failure types. For VE trials, we specifically consider the covariate z(t) = (z 1 , z 2 (t)) T , where z 1 is the treatment group indicator (1=vaccine; 0=placebo) and z 2 (t) the vector of other covariates. Vaccine efficacy to reduce susceptibility to strain j at time t is defined as

Gilbert (2000) discussed the assumptions required for the strain-specific vaccine efficacy to have a meaningful biological interpretation.

In practice, different key subgroups (e.g., men and women; individuals living in different geographic regions, individuals enrolled during different calendar intervals)

typically have different baseline cause-specific hazards of failure. The stratified causespecific proportional hazard regression model postulates that the conditional causespecific hazard function for cause j for an individual in the kth stratum with the covariate value z(t) = (z 1 , z T 2 (t)) T equals

for j = 1, . . . , J and k = 1, . . . , K, where λ 0kj (·) is an unspecified cause-specific baseline hazard function for the kth stratum and K is the number of strata. Here β j = (α j , γ j ) is a p-vector of regression parameters for the jth strain. Define β = (β 1 , . . . , β J ). Model (1) allows different baseline functions for different strata. Similar generalizations of the Cox model were studied by Dabrowska (1997) . Under model

(1), the covariate-adjusted strain-specific vaccine efficacy V E j is one minus exp(α j ).

To ease the notation in the method development, we generally use Z(t) = (Z 1 , Z T 2 (t)) T to represent the covariate process on [0, τ ], denoted by Z(·). The right-censored failure time data are observations of (X, δ, Z(·)), where X = min{T, C}, δ = I(T ≤ C), and C is a censoring random variable. In a competing risks framework, the cause V can only be observed when failure occurs, whereas it is unknown if the failure time T is censored. Then, the completely observed right-censored competing risks data are observations of the random variables (X, Z(·), V ) for δ = 1 and (X, Z(·)) for δ = 0.

In the presence of missing causes, we introduce a binary indicator, R, representing whether all possible data are observed for a subject. R equals to one if either δ = 0 (right-censored) or if δ = 1 and V is observed, and zero otherwise. The auxiliary covariates A may be helpful for predicting missingness and for informing about the distribution of missing causes. A may include useful auxiliary marks because causes can only be missing for failures. In the COVID-19 application presented in Section 7,

A is the SARS-CoV-2 viral load measured in COVID-19 endpoint cases, a continuous mark that is possibly associated with the probability of missingness and the strain type V .

We assume that the censoring time C is conditionally independent of (T, V ) given Z(·) for an individual in the kth stratum. We also assume the cause V is missing at random (Rubin, 1976) ; that is, given δ = 1 and W = (T, Z(T ), A) of an individual in the kth stratum, the probability that the cause V is missing depends only on the observed W , not on the value of V ; this assumption is expressed as r k (W ) ≡ P (R = 1|δ = 1, W ) = P (R = 1|V, δ = 1, W ).

(2)

Let π k (Q) = P (R = 1|Q) where Q = (δ, W ). Then π k (Q) = δr k (W ) + (1 − δ). The missing at random assumption (2) also implies that V is independent of R given Q:

For an observed value w of W of an individual in the kth stratum, we write r k (w) = P (R = 1|δ = 1, W = w) and ρ kj (w) = P (V = j|δ = 1, W = w). The stratum-specific definitions of r k (w) and ρ kj (w) leave the options for the models of the probability of complete-case and cause distribution to be different for different strata.

Let n k be the number of subjects in the kth stratum; the total sample size is n = K k=1 n k . Let {X ki , Z ki (·), δ ki , R ki , V ki , A ki ; i = 1, . . . , n k } be iid replicates of {X, Z(·), δ, R, V, A} from the kth stratum. The observed data are denoted by

, R ki = 1} for δ ki = 0. We assume that {O ki ; i = 1 . . . , n k , k = 1, . . . , K} are independent for all subjects. Similarly, we denote W ki = (T ki , Z ki (T ki ), A ki ) and Q ki = (δ ki , W ki ).

We consider a parametric model r k (W ki , ψ k ) for r k (W ki ), where ψ k is an unknown vector of parameters to be further discussed in the next section. Let π k (Q ki , ψ k ) = δ ki r k (W ki , ψ k ) + (1 − δ ki ). Additional notation is introduced in the following. For

where for any z ∈ R p , z ⊗0 = 1, z ⊗1 = z and z ⊗2 = zz T . Define s 

When there are no missing causes, for each j, β j in model (1) can be estimated by maximizing the local log-partial likelihood function

where N kij (dt) = I(X ki ≤ t, δ ki = 1, V ki = j) is the counting process with a jump at the uncensored failure time X ki and the associated cause V ki = j. Taking the derivative of l(j, β) with respect to β gives the score function

The maximum partial likelihood estimator is a solution to U(j, β j ) = 0.

Following Horvitz and Thompson (1952) , inverse probability weighting of completecases has been commonly used in missing data problems. Let r k (W ki , ψ k ) be the parametric model for the probability of complete-case r k (W ki ) defined in (2), where ψ k is a q-dimensional parameter. For example, one can assume the logistic model with logit(r k (W ki , ψ k )) = ψ T k W ki for those with δ ki = 1. By (2), the maximum likelihood estimator ψ = ( ψ 1 , . . . , ψ K ) of ψ = (ψ 1 , . . . , ψ K ) is obtained by maximizing the observed data likelihood,

We propose the following inverse probability weighted (IPW) estimating equation for β:

The IPW estimator of β j solves the above equation and is denoted by β j,I .

Let Λ 0kj (t) = t 0 λ 0kj (s) ds be the cumulative baseline function for each k and j. Let K(·) be a kernel function with bandwidth h and K h (x) = K(x/h)/h. The baseline function λ 0kj (t) can be estimated by λ I 0kj (t), obtained by smoothing the increments of the following estimator of the cumulative baseline function Λ 0kj (t):

.

The IPW estimator β j,I uses only complete cases and is inefficient. To increase estimation efficiency, we propose the augmented inverse probability weighted complete-case (AIPW) estimating function following the idea of Robins et al. (1994) . The proposed AIPW estimating equation utilizes available information for individuals with missing causes through a consistent estimator of ρ kj (W ), the conditional distribution of the failure cause.

In the case that J j=1 ρ kj (W ki ) = 1, we posit parametric models ρ kj (W ki , ϕ kj ) for ρ kj (W ki ) for j = 1, . . . , J − 1, where ϕ kj are unknown parameters. It is natural to use a logistic multinomial regression model logit{ρ kj (W ki , ϕ kj )} = W T ki ϕ kj for j = 1, . . . , J −1, but other parametric models can also be accommodated. Under the MAR assumption (3), ρ kj (W ki ) can be estimated using the complete cases with R ki = 1 and δ ki = 1. The maximum likelihood estimator ϕ kj of ϕ kj can be obtained by maximizing the likelihood based on complete-case data

Since ϕ kj is the maximum likelihood estimator, it follows that for a correctly specified model ρ kj (W ki , ϕ kj ), ϕ consistently estimates ϕ kj , the true value of the parametric Robins et al. (1994) , we obtain the following AIPW estimating equation for β:

The AIPW estimator of β j solves the above equation and is denoted by β j,A . We can similarly estimate the baseline hazard function λ 0kj (t) by a kernel estimator

is an estimator of the cumulative baseline function Λ 0kj (t).

We investigate the asymptotic properties of the IPW estimator β I = ( β 

Then S ψ ki and I ψ k be the score vector and information matrix for ψ k under (6), with

The consistency and asymptotic normality of β I are established in the next two theorems. 

as n → ∞, where Σ = diag{Σ j , j = 1, . . . , J} with Σ j given in the condition (A.3),

Here D kj is a p × q matrix.

Note that the derivative of U I (j, β j , ψ) with respect to β j equals

whereJ k (t, β j , ψ k ) is defined at the end of Section 2. Let Σ j,I = −n −1 U ′ I (j, β j,I , ψ) and Σ I = diag{ Σ j,I , j = 1, . . . , J}.

Let D kj , I ψ k , S ψ ki be the empirical counterparts of D kj , I ψ k , S ψ ki , respectively, obtained by replacing expectation with sample average, ψ k with ψ k , and M kij (dt) with

By the consistency of β I and ψ, Σ and Σ * I can be consistently estimated by Σ I andΣ *

The asymptotic variance of n 1/2 ( β j,I − β j ) can be consistently estimated by ( Σ j,I ) −1 Σ * j,I ( Σ j,I ) −1 . The IPW estimators β j,I , j = 1, . . . , J, are not asymptotically independent.

We introduce the following notation:

The next theorem shows that the AIPW estimator β j,A is consistent if either r k (w, ψ k ) or g k (a|t, v, z, θ k ) is correctly specified, a double robustness property.

Theorem 4.3 Assuming Condition A, β j,A P −→β j uniformly in j = 1, . . . , J as n →

∞. This consistency holds if either r k (w, ψ k ) or g kj (a|t, z, θ k ) is correctly specified.

Following the proofs of Theorem 4.2 and 4.3, it is easy to show that n 1/2 β j,A −β j is asymptotically normal for j = 1, . . . , J if either r k (w, ψ k ) or g kj (a|t, z, θ k ) is correctly specified. When both r k (w, ψ k ) and g kj (a|t, z, θ k ) are correctly specified, Theorem 4.4

below shows that β j,A is more efficient than β j,I .

Theorem 4.4 Assuming Condition A, if both r k (w, ψ k ) and g kj (a|t, z, θ k ) are correctly specified for j = 1, . . . , J and for k = 1, . . . , K, we have (s)) ds] and λ kj (t|Z ki (t)) = λ 0kj (t) exp(β T j Z ki (t)). We have

By the consistency of β A , ψ, and ρ(·), Σ and Σ * A can be consistently estimated by Σ A andΣ * A , respectively. Under Theorem 4.4, we have n 1/2 β j,

The asymptotic variance of n 1/2 ( β j,A − β j ) can be consistently estimated by ( Σ j,A ) −1 Σ * j,A ( Σ j,A ) −1 . The AIPW estimators β j,A , j = 1, . . . , J, are not asymptotically independent.

The estimator β j,A is more efficient than β j,I in the sense that

Equation (12) shows that the asymptotic covariance Cov{n 1/2 β j,A − β j } is smaller than Cov{n 1/2 β j,I − β j }.

Under the stratified cause-specific Cox model (1), the strain-specific vaccine efficacy 

By the asymptotic results in Section 4, n 1/2 α − α D −→N(0, Ω α ), as n → ∞, where Ω α is the asymptotic covariance matrix consists of elements in corresponding positions

of Ω. Let Ω α be a consistent estimator of Ω α and Ω α,ij be the (i, j)th entry of Ω α . A large sample 100(1−α)% confidence interval for α j is given by α j ±z α/2 σ j , j = 1, . . . , J,

where z α/2 is the upper α/2th percentile of the standard normal distribution and σ j = ( Ω α,jj /n) 1/2 is an estimate of σ j , the standard error of α j .

The strain-specific vaccine efficacy V E j = 1 −exp(α j ) can be estimated by V E j = 1 − exp( α j ). By the asymptotic property of α j and the delta method, we have

An approximate 100(1 − α)% confidence interval for V E j is then given by V E j ± z α/2 σ j exp( α j ), j = 1 . . . , J.

Using the transformation log((1 − V E)/(1 − V E)) = α j − α j , we can construct an alternative large-sample approximation of the 100(1 − α)% confidence interval for

show that the latter one has better coverage probability.

To measure how much greater the level of VE is against a strain V = j virus than against a strain V = i virus, we define

which can be estimated by V D(i, j) = exp( α i − α j ). A larger V D(i, j) value indicates that the vaccine provides greater protection against a strain type j virus than against a strain type i virus. The asymptotic variance of α i − α j can be estimated by V ar( α i − α j ) = n −1 ( Ω α,ii + Ω α,jj − 2 Ω α,ij ). A large sample 100(1 − α)% confidence interval for V D(i, j) using the logarithm transformation is

We propose test procedures to evaluate various hypotheses concerning strain-specific VE. The tests assess if the vaccine provides at least a certain specified level of efficacy against some strains and whether vaccine efficacy varies across strains. The hypothesis tests concerning V E j are constructed based on the estimator of α j .

(1) First, we consider testing V E j ≤ V E 0 for j = 1, . . . , J, where V E 0 is a fixed constant such as 0.30 or 0. Let c 0 = log(1 − V E 0 ). We develop the tests of the null hypothesis (A) that VE is at most V E 0 against all strains H A0 : V E j ≤ V E 0 for all j = 1, · · · , J. This is equivalent to testing H A0 : α j ≥ c 0 , for all j = 1, · · · , J versus one of the following alternative hypotheses H A1 : α j ≤ c 0 with strict inequality for some j, H A2 : α j = c 0 for some j.

Thus, H A0 implies that VE against any strain is no more than V E 0 , say, 30%. The alternative H A1 indicates that the VE is higher than V E 0 for at least some of the viral strains, while H A2 states that VE differs from V E 0 for some of the viral strains.

The following test statistics are proposed for detecting departures from H A0 in the directions of H A1 and H A2 , respectively:

The test statistic U 1 can be used to detect the departure H A1 from H A0 and U 2 can be used to detect the general departure H A2 from H A0 . Under H A0 , the test statistic U 1 has the asymptotic distribution of inf 1≤j≤J Z j /σ j and U 2 has the asymptotic distribution of J j=1 Z 2 j /σ 2 j , where the vector (Z 1 , . . . , Z J ) follows a multivariate normal distribution with mean vector 0 = (0, . . . , 0) and covariance matrix Ω α ). Let ( Z 1 , . . . , Z J ) ∼ N(0, Ω α ). Let U * 1,α be the αth percentile of U * 1 = inf 1≤j≤J Z j / σ j , and U * 2,α the upper αth percentile of U * 2 = J j=1 Z 2 j / σ 2 j , respectively. If U 1 < U * 1,α , the test based on the test statistic U 1 rejects H A0 in favor of the alternative H A1 at the α level of significance. If U 2 > U * 2,α , the test based on the test statistic U 2 rejects H A0 in favor of the alternative H A2 .

(2) To test VE against each strain j, j = 1, . . . , J, we test the following hypotheses:

versus one of the following alternative hypotheses

The following test statistics are used for detecting departures from H Aj0 in the directions of H Aj1 and H Aj2 , respectively:

The critical values are obtained similar to above but only for one j at a time. In clinical trials that require the simultaneous test of J null hypotheses {H Aj0 , j = 1, . . . , J}, a common approach is to apply Bonferroni adjustment of the level of significance.

While the Bonferroni procedure is simple to implement, it tends to be quite conservative for control of the familywise error rate (FWER), especially when the number of tests is large. Since the test statistics follow multivariate Gaussian distribution asymptotically, we can apply a less conservative step-downŠidák-like procedure here (Holland and Copenhaver, 1987) . The corresponding adjusted p-values are given by

. . , J} are ordered unadjusted p-values from smallest to largest.

(3) Next, we develop tests for whether strain-specific VE depends on strain type, so-called "sieve effect" tests. These tests evaluate the null hypothesis (B)

versus the following alternative hypotheses:

H B0 implies that strain-specific VE does not vary with strain type. The ordered alternative H B1 states that VE decreases with strain type. Under the proportional hazards model (1) the hypotheses (B) can be rewritten as H B0 : α 1 = α 2 = · · · = α J against the following alternative hypotheses H B1 : α 1 ≤ · · · ≤ α j ≤ · · · ≤ α J with at least one strict inequality, H B2 : α i = α j for at least one pair of i and j, 1 ≤ i < j ≤ J.

The following test statistic T 1 is suggested for detecting the monotone departure H B1 from H B0 :

To detect general alternative H B2 from H B0 , we consider the test statistic

.

The asymptotic distribution of test statistic T 2 under H B0 can be approximated by

We conduct a simple simulation study to examine the performance of the proposed methods. We consider a p = 2 dimensional covariate Z = (Z 1 , Z 2 ), where Z 1 is a Bernoulli random variable with probability of success 0.5 that represents the treatment group indicator, and Z 2 is a uniformly distributed random variable on (0, 1).

We consider the following cause-specific proportional hazards model for J = 2 causes and K = 3 strata:

λ kj (t|z) = t θ kj exp(α j z 1 + γ j z 2 ), j = 1, 2, k = 1, 2, 3,

where θ kj , α j and γ j are the parameters to be specified. All failure times greater than τ = 1 are right-censored at τ . In addition, random censoring times are generated from an exponential distribution, independent of (T, V ), with parameter adjusted so approximately 40% of the observations are censored. The sizes and powers of the tests at the nominal 0.05 level are estimated from 1000 independent samples.

We consider a single auxiliary covariate A that follows a Bernoulli distribution with success probability of 0.5. For the cause V = j, we also generate a single auxiliary mark variable A that follows a uniform distribution on (2a(j − 1), 1 + 0.5aj).

We examine the performance of the estimators under three different levels of association between A and the failure cause V , by considering the settings a = 0, 0.2, and 0.5, which result in approximate Kendall's tau values of 0, 0.3, and 0.6, respectively.

These three auxiliary association level settings are denoted by (Aux0), (Aux1), and (Aux2), respectively. Note that A is independent of V for the setting (Aux0), and the association between A and V increases from (Aux1) to (Aux2).

The cause V is missing at random (MAR). r k (W ), the conditional probability that the cause is not missing when δ = 1 for k-th stratum, follows a logistic regression model logit{r k (W, ψ)} = ψ 1 + ψ 2 Z 1 + ψ 3 A. With ψ = (1.5, −1, −0.5), we have about 45% missingness for Z 1 = 1 and about 20% missingness for Z 1 = 0. Since only two causes are considered in this simulation study, we posit a logistic regression model logit{ρ k2 (W, ϕ)} = ϕ 1 + ϕ 2 Z 1 + ϕ 3 A for ρ k2 (W ), the probability P (V = 2|δ = 1, W )

in the k-th stratum. The parameter ρ k1 (W ) is estimated by 1 − ρ k2 (W, ϕ).

We conducted simulations with sample size n = 1200 and with different sets of values for θ kj , α j and ϕ j , j = 1, 2, k = 1, 2, 3. We choose ( (2) For testing H B0 , N 1 : (α 1 , α 2 ) = (log(1 − 0.5), log(1 − 0.5)), N 2 : (α 1 , α 2 ) = (log(1 − 0.7), log(1 − 0.5)), and N 3 : (α 1 , α 2 ) = (log(1 − 0.9), log(1 − 0.5)).

The for both IPW and AIPW methods. The powers increase as the extend of departure from the corresponding null hypothesis increases. When the correlation between auxiliary A and cause V becomes stronger, powers using AIPW method increase and are slightly higher than those using IPW methods.

7 An application to a pseudo dataset for the Mod-

We apply the proposed methods to a pseudo dataset designed to approximate the Moderna COVE vaccine efficacy trial of the mRNA-1273 vaccine (Baden et al., 2020) .

The primary endpoint is virologically confirmed COVID-19 disease (e.g., Krause et al., 2020, Lancet; Mehrotra et al., 2020, Ann Int Med) . The data set approximately fits the Moderna COVE trial design, in terms of numbers of enrolled study participants, numbers of COVID-19 endpoints in the two treatment groups, and through analysis of 1122 randomly sampled real SARS-CoV-2 Spike protein sequences downloaded from GISAID to determine the strain type (i.e. cause) V with a realistic distribution of interest.

The COVE trial randomized adults at risk for COVID-19 to vaccine or placebo in one-to-one allocation (administered at Day 1 and Day 29), and was designed to follow participants for occurrence of the COVID-19 primary endpoint for 2 years. Participants were enrolled and followed starting on July 27, 2020, and in late December of 2020 the U.S. FDA granted Emergency Use Authorization to the vaccine based on its demonstrated high vaccine efficacy. Shortly after that, a process began to unblind study participants and to offer the vaccine to placebo recipients. Our analysis restricts to the primary period of follow-up (pre-unblinding) and to the participants who tested negative for SARS-CoV-2 at enrollment and who received both vaccinations without specified protocol violations (the primary analysis cohort). Following is correlated with the probability of missingness and the strain type V , we consider it as an auxiliary variable in the analysis. A special problem that needs to be addressed is that samples with low viral load may have the probability of missingness very close to 1. This "positivity problem" can make methods that use inverse probability weighting perform unstably. To allow our IPW and AIPW methods to work more robustly, we classify the COVID-19 endpoint participants with low viral load as a minor type of failure causes while the study endpoint of interest is still COVID-19. Specifically, V is redefined as the cause of failure with three types: 1 = vaccine-matched genotype AND viral load above minimum threshold; 2 = vaccine-mismatched genotype AND viral load above minimum threshold; and 3 = viral load below minimum threshold.

Here, the minimum threshold is chosen large enough such that based on an empirical analysis COVID-19 endpoint cases with V L at the minimum threshold have at least 0.05-0.10 probability that the sequence genotype is observed. We use the minimum threshold h 0 = 1 for illustration.

Since the cause V = 3 can be determined based on the observed viral load < h 0 (an auxiliary variable), the probability of it being not missing is one conditional on the observed viral load. The missing at random (MAR) assumption still holds:

The data analysis uses K = 3 baseline strata defined by geography and calendar time. Let T be the time from 13 days post dose 2 until the COVID-19 endpoint. We consider the following stratified cause-specific proportional hazards model:

for j = 1, 2, 3 and k = 1, 2, 3, where Trt is the vaccine group indicator, Highrisk is the baseline covariate high risk/at-risk pre-existing condition (1=yes, 0=no), Age 65 is the age group at enrollment (1="65+", 0="18-64"), Minority is the baseline covariate underrepresented minority status (1=minority, 0=non-minority), and Sex is sex assigned at birth (1=female, 0=male). Tables 10-13 . The vaccine efficacy is significantly greater than the null level 30% for hamming distances less than 9. We also confirm a trend that the vaccine provides better protection against circulating viruses with smaller hamming distances.

1, . . . , K, the second partial derivative of λ 0k (t, v) with respect to v exists and is continuous on [0, τ ] × [0, 1]. The covariate process Z ki (t) has paths that are left continuous and of bounded variation, and satisfies the moment condition 

(A.

3) The limit p k = lim n→∞ n k /n exists and 0 < p k < 1. s 

Smoothed Cox regression

Comprehensive sieve analysis of breakthrough hiv-1 sequences

Comparison of competing risks failure time methods and timeindependent methods for assessing strain variations in vaccine protection

Statistical methods for assessing differential vaccine protection against human immunodeficiency virus types

An improved sequentially rejective bonferroni test procedure

A generalization of sampling without replacement from a finite universe

Viral genetic diversity and protective efficacy of a tetravalent dengue vaccine in two phase 3 trials

Covid-19 vaccine trials should seek worthwhile efficacy

Genetic variants of sars-cov-2-what do they mean?

Efficacy of the chadox1 ncov-19 covid-19 vaccine against the b. 1.351 variant

Clinical endpoints for evaluating efficacy in COVID-19 vaccine trials

Genetic diversity and protective efficacy of the RTS,S/AS01 malaria vaccine

Safety and efficacy of the bnt162b2 mrna covid-19 vaccine

Estimation of regression-coefficients when some regressors are not always observed

Sieve analysis to understand how SARS-CoV-2 diversity can impact vaccine protection

Inference and missing data

Safety and efficacy of single-dose ad26. cov2. s vaccine against covid-19

Multiple approaches for massively parallel sequencing of sars-cov-2 genomes directly from clinical samples

the estimate of covariate coefficients; SE, the estimated standard error of the estimators of covariate coefficients

This research was partially supported by NIAID NIH award number R37AI054165.Dr. Sun's research was also partially supported by the National Science Foundation grants DMS1915829.

Let F t = σ{I(X ki ≤ s, δ ki = 1), I(X ki ≤ s, δ ki = 0), V ki I(X ki ≤ s, δ ki = 1), Z ki (s); 0 ≤ s ≤ t, i = 1, . . . , n k , k = 1, . . . , K} be the (right-continuous) filtration generated by, that is, the causespecific instantaneous failure rate at time t given the observed information up to time t only depends on the failure status and the current covariate value. Under model (1), the cause-specific intensity of N kij (t) with respect to F t equals Y ki (t)λ kj (t|Z ki (t)). Let Aalen and Johansen (1978) , for j = j ′ , M kij (·) and M kij ′ (·) are orthogonal square integrable martingales with respect to F t for j, j ′ = 1, . . . , J.. . , n k , k = 1, . . . , K} be the right continuous filtration obtained by adding R ki and δ Johansen (1978) , the processes M * kij (t) and M * kij ′ (·), 0 ≤ t ≤ τ are orthogonal square integrable martingales for j = j ′ .The following regularity conditions are assumed throughout the rest of the paper.Most of the notation can be found at the end of Section 2.