key: cord-0646835-7fj3zts4
authors: Schnitzer, Mireille E.; Harel, Daphna; Ho, Vikki; Koushik, Anita; Merckx, Joanna
title: Identifiability and estimation under the test-negative design with population controls with the goal of identifying risk and preventive factors for SARS-CoV-2 infection
date: 2020-06-04
journal: nan
DOI: nan
sha: 77d79db743161f776ec3db337bde603f9dbb1aab
doc_id: 646835
cord_uid: 7fj3zts4

Due to the rapidly evolving COVID-19 pandemic caused by the SARS-CoV-2 virus, quick public health investigations of the relationships between behaviours and infection risk are essential. Recently the test-negative design was proposed to recruit and survey participants who are being tested for SARS-CoV-2 infection in order to evaluate associations between their characteristics and testing positive on the test. It was also proposed to recruit additional untested controls who are part of the general public in order to have a baseline comparison group. This study design involves two major challenges for statistical risk factor analysis: 1) the selection bias invoked by selecting on people being tested and 2) imperfect sensitivity and specificity of the SARS-CoV-2 test. In this study, we investigate the nonparametric identifiability of potential statistical parameters of interest under a hypothetical data structure, expressed through missing data directed acyclic graphs. We clarify the types of data that must be collected in order to correctly estimate the parameter of interest. We then propose a novel inverse probability weighting estimator that can consistently estimate the parameter of interest under correctly specified nuisance models.

Under the current pandemic caused by the SARS-CoV-2 virus, where the resulting illness is referred to as COVID-19, it is challenging to implement fast epidemiological inquiries to map and understand the disease.

Highly infectious 1 in a completely non-immune population and targeting primarily the respiratory system with clinical symptoms that include fever, cough, and fatigue, 2, 3 this illness continues to cause substantial morbidity and mortality, straining the healthcare systems of many countries. With the aim of reducing infection, global campaigns encourage individuals to modify their daily behaviour by measures which include physical distancing, the use of masks, and intensified hygienic practices. Much interest lies in establishing whether these interventions are effective at reducing infection probabilities at an individual or population level.

Given the challenges involved in testing large portions of the population for active infection with SARS-CoV-2, cases in the general public are typically ascertained through testing sites. In Canada, though regulations have varied by epidemic stage and jurisdiction, 4 in order to obtain a test, individuals may be required to be experiencing symptoms of COVID-19 and/or have other reason to believe that they are infected, such as being a healthcare worker. Thus, if potential study participants are recruited at test centers, the resulting study cohort will not be representative of the general population at risk for the disease. Further, due to the nature of testing self-selection, associations measured between participant covariates (e.g. demographics,

Conflicts of interest statement: Joanna Merckx is an employee of bioMérieux. This work is unrelated to her function as Director Medical Affairs, bioMérieux Canada, Inc. The other authors have declared no conflict of interest. characteristics, and behaviours) and outcomes will not necessarily be representative of true causes or even predictors of infection. 5 Recently, Vandenbroucke et al. 6 proposed a a modified case-control design that combines a test-negative design, best known from its use in vaccine effectiveness research, 7, 8 with the recruitment of additional population controls in order to identify risk/preventive factors of SARS-CoV-2 infection. This study design involves both the recruitment of patients who are seeking testing for SARS-CoV-2 infection and of untested individuals in the general population. They propose a matching approach to compare differences in study participant covariates between people who test positive, people who test negative, and people who are not seeking testing in order to triangulate factors that likely increase or decrease the odds of infection. Karmakar and Small 9 proposed a more efficient test to compare factors between the three groups. However, neither of these studies addresses the problem from a missing data perspective while dealing with potential selection bias of having two groups of participants who received testing based on outcome-related symptoms.

In this methodological study, we give a specific definition of the parameters of interest in a "risk/preventive factor" analysis, corresponding to modeling the covariates that are predictive of SARS-CoV-2 infection in a regression model. We then provide identifiability conditions under the test-negative design and an assumed structural setting in the context of the COVID-19 pandemic. Identifiable means that we would know the exact value of the parameters of interest if we had an infinite sample size; our study thus gives some conditions under which identifiablity is achieved. Identifiability is of great importance because without this guarantee, we cannot construct a consistent estimator under the given assumptions. Finally, we propose an inverse probability weighting (IPW) estimator 10-12 of the parameters of interest that may be feasible in this setting.

We evaluate this approach through simulation study and compare it with a naive approach to estimation that does not incorporate untested population controls.

Given access to the recruitment of people seeking SARS-CoV-2 tests at a given testing site, we consider a study design that involves the recruitment of two groups of people: (1) people who are being tested for SARS-CoV-2 at the test site and (2) members of the general population, possibly selected as matched pairs for those being tested, who are not seeking testing but who are under the jurisdiction of the test site. It must be the case that members of group (2) would be able to access the test site were they to have symptoms of COVID-19 or otherwise qualify for and seek testing. Participants recruited from group (1) are denoted R = 1 and those from group (2) are denoted R = 0. This is essentially a case-control study design except that cases are those who are tested and controls are those who are untested. We assume random sampling of independent members of both groups, with a total sample size of n. The need for independence implies that, for example, only one member per household should be recruited.

All participants are given a questionnaire to collect information about the potential risk/preventive factors under study, X, which we will now refer to simply as risk factors, and related confounders, C. The questionnaire may also capture information about current symptoms related to COVID-19, W . It is also necessary that we receive the result of the SARS-CoV-2 test from those being tested. The test result is denoted Y (1=positive; 0=negative), which is only observable for those who are being tested (R = 1).

Recruitment from groups (1) and (2) will give us three categories of participants to contrast: those being tested and who test positive for COVID-19, those being tested and who test negative for COVID-19, and those not being tested. In principle, the comparison of the covariates (X) between test-positives and negatives will allow us to see which risk factors differ between people who have become infected with SARS-CoV-2 versus those who haven't and we can then compare to the controls from the general population. 6 But simple contrasts of tested participants may not estimate an interpretable parameter. In fact, the interpretation of any measured associations relies on certain structural assumptions and the collection of necessary data that would allow for the credibility of these assumptions. We expand on these concepts in the remainder of the paper.

Our scientific objective is to identify risk factors, i.e. behaviours or characteristics of individuals that are associated with COVID-19 in a chosen outcome regression model, possibly after adjustment for other suspected confounders of these risk factors. 13, 14 The population of interest is members of the general public, who were not previously infected with SARS-CoV-2 (i.e. who are at risk of infection), and who are under the jurisdiction of the COVID-19 testing site under study, but who may or may not be seeking testing. The regression model thus represents the associations between the risk factors and prospective short-term risk of infection with SARS-CoV-2 in this population.

The binary outcome of interest is infection with SARS-CoV-2, denoted Y 1 . Due to imperfect test sensitivity and specificity, this outcome may not correspond with Y , the result of the test for COVID- 19 . Sensitivity is defined as the probability of testing positive when truly infected. Specificity is defined as the probability of testing negative when not infected. The observed data are thus of the form O = (X, C, W , R, Y ). The complete data under knowledge of the true outcomes are (X, C, W , R, Y 1 ). Under a perfect test for COVID-19 (i.e. sensitivity and specificity equal to one), Y = Y 1 when R = 1 in which case the observed data with outcome censoring can be written as O * = (X, C, W , R, R × Y 1 ). We will use lower case letters to represent realizations of these random variables. In particular, o i = (x i , c i , w i , r i , y i ) for i = 1, ..., n where n is the total sample size.

We then define a logistic regression model

.., C s ) (and similarly for the realizations denoted by lower case letters), β = (β 0 , β 1 , ..., β r ), and γ = (γ 1 , γ 2 , ..., γ s ), under a typical log-likelihood loss function. Our interest lies in the vector parameter β where exp(β k ) corresponds to the conditional odds ratio related to the covariate X k .

Importantly, the parameters in this regression model may not represent causal effects, i.e. even if a coefficient β k is negative, it does not necessarily mean that X k decreases the risk of infection. 13, 14 In order to establish such a relationship, all confounders of the effects of risk factors on the outcome must be adjusted for in the model, the model must correctly represent the mechanisms of infection, and all risk factors must be independently manipulable. While we could extend this work to consider these aspects, for simplicity we retain β as the statistical parameter of interest in this study; in practice its estimation may provide important insights into the variables related to SARS-CoV-2 infection or ways that high-risk individuals may be identified. The remainder of the article addresses the estimation of the statistical parameter β within the regression model.

The challenge in comparing patients who tested positive versus negative arises from selecting on patients who seek testing. Figure 1 is a missing data directed acyclic graph (mDAG) 15, 16 representing assumed relationships between covariates in this analysis. In particular, we allow for the baseline covariates to potentially cause (i.e.

influence the risk for) SARS-CoV-2 infection, Y 1 . If a patient seeks testing (R = 1) then we observe a test

Testing is typically obtained if the individual has suspected symptoms of COVID-19, W (which may include fever, respiratory symptoms, etc) but the act of seeking and then receiving testing may also be affected by the risk factors (X) and other baseline covariates (C). For example, an alert individual who frequently hand washes may be more inclined to seek testing, possibly also depending on whether they are experiencing real or perceived symptoms of COVID-19 (included in W ). Any variable in X, such as recent travel, that places a person at higher risk for infection may also prompt that person to seek testing, even with absent or mild symptoms. We assume that true infection only affects test-seeking behaviour through symptoms. Thus, in W we include all symptoms known to the participant. W may also represent symptoms of other (respiratory or other) infections, and these may be caused by pathogens other than the SARS-CoV-2 virus.

Other unmeasured factors may modify the risk of COVID-19, Y 1 . We allow for unmeasured causes (omitted from graph for simplicity) of both COVID-19 and other infections. In constructing this mDAG, we also assume that no unmeasured factor simultaneously affects any pair of nodes. We will discuss and relax some of these assumptions in Section 5.

The objective of our analysis is to estimate the model parameters representing the relationship between X and Y 1 while adjusting for C, i.e. the parameters of the model for P r(Y 1 = 1 | X = x, C = c). But because we only have outcome data from those who are seeking testing, we may consider directly modeling the observed outcomes among those who were tested P r(Y = 1 | X = x, C = c, R = 1). Even under a perfect test for COVID-19, so that Y 1 = Y when R = 1, such modeling of the selected population may produce misleading associations between X and Y . This is due to collider bias, 5, 18 which is caused by subsetting or adjusting for a variable that is caused by the two variables whose association is of interest. In our case, we would be conditioning on R = 1, which is caused by both Y 1 (through W ) and by X. Thus, there is a possibility for erroneous conclusions resulting from the measured associations between X and Y among those seeking testing. 

In this section, we discuss the identifiability of the parameters of the model in equation (1) under the mDAG in Figure 1 . We then discuss identifiability under some less restrictive assumptions. When identifiability holds, a maximum likelihood substitution estimator 19 and an IPW estimator can be constructed. Figure 1 The mDAG in Figure 1 makes several important assumptions. In particular, it assumes that W is fully measured, meaning that we measured all symptoms caused by SARS-CoV-2 infection that may lead to seeking testing. This can be achieved by thorough harmonized data collection on individuals being recruited from both populations. We assume that Y is equal to Y 1 when R = 1 up to random error. As mentioned, we allow for unmeasured common causes of Y 1 and other infections. Otherwise, we assume that there are no common causes of any pair of nodes in the graph.

As a consequence of this structure, we have the independence condition

We then use the law of total probability to write our association of interest as

by the independence assumption (2) . The quantity f W |X,C (x, c) is the possibly multivariate distribution of W conditional on X = x and C = c and the multiple integrals are taken over the domain of this distribution.

If the SARS-CoV-2 test is perfect then we can replace the Y 1 by Y directly. If not, assuming uniform test sensitivity σ and specificity ρ independent of patient characteristics, under the condition that σ + ρ > 1, the law of total probability expanding on Y and rearranging terms gives us

which is estimable from the data of the tested subjects. 20

For the distribution in equation (3), we may write Because we are able to relate the target probability P r(Y 1 | X = x, C = c) to quantities that are known if given infinite samples of the observed data structure, under the given assumptions, we have established identifiability. This result does not rely on the specification of parametric models. Thus, for identifiability we need complete data on the symptoms W (specifically, all variables caused by Y 1 leading to testing) in addition to the covariates of interest (X, C). These covariates must include all common causes of testing and Y 1 and/or W (see next section). We also require knowledge of the parameters P r(R = 1), σ, and ρ.

The assumed relationships in the mDAG of Figure 1 may be too restrictive in certain studies. We thus describe several specific generalizations of the above graph and the consequences on identifiability and interpretability of the β parameter.

Unmeasured variable affecting test seeking which also affects Y 1 and/or W In Figure 2 we add an unmeasured variable U that is a factor influencing the act of getting tested which acts independently of all other variables in the graph. Suppose that U also affects being infected with SARS-CoV-2 or having symptoms of COVID-19 or both. For instance, mold exposure is associated with living in lower income neighborhoods. 22 Mold exposure may lead to respiratory symptoms (e.g. asthma exacerbation) that could be confused for COVID-19 symptoms. People living in lower income neighborhoods may be more at risk of COVID-19 due to greater population density and greater proportions of people who work in "essential services". 23 Thus, socio-economic status may be such a variable if it is not included in X or C. A second example is any variable that makes an individual high-risk (arrow into Y 1 ) that also leads to testing even if the individual is symptom free.

The consequence of such a variable is that independence condition (2) no longer holds. This is because U directly creates dependence between R and Y 1 and/or adjusting for W , which is a collider of Y 1 and U , creates the dependence between R and Y 1 . Thus, if such a variable U exists we cannot use the described maximum likelihood procedure. We should therefore attempt to measure all such factors and include them in X or C.

Another potential scenario involves symptoms U of COVID-19 that were not included in W but can also lead to testing. This scenario is portrayed in Figure 3 . Such a variable may exist if, for instance, a study does not ask about the less common symptoms of COVID-19, such as headache or skin rashes.

In this scenario, U is a mediator of the effect of Y 1 on R so the independence condition (2) does not hold.

This is still the case if U is related to the baseline covariates or partially correlated with other symptoms.

Consider the presence of a variable U affecting R and correlated with baseline covariates (either by causal relationship or other). Such a variable does not affect the independence condition (2) and thus identifiability is preserved. Such variables include demographic information and participant characteristics that affect test-seeking behaviour but are otherwise not related to infection or symptoms.

In the presence of a variable U that affects both X and Y 1 , the association between risk factors and outcome of interest will be confounded. However, the presence of such a variable will not affect the independence condition (2) and thus the parameters β in model (1) will still be identifiable. However, their values may be less meaningful and not represent causal relationships between risk factors and infection due to the unmeasured confounding.

The g-formula relates the observed data to P r(Y 1 | X = x, C = c) and thus the model of interest. Estimation is available in principle through modeling the components of the g-formula, producing a substitution estimator. 24 We find that when W is high-dimensional, as is likely the case in this setting, the g-formula estimator may not be feasible. Alternatively, one may model the probability of selection directly using the case-control data of tested and untested individuals to construct an estimator using IPW. 12, 21 We describe the latter, which requires knowledge of the test sensitivity, σ, and specificity, ρ, and the value of testing prevalence P r(R = 1). If these parameters are uncertain, then one can undertake a sensitivity analysis by varying the assigned values.

The IPW estimator for the parameters of interest in model 1 is given through the score equations of a weighted logistic regression

where values (x i , c i , w i , r i ) refer to the data realizations of subject i.

In order to estimate the numerator of the IPW estimator, we must first define a model for P r(Y = 1 | X = x, C = c, W = w, R = 1). This model is fit on subjects who received a test. Predictions from this model fit are denotedQ Y,R=1 (x, c, w). By the relationship in equation (4), we set

In order to estimate the denominator of (6), we note that the associations between covariates, symptoms, and the probability of testing must be estimated from the data resulting from the case-control design, where sampling is carried out in both the tested and untested groups. If we know the baseline testing prevalence q 0 = P r(R = 1), we may use a simple weighting method for case-control studies. 21 Specifically, we assign all cases the weight q 0 and all controls the weight (1 − q 0 )/J where J is the ratio of the number of controls to cases in the sample. We use these weights in any chosen binomial regression model for R conditional on X, C, and W . Finally, we use predictions from this model fit to estimateP r(R = 1 | X = x, C = c, W = w) for all tested subjects.

A simple proof of the consistency of this estimator under the independence assumption (2) is given in the Appendix. It is required that the models for Q Y,R=1 (x, c, w) and P r(R = 1 | X = x, C = c, W = w)

are both correctly specified. We expect that some values of the denominator may be close to one for some tested subjects who are experiencing several symptoms of COVID-19. However, we would not necessarily expect denominator values close to zero because the IPW equation (6) only uses subjects who did, in fact, get tested. We thus expect our IPW estimator to be fairly stable in this setting.

In order to evaluate the proposed IPW method under the mDAG in Figure 1 , and compare it to a naive approach, we perform a simulation study. We evaluate the method under ideal circumstances, where the assumed parametric models are close to well-specified, where sensitivity and specificity of the test are known, and where the baseline prevalence of testing is known. We then evaluate the sensitivity to departures from these assumptions.

We first simulate ordered data O + = (C, X, Y 1 , W, R, Y ), where each variable is unidimensional, for a population of 1,000,000. Baseline confounder C is generated from a standard Gaussian distribution. The risk factor of interest, X, is generated as a Bernoulli random variable conditional on C such that its prevalence in the population is approximately 10%. True COVID-19 status Y 1 is generated as Bernoulli conditional on X and C such that the true incidence of acute infection in this previously untested population is 10% overall (for this example, though this is likely lower in the general population in practice). The true conditional association between X and Y 1 is exp(β) = 1.5. Symptoms W are Bernoulli conditional on C, X, and Y 1 , where the dependence on Y 1 is strong so that infected individuals have a high probability of experiencing symptoms (roughly between 0.5 and 0.9). Testing status R is then generated given C, X, and W with a fairly strong dependence on W such that symptoms lead to a higher probability of being tested. Given the true test infection state Y 1 , true sensitivity σ = 0.95 and specificity ρ = 0.99, test outcome Y is drawn for all tested subjects. Then, we randomly sample 2000 tested participants (roughly all available) and 2000 untested participants (a small subsample of the total available) from the population, which gives us our study sample. The data generation is given in Appendix Table B .

In order to demonstrate the selection bias from using only tested subjects to evaluate risk factors, we fit a logistic regression for Y conditional on C and X using the data from tested subjects in the sample. We then apply our method using logistic regressions for Q Y,R=1 (x, c, w) and P r(R = 1 | X = x, C = c, W = w),

where the latter regression is weighted using the case-control weights. The score equations (6) are then solved using a standard optimization procedure for logistic regressions with a log-likelihood loss function, though our implementation allows for values ofQ Y 1 ,R=1 (x, c, w) that are outside of (0,1) which occurs due to the transformation with σ and ρ.

We implement our method with correctly specified logistic regression models under the following settings:

assumed values (σ,ρ) set to (0.95, 0.99) (i.e. truth), (1, 1), and (0.99, 0.95); assumed testing prevalenceq 0 set to truth (roughly 0.2% of the full population), truth ×10, and truth ×100. We then misspecify our testing model by omitting an interaction term between X and W . In the last IPW implementation, we do not adjust for symptoms by removing W from all models.

We use a case-control nonparametric bootstrap method, where resampling with replacement is done separately in the tested and untested groups, to estimate the standard error and 95% confidence intervals for the IPW method. 25 The usual logistic regression standard errors are used for the naive method. All simulations were run with R statistical software v. 3.6.1. 26 The results of all implementations in addition to the analysis conducted only on tested subjects are given in Table 1 . Mean parameter estimates, mean standard error estimates, Monte Carlo standard errors, and % coverage of the 95% confidence intervals are given. We first note that the logistic regression analysis run with only tested subjects is highly biased, suggesting on average that X leads to a lower risk of the In this paper, we have contributed to the investigation of statistical analysis under the test-negative design in the context of evaluating risk or preventive factors of COVID-19 when participants may be conveniently recruited at disease testing sites. We defined a potential parameter of interest in such a study as the coefficients in a regression model for the true infection outcome. We explained and demonstrated the importance of sampling additional population controls 6 in order to avoid selection bias from comparing only tested individuals. We then investigated the identifiability of the target parameter under several settings.

Finally, we proposed a novel IPW estimator that accounts for both imperfect test sensitivity and specificity and the study design. We then evaluated this estimator through simulation study.

There is a growing literature on identifiability conditions for statistical parameters under missingness 27 and a large literature of identifiability of causal parameters. 28, 29 Our setting is somewhat different from a typical missing data setting in that, because observed outcomes are obtained through imperfect tests, true infection status is not observed for any subject. In addition, the case-control component of the study design must be considered when estimating all probabilities and distributions in the general population of interest. These results are important as they shed light on the data collection needed to correctly estimate the parameter of interest. In particular, we must measure all variables on the pathway between SARS-CoV-2 infection and testing. This means that incomplete ascertainment of the symptoms leading some individuals to be tested would result in a biased estimator. We must also measure and adjust for all causes of testing if they are also causes of SARS-CoV-2 infection and/or symptoms.

The estimator proposed assumes knowledge of the test properties and the prevalence of testing in the population. Given the potential sensitivity to errors made when specifying these quantities, one could undertake a sensitivity analysis. Specifically, confidence intervals could be constructed using all combinations of credible values for σ, ρ, and q 0 . By taking the minimum confidence interval lower bound and the maximum upper bound, we can place bounds on the set of parameter values that are supported by the data and assumed model. Other approaches may involve Bayesian estimation 30 where informed priors are placed on these values, but we do not explore such approaches here. We also noted the sensitivity of the results to misspecification of the model for testing. It is thus important to understand the mechanisms driving people to seek and receive testing and to use a flexible modeling approach. 31 This work can be directly adapted to an investigation of causality by modifying the target parameter of interest to a causal parameter under additional assumptions including "no unmeasured confounders" for an exposure of interest . If the additional assumptions hold, then this approach could investigate potential epidemiological causes of SARS-CoV-2 infection. Future work could also improve the efficiency of the IPW estimator through such approaches as targeted maximum likelihood estimation. 32 Though improvements are likely possible, a practical fully efficient estimator is probably infeasible due to the difficulties in applying the g-formula.

Due to the rapidly evolving nature of the COVID-19 pandemic, studies with short timelines are necessary to monitor public health. The accessibility of the test-negative design with untested controls allows for much

shorter timelines compared to a cohort study of uninfected individuals. We must however overcome the inherent selection bias arising from this design. Novel study designs must be followed by clear definitions of parameters of interest, investigations of identifiability of these parameters, and potentially tailored estimators. These steps allow for a principled approach that does not solely rely on intuition and may help avoid substantial sources of bias when tracking risk and preventive factors of COVID-19.

By the definition of the parameters (β, γ) in equation (1) and the typical log-likelihood loss function, the true values are defined through the equations

We first assume that the model for Q Y,R=1 (x, c, w) is consistent such that the values for given (x i , c i , w i ) converge to the truth. Then, the estimatesQ Y 1 ,R=1 (x, c, w) will converge to the true P r(Y 1 = 1 | X =

x i , C = c i , W = w i ) as long as the parameters σ and ρ are correct. Then, as n goes to infinity, and assuming consistent nuisance function estimation in the denominator, the IPW score equations (6) 

In this section, we present the R code to run the estimator for observed data with structure O = (X, C, W, R, Y )

where X, C, and W are univariate. Note that the simulation study data has such a structure. This code can be easily extended for multivariate versions of those variables.

The IPW function uses the following two helper functions.

#Log-bin function that can take Y values outside of (0,1)

LogLikelihood<-function(beta, Y, X,w){ pi<-plogis( X%*%beta ) # P(Y|A,W)= expit(beta0 + beta1*X1+beta2*X2... pi<-plogis( X%*%beta ) # P(Y|A,W)= expit(beta0 + beta1*X1+beta2*X2...) Figure 1 : mDAG representing hypothetical relationship between baseline covariates X and C, symptoms W , seeking testing R, true infection Y 1 , and observed test outcome Y . Note that Y 1 is observed with error for tested subjects (R = 1). Drawn using DAGitty. 17 Figure 2 : Presence of an unmeasured variable U that affects R and symptoms W and/or infection with COVID-19, Y 1 . Drawn using DAGitty. 17 Figure 3 : Presence of unmeasured symptoms U of COVID=19 Y 1 . Drawn using DAGitty. 17 

Temporal dynamics in viral shedding and transmissibility of COVID-19

Epidemiology, clinical course, and outcomes of critically ill adults with COVID-19

Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan

Canadian Public Health Laboratory Network Best Practices for COVID-19

A structural approach to selection bias Epidemiology

The test-negative design with additional population controls: a practical approach to rapidly obtain information on the causes of the SARS-CoV-2 epidemic Arxiv

The test-negative design: validity, accuracy and precision of vaccine efficacy estimates compared to the gold standard of randomised placebo-controlled clinical trials

Basic principles of test-negative design in evaluating influenza vaccine effectiveness Vaccine

Inference for a test-negative case-control study with added controls Arxiv

A generalization of sampling without replacement from a finite universe

Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data

Constructing inverse probability weights for marginal structural models

To Explain or To Predict?

Clarifying questions about "risk factors": predictors versus explanation

Using causal diagrams to guide analysis in missing data problems

Graphical models for inference with missing data in

Robust causal inference using directed acyclic graphs: the R package 'dagitty

Illustrating bias due to conditioning on a collider

A new approach to causal inference in mortality studies with sustained exposure periodsapplication to control of the healthy worker survivor effect Mathematical Modeling

Basic methods for sensitivity analysis of biases

Estimation based on case-control designs with known prevalence probability

Environmental conditions in low-income urban housing: clustering and associations with self-reported health

The plight of essential workers during the COVID-19 pandemic

Implementation of G-Computation on a Simulated Data Set: Demonstration of a Causal Inference Technique

Estimation in choice-based sampling with measurement error and bootstrap analysis

R: A Language and Environment for Statistical Computing. R Foundation for Statistical ComputingVienna

Identification In Missing Data Models Represented By Directed Acyclic Graphs Uncertain Artif Intel

Causality: Models, reasoning and inference

Causal diagrams for epidemiologic research

Estimating Prevalence Using an

Improving Propensity Score Estimators' Robustness to Model Misspecification Using Super Learner

Targeted learning: causal inference for observational and experimental data

B Simulation study data generation The data-generating mechanism used in the simulation study is given in Table B. We also present the R code below

n=popsize) #prevalence is around 0.1 #Y will be censored, Y1 is latent for everyone Y1<-rbinom(prob=plogis(log(OR)*X+0.5*C-2.7),size=1,n=popsize) #check desired prevalence of true outcome #generate test results Y<-rbinom(prob=(sens*Y1+(1-spec)*(1-Y1)),size=1,n=popsize) #symptoms based on infection W<-rbinom(n=popsize, prob=plogis(-2+0.2*C+0.5*X+3*Y1),size=1) #selection on outcome for testing R<-rbinom

002 of pop tested -> determines sample size q0<-mean(R) #Pr(R=1) in population Y[R=0]<-NA indcontrols<-sample(1:sum(R==0),size=2000,replace=F) indcases<-sample(1:sum(R==1),size=min(sum(R==1),2000),replace=F) dat<-as

Machine$double.neg.eps # for consistency with above pi[pi==1] <-1-.Machine$double.neg.eps gr<-crossprod(w*X, Y-pi) # gradient return

} The function to run IPW depends on the data (dat) and values for sensitivity (sens), specificity (spec), and the baseline prevalence (q0hat). The function follows

IPWest<-function(dat,sens,spec,q0hat){

#Use IPW estimator (with true sens and spec) to estimate modYR1<-glm

newdata=dat)+specfor R, fit with weights w. We use a logistic regression as an example: modRwxc<-glm