key: cord-0547346-wjm0u5kj
authors: Schwarz, Tassilo
title: Randomized Controlled Trials Under Influence: Covariate Factors and Graph-Based Network Interference
date: 2021-11-18
journal: nan
DOI: nan
sha: 7710718568f41591d25e118ef129423204c2955b
doc_id: 547346
cord_uid: wjm0u5kj

Randomized controlled trials are not only the golden standard in medicine and vaccine trials but have spread to many other disciplines like behavioral economics, making it an important interdisciplinary tool for scientists. When designing randomized controlled trials, how to assign participants to treatments becomes a key issue. In particular in the presence of covariate factors, the assignment can significantly influence statistical properties and thereby the quality of the trial. Another key issue is the widely popular assumption among experimenters that participants do not influence each other -- which is far from reality in a field study and can, if unaccounted for, deteriorate the quality of the trial. We address both issues in our work. After introducing randomized controlled trials bridging terms from different disciplines, we first address the issue of participant-treatment assignment in the presence of known covariate factors. Thereby, we review a recent assignment algorithm that achieves good worst-case variance bounds. Second, we address social spillover effects. Therefore, we build a comprehensive graph-based model of influence between participants, for which we design our own average treatment effect estimator $hat tau_{net}$. We discuss its bias and variance and reduce the problem of variance minimization to a certain instance of minimizing the norm of a matrix-vector product, which has been considered in literature before. Further, we discuss the role of disconnected components in the model's underlying graph.

I am grateful to my advisors Prof. Rasmus Kyng and Federico Soldà for their support throughout this work, in particular their feedback on new ideas and mathematical advice.

Working on an interdisciplinary topic required support and inspiration from many sides. I am grateful to Prof. Stefan Bechtold who introduced me to randomized controlled trials and with whom I have been working on that topic for the last years. I appreciate the insightful discussion on randomized controlled trials in behavioral economics with Dr. Tobias Gesche.

Working on an interdisciplinary topic where little theoretic work exists (RCTs with social spillover effects) requires making many fundamental decisions. I am grateful to my advisors for providing me with the necessary guidelines and recommendations.

After giving an introduction to randomized controlled trials, this work is two-fold: On the one hand, it reviews recents progress for optimal experiment designs (in particular the Gram-Schmidt Walk Design, Section 2.6). On the other hand, it addresses the issue of interference among units by introducing a novel model and estimator for the average treatment effect (Chapter 3).

For readers only interested the latter, we recommend to still read the spectral discussion (Section 2.5), as this provides a good intuition for the formal arguments made in Chapter 3.

This work reflects the research undertaken during my bachelor thesis project "Randomized Controlled Trials: Review and Network Effects" in the Computer Science Department at ETH Zürich.

Rubin coined an assumption made by Cox in 1958 as the so-called stable unit treatment value assumption (SUTVA): Any unit's observed outcome depends only on the treatment it was assigned to, not on any other unit.

"If unit i is exposed to treatment j, the observed value of Y will be Y ij ; that is, there is no interference between units (Cox 1958, p. 19) leading to different outcomes depending on the treatments other units received [...]" - [Rub80] This assumption remains dominant in many RCTs to date, even though it might be inaccurate in many instances. We will address this issue below and provide more accurate models beyond SUTVA in Chapter 3.

RCTs have spread to many other disciplines to perform either field studies (i.e., studies in the real world) or laboratory experiments, where the environment can be controlled. These disciplines include psychology, behavioral economics, educational research, and agricultural science. In medicine, they remain to date the "golden standard for ascertaining the efficacy and safety of a treatment" [Kab+11] .

Surprisingly, the in 1958 established stable unit treatment value assumption (SUTVA) remains to date a dominant assumption in many RCTs. Especially in field studies, where the experimenter has limited influence on the surroundings, the assumption that units do not affect each other can be inaccurate and lead to wrong conclusions.

Another important property that can affect the quality of an RCT is the balancedness of covariate factors, i.e., properties of units that might affect their treatment's outcome. Large trials often rely on the law of large numbers, hoping that and i.i.d. design 1 avoids imbalances of predictive characteristics between the treatment and control group. One such large RCT is the medical trial performed to assess the performance of the BNT162b2 mRNA Covid-19 Vaccine [Pol+20] : With over 35.000 participants, most subgroups have approximately an equal share in the treatment and control group. However, the subgroup Native Hawaiian or other Pacific Islander is not balanced at all: Subgroup # in treatment # in control Native Hawaiian or other Pacific Islander (n=76) 50 26

If this subgroup were predictive of the treatment outcome, such an imbalance could distort the outcome. In this case, this is only a small subgroup, so that it has a negligible impact on the overall average outcome -but might have a significant impact for members of this subgroup.

The goal of this work is to both discuss the idea of balance within subgroups as well as to give a framework to make RCTs work beyond the stable unit treatment value assumption (SUTVA).

1 a design in which each unit gets allocated with probability 1 2 to either treatment or control group, independent of the others In Chapter 2, we will introduce RCTs formally and dive into the idea of balance within subgroups. We examine this idea both from the perspective of medical RCT literature as well as from the perspective of a recent paper by Harshaw et al. [Har+21] . This paper provides an efficient algorithm for finding a design with two desirable properties: unbiasedness and low variance of a specific treatment effect estimator.

Chapter 3 covers RCTs beyond SUTVA. We motivate the need for frameworks capable of dealing with RCTs in the presence of peer influence, review literature on this issue, define a new model and estimator, and analyze its error. Under basic assumptions, we can give a bound on the estimator's variance consisting of only positive terms. This also reduces variance minimization in that context to the minimization of the l 2 norm of a certain matrixvector product. We further show that if units are clustered, randomization can be done for each cluster individually.

Finally, we conclude and give an outlook discussing further research questions.

This chapter aims to introduce RCTs, review their state-of-the-art in literature and discuss the idea of balance within subgroups both in terms of medical literature and the so-called Gram-Schmidt Walk Design.

In more detail, we introduce RCTs formally (Section 2.1), give a summary of means of randomization from (medical) literature (Section 2.2), discuss a common way of approximating the so-called average treatment effect (Section 2.3), give a statistical discussion of bias and variance (Section 2.4) and finally review the Gram-Schmidt Walk Design algorithm and key theorems from Harshaw et al. (Sections 2.5 and 2.6).

For consistency with the review of the Gram-Schmidt Walk Design, we stick to the notation of [Har+21] .

A randomized controlled trial is an experiment with n participants, the so-called units. These units get partitioned into two groups: the treatment and control group, where they either receive a specific treatment or placebo. Because the placebo treatment can be seen as a "null-treatment", we refer for brevity to both groups as the treatment groups. 1

To which of the treatment groups a unit i ∈ [n] gets assigned to is determined by the assignment vector z ∈ {±1} n , where z i = +1 means they are assigned to the treatment group Z + := {i ∈ [n] : z i = +1}. Analogously, z i = −1 means they are assigned to the treatment group Z − := {i ∈ [n] : z i = −1}.

Because the assignment z is random in a randomized controlled trial, z is a random vector. The distribution of this random vector is called the design.

After the units have been treated, the experimenter measures a certain quantity for each unit, the outcome. We assume this to be a scalar value. This outcome depends on a unit's treatment group: We measure a i if unit i is in group Z + . If it is in Z − , we measure b i . We refer to these quantities, that we potentially measure, as the potential outcomes a, b ∈ R n . Harshaw et al. introduce a vector combining both a, b: the potential outcomes vector µ := 1 2 (a + b). To assess the treatment's effect, we would ideally measure for each unit both a i and b i . However, as the units are partitioned into the treatment groups, we can either observe one of these quantities. This is encoded in the observed outcome vector y ∈ R n :

The overall goal is to find the average treatment effect τ :

Because a i − b i is unobservable, 2 we use estimatorsτ to approximate τ . These estimators should, in expectation, match τ . If this is the case, we say thatτ is unbiased.

Is unbiasedness enough? At first glance, a solely unbiased estimator could suffer from one issue: Even though it is unbiased, it could have a very high error |τ − τ | in every outcome. For example, consider an estimator that has the following error distribution:

In expectation over all possible designs,τ is unbiased. But clearly, this is not a desirable estimator: The error is |τ − τ | = 1000 with certainty. We therefore want in addition the estimator to have small mean squared error

When taking a deeper look, we find that the mean squared error decreases for unbiased estimators when increasing the sample size n (Section 2.4). Therefore, solely unbiased estimators are widely considered a good tool. But in case of small study sizes, finding a design that minimizes an estimator's mean squared error is important.

We collect the definitions from above in the following table. This should serve as a reference point throughout this work.

Units the n participants Potential outcomes a, b The potential outcomes vector µ := 1 2 (a + b) Observed outcomes y Assignment z Design distribution of assignments, i.e., distribution of random vector z

Z + := {i ∈ [n] : z i = +1}, Z − := {i ∈ [n] : z i = −1} Average treatment effect τ = 1 n i∈[n] a i − b i = 1 n 1, a − b

There are mainly four types of randomization used in practice: Complete randomization, random allocation, permuted-block randomization, and adaptive randomization [CL13; KRP08] . We review them in the following.

Complete randomization. The simplest form is complete randomization, where Pr [z i = 1] = 1 2 , independent of any other unit j. We refer to this as the i.i.d. randomization. This is both simple in application and analysis, which is presumably the reason for its popularity. However, this randomization scheme might lead to unequal group sizes.

Random allocation. To overcome the issue of different group sizes, random allocation is used: n 2 units are drawn from the set of all n units without replacement and assigned to (w.l.o.g.) the treatment group.

Random allocation versus complete randomization. Intuition tells us that for large n, the difference between complete randomization and random allocation should be small. Therefore, it might be possible to use the easier-to-analyze complete randomization for large n. But how large should n be, for being able to neglect the difference in group sizes?

A brief calculation allows for a quantitative argument, based on a simple estimation by Lachin [Lac88] :

z i be the sum of the assignment vector entries. S = 0 indicates that the assignment is perfectly balanced. The sizes of Z + , Z − relate to S in the following sense:

A measure for imbalance, according to Lachin, is

Under complete randomization, linearity of expectation and the i.i.d property yields

The central limit theorem applies for large enough n: 3

Therefore, the probability that the imbalance is greater than t ∈ [0, 1] is:

(CLT) Figure 2 .1 shows this function for t = 0.6. It can be seen that for n > 200, the probability of having imbalance > t = 0.6 4 is negligibly small (< 0.0047). This is the mathematical reason why practitioners use n > 200 as a rule of thumb when imbalance is negligibly small and there is no reason to use random allocation. Instead, the easier-to-analyze method of complete randomization can be used [KRP08; CL13] . Remark (on notation). Harshaw et al. mention that random allocation is sometimes called complete randomization. However, this is not the case in standard literature such as [CL13] or reviews such as [Lac88; KRP08] where complete randomization clearly refers to the independent random allocation, as presented above. Therefore, it should be safe to use the term "complete randomization" for the i.i.d. design.

What if both treatment and control group are balanced, but subgroups are not? This is unfavorable, as subgroups could be predictive of the outcome. In medical trials, this is referred to as "prognostic factors" if, e.g., the age influences the observed outcome. To overcome this issue, the portion of treatment / control assignments gets balanced within each subgroup (also known as stratum). This technique is referred to as stratified randomization. The following randomization schemes are both stratified randomization techniques.

Permuted-Block Randomization. Units get partitioned into blocks, and randomization is performed within each block, commonly using the random allocation scheme. For example, units might sequentially enter a study over an extended period of time (e.g., patients in a hospital). To ensure that any covariates that change over time are balanced, units get allocated into blocks according to their time of entry. Random allocation is now performed for each block, thereby ensuring balancedness of time-fluctuating covariates.

Adaptive Randomization. In adaptive randomization, the probability of getting assigned to treatment / control gets adapted over time, based on covariates measured until the time of entry. This is highly situation-dependent, and therefore does not allow for a good general analysis.

In Sections 2.5 and 2.6, we will review a paper [Har+21] that performs at its very heart stratification -a technique used for decades. 5 The paper's key contribution is that it provides an algorithm with good bounds on the variance of a common average treatment effect estimator, which is defined in the following.

A widely used estimator of the average treatment effect is the Horvitz-Thompson estimator [HT52] . In the most general case, the Horvitz-Thompson estimator is defined aŝ

(2.5)

For the remainder of this work, we will assume Pr [z i = +1] = Pr [z i = −1] = 1 2 for all units, as this gives -in expectation -equal group sizes. 6 In this case, we get a simpler expression: Proof.

As 

(2.6)

We will therefore use the term variance of the estimator in the following. It shall be noted that in medical literature, the term "increasing precision" is sometimes used. Mathematically, this means reducing the variance.

Note that unbiasedness or low variance always refers to a combination of both design and estimator.

Talks with experimenters showed that they clearly prefer an unbiased estimator-design combination over a low variance one, noting that a low variance can also be achieved by simply increasing the experiment size n. Here, we give a mathematical justification for this attitude, at the example of a slightly modified estimator based onτ ht .

be our estimator.

Suppose, for simplicity, that each unit has the same E [z i

which is biased, as the average treatment effect is

(2.9) But the variance under the i.i.d. design is

(2.10)

The key observation here is the factor 1 n , which exists in the final variance term, but not in the final bias term: By choosing the i.i.d. design, the variance decreases with 1 n , but the expectation is fixed independent of n. Therefore, a biased estimator cannot be "made unbiased" by increasing n, but the precision of a high-variance estimator can be increased by increasing n.

Note that besides the i.i.d. design, we used in Equation 2.10 the SUTVA (the y i do not depend on each other). Therefore, such a calculation becomes more challenging in the presence of spillover effects (peer influence). We will pick up this idea in Chapter 3.

In this section, we first find a simple expression for the error ofτ ht , subsequently analyze the variance by using spectral decomposition, and finally draw conclusions about a tradeoff between potential performance and robustness ofτ ht . This insight will be key for realizing the possibilities and limitations of a good design.

Lemma 3 (Error ofτ ht ). The error ofτ ht is:

This lets us rewrite the variance:

where ( * ) follows from the fact that E [z] = 0.

As Cov (z) is a symmetric matrix, the spectral theorem applies, and we can eigendecompose:

where Λ = diag(λ i ) consists of the eigenvalues and V of the eigenvectors of Cov (v). According to the spectral theorem, V forms a basis of R n . We can therefore express µ in terms of the eigenbasis V:

∃w ∈ R n : µ µ 2 = Vw with w 2 = 1.

(2.17)

We thus get for the variance:

where ( * ) is a convex combination of the eigenvalues of Cov (z), since w 2 = 1.

This gives a first insight for designing good experiments: If possible, we should align the smallest eigenvector of Cov (z) with µ while making the biggest eigenvectors orthogonal to µ. However, we do not know µ (as it directly depends on both potential outcomes a, b). Nonetheless, we might be able to predict µ based on some covariate information: This is the topic of Section 2.6. 

Proof. For any i ∈ [n], we have:

And as the sum of a matrix's eigenvalues is its trace, we have

This shows an inherent trade-off between a design's robustness and potential performance: We see in Equation 2.18 that λ min determines how good a design's potential performance is, while λ max expresses how bad a design's worst-case variance can get. 7 The best design would thus have both a small λ max (robust against bad µ) and a small λ min (good best-case performance). However, Lemma 5 shows that i∈[n] λ i = n, so that we cannot have both small λ min and λ max .

The i.i.d. design with λ 1 = . . . = λ n = 1 has minimal λ max so that it is robust. However, it has no good potential performance. A design with λ min < 1, on the other hand, has good potential performance (for µ being aligned with the minimal eigenvector) but bad worst-case performance, as λ max > 1.

Harshaw et al. use this argument to state that no design can be uniformly better than all others. There will always exist potential outcomes µ where one design has better (i.e., lower) variance than the other. If we know that µ lies predominantly in a specific direction, we can create good designs. This is the baseline assumption of the Gram-Schmidt Walk Design and is covered in the next section.

As we saw in the previous section, we cannot improve the i.i.d. design without making further assumptions on the potential outcomes.

The assumption made in [Har+21] is that there exists some information (covariate information) for each unit that is predictive of the potential outcome vector µ. As shown in the introduction, this idea is not new and has been discussed under the term stratification in the clinical trial literature at least since the 1970s [Zel74] .

However, a key contribution from Harshaw et al. is to give an algorithm under these assumptions with good bounds on the variance. After stating the assumptions formally, the goal of this section is to both give intuition behind the algorithm (Figure 2 .2) as well as review some of the main theorems.

Formally, we represent the d covariates we have for each unit in the covariate matrix X ∈ R n,d , where the ith row represents the covariates of unit i. The covariates are linearly predictive of the potential outcomes if µ is close to the column span of X:

(2.20)

If we denote by β the vector representing the linear function between covariates and potential outcomes best

we can decompose µ into µ =μ + ε whereμ := Xβ, ε := µ −μ and get for the variance of the Horvitz-Thompson estimator:

To minimize that expression, we would like to align the smallest eigenvector of Cov (z) with bothμ and ε. Since they are orthogonal, however, we cannot align to both vectors at the same time. Nonetheless, by our assumption that X is predictive of µ, we can assume that ε is small and thus

Asμ ∈ Span(X), we can further simplify:

If the covariates are predictive, we should therefore focus on Cov (X z) rather than solely on Cov (z).

In the following sections, we first give an intuition for the Gram-Schmidt Walk Design algorithm (Section 2.6.2), then describe the algorithm itself and establish some mathematical properties (Section 2.6.3) and finally analyze the Horvitz-Thompson estimator's error under that design, corresponding to the bias (Section 2.6.4) and mean square error (Section 2.6.5).

The overall goal is to randomly find an assignment vector z ∈ {±1} n , that assigns each of the n elements into either Z + or Z − .

There is an inherent tradeoff between robustness, achieved by i.i.d. randomization, and potential performance, 8 achieved by balancing the covariates between both Z + and Z − groups (Section 2.5). This founds the necessity for a trade-off parameter Φ between robustness (Φ = 1) and potential performance (Φ = 0), which needs to be specified by the experimenter.

How can we connect (1) i.i.d. randomization and (2) balance of covariates in order to find a randomized design corresponding to a given Φ ∈ (0, 1]?

Intuitively, (2) corresponds to balancing some measure of the n covariate vectors. We can phrase (1) in terms of (2): i.i.d. random assignment of n elements into two groups Z + , Z − is the same as randomly balancing n orthogonal vectors: It is impossible to balance these so that the best random balance is just an i.i.d. randomization. Let us use the unit vectors e i as orthogonal vectors.

Combining robustness and balance, we, therefore, aim to balance n vectors consisting of both an orthogonal part e i , scaled by ≈ Φ, and a part consisting of the ith covariate vector X i,: , scaled by ≈ 1 − Φ.

where ξ = max i∈[n] X i,: 2 is some scaling factor. 9 Consider the extreme cases: For Φ = 1, B consists of only orthogonal unit vectors, which cannot be balanced. So the best way to balance will be an i.i.d. assignment. If Φ = 0, B consists of only the covariate vectors. Thus, balancing the column vectors of B corresponds to balancing the covariate vectors.

In summary, the goal is to randomly find a vector z on one of the corners of the hypercube {±1} n , while somehow balancing the columns of B.

The Gram-Schmidt Walk Design starts by a relaxation: z = 0 ∈ [−1, +1] n . In order to achieve integrality, we first choose a direction u. Then, we start walking from z randomly either in positive or negative direction along u, until we hit a new boundary of [−1, +1] n . By repeating this procedure at most n times, we achieve integrality. This process is depicted in Figure 2 .2. In this procedure, we have the following two degrees of freedom that both help achieve a particular purpose:

• The step direction u is chosen such that the covariate imbalance between Z + , Z − does not increase by too much. 10 • The step size is chosen randomly between two values: Either positive or negative in order to hit a new boundary of [−1, +1] n when walking along u. The probability distribution between these two step sizes is chosen such that the Horvitz-Thompson estimator is unbiased.

We restate the Gram-Schmidt Walk Design algorithm from Harshaw et al. in Algorithm 1.

The Gram-Schmidt Walk Design (aka GSWD) from [Har+21] Input:

Set step size δ t :

Update fractional assignment z t+1 ← z t + δ t u t 11:

t ← t + 1 12: end while 13: return assignment vector z := z t Note that we did not emphasize in the intuitive explanation the procedure of choosing a pivot at random. While so-called "pivot phases" are important in the analysis of the Gram-Schmidt Walk in [Ban+17] , the only importance of choosing these pivots at random is for finding an estimator of the ridge loss, which is in turn necessary for constructing confidence intervals. 11 Therefore, we will not dive deeper into this.

A rigorous algorithm analysis can be found in the appendix of Harshaw et al. We highlight the key results in the following.

Theorem 6. Under the Gram-Schmidt Walk Design (Algorithm 1), the Horvitz-Thompson estimatorτ ht is an unbiased estimator for the average treatment effect.

To prove Theorem 6, we first show the following lemma.

11 Formally, the random choice of pivots is used to prove that all "second-order assignment probabilities": [Har+21] Lemma 7 (Lemma 2 in [Har+21] ). The sequence of fractional assignments in Algorithm 1 z 0 , z 1 , . . . forms a martingale.

Proof. The central observation is that by choice of δ t , we have

The rest follows by applying basic probabilistic identities, as done in the following.

where the conditional independence holds because δ t is complety determined by δ + t , δ − t and thus

for any x ∈ R.

Proof of Theorem 6. The unbiasedness ofτ ht under the Gram-Schmidt Walk Design follows now from the fact that z 1 , z 2 , . . . form a martingale and by observing that z 0 = 0. For the advanced reader familiar with basic martingale properties, we recommend skipping the remainder of this proof.

Formally, we show by induction that for any t ≥ 0 (where t is the index of an iteration of the GSWD):

Step (t > 0). Assume, by induction, that Equation (2.28) holds for t. Then, we have

(induction hypothesis, Equation 2.28) Therefore, we have for the algorithm's returned assignment vector z :

Therefore, for any unit i ∈ [n]:

which implies that Pr [z(i) = +1] = Pr [z(i) = −1] = 1 2 . From Lemma 1 we know that this implies unbiasedness ofτ ht .

This section shows two bounds on the worst-case variance ofτ ht under the Gram-Schmidt Walk Design.

In the following, we assume Φ ∈ (0, 1] to eliminate issues when Φ = 0.

Theorem 8 (Theorem 1 in [Har+21] ). Under the Gram-Schmidt Walk Design,

Proof outline. This proof is based on the main proof of Bansal et al.'s Gram-Schmidt Walk [Ban+17] and stems from the connection of the above algorithm to its namesake, the Gram-Schmidt orthogonalization procedure. The lengthy proof is beyond the scope of this work and we refer to Appendix A, pp. 58-67, in [Har+21] . However, we give a basic sketch that should help the interested reader while studying the proof from Harshaw et al.

In general, the goal is to show a Loewner order bound, it thus suffices to show

(2.30)

The key step is that for any v ∈ R n+d , v Cov (Bz) v can be rewritten in terms of the step sizes and directions, and then further in terms of projection matrices:

v

Here, S i is unit i's pivot phase (the set of iterations for which unit i was the pivot). 12 P i is the projection matrix onto the subspace of {Bu t : t ∈ S i }, i.e., the subspace of updates generated during the ith pivot phase. The above equation shall serve as a guideline when studying the detailed proof of Harshaw et al., where proofs for all the steps are given.

To briefly touch on the connection to the Gram-Schmidt orthogonalization process, where a set of vectors are orthogonalized one after the other, it is shown in Bansal et al. that the subspaces mentioned above are orthogonal to each other, i.e.

(2.31)

Proof. We first calculate the upper n × n block of the projection matrix P. The lemma then follows from this n × n block and Theorem 8.

By definition of the projection matrix P onto the column space of B, we have

and therefore, by matrix block multiplication (note that the middle matrix has dimension n × n):

(2.32)

This inverse indeed exists, as for any Φ ∈ (0, 1], ΦI + (1 − Φ)ξ −2 XX is positive definite and thereby invertible.

By definition of B, we have: 

As Φ is positive, we can multiply by Φ −1 , which gives the result.

From this, the following (first) bound on the variance ofτ ht follows.

Corollary 10 (Theorem 2 in [Har+21] ). For any given trade-off parameter Φ ∈ (0, 1] and potential outcome vector µ = a+b 2 , the worst-case variance under the Gram-Schmidt Walk Design is upper bounded by:

Proof. This result follows from a bound in the spectral discussion and a bound on the maximal eigenvalue.

We know from the spectral discussion (Lemma 4), that

where λ max is the maximal eigenvector of Cov (z). As Cov (z) We can summarize the contributions that lead to Corollary 10 in the following diagram. However, it is possible to refine this result. In order to do so, we first need to establish a connection to the ridge regression loss.

Lemma 11 (Lemma A10 in [Har+21] ). For any µ ∈ R n , Φ ∈ (0, 1] and X ∈ R n,d with maximum row norm ξ = max i∈ [n] x i 2 , the following equality holds.

Proof. Note that there exists an explicit formula for the optimal β ∈ R d , see Hastie et al. [HTF09] . This lemma follows then by realizing that the optimal value of the ridge loss function, i.e., the optimizer β ∈ R d plugged into the ridge loss function, is equal to µ Qµ.

For a detailed multi-page calculation, we refer to pp. 79-81 in the appendix of [Har+21] .

This lemma allows now for an improvement of Corollary 10:

Theorem 12 (Theorem 3 in [Har+21] ). For any given trade-off parameter Φ ∈ (0, 1) and potential outcome vector µ = a+b 2 , the worst-case variance under the Gram-Schmidt Walk Design is upper bounded by:

Proof. This result follows by first rewriting the variance in terms of the squared error expression (Lemma 3) and then applying the connection to the ridge loss established in Lemma 11:

Theorem 12 lets us connect to previous results and an earlier discussion from the spectral section. First, notice that Theorem 12 is at least as good as Corollary 10: Choosing β = 0 makes their bounds match.

But Theorem 12's power can be discovered when considering Φ → 0. Recall that this choice of Φ corresponds to emphasizing "potential performance" in the potential performance versus robustness tradeoff. Then, the term

has the most weight in Theorem 12. If the covariates are linearly predictive of the potential outcomes vector, we have that µ is close to the column span of X, and we can thus find a good β such that µ − Xβ 2 2 ≈ 0. Thereby, Theorem 12 gives a good upper bound for the variance ofτ ht -which we have shown to be the mean squared error of the estimator.

Because we know that the variance can be decreased with an increase of the experiment size n, this result is significant when increasing the sample size of an experiment is expensive or not possible.

A key assumption in many RCTs, including the model considered in Harshaw et al., relies on Rubin's SUTVA (see Chapter 1): The absence of any influence between units. However, in many situations, this is an inaccurate model that can lead to wrong results. For example, in a vaccine trial, a participant is not only influenced by his own group assignment (vaccine / placebo) but also by the vaccination rate in his peer group. Therefore, the underlying social network should be considered in an effective RCT.

The influence of one unit on another unit's observed outcome is commonly referred to as interference or spillover effect (the effect on one individual "spills over" to another) and occurs in literature in two ways: It might either be the quantity of interest in an experiment (how do peer groups affect each other?) or an undesired perturbation factor that makes the statistical analysis of an RCT harder. The latter is the case for the vaccine trial described in the beginning, and is the focus of this work.

Early occurrences of interference are in agricultural studies, where nearby crops influenced each other. This carried over to educational research, where disjoint student groups are exposed over an extended period of time to different learning techniques, but any interference (e.g., by communication between students from different groups) should be ruled out. A common practice in these educational studies is block randomization at the institutional level, where schools are treated as blocks, and each school is randomly assigned to treatment / control as a whole 1 [Sch17] . The assumption made there is that students might communicate within schools, but there is no communication between students from different schools. Sobel calls this "partial interference assumption" [Sob06] , which has been studied extensively [HH08] . But also this model's field of application is limited. Coming back to the vaccine trial, what are the blocks? Contagious diseases spread via social contacts within a population. Thus, there is a social network underlying the spillover effects, and we therefore speak of network interference. As we model potential influence (it is not known ex-ante which person will infect another), we have a rather connected graph instead of multiple, clearly separated clusters. The example of a vaccine trial motivates the need for both an RCT model as well as a treatment effect estimator under network interference.

Before designing an appropriate estimator, we need to model how units influence each other. In a recent economics paper, Candogan et al. model interference patterns as a graph and consider interference as soon as a unit u has any neighbor not in the same treatment as u. If this is the case, they discard the data obtained from u. If this is not the case (u has only 3.2. Modeling Network Spillover Effects neighbors within the same treatment group), they keep the observed outcome from u as it is. [CCN21] This model has three major drawbacks. Firstly, their unsustainable dealings with data points leads to low precision: A unit is likely to have at least one neighbor that is not in the same treatment group. By discarding the observed outcomes from any such units, the number of units used for analysis (call this n ) shrinks significantly, and so does the estimator's precision. 2 Second, the intensity of network effects is not accounted for. Suppose there are two units (that have both only neighbors in the same treatment group and are therefore not discarded) where one of them has only a single neighbor, and the other has 100 neighbors. The latter might be more influenced than the former, but this is not accounted for in this model.

Third, what if we only know the likelihood of interaction between two units? Their model assumes a deterministic graph, which might not be known ex-ante -i.e., when making the group assignments.

The three points above show that it is crucial to define the model of interference thoughtfully. "In presence of network interference, even designing an unbiased estimator for τ given a randomized experiment can be challenging" - [CCN21] We overcome all of the issues described above in the following sections: First, we find a good interference model (Section 3.2) -which is different from Candogan et al. Then, we present an unbiased estimator, 3τ net , for the average treatment effect (Section 3.3). Next, we analyze its variance (Section 3.4), where we can give a nice matrix-vector product expression that bounds the variance (Theorem 18). Then, we draw comparisons to Candogan et al. by giving a similar linear program as [CCN21] (Section 3.5) and finally discuss the role of disconnected graphs (Section 3.6).

Finding a good model is the foundational part of this work: Without a good model, the subsequent analysis is useless. The process of finding a good model is a balancing act between generality and complexity: The good model should be both general enough to be representative of the real world and simple enough to be analyzable.

We outline in this section different possible models and give thereby reasons for our final model.

Suppose we knew the underlying social network in the form of a graph G = (V, E) with V = [n] and adjacency matrix A. Without loss of generality, we assume no self-loops. 4

Under interference, we do not observe only this unit's base outcome y i but also some network effect based on its neighbors. In our model, we assume that these parts are additive, 5 so that we get for the observed outcome under interference, y i : The question at hand is: What is a good quantity for "some quantity depending on unit i's neighbors and A"? We explore some possibilities in the following.

The idea of this model is similar to an echo chamber. One can think of people's political points of view and how they influence each other through communication. Consider, for example, a group of people belonging to either of two political parties, say party "D" and party "R". It is a well-known phenomenon that people that only communicate within their party (sociologically speaking, their "bubble") and never consider outside opinions, tend to have a more extreme point of view. On the other hand, if someone communicates with both "R" and "D" people, they might have a more balanced opinion, and their point of view does not get much distorted from their friends' opinion.

This model tries to capture exactly this notion. To see that quantitatively, consider a unit i in group Z + . If this unit has many neighbors within the same group, Z + , the network effect term +z i j∈[n] A i,j z j is big and thus the observed outcome y i increases. On the other hand, if a unit k in group Z − has many neighbors in Z − , the term +z k j∈[n] A k,j z j is big (as z k = −1) and thus the observed outcome y j increases, too. For a unit l with connections to both groups equally, the network term will be zero, and we observe y l = y l .

This model has a specific statistical property:

(using the fact that G has no self-loops) and therefore, in expectation, y = y. However, this model has two major shortcomings. First, a unit's influence on others might depend on the magnitude of its base outcome y i : People with a stronger opinion might have a stronger influence on their peer group. This model does only account for edge weights but not for the neighbor's base outcome. Accounting for such a dependence (adding a factor y j in the sum) would destroy the nice statistical property described above. Second, while this echo chamber model might seem reasonable in some sociological context, there are many other cases where network effect is inherently different. Recall that in this model, the observed outcome's increase depends not on the neighbors' actual treatment groups but only on their group assignments relative to the unit. It could also be that the neighbors' actual treatment groups determine the influence on the observed outcome. For example, units from Z + might universally increase their neighbor's observed outcome -independent of their group -and units from Z − might universally decrease their neighbor's observed outcome. Treatment group dependent model. Fixing the above model's drawbacks, we consider the following model

Here, the network influence also depends on the neighbor's base outcome (y j ) and it depends on the neighbor's actual group assignment (versus in the echo chamber model, where the sum's z i coefficient made the effect depend only on the neighbor's group relative to unit i).

In vector notation, this model is y = (I + A) y.

(3.5)

Mathematically, we could simply infer y from y (assuming (I + A) is invertible):

and use the standard Horvitz-Thompson estimator on y. Is modeling interference really that simple? Taking a close look shows us that it is not:

Introducing randomness. Knowing connections between units (such as friendships or social closeness) does not mean that they really do influence each other. It is just a possibility. Nonetheless, we can say that the likelihood of influence is stronger for persons with a strong friendship or more communication. Therefore, the actual observed outcome under interference is rather a random variable, following some probabilistic model that we make of the network.

To account for this, we model each edge weight as an independent random variable C i,j ≥ 0 with known expected value A i,j . We do not make any further assumptions. We let this random model deliberately be that free and do not make any further assumptions on the exact distribution, as this is situation dependent.

Our model now becomes:

The experimenter will give the underlying expected adjacency matrix A and will define the probability distribution of C. Note that diag(C) = 0 corresponds to the absence of self-loops.

Therefore, our final model becomes:

Definition 2 (Probabilistic interference model). The observed outcome under interference is y = (I + C)y

where C is the random adjacency matrix with known expectation E [C] = A, diag(C) = 0 and y ∈ R n with

It shall be noted that another possibility would be to use

A i,j y j (note the "prime" symbol: y j ).

(3.8) which would reflect the inherent recursive nature of peer influence: One unit influences another, which again, under this influence, influences the next one. There are two points to make on this idea. First, order-n influence can be described already with the adjacency matrix A, as we do not restrict it to be binary. Therefore, further degree-k-neighborhoods in G could be encoded using the kth power of the adjacency matrix instead. However, and this is the second point, this would create difficulties when adding randomness at a later stage. We discuss this further in Section 3.7.

The Horvitz-Thompson estimator (Definition 1) simply applied to the observed outcomes y would be biased, as it still contains all the spillover effects. Therefore, we define the following estimator for the (unobservable) average treatment effect:

Definition 3 (Network ATE estimator).

Note that all quantities in this definition are known (A, n) or measurable (y ).

Our estimator has the nice property of being unbiased (see next section). Its well-definedness, however, depends on the invertibility of I + A. In the following, we give some sufficient conditions on the probabilistic graph model under which I + A is guaranteed to be invertible. This should illustrate that under reasonable assumptions,τ ht is well-defined.

Note that the exact modeling of the probabilistic graph will be the experimenter's task. This section aims to give examples of sufficient conditions on the graph model but does by no means try to capture all necessary conditions.

In general, it is useful to have some mechanism in the model that allows an edge (i, j) to have zero weight with certainty, as big social networks inherently have clusters and thereby units that are known to have no interaction. The Bernoulli model provides that functionality.

Bernoulli model. Suppose unit j's influence on unit i (i = j) is either 0 or α, according to a Bernoulli distribution: Therefore, I + A is strictly diagonally dominant, thereby (I + A) −1 exists and thusτ net is well-defined.

Uniform weight distribution with activation. We can extend the Bernoulli model where not only the existence of a link is probabilistic, but also the weight -if it exists. The following captures this in case of a uniform weight distribution 

Therefore, I + A is strictly diagonally dominant, thereby (I + A) −1 exists and thusτ net is well-defined.

For a particular probabilistic graph model in an experiment, the experiment will either have to verify beforehand that under his assumptions, I + A is guaranteed to be invertible, or they will have to check after determining the expected edge weights, denoted in A, that the matrix I + A is invertible.

From now on, we will assume well-definedness ofτ ht .

To analyze bias and variance ofτ net , we first need to find a compact expression of the estimator's error. To that end, we use the analysis of τ from Harshaw et al. (see proof of Lemma A1 in [Har+21] ):

Definition 4. ∀i ∈ [n] :

It can be easily seen that the following identities hold.

Lemma 15 (Identities forå,b as in [Har+21] ).

• y =å +b (by definition)

Their purpose is to express the deterministic τ in terms of z to make it better comparable withτ net . Therefore, we can express the average treatment effect as Therefore, the errorτ net − τ iŝ

(3.14)

Remark. Note that the only difference to the error ofτ ht in the non-interference casê

is the term +2 z, Ey .

Forτ ht , this error representation eliminates y in the right side of z, • . This leads to having only linear dependencies on z in the z, • term. More importantly, it leads to having only quadratic terms in z, • 2 , which is part of the variance expression.

In our network interference case at hand, y does not cancel, so that the right-hand side of z, • depends on y. As y depends on the random vector z, we have quadratic dependence on z in z, • (relevant for the bias) and quartic dependence in z, • 2 (relevant for the variance), making an analysis hard. However, we can eliminate the dependency in the bias term (Section 3.4.1)and reduce to cubic dependencies in the variance term (Section 3.4.2).

Remark. The error ofτ net (Equation 3.14) depends on two inherently different random variables. On the one hand, it depends on the random matrix C, on which E depends. On the other, it depends on the design z upon which y depends. But both C and z are stochastically independent, as the former describes the random connections (e.g., friendships or communication), and the latter describes the group assignment: We can only make the design dependent on the observable A = E [C] but not on the unobservable C. Therefore, C ⊥ z.

Theorem 16. The average treatment effect estimatorτ net is unbiased.

Proof.

Note that we used our general assumption, namely that the design ensures E [z] = 0.

Lemma 17. The variance ofτ net is

where both summands are non-negative.

Proof.

(using the error expression, equation 3.14)

The second summand is clearly non-negative. For the first one note that Cov (z) is positivesemidefinite.

The dependence of y on z can be written as follows:

where represents element-wise multiplication.

This shows the challenge of the variance term in Lemma 17: Due to the linear dependence of y on z, the variance is an expectation over a quartic term in z.

There are two nice properties for an expression of Var [τ net ] that we would like to achieve.

(P1) It should only consist of positive terms, as this makes a possible algorithm design easier by avoiding the burden of cancellation. (P2) Its dependence on z should be simpler, possibly removing the quartic dependency.

We have explored this in two ways.

(a) By assuming natural bounds on y, we can achieve both (P1) and (P2). (b) Without making any assumptions on y, we can achieve (P2) by transforming the quartic dependency on z to a cubic dependency. However, this comes at the cost of losing (P1).

Both ways (a) and (b) are described in the following sections.

A natural assumption is a bound on the outcome y.

Assumption 1 (Bound on y) . Assume ∃f ∈ R + .∀i ∈ [n]:

This extends to The mathematical effect of this assumption is to reduce the quartic dependence on z to a quadratic one.

where M is a known, positive-semidefinite matrix and R := M 1 2 is its root.

We first show a stochastic property of E that will be needed in the proof of Theorem 18. 

Analogously, for any other E p,q , we have If those are from different columns, meaning j = q, these two terms do not share any entry of C. As different entries of C are independent by our interference model assumption,

Proof of Theorem 18. We first find a bound on the network part of Var [τ net ] and then turn this into the bound of Theorem 18.

We now rewrite the network term of Var [τ net ] as a sum of positive terms.

(z e i y i )(z e j y j )

(z e i y i )(z e j y j )

By applying Assumption 1, this yields

Plugging this into Var [τ net ], we obtain

(by Assumption 1: |a + b| ≤ 2 √ f · 1)

By setting

Cov (e i ) Note that the bound of Theorem 18 is tight in the following sense: There exist a, b such that it holds with equality (namely a = b = √ f · 1).

To minimize the variance ofτ net in this setting, we thus need to minimize the euclidean norm This gives us some asymptotic insight into our problem. Theorem 20 implies for the case d = n that it is not efficiently possible to find a z ∈ {±1} n such that

for matrices R ∈ R n,n with maximum l 2 column norm 1 in general. 6 We can transform our problem (Equation 3.20) into the above form by setting

where ξ := max i∈[n] R :,i 2 is the maximum column norm. Therefore, we cannot hope to efficiently find a better z for all R ∈ R n,n .

However, it has to be noted that Theorem 20 is a statement about matrices V in general.

We might have a very special case here where the matrices R have a special form for which it is indeed possible to make the distinction. We will continue this conversation in the outlook.

Do we know of an efficient algorithm that finds an asymptotically tight z ∈ {±1} n for any given R ? We do, namely the simple i.i.d. design.

Lemma 21. The i.i.d. design is asymptotically tight in the above sense, i.e., for any R ∈ R n,n with maximum l 2 column norm 1, the i.i.d. design gives a z ∈ {±1} n s.t.

Proof. For any i ∈ [n], we have [Spi21] .

We conclude this section with a bound on the probability of having a "bad"τ net . 

If we do not make any assumptions on y, we can work towards (P2) by transforming the variance ofτ net to have only a cubic, instead of a quartic, dependence on z. However, we lose during this process the consisting-of-positive-terms-only property (P1). In this section, we will first establish an exact term for the variance. Based on this, we will determine the variance ofτ net under an i.i.d. design. 

where the network-related constant K is defined as

To prove this theorem, we first need a technical lemma:

where K i,j,k is defined as in Theorem 24.

We prove this technical lemma in Appendix B.1.

Proof of Theorem 24. Recall from Equation 3.15 the identity for y that exhibits the dependence on z: a − b) ) .

We can use this identity in combination with Lemma 25 to show the theorem:

Constraining a, b to some range, e.g. [0, 1] n , finding a robust design could be formulated as follows, based on Theorem 24:

However, it is not clear how to optimize this min-max expression. If we constrain the potential outcomes to be binary a, b ∈ {0, 1} n , we can formulate finding an optimal robust design as a linear program, similar to Candogan et al.:

We first introduce a technical notation that simplifies the linear program notation later. Let us enumerate all 2 n possible assignment vectors z ∈ {−1, 1} n : 1, . . . , 2 n in some order and stack all of these assignment vectors in that order on top of each other. We call this matrix of stacked vectors W ∈ {±1} 2 n ,n : 

(2)

3.6. Disconnected Graphs: Block design

Proof. We will first show that the above linear program gives a valid distribution, then we show that this distribution ensures thatτ net is unbiased, and finally show the optimality claim.

Constraints (3) and (4) ensure that {p u : u ∈ [2 n ]} is a valid probability distribution over the sample space of all 2 n possible assignments.

Note that we can rewrite the expectation of z i

which holds by definition of W. Therefore, constraint (2) ensures unbiasedness, asτ net is unbiased for E [z] = 0 (Theorem 16).

It remains to show that D is worst-case optimal. Note that v is just a scalar, introduced for minimization. Similar to the expectation of z i , we can use W to rewrite

(3.33) Therefore, 

(by Equations 3.32, 3.33) Therefore, constraint (1) is the variance term. As the scalar v is guaranteed to be greater than or equal to the variance term for all possible outcomes a, b ∈ {±1} n , v is at least the worst-case variance. As we minimize v, it is the optimal worst-case variance.

We emphasize this section's introductory remarks: As with Candogan et al.'s linear program, this one is only feasible for very small n. Note that we have not only exponentially many constraints -which could allow for an efficient linear program if a separation oracle was found -but also exponentially many variables {p u : u ∈ [2 n ]}.

We have seen before that the variance of an unbiased estimator decreases naturally with 1 n . The calculation (Equation 2.10), however, assumed independence between the units (SUTVA).

It is not clear how this carries over to the non-SUTVA case, 7 where dependence between the units is the characteristic.

In this section, we show that if a graph has multiple connected components, the network's contribution to the variance term can be regarded as a contribution per connected component. Therefore, we have a notion of independence at the inter-connected-components-level. 

Proof. Recall the definition of K: because at least one of the factors must be zero, for any of the above reasons.

The lemma follows by summing expression 3.39 over all u ∈ [n].

7 Even though there is a 1 n 2 factor in the variance term, it is not clear whether or not Ez,C (z Ey) 2 ∈ Ω n 2 8 Block-diagonality of A implies block-diagonality of C as E [C] = A, and as the entries of C are nonnegative random variables 

There are many questions worth exploring further.

Capturing "more" graph influence. An interesting generalization of our model would be trying to capture degree-k influence: In reality, a unit gets influenced from its neighbors. This unit, again, influences its neighbors -but now with the influence it gained so far. If we consider influence from units upto the k-th neighbor ("degree-k influence"), we would get a model like this:

More realistically, spillover effects from neighbors further away are less influential, thus motivating a damping factor λ < 1. In addition, an accurate model would capture the influence from neighbors of any distance: k → ∞. This gives rise to the following model:

which is similar to a model in the game-theoretic paper by Ballester et al. [BCAZ06] . While it is very nice to have an explicit representation of the matrix series, further calculations would require taking the expectation of an inverse of a matrix, making mathematical reasoning hard.

An asymptotic lower bound result. Another interesting question comes from the complexity side: Does the hardness result from Theorem 20, which is a statement on matrices in general, also apply to the matrices R of interest for our problem? This could be approached by trying to find a reduction from Max-2-2-Set-Splitting -which is known to be NP-hard [Gur04; CNN11] -to our variance bound minimization problem (Equation 3.20). If possible, Max-2-2-Set-Splitting could be reduced to an instance of our problem for a certain family of graphs, using a reduction similar to [CNN11] . It shall be noted that this statement would still give an asymptotical lower bound, given the boundness assumptions on y. While experimenters are sometimes interested in small constants, such an asymptotic result would give guidance for further algorithm searches.

Chapter 4

In this work, we covered the randomized controlled trial, one of the major scientific tools in medicine, behavioral economics and other scientific areas. We gave a formal introduction, covering terms and notions from different disciplines. Further, we introduced Harshaw et al.'s Gram-Schmidt Walk Design, built an intuition for the algorithm, and gave major theorems.

Furthermore, we developed a model in the presence of social spillover (network interference), i.e., when Rubin's Stable Unit Treatment Value Assumption (SUTVA) does not hold. For this case, we introduced an unbiased estimator and analyzed its variance. Under boundness assumptions on the potential outcomes, we reduced the task of variance minimization to the minimization of an l 2 norm of a matrix-vector product in expectation, where the random vector has to be in {±1} n . While the variance decreases with ∼ 1 n if the SUTVA holds, we cannot make this argument in the presence of interference. However, we have shown that such an argument can be made with an increasing number of disconnected components in the underlying graph.

Social interference in randomized controlled trials is a source for many interesting questions. We motivate more research in both finding optimal designs in our model as well as in extending the model. 

Who's Who in Networks. Wanted: The Key Player

The Gram-Schmidt Walk: A Cure for the Banaszczyk Blues

On Some Combinatorial Questions in Finite-Dimensional Spaces

Integer-Making" Theorems

Near-Optimal Experimental Design for Networks: Independent Block Randomization. SSRN Scholarly Paper ID 3852100

Tight Hardness Results for Minimizing Discrepancy

Design and Analysis of Clinical Trials: Concepts and Methodologies

Streptomycin Treatment of Pulmonary Tuberculosis

The Design of Experiments. The Design of Experiments

Clinical Research in Ancient Babylon: Methodologic Insights from the Book of Daniel

Inapproximability Results for Set Splitting and Satisfiabili-tyProblems with No Mixed Clauses

Balancing Covariates in Randomized Experiments with the Gram-Schmidt Walk Design. Version 3

The Elements of Statistical Learning

The Clinical Trial

A Generalization of Sampling Without Replacement From a Finite Universe

Toward Causal Inference With Interference

Issues in Outcomes Research: An Overview of Randomization Techniques for

Stratified Randomization for Clinical Trials

Properties of Simple Randomization in Clinical Trials

Central Limit Theorem

Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine

Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment

What Is Design-Based Causal Inference for RCTs and Why Should I Use It? NCEE 2017-4025

What Do Randomized Studies of Housing Mobility Demonstrate?

Discrepancy Theory and Randomized Controlled Trials

The Randomization and Stratification of Patients to Clinical Trials

(using above identity for y)

The theorem follows by multiplying by 4 n 2 .Corollary 26. For the i.i.d. design,due to independence of the entries of z. Applying this to Theorem 24 gives the desired result.Remark. This exact expression can be turned into a better bound on Pr [|τ net − τ | ≥ t] than the one given by Theorem 22 using the Chebyshev inequality. We do not carry this out because this bound would require knowledge of a, b, which is unknown to the experimenter. However, it shall be said that if the experimenter finds other bounds for a, b than suggested in the previous section, they can use this exact expression (Corollary 26) and apply their bounds from here on.

This section aims to derive a formulation similar to the formulation in Candogan et al. [CCN21] . We demonstrate thereby that under our model assumptions and estimator, it is possible to derive a similar linear program. However, as in their paper, it still suffers from major drawbacks and can only be used for small experiment sizes n and under limiting assumptions on a, b.One notion of a design's robustness is to have minimal variance in the worst-case potential outcomes a, b. Recall that the design D is the distribution of the random assignment vector z: z ∼ D.

The following lemma is needed in the proof of Lemma 9.