key: cord-0549412-jylcg9rr
authors: Kancharla, Manjusha; Kang, Hyunseung
title: A Robust, Differentially Private Randomized Experiment for Evaluating Online Educational Programs With Sensitive Student Data
date: 2021-12-05
journal: nan
DOI: nan
sha: 82948e6e3027c5ad87c6f1e2d5e6fba0254426e5
doc_id: 549412
cord_uid: jylcg9rr

Randomized control trials (RCTs) have been the gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest. However, many RCTs assume that study participants are willing to share their (potentially sensitive) data, specifically their response to treatment. This assumption, while trivial at first, is becoming difficult to satisfy in the modern era, especially in online settings where there are more regulations to protect individuals' data. The paper presents a new, simple experimental design that is differentially private, one of the strongest notions of data privacy. Also, using works on noncompliance in experimental psychology, we show that our design is robust against"adversarial"participants who may distrust investigators with their personal data and provide contaminated responses to intentionally bias the results of the experiment. Under our new design, we propose unbiased and asymptotically Normal estimators for the average treatment effect. We also present a doubly robust, covariate-adjusted estimator that uses pre-treatment covariates (if available) to improve efficiency. We conclude by using the proposed experimental design to evaluate the effectiveness of online statistics courses at the University of Wisconsin-Madison during the Spring 2021 semester, where many classes were online due to COVID-19.

The gold standard to evaluate the effectiveness of a program, policy, or treatment on an outcome of interest is a randomized control trial (RCT). In its purest form, an investigator randomly assigns each individual i, i = 1, . . . , n, to treatment, denoted as A i = 1, or control, denoted as A i = 0, and observes his/her outcome Y i . Because the treatment was randomized, both treated and control groups are similar in their unobserved and observed characteristics and thus, taking the difference of the average outcomes from the two groups yields an unbiased estimate of the average treatment effect (ATE). A bit more formally, a RCT satisfies strong ignorability [Rosenbaum and Rubin, 1983] and the ATE can be identified from observed data; see Imbens and Rubin [2015] and Hernan and Robins [2020] for textbook discussions.

In addition to strong ignorability, one subtle, yet important assumption underlying RCTs is that after randomly assigning treatment, individuals in the study share their response/outcomes to the investigator. This assumption, almost axiomatic in a RCT, is becoming less plausible in the modern era, especially in online settings where there are increasing regulations to protect users' privacy. For example, Bond et al. [2012] ran an online randomized experiment among 61 million users of Facebook and collected their voting behaviors as the primary outcome; this experiment attracted controversy due, in part, to concerns over data privacy [Benbunan-Fich, 2017] . In RCT-based evaluations of educational programs, investigators often collect sensitive data on students' performance, say test scores, probation status, and class rank, as their primary outcomes [Aaronson public . Some examples include removing any protected health information (PHI) to be compliant with the Health Insurance Portability and Accountability Act (HIPAA) [Annas, 2003 ], using κ-anonymity [Samarati, 2001 , Sweeney, 2002 , -diversity, or t-closeness [Machanavajjhala et al., 2006 , Li et al., 2007 . While they are an improvement over the previous approach in terms of replicability and data sharing, it has been shown that many popular de-identification methods are not sufficient to guarantee privacy. For example, Sweeney [2000] linked de-identified patient-specific health data to voter registration records using variables such as ZIP code, birth date and gender and observed that 87% of the U.S. population can be uniquely identified. Also, Narayanan and Shmatikov [2008] linked the Netflix Prize dataset containing anonymized movie ratings of 500,000 Netflix subscribers to the Internet Movie Database (IMDb), allowing re-identification of users on Netflix; this led to the discontinuation of the Netflix Prize in 2010 [Hunt, 2010] .

The approach to data privacy we use in this work is differential privacy [Evfimievski et al., 2003 , Dwork, 2006 , Dwork and Smith, 2010 , specifically local differential privacy; see Duchi et al. [2018] and references therein. Broadly speaking, differential privacy is a mathematical definition of privacy where nearly identical statistics are computed from a dataset, say the sample mean or the p-value from a hypothesis test, regardless of whether any one individual is present or absent in the dataset; see Section 2.2 for details. Differential privacy is considered to be the strongest form of data privacy in that if an adversary were to obtain differentially private data, it is, up to a privacy loss value , impossible to reidentify the individual in the data. Due to these strong privacy guarantees, differential privacy is used by Google's Chrome browser [Erlingsson et al., 2014 ] and Apple's mobile iOS platform [Apple, 2019] to protect their users' privacy while enabling the development of novel machine learning methods and statistical analysis.

Our main contribution is to propose a simple, robust RCT that guarantees local differential privacy while allowing investigators to estimate treatment effects. Specifically, similar to a typical RCT, we assume that the investigator randomly assigns treatment to individual i and therefore, the treatment value is known to the investigator. But, unlike a typical RCT, we use randomized response techniques originally from Warner [1965] to collect differentially private outcome data from individual i, denoted as Y i , instead of the sensitive, "true" outcome, denoted as Y i . That is, the investigator only sees a privatized response Y i along with the treatment assignment A i to estimate the treatment effect on the sensitive/true outcome Y i ; in contrast, a typical RCT allows investigators to see both the sensitive/true outcome Y i and A i to estimate the treatment effect on the true sensitive/outcome Y i .

A key innovation in our proposed experimental design that distinguishes itself from a straightforward application of existing techniques in differential privacy to RCTs is that we allow the privatized response Y i to be "adversarial." More concretely, unbeknownst to the investigator, some participants may provide "adversarial" data to further mask their identity, say by providing a completely random value as Y i that deviates from the experimental protocol. Our proposed design allows responses from such participants, which we broadly call "cheaters," and even if their identity is unknown to the investigator, their responses will not harm estimation and inference of treatment effects. In relation to works in differential privacy, a cheater represents, in a loose sense, an "imperfect" implementation of a differentially private algorithm where a database/central entity holding the private data may not faithfully execute the privacy-preserving algorithm, say the entity added the wrong noise to the private data or it forgot to add any noise to the private data. Our work here shows how to still obtain relevant statistics of interest even if the differentially private algorithm may be imperfectly implemented. We achieve this by using a simple idea based on sample splitting and noncompliance in psychometric testing [Clark and Desharnais, 1998 ] where we apply two slightly different differentially private algorithms on two random subgroups of participants and re-weigh the outputs from the two algorithms via inverse probability weights to remove bias arising from cheaters; see Section 3.3 for details. Also, in relation to works in psychometric testing, our work extends Clark and Desharnais [1998] to allow for arbitrary types of cheaters and differential privacy; see Section 2.3 for details.

Once we have data from the proposed design, we propose two consistent estimators.

The first estimator is essentially a difference-in-means estimator weighted by the proportion of non-cheaters and is similar, in form, to the local average treatment effect in the noncompliance literature [Angrist et al., 1996] . The second estimator is a doubly robust, covariate-adjusted estimator that uses pre-treatment covariates, if available, to improve efficiency. We also compare our design to a typical RCT that collects the true, sensitive/private outcome and assess the trade-off between statistical efficiency and data privacy.

Finally, the proposed experimental design is used to evaluate online statistics courses at the University of Wisconsin-Madison. Specifically, during the Spring of 2021 when most classroom instruction went online due to COVID-19, n = 72 students participated in an evaluation about the impact of instructors being present in online lecture videos on learning outcomes. Similar to prior works in this area [Kizilcec et al., 2014 , Pi and Hong, 2016 , Wilson et al., 2018 , Wang et al., 2020 , we find that instructor-present video lectures improved students' attention among non-cheaters. Critically, unlike these prior works, the sensitive learning outcomes from the students are guaranteed to be differentially private to any investigator (including those that actually conducted the evaluation). In fact, the proposed design received approval from the Education and Social/Behavioral Science We review the potential outcomes notation used to define treatment effects [Neyman, 1923 , Rubin, 1974 . For each individual i = 1, . . . , n, let A i ∈ {0, 1} denote the binary treatment assignment with A i = 1 denoting treatment and A i = 0 denoting control. Let Y i (a) ∈ D ⊆ R denote the potential outcome of individual i under treatment assignment value a ∈ {0, 1} and D denote the support of the outcome. For simplicity, we consider D to take on binary values, D = {0, 1}. In Section 3.2, we show that as long as D is restricted so that differential privacy is well-defined (see Chapter 2.3 of Dwork and Roth [2014] ), our proposed experimental design remains differentially private. Finally, let Y i ∈ D and X i ∈ R p be the observed outcome and p pre-treatment covariates, respectively, for individual i.

The estimand of interest is the average treatment effect (ATE) defined as τ

To identify τ , the following assumptions are usually made; here, we write the assumptions under a RCT, but the interested reader is referred to Imbens and Rubin [2015] and Hernan and Robins [2020] for identification strategies outside of RCTs.

(A1) Stable unit treatment value assumption [Rubin, 1980] 

(A3) Overlap: There exists 0 < ζ < 1, where 1 − ζ < δ = Pr(A i = 1) < ζ.

Briefly, assumption (A1) states that there are no different versions of the treatment and the treatment of individual i does not impact the potential outcome of individual i = i.

states that there is a non-zero probability that individual i is assigned to either treatment or control. Assumptions (A1)-(A3) are usually satisfied by a RCT and consequently, τ can be identified by the well-known, difference-in-means formula, i.e.,

. In particular, we review two consistent estimators of the ATE in RCTs, the difference-in-means estimator τ Diff and the covariate-adjusted, doubly robust estimator

Here,

regression models for treatment and control groups, respectively. The difference-in-means estimator does not require an outcome model and does not adjust for covariates X i . The doubly robust estimator is an augmented version of the difference-in-means estimator that adjusts for covariates and, in a randomized experiment, is consistent even if f 1 or f 0 are mis-specified. But, if f 1 and f 0 are correctly specified, τ Cov is more efficient than τ Diff .

Finally, both estimators are asymptotically Normal with mean τ and their standard errors can usually be estimated by a sandwich formula or the bootstrap [Efron and Tibshirani, 1994] . For additional discussions, see Lunceford and Davidian [2004] , Zhang et al. [2008] , Tsiatis et al. [2008] , and Bang and Robins [2005] .

While these estimators have desirable statistical properties, from a data privacy perspective, the estimators require individuals' responses Y 1 , . . . , Y n . If individuals are unwilling or apprehensive about sharing their responses to treatment or if they share a dishonest response due to reservations about their data privacy, the estimators may no longer be consistent. The next few sections talk about a formal way to "privatize" Y i and how to use this privatized response to identify and estimate treatment effects on Y i . Also, while we can also privatize the pre-treatment covariates X i in a similar fashion as the response, we leave them to take on any arbitrary values since we can identify the causal effect without X i . Instead, we'll primarily use X i to gain efficiency; see Sections 3.3 and 3.6 for details.

Randomized Response (FRR)

As before, let Y i be the original response that individual i wishes to keep private and let Y i (2) an identity function where for any Y i ∈ D, M(Y i ) = Y i . Intuitively, the first example is more privacy-preserving than the second example in that the constant function always produces the same value 0 for every individual, making it impossible for an investigator to recover the original, sensitive response Y i from Y i . But, the first example's M makes the ATE unidentifiable since everyone in the study has the same outcome, irrespective of the treatment. In contrast, in the second example with the identify function, identification of the ATE is possible, but it is not privacy-preserving since the investigator sees the original, Dwork and Roth [2014] argued that most non-randomized functions M are inadequate to simultaneously preserve privacy and allow estimation and consequently, Dwork [2006] proposed differential privacy, a family of non-deterministic Ms that take an input and stochastically generates an output.

For example, if the input is individual i's original response Y i , M may return Y i plus some stochastic noise generated by M, say Y i plus a random value from a Laplace distribution.

The investigator specifies M, carefully choosing how "random" M should be in order to balance the need for data privacy and the need to estimate scientifically meaningful quantities; too much randomness would mean that the ATE becomes "less identifiable"

while too little randomness would mean that the individual's data is less private. The amount of this randomness is measured by the privacy loss parameter , ∈ [0, ∞), and we say a random map M is ( , 0)-differentially private if changing the input to M only changes its output by a factor based on ; see Definition 2..1.

Definition 2..1 (Differential privacy [Dwork and Roth, 2014] ). A randomized map M is

Lower values imply that M becomes more privacy preserving; in the extreme case where = 0 and there is no loss in privacy, the output from M is statistically indistinguishable for any pair of inputs Y and Y and estimation of treatment effects would be impossible.

An example of such M would be a fair coin toss where, regardless of the original input

would be a result of a coin toss. On the other hand, by setting > 0 , M allows some "signal" from the private outcome Y i to be passed onto the privatized outcome Y i so that treatment effects can be estimated. Some values of ( , 0)-differential privacy that are used in practice are (4, 0) [Apple, 2019] , (3, 0) [Qin et al., 2016] , or approximately (1.5, 0) [Fanti et al., 2015] . We also remark that unlike the usual definition of differential privacy which considers privacy of databases containing records from multiple individuals, Definition 2..1 is specific to a single entry in a database and as such, is a local definition of differential privacy [Dwork and Roth, 2014, Duchi et al., 2018] .

The function M that we will use in our proposed design is based on the forced randomized response (FRR) of Fox and Tracy [1984] . Broadly speaking, in an FRR, individuals use a randomization device, typically a six-sided die, and based on the results of the randomization, some are instructed or "forced" to give a specific type of response, say 0 or 1, regardless of their original response Y i , while others are instructed to give their original response Y i . For example, if individual i s die roll is on 1, individual i is instructed to report a 0 to the investigator, i.e., Y i = 0, and if the die roll is on 6, individual i is instructed to report a 1 to the investigator, i.e., Y i = 1. If the die roll is on anything but 1 or 6, individual i is instructed to provide the original, true, response Y i = Y i . A bit more formally, if P i ∈ {0, 1, 2} represent different instructions where P i = 0 represents 'report 0', P i = 1

represents 'report 1', and P i = 2 represents 'report original response', with probabilities r 0 , r 1 and 1 − r 0 − r 1 , respectively, an FRR map M is defined as,

A critical part of a FRR is that investigators are unaware of the result of the die roll; only the participant knows the result of the die toss. Hence, investigators have no idea if Y i from individual i is the true Y i or one of the forced responses 0 or 1. To put it differently, a FRR protects the privacy of individual i's response by plausible deniability of any particular response. Nevertheless, investigators can choose the die probability, represented by r 0 and r 1 , and these values affect both the privacy loss and efficiency of τ ; in Section 3.2, we present a formula relating FRR parameters r 0 , r 1 with the privacy loss parameter in differential privacy. For simplicity, we will collapse r 1 = r 0 = r going forward so that there is one parameter r parametrizing the privacy-preserving M.

Suppose an individual is generally wary of sharing their response to treatment and does not trust the privacy-preserving nature of M. For example, a participant may feel like the die in a FRR is rigged and produce a response Y i that deviates from the original FRR protocol. Or, a participant, after being instructed to report the true response Y i from the die roll, may feel uncomfortable doing so and instead report the opposite, say

to the investigator. These are instances of noncompliance to the experimental protocol and our goal is to still estimate causal effects in the presence of it.

To achieve this goal, we use the concept of "cheaters" by Clark and Desharnais [1998] in psychometric testing. Broadly speaking, cheaters are those who deviate from the experimental protocol laid out by investigators whereas non-cheaters/honest individuals are those who follow the experimental protocol. For example, an individual who reports '0'

(i.e., Y = 0) even though the FRR prompt was to report '1' is a cheater; see Table 1 in supplementary materials for more examples. Clark and Desharnais [1998] showed that if all cheaters are assumed to generate the same response to the investigators, say all cheaters only report Y i = 0, we can detect the presence of cheaters by (a) randomly splitting the study sample into two pieces, say n individuals are split into samples of size n 1 and n 2 , n 1 + n 2 = n, and (b) comparing appropriate statistics between the two subsamples. Our work extends this idea of using sample splitting to detect cheaters by relaxing the requirement that all cheaters generate the same response. We achieve this by using a different statistic to compare between the two subsamples, namely a variant of the compliance rate in instrumental variables [Angrist et al., 1996] ; see Section 3.3 for details.

Combining the aforementioned ideas on differential privacy and cheaters, we propose a robust and differentially private RCT, which we call a RP-RCT and is laid out in Figure   1 . if i is a cheater and 0 otherwise, and the result of the randomization device behind the FRR, P i ; these are denoted as unobserved variables in Figure 1 . In particular, the variable C i can be thought of as a latent characteristic of individual i that is never observed by the investigator and the investigator only sees Y i without knowing whether it came from a cheater (i.e C i = 1) or not (i.e., C i = 0).

We make some remarks about the experimental protocol. First, compared to a traditional RCT, a RP-RCT adds two additional steps, the sample splitting step to be robust Enroll Individuals. Collect X. Fix δ.

Sample Splitting:

Fix r, r' for the two FRRs.

Randomly assign subject to one of two subsamples, S, with equal probability.

Privatizing with parameter r: Instruct subject to run randomization device and note its outcome, P, without sharing it to the investigator.

Random Assignment of Treatment, A: Randomly assign subject to treatment A=1 with probability δ or control A = 0 with probability 1-δ.

Observed Variables:

( # , A, S, X).

Privatizing with parameter r':

Instruct subject to run randomization device and note its outcome, P, without sharing it to the investigator.

Random Assignment of Treatment, A: Randomly assign subject to treatment A=1 with probability δ or control A = 0 with probability 1-δ.

Observed Variables:

( # , A, S, X). Random Assignment of Treatment, A: Randomly assign subject to treatment A=1 with probability δ or control A = 0 with probability 1-δ.

Observed Variables:

( , A, X). . Compared to a RCT, a RC-PCT adds two additional steps, sample splitting and FRR. Sample splitting is used to make our design robust to responses from cheaters and FRR is used to privatize responses. against cheaters and the FRR step to privatize the study participant's response. Here, for simplicity, the sample splitting step creates two equal, random, sub-samples of individuals, but this can be relaxed at the expense of additional notation. Second, a RP-RCT satisfies assumptions (A2) and (A3) because the treatment A i is still randomly assigned to individuals with probability δ. Also, so long as the treatment is well-defined and does not cause interference (i.e., does not violate SUTVA), all assumptions (A1)-(A3) are satisfied by a RP-RCT. Third, while we only consider the FRR as our M, we can replace M with another differentially private algorithm. Fourth, because the treatment A i is randomized, the pre-treatment covariates X i can take on any value (e.g., missing, censored, etc.) and it won't impact the identification strategy.

The following theorem shows that among non-cheaters, the privatized response Y i generated from a RP-RCT is (ln 2 r+r − 1 , 0)-differentially private.

Theorem 1 (Differential Privacy of RP-RCT). Consider a non-cheater's true response Y i . Then, a RP-RCT which generates his/her privatized response Y i is (ln 2 r+r − 1 , 0) differentially private.

In words, Theorem 1 states that regardless of the non-cheater's true response Y i , the privatized outcome generated from a RP-RCT, Y i would be private up to some privacy loss parameter. The exact privacy loss depends on the FRR parameters r, r and investigators can choose r, r to achieve the desired level of privacy loss. Also, Theorem 1 does not make any claims about differential privacy for cheaters. This is because cheaters can choose to report any response Y i that may or may not be differentially private. For example, if a cheater provides the result of a random coin flip as Y i irrespective of his/her true response Y i , then his/her response achieves perfect differential privacy, i.e., = 0. However, if a cheater always provides the opposite of his/her true response to potentially hide his/her true response, i.e., Y i = 1 − Y i , then his/her data, despite their best intentions, is never differentially private. Without making an assumption about cheaters' behaviors, we cannot make any guarantees about their responses' differential privacy.

To lay out the identification strategy with a RP-RCT, we assume that the RP-RCT satisfies the following assumptions.

(A4) Random sample splitting:

, Y i (0), X i , C i with r, r ∈ [0, 0.5) and r = r (A6) Extended randomization of treatment:

Assumption (A4) states that the two subsamples in a RP-RCT were split randomly. Assumption (A5) states that the randomization device in the FRR (i.e., the die roll) is random and the two FRRs used in the two subsamples are different. Assumption (A6) is a re-iteration of the treatment randomization assumption (A2), except we now include the new variables introduced as part of a RP-RCT: S i , P i and C i . Note that Assumptions (A4)-(A6) are satisfied by the design of a RP-RCT.

Let λ ∈ [0, 1) be the proportion of cheaters in the population, i.e., Pr(C i = 1) = λ. We now state assumptions about the cheater status C i . While likely in some settings, these assumptions may not always be satisfied by the design of a RP-RCT.

(A7) Cheater's response:

(A8) Proportion of non-cheaters: 0 < Pr(C i = 0) = 1 − λ.

Assumption (A7) states that a cheater gives the same response to the investigator regardless of whether he/she was randomized to treatment or control; note that assumption (A7) does not say that all cheaters produce the same response (i.e., the assumption underlying Clark and Desharnais [1998] ). Assumption (A7) is plausible if the treatment assignment A i is blinded so that the participant does not know which treatment he/she is receiving and thus, cannot use this information to change his/her final response Y i to the investigator.

Also, assumption (A7) still allows a cheater to use his/her original response Y i , potential outcomes Y i (1), Y i (0), or pre-treatment characteristics X i to tailor his/her private response Y i . For example, if a cheater reports a constant value, say Y i = 0, or the opposite of his true response, say Y i = 1 − Y i , assumption (A7) will still hold. Or if some cheaters' private response depends on unmeasured, pre-treatment characteristics, say, cheaters who are more privacy-conscious report Y i = 0 while less privacy-conscious cheaters may report a mixture of 0 or Y i , assumption (A7) will hold. However, if a cheater uses the treatment assignment to change his/her response Y i , say the cheater would report Y i = 1 if they receive treatment and Y i = 0 if they receive control, assumption (A7) is violated. Assumption (A8) states that not all participants in the study are cheaters. If assumption (A8) does not hold and every participant is a cheater, we cannot identify any treatment effect. Also, in Section 3.4, we present a way to assess assumption (A8) with the observed data by estimating the proportion of cheaters. Overall, so long as the treatment is blinded and there is at least one non-cheater, a RP-RCT plausibly satisfies assumptions (A1)-(A8) by design.

We now show that the data from a RP-RCT can identify the ATE among non-cheaters.

parameters r, r ∈ [0, 0.5) set by the investigator and the observed data is

Under assumptions (A1)-(A8), the ATE among non-cheaters, denoted as

Additionally, the proportion of cheaters λ is identified via

Theorem 2 shows that our new design can identify the ATE among non-cheaters by taking the difference in the averages of the privatized outcomes Y i among treated and control units, re-weighed by the FRR parameters r, r and the proportion of cheaters λ. If there are no cheaters in the population, Theorem 2 implies τ = τ H and we can identify the ATE for the entire population. In contrast, if everyone is a cheater, we cannot identify the treatment effect; intuitively, if everyone is a cheater, disregards the FRR, and reports Y i = 0, it would be impossible to know the effect of the treatment on the response. Generalizing this intuition, we can only identify the subpopulation of individuals who are non-cheaters, even if the investigator does not know who are cheaters or non-cheaters. We remark that this result is similar in spirit to the local average treatment effect in the noncompliance literature [Angrist et al., 1996] where under noncompliance, only the subpopulation of compliers can be identified from data.

Theorem 2 also shows that we can identify the proportion of cheaters. While the exact formula is complex, roughly, we measure the excess proportion of privatized outcomes had everyone followed the FRR and use additional moment conditions generated from sample splitting to identify λ; see Section B of the supplementary materials for details.

Overall, Theorems 1 and 2 show that we can identify the treatment effect with privatized outcomes Y i , some of which may be contaminated by cheaters. The proposed design has some key parameters, r and r , that govern the privacy loss parameter . They also affect estimation and testing of τ , which we discuss below.

Using the identification result in Section 3.3, we can construct estimators of τ H by replacing the expectations in Theorem 2 with their sample counterparts, i.e.,

in-means estimator in a RCT re-weighted by the estimated proportion of non-cheaters and the privacy-preserving map, i.e., (1 − λ)(1 − r − r ). Also, an estimate of the proportion of cheaters λ can be obtained by replacing the expectations in Theorem 2 with their sample counterparts, i.e., Theorem 3 shows that τ H,Diff is a consistent and asymptotically Normal estimator of τ H .

Theorem 3 (Asymptotic Properties of τ H,Diff ). Suppose the observed data ( Y i , A i , S i ) is i.i.d. and generated from a RP-RCT. Then, we have

Also, V H,Diff can be consistently estimated by,

Theorem 3 can be used as a basis to construct 1 − α confidence intervals and p-values for testing the null hypothesis H 0 : τ H = τ 0 . For example, a Wald-style test for H 0 would be

H,Diff and one would reject the null in favor of a two-sided alternative at level α if |t| exceed z 1−α/2 , where z 1−α/2 is the 1 − α/2 quantile of the standard Normal distribution. We can also construct a Wald-based (1 − α) × 100% level two-sided confidence interval for τ H as ( τ H,Diff − z 1−α/2 V H,Diff /n, τ H,Diff + z 1−α/2 V H,Diff /n). Note that similar to the usual difference in means estimator in Section 2.1, practitioners could also use the bootstrap to estimate the standard error of τ H,Diff and its associated confidence interval.

In this section, we compare a RP-RCT to a traditional RCT as a way to study the cost of guaranteeing differential privacy on statistical efficiency. To begin, suppose λ = 0 so that both τ H,Diff and τ Diff estimate the same parameter τ . The following theorem shows that τ H,Diff is never as asymptotically efficient as τ Diff ; in short, there is a statistical cost of using a RP-RCT to guarantee differential privacy of an individual's response.

Theorem 4. [Relative Asymptotic Efficiency of τ H,Diff and τ Diff ] For any ∈ (0, ∞) and λ = 0, we have,

The relative efficiency between the two estimators, measured by Var( τ Diff )/Var( τ H,Diff ), is determined by , which in turn is governed by FRR parameters r, r . In particular, as individuals' responses become less private by an increase in privacy loss , the relative efficiency approaches 1. In fact, only when the privacy loss approaches infinity, i.e., → ∞, do we have relative efficiency equaling 1 and thus, τ H,Diff will never be as efficient as τ Diff as long as we want to guarantee some amount of data privacy. Investigators can use the formula in Theorem 4 to assess what value of privacy loss, , works best in their own studies.

In particular, we recommend investigators specify based on (a) their tolerance for loss in efficiency at the expense of more privacy, (b) τ 0 and τ 1 , which may be informed from subjectmatter experts during the planning stage of the experiment, and/or (c) recommended data privacy standards from relevant regulatory bodies.

Similar to a RCT, suppose a RP-RCT collected pre-treatment covariates X i , which may be missing, contaminated, and/or corrupted. We propose to use the pre-treatment covariates to develop a more efficient estimator without incurring additional bias by using a doubly robust estimator [Bang and Robins, 2005] . Formally, let f 0 (x i ) and f 1 (x i ) be the postulated models for the true outcome regressions E Y i | A i = 0, X i = x i and E Y i | A i = 1, X i = x i , respectively. For simplicity, we assume that these functions are fixed, but our result below holds if these functions were estimated at fast enough rates, say those based on parametric models. Consider the following estimator for τ H ,

The following theorem shows that τ H,Cov is a consistent, asymptotically Normal estimate of τ H even if f 1 and f 0 are mis-specified.

Theorem 5 (Consistency and Asymptotic Normality of τ H,Cov ). Suppose the same assumptions in Theorem 3 hold. Then, for any fixed working models of the privatized outcomes

Also, V H,Cov can be consistently estimated by, learning experience for students compared to instructor-absent online lectures (i.e., control); see Figure 2 for an example. Some prior works [Wilson et al., 2018 , Kizilcec et al., 2014 found no evidence that instructor-present lecture videos had a significant impact on student learning in terms of attention and comprehension. However, other works [Pi and Hong, 2016, Wang et al., 2020] show the opposite, that instructor-present lecture videos enhanced student learning. Regardless, the participant data from these works are not publicly available due to the sensitive nature of students' educational data. Additionally, there was concern that students may be less willing to provide their true attention and retention rate in the courses when asked by the Department. For example, some students might not be comfortable admitting to not paying attention to online lectures and simply lie to the investigator, leading to potentially biased results.

Compared to using a RCT, using a RP-RCT provides numerous remedies for the aforementioned issues. First, students' data is guaranteed to be differentially private, which may encourage more honest participation. Second, even if some students remain dishonest and are cheaters, a RP-RCT still provides a robust estimate of the treatment effect. Third, the data can be shared publicly and students' data privacy is still preserved via differential privacy; as mentioned earlier, the experimental protocol, including sharing students' response data, was approved by the Education and Social/Behavioral Science IRB of the University of Wisconsin -Madison.

The study population consisted of students enrolled in introductory statistics classes at the University of Wisconsin -Madison during the Spring 2021 semester. Electronic, informed consent was obtained from all participants before enrollment. Once a student gave consent, the study collected the following pre-treatment covariates: gender, race/ethnicity, year in college, major or field of study, prior subject matter knowledge, previous exposure to and grades in statistics/mathematics/computer science classes, self-rated interest in statistics, proficiency in English, amount of experience with video lectures in the past or current semester, and preference on online lecture format. Students had the option to not provide answers to any of the pre-treatment covariates. Afterwards, students are randomly placed into one of the two subgroups (S i = 1 or S i = 2) and within each subgroup, they are randomly assigned to treatment or control. The control arm had a narrated 'Instructor Absent' (IA) video format where students see the lecture slides and hear an audio narration of the lecture from the instructor. The treatment arm was identical to the control arm except it used an 'Instructor Present' (IP) video format where the instructor's face was embedded in the upper-right corner of the lecture video. Both lecture videos were 13 minutes long and introduced identical statistical concepts, specifically about RCTs. We remark that RCTs were not covered in the classes in which the study was conducted. Also, Primary Outcome Questions Attention "I found it hard to pay attention to the video." Retention "I was unable to recall the concepts while attempting the followup quiz." Judgement of Learning "I don't feel that I learnt a great deal by watching the video." Comprehension "I found the topic covered in the video to be hard." Table 1 : Questions Concerning Four Areas of Student Learning. Students were asked to say "yes" or "no" if they agreed or disagreed with the statements above.

the study used an FRR with r 0 = 0, r 1 = 0.1040 and another FRR with r 0 = 0, r 1 = 0.1667, resulting in a privacy loss of = 2. This value was based of our own preference towards data privacy at the expense of efficiency where we were willing to tolerate roughly 50% increase in standard error under a RP-RCT than a RCT, and consultation with the University's IRB, which, among other things, gave approval to release the student-level data to the public for future replication and analysis.

After students watched either the IA or IP lecture video, they were asked a series of questions, notably on four areas of student learning (i.e., Attention, Retention, Judgement of Learning and Comprehension) used by previous works [Kizilcec et al., 2014 , Pi and Hong, 2016 , Wang and Antonenko, 2017 , Wilson et al., 2018 . All four outcomes were 'yes/no' questions; the exact wording of the questions is in Table 1 .

For our working models for the privatized outcomes f 1 and f 0 , we used logistic regression models that minimized the Akaike information criterion [Bozdogan, 1987] ; see Section C of supplementary materials for additional details.

We remark that as part of the RP-RCT protocol, students were prompted to 'roll a die'

using an online die roll (i.e., the FRR device). Students were allowed to roll the die only once and the resulting die roll was visible to the student only. Based on the outcome of the roll, students were asked to answer the four questions in Table 1 using the FRR prompt presented in panel C of Figure 2 . Also, students must roll the die before being presented with the four questions.

Roll the Die. ance between the IP and IA subgroups and, as expected, we found no differences between the two groups in terms of their baseline pre-treatment covariates. Our results suggest that cheating varies with the question. For example, the estimated proportion of cheaters for the Attention and Judgement of Learning related questions were smaller (24% and 32%, respectively). However, they were larger for Retention and Comprehension questions (40% and 38%, respectively). These differences may suggest that students might be more apprehensive in sharing outcomes related to their learning abilities (i.e., Retention, Comprehension) than outcomes related to instruction (e.g., ability to engage students and transfer knowledge).

Our results largely agree with previous works on online video lectures. For example, our results and those by Kizilcec et al. [2014] , Pi and Hong [2016] and Wang et al. [2020] agree that IP lectures receive considerably more attention than IA lectures. Also, our findings on retention, judgement of learning, and comprehension match those in Wang and Antonenko

[2017], Wang et al. [2020] , Wilson et al. [2018] . However, we remark that there is work [Kizilcec et al., 2014] that suggests the opposite of what we find on retention. Finally, unlike these works, all of the student-level data and code is publicly available for replication and future analysis, especially if investigators want to combine this data with future datasets to boost power in related evaluations of online video lectures.

We propose a new experimental design to evaluate the effectiveness of a program, policy, or a treatment on an outcome that may be sensitive, with a particular focus on online education programs where students' response data are often sensitive. Our design, a RP-RCT, has differential privacy guarantees while also allowing estimation of treatment effects. A RP-RCT also accommodates cheaters who may not trust the privacy-preserving nature of our design and provide arbitrary responses that may further protect their privacy. We provide two consistent, asymptotically Normal estimators, one of which allows for covariate adjustment. We also assess the trade-off between differential privacy and statistical efficiency. We conclude by using the RP-RCT to evaluate different types of online video lectures at the Department of Statistics at the University of Wisconsin-Madison and find that our results largely agree with existing results on online video lectures, while preserving students' data privacy and allowing sharing of this data for future replication.

Teachers and student achievement in the chicago public high schools

Identification of causal effects using instrumental variables

Hipaa regulations -a new era of medical-record privacy

Apple differential privacy technical overview

Doubly robust estimation in missing data and causal inference models

The ethics of online research with unsuspecting users: From a/b testing to c/d experimentation

A 61-million-person experiment in social influence and political mobilization

Model selection and akaike's information criterion (aic): The general theory and its analytical extensions

Honest answers to embarrassing questions: Detecting cheating in the randomized response model

Minimax optimal procedures for locally private estimation

Differential privacy

The algorithmic foundations of differential privacy

Differential privacy for statistics: What we know and what we want to learn

An introduction to the bootstrap

Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security -CCS '14

Limiting privacy breaches in privacy preserving data mining

Building a rappor with the unknown: Privacypreserving learning of associations and data dictionaries

Measuring associations with randomized response

reform of eu data protection rules

Causal Inference: What If

Netflix prize update. The Netflix Blog

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction

Showing face in video instruction: Effects on information retention, visual attention, and affect

t-closeness: Privacy beyond k-anonymity and l-diversity

Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study

-diversity: Privacy beyond κ-anonymity

Robust de-anonymization of large sparse datasets

Sur les applications de la theorie des probabilites aux experiences agricoles: Essai des principes

Office of the Federal Register and R. Administration. 45 cfr 46 -protection of human subjects

Learning process and learning outcomes of video podcasts including the instructor and ppt slides: a chinese case

Heavy hitter estimation over set-valued data with local differential privacy

Association for Computing Machinery

The central role of the propensity score in observational studies for causal effects

Estimating causal effects of treatments in randomized and nonrandomized studies

Randomization analysis of experimental data: the fisher randomization test comment

Protecting respondents identities in microdata release

Simple demographics often identify people uniquely

k-anonymity: a model for protecting privacy

Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach

Instructor presence in instructional video: Effects on visual attention, recall, and perceived learning

Does visual attention to the instructor in online video affect learning and learner perceptions? an eye-tracking analysis. Computers and Education, 146:103779, 2020. bias

Instructor presence effect: Liking does not always lead to learning

Improving efficiency of inferences in randomized clinical trials using auxiliary covariates