key: cord-0209078-zhngjxq6
authors: Ngo, Daniel; Stapleton, Logan; Syrgkanis, Vasilis; Wu, Zhiwei Steven
title: Incentivizing Compliance with Algorithmic Instruments
date: 2021-07-21
journal: nan
DOI: nan
sha: 7500110c6a556ed90a322ba728d1849dcfcc61b3
doc_id: 209078
cord_uid: zhngjxq6

Randomized experiments can be susceptible to selection bias due to potential non-compliance by the participants. While much of the existing work has studied compliance as a static behavior, we propose a game-theoretic model to study compliance as dynamic behavior that may change over time. In rounds, a social planner interacts with a sequence of heterogeneous agents who arrive with their unobserved private type that determines both their prior preferences across the actions (e.g., control and treatment) and their baseline rewards without taking any treatment. The planner provides each agent with a randomized recommendation that may alter their beliefs and their action selection. We develop a novel recommendation mechanism that views the planner's recommendation as a form of instrumental variable (IV) that only affects an agents' action selection, but not the observed rewards. We construct such IVs by carefully mapping the history -- the interactions between the planner and the previous agents -- to a random recommendation. Even though the initial agents may be completely non-compliant, our mechanism can incentivize compliance over time, thereby enabling the estimation of the treatment effect of each treatment, and minimizing the cumulative regret of the planner whose goal is to identify the optimal treatment.

In many applications, estimating the causal effect of a treatment or intervention is at the heart of a decision-making process. Examples include a study on the effect of a vaccine on immunity, an assessment of the effect of a training program on workers' efficiency, and a evaluation of the effect of a sales campaign on a company's profit. Many studies on causal effects rely on randomized experiments, which randomly assign each individual in a population to a treatment group or a control group and then estimate the causal effects by comparing the outcomes across groups. However, in many real-world domains, participation is voluntary, which can be susceptible to non-compliance. For example, people may turn down a vaccine or a drug when they are assigned to receive the treatment Wright [1993] . Another example is a randomized evaluation of the Job Training Partnership Act (JTPA) training program Bloom et al. [1997] , where only 60 percent of the workers assigned to be trained chose to receive training, while roughly 2 percent of those assigned to the control group chose to receive training. In many cases, non-compliance can cause selection bias: for example, those who choose to receive the drug or vaccine in a randomized trial tend to be healthier, and those who join the training program might may more productive to begin with.

Although non-compliance in randomized experiments has been well studied in many observational studies (see e.g. Angrist and Pischke [2008] ), there has been little work that studies and models how compliance varies over time. In reality, however, participants' compliance behaviors may not be static: they may change according to their time-varying beliefs about the treatments. If the outcomes from the previous trials suggest that the treatments are effective, then the participants may become more willing to accept the recommendation. For example, those initially weary about a new vaccine may change their mind once they see others take it without experiencing negative symptoms. 1 Motivated by this observation, this paper studies the design of dynamic trial mechanisms that map history-the observations from previous trials-to a treatment recommendation and gradually incentivize compliance over time.

In this paper, we introduce a game theoretic model to study the dynamic (non)-compliance behavior due to changing beliefs. In our model, there is a collection of treatments such that each treatment j is associated with an unknown treatment effect θ j . We study an online learning game, in which a set of T myopic agents arrive sequentially over T rounds. Each agent t has a private unobserved type u t , which determines their heterogeneous prior beliefs about the treatment effects. Each agent's goal is to select a treatment j that maximizes the reward: θ j + g (ut) t , where g (ut) t denotes the type-dependent baseline reward (without taking any treatment). We introduce a social planner whose goal is to estimate the effects of underlying treatments and incentive the agents to select the treatment that maximize long-term cumulative reward. Upon the arrival of each agent t, the planner provides the agent with a random treatment recommendation, which is computed by a policy that maps the history of interactions with the previous (t−1) agents. While agent t does not observe the previous history, they form a posterior belief over the treatment effects based on the recommendation and then select the action that maximizes their expected utility.

Under this model, we provide dynamic trial mechanisms that incentivize compliance over time and accurately estimate the treatment effects. The key technical insight is that the planner's random recommendation at each round can be viewed as an instrument that is independent of the agent's private type and only influences the observed rewards through the agent's choice of action. By leveraging this observation, we can perform instrumental variable (IV) regression to recover the treatment effects, as long as some of the agents are compliant with the recommendations. To create compliance incentives, our mechanisms leverage techniques from the literature of incentivizing exploration Mansour et al. [2015] , Slivkins [2019] . The key idea is information asymmetry: since each agent does not directly observe the previous history, the planner has more information. By strategically mapping previous history to instruments, the planner can incentivize agents to explore treatments that are less preferred a-priori.

We first focus on the binary action setting, where each agent can select treatment or control. Then we will extend our results to the k treatments setting in section 6. In the binary setting, we first provide two mechanisms that works with two initial non-compliance situations.

Complete non-compliance. In Section 3, we consider a setting where the planner initially has no information about the treatment effect θ, so all agents are initially non-compliant with the planner's recommendations. We provide Algorithm 1 which first lets initial agents choose their preferred arms, then constructs recommendations that incentivize compliance for some later agents. This enables treatment effect estimation through IV regression.

Partial compliance. In Section 4, we consider a setting where the planner has an initial estimate of the treatment effect θ (that may be obtained by running Algorithm 1), so they can incentivize some agents to comply. We provide Algorithm 2, which can be viewed as the bandit algorithm active arm elimination Even-Dar et al. [2006] which uses IV estimates to compare treatments. Samples collected by Algorithm 1 provide an increasingly accurate estimateθ and incentivize more agents to comply over time.

Regret minimization. In Section 5, we show that if the planner first runs Algorithm 1 to obtain an initial treatment effect estimateθ and then runs Algorithm 2 to amplify compliance, then he can achieveÕ( √ T ) regret w.r.t. the cumulative reward given by enforcing the optimal action for all agents. We then extend such a regret minimization policy to the setting with k different treatments in Section 6.

Experiments. Lastly, in Section 7, we complement our theoretical results with numerical simulations, which allow us to examine how parameters in agents' prior beliefs influence the convergence rate of our recommendation algorithm.

We design mechanisms which strategically select instruments to incentivize compliance over time, so that we can apply tools from IV regression [Angrist and Krueger, 2001 , Angrist and Imbens, 1995 , Imbens et al., 1996 to estimate causal effects. Although IV regression is an established tool to estimate causal effects where there is non-compliance in observational studies (see e.g. Bloom et al. [1997] , Angrist [2005] ), our results deviate significantly from previous works, due to the dynamic nature of our model. In particular, even if all agents are initially non-compliant, our mechanism can still incentivize compliance over time and estimate treatment effects -whereas directly applying standard IV regression at the onset cannot.

Our work draws on techniques from the growing literature of incentivizing exploration (IE) Kremer et al. [2013] , Mansour et al. [2015 Mansour et al. [ , 2016 , Immorlica et al. [2019] , Sellke and Slivkins [2020] , where the goal is also to incentivize myopic agents to explore arms in a multi-armed bandit setting [Auer et al., 2002] using information asymmetry techniques from Bayesian persuasion Kamenica and Gentzkow [2011] . While our mechanisms are technically similar to those in Mansour et al. [2015] , our work differs in several key aspects. First, prior work in IE -including Mansour et al. [2015] -does not capture selection bias and cannot be directly applied in our setting to recover causal effects. The mechanism in Mansour et al. [2015] aims to enforce full compliance (also called Bayesian incentive-compatibility) that requires all agents to follow the planner's recommendations: as a result, the mechanism needs to cater to the type of agents that are most difficult to convince. By contrast, our mechanism relies only on the compliance of a partial subset of agents in order to obtain accurate estimates.

There has also been a line of work on mechanisms that incentivize exploration via payments Frazier et al. [2014] , Chen et al. [2018] , Kannan et al. [2017] . There are several known disadvantages of such payment mechanisms, including potential high costs and ethical concerns Groth [2010] . See Slivkins [2017] for a detailed discussion.

Thematically, our work relates to work on "instrument-armed bandits" by Kallus [2018] , which also views arm recommendations as instruments. However, the compliance behavior (modeled as a fixed stochastic mapping from instrument to treatments) is static in Kallus [2018] : it does not change over time -even if the planner has obtained accurate estimate(s) of the treatment effect(s). By comparison, since all agents eventually become compliant in our setting, we can achieve sublinear regret w.r.t. the best treatment, which is not achievable in a static compliance model.

We study a sequential game between a social planner and a sequence of agents over T rounds, where T is known to the social planner. We will first focus on the binary setting with a single treatment, and study the more general setting of k treatments in Section 6. In the binary setting, the treatment of interest has unknown effect θ ∈ [−1, 1]. In each round t, a new agent indexed by t arrives with their private type u t drawn independently from a distribution U over the set of all private types U . Each agent t has two actions to choose from: taking the treatment (denoted as x t = 1) and not taking the treatment, i.e. the control (denoted as x t = 0). Upon arrival, agent t also receives an action recommendation z t ∈ {0, 1} from the planner. After selecting an action Recommendation z t influences treatment x t , which influences outcome y t . Unobserved type u t influences both treatment x t and outcome y t -the latter via error term g (ut) t .

x t ∈ {0, 1}, agent t receives a reward y t ∈ R, given by

where g (ut) t denotes the confounding baseline reward which depends on the agent's private type u t ; each is drawn from a sub-Gaussian distribution with a sub-Gaussian norm of σ g . The social planner's goal is to estimate the treatment effect θ and maximize the total expected reward of all T agents.

History and recommendation policy. The interaction between the planner and the agent t is given by the tuple (z t , x t , y t ). For each t, let H t denote the history from round 1 to t, i.e. the sequence of interactions between the social planner and the first t agents, such that H t := ((z 1 , x 1 , y 1 ), . . . , (z t , x t , y t )). Before the game starts, the social planner commits to a recommendation policy π = (π t ) T t=1 where each π t : ({0, 1} × {0, 1} × R) t−1 → ∆({0, 1}) is a randomized mapping from the history H t−1 to recommendation z t . Policy π is fully known to all agents.

Beliefs, incentives, and action choices. Each agent t knows their place t in the sequential game, and their private type u t determines a prior belief P (ut) , which is a joint distribution over the treatment effect θ and noisy error term g (u) . Agent t selects action x t as such:

An agent t is compliant with a recommendation z t if the agent chooses the recommended action, i.e. x t = z t . We'll also say that a recommendation is compliant if x t = z t . Figure 1 shows the causal diagram for this setting.

Unlike the standard multi-armed bandit and previous models on incentivizing exploration Mansour et al. [2015 exploration Mansour et al. [ , 2016 , the heterogeneous beliefs in our setting can lead to selection bias. For example, agents who are willing to take the treatment may also have higher baseline rewards. Thus, simply comparing rewards across the treatment group (x = 1) and the control group (x = 0) will lead to a biased estimate of θ. To overcome this selection bias, we will view the planner's recommendations as instruments and perform instrumental variable (IV) regression to estimate θ. There are two criteria for recommendation z t to be a valid instrument: (1) z t influences the selection x t , and (2) z t is independent from the noisy baseline reward g (u) . See Figure 1 for a graphical explanation of how these criteria will be satisfied in our setting. Criterion (2) follows because planner chooses z t randomly, independent of the type u t . Our goal is to design a recommendation policy to meet criterion (1).

Wald Estimator. Our mechanism periodically solves the following IV regression problem: given a set S of n observations (x i , y i , z i ) n i=1 , compute an estimateθ S of θ. We consider the following two-stage least square (2SLS) or Wald estimator (which are equivalent for binary treatments):

wherex,ȳ,z denote the empirical means of variables x i , y i , and z i respectively. While existing work on IV regression mostly focuses on asymptotic analyses, we provide a high-probability finite-sample error bound forθ S , which is required by our regret analysis and may be of independent interest.

Theorem 2.1 (Finite-sample error bound for Wald estimator). Let z 1 , z 2 , . . . , z n ∈ {0, 1} be a sequence of instruments. Suppose there is a sequence of n agents such that each agent i has their private type u i drawn independently from U, selects action x i under instrument z i , and receives reward

and the Wald estimator given by (3) satisfies

with probability at least 1 − δ, for any δ ∈ (0, 1).

Proof Sketch. See Appendix B for the full proof. The bound follows by substituting our expressions for y t , x t into the IV regression estimator, applying the Cauchy-Schwarz inequality to split the bound into two terms (one dependent on {(g t , z t )} |S| t=1 and one dependent on {(x t , z t )} |S| t=1 ), and bound the second term with high probability.

Note that the error rate above depends on the covariance between the instruments z and action choices x. In particular, when n i=1 (x i −x)(z i −z) is linear in n, the error rate becomes O(1/ √ n). In the following sections, we will provide mechanisms that incentivize compliance so that the instruments z are correlated with actions x, enabling us to achieve such an error rate.

In this section, we present a recommendation policy that incentivizes compliance to enable IV estimation. We focus on a setting where the agents are initially completely non-compliant: since the planner has no information about the treatment effect in the initial rounds, the recommendations have no influence on agents' action selections. For simplicity of exposition, we will present our policy in a setting where there are two types of agents who are initially "always-takers" and "never-takers." As we show later Section 6, this assumption can be relaxed to have arbitrarily many types and also allow all types to be "always-takers."

Formally, consider two types of agents i ∈ {0, 1}. For type i, let p i be the fraction of agents in the population, P (i) the prior beliefs, and g (i) the baseline reward random variables. Agents of type 1 initially prefer the treatment and type 0 agents prefer control: their prior means for θ satisfy

Our policy (Algorithm 1) splits into two stages. In the first stage, agents take their preferred action according to their prior beliefs: type 0 agents choose control and type 1 treatment. This allows us to collect 0 and 1 observations of rewards for x = 0 and x = 1, respectively. Letȳ 0 andȳ 1 denote the empirical average rewards for the two actions, respectively. Note that since the baseline rewards g (u) are correlated with the selections x, the difference (ȳ 1 −ȳ 0 ) is a biased estimate for θ.

In the second stage, we use this initial set of reward observations to construct valid instruments which incentivize agents of one of the two types to follow both control and treatment recommendations. Without loss of generality, we focus on incentivizing compliance among type 0 agents. Since they already prefer control, the primary difficulty here is to incentivize type 0 agents to comply with treatment recommendations. 2 We leverage the following observation: according to the prior P (0) of type 0 agents, there is a non-zero probability that the biased estimate (ȳ 1 −ȳ 0 ) is so large that θ must be positive.

Formally, consider the following event for the average rewardsȳ 0 andȳ 1 :

] and σ g is the variance parameter for g (0) and g (1) .

Assumption 3.1 (Knowledge Assumption for Algorithm 1). Within Section 3, the following are common knowledge among agents and planner: 3 1. Type 0 agents prefer control and type 1 agents prefer treatment. The fraction of agents of type 0 in the population is p 0 ≥ 0 and the fraction of type 1 is p 1 > 0.

2. Type 0's prior treatment effect mean µ (0) and the probability of event ξ, denoted P P (0) [ξ], over the prior P (0) of type 0. 4

Algorithm 1 Overcoming complete non-compliance Input: exploration probability ρ ∈ (0, 1), ∈ N (assume w.l.o.g. ρ ∈ N), minimum first stage samples 0 , 1 ∈ N, and failure probability δ < PP 0 [ξ]/8 1st stage: The first 2 max ( 0 /p 0 , 1 /p 1 ) agents are given no recommendation (they choose what they prefer) 2nd stage: Based on at least 0 control and 1 treatment samples collected in the first stage:

2 then a * = 1 else a * = 0 From the next agents, pick ρ agents uniformly at random to be in the explore set E for the next rounds do if agent t is in explore set E then z t = 1 else z t = a * We prove that Algorithm 1 is compliant for agents of type 0 as long as the exploration probability ρ is less than some constant that depends on prior P (0) . When an agent of type 0 is recommended treatment, they do not know whether this is due to exploration or exploitation. However, with small enough ρ, their expected gain from exploiting exceeds the expected loss from exploring. Hence, the agents comply with the recommendation and take treatment. 2 We could instead incentivize type 1 agents to take control. This would require 1) rewriting event ξ so it indicates that the expectation of θ over P (1) must be negative and 2) rewriting Algorithm 1 so that control is recommended when exploring. We cannot incentivize both types to comply at the same time.

3 Assumptions do not hold elsewhere, unless explicitly stated. 4 These assumptions (as well as Assumption 4.1 and Assumption 5.1) require only partial knowledge of the priors for compliant agents only. They are no more restrictive than the least restrictive (detail-free) assumptions of Mansour et al. [2015] .

Lemma 3.2 (Type 0 compliance with Algorithm 1). Under Assumption 3.1, any type 0 agent who arrives in the last rounds of Algorithm 1 is compliant with any recommendation, as long as the exploration probability ρ satisfies

where the event ξ is defined above in Equation (4).

Proof Sketch. See Appendix C for the full proof. The proof follows by expressing the compliance condition for type 0 agents as different cases, depending on the recommendation. By keeping the exploration probability ρ small with regard to type 0 agent's prior-dependent probability P P (0) [ξ] and the conditional expected treatment effect E P (0) [θ|ξ], the expected gain from exploiting is greater than the expected loss from exploring. Hence, type 0 agents would comply with the recommendation. We further simplify the condition on exploration probability ρ by applying high probability bound on the samples collected from the 1st stage (where no recommendations were given).

We also provide a separate accuracy guarantee for the treatment effect estimateθ at the end of the Algorithm 1.

Theorem 3.3 (Treatment Effect Confidence Interval after Algorithm 1). With sample set S = (x i , y i , z i ) i=1 of samples collected from the second stage of Algorithm 1 -run with exploration probability ρ small enough so that type 0 agents are compliant (see Lemma 3.2),approximation bound A(S , δ) satisfies the following, with probability at least 1 − δ:

for any δ ∈ (0, 1). Recall σ g is the variance of g (u i ) , p 0 is the fraction of compliant never-takers in the population of agents, 5 and A(S , δ) is defined as in Theorem 2.1.

Proof Sketch. See Appendix C for the full proof. Note that Theorem 2.1 applies, so we only have to bound the denominator term which is dependent on {(x t , z t )} |S | t=1 . We assume that Algorithm 1 is initialized with parameters (see Lemma 3.2) such that type 0 agent is compliant. We bound the term dependent on {(x t , z t )} |S | t=1 with high probability.

Algorithm 1 can be extended to handle more general settings:

1. There can be arbitrarily many types of agents that do not share the same prior. In this case, let E P (u) [g 0 ] and E P (u) [g 1 ] denote the expected baseline rewards for never-takers and always-takers, respectively, over the prior P (u) of any type u and

Then, Algorithm 1 can still incentivize any never-taker type u agents to comply as long as the planner has a lower bound on P P (u) [ξ (u) ], where ξ (u) is defined just as ξ in Equation (4), except G (0) is replaced with G (u) . Theorem 3.3 applies as is.

2. All types can be always-takers (who prefer the treatment). The algorithm can incentivize some of the agents to take control with an event ξ defined withoutȳ 0 and flipped (i.e. the mean treatment reward is much lower than the expected baseline reward). 6 5 We redefine p0 here to be applicable to more general settings. 6 Also, Lemma C.3 can be proved sans clean event C0.

By Theorem 3.3, samples collected from Algorithm 1 produce a confidence interval on the treatment effect θ which decreases proportionally to 1/ √ t by round t. However, it still decreases slowly because the exploration probability ρ is small (see roughly how small in Section 7). In Section 4, we give an algorithm for which this confidence interval improves quicker and works for arbitrarily many types.

In this section, we present a recommendation policy which (1) capitalizes on partial compliance, eventually incentivizing all agents to comply, and (2) determines whether the treatment effect is positive or not (with high probability). Algorithm 2 recommends control and treatment sequentially (one after the other). Lemma 4.2 gives conditions for partial compliance from the beginning of Algorithm 2, given access to initial samples which form a crude estimate of the treatment effect. Theorem 4.3 demonstrates how rapidly this estimate improves throughout Algorithm 2, which solely depends on the fraction of compliant agents (and not on some fraction like ρ with Algorithm 1). More (and eventually all) types of agents progressively become compliant throughout Algorithm 2.

Assumption 4.1 (Knowledge Assumption for Algorithm 2). Within Section 4, the following are common knowledge among agents and planner:

1. The fraction of agents in the population who prefer control is p 0 ≥ 0; that who prefer treatment is p 1 ≥ 0.

2. For each type u and for some τ (which can differ per u), the probability τ P

Algorithm 2 Overcoming partial compliance

which meet Theorem 2.1 conditions and produce IV estimateθ S 0 ; 7 time horizon T ; number of recommendations of each action per phase h; approximation bound failure probability δ; Split the remaining rounds (up to T ) into consecutive phases of h rounds each, starting with q = 1; Letθ 0 =θ S 0 and A 0 = A(S 0 , δ); while |θ q−1 | ≤ A q−1 do The next 2h agents are recommended control and treatment sequentially (one after the other); Let S q be samples up to and including phase q, i.e. S q := (x i , z i , y i )

be the samples with smallest approximation bound so far (from phase 1 to q), i.e. S BEST q = argmin Sr,0≤r≤q A(S r , δ); Defineθ q =θ S BEST q and A q = A(S BEST q , δ); q = q + 1; For all remaining agents recommend a * = 1 θ q > 0 .

We focus on a setting where agents are assumed to have been at least partially compliant in the past, such that we may form an IV estimate from the history. The social planner employs Algorithm 2, which is a modification of the Active Arms Elimination algorithm Even-Dar et al. [2006] . Treatment and control "race", i.e. are recommended sequentially, until the expected 7 Operator | · | denotes the cardinality of a set. treatment effect is known to be negative or positive (with high probability). Then, the algorithm recommends the "winner" (the action with higher expected reward) for the remainder of the time horizon T .

The compliance incentive works as such: when an agent is given a recommendation, they do not know whether it is because the action is still in the "race" or if the action is the "winner". When the algorithm is initialized with samples that form an IV estimate which is sufficiently close to the true treatment effect (according to the agent's prior), then the probability that any recommended action has "won" is high enough such that the agent's expected gain from taking a "winning" action outweighs the expected loss from taking a "racing" one. We formalize this in Lemma 4.2.

Lemma 4.2 (Algorithm 2 Partial Compliance). Recall that Algorithm 2 is initialized with input

For any type u with the following prior preference (control or treatment), if S 0 satisfies the following condition, with probability at least 1 − δ, then all agents of type u will comply with recommendations of Algorithm 2:

for some τ ∈ (0, 1), where A(S 0 , δ) is the approximation bound for S 0 and any δ ∈ (0, 1) (see Theorem 2.1).

Proof Sketch. See Appendix D.1 for the full proof. The proof follows by using a "clean event" analysis where the IV estimated treatment effectθ is close to the true treatment effect θ. We split the conditional expected treatment effect E P (u) [θ] into different cases for the value of θ.

With an IV estimate that is sufficiently close to the true treatment effect, the expected gain from exploiting (taking the "winning" action) is greater than the expected loss from exploring (taking a recommended action when the "race" is not over) and any agent of type u will comply with recommendation.

When a nonzero fraction of agents comply from the beginning, the samples gathered in Algorithm 2 provide treatment effect estimatesθ which become increasingly accurate over rounds. In the following Theorem 4.3, we provide a high probability guarantee on this accuracy. Theorem 4.3 (Treatment Effect Confidence Interval from Algorithm 2 with Partial Compliance).

of |S| samples collected from Algorithm 2 where p c is the fraction of compliant agents in the population, we form an estimateθ S of the treatment effect θ. With probability at least 1 − δ,

for any δ ∈ (0, 1), where σ g is the variance of g (u i ) .

Proof Sketch. See Appendix D.3.1 for a full proof. Note that Theorem 2.1 applies, so we only to have to bound the denominator term which is dependent on {(x t , z t )} |S| t=1 . We assume that Algorithm 2 is initialized with parameters such that p c > 0 fraction of the population complies with all recommendations. We bound the term dependent on {(x t , z t )} |S| t=1 with high probability.

Agents become compliant during Algorithm 2 for the same reason others become compliant from the beginning: they expect that the estimateθ is sufficiently accurate and it's likely they're getting recommended an action because it won the race. For large enough T , all agents will become compliant. 8 Note that the accuracy improvement in Theorem 4.3 relies solely on the proportion of agents p c who comply from the beginning of Algorithm 2, which relies on the accuracy of the approximation bound given by initial samples S 0 . Thus, if the social planner can choose more accurate S 0 , then the treatment effect estimateθ given by samples from Algorithm 2 becomes more accurate quicker. In Section 5, we present a recommendation policy in which S 0 can be chosen by running Algorithm 1.

In this section, we present a recommendation policy π c , which spans T rounds and runs Algorithms 1 and 2 in sequence. This policy achievesÕ( √ T ) regret for sufficiently large T and produces an estimateθ which deviates from the true treatment effect θ by O(1/ √ T ). 9

Assumption 5.1 (Knowledge Assumption for Policy π c ). Within Sections 5.1 and 5.2, the following are common knowledge among agents and planner:

1. All prior-dependent constants given in Assumption 4.1 2. For each type u which prefers control, prior mean µ (u) and a lower bound on the probability P P (u) [ξ (u) ] (defined in Extension 1 of Algorithm 1 from Section 3.1)

Recommendation policy π c over T rounds is given as such:

1) Run Algorithm 1 with exploration probability ρ set to incentivize at least p c 1 > 0 fraction of agents of the population who initially prefer control to comply in Algorithm 1 and to make at least p c 2 > 0 fraction of agents comply in Algorithm 2 (see Lemma 5.2).

2) Initialize Algorithm 2 with samples from Algorithm 1. At least p c 2 fraction of agents comply in Algorithm 2.

We first provide conditions on to define policy π c .

Lemma 5.2 (Lower bound on for Type u Compliance in Algorithm 2). Recall that S denotes the samples collected from the second stage of Algorithm 1. Let S be the input samples S 0 in Algorithm 2. Assume that p c 1 proportion of agents in the population are compliant with recommendations of Algorithm 1 and length satisfies:

for some τ ∈ (0, 1) and where κ 1 := 8σg √ 2 log(5/δ) pc 1 ρ(1−ρ) and κ 2 := (3 − ρ) ρ log(5/δ) 2(1−ρ) for any δ ∈ (0, 1). Then any agent of type u will comply with recommendations of Algorithm 2.

Proof Sketch. See Appendix D.3 for the full proof. The proof follows by substituting the value of into the approximation bound Theorem 3.3 and simplifying. The compliance condition follows from Lemma 4.2.

Policy π c shifts from Algorithm 1 to Algorithm 2 as soon as the condition on above is satisfied. This is because 1) the treatment effect estimateθ get more accurate quicker and 2) less regret is accumulated in Algorithm 2 than Algorithm 1.

The goal of recommendation policy π c is to maximize the cumulative reward of all agents. We measure the policy's performance through regret. We are interested in minimizing regret, which is specific to the treatment effect θ. Since agents' priors are not exactly known to the social planner, this pseudo-regret is correct for any realization of these priors and treatment effect θ.

Definition 5.3. [Pseudo-regret] The pseudo-regret of a recommendation policy is given as such:

We present regret guarantees for recommendation policy π c . First, policy π c achieves sub-linear pseudo-regret.

Lemma 5.4 (Pseudo-regret). The pseudo-regret accumulated from policy π c is bounded for any θ ∈ [−1, 1] as follows, with probability at least 1 − δ for any δ ∈ (0, 1):

for sufficiently large time horizon T , where the length of Algorithm 1 is L 1 = + 2 max 0 p 0 , 1 p 1 .

Proof Sketch. See Appendix E.1 for the full proof. The proof follows by observing that Algorithm 2 must end after some log(T ) phases. We can bound the regret of the policy π c by at most that of Algorithm 2 plus θ per each round of Algorithm 1, or alternatively, we can upper bound it by θ per each round of the policy π c .

Policy π c also achieves sub-linear regret, where the expectation is over the randomness in the priors of the agents. Lemma 5.5 provides a basic performance guarantee of our recommendation policy.

Lemma 5.5 (Regret). Policy π c achieves regret as follows:

for sufficiently large time horizon T .

Proof Sketch. See Appendix E.2 for the full proof. The proof follows by observing that we can set the parameters in Algorithm 1 and Algorithm 2 in terms of the time horizon T while maintaining compliance throughout policy π c .

These results are comparable to the pseudo-regret of the classic multi-armed bandit problem, with some added constants factors for the compliance constraints Even-Dar et al. [2006] . The pseudo-regret of our policy π c is asymptotically equivalent to an extension of the detail-free recommendation algorithm of Mansour et al. [2015] , which incentivizes full compliance for all types. However, our policy can finish in a more timely manner and has smaller prior-dependent constants in the asymptotic bound.

In Section 6, we provide an extension of our model and policy π c to arbitrarily many treatments with unknown effects. We also provide similar regret guarantees.

In this section, we introduce a setting which extends the previous binary treatment setting by considering k treatments (and no control). We now consider a treatment effect vector θ ∈ R k ; and x, z ∈ {0, 1} k are one-hot encodings of the treatment choice and recommendation, respectively. We assume that E [g (u i ) ] = 0. 10 All other terms are defined similar to those in Section 2. Here, the reward y i ∈ R and action choice x i ∈ {0, 1} k at round i are given as such: 11

, we compute IV estimateθ S of θ as such:

Next, we state finite sample approximation results which extend Theorem 2.1 to this general setting.

Theorem 6.1 (Many Treatments Effect Approximation Bound). Let z 1 , . . . , z n ∈ {0, 1} k be a sequence of instruments. Suppose there is a sequence of n agents such that each agent i has private type u i drawn independently from U, selects x i under instrument z i and receives reward

, and the IV estimator given by Equation (11) satisfies

with probability at least 1 − δ for any δ ∈ (0, 1).

Proof Sketch. See Appendix F.3 for the full proof. The bound follows by substituting our expressions for y t , x t into the IV regression estimator, applying the Cauchy-Schwarz inequality to split the bound into two terms (one dependent on {(g

). We bound the second term with high probability.

Next, we extend recommendation policy π c to k treatments (see Definition 6.3 in Appendix F for details). 13 Assumption 6.2 (Knowledge Assumption for General Policy π c ). Within Section 6, the following are common knowledge among agents and planner:

1. All agents share a preference ordering over all k treatments, i.e. for any type u, the prior

10 Without this assumption, we run into identifiability issues: we cannot reconstruct the individual treatment effects θ 1 , . . . , θ k without fixing some mean E[g (u) ]. Yet, for purposes of regret minimization, assuming E[g (u) ] = 0 does not change our results. 11 We bastardize notation by writing xi = j instead of xi = ej (the k-dimensional unit vector along the jth dimension). 12 The operator σmin(·) denotes the smallest singular value. 13 Algorithm 3 extends Algorithm 1 and Algorithm 4 extends Algorithm 2. 14 This ordering assumption is shared by Mansour et al. [2015] .

We assume that every agent -regardless of type-shares the same prior ordering of the treatments, such that all agents prior expected value for treatment 1 is greater than their prior expected value for treatment 2 and so on. First, Algorithm 3 is a generalization of Algorithm 1 which serves the same purpose: to overcome complete non-compliance and incentivize some agents to comply eventually. The incentivization mechanism works the same as in Algorithm 1, where we begin by allowing all agents to choose their preferred treatment -treatment 1-for the first rounds. Based on the samples collected from the first stage, we then define a number of events ξ (u) j -which are similar to event ξ from Algorithm 1-that each treatment j ≥ 2 has the largest expected reward of any treatment and treatment 1 has the smallest, according to the prior of type u:

where C = σ g 2 log(3/δ) + 1 4 for any δ ∈ (0, 1) and whereȳ 1 denotes the mean reward for treatment 1 over the samples of the first stage of Algorithm 3. Thus, if we set the exploration probability ρ small enough, then some subset of agents will comply with all recommendations in the second stage of Algorithm 3.

Input: exploration probability ρ ∈ (0, 1), minimum number of samples of any treatment ∈ N (assume w.l.o.g. ( /ρ) ∈ N), failure probability δ ∈ (0, 1), compliant type u 1st stage: The first agents are given no recommendation (they choose treatment 1)

holds, based on the samples from the first phase and any samples of treatment 2 ≤ j < i collected thus far then a * i = i else a * i = 1 From the next /ρ agents, pick agents uniformly at random to be in the explore set E 15 for the next rounds do if agent t is in explore set E then z t = 1 else z t = a * Second, Algorithm 4 is a generalization of Algorithm 2, which is required to start with at least partial compliance and more rapidly and incentivizes more agents to comply eventually. The incentivization mechanism works the same as in Algorithm 1, where we begin by allowing all agents to choose their preferred treatment -treatment 1-for the first rounds. Based on the samples collected from the first stage, we then define a number of events -which are similar to event ξ from Algorithm 1-that each treatment j ≥ 2 has the largest expected reward of any treatment and treatment 1 has the smallest. Thus, if we set the exploration probability ρ small enough, then some subset of agents will comply with all recommendations in the second stage of Algorithm 3. Definition 6.3 (General recommendation policy π c for k treatments). Recommendation policy π c over T rounds is given as such:

Algorithm 4 Overcoming partial compliance for k treatments

which meet Theorem 6.1 conditions and produce IV estimateθ S 0 , time horizon T , number of recommendations of each action per phase h, failure probability δ ∈ (0, 1) Split the remaining rounds (up to T ) into consecutive phases of h rounds each, starting with q = 1;

The next |B| agents are recommended each treatment i ∈ B sequentially in lexicographic order; Let S BEST q be the sample set with the smallest approximation bound so far, i.e. S BEST

For all remaining agents, recommend a * that remains in B.

1) Run Algorithm 3 with exploration probability ρ set to incentivize at least p c 1 > 1/k fraction of agents of the population to comply in Algorithm 3. Let S 0 be the sample set given from Algorithm 3. By Corollary F.5 and Theorem F.6, we can use S 0 as the initial samples in Algorithm 4 to incentivize compliance for any arm a if the approximation bound A(S 0 , δ) given by S 0 is small enough (see Theorem F.6). Thus, we run Algorithm 3 long enough (i.e. we set large enough) so that the approximation bound is small enough and at least p c 2 > 1/k fraction of agents comply with recommendations of every treatment in Algorithm 4.

2) Initialize Algorithm 4 with samples S 0 from Algorithm 3. At least p c 2 > 1/k fraction of agents comply with recommendations from Algorithm 4 from the beginning and until time horizon T .

Similar to the control-treatment setting, we provide the compliance lemmas and proof sketches for Algorithm 3 and Algorithm 2.

In Algorithm 3, any type u agent who arrives in the last /ρ rounds of Algorithm 3 is compliant with any recommendation if µ (u) i > 0 for all 1 < i ≤ k, and the exploration probability ρ satisfies:

Proof Sketch. Let the recommendation policy π here be Algorithm 3.

Part I (Compliance with recommendation for treatment i > 1): We first argue that an agent t of type u who is recommended treatment i will not switch to any other treatment j. For treatments j > i, there is no information about treatments i or j collected by the algorithm and by assumption, we have µ

j . Hence, it suffices to consider when j < i. We want to show that

Part II (Recommendation for treatment 1): When agent t is recommended treatment 1, they know that they are not in the explore group E. Therefore, they know that the event ¬ξ (u) i occurred. Thus, in order to prove that Algorithm 3 is BIC for an agent of type u, we need to show the following for any treatment j > 1:

We omit the remainder of this proof due to its similarity with the proof of Lemma 3.2

Lemma 6.5 (Algorithm 4 Partial Compliance). Recall that Algorithm 4 is initialized with input

For any type u, if S 0 satisfies the following condition, then with probability at least 1 − δ all agents of type u will comply with recommendations of Algorithm 4:

for some τ ∈ (0, 1), where A(S 0 , δ) is the approximation bound for S 0 and any δ ∈ (0, 1) (see Theorem 6.1).

Proof Sketch. Let the recommendation policy π here be Algorithm 4. We want to show that for any agent at time t with a type i < u in the racing stage and for any two treatments a, b ∈ B :

We omit the remainder of this proof due to its similarity with the proof of Lemma 4.2.

In order to incentivize agents of any type u to comply with general extensions Algorithms 3 and 4, we (again) set exploration probability ρ and length to satisfy some compliance conditions relative to P P (u) [ξ (u) ] and τ P P (u) [G v > τ ], respectively (see Appendix F.4). We present the (expected) regret from the k treatment extension of policy π c next. Lemma 6.6 (Regret of Policy π c for k Treatments). An extension of policy π c achieves (expected) regret as follows:

for sufficiently large time horizon T .

Proof Sketch. See Appendix F.4 for the full proof. The proof follows the same structure as that of Lemma 5.5.

Though our analysis covers a more general k treatment setting than Mansour et al. [2015] (capturing non-compliance and selection bias), our policy π c accumulates asymptotically comparable regret in terms of T . See Appendix F for all other results. Next, in Section 7, we implement Algorithm 1 experimentally.

In this section, we present experiments to evaluate Algorithm 1. We mention previously in the paper that this approximation bound decreases slowly throughout Algorithm 1, because the exploration probability ρ in Algorithm 1 is small. Here, we are interested in (1) how small the exploration probability ρ is and (2) how slowly the approximation bound on the absolute difference |θ −θ| decreases as Algorithm 1 progresses (whereθ is based on samples from Algorithm 1).

These are important to study, because this slow improvement in accuracy is the primary source of inefficiency (in terms of sample size) for policy π c , which accumulates linear regret during Algorithm 1 (see Lemma 5.4) for marginal improvements in estimation accuracy. This motivates the social planner to move to Algorithm 2 -where the estimation accuracy increases much quicker-as soon as possible in policy π c . Yet, there is also a tradeoff for moving to Algorithm 2 too quickly: if Algorithm 1 is not run for long enough, then only a small portion of agents may comply in Algorithm 2. In order to better inform the choice of hyperparameters in policy π c (specifically, the compliance paramters p c 1 and p c 2 ), we empirically estimate these quantities experimentally. We defer experiments on Algorithm 2 to the appendix. 16 Experimental Description. We consider a setting with two types of agents: type 0 who are initially never-takers and type 1 who are initially always-takers. We let each agent's prior on the treatment effect be a truncated Gaussian distribution between −1 and 1. The noisy baseline reward g (ut) t for each type u of agents is drawn from a Gaussian distribution N (µ g (u) , 1), with its mean µ g (u) also drawn from a Gaussian prior. We let each type of agent have equal proportion in the population, i.e. p 0 = p 1 = 0.5. We are interested in finding the probability of event ξ (as defined in Equation (4)) and the exploration probability ρ (as defined in Equation (5)). Instead of deriving an explicit formula for PP 0 [ξ] to calculate the exploration probability ρ, we estimate it using Monte Carlo simulation by running the first stage of Algorithm 1 for 1000 iterations and aggregating the results. After this, Algorithm 1 is run with the previously-found exploration probability ρ over an increasing number of rounds. We repeatedly calculate the IV estimate of the treatment effect and compare it to a naive OLS estimate (that regresses the treatment onto the reward) over the same samples as a benchmark. Results are averaged over 5 runs; light blue error bars represent one standard error. The OLS estimate converges to a value with approximation error around 0.1; whereas the approximation error of our IV estimate steadily decreases over time.

Results. In Figure 2 , we compare the approximation bound on |θ −θ| between IV estimateθ versus via a naive estimate for a specific, chosen ρ = 0.001. In our experiments, the exploration probability ρ generally lies within [0.001, 0.008]. In Figure 2 , we let hidden treatment effect θ = 0.5, type 0 and type 1 agents' priors on the treatment effect be N (−0.5, 1) and N (0.9, 1) -each truncated onto [−1, 1],-respectively. We also let the mean baseline reward for type 0 and type 1 agents be µ g (0) ∼ N (0, 1) and µ g (1) ∼ N (0.1, 1), respectively. These priors allow us to set the exploration probability ρ = 0.001 for Figure 2 , where IV regression consistently outperforms OLS for any reasonably long run of Algorithm 1.

To our knowledge, these experiments are the first empirical evaluation of an Incentivizing Exploration algorithm. Figure 2 shows the effect of small exploration probability ρ = 0.001: we need to run Algorithm 1 for a while to produce a decently accurate causal effect estimate. 17

In this paper, we present a model for how (non)-compliance changes over time based on beliefs and new information. We observe that recommendations which incentivize (at least partial) compliance can be treated as instrumental variables (IVs), enabling consistent treatment effect estimation via IV regression, even in the presence of non-compliance and confounding. Finally, we provide recommendations mechanisms which provably incentivize compliance and achieve sublinear regret.

In this paper, we provide recommendation algorithms for incentivizing exploration (IE), each which consist of two parts with their own ethical considerations: 1) recommending for exploration and 2) selectively disclosing information. First, in certain contexts, recommending for exploration of a treatment with unknown effects could place undue risk or harm onto an individual in order to benefit society or maximizing well-being overall. 18 Because of this, in certain medical contexts, incentivizing exploration may violate the principle of nonmaleficence. Second, incentivizing exploration requires selective information disclosure. While selective disclosure may be required in studies where information is protected or proprietary, it may be (morally) wrong in other settings. Restricting full information about the history of a treatment for the purpose of changing an individual's behavior may be manipulative or deceptive and may restrict their autonomy. We do not provide ethical considerations to discourage incentivizing exploration outright: there may be settings where the benefits outweigh the harms. We leave it to study designers and policymakers to weigh these in specific settings.

Future Work. Here, we focused on a setting where the causal model is linear and there is no treatment modification by the private type (so all agents share the same treatment effect θ). Future work may extend our results to non-linear settings and settings with treatment effect heterogeneity. We may also relax the (somewhat unrealistic) assumptions 1) that the social planner knows key prior-dependent constants about all agents and 2) that agents fully know their prior, the recommendation mechanism, and can exactly update their posterior over treatment effects. Finally, our empirical results invite further work to improve the practicality of incentivizing exploration mechanisms to allow for more frequent exploration and lessen the number of samples needed.

Theorem A.1. (Chernoff Bound for unbounded sub-Gaussian random variables) Let X 1 , . . . , X n be independent sub-Gaussian random variables with parameter σ.

Corollary A.2. (High probability bound on the sum of unbounded sub-Gaussian random variables) For any δ ∈ (0, 1), with probability at least 1 − δ,

. . , X n be independent and bounded random variables such that a ≤ X i ≤ b for all i. Then

Corollary A.4. (High probability upper bound on the sum of bounded random variables) For any δ ∈ (0, 1), with probability at least 1 − δ,

where X i ∈ [a, b] for all i from 1 to n.

Lemma A.5. (Cauchy-Schwarz Inequality) For any n-dimensional vectors u, v ∈ R n , the L 2 −norm of the inner product of u and v is less than or equal to the L 2 −norm of u times the

Alternatively, for any m × n-dimensional matrices A ∈ R m×n and n-dimensional vector v ∈ R n , the L 2 −norm of the dot product of A and v is less than or equal to the spectral norm of A times

Theorem A.6. (Matrix Chernoff ) Consider a finite sequence X k of independent, random, self-adjoint matrices with common dimension d. Assume that:

Introduce the random matrix Y = k X k . Define the minimum eigenvalue µ min and maximum eigenvalue µ max of the expectation E [Y ].

Then, for θ > 0,

Recall that our reward model can be stated as the following equation:

To analyze the Wald estimator, we introduce two conditional probabilities that an agent chooses the treatment given a recommendationγ 0 andγ 1 , given as proportions over a set of n samples (x i , z i ) n i=1 and formally defined aŝ

Then, we can write the action choice x i as such:

we can rewrite the reward y i as

Let operator· denote the sample mean, e.g.ȳ := 1

Thus, the centered reward and treatment choice at round i are given as:

This formulation of the centered reward y i −ȳ allows us to express and bound the error between the treatment effect θ and its instrumental variable estimateθ S , which we show in the following Theorem 2.1.

Theorem 2.1 (Finite-sample error bound for Wald estimator). Let z 1 , z 2 , . . . , z n ∈ {0, 1} be a sequence of instruments. Suppose there is a sequence of n agents such that each agent i has their private type u i drawn independently from U, selects action x i under instrument z i , and receives reward

and the Wald estimator given by (3) satisfies

with probability at least 1 − δ, for any δ ∈ (0, 1).

Proof. Given a sample set S = (x i , y i , z i ) n i=1 of size n, we form an estimate of the treatment effectθ S via a Two-Stage Least Squares (2SLS). In the first stage, we regress y i −ȳ onto z i −z to get the empirical estimateβ S and x i −x onto z i −z to getγ S as such:

In the second stage, we take the quotient of these two empirical estimates as the predicted treatment effectθ S , i.e.

Next, we can express the absolute value of the difference between the true treatment effect θ and the IV estimate of the treatment effectθ S given a sample set S of size n as such: (18)

In order to complete our proof, we demonstrate an upper bound on the numerator (21) in the last line above. We do so in Lemma B.1.

Lemma B.1. For all δ ∈ (0, 1), with probability at least 1 − δ, we have

if the set of g (u i ) i are i.i.d. sub-Gaussian random variables with sub-Gaussian norm σ g .

Proof. We can rewrite the left hand side as follows g (u) ] (by the triangle inequality and |z| ≤ 1)

is sub-Gaussian, then the last line in the system of inequalities above is given as:

≤ σ g 2n 1 log(1/δ 1 ) + σ g 2n log(1/δ 2 ) (by Corollary A.2, where n 1 := n i=1 z i ) ≤ 2σ g 2n log(2/δ) (since n 1 ≤ n and by Theorem A.7, where δ 1 = δ 2 = δ/2) This recovers the stated bound and finishes the proof for Theorem 2.1.

Next, we demonstrate a lower bound on the denominator of Theorem 2.1, in terms of the level of compliance at each phase of Algorithms 1 and 2.

denote a sample set which satisfies the conditions of Theorem 2.1. Furthermore, assume that there are p c fraction of agents in the population who would be compliant. Recall that z = 1 n n i=1 z i andx = 1 n n i=1 x i . Then, the denominator of the approximation bound A(S, δ) (from Theorem 2.1) is lower bounded as such:

otherwise.

The second case above occurs with probability at least 1 − δ for any δ ∈ (0, 1).

Proof. In this theorem, we formulate the denominator of the approximation bound in Theorem 2.1 in terms ofz, sincez is determined by the social planner. For any type u, let u ∈ U c denote that agents of type u comply; let u ∈ U 0 denote that agents of type u are never-takers (agents which prefer control, according to their prior); and let u ∈ U 1 denote that agents of type u are always-takers (agents which prefer treatment, according to their prior). Let p 0 and p 1 be the fractions of never-takers and always-takers, respectively.

Next, we expand the binomial in the denominator and arrive at the following simplified form:

First, observe that at any round i, the product x i = 1 only when agent i is a non-compliant always-taker or when z i = 1 and agent i is compliant. Formally, for any agent i, action choice x i = 1 is equivalent to the following:

Then, the sum n i=1 x i can be expressed as follows:

where we definep c:z i =1 as the empirical proportion of agents with types in U c when the recommendation z = 1 andp nc1 as the empirical proportion of non-compliant always-takers. Formally,

Define p nc1 to be the proportion of non-compliant always-takers in the population of agents. Then, in expectation over the randomness of how agents arrive, E [p c:z i =1 ] = p c and E [p nc1 ] = p nc1 .

Next, we rewrite the sum n i=1 x i z i in terms ofz and some population constants. Observe that at any round i, the product x i z i = 1 only when both x i = 1 and z i = 1. Thus, by Equation (24), for any agent i, the event x i z i = 1 is equivalent to the following:

Then, the sum n i=1 x i z i can be expressed as follows:

where we definep nc1:z i =1 as the empirical proportions of non-compliant always-takers who arrive when z i = 1 -i.e.p nc1:

In expectation over the randomness of how agents arrive, E [p nc1:

Finally, by Equations (23), (25) and (26), we can provide a high probability lower bound on the denominator as such:

x i z i − nzx (by Equation (23)) = |nz (p c:zi=1 +p nc1:zi=1 ) − nz (zp c:zi=1 +p nc1 )| (by Equations (25) and (26)

with probability at least 1 − δ for any δ ∈ (0, 1).

Claim C.1. For any agent t at round t with recommendation policy π t with a positive probability of recommending either control or treatment, according to the prior P (ut) , i.e. P πt,P (u t ) [z t = 0] > 0 and P πt,P (u t ) [z t = 1] > 0. Furthermore, a (u) and b (u) denote the initially preferred and unpreferred actions for any type u, i.e. a (u) :

Formally, the following holds:

Proof. Note that a (u) is defined in such a way that (−1) a (u) E P (u) [θ] < 0 always: if agents of type u prefer initially control, then a (u) = 0 and (−1) a (u)

Then, Recall that we assume that type 0 agents prefer the control, i.e. the expected treatment effect E P (0)

[θ] < 0. Then:

Therefore, given that both P πt,P (u t ) [z t = 0] > 0 and P πt,P (u t ) [z t = 1] > 0 and, by assumption,

Lemma 3.2 (Type 0 compliance with Algorithm 1). Under Assumption 3.1, any type 0 agent who arrives in the last rounds of Algorithm 1 is compliant with any recommendation, as long as the exploration probability ρ satisfies

where the event ξ is defined above in Equation (4).

Proof. Let the event ξ = ξ (0) (as given by Definition C.2). By Lemma C.3, if ρ satisfies the following condition, then any type 0 agent will comply with any recommendation of the last rounds of Algorithm 1:

Definition C.2 (Extension 1 of Algorithm 1). Here, we formalize the recommendation policy of Extension 1 in Section 3.1, which modifies Algorithm 1 in two ways:

1. We redefine event ξ as ξ (u) such that it is relative to any type u, defined as follows:

where G (ut) is an upper bound on the difference between the prior mean of the treatment versus the control according to type u, i.e. G (ut) > E P (u) [g 1 − g 0 ], and where E P (u) [g 0 ] and E P (u) [g 1 ] are the expected baseline rewards for initial never-takers and always-takers.

2. If we are trying to incentivize compliance for always-takers, then those agents in the exploration set E are recommended control (rather than treatment, as described in the pseudocode for Algorithm 1).

Lemma C.3 (Arbitrary Type Compliance with Extension 1 of Algorithm 1). Under Assumption 3.1, any type u t agent who arrives at round t in the last rounds of Extension 1 of Algorithm 1 (given in Definition C.2) is compliant with any recommendation z t , as long as the exploration probability ρ satisfies:

where the event ξ (ut) is defined in Definition C.2.

Proof. This proof follows a similar structure to the Sampling Stage BIC proof in Mansour et al. [2015] . We will prove compliance for any type u in the more general Extension 1 of Algorithm 1, as given in Definition C.2, which admits arbitrarily many types and the option to incentivize initial always-takers, instead of initial never-takers, to comply.

Let recommendation policy π be that described in Definition C.2, i.e. Extension 1 of Algorithm 1 which admits arbitrarily many types and allows for the exploration recommendations to be given in order to incentivize initial always-takers, instead of initial never-takers, to comply. Throughout this proof, we will assume that the exploration set E is defined relative to the initial preference of any agent of type u t , who we are proving compliance for. (2), if any agent t expects the treatment effect θ to be positive, they will select the treatment x t = 1. Conversely, if they expect the treatment effect θ to be negative, they will select control x t = 0. Thus, for any agent of type u t at round t, proving compliance entails the expected treatment effect θ over the prior of type u t and policy π t is positive given that the recommendation z t = 1 and negative given that the recommendation z t = 0, i.e.

Next, we show that we can reduce our proof to demonstrating only one of the above statements, depending on the prior preference of type u. Let a (u) and b (u) denote the prior preferred and unpreferred actions for any type u, i.e. a (u) 

Because policy π (Algorithm 1 extension) is designed in a such way that at any round t in the last rounds, treatment or control is recommended each with positive probability -i.e. P πt,P (u t ) [z t = 1] > 0 and P πt,P (u t ) [z t = 0] > 0,-Claim C.1 applies and the following holds:

Thus, at round t, in order to prove compliance for agents of type u t with prior preferred and unpreferred actions a (ut) and b (ut) , respectively, it suffices to demonstrate that

The remainder of the proof is devoted to demonstrating this.

We first rewrite E πt,P (u t )

(for explore set E defined relative to recommend action b (ut) )

(since the only way z t = b (ut) when exploiting (i.e. when t ∈ E) is when event ξ occurs)

Now, we can rewrite our compliance condition as such:

[θ|ξ] P πt,P (u t )

[ξ] + ρµ (u) ≥ 0. Now, we rewrite this compliance condition strictly in terms of the exploration probability ρ and relative to a number of constants which depend on the prior P (ut) . Thus, if we set ρ to satisfy the following condition (in Equation (31)), then all agents of type u will comply with recommendations from policy π (Algorithm 1 extension):

[ξ (ut) ] + ρµ (ut) ≥ 0 (by Equation (30))

Finally, we can further simplify the upper bound on ρ given in Equation (31) above by showing that E πt,P (u t ) [θ|ξ (ut) ] satisfies some constant lower bound. This will complete our proof.

For any type u, the baseline reward g (u) is a random variable independently distributed according to a sub-Gaussian distribution with variance σ (u) which is bounded above by σ g , i.e. σ (u) < σ g for any u. Furthermore, recall that G (ut) > EP u t [g 1 − g 0 ], where EP u t [g 1 ] and EP u t [g 0 ] are the expected value of the baseline rewards of always-takers and never-takers over the prior of type u t , respectively. Now, we define 3 clean events: C 0 and C 1 pertain to these baseline reward random variables, and C 2 occurs when the first stage of Algorithm 1 generates at least 0 control samples and at least 1 treatment samples:

where = 2 max( 0 /p 0 , 1 /p 1 ) is the number of rounds in the first stage of Algorithm 1. Let δ 0 = δ 1 = P πt,P (u t ) [ξ (ut) ]/24. Furthermore, event C 2 occurs when the binomial random variable with success u t = x t = 1 (since x t = u t in the first stage of Algorithm 1) and success probability p 1 is lower bounded by 1 and upper bounded by − 0 . For = 2 max( 0 /p 0 , 1 /p 1 ) total trials, the probability of this event is less than P πt,P (u t ) [ξ (ut) ]/24. Now, define another clean event C where all C 0 , C 1 , and C 2 happen simultaneously. Letting δ = δ 0 + δ 1 + δ 2 , the event C occurs with probability at least 1 − δ where δ < P πt,P (u t ) [ξ (ut) ]/8. 19 This point is not entirely obvious: If a (u t ) = 0, then (−1) a (u t )+1

< 0 and E π t ,P (u t ) [θ|ξ (u t ) ] − µ (u t ) > 0, since

We can now rewrite E πt,P (u t ) [θ|ξ (ut) ] P πt,P (u t ) [ξ (ut) ] in terms of event C:

This comes down to finding a lower bound on the denominator of the expression above. We can reduce the dependency of the denominator to a single prior-dependent constant P πt,P (u t ) [ξ (ut) ] if we lower bound the prior-dependent expected value E πt,P (u t ) [θ|ξ (ut) ]. That way, assuming we know the prior and can calculate the probability of event ξ (ut) , we can pick an appropriate exploration probability ρ to satisfy the compliance condition for all agents of type 0. Then:

Hence, the term E πt,P (u t ) [θ|ξ (ut) ] P πt,P (u t ) [ξ (ut) ] satisfies the following lower bound:

[θ|ξ (ut) , C] P πt,P (u t )

[ξ (ut) ] − 2δ (by Equation (36)) (31), we arrive at a lower bound to set the exploration probability ρ for the agent any round t with type u t to comply with recommendation policy π t (extension of Algorithm 1):

Theorem 3.3 (Treatment Effect Confidence Interval after Algorithm 1). With sample set S = (x i , y i , z i ) i=1 of samples collected from the second stage of Algorithm 1 -run with exploration probability ρ small enough so that type 0 agents are compliant (see Lemma 3.2),approximation bound A(S , δ) satisfies the following, with probability at least 1 − δ:

for any δ ∈ (0, 1). Recall σ g is the variance of g (u i ) , p 0 is the fraction of compliant never-takers in the population of agents, 20 and A(S , δ) is defined as in Theorem 2.1.

Proof. First, Theorem 2.1 demonstrates, for any δ 1 ∈ (0, 1), with probability at least 1 − δ 1 that the approximation bound

Next, recall that the mean recommendationz = ρ for exploration probability ρ in the second stage of Algorithm 1. We assume Algorithm 1 to be initialized with parameters (see Lemma 3.2 for details) such that its recommendations are compliant for agents of type 0. In the worst case, only type 0 agents are compliant. Therefore, Theorem B.2 implies that, for any δ 2 ∈ (0, 1), with probability at least

With a union bound over Equations (39) and (40) while letting δ 1 = δ 2 = δ 3 for any δ ∈ (0, 1), we conclude: with probability at least 1 − δ, A(S , δ) ≤ 2σ g 2 log(3/δ)

Lemma 4.2 (Algorithm 2 Partial Compliance). Recall that Algorithm 2 is initialized with input

For any type u with the following prior preference (control or treatment), if S 0 satisfies the following condition, with probability at least 1 − δ, then all agents of type u will comply with recommendations of Algorithm 2:

for some τ ∈ (0, 1), where A(S 0 , δ) is the approximation bound for S 0 and any δ ∈ (0, 1) (see Theorem 2.1).

Proof. Just as in the proof for Lemma C.3, let a (u) and b (u) denote the prior preferred and unpreferred actions for agents of any type u, i.e. a (u) 

Let π denote the recommendation policy defined by Algorithm 2. At any round t of Algorithm 2, recommendation policy π t has a positive probability of recommending either control or treatment, according to the prior P (ut) for type u t , i.e. P πt,P (u t ) [z t = 0] > 0 and P πt,P (u t ) [z t = 1] > 0. Thus, by Claim C.1, the following holds:

and it suffices to prove the premise ut) ] ≥ 0 in order to prove that agent t of type u t complies with recommendation z t .

Recall that the sample set S BEST q is made up of the best samples up until phase q of Algorithm 2, i.e. the samples which produce the smallest approximation bound A q . The treatment effect estimate derived from set S BEST q is denotedθ q . We define the event C as the event that the treatment effect estimateθ q satisfies the approximation bound A q at every phase q throughout Algorithm 2:

By Theorem 2.1, for event C, the failure probability P [¬C] ≤ δ. Furthermore, we assume here that

Therefore, since |θ| ≤ 1, we have:

In order to lower bound the last line above, we marginalize b (ut) , C] based on four possible ranges which θ lies on:

Because A q is the smallest approximation bound derived from samples collected over any phase q of Algorithm 2 (including the initial sample set S 0 ), the following holds:

Conditional on C, |θ −θ q | < A q . Thus, we may reduce two of the terms in Equation (42) above, based on the structure of Algorithm 2. First, from the first term in Equation (42), note that

invokes the stopping criterion for the while loop in Algorithm 2. Thus, type u t 's preferred action a (ut) must have been eliminated from the race before phase q = 1 and the unpreferred action b (ut) is recommended almost surely throughout Algorithm 2, i.e. P πt,P (u t )

Second, from the last term in Equation (42), note that

by phase q = 1 and the unpreferred action b (ut) is recommended almost never, i.e.

Substituting these probabilities back into Equation (42), we proceed:

Putting everything together, we get that

Therefore, so long as

any agent of type u t will comply with recommendations from Algorithm 2.

Lemma D.1 (Algorithm 2 Full Compliance). Suppose that some fraction p c > 0 of agents is compliant from the beginning of Algorithm 2 and assume that p c < 1. Of all types u which were not compliant from the beginning, let type u * agents be the most resistant to compliance. Suppose that phase q satisfies one of the following bounds (depending on whether type u * agents prefer control or treatment):

for some τ ∈ (0, 1) and any δ ∈ (0, 1). Then, with probability at least 1 − δ, for any phase q greater or equal to the following lower bound all agents will comply with recommendations from Algorithm 2.

Proof. First, recall that the set S q is made up of the input samples S 0 plus samples collected following Algorithm 2 over all phases up to q. Let S q−0 = (x i , y i , z i ) 2hq i=1 denote S q sans S 0 (i.e. just samples collected following Algorithm 2 up to phase q). Note that for any δ ∈ (0, 1), the approximation bound A q ≤ A(S q−0 , δ).

We want to prove that type u * is compliant by and beyond phase q. By Lemma 4.2, it suffices to prove that the approximation bound A q satisfies the following upper bound with probability at least 1 − δ: 21

for any δ ∈ (0, 1) and some τ ∈ (0, 1).

In order to prove this, recall that each phase q of Algorithm 2 is 2hq rounds long and the mean recommendationz = 1 2 . By assumption, p c proportion of agents are compliant. Thus, by Theorem B.2, with probability 1 − δ for any δ ∈ (0, 1), the set S q−0 satisfies:

By assumption, q satisfies the following lower bound for some τ ∈ (0, 1):

Substituting these lower bound values for q in A(S q−0 , δ), we get that the approximation bound A q satisfies the following inequalities (since A q ≤ A(S q−0 , δ)):

Finally, after Algorithm 2 has become compliant for both types of agents, we achieve the following accuracy guarantee for the final treatment estimateθ S .

Suppose sample set S = (x i , y i , z i ) |S| i=1 is collected from Algorithm 2 during |S| rounds when all agents comply. We form estimateθ S of the treatment effect θ. For any δ ∈ (0, 1), with probability at least 1 − δ,

Proof. We assume Algorithm 2 is initialized and allowed to run long enough such that both types 0 and 1 become compliant at some point.

collected during these rounds (from i to |S|)), we form an estimateθ S of the treatment effect θ. By Theorem 2.1, this estimate satisfies the following bound with probability at least 1 − δ for any δ ∈ (0, 1):

.

Recall thatz = 1 2 throughout Algorithm 2. Then, by Theorem B.2, the denominator of the bound in Equation (43) above satisfies the following bound:

Therefore, by Equations (43) and (44), the confidence interval |θ S − θ| satisfies the following upper bound:

Lemma 5.2 (Lower bound on for Type u Compliance in Algorithm 2). Recall that S denotes the samples collected from the second stage of Algorithm 1. Let S be the input samples S 0 in Algorithm 2. Assume that p c 1 proportion of agents in the population are compliant with recommendations of Algorithm 1 and length satisfies:

for some τ ∈ (0, 1) and where κ 1 := 8σg √ 2 log(5/δ) pc 1 ρ(1−ρ) and κ 2 := (3 − ρ) ρ log(5/δ) 2(1−ρ) for any δ ∈ (0, 1). Then any agent of type u will comply with recommendations of Algorithm 2.

Proof. By assumption, Algorithm 1 is initialized so that agents of type 0 comply and we collect S samples from the second stage. Then, for any δ ∈ (0, 1) and some τ ∈ (0, 1), approximation bound A(S , δ) satisfies:

(by Equation (6))

Thus, by Lemma 4.2, if we let the samples S collected from the second stage of Algorithm 1 be the input samples S 0 in Algorithm 2, i.e. S 0 = S , then that agents of type 0 will comply with recommendations of Algorithm 2.

Theorem 4.3 (Treatment Effect Confidence Interval from Algorithm 2 with Partial Compliance).

of |S| samples collected from Algorithm 2 where p c is the fraction of compliant agents in the population, we form an estimateθ S of the treatment effect θ. With probability at least 1 − δ,

for any δ ∈ (0, 1), where σ g is the variance of g (u i ) .

Proof. First, Theorem 2.1 demonstrates, for any δ 1 ∈ (0, 1), with probability at least 1 − δ 1 that the approximation bound

Next, recall that the mean recommendationz = 1 2 throughout Algorithm 2. We assume Algorithm 2 to be initialized with parameters such that its recommendations are compliant for agents of type 0. In the worst case, only type 0 agents are compliant. Therefore, Theorem B.2 implies that, for any δ 2 ∈ (0, 1), with probability at least 1 − δ 2 that

With a union bound over Equations (45) and (46) while letting δ 1 = δ 2 = δ 3 for any δ ∈ (0, 1), we conclude: with probability at least 1 − δ, A(S, δ) ≤ 8σ g 2 log(3/δ) p 0 |S| − log(3/δ) E Missing Regret Proofs for Section 5

Lemma 5.4 (Pseudo-regret). The pseudo-regret accumulated from policy π c is bounded for any θ ∈ [−1, 1] as follows, with probability at least 1 − δ for any δ ∈ (0, 1):

for sufficiently large time horizon T , where the length of Algorithm 1 is L 1 = + 2 max 0 p 0 , 1 p 1 .

Proof. Recall that the clean event C, as defined in the proof of Lemma 4.2, entails that the approximation bound over all rounds. If event C fails (i.e. ¬C holds), then the pseudo-regret may only be bounded by the maximum possible value, which is at most T |θ|.

Assume now that C holds for every round L 2 in Algorithm 2. Then, the absolute value of the treatment |θ| ≤ |θ S L 2 | + A(S L 2 , δ) where A(S L 2 , δ) is the approximation bound based on the samples S L 2 collected from L 2 rounds of Algorithm 2. Before the stopping criterion of Algorithm 2 is invoked, we also have |θ S L 2 | ≤ A(S L 2 , δ). Hence, the treatment effect absolute value satisfies the following inequalities:

and, to carry on:

⇒ L 2 ≤ 2048σ 2 g log(5T /δ) p 2 c 2 |θ| 2 Therefore, a winner is declared -i.e. the treatment effect is definitively either positive or not-by the following round of Algorithm 2:

After this, the winner is recommended for the remainder of the rounds and (because we assume event C holds) no regret is accumulated for the remaining rounds (until time horizon T ).

Note that, by the length the assumption that L 2 ≥ 200 log(5T /δ) p 2 c 2 holds trivially for L * 2 if |θ| ≤ 16σg 5 . Otherwise, we simply assume that L 2 ≥ 200 log(5T /δ) p 2 c 2 holds. We may incorporate this bound into a necessary lower bound on the time horizon T , such that we assume T ≥ L 1 + L 2 ≥ L 1 + 200 log(5T /δ) p 2 c 2 . Now, we demonstrate the amount of regret accumulated during these n * rounds of Algorithm 2 before a winner is declared. During each phase Algorithm 2, each control and treatment each get recommended n * /2 times.

Furthermore, recall that p c 2 fraction of agents are compliant throughout all rounds of Algorithm 2. Without loss of generality, assume that these agents initially prefer control and the rest of the 1 − p c 2 fraction of agents prefer treatment (and do not comply). Then, if the treatment θ < 0, then in expectation over the randomness of the arrival of agents, the regret is on average (1 − p c 2 /2)|θ|. Then, the total accumulated regret throughout Algorithm 2 in policy π c is given as such:

On the other hand, if treatment θ ≥ 0, then on average, the regret for each phase is p c 2 |θ|/2, then the total accumulated regret for Algorithm 2 is given:

Observe that the pseudo-regret for each round t of the policy π c over the entire T rounds is at most that of Algorithm 2 plus |θ| per round of Algorithm 1. Recall that, by policy π c , there are L 1 total rounds of Algorithm 1. Alternatively, we can also upper bound the total pseudo-regret by |θ| per each round. Therefore, the total accumulated pseudo-regret for policy π c once we reach the time horizon T is bounded as follows.

If the treatment effect θ ≥ 0, then the total regret of policy π c satisfies the following bound with probability at least 1−δ, for any δ ∈ (0, 1) which satisfies the condition that L 1 ≥

We can solve for |θ| in terms of T at the point when T |θ| is the better regret:

Substituting this expression for |θ| back into our expression for the total pseudo-regret, we get the following:

The L 1 ρ regret for Algorithm 1 in the last line above is given because |θ| ≤ 1.

Following a similar analysis, if the treatment effect θ < 0, then the total pseudo-regret accumulated following policy π c satisfies the following bound with probability at least 1 − δ for any δ ∈ (0, 1) which satisfies the condition that L 1 ≥

Note that (as stated above) this regret holds only if the time horizon T is sufficiently large such that T ≥ L 1 + 200 log(5T /δ)

Lemma 5.5 (Regret). Policy π c achieves regret as follows:

for sufficiently large time horizon T .

Proof. We can set parameters δ, 0 , 1 , , and ρ in terms of the time horizon T , in order to both guarantee compliance throughout policy π c and to obtain sublinear (expected) regret bound relative to T . First, to guarantee sublinear expected regret, we must guarantee that δ = 1/T 2 . To meet our compliance conditions for Algorithm 2, we must set

for some τ . These may be expressed as conditions on the time horizon T : for any δ ∈ (0, 1) which satisfies the above compliance conditions, we set T sufficiently large to satisfy the following condition:

Second, recall that p 0 and p 1 denote the fractions in the population of agents who are never-takers and always-takers, respectively. Furthermore, recall that p c 1 and p c 2 denote the fractions of agents who comply with Algorithm 1 and Algorithm 2, respectively. Assume that the length of the first stage of Algorithm 1 is non-zero and the exploration probability ρ is set to be small enough in order to guarantee compliance throughout Algorithm 1. The length L 1 of Algorithm 1 must be sufficiently large so that p c fraction of agents comply in Algorithm 2, as well. However, in order to guarantee sublinear regret, we also need that

Recall that the clean event C, as defined in the proof of Lemma 4.2, entails that the approximation bound over all rounds. This event C holds with probability at least 1 − δ for any δ ∈ (0, 1). Conditional on the failure event ¬C, policy π c accumulates at most linear pseudo-regret in terms of T , i.e. T |θ|. Thus, in expectation it accumulates at most T |θ|δ regret.

Then, with the above assumptions on T in mind, the expected regret of policy π c is:

Therefore, assuming that all hyperparameters δ, 0 , 1 , , and ρ are set to incentivize compliance of some nonzero proportion of agents throughout π c and assuming that T is sufficiently large so as to satisfy both Equations (51) and (52) above, policy π c achieves sublinear regret.

We now consider a general setting for the sequential game between a social planner and a sequence of agents over T rounds, as first mentioned in Section 2. In this setting, there are k treatments of interest, each with unknown treatment effect. In each round t, a new agent indexed by t arrives with their private type u t drawn independently from a distribution U over the set of all private types U . Each agent t has k actions to choose from, numbered 1 to k. Let x t ∈ R k be a one-hot encoding of the action choice at round t, i.e. a k-dimensional unit vector in the direction of the action. For example, if the agent at round t chooses action 2, then x t = e 2 = (0, 1, 0, · · · , 0) ∈ R k . Additionally, agent t receives an action recommendation z t ∈ R k from the planner upon arrival. After selecting action x t ∈ R k , agent t receives a reward y t ∈ R, given by

where g (ut) t denotes the confounding baseline reward which depends on the agent's private type u t . Each g (ut) t is drawn from a sub-Gaussian distribution with a sub-Gaussian norm of σ g . The social planner's goal is to estimate the treatment effect vector θ ∈ R k and maximize the total expected reward of all T agents.

History, beliefs, and action choice. As in the body of the paper, the history H t is made up of all tuples (z i , x i , y i ) over all rounds from i = 1 to t. Additionally, before the game starts, the social planner commits to recommendation policy π, which is known to all agents. Each agent also knows the number of the round t when they arrive. Their private type u t maps to their prior belief P (ut) , which is a joint distribution over the treatment effect θ and noisy error term g (u) . With all this information, the agent t selects the action x t which they expect to produce the most reward:

As in the body of the paper, we view the planner's recommendations as instruments and perform instrumental variable (IV) regression to estimate θ.

IV Estimator for k > 1 Treatments Our mechanism periodically solves the following IV regression problem: given a set S of n observations (x i , y i , z i ) n i=1 , compute an estimateθ S of θ. We consider the following two-stage least square (2SLS) estimator:

where (·) −1 denotes the pseudoinverse.

To analyze the 2SLS estimator, we introduce a compliance matrix of conditional probabilities that an agent chooses some treatment given a recommendationΓ, given as proportions over a set of n samples S = (x i , z i ) n i=1 , where any entry inΓ is given as such:

Then, we can write the action choice x i as such:

where η i = x i −Γz i . Now, we can rewrite the reward y i at round i as such:

This formulation allows us to express and bound the error between the treatment effect θ and its IV estimateθ in Theorem 6.1.

F.3 Proof of Theorem 6.1 Theorem 6.1 (Many Treatments Effect Approximation Bound). Let z 1 , . . . , z n ∈ {0, 1} k be a sequence of instruments. Suppose there is a sequence of n agents such that each agent i has private type u i drawn independently from U, selects x i under instrument z i and receives reward

The approximation bound A(S, δ) is given as such: 22

, and the IV estimator given by Equation (11) satisfies

with probability at least 1 − δ for any δ ∈ (0, 1).

Proof. Given a sample set S = (x i , y i , z i ) n i=1 of size n, we form an estimate of the treatment effectθ S via Two-Stage Least Squares regression (2SLS). In the first stage, we regress y i onto z i to get the empirical estimateβ S and x i onto z i to getΓ S as such:

Now, note that by definition θ = (Γ) −1 β. In the second stage, we take the inverse ofΓ timesβ as the predicted causal effect vectorθ S , i.e.

The operator σmin(·) denotes the smallest singular value.

Hence, the L2-norm of the difference between θ andθ S is given as:

in the following lemma F.1.

Lemma F.1. For any δ ∈ (0, 1), with probability at least 1 − δ, we have

Proof. Recall that the baseline reward g (u) is an independently distributed random variable which, by assumption, has a mean of zero, i.e. E [g (u) ] = 0. Because of these properties of g (u) , with probability at least 1 − δ for any δ ∈ (0, 1), the numerator above satisfies the following upper bound: by Corollary A.2 and, by assumption, E [g (u) ] = 0) ≤ k j=1 σ g 2n j log(k/δ) 2 (by a Theorem A.7 where δ j = δ k for all j)

≤ k σ g 2n log(k/δ) 2 (since n j ≤ n for all j)

= σ g 2nk log(k/δ) This recovers the stated bound and finishes the proof for Theorem 6.1.

Next, we demonstrate a lower bound which the denominator of the approximation bound A(S, δ) in Theorem 6.1 equals O(1/ |S|), where |S| is the size of sample set S.

Theorem F.2 (Treatment Effect Confidence Interval for General k Treatments). Let z 1 , . . . , z n ∈ {0, 1} k be a sequence of instruments. Suppose there is a sequence of n agents such that each agent i has private type u i drawn independently from U, selects x i under instrument z i and receives reward y i . Assume that each agent initially prefers treatment 1, i.e. x = e 1 . Let sample set S = (x i , y i , z i ) n i=1 . Let r be the proportion of recommendations for each treatment j > 1 and let 1 − (k − 1)r be the proportion of recommendations for treatment 1. Let p c fraction of agents in the population of agents be compliant over the rounds from which S is collected. For any δ ∈ (0, 1), if n ≥ rp 2 c log(k/δ) , then the approximation bound A(S, δ) is given as such:

and the IV estimator given by Equation (11) satisfies

with probability at least 1 − δ, where α > 0 is a constant of proportionality given in Claim F.3 below.

Proof. Note that Theorem 6.1 holds in this case and it suffices to demonstrate that the denominator is bounded by 1 αn . Claim F.3 (Proportionality of the Denominator of the Approximation Bound for k Treatments). Given all assumptions in Theorem F.2 above, the denominator σ min ( n i=1 z i x i ) of the approximation bound A(S, δ) is positive and increases proportionally to n. Formally, σ min ( n i=1 z i x i ) = Ω(n).

Proof. Recall that we assume that every agent initially prefers treatment 1. Thus, whenever agent is recommended any treatment greater than 1 and does not comply, the agent takes treatment 1. At any round i, if agent i is always compliant, then x i = z i ; if not, then x i = e 1 . (If z i = e 1 , then x i = z i = e 1 always.) Furthermore, at any round i when x i = z i , the outer product z i x i = diag(z i ), i.e. a diagonal matrix where the diagonal equals z i . If x i = e 1 , then the outer product

which is a k × k matrix where the first column is z i and all other entries are 0. Thus, as long as we have at least one sample of each treatment, i.e. at least one round i where x i = z i = e j for all 1 ≤ j ≤ k, then the sum n i=1 z i x i is a lower triangular matrix with all positive entries in the diagonal. To illustrate this, let A denote the expected mean values of the sum n i=1 z i x i , such that

Note that

Furthermore, letÂ denote the empirical approximation of A over our n samples, given as such:

where for any j ≥ 2, the empirical proportion of agents who comply with the recommended treatment j is denoted asp c,j . Note that, since E [p c,j ] = p c for all j ≥ 2, the expected value E Â = A. We may bound the difference betweenp c,j and p c with high probability, based on the number of times each treatment j is recommended, which is rn. Over n samples, with probability at least 1 − δ j for any δ j ∈ (0, 1) for any treatment j, the proportionp c,j satisfies the following:

2rn . In order for this bound to hold for all j ≥ 2, let δ 2 = δ 3 = · · · = δ k = δ/k. Then, by a union bound, with probability 1 − δ for any δ ∈ (0, 1), the boundp c,j ≥ p c − log(k/δ) 2rn holds simultaneously for all 2 ≤ j ≤ k. Thus, for any δ ∈ (0, 1) and n ≥ rp 2 c log(k/δ) , each entry in the diagonal ofÂ is positive. Thus, (since it is a triangular matrix) the eigenvalues ofÂ equal the entries in the diagonal and are all positive. Furthermore, because rank(Â) = rank(Â Â ), the singular values ofÂ are all positive, as well.

Thus, for n ≥ rp 2 c log(k/δ) , the minimum singular value

where α = σ min Â > 0 is some (possibly small) constant of proportionality.

Thus, by Claim F.3 and Theorem 6.1, the approximation bound

Corollary F.4 (Treatment Effect Confidence Interval for General k Treatments). Given all assumptions in Theorem F.2, plus the assumptions that the minimum compliance rate for any arm is at least 1/k and the minimum proportion of treatment 1 recommendations is at least 1/k, for any δ ∈ (0, 1), with a large enough sample size n, the approximation bound A(S, δ) is given as such:

Proof. Note that Claim F.3 holds in this case and it suffices to demonstrate that the α is bounded by 1 k . We focus on the denominator σ min ( n i=1 z i x i ) of the approximation bound A(S, δ). Note that since z i and x i are one-hot encoded vectors, we have:

where ∀j : v j1 = r k (1 − p j ) is the probability of getting a treatment 1 sample when the recommendation is j > 1, the term v 11 = 1 − r is the probability of recommending treatment 1 (since we assume agents always comply with treatment 1 recommendations), and the term v jj = r k p j is the probability of getting a treatment j sample when the recommendation is j > 1. By definition, we can write the denominator term squared as:

Also, without loss of generality, assume that a 1 > 0 and ∀j > 1 : a j ≤ 0. Substituting the expression above with algorithm-specific variables, we have:

where the second line is direct substitution, the third line comes from lower bounding all p 2 j terms with the minimum compliance rate p 2 min and lower bounding p j (1 − p j ) by 1/4. The last line comes from the fact that a = 1 and from applying Lemma A.5 on the last term. Since we assume that the probability of recommending treatment 1 is 1 − r ≥ 1 k , we have:

Since we assume that the minimum compliance rate p min ≥ 1 k , we have:

Therefore, we have α = 1 k and

We apply Theorem A.6 to this matrix to get that, with probability at least 1 − δ, for δ ∈ (0, 1):

Hence, we have the approximation bound A(S, δ) for Algorithm 4 is given as

We assume that every agent -regardless of type-shares the same prior ordering of the treatments, such that all agents prior expected value for treatment 1 is greater than their prior expected value for treatment 2 and so on. First, Algorithm 3 is a generalization of Algorithm 1 which serves the same purpose: to overcome complete non-compliance and incentivize some agents to comply eventually. The incentivization mechanism works the same as in Algorithm 1, where we begin by allowing all agents to choose their preferred treatment -treatment 1-for the first rounds. Based on the samples collected from the first stage, we then define a number of events ξ (u) j -which are similar to event ξ from Algorithm 1-that each treatment j ≥ 2 has the largest expected reward of any treatment and treatment 1 has the smallest, according to the prior of type u:

where C = σ g 2 log(3/δ) + 1 4 for any δ ∈ (0, 1) and whereȳ 1 denotes the mean reward for treatment 1 over the samples of the first stage of Algorithm 3. Thus, if we set the exploration probability ρ small enough, then some subset of agents will comply with all recommendations in the second stage of Algorithm 3.

Second, Algorithm 4 is a generalization of Algorithm 2, which is required to start with at least partial compliance and more rapidly and incentivizes more agents to comply eventually. The incentivization mechanism works the same as in Algorithm 1, where we begin by allowing all agents to choose their preferred treatment -treatment 1-for the first rounds. Based on the samples collected from the first stage, we then define a number of events -which are similar to event ξ from Algorithm 1-that each treatment j ≥ 2 has the largest expected reward of any treatment and treatment 1 has the smallest. Thus, if we set the exploration probability ρ small enough, then some subset of agents will comply with all recommendations in the second stage of Algorithm 3.

Corollary F.5. (Pairwise Treatment Effect Confidence Interval for General k Treatments) Given all assumptions in Corollary F.4, the pairwise approximation bound between any two particular arms a, b is given as

whereθ a andθ b are the IV estimate for the treatment effect of arm a and arm b, respectively.

Proof. We have:

This recovers the stated bound and we only pay a small constant ( √ 2) to obtain a pairwise approximation bound from our IV estimator.

Theorem F.6. Let G vw denote the gap between the causal effects of any arms v and w, i.e. G vw := θ v − θ w and let G v denote the smallest gap for arm v, i.e.

Let A vw (S, δ) denote a high-probability upper bound (with probability at least 1 − δ) on the difference between the true gap G vw (for causal effects θ v and θ w for arms v and w) and its estimate G vw based on the sample set S, i.e.

Furthermore, let A v (S, δ) denote a high-probability upper bound on the difference between the true minimum gap G v for arm v and its empirical estimate G v based on sample set S, i.e.

where w min = argmin w =v θ v − θ w . For shorthand, let A v q denote the best (i.e. smallest) approximation bound A v (S BEST q , δ) by phase q.

. Any agent at time t with type u t will comply with recommendation z t = e v for arm v from policy π t according to Algorithm 4, if the following holds for some τ ∈ (0, 1):

Proof. Any agent at time t with type u t will comply with an arm v recommendation z t = e v from policy π t following Algorithm 4, if the following holds: For any two treatments v, w ∈ B,

We will prove a stronger statement:

We can prove this in largely the same way as we proved Lemma 4.2 in Appendix D.1: we simply replace θ and A v q in the proof for Lemma 4.2 with G v and A v q , respectively. The clean event C is given as:

By Corollary F.4, for event C, the failure probability P [¬C] ≤ δ. We assume that

Next, we marginalize E P (u t ) ,πt [G v |z t = e v , C] P P (u t ) ,πt [z t = e v , C] based on four possible ranges 48 which G v lies on:

Because A v q is the smallest approximation bound derived from samples collected over any phase q of Algorithm 4 (including the initial sample set S 0 ), the following holds:

which invokes the stopping criterion for the while loop in Algorithm 4. Thus, all other arms must have been eliminated from the race before phase q = 1 and arm v is recommended almost surely throughout Algorithm 4, i.e. P πt,

Substituting in these probabilities (and substituting minimum possible expected values), we proceed:

Putting everything together, we get that

Thus, as long as A v (S 0 , δ) ≤ τ P π,P (u) [G v ≥ τ ]/4, any agent of type u will comply with a recommendation of arm v from recommendation policy π according to Algorithm 4. Finally, we present the (expected) regret from the k treatment extension of policy π c given in Definition 6.3. Lemma 6.6 (Regret of Policy π c for k Treatments). An extension of policy π c achieves (expected) regret as follows:

for sufficiently large time horizon T .

Proof. Let θ * be the best treatment effect overall and the gap between θ * and any treatment effect θ i be ∆ i = |θ * − θ i |. Recall that the clean event C entails that the approximation bound holds for all rounds. If event C fails, then we can only bound the pseudo-regret by the maximum value, which is at most T min i ∆ i . For the rest of this proof, assume that the event C holds for every round of Algorithm 4. This proof follows the standard technique from Even-Dar et al. [2006] . Since C holds, we have ∆ i ≤ A(S L 2 , δ) + |θ * −θ i | for any treatment i, where A(S L 2 , δ) is the approximation bound based on S L 2 samples of Algorithm 4. Before the stopping criteria is invoked, we also have |θ * −θ i | ≤ A(S L 2 ,δ ). Hence, the gap between the best treatment effect and any other treatment effect is:

where σ g is the variance parameter for the baseline reward g (u) and α 2 is defined as in Claim F.3 relative to the proportion r 2 = 1/|B| of recommendations for each treatment during Algorithm 4 and the proportion of compliant agents p c 2 . Hence, we must have eliminated treatment i by round L 2 = 8kσ 2 g log(2kT /δ) ∆ 2 i 1 k 2 − log(k/δ) , assuming that L 2 ≥ p 2 c k log(k/δ) (in order to satisfy the criterion for Theorem F.2). During Algorithm 4, the social planner gives out |B| recommendations for each treatment i ∈ B sequentially. Hence, the contribution of each treatment i for each phase is ∆. Conditioned on event C, the treatment a * at the end of Algorithm 4 is the best treatment overall; so, no more regret is collected after Algorithm 4 is finished.

If treatment 1 is not the winner, then we accumulate R 1 (T ) = ∆ i ((1 − kρ)L 1 + L 2 /k) regret for treatment 1. If some other treatment i > 1 is not the winner, then we accumulate R i (T ) = ∆ i (ρL 1 + L 2 /k) regret for treatment i. Hence, the total regret accumulated in Algorithm 4 is:

Observe that the pseudo-regret of the combined recommendation policy is at most that of Algorithm 4 plus ∆ = min i ∆ i per each round of Algorithm 3. Alternatively, we can also upper bound the regret by ∆ per each round of the combined recommendation policy. Following the same argument as Lemma 5.4, we can derive the pseudo-regret of the policy π c for k treatments: R(T ) ≤ min L 1 (1 − ρ)∆ i + 8(k − 1)σ 2 g log(2kT /δ) ∆ i 1 k 2 − log (k/δ) , T ∆ ≤ L 1 + O(k kT log(kT /δ)).

For the expected regret, we can set the parameters δ and L 1 in terms of the time horizon T , in order to both guarantee compliance throughout policy π c and to obtain sublinear expected regret bound relative to T .

First, we must guarantee that the failure probability δ in Algorithm 4 is small, i.e. δ = 1/T 2 . To meet our compliance condition for Algorithm 4, we must set δ ≤ τ P P (u) [θ ≥ τ ] 2(τ P P (u) [θ ≥ τ ] + 1) for some constant τ ∈ (0, 1). Hence, we can set T sufficiently large such that, for any δin(0, 1), we have

We also recall that the length L 1 of Algorithm 3 needs to be sufficiently large so that p c 2 fraction of agents comply in Algorithm 4. Moreover, we accumulate linear regret in each round of Algorithm 3. Hence, in order to guarantee sublinear regret, we also require that T satisfies the following:

T ≥ L 2 1 = ( + /ρ) 2

Finally, recall that the clean event C in Algorithm 4 holds with probability at least 1 − δ for any δ ∈ (0, 1). Conditioned on the failure event ¬C, policy π c accumulates at most linear pseudo-regret in terms of T . Thus, in expectation, it accumulates at most T max i,j |θ i − θ j |δ regret Therefore, we can derive the expected regret of k treatment recommendation policy π c as: Therefore, assuming that all hyperparameters δ, L 1 are set to incentivize compliance for some nonzero proportion of agents throughout π c and assuming that T is sufficiently large so as to satisfy the conditions above, policy π c (for k treatments) achieves sublinear regret.

In this section, we present additional experiments to evaluate Algorithm 1 and Algorithm 2, which were previously omitted from Section 7. Our code is available here: https://github. com/DanielNgo207/Incentivizing-Compliance-with-Algorithmic-Instruments. We are interested in (1) the effect of different prior choices on the exploration probability ρ, (2) comparing the approximation bound in Algorithm 1 to that of Algorithm 2 and (3) the total regret accumulated by the combined recommendation policy. Firstly, we observed that the exploration probability ρ in Figure 2 is small, leading to slow improvement in accuracy of Algorithm 1. Since ρ depends on the event ξ (as defined in Equation (5)), we want to investigate whether changes in the agents' priors would increase the exploration probability. Secondly, we claimed earlier in the paper that the estimation accuracy increases much quicker in Algorithm 2 compared to Algorithm 1. This improvement motivates the social planner to move to Algorithm 2, granted there is a large enough portion of agents that comply with the recommendations. Finally, while we provide a regret guarantee in Lemma 5.4, it is not immediately clear how the magnitude of Algorithm 1 length L 1 would affect the overall regret. There is a tradeoff: if we run Algorithm 1 for a small number of rounds, then it would not affect the regret by a significant amount, but a portion of the agents in Algorithm 2 may not comply. For our combined recommendation policy, we run Algorithm 1 until it is guaranteed that type 0 agents will comply in Algorithm 2.

Experimental Description For Algorithm 1, we consider a setting with two types of agents: type 0 who are initially never-takers, and type 1 who are initially always takers. For Algorithm 2, we consider a setting with two types of agents: type 0 who are compliant, and type 1 who are initially always-takers. We let each agent's prior on the treatment be a truncated Gaussian distribution between −1 and 1. The noisy baseline reward g (ut) t for each type u of agents is drawn from a Gaussian distribution N (µ g (u) , 1), with its mean µ g (u) also drawn from a Gaussian prior. We let each type of agents have equal proportion in the population, i.e. p 0 = p 1 = 0.5.

For the first experiment, we are interested in finding the correlation between the exploration probability ρ and different prior parameters, namely the difference between mean baseline rewards µ g (1) −µ g (0) and the variance of Gaussian prior on the treatment effect θ. Similar to the experiment in Section 7, we use Monte Carlo simulation by running the first stage of Algorithm 1 with varying choices of the two prior parameters above. From these initial samples, we calculate the probability of event ξ, and subsequently the exploration probability ρ. For the second experiment, we are interested in finding when agents of type 1 also comply with the recommendations. This shift in compliance depends on a constant τ (as defined in Lemma 4.2). We find two values of the constant τ that minimizes the number of samples needed to guarantee that agents of type 0 and type 1 are compliant in Algorithm 2 (as defined in Lemma 5.2). After this, Algorithm 2 is run for increasing number of rounds. Similar to the Algorithm 1 experiment, we repeated calculate the IV estimate of the treatment effect and compare it to the naive OLS estimate over the same samples as a benchmark. On a separate attempt, we evaluate the combined recommendation policy by running Algorithm 1 and Algorithm 2 successively using the priors above. We calculate the accumulated regret of this combined policy using the pseudo-regret notion (as defined in Definition 5.3). Table 2 : Upper bounds on exploration probability ρ to incentivize partial compliance with respect to different variances in the prior over treatment effect θ Results In Table 1 and Table 2 , we calculate the exploration probability ρ with different initialization of the agents' priors. In Table 1 , we let the mean baseline reward of type 1 µ g (1) be drawn from N (0.5, 1) and the mean baseline reward of type 0 µ g (0) be drawn from N (c, 1) with c ∈ [0, 1]. The gap between these priors is defined as E [µ g (1) ] − E [µ g (0) ]. We observe that the exploration probability does not change monotonically with increasing gap between mean baseline reward. In Table 2 , we calculate the exploration probability ρ with different variance in prior over treatment effect θ. Similarly, in Table 1 , we observe that the exploration probability ρ does not change monotonically with increasing variance in prior over θ. In both tables, ρ value lies between [0.001, 0.008], which implies infrequent exploration by Algorithm 1. This slow rate of exploration is also reflected in Figure 2 , which motivates the social planner to transition to Algorithm 2.

Figure 3: Approximation bound using IV regression and OLS during Algorithm 2 with τ = 0.43. The y-axis uses a log scale. Results are averaged over 5 runs; error bars represent one standard error.

In Figure 3 , we compare the approximation bound on |θ −θ| between the IV estimateθ and the naive estimate for Algorithm 2. In our experiments, the constant τ generally lies within [0.4, 0.6]. Similar to the experiment in Section 7, we let the hidden treatment effect θ = 0.5, type 0 and type 1 agents' priors on the treatment effect be N (−0.5, 1) and N (0.9, 1) -each truncated into [−1, 1] -respectively. We also let the mean baseline reward for type 0 and type 1 agents be µ g (0) ∼ N (0, 1) and µ g (1) ∼ N (0.1, 1), respectively. With these priors, we have found a suitable value of τ = 0.43 for Algorithm 2. Instead of using the theoretical bound on in Lemma 5.2, we compare the approximation bound |θ −θ| with the conditions in Lemma 4.2. In Figure 3 , the IV estimate consistently outperform the naive estimate for any number of rounds. Furthermore, we observe that the scale of the IV estimate approximation bound in Figure 3 is much smaller than that of Figure 2 . This difference shows the improvement of Algorithm 2 over Algorithm 1 on estimating the treatment effect θ. It takes Algorithm 2 a small number of rounds to get a better estimate than Algorithm 1 due to the small exploration probability ρ.

Instrumental variables methods in experimental criminological research: What, why, and how? Working Paper 314

Instrumental variables and the search for identification: From supply and demand to natural experiments

Two-stage least squares estimation of average causal effects in models with variable treatment intensity

Mostly Harmless Econometrics: An Empiricist's Companion

Finite-time analysis of the multiarmed bandit problem

The benefits and costs of jtpa title ii-a programs: Key findings from the national job training partnership act study

Incentivizing exploration by heterogeneous users

Conference On Learning Theory

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems

Incentivizing exploration

Honorarium or coercion: use of incentives for participants in clinical research

Kff covid-19 vaccine monitor

Identification of causal effects using instrumental variables

Bayesian exploration with heterogeneous agents

Instrument-armed bandits

Bayesian persuasion

Fairness incentives for myopic agents

Implementing the "wisdom of the crowd

Bayesian incentive-compatible bandit exploration

Bayesian exploration: Incentivizing exploration in bayesian games

Sample complexity of incentivized exploration

Incentivizing exploration via information asymmetry

Introduction to multi-armed bandits

Non-compliance-or how many aunts has matilda?

We thank the members of the Social AI Group for their comments on drafts of this work. Zhiwei Steven Wu was supported by the NSF FAI Award #1939606, a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Facebook Research Award, and a Mozilla Research Grant. We also thank Nicole Immorlica and Akshay Krishnamurthy for their discussions.