key: cord-0177362-6rbthp5x authors: Wu, Han; Wager, Stefan title: Partial Likelihood Thompson Sampling date: 2022-03-02 journal: nan DOI: nan sha: 4f549fd0ffb79a2821f1f4336db8b9a11c798c39 doc_id: 177362 cord_uid: 6rbthp5x We consider the problem of deciding how best to target and prioritize existing vaccines that may offer protection against new variants of an infectious disease. Sequential experiments are a promising approach; however, challenges due to delayed feedback and the overall ebb and flow of disease prevalence make available method inapplicable for this task. We present a method, partial likelihood Thompson sampling, that can handle these challenges. Our method involves running Thompson sampling with belief updates determined by partial likelihood each time we observe an event. To test our approach, we ran a semi-synthetic experiment based on 200 days of COVID-19 infection data in the US. Methods for sequential experimentation have proven themselves as powerful and versatile in a number of application areas, ranging from online advertising [Chapelle and Li, 2011] and revenue management [Ferreira et al., 2018] to website optimization [Letham et al., 2019] . These methods enable us to efficiently optimize an explore-exploit tradeoff between first discovering which of a number of actions is best and then efficiently deploying it once we've identified it. One simple yet successful idea for doing so is Thompson sampling [Thompson, 1933 , Russo et al., 2018 , where an agent dynamically updates a belief distribution for the probability that each action they could take is best, and then performs these actions with propensity proportional to these beliefs. An important and potentially promising application for sequential decision making is in deciding how best to use existing vaccines to target new variants of an infectious disease. For example, in the case of the COVID-19 pandemic, a number of vaccines were developed and found to be safe and effective in protecting against the original viral strain; however, new coronavirus variants then emerged that exhibited at least partial ability to evade protection from vaccines, thus making it more difficult to contain the pandemic [Kustin et al., 2021] . In situations like this, it is of considerable value to promptly assess which of the existing vaccines (if any) offer protection against the new variant. Castillo et al. [2021] call for embedding vaccine trials on new COVID-19 variants within national vaccine rollouts, and sequential learning seems like a perfect candidate for optimizing the resulting explore-exploit tradeoff across vaccines. In this setup, we would only be comparing vaccines that have already been established as generally safe and effective, and so the goal is to discover-as quickly as possible-which vaccine is most effective in the context of interest. The main difficulty in using adaptive experiments for vaccine trials is that such trials involve delayed feedback of a type that cannot readily be handled by available methods, including Thompson sampling [Thompson, 1933 , Russo et al., 2018 or the UCB algorithm Robbins, 1985, Auer et al., 2004] . The standard framework for sequential learning involves a tight feedback loop, where in each time-step an agent chooses an action, sees the corresponding reward, and can then update their beliefs. In a vaccine trial, however, when we cannot immediately assess success after an innoculation; rather, we can only wait and see whether the patient gets infected any time before the end of the trial. There has been some recent work on adaptive trials with delayed feedback [Grover et al., 2018 , Joulani et al., 2013 , Zhou et al., 2019 ; however, available methods cannot simultaneously address some key difficulties that arise when testing vaccines in a pandemic. First, there is no (useful) upper bound on the delay separating an action and the corresponding reward. Study subjects could experience a negative reward (i.e., get infected) anytime between when they're enrolled in the study (and given a vaccine) and the end of the study. Second, the rate of infections doesn't just depend on which vaccines are used, but also on the ebb and flow of the pandemic. Any method that adaptively adjusts vaccine allocation frequencies without accounting for varying baseline infection rates risks providing a biased comparison. The main goal of this paper is to develop methods for adaptive experimentation that can handle the above challenges. Our core proposal is to extend the Thompson sampling method for sequential Bayesian learning to the proportional hazards model [Cox, 1972] , which is widely used in medical statistics. In our context-as spelled out in more detail below-the proportional hazard model posits that, at time t and for an as-of-yet uninfected person having received vaccine k, the instantaneous risk (i.e., hazard) of getting infected is of the form h 0 (t)e −θ k . Here, h 0 (t) is the baseline hazard, i.e., the time-varying instantaneous risk that an unvaccinated person gets infected, and θ k captures the protective effect of the vaccine (the larger θ k the better the vaccine). The proportional hazards model is a natural fit for our setting in that it allows us to address the challenges highlighted above (i.e., unbounded delays to observed infections and time-varying baseline hazards), yet it has enough structure to enable sampleefficient learning. One celebrated property of the proportional hazards model is that we can learn about the underlying efficiency parameters θ k via a partial likelihood in which the baseline risk h 0 (·) gets canceled out [Cox, 1975 , Efron, 1977 . Our proposed approach, partial likelihood Thompson sampling (PLTS), involves running Thompson sampling with belief updates determined by partial likelihood each time we observe an event (i.e., each time an already vaccinated study participant gets infected). This differs from a Bernoulli bandit in that we do not control when events may happen, and the relevant "at risk" sample size changes with time. The resulting Bayesian problem doesn't have a closed-form solution for the posterior, but we find the setting to be amenable to popular methods for approximate inference with Thompson sampling-including Laplace approximation [Chapelle and Li, 2011, Russo et al., 2018] . While the use of partial likelihood for Bayesian inference in general is well established [Kalbfleisch, 1978] , we are not aware of prior research on using proportional hazards modeling or partial likelihood for sequential Bayesian learning, or sequential experiments more generally. In a semi-synthetic study using data from the COVID-19 pandemic, we find that our approach can more reliably identify the best vaccine than a classical randomized controlled trial (RCT), in which volunteers are assigned to different treatments uniformly at random throughout the trial. Our approach also considerably reduces the within-experiment regret from assigning study participants to sub-optimal vaccines. At a high level, sequential vaccine experiments can be seen as a bandit problem with partial, delayed feedback: Feedback is partial because we only ever observe negative rewards (infections), and delayed because it takes time for a study participant to potentially get infected post vaccination. There is a large amount of work on bandits with full feedback: Dudik et al. [2011] consider bandits with constant deterministic delays; Joulani et al. [2013] study a setting where delays have bounded expectation; Mandel et al. [2015] consider bounded delays in the stochastic multi-armed bandit problem; Thune et al. [2019] work with bounded delays in the nonstochastic bandit problem. Meanwhile, Vernade et al. [2017] allow for partial feedback but assume i.i.d. delays with a known distribution, and Manegueu et al. [2020] consider the same partially observable model but with the assumption that delay distributions satisfy polynomial tail bounds. Lancewicki et al. [2021] develop algorithms based on UCB and successive elimination that allow for unrestricted delay distributions. However, their bounds are vacuous in our scenario as we only observe infections (i.e., negative rewards); and the delays considered in Lancewicki et al. [2021] are assumed to be i.i.d across time. Thus, we are not aware of existing methods studied in a setting that includes vaccine trials, where delays are unbounded and time-varying, and positive rewards are never observed. We do note, however, that the method of Thune et al. [2019] is one that-at least algorithmically-could plausibly considered in out setting, and we use it as a baseline in our experiments; see Section 4 for details. Proportional hazards modeling and partial likelihood are core techniques in survival analysis. Cox [1972] first proposed the proportional hazards model, while Cox [1975] and Efron [1977] further developed statistical theory for estimators based on partial likelihood. Kalbfleisch and Prentice [1973] , Breslow [1974] , Efron [1977] provided alternative likelihood formulas when the event times are discrete with multiplicity. We also note a line of work justifying the use of Bayesian methods on partial likelihood. Kalbfleisch [1978] show that partial likelihood is a limiting marginal posterior under noninformative priors for baseline hazards. Sinha et al. [2003] further extend the result to scenarios with time dependent covariates and time-varying regression parameters. Ibrahim et al. [2014] gives a comprehensive textbook treatment of Bayesian survival analysis. Finally we note that the using hazard rates to model the efficacy of vaccines is widely used in medical statistics; see for example Longini Jr and Halloran [1996] , Durham et al. [1998] and Halloran et al. [1999] . Thus, the main contribution of this paper is to leverage fundamental concepts in survival analysis and classical vaccine RCTs, i.e., proportional hazards modeling and partial likelihood, to develop a new bandit algorithm suitable for adaptive vaccine trials. We model vaccine trials as follows. At the start of the trial (i.e., at time t = 0), some participants are recruited to the trial and assigned to each vaccine group uniformly at random. After the initial assignment, volunteers arrive over time, and we randomize and assign them to a treatment arm as soon as they arrive. After enrollment, participants are followed until either they get infected or the study ends; any infected participants are removed from the study at the moment they are infected. Throughout, we use the following notations: M t,k = # participants assigned to arm k by time t m t,k = # participants assigned to arm k at time t N t,k = # observed infections in arm k by time t n t,k = # observed infections in arm k at time t o t,k = # participants remaining in arm k at time t, i.e., M t,k and N t,k are cumulative sums of m t,k and n t,k respectively, and o t,k = M t,k − N t,k . We denote the sum of these statistics across all arms as M t , m t , N t , n t , o t . We also have the convention that n 0,k = 0 for all k since we do not observe any infection at the start of the trial and m T,k = 0 for all k since we do not assign any new participants when we end the trial. This general model is formalized in Protocol 1. One important case of this study design is the batched setting we consider in this paper, where there are a finite number of time points participants can join the study and infections can be recorded. Specifically, at t = 0, ...., T , we collect m t newly arrived participants and assign them to different groups and we also observe a vector of new infections (n t,1 , ..., n t,K ). At time T , we end the experiment. This is summarized in Protocol 2. We model person-specific infection risk using the classical notion of a hazard rate, as follows. We assume that each of the k = 1, . . . , K treatment arms is characterized by a hazard rate h k (t), which captures the instantaneous risk that a person in study arm k becomes infected at time t. Below, note that o t,k denotes the number of still uninfected participants in arm k at time t, and h k (t) describes the expected fraction of these participants who will become infected in the next instant [Cox and Oakes, 1984] . Assumption 1. For each study arm k = 1, . . . , K, there is a hazard rate h k (t) such that, for all 0 < t < T , where E t denotes expectations conditionally on information available at time t. The key flexibility of Assumption 1 is that it allows infection risk to ebb and flow over time: There may be some periods where very few people from any study arms are getting infected, and others where infections are highly prevalent in some arms. However, Assumption 1 does impose non-trivial structure on the problem: For example, it implies that the length of time a patient has been in the study does not affect their risk of getting infected. Given Assumption 1, Halloran et al. [1999] defines vaccine efficiency in terms of a ratio of hazard functions. Suppose that one of the study arms (without loss of generality the first arm k = 1) is a placebo that does not provide any protection against infection. Then the efficiency of the k-th vaccine depends on h k (t)/h 1 (t). Definition 2.1. Under Assumption 1, for each non-placebo arm k = 2, . . . , K, the vaccine efficiency is Protocol 1 General Vaccine Trial Input: Length of experiment T , number of vaccines K Assign m 0 participants uniformly at t = 0 while t ≤ T do if m t = 0 then Assign m t participants to vaccine groups. end if Observe a vector of infections (n t,1 , ..., n t,K ) and end trial for the infected participants. end while Input: Length of experiment T , number of vaccines K for t = 0, 1, ..., T do Assign m t participants to vaccine groups. Observe a vector of infections (n t,1 , ..., n t,K ) and end trial for the infected participants. end for Given our assumptions so far, the vaccine efficiency VE k (t) may vary with time, which creates some potential ambiguity in defining what the best vaccine is. Our next major assumption is that vaccine efficiency doesn't change with time, i.e., equivalently, that the hazard functions follow the proportional hazards model of Cox [1972] . Assumption 2. For each study arm k = 2, . . . , K, there is an efficiency parameter θ k such that Given Assumption 2, the main task of interest in assessing vaccines effectiveness is to estimate the efficiency parameter θ k : The bigger the θ k , the more effective the vaccine is. To illustrate this with a concrete example, in initial studies, Polack et al. [2020] reported that the Pfizer COVID-19 vaccine was 95% effective in preventing infection while Baden et al. [2021] reported that the Moderna COVID-19 vaccine was 94.1% effective. In the context of our model, both of these points estimates correspond to an efficiency parameter θ ≈ 3. Remark 1. Here, for simplicity, we assume constant efficiency of the vaccine. This may be a reasonable assumption for experiments run on the order of months; however, for longer experiments, it may be necessary to extend the model to allow for waning effectiveness. We leave extensions to non-constant efficiency to future work. One major advantage of the proportional hazards model is that it enables a simple approach to learning the efficiency parameters θ k via partial likelihood [Cox, 1972 , 1975 , Efron, 1977 , as follows. Let us first suppose that the infection times (event times) of our participants in the trial are continuous as in Protocol 1 and recall at time t the number of participants in each group is characterized by the vector (o t,1 , ..., o t,K ), i.e., this is the number of participants in each group who have joined the study, been assigned a treatment, and have not yet been infected. Then, the conditional probability that a person in vaccine group j is infected given that there is an infection at time t is (let θ 1 = 0): The unknown baseline hazard function cancels out because of the proportional hazards assumption. Now suppose we have J events (infections) happening at time t 1 < t 2 < · · · < t J and event j happened to group I j . We can then form the following partial likelihood, which is a product over all the conditional probabilities of the observed events. It is a partial likelihood because we ignore all non-events. However, it is shown to be efficient in estimating the hazard rate parameters [Efron, 1977] . The partial likelihood (5) we obtained in the last section assumes continuous infections times where there are no ties in a single event time t j . However, in Protocol 2 the event times will be 1, ..., T and there could be multiple infections in a single vaccine group if the hazard rate is really high. Recall the definition of (n t,1 , ..., n t,K ) which denotes the number of infections happened in each vaccine group during the time interval (t − 1, t] and n t which denotes the sum of infections across all vaccine groups. In this case the exact likelihood proposed in Cox [1972] is the following where R(n t ) is the set of all possible sets of n t participants from the risk set (o t,1 , .., o t,K ) and θ(l) is the sum of all the θ values of the individuals in set l. Due to its complicated form, Breslow [1974] suggests using the following approximation of the exact partial likelihood (6), and this is what we do in our approach. 1 In this section, we describe our proposed algorithm for sequential experimentation, which we call Partial Likelihood Thompson Sampling (PLTS). Thompson sampling [Thompson, 1933] is a Bayesian heuristic for sequential experiments that chooses the actions at each round according to the posterior probability that the action maximizes expected reward. This is usually implemented by sampling, where we sample an instance of environment from the posterior and take the action that maximizes the expected reward [Russo et al., 2018] . In our setting our model parameters are efficiency parameters θ 2 , ...θ K (recall we assume that θ 1 = 0, i.e., that the first arm is a placebo). At each round we will get a sample of (θ 2 , .., θ K ) from the posterior and assign our participants accordingly. We start with uninformative prior for all the parameters as they are potentially unconstrained. Then, following the blueprint of Thompson sampling (see Algorithm 3), we update the posterior each time we collect new data; and here, we do so using the partial likelihood introduced in the previous section. The use of partial likelihood for Bayesian posterior updates is further discussed in Kalbfleisch [1978] . Given this setup, it now remains to derive an efficient posterior sampling method for assigning new participants as they arrive. Since we put an uninformative prior on all the parameters θ 2 , ..., θ k the posterior at time t given observed data D will be Now, one difficulty in using (8) directly is that efficiently sampling from the posterior is non-trivial. For computational tractability, we thus use the popular idea of replacing the exact posterior with its Laplace approximation [Basu and Ghosh, 2020 , Chapelle and Li, 2011 , Gomez-Uribe, 2016 , Russo et al., 2018 . The main idea is as follows. Writing P t (θ 2 , .., θ K ) for the partial likelihood used in (8), we see that ignoring constant From the above formula we see that the logarithm of the likelihood is concave, hence there exists a unique global maximizer of the likelihood. Laplace approximation involves approximating the posterior with a Gaussian distribution centered at the posterior mode; the inverse covariance matrix will be −∇ 2 log P t (θ 2 , ...,θ K ), whereθ 2 , ...,θ K are the posterior mode. Formally, this gives us an approximate sampling distribution θ 2 , ....,θ K = argmax P t (θ 2 , ..., θ K |D) Σ = −∇ 2 log P t (θ 2 , ...,θ K ) P t ≈ N (θ 2 , ....,θ K ;Σ) (9) We can efficiently solve the maximization problem in (9) using any smooth convex optimization algorithm (for example Newton's method). Here, the Hessian is readily available given the form of the likelihood, and so approximate sampling of the posterior is computationally efficient. Now we can proceed to formulate our algorithms. Given the posterior sampling scheme described above, there are two popular ways of running Thompson Sampling. The canonical way [Thompson, 1933, Chapelle and Li, 2011] samples parameters from the current posterior distribution and takes the action that maximizes the expected reward, which aims at achieving low regret [Agrawal and Goyal, 2012 , Kaufmann et al., 2012 , Russo and Van Roy, 2016 . In our setting, we sample θ 2 , .., θ K using (9) and choose the vaccine group with the maximum θ. Algorithm 3 gives the details. In our experiments, we also consider a PLTS-based adaptation of the top-two Thompson Sampling algorithm of Russo [2020] . This adaptation, where the second best action is selected in any given round with fixed probability β, targets best arm identification problem. Algorithm 4 details the algorithm. The algorithm takes in a parameter β which indicates the probability of sampling from the second best arm. We fix β = 0.5, which Russo [2020] suggests as a safe default choice. We conduct semi-synthetic experiments in this section which is motivated by the real COVID-19 vaccines and case counts. Specifically, we model the baseline hazard rate using real world COVID-19 infections data and choose the efficiency parameters according to the efficacy of some approved vaccines. We evaluate our algorithm in both tasks, getting as few infections as possible and correctly identifying the best vaccine. We simulate our experiment using Protocol 2 and fix our length of experiment to be T = 200. For simplicity we let the number of new participants to be constant at each time step, i.e. m t is a constant. Denote the total number of volunteers by M , we let m t = M T . To model the baseline (or placebo) hazard rate, we use the data of 7-day moving average infections in US provided by the CDC data tracker [Centers for Disease Control and Prevention, 2021]. We pick the period of 200 days starting from March 9th, 2020. To get h 1 (t) we divide the daily infection numbers by the US population. The resulting baseline hazard rate is shown in Figure 1 . We clearly see two distinct waves of infections that occurred during this 200-day period. Our next task is to set the efficiency parameters θ k corresponding to the non-placebo study arms. To do so, we use point estimates from a number of randomized controlled trials run early in the COVID-19 pandemic. Specifically: • Based on AstraZeneca Vaccine trials with a 70% reported efficacy [Voysey et al., 2021] , we set θ 2 = 1.2. • Based on SinoPharm Vaccine trials with a 78% reported efficacy [Al Kaabi et al., 2021] , we set θ 3 = 1.5. • Based on Novavax Vaccine trials with an 89% reported efficacy [Heath et al., 2021] , we set θ 4 = 2.2. • Based on Sputnik Vaccine trials with a 91% reported efficacy [Logunov et al., 2021] , we set θ 5 = 2.4. • Based on Pfizer and Moderns Vaccine trials with roughly 95% reported efficacies [Polack et al., 2020 , Baden et al., 2021 , we set θ 6 = 3.0. The motivation for using these numbers is that we hope they capture realistic effect sizes one might see in a multi-arm vaccine trial, and not necessarily that they exactly match real-world efficiencies of the above vaccines established after pooling data from multiple trials. We evaluate the performance of each experimental design using the following metrics; throughout, we use the fact that the 6-th arm is best to condense notation. • In-sample regret (ISR): Defined as 1 T T t=1 θ It − θ 6 where I t is the action chosen at round t. • Best arm identification probability (BIP), i.e., the fraction of times that the best arm (here, the 6-th) has the lowest estimated infection hazard. Specifically, let A i be the estimated best arm for replication i, the best arm identification probability is defined as B i=1 1{A i = 6}/B. • Expected policy regret (EPR), as defined in Kasy and Sautmann: Let a be the estimated best action, let ∆ a = θ a − θ 6 . This is defined as Of these metrics, the first measures the "cost" of running the experiment (i.e., how many study participants were assigned to suboptimal arms during the trial), while the latter two measure the quality of the findings from the study. Our goal is to evaluate our proposed method, PLTS, as well as the top-two Thompson sampling based variant designed for best-arm identification (TTPLTS). We compare these Input: Learning rate η, number of arms K Initialize weights w a 0 = 1, ∀a = 1, ..., K for t = 1, .., T do Let p a t = As discussed in the related work section, we are not aware of any existing methods for adaptive experimentation that were designed for our setting, i.e., with only negative feedback, unbounded delays in receiving feedback, and time-varying delays. However, the DEW approach, although introduced and studied in an adversarial setting with bounded delays, is simple and flexible enough that-at least algorithmically-it can be used in our setting, which is why we also explore using it as a baseline. The DEW algorithm is a form of exponential weighting where weights are updated whenever negative rewards are observed; see Algorithm 5 for details. The one major challenge in using this algorithm is in choosing the learning rate η. Thune et al. [2019] offer guidance based on bounds on the delay distribution, but here of course we have no such bounds (and our setting does not fall under the purview of their theory), so it wasn't clear to us how to choose η. Thus, we simply consider 3 baselines, DEW with η = 0.01, 0.1, 0.4, that span the range of behaviors one can get from the method. (For Thompson sampling, we use an uninformative prior and so there is no analogous tuning parameter for the learning rate that needs to be specified.) Finally, in order to evaluate best arm identification probabilities, we need each method to output a recommended best arm at the end of the experiments. PLTS and TTPLTS output the vaccine with the largest posterior mode and DEW outputs the vaccine with the largest weight. RCT picks the vaccine with lowest infection rate at the end of the trial. For each method we consider, we use a sample size of M = 60000 study participants and replicate all simulations 1000 times. Table 1 summarizes the results across all methods and performance metrics. Our first comparison is between the simplest variant of our methods, PLTS, and the RCT baseline. We here see that PLTS outperforms the RCT along all metrics: It both achieves smaller in-sample regret and has more power to identify the best arm. The reason it can do so is that it quickly shifts sampling towards the most promising vaccines; see Figure 2 . This is clearly desirable from a regret minimization point of view, but here it is also desirable from a power point of view since it concentrates sampling on the most difficult questions, i.e., distinguishing the best arms from each other. Next, we compare PLTS to the DEW baselines in terms of in-sample regret. Here, the picture is nuanced. When well tuned, DEW can slightly outperform PLTS; however, it is Table 1 : Results comparing Partial Likelihood Thompson sampling (PLTS), top-two PLTS (TTPLTS), DEW with varying learning rate (η = 0.4, 0.1, 0.01) and the randomized controlled trial (RCT). We display three metrics defined previously and fix M = 60000. Standard errors are given in parentheses; each configuration is replicated 1000 times. not clear whether an adaptive tuning parameter choice could mirror this result. The rate at which all methods incur infections during the study is shown in Figure 3 . We see that all the methods incur similar numbers of infections in the first wave, but the well-performing methods are able to focus on the better arms and considerably cut down on infections by the time we get to the second (larger) wave. 2 The comparison looks different, however, once we look at metrics that consider the quality of the selected arm, i.e., best arm identification probability and policy regret. Here, PLTS still does well, but variants of DEW that achieved small in-sample regret do very poorly. It appears that, in order to achieve good in-sample regret, DEW needs to make unstable or greedy choices that hurt the quality of the final conclusions we can derive from the data. In contrast, PLTS is able to focus on the best arms without suffering from this phenomenon. Relative to PLTS, the top-two variant TTPLTS achieves better post-trial metrics but worse in-sample regret. This is to be expected, since TTPLTS invests more in sampling the second-best arm in order to improve power for best arm identification. Whether a practitioner prefers the behavior of PLTS or TTPLTS will depend on the relative importance they give to in-sample versus post-trial performance metrics. Finally, we investigate how arm-assignment probabilities of different methods evolve over time: Figure 4 shows the assignment probabilities averaged over 1000 replicates for each vaccine candidate as a function of time for both DEW and PLTS. The dashed horizontal line shows the uniform probability RCT uses. We see that in both cases the more promising candidates get larger shares as time goes on. However, we do see that when DEW uses a large learning rate (corresponding to the cases with good in-sample regret), the assignment probabilities almost flatten out as we approach the end of the trial, suggesting that by this point the learning rate has become too fast to enable effective learning. Adaptive experiments have considerable potential to address challenges associated by new disease variants that emerge during a pandemic [Castillo et al., 2021] . However, the vaccine trial setting comes with a number of statistical challenges-including unbounded and timevarying delay distributions and partial feedback-that have not been considered in the context of existing bandit algorithms. We introduced partial likelihood Thompson sampling, which adapts Thompson sampling to the setting of vaccine trials using fundamental modeling techniques that have been prevalent for decades in the survival analysis literature [Cox and Oakes, 1984] . We find our method to be a robust and performant option for sequential experimentation in an experiment built around data from the COVID-19 pandemic, thus highlighting its promise as a tool for quickly targeting the use of existing vaccines against a new disease variant. Analysis of thompson sampling for the multi-armed bandit problem Near-optimal regret bounds for thompson sampling An Pan, and Xiaoming Yang. Effect of 2 Inactivated SARS-CoV-2 Vaccines on Symptomatic COVID-19 Infection in Adults: A Randomized Clinical Trial Finite-time analysis of the multiarmed bandit problem Efficacy and safety of the mrna-1273 sars-cov-2 vaccine Adaptive rate of convergence of thompson sampling for gaussian process optimization Covariance analysis of censored survival data Market design to accelerate COVID-19 vaccine supply Data table for case daily trends -united states An empirical evaluation of Thompson sampling Regression models and life-tables Analysis of Survival Data Efficient optimal learning for contextual bandits Estimation of Vaccine Efficacy in the Presence of Waning: Application to Cholera Vaccines The efficiency of Cox's likelihood function for censored data Online network revenue management using Thompson sampling Online algorithms for parameter mean and variance estimation in dynamic regression models Best arm identification in multi-armed bandits with delayed feedback Design and Interpretation of Vaccine Field Studies Safety and efficacy of nvx-cov2373 covid-19 vaccine Bayesian Survival Analysis Online learning under delayed feedback Marginal likelihoods based on cox's regression and life model Non-parametric Bayesian analysis of survival time data Adaptive treatment assignment in experiments for policy choice Thompson sampling: An asymptotically optimal finite-time analysis Evidence for increased breakthrough rates of SARS-CoV-2 variants of concern in BNT162b2-mRNA-vaccinated individuals Asymptotically efficient adaptive allocation rules Stochastic multi-armed bandits with unrestricted delay distributions Constrained Bayesian optimization with noisy experiments Safety and efficacy of an rad26 and rad5 vector-based heterologous prime-boost covid-19 vaccine: an interim analysis of a randomised controlled phase 3 trial in russia A frailty mixture model for estimating vaccine efficacy The queue method: Handling delay, heuristics, prior data, and evaluation in bandits Stochastic bandits with arm-dependent delays Safety and efficacy of the bnt162b2 mrna covid-19 vaccine Simple bayesian algorithms for best-arm identification An information-theoretic analysis of thompson sampling A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning A Bayesian justification of Cox's partial likelihood On the likelihood that one unknown probability exceeds another in view of the evidence of two samples Nonstochastic multiarmed bandits with unrestricted delays Stochastic bandit models for delayed conversions Single-dose administration and the influence of the timing of the booster dose on immunogenicity and efficacy of chadox1 ncov-19 (azd1222) vaccine: a pooled analysis of four randomised trials Learning in generalized linear contextual bandits with stochastic delays