key: cord-0434553-0xcm0bad authors: Chugg, Ben; Ho, Daniel E. title: Reconciling Risk Allocation and Prevalence Estimation in Public Health Using Batched Bandits date: 2021-10-25 journal: nan DOI: nan sha: 128d9fff20b2e813821cac3bd855b00fc0ad4f4e doc_id: 434553 cord_uid: 0xcm0bad In many public health settings, there is a perceived tension between allocating resources to known vulnerable areas and learning about the overall prevalence of the problem. Inspired by a door-to-door Covid-19 testing program we helped design, we combine multi-armed bandit strategies and insights from sampling theory to demonstrate how to recover accurate prevalence estimates while continuing to allocate resources to at-risk areas. We use the outbreak of an infectious disease as our running example. The public health setting has several characteristics distinguishing it from typical bandit settings, such as distribution shift (the true disease prevalence is changing with time) and batched sampling (multiple decisions must be made simultaneously). Nevertheless, we demonstrate that several bandit algorithms are capable out-performing greedy resource allocation strategies, which often perform worse than random allocation as they fail to notice outbreaks in new areas. Public health crises -such as lead paint, drinking water contamination, natural disasters, and infectious disease outbreaks -demand effectively distributing scarce resources. In the early stages of the COVID-19 pandemic, for instance, public health officials grappled with how to best allocate diagnostic tests [10] . Some advocate random sampling to learn about the disease [12, 11] , while others favor more targeted allocation [5, 4] . This is perceived to pose a tradeoff: Random sampling is seen to enable reliable prevalence estimates and the discovery of new hot spots, while risk targeting focuses on symptomatic and/or exposed individuals to ostensibly drive down disease rates. 1 While the general tension runs throughput public health, the setting presents novel challenges to classical bandit approaches. Outbreaks present distribution shifts and allocation is necessarily batched for operational reasons. In this short work, we demonstrate that these two objectives are not in zero-sum conflict. Inspired by a door-to-door Covid-19 testing program we helped design, we demonstrate that the batched setting allows us to recover unbiased prevalence estimates and we investigate the performance of bandits in the presence of distribution shift. To illustrate, we consider the early stages of an infectious disease outbreak when tests are not widely available. A public health agency must decide how to allocate tests between K different units, U 1 , . . . , U K . Units might be neighbourhoods [4] , apartments buildings [9] , or dorm rooms on a college campus [17] . They might also be artificially constructed clusters of individuals based on interaction networks [14] . We consider a series of discrete timesteps t = 1, . . . , T , which could represent days or weeks. At time t, m t tests can be distributed among the regions in any way we choose. The supply may vary from period to period, in which case m t = m t+1 . Our approach involves treating each unit as an arm in the multi-armed bandit problem [16] . The prevalence rate 2 in each unit is the "reward" distribution, thus encouraging tests to be allocated to units with higher proportions of infected individuals. Unlike most bandit problems our reward distributions are non-stationary [1] . In particular, they change as a function of the disease spread and the number of tests (Section 2). In addition, we must make take multiple actions at once (actions are batched) [7] . We investigate four bandit algorithms: Upper confidence bound sampling, Thompson sampling, Exponential Weights (Exp3), and -greedy. We discuss their details and how they've been extended to the non-stationary and batched setting in Section 3. The following section discusses how we model the spread of the disease in each unit and the effect of testing. Code is available at https://github.com/reglab/mab-infectious-disease. Many mathematical models for infectious disease dynamics exist [3] . Here, we use a simple logistic model which incorporates the effect of testing and contact tracing. We note, however, that none of the bandit sampling strategies nor the prevalence estimation rely on the specifics of the disease model. Given a unit with population N , let C(t) be the number of disease cases at time t. We begin by considering the logistic growth equation, f (t) = αf (t)(1 − f (t)/N ), which accounts for the carrying capacity, N , of the population and the growth rate α of the virus. In our setting, however, we want to take into account the additional effects of testing. For an empirically chosen parameter β, we thus add an extra term to the logistic equation and obtain a Bernoulli type differential equation: where ϕ(t) is the number of tests conducted at time t. The additional term βϕ(t) represents the impact on the rate of growth of the virus: The more tests performed at time t, the lower the rate of growth. We assume that the term βϕ(t) captures not only the immediate effects of testing, but the longer run effects of subsequent interventions, such as isolation / quarantine and contact tracing. See Figure 1 for an illustration of the model dynamics and the effects of testing. In our setting, disease spread in each unit U i evolves independently according to Equation (1). At each time t, a given strategy chooses m t regions to sample, and then samples individuals uniformly within that region. In order to adapt the strategies to the non-stationary setting, we apply a discount factor γ ∈ (0, 1) to the information received over time [2, 15] . Let α τ be the number of positive tests witnessed at time τ . If making a decision based on the overall number of, say, positive tests seen from time 0 to t, each strategy will use the discounted quantity t τ =0 γ t−τ a t . In this way, each strategy smoothly forgets the past. We give a very brief overview of each strategy. For more details, we refer readers to [16] . -greedy. Perhaps the simplest bandit strategy is -greedy, which chooses a random action with probability and chooses the action with the highest expected reward (i.e., the positivity rate, meaning the fraction of test results yielding a positive test result) otherwise. In the batched setting, we simply repeat this process m t times. Thompson Sampling. Thompson sampling maintains a beta distribution over each unit, beta(P k , N k ) where P k (resp., N k ) is the (discounted) number positive (resp., negative) tests in unit k. To batch sample, first draw p 1 k , . . . , p mt k ∼ beta(P k , N k ) for each k = 1, . . . , K, and then for each r = 1, . . . , m t sample region argmax k p r k . UCB Sampling. We use a binomial proportion confidence interval as our confidence bound. Formally, we use a Clopper-Pearson interval, so unit k receives a score In order to batch sample, we form a distribution over the regions using their scores: and sample from this distribution (with replacement) m t times. Exp3. Each unit i begins with weight w i (0) = 1. At time t, we sample unit i with probability for some ∈ (0, 1). If unit k was sampled and r k is its reward, we update as the weight as w k (t + 1) = w k (t) exp r k π k (t)K . Otherwise, we set w k (t + 1) = w k (t). For us, the reward r k is the number of positive tests received after testing unit k. For an individual i at time t, let y t (i) ∈ {0, 1} indicate whether they are infected. Write i ∈ U k if individual i is in unit U k . We're searching for an estimate of the true prevalence-a population estimate-of the disease at time t. That is, We employ a Horvitz-Thompson-type approach which takes into account the probability that each region was sampled [6] . Put N = k N k and let S t be the set of individuals tested during timestep t. Let π k (t) denote the probability that unit U k was sampled by the bandit strategy (e.g., Equations (2) or (3)). 3 The quantityμ is an unbiased estimate of µ t (see Appendix B for a proof). 4 The right hand side of Figure 1 examines the improvement of each bandit strategy over random selection, using a discount rate of γ = 0.5. For each budget, we run the simulation for 30 timesteps. Each point reflects the difference between the total number of positive cases across all regions at the end of the 30 timesteps, between the bandit strategy and random sampling. Each bandit strategy causes a decrease in the total number of positive cases when compared with random selection, with the exception of -greedy at = 0.01 (i.e., using 1% of its budget to perform random exploration). Thompson Sampling and Exp3 are the two best performing strategies, especially as the budget grows. The results of more experiments across a wider range of parameters are shown in Appendix C. The general trends are that (i) as the number of allocation units K decreases, the differences between different strategies become less apparent, and (ii) as the budget increases and/or K increases, Thompson Sampling, Exp3, and UCB outperform -greedy by wider margins. Using UCB as an example, Figure 2 illustrates population estimation as a function of the budget. The figure on the left hand side demonstrates the evolution of the true prevalence rates (across all units) and the estimates over time. Note that the true means are different for different budgets because testing changes the prevalence of the disease, and a higher budget implies more testing. Thus, a lower budget results in a higher prevalence. The right hand side of Figure 2 demonstrates the effect of different budget sizes on the variance of the estimate. While the estimates are unbiased on average across all three budgets, smaller budgets result in larger variance. Of course, this is due to the fact that the variance ofμ t grows as a function of the inverse of p i p j , where p i is the probability that individual i is tested. This, in turn, is a function of m t . Thus, as m t → 0, V ar(μ t ) → ∞. This work has taken initial steps towards understanding both how multi-armed bandits can enable resources to be more effectively distributed during a public health crisis, and demonstrated that allocating resources to high-risk areas while maintaining an accurate understanding of the true dynamics of the problem (in this case, the spread of an infectious disease) are not necessarily in conflict. where λ α,β (x) = exp(α(x − Φ(x))). Observe that for ϕ ≡ 0, the solution reduces to the solution of the usual logistic growth equation. For ϕ ≡ 0, however, the integral in the denominator of C(t) is harder to evaluate owing to the term Φ(t). To evaluate it, we recall that Φ(t) is a function of discrete time steps: it changes only at times integer times t = 0, 1, . . . and is constant on the interval [t, t + 1). Thus, which is straightforward to evaluate computationally. This is straightforward to see after noting that for an individual i in unit U ki , Pr(i ∈ S t ) = m t π ki (t)/N ki . We can thus take the expectation of (5) and rearrange it to read Note that the result for Pr(i ∈ S t ) is because |U ki ∩ S| = k is distributed as a binomial. So Pr(i ∈ S t ||U ki ∩ S t | = ) Pr(|U ki ∩ S| = ) = mt =1 N ki m t π ki (1 − π ki ) mt− = m t π ki N ki . Figure 3 : Improvement over random sampling. As in the right hand side of Figure 1 , the x-axis is the budget, and the y-axis is the difference in number of cases between the given strategy and random sampling by the end of 30 timesteps. As K increases, we see a general increase in the effectiveness of Thompson Sampling and Exp3 relative to the two -greedy strategies. For all values of K and γ, -greedy at = 0.01 performs worse than random sampling across the majority of budgets. This suggests that with such a small exploration budget, it is unable to discover those units whose prevalence is increasing drastically over time. Interestingly, at lower values of K, -greedy with more exploration (10%) is unable to compete with the other bandit algorithms. Stochastic multi-armed-bandit problem with non-stationary rewards Non stationary multi-armed bandit: Empirical evaluation of a new concept drift-aware algorithm Mathematical models to characterize early epidemic growth: A review Evaluation of allocation schemes of covid-19 testing resources in a community-based door-to-door testing program A workable strategy for covid-19 testing: stratified periodic testing rather than universal random testing A generalization of sampling without replacement from a finite universe Linear upper confidence bound algorithm for contextual bandit problem with piled rewards Adaptive targeted infectious disease testing Understanding the spatial clustering of severe acute respiratory syndrome (sars) in hong kong Covid-19 testing: One size does not fit all Testing of asymptomatic individuals for fast feedback-control of covid-19 pandemic Why only test symptomatic patients? consider random screening for covid-19 Optimal covid-19 quarantine and testing policies Clustering for epidemics on networks: a geometric approach Weighted linear bandits for non-stationary environments Introduction to multi-armed bandits Multiple covid-19 clusters on a university campus-north carolina A Solving Equation (1) Let Φ(t) = t 0 ϕ(z)dz be the total number of tests conducted in region r up to and including time t. The solution to Equation (1) is given by