key: cord-0570866-v2brianh authors: Kartik, Dhruva; Sood, Neeraj; Mitra, Urbashi; Javidi, Tara title: Adaptive Sampling for Estimating Distributions: A Bayesian Upper Confidence Bound Approach date: 2020-12-08 journal: nan DOI: nan sha: a9e9ea9c64af16b2fb3c91a02a2b16c161dc3ccc doc_id: 570866 cord_uid: v2brianh The problem of adaptive sampling for estimating probability mass functions (pmf) uniformly well is considered. Performance of the sampling strategy is measured in terms of the worst-case mean squared error. A Bayesian variant of the existing upper confidence bound (UCB) based approaches is proposed. It is shown analytically that the performance of this Bayesian variant is no worse than the existing approaches. The posterior distribution on the pmfs in the Bayesian setting allows for a tighter computation of upper confidence bounds which leads to significant performance gains in practice. Using this approach, adaptive sampling protocols are proposed for estimating SARS-CoV-2 seroprevalence in various groups such as location and ethnicity. The effectiveness of this strategy is discussed using data obtained from a seroprevalence survey in Los Angeles county. We frequently encounter scenarios where we need to estimate a finite collection of pmfs from their samples. We have a limited sample budget and we would like to adaptively choose which pmf to sample from so that we have uniformly good estimates of all the pmfs. The goodness of our estimate can be evaluated using various distribution distance metrics. In this paper, we will focus on the mean squared error. A concrete application that fits this setting is that of estimating SARS-CoV-2 incidence rates in various geographic regions. Incidence rate, the fraction of currently infected individuals, is a crucial metric for assessing the safety of a given region. Incidence rate cannot be estimated in an unbiased way simply based on the reported cases, it needs to be estimated by conducting randomized tests in each region of interest. Typically, such randomized surveys can be conducted by a central authority which has a limited testing budget. A natural question that arises in this scenario is that of allocating the available tests to various regions in an efficient manner. We may also be interested in other parameters such as SARS-CoV-2 seroprevalence which is an indicator of an individual's potential immunity against the virus, and other kinds of groups like ethnicity, age etc. in addition to the geographic location. Other applications of this setting include dynamics estimation in Markov Decision Processes (MDPs) and text compression which are discussed in greater detail in Shekhar et al. (2020) . When we are interested in minimizing the mean squared error for every pmf, an efficient sampling strategy would be to let the number of tests be proportional to the variance of the associated pmf. However, the key problem is that we do not know the variance of the pmf. Therefore, we need to adapt our allocation as we learn the distribution over time. An upper confidence bound (UCB) based approached was proposed in Shekhar et al. (2020) . The main idea in this approach is to first form an upper bound on the variance of each pmf. As a function of this upper bound and the number of samples acquired so far for each pmf, the next sample is chosen accordingly. It is shown that the number of samples obtained in this manner would approximately be the same as that of an oracle that allocates samples using the knowledge of the variance of the underlying distributions. While this strategy satisfies many such interesting theoretical properties, we observed that it tends be more conservative than necessary in practice. Our main goal in this paper is to devise a less conservative approach while ensuring the same performance guarantees as in Shekhar et al. (2020) . Our contributions towards this goal are (i) we assume a Dirichlet prior on the pmfs and compute aforementioned upper confidence bounds using the posterior belief on the pmfs. This, combined with a stronger set of inequalities gives us noticeably tighter bounds on the variance of the pmfs; (ii) we analytically prove that this Bayesian approach for computing confidence bounds is no worse than the one in Shekhar et al. (2020) ; and (iii) we employ these methods for estimating SARS-CoV-2 seroprevalence in various categories using data obtained by a randomized survey Sood et al. (2020) conducted in Los Angeles county. Our analysis of the survey data illustrates the importance of adaptive sampling in order to obtain better estimates. We would like to emphasize that the Bayesian approach used herein to compute upper confidence bounds and the associated proof methodologies can also be adapted to other UCB based approaches that are widely used in online learning Efroni et al. (2020) ; Rosenberg and Mansour (2019) and dynamics estimation Tarbouriech and Lazaric (2019) . Related Work Our work is closely related to Shekhar et al. (2020) . We adopt their sampling strategy and many of the proof methodologies. The key difference between our work and Shekhar et al. (2020) is our Bayesian approach for computing the upper confidence bounds on the variances. Our approach for computing these bounds has a significant impact on the performance in practice, especially in the non-asymptotic regime. Multiple distance metrics have been considered in Shekhar et al. (2020) , whereas we restrict our attention to the mean squared error. It is however possible to extend our methodology to the other distance metrics as well. When the pmfs of interest have a binary support, which is the case in the problem of seroprevalence estimation, then the problem of estimating distributions reduces to the problem of estimating means of random variables. Substantial amount of work has been done in the area of estimating the means of random variables Carpentier et al. (2011) . In Carpentier et al. (2011) , two approaches for computing upper confidence bounds were proposed, one using the Chernoff-Hoeffding bound and the other using the empirical Bernstein bound. The Chernoff-Hoeffding bound approach is very close to that of Shekhar et al. (2020) . We observe that the performance of our Bayesian approach is better than both the approaches in Carpentier et al. (2011) . However, the regret bounds in Carpentier et al. (2011) for the Bernstein approach are tighter than ours. Notation In general, subscripts denote time indices unless stated otherwise. For a sampling strategy g and a collection of pmfs p, we use P g p [·] and E g p [·] to indicate respectively that the probability and expectation depend on the choice of g, and that they are conditioned on the model p. We denote the indicator function associated with an event E with ½ E . We are interested in estimating K probability mass functions (pmfs) over a finite space {1, . . . , L} from their samples. Let the k-th pmf be denoted by p (k) . = p (k,1) , . . . , p (k,L) and let p . = p (1) , . . . , p (K) . For convenience, we will refer to each of these pmfs as an arm. The total number of samples that can be obtained is N . At any given time, we can adaptively choose an arm to obtain a sample. The n-th arm sampled is denoted by U n . Let the outcome of the n-th sample be denoted by Y n . We have Y n ∼ p (Un) . Let the history of sampled arms and their corresponding outcomes be denoted by I n . = {U 1 , Y 1 , . . . , U n−1 , Y n−1 }. Let the strategy used to collect samples at time n be g n . In other words, U n = g n (I n ). After collecting n samples, we form an estimatep (k) n of each pmf p (k) , which in our case, is the empirical distribution. The mean squared error between the true distribution p (k) and the estimate is denoted by (1) Remark 1 Note that we can scale the loss L (k) by an appropriate weight w (k) . Our sampling strategy and proofs can be easily adapted to this weighted loss. As discussed earlier, our sampling strategy will involve the computation of upper confidence bounds on certain parameters. When Bayesian methods are used to derive upper confidence bounds, it is generally not possible to provide performance guarantees for every instance p of the underlying pmfs. In such cases, the loss L is locally averaged over a small region containing the true pmf p. We will therefore consider a loss that is locally averaged with respect to p using a distribution ̺. We refer to Brown et al. (2001) ; Bayarri and Berger (2004) for a detailed discussion on the interpretation of local averaging. Remark 2 Note that the distribution ̺ used for local averaging is not the same as the prior on p. We will use a Bayesian interpretation of the pmfs p only to compute tighter upper confidence bounds. However, our formulation and performance guarantees are frequentist in nature since the distribution ̺ is supported on an arbitrarily small region around p. Let ∆ denote the L − 1 dimensional simplex and let ̺ = ̺ (1) × · · · × ̺ (K) be a product distribution on ∆ K . Our goal is to minimize the following cost associated with a sampling strategy g which is given by where π ∼ ̺. The distribution ̺ (k) has the following form. For any Borel measurable set A ⊆ ∆, where µ is the uniform distribution over ∆ and η(p (k) ) is an ℓ 2 -ball around p (k) , such that µ η(p (k) ) = η. Clearly, the cost in (2) approaches the cost in Shekhar et al. (2020) ; Carpentier et al. (2011) when η → 0 and thus, it is desirable to select small η. However, the performance guarantees we provide in this paper become weaker as η becomes small. We will discuss the effects of the choice of η in subsequent sections. The Oracle Let there be an oracle that knows the probability distributions p. Using this information, the oracle allocates a fixed number of samples for each arm. If the number of samples allocated to arm k is T (k) , the mean-squared error associated with arm k's estimate is given by where ϕ is referred to as the tracking function and c (k) is referred to as the tracking parameter Shekhar et al. (2020) . The oracle solves the following optimization problem to obtain an allocation. While Problem (P1) is a combinatorial optimization problem, an approximate solution can easily be obtained by making a convex relation. The non-integer solution and the optimal value for this problem can be expressed in closed form as The regret with respect to the oracle that knows the underlying distributions is We will first define some important quantities and compute an upper confidence bound on the tracking parameter c (k) . This bound will be used in our sampling strategy described in Section 4. Let us assume that each of the pmfs p (k) is independently drawn from a Dirichlet distribution Kotz et al. (2004) with parameters α . In other words, This prior distribution 1 will be denoted by ρ 1 . We will use the uniform prior which is equivalent to setting α (k,l) 1 = 1 for every k, l. Thus, ρ 1 is the same as the distribution µ used in (41). Remark 3 Note that the prior ρ 1 is just something we assume. In a frequentist setting with local averaging, we are not provided with any prior information on the pmfs p. Since the distribution ̺ is defined using the uniform distribution µ in (41), the uniform prior ρ 1 = µ happens to be a convenient choice for computing our confidence bounds and subsequently analyzing them. In some cases however, we do have some prior information on p. A discussion on how to incorporate certain kinds of priors is in Appendix D. 1. Note that this is distinct from the distribution ̺ that was used for local averaging. When the prior on the pmfs is a factored Dirichlet distribution as in (8), the posterior distribution on the pmfs after collecting n samples is k D(α Further, since the pmf p (k) is Dirichlet distributed, the probability p (k,l) is distributed according to the Beta distribution. More precisely, conditioned on information I n , we have where α Let the number of times arm k has been sampled until time n be Based on the posterior distribution of p (k,l) , we can compute a lower bound a (k,l) n and an upper bound b (k,l) n on p (k,l) such that where δ n = n−1 . This ensures that the interval E (k,l) n is non-increasing in n. Note that we can compute this interval based on the inverse cdf of the posterior distribution, which is a Beta distribution. For arm k, let q (k) n and u (k) n respectively be the optimal solution and the optimum value of the following quadratic program max q∈∆ l Notice that since E (k,l) n is non-increasing in n, the upper bound u (k) n is also non-increasing in n. Let us define the event E as Note that because of the way the upper bound u (k,l) n is defined in (14), we have p (k,l) (1 − p (k,l) ) ≤ u (k,l) n under the event E. Further, we have P ρ 1 [E] ≥ 1 − δ because of the union bound n δ n ≤ δ. The main idea is simply to sample the arm that has maximum ϕ(u (k) n , T (k) n ) at time n. The precise strategy is stated as Algorithm 1. n ) Sample arm U n to obtain Y n and update the posterior according to (9) end returnp (k) N Using Algorithm 1, we can achieve nearly the same mean squared error as the oracle would have achieved with the knowledge of underlying distributions. The gap between the oracle and Algorithm 1 is characterized by the regret bound below. Theorem 4 Using the sampling strategy in Algorithm 1, we have Our methodology for deriving the regret bound in Theorem 4 is adapted from that in Shekhar et al. (2020) . There are two key distinctions from the approach in Shekhar et al. (2020) : (i) the gap e . Proof See Appendix A. Lemma 6 Under event E, we have Proof See Appendix B. Note that the bound here is as tight as the one in Lemma 3 of Shekhar et al. (2020) . These lemmas address the first aforementioned distinction from Shekhar et al. (2020) . Using Lemma 6, we can follow the steps of the proof of Theorem 1 in Shekhar et al. (2020) to complete the proof of Theorem 4. We refer the reader to Appendix C for the full proof with details related to local averaging. We consider an experimental setup with two arms (K = 2) and a binary support (L = 2). The pmfs associated with arms 1 and 2 are [0.99, 0.01] and [0.7, 0.3]. The total number of samples N = 2500 and the value of η = 1/N . Figure 1 (a)subfigure depicts the regret R N for the sampling strategy in Shekhar et al. (2020) and our sampling strategy in Algorithm 1. We will refer to the former strategy as UCB and the latter as Bayesian UCB. There is a clear improvement in performance in terms of regret. We also plot the average number of samples acquired by both strategies in Figure 1 (c)subfigure. We observe that the number of samples acquired by our Bayesian UCB approach is closer to the oracle allocation. Recall that the main difference between the UCB and the Bayesian UCB strategies is the way the upper bound u (k,l) n is computed. In both these approaches, we observe that for every component, the value of the tracking function u N +1 is nearly the same for all arms k. Proposition 7 For every arm k, we have N +1 , approximately. In Figure 1(b) subfigure, we plot a typical realization of the upper bounds u (14). The bounds computed using the method in Shekhar et al. (2020) do capture the difference between the tracking parameters for arms 1 and 2. However, according to Proposition 7, it is the ratio between the upper bounds that matters. The ratio between the upper bounds in Shekhar et al. (2020) is close to 1 even for N = 2500. While this ratio will approach c (1) /c (2) eventually, it is quite slow in practice. While our Bayesian approach provides tighter bounds, there is a computational cost associated with it, especially because of the diminishing probability δ n . For the binary support case, there are simpler closed form approximations Bayarri and Berger (2004); Brown et al. (2001) that can be used to compute a (k,l) n and b (k,l) n . We propose to use our sampling strategy for tracking SARS-CoV-2 seroprevalence in populations classified on the basis of location, ethnicity, age etc. Each category of interest can be modeled as an arm with a Bernoulli pmf, where the probability p (k,2) represents the positivity associated with category k. We use data from a randomized seroprevalence survey conducted in Los Angeles county Sood et al. (2020) . The plots in Figures 2(a)subfigure and 2(b)subfigure depict the gap between optimal sample allocation and the random allocation strategy used by the survey. This gap illustrates the importance of adaptive randomized sampling for obtaining uniform estimates of seroprevalence. Without careful sampling, we can have bad estimates for certain groups that may be particularly vulnerable. Remark 8 For estimating SARS-CoV-2 seroprevalence, there are some practical aspects that require a slightly different formulation and sampling strategy. We provide a heuristic approach for addressing these issues and the analysis of this approach is a problem for future work. For the problem of uniformly estimating pmfs in a mean squared sense, a UCB sampling approach is proposed. Unlike the state-of-the-art approach in Shekhar et al. (2020) , a Dirichlet prior is assumed on the pmfs and the upper confidence bounds are computed based on the posterior. It is analytically shown that the performance of this Bayesian approach can be now worse than the approach in Shekhar et al. (2020) , and has noticeably better performance in practice. We discuss the potential application of the sampling approach for SARS-CoV-2 seroprevalence and incidence rate estimation. There are two potential directions for future work. One is to incorporate temporal variation in the underlying pmfs. This is particularly important in studying SARS-CoV-2 incidence rates. The other is to incorporate a prior on the collection pmfs so that the correlation between them can be exploited to improve our estimates and the sampling strategy. For both these directions, a Bayesian model can be more helpful. A Beta distribution with parameters α, β is sub-Gaussian with parameter 1 4(α+β+1) Marchal et al. (2017) . And because of (10) we can conclude using the Chernoff bound Ross (2014) that Since a The second inequality in the lemma simply follows from the fact the α Under event E, we have Here, inequality in (a) is because of the triangle inequality and the inequality in (b) is due to Lemma 5. We have Note that the last equality in the display above follows from a simple change of measure argument. Since δ p is the probability of the unfavorable event E c conditioned on the model p, we can obtain a regret bound in terms of δ p without local averaging using the same arguments as in the proofs of Lemma 1 and Theorem 1 in (Shekhar et al., 2020, Appendix D.3) . Thus, we have where and c (i) = l π (i,l) (1 − π (i,l) ). Using (32), we can then locally average the regret as In this section, we will consider a scenario where we are provided with a prior ρ 1 on p. We will first modify the distribution ̺ used for local averaging as for any Borel measurable set A ⊆ ∆ K . Here, η(p) is a ball around p with ρ 1 (η(p)) = η. Dirichlet priors When the prior ρ 1 is any factored Dirichlet distribution (i.e. not necessarily uniform) of the form in (8), one can easily extend all the results with the modified ̺. Intervals for Bernoulli parameters Another scenario of interest is when L = 2 and we are given that the Bernoulli parameter associated with p (k) lies in some interval [γ We can then assume that the Bernoulli parameter p (k) is distributed uniformly and independently over the interval [γ (k) l , γ (k) u ]. Since this prior does not correspond to a Beta distribution, the posterior belief may not be a Beta distribution. However, we still can easily find the cdf of the posteriors in this case because the posterior on the parameters is just going to be a truncated Beta distribution. Let F (k) n be the cdf of the posterior at time n with a uniform prior and letF (k) n be the posterior with the truncated prior. Then one can show that The confidence intervals in (12) and (13) can then be computed using the inverse ofF In seroprevalence estimation, we generally allocate samples in batches of size B. Also, we are generally interested in estimating the positivity of each category as well as the positivity in the overall population. Let the fraction of individuals of category k in the overall population be w (k) . Then the overall positivity r and its estimater are given by The mean squared error between r andr N is given by If the mean squared error associated with r is not considered, then it may so happen that a tiny group (small w k ) with high positivity will be allocated too many samples. The contribution of this small group to the overall estimate would be small and thus, allocating too many samples to it could compromise the quality of the overall estimate r. Therefore, we need to determine an allocation that accounts for the quality of the overall estimate r as well. A suitable way to formalize this notion is to pose the following constraints on the oracle allocation T (k) T (k) ≥ 0, k = 1, . . . , K, where θ (k) , k = 0, . . . , K are predetermined constants. A solution to the above set of constraints can be obtained by solving the following optimization problem min T (1) ,..., T (k) ≥ 0, k = 1, . . . , K. The Problem (C1) is feasible if and only if the optimum value of the optimization problem above is less than or equal to 1. In that case, the solution to Problem (C2) is a solution to Problem (C1). Notice that Problem (C1) is very similar to Problem (P1) except for the additional mean squared error term associated with the overall estimate r. Because of this distinction, it is not clear whether one can view Problem (C2) as a particular instance of Problem (P1) with appropriate modifications and simply apply the adaptive sampling strategy in Section 4 to the modified problem. Nonetheless, we provide a similar heuristic sampling approach that tracks the oracle quite well (See Figure 3) . Sood et al. (2020) , the oracle allocation in Problem (C1) with appropriate constants θ (k) , and the allocation by the heuristic (denoted by Adaptive). Notice that the heuristic tracks the oracle closely. This plot tells us that in order get a good overall estimate, we should allocate fewer samples to an underrepresented group like the Pacific Islanders (w (k) ≈ 0.003) than the number suggested by the oracle (P1) (See 2(a)subfigure). However, it also tells us that we can allocate substantially more samples than the number in the survey (≈ w (k) N ) to this group for a better estimate of their positivity without comprising the quality of the overall estimate. At each time time n, the heuristic is to allocate samples within each batch according to the solution of the following optimization problem min τ (1) ,...,τ (K) ,λ λ (C3) The interplay of bayesian and frequentist analysis Interval estimation for a binomial proportion Upper-confidence-bound algorithms for active learning in multi-armed bandits Exploration-exploitation in constrained mdps On the sub-gaussianity of the beta and dirichlet distributions Online convex optimization in adversarial markov decision processes Introduction to probability models Adaptive sampling for estimating probability distributions Seroprevalence of sars-cov-2-specific antibodies among adults in los angeles county, california Active exploration in markov decision processes This research was supported, in part, by National Science Foundation under Grant NSF CCF-1817200, CCF-1718560, CPS-1446901, Grant ONR N00014-15-1-2550, and Grant ARO W911NF1910269.