key: cord-0650597-mei5z3xu
authors: Wang, Zihao; Yang, Shuoguang; You, Wei
title: Optimality Conditions and Algorithms for Top-K Arm Identification
date: 2022-05-24
journal: nan
DOI: nan
sha: 0822af9fb54366b953278ba63367a5a8d22ac5f5
doc_id: 650597
cord_uid: mei5z3xu

We consider the top-k arm identification problem for multi-armed bandits with rewards belonging to a one-parameter canonical exponential family. The objective is to select the set of k arms with the highest mean rewards by sequential allocation of sampling efforts. We propose a unified optimal allocation problem that identifies the complexity measures of this problem under the fixed-confidence, fixed-budget settings, and the posterior convergence rate from the Bayesian perspective. We provide the first characterization of its optimality. We provide the first provably optimal algorithm in the fixed-confidence setting for k>1. We also propose an efficient heuristic algorithm for the top-k arm identification problem. Extensive numerical experiments demonstrate superior performance compare to existing methods in all three settings.

Since it was introduced in the 1930s, bandit problems have received enormous interest in the past two decades. The majority of the literature focused on the design of efficient algorithms to maximize the cumulative reward, see [28] for a textbook treatment. In applications, the objective may not always be maximizing the cumulative reward but identifying the competing designs with the best performance. In the multi-armed bandit context, this is called a top-k arm identification problem, where the set of k arms with the highest means is to be learned. It serves as a fundamental model with rich applications. For example, in the COVID-19 vaccine race, the goal is to identify a set of best-performing vaccines as fast and accurately as possible and push them to mass production. For an aerospace company investigating different proposals for designing a rocket, the prototyping is hugely costly; how should it rapidly determine the designs with top-of-the-line reliability? In the applications above, best-performing alternatives are determined based on costly sampling. Algorithms that require fewer samples to reach the desired confidence level or accuracy guarantee are invaluable.

An essential step to understanding the efficiency of an algorithm is to characterize the sample complexity. The special case of k = 1 is called the best arm identification problem, which has been studied for decades under the name ranking and selection or ordinal optimization; see [17, 16] for thorough reviews. The sample complexity in the case of k = 1 have been obtained in statistics [6, 18] , simulation optimization [15] , and machine learning [12, 31, 32] literature, each derives a complexity measure that depends on the problem instance. Literature above fall within three main formulations for this problem: the fixed-confidence and the fixed-budget settings from the frequentist perspective; and the convergence rate of the posterior probability from the Bayesian perspective. This paper proposes a unified optimal problem that characterizes the complexity measure for top-k identification under the three settings for k ≥ 1.

Closely related to these complexity measures is the optimal allocation of sampling effort, as discussed in [6, 15] . Finding the optimal allocation of sampling effort to competing arms is essential to the efficiency of any algorithm. Our unified optimization problem characterizes this optimal allocation.

We consider a multi-armed bandit problem with K arms A = [K], where [K] denotes the list {1, 2, . . . , K}. The rewards from arm i ∈ A are independent and identically distributed random variables drawn from the distribution F i . We restrict the focus to the problem where the reward distribution F i follows a onedimensional non-singular canonical exponential family, parameterized by its mean θ i . We assume that rewards from different arms are mutually independent. Let θ = (θ 1 , θ 2 , . . . , θ K ) denote the vector of mean rewards. The set of the top-k arm is assumed to be unique and is defined as I = argmax S⊂[K],|S|=k s∈S θ i , where |S| denotes the cardinality of S. We refer to all other arms as the bottom arms, and define J = [K]\I.

At the start of each round t, the player follows a sampling rule to choose an arm I t and observes the noisy reward Y t,It from the selected arm. Based on the information available at the end of round t, the player follows a recommendation rule to propose the estimated top-k armÎ t . The objective is to identify the top-k arm efficiently.

To facilitate further discussion, we define the number of samples collected from arm i in the first t rounds as T t,i = t s=1 1(I s = i), and define the sample mean as θ t,i = t s=1 1(I s = i)Y s,Is /T t,i . A natural candidate recommendation is the set of arms with the k largest sample means, i.e.,

Various settings have been proposed for the top-k arm identification problem in the literature. In this paper, we consider three major formulations: the fixed-confidence and the fixed-budget settings from the frequentist perspective; and the convergence rate of the posterior probability from the Bayesian perspective. We characterize the asymptotic complexity measures under all three settings as a unified optimal allocation problem.

Fixed-confidence setting. Given a confidence level δ, the player specifies a stopping rule τ δ to ensure that the probability of incorrect recommendation is less than δ. The goal is to minimize the expected number of samples E θ [τ δ ]. An algorithm is said to be δ-correct if P θ (τ δ < ∞,Î τ δ = I) ≤ δ for all problem instances θ.

Let Σ K denote the probability simplex in R K and let ψ ∈ Σ K denote a generic allocation rate vector. The following function is key to our analysis

where d(·, ·) is the Kullback-Leibler (KL) divergence of the two reward distributions and

The lemma below characterizes an asymptotic lower bound on the sample complexity of any δ-correct algorithm. We defer detailed proof to Appendix E.

Lemma 2.1. For any δ-correct algorithm such that

, where Γ(ψ) = min i∈I min j∈J C i,j (ψ i , ψ j ).

By minimizing the lower bound in Lemma 2.1, we obtain the optimal allocation problem

Posterior convergence rate. Many recent works [32, 31, 33] adopt a Bayesian setting and consider the convergence rate of the posterior probability of incorrect selection as a complexity measure. We now characterize the asymptotic optimality in the posterior convergence rate. In the Bayesian setting, a prior distribution π 0 is imposed on the mean vector θ. At the end of round t, the information available to the player is F t ≡ σ({I 1 , Y 1,I1 , . . . , I t , Y t,It }). Let L t (θ) = t s=1 p(Y s,Is |θ Is ) denote the likelihood function. Conditional on F t , the posterior distribution of θ has the following density π t (θ) = π 0 L t (θ) π 0 L t (θ )dθ .

For the set of top arms I, the set of parameters that makes I optimal is

The performance of an algorithm is evaluated by P πn (Θ), i.e., the posterior probability that I is optimal. We are interested in the optimal rate at which this posterior probability converges to 1 with the number of samples n → ∞. For any algorithm allocates sampling effort according to some probability vector ψ, Proposition 5 and Lemma 2 of [32] imply that

Note that Θ can be decomposed as Θ c = i∈I j∈J Θ c i,j , the posterior probability of incorrect selection is bounded by max i∈I max j∈J P πn (Θ c i,j ) ≤ P πn (Θ c ) ≤ |I||J | max i∈I max j∈J P πn (Θ c i,j ).

By (8) and (9) , the optimal allocation problem takes the same form as (5) , i.e.,

Fixed-budget setting. Given a finite sampling budget n, the goal is to minimize the frequentist probability of incorrect recommendation. For ordinal optimization (selecting the top-1 arm), [15] proposed a large deviation framework. We generalize their results to the top-k arm identification problem. Let P n (·) denote the joint distribution of the vector of the sample means of the rewards collected from the arms. The following lemma extends Lemma 1 of [15] to the top-k arm setting.

Lemma 2.2. For any static allocation rate ψ and any θ i > θ j ,

Furthermore, there exists a unique solutionθ B i,j to the infimum in (10) . Recall that Θ is defined in (7) , the probability of incorrect recommendation can be bounded by

Hence, the large deviation rate of the probability of incorrect recommendation satisfies

As a result, maximizing the large deviation rate is equivalent to the following optimization problem

Remark 1. Due to the asymmetry of the KL divergence for general distributions, the functions B i,j and C i,j may not be the same. This is consistent with previous studies [23, 3] of the best-arm identification (k = 1) problem, which shows that the complexity of the fixed-budget setting may be smaller than the complexity of the fixed-confidence setting under general reward distributions. For Gaussian bandits, however, it can be verified that B i,j = C i,j . Hence, the complexity measure and optimal allocation rates under two settings coincide.

Henceforth, we focus on the optimal allocation problem (5). We remark that all results in this section can be easily adapted to address the optimal allocation problem (11) in the fixed budget setting, observing that it shares the same structure as (5) . We omit the details. Assuming the explicit knowledge of the true mean rewards, we conduct a comprehensive study of (5) from both structural and algorithmic perspectives. In Section 3.1, we provide novel structural insights to its optimal solution. The optimality condition to this problem is complicated and hence cannot be easily adapted into a learning algorithm. Motivated by the structural insights, we address this challenge by formulating (5) as a saddle-point problem and proposing an algorithm based on the Frank-Wolfe algorithm in Section 3.2. We prove that our algorithm finds an optimal solution to (5) with a guarantee on the convergence rate.

The key challenge in solving (5) is that Γ(ψ) is a non-smooth function in ψ. As we will discuss in this subsection, there are multiple (i, j) pairs such that C i,j (ψ * ) achieves the optimal value Γ * , and the locations of such pairs exhibit a complicated structure.

To begin with, we identify the following necessary conditions for optimality, which can be viewed as a direct extension of a well-known result for k = 1, e.g., Proposition 8 of [32] .

Proposition 3.1. The optimal solution to (5) satisfies the following conditions

As a corollary of Proposition 3.1, we show that the optimal allocation rates are monotone within the top-arm set and the bottom-arm set, respectively. This generalizes the observation in [12] for the top-1 arm identification problem.

For any j, j ∈ J such that θ j ≤ θ j , we have ψ * j ≤ ψ * j . We now identify another necessary condition for optimality. For general reward distributions, this condition appears to have an obscure expression. Hence, we postpone it to Appendix A. However, the result is significantly simplified for Gaussian bandits as we discuss now.

Recall from Proposition 3.1 that there are multiple (i, j) pairs such that C i,j (ψ * ) = Γ * . We say that a bipartite graph G(I, J , E) is induced by the optimal solution ψ * if an edge connects i and j if and only if C i,j (ψ * ) = Γ * . The bipartite graph may not be connected, as Example 1 below exhibits. To proceed our analysis, we decompose the bipartite graph into L connected subgraphs G l (I l , J l , E l ), i = 1, . . . , L. By Proposition 3.1, for any arm i ∈ I ∪ J , there must exist an edge e ∈ E such that i is incident to e. Hence, we have I l = ∅ and J l = ∅ for all l = 1, . . . , L. We say that (for a Gaussian bandit) a connected subgraph G l balances the sum-of-squared allocations if

Now, we are ready to present the necessary condition for Gaussian bandits. A generalization of the following result under general reward distributions is provided in Appendix A.

Proposition 3.3. For any Gaussian bandit, each connected subgraph of the bipartite graph induced by the optimal solution balances the sum-of-squared allocations.

The following example demonstrates that the bipartite graph can be disconnected. Nevertheless, each connected subgraph balances the sum-of-squared allocations at optimality. Example 1. Consider a Gaussian bandit with θ = {0.51, 0.5, 0, −0.01, −0.092}, σ 2 = 1/4 and k = 2. The optimal allocation is ψ * ≈ (0.2185, 0.2371, 0.2185, 0.2026, 0.1232). We have Γ * = C 1,3 = C 2,4 = C 2,5 = 0.0568. Hence, there are two connected subgraphs. Note that the sum-of-squared allocations are balanced, i.e., (ψ * 1 ) 2 = (ψ * 3 ) 2 and (ψ * 2 ) 2 = (ψ * 4 ) 2 + (ψ * 5 ) 2 . Remark 2. Proposition 3.3 implies that the sum-of-squared allocations are balanced in the entire bipartite graph, i.e.,

This recovers the observations in [12] for k = 1 and in [36] for k ≥ 1. We emphasize that Example 1 implies that Proposition 3.3 is strictly stronger than (13) .

Remark 3. We point out that (12) and (13) (12) and (13) with an objective value of Γ 1 ≈ 0.0873. For ψ 1 , we have C 1,3 = C 1,4 = C 2,4 < C 2,3 . However, the optimal solution ψ * ≈ (0.1277, 0.3894, 0.4016, 0.0813), with an optimal objective value of Γ * ≈ 0.1938 > Γ 1 . For the optimal solution, we have C 1,3 = C 2,4 = C 2,3 < C 1,4 .

Next, we present a sufficient condition for a large set of problem instances θ. We discuss complete characterization in Appendix A, but it does not appear to have an elegant presentation. For the ease of notation, we re-label the arm with the l-th largest mean as arm i l , for l = 1, . . . , k and re-label the arm with the (l + k)-th largest mean as arm j l , for l = 1, . . . , K − k.

Theorem 3.4. Consider any problem instance such that the optimal solution ψ * satisfies

Then ψ * is the optimal solution to (5) if

Remark 4. For general bandits, condition (14) always holds when (i) |I| ≤ 2 and |J | ≤ 2; and (ii) |I| = 1 or |J | = 1. For Gaussian bandits with equal variance, the assumption (14) reduces to (ψ * i k ) 2 > j∈J \{j1} (ψ * j ) 2 ; and (16) reduces to the balancing of the sum-of-squared allocations. Recall the monotonicity in Proposition 3.2, we have ψ j1 ≥ ψ j for j = j 1 . In general, ψ * j1 − ψ * j increases as the gap (θ j1 − θ j ) increases. Roughly speaking, when (θ j1 − θ j ) is large enough with respect to (θ i k − θ j1 ), condition (14) will hold. We caution that (14) is non-trivial to check, and developing a sampling rule based on tracking the sufficient condition can be challenging.

To conclude this subsection, we present a necessary and sufficient condition, which motivates our algorithm. To this end, note that (5) is equivalent to

In the proof of Theorem 3.5, we show that (17) is a convex optimization problem and the Slater's condition holds. Hence, the Karush-Kuhn-Tucker (KKT) condition is necessary and sufficient for global optimality. Let µ ∈ Σ k(K−k) be the dual variables that correspond to the inequality constraints in (17) . Define the following sampling functions

For Gaussian bandits, we have h i;j (ψ) = ψ j /(ψ i + ψ j ) and h j;i (ψ) = ψ i /(ψ i + ψ j ). 

As demonstrated by Theorem 3.4, the sufficient conditions exhibit complicated structure. Furthermore, the condition in Theorem 3.5 is described by a set of nonlinear equations, which is nontrivial to solve. In this section, we present a simple algorithm that generates a sequence of solutions converging to the optimal solution to (5). We further provide a theoretical guarantee on the convergence rate. The algorithm can be directly integrated with the learning of the mean reward vector, see Section 4.1.

Recall that (5) targets to maximize the minimum of the concave functions C i,j 's over sets I and J . The minimization part here introduces a non-smooth structure, so that the minimum is non-smooth even if each C i,j is smooth. We consider the following generic non-smooth optimization problem:

where the feasible region X is convex and closed, and C i,j 's are smooth concave functions. We denote by C(ψ) ∈ R |I||J | the vectorized form of C i,j 's, and denote by µ ∈ R |I||J | an auxiliary vector of length |I||J |. By using the strong duality, we provide the following result to reformulate the non-smooth problem (19) as a saddle point problem.

Proposition 3.6. The dual problem of (19) is formulated as

where (µ * , ψ * ) is called a saddle point.

We point out that here, it is no longer required to handle the non-smoothness within (19) . We now present our Frank-Wolfe Gradient Ascent (FWGA) algorithm, which can be treated as a variant of the hybrid method that combines an FW step and a gradient ascent projection step within each update. In the k-th iteration of our algorithm, at the current solution (µ k , ψ k ), FWGA alternatively conducts an FW step to update µ, and conducts one-step gradient ascent to update ψ. Note that here the feasible region of ψ, Σ K , is a simplex, making projections of ψ is easy to conduct. We summarize the details of our FWGA algorithm in Algorithm 1.

Algorithm 1 FWGA for optimal allocation problem (5) 1: Input: θ, ψ 0 , µ 0 , T 2: for t = 1, . . . , T do 3:

Define the objective optimality gap induced by solution (µ k , ψ k ) as

By the saddle point formulation (20) , we observe that Q k = 0 at the optimal solution (µ * , ψ * ) and Q k > 0 otherwise. The next theorem establishes the convergence rate of the objective optimality gap.

Theorem 3.7. Suppose the function C(ψ) is Lipschitz continuous over Σ K and we run FWGA for N rounds with step-sizes α t = 2 t+1 and τ t = N 1.5 t . There exists a constant C 0 > 0 such that

The result above suggests that it requires O(1/ 2 ) steps for our FWGA algorithm to obtain an -optimal solution. As pointed out by [26] , the sample complexity for FW-type saddle point algorithms is lower bounded by O(1/ 2 ). This suggests that our result matches the state-of-art convergence rate for non-smooth convex-concave saddle point problems under the FW framework.

We now incorporate the learning of mean rewards and propose two anytime and parameter-free algorithms for the top-k identification problem. In Section 4.1, we integrate an solver for (5) (e.g., FWGA) and the framework of Track-and-Stop in [12] , where we provide a matching upper bound on the sample complexity in the fixed-confidence setting. In section 4.2, we propose an efficient heuristic algorithm by tracking the KKT conditions in Theorem 3.5.

Building upon the convergence guarantee of the FWGA procedure, we can integrate FWGA and mean rewards learning in the framework of Track-and-Stop in [12] . We show that such integration yields an asymptotically optimal top-k arm identification algorithm in the fixed confidence setting.

We start by introducing the Chernoff's stopping rule in top-k arm identification:

where θ t,i,j = (T t,i θ t,i + T t,j θ t,j )/(T t,i + T t,j ) and β(t, δ) is a threshold to be determined. We propose the FWGA-TaS algorithm that combines FWGA, the C-Tracking or D-Tracking (see Appendix B.2) in [12] , the Chernoff's stopping rule (22) and the recommendation rule (1).

The following lemma provides a sufficient condition for δ-correctness, which extends Proposition 12 in [12] to the top-k arm identification problem. Proofs and discussions are deferred to Appendix B.

Lemma 4.1. Let δ ∈ (0, 1) and α > 1. There exists a constant C = C(α, K) such that for any sampling rule, using the Chernoff 's stopping rule with threshold β(t, δ) = log (Ct α /δ) and the recommendation rule (1) ensures that for any θ, P θ (τ δ < ∞,Î τ δ = I) ≤ δ.

Next we show that FWGA-TaS admits an upper bound that matches the lower bound in (4). 

Track-and-Stop sampling rule requires solving (5) in each step, which can be time-consuming. We now present an efficient heuristic algorithm that tracks the KKT condition (18) , which eliminates the need to solve the optimal allocation problem. We illustrate the main idea in the setting where θ is known. At time step t, the algorithm first identifies a pair of arms (i, j) and then flips a biased coin to decide which of these to sample. We interpret the dual variables µ i,j as the relative frequency of generating the (i, j)-pair. By (18b), µ i,j > 0 holds only if C i,j = min i min j C i ,j at the optimal solution. Accordingly, we choose (i, j) = arg min i∈I,j∈J C i,j and break tie arbitrarily. Given the choice of (i, j)-pair, we then select i with probability h i;j and j otherwise; recall that h j;i + h i;j = 1. The overall frequencies of drawing the arms are exactly given by (18a). Intuitively, this procedure will enforce all conditions in (18) to hold as t → ∞. In Appendix C.1, we provide further discussion on this algorithm to solve (5) .

To integrate the learning of the mean rewards, we propose Algorithm 2 to track a sample version of (18). Let ψ t,i = T t,i /t be the allocation rate of arm i at the end of round t. In a Bayesian setting, the sample version of θ is given by its posterior sampleθ t ∼ π t . The sample version ofθ i,j (ψ), C i,j and h i;j are calculated bỹ

We refer to this procedure as the KKT Thompson Sampling (KKT-TS) algorithm. For the fixed budget setting, a similar algorithm can be obtained by replacing C i,j with B i,j .

3: (i t , j t ) = arg min i∈Ĩt,j∈Jt C t,i,j . 4: I t = i t with probability h t,i;j and I t = j t otherwise. 5: Collect reward Y t,It and update π t+1 by (6)

In this section, we conduct numerical studies of our algorithms. Discussions of relevant algorithms can be found in Appendix C. Additional numerical examples are collected in Appendix D.

We consider the fixed-confidence and fixed-budget performance for the Gaussian bandit with θ i = 1 − (i − 1)/20 for i = [20] and σ 2 = 1/4 and choose the top-5 arm.

For the fixed-confidence setting, we compare our KKT-TS with UGapE, KL-LUCB, KL-Elim and a uniform allocation rule. We also include a DT-TS algorithm, which combines the KKT-tracking algorithm (Appendix C.1) with the Thompson Sampling Direct-Tracking algorithm (Appendix C.2). For implementation details and discussions of existing algorithms, we refer to Appendix C.3. We adopt the Chernoff stopping rule (22) with a heuristic β(t, δ) = log ((log(t) + 1)/δ) for both our algorithms and the uniform allocation rule as in [12] . To make fair comparisons, we choose the exploration rate as β(t, δ) in UGapE, KL-LUCB and KL-Elim, coupled with their own stopping rules. Figure 1 demonstrates the fixed-confidence performance of our algorithm. We also report the realized error rateδ, i.e., the empirical probability that the algorithm identifies the correct top arms upon stopping. We observed that the stopping is still conservative, and our algorithms significantly outperform alternatives. Although UGapE achieves a smaller sample complexity for δ = 0.1, the realized error rate is much higher than our algorithms. [20] . Estimated over 10 4 independent replications. Realized error rate is reported asδ. Figure 2 demonstrates the fixed-budget performance. We remark that UGapE is sensitive to tuning parameters that are nontrivial to set. We give advantage to UGapE by selecting the best parameter (a = 5) in hindsight and pass-through the true sample complexity. We point out that our algorithm is parameter-free and yet achieves better performance. 

We proposed a unified optimal allocation problem to characterize the complexity measures of the top-k arm identification in fixed-confidence, fixed-budget, and posterior convergence rate settings. By combining FWGA and TaS, we provide the first algorithm that achieves optimal sample complexity in the fixed-confidence setting. The main limitation is the computation burden associated with TaS, which requires solving the optimal allocation problem in each step. Furthermore, there is a gap between the theoretical guarantee and practical performance. By analyzing the optimality conditions of the optimization problem, we propose a heuristic KKT-TS algorithm, which eliminates the need to solve the optimal allocation problem in each iteration. KKT-TS is parameter-free and applies to general exponential family distributions. Numerical studies demonstrate the exceptional empirical performance of KKT-TS. In future work, we wish to establish the convergence and sample complexity guarantee of the KKT-TS algorithm. Another important future direction is to justify the benefits of using Thompson Sampling. The following proposition reveals some basic properties of C i,j , which are frequently used in our analysis.

Proof. See the proof for Lemma E.3.

We also give the second order derivative of C i,j (ψ) for the reference of interested readers. The calculation detail is omitted.

The smallest eigenvalue can be calculated correspondingly, Table 1 gives examples of the explicit expressions of C i,j for some distributions. For ease of notation, we define x * = 1 − x for any x ∈ (0, 1). Table 1 :

In Proposition 3.3, we show that for Gaussian bandits, the optimal solution must balance the top and bottom arms in all connected subgraphs in terms of the sum-of-squared allocations. For general reward distribution, we have similar results for each connected subgraph. However, the form of this necessary condition cannot be easily simplified.

To derive the general result, we mirror the proof of Proposition 3.3 in Section E.6. Consider any connected subgraph G l = G(I l , J l , E l ) of the bipartite graph induced by an optimal solution ψ. By Proposition 3.1, for any arm i ∈ I ∪ J , there must exist an edge e ∈ E such that i is incident to e. Hence, we have I l = ∅ and J l = ∅.

Take any i 1 ∈ I l as an anchor point. By the connectedness of G l , for any j ∈ J l , there exists a path p i1,j = (i 1 , j 1 , i 2 , j 2 , . . . , j r = j) connecting i 1 and j. Along the path, we perform the Taylor's expansion that

Recall from Proposition A.1 that

Define˜ i1 and˜ j1 such that the first order terms in the Taylor's series vanish

Hence, we can write˜

Repeating this procedure along the path connecting i 1 and j, we havẽ

Similarly, we have for any i ∈ I l , there exists a path p i1,i = (i 1 , j 1 , i 2 , j 2 , . . . , i r = i) connecting i 1 and i. We can then derive˜

We adopt the convention that S i1→i1 = 1. For Gaussian bandits, this simplifies to

For general distribution, we cannot easily simplify S i1→j and S i1→i . With the same argument as in Section E.6, we provide the following proposition for general bandits.

Proposition A.2. Consider a bandit with general exponential family rewards. Let G be the bipartite graph induced by the optimal solution ψ * and let G l be any connected subgraph of G. For any anchor point i ∈ I l , G l balances the top and bottom arms

Remark 5. Condition (15) implies that the bipartite graph is connected. By setting the anchor point as i k in Proposition A.2, we see that (14) is equivalent to the balancing of top and bottom arms.

In Theorem 3.4, we characterized the sufficient conditions for optimality under the assumption of (14) . We now discuss how to obtain complete characterization. We assume that the mean rewards are mutually different without loss of generality (for the optimal allocation problem). Otherwise, we can combine arms with the same mean rewards and evenly distribute the allocation among them.

To facilitate our discussion, we present C i,j (ψ * ) as a matrix in R |I|×|J | , where the (i, j)-th entry is "=" if C i,j (ψ * ) = Γ * and ">" if C i,j (ψ * ) > Γ * . The following matrix represents the case discussed in Theorem 3.4.

Recall the assumption in Theorem 3.4, i.e.,

In general, the ='s locations can be quite different from the clean structure in the matrix above. We illustrate with an example and discuss the sufficient conditions for it. Consider choosing the top-2 arms among 5 competing arms. Without loss of generality, we assume that the arms are sorted in descending order of the mean rewards, i.e., I = {1, 2} and J = {3, 4, 5}. We may have the following scenario.

Mirroring the proof of Theorem 3.4, we have the following sufficient conditions for this case.

Theorem A.3. Consider any problem instance such that the optimal solution ψ * satisfies

Then ψ * is the optimal solution to (5) if

and

Remark 6. For this scenario, the bipartite graph is connected, and the condition (26) Remark 7. For Gaussian bandits with equal variances, the assumption (24) is equivalent to

Recall that (16) reduces to

Hence the two scenarios are mutually exclusive. Recall that we have the balancing of top and bottom arms in the entire bipartite graph, i.e.,

By the monotonicity in Proposition 3.2, we must have

which contradicts with the balancing of top and bottom arms. Finally, recall that Example 1 corresponds to

Combining (14), (24) and (27), we have now fully characterized the sufficient conditions for the case of choosing 2 arms among 5.

We omit the discussion for general k and K. A similar analysis can be carried out. However, we are cautious that any elegant presentation can be derived due to the combinatorial number of scenarios, even though conditions such as (14) and (24) can significantly reduce the number of scenarios to be discussed.

We consider the Chernoff stopping rule from the perspective of generalized hypothesis testing. Recall that at the end of each round t, the information available to the player is {I 1 , Y 1,I1 , . . . , I t , Y t,It }. Based on this information, the player recommend top-k arm setÎ t aŝ

Define the following reject region:

and the Generalized Likelihood Ratio Test statistic

To ensure the δ-correctness, we may find the threshold β t,δ such that,

Next we show a convenient closed-form to compute the test statistic Z t . First, it is clear that

Then Z t can be simplified as

Thus, for the fixed-confidence setting, at the end of each round t, the algorithm compute Z t . If Z t > β t,δ , then it stops and recommendÎ t .

Proof of Lemma 4.1 Mirroring the proof of Proposition 12 in [12] , we first rewrite the test statistics Z t as

where Z t,i,j = T t,i d(θ t,i , θ t,i,j ) + T t,j d(θ t,j , θ t,i,j ).

One can check that if θ t,i > θ t,j and θ i < θ j , then

Using this property, we have the following bound.

The last equality is used in the proof of Proposition 12 in [12] . Hence, for α > 1, choosing the same constant C(α, K) in [12] ensures that P θ (τ δ < ∞,Î τ δ = I) ≤ δ.

Building on the fast convergence result of the FWGA procedure, we can integrate it and parameter learning under the framework of "Track-and-Stop" strategy in [12] . Accordingly, we can obtain the optimal sample complexity. For completeness, we give the detailed sampling rule and sketch the proof by following [12] . Let ψ * (θ) be the solution returned by FWGA. Assume that ψ * (θ) is continuous in every θ. 

Theorem 4.2 can be restated as follows.

Theorem B.1. Let α = (1, e/2] and r(t) = O(t α ). Using the Chernoff 's stopping rule τ δ = inf{t : Z t > β(t, δ)} with β(t, δ) = log(r(t)/δ) and the sampling rule C-Tracking or D-Tracking, we have

Proof. By the continuity of ψ * (θ), for any > 0, there exists ξ = ξ( ) such that for all θ ∈ Θ :

Conditional on E T , we conduct the following analysis: 1. By Lemma 20 in [12] , there exists a constant L such that for T > L, both C-Tracking and D-Tracking yield that for any t > √ T ,

.

where the last inequality comes from lemma 19 (in [12] ) and, B and C are constants.

Next we bound T 0 . Let η > 0 and

One can show that

where D is a constant such that r(T ) ≤ DT α . We are now ready to bound E θ [τ δ ]:

For every η > 0 and > 0, we have,

Then letting η → 0 and → 0, it follows that

The proof is completed.

Detaching the learning part of Algorithm 2, we immediately obtain Algorithm 3 for solving (5) . We now show that the KKT-Tracking algorithm can be viewed as a dual descent algorithm.

Algorithm 3 KKT-Tracking for (5) 1: Input: θ, ψ 0 2: Output: ψ T 3: for t = 1, 2, · · · T do 4:

To start, we consider yet another equivalent formulation of the problem (5).

Proposition C.1. ψ * is an optimal solution to (5) if and only if

By Proposition C.1, it is equivalent to considering the following optimization problem:

Later, we will see that this log-transformation induces the simple updating rule of ψ t of Algorithm 3.

Proposition C.2. The dual problem of (33) can be formulated as

In

Step 4 of Algorithm 3, we break tie arbitrarily so that only one pair (i, j) is selected at each time step t. The dual variables µ i,j are interpreted as the frequency at which the pair (i, j) is selected. We now show that the dual variables are updated implicitly. Define µ t i,j the frequency that (i, j) pair has been chosen up to time t. Then i∈I j∈J µ t i,j = 1 holds for all t. Suppose that ψ t−1 ∈ argmax ψ∈Σ K i∈I j∈J µ t−1 i,j log C i,j (ψ). To do dual descent, we perform one-step Frank-Wolfe algorithm:

Note that Step 4 of Algorithm 3 is equivalent to Step 1 of the Frank-Wolfe algorithm above, which yields z t = I (i,j) , i.e., the I × J matrix with (i, j)-th entry being 1 and 0 elsewhere. After updating µ t , we solve ψ t ∈ argmax ψ∈Σ K i∈I j∈J µ t i,j log C i,j (ψ) to prepare for next round dual descent.

is equivalent to ψ i = j∈J µ i,j h i;j , for all i ∈ I,

Given µ, a close-form solution for (35) is not readily available. We may choose to solve ψ in each round numerically. In the KKT-Tracking algorithm, we perform one-step iteration on (35), i.e.,

The following lemma shows that above updating rule yields exactly Step 7 of the KKT-Tracking algorithm.

, then one-step iteration gives the following equation:

The Track-and-Stop strategy [12] requires forced exploration, which can hamper practical performance of the algorithm. The authors there observed that using sample means as the plug-in estimate for θ can be hazardous because inaccurate initial estimates can lead to the abandonment of top arms. Consequently, they propose to incorporate forced explorations. We take an alternative approach inspired by Thompson Sampling (TS). In particular, we uses posterior samples of the mean rewards,θ t ∼ π t , as the plug-in estimates. As such, we implicitly force the exploration of the inadequately sampled arms. Consider a meta algorithm with access of a Solver for (5), for example Algorithm 1 or Algorithm 3. In each round, we sampleθ t from the posterior distribution π t of the mean rewards. The Solver takesθ t as input and produces the estimated optimal allocation ψ t = Solver(θ t ). The algorithm then selects the arm whose current allocation deviates the most from ψ t . We refer to it as the Thompson Sampling Direct-Tracking algorithm.

1: Sampleθ t ∼ π t 2: ψ t ← Solver(θ t ) 3: I i ← argmax i∈[K] tψ t,i − T t,i 4: Collect reward Y t,It and update π t+1 by (6).

Assume that θ is sorted by descending order θ 1 ≥ · · · ≥ θ k > θ k+1 ≥ · · · ≥ θ K . Many existing algorithms for top-k arm identification try to match the sample complexity

1 ∆i , where the gap ∆ i for each arm is defined as

Evaluating the empirical gap is the critical step. Hence, these algorithms share a similar structure and performance. In the fixed-confidence setting, existing algorithms use upper or lower confidence bound as the plug-in estimate for θ i , including LUCB, Racing [24] , and UGapEc [9] . In the fixed-budget setting, existing algorithms use the sample mean as the plug-in estimate for θ i , including SAR [2] , GapE, UGapEb [9] . Note that this complexity measure H is derived from approximations of the optimal allocation, albeit with a tractable closed-form. A refined sample complexity and optimal allocation ratio can be obtained by solving a max-min optimization problem, which shows the sample complexity H is not tight.

In [12] , the optimal allocation ratio for best arm identification (corresponding to top-k=1) is achieved by adaptively tracking the optimal conditions of the corresponding optimization problem. Under this framework, the optimal algorithms of many bandit pure-exploration problems [8, 29, 34] have been derived. However, these algorithms focus on choosing a unique arm, which can not be easily extended to top-k arm identification. To the best of our knowledge, a computationally efficient procedure to track the optimal conditions of top-k identification is still missing in the bandit literature.

There is also a branch of algorithms under the Optimal Computing Budget Allocation (OCBA) framework in the simulation optimization literature. OCBA-type algorithms solve the top-k identification problem in the Gaussian reward bandit with unknown variance. Among them, OCBASS [11] , and OCBAss [10] are two alternatives with the empirical best performance. We caution that, through our analysis of optimal conditions, they are not optimal.

Many recent works design Bayesian algorithms for top-k arm identification, where the posterior convergence rate is of particular interest. See TTPS, TTVS and TTTS in [32] , TTEI in [31] and T3C in [33] . All these algorithms can only be applied for (top-1) best arm identification. Besides, these algorithms rely on a parameter β, which needs online tuning.

Below we give brief descriptions and essential implementation details of these algorithms.

All the following algorithms require the construction of upper and confidence lower bound. Instead of using a common Hoeffding bound, we use an improved KL-divergence-based bound introduced in [24] :

where β t,δ is the exploration rate we specify to be log ((log t + 1)/δ). As they pointed out, the KL-confidence region is always smaller than those obtained with Hoeffding's bound but shares the same coverage probability.

• KL-LUCB [24] . At each round, the algorithm samples two critical arms u t and l t :

The algorithm stops when U ut (t) > L lt (t).

• KL-Racing and KL-Elimination [24] . The Racing algorithm is a uniform-based sampling algorithm. The algorithm maintains a selected set S, an eliminated set E and a remaining set R. At the beginning, the algorithm initialize S = ∅, E = ∅, R = [K]. At each round t, the algorithm sample all the arms in R. Then it computes the empirical top-(k − |S|) arm set T and empirical bottom arm set B in R . There are four critical arms to be evaluated:

The only difference between the Racing algorithm and Elimination algorithm is that the Elimination algorithm never selects arm a B into S. In practice, we observe that KL-Racing can be too aggressive to select a wrong arm in the early stage, which is not desirable to control the confidence level. In our experiments, we only report the results of KL-Elimination.

Besides, as pointed out by [12] , Racing-type algorithms cannot be optimal since it forces the allocation ratios among selected arms to be equal.

• UGapE [9] . The UGapE is a gap-based exploration algorithm. At each round t, the algorithm identifies the arm with largest gap index B i (t) = max k i =i U i (t) − L i (t). The algorithm stops when min i B i (t) > 0.

• UGapE [9] . The UGapE provide a variant running in the fixed-budget setting. At each round t, the algorithm also identifies the arm with largest gap index B i (t) = max k i =i U i (t)−L i (t). In the fixed-budget setting, U i (t) and L i (t) are defined by

A tuning parameter a for exploration is needed. Furthermore, the algorithm needs to estimate the sample complexity H of the problem. In our implementation, we give advantage to UGapE by hand-selecting the best problem-dependent tuning parameter and pass-through the true sample complexity H, both of which are unknown beforehand.

• SAR [2] . The SAR maintains an active set A and a selected set S, which are initialized to be A = [K] and S = ∅ . The time horizon (budget) T is divided into K − 1 phases. At the beginning, the algorithm computes the following quantities:

i , n 0 = 0, and n p = 1 log(K)

At each phase p, the algorithm samples each active arm i ∈ A for n p − n p−1 times. Sort the arms in A in descending order, say θ t,1 ≥ θ t,2 ≥ · · · . Define the following empirical gap

• OCBAss [10] . For the Gaussian bandits with equal known variance, OCBAss can be described as follows. At the beginning, the algorithm samples each arm for ∆ 0 times. Then at time t, for all i ∈Î t and j / ∈Î t , the algorithm compute C i,j defined as follows:

Then it identifies the (i t , j t )-pair that minimizes C i,j as candidate arms. If i∈Ît T 2 t,i < j / ∈Ît T 2 t,j , then sample i t for n times. Otherwise, sample j t for n times. In the experiment, we choose the parameter ∆ 0 = n = 10 suggested by [10] .

At first glance, the OCBAss algorithm appears very similar as the KKT-TS algorithm, becuase they both identify the (i, j)-pair that minimizes C i,j as candidate arms. However, there are several important differences. First, in choosing from the (i, j)-pair, OCBAss balances the sum-of-squared allocation from the entire entire bipartite graph. However, as Remark 2 shows, the optimal solution should balance the sum-of-squared allocation in each and every connected subgraph. Second, the balancing of sum-of-squared allocations works only for Gaussian rewards. However, as Proposition A.2 shows, the sufficient conditions under general reward distribution is not the balancing of sum-of-squared allocations. Third, OCBAss uses the sample mean in the calculation of C i,j instead of the posterior samples. Plugging in sample mean can be hazardous, because initial inaccurate estimations may cause the algorithm to abandon optimal arms in early stage.

• OCBASS [11] . At the beginning, the algorithm samples each arm for ∆ 0 times. Then at time t, the algorithm compute C i,j defined by (38). Then it identifies the (i t , j t )-pair that minimizes C i,j as candidate arms. Choose i t or j t arbitrarily and sample it for n times. In the experiment, we choose the parameter ∆ 0 = n = 10 suggested by [11] .

The OCBASS algorithm is very similar to the OCBA-ss algorithm. In particular, they both identify the (i, j)-pair that minimizes C i,j as candidate arms. In each iteration of the algorithm, there will be multiple arms that achieves the minimum value of C i,j . However, the important step of balancing sum-of-squared allocation is missing, which implies that OCBASS cannot be optimal.

C.3.3 Bayesian algorithms for k = 1.

• TTTS [32] . In each step, the Top-Two Thompson Sampling (TTTS) algorithm samples the top-two arms from the posterior distribution of the mean rewards. It then plays the best arm with probability β and the second-best arm with probability 1 − β. The top-two sampling strategy is surprisingly simple and yet allocates optimally among the suboptimal arms. This algorithm is shown to be optimal (with respect to the tuning parameter β) in the posterior convergence rate, and fixed-confidence settings (proved by [33] ). The main limitations are two-fold: the need of the tuning parameter β that determines the allocation rate of the best arm, and the computation burden of sampling the second best arm, which get progressively harder as the posterior density concentrates. TTPS and TTVS proposed in [32] share the same asymptotic optimality as TTTS, while TTTS usually has the best empirical performance among three.

• TTEI [31] . The Top-Two Expected Improvement TTEI algorithm identifies the two arms with the largest amount of improvement of the target objective, and then sample the best arm w.p. β and the second-best arm w.p. 1 − β. The novel proof technique is noticeable. The algorithm provably improves the EI algorithm [19] . This algorithm is also shown to be optimal (with respect to the tuning parameter β) in the posterior convergence rate, and fixed-confidence settings.

• T3C [33] . The Top-Two Transporation Cost (T3C) algorithm simplifies the TTTS algorithm by replacing the posterior sampling of the second-best arm with the arm that has the lowest transportation cost with respect to the best arm, see (1) there. Such a selection criteria for the second-best arm resembles the inner minimization problem min j∈J C i,j (ψ i , ψ j ) of (5). However, the allocation between the best arm and the second-best arm is still determined by a tuning parameter β. This algorithm is also shown to be optimal (with respect to the tuning parameter β) in the posterior convergence rate, and fixed-confidence settings.

We remark it is challenging to apply the proof techniques of [32, 31, 33] to the case of k > 1. We point out two critical steps in their proof. (I) Their proofs rely critically on the fact that their algorithm forces a β proportion of sampling effort to be allocated to the top-1 arm (and hence the need for the tuning parameter). With k > 1, even more tuning parameters are required. (II) Based on the (rather forced) convergence of the top-1 arm, the TTEI, TTTS and T3C algorithms were able to establish the convergence of allocation on the sub-optimal arms. This is possible because of the drastically simplified optimality condition in the case of k = 1 -condition (15) in our paper reduces to the equality of C 1,j (ψ) for all j > 1. However, the optimal solution in the case of k > 1 does not admit such a simple structure.

All experiments in the paper were conducted on a desktop computer. (Processor: Intel Core i7-8700 @ 3.20GHz, cores: 6, RAM: 16GB.)

We demonstrate the convergence of FWGA in Algorithm 1 and KKT-Tracking in Algorithm 3 for (5). Figure  3 (left) presents the Gaussian bandit from Example 1. This example shows that even though the sufficient condition for optimality is complicated, our algorithms can converge to the optimal solution. Figure 3 (right) presents the example of choosing the top 3-arm in a Bernoulli bandit with θ 1 = 0.8, θ 2:3 = 0.6, θ 4:6 = 0.4, and θ 7:10 = 0.2. For FWGA, we set τ t = N 3/2 /t for the Gaussian example and 10N 3/2 /t for the Bernoulli example.

We evaluate the performance of our algorithm through the objective optimality gap (21) , which is zero at the optimal solution and strictly positive otherwise. Both algorithms converges rapidly to the optimal solutions. The tracking heuristic converges even faster than FWGA. We conjecture that KKT-Tracking converges at a faster rate of 1/t and leave the convergence analysis for future work. 

In this section, we present further numerical examples in the fixed-confidence setting. In particular, we consider two examples in Figure 4 -5. For each example, we plot the fixed-confidence sample complexity for confidence levels δ = 0.1, 0.01, 0.001. In addition to the algorithms presented in Figure 1 , we also include two more algorithms: DT-TS and BTS. The DT-TS algorithm is the Thompson Sampling Direct-Tracking from Algorithm 4 coupled with the KKT-Tracking from Algorithm 3, where we set T = 100. The BTS is an algorithm designed to track the sufficient conditions derived in Theorem 3.4, assuming that conditions (14) always holds. In particular, it selects the (i, j)-pair only from (i k , j) or (i, j 1 ), and to choose from i and j, it seeks to balance conditions (16) . The rest of the steps in BTS are exactly the same as that of KKT-TS.

We make a few remarks on KKT-TS, DT-TS and BTS.

1. There is no practical difference between the fixed-confidence performance of KKT-TS, DT-TS and BTS.

2. The computation complexity of DT-TS is significantly larger than the other two due to the need to solve for (5) in each iteration. The computation complexity of KKT-TS is slightly larger than BTS, because KKT-TS need to identify the minimum over k(K − k) values of C i,j where as BTS finds minimum over K − 1 values.

3. As discussed in Section A.2, the complete characterization of the sufficient condition is complicated, and the assumption (14) is non-trivial to verify. This implies that BTS may not always converge to the optimal allocation. Nevertheless, BTS has no discernable different from KKT-TS. In our experiments, we find that even when (14) fails to hold, Theorem 3.4 gives a feasible solution that is sufficiently close to the optimal solution.

In Figure 6- Comparing the optimal allocation rate and the average allocation rate of the algorithms with a budget of 10 6 . We observe that KKT-TS and OCBAss (omitted) allocate almost exactly as the optimal rate. In this simpler case, OCBAss tracks the correct sufficient condition, hence also converges to the optimal rate. In contrary, since the key step of balancing the sum-of-squared allocation is missing from OCBASS, it does not converge to the optimal allocation, albeit sufficiently close. UGapE and SAR algorithms deviates quite a lot from the optimal allocation, although their empirical performance is respectable. and σ 2 = 1/4. Estimated over 10 7 runs. The broken lines with the same marker and color indicates the large deviation rate of the probability of false selection, i.e. exp(−tΓ(ψ)), whereψ is the average allocation rate for each algorithm with budget 10 6 . The large deviation rate of OCBAss and OCBASS, as well as the optimal rate Γ * is practically indiscernible from that of KKT-TS, hence are omitted. We observe that the realized large deviation rate of the probability of false selection matches the corresponding Γ(ψ) for all algorithms, except for the OCBA-type algorithms. This is due to the use of samples means as plug-in estimates, which incurs small probability that the algorithm abandons optimal arm due to initially inaccurate estimations.

In this section, we present numerical examples for posterior convergence. As benchmarks, we compare with the KKT-TS, TTTS, TTEI and T3C algorithms. We remark that these algorithms are designed only for the best arm identification problem (k = 1).

Before presenting the numerical results, we include the details about how we compute the posterior probability. Recall the set of parameters that make I optimal is

Under independent prior, the posterior distribution of θ at time n can be decomposed by π n (θ) = π n,1 (θ 1 ) · · · π n,K (θ K ), where π n,k is the posterior probability density function of θ k . Let F n,k (·) be the posterior cumulative density function of θ k . Then we can compute P πn (Θ) by

Thus, the computation of P πn (Θ) is reduced to a one-dimensional integral, which can be further computed efficiently using quadrature.

We do the simulation on two problem instances for choosing the best arm (k = 1). For Gaussian bandits, we impose independent improper Gaussian priors with zero mean and ∞ variance; for Bernoulli bandits, we impose independent beta priors. We plot the sample complexity for the first time when posterior probability of optimal set P πn (Θ) reach a given level c = 0.9, 0.99, 0.999. Results are shown in Figure 10 As observed from Figure 10 and 11, there is no practical difference among the KKT-TS, TTTS, TTEI and T3C algorithms and all of them outperform the uniform allocation rule dramatically. This is expected because TTTS, TTEI and T3C algorithms attain provably optimal rate (in a restricted sense of any given β), and furthermore, it is shown in their papers that β = 1/2 is a robust choice. The significant contribution of KKT-TS is three-fold: the extension to cover k > 1; the elimination of tuning parameters; and the elimination of expensive posterior sampling of the second-best arm (same for T3C).

The performance of KKT-TS is further demonstrated by Figure 12 below, which compare the sample paths of the postior probability P πn (Θ c ) with the optimal rate of e −Γ * n . Figure 12 provide strong evidence that the posterior convergence rate of KKT-TS attains optimality. In particular, we conjecture that for KKT-TS, with probability 1:

We leave the proof for future work. 

We restrict the distribution of the reward Y t,i in the one-dimensional natural exponential family, which is a subset of exponential family. In particular, a natural exponential family has identity natural statistic in canonical form, i.e.,

where η is the natural parameter and A(η) is assumed to be twice differentiable. The distribution is called non-singular if Var η (Y t,i ) > 0 for all η ∈ T o , the interior of the parameter space T . Denote by θ the mean reward θ(η) = yp(y|η)dy. We have the following lemma.

Lemma E.1. At any η ∈ T o , θ(η) is strictly increasing and A(η) is strictly convex.

Proof. By the properties of the exponential family, we have, for any η ∈ T o ,

This implies the lemma. Lemma E.1 implies that θ(·) is a one-to-one mapping. Consequently, we can use θ to denote the distribution p(·|η) for convenience. Let θ 1 and θ 2 be the mean reward of the distribution p(·|η 1 ) and p(·|η 2 ). Denote by d(θ 1 , θ 2 ) the Kullback-Leibler (KL) divergence from p(·|η 1 ) to p(·|η 2 ). We have a convenient closed-form to calculate d(·, ·) and its partial derivative:

Above equations are used repeatedly in our derivation.

To derive the lower bound, we utilize Lemma 1 in [23] , which is rephrased below.

Lemma E.2. Let θ and θ be two bandit models with K arms such that for all i, the distributions θ i and θ i are mutually absolutely continuous. For any almost-surely finite stopping time τ σ with respect to

where d(·, ·) is the Kullback-Leibler (KL) divergence of the two reward distributions and kl(·, ·) is the KL divergence of two Bernoulli distributions.

The δ-correctness of an algorithm implies that, for any E ∈ F τ δ , we have P θ (E) > 1 − δ and P θ (E) ≤ δ. Hence, (42) implies that

where we use the monotonicity of the KL divergence of Bernoulli distribution. We write above inequality as

Note that T τ δ ,i /τ δ is the proportion of samples allocated to arm i. For any δ-correct algorithm such that

and some probability vector ψ ∈ Σ K , the lower bound of sample complexity

It can be verify that lim δ→0

log(1/δ) = 1. Hence, we have the asymptotic lower bound

Minimizing this lower bound is equivalent to the following optimization problem,

To further simplify (43), we note that the complement of Θ can be written as

where Θ c i,j = {θ : θ j > θ i }. We hence write the optimization problem as

We define

Invoking following lemma, we obtain explicit expression for C i,j (ψ) as in Lemma 2.1, which completes the proof.

Lemma E.3. For any i ∈ I and j ∈ J , C i,j (ψ) is concave in ψ and depends only on ψ i and ψ j

Proof of Lemma E.3. Recall that

We have

[By the monotonicity of KL-divergence]

Taking derivative of the function inside the infimum with respect toθ and equating it to 0, we have

where we use d(θ i ,θ) = A(η(θ)) − A(η(θ i )) − θ i (η(θ) − η(θ i )). We find that the infimum is achieved at θ = (ψ i θ i + ψ j θ j )/(ψ i + ψ j ). Hence,

which depends only on ψ i and ψ j .

To see the concavity of C i,j (ψ), note that C i,j (ψ) = infθ ψ i d(θ i ,θ) + ψ j d(θ j ,θ) , which is the minimum over a family of linear functions in ψ. This directly implies that C i,j (ψ) is concave in ψ (See Chapter 3.2 of [1] ).

The partial derivatives of C i,j (ψ) are derived as follows

The last equality comes from (44). The calculation of ∂Ci,j (ψ) ∂ψj is a verbatim of the above procedure. This completes the proof.

For any static allocation ratio ψ, Lemma 1 in [15] implies that for any θ i > θ j ,

where I i (θ) = sup λ λθ − Λ θi (λ) and Λ θi (λ) is the log-moment generating function of the rewards of arm i.

Let A * (θ) = sup λ λθ − A(λ) be the Fenchel-Legendre transform of the log-partition function A of the reward distribution. To calculate I i (θ), we defineη = η(θ). Then

[By the property of exponential family]

[By Equation (40) ] Therefore,

Let η i = η(θ i ) and η j = η(θ j ). Then we have

Take derivative of the function inside the infimum of (45) with respect toθ and equate it to 0, we have

We find that the infimum is achieved atθ

Thus, B i,j (ψ) = ψ i d(θ B i,j , θ i )+ψ j d(θ B i,j , θ j ). Note that, in general, B i,j (ψ) = C i,j (ψ) due to the asymmetry of KL divergence.

Suppose without loss of generality that there exist j = j ∈ J * such that

Since C i,j 's are continuous functions strictly increasing in ψ j , there exist a sufficiently small > 0 such that

and

Hence, ψ * cannot be an optimal solution. Consequently, we have

With similar argument, we can show that

using the fact that C i,j 's are continuous functions strictly increasing in ψ i .

For the ease of exposition, we introduce the following function

Then C i,j (ψ) can be written as

Lemma E.4. For any i ∈ I and j ∈ J , we have

Moreover, for any i, j and j such that θ i > θ j ≥ θ j , we have

Proof of Lemma E.4. We start by identifying that

By Proposition A.1, we have

Hence,

To see (46), we fix θ i and x and define

It follows that h(θ j ) = g i,j (x). To complete the proof, we need to show that, for any θ i > θ j ≥ θ j , g i,j (x) ≥ g i,j (x). It suffices to verify that h (α) < 0 when θ i > α. To see this, let

We recall the partial derivative of d(·, ·) in (41) and calculate h (α) as follows.

h (α) = ∂d(θ i , m) ∂m

dη dm

Noticing that η (θ) = 1/θ (η) > 0 and m − α = (

This completes the proof. Proof of Proposition 3.2. At the optimal solution ψ * , consider any j, j ∈ J such that θ j ≥ θ j . Proposition 3.1 implies that there exist a i ∈ I such that C i,j (ψ * ) = Γ * ≤ C i,j (ψ * ).

Note that we must have ψ * i > 0, the inequality above implies that

Combining (48) , (49) and g i,j (x) > 0, we conclude that ψ * j ≥ ψ * j . This shows the monotonicity of ψ * within the bottom arms J .

For the top arms I, we introduce

Then C i,j can be written as

From the proof of Lemma E.4 (See (47) and replace θ i with θ j in h), we have g j,i (x) = d θ i , θj +xθi 1+x > 0 and, for any θ i ≥ θ i > θ j ,

The rest of the proof mirrors the proof of the monotonicity within the bottom arms. At the optimal solution ψ * , consider any i, i ∈ I such that θ i ≥ θ i . Proposition 3.1 also implies that there exists a j ∈ J such that

Note that we must have ψ * j > 0, the inequality above implies that

Combining (51) , (52) and g j,i (x) > 0, we conclude that ψ * i ≥ ψ * i . This shows the monotonicity of ψ * within the top arms I. The proof is complete.

We start with the simple case where the variance of the rewards are the same σ 2 .

Consider any connected subgraph G l = G(I l , J l , E l ) of the bipartite graph induced by an optimal solution ψ. We show that if

then we can find another ψ = ψ with higher objective value, and hence ψ cannot be an optimal solution. By Proposition 3.1, for any arm i ∈ I ∪ J , there must exist an edge e ∈ E such that i is incident to e. Hence, we have I l = ∅ and J l = ∅. Take any i 1 ∈ I l as an anchor point. By the connectedness of G l , for any j ∈ J l , there exist a path p i1,j = (i 1 , j 1 , i 2 , j 2 , . . . , j r = j) connecting i 1 and j. Along the path, we perform Taylor's expansion

Starting from the first edge (i 1 , j 1 ), we define˜ i1 and˜ j1 such that the first order terms in the Taylor's series vanishes

Recall that, for Gaussian bandits with reward distribution N (θ i , σ 2 ), we have

Hence, we can rewrite (53) as˜

Repeat this procedure for (j 1 , i 2 ) on the path, we havẽ

.

Repeat for all edges on the path, we have˜

Similarly, for all i ∈ I l , we have˜

Now, we claim that if i∈I l (ψ i ) 2 = j∈J l (ψ j ) 2 , then we can find a such that Γ(ψ + ) > Γ(ψ). Consequently, ψ cannot be an optimal solution.

To this end, let

where the multiplier λ is determined by (58a), i.e., j =j1

For the dual feasibility condition (58f) to hold, we require conidition (14) to hold, i.e.,

For (60) to hold, we require (16), i.e.,

We have now checked that equations (58a), (58b), (58c), (58d) and (58f) hold. For the primal variable (equivalent to objective value) φ, we let

, for all i ∈ I and j ∈ J .

By noting that µ i,j = 0 for i = i k and j = j 1 , we have (58g). At last, when Γ * = φ, the inequality in (58e) also holds automatically. Hence, all the conditions in (58) holds. The proof is complete.

We restate Theorem 3.5 here for easy reference.

Theorem E.5. A feasible solution (φ, ψ) to (17) is optimal if and only if there exists µ ∈ Σ k(K−k) such that

i∈I j∈J

See the proof for Theorem 3.4 in Section E.7. The problem (17) has been shown to be a convex optimization and the KKT condition is necessary and sufficient for global optimality. We restate the KKT condition (58) in Section E.7 as follows. For all i ∈ I and j ∈ J ,

i∈I j∈J

i∈I j∈J

It can also be seen that, at the optimal solution, φ > 0, ψ i > 0 and ψ j > 0.

To show the equivalent between (66) and (65), we first show that (66) implies (65). By the complementary slackness condition (66f), there must exist a i ∈ I and a j ∈ J such that φ = C i ,j (ψ), i.e., φ = min i∈I min j∈J C i,j (ψ). Suppose otherwise, we have φ < C i,j (ψ) for all i ∈ I and j ∈ J . Then µ i,j = 0 for all i ∈ I and j ∈ J , which yields contradiction with i∈I j∈J µ i,j = 1. Further, we can write the stationarity conditions (66a) and (66b) as,

Summing the above over all i ∈ I and j ∈ J and using the fact that h i;j + h j;i = 1, we have λ = φ. After plugging λ = φ into (67), we obtain (65). Next we show that (65) also implies (66). Using complementary slackness again, we can rewrite (65a) as

Thus, we have φ = j∈J µ i,j d(θ i ,θ i,j ).

which coincides with (66a). Similarly, we can derive (66b) from (65b). At last, (66c) can be seen from directly summing (65a) and (65b) over i ∈ I and j ∈ J . This completes the proof.

E.9 Proof of Proposition 3.6

Consider the primal problem as:

s.t. φ − C i,j (ψ) ≤ 0, for all i ∈ I, j ∈ J .

We write the Lagrangian as

where ψ ∈ Σ K and µ i,j ≥ 0 for all i ∈ I, j ∈ J . Recall that in Appendix Section E.7, we have shown that the Slater condition holds. Consequently, the above problem is equivalent to the following min-max optimization problem: Λ * = min µ≥0 max φ,ψ∈Σ K L(φ, ψ, µ).

Note that 

We conclude that the dual problem can be formulated as

This completes the proof.

E.10 Convergence of FWGA: Proof of Theorem 3.7

In this section, we consider the following generic non-smooth optimization problem:

where the feasible region X is convex and closed, and f j 's are smooth concave functions. By using the strong duality, this non-smooth problem could be reformulated as 

Let (µ * , x * ) be the optimal solution pair to the above min-max optimization problem. We observe that the above saddle-point formulation avoids handling the non-smooth max(·) operator, and preserves a smooth objective.

We first introduce the algorithm, in iteration k,

• We conduct a Frank-Wolfe step p k = argmin p e=1 p, f (x k−1 )

• We update µ k = (1 − α k )µ k−1 + α k p k .

• We conduct a gradient ascent step to u k−1 such that

• We update x k = (1 − α k )x k−1 + α k u k .

Proof of Theorem 3.7. Suppose the function f is C F -Lipschitz continuous. By using Lemma 3.8 of [27] ), we have that − m j=1 p k,j ∇f j (u k−1 ),

By recall the definition that Q k = µ k , f (x * ) − µ * , f (x k ) . We have

where the first inequality uses the concavity of each f j that f j (1 − α k )x k−1 + α k u k ≥ (1 − α k )f j (x k−1 ) + α k f j (u k ).

By recalling the update rule for p k , we have p k , f (u k−1 ) ≤ µ * , f (u k−1 ) . Meanwhile, b which implies that

Furthermore, by using the concavity for each f j , we have

By combining the above inequality with (80), we obtain that

where C F = max j C fj is the maximal Lipschitz continuity parameter for C j 's and last inequality uses the facts m j=1 p k,j = 1 and p k,j ≥ 0 . By combining the above inequality with (82), we obtain that

where the last inequality uses the fact that 2ab ≤ a 2 /2 + 2b 2 . Recall that the step-sizes are set as α k = 2 k+1 and τ k = N 3/2 /k. Let Γ k = 1 if k = 1;

(1 − α k )Γ k−1 if k > 1.

we have Γ K = 2 K(K+1) . Meanwhile, we provide the following result (Lemma 1 of [13] ) to characterize the convergence.

Lemma E.6. Let {α k } be the step-sizes and the sequence {A k } satisfies A k ≤ (1 − α k )A k−1 + B k , k = 1, 2, · · · , N.

Then for all 1 ≤ k ≤ N ,

By utilizing the above lemma, we have that

Moreover, we note that α k τ k Γ k = N 3/2 for all k = 1, 2, · · · , N , which implies that

By substituting the above equation into (84), we further obtain that

where C > 0 is a positive constant. This completes the proof.

To see this, we use the notation Γ(ψ) = min i∈I min j∈J C i,j (ψ). Then

The last equation holds by noting that both C i,j and log C i,j achieves the minimum at the same (i, j) pairs. This completes the proof.

Consider the primal problem as:

s.t. φ − log C i,j (ψ) ≤ 0, for all i ∈ I, j ∈ J .

We write the Lagrangian as

where ψ ∈ Σ K and µ i,j ≥ 0 for all i ∈ I, j ∈ J . Recall that in Appendix Section E.7, we have shown that the Slater condition holds. Consequently, the above problem is equivalent to the following min-max optimization problem: Λ * = min µ≥0 max φ,ψ∈Σ K L(φ, ψ, µ).

Note that 

E.14 Proof of Lemma C.4

We perform one-step iteration as follows: for all i ∈ I,

The last equality follows by assuming ψ t−1 i = j ∈J µ t−1 i ,j h i . Similarly, we have, for all j ∈ J ,

In a compact way, we have

which completes the proof.

Convex optimization

Multiple identifications in multi-armed bandits

Tight (lower) bounds for the fixed budget best arm identification bandit problem

Efficient simulation budget allocation for selecting an optimal subset

Simulation budget allocation for further enhancing the efficiency of ordinal optimization

Sequential design of experiments

Decomposition techniques for bilinear saddle point problems and variational inequalities with affine monotone operators

Non-asymptotic pure exploration by solving games

Best arm identification: A unified approach to fixed budget and fixed confidence

A note on the subset selection for simulation optimization

A new budget allocation framework for selecting top simulated designs

Optimal best Markovian arm identification with fixed confidence

Accelerated gradient methods for nonconvex nonlinear and stochastic programming

Frank-wolfe algorithms for saddle point problems

A large deviations perspective on ordinal optimization

Review on ranking and selection: A new perspective

Discrete optimization via simulation

Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means

Efficient global optimization of expensive black-box functions

Solving variational inequalities with monotone operators on domains given by linear minimization oracles

PAC subset selection in stochastic multi-armed bandits

On bayesian index policies for sequential resource allocation

On the complexity of best-arm identification in multi-armed bandit models

Information complexity in bandit subset selection

The extragradient method for finding saddle points and other problems

The complexity of large-scale convex programming under a linear optimization oracle

First-order and Stochastic Optimization Methods for Machine Learning

Bandit Algorithms. Bandit Algorithms

Gradient ascent for active exploration in bandit problems

Solving a class of non-convex min-max games using iterative first order methods

Improving the expected improvement algorithm

Simple Bayesian Algorithms for Best-Arm Identification

Fixed-confidence guarantees for bayesian best-arm identification

Fast pure exploration via frank-wolfe

Provably improving the optimal computing budget allocation algorithm

Asymptotically optimal sampling policy for selecting top-m alternatives

We consider two cases.1. If i∈I l (ψ i ) 2 > j∈J l (ψ j ) 2 , then let i1 < 0 be sufficiently small such that the first order terms dominates the remainder in the Taylor series. We have δ < 0. DefineNote that ∂Ci,j (ψ) ∂ψi is strictly positive, then the first order term is strictly positive for any (i, j) pairs, and hence Γ(ψ + ) > Γ(ψ).2. If i∈I l (ψ i ) 2 < j∈J l (ψ j ) 2 , then let i1 > 0 be sufficiently small such that the first order terms dominates the remainder in the Taylor series. We have δ < 0. DefineThen the first order term is strictly positive for any (i, j) pairs, and hence Γ(ψ + ) > Γ(ψ).Finally, we remark that it is straightforward to generalize the above result to Gaussian bandits with heterogeneous variances. To this end, note thatConsequently, we haveThe rest follows exactly as the case with equal variances. We omit the details.

Notice that Prob. (5) can be reformulated asThis is a convex optimization problem, since every C i,j is a concave function of ψ. Furthermore, there exists a point ψ = (1/K, 1/K, . . . , 1/K) and φ = 0 such that the inequality constraints strictly hold. Hence, the Slater's condition holds and thus Karush-Kuhn-Tucker (KKT) condition is both necessary and sufficient for global optimality. Letting λ ∈ R and µ i,j ≥ 0 be the Lagrangian multipliers the correspond to the equality and inequalities, respectively. The Lagrangian can be written asThe KKT conditions are derived asLet ψ be the optimal primal solution so that (58d) holds automatically. We aim to find a KKT solution such thatIn particular, under (59), we find that the solution to (58b) and (58c) satisfies, for j = j 1 ,We conclude that the dual problem can be formulated asThis completes the proof.E. 13 Proof of Proposition C.3Fix µ, we will show the optimal condition of ψ ∈ argmax ψ ∈Σ K i∈I j∈JNote that max ψ∈Σ K i∈I j∈J µ i,j log C i,j (ψ) is a convex optimization problem for any µ ∈ Σ k(K−k) and the feasible region Σ K contains interior points. Hence, the Slater's condition holds and KKT condition is necessary and sufficient. We write down the Lagrangian as follows:The KKT conditions are: for any i ∈ I and j ∈ J ,Using the relation:∂Ci,j ∂ψi = d(θ i ,θ), the stationarity on ψ i is equivalent to: Combining with K k=1 ψ k = 1, we conclude that λ = 1. This completes the proof.