key: cord-0479097-5ebqwvlw
authors: Madhushani, Udari; Leonard, Naomi
title: When to Call Your Neighbor? Strategic Communication in Cooperative Stochastic Bandits
date: 2021-10-08
journal: nan
DOI: nan
sha: e5b1a60ff7d4778c7c556dccd9290a54ada4d545
doc_id: 479097
cord_uid: 5ebqwvlw

In cooperative bandits, a framework that captures essential features of collective sequential decision making, agents can minimize group regret, and thereby improve performance, by leveraging shared information. However, sharing information can be costly, which motivates developing policies that minimize group regret while also reducing the number of messages communicated by agents. Existing cooperative bandit algorithms obtain optimal performance when agents share information with their neighbors at textit{every time step}, i.e., full communication. This requires $Theta(T)$ number of messages, where $T$ is the time horizon of the decision making process. We propose textit{ComEx}, a novel cost-effective communication protocol in which the group achieves the same order of performance as full communication while communicating only $O(log T)$ number of messages. Our key step is developing a method to identify and only communicate the information crucial to achieving optimal performance. Further we propose novel algorithms for several benchmark cooperative bandit frameworks and show that our algorithms obtain textit{state-of-the-art} performance while consistently incurring a significantly smaller communication cost than existing algorithms.

Sequential decision making in uncertain environments has been extensively studied over the past several decades due to its wide range of real world applications including recommender systems, user-targeted online advertising (Tossou and Dimitrakakis, 2016) , clinical trials (Durand et al., 2018) and target searching (e.g. finding nuclear or a temperature source) in robotics. Making optimal decisions under uncertainty requires striking a balance between exploring the environment to identify better decisions and exploiting the decisions that are already known to produce higher outcomes. In collective decision making, i.e., a group of agents making sequential decisions, performance can be greatly improved through cooperative communication by sharing information about the environment. However, often times communication is time consuming and expensive. For example, consider a recommender systems, in which multiple servers networked to handle high demands. In this case high communication between servers can lead to service latency. Similarly, for a group of robots, communication can increase battery power consumption. Thus the cost associated with communication makes it desirable to reduce the amount of shared information. Motivated by this we ask:

In this section we provide mathematical formulation and and intuition of our communication protocol. Notations. For any positive integer M we denote the set {1, 2, . . . , M } as [M ] . We define 1{x} as an indicator variable that takes value 1 if x is true and 0 otherwise. Further, we use X\x to denote the set X excluding the element x. We use |X| to denote the number of elements in set X. For any general graph G we defineχ(G),γ(G) as clique covering number and dominating number respectively. We use G γ to denote the γ th power graph of G. Let g(M, x) = M + N i=1 (12 log(3(x + 1)) + 3 log (x + 1)) . Cooperative stochastic bandits. We consider the cooperative bandit problem with K arms and N agents. Reward distributions of each arm k ∈ [K] is assumed to be sub-Gaussian with mean µ k and variance proxy σ 2 k . At each time step t ∈ [T ] each agent i ∈ [N ] pulls an arm A (i) t and receives a numerical reward X (i) t drawn from the probability distribution associated with the pulled arm. Without loss of generality we assume that µ 1 ≥ µ 2 . . . ≥ µ K and define ∆ k := µ 1 − µ k , ∀k > 1 to be the expected reward gap between optimal arm, i.e., the arm with highest mean reward, and arm k. Let∆ := min k =1,k∈ [K] ∆ k be the minimum expected reward gap. We make following assumptions.

Assumptions: (A1) When more than one agent pulls the same arm at the same time they receive rewards independently drawn from the probability distribution associated with the pulled arm.

(A2) All the agents know σ 2 ≥ σ 2 k , ∀k, an upper bound of the variance proxy associated with arms. Communication over a general graph. Let G(V, E) be a general graph that encodes the hard communication constraints among agents. The vertex set V is the set of agents [N ] and each edge (i, j) ∈ E indicates that agents i and j are neighbors. We consider that agents directly communicate with their neighbors only. Let 1{(i, i) ∈ E} = 1, ∀i. At each time step t we define the communication between agents by G t (V, E t ) where E t ⊆ E. Let d (i) be the degree of agent i. Let G γ denote the γ th power graph of G. Denote d (i) γ to be the degree of agent i in graph G γ , i.e., number of agents within a distance of γ from agent i in graph G. For any γ let d

We denote m (i) t as the message shared by agent i at time t with its neighbors. This can be either a single message containing information about a particular arm pull, typically the last arm pull of agent i, or a concatenation of information about several arm pulls by more than one agent over several previous time steps. We define n N j=1 1 A (j) τ = k 1{(i, j) ∈ E τ } to be the number of times until time step t that agent i pulled arm k and observed reward values from arm k, respectively. Note that the number of observations N Regret and communication cost. Following the convention we define regret as the loss suffered by agents due to pulling suboptimal arms. Let R(t) be the cumulative group regret at time t. Then the expected cumulative group regret can be given as E [R(t)] :

We define the communication cost as the number of messages shared by agents. We consider the cost of sharing a concatenated message to be the number of single messages included in it. Let L(t) be the cumulative group communication cost at time t. Then, the expected group communication cost can be given as E [L(t) 

Proposed communication protocol: ComEx. We propose ComEx, a cost-effective partial communication protocol that obtains same order of performance as full communication. Figure 1 : A summary of our proposed algorithms and existing state-of-the-art algorithms for different cooperative bandit frameworks. As motivated above, information about suboptimal arms is most valuable to agents seeking to maximize expected cumulative reward. This is because, with information from neighbors on a suboptimal arm, an agent can obtain a sufficiently accurate estimate of the expected reward of the suboptimal arm without having to pull the arm by itself. Agents typically pull suboptimal arms when they are exploring. Thus, to provide the means to maintain high performance with low communication costs, we propose a new communication protocol as follows in which agents only share information they obtained through exploring.

Note that according to the above communication protocol agents initiate sharing messages only about the rewards received from the arms that are instantaneously suboptimal i.e., arm that does not have the maximum estimated expected reward. This maximizes the chance of sharing information about suboptimal arms. Generalizability of ComEx. As we will demonstrate in next few sections, our communication protocol is an easily implementable general communication protocol that can be incorporated in a wide range of cooperative bandit algorithms. We illustrate the generality by proposing novel 5 algorithms incorporating ComEx in several cooperative bandit frameworks. Figure 1 provides a summary of our algorithms and state-of-the-art algorithms in several benchmark cooperative bandit frameworks.

In this section we propose novel algorithms for decentralized cooperative bandits.

We present our first algorithm ComEx-UCB by combining the above communication protocol with instantaneous reward sharing. Each agent follows a sampling rule that balances exploiting with exploring. We use a natural extension of Upper Confidence Bound (UCB) algorithm as a sampling rule. In UCB at each time step t for each arm k each agent i constructs an upper confidence bound, i.e., the sum of its estimated expected reward (empirical average of the observed rewards) and the uncertainty associated with the estimate C (i)

where ξ > 1, and pull the arm with highest bound. If the pulled arm is instantaneously suboptimal, the agent sends a message m (i)

to its neighbors (see Definition 1). Note that under this communication rule agents do not share concatenated messages. Thus passing information about time step and agent id is redundant. Pseudo code for ComEx-UCB is given in Appendix I.

Theorem 1. (Group regret of ComEx-UCB) Consider a group of N agents following ComEx-UCB while sharing instantaneous rewards over a general communication graph G. Then for any ξ ≥ 1.1 expected cumulative group regret satisfies:

Proof sketch. We follow an approach similar to the standard UCB analysis Auer, Cesa-Bianchi and Fischer (2002); Dubey et al. (2020) with a few key modifications. We partition the communication graph into a set of non overlapping cliques and analyze the regret of each clique and take the summation over cliques to obtain the regret of the group. When agents are using full communication group regret can be given as the summation of a log T term that scales with the clique covering numberχ(G) and a term, which is independent of T. The second term depends on the summation of tail probabilities of arms, i.e., P µ

For full communication a similar result can be found in (Dubey et al., 2020) . Note that full communication is a deterministic communication protocol and ComEx-UCB is a stochastic communication protocol that depends on the decision making process. Two major technical challenges in proving the regret bound for ComEx-UCB are 1.) deriving a tail probability bound for the case in which the communication between agents are stochastic and 2.) bounding the additional regret incurred by not sharing information when pulling the arm with highest estimated average reward, i.e., A

. We overcome the first challenge by noticing that communication random variables 1{(i, j) ∈ E t }, ∀i, j, t are previsible, i.e., measurable with respect to the sigma algebra generated by information obtained up to time t − 1. We address the second challenge by proving that the number of times agents do not share information about any suboptimal arm k can be bounded by tail probabilities of arm k and the optimal arm. A complete proof of Theorem 1 is given in Appendix A.

Remark 1. By replacing ComEx with full communication in ComEx-UCB algorithm agents obtain an expected cumulative group regret of E [R(T )] = O (Kχ(G) log T + KN ) (Appendix H ). Thus from Theorem 1 we see that ComEx obtains the same order of performance as full communication.

Recall that expected communication cost under full communication is Θ(T ). Now we prove that expected communication cost under ComEx is logarithmic in time. In ComEx-UCB algorithm agents are only sending single messages (not concatenated). Thus expected group communication cost at time step t can be given as

Theorem 2. (Communication cost of ComEx-UCB) Consider a group of N agents following ComEx-UCB while sharing instantaneous rewards over a general communication graph G. Then for any ξ ≥ 1.1 expected group communication cost satisfies:

Proof sketch. Note that expected group communication cost is the sum of 1.) expected number of times agents pull any suboptimal arm when it is instantaneously suboptimal and 2.) expected number of times agents pull the optimal arm when it is instantaneously suboptimal. We note that the first term can be directly bounded by the expected number of times agents pull suboptimal arms. We prove that the second term can be bounded logarithmically in time. A detailed proof of Theorem 2 is given in Appendix B.

We propose ComEx-MPUCB an improved version of ComEx-UCB by incorporating a message passing method Suomela (2013); Bar-On and Mansour (2019); Dubey et al. (2020) that allows agents to share the messages they initiated with agents who are within a distance of γ. We call γ communication density parameter. We consider that at time t each agent i initiates a message m (i)

according to ComEx given in Definition 1 and sends the messages to its neighbors. Subsequently the agents who receive the message forward it to their neighbors. Messages received at time t are forwarded to neighbors at time t + 1 resulting that each hop adds a delay of 1 time step. Under this message passing method γ-hop neighbors receive the message after a delay of γ time steps. Agents do not forward the messages that are older than γ − 1 and discard the messages that are older than γ. Note that for a connected graph maximum number of time step required to pass a message between any two agents equals to the diameter of the graph. Thus we choose γ to be an integer constant which is at most diameter of the communication graph G. The pseudo code for ComEx-MPUCB is given in Appendix J.

Theorem 3. (Group regret of ComEx-MPUCB) Consider a group of N agents following ComEx-MPUCB. Then for any ξ ≥ 1.1 expected cumulative group regret satisfies:

Proof sketch. We see that regret under ComEx-MPUCB can be given as the summation of regret of ComEx-UCB when communication graph is G γ and the regret incurred by the delay in passing messages to agents who are not 1-hop neighbors. We prove that the expected regret due to delay is at most (N −χ(G γ ))(γ − 1). A detailed proof is provided in Appendix C.

Remark 2. Similar to ComEx-UCB by replacing ComEx with full communication in ComEx-MPUCB algorithm agents obtain an expected cumulative group regret of E [R(T )] = O (Kχ(G γ ) log T + KN ) (Appendix H ). Thus from Theorem 3 we see that ComEx obtains the same order of performance as full communication.

Now we proceed to prove that expected group communication cost under ComEx-MPUCB is logarithmic in time. 

Proof sketch. Note that under ComEx-MPUCB agents send concatenated messages to their neighbors. Recall that agents do not forward the messages that are older than γ − 1. Thus each message initiated by agent i is subsequently forwarded by all agents who are within distance of

. A detailed proof can be found in Appendix D.

We propose ComEx-LFUCB by combining ComeEx communication protocol with a leader-follower method Kolla, Jagannathan and Gopalan (2018); Landgren, Srivastava and Leonard (2018); Dubey et al. (2020) ; Wang, Proutiere, Ariu, Jedra and Russo (2020). ComEx-LFUCB provides better performance compared to its decentralized counter part ComEx-MPUCB. Let V γ be the set of vertices in minimal dominating set of graph G γ . We consider each agent i ∈ V γ to be a leader and all the other agents to be followers. Note that every follower has at least one leader as a neighbor. We consider that each leader uses ComEx-MPUCB and each follower copies the last action observed from its leader. For each follower j a leader i is assigned such that d(i, j) = min i d(i , j) where d(i, j) is the distance between agent i and agent j in graph G. Let N i γ be the set of follower of leader i. We consider that each leader sends a message containing the id of the arm it pulls and whether it is instantaneously suboptimal, i.e. for i ∈ V γ at time step t, m

to its neighbors and they subsequently forward it to their neighbors. Note that at time step t follower j ∈ N (i)

. Each follower pass a message containing information about the reward and arm id if it pulls an arm that is specified as instantaneously suboptimal by its leader. Thus the followers communicate according to ComEx by initiating a message as follows. Follower

Accordingly under full communication followers share their rewards and arm pulls at every time step. Pseudo code for ComEx-LFUCB is provided in Appendix K.

Theorem 5. (Group regret of ComEx-LFUCB) Consider a group of N agents following ComEx-LFUCB with communication density parameter γ. Then for any ξ ≥ 1.1 expected cumulative group regret satisfies:

Proof sketch. We follow a similar approach to the proof of Theorem 3 with a few key modifications followed by the argument below. Note that number of suboptimal arm pulls by each j ∈ N (i) γ can be upper bounded using suboptimal arm pulls by i and message passing delay. Note that message passing delay can be upper bounded by d(i, j). A detailed proof of Theorem 5 is given in Appendix E.

Remark 3. Similar to ComEx-MPUCB by replacing ComEx with full communication in ComEx-LFUCB algorithm, i.e. allowing followers to share information about arm pulls at every time step, agents obtain an expected cumulative group regret of

. Thus from Theorem 5 we see that ComEx obtains the same order of performance as full communication.

Theorem 6. (Communication cost of ComEx-LFUCB) Consider a group of N agents following ComEx-LFUCB with communication density parameter γ. Then for any ξ ≥ 1.1 expected group communication cost satisfies:

Proof sketch. Note that the expected number of times a leader initiates a message can be upper bounded by twice the expected number of its suboptimal arm pulls. Further the number of times each follower j ∈ N (i) γ initiates a message can be bounded by the number of instantaneously suboptimal arms pulled by the leader i. Similar to ComEx-MPUCB in ComEx-LFUCB agents send concatenated messages to their neighbors. Thus each message initiated by any agent i is subsequently forwarded by all agents who are within distance of γ − 1 in graph G. A detailed proof can be found in Appendix F.

Remark 4. Algorithm and results provided in this Section can be specialized to centralized cooperative bandits with instantaneous reward sharing by substituting γ = 1. 

We propose two more algorithms, thus extending ComEx to additional cooperative bandit frameworks. We leave providing theoretical guarantees for these as future work. Estimate sharing. We propose ComEx-EstUCB by combining ComEx with estimate sharing Landgren, Srivastava and Leonard (2016a); Martínez-Rubio, Kanade and Rebeschini (2019); Landgren, Srivastava and Leonard (2020), which obtains better performance than instantaneous reward sharing. In estimate sharing, for each arm k, agents maintain estimated sum of rewards and estimated number of pulls from the arm. At each time step, agents average their estimates with their neighbors according to a consensus protocol and update the estimates by incorporating the information of arm pull at that time step. We refer readers to Landgren, Srivastava and Leonard 2020 for more details.

In ComEx-EstUCB agents only average estimates of instantaneously sub optimal arms. Pseudo code for ComEx-EstUCB is given in Appendix L.

Thompson sampling. We extend our communication protocol to cooperative Thmpson bandits as follows. We propose ComEx-MPThompson, a new algorithm by replacing UCB sampling rule with Thompson sampling rule in ComEx-MPUCB as follows. We combine ComEx with message passing and a natural extension of Thompson sampling to cooperative bandits. Here we provide a brief description of cooperative Thompson sampling rule and refer readers to Lalitha and Goldsmith (2020) for more details. Algorithm is initialized by each agent assigning a suitable prior distribution to each arm. Typically Gaussian priors are used for Gaussian reward distributions and Beta priors are used for Bernoulli distributions. At each time step each agent constructs a posterior distribution for each arm using prior distribution and available reward information at that time step. Each agent draws a sample from posterior distributions associated with each arm and pull the arm with highest sampled value. Agents initialize messages according to ComEx and pass the messages to neighbors using a similar protocol given in ComEx-MPUCB. Pseudo code for ComEx-MPThompson is given in Appendix M.

In this section we provide numerical simulations illustrating our results and validating our theoretical claims. All the experiments were run on the first author's personal laptop. We show that ComEx obtains same order of performance, i.e., same order of group regret, as full communication for a significantly smaller communication cost than full communication. We also demonstrate that our algorithms outperform state-of-the-art algorithms in several bandit frameworks.

Experimental setup. We provide simulation results for following cooperative bandit frameworks 1) decentralized instantaneous reward sharing, 2) decentralized message passing, 3) decentralized estimate sharing, 4) centralized leader-follower, and 5) Thompson sampling. We compare performance of our algorithms (ComEx-UCB, ComEx-MPUCB, ComEx-EstUCB, ComEx-LFUCB and ComEx-Thompson) with their corresponding full communication algorithms (Full-UCB, Full-MPUCB, Full-EstUCB, Full-LFUCB and Full-Thompson) and state-of-the art algorithms in each framework. For all simulations presented in this section we consider 10 arms (K = 10), 100 agents (N = 100) and 500 time steps (T = 500). Communication graph between agents is considered to be a Erdos Renyi random graph with edge probability 0.7. Results are averaged over 100 Monte Carlo simulations. Additional experimental results for different graph structures and parameters (ξ, γ) are provided in 

Hyper parameters We use tuning parameter ξ = 1.01 for UCB based algorithms. For results provided in Figure 2 (b)-2(e) we use communication density parameter γ = 5. None of the competing algorithms, except UCB-Coop2, MP-UCB(D) and MP-UCB(C) have hyperparameters. We tuned parameters of UCB-Coop2 to get best results of that algorithm and used κ = 0.02, γ = 1.001, η = 0.001 (Equations 9 and 15 in Landgren, Srivastava and Leonard (2020). Here we γ to avoid confusing with communication parameter γ used in this paper) for final results. Decreasing γ below 1.001 and η below 0.001 did not offer any significant improvement. MP-UCB(D) and MP-UCB(C) are originally proposed in Dubey et al. (2020) for heavy-tailed distributions, and we adapt them to sub-Gaussian distributions as directed by the authors. For MP-UCB(D) and MP-UCB(C) we considered the same C We consider triangle distributions with mod 1 for the optimal arm and mod 0 for all sub-optimal arms. In simulations provided in Figures 2(b) , 2(c) and 2(e) we consider Gaussian reward distributions. Expected reward for the optimal arm is µ 1 = 11 and for all sub-optimal arms k > 1 is µ k = 10. We let variance associated with all arms be σ 2 k = 1, ∀k. We use the notation Obs-UCB to denote the algorithm presented in Madhushani and Leonard (2020b) .

ComEx obtains same order of performance as full communication. Our results in Figure 2 illustrate that ComEx obtains the same order of performance, i.e., same order of group regret, as full communication. From Comparing Figures 2(a) and 2(b) we see that performance difference between full communication and ComEx decrease when communication density γ increase. Comparing Figure 2 (e) with others we see that performance difference between full communication and ComEx is smaller when agents are using UCB based sampling rules and Thompson based sampling rules. All results illustrate that our algorithms consistently out preforms state-of-the-art algorithms in all five benchmark cooperative bandit frameworks.

ComEx only incurs a logarithmic communication cost. Our simulation results also illustrate that ComEx only incurs a logarithmic communication cost. In Figure 2 (a) we observe that Obs-UCB also incurs a logarithmic cost. However ComEx-UCB incurs a smaller cost than Obs-UCB while suffering a smaller group regret. Further, results illustrate that ComEx enabled algorithms incurs a significantly smaller communication cost compared to existing state-of-the-art algorithms.

Additional discussion. State-of-the-art algorithm for leader-follower setting is DPE2 in Wang, Proutiere, Ariu, Jedra and Russo (2020). DPE2 uses a phased communication protocol, where during the leader selection phase, which lasts at least 2D rounds, where D is the diameter of the graph, agents do not pull arms. Thus, this phase accumulates an expected group regret of at least 2DN µ 1 . In our experimental setup, this alone exceeds the regret accumulated by our algorithms during the entire time horizon. So a meaningful comparison cannot be provided without modifying DPE2 to allow pulling arms during the leader selection phase.

Limitations. Main limitation of this work is that all the theoretical claims are provided using upper bounds. Obtaining lower bounds for cooperative bandits that communicate over general graphs are difficult due to the complex nature of the probability distribution associated with the sampling process of agents. This is an active area of research. We provide a discussion in Appendix B for the optimality of our regret bounds by providing a lower bound when G is a complete graph.

Future extensions. We plan to analyse regret and communication cost for the algorithms provided in Section 6. Our intuition can be extended to the collision setting by not allowing agents to share information about the first N instantaneously optimal arms. In the collision setting when more than one agent pulls the same arm at the same time step a collision occurs. This causes agents to either split the reward or completely loose the reward at that time step. Another extension will be proposing similar algorithms for linear bandits and adversarial bandits.

We proposed ComEx, a general and effective communication protocol which obtains same order of performance as full communication but incurs significantly smaller communication cost than the latter. Next, we proposed novel algorithms for several benchmark bandit frameworks by incorporating ComEx protocol. We provided theoretical guarantees followed by experimental results illustrating the state-of-the-art performance of our algorithms. We begin the proof of Theorem 1 by proving a few useful lemmas.

Lemma 1. (Restatement of results from (Auer, Cesa-Bianchi and Fischer, 2002) ) Let

For any suboptimal arm k and ∀i, t we have

Proof. Note that for any k > 1 we have

This concludes the proof of Lemma 1.

Proof. Let C be a non overlapping clique covering of G. Note that for each suboptimal arm k > 1 we have

Let τ k,C be the maximum time step such that the total number of pulls from arm k shared by agents in the clique C is at most η k . This can be stated as

We analyse the expected number of times all agents pull suboptimal arm k as follows.

Taking the expectation of (3) we have

Now we proceed to upper bound the first term of right hand side of (3) as follows. Note that we have

Taking the expectation of (6) we have

Now we proceed to upper bound last term of (7) as follows. Note that for any suboptimal arm k we have,

From (5), (7) and (12) we have

From (1), (13) and Lemma 1 we have

This concludes the proof of Lemma 2.

Now we proceed to bound the tail probabilities as follows.

Lemma 3. (Tail probability bound) Let d (i) be the degree of agent i. For some σ ≥ σ k and for any ζ > 1 P µ

Proof. Let X k be the sub-Gaussian random variable that models rewards drawn from arm k. Then X k has mean µ k and variance proxy σ k . Then we have

Define a new random variable such that ∀τ > 0.

Equality (a) follows from the fact that random variables exp λ (

Then we have Further, using the properties of conditional expectations

Thus we see that

Note that we have

Recall from the Markov inequality that P(Y ≥ a) ≤ E(Y ) a for any positive random variable Y . Thus,

Then we have,

Since σ ≥ σ k we have

Note that ∀ζ > 1 we have

Then we have

This concludes the proof of Lemma 3. ≤ 12 log(3(d (i) + 1)) + 3 log (d (i) + 1) + 1

Proof. For ζ = 1.3 we have 1 log ζ < 8.78. Further (ξ + 1) 1 − (ζ−1) 2

> 2 and ∀t ≥ 3 we see that

is monotonically decreasing. Thus we have

Thus we have

Recall that For ζ = 1.3 we have 1 log ζ < 8.78. Thus the proof of Lemma 4 follows from (17) and (21). Now we proceed to prove Theorem 1. From definition of expected cumulative group regret and Lemmas 2, 3 and 4 we have

This concludes the proof of Theorem 1.

Recall that all the agents communicate their rewards and arm ids at time t = 1. Then the expected communication cost can be given as

Note that we have

For all agents we first upper bound the expected number of times they shares rewards and actions with their neighbors until time T when they pull a suboptimal arm:

Next for all agents we upper bound the expected number of times they shares rewards and actions with their neighbors until time T when they pull the optimal arm as follows. Let k * t be the suboptimal arm with highest estimated expected reward for agents i at time t. This can be stated as

Thus, we have

Note that the first term on the right hand side of the above equation is the summation tail probabilities of the estimate of the optimal arm. Now we proceed to upper bound the second term as follows. Let τ (i) 1 denote the maximum time step when the total number of times agent i pulled the optimal arm and the total number of observations it received from its neighbors about the optimal arm is at mostη. This can be stated as τ (i)

If agent i pulls the optimal arm at time t we have Q (i)

From (27), (28) and (29) we have

From (31) and Lemma 3 we have

The proof of Theorem 2 follows from (24), (25), (26), (32) and Theorem 1.

In section we follow an approach similar to Section A. Recall that G γ is the γ th power graph of G.

Thus each pair of vertices in G γ are adjacent if and only if they distance between them in G is at most γ. We begin the proof of Theorem 3 by proving a lemma similar to Lemma 2.

Proof. Let C γ be a non overlapping clique covering of G γ . Note that for each suboptimal arm k > 1 we have

Let τ k,C be the maximum time step such that the total number of messages about pulls from arm k initiated by agents in the clique C is at most η k + (|C| − 1) (γ − 1). This can be stated as

We analyse the expected number of times all agents pull suboptimal arm k as follows.

Taking the expectation of (34) we have

Now we proceed to upper bound the first term of right hand side of (34) as follows. Note that we have

Taking the expectation of (36) we have

Now we proceed to upper bound last term of (37) as follows. Note that for any suboptimal arm k we have,

Now we proceed to upper bound the last term of (38) as follows. Note that we have

From (38) and (39) 

From (35), (37) and (41) we have

The proof of Lemma 5 follows from (33), (42) and Lemma 1.

Now we proceed to prove Theorem 3 as follows. We start by obtaining a modified tail bound similar to the result in Lemma 3. Note that ∀i, k, t we have 1 ≤ N (i)

The proof of Theorem 3 follows from Lemmas 4, 5 and (43).

Following a similar approach to the proof of Theorem 2 we obtain

Similarly we get

From (44) and (45) we have

Note that (46) is the expected number of messages initiated by all the agents. Recall that in ComEx-MPUCB a message initiated by agent i is subsequently passed by agents within a γ − 1 distance in graph G. Thus we have

From (46) and (47) we have

25 From (43), (48) and Lemma 5 we have

Recall that η k = 8σ(ξ+1) ∆ 2 k log T. Thus the proof of Theorem 4 follows from (49) and Lemma 4.

We follow a similar approach to proof of Theorem 3. We begin the proof by providing a lemma similar to Lemma 5.

where V γ is the maximal dominating set of G γ and N (i) γ is the set of followers of leader i.

Proof. Recall that V γ is the maximal dominating set of G γ . Let N (i) γ be the set of followers of leader i. Then for each suboptimal arm k > 1 we have

Let τ (i) k be the maximum time step such that the total number of times agent i pulls arm k and the number of times agents in N (i) γ initiated messages about pulls from arm k is at most η k + N (i) γ (γ − 1). This can be stated as

Then we have N (i)

We analyse the expected number of times all agents pull suboptimal arm k as follows. Let d(i, j) be the distance between agents i and j in graph G. Then 26 note that for any j ∈ N (i)

Now we proceed to upper bound the first two terms of right hand side of (51) as follows. Note that we have

Taking the expectation of (51) and (52) we have

Now we proceed to upper bound last term of (53) as follows. Note that for any suboptimal arm k we have,

Thus from (53), (56) and Lemma 1 we have

The proof of Lemma 6 follows from (50) and (57). Now we proceed to prove Theorem 5 as follows. We start by obtaining a modified tail bound similar to the result in Lemma 3. Note that ∀i ∈ V γ we have 1 ≤ N (i)

The proof of Theorem 5 follows from Lemmas 4, 6 and (58).

Following a similar approach to the proof of Theorem 4 we obtain

Similarly we get

From (59) and (60) we have

Note that (61) is the expected number of messages initiated by all the agents. Recall that in ComEx-LFUCB a message initiated by agent i is subsequently passed by agents within a γ − 1 distance in graph G. Thus we have

From (61) and (62) we have

29 From (58), (63) and Lemma 6 we have

Recall that η k = 8σ(ξ+1) ∆ 2 k log T. Thus the proof of Theorem 6 follows from (64) and Lemma 4.

In this section we provide theoretical bounds for group regret of Full-UCB, Full-MPUCB and Full-LFUCB as follows.

We start by proving a Lemma similar to Lemma 2.

Lemma 7. Let η k = 8(ξ+1)σ 2 ∆ 2 k log T. Let C be a non overlapping clique covering andχ(G) be the clique covering number of the graph G. Let τ k,C be the maximum time step such that the total number of pulls from arm k by agents in the clique C ∈ C is at most η k . Define τ k := min C τ k,C . Then we have

Proof. Let C be a non overlapping clique covering of the graph G. Then we have

Let τ k,C be the maximum time step such that the total number of pulls from arm k by agents in the clique C is at most η k . This can be stated as τ k,C := max t ∈ [T ] : i∈C

We analyse the expected number of times all agents pull suboptimal arm k as follows.

Taking the expectation of (66) we have

Let τ k := min C τ k,C∈C . Similarly to Lemma 2 from (65), (67) and Lemma 1 we have

This concludes the proof of Lemma 7.

Then from Lemmas 3, 4 and 7 it follows that

We start by proving a Lemma similar to Lemma 5.

Lemma 8. Let η k = 8(ξ+1)σ 2 ∆ 2 k log T. Let C γ be a non overlapping clique covering andχ(G γ ) be the clique covering number of the graph G γ , which is the γ th power graph of G. Let τ k,C be the maximum time step such that the total number of pulls from arm k by agents in the clique C ∈ C γ is at most

Proof. Let C γ be a non overlapping clique covering of the graph G γ . Then we have

Let τ k,Cγ be the maximum time step such that the total number of pulls from arm k by agents in the clique C is at most η k . This can be stated as τ k,C := max t ∈ [T ] : i∈C

We analyse the expected number of times all agents pull suboptimal arm k as follows.

Taking the expectation of (69) we have

Let τ k := min C∈Cγ τ k,C . Similarly to Lemma 5 from (68), (71) and Lemma 1 we have

This concludes the proof of Lemma 8.

Then from Lemmas 3, 4 and 8 it follows that

We begin the proof by providing a lemma similar to Lemma 6.

Lemma 9. Letγ(G γ ) is the clique covering number of graph G γ . Let η k = 8(ξ+1)σ 2

where V γ is the maximal dominating set of G γ and N (i) γ is the set of followers of leader i. Here τ (i) k be the maximum time step such that the total number of times agent i pulls arm k and the number of times agents in N (i) γ pull from arm k is at most η k + N (i) γ (γ − 1).

Proof. Recall that V γ is the maximal dominating set of G γ . Let N (i) γ be the set of followers of leader i. Then for each suboptimal arm k > 1 we have

Let τ (i) k be the maximum time step such that the total number of times agent i pulls arm k and the number of times agents in N (i) γ pull from arm k is at most η k + N (i) γ (γ − 1). This can be stated as

Then we have N (i) k (t) > η k , ∀t > τ (i) k . We analyse the expected number of times all agents pull suboptimal arm k as follows. Let d(i, j) be the distance between agents i and j in graph G. Then note that for any j ∈ N (i)

Now we proceed to upper bound the first two terms of right hand side of (73) as follows. Note that we have

Taking the expectation of (73) and (74) we have

Recall that N 

The proof of Lemma 9 follows from (72) and (76).

Then from Lemmas 3, 4 and 9 it follows that E [R(T )] = O (Kγ(G γ ) log T + KN ) .

In this section we provide additional simulation results. We observe that performance of the algorithms improve when we decrease ξ. Thus for simulations provided in this section we use ξ = 1.001. Further when γ is increased communication density increases and performance improve. For simulations provided in this section we consider γ = 7. We use the same graph structure and reward structures used in the results provided in the main paper. Additional details on estimate sharing Note that in estimate sharing agents average their estimates of instantaneously suboptimal arms at every time step. Thus at each time step each agent creates 2K number of messages (estimated sum of rewards for each arm and estimated number of pulls from each arm). If the number of arms are of same order as time horizon this leads to O(T 2 ) cost for Full-EstUCB and O(T log T ) cost for ComEx-EstUCB. However we consider that nummber of arms are fixed and K << T for large T and when providing simulation results for the communication cost we only considered the communication cost associated with initiating messages and passing them through network neglecting the dependence on number of arms. This leads to O(T ) cost for Full-EstUCB and O(log T ) cost for ComEx-EstUCB.

Finite-time analysis of the multiarmed bandit problem

Individual regret in cooperative nonstochastic multi-armed bandits

Delay and cooperation in nonstochastic bandits

Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits

Cooperative multi-agent bandits with heavy tails

Contextual Bandits for Adapting Treatment in a Mouse Model of de Novo Carcinogenesis

Collaborative learning of stochastic bandits over a social network

Asymptotically efficient adaptive allocation rules

Bayesian Algorithms for Decentralized Stochastic Bandits

Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms

On distributed cooperative decision-making in multiarmed bandits

Social imitation in cooperative multiarmed bandits: Partition-based algorithms with strictly local information

Distributed Cooperative Decision Making in Multi-agent Multi-armed Bandits

Heterogeneous stochastic interactions for multiple agents in a multi-armed bandit problem

Distributed learning: Sequential decision making in resource-constrained environments

Input: Arms k ∈ [K], variance proxy upper bound σ 2 , parameter ξ Initialize: N (i)be the estimated number of pulls from arm k for agent i up to time t.

In Thompson sampling for each arm k each agent i maintains a posterior distribution φ (i) k and updates the distribution according to the available information. Then draw samples from the posterior distribution and pull the arm with highest sample value.

Input: Arms k ∈ [K], variance proxy upper bound σ 2 k , parameter ξ, γ Initialize: N (i) 

Input: Arms k ∈ [K], variance proxy upper bound σ 2 k , parameter ξ, γ Initialize: N (i)