key: cord-0654402-q32eg4nu
authors: Abels, Axel; Lenaerts, Tom; Trianni, Vito; Now'e, Ann
title: Dealing with Expert Bias in Collective Decision-Making
date: 2021-06-25
journal: nan
DOI: nan
sha: fa846dab1066933811f73da51e4558c86d7fb623
doc_id: 654402
cord_uid: q32eg4nu

Quite some real-world problems can be formulated as decision-making problems wherein one must repeatedly make an appropriate choice from a set of alternatives. Expert judgements, whether human or artificial, can help in taking correct decisions, especially when exploration of alternative solutions is costly. As expert opinions might deviate, the problem of finding the right alternative can be approached as a collective decision making problem (CDM). Current state-of-the-art approaches to solve CDM are limited by the quality of the best expert in the group, and perform poorly if experts are not qualified or if they are overly biased, thus potentially derailing the decision-making process. In this paper, we propose a new algorithmic approach based on contextual multi-armed bandit problems (CMAB) to identify and counteract such biased expertises. We explore homogeneous, heterogeneous and polarised expert groups and show that this approach is able to effectively exploit the collective expertise, irrespective of whether the provided advice is directly conducive to good performance, outperforming state-of-the-art methods, especially when the quality of the provided expertise degrades. Our novel CMAB-inspired approach achieves a higher final performance and does so while converging more rapidly than previous adaptive algorithms, especially when heterogeneous expertise is readily available.

The recent emergence of COVID-19 [WHO et al., 2020] has presented diverse groups of decision-makers with substantially the same problem: what measures to put in place to reduce the loss of human life while limiting economic decline and ensuring mental wellbeing. While the single best sequence of decisions is unlikely to be identical for any two regions, governments have taken strikingly different approaches to handle the pandemic [e.g. the difference between Sweden and other EU countries, see Jonung, 2020] . Whether due to legislative restrictions on the available decisions, cultural differences, or past experiences, governments-influenced by the information and expertise available to them-have not been uniform in their response to the pandemic [Anderson et al., 2020] . It has become increasingly clear that experts' advice is not aligned, begging the question as to which advice should be followed. This problem is apparent especially in the current pandemic, as previous knowledge on how to manage a pandemic cannot be reliably exploited due to the specificity of the virus and the variability of the socio-economical context. What followed instead was a series of regulation adjustments as governments re-evaluated previously taken measures in terms of their effectiveness and the limitations they imposed, weighing them in the light of demands coming from the electorate and different lobbies. Biases displayed by experts may be many, as for instance different goals (e.g., a rapid economic recovery), different levels of knowledge about epidemics, level of exposure to a health crisis of this scale or even personalities. It is well known that, when biases are present, the effectiveness of deciding in a group is reduced, sometimes leading to major mistakes [Bang and Frith, 2017] . Hence, methods to identify and counteract possible biases are paramount.

In this paper, we take a computational stance to address the problem of taking appropriate decisions in a group of experts (hereafter referred to as collective decision making-CDM), explicitly accounting for possible biases. Specifically, we propose an algorithmic solution that finds the best options from a set of alternatives, exploiting in the best possible way the advice from biased experts. We start from observing that any repeating decision problem-e.g. the best response to an expanding pandemic or any other-can be formalised as a Contextual Multi-Armed Bandit [Contextual MAB or CMAB Li et al., 2010] . In CMABs, the decision maker repeatedly observes a set of contexts which characterizes the decision to be made and must select the option from the set of available alternatives (often referred to as arms) whose (a-priori unknown) value function maps the context to the highest expected outcome. In the case of COVID-19, as the demographics of the cases change, the context for which a decision must be made changes, and the most appropriate course of action might need to evolve, e.g., lock-down measures targeting different age-groups over time. Early stages of learning how to solve a CMAB problem are characterized by a high uncertainty about the outcomes of choosing one or the other alternative (i.e. which arm to select) as limited or no knowledge of the problem is available. CMAB algorithms therefore balance exploration-to reduce this uncertainty-and exploitation-to maximize the collected rewards. Because more complex problems require more data (thus, more exploration) to acquire an appropriate approximation of the outcomes of decisions, CMAB algorithms which learn this approximation from scratch are of limited use for complex problems wherein extensive exploration is unreasonable. For example, empirically evaluating the efficacy of containment measures on COVID-19 may lead to costs in lives or a collapse of the healthcare system. A sensible way to alleviate these costs is therefore to consult experts, who should be able to reduce the necessity to explore by providing advice on how to handle the problem.

Concretely, experts-whether humans or AI algorithms-observe the problem's context and provide advice about the alternative solutions. In other words, each expert is a decisionmaker that proposes a solution to the decision problem, in the form of an advice. This advice can then be exploited centrally in conjunction with other experts' advice to determine the best possible alternative, e.g., through some voting process or other decision making approaches. While a simple majority vote can lead to improvements in performance under certain assumptions [Nicolas et al., 1785] , adaptive methods have also been proposed to tackle this problem. Algorithms for bandits with expert advice [Auer et al., 2002] learn to weigh advice of good experts more strongly. Previous work has approached this CDM problem with methods that are bounded by the performance of the best available expert (i.e., no solution is considered outside the set of solutions provided by the experts). In such cases, performance is measured in terms of regret with regards to the best expert [Auer et al., 2002 , McMahan and Streeter, 2009 , Zhou, 2015 , Abels et al., 2020a . Consequently, poor absolute performance is deemed acceptable as long as the performance relative to the best expert is reasonable. Clearly, when systematic biases are present in the expert population, this approach can lead to very poor absolute performance, for which solutions need to be found.

A CDM algorithm that suitably takes into account possible biases must aggregate the advice from multiple experts considering their level of expertise and potential biases to identify the best arm. In other words, a CDM problem is in itself a CMAB problem wherein the available advice from experts forms the contextual information. We refer to this approach towards CDM as meta-CMAB. Starting from a formalization of CDM as a CMAB problem, we investigate in this paper how the meta-CMAB approach can overcome systematic biases in expert advice. We will show that meta-CMAB is robust to such biases as long as expert advice is (positively or negatively) correlated with the actual outcome, in which case its performance often surpasses the single best expert. Furthermore, the meta-CMAB also takes advantage of the presence of under-performing experts, who can often still provide relevant contributions to the decision-making process. We contrast these results with the state-of-the-art approaches to deal with decision making with expert advice, namely EXP4 [Auer et al., 2002] , and its derivatives, such as EXP4.P [Beygelzimer et al., 2010] . These algorithms tend to converge either towards the single best expert's policy, limiting the opportunities for collective intelligence, or towards non-adaptive aggregation techniques (e.g., plurality votes) which, predictably, degrade in performance as biases become more widespread. Before discussing the result of meta-CMAB and its relation to these other methods, we first present the necessary background knowledge on CMABs and bandits with expert advice, then we present how we model bias in the expertise, and finally, for varying bias models, we experimentally compare the performance of meta-CMAB to previous stateof-the-art algorithms and show meta-CMAB's capacity to profit even from the input of weak experts.

CDM [Aikenhead, 1985 , Robson and Rew, 2010 , Marshall et al., 2017 is concerned with combining the opinions of a group of agents (here, experts) to make appropriate decisions. More specifically, we are interested in using a group of experts to maximize the outcome of decisions made in a sequence of decision-making problem instances. In medical diagnostics for example, for each given patient (problem instance) a suitable diagnosis must be made (the single decision in the sequence) leading to the most appropriate treatment. While mistakes can always happen, a rational goal is to maximize the number of correct diagnoses and appropriate treatments administered. Similarly for a pandemic, taking into account the demographics of a population, the estimated basic reproduction number [R 0 , Milligan and Barrett, 2015] , the number of infections, and the current capacity for hospitalization, one might implement more or less extreme measures in the hope of containing the virus while minimizing economic regression. When assessing the situation for a different region, a decision-maker might come to a different decision. However, as a result of a lack of certainty about the potential outcomes, sub-optimal choices are often inevitable but necessary to enhance the understanding of the consequences associated to different choices. This fundamental dilemma between the need for expanding understanding and the desire to make the best decisions is formalized by MABs, which we introduce in the following section. Following this we formalize CMABs, wherein the outcome of decisions is conditioned on an observable set of features: a context. We then present prior research on solving CDM problems formulated as bandits with expert advice, for which we present the meta-CMAB approach in Section 3.

Formally, a MAB is characterized by a set of K arms identified by numbers 1, ..., K and a value function f that maps each arm to a reward distribution [Agrawal, 1995] . Hence, in the context of decision-making as we discussed in the introduction, each arm is an alternative we can choose, for example a potential treatment. The aim of an agent solving a MAB problem is typically to identify the best arm k * = arg max k∈K E[f (k)]. Because f is unknown, exploration is required to identify the best arm(s). State-of-the-art MAB algorithms balance exploration of uncertain arms and exploitation of likely optimal arms to minimize regret [Russo et al., 2017 ].

An agent's policy at time t, π t is a probability distribution over the arms according to which the agent chooses an arm k t ∼ π t .

While MABs are useful as models of fundamental repeated decision-making, they are limited in their applicability as they assume that arms are only characterized by their identifier. In real world problems, decisions are typically made based on additional information about the problem or the arms. In medical diagnostics for example, it is necessary to take into account characteristics specific to the patient we consider, such as their medical record. The addition of such information transforms a MAB into a contextual MAB (or CMAB).

Two alternative formulations are prevalent in the CMAB literature [Wang et al., 2005 , Zhou, 2015 , Bouneffouf and Rish, 2019 . Either contexts are disjoint, i.e. arm-specific, (and timedependent) and there is a shared value function mapping each arm's context to a value. Alternatively, CMABs can possess one time-dependent (but not arm-dependent) context vector #» x t , and arm-dependent functions mapping this shared context to arm-specific values (as will be the focus in this paper). While both formulations are more intuitive for different real-world problems, there is a straightforward reduction from the disjoint formulation to the shared context formulation.

Hence, we follow here the latter formalism for contextual bandits introduced by Wang et al. [2005] (also referred to as bandits with side observations), which associates to each individual arm a function which maps the shared context to an outcome. A more complete discussion of the disjoint formulation is given in Appendix C.

In the shared context formulation, each decision (or timestep) is characterized by a d-dimensional context vector #» x t , and each arm k is characterized by a fixed but a-priori unknown mapping of this time-dependent context vector to a value, i.e., a function f k : R d → [0, 1]. In medical diagnostics for example, the context vector could be a set of features describing the concerned patient (e.g., their medical record), and each possible treatment is a mapping from this context to an outcome, e.g., whether the treatment was successful, or a QALY score [Whitehead and Ali, 2010 ]. An agent's policy π : R d → [0, 1] K maps the context vector to a probability distribution according to which the agent chooses an arm.

In CMABs, the regret over T rounds of an agent pulling an arm and observing the resulting reward r t at each timestep t is the sum of differences between the reward of the pulled arm and the value of the best arm at each timestep t, i.e., max

The aim of any agent is to minimize the regret R π T , or, equivalently, maximize the sum of collected rewards.

When approximating f k from scratch is impractical, experts can provide the advice required to select the appropriate arms. Identifying the best way to use this advice can however present another challenge, as all experts might not be equal in their expertise. Both static aggregation techniques such as plurality votes [Grofman et al., 1983] , or adaptive methods which for example learn to identify the best performing experts [Auer et al., 2002 , Beygelzimer et al., 2010 have been proposed to address this. In the following section we formalize this problem of bandits with expert advice and present different algorithmic approaches to tackle it.

The problem of bandits with expert advice [Auer et al., 2002] formalises a CDM process wherein at each timestep t a set of N experts observe the context of a CMAB and provide the centralized learner with advice based on their prior knowledge about the problem. This knowledge for each expert n is captured by an approximation of f , i.e.f n , and a policy function π n which maps the arms to a probability distribution. Both functions map the context to a vector (of either values or probabilities), of which we will denote the k-th value by the k subscript, e.g.,f n k ( #» x t ) (resp., π n k ( #» x t )) is expert n's estimated value (resp., probability) for the k-th arm given the context vector #» x t at timestep t. In other words, the expert's advice is a solution of the CMAB from the expert's perspective, on the basis of the expertise encoded intof n and π n . In addition to this advice, each expert n can provide her confidence estimate for each advice c n k ( #» x t ), expressing how accurate she estimates her advice about arm k is given the context #» x t .

We make no assumption on how experts acquire their expertise. In case of human experts, it can for example be the result of past experiences; for an AI algorithm, it can be acquired by training on an existing (possibly biased) data set. In any case, the aim of the centralized learner in this setting is to learn a policy that maps expert advice to probabilities over the arms. Auer et al. [2002] assume advice to be a probability distribution:

Alternatively, in [Abels et al., 2020a] , an expert's value estimate is considered as a potentially more informative advice: ξ n k,t ←f n k ( #» x t ) We define ξ ξ ξ t as the advice matrix at time t (and thus resulting from the context vector #» x t ), wherein row n consists of expert n's advice vector. An algorithm for bandits with expert advice, i.e., a CDM algorithm, selects arms based on this advice matrix and the subjective confidence estimation. Specifically, a CDM algorithm maintains a time-dependent policy π t which maps expert advice to a probability distribution over the arms. By incorporating experiences (i.e., chosen arms, the advice that led to each chosen arm, and the observed rewards), the CDM algorithm should improve its policy while, again, balancing exploration and exploitation with the ultimate goal of maximizing the cumulative reward, or equivalently, minimizing regret. Algorithm 1 outlines the problem of bandits with expert advice. An important observation here is that only the experts need to observe the context, the CDM algorithm itself does not explicitly take the context into account to make its decision. As a result, it is also applicable to problems for which the contexts can be fully observed by human experts but are hard to meaningfully capture into datapoints to train a CMAB solver.

The performance in this setting has traditionally been measured in terms of regret with respect to the single best expert, in other words, algorithms compete with the single best expert in hindsight. Let r n t be the reward collected by exclusively following expert n's advice at timestep t, and r CDM t be the reward collected by the CDM algorithm at timestep t, the regret is then

This regret expresses the traditional goal of performing as well or better than the single best expert, but does not give us a clear indication of performance on the underlying CMAB. That performance is better captured by the cumulative reward

We can further scale this cumulative reward such that the minimal cumulative reward is 0 and the maximal cumulative reward is 1. This scaled cumulative reward gives us a more consistent reference point for the worst and best possible performance of our algorithms. For this reason we will focus our results on this scaled cumulative reward instead. Note that maximizing the (scaled) cumulative reward is equivalent to minimizing the regret, so this choice has no impact on our results.

Both static and adaptive aggregation methods have been proposed to tackle this problem, we discuss both approaches in the following sections.

Confidence Weighted Averages A common approach in collective decision-making is to collect the experts' advice and to choose the arm with the most support, i.e., a plurality vote. If knowledge about the experts' expected performance is available (e.g., a confidence estimate), votes can be weighted in function of this performance estimate, resulting in a weighted majority vote [WMV, see Marshall et al., 2017] which weighs each expert's advice by its confidence to obtain a value for each arm and then greedily selects the arm with the highest value

The logarithm in the above formula enforces negative weights for experts performing worse than an expert providing random advice (c n k,t < 0.5). When confidence estimates are inaccurate, the performance of a WMV can quickly degrade [Abels et al., 2020a] . It can therefore be desirable to independently approximate performance estimates when confidence estimates are unavailable or inaccurate. In the following sections we present methods from the literature which attempt to learn appropriate weights for the experts.

Algorithm 1 Bandits with expert advice Require: N experts, contextual bandit with reward function f k : R → [0, 1] for each of the K arms, learner with initial policy π 1 1: Each expert n has acquired T tr prior experiences on which they base their advice. 2: for t = 1, 2, ..., T cdm do 3:

Experts observe the context #» x t

Get expert advice and confidence vectors

Pull arm k t ∼ π t (ξ t , c t ) and collect resulting reward r CDM t 6:

Learner updates its policy based on the 7:

received advice, the pulled arm and the 8:

Meta-MAB It is straightforward to draw a parallel between the previously introduced MAB problem and the problem of bandits with expert advice, as first discussed by Auer et al. [2002] . In essence, given N experts, each expert can be treated as an arm in an N -armed meta-bandit. Selecting an arm means we only follow the corresponding expert's advice. Concretely, if at time t arm n of the meta-bandit is pulled, the policy of expert n is followed to pull an arm in the original CMAB. The arm's estimated value is then updated based on the reward observed by following its advice. As expert advice is never combined, this approach's performance is bounded by the performance of the best expert. By adding an additional uniform random expert to the expert set we ensure that, if all other experts are worse than random, we still converge towards a random policy. What's more, each timestep only the estimated value of one expert is updated. These limitations motivated the introduction of exponential weights methods, discussed in the following section. Appendix B provides a full description of the meta-MAB algorithm.

Exponentially Weighted Averages The exponential weighting method EXP4.P [Beygelzimer et al., 2010] replaces the logarithmic term in Equation 4 by a trained weight w n t , which iteratively increases the importance of experts whose advice is most correlated with the outcomes. Note that EXP4.P includes an additional uniform random expert. Hence, if all experts are worse than random, EXP4.P competes with the performance of a random policy. In EXP4.P+CON [Abels et al., 2020a] , confidence estimates are used as a prior on these weights. As a result, EXP4.P+CON can exploit accurate confidence estimates but is less sensitive to inaccurate confidence estimates [Abels et al., 2020a] . We provide a full description of how we integrate weights into the EXP4.P algorithm in Appendix A.

3 Meta-CMAB: meta-learning for Bandits with Expert Advice

The previously discussed methods build on the proposition that the best way of combining expert advice is a weighted average. In contrast, we hypothesized that there are patterns in the experts' advice which cannot be appropriately exploited by such methods [Abels et al., 2020b] . In the meta-CMAB, we make the less restrictive assumption that there exists some function E which maps the experts' advice and confidence for a given context to an expected reward. In other words, given a context #» x k,t (i.e., a context for a given arm k at time t) and a set of N experts wherein each expert n provides advice ξ n k,t and confidence c n k,t about that context, we assume the value function f can be approximated by

If such an E exists, we can construct a secondary CMAB, the meta-CMAB, with arm context

for each arm k, and shared value function E. Minimizing regret in this meta-CMAB is equivalent to optimising the CDM process (i.e., minimizing regret in the bandits with expert advice setting). This reduction makes it possible to select one of the many CMAB algorithms available in the literature to solve the meta-CMAB, and consequently the original problem. Just as in standard CMABs, the choice of the CMAB algorithm depends on the assumptions made about E.

Algorithm 2 meta-CMAB for bandits with expert advice Require: N experts, contextual bandit with reward function f k : R → [0, 1] for each of the K arms 1: Each expert n has acquired T tr prior experiences on which they base their advice. 2: Initialize a CMAB algorithm (e.g., LinUCB) with initial policy π 1 3: for t = 1, 2, ..., T cdm do 4:

Experts observe the context #» x t

Get expert advice and confidence vectors

Pull arm k t ∼ π t ({ #» y 1,t , ..., #» y K,t }) and collect resulting reward r CDM t 8:

Update the policy according to the selected 9:

In this paper we use LinUCB to solve the meta-CMAB, as it provides a relatively easy to interpret linear relation between expert advice and expected reward. More specifically, we assume that E is linear, i.e., there exists some 

Because of the lack of publicly available data sets which would allow us to extensively experiment with biases, number of experts, and number of arms, we adopt a synthetic experimental setting. In this synthetic CDM setting, CDM algorithms (concretizing the outline provided in Algorithm 1) must solve a collective Perlin bandit (see Section 4.1) with the help of advice provided by experts that are trained on an expert-dependent biased bandit (see Section 4.2 below). Each expert is first trained for T tr = K ·100 steps on its prior bandit using KernelUCB [Valko et al., 2013] . Note that while the resulting performance of the experts determines the performance of the CDM algorithms, the exact training procedure is of little importance. Hence, another CMAB algorithm such as GP-UCB [Srinivas et al., 2009] , or simply fitting a regression to as many sampled experiences would result in similar experts in terms of performance, and hence in similar performances for the CDM algorithms. Once partially trained, the experts will provide advice for deciding which arm to pull on the collective bandit, and this for T cdm = 1000 steps. Cumulative performance is measured, hence CDM algorithms must rapidly learn to use the (possibly biased) advice provided by experts who were trained on their (possibly biased) prior bandits. We explore varying expert configurations detailed in Section 4.3.

We evaluate the performance of the previously introduced CDM algorithms on Perlin bandits, contextual bandits whose value function is a Perlin noise function, a type of noise commonly used to generate natural looking patterns [Perlin, 1985] . Figure 1 illustrates the steps involved in the generation of Perlin noise for a 2-dimensional space (i.e., d = 2). First (Figure 1 .a), a grid of unit vectors is randomly sampled (we will denote the set of all such grids as V). Then, for each point, the dot product between its 2 d closest vectors and its offset to that vector is computed (Figure 1 .b illustrates the dot product to the nearest vec-tor.). Finally, these 2 d dot products are interpolated for each point, resulting in a smooth landscape as in Figure 1 .c. The value for a point in this landscape corresponds to the expected reward in the corresponding contextual bandit. Let V ∈ V be a grid of vectors sampled for the Perlin procedure, then the value of any given point #» x is noted as P

1] a function which applies the steps of the Perlin procedure subsequent to the sampling of a vector grid and maps the point to its value in the resulting landscape. Sampling a new Perlin Bandit (i.e., a contextual bandit whose value functions are Perlin functions) simply implies sampling K new vector grids. Thus, a Perlin bandit b's value function f b k for arm k is in practice a function of the context and that bandit's vector grid for arm

The context is always sampled uniformly at random from the context space [0, 1] d . The reward for pulling arm k with context #» x is the result of a Bernoulli trial wherein the probability of success is f b k ( #» x ). Within the context of the work, we sample for each arm a grid of 5 × 5 vectors on the unit square (see for example Figure  2 .a). The resulting landscape contains multiple high and lows with gradual transitions from high regions to lower regions. These types of landscapes, combined with Bernoulli rewards, present a challenging problem which typically requires exhaustive exploration for optimal performance.

Given the characterization of a Perlin bandit by its vector grids, it is natural to implement biases as the result of rotations of these vectors. By applying such rotations we obtain landscapes that diverge from the original landscape (for example the modified bandit of Figure 2 .b). Concretely, random noise can be introduced by rotating each vector by a random angle sampled from a wrapped Cauchy distribution [a Cauchy distribution wrapped around the unit circle Kent and Tyler, 1988] with mean 0 and scale ρ. When ρ → 1 we obtain an identical bandit, when ρ → 0, we obtain a random (and on average uncorrelated) bandit. It is straightforward to see that inverting these vectors results in a bandit that maximally differs from the original one, as in Figure 2 .c.

Note that an increase in distance from the original bandit will decrease the correlation [as measured by the Pearson Correlation Coefficient (PCC), see Manders et al., 1992] . Figure  2 .d confirms this and further shows that the correlation is inversely proportional to the distance.

By training an expert on a secondary Perlin bandit at a distance from the true bandit we can model biases in expertise, as detailed in the following section.

Whether it is the result of inaccurate past experiences, cognitive biases [e.g., outcome bias, see Baron and Hershey, 1988] , or simply malicious intent [which in essence entails that the expert's values are not aligned with those of the collective, possibly as a result of reactance, Steindl et al., 2015] , biases are likely to be introduced along the knowledge acquirement process [O'Sullivan and Schofield, 2018, Dror et al., 2018] and manifest themselves in the expert's advice.

The presence of systematic biases poses an understudied challenge to bandits with expert advice. Existing research on bandits with expert advice is focused on minimizing regret with regards to the best expert. Naturally this implies that when all experts perform badly, a poor collective performance is acceptable. In contrast, we make the assertion that even if experts perform poorly, if there are systematic biases in their advice, these biases should be identified and counteracted to benefit overall performance and this beyond the single best expert.

We measure such biases in terms of the distance between the expected rewards of the true bandit (i.e., the bandit solved by the CDM algorithm) and of an expert's biased bandit. Suppose we have a true bandit b * defined by a value function f * k for each arm k originating from a set F of value functions with a shared domain X , for example the set of Perlin bandits (presented in Section 4.1). Then, given a secondary (biased) bandit b with value function f k ∈ F for each arm k, and X ∼ X a random context, we define the distance between these bandits as:

And the scaled distance as:d

Which ensures that the scaled distance of the furthest bandit within the considered set is 1. Complementary to this definition of distance, we define 3 alternative ways in which experts are distributed in terms of their distance, homogeneously, heterogeneously, or polarized. These expert configurations are elaborated in the following section.

We consider three different configurations representative of different real-world scenarios. First, situations wherein experts are globally homogeneous in their expertise, i.e., they are all approximately equally close or far from the truth (Figure 3 .a). This configuration can capture for example trained professionals whom one might expect to be relatively similar in terms of expertise. In contrast, heterogeneous experts can vary significantly in their expertise (Figure 3 .b). Applications relying on crowdsourcing are prime examples of such heterogeneity. Finally, polarized experts are especially representative of polarized collectives, wherein contradicting clusters of similar experts occur (Figure 3 .c). Such clusters can for example occur in policy-making, along the left-right axis [Poole and Rosenthal, 1984] .

Let us now briefly describe how experts under these different configurations are sampled for this paper.

Homogeneous experts We consider experts to be homogeneous when all experts' prior bandits are approximately at the same distance ∆ of the true bandit. In other words, given a true bandit with value function f , the n-th expert's prior bandit with value function f n is at a distanced(f, f n ) ≈ ∆. This configuration is illustrated in Figure 3 .a, wherein, for a given distance ∆, all experts are close together and within a small distance of ∆.

Heterogeneous experts Heterogeneous experts are uniformly spread across a window of size W ∆ = min(∆, 1−∆) around the target distance ∆. In other words, experts are found uniformly in the interval [∆ − W ∆ , ∆ + W ∆ ]. For example, if ∆ = 1 6 , experts will be spread out in the [0, 1 3 ] interval, as illustrated in Figure 3 .b.

Polarized experts Instead of being spread uniformly across the range, experts are divided into two clusters located at the endpoints of the window of size W ∆ = min(∆, 1−∆) around ∆. In other words, experts are split into two evenly sized clusters at distances approximately ∆ − W ∆ and ∆ + W ∆ . For example, if ∆ = 1 6 , the expert clusters will be located at 0 and 1 3 . Figure 3 .c illustrates the clusters of experts that occur for a given distance.

Having defined distances and established the different expert configurations we will consider, we present in the following section the type of contextual bandit we will be evaluating our algorithm on, as well as how biases and the resulting distances are implemented.

We evaluate performance in terms of scaled cumulative reward (see Section 2.3). In addition to the performance of different algorithms, we also plot the performance of the best and worst expert. For the sake of clarity we restrict ourselves here to value advice (see Section 2.3), as the results with probability advice are similar.

As they share a lot of similarities with the heterogeneous configuration, the plots for the polarized configuration were moved to Appendix D ( Figures A.1 to A.3) . Finally, the results below are those when experts provide no confidence estimates, results including hindsight confidence are provided in Appendix E. Those results show that the WMV competes with meta-CMAB when accurate confidence estimates are available, but as confidence estimates become less accurate, the performance of the WMV quickly degrades.

By increasing the distance between experts' prior bandits and the collective bandit (i.e., ∆), we simulate the presence of biases in the expertise with the aim of establishing how the different CDM algorithms are influenced by different bias levels. As we laid out in Section 4.3, different expert configurations correspond to different distributions of biases in the experts. Figure 4 shows how these biases affect the scaled cumulative reward for the considered algorithms when the set of experts is homogeneous in its expertise. We observe a decline in performance of both the state-of-the-art algorithms and meta-CMAB as the difference between the experts' expertise and the true bandit increases. Interestingly, as the difference goes beyond 0.5 meta-CMAB appears to recapture its prior performance. To explain this, we observe that the average distance between two uncorrelated Perlin bandits is 0.5 (cfr. Figure 2 ), hence it is at this point that the prior bandits are least correlated with the collective bandit. Beyond 0.5, the prior bandits are negatively correlated with the collective bandit. This negative correlation can be identified by meta-CMAB whose performance increases as the absolute value of the correlation increases. meta-CMAB (in this case using LinUCB for reasons of clarity) learns a relation between advice and expected outcome. When prior bandits are weakly correlated with the collective bandits this relation is harder to establish. In contrast, when the prior bandits are strongly negatively correlated, the inverse relation between advice and actual outcome is identified and exploited by meta-CMAB. The left-hand plot in Figure 5 supports the hypothesis that there is a correlation between an expert's expected performance and the expert's weight in the linear model learned through LinUCB. Concretely this implies that the advice of experts with a negative correlation is inverted, and that the experts with a higher absolute correlation have a higher importance. In contrast, EXP4.P always weighs experts positively, and as a result is unable to counteract widespread negative correlation. The right-hand plot in Figure 5 illustrates this property, as negatively correlated experts are assigned a weight close to 0. In other words, EXP4.P implicitly identifies negatively correlated bandits but ignores their advice.

Of note is that the inclusion of an synthetic random expert (i.e., a bandit providing random advice) acts as a safety net for both EXP4.P and meta-MAB. As such a random expert is expected to perform better than the other experts when the distance grows beyond 0.5, it can allow these two algorithms to implicitly act randomly when no provided expert is better than random. Finally, we observe a small increase in performance of EXP4.P as ∆ approaches 1. Experts for such large distances are so far from the truth that they can be identified with a limited amount of exploration, allowing EXP4.P to quickly converge towards the random policy.

In contrast with the homogeneous configuration, the heterogeneous configuration offers the CDM algorithms a wider variety of experts in terms of their prior bandit's distance to the collective bandit. In practice this results in the availability of better (and also worse) experts than for the homogeneous configuration. Learning algorithms (EXP4.P, meta-MAB and meta-CMAB) are able to identify and use these experts in their decision making, which results in improved performance, displayed in Figure 6 . In contrast, the WMV does not distinguish between better or worse experts, instead its performance tends to be dictated by the average performance of the experts. Hence, whether the experts are closely grouped around ∆ (as in the homogeneous configuration), or spread out more widely (as in the heterogeneous configuration), the average distance of experts is approximately ∆ for both configurations. As a result, the performance of the WMV does not change significantly for the heterogeneous configuration, when compared to the homogeneous configuration. In contrast, the learning algorithms (EXP4.P, meta-MAB, and meta-CMAB) are able to exploit the better experts to varying degrees, resulting in improved performance. This improvement is strongest for ∆ = 0.5. For this distance, the experts are spread out across the whole range of distances (see Figure 3 .b). As a result, optimal experts are present resulting in strongly improved performance. Furthermore, meta-CMAB benefits not only from these optimal experts at distance 0, but also from the pessimal experts at distance 1, as both are strongly (positively or negatively) correlated to the collective bandit. This is in contrast to the homogeneous configuration, wherein a distance of 0.5 induced only weakly correlated experts which hurt the performance of meta-CMAB.

While the distribution of experts in the polarized configuration differs significantly from the heterogeneous configuration, we find that the changes in performance between these two configurations is limited (as the similarity between Figure 6 and Figure A.1 shows) . The observations we made for the heterogeneous configuration therefore also hold for polarized experts, with only a slight improvement in performance noticeable for the EXP4.P and meta-MAB algorithms. The big gap in performance between the cluster of good experts and the cluster of bad experts (as opposed to the gradual decrease in performance of experts present in the heterogeneous configuration) allows for a quicker convergence towards these experts.

Effect of changes in the number of arms and experts Because contexts are drawn uniformly at random, any additional arms are as likely to provide high or low rewards. Hence, increasing the number of arms gives better-than-random experts a higher likelihood of selecting an arm with a high reward. Conversely, worse-than-random experts will tend to collect lower rewards when the number of arms is increased. As a result, adaptive algorithms benefit from the increase in performance of the better experts. In contrast, the performance of the WMV, is pushed further up (if experts are expected to be better than random) or down (if experts are expected to be worse than random), as the majority will favour more (if better-than-random) or less (if worse-than-random) rewarding arms. Similarly, increasing the number of experts is likely to produce both better and worse experts. While an adaptive algorithm can be expected to benefit from the increase in better experts, the flip-side is the requirement for additional exploration to identify the good experts in a larger expert set. In our experiments we found that an increase in the number of experts can be exploited by EXP4.P and meta-MAB provided the experts are expected to be better than random (i.e., ∆ < 0.5). When this is not the case, it is harder for these algorithms to converge towards the additional random expert, resulting in a decrease in performance relative to the 4 experts scenario (As shown by the differences in performance between Figure 4 .A and 4.C for example). In contrast, an increased number of experts offers meta-CMAB more features (i.e., advice values forming the meta-contexts) which it can map to outcomes to achieve more accurate predictions, which results in improved performance overall (See for example the improvements of meta-CMAB from the 4 experts in Figure 4 .A to the 32 experts in Figure 4 .C for example).

We hypothesize that even weak experts provide information on the expected outcome which can be identified and exploited. To demonstrate this we run the same experiments as previously, but only with the top 50% experts. Concretely, we sample N experts and identify the best N/2 experts in hindsight, which we then provide to the different CDM algorithms. Condorcet's jury theorem [Nicolas et al., 1785] suggests that if all experts are better than random, increasing the number of experts should improve the performance of a majority vote. Similarly, when the distance is lower than 0.5, ablating 50% of experts results in a reduction of the number of better than random experts, which in turn decreases the performance of the WMV, as illustrated in Figure 7 (the performance of WMV is higher than WMV 50% when ∆ < 0.5). Conversely, when experts are worse than random, filtering out the 50% worst experts improves the majority vote. In contrast to this, we confirm the hypothesis that the performance worsens regardless of the distribution of experts when we reduce the number of experts available to meta-CMAB, as even weak experts can provide identifiable patterns from advice to expected outcomes. EXP4.P slightly benefits in both cases as it tends to converge towards a single expert which should, ideally, be in top the 50%. Hence, restricting the expert set to this top half facilitates convergence. A similar conclusion can be drawn for meta-MAB, as fewer (bad) experts leads to a faster identification of the single best expert. Figure 8 shows that these observations also hold for the heterogeneous configuration. This gap in performance is strongest for meta-CMAB for ∆ = 4/6, as for this ∆ the bottom 50% of experts are more negatively correlated with the collective bandit. As a result, ablating these highly (negatively) correlated experts reduces meta-CMAB's predictive power, resulting in a decrease in performance. Finally, these observations also hold for the polarized configuration (see Figure A .2 in the Appendix).

5.3 How does the expected reward increase in function of time? Figure 9 gives a more detailed picture of the expected reward as time progresses for a specific distance (i.e., ∆ = 0.5) in the homogeneous configuration. In these plots the average reward per timestep is given instead of the cumulative reward. In other words, if we were to stop training at step t, the corresponding average reward indicates the expected performance if no more training is done. This allows us to estimate the viability of the different algorithms when less training time is available. We find that meta-CMAB quickly outperforms the alternative algorithms and is close to its final performance after 200 timesteps. Note that this distance of ∆ = 0.5 is the worst case scenario for meta-CMAB, as expert advice tends to be uncorrelated. For all other distances meta-CMAB converges more quickly. This distance seems to be equally hard for EXP4.P and meta-MAB, as they only slightly improve on the performance of a random policy. The performance of the WMV tends to follow the average performance of experts, which is close to 0.5 when ∆ = 0.5.

The main take-away for the homogeneous configuration becomes even more apparent when considering the heterogeneous case (Figure 10 ), meta-CMAB quickly surpasses the alternative algorithms, converging rapidly to its final performance, thanks to the presence of high correlation experts in the heterogeneous configuration. Interestingly, these results also demonstrate the quicker convergence of EXP4.P when compared to meta-MAB. As meta-MAB only updates a single expert per timestep, its convergence is slower than EXP4.P which updates its beliefs about all experts each timestep. Similar conclusions can be made for the polarized configuration ( Figure A. 3, as shown in the Appendix).

In this paper we explored the presence of systematic biases in expert advice and how they affect the performance of several CDM algorithms under different expert configurations. In particular, we found that our proposed meta-CMAB approach can implicitly identify such biases and counteract them, provided there is a correlation (whether positive or negative) between the experts' advice and the outcomes. In contrast, previous approaches to bandits with expert advice, such as EXP4.P and meta-MAB, are constrained by the performance of the single best expert and as a result these algorithms rapidly degrade in performance when biases are present. We found that these conclusions also hold for the alternative CMAB formulation, as the results presented in the appendix show.

Our exploration of different expert configurations demonstrates the potential for meta-CMAB on a wide variety of collective decision-making tasks, provided an objective feedback from decisions (i.e., a reward) is available. Our first results showed how a group of homogeneous experts, with relatively similar levels of expertise, can benefit from the meta-CMAB algorithm. In particular, meta-CMAB is able to exploit the collective, whereas EXP4.P and meta-MAB tend to converge towards the exploitation of a single expert. meta-CMAB thus really reaps the benefits of consulting multiple experts as it finds solutions better than what can be provided by the best expert in the group.

Of further interest is the observation that meta-CMAB is able to beneficially exploit advice provided by weak experts, whereas other algorithms are hurt by the presence of these weak experts. This opens the way for applications wherein imperfect advice is prevalent, for example through crowd-sourcing, wherein a heterogeneous configuration of experts is likely. In such scenarios we found that meta-CMAB assigns an importance to experts which is directly correlated to their expected performance. Similarly, we also demonstrated that meta-CMAB achieves high performance when experts are polarized (i.e., when two groups of experts with opposing advice occur), which could for example be beneficial in policymaking, wherein polarization is a recurring nefarious phenomenon.

Finally, we showed that meta-CMAB not only outperforms other algorithms over time but that it does so by quickly achieving high performance, especially when heterogeneous expertise is available. Consequently, meta-CMAB is highly useful for CDM applications wherein only a small number of testing steps can be executed before really being deployed.

It would be worthwhile to explore now how meta-CMAB could be used in existing applications which make use of WMV or EXP4(.P). Lopes et al. [2013] for example leverage EXP4 to identify appropriate activities in an intelligent tutoring system. Similarly, the use of WMV has been explored in medical diagnostics, Kämmer et al. [2017] for example enhance diagnostics by computing a (weighted) average of human experts for a simulated diagnostic problem. Similar research has found that combining the opinion of a group of radiologists can improve mammography screening beyond the single best expert [Wolf et al., 2015] . The (weighted) averages in these approaches could be replaced by meta-CMAB to guard for biased expertise.

In conclusion, biases are not exclusive to human experts, as machine learning efforts in medical diagnostics have shown, deep learning models can struggle to generalize from one hospital to another [Zech et al., 2018] . In such scenarios too meta-CMAB may provide relief by finding an appropriate mapping from models (i.e., experts) trained on other hospitals to the expected outcomes for the target hospital. As the COVID-19 pandemic has highlighted, conflicting expertise is inevitable, and, in the absence of prior knowledge to filter expertise, our meta-CMAB method offers a promising approach to re-conciliate and exploit the knowledge of biased as well as conflicting experts.

We modify the EXP4.P algorithm to use experts' confidence estimates as priors on their weights. More specifically, we add the confidence estimates to the learned weights when computing the weighted average. Hence, instead of weighing each expert n's advice for arm k by exp(w n t )

wherein γ 2 scales this added component by the learning rate, and M controls the strength of these confidence estimates in the decision making. Intuitively, the confidence weight added in this manner is similar to the weight acquired by experiencing M steps wherein the expected reward of each expert is its confidence estimate.

Algorithm 3 EXP4.P with confidence estimates Require: δ > 0, M ≥ 0 Algorithm 4 Description of the meta-MAB algorithm Require: M ≥ 0 1: Set #» α 1 = 1 and #» β 1 = 1 each of size N . 2: for t = 1, 2, ..., T do 3:

for n = 1, 2, ..., N do 5:

n t = arg max N n=1θ n Choose expert 7:

Draw arm k t according to #» ξ nt t , and receive reward r t .

Update beliefs about chosen expert 9:

α nt t+1 = α nt t + r t 10:

We present here in more detail the disjoint CMAB formulation, as well as how any such CMAB can be reduced to the shared context formulation presented in the main paper. At time t, the set of all arms in this alternative CMAB formulation is characterized by a K × d context matrix consisting of K time-dependent d-dimensional context vectors

A CMAB also possesses a fixed but unknown value function f which maps each arm context to a reward, f :

This formulation is more natural for problems wherein the mapping is the same, but available options are distinguished by their set of features. For example, suppose we wish to select a project from a set of candidates. Each candidate (=arm) would be captured by a set of features (=arm context) predictive of whether it will be successful. We can assume that similar projects (i.e., projects with similar contexts) will have similar outcomes. In other words, the reward function is shared across arms.

From disjoint to shared context We now briefly show how the above formulation can be reduced to the formulation presented in the main paper.

Suppose we have a bandit b, a context vector #» x k of dimension d per arm, and the shared reward function f .

The equivalent shared context formulation can be constructed as follows. First, define the shared context vector as the concatenation of the per-arm vectors:

Then each arm's reward function applies the function f to the relevant elements of the shared context: f k ( #» x ) := f ( #» x (k−1)d:kd ), where we use the notation #» x i:j to denote vector slice from indices i to j. Hence, any CMAB with arm contexts can be solved using algorithms for the shared context CMAB by mapping it as presented above and then applying the shared context CMAB algorithm on the resulting CMAB. 

We provide here the results when confidence estimates are provided based on the expected performance of experts. Given an expert with policy π with cumulative reward R π T we derive her hindsight confidence over T steps by normalizing her cumulative reward with regards to the worst possible cumulative reward (R π − T ), the best possible cumulative reward (R π + T ), and the reward of a uniform random policy (R π U T ) as c π

scales the confidence such that a random policy is assigned a confidence of 0.5. This confidence measure has the following properties: (i) c π T ∈ [0, 1] for every policy π, (ii) c π T > c π T ⇔ R π T > R π T , (iii) c π U T = 0.5, (iv) c π * T = 1 and v) c π − T = 0. Hence, worse than random experts have a confidence lower than 0.5, while better than random experts have a confidence above 0.5. The best possible expert has maximal confidence, while the worst possible expert has minimal confidence.

While this information is unlikely to be available in real world scenarios, we include the results in Figures A.4 to A.15 for completeness. We set M = 100 for the confidence-weighted EXP4.P and meta-MAB algorithms. While the performance of meta-CMAB is unaffected (the confidence being a constant it is of no use for LinUCB), there is a slight improvement for the meta-MAB and EXP4.P algorithms which can use the confidence as priors on their learned weights. The main take-away is the greatly improved performance of the WMV. Just as meta-CMAB can learn to invert the advice of negatively correlated experts, the WMV inverts the advice of experts with confidence lower than 0.5. This inversion is due to the logit function used by the WMV, for which a confidence lower than 0.5 results in a negative value, and hence a negative weight in the WMV. While this improvement in performance is remarkable, it relies on the availability of full information about the experts, which is typically unavailable in practice.

We demonstrate how noisy confidence affects these results by sampling noisy confidences. Given an expert with true confidence c n T , we sample her noisy confidence from the Beta distribution β(1 + c n T /η, 1 + (1 − c n T )/η), which ensures one remains in the [0, 1] interval. The parameter η controls the strength of the noise, and as η → ∞, the confidence distribution converges towards a uniform distribution in the [0, 1] interval. We plot results for noise levels η = 0.1 (Figures A.7 to A.9), η = 1 (Figures A.10 to A.12), and η = 10 (Figures A.13 to A.15). As the noise level η increases, the confidence estimates lose their correlation with the expected performance of experts, and as a result the performance of the WMV converges towards the performance of a random policy. The performance of EXP4.P and meta-MAB also decreases as the noise level increases, but the adaptive nature of these methods counteracts the erroneous confidence estimates generated under high noise levels. In other words, these two methods can diverge from the inappropriate confidence-based prior weights. Finally, meta-CMAB does not take advantage of confidence estimates and it follows that its performance is unaffected by noisy confidence estimates. 

How expert confidence can improve collective decision-making in contextual multi-armed bandit problems

Collective decision-making as a contextual multi-armed bandit problem

Sample mean based index policies by o (log n) regret for the multiarmed bandit problem

Collective decision making in the social context of science

How will country-based mitigation measures influence the course of the covid-19 epidemic? The Lancet

The nonstochastic multiarmed bandit problem

Making better decisions in groups

Outcome bias in decision evaluation

An optimal high probability algorithm for the contextual bandit problem

A survey on practical applications of multi-armed and contextual bandits

When expert decision making goes wrong: consensus, bias, the role of experts, and accuracy

Thirteen theorems in search of the truth

Sweden's Constitution Decides Its Covid-19 Exceptionalism

The potential of collective intelligence in emergency medicine: pooling medical students' independent decisions improves diagnostic performance

Maximum likelihood estimation for the wrapped cauchy distribution

A contextual-bandit approach to personalized news article recommendation

Multi-armed bandits for intelligent tutoring systems

Dynamics of threedimensional replication patterns during the s-phase, analysed by double labelling of dna and confocal microscopy

Individual confidenceweighting and group decision-making

Tighter bounds for multi-armed bandits with expert advice

An Essential Guide

Essai sur l'application de l'analyseà la probabilité des decisions renduesà la pluralité des voix. Par m. le marquis de Condorcet

Cognitive bias in clinical medicine

An image synthesizer

The polarization of american politics

Collective wisdom and decision making in surgical oncology

A tutorial on thompson sampling

Gaussian process optimization in the bandit setting: No regret and experimental design

Understanding psychological reactance. Zeitschrift für Psychologie

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples

Finite-time analysis of kernelised contextual bandits

Bandit problems with side observations

Health outcomes in economic evaluation: the qaly and utilities

Coronavirus disease 2019 (covid-19): situation report

Collective intelligence meets medical decision-making: the collective outperforms the best radiologist

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study

A survey on contextual multi-armed bandits

1: Define γ = ln N KT , set #» w 1 = 1 of size N . 2: for t = 1, 2, ..., T do 3:

for k = 1, 2, ..., K do compute weighted average 5:Draw arm k t according to #» p t , and receive reward r t .

for n = 1, ..., N do Update weightŝ y n t = ξ n t,kt · r t /p kt,t v n t = K k=1 ξ n k,t /p k,t w n t+1 = w n t + γ 2 ŷ n t +v n t ln(N/δ) KT

We solve the meta-MAB using Thompson Sampling [Thompson, 1933] and use a Beta distribution to model the priors. We include confidence estimates by adding them onto the parameters of the beta distribution. Intuitively, the confidence weight added in this manner is similar to the weight acquired by experiencing M steps wherein the expected reward of each expert is its confidence estimate.