key: cord-0676919-zni7jwyy authors: Dhir, Neil title: Chronological Causal Bandits date: 2021-12-03 journal: nan DOI: nan sha: 81fd757034d942b408a31e98560d3fb55db672d6 doc_id: 676919 cord_uid: zni7jwyy This paper studies an instance of the multi-armed bandit (MAB) problem, specifically where several causal MABs operate chronologically in the same dynamical system. Practically the reward distribution of each bandit is governed by the same non-trivial dependence structure, which is a dynamic causal model. Dynamic because we allow for each causal MAB to depend on the preceding MAB and in doing so are able to transfer information between agents. Our contribution, the Chronological Causal Bandit (CCB), is useful in discrete decision-making settings where the causal effects are changing across time and can be informed by earlier interventions in the same system. In this paper, we present some early findings of the CCB as demonstrated on a toy problem. A dynamical system evolves in time. Examples include weather, financial markets as well as unmanned aerial vehicles [7] . One practical goal is to take decisions within those systems such as deciding which financial instrument to buy, one day after another. The bandit (MAB) paradigm [20, 6] can help with that. In that environment an agent and an environment interact sequentially [13] ; the agent picks an action and receives a reward from the environment. The agent continues like so, usually for a finite number of plays, with the goal of maximising the total reward. The challenge in this problem stems from the trade-off between exploiting actions with known rewards, versus exploring actions with unknown rewards. Recently, many studies have addressed the case wherein which there is a non-trivial dependency structure between the arms. One such direction presumes that the dependency structure is modelled explicitly through causal graphs [14, 15, 12, 17] . We extend that paradigm to also account for the causal temporal dynamics of the system. MABs already constitute sequential decision-making paradigms, but here we expand that idea to cover chronological MABs: where the reward distribution is conditional upon the actions taken by earlier MABs (see fig. 1 ). We are not considering a Markov Decision Process (MDP) -we have no explicit notion of state and consequently do not maintain a model of the state dynamics. This type of transfer learning in causal MABs was also studied by [25, 3] but there the authors transfer information between unrelated tasks whereas we are interested in transfers for when all agents operate in the same (possibly non-stationary) dynamical system. Ours, instead, is more similar to the restless bandit [23, 9] problem where rewards vary with time (unlike the standard bandit setting where they are fixed but unknown). Example. Consider a dynamical environment F 1 , such as a country subject to the COVID-19 pandemic. The highly transient nature of the virus [18] necessitates multiple chronological clinical trials, the start of each indexed by t i (see fig. 1 ), to find a treatment or vaccine. Suppose we conduct clinical trial i which has K different treatments A {a 1 , . . . , a K } of unknown ef- 1 We will somewhat abuse standard notation for dynamical systems theory. Figure 1 : Structural causal model framed as a multi-armed bandit problem. SCM-MAB topology is adapted from figure 3(c) in [14] . Top: Each intervention can be construed as pulling an arm and receiving a reward (measuring the causal effect). Shown are all possible interventions -including sub-optimal ones (e.g. pulling Z and X together). Bottom: The optimal intervention (and consequent samples from the SEM) from the preceding SCM-MAB are transferred to the current SCM-MAB, via F. Exogenous variables and incoming edges from t i−1 are not shown to avoid clutter. Implemented interventions are represented by squares. To make this decision, we could learn from how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we may have a reasonable idea of which treatment is most effective, and from thereon, we could administer that treatment to all patients. However the exploration phase may take a long time and many patients may receive a sub-optimal treatment during that period. But we know than a earlier, similar trial i − 1, has just concluded and because we are aware of the causal nature of our treatments and their evolution over time, we can condition our trial i on the lessons learned, and actions taken in trial i − 1, before we start ours. There are two purposes to this (1) the additional information may aid the discovery of the most effective treatment in our trial and (2) it may also show that the most optimal intervention changes over time owing to the highly non-stationary environment of real systems, where a typical assumption [13] is for standard MABs to have a stationary reward distribution [13] . The time between two consecutive trials i and i − 1 is Contributions. The chronological causal bandit (CCB) extends the SCM-MAB by Lee & Bareinboim [14] by conditioning a SCM-MAB on prior causal bandits played in the same environment F. The result of this is a piece-wise stationary model which offers a novel approach for causal decision making under uncertainty within dynamical systems. The reward process of the arms is non-stationary on the whole, but stationary on intervals [24] . Specifically, past optimal interventions are transferred across time allowing the present MAB to weigh the utility of those actions in the present game. Notation. Random variables are denoted by upper-case letters and their values by lower-case letters. Sets of variables and their values are noted by bold upper-case and lower-case letters respectively. We make extensive use of the do-calculus (for details see [19, §3.4] ). Samples (observational) drawn from a system or process unperturbed over time are contained in D O . Samples (interventional) drawn from a system or process subject to one or more interventions are denoted by D I . The domain of a variable is denoted by D(·) where e.g. x ∈ D(X) and Structural causal model. Structural causal models (SCMs) [19, ch. 7] are used as the semantics framework to represent an underlying environment. For the exact definition as used by Pearl see [19, def. 7.1.1] . Let M be a SCM parametrised by the quadruple U, V, P (U), F . Here, U is a set of exogenous variables which follow a joint distribution P (U) and V is a set of endogenous (observed) variables. Within V we distinguish between two types of variables: manipulative (treatment) and target (output) variables (always denoted by Y in this paper). Further, endogenous variables are determined by a set of functions SCM is associated with a causal diagram (a directed acyclical graph, DAG for short) G = V, E where the edges are given by E. Each vertex in the graph corresponds to a variable and the directed edges point from members of PA i and U i toward V i [19, ch. 7] A bidirected edge between V i and V j occurs if they share an unobserved confounder which is to say U i ∩ U j = ∅ [14] . Unless otherwise stated, from hereon, when referring to V we are implicitly considering V \ {Y } -i.e. the manipulative variables not including the outcome variable. Finally, the fundamental do-operator do(V = v) represents the operation of fixing a set of endogenous variable(s) V to constant value(s) v irrespective of their original mechanisms. Throughout we do not consider graphs with non-manipulative variables. For more a more incisive discussion on the properties of SCMs we refer the reader to [19, 4] . Multi-armed bandit. The MAB setting [20] entertains a discrete sequential decision-making scenario in which an agent selects an action or 'arm' a ∈ A according to a policy π, and receives a stochastic reward Y (a), emanating from an unknown distribution particular to each arm. The expectation of the reward is given by µ a . The goal of the agent is to optimise the arm selection sequence and thereby maximise the expected reward is the expectation under the given policy and A n is the arm played on the n th round. We will use a similar performance measure, the cumulative regret [14] where the max reward is µ * = max a∈A µ a . Using the regret decomposition lemma [13, Lemma 4.5], we can write this in the form where each arm's gap from the best arm ("suboptimality gap" [13, §4.5]) is ∆ a = µ * − µ a and # a (N ) is the total number of times arm a was played after N rounds. Connecting SCMs to MABs. Echoing the approach taken by Lee & Bareinboim [14, §2], using the notation and concepts introduced in §1.1; let M be an SCM parametrised by U, V, P (U), F where Y ∈ V is the, as-noted, reward variable. The arms of the bandit are given by the set This is a set of all possible interventions on endogenous variables except the reward variable [14, §2] . Each arm is associated with a re- . This is the SCM-MAB setting [14] , fully represented by the tuple M, Y . As noted by the authors an agent facing a SCM-MAB M, Y , intervenes (plays arms) with knowledge of G(M) and Y but does not have access to the structural equations model F and the joint distribution over the exogenous variables P (U). Causality across time. Taking inspiration from Aglietti et al. [2, §2] , the authors consider causality in time, manifested by propagating a DAG in time, and connecting each time-slice DAG with directed edges as shown in fig. 2 . By doing this we are invariably considering a dynamic Bayesian network (DBN) [11] . As we are making interventions on time-slices of the DBN, we introduce notation to aid the exposition of the method. The SCM-MAB is particular to one graph, wherein which we seek to minimise R N . We instead seek a sequence of interventions which minimise R Ni at each time-step t i (i.e. the start of the trial) 4 of a piece-wise stationary SCM-MAB as set out in the next paragraph. Problem statement. Similar to [2] the goal of this work is to find a sequence of optimal interventions over time, indexed by t i , by playing a series of sequential conditional SCM-MABs or chronological causal bandits (CCB). The agent is provided with M i , Y i for each t i and is then tasked with optimising the arm-selection sequence (within each 'trial' i) and in so doing, minimise the total regret, for the given horizon N i . However, different to the standard MAB problem we must take into account previous interventions as well. We do that by first writing the regret eq. (1), in a different form which is conditional on past interventions, enabling the transfer of information where A * n = arg max where I i−1 = do(a * i−1 ) denotes the previously implemented intervention and 1 i>0 is the indicator function. Now, take particular care w.r.t. the index for which the "implemented intervention" concerns. Drawing upon our example in §1, the implemented intervention here corresponds to the treatment that was found to be the most effective during trial i − 1 (during which N i−1 rounds were played). Although the agent finds a sequence of decisions/treatments, only one of them, in this setting, is the overall optimal choice -i.e. has the lowest, on average, suboptimality gap ∆ a . The implemented intervention at i − 1 is found by investigating which arm has the lowest regret at the end of horizon Simply, we pick the arm that has been played the most by the agent. Remark 2.1. The above problem statement does not discuss which set of interventions are being considered. Naively one approach is to consider all the 2 |V\{Y }| sets in P(V \ {Y }) [1] . Though a valid approach, the size will grow exponentially with the complexity of G i (M i ). Intuitively, one may believe that the best and only action worth considering is to intervene on all the parents of the reward variable Y i . This is indeed true provided that Y i is not confounded by any of its ancestors [14] . But whenever unobserved confounders are present (shown by red and dashed edges in fig. 2 ) this no longer holds true. By exploiting the rules of do-calculus and the partial-orders amongst subsets [1] , Lee & Bareinboim [14] characterise a reduced set of intervention variables called the possibly optimal minimal intervention set (POMIS) or A from hereon 5 , where A ⊆ P(V \ {Y }), and typically |A| < |P(V \ {Y })|. They demonstrate empirically that the selection of arms based on POMISs make standard MAB solvers "converge faster to an optimal arm" [14, §5] . Throughout we use the POMIS as well. For the full set of arms for the toy problem in fig. 2 1. Invariance of the causal structure G 0 (M 0 ) = G i (M i ), ∀i > 0. 2. Aglietti et al. [2] showed that A does not change across time given assumption (1). 3. An agent facing a SCM-MAB M i , Y i , plays arms with knowledge of G i (M i ) and Y i but not P (U τ ) and F τ . Assumption (2) posits that the DAG is known. If this were not true then we would have to undertake causal discovery (CD) [8] or spend the first interactions with the environment [14] learning the causal DAG, from D O [21] , from D I [10] or both [1] . As it is, CD is outside the scope of this paper. Herein we describe how to express the reward distribution for trial i as a function of the intervention(s) implemented at the previous trial i − 1. The key to this section is the relaxation of assumption (3) . We seek an estimate F τ for F τ (the true SEM), and P (U τ ) for P (U τ ). Thus, if we have F τ , we can relay information about past interventions to SCM-MAB i and thus enable the present reward distribution to take into account those interventions. Reminding ourselves that the members of we model all functions in M as independent probability mass functions (PMF). is an interventional distribution, found by fixing variables V to v in G(M) -its samples are contained in D I . As we are operating in a discrete world, we do not need to approximate any intractable integrals (as is required in e.g. [2, 1] ) to compute the effect of interventions. Rather, by assuming the availability of D O and using the do-calculus, we are at liberty to estimate F by approximating the individual PMFs that arise from applying the do-calculus (see [19, Theorem 3.4.1] ). Consequently we can build a 'simulator' F τ using D O (which concerns the whole of M, not just the window τ ). D I is very scarce because playing an arm does not yield an observed target value but only a reward (hence we cannot exploit the actions taken during horizon N i ). The only interventional data that is available, at each i, to each SCM-MAB M i , Y i , is the implemented intervention I i−1 . The CCB method is graphically depicted in fig. 3 . Figure 3 : Transfer learning shown through the lens of the CCB applied to fig. 2 . Each row corresponds to one SCM-MAB which, if i > 0, takes into account previous interventions and secondary effects from those interventions (intervening on e.g. Z 0 means we can also calculate values for X 0 and Y 0 in time-slice t 0 ). Further, shown are the structural equation models considered by CCB at every time step t ∈ {0, 1, 2}. As well as the minimisation task over the cumulative regret, undertaken to find the best action for each time-slice. Square nodes denote intervened variables. We demonstrate some empirical observations on the toy-example used throughout, shown in fig. 2 . We consider the first five time-slices (trials) as shown in fig. 3 . The reward distribution, under the POMIS, for each slice is shown in fig. 4 (these distributions are found conditional upon the optimal intervention being implemented in the preceding trial as shown in fig. 3 ). For the true SEM see eqs. (4) (5) (6) and eqs. (7) (8) (9) . Figure 4 shows that the system is in effect oscillating in the reward distribution. This is because the optimal intervention changes in-between trials. The horizon for each trial was set to 10,000. We used two common MAB solvers: Thompson sampling (TS, [22] ) and KL-UCB [5] . Displayed results are averaged over 100 replicates of each experiment shown in fig. 5 . We investigate the CR and the optimal arm selection probability at various instances in time. For trial i = 0, CCB and SCM-MAB are targeting the same reward distribution. Consequently they both identify the same optimal intervention Z 0 = z 0 . Come i = 1 things start to change; having implemented the optimal intervention from i = 0 CCB is now targeting a different set of rewards (see fig. 4 ). The SCM-MAB, being agnostic about past interventions, targets the same reward as previously (blue bars in fig. 4 ). As discussed, this ignores the dynamics of the system; a vaccinated population will greatly alter the healthcare environment, hence to administer the same vaccine (arm 3 at i = 0) at the next clinical trial (i = 1), without taking this into account, makes for poor public policy. fig. 5c , found using a Thompson sampling policy [14, §5] . An equivalent plot for KL-UCB can be found in fig. 6 . Consider the CR from the CCB at trial i = 1 fig. 5b ; it is lower than the SCM-MAB in fig. 5a , as it is transferring preceding intervention to the current causal model, fig. 3(i = 1) , and now finds that X 1 = x 1 is the optimal intervention (see appendix A for full optimal intervention sequence). Let's now turn to fig. 5c , to minimise the regret the agent should be choosing the optimal action almost all of the time. But it is only possible to reduce regret if the algorithm has discovered the arm with the largest mean. In trials i = 1 and i = 3 the reward per arm, across the POMIS, is almost identical. As is stands the agent does not have a reasonable statistical certainty that is has found the optimal arm (orange and red curves in fig. 5c ). But all have the same causal effect, why the CR in fig. 5b is low. We propose the chronological causal bandit (CCB) algorithm which transfers information between causal bandits which have been played in the dynamical system at an earlier point in time. Some initial findings are demonstrated on a simple toy example where we show that taking the system dynamics into account has a profound effect on the final action selection. Further, whilst we in this example have assumed that dt i is the same for each trial, it remains a large assumption that will be studied further in future work. There are many other interesting avenues for further work such as the optimal length of horizon N i as well as determining the best time t * i to play a bandit (i.e. start a trial). Moreover, the current framework allows for confounders to change across time-step -something we have yet to explore. Material herein is adapted from [14, task 2] . When t = 0 we use the original set of structural equation models [14, appendix D]: where ⊕ is the exclusive disjunction operator and ∧ is the logical conjunction operator (i.e. 'and'). The biggest difference between eqs. (4-6) and eqs. (7) (8) (9) is that the former has an explicit dependence on the past. Depending on the implemented intervention at t − 1 one or both of {z t−1 , x t−1 } will be fixed to the value(s) in I t−1 . "The task of finding the best arm among all possible arms can be reduced to a search within the MISs" [14] . The POMIS is given by A = {do(X = 0) , do(X = 1) , do(Z = 0) , do(Z = 1)}. The optimal intervention sequence is given by {Z 0 = z 0 , X 1 = x 1 , Z 1 = z 1 , X 1 = x 1 , Z 1 = z 1 }. Using the example domain, this translates to {0, 1, 0, 1, 0}. Causal Bayesian Optimization Dynamic causal bayesian optimization Sequential transfer in multi-armed bandit with finite set of models Causal inference and the data-fusion problem Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics Sequential design of experiments Next generation reservoir computing Review of causal discovery methods based on graphical models Approximation algorithms for restless bandit problems Experimental design for learning causal graphs with latent variables Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning Causal bandits: Learning good interventions via causal inference Bandit algorithms Structural causal bandits: where to intervene? Structural causal bandits with non-manipulable variables Characterizing optimal mixed policies: Where to intervene and what to observe Regret analysis of bandit problems with causal background knowledge A model based study on the dynamics of covid-19: Prediction and control Causality: models, reasoning and inference Some aspects of the sequential design of experiments Causation, prediction, and search On the likelihood that one unknown probability exceeds another in view of the evidence of two samples Restless bandits: Activity allocation in a changing world Piecewise-stationary bandit problems with side observations Transfer learning in multi-armed bandit: a causal approach