key: cord-0428794-sabolarf
authors: Bogacz, Rafal
title: Dopamine role in learning and action inference
date: 2019-11-11
journal: bioRxiv
DOI: 10.1101/837641
sha: 3feb86d6cb476a5d0f178401c7ad767066b980b0
doc_id: 428794
cord_uid: sabolarf

This paper describes a framework for modelling dopamine function in the mammalian brain. In this framework, dopaminergic neurons projecting to different parts of the striatum encode errors in predictions made by the corresponding systems within the basal ganglia. These prediction errors are equal to differences between rewards and expectations in the goal-directed system, and to differences between the chosen and habitual actions in the habit system. The prediction errors enable learning about rewards resulting from actions and habit formation. During action planning, the expectation of reward in the goal-directed system arises from formulating a plan to obtain that reward. Thus dopaminergic neurons in this system provide feedback on whether the current motor plan is sufficient to obtain the available reward, and they facilitate action planning until a suitable plan is found. Presented models account for dopaminergic responses during movements, effects of dopamine depletion on behaviour, and make several experimental predictions.

Neurons releasing dopamine send widespread projections to many brain regions, including basal 20 ganglia and cortex (Björklund & Dunnett, 2007) , and substantially modulate information processing in 21 the target areas. Dopaminergic neurons in the ventral tegmental area respond to unexpected rewards 22 (Schultz, Dayan, & Montague, 1997) , and hence it has been proposed that they encode reward 23 prediction error, defined as the difference between obtained and expected reward (Houk, Adams, & 24 Barto, 1995; Montague, Dayan, & Sejnowski, 1996) . According to the classical reinforcement learning 25 theory, this prediction error triggers update of the estimates of expected rewards encoded in striatum. 26 Indeed, it has been observed that dopaminergic activity modulates synaptic plasticity in the striatum 27 in a way predicted by the theory (Reynolds, Hyland, & Wickens, 2001; Shen, Flajolet, Greengard, & 28 Surmeier, 2008) . This classical reinforcement learning theory of dopamine has been one of the 29 greatest successes of computational neuroscience, as the predicted patterns of dopaminergic activity 30 have been seen in diverse studies in multiple species ( However, this classical theory does not account for the important role of dopamine in action planning. 33 This role is evident from the difficulties in initiation of voluntary movements seen after the death of 34 dopaminergic neurons in Parkinson's disease. This role is consistent with the diversity in the activity 35 of dopaminergic neurons, with many of them responding to movements (da Silva, Tecuapetla, Paixão, Understanding the role of dopamine in action planning and movement initiation is important for 41 refining treatments for Parkinson's disease, where the symptoms are caused by dopamine depletion. 1

Despite this importance, there is no mathematical framework, which can describe the role of 2 dopamine in both learning and action planning. 3 A promising theory, called active inference, may provide the foundation for a framework accounting 4 for such a dual role of dopamine (Friston, 2010) . This theory relies of an assumption that the brain 5 attempts to minimize prediction errors defined as the differences between observed stimuli and 6

expectations. In active inference, these prediction errors can be minimized in two ways: through 7

learning -by updating expectations to match stimuli, and through action -by changing the world to 8 match the expectations. According to the active inference theory, prediction errors may need to be 9 minimized by actions, because the brain maintains prior expectations that are necessary for survival 10 and so cannot be overwritten by learning, e.g. food reserves should be at a certain level. When such 11 predictions are not satisfied, the brain plans actions to reduce the corresponding prediction errors, 12 e.g. by finding food. 13

This paper suggests that a more complete description of dopamine function can be gained by 14 integrating reinforcement learning with elements of three more recent theories. First, taking 15 inspiration from active inference, we propose that prediction errors represented by dopaminergic 16 neurons are minimized by both learning and action planning, which gives rise to the roles of dopamine 17 in both these processes. Second, we incorporate a recent theory of habit formation, which suggests 18 that the habit and goal-directed systems learn on the basis of distinct prediction errors (Miller, 19 Shenhav, & Ludvig, 2019), and we propose that these prediction errors are encoded by distinct 20 populations of dopaminergic neurons, giving rise to the observed diversity of their responses. Third, 21 we assume that the most appropriate actions are identified through Bayesian inference (Solway & 22 Botvinick, 2012), and present a mathematical framework describing how this inference can be 23 physically implemented in anatomically known networks within the basal ganglia. Since the 24 framework extends the description of dopamine function to action planning, we refer to it as the 25 DopAct framework. The DopAct framework accounts for a wide range of experimental data including 26 the diversity of dopaminergic responses, the difficulties in initiation of voluntary movements under 27 dopamine depletion, and it makes several experimentally testable predictions. 28

To provide an intuition for the DopAct framework, we start with giving its overview. Next, we formalize 30 the framework, and then show examples of models developed within it for two tasks commonly used 31 in experimental studies of reinforcement learning and habit formation: selection of action intensity 32 (such as frequency of lever pressing) and choice between two actions. 33

Overview of the framework 34 This section first gives an overview of computations taking place during action planning in the DopAct 35 framework, and then summarizes how these computations could be implemented in neural circuits 36 including dopaminergic neurons. 37

It has been proposed that the aim of action selection is to bring an animal to a desired level of reserves 38 such as food, water, etc. (Hull, 1952; Stephan et al., 2016) . Although there are multiple internal 39 dimensions which animals need to optimize, e.g. temperature (Buckley, Kim, McGregor, & Seth, 2017) , 40 internal salt levels (Cone et al., 2016) , etc., for simplicity, we will consider a single dimension of food 41 reserves. As there exists an optimal level of food reserves, an animal should only seek food resources, 42

if its reserves are below the optimum value. Here we propose, that the mechanisms, which the brain 43 employs to achieve the desired reserves level, include a system that is able to compute how much 44 resources should be acquired in a given situation, and a circuit that is able to select an action to obtain 1 the desired reward. We refer to these two components as a valuation system and an actor, 2

respectively. This paper focusses on describing the actor. Nevertheless, we briefly summarise the 3 computations in the valuation system below, because it will help in understanding the computations 4 of the actor. 5

In DopAct framework, the role of the valuation system is to compute how much resources the animal 6 should aim at acquiring in a given situation. We distinguish between two classes of factors that 7 describe a situation of an animal: internal factors discussed earlier to which we refer as 'reserves', and 8 external factors related to the environment, such as stimulus or location in space, to which we refer 9 as a 'state' following reinforcement learning terminology. Thus the role of the valuation system is to 10 compute the value defined as the amount of resource, which should be acquired by the animal in a 11

given state . The value depends on both the amount of resources available in state , and the 12 current level of reserves, as illustrated in Figure 1A , where arrows denote animal's estimate for the 13 change in reserve levels that may be achieved in a given state. For example, imagine a whale in a state 14 "by a big school of sardines". If the reserves level resulting from consuming the entire resource 15 available is still below the optimum level, as in the top display of Figure 1A , then the desired value 16 is equal to the entire resource available (i.e. the whale should swallow the whole school). By contrast, 17 the bottom display illustrates the case when consuming the whole resource would move the animal 18 beyond the optimum reserves level, and here value is equal to the amount required to bring the 19 food reserves to the desired level (i.e. the whale should only swallow a corresponding fraction of the 20 school). To perform this computation, the valuation system needs to be able to learn how much 21 resource is available in a given state (analogously to a critic in classical reinforcement learning), and 22 during planning compute what fraction of that resource should be acquired. 23

Since this paper focusses on describing computations in the actor, for simplicity, we assume that the 24 valuation system is able to compute the value , but this paper does not describe how that 25 computation is performed. In simulations we mostly focus on a case of low reserves, where is equal 26 to the whole resource available, and use a simple model similar to a critic in standard reinforcement 27 learning, which just learns the average value ( ) of resource in state (Sutton & Barto, 1998 ), but 28 does not consider reserve levels. Extending the description of the valuation system will be an 29 important direction for future work and we come back to it in Discussion. In the considered case of 30 low resources, for simplicity we use word 'reward' as a synonym of 'resource'. 31

The goal of the actor is to select an action to obtain the reward set by the valuation system. This action 32

is selected through inference in a probabilistic model, which describes relationships between states, 33 actions and rewards. We denote random variables from which states, actions and rewards are 34 sampled by , and , and particular values of these variables by corresponding small letters. The 35

DopAct framework assumes that two systems within the actor learn distinct relationships between 36 the variables, shown in Figure 1B . The first system, shown in orange, learns how the reward depends 37 on action selected in a given state, and we refer to it as 'goal-directed', because it can infer actions 38 that typically lead to the desired reward. The second system, in blue, learns which actions should 39 generally be chosen in a given state, and we refer to it as 'habit', because it suggests actions without 40

considering the value of the reward currently available. Both goal-directed and habit systems propose 41 an action, and their influence depends on their relative certainty. 42 Figure 1C gives an overview of how the systems mentioned above contribute to action planning, in a 43 typical task. During initial trials of a task, the valuation system (shown in red) evaluates the current 44 state and computes the value of desired reward , and the goal-directed system selects the action 45

. At this stage the habit system contributes little to the planning process as its uncertainty is high. As 46 1 system (Miller et al., 2019) . If for a given state , the animal selects very similar actions over many 2 trials, the certainty of the habit system increases. In this case, on later trials the action is mostly 3 determined by the habit system ( Figure 1C ). Such choice is also faster, because it does not require an 4 intermediate step of computing the value of the state. circles, and arrows denote dependencies learned by different systems. C) Schematic overview of 10 information processing in the framework at different stages of task acquisition. D) Mapping of the 11 systems on different parts of the cortico-basal ganglia network. Circles correspond to neural 12 populations located in the regions indicated by labels to the left, where 'Striatum' denotes medium 13 spiny neurons expressing D1 receptors, 'GABA' denotes inhibitory neurons located in vicinity of 14 dopaminergic neurons, and 'Reward' denotes neurons providing information on the magnitude of 15 instantaneous reward. Arrows denote excitatory projections, while lines ending with circles denote 16 inhibitory projections. E) Summary of prediction errors encoded in different systems. 17

The details of the above computations in the framework will be described in the next section, and it 18 will be later shown how an algorithm inferring action can be implemented in a network resembling 19 the anatomy of the basal ganglia ( Figure 1D ). But before going through a mathematical description, 20 let us first provide an overview of this implementation. In this implementation, the valuation, goal-21 directed and habit systems are mapped on the spectrum of cortico-basal ganglia loops (Alexander, 22 including the dorsolateral striatum that has been shown to be critical for habitual behaviour (Burton,  2 Nakamura, & Roesch, 2015). In the DopAct framework, the probability distributions learned by the 3 actor are encoded in the strengths of synaptic connections in the corresponding loops, primarily in 4 cortico-striatal connections. As in a standard implementation of the critic (Houk et al., 1995) , the 5 parameters of the value function learned by the valuation system are encoded in cortico-striatal 6

connections of the corresponding loop. 7

Analogous to classical reinforcement learning theory, dopaminergic neurons play a critical role in 8 learning, and encode errors in predictions made by the systems in the DopAct framework. However, 9

by contrast to the standard theory, dopaminergic neurons do not all encode the same signal, but 10 instead dopaminergic populations in different systems compute errors in predictions made by their 11 corresponding system ( Figure 1E ). Since both valuation and goal-directed systems learn to predict 12 reward, the dopaminergic neurons in these systems encode reward prediction errors (which slightly 13 differ between these two systems, as will be illustrated in simulations presented later). By contrast, 14

the habit system learns to predict action on the basis of a state, so its prediction error encodes how 15 the currently chosen action differs from a habitual action in the given state. Thus these dopaminergic 16 neurons respond to non-habitual actions in the DopAct framework. We denote the prediction errors 17

in the valuation, goal-directed and habit systems by , and , respectively. In the DopAct 18 framework, the dopaminergic neurons send these prediction errors to the striatum, where they 19

trigger plasticity of cortico-striatal connections. For example, when an action is selected mostly by the 20 goal-directed system, the prediction error in the habit system will trigger plasticity in the striatal 21 neurons of the habit system, so they tend to predict this action in the future. In this way, the habit 22 system learns to mimic the goal-directed system. 23

The systems communicate through an 'ascending spiral' structure of striato-dopaminergic projections 24 identified by Haber, Fudge, and McFarland (2000) . These Authors observed that dopaminergic 25 neurons within a given loop project to the corresponding striatal neurons, while the striatal neurons 26 project to the dopaminergic neurons in the corresponding and next loops, and they proposed that the 27 projections to the next loop go via interneurons, so they are effectively excitatory ( Figure 1D ). In the 28 DopAct framework, once the striatal neurons in the valuation system compute the value of the state 29

, they communicate it to the goal-directed system via the dopaminergic neurons in the goal-directed 30 system. 31

In the DopAct framework, dopamine in the goal-directed system plays a role in both action planning 32 and learning, and now an overview of this role is given. In agreement with classical reinforcement 33 learning theory, the dopaminergic activity encodes reward prediction error, namely the difference 34 between the reward (including both obtained and available reward) and the expected reward (Schultz 35 et al., 1997) , but in the DopAct framework the expectation of reward only arises from formulating a 36 plan to achieve it. Thus in the presence of reward, the prediction error can only be reduced to zero, 37 once a plan to obtain the reward is formulated. 38

The dual role of dopamine in the DopAct framework stems from the two ways in which the goal- 39 directed system minimizes reward prediction error: learning and action planning. To gain an intuition 40 for how this system operates, let us consider a simple example of a naïve hungry rat exploring a 41

conditioning apparatus. Assume that the rat presses a lever and a food pellet is delivered in a food 42 port ( Figure 2 ). The sight of this unexpected reward will trigger a dopamine response. According to 43 the DopAct framework, the reward prediction error arises in the goal-directed system, because the 44 valuation system noted that a reward is available, but the goal-directed system has not yet prepared 45 actions to obtain the reward (so it has not formed an expectation). The resulting prediction error is 1 being used in two ways. First, the prediction error triggers a process of planning actions that can get 2 the reward. This facilitation of planning arises in the network, because the dopaminergic neurons in 3 the goal-directed system project to striatal neurons ( Figure 1D ), and increase their excitability. Once 4 an action plan has been formulated, the animal starts to expect the available reward, and the 5 dopamine level encoding the prediction error decreases. Importantly, in this network dopamine 6 provides a crucial feedback to striatal neurons on whether the formulated action plan is sufficient to 7 obtain the available reward. If it is not, this feedback triggers changes in the action plan until it 8 becomes appropriate. Thus the framework suggests why it is useful for neurons encoding reward 9

prediction error to be involved in planning, namely it suggests that this prediction error provides a 10 useful feedback for the action planning system, informing if the plan is suitable to obtain the reward. 11

Second, the prediction error allows the animal to learn that rewards are available after certain actions 12 at particular states, so in this case, it will modify synaptic connections encoding the value of lever 13

pressing. 14 15 Figure 2 . Schematic illustration of changes in dopaminergic activity in the goal-directed system while 16 a naïve hungry rat presses a lever and a food pellet is delivered. The prediction error encoded in 17 dopamine (bottom trace) is equal to a difference between the reward available (top trace) and the 18 expectation of reward arising from a plan to obtain it (middle trace). The dashed arrows schematically 19 indicate the processes such prediction error triggers. 20

Formal description of the framework 21 Let us now provide the details of the DopAct framework. For clarity, we will follow Marr's levels of 22 description, and in this section, we discuss computation and algorithm employed by the actor, while 23 sample implementations for two commonly used tasks are presented in the following sections. To 24 illustrate the computations in the framework we will consider a simple task, in which only an intensity 25 of a single action needs to be chosen. Such choice needs to be made by animals in classical  26 experiments investigating habit formation, where the animals are offered a single lever, and need to 27 decide how frequently to press it. Furthermore, action intensity often needs to be chosen by animals 28 also in the wild (e.g. a cat deciding how vigorously to pounce on a prey, a chimpanzee choosing how 29 strongly to hit a nut with a stone, or a sheep selecting how vigorously to eat the grass). Let us denote 30 the action intensity by . Let us assume that the animal chooses it on the basis of the reward it expects 31 and the stimulus (e.g. size of prey, nut or height of grass). Thus the animal needs to infer an action 32 intensity sufficient to obtain the desired reward (but not larger to avoid unnecessary effort). previous section, we assume that the actor maintains two probability distributions: The goal-directed 2 system encodes how the reward depends on states and actions, while the habit system encodes the 3 probability distribution of generally selecting actions in particular states. During action planning, when 4 an animal notices reward available = , it combines information from both systems through 5

Bayesian inference. According to Bayes' theorem (Equation 3.1 in Figure 3 ), the posterior probability 6 of selecting a particular action given available reward is proportional to the product of a likelihood of 7 the reward given the action, which we propose is represented in the goal-directed system, and a prior, 8 which we propose is encoded by the habit system. intensity. In this example the stimulus intensity is equal to = 1, the valuation system computes 13 desired resource = 2, and the parameters of the probability distributions encoded in the goal-14 directed and habit systems are listed in the panel. The blue curve shows the distribution of action 15 intensity, which the habit system has learned to be generally suitable for this stimulus. The orange 16 curve shows probability density of obtaining reward of 2 for a given action intensity, and this 17 probability is estimated by the goal-directed system. For the chosen parameters, it is the probability 18 of obtaining 2 from a normal distribution with mean . Finally, the green curve shows a posterior 19 distribution computed from Equation 3.1. 20

In the DopAct framework, an action is selected which maximizes the probability (

). An 21 analogous way of selecting actions has been used in models treating planning as inference (Attias, 22 2003), and it has been nicely summarized by Solway the occurrence of reward as a premise, and leverages the generative model to determine which course 1 of action best explains the observation of reward." In this paper, we make explicit the rationale for 2 this approach: The desired amount of resources that should be acquired depends on the levels of 3 reserves (and a given state); this value is computed by the valuation system, and the actor needs to 4 find the action depending on this reward. Let us provide a further rationale for selecting an action 5 which maximizes ( ), by analysing what this probability expresses. Let us consider a following 6 hypothetical scenario: An animal selected an action without considering the desired reward, i.e. by 7

sampling it from its default policy ( ) provided by the habit system, and obtained reward . In this 8 case, ( ) is the probability that the selected action was . When an animal knows the amount 9

of resource desired , then instead of just relying on the prior, the animal should rather choose an 10 action maximizing ( ), which was the action most likely to yield this reward in the above 11 scenario. 12

One may ask why it is useful to employ the habit system, instead of exclusively relying on the goal- 13 directed system that encodes the relationship between rewards and actions. The answer is that there 14 may be uncertainty in the action suggested by the goal-directed system, arising for example, from 15 noise in the computations of the valuation system or inaccurate estimates of the parameters of the 16 goal-directed system. According to Bayesian philosophy, in face of such uncertainty, it is useful to 17 additionally bias the action by a prior, which here is provided by the habit system. This prior encodes 18 an action policy that has overall worked in the situations previously experienced by the animal, so it 19

is a useful policy to consider under uncertainty in the goal-directed system. Later we will show in 20

simulations that incorporating the prior may indeed help to select more optimal actions. 21

To provide an intuition for how the action intensity is computed, let us consider an example. Let us 22 specify the form of the prior and likelihood distributions. They are given in Figure 3B , where ( ) 23 denotes the probability density of a normal distribution with mean and variance . In a case of the 24 prior, we assume that action intensity is normally distributed around a mean given by stimulus 25 intensity scaled by parameter , reflecting an assumption that a typical action intensity often depends 26 on a stimulus (e.g. the larger a nut, the harder a chimpanzee must hit it). On the other hand, in a case 27 of the probability of reward maintained by the goal-directed system, the mean of the reward is 28 equal to a product of action intensity and the stimulus size, scaled by parameter . We assume that 29 the mean reward depends on a product of and for three reasons. First, in many situations reward 30 depends jointly on the size of the stimulus, and the intensity with which the action is taken, because 31

if the action is too weak, the reward may not be obtained (e.g. a prey may escape or a nut may not 32 crack), and the product captures this dependence of reward on a conjunction of stimulus and action. 33 Second, in many foraging situations, the reward that can be obtained within a period of time is 34 proportional to a product of and (e.g. amount of grass eaten by a sheep is proportional to both 35 how vigorously the sheep eats it, and how high the grass is). Third, when the framework is generalized 36 to multiple actions later in the paper, the assumption of reward being proportional to a product of 37 and will highlight a link with classical reinforcement learning. We denote the variances of the 38 distributions of the goal-directed and habit systems by and . Figure 3C shows an example of 39 probability distributions encoded by the two systems for sample parameters. It also shows a posterior 40 distribution ( ), and please note that its peak is in between the peaks of the distributions of the 41 two systems, but it is closer to the peak of a system with smaller uncertainty (orange distribution is 42 narrower). This illustrates how in the DopAct framework, the action is inferred by incorporating 43 information from both systems, but weighting it by the certainty of the systems. 44

In addition to action planning, the animal needs to learn from the outcomes, to predict rewards more 45 accurately in the future. After observing the reward actually obtained = , the parameters of the 46 distributions should be updated to increase ( ), so in the future the animal is less surprised by the 1 reward obtained in that state ( Figure 3A ). 2

Let us now describe an algorithm used by the actor to infer action intensity that maximizes the 3 posterior probability ( ). This posterior probability can be computed from Equation 3.1, but 4 note that does not occur in the denominator of that equation, so we can simply find the action that 5 maximizes the numerator. Hence, we define an objective function equal to a logarithm of the 6 numerator of Bayes' theorem (Equation 4 .1 in Figure 4 ). Introducing the logarithm will simplify 7 function because it will cancel with exponents present in the definition of normal density (Equation  8 3.3), and it does not change the position of the maximum of the numerator because the logarithm is 9 a monotonic function. For example, the green curve in Figure 4B shows function corresponding to 10 the posterior probability in Figure 3C . Both green curves have the maximum at the same point, so 11 instead of searching for a maximum of a posterior probability, we can seek the maximum of a simpler 12 function . if the action is initialized to = 1.5, then the gradient of at this point is positive, so is increased 19 (Equation 4 .2), as indicated by a green arrow on the x-axis. These changes in continue until the 20 gradient is no longer positive, i.e. when is at the maximum. Analogously, if the action is initialized to 21 = , then the gradient of is negative, so is decreased until it reaches the maximum of . 22

During action planning we set reward to reward available = in Equation 4 .1, and we find the action 23 maximizing . This can be achieved by initializing to any value, and then changing it proportionally 24 to the gradient of (Equation 4.2). Figure 4B illustrates that with such dynamics, the value of 25 approaches a maximum of . Once converges, the animal may select the action with the 26 corresponding intensity. In summary, this method yields a differential equation describing an 27 evolution of a variable , which converges to a value of that maximizes ( ). 28

After obtaining a reward, is set to the reward obtained = in Equation 4 .1, and the values of 29 parameters are changed proportionally to the gradient of (Equations 4.3). Such parameter updates 30

allow the model to be less surprised by the rewards (as aimed for in Figure 3A ), because under certain 31 assumptions function expresses "negative free energy", which provides a lower bound on ( ) 32 (Friston, 2005 ) (a detailed explanation for why expresses negative free energy for an analogous 33 problem is given by Bogacz (2017)). Thus changing the parameters to increase , rises the lower bound 1 on ( ), and so it tends to increase ( ). 2

3

This section describes how the above algorithm for inferring action intensity could be implemented in 4 a network resembling basal ganglia anatomy. Let us start with considering a special case in which both 5 variance parameters are fixed to = = 1, because then the mapping of the algorithm on the 6 network is particularly beautiful. 7

Let us derive the details of the algorithm (general form of which is given in Figure 4A ) for the problem 8 choosing action intensity. Substituting probability densities of likelihood and prior distributions 9

(Equations 3.2-3.3) for the case of unit variances into Equation 4 .1 (and ignoring constants 1 √2 ⁄ ), 10 we obtain the expression for the objective function in Equation 5.1. We see that consists of two 11 terms, which are the squared prediction errors associated with goal-directed and habit systems. The 12

prediction error for the goal-directed system describes how the reward differs from the expected 13 mean, while the prediction error of the habit system expresses how the chosen action differs from 14 that typically chosen in the current state (Equations 5.2). Equation 5 .1 highlights that inferring action 15 intensity that maximizes corresponds to reducing prediction errors. As described in the previous 16 section, action intensity can be found by changing its value according to a gradient of (Equation 4.2). 17

Computing the derivative of over , we obtain Equation 5 .3, where the two colours indicate terms 18 connected with derivatives of the corresponding prediction errors. Finally, when the reward is 19 obtained, we modify the parameters proportionally to the derivatives of over the parameters, which 20 are equal to relatively simple expressions in Equations 5.4. in Figure 1D , and additionally "Output" denotes the output nuclei of the basal ganglia. C) Definition of 25 striatal activity in the goal-directed system. 26

The key features of the algorithm in Figure 5A action intensity in the model is jointly determined by the striatal neurons in the goal-directed and 2 habit systems, which compute the corresponding terms of Equation 5 .3, and communicate them by 3

projecting to the thalamus via the output nuclei of the basal ganglia. The first term can be 4 provided by striatal neurons in the goal-directed system (denoted by in Figure 5B ): They receive 5 cortical input encoding stimulus intensity , which is scaled by cortico-striatal weights encoding 6 parameter , so these neurons receive synaptic input To compute , the gain of the striatal 7 neurons in the goal-directed system needs to be modulated by dopaminergic neurons encoding 8 prediction error (this modulation is represented in Figure 5B by an arrow from dopaminergic to 9 striatal neurons). Hence, these dopaminergic neurons drive an increase in action intensity until the 10 prediction error they represent is reduced (as discussed in Figure 2 ). The second term in Equation 11 5.3 can be computed by a population of neurons in the habit system receiving cortical input via 12 connection with the weight . Finally, the last term simply corresponds to a decay. Moreover, 13

according to Equations 5.4, the prediction errors modulate the plasticity of cortico-striatal connections 14 in both systems (represented in Figure 5B by arrows going from dopamine neurons to parameters). 15

There are several ways of mapping the remaining details of the algorithm on the striato-dopaminergic 16 circuit, and as a proof of principle we show here one such mapping, as an example. Within each 17 system, dopaminergic neurons compute errors in the predictions about the corresponding variable, 18

i.e. reward for the goal-directed system, and action for the habit system. In the habit system, the 19 prediction error is equal to a difference between action and expectation (blue Equation 5.2). Such 20 error can be easily computed in a network of Figure 5B , where the dopaminergic neurons in the habit 21 system receive effective input form the output nuclei equal to (as they receive inhibition equal to 22 ), and inhibition from the striatal neurons. In the goal-directed system, the prediction error is 23

proportional to the difference between the reward and the expectation (orange Equation 5.2). The 24

neurons computing prediction error in the goal-directed system in the network in Figure 5B receive 25 input equal to reward, so let us now consider where the inhibition equal to the expectation could 26 originate. The prediction error node receives a related term from the striatum ( Figure 5B ), and 27 this input could be normalized by the activity of the error node itself , to result in an effective input 28 . Such input would still need to be scaled by action intensity to compute the expectation. Further 29

work is required to understand how such scaling may be achieved and one possibility is thorough an 30 input from the output nuclei (included in Figure 5B ) (Watabe-Uchida, Zhu, Ogawa, Vamanrao, & 31

Uchida, 2012). 32

The prediction errors are used to update the parameters of the distributions represented by the 33 systems, which are encoded in the weights of cortico-striatal connections. Once the actual reward is 34 obtained, the learning of these parameters could be achieved through local synaptic plasticity 35 dependent on dopaminergic modulation. In the goal-directed system, orange Equation 5. 4 36 corresponds to local plasticity, if at the time of reward the striatal neurons encode information about 37 action intensity (see definition of in Figure 5C ). Such information could be provided from the 38 thalamus during action execution. Then the update of synaptic weight encoding parameter will 39

correspond to a standard three-factor rule ( neurons (such as leak conductance). In such an extended model, the action proposals of the two 46 systems are weighted according to their certainties. As described in the Methods, a simple model of 1 the valuation system based on a critic from standard reinforcement learning is employed in 2 simulations (because the simulations correspond to a case of low level of animal's reserves). Striatal 3 neurons in the valuation system compute the reward expected in a current state as = , where 4 is a parameter encoded in cortico-striatal weights. The dopaminergic neurons in the valuation system 5 encode the prediction error similar to that in the temporal-difference learning model, and after 6 reward delivery, they modulate plasticity of cortico-striatal connections. The Method section also 7 provides details of the implementation and simulations of the model. 8

Simulations of action intensity selection 9 To illustrate how the model mechanistically operates and to help relate it to experimental data, we 10 now describe a simulation of the model inferring action intensity. On each simulated trial the model 11 selected action intensity, after observing a stimulus, which was set to = 1. The reward obtained 12 depended on action intensity as shown in Figure 6A , according to = ta h( / ) 1 . Thus, the 13 reward was proportional to the action intensity, transformed through a saturating function, and a cost 14

was subtracted proportional to the action intensity, that could correspond to a price for making an 15 effort. We also added small Gaussian noise to reward with standard deviation = (to account 16 for randomness in the environment), and to action intensity with standard deviation = (to 17 account for imprecision of the motor system or exploration). 18 Figure 6B shows how the prediction errors and action intensity changed within a trial. The left display 19 corresponds to one of the early trials. The stimulus was presented at time 1. The valuation system 20 detected that a reward was available, which initially resulted in a prediction error in the goal-directed 21 system, visible as an increase in the orange curve. This prediction error triggered a process of action 22 planning, so with time the green curve representing planned action intensity increased. Once the 23 action plan has been formulated, it provided a reward expectation, so the orange prediction error 24

decreased. As the habit system has not formed significant habits on early trials, it was surprised by the 25 chosen action, and this high value of blue prediction error drove its learning over trials. For simplicity, 26 the simulation is shown for the entire period of 2 time units, but in a real neural system, the action is 27

likely to be executed once converges, and the resulting movements would change and thus the 28 habit prediction error. Therefore, the increase in the habit prediction error would be transient rather 29 than sustained (as depicted in the figure). Extending the model to detect convergence of action 30 intensity will be an important direction of future work and we come back to it in the Discussion. 31

The middle display in Figure 6B shows the same quantities on a trial that followed an extensive 32

training. Now the habit system was highly trained and rapidly drove action planning, so the green 33

curve showing planned action intensity increased more rapidly and to a higher level. Nevertheless, 34 due to the dynamics in the model, the increase in action intensity was not instant, so there was a 35 transient negative prediction error in the habit system while an action was not yet equal to the value 36 predicted by the habit system. After this transient, the habit system was no longer surprised by the 37 planned action, thus the blue prediction error converged to a value close to 0. At this stage of training, 38 the goal-directed system was too slow to lead action planning, so the orange prediction error was 39

lower. Finally, the red prediction error in the valuation system behaved as expected from the standard 40 temporal difference learning model, i.e. a response to a stimulus predicting a reward was produced 41 even after extensive training. 42

Dopaminergic neurons in the model are only required to facilitate planning in the goal-directed 43 system, where they increase excitability of striatal neurons, but not in the habit system. To illustrate 44

it, right display in Figure 6B shows simulations of a complete dopamine depletion in the model. It 45 shows action intensity produced by the model in which following training, all dopaminergic neurons 1 were set to 0. After just 9 trials of training, on the 10th trial, the model was unable to plan an action. 2 By contrast, after 999 training trials, the model was still able to produce a habitual response, because 3 dopaminergic neurons are not required for generating habitual responses in the model. This parallels 4 the experimentally observed robustness of habitual responses to blocking dopaminergic modulation 5 ( display shows how the parameters in the three systems changed during learning. The parameter of 14

the valuation system correctly converged to the maximum value of the reward available in the task 15 ≈ 4 (i.e. the maximum of the curve in Figure 6A ). The parameter of the habit system correctly 16

converged to action intensity resulting in the maximum reward ≈ 9 ( Figure 6A ). The parameter of 17 the goal-directed system converged to a vicinity of ≈ 4/9, which allows the goal-directed system to 18 expect the reward of 4 after selecting an action with intensity 9 (see orange Equation 3.2). 19

The middle display in Figure 6C shows how the variance parameters in the goal-directed and habit 1 systems changed during the simulation. The variance of the habit system was initialised to a high 2 value, and it decreased over time, resulting in an increased confidence of the habit system. The middle 3 display also shows the action intensity produced in the model. On average it was close to ≈ 9, which 4 was approximately the value resulting in the maximum reward ( Figure 6A ). Although the model 5 discovered this optimal action intensity quite rapidly, on a first few hundred trials, the inferred value 6 of action was quite variable, because it relied on the inference in the goal-directed system, which in 7 turn relied on the valuation system. The fluctuations in the parameters of these systems arising from 8 learning from noisy rewards resulted in variable action intensity. However, once the habit system 9 became more confident, the produced action intensity became closer to the optimum value, as the 10 fluctuations in its parameter were lower than for the other system (left display in Figure 6C ). This 11

illustrates the benefits of Bayesian inference mentioned earlier, that in the face of imprecision in the 12 goal-directed system, a more accurate action intensity can be identified by also relying on a prior, 13 which here was encoded in the habit system. 14 The right display in Figure 6C shows prediction errors after the stimulus in the three systems. The  15 prediction error of the valuation system simply followed the value of the stimulus estimated by that 16 system. The prediction error in the goal-directed system decreased on later trials, when it became 17 slower than the habit system and no longer could lead action planning. There was a prediction error 18 in the habit system on the initial trials, but it decreased once the action became habitual. 19

Although we demonstrated that including the habit system brings benefits of more accurate actions 20

( Figure 6C , middle) and faster planning ( Figure 6B ), it may also have costs of excessive perseveration. 21 Such perseveration is typically studied experimentally in tasks where rats have to press a lever multiple 22

times to get a reward. As mentioned earlier, such tasks could be conceptualized as a choice of the 23 frequency of pressing a lever, that could also be described by a single number . Furthermore, the 24 average reward rate experienced by an animal may correspond to a non-monotonic function similar 25 to that in Figure 6A , because with more presses more rewards is typically received, but beyond a 26 certain frequency, there may be a large effort cost. Therefore, we will continue to use the model 27 described above to illustrate key qualitative effects known in the literature. 28 Figure 7A shows a simulation paralleling the reward omission protocol. The model has been simulated 29 with reward depending on action intensity as in Figure 6A for a particular number of trials shown on 30

x-axis of Figure 7A , and then the reward was not delivered so in simulations = 1 reflecting just 31 a cost connected with making an effort. The figure presents the average action frequency selected by 32 the model after 500 trials since the rewards had been omitted. The action frequency increases with 33 longer training, which parallels experimental observation of decreased sensitivity to reward omission 34 with increased training (Dickinson, Squire, Varga, & Smith, 1998). When the rewards become omitted, 35 the optimal action frequency is = , as there is no benefit for action, but only a cost. Nevertheless, 36

if the model had made the same responses for many trials, the habit system increased its confidence 37 to the extent that it continued to drive action planning. In that case, the chosen action was close to 38 the habitual one, so there was little prediction error in the habit system ( ≈ ), despite no reward, 39 hence the habit system did not adjust significantly its behaviour or even its confidence. Although such 40 perseveration in absence of rewards is not useful for animals, it is a cost paid in this neural circuit for 41 the important benefits of fast and accurate habitual actions. 42 Figure 7B shows simulations of the model in a devaluation paradigm. In this paradigm, animals are 43 trained to press a level for reward (e.g. food), and following this training the reward is devalued (e.g. 44 the animals are fed to satiety, so they no longer desire the reward). It has been observed that animals 45 trained for a short period would not press a lever following devaluation, while after an extensive 46 training, the animal would press the lever, even though they no longer desire the reward (Dickinson, 1 1985) . To simulate this paradigm, the model was trained for a number of trials (shown on x-axis), and 2 then to simulate devaluation, the value computed by the valuation system was set to = 3 throughout the subsequent trial, as in a previous modelling study (Solway & Botvinick, 2012 ). Figure  4 7B shows the average action frequency inferred on this trial. It increased with training, analogously as 5 in the devaluation experiments (Dickinson, 1985) . the perseveration is more likely when the rewards are noisier. This effect occurs, because increasing 12 noise in reward increases the estimate of reward variance , and this parameter also encodes the 13 uncertainty of the goal-directed system. The increased uncertainty of the goal-directed systems makes 14 it more likely to give in to the habit system. This property of the model parallels experimental 15 observation that habitual behaviour is easier to produce in an experimental paradigm (variable 16 interval schedule) (Dickinson, Nicholas, & Adams, 1983 ), which has more variable reward probability 17 (Miller et al., 2019) . 18

In summary, this section described a model for a simple task in which an animal had to select an 19 intensity of a single action. Simulations of the model revealed a rich pattern of responses of different 20 populations of dopaminergic neurons at different stages of task acquisition, and we will relate them 21

to experimental data in Discussion. 22

Choice between two actions 23 This section shows how models developed within the DopAct framework can also describe more 24 complex tasks with multiple actions and multiple dimensions of state. We consider a task of choice 25 between two options, often used in experimental studies, as it allows illustrating the generalization, 26 and at the same time results in a relatively simple model. This section will also show that the models 27 developed in the framework can under certain assumptions be closely related to previously proposed 28 models of reinforcement learning and habit formation. 29

To make dimensionality of all variables and parameters explicit, we will denote vectors with a bar and 30 matrices with a bold font. Thus ̅ is a vector where different entries correspond to intensities of 31 different stimuli in an environment, and ̅ is a vector where different entries correspond to intensities 32

of different actions. Equation 8 .1 in Figure 8 shows how the definitions of the probability distributions 1 encoded by the goal-directed and habit systems can be generalized to multiple dimensions. Orange 2 Equation 8 .1 states that the reward expected by the goal-directed system has mean ̅ ̅ , where 3 is now a matrix of parameters. This notation highlights the link with the standard reinforcement 4 learning, where the expected reward for selecting action in state is denoted by : Note that if ̅ 5 and ̅ are both binary vectors with entries and equal to 1 in the corresponding vectors, and all other 6 entries equal to 0, then ̅ ̅ is equal to the element of matrix . In the model, the prior probability is proportional to a product of three distributions. The first of them 11 is encoded by the habit system and given in blue Equation 8 .1. The expected action intensity encoded 12 in the habit system has mean ̅ , and this notation highlights the analogy with a recent model of habit 13 formation (Miller et al., 2019) where a tendency to select action in state is also denoted by . 14 Additionally, we introduce another prior given in Equation 8 .2, which ensures that only one action has 15 intensity significantly deviating from 0. Furthermore, to link the framework with classical 16 reinforcement learning, we enforce a third condition ensuring that action intensity remains between 17 0 and 1 (Equation 8 .3). These additional priors will often result in one entry of ̅ converging to 1, while 18 all other entries decaying towards 0 due to competition. Since in our simulations we also use a binary 19 state vector, the reward expected by the goal-directed system will often be equal to as in the 20 classical reinforcement learning (see paragraph above). 21

Methods section derives equations describing inference and learning for the above probabilistic action with a larger initial input is likely to win the competition, so the action for which the right hand 28 side of Equation 8.4 is highest is most likely to be selected. Furthermore, after selection of action in 29 state , the parameters are updated according to Equations 8.5. Namely, the parameter describing 30 expected reward for action in state is modified proportionally to a reward prediction error, as in 31 classical reinforcement learning. Additionally, for every action and current state the parameter 32 describing a tendency to take this action is modified proportionally to a prediction error equal to a 33 difference between the intensity of this action and the intensity expected by 

The similarity of a model developed in the DopAct framework to classical reinforcement learning, 1 which has been designed to maximize resources, highlights that the model also tends to maximize 2 resources, when animal's reserves are sufficiently low. But the framework is additionally adaptive to 3 the levels of reserves: If the reserves were at the desired level, then = during action planning, so 4 according to Equation 8 .4, the goal-directed system would not suggest any action. 5

Weighting the contribution of the goal-directed system by in Equation 8.4 also has another function: 6 it can bring the contributions of the two systems to the same range, irrespectively of the magnitudes 7 of reward used in the task. Note that in the orange term of Equation 8.4, both and have units of 8 reward magnitude, while has units of reward magnitude squared (because it is a variance), hence 9 the whole orange term is unit-less. The blue term in Equation 8 .4 corresponding to the habit system 10 is also unit-less. Therefore, if a task has stochastic rewards, and their magnitude is scaled up, the value 11 of the orange term will not change to a different range, which allows the habit system to contribute 12 to the action selection process irrespective of the scale of rewards. 13

In the case of deterministic rewards or behaviour, the variance parameters may approach 0 due to 14 learning over trials, which would result in both terms in Equation 8.4 diverging to infinity. To prevent 15 this from happening, a constraint (or a "hyperprior") on the minimum value of the variance 16 parameters needs to be introduced (in all simulations, if or decreased below 0.2, it was set to 17 0.2). 18

The Methods section describes how the inference and learning can be implemented in a generalized 19 version of the network described above. In this network, striatum, output nuclei and thalamus To illustrate predictions made by the model, we simulated it in a probabilistic reversal task. On each 29 trial, the model was "presented" with one of two "stimuli", i.e. one randomly chosen entry of vector 30

̅ was set to 1, while the other entry was set to 0. On the initial 150 trials, the correct response was to 31 select action 1 for stimulus 1 and action 2 for stimulus 2, while on the subsequent trials, the correct 32 responses were reversed. The mean reward was equal to 1 for a correct response and 0 for an error. 33 In each case, a Gaussian noise with standard deviation = was added to the reward. 34 Figure 9A shows changes in action intensity and inputs from goal-directed and habit systems as a 35 function of time on different trials within a simulation. On an early trial (left display) the changes in 36 action intensity were primarily driven by the goal-directed system. The intensity of the correct action 37 converged to 1, while it stayed at 0 for the incorrect one. After substantial training (middle display), 38 the changes in action intensity were primarily driven by the habit system. Following a reversal (right 39 display) one can observe a competition between the two systems: Although the goal-directed system 40 had already learned the new contingency (solid orange curve), the habit system still provided larger 41

input to the incorrect action node (dashed blue curve). Since the habit system was faster, the incorrect 42 action had higher intensity initially, and only with time, the correct action node received input from 43 the goal-directed system, and inhibited the incorrect one. 44 1 Figure 9 . Simulation of the model of choice between two actions. A) Changes in action intensity and 2 inputs from the goal-directed and habit systems, defined below Equation 8.4. Solid lines correspond 3 to a correct action and dashed lines to an error. Thus selecting action 1 for stimulus 1 (or action 2 for 4 stimulus 2) correspond to solid lines in left and middle display (before reversal) and to dashed lines in 5 the right display (after reversal). B) Changes in model parameters and prediction errors across trials. 6 Dashed black lines indicate a reversal trial. 7 Figure 9B shows how parameters and prediction errors in the model changed over trials. Left display 8 illustrates changes in sample cortico-striatal weights in the three systems. The valuation system 9 rapidly learned the value available after the first stimulus, but after reversal this estimate decreased, 10 as the model persevered in choosing the incorrect option. Once the model discovered the new rule, 11 the estimated value of the stimulus increased. The goal-directed system learned that selecting the 12 first action after the first stimulus gave higher rewards before reversal, but not after. The changes in 13 the parameters of the habit system followed those in the goal-directed system. The middle display 14 shows that the variance estimated by the habit system initially decreased, but then increased several 15 trials after the reversal, when the goal-directed system discovered the new contingency, and thus 16 selected actions differed from the habitual ones. The right display shows an analogous pattern in 17 dopaminergic activity, where the neurons in the habit system signalled higher prediction errors 18 following a reversal. The prediction errors in the goal-directed system increased in a period after 19 reversal, when the agent started to explore the option that had been unrewarded before reversal but 20 now gave higher rewards. These simulations illustrate several experimental predictions of the model 21

to which we will come back in Discussion. 22

In this paper, we proposed how an action can be identified through Bayesian inference, where the 24 habit system provides a prior and the goal-directed system represents reward likelihood. Within the 25

DopAct framework, the goal-directed and habit systems may not be viewed as fundamentally different 1 systems, but rather as analogous segments of neural machinery performing inference in a hierarchical 2 probabilistic model ( Figure 1B) , which correspond to different levels of hierarchy. 3

In this section, we discuss the relationship of the framework to other theories and experimental data, 4 and suggest experimental predictions and directions for future theoretic work. 5

Relationship to other theories 6 The DopAct framework combines elements from four theories: reinforcement learning, active 7

inference, habit formation, and planning as inference. For each of the theories we summarize 8 similarities, and highlight the ways in which the DopAct framework extends them. 9

As in classical reinforcement learning (Houk et al., 1995; Montague et al., 1996) , in the DopAct 10 framework the dopaminergic neurons in the valuation and goal-directed systems encode reward 11 prediction errors. Furthermore, similarly to reinforcement learning models, the parameters describing 12 expected reward are encoded in cortico-striatal weights. The learning rules for these parameters aim 13 at minimizing prediction errors over time, and involve tri-factor Hebbian plasticity often used in 14 models of reinforcement learning (Frémaux & Gerstner, 2016; Kuśmierz et al., 2017; Roelfsema & 15 Holtmaat, 2018). However, the key conceptual difference of the DopAct framework is that it assumes 16 that the goal of animals' behaviour is to achieve a desired level of reserves, rather than always 17 maximize acquiring resources. It has been proposed that when a physiological state is considered, the 18 reward an animal aims to maximize can be defined as a reduction of distance between the current 19 and desired levels of reserves (Juechems & Summerfield, 2019; Keramati & Gutkin, 2014) . Under this 20 definition, a resource is equal to such subjective reward only if consuming it would not bring the 21 animal beyond its optimal reserve level. When an animal is close to the desired level, acquiring a 22 resource may even move the animal further from the desired level, resulting in a negative subjective 23 reward. As the standard reinforcement learning algorithms do not consider physiological state, they 24 do not always maximize the subjective reward defined in this way. Nevertheless, we highlighted 25 (Figure 8) , that when the level of reserves is low, the framework can produce a similar behaviour to 26 standard reinforcement learning models (before an action becomes habitual), but importantly, the 27 framework offers flexibility to stop acquiring resources, when the reserves reach the optimum level. 28

The DopAct framework relies on key high-level concepts from the active inference theory (Buckley et 29 al., 2017; Friston, 2010) : the animals aim to reach the desired level of reserves, prediction errors can 30 be minimized by both learning and action planning, and both of these processes can be derived from 31 minimization of free-energy. In the DopAct framework, the neurons encoding prediction errors affect 32

both the plasticity and the activity of its target neurons, analogously as in previous predictive coding 33

architectures derived through free-energy minimization (Friston, 2005) . In addition to being a 34 conceptual model, active inference also describes a mechanistic model computing the optimal action 35 (Friston et al., 2013) , but the details of the neural implementation of active inference in that model 36 are very different than in DopAct framework. Nevertheless, the function of dopamine is related to 37 that proposed in a past active inference model: in that model it encodes precision, i.e. an inverse 38 variance of choice policy (Friston et al., 2013) , while in the DopAct framework it encodes a precision 39 weighted prediction error in the goal-directed system. 40

We have demonstrated that for a certain set of assumptions, the DopAct framework describes choice 41

and learning processes in a very similar way to a recent model of habit formation (Miller et al., 2019) . 42 Both in that model and the DopAct framework, the parameters of the habit system are updated on 43 the basis of prediction errors that do not depend on reward, but rather encode the difference between 44 the chosen and habitual actions. The simulations of several paradigms in this paper parallel those in a 45 previous study (Miller et al., 2019) . The key new contribution of this paper is to offer a normative 1 rationale for how such model can arise from Bayesian inference in which the habit system provides a 2

prior. Furthermore, we proposed how learning in this model can be implemented in the basal ganglia 3 circuit including multiple populations of dopaminergic neurons encoding different prediction errors. 4

Similarly as in the model describing goal-directed decision making as probabilistic inference (Solway 5 & Botvinick, 2012), the actions selected in the DopAct framework maximize a posterior probability of 6 action given the reward. Our simulations of behaviour in devaluation experiments were inspired by 7 the simulations in that study. The new contribution of this paper is making explicit the rationale for 8 why such probabilistic inference is the right thing for the brain to do: The resource that should be 9 acquired in a given state depends on the level of reserves, so the inferred action should depend on 10 what portion of reward available is necessary to restore the reserves. We also propose a detailed 11

implementation of the probabilistic inference in the basal ganglia circuit. 12

It is useful to discuss the relationship of the DopAct framework to several other theories. It has been 13 suggested that action planning involves a competition between model-based and model-free systems, 14

which are located in prefrontal cortex and striatum respectively (Daw, Niv, & Dayan, 2005) . Here we 15 propose Relationship to experimental data 29 Relating DopAct framework to experimental data is not fully straightforward for two reasons. First, 30 the tasks simulated in this paper are much simpler than the tasks studied in experiments, in which 31 stimuli and actions have multiple dimensions, and animals often need to make predictions about 32 temporal effects of their behaviour. To fully account for a wide repertoire of neural responses 33 observed in these tasks, more complex models would need to be developed and trained within the 34 DopAct framework. Nevertheless, in this section we will extrapolate and relate the qualitative patterns 35 in presented simulations with available data. Second, to relate the simulations with data, we will need 36 to assume a particular mapping of different systems on anatomically defined brain regions. Thus we 37

will assume that dopaminergic neurons in valuation, goal-directed, and habit systems can be mapped 38 on a spectrum of dopaminergic neurons ranging from ventral tegmental area (VTA) including the 39 valuation system to substantia nigra pars compacta (SNc) including the habit system. The mapping of 40 the dopaminergic neurons from the goal-directed system is less clear, so let us assume that these 41 neurons may be present in both areas. Furthermore, we will assume that the striatal neurons in 42 valuation, goal-directed, and habit systems can be approximately mapped on ventral, dorsomedial, 43 and dorsolateral striatum. However, the neurons corresponding to different systems may not be 44 perfectly separated in space. Keeping these limitations in mind, let us consider the relationship of the 45 DopAct framework to data on functional anatomy, physiology and behaviour. 46

In the DopAct framework the role of dopamine during action planning is specific to preparing goal-1 directed but not habitual movements ( Figure 6B right) . Thus the framework is consistent with an 2 observation that blocking dopaminergic transmission slows responses to reward-predicting cues early 3 in training, but not after extensive training, when the responses presumably became habitual (Choi et 4 al., 2005) . Analogously, the DopAct framework is consistent with an impairment in Parkinson's disease 5

for goal-directed but not habitual choices (de Wit, Barker, Dickinson, & Cools, 2011) or voluntary but 6 not cue driven movements (Johnson et al., 2016) . The difficulty in movement initiation in Parkinson's 7 disease seems to depend on whether the action is voluntary or in response to a stimulus, so even 8

highly practiced movements like walking may be difficult if performed voluntarily, but easier in 9

response to auditory or visual cues (Rochester et al., 2005) . Such movements performed to cues are 10 likely to engage the habit system, because responding to stimuli is a hallmark of habitual behaviour 11 (Dickinson & Balleine, 2002) . 12

Dopaminergic modulation of plasticity is required for forming habits in the DopAct framework. Thus 13 the framework is consistent with the observation that lesions to SNc (which we assumed to contain 14 neurons encoding ) prevents habit formation (Faure, Haberland, Condé, & El Massioui, 2005) . In 15 that study, the lesioned animals could learn to perform an action leading to reward, but after extensive 16 training, they performed action after reward devaluation less frequently than the control animals 17

(implying an impairment in habit formation). 18

The mapping of the goal-directed and habit systems on dorsomedial and dorsolateral striatum is Let us now discuss the relationship of DopAct framework to responses of dopaminergic neurons. 27 These responses have been recorded in different behavioural paradigms, which we discuss in turn. In 28 classical conditioning, dopaminergic neurons in VTA have been shown to encode reward prediction 29 error (Eshel et al., 2016; Schultz et al., 1997; Tobler et al., 2005) . Since in the DopAct framework the 30 valuation system is similar to the standard temporal difference learning model, it inherits the ability 31

to account for the dopaminergic responses to unexpected rewards previously explained with that 32 model. It is intriguing to ask if some of the observed dopaminergic responses to a conditioned stimulus 33 reflect the prediction error in the goal-directed system. The motivation for this question is that even 34 in classical conditioning tasks, the animal needs to perform some actions to consume the reward, e.g. 35 swallow it, and it has been reported that animals start to lick after the conditioned stimulus (Tobler et 36 al., 2005 ). To answer the above question, one would need to analyse how the responses to the 37 conditioned stimulus change across trials. According to the DopAct framework, dopaminergic 38 responses in the goal-directed system should diminish as the action becomes habitual ( Figure 6C  39 right). Thus if there existed dopaminergic neurons that encoded reward prediction error in early stages 40 of classical conditioning, but diminished in later stages, then the DopAct framework would suggest 41 they are a part of the goal-directed system. 42

In operand conditioning tasks, dopaminergic responses to movements performed in order obtain a 43 reward were observed in both VTA and SNc (Engelhard et al., 2019; Schultz, 1986) . Analogous 44 dopaminergic responses in all systems within DopAct framework were produced in simulations ( Figure  45 6B left). At the time of reward delivery, the DopAct framework predicts a response in valuation and 46 goal-directed systems, but not in the habit system. Accordingly, a much larger fraction of neurons has 1 been reported to respond to reward in VTA (Engelhard et al., 2019) than in SNc (Schultz, 1986 ). The 2

DopAct framework also predicts that the responses to movements should be modulated by reward 3 magnitude in the valuation and goal-directed systems, but not in the habit system. This prediction can 4 be compared with data from a task in which animals could press one of two levers that differed in 5 magnitude of resulting rewards (Jin & Costa, 2010). As mentioned above, the framework predicts that 6 the dopaminergic neurons in the valuation and goal-directed systems would respond differently 7 depending on which lever was pressed, while the dopaminergic neurons in the habit system would 8 have response dependent just on action intensity but not reward magnitude. Indeed, a diversity of 9 dopaminergic neurons have been observed in SNc, and the neurons differed in whether their 10 movement related response depended on reward available (Figure 4j in the paper by Jin and Costa  11 (2010)). 12

The dopaminergic responses have also been observed in a task in which mice could make spontaneous 13 movements and reward was delivered at random times (Howe & Dombeck, 2016) . It has been 14 observed that a fraction of dopaminergic neurons had increased responses to rewards, while a group 15

of neurons responded to movements. Moreover, the reward responding neurons were located in VTA 16 while most movement responding neurons in SNc (Howe & Dombeck, 2016) . In that study the rewards 17

were delivered to animals irrespectively of movements, so the movements they generated were most 18

likely not driven by processes aiming at achieving reward (simulated in this paper), but rather by other 19

inputs (modelled by noise in our simulations). To relate this task to the DopAct framework, let us 20 consider the prediction errors likely to occur at the times of reward and movement. At the time of 21 reward the animal was not able to predict it, so , but it was not necessarily making any 22 movements = , while at the time of a movement the animal might have not expected reward 23 = = , but might have made non-habitual movements . Hence the framework predicts 24 separate groups of dopaminergic neurons to produce responses at times of reward and movements, 25 as experimentally observed (Howe & Dombeck, 2016) . Furthermore, the peak of the movement 26 related response of SNc neurons was observed to occur after the movement onset (Howe & Dombeck,  27 2016), which suggests that most of this dopaminergic activity was a response to a movement rather 28 than a response initiating a movement. This timing is consistent with the role of dopaminergic neurons 29 in the habit system, which compute a movement prediction error, rather than initiate movements. 30

In tasks involving both operand conditioning and spontaneous movements, it has been observed that 31 a fraction of dopaminergic neurons in SNc pause during movements (Dodson et al., 2016; Schultz et 32 al., 1983) . It is intriguing to ask if these decreased responses could correspond to brief decreases in 33 activity of dopaminergic neurons in the habit system present during simulations of habitual 34 movements (Figure 6B middle) . This may not be a case, because the decreases in simulation are very 35 brief, while they could last ~200ms in SNc neurons (Dodson et al., 2016) . Furthermore, a recent study 36 suggests that the decreases of activity of dopaminergic neurons specifically occur during brief 37 movements ("jerks") that do not develop into longer movements (Howe et al., 2019) , and the 38 difference between movement duration was not modelled in the presented simulations. 39 Nevertheless, further investigation of pauses in dopaminergic activity in relationship to DopAct 40 framework may be an interesting direction of future work. 41

The DopAct framework is consistent with behavioural data connected with habit formation. 42

Analogously to a previous model of habit formation (Miller et al., 2019) , it accounts for the effects of 43 training duration and protocol on the perseveration in omission and devaluation paradigms (Figure 7 ). 44

Additionally, the framework describes the dynamics of competition between the systems during 45 action planning. In a recent study, human participants were extensively trained to make particular 46 responses to given stimuli (Hardwick, Forrence, Krakauer, & Haith, 2018) . After a reversal, they tended 1 to produce incorrect habitual actions when required to respond rapidly, but were able to produce the 2 correct actions given sufficient time. Analogous behaviour of the model is shown in the right display 3

of Figure 9A , where the faster habit system initially prepares an incorrect action, but later the slower 4 goal-directed system increases the intensity of the correct action. 5

Furthermore, the DopAct framework accounts for the observation that animals are less likely to 6 persevere after reversal if larger rewards are used (Theios & Blosser, 1965 ). An important detail of this 7 experiment was that the rewards were deterministic and of constant magnitude within conditions. 8

Such effect of deterministic reward magnitude on reversals is consistent with the model, because the 9 magnitude of the available reward scales the contribution of the goal-directed system (Equation 8.4), 10 but with deterministic rewards, term that normalizes the goal-directed contribution is little 11 affected by reward magnitude (it may stay at its lower boundary determined by a hyperprior). Thus 12

with higher deterministic rewards, the goal-directed system is more likely to overcome the tendency 13 of the habit system to persevere following a reversal. 14 Experimental predictions 15 The analysis in the previous paragraph suggests that the effect of reward magnitude on the tendency 16 to persevere after reversal (Theios & Blosser, 1965 ) is a result of using deterministic rewards. Thus the 17 DopAct framework predicts that if this study were repeated with stochastic rewards (typical for a 18 natural environment), the contribution of the goal-directed system (Equation 8.4) would be correctly 19 normalized by the variance, and the effect would disappear. 20

The DopAct framework predicts distinct patterns of activity for different populations of dopaminergic 21 neurons. Dopaminergic neurons in the habit system should respond to movements more, when they 22

are not habitual, e.g. at an initial phase of task acquisition or after a reversal ( Figure 9C ). When the 23 movements become highly habitual, these neurons should tend to more often produce brief 24 decreases in response ( Figure 6B ). Furthermore, when the choices become mostly driven by the habit 25 system (as was the case in the simulation of Figure 6 , where the variance of habit system decreased 26 much below the variance of the goal-directed system -middle display in panel C), then dopaminergic 27 neurons in the goal-directed system should no longer signal reward prediction error ( Figure 6C , right). 28 By contrast, the dopaminergic neurons in the valuation system should signal reward prediction error 29

irrespectively of the stage of task acquisition ( Figure 6C , right). 30

A central feature of the DopAct framework is that the expectation of the reward in the goal-directed 31 system arises from forming a motor plan to obtain it. Thus the framework predicts that the 32 dopaminergic responses in the goal-directed system to stimuli predicting a reward should last longer 33 if planning actions to obtain the reward takes more time. One way to test this prediction would be to 34 optogenetically block striatal neurons expressing D1 receptors in the goal-directed system for a fixed 35 period after the onset of a stimulus, so the action plan cannot be formed. The framework predicts that 36 such manipulation should prolong the response of dopaminergic neurons in that system. 37

In the DopAct framework dopaminergic neurons increase the gain of striatal neurons during action 38 planning, only in the goal-directed but not in the habit system. Therefore, the framework predicts that 39 the dopamine concentration should have a larger effect on the slope of firing-Input curves for the 40 striatal neurons in the goal-directed than the habit system. This prediction may seem surprising, 41

because striatal neurons express dopaminergic receptors throughout the striatum (Huntley, Morrison, 42 Prikhozhan, & Sealfon, 1992). Nevertheless, it is consistent with reduced effects of dopamine blockade 43 on initiation of habitual movement (Choi et al., 2005) that are known to rely on dorsolateral striatum 44 (Yin et al., 2004) . Accordingly, the DopAct framework predicts that the dopaminergic modulation of 1 dorsolateral striatum should primarily affect plasticity rather than excitability of the striatal neurons. Directions for future work 9 This paper described a general framework for understanding the function of dopaminergic neurons in 10 the basal ganglia, and presented simple models capturing only a subset of experimental data. To 11 describe responses observed in more realistically complex tasks, models could be developed following 12 a similar procedure as in this paper. Namely, a probabilistic model could be formulated for a task, and 13 a network minimizing the corresponding free-energy derived, simulated and compared with 14 experimental data. This section highlights key experimental observations the models described in this 15

paper are unable to capture, and suggests directions for developing models consistent with them. 16

The presented models do not mechanistically explain the dependence of dopamine release in ventral 17 striatum on motivational state such as hunger or thirst (Papageorgiou, Baudonnat, Cucca, & Walton, 18 2016). To reproduce these activity patterns, it will be important to extend the framework to describe 19 the computations in the valuation system. 20

The models do not describe how the striatal neurons distinguish whether dopaminergic prediction 21 error signal should affect their plasticity or excitability, and for simplicity, in the presented simulations 22 we separated in time the planning and learning processes. However, as illustrated in Figure 2 , the 23 same dopaminergic signal may need to trigger plasticity in one group of striatal neurons (selective for 24 a past action), and changes in excitability in another group (selective for a future action). It will be 25 important to further understand the mechanisms which can be employed by striatal neurons to 26 appropriately react to dopamine signals (Berke, 2018) . 27

The models presented in this paper described only a part of the basal ganglia circuit, and it will be 28 important to include also other elements of the circuit. In particular, this paper focussed on a subset 29 of striatal neurons expressing D1 receptors, which project directly to the output nuclei and facilitate 30 movements, but another population expressing D2 receptors projects via an indirect pathway and 31

inhibits movements (Kravitz et al., 2010) . Several theories have been proposed for the function of 32 these two classes of neurons. For example, it has been suggested that D1 neurons encode the payoffs 33 of actions, while the D2 neurons encode their costs (Collins & Frank, 2014; Möller & Bogacz, 2019) . 34 Consequently, it would be interesting to investigate if the framework can be extended so that the 35 mean expected reward represented by the goal-directed system explicitly includes costs such as effort 36 (e.g. mean reward is assumed to follow non-monotonic functions of action intensity like that in Figure  37 6A). It could be then investigated if the inference in such modified probabilistic model could be 38 mapped on a network including the striatal D2 neurons. Furthermore, it has been shown how the D1 39 and D2 neurons can jointly encode the variability of rewards associated with selecting particular 40 actions in a given state (Mikhael & Bogacz, 2016) . It would be interesting to investigate if that model 41

can be used to extend the framework so that the variance of reward distribution in the goal-directed 42 system is learned for individual actions. 43

The basal ganglia circuit also includes a hyperdirect pathway, which contains the subthalamic nucleus. 1

It has been proposed that a function of the subthalamic nucleus is to inhibit non-selected actions 2 (Gurney et al., 2001) , and the hyperdirect pathway may support the competition between the actions 3 present in the framework. It will be important to investigate how the definition of the additional priors 4 ensuring that only a single action is selected (Equation 8 .2) can be extended to a choice between 5 multiple actions to effectively supress multiple non-selected actions, and how enforcing of such a prior 6 may be achieved by the basal ganglia circuit. The subthalamic nucleus has also been proposed to be 7 involved in determining when the planning process should finish and action should be initiated (Frank 8 et al., 2007) . For simplicity, we have simulated the planning process for a fixed interval (Figures 6 and  9 9), but it will be important to extend the framework to describe the mechanisms initiating an action. 10

The presented models cannot reproduce the gradual ramping of activity of dopaminergic neurons, 11 observed as animals approached rewards (Howe, Tierney, Sandberg, Phillips, & Graybiel, 2013). To be 12 consistent with these data, the valuation system could be extended to incorporate synaptic decay that 13

has been shown to allow standard reinforcement learning models to reproduce the ramping of 14 prediction error (Kato & Morita, 2016 Selecting action intensity 36 We first describe an extended version of the actor that learns uncertainties associated with goal-37 directed and habit systems, and then provide the details of the dynamics of the simulated model. 38

Substituting probability densities of likelihood and prior distributions (Equations 3.2-3.3) into Equation 39 4.1, we obtain an expression for the objective function in Equation 10 .1 in Figure 10A , which has an 40 analogous form as before but now includes the variance parameters. To find action intensity, we 41 change it according to the gradient of , as described in Equation 10 .2. Analogously as in the simple 42 case described in the Results, the action intensity is driven by both goal-directed and habit systems, 43 but now their contributions are normalised by the variance parameters. For the habit system this 44 normalization is stated explicitly in Equation 10.2, while for the goal-directed system it comes from a 1 modified definition of prediction error in orange Equation 10 .3. Since variance of the habit system 2 already normalizes its contribution in Equation 10 .2, we choose not to include it in the definition of 3 habit prediction error. remain the same as in Equation 5 .4 (ignoring constants that can be incorporated into a learning rate). 10 The update rule for the parameter describing variance in the goal-directed system can be obtained by 11 computing derivative of , giving the orange Equation 10 .5. Following an analogous procedure for the 12 variance in the habit system, we would obtain a derivative equal to ⁄ 1 ⁄ but to simplify this 13 expression, we scale it by , resulting in the blue Equation 10 .5. As is a positive number, such 14 scaling does not change the value to which converges. 15

There are several ways of including the variance parameters in the network, and as a proof of principle 16 we show here one of them ( Figure 10B ). This network is very similar to that shown in Figure 5B , but 17 now the projection to output nuclei from the habit system is weighted by its precision 1 ⁄ (to reflect 18 the weighting factor in Equation 10 .2), and also the rate of decay (or relaxation to baseline) in the 19 output nuclei could be proportional to . One way to ensure that prediction error in goal-directed 20 system is scaled by is to encode in the rate of decay or leak of these prediction error neurons 21 (Bogacz, 2017; Friston, 2005) . To see how the scaling of by in orange Equation 10 .3 arises, the 22 dynamics of these prediction error neurons can be written as a differential equation, which is included 23 as orange Equation 11.2 in Figure 11 . This equation, includes at the end a decay term with a decay 24 rate . To find a value to which orange Equation 11.2 converges, we note that in equilibrium = , 25 so by setting left hand of the equation to 0 and solving for we obtain the value of in an 26

equilibrium. This value is equal to that in orange Equation 10 .3, with a difference that the total reward 27 is here replaced by the sum of instantaneous reward , and available reward provided by the 28 valuation system, which are the inputs to these prediction error neurons ( Figure 1D ). The updates of the variance parameter (Equations 10.5) only depend on the corresponding prediction 1 errors and the variance parameters themselves, so they could be implemented with local plasticity if 2 the neurons encoding variance parameters received corresponding prediction errors. 3 Figure 11 provides a complete description of the dynamics of the simulated model . The orange and  4 blue equations describe the actor and they parallel those in Figure 10A , but now explicitly include time 5

constants for nodes encoding action intensity ( ) and prediction errors ( ), as well as learning rates 6 for the means of distributions in the goal-directed ( ) and habit ( ) systems and for variances ( ). 7

Red equations provide a description of the valuation system. According to red Equation 11 .1, the 8 estimate of the value of state converges in equilibrium to = . To illustrate the dynamic of the 9 prediction error of the valuation system, it is simulated according to red Equation 11.2, which 10 converges to a difference between total reward ( ) and the expectation of that reward made at 11 an earlier time ( , where is a parameter describing how long ago the prediction was made). This 12 prediction error is similar to that in the standard temporal difference learning (Sutton & Barto, 1998), 13 but for simplicity we do not represent the vector of parameters encoding the reward predicted in 14 different moments in time. Dynamics of such prediction error after conditioned stimulus will be similar 15

to that in a trained temporal difference learning model, where all parameters encoding reward 16 expectation until time it typically occurs converge to the same value (if the "discount factor" is set to 17 1). This prediction error is used only to illustrate expected dynamics of dopaminergic neurons in the 18 valuation system, but it does not drive plasticity. Following reward delivery, the parameter is 19 modified according to red Equation 11 .3, where is taken as the estimated value at the end of 20 simulation of the planning phase on this trial, and is a learning rate. 21 22 Figure 11 . Dynamics of the model choosing action intensity. Red, orange and blue equations describe 23 valuation, goal-directed and habit systems, respectively. 24

We assume that the intensity of action executed by the agent is equal to the inferred action intensity 25 plus motor noise with standard deviation . For simplicity we do not explicitly simulate the dynamics 26 of the model after the delivery of reward , but we fix the value of to intensity of executed action. 27 We compute the prediction errors in the goal-directed and habit system in an equilibrium (Equations 28 10 .3), and update the parameters. In simulations the learning rate of the valuation system was set to 29 = on trials with , and to = 1 when ≤ . Other parameters of the simulation 30 were set to: = , = 2, = = 2, = 1, and = 1. The model parameters 31

were initialized to = = 1, = , = 1 and = 1 . 32 To obtain the equations describing action planning or learning, we need to compute derivatives of 3 over vectors or matrices. The rules for computing such derivatives are natural generalizations of the 4 standard rules and they can be found in a tutorial paper (Bogacz, 2017) . During planning, the action 5 intensity should change proportionally to a gradient of , which is given in Equation 12 .2, where the 6 prediction errors are defined in Equations 12.3. These equations have an analogous form to those in 7 Figure 10A , but are generalized to matrices. The only additional element is the last term in Equation 8 12.2, which ensures competition between different actions, i.e. will be decreased proportionally to 9

, and vice versa. During learning, the parameters need also be updated proportionally to the 10 corresponding gradients of , which are given in Equations 12.4 and 12.5. Again, these equations are 11

fully analogous to those in Figure 10A . Notation as in Figure 5B . C) An approximation of learning in the habit system. Approximate error:

Approximate learning:

It is interesting to note how the equations in Figure 12A simplify in certain relevant limits. First, 1 consider the evolution of action intensity at the start of the trial, when ≈ . Let us now consider how equations in Figure 12A could be implemented in a neural circuit 6 schematically shown in Figure 12B . There are many ways of achieving these computations, but as a 7

proof of principle, we discuss one candidate mapping. We assume that striatum, output nuclei and 8 thalamus include neural populations selective for the two alternative actions (shown in vivid and pale 9 colours in Figure 12B ), and the connections between these nuclei are within the populations selective 10

for a given action, as in previous models (Bogacz & Gurney, 2007; Frank et al., 2007; Gurney et al., 11 2001) . Additionally, we assume that sensory cortex includes neurons selective for different states 12

(shown in black and grey in Figure 12B ), and the parameters and are encoded in cortico-13 striatal connections. Then, the orange and blue terms in Equation 12.2 can be computed by the striatal 14 neurons in goal-directed and habit systems in exactly analogous way as in the network inferring action 15 intensity, and these terms can be integrated in the output nuclei and thalamus. The last term in 16 Equation 12 .2 corresponds to mutual inhibition between the populations selective for the two actions, 17 and such inhibition could be provided by inhibitory projections that are presents in many different 18 regions of this circuit, e.g. by co-lateral projections of striatal neurons (Preston, Bishop, & Kitai, 1980) 19 or via a subthalamic nucleus, which has been proposed to play role in inhibiting non-selected actions 20 (Bogacz & Gurney, 2007; Frank et al., 2007; Gurney et al., 2001) . 21

The prediction error in the goal-directed system (orange Equation 12 .3) could be computed 22 analogously as in the network selecting action intensity as a difference between reward and 23 expectation provided by the striatum. To compute the expected reward ̅ ̅ , the striatal neurons 24 could calculate ̅ and the elements of this vector could be summed up thanks to projections from 25 striatum to dopaminergic neurons that could be scaled by input encoding ̅ ( Figure 12B ). During 26 learning, the prediction error in the goal-directed system modulates plasticity of the corresponding 27 cortico-striatal connections according to orange Equation 12.4, which describes a standard tri-factor 28 Hebbian rule (if following movement the striatal neurons encode chosen action, as assumed in Figure  29 5C). The learning rule for the variance parameter of the goal-directed system (orange Equation 12.5) 30

is exactly the same as in the model in the previous section (cf. orange Equation 10.5). 31

Learning in the habit system can be approximated with a single dopaminergic population, because the 32 prediction error ̅ has a characteristic structure with large redundancy. Namely, if the vectors ̅ and 33

̅ are binary (i.e. only one entry is equal to 1 and other entries to 0), then only one entry in ̅ 34

corresponding to the chosen action is positive, while all other entries are negative (because 35 parameters stay in a range between 0 and 1 when initialized within this range and updated 36 according to blue Equation 12.4). Hence, we simulated an approximate model with a single prediction 37 error just encoding the prediction error for the chosen action (Equation 12.6). With such a single 38 modulatory signal, the learning rules for striatal neurons in the habit system have to be adjusted so 39 the plasticity has opposite directions for the neurons selective for the chosen and the other actions. 40

Such modified rule is given in Equation 12.7 and corresponds to tri-factor Hebbian learning (if striatal 41 neurons in the habit system have activity proportional to ̅ during learning, as we assumed for the 42 goal-directed system). Thanks to this approximation, the prediction error and plasticity in the habit 43 system take a form that is more analogous to that in the goal-directed system. When the prediction 44 error in the habit system is a scalar, the learning rule for the variance parameter (blue Equation 12.5) 45

becomes the same as in the model in the previous section (cf. blue Equation 10.5). 46 1 the previous section. The simulated model also included a valuation system, which was parameterized 2 by a vector ̅ that described the values of the states in the simulation. At the end of the planning 3 phase, Gaussian noise with standard deviation = 2 was added to all entries of the action vector (to 4 allow exploration), and the action with the highest intensity was "chosen" by the model. Subsequently  5 for the chosen action , the intensity was set to = 1, while for the other action it was set to ≠ = 6

. All parameters of the simulations had the same value as in the previous section, except for = 7

1 and = . 8

Parallel organization of functionally segregated 14 circuits linking basal ganglia and cortex

Planning by probabilistic inference

What does dopamine mean?

What is the role of dopamine in reward: hedonic impact, 18 reward learning, or incentive salience?

Dopamine neuron systems in the brain: an update. Trends in 20 neurosciences

A tutorial on the free-energy framework for modelling perception and learning

The basal ganglia and cortex implement optimal decision making 24 between alternative actions

The free energy principle for action and 26 perception: A mathematical review

From ventral-medial to dorsal-lateral striatum: 28 neural correlates of reward-guided decision-making

Extended habit training reduces dopamine mediation 31 of appetitive response expression

Opponent actor learning (OpAL): Modeling interactive effects of 33 striatal dopamine on reinforcement learning and choice incentive

Physiological state gates acquisition and expression of mesolimbic reward prediction signals

Dopamine neuron activity before action 38 initiation gates and invigorates future movements

Uncertainty-based competition between prefrontal and 1 dorsolateral striatal systems for behavioral control

Habitual versus goal-directed action 3 control in Parkinson disease

Actions and habits: the development of behavioural autonomy

The role of learning in the operation of motivational systems

Stevens' handbook of experimental psychology

The effect of the instrumental training contingency 9 on susceptibility to reinforcer devaluation

Omission learning after instrumental 12 pretraining

Representation of spontaneous movement by dopaminergic neurons is cell-type selective and 15 disrupted in parkinsonism

Specialized coding of sensory, motor and cognitive variables in VTA dopamine neurons

Dopamine neurons share common response 19 function for reward prediction error

Lesion to the nigrostriatal dopamine 21 system disrupts stimulus-response habit formation

Hold your horses: impulsivity, deep 23 brain stimulation, and medication in parkinsonism

Neuromodulated spike-timing-dependent plasticity, and theory 25 of three-factor learning rules

A theory of cortical responses

The free-energy principle: a unified brain theory?

The 31 anatomy of choice: active inference and agency

Rethinking dopamine as generalized 33 prediction error

Go and no-go 35 learning in reward and punishment: interactions between affect and effect

A computational model of action selection in the 1 basal ganglia. I. A new functional anatomy

Striatonigrostriatal pathways in primates form an 3 ascending spiral from the shell to the dorsolateral striatum

Time-dependent competition 5 between habitual and goal-directed response preparation

D2 dopamine receptors in striatal medium spiny neurons reduce L-Type Ca2+ currents and 8 excitability vía a novel PLCβ1-IP3-calcineurin-signaling cascade

A model of how the basal ganglia generate and use neural 11 signals that predict reinforcement

Rapid signalling in distinct dopaminergic axons during 14 locomotion and reward

Coordination of rapid cholinergic and dopaminergic signaling in striatum during spontaneous 17 movement

Prolonged 19 dopamine signalling in striatum signals proximity and value of distant rewards

A behavior system

Localization of multiple dopamine 24 receptor subtype mRNAs in human and monkey motor cortex and striatum

Start/stop signals emerge in nigrostriatal circuits during sequence 27 learning

Closed-29 loop deep brain stimulation effects on parkinsonian motor symptoms in a non-human primate-is beta 30 enough? Brain stimulation

Where does value come from? Trends in cognitive sciences

Forgetting in reinforcement learning links sustained dopamine signals to 33 motivation

Homeostatic reinforcement learning for integrating reward 35 collection and physiological stability

Regulation of parkinsonian motor behaviours by optogenetic control of basal ganglia circuitry

Learning with three factors: modulating Hebbian 1 plasticity with errors

Reward prediction error does 3 not explain movement selectivity in DMS-projecting dopamine neurons

A computational substrate for incentive salience

Learning reward uncertainty in the basal ganglia

Habits without values

Learning the payoffs and costs of actions

A framework for mesencephalic dopamine 12 systems based on predictive Hebbian learning

Tonic dopamine: opportunity costs and the control of 14 response vigor

Dissociable roles 16 of ventral and dorsal striatum in instrumental conditioning

Mesolimbic dopamine encodes 18 prediction errors in a state-dependent manner

Medium spiny neuron projection from the rat striatum: an 20 intracellular horseradish peroxidase study

A cellular mechanism of reward-related learning

The effect of external rhythmic cues (auditory and visual) on walking during a functional task 25 in homes of people with Parkinson's disease. Archives of physical medicine and rehabilitation

Control of synaptic plasticity in deep cortical networks

Attention-gated reinforcement learning of internal 30 representations for classification

Responses of midbrain dopamine neurons to behavioral trigger stimuli in the 32 monkey

A neural substrate of prediction and reward

The activity of pars compacta neurons of the monkey 36 substantia nigra in relation to motor activation

Dichotomous dopaminergic control of 1 striatal synaptic plasticity

Goal-directed decision making as probabilistic inference: a 3 computational framework and potential neural correlates

Allostatic self-efficacy: a metacognitive theory of dyshomeostasis-induced fatigue and depression. 6 Frontiers in human neuroscience

Introduction to reinforcement learning

Action initiation 10 shapes mesolimbic dopamine encoding of future rewards

Overlearning reversal effect and magnitude of reward

Dopamine increases the gain of the input-output 15 response of rat prefrontal pyramidal neurons

Adaptive coding of reward value by dopamine 17 neurons

A specific role for posterior dorsolateral striatum 19 in human habit learning

Whole-brain mapping of 21 direct inputs to midbrain dopamine neurons

Lesions of dorsolateral striatum preserve outcome 23 expectancy but disrupt habit formation in instrumental learning

The role of the dorsomedial striatum 26 in instrumental conditioning

Human substantia nigra neurons encode unexpected financial rewards

 9 This work has been supported by MRC grant MC_UU_12024/5. The author thanks Moritz Moeller and 10Sashank Pisupati for comments on an earlier version of the manuscript, and Karl Friston, Yonatan 11Loewenstein, Mark Howe, Friedemann Zenke, Kevin Miller and Peter Dayan for discussion. 12