key: cord-0521417-pyq0apky
authors: Hill, Edward; Bardoscia, Marco; Turrell, Arthur
title: Solving Heterogeneous General Equilibrium Economic Models with Deep Reinforcement Learning
date: 2021-03-31
journal: nan
DOI: nan
sha: 1f063f9e71fbc574e89e6ca214f56de4a8bc6917
doc_id: 521417
cord_uid: pyq0apky

General equilibrium macroeconomic models are a core tool used by policymakers to understand a nation's economy. They represent the economy as a collection of forward-looking actors whose behaviours combine, possibly with stochastic effects, to determine global variables (such as prices) in a dynamic equilibrium. However, standard semi-analytical techniques for solving these models make it difficult to include the important effects of heterogeneous economic actors. The COVID-19 pandemic has further highlighted the importance of heterogeneity, for example in age and sector of employment, in macroeconomic outcomes and the need for models that can more easily incorporate it. We use techniques from reinforcement learning to solve such models incorporating heterogeneous agents in a way that is simple, extensible, and computationally efficient. We demonstrate the method's accuracy and stability on a toy problem for which there is a known analytical solution, its versatility by solving a general equilibrium problem that includes global stochasticity, and its flexibility by solving a combined macroeconomic and epidemiological model to explore the economic and health implications of a pandemic. The latter successfully captures plausible economic behaviours induced by differential health risks by age.

One of the core problems in macroeconomics is to create models that capture how the self-interested actions of individuals and firms combine to drive the aggregate behaviour of the economy. These models can provide a guide for policymakers as to what actions they should take in any particular circumstance. Historically, macroeconomic models have tended to be simple because of the need for interpretability, but also because of a heavy reliance on solution methods that are semi-analytical. Such methods allow for the solution of a wide range of important macroeconomic problems. However, events such as the Great Financial Crisis and the COVID-19 crisis have shown that the ability to solve more general problems that include multiple, discrete agents and complex state spaces is desirable. We propose a way to use reinforcement learning to extend the frontier of what is possible in macroeconomic modelling, both in terms of the model assumptions that can be used and the ease with which models can be changed.

Specifically, we show that reinforcement learning can solve the 'rational expectations equilibrium' (REE) models that are ubiquitous in macroeconomics where choice variables are continuous and may have time-dependency, and where there are global constraints that bind agents' collective actions. Importantly, we show how to solve rational expectations equilibrium models with discrete heterogeneous agents (rather than a continuum of agents or a single representative agent).

We apply reinforcement learning to solve three REE models: precautionary saving; the interaction between a pandemic and the macroeconomy (an 'epi-macro' model), with stochasticity in health statuses; and a macroeconomic model which has global stochasticity, i.e. where the background is changing in a way that the agents are unable to predict. With these three models, we show that we can capture a macroeconomy that has rational, forward-looking agents, that is dynamic in time, that is stochastic, and that attains 'general equilibrium' between the supply and demand of goods or services in different markets. arXiv:2103.16977v1 [econ.GN] 31 Mar 2021

Macroeconomic models seek to explain the behaviour of economic variables such as wages, hours worked, prices, investment, interest rates, the consumption of goods and services, and more, depending on the level of complexity. They do this through 'microfoundations', that is describing the behaviour of individual agents and deriving the system-wide behaviour based on how those atomic behaviours aggregate. An important class of these models is used to describe how variables co-move in time when supply and demand are balanced (in general equilibrium), and when some variables are subject to stochastic noise (aka 'shocks'). A typical macroeconomic rational expectations model with general equilibrium is a representation of an economy populated by households, firms, and public institutions (such as the government). The choices made by these distinct agents are framed as a dynamic programming problem in which households maximise their discounted future utility U = E ∞ t=1 β t u (s t , a t ) with u per-period utility, β a discount factor, s t ∈ S a vector of state variables, a t ∈ a(s t ) a vector of choice variables, and s t evolving as s t+1 = h(s t , a t ). E(·) represents an expectation operator, usually assumed to be 'rational' in the sense of being the households' best possible forecast given the available information (and implying that any deviations from perfect foresight are random). For household agents, u is monotonically increasing in consumption, c t , and decreasing in hours worked, n t (both choice variables). Extra conditions are imposed via other equations, for example, a budget constraint of the form (1 + r t )b t−1 + w t n t ≥ p t c t + b t with p t price, w t wages, and r the interest rate. b t captures savings, typically in the form of a risk-free bond or other investment. If b t < 0 is permitted (i.e. debt) then b t usually satisfies a 'no Ponzi' condition that rules out unlimited borrowing and effectively imposes the rule that b T = 0 (t ∈ 0, . . . , T ). Consumers take prices, wages, and interest rate as given; these are state variables. Firms maximise profits Π t = p t Y t − w t N t (possibly including a −r t K t term if savings are invested) subject to a 'production constraint', Y t , that turns labour, N t , and capital, K t into consumption goods. Typically,

f is a monotonically increasing function of its inputs and A t is either predetermined or follows a log-autoregressive process ln A t = ρ A ln A t−1 + t ; t ∼ N (0, σ A ) is known as a technology 'shock'. Governments perform functions such as the collection and redistribution of taxes. Firms are assumed to be perfectly competitive, meaning that each firm takes prices and wages as given.

Prices, wages, and interest rates are determined by market clearing for goods, labour, and savings respectively in which supply and demand are balanced in each market. These 'general equilibrium' conditions bind agents and the environment together, and are atypical in reinforcement learning.

The competitive equilibrium is defined by a vector of state variables, and by consumption and production plans for the agents that maximise utility. Often, the optimal policies of all agents are solved analytically by Lagrangian methods: the equilibrium conditions are substituted in and the system of equations simplified, usually by log-linearising the model around an assumed steady state. We now review some macroeconomic models before briefly discussing multi-agent models more generally.

The Representative agent with rational expectations is an important class of macroeconomic model, most well-known is the representative agent dynamic stochastic general equilibrium (DSGE) model. The canonical model is the representative agent New Keynesian (RANK) model (Smets & Wouters, 2007) . Continuum rational expectations models overcome some of the heterogeneity-related shortcomings of those models by replacing the representative household with a continuum of households that are ex ante differentiated by their assets and labour productivity. The canonical example is the heterogeneous agent New Keynesian (HANK) model (Kaplan et al., 2018) . Macroeconomic agent-based models differ in that they simulate agents as discrete entities but also typically make very different assumptions to, say, RANK or HANK models; the most important being that they tend not to assume rational expectations/perfect foresight and they may not necessarily have competitive markets. Importantly, they allow for heterogeneity in multiple dimensions simultaneously (Haldane & Turrell, 2019) . Agent-based models (ABMs) are also extensively used in epidemiology (Tracy et al., 2018) , sometimes under the name 'individual-based models'. At the start of the coronavirus crisis, UK government policy was heavily informed by such models, most notably that of Ferguson et al. (2020) , and there are several ABMs modelling the coronavirus pandemic (Hoertel et al., 2020; Kerr et al., 2020) . These epi-ABMs do not capture economic effects.

Epi-macro models attempt to combine macroeconomic and epidemiological effects, and their interaction. The canonical examples combining epidemiology and a REE representative agent model are Eichenbaum et al. (2020a) and Eichenbaum et al. (2020b) who link the two by assuming that, in addition to the usual Susceptible, Infected, Recovered (SIR) model transmission mechanism as posed by Kermack & McKendrick (1927) , a household agent may be infected at work or while engaging in consumption. Market clearing is also assumed. Building on many of the same assumptions as HANK, the canonical continuum agent epi-macro model with REE is by Kaplan et al. (2020) . Agents are differentiated by their assets, productivity, occupation, and health status. There are three types of good: regular, social, and home-produced; and three types of work: workplace, remote, and home. The epi-macro link is achieved through a transmissibility of infection that is modified to include terms proportional to hours worked and amount consumed, with avoidance of infection captured through a disutility of death. Market clearing is assumed.

Finally, recent work has seen reinforcement learning be applied to multi-agent systems of relevance to economics in the case of bidding in auctions under constraints (Feng et al., 2018; Dütting et al., 2019) , and deciding on behaviours for both agents and a social planner in a gather-and-build economy (Zheng et al., 2020) .

In the rest of this paper, we show how to use reinforcement learning to solve typical rational expectations macroeconomic models while also incorporating discrete agent heterogeneity and, potentially, stochasticity; demonstrating that all three can be combined is by far our major contribution and has applications for a wide class of economic problems.

A typical rational expectations equilibrium problem is that of precautionary saving, in which agents anticipate a change in circumstances that will adversely affect their utilities, in this case a reduction in wages, and respond in advance in order to smooth their consumption. Such behaviour is typical of the agents in a REE model. The simplest version of this problem has a known analytical solution. We solve this model using reinforcement learning so that we may compare it to the analytical solution, and we also use it as a way to demonstrate many of the challenges of using RL for this class of problems; notably the speed and accuracy of convergence given the sensitivity to the estimate of the value function; the continuous action and state spaces; and the enforcement of the 'no Ponzi' condition.

We assume that there is a single household agent with rational expectations. There are I = 2 firms, with the firms and the good each firm produces indexed by i. The household agent is employed by one of these firms, which we will denote e, and has per period utility u t = i∈I ln c it − θ 2 n 2 t with action (choice variables) c it consumption and n t hours worked. 0 ≤ t < T is the discrete timestep. The price vector is fixed to p it = 1, and the interest rate is fixed to 0. The wage is imposed as w = 1 for t < T /2 and w = 0.5 afterwards, a fall that is anticipated. The household agent is subject to a budget constraint such that b t+1 = b t + w t n t − i∈I p it c it . The no Ponzi condition is imposed via b T = 0, which prevents unlimited borrowing by the household. The agent maximises its discounted utility 0≤t<T β t u t with β = 0.97 and T = 20.

We use I = 2 firms rather than a single representative firm, allow θ (the utility of work) to vary, and introduce four discrete states, indexed by d, that the household can be in at any given time. While making no economic difference, the expansion of the action and state spaces means that the hyperparameters we find are transferable to the more realistic problems we will come to. It also allows us to test that training the parameters of a single network to provide the value function for any agent is a good approximation to training a network for each agent individually, which is significantly more computationally expensive.

The analytical solution for the consumption and hours paths for the household are given by c t = λ −1 β t /I and n t = w t θ −1 λβ −t where λ is a Lagrange multiplier determined from the no Ponzi condition. Computationally, we start from the Bellman equation:

where s is the agent's state, a is the action vector. E is the expectation operator, and S is the global state (which includes t, but we make t explicit for clarity). For the current problem, we drop the global state to obtain

The optimal action vector under local and global constraints, a * t (s, S), is computed using the method of Lagrange multipliers; this requires accurate values of the gradient of U with respect to the state variables.

We use a deep neural network to approximate U (t, s) = U (t, s = (e, θ, d; b t )); however, direct approximation is problematic because ∂ s U (t, s) is slow to converge and is highly sensitive to initialisation and hyperparameter choice. To mitigate this, we find D(t, s) = ∂ s U (t, s) explicitly by solving

That a directly learnt estimate of the ∂ s U (t, s) aids stability and convergence has been noted in both deterministic (Balduzzi & Ghifary, 2015) and stochastic continuous control problems.

In the case of precautionary saving, ∂ s P(s → s t+1 |s, a * ) = 0 and P(s → s t+1 |s, a * ) = δ sst+1 , (with δ the Kronecker delta) so that D(t, s) = ∂ s u t (s, a * ) + β ∂st+1 ∂s D(t + 1, a * (s)). The values of ∂ a U (t + 1, s t+1 ) that are required for finding a * are then written as ∂st+1 ∂a D(t + 1, s t+1 ). U and D need not be consistent.

System All timings use a laptop with a 4-core CPU (an i7-6700HQ) without GPU acceleration. The code is written in PYTHON3 using PYTORCH (Paszke et al., 2019) for the neural network.

Neural Network and Training The networks for U and D are identical with 5 layers of 50 Softsign neurons, followed by a single linear layer. The inputs are normalised to the typical scales in the problem, for example t → (2t − 1)/2T , and the outputs of each network are normalised to the scale of the problem by an additive and multiplicative factor. The networks are wrapped in a caching and linear interpolation function to reduce network evaluations.

The networks are trained using the Adam optimiser (Kingma & Ba, 2014) with a decaying learning rate over epochs, E of l r = max(5 × 10 −3 × 0.8 −E , 10 −5 ). Each epoch contains 160 experiences of U (t, s) and D(t, s), which are trained with a replay buffer (Mnih et al., 2015) . Initially the buffer is emptied after each epoch, but once l r ≤ 10 −4 it retains a fraction of its contents. This provides swift initial learning followed by good coverage of experiences later in the process (Fedus et al., 2020) .

The experiences are created using n = 4-step learning (e.g. Sutton & Barto, 2018) , recorded by running an agent forward taking its current optimal actions. These are run independently, in parallel. n/(n + 1) of forward runs apply a perturbation to the state every n steps, allowing for exploitation and exploration. In early epochs, where the predicted solution is less accurate, Double DQN (Van Hasselt et al., 2016) and target clipping (Schulman et al., 2017) provide stability.

We assess the goodness of the model by defining its error as the mean absolute fractional difference between the analytic consumption from the equations above and the simulated consumption for an agent beginning at t = 0 with b 0 = 0 for a number of values of θ. This is a stringent and appropriate test of the model since the value of consumption at each timestep depends on current savings, which are themselves determined by the time-histories of n and c i . Note that this means errors in the observable values compound over time, as they will in similar models in later sections. The model successfully converges to the analytical solution with error of ≈ 0.01 in ≈ 25 epochs, with each epoch containing 160 experiences. This takes ≈ 2 minutes, about 1 GFLOPs-hour.

We now build on the previous example to demonstrate the solution of a rational expectations general equilibrium epimacro model, with SIRD health states. We find the 'decentralised equilibrium' in which each household agent behaves optimally according to its choice variables. Agents are motivated to change their behaviour due to fear of dying from the disease and, because consumption carries with it a risk of contracting the disease, their patterns of consumption change, in turn altering their risk of infection -this risk therefore connects the macroeconomic and epidemiological aspects of the model. These consumption changes are differentiated by age as there is an exponentially increasing risk of death according to age once infected.

The model combines features of both agent-based macro models and rational expectations models. Agents are rational, forward-looking, and discrete. Let household agents be indexed by j, while sectors (the analogue of firms), and the goods that each sector produces, are indexed by i. Household are ex ante differentiated by their age and the sector that employs them. Let E i be the set of households employed by sector i.

The model is a real business cycle (RBC) model (Kydland & Prescott, 1982) , with no technology shock. In each period household agents engage in consumption, c, and work, captured by hours worked, n. Time-t utility of household j when susceptible, infected, or recovered is u j = i ln c ji − 1 2 θn 2 j where θ = 1 is a disutility of working. Households face a time-t budget constraint balancing their income from work and interest on the capital they have loaned to industry, against their consumption and investment in new capital, v j : w j n j + rk j,t = i p i c ji + v j,t , with k j,t+1 = k j,t + v j,t . Sector i produces a quantity of goods Y i using household labour such that

where N i are the total hours worked in sector i and K i = K e +K i is its total capital. K e is an initial endowment of capital andK i is that derived from consumers' investments. We set A = 1, α = 2/3. Sectors are profit-maximising so ∂ Ni Π i = 0 and ∂ Ki Π i = 0 where the profit Π i = p i Y i − w i N i − (r + δ)K i , with δ the depreciation rate of capital. For our form of Y i this implies Π i = 0. All consumers and firms take p i , w i and r as given, and these are adjusted to clear the markets for hours worked: j∈Ei g j n j = N i ∀i and capital: j g j k j = iK i , where agent weights g j = J −1 ∀j. The utility of death is −200, and the discount rate is β = 0.97. See Appendix I for a more detailed description. This type of model, which is common, is used only to demonstrate the approach; the solution method is applicable to a wide range of models. Also, note that there is no need for linearisation around a steady-state, which is a common solution technique for models of this type.

Household agents have four possible health states: susceptible, infected, recovered, and deceased (SIRD), with the probability of transition between states given by, e.g., P(S → I) for going from susceptible to infected. In what follows, epidemiological parameters are noted with tildes. At each time-step (a day),

Infected j c ji j c ji so that there is a 'shopping risk' of acquiring the infection if many infected are consuming the same goods.ρ i is a vector of relative consumption risks such that iρ i = 1. P(I → R) =γ, and P j (I → D) =ξ(j) for each household j whereξ(j) is an exponentially increasing function of age rising from 0.006 at age 40 through 0.024 at 55 to 0.165 for a 70 year old, then flattening at age 80.β = 0.56, andγ = 0.2 from Lin et al. (2020) . The relative risk of consuming each sector's product isρ i = {0.8, 1.2}/2, and we refer to these as 'remote' and 'social' consumption respectively. At time t = 0, k j = 15 = K e /2 ∀j, and at time t = 2 we infect the youngest 10% of the population. On death, any investments are redistributed across living agents and deceased agents are not replaced. We use J = 100 agents, with a distribution of ages given by Age(j) ∼ U(20, 95). Households are evenly distributed across employers, and cannot change employer.

Aside from using a reasonable distribution of the death rates, this model is entirely uncalibrated.

We run a number of simulations indexed by τ . Solving the model means finding a history H τ =∞ = {S t } t∈0,...,T and agent behaviours U τ =∞ (t, k t , S t ) that are consistent. We begin with H τ =0 and U τ =0 . We useH τ , an average created from {H τ } τ ≤τ to re-train U τ , obtaining U τ +1 . A multi-agent simulation is run with agent behaviours governed by U τ +1 to obtain H τ +1 in general equilibrium. As is standard in iterative methods, we terminate at a sufficiently large τ max in order to provide a good approximation to the values at τ = ∞; the results we present use τ max = 50. In the limit of large J, each household's behaviour makes no difference to the system, so despite the stochastic transitions of health statuses, we are able to obtain a unique history of the variables characterising the global system (including the I = 2 sectors). The observedĤ τ can be seen as noisy observations of that H τ , andH τ as an estimate ofH τ where the averaging is chosen such thatH τ τ →∞ − −−− → H τ andH τ has significantly less noise thanĤ τ . We use the average of {H τ } τ ≤τ weighted by γ τ −τ with γ ≤ 1. While in general H τ is therefore produced solely as a function of U τ −1 , we modify this general scheme for this specific case by providing the infection fraction Infected j c ji / j c ji in multi-agent simulation τ fromH τ −1 . This counteracts the propagation of the high levels of noise created by the initiation of the pandemic through to later times of the simulation, and could be avoided by using a larger number of agents.

At each timestep, we iteratively solve to find the values of the prices (p i , w i , r) for which the markets for labour and capital clear by gradient descent using scipy.optimize.least squares with the default parameters, initialised from the previous timestep, for t > 0. We then advance both the capital and epidemiological state to proceed to the next timestep.

U τ +1 is obtained from U τ by continuing training usingH τ . The network parameters are the same as in §3.1, however we use a gentler decay of the learning rate. There are no problems observed with convergence, and the adherence to the no Ponzi condition is a good test of this, since achieving it is sensitive to the entire time history of the simulation.

As in the multi-agent model, consumers take prices (p i ,w i and r) as given, find their optimal consumptions, hours worked and investments before advancing their state using the budget constraint and the probabilities of their SIRD state changing.

We examine two cases: a 'heterogeneous' case as described above, and a 'homogeneous' case without age heterogeneity but with the same mean death rate. Figure 1 shows the percentages of susceptible, infected, recovered and deceased agents as the pandemic progresses. Each line is an average over the results of 3 simulations, and each simulation's result is an average over 20 histories. As can be seen from the 95% confidence intervals in the figure, the behaviour is similar across simulations. In the homogeneous-age case, more people are infected over the course of the pandemic, but there are fewer deaths in total. Figure 2 shows the agents' consumptions. We bin the uniform age distribution into young (< 40), old (> 70) and middle-age groups; we find considerable differences in behaviour between them. After consuming the most before the pandemic since they anticipate the coming opportunity to save, the old strongly reduce consumption in response to infection risk, and reduce consumption of the riskier 'social' good more. The young, conversely, are unlikely to die and so their consumption is relatively unchanged, governed by the decrease in the size of the economy. Figure 3 shows the mean investments per agent in each age group, normalised to the salary (product of wage and hours worked) of an agent in the same system with no pandemic and no saving. The young anticipate the pandemic and save before it in order to spend when infection rates are higher, with the converse behaviour for the old. It also shows the no Ponzi condition holds with good accuracy for all age groups.

Finally, Figure 4 shows the total consumption for both heterogeneous-age and homogeneous-age cases. The inclusion of distributional effects causes significant changes to bulk macroeconomic quantities.

Together these results show that in this uncalibrated model the inclusion of age heterogeneity makes a substantial difference to both the epidemiological and economic progress of a pandemic. This model and solution method has been tested with a range of epidemiological and economic parameters and has shown consistent stability and convergence. This exercise has also shown the sensitivity of the model's conclusions to those parameters, emphasising the importance of calibration in all aspects of the model if it were used as more than a test case of the methodology.

The hardware, software, and parameters are identical to §3.1 with the exception of the learning rate decay which is slower here to allow for the longer time history. Scaling is linear with J, since the number of iterations of the least squares optimisation seems to scale very weakly with J for this problem. Each history calculation followed by an RL update takes ∼ 30 minutes on the reference machine, so a single simulation takes ∼ 24 hours, ∼ 720 GFLOPs-hour.

In the previous section, only the agents' state was stochastic and, because of being in the limiting case of large numbers of agents, each individual agent's state did not affect the global state. We now return to the model in §3.1, but instead of having a deterministic drop in wages, wages now follow a log-autoregressive stochastic process.

We return to the Bellman equation, (1). We assume there is no stochasticity in health states so the E s |s,a (·) are no longer present, however the E St+1|S (·) remain. 

where s t+1 = a(s) advances deterministically. We change the method to solve forŨ andD, defining

and U (t, S, s) = max a u t (S, s, a) + βŨ (t, S, s t+1 )

where the value of a which attains the maximum defines a * . Again, we will find the maximum using Lagrange's method and the auxiliary quantityD,

and so ∂ a L(t, S, s, a) = ∂ a u t (S, s, a) + β(∂ a s t+1 )D(t + 1, S, s t+1 ) (7)

As is standard in reinforcement learning, the expectations are approximated by using a large number of global state histories to updateŨ andD.

Excepting the wage history, the set-up is as in §3.1. The agents are statically heterogeneous in their propensity to work, θ ∈ [0.6, 1.4], and employer, e; and dynamically heterogeneous in savings. Again, the multiple employers and the treatment of heterogeneity in θ are introduced so that demonstrations of convergence and hyperparameter choice are relevant to the problem in the next section.

In this toy problem, we compare two types of agent: one a current-time wage-observing agent, whose future utility is a function of the wages at the current timestep, w t : U (t, S = (t, w t ), s = (θ, e, b t )); the other a non-wageobserving agent with U (t, S = (t), s = (θ, e, b t )). Both are able to optimise by testing how their strategies play out within the histories they have seen, however they only have partial visibility of the global state. In the previous case that had deterministic global state, a knowledge of the time determined all other global variables, but here the mapping from observable values to global state is one-to-many.

The wage follows a log-autoregressive process, ln w t = ρ w ln w t−1 + t ; t ∼ N (0, σ w ); where we use ρ w = 0.97, σ w = 0.1, and w t=0 = 1. Since the wage is autoregressive, knowledge of the current wage adds information about the wage in the future. The agent is trained on #T ∈ {100, 1000} training histories.

We parametrise wage histories by their mean absolute fractional deviation, d h , from the mean path of the autoregressive process w mean,t = exp σ 2 w 2 t <t ρ 2t w . Of the 50 test histories, which have d h ∈ [0.07, 0.55], we consider the 27 with d h > 0.2 to represent those with significant deviation from the mean path.

We judge the success of this model by calculating the average total utility an agent with θ = 1 attains over these previously unseen wage histories, {w h,t } 0≤t<T , where the average over histories with d h > x is denotedū d h >x . Table 1 compares the average utilities of agents with different training setups and wage visibility to an analytic approximation found by defining the action at time t in history h to be the values obtained from the formulae in §3.1 for a wage history beginning at time t: c h,t = 1/λ and n h,t = w h,t θ −1λ , obtainingλ from the no Ponzi condi-

p,t is evaluated analytically over all possible wage paths that have w p,t = w h,t using the second moment of the log-normal distribution.

We implement prioritised experience replay (Schaul et al., 2015) by retaining experiences with a larger error for a larger number of training epochs. This increases the agents' performance relative to a base agent trained as in previous sections, particularly on the d h > 0.2 histories that deviate significantly from the mean path. This is expected since the training examples are sparser and more varied at higher d h . The wage-observing agent outperforms the non-wage- observing agent. In addition to the solution becoming stable and the no Ponzi condition being satisfied, that the utilities for the wage-observing agent are consistently higher than those for the analytical approximation gives us confidence that the answers converged to are accurate. We record the utilities after 100 epochs which equates to 40, 000 experiences or 10 minutes on the reference machine. A degradation in performance is seen if an insufficient number of training histories are used.

The specification is identical to §3.1.2 except that the rate of decay of the learning rate is decreased to allow averaging, l r = e −0.01E /(1 + E) for epoch E; and, as discussed, prioritised experience replay is used.

Finally we demonstrate a general equilibrium model that has stochastic global variables, specifically a log-autoregressive process in the technology. This is given by ln A it = ρ A ln A i,t−1 + t ; t ∼ N (0, σ A ) and affects the production, given by Y it = A it K 1−α it N α it , of a sector. α = 2/3, ρ A = 0.97, σ A = 0.1, and A i0 = 1. The agents are as in §3.3, being statically heterogeneous in propensity to work, θ and employer e, and dynamically heterogeneous in investment, k t ; however now have visibility of all prices, not just wages. The global model that couples the agents is the same RBC model described in §3.2 and Appendix I, but without infections. As the agents' internal states are no longer stochastic, a smaller number (J = 10) of agents can be used.

As in §3.3, we compare two types of agents, one that 'sees' realisations of the prices and another than does not. Since differing values of the technology shock move the general equilibrium, changing the prices, observing them gives the agent information about the state of the underlying stochastic process. Figure 5 shows the paths of hours worked, n j , for 5 agents with a range of values of θ, all of whom work in the same sector given by i = 0. As expected, the paths of agents who can observe realisations of prices have smaller fluctuations, which is also true of the paths of other quantities. Averaging over 256 runs, the mean unsigned curvature of the paths drops fromκ non−obs = 0.44 toκ obs = 0.27; this fall is also true of the curves in the figure whereκ non−obs = 0.42 andκ obs = 0.19. This difference is because agents who can observe realisations are better able to adjust their behaviour to the current and (since it is an autoregressive process) future values of the technology shock.

The specification of the neural network and learning remains unchanged from the previous section. Calculation of the multiple histories is parallelised; we use 8 threads. For each simulation epoch E, 8(4 + E) histories are found, followed by 20 RL training epochs. In total, there are 12 simulation epochs and a total of ≈1000 histories and 240 RL training epochs, each of which samples from the most recent 50% of the histories. The number of histories and epochs is informed by the convergence properties from the previous section. A total of ≈100, 000 experiences are recorded during the whole simulation which takes ≈6 hours.

This work shows that reinforcement learning can be used to solve a wide range of important macroeconomic rational expectations models in a way that is simple, flexible, and computationally tractable. Furthermore, these methods can be immediately applied to previously intractable problems with multiple degrees of discrete heterogeneity and stochasticity. Being highly relevant to real world phenomena, such as climate change and disease transmission, these capabilities are of great value to policymakers and can be developed into serious tools to aid decision making in complex scenarios.

Finally, by linking to reinforcement learning, this work provides the potential to apply its extensive toolkit of techniques, many of which have direct relevance to economic questions: examples include accessing larger state and action spaces (e.g. Lillicrap et al., 2015) , including bounded rationality, or applying inverse reinforcement learning to deduce agents' objectives and rewards from observed microand macro-economic behaviours. Additionally, we can harness improvements in implementation such as GPU/TPU acceleration (Paszke et al., 2019) and distributed computing (Mnih et al., 2016) .

We use a standard real business cycle model, however we adopt notation and variables common in reinforcement learning, in particular, an emphasis on state variables (capital) rather than action variables (consumption, hours worked), and the inclusion of an action-dependent expectation. A baseline RBC model would use Equations 9, 12, 13, 14, 15, 16, and 19. We use Equations 9, 10, 11, 14, 15, 16, and 19 , but note that 10 and 12, and 11 and 13, are the same up to algebraic manipulation, expressing the future behaviour in terms of U (k) and U (k), functions of the state, rather than in terms of consumption or other actions as is usually seen.

We work in real quantities, using p i=0 = 1 as the numéraire and other quantities, including p i =0 = 1 defined relative to this.

Notation j is an index that runs over the J consumer-workers. i runs over the I consumption goods, each of which is produced by a different sector/firm. For consumers, n j is hours worked, c ji is consumption, k j is capital held by the consumer, v j is their investment in capital in that timestep, θ j > 0 is the weight given to hours in the utility function. E i denotes the set of agents employed at firm i, e(j) is the index, i, of the employer of j.

For firms, N i is the number of hours worked at the firm, K i is its capital, A i is the firm's technology, Y i is the production function, C i is the consumption of the firm's goods.

r is the real interest rate, w i are wages, and p i are the prices of goods.

For consumers, the time-t utility is u j ({c ji }, n j ; θ j ) = i ln c ji − 1 2 θ j n 2 j and their total utility from time t onward is U j,t,S (k t,j ) = max cji,nj u j (c ji , n j ) + β S P S→S (c ji , n j )U j,t+1,S (k t+1,j )

Their budget constraint is, w e(j) n j + rk j,t = i p i c ji + v j ∀j (9) k j,t+1 = k j,t + v j Let U j = ∂ kt+1,j U (k t+1,j ); expectations are over a distribution of probabilities P S→S , and so E ∂ cji ln P j = S→S ∂ cji P j . Consumers take prices (p i , w i and r) as given and use their first order conditions

to find c ji and n j ; this is solved iteratively since U is a function of k t+1 and thus c ji and n j . In the reinforcement learning training we add an additional reward for individually achieving a no-Ponzi condition at the final time-step.

If the probabilities were independent of the action, ∂ cji P j = 0, as would be the case in a standard RBC model, then that term can be removed and the E U j eliminated obtaining n j = w e(j) θc ji p i ∀i

A small amount of work, with care taken as to the maximisation over a in the definition of the utility, shows that E U j = (1 + r t+1 )(p i,t+1 c ji,t+1 ) −1 , and so Equation 10 reduces to the Euler equation

(1 + r t+1 )c −1 ji,t+1

where the p i remains due to the multiple goods with p i =0 = 1.

Firms are profit maximising with production function

Since profit Π i = p i Y i − w i N i − (r + δ)K i , then taking prices (p i , w i , r) as given (16) and therefore Π i = 0.

We split K i = K e +K i where K e is an endowment andK i is provided by investment from the consumers.

The technology shock is log-autoregressive

Wages are set by market clearing for hours worked

and the real interest rate is set by market clearing for capital

where g j is the weight of each agent, with j g j = 1.

Compatible value gradients for reinforcement learning of continuous deep policies

Optimal auctions through deep learning

The macroeconomics of epidemics

Epidemics in the neoclassical and new keynesian models

Revisiting fundamentals of experience replay

Deep learning for revenue-optimal auctions with budgets

Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

Drawing on different disciplines: macroeconomic agent-based models

Learning continuous control policies by stochastic value gradients

A stochastic agent-based model of the SARS-CoV-2 epidemic in france

Monetary policy according to hank

The great lockdown and the big stimulus: Tracing the pandemic possibility frontier for the u.s. Working Paper 27794

A contribution to the mathematical theory of epidemics

Covasim: an agentbased model of COVID-19 dynamics and interventions. medRxiv

A method for stochastic optimization

Time to build and aggregate fluctuations

Continuous control with deep reinforcement learning

A conceptual model for the outbreak of coronavirus disease 2019 (COVID-19) in Wuhan, China with individual reaction and governmental action

Human-level control through deep reinforcement learning

Asynchronous methods for deep reinforcement learning

An imperative style, high-performance deep learning library

Prioritized experience replay

Proximal policy optimization algorithms

Shocks and frictions in us business cycles: A bayesian dsge approach

Reinforcement learning: An introduction

Agent-based modeling in public health: Current applications and future directions

Deep reinforcement learning with double q-learning

The AI economist: Improving equality and productivity with AI-driven tax policies

We would like to thank Federico Di Pace for useful discussions.

Appendix I : The Real Business Cycle model