key: cord-0521417-pyq0apky authors: Hill, Edward; Bardoscia, Marco; Turrell, Arthur title: Solving Heterogeneous General Equilibrium Economic Models with Deep Reinforcement Learning date: 2021-03-31 journal: nan DOI: nan sha: 1f063f9e71fbc574e89e6ca214f56de4a8bc6917 doc_id: 521417 cord_uid: pyq0apky General equilibrium macroeconomic models are a core tool used by policymakers to understand a nation's economy. They represent the economy as a collection of forward-looking actors whose behaviours combine, possibly with stochastic effects, to determine global variables (such as prices) in a dynamic equilibrium. However, standard semi-analytical techniques for solving these models make it difficult to include the important effects of heterogeneous economic actors. The COVID-19 pandemic has further highlighted the importance of heterogeneity, for example in age and sector of employment, in macroeconomic outcomes and the need for models that can more easily incorporate it. We use techniques from reinforcement learning to solve such models incorporating heterogeneous agents in a way that is simple, extensible, and computationally efficient. We demonstrate the method's accuracy and stability on a toy problem for which there is a known analytical solution, its versatility by solving a general equilibrium problem that includes global stochasticity, and its flexibility by solving a combined macroeconomic and epidemiological model to explore the economic and health implications of a pandemic. The latter successfully captures plausible economic behaviours induced by differential health risks by age. One of the core problems in macroeconomics is to create models that capture how the self-interested actions of individuals and firms combine to drive the aggregate behaviour of the economy. These models can provide a guide for policymakers as to what actions they should take in any particular circumstance. Historically, macroeconomic models have tended to be simple because of the need for interpretability, but also because of a heavy reliance on solution methods that are semi-analytical. Such methods allow for the solution of a wide range of important macroeconomic problems. However, events such as the Great Financial Crisis and the COVID-19 crisis have shown that the ability to solve more general problems that include multiple, discrete agents and complex state spaces is desirable. We propose a way to use reinforcement learning to extend the frontier of what is possible in macroeconomic modelling, both in terms of the model assumptions that can be used and the ease with which models can be changed. Specifically, we show that reinforcement learning can solve the 'rational expectations equilibrium' (REE) models that are ubiquitous in macroeconomics where choice variables are continuous and may have time-dependency, and where there are global constraints that bind agents' collective actions. Importantly, we show how to solve rational expectations equilibrium models with discrete heterogeneous agents (rather than a continuum of agents or a single representative agent). We apply reinforcement learning to solve three REE models: precautionary saving; the interaction between a pandemic and the macroeconomy (an 'epi-macro' model), with stochasticity in health statuses; and a macroeconomic model which has global stochasticity, i.e. where the background is changing in a way that the agents are unable to predict. With these three models, we show that we can capture a macroeconomy that has rational, forward-looking agents, that is dynamic in time, that is stochastic, and that attains 'general equilibrium' between the supply and demand of goods or services in different markets. arXiv:2103.16977v1 [econ.GN] 31 Mar 2021 Macroeconomic models seek to explain the behaviour of economic variables such as wages, hours worked, prices, investment, interest rates, the consumption of goods and services, and more, depending on the level of complexity. They do this through 'microfoundations', that is describing the behaviour of individual agents and deriving the system-wide behaviour based on how those atomic behaviours aggregate. An important class of these models is used to describe how variables co-move in time when supply and demand are balanced (in general equilibrium), and when some variables are subject to stochastic noise (aka 'shocks'). A typical macroeconomic rational expectations model with general equilibrium is a representation of an economy populated by households, firms, and public institutions (such as the government). The choices made by these distinct agents are framed as a dynamic programming problem in which households maximise their discounted future utility U = E ∞ t=1 β t u (s t , a t ) with u per-period utility, β a discount factor, s t ∈ S a vector of state variables, a t ∈ a(s t ) a vector of choice variables, and s t evolving as s t+1 = h(s t , a t ). E(·) represents an expectation operator, usually assumed to be 'rational' in the sense of being the households' best possible forecast given the available information (and implying that any deviations from perfect foresight are random). For household agents, u is monotonically increasing in consumption, c t , and decreasing in hours worked, n t (both choice variables). Extra conditions are imposed via other equations, for example, a budget constraint of the form (1 + r t )b t−1 + w t n t ≥ p t c t + b t with p t price, w t wages, and r the interest rate. b t captures savings, typically in the form of a risk-free bond or other investment. If b t < 0 is permitted (i.e. debt) then b t usually satisfies a 'no Ponzi' condition that rules out unlimited borrowing and effectively imposes the rule that b T = 0 (t ∈ 0, . . . , T ). Consumers take prices, wages, and interest rate as given; these are state variables. Firms maximise profits Π t = p t Y t − w t N t (possibly including a −r t K t term if savings are invested) subject to a 'production constraint', Y t , that turns labour, N t , and capital, K t into consumption goods. Typically, f is a monotonically increasing function of its inputs and A t is either predetermined or follows a log-autoregressive process ln A t = ρ A ln A t−1 + t ; t ∼ N (0, σ A ) is known as a technology 'shock'. Governments perform functions such as the collection and redistribution of taxes. Firms are assumed to be perfectly competitive, meaning that each firm takes prices and wages as given. Prices, wages, and interest rates are determined by market clearing for goods, labour, and savings respectively in which supply and demand are balanced in each market. These 'general equilibrium' conditions bind agents and the environment together, and are atypical in reinforcement learning. The competitive equilibrium is defined by a vector of state variables, and by consumption and production plans for the agents that maximise utility. Often, the optimal policies of all agents are solved analytically by Lagrangian methods: the equilibrium conditions are substituted in and the system of equations simplified, usually by log-linearising the model around an assumed steady state. We now review some macroeconomic models before briefly discussing multi-agent models more generally. The Representative agent with rational expectations is an important class of macroeconomic model, most well-known is the representative agent dynamic stochastic general equilibrium (DSGE) model. The canonical model is the representative agent New Keynesian (RANK) model (Smets & Wouters, 2007) . Continuum rational expectations models overcome some of the heterogeneity-related shortcomings of those models by replacing the representative household with a continuum of households that are ex ante differentiated by their assets and labour productivity. The canonical example is the heterogeneous agent New Keynesian (HANK) model (Kaplan et al., 2018) . Macroeconomic agent-based models differ in that they simulate agents as discrete entities but also typically make very different assumptions to, say, RANK or HANK models; the most important being that they tend not to assume rational expectations/perfect foresight and they may not necessarily have competitive markets. Importantly, they allow for heterogeneity in multiple dimensions simultaneously (Haldane & Turrell, 2019) . Agent-based models (ABMs) are also extensively used in epidemiology (Tracy et al., 2018) , sometimes under the name 'individual-based models'. At the start of the coronavirus crisis, UK government policy was heavily informed by such models, most notably that of Ferguson et al. (2020) , and there are several ABMs modelling the coronavirus pandemic (Hoertel et al., 2020; Kerr et al., 2020) . These epi-ABMs do not capture economic effects. Epi-macro models attempt to combine macroeconomic and epidemiological effects, and their interaction. The canonical examples combining epidemiology and a REE representative agent model are Eichenbaum et al. (2020a) and Eichenbaum et al. (2020b) who link the two by assuming that, in addition to the usual Susceptible, Infected, Recovered (SIR) model transmission mechanism as posed by Kermack & McKendrick (1927) , a household agent may be infected at work or while engaging in consumption. Market clearing is also assumed. Building on many of the same assumptions as HANK, the canonical continuum agent epi-macro model with REE is by Kaplan et al. (2020) . Agents are differentiated by their assets, productivity, occupation, and health status. There are three types of good: regular, social, and home-produced; and three types of work: workplace, remote, and home. The epi-macro link is achieved through a transmissibility of infection that is modified to include terms proportional to hours worked and amount consumed, with avoidance of infection captured through a disutility of death. Market clearing is assumed. Finally, recent work has seen reinforcement learning be applied to multi-agent systems of relevance to economics in the case of bidding in auctions under constraints (Feng et al., 2018; Dütting et al., 2019) , and deciding on behaviours for both agents and a social planner in a gather-and-build economy (Zheng et al., 2020) . In the rest of this paper, we show how to use reinforcement learning to solve typical rational expectations macroeconomic models while also incorporating discrete agent heterogeneity and, potentially, stochasticity; demonstrating that all three can be combined is by far our major contribution and has applications for a wide class of economic problems. A typical rational expectations equilibrium problem is that of precautionary saving, in which agents anticipate a change in circumstances that will adversely affect their utilities, in this case a reduction in wages, and respond in advance in order to smooth their consumption. Such behaviour is typical of the agents in a REE model. The simplest version of this problem has a known analytical solution. We solve this model using reinforcement learning so that we may compare it to the analytical solution, and we also use it as a way to demonstrate many of the challenges of using RL for this class of problems; notably the speed and accuracy of convergence given the sensitivity to the estimate of the value function; the continuous action and state spaces; and the enforcement of the 'no Ponzi' condition. We assume that there is a single household agent with rational expectations. There are I = 2 firms, with the firms and the good each firm produces indexed by i. The household agent is employed by one of these firms, which we will denote e, and has per period utility u t = i∈I ln c it − θ 2 n 2 t with action (choice variables) c it consumption and n t hours worked. 0 ≤ t < T is the discrete timestep. The price vector is fixed to p it = 1, and the interest rate is fixed to 0. The wage is imposed as w = 1 for t < T /2 and w = 0.5 afterwards, a fall that is anticipated. The household agent is subject to a budget constraint such that b t+1 = b t + w t n t − i∈I p it c it . The no Ponzi condition is imposed via b T = 0, which prevents unlimited borrowing by the household. The agent maximises its discounted utility 0≤t 0. We then advance both the capital and epidemiological state to proceed to the next timestep. U τ +1 is obtained from U τ by continuing training usingH τ . The network parameters are the same as in §3.1, however we use a gentler decay of the learning rate. There are no problems observed with convergence, and the adherence to the no Ponzi condition is a good test of this, since achieving it is sensitive to the entire time history of the simulation. As in the multi-agent model, consumers take prices (p i ,w i and r) as given, find their optimal consumptions, hours worked and investments before advancing their state using the budget constraint and the probabilities of their SIRD state changing. We examine two cases: a 'heterogeneous' case as described above, and a 'homogeneous' case without age heterogeneity but with the same mean death rate. Figure 1 shows the percentages of susceptible, infected, recovered and deceased agents as the pandemic progresses. Each line is an average over the results of 3 simulations, and each simulation's result is an average over 20 histories. As can be seen from the 95% confidence intervals in the figure, the behaviour is similar across simulations. In the homogeneous-age case, more people are infected over the course of the pandemic, but there are fewer deaths in total. Figure 2 shows the agents' consumptions. We bin the uniform age distribution into young (< 40), old (> 70) and middle-age groups; we find considerable differences in behaviour between them. After consuming the most before the pandemic since they anticipate the coming opportunity to save, the old strongly reduce consumption in response to infection risk, and reduce consumption of the riskier 'social' good more. The young, conversely, are unlikely to die and so their consumption is relatively unchanged, governed by the decrease in the size of the economy. Figure 3 shows the mean investments per agent in each age group, normalised to the salary (product of wage and hours worked) of an agent in the same system with no pandemic and no saving. The young anticipate the pandemic and save before it in order to spend when infection rates are higher, with the converse behaviour for the old. It also shows the no Ponzi condition holds with good accuracy for all age groups. Finally, Figure 4 shows the total consumption for both heterogeneous-age and homogeneous-age cases. The inclusion of distributional effects causes significant changes to bulk macroeconomic quantities. Together these results show that in this uncalibrated model the inclusion of age heterogeneity makes a substantial difference to both the epidemiological and economic progress of a pandemic. This model and solution method has been tested with a range of epidemiological and economic parameters and has shown consistent stability and convergence. This exercise has also shown the sensitivity of the model's conclusions to those parameters, emphasising the importance of calibration in all aspects of the model if it were used as more than a test case of the methodology. The hardware, software, and parameters are identical to §3.1 with the exception of the learning rate decay which is slower here to allow for the longer time history. Scaling is linear with J, since the number of iterations of the least squares optimisation seems to scale very weakly with J for this problem. Each history calculation followed by an RL update takes ∼ 30 minutes on the reference machine, so a single simulation takes ∼ 24 hours, ∼ 720 GFLOPs-hour. In the previous section, only the agents' state was stochastic and, because of being in the limiting case of large numbers of agents, each individual agent's state did not affect the global state. We now return to the model in §3.1, but instead of having a deterministic drop in wages, wages now follow a log-autoregressive stochastic process. We return to the Bellman equation, (1). We assume there is no stochasticity in health states so the E s |s,a (·) are no longer present, however the E St+1|S (·) remain. where s t+1 = a(s) advances deterministically. We change the method to solve forŨ andD, defining and U (t, S, s) = max a u t (S, s, a) + βŨ (t, S, s t+1 ) where the value of a which attains the maximum defines a * . Again, we will find the maximum using Lagrange's method and the auxiliary quantityD, and so ∂ a L(t, S, s, a) = ∂ a u t (S, s, a) + β(∂ a s t+1 )D(t + 1, S, s t+1 ) (7) As is standard in reinforcement learning, the expectations are approximated by using a large number of global state histories to updateŨ andD. Excepting the wage history, the set-up is as in §3.1. The agents are statically heterogeneous in their propensity to work, θ ∈ [0.6, 1.4], and employer, e; and dynamically heterogeneous in savings. Again, the multiple employers and the treatment of heterogeneity in θ are introduced so that demonstrations of convergence and hyperparameter choice are relevant to the problem in the next section. In this toy problem, we compare two types of agent: one a current-time wage-observing agent, whose future utility is a function of the wages at the current timestep, w t : U (t, S = (t, w t ), s = (θ, e, b t )); the other a non-wageobserving agent with U (t, S = (t), s = (θ, e, b t )). Both are able to optimise by testing how their strategies play out within the histories they have seen, however they only have partial visibility of the global state. In the previous case that had deterministic global state, a knowledge of the time determined all other global variables, but here the mapping from observable values to global state is one-to-many. The wage follows a log-autoregressive process, ln w t = ρ w ln w t−1 + t ; t ∼ N (0, σ w ); where we use ρ w = 0.97, σ w = 0.1, and w t=0 = 1. Since the wage is autoregressive, knowledge of the current wage adds information about the wage in the future. The agent is trained on #T ∈ {100, 1000} training histories. We parametrise wage histories by their mean absolute fractional deviation, d h , from the mean path of the autoregressive process w mean,t = exp σ 2 w 2 t 0.2 to represent those with significant deviation from the mean path. We judge the success of this model by calculating the average total utility an agent with θ = 1 attains over these previously unseen wage histories, {w h,t } 0≤t x is denotedū d h >x . Table 1 compares the average utilities of agents with different training setups and wage visibility to an analytic approximation found by defining the action at time t in history h to be the values obtained from the formulae in §3.1 for a wage history beginning at time t: c h,t = 1/λ and n h,t = w h,t θ −1λ , obtainingλ from the no Ponzi condi- p,t is evaluated analytically over all possible wage paths that have w p,t = w h,t using the second moment of the log-normal distribution. We implement prioritised experience replay (Schaul et al., 2015) by retaining experiences with a larger error for a larger number of training epochs. This increases the agents' performance relative to a base agent trained as in previous sections, particularly on the d h > 0.2 histories that deviate significantly from the mean path. This is expected since the training examples are sparser and more varied at higher d h . The wage-observing agent outperforms the non-wage- observing agent. In addition to the solution becoming stable and the no Ponzi condition being satisfied, that the utilities for the wage-observing agent are consistently higher than those for the analytical approximation gives us confidence that the answers converged to are accurate. We record the utilities after 100 epochs which equates to 40, 000 experiences or 10 minutes on the reference machine. A degradation in performance is seen if an insufficient number of training histories are used. The specification is identical to §3.1.2 except that the rate of decay of the learning rate is decreased to allow averaging, l r = e −0.01E /(1 + E) for epoch E; and, as discussed, prioritised experience replay is used. Finally we demonstrate a general equilibrium model that has stochastic global variables, specifically a log-autoregressive process in the technology. This is given by ln A it = ρ A ln A i,t−1 + t ; t ∼ N (0, σ A ) and affects the production, given by Y it = A it K 1−α it N α it , of a sector. α = 2/3, ρ A = 0.97, σ A = 0.1, and A i0 = 1. The agents are as in §3.3, being statically heterogeneous in propensity to work, θ and employer e, and dynamically heterogeneous in investment, k t ; however now have visibility of all prices, not just wages. The global model that couples the agents is the same RBC model described in §3.2 and Appendix I, but without infections. As the agents' internal states are no longer stochastic, a smaller number (J = 10) of agents can be used. As in §3.3, we compare two types of agents, one that 'sees' realisations of the prices and another than does not. Since differing values of the technology shock move the general equilibrium, changing the prices, observing them gives the agent information about the state of the underlying stochastic process. Figure 5 shows the paths of hours worked, n j , for 5 agents with a range of values of θ, all of whom work in the same sector given by i = 0. As expected, the paths of agents who can observe realisations of prices have smaller fluctuations, which is also true of the paths of other quantities. Averaging over 256 runs, the mean unsigned curvature of the paths drops fromκ non−obs = 0.44 toκ obs = 0.27; this fall is also true of the curves in the figure whereκ non−obs = 0.42 andκ obs = 0.19. This difference is because agents who can observe realisations are better able to adjust their behaviour to the current and (since it is an autoregressive process) future values of the technology shock. The specification of the neural network and learning remains unchanged from the previous section. Calculation of the multiple histories is parallelised; we use 8 threads. For each simulation epoch E, 8(4 + E) histories are found, followed by 20 RL training epochs. In total, there are 12 simulation epochs and a total of ≈1000 histories and 240 RL training epochs, each of which samples from the most recent 50% of the histories. The number of histories and epochs is informed by the convergence properties from the previous section. A total of ≈100, 000 experiences are recorded during the whole simulation which takes ≈6 hours. This work shows that reinforcement learning can be used to solve a wide range of important macroeconomic rational expectations models in a way that is simple, flexible, and computationally tractable. Furthermore, these methods can be immediately applied to previously intractable problems with multiple degrees of discrete heterogeneity and stochasticity. Being highly relevant to real world phenomena, such as climate change and disease transmission, these capabilities are of great value to policymakers and can be developed into serious tools to aid decision making in complex scenarios. Finally, by linking to reinforcement learning, this work provides the potential to apply its extensive toolkit of techniques, many of which have direct relevance to economic questions: examples include accessing larger state and action spaces (e.g. Lillicrap et al., 2015) , including bounded rationality, or applying inverse reinforcement learning to deduce agents' objectives and rewards from observed microand macro-economic behaviours. Additionally, we can harness improvements in implementation such as GPU/TPU acceleration (Paszke et al., 2019) and distributed computing (Mnih et al., 2016) . We use a standard real business cycle model, however we adopt notation and variables common in reinforcement learning, in particular, an emphasis on state variables (capital) rather than action variables (consumption, hours worked), and the inclusion of an action-dependent expectation. A baseline RBC model would use Equations 9, 12, 13, 14, 15, 16, and 19. We use Equations 9, 10, 11, 14, 15, 16, and 19 , but note that 10 and 12, and 11 and 13, are the same up to algebraic manipulation, expressing the future behaviour in terms of U (k) and U (k), functions of the state, rather than in terms of consumption or other actions as is usually seen. We work in real quantities, using p i=0 = 1 as the numéraire and other quantities, including p i =0 = 1 defined relative to this. Notation j is an index that runs over the J consumer-workers. i runs over the I consumption goods, each of which is produced by a different sector/firm. For consumers, n j is hours worked, c ji is consumption, k j is capital held by the consumer, v j is their investment in capital in that timestep, θ j > 0 is the weight given to hours in the utility function. E i denotes the set of agents employed at firm i, e(j) is the index, i, of the employer of j. For firms, N i is the number of hours worked at the firm, K i is its capital, A i is the firm's technology, Y i is the production function, C i is the consumption of the firm's goods. r is the real interest rate, w i are wages, and p i are the prices of goods. For consumers, the time-t utility is u j ({c ji }, n j ; θ j ) = i ln c ji − 1 2 θ j n 2 j and their total utility from time t onward is U j,t,S (k t,j ) = max cji,nj u j (c ji , n j ) + β S P S→S (c ji , n j )U j,t+1,S (k t+1,j ) Their budget constraint is, w e(j) n j + rk j,t = i p i c ji + v j ∀j (9) k j,t+1 = k j,t + v j Let U j = ∂ kt+1,j U (k t+1,j ); expectations are over a distribution of probabilities P S→S , and so E ∂ cji ln P j = S→S ∂ cji P j . Consumers take prices (p i , w i and r) as given and use their first order conditions to find c ji and n j ; this is solved iteratively since U is a function of k t+1 and thus c ji and n j . In the reinforcement learning training we add an additional reward for individually achieving a no-Ponzi condition at the final time-step. If the probabilities were independent of the action, ∂ cji P j = 0, as would be the case in a standard RBC model, then that term can be removed and the E U j eliminated obtaining n j = w e(j) θc ji p i ∀i A small amount of work, with care taken as to the maximisation over a in the definition of the utility, shows that E U j = (1 + r t+1 )(p i,t+1 c ji,t+1 ) −1 , and so Equation 10 reduces to the Euler equation (1 + r t+1 )c −1 ji,t+1 where the p i remains due to the multiple goods with p i =0 = 1. Firms are profit maximising with production function Since profit Π i = p i Y i − w i N i − (r + δ)K i , then taking prices (p i , w i , r) as given (16) and therefore Π i = 0. We split K i = K e +K i where K e is an endowment andK i is provided by investment from the consumers. The technology shock is log-autoregressive Wages are set by market clearing for hours worked and the real interest rate is set by market clearing for capital where g j is the weight of each agent, with j g j = 1. Compatible value gradients for reinforcement learning of continuous deep policies Optimal auctions through deep learning The macroeconomics of epidemics Epidemics in the neoclassical and new keynesian models Revisiting fundamentals of experience replay Deep learning for revenue-optimal auctions with budgets Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand Drawing on different disciplines: macroeconomic agent-based models Learning continuous control policies by stochastic value gradients A stochastic agent-based model of the SARS-CoV-2 epidemic in france Monetary policy according to hank The great lockdown and the big stimulus: Tracing the pandemic possibility frontier for the u.s. Working Paper 27794 A contribution to the mathematical theory of epidemics Covasim: an agentbased model of COVID-19 dynamics and interventions. medRxiv A method for stochastic optimization Time to build and aggregate fluctuations Continuous control with deep reinforcement learning A conceptual model for the outbreak of coronavirus disease 2019 (COVID-19) in Wuhan, China with individual reaction and governmental action Human-level control through deep reinforcement learning Asynchronous methods for deep reinforcement learning An imperative style, high-performance deep learning library Prioritized experience replay Proximal policy optimization algorithms Shocks and frictions in us business cycles: A bayesian dsge approach Reinforcement learning: An introduction Agent-based modeling in public health: Current applications and future directions Deep reinforcement learning with double q-learning The AI economist: Improving equality and productivity with AI-driven tax policies We would like to thank Federico Di Pace for useful discussions. Appendix I : The Real Business Cycle model