key: cord-0619800-jcwjmf4s authors: Petrone, Daniele; Rodosthenous, Neofytos; Latora, Vito title: Artificial intelligence applied to bailout decisions in financial systemic risk management date: 2021-02-03 journal: nan DOI: nan sha: c4962b4c1a92cc82dc5788a5dd56488cc6d56780 doc_id: 619800 cord_uid: jcwjmf4s We describe the bailout of banks by governments as a Markov Decision Process (MDP) where the actions are equity investments. The underlying dynamics is derived from the network of financial institutions linked by mutual exposures, and the negative rewards are associated to the banks' default. Each node represents a bank and is associated to a probability of default per unit time (PD) that depends on its capital and is increased by the default of neighbouring nodes. Governments can control the systemic risk of the network by providing additional capital to the banks, lowering their PD at the expense of an increased exposure in case of their failure. Considering the network of European global systemically important institutions, we find the optimal investment policy that solves the MDP, providing direct indications to governments and regulators on the best way of action to limit the effects of financial crises. We describe the bailout of banks by governments as a Markov Decision Process (MDP) where the actions are equity investments. The underlying dynamics is derived from the network of financial institutions linked by mutual exposures, and the negative rewards are associated to the banks' default. Each node represents a bank and is associated to a probability of default per unit time (PD) that depends on its capital and is increased by the default of neighbouring nodes. Governments can control the systemic risk of the network by providing additional capital to the banks, lowering their PD at the expense of an increased exposure in case of their failure. Considering the network of European global systemically important institutions, we find the optimal investment policy that solves the MDP, providing direct indications to governments and regulators on the best way of action to limit the effects of financial crises. In times of crisis, as during the recession of 2008 or the economic disruption triggered by the COVID-19 pandemic, the governments face difficult decisions regarding bailing-out strategically important companies. In particular, large banks are critical for the stability of the financial system and are closely monitored by central banks and governments. As an example, to rescue Royal Bank of Scotland (RBS) in 2008-2009, the UK government became the majority shareholder of the bank, purchasing shares for a total 45.5 billion pounds [5] . The government achieved its objectives to stabilise the financial system, and no depositor in UK banks lost money. However, the cost for taxpayers has been estimated by the Office for Budget Responsibility (OBR) to be in the region of 27 billion pounds as of March 2018 [2] . The price of RBS shares plummeted after the purchase and the government has since sold part of its investment at a loss. Was the government intervention value for money? The National Audit Office (NAO) is the UK's public spending watchdog and in December 2009 released the report "Maintaining financial stability across the United Kingdom's banking system" [3] where they analysed the government support for the banking sector and the conclusion was that: "If the support measures had not been put in place, the scale of the economic and social costs if one or more major UK banks had collapsed is difficult to envision. The support provided to the banks was therefore justified, but the final cost to the taxpayer of the support will not be known for a number of years". NAO did not produce an estimate of the impact in case of inaction of the government. In this paper, we propose a mathematical framework that allows a quantitative comparison between investment decisions by the government. * Electronic address: v.latora@qmul.ac.uk Our framework is based on the three following building blocks: (a) a dynamical network model of the financial system with a contagion mechanism between financial institutions; (b) a set of allowed government interventions to control the network; and (c) a quantitative way to assess the government actions at each time step. A network model [11] [14] [19] [20] is essential, as the main concern is not the direct cost of a default but the systemic risk that it entails. Systemic risk can be defined as the risk that large part of the financial system is disrupted and as such it requires connections between financial institutions that can transfer the distress along the network [12] [13] [16] . The contagion mechanism that we use is the impact that a bank default has on other banks [6] . The impact can be due to direct losses in bilateral credit exposures [23] [22] (for example if they had lent money to the defaulting bank), or indirect losses due to fire selling of assets by the defaulting bank [15] , that would lower the market value of similar assets in the balance sheet of the other financial institutions. The impact would lower the capital buffer of the affected banks, weakening the network and its ability to withstand future shocks. In particular, the probability of default per unit time (PD) of the nodes (banks or financial institutions) would increase, hence increasing the expected loss in the network [6] . One main novelty of our model is that we allow for the network to be controlled by a government investment in the capital of the banks. Such an investment would, conversely, decrease the PD of the banks that receive the additional capital, but also increase the loss for the government in case of default. In our framework, the connection between the change in PD and the variation in the amount of capital is provided by the Merton model of credit risk [18] . To follow the evolution in time of the network, we simulate the default process given the PD of the nodes and their tendency of defaulting during the same time step. Finally, we use artificial intelligence techniques [24] [26] [28] to assess the optimality of government decisions (no investment vs different amounts of investment), recasting the system as a Markov Decision Process (MDP) [29] where the actions (controls) are government investments at each time step. The paper is structured as follows. In section II A we describe the network of financial institutions, its dynamics and contagion mechanism, and then in section II B we introduce a Markov Decision Process based on the network, in order to model government interventions on bailed-out banks. We continue in section II C with presenting our strategy to solve the MDP, by finding the optimal government investment decision for each state of the network and time. Section III contains our results, obtained by applying our model to a homogeneous network organised as a Krackhardt kite graph (see Fig. 1 ) and to the network of the European Global Systemically Important Institutions. We have found that a preexisting investment in a distressed node makes it convenient for the government to intervene again to try to save the invested capital (creating moral hazard as the node could act haphazardly relying on the implicit government guarantee). Moreover, by changing the parameter α, that accounts for the taxpayers' loss in case a bank defaults, we have observed that there is a 'critical' value that separates networks for which the inaction of the government is the best option from networks where an investment of the government would be the optimal decision as it would lower the overall expected loss of the system. Finally, we provide our conclusions in section IV. We consider a network G with a set I = {1, ..., N } of nodes representing financial institutions. Each node i ∈ I is characterised at time t by a probability of default P D i (t) ∈ (0, 1] per time interval ∆t, a total asset W i (t) and an equity E i (t) (such that E i (t) ≤ W i (t)), that is the capital used by node i as a buffer to withstand financial losses . The edges w ij of the network represent the exposure of node i to the default of node j for all i = j ∈ I. To take into account government interventions aimed at limiting the overall losses, we use an adaptation of the 'PD Model' described in [6] by extending it to allow the possibility for the nodes (banks) to incur positive shocks, via investments in the nodes, rather than just negative shocks due to the default of other nodes. The focus has also changed from the one in [6] , as we are now exclusively interested in the losses incurred by the taxpayers, disregarding the losses sustained by private investors. In the following, we will measure the time in discrete time steps that are multiples of ∆t, i.e. t + 1 is equivalent to t + ∆t. We define the total impact I i (t) on node i at time t, due to the default of other nodes j ∈ I \ {i} in the network as where δ j (t) = 1 if and only if node j defaults at time t and δ j (t) = 0 otherwise. The impact I i (t) represents a loss for the total asset W i , which in turn decreases also the equity E i of node i, hence reducing their value at time t + 1. This can be seen from the accounting equation for each node i, namely which states that the total asset W i is always equal at all times to the equity E i plus the total liability B i . Note that B i is not affected by the losses as it is comprised of loans from other banks, deposits, etc., that are due in full unless the bank i defaults. Hence, we have where we define ∆X i (t) := X i (t + 1) − X i (t). We can therefore write where ∆J i (t) denotes the potential increase in the current investment J i (t) of the government in node i at time t. On the other hand, the probability of default P D i (t) of node i is increased by the impact I i (t) at time t, since part of the capital buffer (equity E i ) is lost. In order to model the effect of the impact I i (t) on P D i (t), we use the Merton model for credit risk [18] to calculate the 'implied probability of default' P DM as a function of the parameters of each node: where the term W − E represents the total liability B of each bank, Φ is the univariate standard Gaussian distribution, µ is the drift and σ is the volatility of the geometric Brownian motion associated to the total asset W in the Merton model. We then use (6) to obtain where we introduced the fixed number "P DM f loor i " representing the lower bound of the P D i that is used to exclude unreasonably low probabilities of default. For example, it is a standard assumption for the P D i of a bank i to be greater or equal to the probability of default of the country where it is based. In this context, the latter is the probability of a country defaulting on its debt. Now, if node i loses an amount of capital I i (t) at some time t greater or equal to its buffer E i (t), the total asset W i (t) becomes less than its liability B i (t) and it is convenient for the shareholders to exercise their option to default. In practice, when this occurs, we set P D i (t+1) = 1 and node i will default at time t + 1. Moreover, recall that node i may also default at any time t with probability P D i (t) due to its own individual characteristics given by (7) ; see also the default mechanism in (11) below. Now, when node i defaults, we denote by LGD i the "Loss Given Default" of node i, which is a fixed number representing the percentage of the investments J i on node i by the government, that cannot be recovered after a default. In case of default of node i, we further assume that in addition to the aforementioned loss of investments, the taxpayers' loss L i is also comprised of a fixed percentage α i (for convenience) of the total asset W i of the node. That is, the taxpayers' overall loss L i is given by To complete our framework we need to specify the probability of more than one default happening during the same time step, given the P D i of each node i obtained by (7) . For example, if the nodes were independent the probability of nodes i and j defaulting at the same time step, denoted by P D [ij] , would be the product of the individual probabilities P D i and P D j . In this paper, we allow nodes to depend on each other and use a Gaussian latent variable model [17] to calculate the probabilities of simultaneous defaults of two or more nodes. To be more precise, the probability of a finite subset of nodes {i, j, k, ...} ⊆ I in the network G defaulting at the same time, is given by the following integral where Φ N is the standardised multivariate Gaussian density function with zero mean and a symmetric correlation matrix Σ ∈ [−1, 1] N ×N given by and |Σ| is the determinant of Σ. We further note that the integration domain D in (9) is the Cartesian product of the intervals [−∞, Φ −1 1 (P D i )] for each node i that belongs to the set of defaulting nodes, and the intervals [−∞, ∞] for the remaining nodes, where Φ 1 is the univariate standard Gaussian distribution. In the sequel, this model will also be used to simulate the default mechanism. To be more precise, by sampling values x 1 , ..., x N of the random vector X = (X 1 , X 2 , ..., X N ) T with the multivariate Gaussian distribution mentioned above, at each time step t, we will assume that node i defaults according to the rule: We describe the government decisions of bailing out banks as a Markov Decision Process (MDP) driven by the network framework described above. We assume that the government estimated that the crisis will likely be over at time M, and in any case it will be able to sell the shares of the rescued banks to the private sector for a price that is similar to the purchasing price. We define the 4-tuple (S, A s , P a , R a ) of the set S of all the states, set A s of all actions available from state s ∈ S, transition probabilities P a (s, s ) = P (s t+1 = s |s t = s, a t = a) between state s at any time t and state s at time t + 1 having taken action a ∈ A s at time t, and rewards (negative losses in our model) R a (s, s ) received after taking action a at any time t while being at state s and landing in state s at time t + 1, where s, s ∈ S. Furthermore, a constant discount factor γ is defined with 0 ≤ γ < 1, so that rewards obtained sooner are more relevant in the calculation of the cumulative reward CR over M steps. The latter is therefore defined by In the remaining of this section, we expand on the 4-tuple (S, A s , P a , R a ) that defines our MDP. MDP states. The states s t ∈ S, at each time t, are defined by three main pilars: (a) all the parameters of The MDP actions in our model are injections of capital a t → ∆J a (t) = (∆J a 1 (t), ∆J a 2 (t), ..., ∆J a N (t)) by the government to the nodes (1, 2, ..., N ). These additional resources on one hand, make the nodes more resilient, hence diminishing their probability of default via (4)- (7), but on the other hand they will be at risk in case of default since they increase each J i in (8) . These actions are the control variables of the government when trying to minimise the losses of the network (i.e. maximise the expected CR in (12) , see Section II C for more details). We further assume that these government investments (relative to action a t ) decided at time t are implemented immediately, so that the probability of default P D i (t) given by (7), the default mechanism in (11) and subsequently the impacts I i (t), for each node i ∈ I, are implemented using the updated . MDP transition probabilities. Within our framework, a node that has defaulted does not contribute to future losses and cannot become active again, i.e. the cardinality of the set of defaulted nodes |I def (t)| is a nondecreasing function of time t. Hence the transition probability P a (s, s ) from state s to s will be non-zero only for states s that: (a) have the same number or more defaulted nodes than state s; (b) are "reachable", in the sense that their P D i (t + 1), W i (t + 1) and E i (t + 1), for i ∈ I \ I def (t + 1) (the set of remaining active nodes in s ) take values that are coherent with equations (4)-(7) after calculating the impacts I i (t) from the nodes i ∈ I def (t + 1) \ I def (t). In order to illustrate the above we consider the following example. Example 1 Let us consider a network with three nodes I = {1, 2, 3} and w ij = 1 ∀ i = j ∈ I, at a time t, such that node 3 ∈ I def (t) has already defaulted, while the remaining nodes have W i (t) = 100, E i (t) = 3 and P D i (t) = 0.001 for i ∈ I \ I def (t) = {1, 2}. In case the government does not intervene, the states s that can be reached are the ones where: (i) all the nodes default at time t, i.e. I def (t + 1) = I; (ii) nodes 1 and 2 are still active and W i (t+1), E i (t+1) and P D i (t+1) for i ∈ {1, 2} are the same as for state s; (iii) node 1 defaults at time t while node 2 remains active, i.e. I def (t + 1) = {1, 3}, W 2 (t + 1) = 99 and E 2 (t + 1) = 2 (since the impact I 2 (t) = w 21 = 1) and P D 2 (t + 1) needs to take the value calculated via (7) using the W 2 (t+1) and E 2 (t+1) inputs; and (iv) node 2 defaults at time t but node 1 remains active, which is analogous to (iii) by swapping indices 1 and 2. Now, if the government decides to invest, i.e. a → (∆J a 1 (t), ∆J a 2 (t)) on nodes 1 and 2, respectively, at time t, we need to update the capitals , 2} according to the government intervention and then use the updated E i (t), W i (t) to perform the same analysis as above to identify the reachable states. For states s t+1 with a non-zero transition probability P at (s t , s t+1 ), we can calculate the latter via the Gaussian latent variable model, thus they will depend exclusively on the parameters P D i and Σ ij with i, j ∈ I \I def (t). To be more precise, we first create an intermediate state s by applying the government investments relative to action a t to state s t ; hence each node i of s will have an increased and a probability of default given by (7) with inputs W i (t) and E i (t). Using the intermediate state s with updated E i (t), W i (t), J i (t) and updated P D i (t), we calculate the transition probability via (see also (9) ) the following integral where Φ is the density given by (10) with dimension equal to the cardinality of the set of surviving nodes for all the remaining active nodes i ∈ I \ I def (t + 1) at state s t+1 -upon recalling the default mechanism in (11) . The Σ sub is the sub-matrix of the original correlation matrix Σ after removing the rows and the columns corresponding to defaulted nodes i ∈ I def (t) at state s t . MDP rewards. In our model the "rewards" take nonpositive values, since their overall maximisation has to translate for our MDP into the minimisation of the overall taxpayers' losses L i (t) given by (8) , for all nodes i ∈ I \ I def (t) at each time t. Namely, where only the nodes defaulting at time t with δ i (t) = 1 contribute to the sum of losses, hence the "reward" is 0 if there are no additional defaults at time t. The expected "reward" depends on the action taken, as it influences the total asset W i and investment J i corresponding to the intermediate state s described previously, as well as the P D i , i.e. the probability of having δ i (t) = 1. Solving the MDP means to find the optimal action for each possible state s t . In our context, we expect our model to indicate if the government should intervene and if so, which amount it should invest for a given configuration of the financial system network. To find a solution and describe it mathematically, we need to define a few concepts as described below. Optimal policy. The optimal policy π * (s t ) → a * t , is a function that returns the optimal action a * ∈ A s for each state s at time t. The optimal action is the one that obtains the maximum expected cumulative reward as defined in (12) . Optimal value function. The optimal value function V * (s t ) is defined by This is the expected cumulative reward starting from state s t and following the optimal policy π * for any of the successive time steps till the end of the episode (recall that a full episode consists of M time steps). One way to obtain this expected cumulative reward is to run the MDP starting at s t multiple times and average the results. Given the definition of π * , V * (s t ) represents the maximum expected cumulative reward that can be obtained starting from s t . Optimal action value function. The optimal action value function Q * (s t , a t ) is the expected cumulative reward we obtain if we first take action a t at state s t and then follow the optimal policy π * for any of the successive steps from t + 1 until the end of the episode M . It is defined by Similarly to the previous paragraph, this represents the maximum expected cumulative reward that can be obtained starting from s t after taking action a t . Notice that, finding Q * is equivalent to solving the MDP, since the optimal action for each state s t (hence the optimal policy π * ) can be obtained by Relationships between Q * and V * . From the definitions of V * (s t ) and Q * (s t , a t ), it follows that i.e. the maximum cumulative reward from s t is the one corresponding to the maximum value of Q * after looking at all the potential alternative actions a t . Conversely, we can write Q * (s t , a t ) in terms of V * (s t ) as: In other words, Q * (s t , a t ) can be expressed as the immediate expected reward at time t, given by s t+1 P at (s t , s t+1 )R at (s t , s t+1 ), plus the expected cumulative reward from time t + 1 onwards, given by γ s t+1 P at (s t , s t+1 )V * (s t+1 ). Merging together equations (18) and (19) we then obtain the Bellman Optimality Equation: Our strategy to solve the MDP. We have a complete description of our MDP (in particular, we have the transition probabilities P at (s t , s t+1 ) and the rewards R at (s t , s t+1 )), hence, in theory, we could enumerate all the possible states, use Dynamic Programming and the Value Iteration algorithm [30] to find V * and calculate Q * via equation (19) , thus solve the MDP. However, this is not a scalable approach due to the complexity of the MDP states and the very large number of successor states s for all but trivial networks. Instead, we use a Fitted Value Iteration algorithm [25] that involves: (a) devising a parametric representation V * (s, β) for the optimal value function V * (s) in (20) , where β is a placeholder for a set of parameters to fit (see section II C 1 and V B); (b) using the approximate Bellman Optimality Equation (i.e. substituting V * (s ) with V * (s , β) in (20) to fit β so that eventually V * (s) ≈ V * (s, β f it ) via a learning process (see section II C 2); and finally (c) calculating Q * (s, a) from V * (s, β f it ) hence solve the MDP. In order to solve our MDP using the Fitted Value Iteration algorithm, we need a parametric representation of the optimal value function V * (s t ). In our case, V * (s t ) is minus the minimum expected cumulative losses from s t (i.e. the maximum expected cumulative reward from state s t ) incurred between time t to the end of the episode time M (see also (14) - (15)). The greater the number of nodes in the financial network and the number of residual steps m := M − t, the greater is the potential for additional losses. It is natural to try to express V * (s t ) as a sum of the loss contributions due to each individual node at each of the remaining m time steps. Hence, we introduce the matrix Z := (Z ik (s t )) with i ∈ I \ I def (t) and k ∈ {1, ..., m}, where each element Z ik (s t ) represents the approximate expected loss due to the default of node i at time t + k − 1, taking into account potential government investments. Our final ansatz for the parametric representation V * (s t , β) of V * (s t ) is that it is given by a linear combination of the elements Z ik in which the coefficients β are arranged in a matrix that can change with time, i.e. β ≡ β t := (β ik (t)). Namely, for i ∈ I \ I def (t), k ∈ {1, ..., m}. We then let the system learn the parameters β ik (t) that maximise the expected cumulative reward (minimise the losses). In the following section, we describe how we fit the parameters β ik (t) to achieve the aforementioned task, while in section V B we detail our choice of Z(s t ) in terms of the characteristics of the network. In order to learn the parameters β ik (t) we use the Bellman Optimality Equation (20) and we define the "Bellman value" as its right hand side after substituting V * (s ) with the approximation V * (s , β) from the previous section. Namely, we define VB(st, βt+1) Pa t (st, s t+1 ) Ra t (st, s t+1 ) + γV * (s t+1 , βt+1) . We can initialise β with β ik (t) = 1 for all i, k, t, as a natural starting point due to our initial approximation of expected direct losses Z ik (s t ) in (21) (see also Section V B for more details). We can then compare V * (s t , β t ) from (21) with V B (s t , β t+1 ) from (22) at state s t (starting from the initial state s 0 at time 0 and moving forward to time t), and adjust β so that the two values come closer. Afterwards, we move to another state s t+1 and repeat the same procedure until the difference between V * and V B is "small enough", within the subset of the state space S that is reachable from s 0 . Notice however, that the above approach does not converge in general, unless we use specific learning strategies. The issue is that V B itself depends on β, which is what we want to fit, potentially triggering a divergent loop. To resolve this issue, we notice that V * (s t , β t ) depends on β at time t, while the corresponding V B (s t , β t+1 ) is a function of β at time t + 1. Using this fact, if we fit β backwards in time, then V * (s t , β t ) is compared with a value V B (s t , β t+1 ) that is fixed (because β t+1 would have been already fitted), thus solving the convergence problem. The primary issue that we now need to address is to find a way to calculate V B from (22), despite the fact that the set of states s that can be reached from state s is huge, even for relatively small networks. We first notice that s P a (s, s )R a (s, s ) is the 'one-step' expected reward that can be rewritten in terms of the nodes of the network: with Secondly, the term s P a (s, s ) V * (s , β) can be estimated via Monte Carlo simulations, which involve (a) sampling s using the distribution P a (s) defined by the probability mass function P a (s, s ) and (b) calculating the expected value E Pa(s) [V * (s , β)] by averaging the values V * (s , β). Essentially, However, it is not feasible to calculate P a (s, s ) for all the states s that can be reached from s after taking action a, due to the huge number of these states s . Once again, we use our knowledge of the underlying network dynamics to describe the right hand side of (26) in terms of nodes defaulting instead of MDP transition probabilities. We observe that the transition probability P a (s, s ) was defined through the Gaussian latent variable model (see (13) ) and that there is a one-to-one correspondence between additional nodes defaulting from state s and the state s reached given action a. In particular, we denote by G a s the probability distribution of states s which are derived by using our Gaussian latent variable model in order to first simulate which nodes default via the default mechanism in (11) and then to obtain the corresponding state s . Since G a s is equivalent to P a (s) due to the aforementioned one-to-one correspondence, we can therefore rewrite (26) as Putting all these together (using essentially (23) and (27)) we can eventually rewrite V B (s t , β t+1 ) from (22) for all t ∈ [0, M − 1] in the form of + γ E G a t s t V * (s t+1 , βt+1) Now, given that our episode ends at time step M , we observe that V * (s t ) = 0 for all t ≥ M . Hence, at time , since it will not depend on β, and we can thus write where the latter equality follows from (20) and (23) . Now that we can calculate the exact optimal value function V * for each state at time M − 1, we notice from (28) that We then fit β backwards in time for the decreasing se- Finally, we can solve the MDP by combining all the above results to calculate Q * (s t , α t ). In particular, by substituting V * (s t+1 ) in (19) with its approximation V * (s t+1 , β f it t+1 ) obtained in Section II C 2, and applying the same analysis performed to obtain (28), we conclude that which provides the solution to the MDP. The main result of this paper is the creation of the framework itself. A professional calibration of our model would require the effort of a central bank or a government office. To show how our model works, we explore two instances of our framework: In Section III A we use a network with homogeneous nodes organised as the Krackhardt kite [1] (KK) graph (Fig. 1) , while in Section III B we use the network of the European Global Systemically Important Institutions (GSIIs) obtained from the data in the European Banking Authority website [31] (EBA network). We assume the number of steps to be M = 7, the discount factor γ = 0.98 and for each node i, J i = 0 (unless otherwise specified), µ i = 0 (assuming conservatively that the expected value of the assets' return is zero), α i = α (the same for each node) and that the government can invest only in "risky" banks i with "relatively high" P D i (in our examples, "risky" banks will have P D i > 0.009). We have used an homogeneous correlation matrix that takes into account the average correlation between banks and following [21] we have set Σ ij = 0.5 for i = j ∈ I \ I def . The value of σ i , for each node i, is calculated at time t = 0 from P D i (0), E i (0) and W i (0), by inverting (6) . Finally, we set P DM F loor i = 0.00021 which is the upper end of the AAA default probability bracket, within the internal credit rating methodology used by Credit Suisse [8] . The available actions are expressed with the notation: @ . For example, 8@05 means an investment of 50 bp W 8 or 0.5% W 8 in node 8. An action that considers all the nodes is indicated with = 0. Hence 0@15 stands for an investment of 1.5% W i in each "risky" node i ∈ I \ I def . The common theme is that adding external resources makes the network more resilient but they can be lost in a subsequent default, which creates a trade-off for the decision maker. For relatively low values of α, it is generally not convenient to invest, while for relatively high values of α, the best action is to invest an amount of capital that makes the network sufficiently resilient. In the EBA network (see Section III B), we have shown that there exists a "critical" α c that splits the space of α-values into two "regimes" of low/high values, where α c ≈ 0.0025 in the original network and α c ≈ 0.00096 in the distressed network where the capital of the banks was halved. e nodes 4, 8, 10 ). For small values of alpha, the best action is not to invest (0@0), as alpha increases, so does the convenience of investing more capital. For alpha = 1e − 02 the best action is to invest 1.5 in nodes 4, 8 and 10 (0@15). We have chosen this particular network to assess if our algorithm can distinguish between central nodes and peripheral ones. All the nodes have W (0) = 100, E(0) = 3, µ = 0, LGD = 1 (we assume, conservatively, that all the investment would be lost in case of a default), P D(0) = 0.001 for all but nodes 4, 8 and 10 with P D(0) = 0.01. The edges between nodes are oriented and homogeneous, assuming the value w ij = 1 ∀i = j. We have restricted the potential investment amounts, for each node, to be: 0, 0.5% W , 1% W , 1.5% W or 2%W . Furthermore, the government can choose to invest in a single node or all the nodes for each time step, provided that the nodes are considered distressed. In our example, a node i is defined as risky or distressed if P D i > 0.009. We have analysed the system for different values of alpha (0.0001, 0.001, 0.01) and reported the optimal action values at time t = 0 in (Fig. 2) . For α = 0.0001 the best action is 0@0 (i.e. no investment in any node) followed by investing the minimum amount of capital in individual nodes. As alpha increases, not to invest becomes less and less convenient compared to the other options. For alpha = 0.01 the best action is to invest 1.5 in all the risky nodes (0@15). It is interesting to note that action 0@2 (i.e. investing the maximum amount, 2, in all the risky nodes) is never the best choice, while 0@05 is always the worst, as it provides too few capital to each node to make them resilient. In Fig. 3(a) we can see that the optimal action values corresponding to investments in different nodes tend to converge as the time to the end of the episode decreases because the contagion has less time to propagate and the node position becomes less and less relevant. In Fig. 3(a) , we focus our analysis on nodes 4 and 10 for α = 0.0001 and we note that investing in node 4 is always better than in node 10 for the same amount of capital. In Fig. 3(b) , we show how the results would change in the case when the government had already invested 0.5 in node 10 (i.e. J 10 (0) = 0.5). In this case, a substantial investment in node 10 (10@15 or 10@20) largely outperform investments in node 4. The government needs to keep investing a sufficient amount of capital in node 10 to protect its previous investment. For example, an investment of 0.5 in node 10 is not sufficient to strengthen it and the corresponding action value is the worst among the one considered at time t = 0 (time to end = 7). (a) The algorithm 'feels' the network structure and suggests to invest in node 4 rather than node 10. (b) In case the government had previously invested in node 10, the government needs to protect its investment, risking an additional investment in node 10. In the legend, 0@0 means no investment, 10@05 means investing 0.5 in node 10, 4@10 means investing 1 in node 4, etc. We use the data from the European Banking Authority website about the Global Systemically Important Institutions, relative to the year 2014 (EBA network) [4] . The data does not contain the complete bilateral network (as this is considered business sensitive information) but aggregates of credit exposures vs other financial institutions. For our analysis, we have used the algorithm described in our previous paper [6] (see also [27] ) to reconstruct the network. We set LGD = 0.6 as this is the standard rule of thumb in financial credit risk [7] . The values of the asset volatility σ have been obtained inverting (6) at time zero and assuming it remains constant during the simulation. We have restricted the potential investment amounts to be: 0, 0.5% W , 1% W , 1.5% W , 2%W , 2.5%W , 3%W . Furthermore, if the government decides to invest, it needs to provide additional capital to all the risky nodes, defined in our example, as the nodes with P D > 0.009. For this exercise, we pretend that the European Union (including UK) is also a fiscal union with a single government that is accountable to all the European taxpayers. In particular, we consider investments that individual states might have in banks as of 2014 as 'private' investments, hence we start with with J i (0) = 0 ∀i. We define the 'Convenience' to intervene as: with a 0 t representing the action at time t corresponding to no investments. In Fig. 5(a) we have reported the Convenience vs the time (number of steps) to the end of the episode. We have found that the Convenience is positive and almost constant for large values of α (α = 0.01, α = 0.005), and is negative and decreasing for smaller values of α (α = 0.001, α = 0.0001). In Fig. 5(b) we have the same chart but for a severely distressed version of the network, where the capital of the banks has been halved. The distress has the effect of lowering the value of alpha at which the Convenience is positive, for example the Convenience is now positive for α = 0.001. To explore the transition between positive and negative Convenience, we have reported the optimal action values at time t = 0 as a function of α in (Fig. 6(a) ). For α > α c ≈ 0.0025, the inaction is no longer the most convenient choice, "0@05" (investing 0.5%W in the risky nodes) becomes the best action. It is also interesting to notice that the optimal action becomes "0@10" for higher values of α. In (Fig. 6b) ) we have the optimal action values at time t = 0 when the capital of the banks has been halved. We notice that the value α c at which a government intervention becomes favourable is lower at α c ≈ 0.00096. Table I ). The graph as been reconstructed from aggregated data available at the European Banking Authority (EBA) website and it can be different from the actual network of bilateral exposures. The darker edges identify stronger exposures. The nodes with higher PD are Monte dei Paschi di Siena (MPS) and BFA. . The optimal action value at time t = 0 is reported as a function of α for different actions. 0@0 means no investment. As α increases, 0@0 becomes less and less convenient and for α = αc = 0.0025 the best action becomes investing 0.5%W in each of the risky nodes. For even higher values of alpha, the best action becomes 0@10 (i.e investing 1%W for each of the risky nodes). b) The chart is relative to a distressed version of the 'EBA network' where the capital of the banks has been halved. The value of αc at which a government intervention is convenient is lower than in a): αc ≈ 0.00096. We have shown how to cast a bank bailout decision by a government into an action in a Markov Decision Process (MDP) where the states of the MDP are defined in terms of the underlying network of financial exposures and the MDP dynamics is derived from the network dynamics. In our example, that uses the data relative to the European Global Systemically Important Institutions, we have found that government interventions do not improve the expected loss of the financial network if the loss for the taxpayer as a fraction of the bank total assets α satisfies α < α c ≈ 0.0025. The value of α c becomes lower as the distress of the network increases. It is evident from our analysis that the parameter alpha plays a central role in systemic risk modelling and even if there are works [9] [10] studying the impact for the taxpayers linked to a bank default, additional analysis need to be performed for its reliable estimation. Using a simplified Krackhardt kite network, we have found that the government becomes biased toward investing in a risky node if it had already invested in it in the past. The government needs to evaluate carefully a potential investment. The rescued bank could increase its risky investments knowing that it would be bailed-out in case it became distressed again, thus leading to moral hazard. In section II C 1 we have expressed the approximated value function V * (s t , β) as a linear combination of terms Z ik (s t ) with coefficients β ik (t) given by (21) . In order to fit these β ik (t), we first identify a representative portfolio of MDP states that can be reached, at time t, from the initial state s 0 , and for which we can calculate the Bellman value V B using (28)- (29) . Equating V * (s t , β t ) with the corresponding V B (s t , β t+1 ), for each state s t in the portfolio, we derive a set of linear equations that we use to obtain the coefficients β ik (t) via a ridge regression (with a 5-fold cross-validation). The states in the representative portfolio, at time t, are obtained from the initial state s 0 , after changing the time to maturity from M to M − t (i.e. the states are 'moved' forward in time) and forcing a set U of nodes to default. The representative portfolio contains: (a) the state corresponding to U = ∅, plus (b) all the states corresponding to U = {i} for i ∈ I (i.e. with one additional defaulted node with respect to s 0 ), plus (c) a selection of states corresponding to |U | > 1 (i.e. with multiple additional defaulted nodes), which are chosen randomly with probabilities proportional to exp (−|U |) (a greater importance is given to states with fewer number of additional defaults as they are more likely to be reached in an actual simulation). In addition, we obtain elements in the representative portfolio by performing a government action on s 0 and then move the corresponding state at time t (i.e. at time to maturity M − t). The number of states in the representative portfolio needs to be chosen taking into account the trade-off between stable results and computational resources. In this section, we detail our choice for (Z ik (s t )) used in our ansatz for the value function approximation V * (s t , β) in (21) , with node i ∈ I \ I def (t) and step k ∈ {1, ..., m = M − t} until the end of the episode. We introduce the auxiliary matrix Z, with elements Z ik (s t ; a 1 , ..., a k ) representing the approximated contribution of the expected direct loss, due to the default of node i, at time t + k − 1, taking into account the government actions a j at time t+j −1, for all j = 1, ..., k. That is, we define Z ik := P D ik L ik , if k = 1; P D ik L ik γ k−1 k−1 r=1 (1 − P D ir ), if k > 1. Here, the value P D ik is the modified probability of default and the value L ik is the modified loss, associated to the node i at time t + k − 1, that take into account the expected cumulative impact I ik and the potential cumulative investment from the government J ik on node i from time t up to time t + k − 1. Note that for k > 1, a node i can contribute to the expected loss only if it has not defaulted in the previous time steps (hence the presence of the survival probabilities (1 − P D ir )). To be more precise, we firstly define Then, the cumulative impact I ik depends on the modified probability of default of all the nodes j ∈ H := I \ (I def (t) ∪ {i}) and is defined by Moreover, the cumulative government investment J ik in node i is a function of the actions (a 1 , ..., a k ) that the government can take between t and t+k−1 and is defined by J ik (a 1 , ..., a k ) := k r=1 ∆J ar i (t + r − 1) Finally, the modified loss incurred is defined by In light of the above equations, we observe that Z ik = Z ik (s t ; a 1 , ..., a k ) depend on the actions (a 1 , ..., a k ) via the terms J ik (a 1 , ..., a k ) involved in both P D ik and L ik . We now call a 0 the action corresponding to no additional government investment and define the total expected direct loss for all i ∈ I \ I def (t) and k ∈ {1, ..., m} as T L(s t ; a 1 , a 2 , .., a m ) := i,k Z ik (s t ; a 1 , a 2 , .., a k ). Then, the specific matrix Z = (Z ik (s t )) involved in our value function approximation is defined by Z ik (s t ) := Z ik (s t ; a 1 , ..., a k ), where each a j is calculated sequentially for each j ∈ {1, ..., m} as follows: Assessing the Political Landscape: Structure, Cognition, and Power in Organizations A dynamic approach merging network theory and credit risk techniques to assess systemic risk in financial networks Default recovery rates in credit risk modelling: a review of the literature and empirical evidence Snethlage, D. Estimating the size and incidence of bank resolution costs for selected banks in OECD countries Reducing and sharing the burden of bank failures Network models of financial systemic risk: a review Contagion in financial networks Managing global finance as a system Complex networks: Structure and dynamics Running for the exit: distressed selling and endogenous correlation in financial markets DebtRank: too central to fail? financial networks, the FED and systemic risk The gaussian latent variable model in Modelling Single-name and Multi-name Credit Derivatives On the pricing of corporate debt: The risk structure of interest rates Measuring systemic risk: A risk management approach Interbank exposures: Quantifying the risk of contagion A framework for assessing the systemic risk of major financial institutions Simulation methods to assess the danger of contagion in interbank markets Estimating bilateral exposures in the german interbank market: Is there a danger of contagion? Reinforcement Learning: An Introduction Approximate solutions to markov decision processes An Artificial Intelligence Approach to Regulating Systemic Risk Filling in the blanks: Network structure and interbank contagion A Markovian decision process