key: cord-0649292-1l7i5b38
authors: Li, Zun; Jia, Feiran; Mate, Aditya; Jabbari, Shahin; Chakraborty, Mithun; Tambe, Milind; Vorobeychik, Yevgeniy
title: Solving Structured Hierarchical Games Using Differential Backward Induction
date: 2021-06-08
journal: nan
DOI: nan
sha: 9507164af4a398dc742452b2549f52cea62bfd9b
doc_id: 649292
cord_uid: 1l7i5b38

From large-scale organizations to decentralized political systems, hierarchical strategic decision making is commonplace. We introduce a novel class of structured hierarchical games (SHGs) that formally capture such hierarchical strategic interactions. In an SHG, each player is a vertex in a tree, and strategic choices of players are sequenced from root to leaves, with root moving first, followed by its children, then followed by their children, and so on until the leaves. A player's utility in an SHG depends on its own decision, and on the choices of its parent and all the tree leaves. SHGs thus generalize simultaneous-move games, as well as Stackelberg games with many followers. We leverage the structure of both the sequence of player moves as well as payoff dependence to develop a novel gradient-based back propagation-style algorithm, which we call Differential Backward Induction (DBI), for approximating equilibria of SHGs. We then provide a sufficient condition for convergence of DBI. Finally, we demonstrate the efficacy of the proposed algorithmic approach in finding approximate equilibrium solutions to several classes of SHGs.

The COVID-19 pandemic has revealed considerable strategic tension among the many parties involved in decentralized hierarchical policy-making. For example, recommendations by the World Health Organization are sometimes heeded, and other times discarded by nations, while subnational units, such as provinces and urban areas, may in turn take a policy stance (such as on lockdowns, mask mandates, or vaccination priorities) that is not congruent with national policies. Similarly, in the US, policy recommendations at the federal level can be implemented in a variety of ways by the states, while counties and cities, in turn, may comply with state-level policies, or not, potentially triggering litigation [15] . Central to all these cases is that, besides this strategic drama, what ultimately determines infection spread is how policies are implemented at the lowest level, such as by cities and towns, or even individuals. Similar strategic encounters routinely play out in large-scale organizations, where actions throughout the management hierarchy are ultimately reflected in the decisions made at the lowest level (e.g., by the employees who are ultimately involved in production), and these lowest-level decisions play a decisive role in the organizational welfare.

We propose a novel model of hierarchical decision making which is a natural stylized representation of strategic interactions of this kind. Our model, which we term structured hierarchical games (SHGs), represents each player by a node in a tree hierarchy. The tree plays two roles in SHGs. First, it captures the sequence of moves by the players: the root (the lone member of level 1 of the hierarchy) makes the first strategic choice, its children (i.e., all nodes in level 2) observe the root's choice and follow, their children then follow in turn, and so on, until we reach the leaf node players who move upon observing their predecessors' choices. Second, the tree partially captures strategic dependence: a player's utility depends on its own strategy, that of its parent, and the strategies of all of the leaf nodes. The sequence of moves in our model naturally captures the typical sequence of decisions in hierarchical policy-making settings, as well as in large organizations, while the utility structure captures the decisive role of leaf nodes (e.g., individual compliance with vaccination policies), as well as hierarchical dependence (e.g., employee dependence on a manager's approval of their performance, or state dependence on federal funding). Significantly, the SHG model generalizes a number of well-established models of strategic encounters, including (a) simultaneous-move games (captured by a 2-level SHG with the root having a single dummy action), (b) Stackelberg (leader-follower) games (a 2-level game with a single leaf node) [11, 37] , and (c) single-leader multi-follower Stackelberg games (e.g., a Stackelberg security game with a single defender and many attackers) [5, 8] .

Our second contribution is a novel gradient-based algorithm for approximately computing subgameperfect equilibria of SHGs. Specifically, we propose Differential Backward Induction (DBI), which is a backpropagation-style gradient ascent algorithm that leverages both the sequential structure of the game, as well as the utility structure of the players. As DBI involves simultaneous gradient updates of players in the same level (particularly at the leaves), convergence is not guaranteed in general (as is also the case for best-response dynamics [12] ). Viewing DBI as a dynamical system, we provide a sufficient condition for its convergence to a stable point. Our results also imply that in the special case of two-player zero-sum Stackelberg games, DBI converges to a local Stackelberg equilibrium [11, 39] .

Finally, we demonstrate the efficacy of DBI in finding approximate equilibrium solutions to several classes of SHGs. First, we use a highly stylized class of SHGs with polynomial utility functions to compare DBI with five baseline gradient-based approaches from prior literature. Second, we use DBI to solve a recently proposed game-theoretic model of 3-level hierarchical epidemic policy making. Third, we apply DBI to solve a hierarchical variant of a public goods game, which naturally captures the decentralization of decision making in public good investment decisions, such as investments in sustainable energy. Fourth, we evaluate DBI in the context of a hierarchical security investment game, where hierarchical decentralization (e.g., involving federal government, industry sectors, and particular organizations) can also play a crucial role. In all of these, we show that DBI significantly outperforms the state of the art approaches that can be applied to solve games with hierarchical structure. Related Work SHGs generalize both simultaneous-move games and Stackelberg games with multiple followers [5, 23] . They are also related to graphical games [20] in capturing utility dependence structure, although SHGs also capture sequential structure of decisions. Several prior approaches use gradient-based methods for solving games with particular structure. A prominent example is generative adversarial networks (GANs), though these are zero-sum games [9, 14, 19, [29] [30] [31] . Ideas from learning GANs have been adopted in gradient-based approaches to solve multi-player general-sum games [4, 7, 17, 22, 25, 28, 29] . However, all of these approaches assume a simultaneous-move game. A closely-related thread to our work considers gradientbased methods for bi-level optimization [24, 35] . Several related efforts consider gradient-based learning in Stackelberg games, and also use the implicit function theorem to derive gradient updates [2, 11, 32, 38, 39] . We significantly generalize these ideas by considering an arbitrary hierarchical game structure.

Jia et al. [18] recently considered a stylized 3-level SHG for pandemic policy making, and proposed several non-gradient-based algorithms for this problem. We compare with their approach in Section 4.

Notation We use bold lower-case letters to denote vectors. Let f be a function of the form f px, yq :

We use ∇ x f to denote the partial derivative of f with respect to x. When there is functional dependency between x and y, we use D x f to denote the total derivative of f px, ypxqq with respect to x. We use ∇ 2

x,x f and ∇ 2

x,y f to denote the second-order partial derivatives and D 2

x,x f to denote the second-order total derivative of f . For a mapping f : R d Ñ R d , we use f t pxq to denote t iterative applications of f on x. For mappings f 1 : R d Ñ R d and f 2 : R d Ñ R d , we define pf 1˝f2 qpxq fi f 1 pf 2 pxqq and pf 1`f2 qpxq fi f 1 pxq`f 2 pxq. Moreover, for a given P R ě0 and x P R d , we define the -ball around x as B pxq " tx 1 P R d | }x´x 1 } 2 ă u. Finally, I denotes an identity matrix.

Formal Model A structured hierarchical game (SHG) G consists of the set N of n players. Each player i is associated with a set of actions X i Ď R di . The players are partitioned across L levels, where N l is the set of n l players occupying level l. Let l i denote the level occupied by player i. This hierarchical structure of the game is illustrated in Figure 1 where players correspond to nodes and levels are marked by dashed boundaries.

The hierarchy plays two crucial roles: 1) it determines the order of moves, and 2) it partly determines utility dependence among players. Specifically, the temporal pattern of actions is as follows: level 1 has a single player, the root, who chooses an action first, followed by all players in level 2 making simultaneous choices, followed in turn by players in level 3, and so on until the leaves in the final level L. Players of level l only observe the actions chosen by all players of levels 1, 2, ..., l´1, but not their peers in the same level. So, for example, pandemic social distancing and vaccination policies in the US are initiated by the federal government (including the Centers for Disease Control and Prevention), with states subsequently instituting their own policies, counties reacting to these by determining their own, and behavior of people ultimately influenced, but not determined, by the guidelines and enforcement policies by the local county/city. Next, we describe the utility structure of the game as entailed by the SHG hierarchy. Each player i in level l i ą 1 (i.e., any node other than the root) has a unique parent in level l i´1 ; we denote the parent of node i by Papiq. A player's utility function is determined by 1) its own action, 2) the action of its parent, and 3) the actions of all players in level L (i.e., all leaf players). To formalize, let x l denote the joint action profile of all players in level l. Player i's utility function then has the form u

´i is the action profile of all players in level L other than i. For example, in our running pandemic policy example, the utility of a county depends on both the policy and enforcement strategy of its state (its parent) and on the ultimate pandemic spread and economic impact within it, both determined largely by the behavior of the county residents (leaf nodes). Note the considerable generality of the SHG model. For example, an arbitrary simultaneous-move game is a SHG with 2 levels and a "dummy" root node (utilities of all leaves depend on one another's actions), and an arbitrary Stackelberg game (e.g., Stackelberg security game), even with many followers, can be modeled as a 2-level SHG with the leader as root and followers as leaves. Furthermore, while we have defined SHGs with respect to real-vector player action sets, it is straightforward to represent mixed strategies of finite-action games in this way by simply using a softmax function to map an arbitrary real vector into a valid mixed strategy. Solution Concept Since an SHG has important sequential structure, it is natural to consider the subgame perfect equilibrium (SPE) as the solution concept [33] . Here, we focus on pure-strategy equilibria. To begin, we note that in SHGs, the strategies of players in any level l ą 1 are, in general, functions of the complete history of play in levels 1, . . . , l´1, which we denote by h ăl " px 1 , x 2 , . . . , x l´1 q. Formally, a (pure) strategy of a player i is denoted by s i ph ăl q, which deterministically maps an arbitrary history h ăl into an action x i P X i . A Nash equilibrium of an SHG is then a strategy profile s " ps 1 , . . . , s i , . . . , s n q such that for all i P N , u i ps i , s´iq ě u i ps 1 i , s´iq for all possible alternative strategies for i, s 1 i . Here, we denote the realized payoff of i from profile s by u i ps i , s´iq. Next, we define a level-l-subgame given h ăl as an SHG that includes only players at levels ě l, with actions chosen in levels ă l fixed to h ăl . A strategy profile s is a subgame perfect equilibrium of SHG if it is a Nash equilibrium of every level-l-subgame of SHG for every l and history h ăl . We prove in appendix A that our definition of SPE is equivalent to the standard SPE in an extensive-form representation of SHG.

While in principle we can compute an SPE of an SHG using backward induction, this cannot be done directly (i.e., by complete enumeration of actions of all players) as actions are real vectors. Moreover, even discretizing actions is of little help, as the hierarchical nature of the game leads to exponential explosion of the search space. We now present a gradient-based approach for approximating SPE along the equilibrium path in an SHG that leverages the game structure to derive backpropagation-style gradient updates.

In this section, we describe our gradient-based algorithm, Differential Backward Induction (DBI), for approximating an SPE, and then analyze its convergence. Just as gradient ascent does not, in general, identify a globally optimal solution to a non-convex optimization problem, DBI in general yields a solution which only satisfies first-order conditions (see Section 3.2 for further details). Moreover, we leverage the structure of the utility functions to focus computation on an SPE in which strategies of players are only a function of their immediate parents. 1 In this spirit, we define local best response functions φ i : R d Papiq Ñ R di mapping a player i's parent's action x Papiq to i's action x i ; note that the notation φ i is distinct from s i above for i's strategy to emphasize the fact that φ i is only locally optimal. Now, suppose that a player i is in the last level L. Local optimality of φ i implies that if x i " φ i px Papiq q, then ∇ xi u i`xi , x Papiq , x L,´i˘" 0 and ∇ 2 xi,xi u i`xi , x Papiq , x L,´i˘ă 0. 2 Let φ l denote the local best response for all the players in level l given the actions of all players in level l´1. We can compose these local best response functions to define the function Φ l :" φ L˝φL´1.

. .˝φ l`1 : R dn l Ñ R dn L i.e., the local best response of players in the last level L given the actions of the players in level l. 3 Then for any player piq in level l i ă L, D xi u i`xi , x Papiq , Φ l pxx i , x l,´i yq˘" 0 and D 2 xi,xi u i`xi , x Papiq , Φ l pxx i , x l,´i yq˘ă 0, where D xi is the total derivative with respect to x i (as Φ l pxx i , x l,´i yq is also a function of x i ). Note that the functions φ and Φ are implicit, capturing the functional dependencies between actions of players in different levels at the local equilibrium.

Throughout, we make the following standard assumption on the utility functions [10, 39].

Assumption 1. For any x i P X i , the second-order partial derivatives of the form ∇ 2 xi,xi u i are non-singular.

The DBI algorithm works in a bottom-up manner, akin to back-propagation: for each level l, we compute the total derivatives (gradients) of the utility functions and local best response maps (φ, Φ) based on analytical expressions that we derive below. We then propagate this information up to level l´1, as it is used to compute gradients for that level, and so on until level 1. Algorithm 1 gives the full DBI algorithm. In this algorithm, Chdpiq denotes the set of children of player i (i.e., nodes in level l i`1 for whom i is the parent). DBI works in a backward message-passing manner, comparable to back-propagation: after each player has computed its total derivative, it passes (back-propagates) D xi Φ l to its direct parent; this information is, in turn, used by the parent to compute its own total derivative, which is passed to its own parent, and so on. Algorithm 1 takes the total derivates as given. We now derive closed-form expressions for these. We start from the last level L. Given the actions of players in level L´1, the total derivative of a player i P N L with 1 Note that while we cannot guarantee that an SPE exists in SHGs in general, let alone those possessing the assumed structure, we find experimentally that our approach often yields good SPE approximations.

2 For simplicity, we omit degenerate cases where ∇ 2 x i ,x i u i " 0 and assume all local maxima are strict. 3 Note that in particular Φ L " φ L .

Algorithm 1 Differential Backward Induction (DBI) Input: An SHG instance G Parameters: Learning rate α, maximum number of iterations T for gradient update Output: A strategy profile

Randomly initialize x 0 " xx 0 1 , . . . , x 0 L y for t " 1, 2, . . . , T do for l " L, L´1, . . . , 1 do for i " 1, 2, . . . , n l do if l " L then

(1) For a player i in level L´1, the total derivative (at a local best response) is

where ∇ x L u i is a 1ˆd n L vector and D xi φ L is a d n Lˆd matrix. The technical challenge here is to derive the term D xi φ L for i P N L´1 . Recall that φ L is the vectorized concatenation of the φ j functions for j P N L . Since the local best response strategy of a player in level L only depends on its parent in level L´1, the only terms in φ L that depend on x i are the actions of Chdpiq in level L. Consequently, it suffices to derive D xi φ j for j P Chdpiq. Note that for these players j, ∇ xj u j " 0 (by local optimality of φ L ). We will use this first-order condition to derive the expression for the total derivative using the implicit function theorem.

Theorem 1 (Implicit Function Theorem (IFT) [10, Theorem 1B.1]). . Let f px 1 , x 2 q : R dˆRd Ñ R d be a continuously differentiable function in a neighborhood of px1 , x2 q such that f px1 , x2 q " 0. Also suppose ∇ x2 f , the Jacobian of f with respect to x 2 , is non-singular at px1 , x2 q. Then around a neighborhood of x1 , we have a local diffeomorphism x2 px 1 q :

To use Theorem 1, we set f " ∇ xj u j (which satisfies the conditions of Theorem 1 by Assumption 1), x 1 " x i and x 2 " x j (recall that j P Chdpiq). By IFT, there exists φ j px i q such that D xi φ j "´p∇ 2 xj ,xj u j q´1∇ 2 xj ,xi u j . Define ∇ 2 j :" ∇ 2 xj ,xi u j . Then

Plugging this into Equation (2), we obtain Dx i ui`xi, x Papiq , φL pxL´1q"

For a level l ă L´1, the total derivative of player i P N l in a local best response is D xi

(4)

Applying IFT, we get

for j P Chdpiq. We can apply the above procedure recursively for D x l`1 Φ l`1 to derive the total derivative for players i P N l for l ă L´1:

where Pathpj Ñ iq is an ordered list of nodes (players) lying on the unique path from j to i, excluding i. Note that Equation (6) is a generalization of Equation (3) where the Path only consists of the leaf player. While the above derivation assumes the φ and Φ functions are local best responses, in our algorithm in each iteration we evaluate these functional expressions for the total derivatives at the current joint action profile. This significantly reduces computational complexity and ensures that Algorithm 1 satisfies the first-order conditions upon convergence.

Our next task is to provide sufficient conditions for the DBI algorithm to converge to a stable point. A few notes are in order. First, as we remarked earlier, stable points of DBI are not guaranteed to be SPE just as stable points of gradient ascent are not guaranteed to be globally optimal with general non-convex objective functions. Furthermore, DBI algorithm entails what are effectively iterative better-response updates by players, and it is well-known that best response dynamic processes in games will in general lead to cycles [12] .

To begin, we observe that the gradient updates in DBI can be interpreted as a discrete dynamical system, x t`1 " F px t q, with F px t q " pI`αGqpx t q where G is an update gradient vector. This discrete system can be viewed as an approximation of a continuous limit dynamical system 9

x " Gpxq (i.e., letting α Ñ 0). A standard solution concept for such dynamical systems is a locally asymptotic stable point (LASP).

Definition 1 ( [13] ). A continuous (or discrete) dynamical system 9

x " Gpxq (or x t`1 " F px t q) has a locally asymptotic stable point (LASP) x˚if D ą 0, lim tÑ8 x t " x˚, @x 0 P B px˚q.

There are well-known necessary and sufficient conditions for the existence of an LASP.

Proposition 1 (Characterization of LASP [40, Theorem 1.2.5, Theorem 3.2.1]). A point x˚is an LASP for the continuous dynamical system 9

x " Gpxq if Gpx˚q " 0 and all eigenvalues of Jacobian matrix ∇ x G at x˚have negative real parts. Furthermore, for any x˚such that Gpx˚q " 0, if ∇ x G has eigenvalues with positive real parts at x˚, then x˚cannot be an LASP.

Note that an LASP of DBI is an action profile of all players that satisfies the first-order conditions, i.e., it has the property that no player can improve their utility through a local gradient update. While the existence of an LASP depends on game structure, we show that under Assumption 1, and as long as the sufficient conditions for LASP existence in Proposition 1 are satisfied, DBI converges to LASP. We defer all the omitted proofs to Appendix C.

Proposition 2. Let λ 1 , . . . , λ d denote the eigenvalues of the updating Jacobian ∇ x G at an LASP xå nd define λ˚" arg max iPrds Repλ i q{ |λ i | 2 , where Re is the real part operator. Then with a learning rate α ă´2Repλ˚q{ |λ˚| 2 , and an initial point x 0 P B px˚q for some ą 0 around x˚, DBI converges to an LASP. Specifically, if the choice of learning rate equals α˚and the modulus of matrix ρpI`α˚∇ x Gq " 1´κ ă 1, then the dynamics converge to x˚with the rate of Opp1´κ{2q t q.

Proposition 2 states that there exists a region such that, if the initial point is in that region, then DBI will converge to an LASP. We next show that if we assume first-order Lipschitzness for the update rule, then we can also characterize the region of initial points which converge to an LASP.

Then for all x 0 P B κ{2L px˚q, ą 0 and after T rounds of gradient update, DBI will output a point x T P B px˚q as long as T ě r 2 κ log x 0´x˚ { s where κ is as defined in Proposition 2.

We further show that through random initialization, the probability of reaching a saddle point is 0, which means that with probability 1, DBI converges to an LASP in which players are playing local best responses.

Proposition 4. Suppose G is L-Lipschitz. Let α ă 1{L and define the saddle points of the dynamics G as Xs ad " tx˚P X | x˚" pI`αGqpx˚q, ρppI`α∇ x Gqpx˚qq ą 1u. Also let X 0 sad " tx 0 P X | lim tÑ8 pI`αGq t px 0 q P Xs ad u denote the set of initial points that converge to a saddle point. Then µpX 0 sad q " 0, where µ is Lebesgue measure.

While our convergence analysis does not guarantee convergence to an approximate SPE, our experiments show that DBI is in fact quite effective in doing so in practice. 

In this section, we empirically investigate the following questions: (1) the convergence rate of DBI, (2) the solution quality of DBI, (3) the behavior of DBI in games where we can verify global stability. All our code is written in python. We ran our experiments on an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz to obtain the results in Sections 4.1 and C.3, and on an Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz for the rest of the experiments. We evaluate the performance in terms of quality of equilibrium approximation as a function of the number of iterations of a given algorithm, or its running time. Ideally, given a collection of actions x played by players along the (approximate) equilibrium path computed, we wish to find the largest utility gain any player can have by deviating from this path, which we denote by pxq. However, this computation is impossible in our setting, as it would need to consider all possible histories as well, whereas our approach and alternatives only return x along the path of play (moreover, considering all possible histories is itself intractable). Therefore, we consider two heuristic alternatives. The first, which we call local SPE regret, runs DBI for every player i starting with x, and returns the greatest benefit that any player can thereby obtain; we use this in Section 4.1. In the rest of this section, we use the second alternative, which we call global SPE regret. It considers for each player i in level l a discrete grid of alternative actions, and uses best response dynamics to compute an approximate SPE of the level-pl`1q subgame to evaluate player i's utility for each such deviation. This approach then returns the highest regret among all players computed in this way.

Our evaluation considers three SHG scenarios. We begin by comparing DBI to a number of baselines on simple, stylized SHG models, then move on to three complex hierarchical game models motivated by concrete applications.

We begin by considering instances of SHGs to which we can readily apply several state-of-the-art baselines, allowing us a direct comparison to previous work. Specifically, we consider 3 SHG instances with different game properties: (a) a three-level chain structure (or the p1, 1, 1q game) with 1-d actions (b) a " " shape tree (or the p1, 1, 2q game) with 1-d action spaces, and (c) and p1, 1, 1q game with 3-d actions. In all the games, the payoffs are polynomial functions of x with randomly generated coefficients (we can think of these as proxies for a Taylor series approximation of actual utility functions). Details are in Appendix C.

We compare DBI with the following five baselines: 1) simultaneous partial gradient ascent (SIM) [7, 28], 2) symplectic gradient dynamics with or 3) without alignment (SYM_ALN and SYM, respectively) [4], 4) consensus optimization (CO) [30] , and 5) Hamilton gradient (HAM) [1, 26] . SIM, SYM_ALN, SYM, CO and HAM are all designed to compute a local Nash equilibrium [4, 7] .

We start by comparing convergence behavior of DBI to the baselines. We run all algorithm with the same initial point and learning rate. The results are in Figure 2 where we plot the L 2 norm of total gradient for each of the algorithms (Y axis) against the number of iterations (X axis).

In all cases, DBI converges to a critical point that meets the first-order conditions while the baseline algorithms fail to do so in most cases. In Figures 2(a) and (c), all baselines have converged to a point with finite norm for the total gradients. In (b), however, only CO and HAM converge to a stationary point while SIM, SYM, SYM_ALN all diverge. For scenario (b), DBI appears to be on an inward spiral to a critical point. We further check the second-order condition (see Appendix C) and verify that DBI has actually converged to local maxima of individual payoffs in all three games.

Next, we investigate solution quality in terms of local regret of DBI compared to baselines. As shown in Figure 3 , across all three game instances, DBI outputs a profile of actions (along the path of play) with near-zero local regret while other algorithm fail to do so.

Next, we consider DBI for solving a class of games inspired by hierarchical decentralized policy-making in the context of epidemics such as COVID-19 [18] . The hierarchy has levels corresponding to the (single) federal government, multiple states, and county administrations under each state. Each player's action (policy) is a scalar in r0, 1s that represents, for example, the extent of social distancing recommended or mandated by a player (e.g., a state) for its administrative subordinates (e.g., counties). Crucially, these subordinates have considerable autonomy about setting their own policies, but incur a non-compliance cost for significantly deviating from recommendations made by the level immediately above (of course, non-compliance costs are not relevant for the root player). The full cost function of each player additionally includes infection prevalence within the geographic territory of interest to the associated entity (e.g., within the state), as well as the socio-economic cost of the policy itself (complete details are provided in Appendix C).

Since the actions are in a one-dimensional compact space and the depth of the hierarchy is at most 3, our baseline is the best response dynamics (BRD) algorithm proposed by Jia et al. [18] (detailed in Appendix C), and we use global regret as a measure of efficacy in comparing the proposed DBI algorithm with BRD. The results of this comparison are shown in Figure 4 for two-level (government and states) and three-level (government, states, counties) variants of this game. We consider two-level games with 20 and 50 leaves (states), and three-level games with 2 players in level 2 (states) and 4 and 10 leaves (counties). Left (resp., right) column corresponds to result for games with 2(resp., 3) levels.

As we can see from the top plots in Figure 4 , BRD can have poor convergence behavior in terms of global regret, whereas DBI appears to converge quite reliably to a path of play with a considerably lower global regret. Notably, the improvement in solution quality becomes more substantial as we increase the game complexity either in terms of scale (number of leaves) or in terms of the level of hierarchy (moving from 2-to 3-level games).

Running time (in seconds) demonstrates the relative efficacy of DBI even further (bottom plots in Figure 4 ). In particular, observe the significant increase in the running time of BRD as we increase the number of leaves. In contrast, DBI is far more scalable: indeed, even more than doubling the number of players appears to have little impact on its running time. Moreover, BRD is several orders of magnitude slower than DBI for the more complex games.

Next, we consider hierarchical public goods games. A conventional networked public goods game endows each player i with a utility function u i px i , x´iq " a i`bi x i`ř j g ji x i x j´ci px i q, where g ji is the impact of player j on player i (often represented as a weighted edge on a network), and x i P r0, 1s the level of investment in the public good by player i [6]. We construct a 3-level hierarchical variant of such games by starting with the karate club network [43] which represents friendships among 34 individuals. Level-2 nodes are obtained by partitioning the network into two (sub)clubs, with leaves (level-3 nodes) representing all the individuals. The utility of level-2 nodes is the sum of utilities of individual members of associated clubs, with the utility of the root being the sum of the utilities of all individuals. Furthermore, we introduce non-compliance costs with investment policies in the level immediately above, as we did in the decentralized epidemic policy game (Section 4.2). Further details are provided in Appendix C.6. . We observe that DBI yields considerably lower regret in these games than BRD even as we discretize the latter finely. Moreover, DBI reaches smaller regret orders of magnitude faster than BRD.

In the final set of experiments, we evaluate DBI on a hierarchical extension of interdependent security games [3]. In these games, n defenders can each invest x i ě 0 in security. If defender i is attacked, the probability that the attack succeeds is 1{p1`x i q. Furthermore, defenders are interdependent, so that a successful attack on defender i cascades to defender j with probability q ji . In the variant we adopt, the attacker strategy is a uniform distribution over defenders (e.g., the "attacker" is just nature, with attacks representing stochastic exogenous failures). The utility of the defender is the probability of surviving the attack less the cost of security investment.

We extend this simultaneous-move game to a hierarchical structure consisting of one root player (e.g., government), three level-2 players (e.g., sectors), and six leaf players (e.g., organizations). The policy-makers in the first two levels of the game recommend an investment policy to the level below, and aim to maximize total welfare (sum of utilities) among the leaf players in their subtrees. Just as in both hierarchical epidemic and public goods games, whenever a player in level l does not act according to the recommendation of their parent in level l´1, they incur a non-compliance cost. Complete model details are deferred to Appendix C.7. We conduct experiments with two weights κ that determine the relative importance of non-compliance costs in the decisions of non-root players in the game: κ P t0.1, 0.5u. Figures 5(b) and 5(c) present the results of comparing DBI with BRD on this class of games, where BRD is again evaluated with different levels of action space discretization (note, moreover, that in this setting discretizing actions is not enough, since these are unbounded, and we also had to impose an upper bound). We can observe that for either value of κ, DBI yields high-quality SPE approximation (in terms of global SPE regret) far more quickly than BRD. In particular, when we use relatively coarse discretization, BRD is approximately an order of magnitude slower, and yields significantly higher regret. In contrast, if we use finer discretization for BRD, global regret for BRD and DBI becomes comparable, but now BRD is several orders of magnitude slower. For example, DBI converges within several seconds, whereas if we discretize x i into multiples of 0.02, BRD takes nearly 2 hours, while discretization at the level of 0.01 results in BRD taking nearly 7 hours.

We introduced a novel class of hierarchical games, proposed a new game-theoretic solution concept and designed an algorithm to compute it. We assume a specific form of utility dependency between players and our solution concept only guarantees local stability. Improvement on each of these two fronts is an interesting direction for future work.

Given the generality of our framework, our approach can be used for many applications characterized by a hierarchy of strategic agents e.g., pandemic policy making. However, our modeling requires the full knowledge of the true utility functions of all players and our analysis assumes full rationality for all the players. Although the model we have addressed here is already challenging, these assumptions are unlikely to hold in many real-world applications. Therefore, further analysis is necessary to fully gauge the robustness of our approach before deployment. [9] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Thirty-Second International Conference on Neural Information Processing Systems, pages 9256-9266, 2018.

[10] Asen Dontchev and Tyrrell Rockafellar. Implicit functions and solution mappings. Springer, 2009.

[11] Tanner Fiez, Benjamin Chasnov, and Lillian Ratliff. Implicit learning dynamics in Stackelberg games: Equilibria characterization, convergence analysis, and empirical study. In Thirty-Seventh International Conference on Machine Learning, pages 3133-3144, 2020.

[12] Drew Fudenberg, Fudenberg Drew, David K Levine, and David K Levine. The theory of learning in games, volume 2. MIT press, 1998.

[32] Thanh Nguyen, Arunesh Sinha, and He He. Partial adversarial behavior deception in security games. In Twenty-Nineth International Joint Conference on Artificial Intelligence, pages 283-289, 2020.

[33] Martin J Osborne and Ariel Rubinstein. An introduction to game theory. Oxford university press New York, 2004.

[34] R Selten. Reexamination of the perfectness concept for equilibrium points in extensive games. International Journal of Game Theory, 4(1): 1975 .

[35] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation for bilevel optimization. In Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 1723-1732, 2019.

[36] Michael Shub. Global stability of dynamical systems. Springer, 1987.

[37] Heinrich Von Stackelberg. A Omitted Details from Section 2 Proposition 5. The SPE notion we defined for an SHG model corresponds exactly to the SPE of its extensive-form game (EFG) representation.

Proof. First we show how to construct the EFG representation of an SHG. Since within level l players act simultaneously, we can designate a canonical player order for player in level l, say we label them as pl, 1q, pl, 2q, . . . , pl, n l q. Then in the EFG representation, we sequentialize the playing order as always be p1, 1q, p2, 1q, p2, 2q, . . . , p3, 1q, . . . , pL, 1q, . . . , pL, n L q. For every pl, 1q, l " 1, . . . , L, its information sets are all in the form of some singleton th ăl u for some history h ăl . While for pl, n l q, l ą 1, its information sets are sets that consist of some h ăl concatenated with all possible moves of pl, 1q, . . . , pl, n l´1 q. Then according the the classical definition of the subgame notion [27, 34] , the only well-defined subgames are those rooted at pl, 1q, @l since these are the only ones with singleton information set. So every level-l subgames corresponds to a subgame in the EFG representation, and every subgame in the EFG representation maps to a level-l subgame of the EFG by investigating the player index. Then the SPE for both representations are also equivalent.

Proof of Proposition 2. The learning dynamics of DBI can be written as x t " pI`αGqpx t´1 q. Since G has eigenvalues λ 1 , . . . , λ d at x˚, the matrix I`α∇ x G at a stationary point x˚has eigenvalues 1`αλ 1 , . . . , 1`αλ d .

Since the set of LASPs is non-empty, Repλ i q ă 0 for all λ i . Then for the choice of α˚in the Proposition, the modulus of the Jacobian ). For completeness, we provide a proof. Since ρppIὰ ∇ x Gqpx˚qq " 1´κ, according to Horn and Johnson [16, Lemma 5.6.10], there exists a matrix norm ¨ such that I`α∇ x G ă 1´κ` , for @ ą 0. We choose " κ 4 The Taylor expansion of pI`αGq at x˚is pI`αGqpxq " pI`αGqpx˚q`pI`α∇ x Gqpx˚qpx´x˚q`Rpx´x˚q,

where Rpx´x˚q " op x´x˚ q. Let R 1 px´x˚q " 1 α Rpx´x˚q, Then we have lim xÑx˚R 1px´x˚q x´x˚ " 0. Then we can choose δ ą 0 such that R 1 px´x˚q ď κ 4 x´x˚ when x´x˚ ă δ.

This shows the operator pI`αGq is a contract mapping with contraction constant p1´κ 2 q. Therefore the convergence rate is Opp1´κ{2q t q Before we present the proof of Proposition 3, we state the following lemma.

Lemma 1. The update gradient vector G is L-Lipchitz if and only if ∇ x G ď L at all x P X .

Proof. First we prove for "if" direction. Consider x 1 , x 2 P X ,

Then we prove for the "only if" direction. Take ą 0, then for @x 1 , x 2 P X ,

Since it holds for any x 2 , it must be ∇ x Gpx 1 q ď L. And since it applies for any x 1 P X , this completes the proof.

Proof of Proposition 3. From the proof from Proposition 2 notice that in order to make x t´x˚ ď we only have to let t ě r 2 κ log x 0´x˚ { s since now x t´x˚ ď p1´κ 2 q t x 0´x˚ ď exp p´κ{2tq ď . Now we need to characterize the region of initial points that can converge to x˚by characterizing the maximum possible radius of such initial point to x˚. Recall in the proof of Proposition 2, this is captured by the. parameter δ. On one hand by using the Lipschitzness, we can bound the residual function by

One the other hand to maintain this convergence rate we should let R 1 px´x˚q ď κ 4 x´x˚ . Then we simply let ď L 2 x´x˚ 2 ď κ 4 x´x˚ we get an initial point should satisfy x´x˚ ď κ 2L . To prove Proposition 4, we need a few more machinery.

Proposition 6. With α ă 1{L, I`αG is a diffeomorphism.

Proof. First we show pI`αGq is invertible. Suppose x 1 , x 2 P X such that x 1 " x 2 and pI`αGqpx 1 q " pI`αGqpx 2 q. Then x 1´x2 " αpGpx 2 q´Gpx 1 qq. And by α ă 1{L we have x 1´x2 ď αL x 1´x2 ă x 1´x2 , which is a contraction.

Next we show we show its invert function is well-defined on any point of X . Notice that ρpα∇ x Gq ď α∇ x G ď αL ă 1. And notice that since the eigenvalues of I`α∇ x G is the eigenvalues of α∇ x G plus 1, the only way that makes detpI`α∇ x Gq " 0 is to have one of the eigenvalues of α∇ x to be´1. This contradicts to ρpα∇ x Gq ă 1. Therefore by the implicit function theorem, pI`αGq is a local diffeomorphism on any point of X , and therefore pI`αGq´1 is well defined on X .

Theorem 2 (Center and Stable Manifold [36, Theorem III.7, Chapter 5]). Suppose x˚" hpx˚q is a critical point for the C r local diffeomorphism h : X Ñ X . Let X " X s ' X u , where X s is the stable center eigenspace belonging to those eigenvalues of ∇ x hpx˚q whose modulus is no greater than 1, and X s is the unstable eigenspace belonging to those whose modulus is greater than 1. Then there exists a C r embeded disk W cs loc px˚q that is tangent to X s at x˚called the local stable center manifold. Moreover there D ą 0, hpW cs loc px˚qq Ş B px˚q Ă W cs loc and Ş 8 t"0 h´tpB px˚qq Ă W cs loc px˚q. Proof of Proposition 4. For @x˚P Xs ad , let px˚q ą 0 be the radius of neighborhood provided by Theorem 2 for diffeomorphism h " I`αG and point x˚. Then define B " Ť x˚PXs ad B px˚q px˚q. And since X is a subset of Euclidean space so it is second-countable. By Lindelöf's Lemma [21] , which stated that every open cover there is a countable subcover, we can actually write B "

i"1 B pxi q pxi q for a countable family of saddle points txi u 8 i"1 Ď Xs ad . Therefore for @x 0 P X 0 sad that converges to a saddle point, it must converge to xi for some i. And DT px 0 q ą 0, such that @t ą T px 0 q, h t px 0 q P B pxi q pxi q. From Theorem 2 we have h t px 0 q P W sc loc pxi q. Since h is a diffeomorphism on X we have x 0 P h´tpW sc loc pxi qq. We furthermore union over all finite time step x 0 P Ť 8 t"0 h´tpW sc loc pxi qq. Then we have X 0 sad Ď Ť 8

i"1 Ť 8 t"0 h´tpW sc loc pxi qq. For each i since xi is a saddle point, it has an eigenvalue greater than 1, so the dimension of unstable eigenspace dimpX u pxi qq ě 1. Therefore the dimension of W sc loc pxi q is less than full dimension. This leads to µpW sc loc pxi qq " 0 for any i. Again since h is a diffeomorphism, h´1 is locally Lipschitz which are null set preserving. Then µph´tpW sc loc pxi qqq " 0 for @i, t. And since the countable union of measure zero sets is measure zero, µp Ť 8

i"1 Ť 8 t"0 h´tpW sc loc pxi qqq " 0. So µpX 0 sad q " 0.

Our DBI algorithm essentially is designed to satisfy a first order condition @pl, iq, D x l,i u l,i " 0. If we furthermore impose a second-order condition @pl, iq, D 2 x l,i ,x l,i u l,i ă 0, then at such point every player reaches their local maxima in payoff, which is effectively corresponding to an local equilibrium notion. In fact, in two-level, one-agent each level case this is consistent with the local Stackelberg equilibrium notion in previous literature [11, 39] . We next formally define the notion of local subgame perfect equilibrium (LSPE).

Definition 2 (LSPE). x˚" px1 , . . . , xLq is a local subgame perfect equilibrium (LSPE) if

Proposition 2 only guarantees that DBI will find an LASP when there exists one. So a natural question is to characterize the relationship between LSPEs and LASPs. Fiez et al. [11] shows that for two-player zero-sum Stackelberg games, the set of LASPs and the set of LSPEs coincide. Unfortunately, we next show that this is not the case for general SHGs.

Proposition 7. There exists a SHG G such that LSPEpGq Ć LASPpGq and LASPpGq Ć LSPEpGq.

Proof of Proposition 7. Consider a two layer game with one player each level (also the Stackelberg game [11, 39] ). Suppose the action for player 1 is x and for 2 is y, where x, y P R. The analytical form for first-order gradient is

By 2 u 2

, and, D y u 2 " Bu 2 By .

The total second-order closed-form is

By 2 u 2`B 2 Bx 2 u 1 , and, D 2 y,y u 2 " B 2 By 2 u 2 .

The Jacobian matrix of the dynamics of Algorithm 1 iŝ ∇ x pD x u 1 q ∇ y pD x u 1 q ∇ x pD y u 2 q ∇ y pD y u 2 q˙,

By 2 u 2`B 2 Bx 2 u 1 ,

By 2 u 2`B 2 BxBy u 1 , ∇ x pD y u 2 q " B 2 BxBy u 2 , and, ∇ y pD y u 2 q " B 2 By 2 u 2 .

Then consider payoff function u 1 px, yq "´8x 4`8 x 3 y´5x 2 y 2`5 xy 3`6 y 4`3 x 3`3 x 2 y`6xy 2`3 y 3´8 x 2x y´3y 2`6 x and u 2 px, yq " 4x 4´x3 y`5x 2 y 2´1 0xy 3`1 0y 4`7 x 3´9 x 2 y´7xy 2´2 y 3´3 x 2´9 xy´4y 2´7 x´2y.

We can derive that D x u 1 " 6´16x`9x 2´3 2x 3´y`6 xy`24x 2 y`6y 2´1 0xy 2`5 y 3´p p´9´18x´3x 2´1 4y2 0xy´30y 2 qp´x`3x 2`8 x 3´6 y`12xy´10x 2 y`9y 2`1 5xy 2`2 4y 3 qq{p´8´14x`10x 2´1 2y´60xy`120y 2 q and D y u 2 "´2´9x´9x 2´x3´8 y´14xy`10x 2 y´6y 2´3 0xy 2`4 0y 3 . Then there are two critical points px1 , y1 q « p´0.35225, 0.09948q and px2 , y2 q « p´0.175443,´0.188133q.

At point px1 , y1 q, D 2

x,x u 1 " 28350.9, D 2 y,y u 2 " 0.268741 the Jacobianˆ∇ x pD x u 1 q ∇ y pD x u 1 q ∇ x pD y u 2 q ∇ y pD y u 2 q˙«´1 464.55 1477.72 5.42229 0.268741˙.

And it can be verified that it is an LASP but not an LSPE.

At point px2 , y2 q, D x,x u 1 "´3413.93, D y,y u 2 "´0.711531 the Jacobianˆ∇ x pD x u 1 q ∇ y pD x u 1 q ∇ x pD y u 2 q ∇ y pD y u 2 q˙« 27.9321 661.496 3.70221´0.711531˙.

And it can be verified that it is an LSPE but not an LASP.

In our experiments we may need to check the second-order total derivative at a point. This section provides our approach to calculate such quantity. For player of level L, D 2 x l,i ,x l,i u l,i " ∇ 2 x l,i ,x l,i u l,i . For level l ă L, notice that D x l,i u l,i is an explicit function of the action variables of players that are descendants of pl, iq. Denote this set of players as DES pl, iq (excluding pl, iq), then we have D 2

x l,i ,x l,i u l,i " ∇ x l,i pD x l,i u l,i q`ř pλ,ηqPDES pl,iq ∇ x λ,η pD x l,i u l,i qD x l,i x λ,η .

Here D x l,i x λ,η " p´1q λ´l ś pλ 1 ,η 1 qP PATH ppλ,ηq Ñpl,iqq´∇ 2 x λ 1 ,η 1 ,x λ 1 ,η 1 u λ 1 ,η 1¯´1 ∇ 2 x λ 1 ,η 1 ,x PApλ 1 ,η 1 q u λ 1 ,η 1 .

The terms ∇ x l,i pD x l,i u l,i q and ∇ x λ,η pD x l,i u l,i q involve thrice derivatives. In practice we use either numerical finite-difference methods to approximate this value or use symbolic algebra feature in software tools such as Mathematica [42] to calculate the analytical form of this derivative.

Setup We compare the following algorithms, such that the gradient update vectors are as follows DBI: G DBI " pD x1 u 1 , . . . , D xn u n q. SIM: G SIM " p∇ x1 u 1 , . . . , ∇ xn u n q. SYM: G SYM " G SIMG SIM A SIM , where A SIM " pJ SIM´p J SIM q T q{2, and J SIM is the Jacobian matrix of G SIM . HAM: G HAM "´∇ x G SIM 2 . CO: G CO " G SIM`γ G HAM , where we let γ " 0.1 for all experiments. SYM_ALN:

Experimental Details for (1, 1, 1) , 1-d Action Games We use u 1 px, zq "´7x 2`9 xz`x´z, u 2 px, y, zq "´2y 2´4 yz´10x 2`2 xz´3z 2`4 y`7x´8z´8xyz, and, u 3 py, zq "´10z 2´9 yz`9y 2´5 z´2y.

Here px, y, zq are action variables for players 1, 2, 3. We use the learning rate of α " 1e´5 for all algorithms. We verify that DBI converges to an LSPE px, y, zq " p´0.34, 1.85,´1.08q.

Experimental Details for (1, 1, 2) , 1-d Action Games We use u 1 px, y, zq "´2x 2´3 xy`y 2`5 x7 y`3xz´10yz`5xyz´6z, u 2 pw, x, y, zq " 2w 2´w x´3wy´5x 2`9 xy`2y 2`3 w`5x´4y`5z 28 wz`7xz´9yz´10z, u 3 pw, y, zq "´5y 2´8 yz`z 2`8 y´9z´2wy´4wz´w 2´8 wyz´2w, and, u 4 pw, y, zq "´10z 2´2 yz`5y 2´7 z´6y´3wz´8wy´10wyz`5w Here px, w, y, zq are action variables for players 1, 2, 3, 4. We use α " 4e´6 for all algorithms. We verify that DBI converges to an LSPE px, w, y, zq " p4.70,´2.13, 10.27, 9.93q.

Experimental Details for (1, 1, 1), 3-d Action Games We use u 1 px, zq "´7

i"1 z i q, and, u 3 py, zq "´10

x, y, z are action variables of players 1, 2, 3. We use the learning rate α " 1e´5 for all algorithms. We verify that DBI converges to an LSPE px, y, zq " pp´0.39,´0.39,´0.39q, p0.29, 0.29, 0.29q, p´0.58,´0.58,´0.58qq.

The time cost of each algorithm is measured in Table (1) DBI SIM SIM_ALN SYM CO HAM (1,1,1) A note on the physical significance of the above utility functions Although our methodology is applicable in principle to any set of agents' utility function satisfying the definition of SHGs, it is reasonable to ask whether there exists a game model, grounded in some real-world domain, that exhibits the SHG hierarchy as well as the polynomial multi-variable utility structure above. To that end, consider the COVID-19 Game delineated below. In the experiments described below, each agent (a policy-maker in a hierarchy such as the federal government, state governments, and county governments) has a non-dimensional action restricted to closed interval r0, 1s, interpreted as a social distancing factor due to policy intervention (either recommended or implemented). However, in general, a policy intervention (hence, and agent's action) may take the form of a multi-dimensional real vector, e.g. factor for social distancing, factor for vaccination roll-out, etc.; hence, in general, the overall cost (negative payoff) of each agent is a function of 2`n L vectors rather than 2`n L scalars (correspoding to the agent, its parent, and all leaves) as in Equation (7). Although the actual functional form is not necessarily polynomial, we can always take a polynomial approximation in our model via curve-fitting or a Taylor series expansion. The above polynomial functions would then belong to the resulting class of functions; in our experiments for Convergence Analysis experiments are randomly chosen since the focus of this paper is not on modeling any real-world multi-agent interaction scenario to a high degree of accuracy but on assessing the performance of our algorithm on reasonable models consistent with the SHG structure. This line of thinking can be also applied to the utility functions used in the experiments on the relationship between LASP and LSPE described at the end of the appendix.

In this additional experiment, We will study the chance of having a convergent algorithm as well as the probability of finding an equilibrium upon convergence on several classes of SHGs. We design game classes of SHGs in a way that every game class F has the same parameter space on both its topology and payoff structure. For each class, we generate N instances, and for each instance, we first find the set of critical points by numerically solving the first order condition D xi u i " 0, @piq. For each of these critical points, we determine whether it is an LASP of DBI by checking whether the Jacobian of updating gradient G have eigenvalues all of which have negative real part. If this is the case, we classify the critical point as an LASP according to Proposition 1. Finally, for each of the LASPs, we further check whether it is an LSPE by checking whether D 2 xi,xi u i ă 0, @piq. Finally, across the N instances we compute the fraction of games for which an LASP was found (% LASP), and for those with an LASP, the fractions of the instances where an LASP is also an LSPE (% LSPE). We call these two numbers the measure properties of a game class.

We denote by our game classes by F super sub where the subscript is a vector listing the number of players in each SHG level from top to bottom, and the superscript indicates the parameters of the game. For example F C 1,1 is an SHG class with two levels and one player in each level (a Stackelberg game) with C P Rd etermining payoffs as follows: u i " ř α`βď4,α,βPN c i,α,β x α y β , where x, y P R are 1-d action variables for the player (i " 1) and the second level-2 player (i " 2), respectively; the player-specific coefficients c i,α,β are integers generated uniformly in r´C, Cs for different values of C. When C " 8, we generate the coefficients from a continuous uniform distribution in r´1, 1s. See Appendix C for more information (e.g., the utility structure) of the other game classes. Table 2 : Results for F C 1,1 F C 1,2 , F C 1,1,1 and F C 1,1,2 averaged over N " 10 5 instances.

The results are shown in Table 2 . First, we notice for the same game topology, C does not appear to substantially affect the measure property. Parameter C essentially controls the granularity of a uniform discrete distribution and approaches a uniform continuous distribution as it becomes large. The measure properties will also be quite similar as C grows. The second observation is that as the topology becomes much more complex, it is less probable for DBI to find a stable point. The relevant probabilities degrade from 52% " 59% for Stackelberg games to 13% " 21% for 4-player 3-level games. This suggests a limitation of our algorithm in facing complex game topologies with more intricate back-propagation. However, we observe that the probability of an LASP being an LSPE does not decay speedily, as it achieves 87% " 88% for the structure of p1, 1q, 60% " 67% for p1, 2q, p1, 1, 2q and p1, 1, 1q.

We elaborate our experimental designs here. For each function class F, we generates N instances. For each instances, we solve equations D x l,i " 0, @pl, iq, using numerical methods provided by Mathematica [42] . For each critical point found, we check whether the Jacobian of the updating gradient has eigenvalues all of which have negative real parts. If it is the case, we classify it as an LASP. For each of such LASP, we can determine whether it is an LSPE by checking whether D 2

x l,i ,x l,i ă 0. Then for each instance, we can have two numbers: (1) an indicator demonstrating whether it is found an LASP (2) if there is an LASP, the fraction of these LASPs being LSPE. Then across these N instances, we define %LASP fi #pInstances found LASPq{N and %LSP E fi ř pFraction of these LASPs being LSPE in an instanceq{#pInstances found LASPq.

F C 1,1 is detailed in the main body. For F C 1,2 , the payoff structure is u 1,1 px, y, zq " ř α`β`γď3,α,β,γPN c p1,1q,α,β,γ x α y β z γ , u 2,1 px, y, zq " ř α`β`γď3,α,β,γPN c p2,1q,α,β,γ x α y β z γ , u 2,2 px, y, zq " ř α`β`γď3,α,β,γPN c p2,2q,α,β,γ x α y β z γ , where x, y, z are action variables for players p1, 1q, p2, 1q, p2, 2q respectively.

For F C 1,1,1 , u 1 px, zq " ř α`βď4,α,βPN c 1,α,β x α z β , u 2 px, y, zq " ř α`β`γď3,α,β,γPN c 2,α,β x α y β z γ , u 3 py, zq " ř α`βď4,α,β,γPN c 3,α,β y α z β , where x, y, z are action variables for players p1, 1q, p2, 1q, p3, 1q respectively. For F C 1,1,2 , u 1,1 px, y, zq " ř α`β`γď3,α,β,γPN c 1,α,β,γ x α y β z γ , u 2,1 pw, x, y, zq " ř α`β`γ`ιď2,α,β,γ,ιPN c 2,α,β x α y β z γ w ι , u 3,1 pw, y, zq " ř α`β`γď3,α,β,γPN c p3,1q,α,β,γ w α y β z γ , u 3,2 pw, y, zq " ř α`β`γď3,α,β,γPN c p3,2q,α,β,γ w α y β z γ , where px, w, y, zq are action variables for players p1, 1q, p2, 1q, p3, 1q, p3, 2q.

Global Subgame Perfect Equilibrium in an SHG Before we analyze the COVID-model, we should formally define the notion of global subgame perfect equilibrium in an SHG. Denote the set of players as N . Roughly speaking, a profile x˚is an SPE, if for every pl, iq, given actions of players belonging to N zptpl, iqu Y DES pl, iqq, player pl, iq reaches its maximum payoff by choosing xl ,i , while assuming DES pl, iq also had reached an SPE. We only consider pure SPE here so we assume x˚P X .

However, there are a couple of issues here. First let's imagine when solving a two-level games with n 2 players at the second-level. Given p1, 1q's action choice, the problem of computing SPE in the second-level is equivalent to computing a Nash equilibrium in the second-level. However, an exact pure Nash equilibrium may not exist. So which profile should we choose to propagate back to p1, 1q? To resolve this issue, we define ε-Nash equilibrium.

Definition 3. For a simultaneous-move game, a profile x˚is an ε-Nash if for any player n, @x 1 n P X n , u n px 1 n , x˚nq ď u n px˚q`ε. In another word, an ε-Nash is a profile where for every player a unilateral deviation cannot offer benefit more than ε while fixing other's profile. In the context of our example for the two level game, given p1, 1q's action, we select the profile with the minimum ε of the simultaneous-move game defined on level 2 as an SPE back to p1, 1q.

We now generalize this example to formally define the notion of ε-SPE in an SHG. First let us define Φ l,i pxq P R d L that returns the equilibrated profile at level L. In this profile, leaves that are not descendants of pl, iq are fixed in x, while LEAF pl, iq moved to a profile that corresponding to an SPE of G l,i pxq with an minimum ε. Then we define the ε l,i pxq as max x 1 l,i PX l,i u l,i px 1 l,i , x Papl,iq , Φpx 1 l,i qq´u l,i px l,i , x Papl,iq , Φpx l,i qq and define εpxq " max l,i εpxq as the ε of profile x in an SHG G.

We next provide an approximate algorithm (see Algorithm 2) to compute in an SHG. Pay attention here that x : x i Ñ x 1 i means replacing x i of x with x 1 i . The functions in Algorithm 2 operate as follows. The procedure Search takes a given joint profile x, player index pl, iq and returns a best response profile for a given player pl, iq. Our approach is that for each action in x 1 l,i P X l,i , we re-equilibriate subgames G l,i px : x l,i Ñ x 1 l,i q, and then compute the corresponding payoff. In our actual implementation, we discretize X l,i in a bucket of grid points, and search within such bucket. The procedure Re-Eq returns the re-equilibrated profile of G l,i pxq given pl, iq and x. The procedure SHG_Solve solve a simultaneous-move game at level l`1. It applies an iterative best response approach for Chdpl, iq to generate diverse profiles, and select the one with minimum possible ε. In each iteration, for each pl`1, jq P Chdpl, iq, it computes its best response action against the previous joint profile x t´1 . And then in the next iteration it replace those descendant-profiles of pl`1, jq in x t´1 by computed re-equilibrated profiles when pl`1, jq selected its best response. Then it will return the joint profile with the minimum ε found so-far.

To solve the whole game, we just call SHG_Solvepp0, 0q, xq, where x is some other action profile. To compute the ε of a given profile, we just call Compute_ ε, where it just compute the maximum unilateral deviation for every player using Search to compute the best response action and payoff.

We will now describe in detail the particular subclass of SHGs that we studied in our experiments reported in Section 4.2. This class is impact the realized cost of every player -one of the defining attributes of SHGs) while those taken by the government and states are recommendations. Similar to Jia et al. [18] , we also study a restricted variant where each county is non-strategic and constrained to comply with the action (recommendation) of its parent-state, effectively reducing the model to a 2-level hierarchy. We call this special case a two-level game (Figure 4(a) ) and the more general model a three-level game (Figure 4(b) ).

The cost function of each player piq has, in general, three components: a policy impact cost C inc i px L q which we will elaborate on below; a policy implementation cost C dec i px L q, e.g. economic and psychological costs of a lockdown; and, for each player in levels l ą 1, a non-compliance cost C NC i px i , x Papiq q, a penalty incurred by a policy-maker for deviating from the recommendation of its parent in the hierarchy (e.g., a fine, litigation costs, or reputation harm).

Let N p3,iq ą 0 denote the fixed population under the jurisdiction of county p3, iq for every i P t1, 2, . . . , n 3 u. By construction, the population under each state is given N p2,iq " ř p3,jqPChdp2,iq N p3,jq and the population under the government is N p1,1q " ř n2 i"1 N p2,iq . We next define the expressions for each component of the cost function.

This cost component is a quadratic closed-form approximation to the agent-based model introduced by Wilder et al. [41] . This is inspired by the infection cost computation approach in Jia et al. [18] but they used a different closed-form approximation. For each each county p3, iq, let N init p3,iq denote the number of infected individuals within the population of the county prior to policy intervention; thus, the number of post-intervention susceptible individuals is pN p3,iq´N init p3,iq qx 3,i . Another parameter in the game is the the transport matrix R " tr aa 1 u a,a 1 Pp3,1q,p3,2q,...,p3,n3q , where r aa 1 ě 0 is the proportion of the population of county a 1 that is active in county a in the absence of an intervention. Thus, in the number of post-intervention infected individuals of county a 1 that is active in county a is r aa 1 N init a 1 x a 1 . The last parameter in the model is M , the average number of contacts with active individuals that a susceptible individual makes, and finally µ is the probability that a susceptible individual gets infected upon contact with an active infected individual is µ P p0, 1q.

Putting these together, the policy impact cost is defined by the fraction of post-intervention infected individuals in county a " p3, iq, i P t1, 2, . . . , n 3 u:

For a higher-level player piq,

Policy Implementation Cost: For each county p3, iq, the policy implementation cost is given by

For a higher-level player piq,

Non-Compliance Cost: The non-compliance cost of player piq for l P t2, 3u is given by Euclidean distance between its action and that of its parent:

Finally, each player piq for l ą 1 has an idiosyncratic set of weights κ i ě 0 and η i ě 0 that trade its three cost components off against each other via a convex combination, and account for differences in ideology; the overall cost of such a player is given by

The player p1, 1q obviously has no non-compliance issues, hence it has only one weight κ 1,1 ą 0, its overall cost being C 1,1 px i , x L q " κ 1,1 C inc 1,1 px L q`p1´κ 1,1 qC dec 1,1 px L q.

In our experiments, we set r aa 1 " 1{n 3 for every pair of counties pa, a 1 q, M " 20 and µ " 0.3.

We formally define the algorithm BRD as the procedure SHG_Solvepp0, 0q, xq for some randomly initilized x. We start by defining the concept of one full algorithm iteration for each of DBI and BRD. For DBI(1, 20) , DBI(1, 50), DBI(1, 2, 4), DBI(1, 2, 10), one algrithm iteration consists of 50 steps of gradient ascent, with a learning rate of 0.01. For BRD(1, 20) and BRD(1, 50), one algorithm iteration consists of 20 iterations of level-wise best response during the recursive procedure in SHG_Solve; for BRD(1, 2, 4) and BRD(1, 2, 10), it corresponds to 500 and 200 iterations of level-wise best response, respectively. For DBI we adopt a projector operator that project the resulted action into the nearest point in r0, 1s.

We discretize each action space uniformly into 101 grid points for two-level games, and 11 grid points for three-level games. We let T " 100 for BRD(1, 20) and BRD(1, 50) and T " 20 for BRD(1, 2, 4) and BRD(1, 2, 10). In the three-level experiments, the κ is set to be 0.5 for counties and the states and 0.8 for the government. The η is set to be 0.2 for the states, 0.3 for the counties in (1,2,4) setting , and 0.2 for (1,2,10) experiment. In the two-level experiments, the κ is set to be 0.2 for the government, 0.5 for counties and the states. The η is set to be 0.2 for counties and states. To further study the scalability of the BRD and DBI algorithms, we compare the run-time that (1) the DBI algorithm converges and (2) the BRD algorithm terminates. The convergence of DBI is defined as the convergence of the action profile. When DBI converges, the action profile remains unchanged. In the projected gradient descent method, if all the gradients go to zero, DBI converges. Besides, it is also likely to hit the boundary of the constraints. The DBI is possible to converge with a non-zero gradient norm. The BRD algorithm terminates either T achieves, or goes to zero for all players. Under these conditions, the two algorithms find their optimal solutions. Figure 6 demonstrates the run-time results. We conduct each experiment four times with different random seeds. In two-level problems, the DBI algorithm is more than two times faster than BRD algorithms. Although the action spaces are discretized to 11 grid points for three-level games rather than 101 grids, DBI algorithms still perform better. In practice, discretizing the action spaces in 101 grids for three-level games is computational intensively. The performance and run-time of the BRD algorithm are more dependent on randomization and the initial points for the best responses at each level. When we face many players or multiple levels, the DBI algorithm is a natural choice. The DBI algorithm is significantly more efficient and more stable than the BRD algorithm.

We extend the described public good game on Zachary's Karate club network to a (1-2-34) hierarchical game by introducing the non-compliance costs similar to the COVID-19 game model. The overall utility of the club individuals should be the combination of public good utility u i and C NC i px i , x PA(i) q " px i´xPA(i) q 2 .

In the second level, the administrator and instructor's utility is composed of the total public good utility of their group members, and their own non-compliance costs.

The root player's utility is considered as the social welfare of the whole club.

In our experiments, we set κ 2,i " κ 3,j " 0.5, @i, j. Other parameters of the public good utility are set to be a i " 0, b i " 1, c i " 6, @i. We use learning rate 0.1 in DBI. We project the results to r0, 1s. To compare with the DBI, we use BRD with discretized factors 0.5, 0.2, 0.1, 0.05, and best response rounds 2, 3.

Let a " pa 1 , . . . , a n q be the attacker strategy, the probability distribution over which defenders are attacked. In our experiments, a is set to follow a logit distribution softmaxpλp1´x L qq, with defenders having lower security investment more likely to be attacked. Let c i px i q " c i x i be the cost of security investment for defender i, which is assumed to be linear. The utility of defender i is then as follows:

We then build the (1,3,6) structured hierarchical game by introducing the non-compliance costs C NC i px i , x PA(i) q " px i´xPA(i) q 2 . The utility of each agent in different levels are the same as that of public good game (Eq. 8, 9, 10).

In the experiments, we consider the agents who almost obey the parent's suggestion (κ " 0.1) and the somewhat selfish agents (κ " 0.5). The influence probability q ji , @i, j are set to be equal among the agents, and other parameters are set to c i " 0.2, λ " 5. We use a learning rate of 0.1 in DBI. To compare with the DBI, we use BRD with discretized factors 0.2, 0.1, 0.05 and best response rounds 2, 3. Since there is no upper bound of the strategy profile, we set x i ď 1.0 according to the performance. Then we search the strategy space x P r0, 1s in the BRD execution.

Discrete dynamical systems

Generative adversarial networks

The public policy process

Matrix analysis

Linear lower bounds and conditioning of differentiable games

Michael Wellman, and Yevgeniy Vorobeychik. A game-theoretic approach for hierarchical policy-making

What is local optimality in nonconvex-nonconcave minimax optimization?

Graphical models for game theory

General topology

On the impossibility of global convergence in multi-loss optimization

Solving multi-leader-follower games

Distributed algorithms for the computation of noncooperative equilibria

Finite-time last-iterate convergence for multi-agent learning in games

Stochastic Hamiltonian gradient methods for smooth games

Microeconomic theory

On gradient-based learning in continuous games

Learning in games with continuous action sets and unknown payoff functions

The numerics of GANs

Gradient descent GAN optimization is locally stable

Algorithm 2 Procedures for computing ε of profile x and an algorithm for computing an approximate SPE Input: An SHG instance G Parameters: Best response iterations T procedure Search(x, pl, iq) for x 1 l,i P X l,i do u re´eqReplace dimensions of x t belonging to Despl`1, jq with the ones in x re´eq´maxpl`1,jq ε t´1 l`1 Ð max i ε t´1 l`1,j t˚Ð arg min t t l`1 return x t˚, ε tl`1 end procedure procedure Compute_ε(x) for pl, iq do xl ,i , ul ,i , x re´eqpl,iq Ð Searchpx, pl, iqq ε l,i Ð ul ,i´u l,i pxq return max l,i ε l,i end procedure based on the SHG proposed in Jia et al. [18] which we describe in detail here. The exposition is in terms of a cost function C i px i , x Papiq , x L q for each player, which is more natural in this context, rather than the payoff function u l,i introduced in Section 2, with the understanding that u i "´C i .There are L " 3 levels in the hierarchy such that player p1, 1q represents the federal government (or, simply, government), the players p2, iq, i P t1, 2, . . . , n 2 u are state governments (or, simply, states), and the players p3, iq, i P t1, 2, . . . , n 3 u are county governments (or, simply, counties) partitioned into groups such that each group shares a single state as a parent.Each player piq takes a bounded, scalar action x i P r0, 1s which is a social-distancing factor that (multiplicatively) reduces the proportion of post-intervention contacts among individuals -a lower number implies a stronger policy intervention, hence a lower number of infections but a higher cost of implementation (see below). The actions taken by counties represent policies that get actually implemented (hence directly