key: cord-0157366-ay4mcf9c
authors: Leonardos, Stefanos; Sakos, Iosif; Courcoubetis, Costas; Piliouras, Georgios
title: Catastrophe by Design in Population Games: Destabilizing Wasteful Locked-in Technologies
date: 2020-07-25
journal: nan
DOI: nan
sha: 3cd73b3d326b91ae3c36e6190c2d0f4f730e3056
doc_id: 157366
cord_uid: ay4mcf9c

In multi-agent environments in which coordination is desirable, the history of play often causes lock-in at sub-optimal outcomes. Notoriously, technologies with a significant environmental footprint or high social cost persist despite the successful development of more environmentally friendly and/or socially efficient alternatives. The displacement of the status quo is hindered by entrenched economic interests and network effects. To exacerbate matters, the standard mechanism design approaches based on centralized authorities with the capacity to use preferential subsidies to effectively dictate system outcomes are not always applicable to modern decentralized economies. What other types of mechanisms are feasible? In this paper, we develop and analyze a mechanism that induces transitions from inefficient lock-ins to superior alternatives. This mechanism does not exogenously favor one option over another -- instead, the phase transition emerges endogenously via a standard evolutionary learning model, Q-learning, where agents trade-off exploration and exploitation. Exerting the same transient influence to both the efficient and inefficient technologies encourages exploration and results in irreversible phase transitions and permanent stabilization of the efficient one. On a technical level, our work is based on bifurcation and catastrophe theory, a branch of mathematics that deals with changes in the number and stability properties of equilibria. Critically, our analysis is shown to be structurally robust to significant and even adversarially chosen perturbations to the parameters of both our game and our behavioral model.

Efficient and sustainable innovations often fail to reach the critical mass of adoption that is needed to supersede socially harmful and energy-wasteful status-quo technologies [49, 42, 15] due to positive network effects and lock-in costs [38] . This frequently observed deadlock, termed inefficient lock-in [40] , is a recurrent theme in the economics literature. Recent (technological) examples include the transportation sector -where the existing infrastructure (network effect) for CO 2 emitting vehicles hinders or, at least, slows down the adoption of greener alternatives (e.g., platooning systems, electric cars or aircrafts powered by renewable energy) -updates in the Internet's infrastructure -e.g., the long-anticipated transition from IPV4 to IPV6, a problem also known as protocol ossification, and, notoriously, Proof of Work mining, a setting that has attracted widespread attention by academics, entrepreneurs and policymakers [21, 25] . 1 While this issue is not entirely new, [3, 17, 58] , both experimental and theoretical studies suggest that little is known about how to move out of a lock-in [18, 35, 30] . Surprisingly, in all these cases, the challenge of transitioning to an existing alternative is primarily not technological but rather gametheoretical in nature. Yet, game theory alone can only be used to understand whether a lock-in will occur: since most standard solution concepts are invariant to the history of play, they cannot offer much insight on whether such a lock-in can be overcome unless they are coupled with a proper dynamic, behavioral model [8] . In this respect, the need to develop a suitable behavioral model that will combine adaptive learning with strategic behavior and which will be suitable to make predictions in games with a history of lock-in has been receiving increasing attention [40] .

In this paper, we address this problem from the perspective of mechanism design and focus on developing an economically feasible mechanism that will provably allow a central planner, or more generally, an external authority with partial and only temporary influence over agents' behavior, to move the system out of the inefficient lock-in and lead it to a socially desirable outcome.

From a theoretical perspective, we stylize this situation as a coordination or evolutionary game with multiple equilibria [20] . Accordingly, the population state (or equilibrium) in which the currently prevailing, yet wasteful technology is used by all agents is evolutionary stable and small perturbations, i.e., adopters of the alternative technology, are doomed to fail [43] . In turn, the externality that agents' actions impose on the society is not quantified in their short term utility function and hence, does not shape their current decisions. This creates a deadlock: a situation -among many known in social and economic sciences -in which selfish behavior stands at odds with the social good. Yet, mechanism design, the natural approach to attack this problem, is powerless in this setting. The inherently decentralized nature of contemporary economies and social interactions raises challenges that severely lessen the applicability of off-the-shelf solutions. The reasons are well-rooted in the structure of large population networks (that represent today's transborder societies) and the behavioral patterns of the participating agents.

Specifically, mechanisms that consistently subsidize socially beneficial behavior are not economically feasible and would be subject to gaming in the long run. Even more of a showstopper is the fact that many top-down policies that treat differentially one option versus another have been shown to be rather hard, if not outright impossible, to sustain in practice or have frequently failed to offset users' potential losses from prevailing network effects. More importantly, standard expected utility models, the bedrock of classic mechanism design, are arguably too simplistic to model agents' behavior in practice for several reasons: hedging against volatility or extreme events, risk attitudes, collusions, politics/governance, to name but a few. It would thus be important to develop solutions that are robust to more complex behavioral assumptions.

Our model: To address this problem, we introduce an evolutionary game-theoretic model that captures agent behavior in decentralized systems with network effects. The agents' strategic interaction is stylized as a population game whose state updates are governed by a revision protocol that balances exploitation (expected utility maximization) and exploration of suboptimal alternatives (hedging) [29] . These two elements generate an evolutionary dynamic that can then be used to reason about agents' decision making in this setting [54] .

The advantage of having both a game-theoretic model (Section 2) as well as learning-theoretic model (Section 3) is that it allows us to formally argue about the stability of the equilibria of the game. For example, in the simplest possible game-theoretic model of competition between a prevalent, yet wasteful technology (W) and a more efficient (both socially and individually) alternative (S), we have that the utility of using the W (respectively S) strategy increases linearly in the number of other agents that are using the same technology. This results in three types of fixed points: everyone using W, everyone using S, and a "mixed" population case at the exact split where both technologies are equally desirable/profitable. Intuitively, this mixed state is an unstable equilibrium as a slight increase of the fraction of W (respectively S) adopters is enough to break ties and encourage convergence to a monomorphic state. However, to make this discussion concrete, we need to formally describe how a mixed population state (i.e., the W/S split) evolves over time.

To model the adaptive behavior of the agents, we use the Boltzmann Q-learning dynamics [66, 61] (one of the most well-known models of evolutionary reinforcement learning that can also be derived as a limit of the well known Experience-Weighted Attraction behavioral game theory model [12, 13, 23] ). The decision of each agent, or equivalently of each unit of investment, is whether to adopt technology W or technology S given that x ∈ [0, 1] fraction of the population adopts technology W. According to the Q-learning behavioral model, the agents update their actions by keeping track of the collective past performance -in particular, of a properly defined Q-score -of their available strategies (W and S). Roughly (see Section 2 for the formal specifications and notation), if we denote with u (W, x) and u (S, x) the utility of an investment unit from either of the two technologies W and S when the state of the population is (x, 1 − x), with x ∈ [0, 1] denoting the fraction of W investments, then the Q-learning dynamics are given by the following scheme

In a far-reaching twist, each agent's utility function is enhanced by an entropy term that is weighted by a parameter T, termed temperature or rationality. When T = 0, the dynamics are precisely the replicator dynamics [54, 47, 7] and they recover the Nash equilibria of the game, whereas large values of T lead to uniform randomization between the available options. Informally, low temperatures capture cool-headed agents that focus primarily on strategies with good historical performance, whereas high temperatures favor hedging and exploration. This interpretation suggests that T is a behavioral attribute of the population which cannot be influenced by an external authority. However, from equation (1), it can be seen that the application of a tax-rate (or any other mechanism) to rescale agents' utilities, is essentially equivalent to a change in parameter T, i.e., to a change in the agents' rationality [67, 68] . This is precisely the intuition that we exploit here to prompt the transition from the evolutionary stable, yet inefficient, equilibrium of the locked-in technology W (cf. Proposition 1) to the adoption of the desirable technology S.

Our solution: Catastrophe Design. Our main contribution is to formally prove that there exists a simple, robust, and transient catastrophe-based mechanism to destabilize the W equilibrium and enforce the S equilibrium. The proposed mechanism, while simple to implement, has provable guarantees that critically exploit the complex underlying geometric structure of the Q-learning dynamics, cf. Section 4. As a first step in the process, we provide a complete characterization of both the number as well as the stability of the equilibria of the Q-learning dynamics, also known as Quantal Response Equilibria (QRE) [41] , given game-theoretic models of W/S competition for all values of T (Theorem 1). This is a question of independent interest as it explores the possible limit system behaviors in the lack of any controlling mechanism.

Then, we describe how by simply raising taxation/temperature up to (slightly beyond) a critical value and then reducing it down to zero results into convergence to the socially optimal S equilibrium (Theorem 2). The mechanism is formally described in Section 5. Here, we provide an informal statement for the reader's convenience.

Theorem (Informal statement of Theorem 2). For any initial population state, there exists a finite sequence of control variable levels T = T 0 = 0, T 1 , . . . , T n = 0 such that through the following procedure:

scale the control variable to T i , and wait until the system converges to a QRE the system is going to converge to the desirable state x = 0, which corresponds to the energy-friendly technology S. In particular, the sequence of control levels starts from T 0 = 0 (no control), increases temporarily above a critical level and is reset back to T n = 0.

Our approach is based on the combination of two observations which can be visualized in Figure 1 (cf. Figure 4) : First, the number and stability of equilibria (QRE) of the Q-learning dynamics are a function of T, i.e., of the trade-off level between exploration and exploitation. For T = 0, there exist At phase 1, all resources are invested in W, and the control parameter is at 0. As the control parameter increases, the population moves along the red line on the QRE surface. At phase 2, the control parameter reaches the critical value or tipping point at which the two upper QRE merge into one and at the very next moment, phase 3, at which the parameter T is increased slightly above the critical level, the system undergoes an abrupt transition at which the upper QRE vanishes. Between these two successive time points, the population state changes from 69% of investment in W right before the catastrophe to only 15% immediately after. After this point, the control parameter is reset to 0, and due to the resulting hysteresis effect [50] , the population converges to the new equilibrium which corresponds to the adoption of the (new) technology S. three QRE, which are precisely the three described Nash equilibria. For slightly larger T > 0, we still have three QRE, but now due to the exploration term, all three lie in the interior of the interval (0, 1). Finally, beyond some critical level of the control parameter, the number of QRE drops from three to one. In particular, at this critical level, the prevailing equilibrium fuses with the unstable mixed equilibrium and at the very next moment, the now merged equilibria, cancel each other out at a phenomenon known as saddle-node bifurcation or catastrophe [60, 34] .

The second observation is that we can effectively control parameter T (e.g., increase it by a multiplicative scale α), since it is mathematically equivalent to scaling down the utility of the agents by the same factor α (up to time reparameterization in equation 1). In policy terms, this can be interpreted as taxation of income (i.e., a multiplicative decrease of payoffs for all actions), which results in agent behavior that is less stringent about maximizing earnings at all costs. Informally and taking this idea to its logical extreme, a taxation level of 99% (i.e., a very large T) would effectively render the agents indifferent about payoffs and make them choose actions at random.

Putting these two observations together, by controlling T, we can control the resulting QRE and, thus, the resulting state of the system over time. More critically, when exceeding the critical level of the control parameter (tipping point), a phase transition or catastrophe occurs at which the state behavior changes abruptly (phases 2 and 3 in Figure 1 ). This is the critical step that disrupts the prevalent outcome and encodes a new memory to the system that moves the population into the attracting region of the desirable equilibrium. At this point, the level of adoption of the new technology, or critical mass, is typically well below the simple majority (50%). Finally, we leverage (in principle at least) this saddle-node bifurcation (which eliminates the inefficient equilibria) to create irreversible phase transitions such that even when the controlling parameter T returns to its initial state T = 0, the state of the system is not the original undesirable stable state (x = 1), but the target S state, x = 0 (phase 4 in Figure 1 ).

Critically, we stress-test our findings and establish that they are robust to modeling uncertainty/misspecifications across different axes, cf. Appendix A. In Appendix A.1, we allow (possibly adversarial) uncertainty/perturbations in the dynamics and the population gametheoretical model. For the reader's convenience, our main result is restated in the following Theorem.

Theorem (Informal statement of Theorem 3) . Under state-dependent, bounded perturbations in the evolution of the Q-learning dynamics, the proposed catastrophe mechanism yields a transition from the inefficient to the efficient equilibrium with the only difference that the population stabilizes at some neighborhood of the QRE of the unperturbed system in the intermediate phases.

In Appendix A.2, we explore utility functions that capture strong superadditive effects on network valuation. These models introduce significant difficulties as lie outside the typical framework of evolutionary game theory with multi-linear utility functions. Again, our main result is informally stated below.

Theorem (Informal statement of Theorem 4). For positive network effects of arbitrary intensity, i.e., for arbitrary non-linear payoff functions that yield superadditive value in the degree of adoption, the proposed catastrophe mechanism yields a transition from the inefficient to the efficient equilibrium.

The findings of the robustness analysis justify the selection of our baseline modeling assumptions with an eye towards simplicity primarily for expositional purposes. In short, the main takeaway from the two robustness theorems is that our results can be directly extended for a wide range of modeling designs and approaches. Technically, while the number of equilibria and the stability properties of the dynamics may change under these perturbed conditions, the resulting control mechanism can be applied without any significant changes.

Other related work. While applications of catastrophe theory in the field of economics have received mixed reactions in the past, not least due to their original hype [51] , the formal connections between bifurcations, hysteresis effects, and coordination games have started to gain increasing attention [63, 31, 50] . The idea of leveraging these complex dynamics in the arsenal of mechanism design has been only recently proposed [67, 68] , yet its formal treatment remains unexplored. Applying these concepts in a real options setting and bringing forth theoretically provable properties concerning a market's network effects and the adoption rate of competing technologies, our work extends previous studies that identify and reason about these connections mainly via experiments [5, 38, 9, 59] . Further supporting the scarcity of a formal approach, [4] emphasizes the increasing complexity of economic and social interactions and argues about the need to join forces and develop tools from complexity theory, as a complement to existing economic modeling approaches. Focusing on physical applications, the main field where catastrophe theory is studied, [57] uses the fold catastrophe model to identify early warning signs of critical transitions in complex systems. In relation to mechanism design, such results can amplify the benefits of the catastrophe model as a tool to create, rather than prevent, critical transitions in the hands of policymakers who can, thus, timely predict the effects of their policies.

Outline. The rest of the paper is structured as follows. In Sections 2 and 3, we provide the formal definitions of the game-theoretical and behavior framework. Our working assumptions (which we relax in Appendix A) are presented in Section 2.1. Section 4 contains the technical work up to the design of the main catastrophe mechanism that is presented and explained in Section 5. In Appendix A, we establish the robustness of the proposed mechanism under a broad set of assumptions. We conclude with an illustrative application in the context of blockchain mining in Appendix B. Technical proofs and accompanying materials are deferred to Appendices C and D.

We consider a society or population2 p of acting agents or investors who form a continuum of mass K > 0. Here K denotes the total available capital or resources that the agents are willing to invest and is expressed in monetary terms. The set of available actions or technologies is denoted by A = {W, S}, where W is the costly (wasteful) technology and S is the innovative (socially and individually preferable) technology. In particular, investing one monetary unit in technology W incurs a cost of γ > 0 to the investor, while the investment in technology S incurs zero cost. The latter assumption ensures that (rational) agents may disregard potential alternatives and invest all available resources on either of the two technologies (there is no loss from doing so). Accordingly, the set of 1] denotes the fraction of agents (in terms of capital or resources) in population p that is choosing technology W. Thus, we may slightly abuse notation and refer to x ∈ [0, 1] as the population state.

The payoff function, u : A × X → R 2 , assigns to each population state a vector of payoffs, one for each strategy in A. We assume that the total value created by each technology {W, S} in A depends on the fraction xK of capital that has been invested in that technology via a parameter α > 0 and that the total value is distributed evenly among all invested units. We assume that either technology can generate an aggregate value V > K, if fully adopted by the population. In particular, the values V (W, x) and V (S, x) created by technologies W and S depend on the population state x ∈ [0, 1] via the relationships

Different values of α give rise to different network effects or externalities.3 In particular, α < 1 implies subadditive value (it is optimal for the population to split), α = 1 implies linear value, and α > 1 implies superadditive value, i.e., the population is better off if it fully adopts either of the two technologies. Combining the above, the payoff of each strategy {W, S} ∈ A is given by

Hence, the average payoff obtained by the members of the population at state x ∈ [0, 1] is equal tō

and the aggregate payoff that is achieved by the population as a whole is u A (x) = Kū (x). The cost Kγx is paid by the population as a whole and hence, captures the negative externality (or cost) of the undesirable technology.

To study instances with positive network externalities (or direct network effects [38] ) in which full adoption of one of the technologies is preferable, our main focus will be the case α ≥ 1. For expositional purposes, we will restrict our attention to the case α = 2, but all arguments essentially carry over to any α > 1 (and to the trivial case, α = 1) as we show in Appendix A.2. Specifically, for α = 2, eqs. (3) and (4) become linear in x,

which allows for an equivalent -yet more intuitive -interpretation of the agents' interaction as a single population evolutionary game, cf. [28] . By substituting x = 1 (action W for the whole population) and x = 0 (action S) in (5), we can represent the game by the matrix

Since V > K > γ > 0, we may henceforth normalize V K to 1 and write γ → γ/V K with γ ∈ (0, 1). The equilibria of the resulting game are characterized next. The proof is standard and is presented for completeness in Appendix C.

Proposition 1. The payoff functions in (5) describe a single population, evolutionary game with three Nash equilibria:

The two pure equilibria, x 1 = 0 and x 3 = 1 are evolutionary stable, while the fully mixed equilibrium, x 2 , is not. Equilibrium x 1 -in which the desirable technology is fully adopted -is strictly payoff and risk dominant.

This formulation provides a theoretical explanation of the reasons why the more efficient technology may not succeed. In particular, starting from the prevailing equilibrium of the wasteful technology, evolutionary stability implies that even if a small part of the population adopts the new technology, the system will not move away from the current equilibrium. Hence, although payoff and risk-dominated by the more efficient technology, the W (or equivalently x = 1) equilibrium persists. This creates the lock-in at the wasteful technology. To formally reason about potential mechanisms to disrupt this situation, we first need to develop a proper adaptive learning model that accurately describes the updates of the population states.

The adoption of new technologies is a gradual process at which concerned individuals adjust their actions -distribution of their investments between old and new technologies -via repeated interaction with their environment. In the presence of network effects, each agent's payoff critically depends on the constantly evolving population state or equivalently on the fractions x, 1 − x of the population that adopts either of the two technologies, cf. equation (3) . In this context, [54] provides several alternative microfoundations of the standard replicator dynamics under the term revision protocols.

To improve upon the sub-optimal outcomes that are often reached by the greedy and myopic updates of replicator dynamics [22] and, importantly, to more accurately model agent's strategic deviations from expected utility maximization in risky choices under uncertainty (as in the present context), more elaborate models of adaptive learning have been proposed [12, 13] . Specifically, the effects of exploration and exploitation in coordination games with path-dependence (history of play) and network effects, are commonly examined via a smooth variant of Q-learning dynamics.4. Its connection with game-theoretical settings has been recently elaborated [63, 56, 55, 67, 31] and provides the building block for our subsequent analysis.

We consider a homogeneous population in which each agent is adapting their strategy by repeatedly interacting with their environment (rest of the population). The critical parameter that we need to track (and update) is the fraction x ∈ [0, 1] of the population that invests in the costly (and currently prevailing) technology W. To do this, we focus on the standpoint of a single agent (or unit of investment) and describe the system via the Q-learning dynamics which are based on the principle of Q-learning [66, 12] . The connection to population games that is described here closely follows [63, 31] and [68] without further reference.

Q-values: At each time t ≥ 0, the learning agent assigns a value Q t ( j) to each strategy

where δ > 0 is the learning rate and u ( j, x t ) is the reward from selecting strategy j ∈ {W, S} (as given by equations (3)) when the distribution of the population is x t ∈ [0, 1].

Using the Q-values, the critical decision for each learning agent is the update of her choice distribution. To avoid suboptimal results by greedy updating, i.e., selection of the strategy with the highest Q-value, the agents incorporate in their decision problem an entropy term that rewards exploration of the whole action space (both technologies). In particular, we assume that each agent selects their strategy x t ∈ [0, 1] at time point t ≥ 0 as the (unique) solution of the convex optimization problem5

where T ≥ 0 is the control parameter that tunes the exploration rate. In particular, for T = 0, the agent selects the action with the highest Q-value (pure exploitation), whereas for T → ∞, the agent randomizes between the two available actions, i.e., investments in technologies W and S (pure exploration). The decision rule or revision protocol [54] in (S1) yields the choice distribution

which is known as the Boltzmann distribution. With a slight abuse of notation, x t denotes both the learning agents' choice distribution and the state of the population. However, under the assumption that all agents are symmetric and that they are learning concurrently, both these notions are equivalent.

The population game in (G1), together with the revision protocol in (S1), determine the evolutionary population dynamics. In particular, if we take the time interval to be infinitely small, this sequential joint learning process can be well approximated -after rescaling the time horizon to t → δt/T -by the continuous-time dynamics

(where x W := x and x S := 1 − x), which is the desired expression of the dynamics in terms of population states x ∈ [0, 1] (rather than Q-values).

For any given T ≥ 0, the steady states of the system in (8) are the values of x ∈ (0, 1) for which the expression on the right side becomes zero. As shown in [31] , these are precisely the Quantal Response Equilibria (QRE) of the underlying population game [41] . Assuming the standard logit form, for any T ≥ 0, the QRE are defined as all points x * (T) ∈ [0, 1] that satisfy the equation

Importantly, as shown in [68] , in coordination games with 2 strategies, starting from any interior point6 x ∈ (0, 1), the Q-learning dynamics converge to interior rest points (QRE) for any T ≥ 0.

Our main task is to understand how an external designer can influence the state of the population, i.e., the distribution of the aggregate investment capital across the two different technologies, by modifying the control parameter T ≥ 0. Before formally discussing the proposed mechanism in Section 5, we first need to reason about the solutions (steady states) and stability properties of the dynamics in equation (8) for all different values of the control parameter T ≥ 0. We start with an observation that immediately follows from (3) and (8) Lemma 1. For a given cost parameter γ ∈ (0, 1), consider the game described by equation (G1), and the revision protocol in (S1). Then, the updates in the fraction of investments on the wasteful technology W, i.e., population state x ∈ [0, 1], are governed by the continuous-time dynamics

for any value T ≥ 0 of the control parameter.

When determining the steady states (QRE) and stability properties of equation (10) for different T ≥ 0, it is more intuitive to treat the instance T = 0 separately. In this case, the system reduces to the well-known replicator dynamics, and its steady states are precisely the Nash equilibria of the respective evolutionary game in (G1).

Concerning stability, the one-dimensional differential equation in (10) describes a vector field on the line and specifies the velocity vector x at each x ∈ [0, 1]. Accordingly, a fixed point, i.e., a x 0 ∈ [0, 1] such that x 0 = 0, is stable if the flow points towards it, i.e., if x > 0 for any x < x 0 and x < 0 for any x > x 0 such that |x − x 0 | < for some > 0, i.e., the property needs to be satisfied in a (strictly positive) neighborhood around x 0 . The neighborhood of x 0 for which this holds is called its attracting region [47] . If such an attracting region does not exist, the fixed point is called unstable [60] .

For T = 0, the steady states of the Q-learning dynamics in equation (10) are x 1 = 0, x 2 = (1 + γ) /2, and x 3 = 1. The steady states on the boundary, i.e., x 1 and x 3 , are stable, whereas x 2 is unstable.

and the first claim follows trivially. Concerning their stability, observe that x < 0 for any x ∈ (0, (1 + γ) /2) and x > 0 for x ∈ ((1 + γ) /2, 1). Hence, starting from any point other than x = (1 + γ) /2, the system will converge to the boundary steady states, i.e, to x 1 = 0 for any initial starting point x < x 2 and to x 3 = 1 for any initial starting point x > x 2 , which completes the proof.

The steady states and their stability properties for T = 0 are illustrated in Figure 3a . To proceed with the general case, T > 0, we restrict to interior population states, i.e., x ∈ (0, 1), so that ln (x/(1 − x)) is well defined. For x ∈ (0, 1), the term x (1 − x) is always strictly positive and hence, the fixed points (QRE) and velocity of the dynamics in equation (10) are fully determined by the term

Hence, it will be convenient to introduce the following notation. Notation 1. For given cost and control parameters, γ ∈ (0, 1) and T ≥ 0, let

Whenever obvious from the context, we will simplify notation and write f (x).

Keeping in mind that T is viewed as a control parameter that may vary over time (in response to the actions of some exogenous actor), a two-dimensional visualization of the geometric locus of all points x * ∈ (0, 1) such that f (x; T, γ) = 0 -i.e., of the QRE correspondence -for selected fixed values of γ and variable T ≥ 0 is given in Figure 2a . Figure 2b shows the QRE correspondence in a three-dimensional space -where both γ and T are treated as variables -and Figure 2c shows its projection on the cost-control, (γ, T), parameter space. Figure 2b shows the QRE correspondence for all possible values of γ ∈ (0, 1). The (T, x * )-slices at the γ = 0.02, 0.2, and 0.8 levels correspond precisely to the blue, red, and green lines in Figure 2a . Figure 2c shows the projection of the QRE correspondence on the (γ, T) plane. Darker areas correspond to parameter values with multiple (three) QRE. The critical values T c (γ), at which the phase transitions from one to three QRE occur, correspond to the middle boundary that separates the dark and light regions.

The main observation from Figure 2 is that for each γ ∈ (0, 1), there exists a unique critical value, T c (γ) of the control parameter such that the number of steady states (QRE) depends on whether T is less than, equal to or larger than T c (γ). In particular, for T < T c (γ), there exist three QRE, which for T = 0 are precisely the Nash equilibria of the underlying game, for T = T c (γ), i.e., at the transition point, there exist two QRE, and for T > T c (γ), there remains only one QRE. In all cases, the critical temperature T c lies in the interval (0, 1/2). These properties are formalized in Theorem 1. Before proceeding with its statement, however, it will be convenient to introduce the following notation.

Whenever obvious from the context, we will omit the dependence on T and write x l,u .

Using the above, we can now formulate Theorem 1.

For any value of the cost parameter γ ∈ (0, 1), there exists a unique critical value of the control parameter T c (γ) with T c (γ) ∈ (0, 1/2), so that the steady states of the population or equivalently, the Quantal Response Equilibria (QRE), of the continuous time Q-learning dynamics in (10)

and their stability properties are determined by the relative value of T in comparison to T c (γ) as follows

, where x l,u are given in Notation 2 whenever they exist, with x l < 1/2 and x u > (1 + γ) /2. In all cases, the steady state

The proof of Theorem 1 relies on Lemmas 2 and 3, for all of which the reader is deferred to Appendix C.1. The case T = 0 which was separately treated in Proposition 2, can be derived as a special case of the part T < T c (γ) in Theorem 1, for which x 1 = 0, x 2 = (1 + γ) /2 and x 3 = 1. The stability considerations remain the same for the general case T > 0 and, as in Proposition 2, they are fully determined by the sign of x or equivalently, of f (x). Summing up, the statements of Theorem 1 and Proposition 2 are illustrated in Figure 3 . For 0 < T ≤ T c (γ) (panels (b) and (c)), it holds that x u , x 2 > (1 + γ) /2. A typical instantiation is given in Figure 9 in Appendix C.1.

. From Theorem 1, it follows that for any γ ∈ (0, 1), there are no QRE in [1/2, (1 + γ) /2). Also, for any x [1/2, (1 + γ) /2), there exists precisely one T ≥ 0 such that this x is a QRE of (G1). To see this, we set f (x; T, γ) = 0 in (11) and solve for T as a function for x

For any x ∈ [1/2, (1 + γ) /2), the expression on the right is negative, which implies that there exists no T ≥ 0 so that this x is a QRE of (G1). For any other x ∈ (0, 1) \ [1/2, (1 + γ) /2), the expression on the right is positive (or equal to zero for x = (1 + γ) /2) and yields a unique T ≥ 0 for which this x is a QRE. The points x = 0 and x = 1 are QRE (equivalently Nash equilibria) for T = 0.

Finally, as can be seen in Figure 2b , the critical level T c (γ) is lower for larger values of γ. Intuitively, the control that needs to be exercised on the system to reach the tipping point at which the number of QRE of the population changes, is lower for larger differences -as expressed by parameter γ -between the costs of the two technologies. This intuitive property is formally shown in Corollary 1.

The critical level T c (γ) is decreasing in γ ∈ (0, 1).

Theorem 1 and Corollary 1 provide a static view -in terms of the control parameter T -of the QRE and stability properties of the dynamical system in equation (10). From a mechanism design perspective, we are interested in the behavior of the system as T varies according to the control of the central planner. This is discussed next.

Assuming that the population faces a lock-in at the inefficient state, x = 1, at which the wasteful and/or costly technology is fully adopted, we can now leverage the stability properties of the behavioral model (Q-learning dynamics) of Theorem 1, to design a mechanism that will allow the population to overcome this lock-in and adopt the efficient technology, i.e., converge to the x = 0. The mechanism is described next and visualized in Figure 4 (cf. Figure 1 in Section 1 and Figure 10 in Appendix C.1). Along with the technical details that emerge along the way, the mechanism is formalized in Theorem 2.

Phase 1: Lock-in for T = 0 and T < T c (γ). Initially, the population rests at (or close to) the x = 1 equilibrium at which the whole population adopts the incumbent, wasteful technology. As the control parameter T is gradually increased (but remains below the critical value T c (γ)), the population stabilizes in the interior QRE on the upper branch of the QRE correspondence (light color line). This QRE is stable, and its attracting region is separated from the stable lower QRE by the unstable middle equilibrium (cf. Figures 3a and 3b) .

Phase 2: Saddle-node bifurcation at T = T c (γ). As T reaches (from below) the critical value T c (γ), the upper stable QRE and middle unstable QRE collide and amalgamate into a single, unstable equilibrium, cf. Figure 3c . Intuitively, all this time, the mixed QRE served as a threshold between the attracting regions of the two stable QRE. Hence, although unstable, its existence was critical for the behavior of the system, and its amalgamation with the upper QRE has a critical impact on the dynamics of the system. Together with phase 3, at which the unstable equilibrium is annihilated, this creates a saddle-node bifurcation, fold bifurcation, or fold catastrophe in the population dynamics [60, 34] .

T > T c (γ). The critical phase transition occurs when the control parameter exceeds (even marginally) the critical level T c (γ). The population moves from a QRE with fraction x > (1 + γ) /2 of W adopters prior to the saddle-node bifurcation to a QRE with x < 1/2 immediately after the bifurcation. At this point, the unstable QRE that was generated Fig. 4 : Visualization of the combined catastrophe mechanism and hysteresis effect that move the population out of the lock-in on the inefficient technology W to the efficient technology S. The QRE surface is projected in time, and the path of the population is depicted by starting from a lighter and gradually moving to darker tones. The phases are described in detail in Section 5.

by the amalgamation of the two upper QRE vanishes, and for any larger value of T, the system has only one QRE, which is the single attracting state of the population dynamics (cf. Figure 3d) . The bifurcation that annihilates the amalgamated QRE and the subsequent stabilization of the dynamics in the single remaining QRE is the critical change of the system. In response to a small change in the control parameter (from below to slightly above the critical level), the population moves (abruptly) out of a state (QRE) in the attracting region of the inefficient equilibrium x = 1 to a QRE in the attracting region of the efficient equilibrium x = 0 for T = 0.

Phase 4: Hysteresis effect as T → 0. The (permanent) impact of the bifurcation on the population dynamics is that (the bifurcation) encodes a new memory to the population. In particular, as explained in the previous phase, for any value of T > T c (γ), the system stabilizes at a QRE with fraction x < 1/2 of W adopters. Consequently, using this QRE as a starting point and reducing (gradually or even instantaneously) the control parameter T back to 0, the system provably converges to the efficient equilibrium, i.e., to the adoption of the efficient technology. This creates a hysteresis effect: when the control parameter T is back to its initial level, i.e., T = 0, at which no exogenous control is exercised to the system, the population equilibrates at a different state than its initial one.

The above mechanism, termed catastrophe mechanism, is formalized in Theorem 2, whose proof is given in Appendix C.1. Theorem 2 is stated for initial conditions x 0 ∈ [(1 + γ) /2, 1], i.e., for initial conditions that capture the lock-in at the inefficient technology. For x 0 ∈ [0, (1 + γ) /2) convergence to the x = 0 equilibrium is trivial.

Theorem 2 (Catastrophe mechanism.). For a fixed cost parameter γ ∈ (0, 1) consider the population game (G1) and the revision protocol (S1), which together lead to the population evolutionary dynamics x in (10) . Then, for any initial population state x 0 ∈ [(1 + γ) /2, 1], there exists a finite sequence T 0 , T 1 , . . . , T n of control parameter values with T 0 = T n = 0, and max i ≤n T i > T c (γ), so that the iterative procedure which, starting from phase i = 0, scales the control parameter T to T i , allows the system to converge to a QRE, x * (T i ), and reiterates for phase i + 1, generates a sequence of population states (QRE) that converges to the risk-dominant equilibrium, x * (T n ) = 0, at which the population adopts the efficient technology. Remark 2. In theory, it suffices to use the sequence T 0 = 0, T 1 > T c (γ) , T n = 0 . However, for practical applications, such abrupt changes of the control parameter T may not be possible. For instance, if T denotes a tax or temperature parameter, then only gradual or smooth changes in T may be permissible. In such cases, the sequence T 0 , T 1 , . . . , T n may require additional steps, but as long as it satisfies the requirements of Theorem 2, the final outcome of the catastrophe mechanism still obtains.

Properties and Practical Implementation From the above, it is immediate that the proposed mechanism has two desirable properties for practical applications. First, the critical population mass, x * (T c (γ)), i.e., the fraction of adopters of the wasteful technology at the critical level T c (γ), is strictly larger than the mixed equilibrium (1 + γ) /2 of the initial game. This means that, under the proposed mechanism, the bifurcation occurs when the wasteful technology is still adopted by the majority of the population. The deviation to the new technology by a minority of the population is sufficient to trigger the adoption of the new technology by the whole population and become permanent. Indicatively, in the depicted scenario of Figure 4 , the population transitions from 69% of W adopters immediately before the saddle-node bifurcation (or catastrophe) to 15% of W adopters immediately after.

Second, and more importantly, from a social planner's perspective, the required influence on the control parameter needs to be only temporary. Yet, due to the resulting hysteresis, the effects of the mechanism are permanent. Assuming that the exercise of any influence on the system incurs a cost, this provides a desirable trade-off between the resources (expenses/time) of implementing such a mechanism and the duration of its impact.

Finally, related to the previous property is also the interpretation of the initial normalization of V K to 1, cf. equation (10) , in monetary terms. Since this normalization is equivalent to dividing equation (8) with V K, all instances of γ and T in the previous analysis should be interpreted as γ/V K and T/V K, respectively, for practical purposes. For instance, the threshold T c (γ) < 1/2 for any γ (0, 1) in Theorem 1 implies that in applications with V K 1 (typically V K 1), the tipping point satisfies T c < V K/2. An illustrative application of this mechanism at which T is interpreted as taxation is presented in Appendix B.

In this paper, we developed and studied a mechanism to destabilize lock-in at inefficient equilibria in (population) coordination games. While requiring only temporary influence to the population with limited implementation costs, the proposed mechanism achieves a permanent transition from a sub-optimal locked-in equilibrium state, e.g., the adoption of a status quo, yet wasteful technology, to a socially and individually preferable state, e.g., the adoption of an innovative and sustainable technology. As a first step, our methods expose the geometric features of the Q-learning dynamics in the widely studied model of population games. The transition from the inefficient to the efficient equilibrium is then achieved via an induced saddle-node bifurcation or controlled catastrophe in the learning dynamics. The resulting mechanism is both simple to understand and easy to implement, yet its provable guarantees make critical use of the complex underlying geometry.

The lock-in at inefficient equilibria is a long-standing problem with ample manifestations across the whole spectrum of economic, social, natural, and behavioral sciences. However, little is known on how to overcome the lock-in in practical applications. Most existing studies report experimental findings [8, 9, 39] (and references therein) with only a few recent exceptions [68, 40] . As economies and social interactions become increasingly complex and unpredictable, the design of effective mechanisms, even in purely adversarial conditions (limited resources and ability to control the population), is more relevant than ever. Aiming to make progress in this direction, our work provides a (first to our knowledge) provably efficient mechanism to overcome this lock-in and opens several directions for future work. Some of the most naturally arising questions concern the implementation of the current mechanism -or modifications thereof -in adversarial settings at which the population can control some parameter in response to changes in the control parameter T (here that would be parameter γ) or under modified behavioral models. Testing the efficiency of the proposed mechanism in field or laboratory experiments is another promising direction for study.

Thus far, our analysis and the proposed mechanism to move out of the inefficient lock-in rested on two (seemingly) restrictive assumptions. The first is that the updates in the population state are not subject to uncontrolled perturbations and the second is that α = 2, i.e., that the parameter which expresses the intensity of network effects is such that the system can be represented as an evolutionary game, cf. Section 2.1. For practical purposes and more generally for systems with constantly changing or unknown characteristics, it is important to show that the efficiency of the suggested mechanism is not tied to these particular assumptions. It turns out that this indeed turn and our goal in this section is to test and prove this claim in two directions.

First, we show that the proposed mechanism continues to apply under state dependent perturbations in the Q-learning dynamics without significant impact in its actual implementation, cf. Appendix A.1. This holds for any reasonably -in the sense that the model parameters remain in their admissible regions -bounded perturbations.

Second, we study the robustness of the model for network effects of different intensity, as expressed by different values of parameter α, cf. Equation (2) . In this direction, we obtain a conservative (yet tight for extreme values of the parameters) theoretical bound on the critical value T c (γ), i.e., on the control that needs to be exercised to the system to trigger the desired transition out of the attracting region of the initial technology. In practical terms, as α increases, a network split becomes more damaging and the population reaches the tipping point for increasingly lower values of T c . In this respect, the case α = 2 that we treated in the previous part is one of the costliest to destabilize. Finally, even though the number of QRE in the attracting region of the efficient technology may change for large values of α, the mechanics of the proposed mechanism essentially remain unaffected.

In general, small perturbations in a dynamical system can have major effects in its outcome, cf. [60, 48] or [46] among many others. To study the behavior of the currently used Q-learning dynamics in equation (10) in response to such perturbations, we consider a state-dependent noise term (x)

Similarly to equation (11), it will be convenient to introduce the following notation.

For fixed γ ∈ (0, 1) and T ≥ 0, let

where (x) : (0, 1) → R is a noise-function that depends on the current population state and which for all x ∈ (0, 1) satisfies | (x) | ≤ 0 for some constant 0 ∈ (0, min {γ, 1 − γ}). In particular, f (x; T, γ) := f (x; T, γ) + (x), where f (x; T, γ) is defined in equation (11) . Whenever obvious from the context, we will omit the dependence on T, γ and write f (x).

The bound 0 is naturally imposed by the requirement that the game parameter, γ, should remain within its admissible region. In particular, the bound 0 ∈ (0, min {γ, 1 − γ}) on | (x) | ensures that the perturbation (noise) is not large enough to change the nature of the system in favor of the originally undesirable equilibrium.

Turning to the dynamics of the (x)-perturbed system, the number of interior steady states of the dynamics, as expressed by the roots of f (x; T, γ), may change significantly in comparison to the unperturbed system. To see this, assume that x * is a steady state of the unperturbed system for some T > 0, so that f (x * ; T, γ) = 0. Then f (x * ; T, γ) = (x * ) and hence, there exists a local neighborhood around x * for which the behavior of the system is essentially determined by the noise term. Accordingly, x may have an arbitrary number of signchanges in this neighborhood depending on the exact shape of (x) and hence, it is not possible to argue about the exact state of the system within this region. However, we can still reason about the stability of the system outside these (bounded) neighborhoods in the same fashion as we did for the unperturbed system. In fact, the stability analysis, cf. Theorem 1 and Figure 3 , carry over with the only difference that the system will now stabilize within a neighborhood of the initial QRE, i.e., of the QRE of the unperturbed system. This allows us to establish equivalent convergence guarantees and retain the desirable effect of the proposed catastrophe mechanism. This is formalized in Theorem 3 whose proof can be found in Appendix D.1. mechanism under bounded perturbations.) . For a fixed cost parameter γ ∈ (0, 1) consider the population game (G1), the revision protocol (S1) and the perturbed resulting dynamics in equation (12) , where | (x) | ≤ 0 ∈ (0, min {γ, 1 − γ}) is a state-dependent noise term for all x ∈ (0, 1). Then, the catastrophe mechanism of Theorem 2 still applies.

In particular, for any initial population state x 0 ∈ [(1 + γ) /2, 1], there exists a finite sequence T 0 , T 1 , . . . , T n of control parameter values with T 0 = T n = 0, and max i ≤n T i > T c (γ − 0 ), so that the iterative procedure which, starting from phase i = 0, scales the control parameter T to T i , allows the system to converge to a neighborhood of the QRE, x * (T i ), of the unperturbed system, and reiterates for phase i + 1, generates a sequence of population states (QRE) that converges to the risk-dominant equilibrium, x * (T n ) = 0, at which the population adopts the efficient technology.

Theorem 3 suggests that the results of the unperturbed case, largely carry over also to the perturbed case. Intuitively, while the exact number of steady states in the perturbed system cannot be determined, the new steady states are all located in some bounded neighborhood of the old steady states. 7 Outside these regions, the sign of x remains the same as in the unperturbed case and hence, the dynamics converge to these regions in the same fashion that they converged to the exact steady states in the unperturbed case. Two illustrations -of bounded and unbounded state dependent perturbationsare given in Figure 5 .

From the perspective of mechanism design, it is also meaningful to study the effects of the perturbation on the implementation of the proposed mechanism and in particular, on the control that needs to be exercised on T. To quantify this change, we solve equation f (x) = 0 for T, cf. Remark 1, and recover the control parameter as a function of x on the geometric locus of all QRE of the unperturbed system

Similarly, if T (x; γ) denotes the value of the control parameter on the geometric locus of all QRE of the perturbed system, then (14) which implies that if some x ∈ (0, 1) is a QRE of both systems, then

Equation (15) highlights an important property that comes from the inclusion of the entropy term in the dynamics. Namely, as x approaches the boundary for decreasing T > 0, the effect of any bounded perturbation on the control variable gets increasingly dominated by the entropy term. At T = 0, the system stabilizes again at x = 0. An illustration is given in Figure 5 . Finally, for an alternative bound on the maximum additional control that needs to be exercised on T to trigger the critical phase transition, equation (14) and the monotonicity of the critical level T c (γ) in γ, cf. Corollary 1, imply that T c (γ; ) ≤ T c (γ − 0 ), where T c (γ; ) denotes the critical value in the perturbed system (cf. Theorem 3). In both panels, the largest deviations occur for population states close to 1/2, cf. equation (15), whereas for values of x close to the boundaries, the perturbed systems reduce to the original system (red line) as the perturbations get dominated by the entropy term. In the left panel, which satisfies the assumptions of Theorem 3, the proposed mechanism can be implemented with the only difference that, now, the dynamics can converge to any blue dot (instead of the red line) that lies in a neighborhood of the original QRE (in the unperturbed system). By contrast, the perturbation in the right panel does not satisfy the assumptions and the outcome of the mechanism is ambiguous as the population may not transition to states below 1/2 even for large values of the control parameter T.

Parameter α in Equation (2) captures the intensity of the network effects, i.e., the value that can be created by each technology as a function of its adoption by the population of investors. Values of α < 1 indicate subadditive value, i.e., that the population payoff is maximized when the network splits between the two technologies, and fall out of the present scope. Similarly, when α = 1, the degree of adoption does not affect the generated value and the efficient technology constitutes a dominant strategy which trivializes the resolution of the game.

In the present context, we are interested in cases with superadditive value -or direct positive network effects -that are expressed by values of α > 1 (an illustration is provided in Remark 6). Thus far, we have assumed that α = 2 mainly for expositional purposes, however, in this part, we show that the results generalize essentially unaltered to any α > 1.

Specifically, let α > 1. Then by equations (3) and (8), we otain the relationship

which after normalizing V K α−1 to 1 -or equivalently after dividing the equation with V K α−1 and setting γ → γ/V K α−1 and T → T/V K α−1 -yields the dynamics

The geometric locus of the steady states (QRE) of the dynamics in equation (16) is illustrated for various values of α > 1 and T ≥ 0 in Figure 6 . As network effects become more intense, i.e., as α increases, splits of the population become increasingly detrimental for both aggregate and individual welfare. As a consequence, the system becomes more sensitive to the control T and the inefficient equilibrium is easier to destabilize. This can be seen from the thinning upper component in the QRE surface in the left panel of Figure 6 . Moreover, for higher values of α, the lower component of the QRE correspondence develops an unstable part in the [0, 1/2] region (dashed part in the red and green lines in the left panel and horn-shaped dark area in the right panel) at which the population may oscillate between three or marginally two QRE. However, all these QRE lie in the attracting region of x = 0 for T = 0 and hence, this instability does not compromise the outcome of the proposed catastrophe mechanism. In particular, whenever T ≥ 1/2, there exists a unique QRE x * ∈ (0, 1/2) (light area in the right panel). This implies that by increasing T to (at most) 1/2, the system can be stabilized in a state x * which lies in the attracting region of the desired equilibrium x = 0. Subsequently, the last phase -at which the control parameter is reset back to 0 -of the proposed catastrophe mechanism in Theorem 2 can be applied unaltered (e.g., by reducing T back to 0) and still yield the desired outcome of convergence to the x = 0 equilibrium, independently of the intensity of the network effects, i.e., of the value of α > 1. This is formalized in Theorem 4. In particular, the bifurcations between one and three QRE occur at the boundary between dark and light areas at which the system has two QRE. The parameter range with three QRE that develops for larger values of α can be seen as a dark horn-shaped area in the upper/upper left part of the graph. For certain parameter values, there exist more QRE both in the upper and in the lower component of the QRE surface (darker area at which the two initially dark areas overlap). This unstable area is inconsequential in the transition to the efficient technology since it lies entirely in the attracting region of the x = 0 equilibrium, cf. Theorem 4. Finally, it can be seen that T c (γ) is decreasing in α for values of alpha larger than approximately 2.5 and that the bound T ≥ 1/2 is (in all cases) particularly conservative.

Theorem 4 (Catastrophe mechanism for arbitrary network effects). For a fixed cost parameter γ ∈ (0, 1) consider the population game defined by the payoff functions in equation (3) and the revision protocol (S1) which together lead to the population evolutionary dynamics x in equation (16) . Then, for any α > 1, the catastrophe mechanism of Theorem 2 still applies. In particular, for any initial population state x 0 ∈ [(1 + γ) /2, 1], there exists a finite sequence T 0 , T 1 , . . . , T n of control parameter values with T 0 = T n = 0, and max i ≤n T i ≥ 1/2(> T c (γ)), so that the iterative procedure which, starting from phase i = 0, scales the control parameter T to T i , allows the system to converge to a QRE, x * (T i ), and reiterates for phase i + 1, generates a sequence of population states (QRE) that converges to the risk-dominant equilibrium, x * (T n ) = 0, at which the population adopts the efficient technology. Remark 4. As mentioned above, the only difference in comparison to Theorem 2 is that there exists a critical value α c , so that for values α > α c , there exist multiple QRE in the interval x ∈ (0, 1/2) for certain values of T < 1/2, cf. Figure 6 .8 In practice, the bound T = 1/2 -to obtain a unique QRE, x * ∈ (0, 1/2) -is extremely conservative, and essentially, it is only tight for the absolutely extreme cases of γ close to 0 and α = 1 (not depicted here). Moreover, since convergence to any QRE x * ∈ (0, 1/2) suffices to move the system out of the inefficient lock-in, Theorem 4 can be equivalently stated for max i ≤n T i ≥ T c (γ) without compromising its outcome.

From the perspective of policy-making, it is also worth noting that the case α = 2 that was used up to now corresponds to one of the costliest cases to treat in practice. As can be seen in the right panel of Figure 6 , the critical level T c at which the bifurcation in the upper part of the QRE surface occurs -i.e., the point at which the two upper QRE merge and immediately thereafter, disappear -is decreasing in α for values of α larger than approximately 2.5 (protrusion of the dark blue area in the bottom left corner). In particular, larger values of α capture technologies with more intense (positive) network effects. The more intense effects translate to smaller values of the tipping point T c (γ), which, in turn, implies a lower cost to implement the proposed mechanism.

Since the launch of Bitcoin (BTC) by the pseudonymous Satoshi Nakamoto, [45] , Proof of Work (PoW) blockchains and their applications -most notably cryptocurrencies -have taken the world by storm. Widely considered as a revolutionary technology, blockchains have attracted the attention of institutions, technology-corporations, investors and academics. In these networks, self-interested miners ensure the proper functionality of the supported applications by exerting effort and receiving monetary incentives in return [36] . To participate, and without any centralized authority to grant permission, miners are required to provide evidence of some predefined, and typically scarce, resource, e.g., computational power that utilizes electricity in Proof of Work (PoW) protocols, [21, 25] , or units of the native cryptocurrency in Proof of Stake (PoS) protocols [10] .

Yet, PoW blockchains networks grow orthogonal to their design philosophy in terms of their environmental footprint. Currently, one BTC transaction wastes as much energy -in terms of carbon footprint -as 775.818 VISA transactions [19] . More alarming than its current levelswhich rank the BTC network above Finland and Pakistan in terms of electricity consumption -is the consumption's increasing trend: the electricity used by the BTC network approximately doubles every year [64] . The total figures only get worse, if we account for all other PoW blockchains such as Ethereum [11] .

These alarming figures call for mechanisms to accelerate the development and, more crucially, the adoption of alternative technologies such as (PoS) [10, 27] , also known as virtual mining [6] .10 Yet, an important barrier is that the value of a cryptocurrency -or in general the reliability of its applications -depends on the size of its mining network (with larger network implying higher safety). More mining power implies that it is more costly for a potential attacker to gather the required resources and compromise the functionality of the blockchain, [32, 10] . Hence, when the rest of the population mines a specific PoW cryptocurrency, then it is individually rational (preferable) for any single miner to also mine that cryptocurrency.

To foster the transition to the existing sustainable alternatives, like PoS, using the currently developed framework, the control parameter T ≥ 0 can be interpreted as taxation11. As mentioned in Section 1, changes in T are essentially equivalent to rescaling agents' utilities. Importantly, taxation does not have to be neither preferential, i.e., to be imposed only on PoW networks, nor permanent. By taxing transactions between (all) cryptocurrencies and fiat currencies, central authorities -especially in countries like China where most mining rigs are concentrated -can influence the transition of the growing PoW networks to the PoS technology (or any other desirable alternative with equivalent technological guarantees). Importantly, our work shows that it suffices to (i) exercise a temporary influence to the system and (ii) to only reach a critical mass of adopters of PoS that is well below the simple majority before the critical phase transition becomes permanent.

Finally, our auxiliary results concerning the functional relationship between the critical value, T c (γ), and the cost parameter, γ, cf. Corollary 1, suggest that temporary shocks in the electricity price would accelerate the effects of the transition mechanism. Specifically, since T c (γ) decreases in γ, an increase in the electricity price for PoW mining would sooner prompt the critical phase transition of the population to the alternative technology. These results, extrapolated to similar technological dilemmas -CO 2 emitting vs electric (or generally environmentally friendly) vehicles, e.g., cars or aircrafts or sustainable urban planning -provide an enhanced, and provably effective, toolbox in the hands of contemporary policymakers.

Remark 5 (Derivation of revision protocol (S1)). For games with general action sets A = {1, 2, . . . , n}, i.e., with possibly more than two strategies, the Q-learning agents update their strategies according to the following rule

x j ln x j subject to: n j=1

x j = 1 (S1')

x j ≥ 0, for j = 1, 2, . . . , n.

In the current setting, at which each agent has two strategies, this conveniently reduces to the onedimensional maximization problem in Equation (S1). To intuitively explain the objective function in (S1'), observe that the first term, i.e., n j=1 x j Q t ( j) enforces maximization of the Q-values. Since it is linear in the x j 's, it would simply choose (put full probability on) the strategy with the highest Qvalue, if the second term (entropy) was missing. However, the introduction of the entropy term, i.e., of − n j=1 x j ln x j , essentially requires from the agent to choose the distribution x with maximum entropy for every given weighted sum of the Q-values, and hence, to explore (assign positive probability to) sub-optimal strategies. The relative importance between maximization of Q-values and exploration of the strategy space is controlled by parameter T ≥ 0. Termed temperature in physics, T can be interpreted as a tuning parameter: as T → 0, the agent always acts greedily and chooses the strategy corresponding to the maximum QâĂŞvalue (pure exploitation), whereas as T → ∞, the agent chooses a strategy completely at random (pure exploration). In particular, for T = 0, the system reduces to the well-known replicator (best response) dynamics which (under standard regularity assumptions that are met in the present model) recover the Nash equilibria of the underlying evolutionary game, cf. Section 2.1. For different values of T > 0, the resting points of the system change -sometimes in an abrupt way in response to smooth changes in T -and this is precisely the intuition that we exploit here to design a mechanism that will stabilize the system and move it out of a lock-in to a desirable state. In fact, as shown in [68] , the temperature can be considered as a control parameter in the arsenal of a system designer. From the objective function, one discerns that parameter T essentially rescales all the Q-values in a multiplicative way. Hence, as shown in [68] , T can be treated as a taxation parameter in economic systems or as a medically controlled substance in health related settings. The interpretation of T in a concrete application is further elaborated in Appendix B.

Proof of Proposition 1. The resulting game

with γ ∈ (0, 1) has three Nash equilibria: two pure, (W, W) with payoffs (1 − γ, 1 − γ) and (S, S) with payoffs (1, 1), and one (fully) mixed 1+γ 2 , 1−γ 2 , 1+γ 2 , 1−γ 2 with payoffs 1−γ 2 , 1−γ 2 . These correspond to population states x 1 = 0, x 2 = 1+γ 2 , and x 3 = 1. The corresponding payoffs can be also derived by substituting in the average payoff function, (4) , which here becomes u (x) = 2x 2 − (2 + γ) x + 1.

All three Nash equilibria are symmetric. The (S, S) (bottom right) Nash equilibrium is (strictly) payoff dominant, i.e., it Pareto-dominates the other two (recall that γ ∈ (0, 1) is the cost from the current costly technology) and is also (strictly) risk dominant, since 0 + 1 > 1 − 2γ for any γ ∈ (0, 1). Additionally, both pure strategy equilibria are evolutionary stable, since u (W, W) = 1 − γ > 0 = u (S, W), and u (S, S) = 1 > −γ = u (W, S). A symmetric mixed Nash equilibrium (x * , x * ) is evolutionary stable if u (x * , x) > u (x, x) for all other mixed strategies x x * . Hence, the mixed Nash equilibrium is not evolutionary stable, since for any other x ∈ (0, 1) with x 1+γ

Proof of Lemma 1. From (3) and (8), it follows directly that

Lemma 2. Let T [0, 1/2]. Then, for any γ ∈ (0, 1), the equation

has a unique solution T c (γ) or simply T c ∈ (0, 1/2).

Proof of Lemma 2. Let u := √ 1 − 2T. Then, T ∈ [0, 1/2] implies that u ∈ [0, 1) and the transformation is one to one with inverse T = 1 2 1 − u 2 . Hence, we need to show that the function

has a unique root u c in (0, 1), so that T c (γ) = 1 2 1 − u 2 c . Note that g γ (u) is defined for any u ∈ (−1, 1). The derivative of g γ (u) with respect to u is

for all u ∈ (−1, 1) with g (u) = 0 only for u = 0. Hence, g γ (u) is strictly increasing with lim u→−1 + g γ (u) = −1 + γ < 0, g γ (0) = 0 − γ < 0 and lim u→1 − g γ (u) = 1 − γ > 0. Accordingly g γ (u) has precisely one root, u c ∈ (0, 1) which yields the unique solution of the equation in the statement of Lemma Lemma 2, given by T c (γ) = 1 2 1 − u 2 c ∈ (0, 1/2).

Let γ ∈ (0, 1) and let T c (γ) ∈ (0, 1/2) as given by Lemma 2. Then, the number of the solutions of the equation f (x; T, γ) = 0 with

x 3 ∈ (x u , 1), with x u,l as in (2) . Moreover, it holds that 0 < x l < 1/2 < (1 + γ) /2 < x u < 1. -T = T c (γ): there are 2 solutions x 1 , x 2 , with x 1 ∈ (0, x l ) and x 2 = x u , with x l < 1/2 and

, with x l < 1/2 when T < 1/2 and x 1 ∈ (0, 1/2) when T ≥ 1/2.

Proof of Lemma 3. For any T > 0 and γ ∈ (0, 1), the function f (x; T, γ) is continuous in x ∈ (0, 1) with

This implies that f (x; T, γ) starts positive and ends up negative.

for any x ∈ (0, 1). For T < 1/2, there exist two critical points

and for T = 1/2 only one, i.e., x l = x u = 1/2. Depending on the value of T relative to T c (γ), there are three cases for the number of the solutions of the equation f (x; T, γ) = 0 for x ∈ (0, 1).

-0 < T < T c (γ). Since T c (γ) < 1/2 for any γ ∈ (0, 1) as shown in Lemma 2, we have for any T ∈ (0, T c (γ)) by equation (17) that

Moreover, it also immediate that 1 − 2T > 0 and, hence that 0 < x l < 1/2 which shows that 0 < x l < 1/2 < (1 + γ) /< x u < 1 as claimed. Also, by the strict monotonicity of g γ (u) ∈ (−1, 1) that was shown in the proof of Lemma 2, we have that

Hence, using equations (18) and the sign of f (x; T, γ), we obtain that f (x; T, γ) has precisely one root x 1 ∈ (0, x l ), one root x 2 ∈ ((1 + γ) /2, x u ) and one root x 3 ∈ (x u , 1). -T = T c (γ). As in the previous case, T c (γ) < 1/2 which by the same reasoning implies that 0 < x l < (1 + γ) /2 < x u < 1. However, in this case, f (x u ; T, γ) = g γ 1 − 2T c (γ) = 0 and hence, using again equations (18) and the sign of f (x; T, γ), it follows that f has one root in (0, x 1 ), and a second root in ((1 + γ) /2, 1) which is precisely x u .

-T > T c (γ). In this case,

which implies that f (x; T, γ) turns negative at some point x 1 < x l when T < 1/2 or x 1 < 1/2 when T > 1/2 (at which case x l does not exist) and remains negative thereafter since x 2 is a local maximum. Hence, this x 1 ∈ (0, x l ) or x 1 ∈ (0, 1/2) is the unique root of f (x; T, γ) in (0, 1). The three cases of Lemma 3 are illustrated in Figure 8 . In the depicted instantiation, γ = 0.185 which yields T c (γ) ≈ 0.3. The three curves correspond to T = 0.25, T = T c (γ) and T = 1 which, in agreement with Lemma 3, yield 3, 2 and 1 solutions respectively to the equation f (x; T, γ) = 0, x ∈ (0, 1).

Using Lemmas 2 and 3, we can now determine the convergence properties of the Q-learning dynamical system x and hence, prove Theorem 1.

Proof of Theorem 1. The existence of the steady states in the three cases has been established in Lemma 3. Hence, it remains to prove the claims about their stability. The dynamics defined by x are 1-dimensional and hence their convergence properties and the stability of their steady states can be fully determined by the sign of x. Since x (1 − x) > 0 for any x ∈ (0, 1), the sign of x fully depends on f (x; T, γ). In turn, the sign of f (x; T, γ) for any x ∈ (0, 1) has already been determined by the calculations in the proof of Lemma 3 and in particular by equation (18) . Formally, we have the following cases -T < T c (γ). In this case, the dynamics x have three steady states x 1 , x 2 , x 3 ∈ (0, 1) that correspond to the respective roots of function f (x; T, γ). In particular, it holds that 0 < x 1 < x l < 1/2 < (1 + γ) /2 < x 2 < x u < x 3 < 1 and the sign of x starts positive and alternates accordingly. This gives the stability results in Figure 3b . -T = T c (γ). At this point, the two roots that are larger than 1/2, namely x 2 and x 3 , merge to one root precisely at x u . The critical observation is that this new steady state is unstable, since the dynamics x have a negative sign at both sides of the root x u . Moreover, it holds that x u > (1 + γ) /2. This is shown in Figure 3c .

In this case, f (x; T, γ) < 0 which implies that the dynamics x are decreasing for any

x (0, 1). Since f (x; T, γ) starts positive and ends up negative, there remains only one root (steady state), x 1 of f (x; T, γ), which is strictly less than 1/2. An illustration is given in Figure 3d . Figure 3 .

Proof of Corollary 1. For γ ∈ (0, 1), the critical temperature T c (γ) is the unique solution of the equation

. Implicit differentiation of the function F (γ, T) = 0 with respect to γ, yields

which is negative, since the argument of the ln is larger than 1 for all T ∈ (0, 1/2].

Proof of Theorem 2. As mentioned in Remark 2, in theory it suffices to select T 0 = 0, T 1 > T c (γ) , T 2 = 0 . Then, the stability properties of the dynamics that were established in Theorem 1 directly imply the result. For practical purposes though, only gradual changes in T may be possible. However, we can show that the mechanism still results in the same outcome. Specifically, starting from any initial point x 0 ∈ [(1 + γ) /2, 1] and increasing T from T 0 = 0 to a T 1 ∈ (0, T c (γ)), the dynamics will stabilize at QRE x 3 . Further increasing T above T c (γ) will lead the dynamics to stabilize at the unique remaining QRE, x * 1 (T) < 1/2. It remains to show that we can reduce T back to 0 and converge to the desired equilibrium x 

> 0 for x < x l (T) and T < T c (γ) < 1/2. Hence, 0 < x 1 (T) < x l (T) implies that for any > 0, there exists a δ > 0 such that x 1 (T) < for any T < δ, since

This implies, that x 1 (T) → 0 as T → 0 + , which concludes the proof.

An alternative visualization of the control catastrophe mechanism described in Section 5 is provided in Figure 10 . Fig. 10 : The process of destabilizing an equilibrium via a controlled catastrophe and hysteresis mechanism in the game (G1). The first panel shows the evolution of the population state for 10 initial starting points and control parameter T = 0. The Q-learning dynamics -in this case, equivalent to the replicator dynamics -converge to the two stable equilibria of the system. The only trajectory that converges to the mixed, unstable equilibrium is the one that precisely starts from it (horizontal line at (1 + γ) /2, i.e., slightly above 1/2). In the second panel, the control parameter increases (but remains below the critical level). The starting points are the possible steady states for T = 0. The two stable equilibria (QRE) are in the interior of the admissible region, x ∈ (0, 1). There is a third unstable equilibrium that separates the attracting regions of the two equilibria (not visible since no trajectory converges to it). In the third panel, the control parameter is increased above the critical level. Starting again from the endpoints of the previous panel, the stability of the system changes abruptly: the upper and middle equilibria merge and annihilate each other. There is only one remaining equilibrium and the population converges to it independently of the starting point. This equilibrium lies in the attracting region of the efficient technology (risk-dominant equilibrium), which can be now recovered by resetting the system back to its initial control value, T = 0 (last panel). Proof. Since ∂ ∂γ f (x; T, γ) = −1 < 0, f (x; T, γ) is decreasing in γ which implies that f (x; T, γ + 0 ) < f (x; T, γ) + (x) = f (x; T, γ) < f (x; T, γ − 0 ) .

For given γ ∈ (0, 1) and any T > 0, let x * m (T) := min {x ∈ (0, 1) : f (x; T, γ) = 0} and x * m (T; 0 ) := min {x ∈ (0, 1) : f (x; T, γ) = 0} denote the minimum QRE at level T of the unperturbed and perturbed dynamics, respectively. Note that by Theorem 1, x * (T; γ) < 1/2 for any pair (T, γ).

Proof of Theorem 3. The stability analysis of the dynamics in the perturbed case is derived in the same way as in Theorem 1 and follows directly from Lemma 4 and the fact that | (x) | ≤ 0 . In particular, let x * ∈ (0, 1) be a QRE of the unperturbed system. Then, since | (x) | ≤ 0 , we can found a δ (x * ) > 0, so that | f (x * ± δ (x * ) ; γ, T) | > 0 . Hence, outside the δ (x * )-neighborhood of x * , the function f (x; γ, T) retains the sign of the unperturbed function f (x; T, γ) and the stability property of the dynamics directly follow from Theorem 1.

To obtain an estimate of δ (x * ), observe that for any x ∈ (0, 1), by first order Taylor's theorem, we have that

where ξ ∈ (x * − δ, x * ) or ξ ∈ (x * , x * + δ), respectively and δ > 0. This equation can then be solved to find a minimum value δ (x * ) so that | f (x * ± δ (x * ) ; T, γ) ≥ 0 . As claimed in Appendix A, the steep increase, 1

x(1−x) , of the entropy term as x * approaches the boundary, dominates the (bounded) perturbation term (x * ) and yields lower values of δ (x * ). This implies that the distance between x * m (T) and x * m (T; 0 ) as defined in Notation 4 is decreasing for values of x * m (T) close to 0. Thus, convergence to x = 0 (which corresponds to the efficient technology) is still achieved.

Remark 6. Parameter α expresses the value that is created by the technology in response to its adoption. In particular, there are three interesting cases, depending on whether α is smaller, larger than or equal to 1.

α < 1: Subadditive value. In this case, the total value generated from the two technologies is subadditive implying that a split of the network is more beneficial for the society. -α = 1: Linear value. In this case, the aggregate value that is generated is increasing in the rate of adoption of the innovative (less costly) technology. Resolution of this case is trivial from a mathematical perspective, since the less costly technology constitutes a strictly dominant strategy for the population for any γ ∈ (0, 1). -α > 1: Superadditive value. In this case, the aggregate value is (locally) maximized when either technology is fully adopted. This case corresponds to direct (positive) network effects and is the one that is discussed in the current paper.

Typical instantiations of these cases are depicted in the two panels of Figure 11 .

The proof of Theorem 4 relies on Lemma 5 which exploits a particular inequality of the natural logarithm. 

and selected values of the parameters V = 10, K = 4 and γ = 1. For α = 2 (left panel) and in general for α > 1, the total aggregate value is (locally) maximized at the boundaries, i.e., when either technology is fully adopted (superadditive value). The global maximum is attained when the less costly technology S is fully adopted, i.e., when x = 0. By contrast, for α = 1/2 (right panel, blue line), and in general for α < 1, the aggregate wealth is maximized when the population is split between the two technologies (subadditive value). For α = 1 (right panel, red line), the aggregate wealth is increasing in the adoption of the less costly technology S (linear value).

Proof. Since x ∈ [0, 1], we can rewrite the inequality as

which is equivalent to

Hence, by symmetry, it suffices to show that x α (1 − x) − x 2α ≤ 0, for any x ∈ [0, 1] and α > 2, which is in turn equivalent to x α−1 (1 − x) ≤ 1 2α . By differentiating the left hand side with respect to x, we find that d dx or equivalently that α−1 α α−1 ≤ 1/2 for any α ≥ 2. However, the term on the left side is decreasing in α, since by taking the logarithm and applying the inequality ln (x) ≤ x − 1, we obtain that

Hence, the maximum of the left side is attained for α = 2, yielding a value of 2−1 2 2−1 = 1 2 , which concludes the proof.

To ease the proof of Theorem 4, we restrict attention to integer α > 1, yet the results continue to hold for any α ≥ 1. Recall that for α = 1, the efficient technology constitutes a strictly dominant strategy and hence its adoption is trivial (and hence, it is excluded from the statement of Theorem 4). The continuous case requires an additional technical step for the interval (2, 3) and since it does not

The logit-response dynamics

Dynamic Global Games of Regime Change: Learning, Multiplicity, and the Timing of Attacks

Competing Technologies, Increasing Returns, and Lock-In by Historical Events

Complexity theory and financial regulation

Network effects as drivers of individual technology adoption: Analyzing adoption and diffusion of mobile communication services

Cryptocurrencies Without Proof of Work

From Darwin to Poincaré and von Neumann: Recurrence and Cycles in Evolutionary and Algorithmic Game Theory

An Experimental Study on How to Overcome Coordination Failure in Organizations

Stand by meâĂŤexperiments on help and commitment in coordination games

Formal Barriers to Longest-Chain Proof-of-Stake Protocols

Incentives in ethereum's hybrid casper protocol

Experience-weighted Attraction Learning in Normal Form Games

Behavioral Game Theory: Experiments in Strategic Interaction. Series: The Roundtable Series in Behavioral Economics

Market Entry in the Presence of Network Effects: A Real Options Perspective

Role of network structure and network effects in diffusion of innovations

Penalty-Regulated Dynamics and Robust Learning Procedures in Games

Nuclear Power Reactors: A Study in Technological Lock-in

Coordination and Information in Critical Mass Games: An Experimental Study

Chapter 31 Coordination and Lock-In: Competition with Switching Costs and Network Effects

Energy Equilibria in Proof-of-Work Mining

Beyond myopic best response (in Cournot competition)

Complex dynamics in learning complicated games

The Bitcoin Backbone Protocol: Analysis and Applications

Mind the Mining

Network effects arenâĂŹt enough

Comparative evaluation of consensus mechanisms in cryptocurrencies

Evolutionary Games and Population Dynamics

Exploration vs. Exploitation in Team Formation

Technology adoption in markets with network effects: Theory and experimental evidence

Dynamics of Boltzmann Q learning in two-player two-action games

Blockchain Mining Games

Ouroboros: A Provably Secure Proof-of-Stake Blockchain Protocol

Elements of Applied Bifurcation Theory, Third Edition

Adoption of Information Technology Under Network Effects

Oceanic Games: Centralization Risks and Incentives in Blockchain Mining

Individual Q-Learning in Normal Form Games

Investment decisions and coordination problems in a market with network externalities: An experimental study

Overcoming coordination failure in a critical mass game: Strategic motives and action disclosure

Overcoming inefficient lock-in in coordination games with sophisticated and myopic players

Quantal Response Equilibria for Normal Form Games

Diffusion of Innovations

Mutation, Sexual Reproduction and Survival in Dynamic Environments

Leibniz International Proceedings in Informatics (LIPIcs)

Learning in Games via Reinforcement and Regularization

Bitcoin: A Peer-to-Peer Electronic Cash System

Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos

Average Case Performance of Replicator Dynamics in Potential Games via Computing Regions of Attraction

Chaos in duopoly pricing

The Process of Innovation and the Diffusion of Innovation

The effect of hysteresis on equilibrium selection in coordination games

The rise and fall of catastrophe theory applications in economics: Was the baby thrown out with the bathwater

Adoption of Technologies with Network Effects: An Empirical Examination of the Adoption of Automated Teller Machines

The prevalence of chaotic dynamics in games with many players

Population Games and Evolutionary Dynamics

Stability and diversity in collective adaptation

Coupled replicator equations for the dynamics of learning in multiagent systems

Early-warning signals for critical transitions

Information Rules: a Strategic Guide to the Network Economy

Everybody's doing it': on the persistence of bad social norms

Nonlinear Dynamics and Chaos: With Applications to Physics

Multi-Agent Reinforcement Learning: Independent vs

An Evolutionary Dynamical Analysis of Multi-Agent Learning in Iterated Games

A Selection-mutation Model for Q-learning in Multi-agent Systems

AAMAS '03

Critical slowing down as early warning for the onset and termination of depression

Technical Note: Q-Learning

Hysteresis effects of changing the parameters of noncooperative games

Bifurcation Mechanism Design -from Optimal Flat Taxes to Improved Cancer Treatments

The authors declare that they have no conflict of interest. add much intuition, it is omitted. In particular, for α ∈ [2, 3] , f α (x) is not monotone decreasing. However the statement of Theorem 4 continues to hold in this interval as well. Since this case requires more technical details without providing any additional insight, its proof is omitted.Proof of Theorem 4. The case α = 2 has been treated in the Theorem 1 and Theorem 2. For α ≥ 3, γ ∈ (0, 1) andTo prove uniqueness, it will be sufficient to prove that for T ≥ 1/2, f α (x) is decreasing in x. This implies that for T ≥ 1/2, there is a unique steady state x * and hence, the critical temperature T c will necessarily satisfy T c < 1/2. Taking the derivative of f α (x) with respect to x, we obtainThe last expression is decreasing in T and hence it suffices to prove that d dx f α (x) ≤ 0, for T = 1/2. In this case, d dx f α (x) ≤ 0 is equivalent toand the claim follows from Lemma 5. In particular, equality holds only if T = 1 2 , α = 3 (α − 1 = 2), and x = 1 2 . Hence, f α (x) is decreasing in (0, 1) which proves the claim. Finally, we need to show that for T = 0 and any α > 1, the system still has 3 steady states, x 1 = 0, x 3 = 1 and x 2 ∈ (1/2, 1). For T = 0, the first derivative of f α (x) with respect to x becomesfor all x ∈ (0, 1) and all α > 1 which implies that f α (x) is monotonically increasing. Since f α (x) starts negative at x = 0, ends up positive at x = 1, and f α (1/2) = −γ < 0, it has one root that lies in (1/2, 1). Hence, the dynamicshave two obvious steady states x 1 = 0 and x 3 = 1 and a third steady state x 2 ∈ (1/2, 1). Accordingly, for T = 0, the usual stability analysis applies, which proves that [0, 1/2] lies in the attracting region of x = 0 as claimed.