key: cord-0047721-di03ecw6 authors: Chatterjee, Krishnendu; Katoen, Joost-Pieter; Weininger, Maximilian; Winkler, Tobias title: Stochastic Games with Lexicographic Reachability-Safety Objectives date: 2020-06-16 journal: Computer Aided Verification DOI: 10.1007/978-3-030-53291-8_21 sha: 3368f0ba0e1720b860949d3d5e6958f7877b63e4 doc_id: 47721 cord_uid: di03ecw6 We study turn-based stochastic zero-sum games with lexicographic preferences over reachability and safety objectives. Stochastic games are standard models in control, verification, and synthesis of stochastic reactive systems that exhibit both randomness as well as angelic and demonic non-determinism. Lexicographic order allows to consider multiple objectives with a strict preference order over the satisfaction of the objectives. To the best of our knowledge, stochastic games with lexicographic objectives have not been studied before. We establish determinacy of such games and present strategy and computational complexity results. For strategy complexity, we show that lexicographically optimal strategies exist that are deterministic and memory is only required to remember the already satisfied and violated objectives. For a constant number of objectives, we show that the relevant decision problem is in [Formula: see text], matching the current known bound for single objectives; and in general the decision problem is [Formula: see text]-hard and can be solved in [Formula: see text]. We present an algorithm that computes the lexicographically optimal strategies via a reduction to computation of optimal strategies in a sequence of single-objectives games. We have implemented our algorithm and report experimental results on various case studies. Simple stochastic games (SGs) [26] are zero-sum turn-based stochastic games played over a finite state space by two adversarial players, the Maximizer and Minimizer, along with randomness in the transition function. These games allow the interaction of angelic and demonic non-determinism as well as stochastic uncertainty. They generalize classical models such as Markov decision processes (MDPs) [39] which have only one player and stochastic uncertainty. An objective specifies a desired set of trajectories of the game, and the goal of the Maximizer is to maximize the probability of satisfying the objective against all choices of the Minimizer. The basic decision problem is to determine whether the Maximizer can ensure satisfaction of the objective with a given probability threshold. This problem is among the rare and intriguing combinatorial problems that are NP ∩ coNP, and whether it belongs to P is a major and long-standing open problem. Besides the theoretical interest, SGs are a standard model in control and verification of stochastic reactive systems [4, 18, 31, 39] , as well as they provide robust versions of MDPs when precise transition probabilities are not known [22, 45] . The multi-objective optimization problem is relevant in the analysis of systems with multiple, potentially conflicting goals, and a trade-off must be considered for the objectives. While the multi-objective optimization has been extensively studied for MDPs with various classes of objectives [1, 28, 39] , the problem is notoriously hard for SGs. Even for multiple reachability objectives, such games are not determined [23] and their decidability is still open. This work considers SGs with multiple reachability and safety objectives with lexicographic preference order over the objectives. That is, we consider SGs with several objectives where each objective is either reachability or safety, and there is a total preference order over the objectives. The motivation to study such lexicographic objectives is twofold. First, they provide an important special case of general multiple objectives. Second, lexicographic objectives are useful in many scenarios. For example, (i) an autonomus vehicle might have a primary objective to avoid clashes and a secondary objective to optimize performance; and (b) a robot saving lives during fire in a building might have a primary objective to save as many lives as possible, and a secondary objective to minimize energy consumption. Thus studying reactive systems with lexicographic objectives is a very relevant problem which has been considered in many different contexts [7, 33] . In particular non-stochastic games with lexicographic objectives [6, 25] and MDPs with lexicographic objectives [47] have been considered, but to the best of our knowledge SGs with lexicographic objectives have not been studied. In this work we present several contributions for SGs with lexicographic reachability and safety objectives. The main contributions are as follows. -Determinacy. In contrast to SGs with multiple objectives that are not determined, we establish determinacy of SGs with lexicographic combination of reachability and safety objectives. -Computational complexity. For the associated decision problem we establish the following: (a) if the number of objectives is constant, then the decision problem lies in NP ∩ coNP, matching the current known bound for SGs with a single objective; (b) in general the decision problem is PSPACE-hard and can be solved in NEXPTIME ∩ coNEXPTIME. -Strategy complexity. We show that lexicographically optimal strategies exist that are deterministic but require finite memory. We also show that memory is only needed in order to remember the already satisfied and violated objectives. -Algorithm. We present an algorithm that computes the unique lexicographic value and the witness lexicographically optimal strategies via a reduction to computation of optimal strategies in a sequence of single-objectives games. -Experimental results. We have implemented the algorithm and present experimental results on several case studies. Technical Contribution. The key idea is that, given the lexicographic order of the objectives, we can consider them sequentially. After every objective, we remove all actions that are not optimal, thereby forcing all following computation to consider only locally optimal actions. The main complication is that local optimality of actions does not imply global optimality when interleaving reachability and safety, as the latter objective can use locally optimal actions to stay in the safe region without reaching the more important target. We introduce quantified reachability objectives as a means to solve this problem. Related Work. We present related works on: (a) MDPs with multiple objectives; (b) SGs with multiple objectives; (c) lexicographic objectives in related models; and (d) existing tool support. (a) MDPs with multiple objectives have been widely studied over a long time [1, 39] . In the context of verifying MDPs with multiple objectives, both qualitative objectives such as reachability and LTL [29] , as well as quantitative objectives, such as mean payoff [8, 13] , discounted sum [17] , or total reward [34] have been considered. Besides multiple objectives with expectation criterion, other criteria have also been considered, such as, combination with variance [9] , or multiple percentile (threshold) queries [8, 20, 32, 41] . Practical applications of MDPs with multiple objectives are described in [2, 3, 42] . (b) More recently, SGs with multiple objectives have been considered, but the results are more limited [43] . Multiple mean-payoff objectives were first examined in [5] and the qualitative problems are coNP-complete [16] . Some special classes of SGs (namely stopping SGs) have been solved for totalreward objectives [23] and applied to autonomous driving [24] . However, even for the most basic question of solving SGs with multiple reachability objectives, decidability remains open. (c) The study of lexicographic objectives has been considered in many different contexts [7, 33] . Non-stochastic games with lexicographic mean-payoff objectives and parity conditions have been studied in [6] for the synthesis of reactive systems with performance guarantees. Non-stochastic games with multiple ω-regular objectives equipped with a monotonic preorder, which subsumes lexicographic order, have been studied in [12] . Moreover, the beyond worst-case analysis problems studied in [11] also considers primary and secondary objectives, which has a lexicographic flavor. MDPs with lexicographic discounted-sum objectives have been studied in [47] , and have been extended with partial-observability in [46] . However, SGs with lexicographic reachability and safety objectives have not been considered so far. (d) PRISM-Games [37] provides tool support for several multi-player multiobjective settings. MultiGain [10] is limited to generalized mean-payoff MDPs. Storm [27] can, among numerous single-objective problems, solve Markov automata with multiple timed reachability or expected cost objectives [40] , multi-cost bounded reachability MDPs [35] , and it can provide simple strategies for multiple expected reward objectives in MDPs [28] . Structure of this Paper. After recalling preliminaries and defining the problem in Sect. 2, we first consider games where all target sets are absorbing in Sect. 3. Then, in Sect. 4 we extend our insights to general games, yielding the full algorithm and the theoretical results. Finally, Sect. 5 describes the implementation and experimental evaluation. Section 6 concludes. such that x∈A f (x) = 1. We denote the set of all probability distributions on A by D(A). Vector-like objects x are denoted in a bold font and we use the notation x i for the i-th component of x. We use x 0}. Finite paths are defined analogously as elements of (S × L) * × S. Note that when considering MCs, every state just has a single action, so an infinite path can be identified with an element of S ω . A strategy of player Max is a function σ : (S × L) * × S → D(L) where σ(πs)(s ) > 0 only if s ∈ Act(s). It is memoryless if σ(πs) = σ(π s) for all π, π ∈ (S × L) * . More generally, σ has memory of class-size at most m if the set (S × L) * can be partitioned in m classes M 1 , . . . , M m ⊆ (S × L) * such that σ(πs) = σ(π s) for all 1 ≤ i ≤ m, π, π ∈ M i and s ∈ S . A memory of class-size m can be represented with log(m) bits. A strategy is deterministic if σ(πs) is Dirac for all πs. Strategies that are both memoryless and deterministic are called MD and can be identified as functions σ : S → L. Notice that there are at most |L| S different MD strategies, that is, exponentially many in S ; in general, there can be uncountably many strategies. Strategies τ of player Min are defined analogously, with S replaced by S ♦ . The set of all strategies of player Max is denoted with Σ Max , the set of all MD strategies with Σ MD Max , and similarly Σ Min and Σ MD Min for player Min. Fixing a strategy σ of one player in a game G yields the induced MDP G σ . Fixing a strategy τ of the second player too, yields the induced MC G σ,τ . Notice that the induced models are finite if and only if the respective strategies use finite memory. Given an (induced) MC G σ,τ , we let P σ,τ s be its associated probability measure on the Borel-measurable sets of infinite paths obtained from the standard cylinder construction where s is the initial state [39] . Reachability and Safety. In our setting, a property is a Borel-measurable set Ω ⊆ S ω of infinite paths in an SG. The reachability property Reach (T ) where T ⊆ S is the set Reach (T ) = {s 0 s 1 . . . ∈ S ω | ∃i ≥ 0: s i ∈ T }. The set Safe (T ) = S ω \ Reach (T ) is called a safety property. Further, for sets T 1 , T 2 ⊆ S we define the until property T 1 U T 2 = {s 0 s 1 . . . ∈ S ω | ∃i ≥ 0: s i ∈ T 2 ∧ ∀j < i : s j ∈ T 1 }. These properties are measurable (e.g. [4] ). A reachability or safety property where the set T satisfies T ⊆ Sinks(G) is called absorbing. For the safety probabilities in an (induced) MC, it holds that P s (Safe (T )) = 1−P s (Reach (T )). We highlight that an objective Safe (T ) is specified by the set of paths to avoid, i.e. paths satisfying the objective remain forever in S \ T . SGs with lexicographic preferences are a straightforward adaptation of the ideas of e.g. [46] to the game setting. The lexicographic order on R n is defined as x ≤ lex y iff x i ≤ y i where i ≤ n is the greatest position such that for all j < i it holds that x j = y j . The position i thus acts like a tiebreaker. Notice that for arbitrary sets X ⊆ [0, 1] n , suprema and infima exist in the lexicographic order. where P σ,τ s (Ω) denotes the vector (P σ,τ s (Ω 1 ), . . . , P σ,τ s (Ω n )) and the suprema and infima are taken with respect to the order ≤ lex on [0, 1] n . Thus the lex-value at state s is the lexicographically supremal vector of probabilities that Max can ensure against all possible behaviors of Min. We will prove in Sect. 4.3 that the supremum and infimum in (1) can be exchanged; this property is called determinacy. We omit the superscript Ω in Ω v lex if it is clear from the context. We also omit the sets Σ Max and Σ Min in the suprema in (1), e.g. we will just write sup σ . Fig. 1a with the lex-objective Ω = {Reach (S 1 ) , Safe (S 2 )}. Player Max must thus maximize the probability to reach S 1 and, moreover, among all possible strategies that do so, it must choose one that maximizes the probability to avoid S 2 forever. We extend the notion of value to actions. Let s ∈ S be a state. The lex-value of an action a ∈ Act(s) is . Lex-optimal actions are defined analogously for states s ∈ S ♦ by considering the minimum instead of the maximum. Notice that there is always at least one optimal action because Act(s) is finite by definition. Example 2 (Lex-value of actions). We now intuitively explain the lex-values of all states in Fig. 1a . The lex-value of sink states s, t, u and w is determined by their membership in the sets S 1 and S 2 . E.g., v lex (s) = (1, 1), as it is part of the set S 1 that should be reached and not part of the set S 2 that should be avoided. Similarly we get the lex-values of t, u and w as (1, 0), (0, 0) and (0, 1) respectively. State v has a single action that yields (0, 0) or (0, 1) each with State p has one action going to s, which would yield (1, 1). However, as p is a Min-state, its best strategy is to avoid giving such a high value. Thus, it uses the action going downwards and v lex (p) = v lex (q). State q only has a single action going to r, so v lex (q) = v lex (r). State r has three choices: (i) Going back to q, which results in an infinite loop between q and r, and thus never reaches S 1 . So a strategy that commits to this action will not achieve the optimal value. (ii) Going to t or u each with probability 1 /2. In this case, the safety objective is definitely violated, but the reachability objective achieved with 1 /2. (iii) Going to t or v each with probability 1 /2. Similarly to (ii), the probability to reach S 1 is 1 /2, but additionally, there is a 1 /2 · 1 /2 chance to avoid S 2 . Thus, since r is a Max-state, its lex-optimal choice is the action leading to t or v and we get v lex (r) = ( 1 /2, 1 /4). Notice that with the kind of objectives considered, we can easily swap the roles of Max and Min by exchanging safety objectives with reachability and vice versa. It is thus no loss of generality to consider subsequently introduced notions such as optimal strategies only from the perspective of Max. We stress that counter-strategies of Min depend on the strategy chosen by Max. Locally Lex-Optimal Strategies. An MD strategy σ of Max (Min, resp.) is called locally lex-optimal if for all s ∈ S (s ∈ S ♦ , resp.) and a ∈ Act(s), we have σ(s)(a) > 0 implies that action a is lex-optimal. Thus, locally lex-optimal strategies only assign positive probability to lex-optimal actions. Convention. For the rest of the paper, unless stated otherwise, we use G = (S , S ♦ , Act, P ) to denote an SG and Ω = (Ω 1 , . . . , Ω n ) is a suitable (not necessarily absorbing) lex-objective, that is In this section, we show how to compute the lexicographic value for SGs where all target sets are absorbing. We first show various theoretical results in Sect. 3.1 upon which the algorithm for computing the values and optimal strategies presented in Sect. 3.2 is then built. The main technical difficulty arises from interleaving reachability and safety objectives. In Sect. 4, we will reduce solving general (not necessarily absorbing) SGs to the case with absorbing targets. This first subsection derives a characterization of lex-optimal strategies in terms of local optimality and an additional reachability condition (Lemma 2 further below). It is one of the key ingredients for the correctness of the algorithm presented later and also gives rise to a (non-constructive) proof of existence of MD lex-optimal strategies in the absorbing case. We begin with the following lemma that summarizes some straightforward facts we will frequently use. Recall that a strategy is locally lex-optimal if it only selects actions with optimal lex-value. Max is lex-optimal and τ ∈ Σ MD Min is a lex-optimal counter strategy against σ, then σ and τ are both locally lex-optimal. (We do not yet claim that such strategies σ, τ always exist.) (b) Let G be obtained from G by removing all actions (of both players) that are not locally lex-optimal. Let v lex be the lex-values in G. Proof (Sketch). Both claims follow from the definitions of lex-value and lexoptimal strategy. For (b) in particular, we show that a strategy using actions which are not lex-optimal can be transformed into a strategy that achieves a greater (lower, resp.) value. Thus removing the non lex-optimal actions does not affect the lex-value. See [19, Appendix A.1] for more technical details. Consider again the SG from Fig. 1a . Recall the lex-values from Example 1. Now we remove the actions that are not locally lexoptimal. This means we drop the action that leads from p to s and the action that leads from r to t or u (Fig. 1b) . Since these actions were not used by the lex-optimal strategies, the value in the modified SG is the same as that of the original game. Example 4 (Locally lex-optimal does not imply globally lex-optimal). Note that we do not drop the action that leads from r to q, because v lex (r) = v lex (q), so this action is locally lex-optimal. In fact, a lex-optimal strategy can use it arbitrarily many times without reducing the lex-value, as long as eventually it picks the action leading to t or v. However, if we only played the action leading to q, the lex-value would be reduced to (0, 1) as we would not reach S 1 , but would also avoid S 2 . We stress the following consequence of this: Playing a locally lex-optimal strategy is not necessarily globally lex-optimal. It is not sufficient to just restrict the game to locally lex-optimal actions of the previous objectives and then solve the current one. Note that in fact the optimal strategy for the second objective Safe (S 2 ) would be to remain in {p, q}; however, we must not pick this safety strategy, before we have not "tried everything" for all previous reachability objectives, in this case reaching S 1 . This idea of "trying everything" for an objective Reach (S i ) is equivalent to the following: either reach the target set S i , or reach a set of states from which S i cannot be reached anymore. Formally, let Zero i = {s ∈ S | v lex i (s) = 0} be the set of states that cannot reach the target set S i anymore. Note that it depends on the lex-value, not the single-objective value. This is important, as the singleobjective value could be greater than 0, but a more important objective has to be sacrificed to achieve it. We define the set of states where we have "tried everything" for all reachability objectives as follows: The final set contains all target states as well as the states that have lex-value 0 for all reachability objectives; we need the intersection of the sets Zero k , because as long as a state still has a positive probability to reach any target set, its optimal behaviour is to try that. Fig. 1 , we have Zero 1 = {u, v, w} and thus F = Zero 1 ∪ S 1 = {s, t, u, v, w}. An MD lex-optimal strategy of Max must almost-surely reach this set against any strategy of Min; only then it has "tried everything". The following lemma characterizes MD lex-optimal strategies in terms of local lex-optimality and the final set. Max . Then σ is lexoptimal for Ω if and only if σ is locally lex-optimal and for all s ∈ S we have ∀τ ∈ Σ MD Min : P σ,τ s (Reach (F )) = 1. Proof (Sketch). The "if "-direction is shown by induction on the number n of targets. We make a case distinction according to the type of Ω n : If it is safety, then we prove that local lex-optimality is already sufficient for global lex-optimality. Else if Ω n is reachability, then intuitively, the additional condition ( ) ensures that the strategy σ indeed "tries everything" and either reaches the target S n or eventually a state in Zero n where the opponent Min can make sure that Max cannot escape. The technical details of these assertions rely on a fixpoint characterization of the reachability probabilities combined with the classic Knaster-Tarski Fixpoint Theorem [44] and are given in [19, Appendix A.2] . For the "only if "-direction recall that lex-optimal strategies are necessarily locally lex-optimal by Lemma 1 (a). Further let i be such that Ω i = Reach (S i ) and assume for contradiction that σ remains forever within S \ (S i ∪ Zero i ) with positive probability against some strategy of Min. But then σ visits states with positive lex-value for Ω i infinitely often without ever reaching S i . Thus σ is not lex-optimal, contradiction. Finally, this characterization allows us to prove that MD lex-optimal strategies exist for absorbing objectives. For an absorbing lex-objective Ω, there exist MD lex-optimal strategies for both players. Proof (Sketch). We consider the subgame G obtained by removing lex-suboptimal actions for both players and then show that the (single-objective) value of Reach (F ) in G equals 1. An optimal MD strategy for Reach (F ) exists [26] ; further, it is locally lex-optimal, because we are in G, and it reaches F almost surely. Thus, it is lex-optimal for Ω by the "if "-direction of Lemma 2. See [19, Appendix A.3] for more details on the proof. Theorem 1 is not constructive because it relies on the values v lex without showing how to compute them. Computing the values and constructing an optimal strategy for Max in the case of an absorbing lex-objective is the topic of this subsection. A quantified reachability objective (QRO) is determined by a function q : S → [0, 1] where S ⊆ S. For all strategies σ and τ , we define: Intuitively, a QRO generalizes its standard Boolean counterpart by additionally assigning a weight to the states in the target set S . Thus the probability of a QRO is obtained by computing the sum of the q(t), t ∈ S , weighted by the probability to avoid S until reaching t. Note that this probability does not depend on what happens after reaching S ; so it is unaffected by making all states in S absorbing. In Sect. 4, we need the dual notion of a quantified safety property, defined as P σ,τ s (Safe (q)) = 1 − P σ,τ s (Reach (q)); intuitively, this amounts to minimizing the reachability probability. A usual reachability property Reach (S ) is a special case of a quantified one with q(s) = 1 for all s ∈ S . Vice versa, quantified properties can be easily reduced to usual ones defined only by the set S : Convert all states t ∈ S into sinks, then for each such t prepend a new state t with a single action a and P (t , a, t) = q(t) and P (t , a, ⊥) = 1 − q(t) where ⊥ is a sink state. Finally, redirect all transitions leading into t to t . Despite this equivalence, it turns out to be convenient and natural to use QROs. Example 6 (QRO). Example 4 illustrated that solving a safety objective after a reachability objective can lead to problems, as the optimal strategy for Safe (S 2 ) did not use the action that actually reached S 1 . In Example 5 we indicated that the final set F = {s, t, u, v, w} has to be reached almost surely, and among those states the ones with the highest safety values should be preferred. This can be encoded in a QRO as follows: Compute the values for the Safe (S 2 ) objective for the states in F . Then construct the function q 2 : F → [0, 1] that maps all states in F to their safety value, i.e., q 2 : Thus using QROs, we can effectively reduce (interleaved) safety objectives to quantified reachability objectives: where F is the final set (Definition 4), and Ω = (Ω 1 , . . . , Ω n−1 , Reach (q n )). Then: Proof (Sketch). By definition, Ω v lex (s) = Ω v lex (s) for all s ∈ F , so we only need to consider the states in S \ F . Since any lex-optimal strategy for Ω or Ω must also be lex-optimal for Ω 1, in the i-th iteration of the loop, G is the original SG restricted to only those actions that are locally lex-optimal for the targets 1 to (i−1); this is the case because Line 11 was executed for all previous targets. -Single-objective case: The single-objective that is solved in Line 5 can be either reachability or safety. We can use any (precise) single-objective solver as a black box, e.g. strategy iteration [36] . Recall that by Remark 1, it is no problem to call a single-objective solver with a QRO since there is a trivial reduction. -QRO for safety: If an objective is of type reachability, no further steps need to be taken; if on the other hand it is safety, we need to ensure that the problem explained in Example 4 does not occur. Thus we compute the final set F