key: cord-0825077-g2p90ffj
authors: Barfuss, Wolfram; Mann, Richard P.
title: Modeling the effects of environmental and perceptual uncertainty using deterministic reinforcement learning dynamics with partial observability
date: 2021-09-15
journal: Phys Rev E
DOI: 10.1103/physreve.105.034409
sha: 1066f6515b6c3f2177c5be7a66cb65b852a7475d
doc_id: 825077
cord_uid: g2p90ffj

Assessing the systemic effects of uncertainty that arises from agents' partial observation of the true states of the world is critical for understanding a wide range of scenarios. Yet, previous modeling work on agent learning and decision-making either lacks a systematic way to describe this source of uncertainty or puts the focus on obtaining optimal policies using complex models of the world that would impose an unrealistically high cognitive demand on real agents. In this work we aim to efficiently describe the emergent behavior of biologically plausible and parsimonious learning agents faced with partially observable worlds. Therefore we derive and present deterministic reinforcement learning dynamics where the agents observe the true state of the environment only partially. We showcase the broad applicability of our dynamics across different classes of partially observable agent-environment systems. We find that partial observability creates unintuitive benefits in a number of specific contexts, pointing the way to further research on a general understanding of such effects. For instance, partially observant agents can learn better outcomes faster, in a more stable way and even overcome social dilemmas. Furthermore, our method allows the application of dynamical systems theory to partially observable multiagent leaning. In this regard we find the emergence of catastrophic limit cycles, a critical slowing down of the learning processes between reward regimes and the separation of the learning dynamics into fast and slow directions, all caused by partial observability. Therefore, the presented dynamics have the potential to become a formal, yet practical, lightweight and robust tool for researchers in biology, social science and machine learning to systematically investigate the effects of interacting partially observant agents.

We do not observe the world as it is, but instead as our limited sensory and cognitive apparatus perceives it. There are always elements of the world that are hidden from us, such as the detailed physical state of our environment and the internal states of other agents. As such uncertainty is a fundamental feature of life. To be more specific, we might not know what will happen (stochastic uncertainty), what currently is (state uncertainty) and what others are going to do (strategic uncertainty), among other forms of uncertainty (1) (2) (3) . In common with other animals, we must learn and make decisions amid this uncertainty using the limited cognitive resources available to us. So must everybody else.

Given the cognitive demands of fully integrating all sources of uncertainty when learning from experience and making decisions, real agents must employ methods of bounded rationally (4) that use cognitive resources efficiently to obtain acceptable solutions in a timely manner (5) . As such, evolutionary game theory (6) takes into account strategic uncertainty by assuming that other agents are not perfectly rational but instead by allowing agents to adapt to each other sequentially, with relatively-successful strategies being reinforced and less successful strategies selected against. Tools and methods from evolutionary game theory have also been used successfully to formally study the dynamics of multiagent reinforcement learning (7, 8) . Börgers and Sarin (9) established the formal relationship between the learning behavior of one of the most basic reinforcement learning schemes, Cross learning (10) , and the replicator dynamics of evolutionary game theory. Since then, this approach of evolutionary reinforcement learning dynamics has been extended to stateless Q-learning (11, 12) , regret-minimization (13) and temporal-difference learning (14) , as well as discrete-time dynamics (15) , continuous strategy spaces (16) and extensive-form games (17) . This learning dynamic approach offers a formal, lightweight and deterministically reproducible way to gain improved, descriptive insights into the emerging multiagent learning behavior.

Apart from strategic uncertainty, representing stochastic uncertainty, i.e., uncertainty about what will happen in the form of probabilistic events within the environment, requires foremost the presence of an environment. Recent years have seen a growing interest to move evolutionary and learning dynamics in stateless games to changing environments. Here, the term environment can mean external fluctuations (18, 19) , a varying population density (20, 21) , spatial network structure (22, 23) , or coupled systems out of evolutionary and environmental dynamics. Coupled systems may further be categorized into those with continuous environmental state spaces (24) (25) (26) (27) (28) or discrete ones (14, (29) (30) (31) . We will focus on learning dynamics in stochastic games (14, 29) which encode stochastic uncertainty via action-depended transition probabilities between environmental states.

However, all dynamics discussed so far are either applicable only to stateless environments, assume that agents do not tailor their response to the current environmental state, or if they do, assume that agents observe the true states of the environment perfectly. Yet, often in real-world settings state observations are noisy and incomplete. Thus, they are lacking a systematic way to describe interacting agents under state uncertainty.

In this work, we relax the assumption of perfect observations and introduce deterministic reinforcement learning dynamics for partially observable environments. With the derived dynamics we are able to study the idealized reinforcement learning behavior in a wide range of environmen-tal classes, from partially observable Markov decision processes (POMDPs, 32), decentralized (Dec-)POMDPs (33) , and fully general partially observable stochastic games (34) .

Note that while a great deal of works on partially observable decision domains is of normative nature, ours is descriptive. For the normative agenda, agents are often enriched with, e.g., generative models and belief-state representations (32, 33) , abstractions (35) or predictive state representations (36) in order to learn optimal policies in partially observable decision domains. Also the economic value of signal is often studied by asking how fully rational agents optimally deal with a specific form of state uncertainty (37) . However, such techniques can become computationally extremely expensive (38) . It is unlikely that biological agents perform those elaborate calculations (39) and the focus on unboundedly rational game equilibria lacks a dynamic perspective (40) making it unable to answer which equilibrium (of the often many) the agents select.

Instead, this work takes a dynamical systems perspective on individual learning agents employing the widely occurring principle of temporal-difference reinforcement learning (41) in which the agents simply treat their observations as if they were the true states of the environment. Temporal-difference learning is not only a computational technique (42) , it also occurs in biological agents through the dopamine reward-prediction error signal (43, 44) . We focus on agents which employ either so called memoryless policies, at which they choose their actions based solely on their current observation (45) , or they use a short and fixed history of current and past observations and actions to base the current action on. This has the advantage of being simple to act on (46) and they are easy to realize at no or little additional computational cost.

To highlight the broad applicability of our dynamics we study the emerging learning behavior across five partially observable environment classes. We find a variety of effects caused by partial observability which generally depend on the environment and its representation. For instance, partial observability can lead to better learning outcomes faster in a single-agent renewable resource harvesting task, stabilize a chaotic learning process in a multistate zero-sum game and even overcome social dilemmas. Compared to fully observant agents, partially observant learning often requires more exploration and less weight on future rewards to obtain the favorable learning outcomes. Furthermore, our method allows the application of dynamical systems theory to partially observable multiagent leaning. We find that partial observability can cause the emergence of catastrophic limit cycles, a critical slowing down of the learning processes between reward regimes and the separation of the learning dynamics into fast and slow eigendirections. We hope that the presented dynamics become a practical, lightweight and robust tool to systematically investigate the effect of uncertainty of interacting agents.

2.1 Partially observable stochastic games Definition. The game G = N, S, A, O, T, R, O is a stochastic game with N ∈ N agents. The environment consists of Z ∈ N states S = (S 1 , . . . , S Z ). In each state s, each agent i ∈ {1, . . . , N } has M ∈ N available actions A i = (A i 1 , . . . , A i M ) to choose from. A = i A i is the joint-action set and agents choose their actions simultaneously. A joint action is denoted by a = (a 1 , . . . , a N ) ∈ A. With a −i = (a 1 , . . . , a i−1 , a i+1 , . . . , a N ) we denote the joint action except agent i's. We chose an identical number of actions for all states and all agents out of notational convenience. Throughout this paper, we restrict ourselves to ergodic environments without absorbing states.

The transition function T : S×A×S → [0, 1] determines the probabilistic state changes. T (s, a, s ) is the transition probability from current state s to next state s under joint action a.

The reward function R : S × A × S → R N maps the triple of current state s, joint action a and next state s to an immediate reward scalar for each agent. R i (s, a, s ) is the reward agent i receives.

Instead of observing the states s ∈ S directly, each agent

We chose an identical number of observations for all agents out of notational convenience. By construction, this observation function can model both noisy state observations (Q = Z) and hidden states (Q < Z).

Policies. We consider agents that choose their actions probabilistically according to their memoryless policy X i :

Histories. Besides memoryless policies we also consider policies with fixed histories H h of type h. The type h is composed of h = h o × h a with h o ∈ N and h a ∈ N N . h o represents how many of current and past observations are to be used to encode the histories. Likewise, h a represents how many past actions of each agent are to be encoded in the histories. For example, the default memoryless policy is of type h = (1, 0). Practically, histories induce an embedding of the game into a larger state space at which the histories H h correspond to the larger state set and transitions, rewards and observations are adjusted accordingly.

Temporal-difference Q-learning is one of the most widely studied reinforcement learning processes (42, 43, 47) . Agents successively improve their evaluations of the quality of the available actions. Originally developed under the assumption that agents can observe the true Markov state of the environment, we here present the basic temporaldifference Q-learning algorithm in the more general formulation, where agents use observations instead of states. When observations exactly map onto the states, the original algorithm is recovered.

At time step t agent i evaluates action a i at observation o i to be of quality Q i t (o i , a i ). Those state-action values Q i with the temporal-difference error

The discount factor parameter γ ∈ [0, 1) regulates how much the agent cares for future rewards. The learning rate parameter α ∈ (0, 1) regulates how much new information is used for an observation-action-value update. For the sake of simplicity, we assume identical parameters across agents throughout this paper and therefore do not equip parameters with agent indices. The variable r i t refers to the immediate reward at time step t. Note that the (1 − γ) prefactor in front of the reward occurs when we assume that agents aim to maximize a return defined as (14) . This leads the values to be on the same scale as the rewards.

Agents select actions based on the current observationaction values Q i t (o i , a i ) balancing exploitation (i.e., selecting the action of maximum quality) and exploration (i.e., selecting lower quality actions in order to learn more about the environment). We here use the widely used Boltzmann policy function. The probability of choosing action a i under observation o i is

where the intensity of choice parameter β controls the exploration-exploitation trade-off. Throughout this paper, we are interested in the idealized learning process with fixed parameters α, β and γ throughout learning and evaluating a policy.

In this section we derive the deterministic reinforcement learning dynamics under partial observability in discrete time. As classic evolutionary dynamics operate in the theoretical limit of an infinite population, the learning dynamics are derived by considering an infinite memory batch (48, 49) . A learning dynamic update of the current policy uses policy-averages instead of individual samples. Thus, we need to construct the policy-average temporaldifference errorδ i to be inserted in the update for the joint policy, 

The challenge is that the rewards R i (s, a, s ) in the stochastic game model depend on the true states, not on the observations of the agents. Thus, in order to obtain the average observation-action rewardsR i (o i , a i ), we need a mapping from observations to states. The observation function is a mapping from states to observations. With Bayes's rule,B

we can transform the observation function into a belief function, following the rules of probability.B i (o i , s) is the belief of agent i (or simply the probability) that the environment is in state s when it observed observation o i . The only problem is how to obtain the policy-average stationary state distributionP (s).P (s) is the lefteigenvector of the average transition matrixT (s, s) where the entryT (s, s ) denotes the probability of transitioning from state s to state s . This matrix could be obtained asT (s, s ) = j a jȲ j (s, a j )T (s, a, s ) if we had the probability for each agent j to choose action a j in state s,Ȳ j (s, a j ). However, we assumed that agents condition their actions only on observations, X j (o j , a j ). Yet, whenever the environment is in state s, agent j observes observation o j with probability O j (s, o j ) and than chooses action a j with probability X j (o j , a j ). Thus, with

we can average out the observation and obtain the policyaverage state policiesȲ j (s, a j ). Note thatȲ j (s, a j ) are proper conditional probabilities, which can be seen by applying a j to both sides of Eq. 6. WithȲ j (s, a j ) we can then compute the policy-average transition matrixT (s, s), its left-eigenvector, the stationary state distributionP (s), and thus, the policy-average belief of agent i that the environment is in state s when it observed observation o i , B i (o i , s).

Whenever agent i observes observation o i , with prob-abilityB i (o i , s) the environment is in state s where all other agents j = i behave according toȲ j (s, a j ), the environment transitions to a next state s with probability T (s, a, s ), and agent i receives the reward R i (s, a, s ). Mathematically, the policy-average reward for action a i under observation o i reads

T (s, a, s )R i (s, a, s ). (7)

Second, the policy-average of the quality of the next

is computed by averaging over all states, all actions of the other agents, next states and next observations. Whenever agent i observers observation o i , the environment is in state s with probabilityB i (o i , s). There, all other agents j = i choose their action a j with probabilityȲ j (s, a j ). Consequently, the environment transitions to the next state s with probability T (s, a, s ). At s , the agent observes observation o with probability O i (s , o ) and estimates the quality to be of value max bQ i (o , b). Mathematically, we write

Here, we replace the quality estimates Q i t (o i , a i ), which evolve in time t (Eq. 1), with the policy-average observation-action qualityQ i (o i , a i ), which is the expected discounted sum of future rewards from executing action a i at observation o i and then following along the joint policy X. It is obtained by a discount factor weighted average of the current policy-average rewardR i (o i , a i ) and the policy-average observation quality of the next observation 

where all other agents j = i select action a j with probabilityȲ j (s, a j ). Consequently, the environment will transition to the next state s with probability

Further at Eq. 9,V i (o i ) is the policy-average observation quality, i.e., the expected discounted sum of future rewards from observation o i and then following along the joint policy X. They are computed via matrix inversion according tō

This equation is a direct conversion of the Bellman equa-

, which expresses that the value of the current observation is the discount factor weighted average of the current reward and the value of the next observation. Underlined observation variables indicate that the corresponding object is a vector or matrix and 1 Q is a Q-by-Q identity matrix.

T i (o, o) denotes the policy-averaged transition matrix for agent i. The entryT i (o i , o ) indicates the probability that agent i will observe observation o after observing observation o i at the previous time step, given all agents follow the joint policy X. We compute them by averaging over all states, all actions from all agents and all next states,

For any observation o i ,B i (o i , s) is the probability to be in state s, where all agents j act according toȲ j (s, a j ). Therefore, the environment transitions with probability T (s, a, s ) from state s to the next state s , which is observed by agent i as observation o with probability

is a proper probabilistic matrix. This can be seen by applying o to both sides of Eq. 12. Further in Eq. 11,R i (o i ) denotes the policy-average reward agent i obtains from observation o i . We compute them by averaging over all states, all actions from all agents and all next states. Whenever agent i observes observation o i , the environment is in state s with proba-bilityB i (o i , s). Here, all agents j choose action a j with probabilityȲ j (s, a j ). Hence, the environment transitions to the next state s with probability T (s, a, s ) and agent i receives the reward R i (s, a, s ),

Note that the quality maxQi (o i , a i ) depends on o i and a i although it is the policy-averaged maximum observationaction value of the next observation.

All together, the policy-average temporal-difference error, to be inserted into Eq. 4, reads

We study the emerging learning dynamics across five test environments: three single-agent decision problems and two multiagent games. Three environments will cover noisy observations, the other two focus on a reduced observation space, where a given observation is consistent with multiple true states of the world. As one evaluation metric we use the average reward, sP (s)R i (s), wherē P (s) is the stationary state-distribution andR i (s) = a j s jȲ j (s, a j )T (s, a, s )R i (s, a, s ) is the average reward for each state given the current policy X (see Sec.

3). We defined a learning trajectory as having converged if the norm between old and updated policy (according to Eq. 4) is below 10 −5 . Since we defined the return with the (1 − γ) prefactor we also consider a scaled version of the intensity of choice parameter β = β /(1 − γ) for some experiments. Doing so preserves the ratio of exploration and exploitation in the temporal-difference error (Eq. 14) under changes in the discount factor γ.

Environment description. The first environment is a simple coordination task in which the agent must move between the left and right environmental state in order to obtain a maximum reward of 1. Coordinating which of the two available actions (Left, Right) to choose from is complicated by observational noise ν, letting the agent perceive the correct state only with probability 1 − ν (Fig. 1 A) . This environment is adapted from Singh et al. (45) . shows the corresponding policy spaces in which the agent's probability of choosing Left, given the agent perceived the environment to be in the left state, is plotted on the x-axes; and the agent's probability of choosing Left, given the agent perceived the environment to be in the right state, is plotted on the y-axes. 5 individual trajectories, whose initial policies were centered around the center of the policy space, are plotted in color. Arrows in gray indicate the flow of the learning dynamical system. Panel C shows the corresponding reward trajectories. Remaining hyperparameters were set as α = 0.01, γ = 0.9. Partial observability can cause the learning to enter low-rewarding limit cycles under high intensity of choice.

Results. Figure 1 shows how partial observability can cause the deterministic learning dynamics to enter lowrewarding limit cycles under a high intensity of choice. Often, learning a policy involves a trade-off between the amount of reward from that policy and the amount of time required learning it. In the simple coordination task with perfect observation, a high intensity of choice can speed up the learning process by a factor of 6. The trajectories with β = 40 require about 18 time steps to arrive at the optimal policy with average reward 1 (green lines, top row), the trajectories with β = 400 require only 3 time steps (green lines, bottom row). Thus, a high intensity of choice is clearly preferable under perfect observation.

With fully uninformative observations (observational noise level ν = 0.5, Fig. 1 B, third column) a more explorative agent (i.e., lower intensity of choice, top row) has an advantage. From all initial policies, it takes the agent about 580 time steps to learn to fully randomize its actions. This yields an average reward of zero and is also the optimal memoryless policy (45) . The more exploitative agent (bottom row) on the other hand enters a limit cycle between choosing Left and Right almost deterministically, irrespective of its current observations. Thus, while choosing Left the agent is trapped in the LEFT state, obtaining an average reward of ∼ −1. While choosing Right, the agent is trapped in the RIGHT state also obtaining an average reward of ∼ −1. The positive reward obtained through the move between states is neglected, since the derived dynamics consider the theoretical limit of an an infinite memory batch (48) . This can also be interpreted as a complete separation of the interaction timescale and the adaptation timescale (14, 49) . The agent experiences an infinite amount of negative reward during interaction and only one single positive reward after the policy adaptation. It will be interesting to reexamine this scenario under relaxed conditions when interaction and adaption timescales are not completely separated in future work.

When observations are almost completely noisy, yet still contain some information about the true environmental state, the more exploitative agent learns a slightly more rewarding policy faster ( Fig. 1 B second column) . An observational noise level of ν = 0.49 means that of 100 times being in the LEFT environmental state, the agent will observe on average left 51 times and right 49 times. Here, from all initial policies, the more explorative agent (top row) converges to a fixed point in the upper left part of the policy space in about 600 time steps, i.e., slower than under completely noisy observations. This policy yields an average reward of about 0.013. The more exploitative agent (bottom row) learns on an interesting transient resembling the limit cycle of the fully uninformative case, yet manages to converge to the deterministic policy in the upper left of the policy space. This yields an average reward of about 0.02 and takes at most 250 times, depending on the initial policy. This is still distinctly faster than the more explorative agent.

Overall, it is interesting to observe how partial observability caused a well-known dynamical-systems phenomenon in the learning dynamics, which can explain the abrupt improvements in the reward trajectories ( Fig. 1 C  bottom) : the separation of the dynamics into a fast eigendirection along the diagonal from the bottom left to the top right of the policy space and a slow eigendirection perpendicular to that (50) . The slow eigendirection corresponds to a coordinated policy where the agent's observation is decisive for its actions. Along the fast eigendirection the agent's policy is independent of its observations. The more explorative agent moves along these axes whereas the more exploitative agent overrides. Yet, as long as there is some information in the observations about the environmental state, the more exploitative agent learns better policies faster.

So far we examined only memoryless policies, i.e., policies that condition their choice of action only on the cur- rent observation. If the more exploitative agent (β = 400) is able to condition its choice of action not only on the current observations but also on its last action, then it learns the optimal policy with an average reward of 1 in at most only two time steps -even under fully uninformative state observations (Fig. 2) . The agent learns to alternate between Left and Right. This learned policy and even the whole learning dynamics do not depend on the stateobservation, as shown by the straight line trajectories and corresponding dynamical flow arrows in Fig. 2 .

As a consistency check we compare the derived deterministic learning dynamics with partial observability to a sample-batch reinforcement learning algorithm, as detailed in Ref. (49) (Fig. 3) . The batch learning algorithm collects observation and reward experiences inside a batch of size K while keeping its policy fixed before it then updates its policy using the whole information of the collected batch. This is a widely occurring principle for improved data efficiency and learning stability (51) and is used for example in memory-replay (52) and model-based reinforcement learning (53) . Figure 3 shows that our deterministic theory describes such batch learning approaches well under large batch sizes. Yet the calculation time of the deterministic dynamics was in the order of 100 times faster than the simulation of the algorithms.

Environment description. The next environment is the single-agent navigation task adapted from Parr and Russell's Grid World (1995) . It consists of 11 states, 6 observations, 4 actions and 1 agent (Fig. 4 A) . The agent can move north, south, east, and west. If the agent would move into a wall, then it stays on its current patch. The agent wants to reach the patch in the upper right, which is rewarded by a reward of 1. However, entering the patch below is punished by a reward of −1. In both cases, the episode ends and the agent begins a new episode on one randomly chosen patch out of the nine other patches. All other state-action combinations yield zero reward. We use this environment to compare the effect of various hyperparameter combinations on the learning behavior of an agent with partial observability to an agent with full observability. Under partial observability the agent can only observe whether or not there is a wall east and west of its current patch. Imagine, for example, a robot equipped only with haptic sensors on its sides or an insect with corresponding antennae. With full observability, the agent can distinguish each grid patch separately.

Results. In contrast to a fully observant agent, neither a high weight on future rewards (large γ) nor a high intensity of choice (large β ) leads to a large reward for a partially observable agent. Instead, the highest reward depends on the mutual combination of the two hyperparameters. For a hyperparameter combination of γ = 0.99 and β = 50 an agent with full observability quickly learns the optimal policy (Fig. 4 A&B, light-red straight lines). Observe also how the light-red straight lines in states (3, 2) and (3, 3) avoid being close to the penalty state. In contrast, less weight on future rewards (a lower discount factor of γ = 0.4) and more exploration with β = 20 lead to a lower average reward at convergence (light blue straight lines). Observe how the convergence points in policy space (light blue dots) are increasingly farther apart from the optimal policy (light red dots) the more steps the grid cell is away from the goal. However, when we turn to the agent with partial observability, it is the other way around. Here less weight on future rewards and more exploration lead to a better average reward at convergence (dark colored dashed lines). This result can be explained as follows. In a fully observable Markov decision process there is Learning trajectories under both partial (dashed lines) and full observability (straight lines) are plotted for various hyperparameter combinations from 15 random initial policies each. Panel A shows the grid world. The trajectories of the policies' action probabilities are projected into each observation/state, such that a deterministic policy toward one direction appears at the edge of that direction in the center. Stochastic policies appear inside the patches. Panel B shows the corresponding reward trajectories. In Panel C, two hyperparameter grids show the average reward at convergence for an agent with partial and full observability (with independent color scales for each case). The learning rate was set to α = 0.01. In contrast to a fully observant agent, neither a high weight on future rewards (large γ) nor a high intensity of choice (large β ) leads to the highest reward for a partially observant agent. Instead, the highest reward depends on the mutual combination of the two hyperparameters.

always an optimal deterministic policy (55) . See how the light red straight lines converges to edge of most grid cells, indicating a deterministic action in that direction (Fig. 4 A) . Yet policies in a partially observable Markov decision process often require stochasticity (45) . More exploration directly ensures that, although not in a rewardtargeted way. Less weight on future rewards might be advantageous under partial observability since too much weight on too distant rewards in the future cannot pay off when there is a fundamental uncertainty about which state the agent occupies or even about what the real states of the environment are. When the environment is only partially observable, anticipating too distant rewards does not have to be beneficial.

A systematic analysis of the hyperparameter grid ( Fig. 4 C) confirms that partial observability requires the right combination of the two hyperparameters in order to learn the highest reward. Under full observability, simply setting a sufficiently high weight on future rewards γ and a sufficiently strong intensity of choice β (i.e., little exploration) leads to the average reward of the optimal policy. In contrast, for partial observability too much farsightedness and too intense exploitation can hurt the performance of the agent. Instead, the optimal reward at convergence is obtained by a more randomly explorative and myopic agent.

With memoryless policies, the average reward of the partially observant agent is smaller by an order of magnitude compared to the fully observant agent. Figure 5 compares the results of the two simplest types of history, i.e., where the agent uses one more piece of information. Thus, the agent conditions its actions either on the current observation and the last action h = (1, 1) or on the current and last observations h = (2, 0). Both types of history are able to obtain a similar maximum average reward, with a slight advantage for history h = (1, 1). Although both are of the simplest type of history conceivable the difference in maximum reward between the partially observant agent = (1, 1) . The plots on the right show results for an agent which conditions its action on the current and last observations h = (2, 0). The top plots show the average reward at convergence, the bottom plots the time steps to convergence, each on the same color scale. Results are averaged over 15 Monte Carlo runs from random initial policies. The learning rate was set to α = 0.01. Both types of history obtain similar maximum average reward, but at different hyperparameter combinations. and fully observant agent is already halved, compared to the partially observant agent without history.

Also, the set of hyperparameter combinations that obtain a high average reward is shifted to the lower right corner of the parameter space where also the fully observant agent obtains its maximum. Interestingly, though, the set of high rewarding hyperparameter combinations is not identical across the two types of history. The actiondepended history (h = (1, 1)) performs best with a high weight on future rewards γ and more exploration, whereas the two-observation history (h = (2, 0)) obtains the highest rewards by more exploitation across a wider range of weights on future rewards γ.

Furthermore, the learner experiences another dynamical systems phenomenon: a critical slowing down of its learning dynamics (50) before a hyperparameter bifurcation into the high rewarding regime (Fig. 5, bottom  row) . In the area around the hyperparameter regions which obtain high average reward (yellow area in the plots in the top row), the number of time steps it takes the agent to converge is distinctly higher compared to other hyperparameter regions. Interestingly, this effect is absent in the memoryless learner of Fig. 4 (not shown). Utilizing such dynamical systems phenomena have the potential to improve the efficiency of hyperparameter search.

Environment description. Harvesting a renewable resource is a foundational challenge in environmental economics and the Earth and sustainability science (56) (57) (58) (59) .

Here we use a standard logistic growth model, in which the (continuous) resource stocks t+1 =s t + rs t (1 −s t /C) first regrows exponentially with rate r ∈ R until it saturates at capacity C ∈ N. In order to turn the stock-continuous logistic growth into a state-discrete Markov decision process, we discretize the continuous resource stock into the environmental states s ∈ {0, ..., C − 1}. The agent has three possible actions: harvest nothing, harvest a small amount, or harvest a large amount. What is small and large depends on the maximum amount, ∆s max , the resource regrows from environmental states S. The small harvest amounts to (1 − ∆E)∆s max , the large harvest amounts to (1 + ∆E)∆s max , with ∆E representing the deviation in the agent's harvesting effort.

State transitions work as follows: The harvest amount is subtracted from the current stock state s t . The stock regrows according to the logistic growth equation, yielding a new hypothetical stocks t+1 . In order to avoid the complete depletion of the resource, the minimum hypothetical stock yields a value proportional to a base level s base . Since the agent should have an influence on the regrowth of the resource,s base is multiplied by (1 + ∆E) if the agent chose to harvest nothing, by (1 − ∆E) if the agent chose to harvest a little, and by 0 if the agent chose to harvest a lot. The resource stock is then discretized by a normal distribution arounds t+1 with variance σ 2 . The probability mass that lies between stock s t+1 − 0.5 and s t+1 + 0.5 gives the probability to transition to the new state s t+1 . (For s t+1 = 0 the lower bound is −∞, for s t+1 = C the upper bound is +∞.) Thus, σ represents the level of stochasticity within the environmental dynamics.

The rewards are identical to the harvest amount. Harvesting a lot yields a higher immediate reward than harvesting a little. Except when the resource is degraded, i.e., either the current state s t or the next state s t+1 equals zero, then the rewards are only 10% of the harvest amount. Thus, the agent has always an immediate incentive to harvest more over a little. The optimal policy depends on the weight the agent puts on future rewards (by its discount factor γ).

We use this environment to showcase how partial observability can be used to investigate the effect of different (imperfect) representations of the environment. We focus on representations under which the agent perceives several adjacent states as a single coherent observations. Figures 6 A & B illustrate the renewable resource harvesting environment and the investigated observation representations for capacity C = 5.

Results. We find that inaccurate (reduced complexity) representations of the environment can lead to a better learning outcome faster, when compared to an agent which perceives the environment accurately (at full complexity) (Fig 6) .

In the majority of cases an inaccurate representation of the environment leads to a speed out-performance in the order of 10%, i.e., a smaller number of time steps it takes the learner to converge to a fixed point. Only four representations of the 9-state environment take distinctly longer to converge. Overall, there is a slight tendency that simpler representations lead to faster convergence. Representations (dots) are ordered from the most complex, i.e., the accurate one, on the left to the simplest, i.e., perceiving all states as one, on the right (per environment). All top speed representations (dashed bars) cluster the resource stock 0 and 1 together but separate between stock 1 and 2. In the environments with capacity 8 and 9, a resource stock of 2 is represented completely separate by all top speed representations.

In contrast, the majority of inaccurate representations lead to a worse reward at convergence. Clearly visible by the red dots on the right for each environment, the simpler the representation the worse the performance. Nevertheless, a few representations of intermediate complexity lead to a reward out-performance in the order of 1%. This is remarkable, since Blackwell (60)'s theorem showed that a rational decision maker cannot improve by an inaccurate representation. Of course, our result does not contradict Blackwell, since we investigate a learning process.

To better understand the relationship between the learning process and the rational optimal policies, Table 1 shows the average reward of the optimal policy R * ,γ and the average-reward optimal policy R * ,avg relative to the reward obtained by the fully observant agent (shown in Fig 6 by diamonds and down-triangles) . The optimal pol-Env. states 6 7 8 9 Reward R * ,avg 0.25 0.013 0.019 0.024 Reward R * ,γ -0.015 0.013 0 0.003 Table 1 : Average reward of the optimal average-reward policy R * ,avg and the optimal policy of the discounted reward setting R * ,γ for the same four renewable resource environments as in Fig. 6 . Rewards are also transformed in the same way (R = r/r ac − 1, with r ac being the reward the fully observant agent obtained at convergence). 3) The stock is discretized by a normal distribution in order to have the number of states equaling the capacity C of the logistic growth function. We set the growth rate r = 0.8, the effort deviation ∆E = 0.2, the stock base levels base = 0.1, and the environmental stochasticity σ = 0.5. Panel B shows the possible observation spaces -how the environment is represented by the agent -ordered by decreasing complexity, for a world in which there are five possible true environmental states. In the most complex (at the top) the agent perceives all real states of the world as distinct; in the least complex (at the bottom), the agent makes the same observation regardless of the true state. We investigate all representations where the agent perceives several adjacent states as a single coherent observation. Panels C shows the reward out-performance R = r/r ac − 1 (red), the speed out-performance S = 1 − l/l ac (blue), and the combined reward-speed out-performance R · S( if R > 0 ∧ S > 0) (purple) for all possible representations, for the four renewable resource environments with capacities C and likewise number of states, 6 − 9. Out-performance is measured with respect to the agent which used the accurate representation of the environment and obtained a reward r ac in l ac time steps. For each environment, each dot represents the average of 100 Monte Carlo simulations from random initial policies of a single representation, ordered from the most complex, i.e., the accurate one, on the left to the simplest, i.e., perceiving all states as one, on the right. Violin plots show the distribution of rewards and speed, relative to the agent with the accurate representation. The three top performing representations are shown schematically by the dashed lines. Additionally, the average rewards of the optimal discounted policy R * ,γ and the optimal average-reward policy R * ,avg are shown. The agent's discount factor γ = 0.9, intensity of choice β = 25, and learning rate α = 0.02. There exist inaccurate representations (partial observation functions) of the environment that lead to a better learning outcome faster compared to the fully observant agent. Panels B and C show the average rewards at convergence for agent 1 in red and agent 2 in blue (top row) and the time steps it takes the learners to convergence (bottom row) for various observational noise levels from 0 to 0.5. For each noise level, the plots show a histogram via the color scale. Each histogram results from a Monte Carlo simulation from 100 random initial policies. Panel B shows the case of homogeneous uncertainty where both agents' observations are corrupted equally by noise. In Panel C only agent 2 is increasingly unable to observe the environment correctly (Heterogeneous Uncertainty). The discount factor was set to γ = 0.5 since future states are independent of the agents' actions, which makes the discount factor irrelevant for the learning in this case. Remaining hyperparameters were set to α = 0.01 and β = 50. Homogeneous uncertainty can overcome the social dilemma through the emergence of a stable, mutually high rewarding fixed point above a critical level of observational noise. Heterogeneous uncertainty, however, leads to reward inequality. In both cases, the transition is accompanied by a critical slowing down of the convergence speed.

icy maximizes the state values for each state and depends on the discount factor γ. The average-reward optimal policy maximizes the average reward. Since in this environment R * ,γ approaches R * ,avg under γ → 1, the rewards between R * ,avg and R * ,γ | γ=0.9 represent the rewards more patient or future caring agents could obtain. Thus, the out-performing representations cause the learner to behave as if it were more patient or future-oriented than it actually is (defined by its discount factor γ). However, it is not obvious to identify regularities across the environments between the top rewarding representations. Moreover, Table 1 shows that the learning process under full observability yields decent results. For the environment with 8 states the learner obtains the exact same reward as the optimal policy. In the environment with 6 states the learner obtains an average reward which is even above the one of the optimal policy.

Taken together, a speed out-performance in the order of 10% multiplied by a reward out-performance in the order of 1% leads to combined speed-reward out-performance in the order of 0.1%. Along the four environments investigated, the magnitude in out-performance is increasing with the number of environmental states. Future work is needed to investigate this effect in larger, more complex resource harvesting environments and also how to obtain those representations which lead to better outcomes faster.

Notably, this result resembles the one by Mark et al. (61) who show also that simpler views on the world can be of advantage. However, in their model perceiving the truths comes with a cost which is subtracted from the rewards of the environment. If this cost parameter is sufficiently large, then perceiving the truths cannot pay off by design. We do not model such a cognitive cost of being close to the truths and still find that some inaccurate representations lead to better outcomes faster.

Environment description. The emergence of cooperation in social dilemmas is another key research challenge for evolutionary biology and the social and sustainability sciences (62) (63) (64) . We will focus on the situation where two agents can either cooperate (C) or defect (D) and either face a Prisoner's Dilemma or a Stag Hunt game with equal probability (Fig. 7 A, cf. Refs. 65, 66 ). In the pure Prisoner's Dilemma defection is the Nash equilibrium, which leads to a suboptimal reward for both agents, also known as the tragedy of the commons (67) . In the pure Stag Hunt game, both mutual cooperation and mutual defection are Nash equilibria with the difference that mutual cooperation yields a higher reward than mutual defection for both agents. It is therefore also referred to as a coordination challenge (68) . Here, we consider the situation when the agents are uncertain about the type of game they are facing at each decision point. Whether we are facing a tragedy or a coordination challenge is relevant for, e.g., the mitigation of human-caused climate change (69) . We investigate two scenarios. Under homogenous uncertainty (Fig. 7 B) , both agents' observations are blurred by an increasing level of observational noise. Under heterogeneous uncertainty (Fig. 7 C) , only agent 2's observations become noisier. Since the environment is symmetric under exchanging the roles of the agents, it suffices to explore only one heterogeneous uncertainty scenario.

Results. Homogeneous uncertainty can overcome the social dilemma through the emergence of a stable, mutually high rewarding fixed point above a critical level of observational noise. Under perfect observation both agents convergence to full defection when observing the Prisoner's Dilemma. When observing the Stag Hunt game it depends on the initial joint policy whether the agents converge to mutual defection or mutual cooperation. Reward values are as such that the defective basin of attraction is comparable small (see the light line at an average reward of 0 in Fig. 7 B ). Increasing the observational noise level from zero under homogeneous uncertainty will first decrease the average reward at convergence. The agents still converge to the perfect observation policy which leads them to defect when they observe the Prisoners' Dilemma but the situation is actually the Stag Hunt. However, increasing observational noise further eventually leads to a bifurcation (Fig. 7 B) . Mutual cooperation under both observations becomes a stable fixed point. As a consequence both agents obtain an average reward of 5 at convergence. Interestingly, there seems to be a small range of observational noise at which all three rewards 0, ∼ 2, and 5 are supported by equilibria. For large noise levels only the rewards at 0 and 5 are stable.

Thus, we find that the deterministic learning dynamics under homogeneous partial observability are able to converge to mutually more rewarding policies compared to the perfect observation case. The existence of those equilibria is long known in traditional static game theory (65) . Here we show that our derived dynamics are able to serve as a dynamic micro-foundation for those static equilibria. They correspond to fixed points of the derived learning dynamics, and the transitions between equilibria are again accompanied by the dynamical systems phenomenon of a critical slowing down of the convergence speed (Fig. 7 B,  bottom) .

However, the mutual benefit of uncertainty vanishes when not all agents' observations are uncertain (Fig. 7 C) . Under slight uncertainty only the reward of the illinformed agent (Agent 2 in Fig. 7) decreases. After the bifurcation point under large uncertainty, the ill-informed agent converges to full cooperation under both observations, whereas the well-informed agent still defects in the Prisoner's Dilemma which earns it an average reward of even more than 5. The knowledgable agent exploits the ill-informed and heterogeneous uncertainty leads to reward-inequality between the agents.

Interestingly, Fig. 7 suggests a difference in the type of phase transition between the policy of mediocre reward at low observational noise levels and the policies at high noise levels. The phase transition under homogeneous uncertainty seems to be discontinuous and shifted toward greater noise levels whereas the transition under heterogeneous uncertainty seems to be continuous. Investigating the relationship between the learning dynamics, free energy equivalents (49) and phase transitions is a promising direction of future work.

Environment description. The last environment we use as a test bed is a two-agent, two-state, two-action zerosum competition, also known as the two-state matching pennies game (70) . It roughly models the situation of penalty kicks between a kicker and a keeper. Both agents can choose between the left and the right side of the goal. The keeper agent scores one point if it catches the ball (when both agents have chosen the same action), otherwise the kicker agent receives one point. The two states of the environment encode which agent is the keeper and which one is the kicker. In state KeepKick agent 1 is the keeper and agent 2 is the kicker. In the state KickKeep it is the other way around. Agents change roles under state transitions, which depend only on agent 1's actions. When agent 1 selects either left as keeper or right as kicker both agents will change roles. With symmetrical rewards but asymmetrical state transitions, this two-state zero-sum game presents the challenge of coordinating both agents on playing a mixed strategy with equiprobable actions. Similarly as in Secs. 4.1 and 4.4, the agents' observations of the environmental states are obscured by a noise level ν.

Results. Figure 8 shows how partial observability can stabilize the learning process. When both agents observe the environment perfectly the learning dynamics are prone to be unstable, either unpredictably chaotic or on periodic orbits and limit cycles (Panel A, 14) . The rewards of agent 1 and 2 are circulating around zero. Under a medium observational noise level of ν = 0.25 the learning dynamics are still unstable. Especially the transient dynamics in the policy space (Panel B, on the right) appear strange. The average reward trajectory looks damped compared to the fully observant agents. Increasing the observational noise further such that the agents perceive the two environmental states (KeepKick and KickKeep) as a single observation (KK ), is able to stabilize the learning process. Interestingly, the flow of the learning dynamics is separated into two half circles directed at the upper half of the line at which agent 1 chooses both actions with equal probabilities. As shown by the gray arrows, the circled flow is on a fast timescale compared to the movement downward to the center of the policy space (which is not reached here within 1000 time steps). At this downward movement, both agents play the different roles of kicker and keeper in equal amounts, since only agent 1 is responsible for the state transitions. Any advantage agent 2 gains from deviating from the equiprobable policy as kicker is balanced by the same amount of disadvantage agent 2 looses as keeper. Thus, the rewards for both agents quickly stabilize at zero.

In this article we analyzed the efficacy of temporaldifference reinforcement learning under irreducible environmental uncertainty. To do so, we introduced deterministic multiagent reinforcement learning dynamics, in which the agents are only partially able to observe the true states of the environment. These dynamics operate in the theoretical limit of an infinite memory batch, and make implicit inference about the true states via Bayes rule and can be well approximated by finite-size batch learning algorithms. This limit allows us to systematically separate the stochasticity of reinforcement learning, resulting from probabilistic environmental dynamics, observations, and decisions, from the environmental uncertainty that originates in the agents' incomplete awareness of the true state space.

Overall, we have shown how these dynamics can serve as a practical, lightweight, deterministically reproducible and robust tool, to systematically study the combined effects of strategic uncertainty, stochastic uncertainty and state uncertainty in collectives of self-learning agents across a wide range of partially observable environment classes.

We have found a variety of effects caused by partial observability, yet general conclusion and recommendations cannot be stated, due to the generality of the partially observable agent-environment setting. Providing agents with only a partial view of the true state of the world might be expected to always result in poorer learning decisionmaking outcomes. However, we have demonstrated that irreducible environmental uncertainty can instead lead to better learning outcomes, even in a single-agent environment, stabilize the learning process and overcome social dilemmas in multiagent domains.

Furthermore, our method allows the application of dynamical systems theory to partially observable multiagent learning. We have found that partial observability can cause the emergence of catastrophic limit cycles, within which the agent obtains the worst possible reward. We also found instances where partial observability induces phase transitions between low and high rewarding regimes accompanied by a critical slowing down of the learning processes. Further, we saw partial observability induced separations of the learning dynamics into fast and slow eigendirections, as well as multistability of the learning process.

Potential applications. These results may be of use in technological applications of multiagent reinforcement learning, with respect to training regimes, hyperparameter tuning, and the development of novel algorithms. For example, if agents are able to detect that they entered a slow eigendirection, then they can safely increase their learning rate for a faster convergence. Or training regimes and hyperparameter search techniques might be on the lookout for a critical slowing down since this can indicate a phase transition toward high rewarding solutions. With respect to the hyperparameter values required for a decent performance we found across environments that learning with partial observability demands more exploration and less weight on future rewards, compared to fully observant agents. Moreover, the learning with partial observability might depend crucially on the precise combination of the two parameters, whereas without uncertainty both parameters can be tuned fairly independently. The fast computation speed and visualization capabilities of the deterministic learning dynamics approach might be particular suited for the challenge to engineer interpretable and safety-critical learning systems.

We have shown that whether partial observability in the classic principle of temporal-difference learning is advantageous depends on the specific nature of the environment and its representation (cf. ecological rationality, 71, 72) . Given that temporal-difference reinforcement learning is a relatively simple and widely effective algorithm, and one which closely matches known features of neurological learning (43, 44) , this points to a potential evolutionary pressure for agents to develop internal models of the world that do not match the true state space of their environment (cf. Refs. 61, 73) . The proposed dynamics are therefore a suitable tool to advance theoretical research in cognitive ecology, which studies how animals acquire, retain, and use information within their ecology, evolution and behavior (74, 75) . Within this area, research has begun to ask how agents' may evolve nonveridical or incomplete representations of the world (61, 76, 77) ; the dynamic model presented here offers a tool to study the effect of nonveridical representations in greater depth.

We also showed that partial observability can lead to better collective outcomes in the case of social dilemmas. The question for the preconditions of cooperation and sustainable behavior presents an important area for deeper investigation (78) (79) (80) . Temporal-difference learning is a widespread principle in neuroscience and psychology (44) and there is indeed evidence that humans use a payoffbased learning rule in social dilemmas (81) . The topic of uncertainty is of special relevance in the mitigation of the climate crisis through global cooperation agreements (68, (82) (83) (84) (85) . Our results highlight the potential for a systematic investigation of mechanisms that incorporate useful uncertainty (86) for learning and adaptive actors.

In our examples, the mutual benefit of uncertainty in the social dilemma vanishes when not all agents are likewise ill-informed causing reward-inequality between the agents. This suggests that partial observability as a mechanism for solving social dilemmas may need to be regulated externally (e.g., by authorities that monitor information flow, or as a feature of the environment) rather than something that is likely to be generated as an evolutionary adaptation amongst individuals in competition with each other.

Future directions. A promising directions for future work is the integration of model uncertainty through an analytical treatment of noisy dynamics (cf. Ref. 87 ). The stochastic noise models the finiteness of a reasonable learning algorithm compared to the theoretical limit of the infinite memory batch of the present dynamics. The challenge is that this problem is ill defined and many reasonable learning algorithms exist. Furthermore exciting is the embedding of representation and generalization dynamics into the nonlinear dynamics of learning, acting and environment to study the principles of advantageous representations.

Python code to reproduce all results is available at https://github.com/wbarfuss/POLD and archived at https://doi.org/10.5281/zenodo.6361994.

Decision making under uncertainty: theory and application

Reasoning about uncertainty

Decision making under deep uncertainty: from theory to practice

Models of bounded rationality: Empirically grounded economic reason

Rational use of cognitive resources: Levels of analysis between the computational and the algorithmic

Evolutionary games and population dynamics

Evolutionary dynamics of multi-agent learning: a survey

Towards a unified treatment of the dynamics of collective learning

Learning through reinforcement and replicator dynamics

A stochastic learning model of economic behavior

A selectionmutation model for Q-learning in multi-agent systems

Coupled replicator equations for the dynamics of learning in multiagent systems

Evolutionary dynamics of regret minimization

Deterministic limit of temporal difference reinforcement learning for stochastic games

Complex dynamics in learning complicated games

Continuous strategy replicator dynamics for multi-agent Q-learning

Evolutionary dynamics of Q-learning over the sequence form

Cooperation dilemma in finite populations under fluctuating environments

Fixation in finite populations evolving in fluctuating environments

Evolutionary games and population dynamics: maintenance of cooperation in public goods games

Eco-evolutionary dynamics of social dilemmas

Cooperation in changing environments: Irreversibility in the transition to cooperation in complex networks

Environmental feedback drives cooperation in spatial social dilemmas

The survival of the conformist: social pressure and renewable resource management

An oscillating tragedy of the commons in replicator dynamics with game-environment feedback

Punishment and inspection for governing the commons in a feedback-evolving game

Evolutionary games with environmental feedbacks

Eco-evolutionary dynamics with environmental feedback: Cooperation in a changing world

Evolution of cooperation in stochastic games

Asymmetric evolutionary games with environmental feedback

Evolutionary dynamics with game transitions

Partially observable Markov decision processes

A concise introduction to decentralized POMDPs

Dynamic programming for partially observable stochastic games

Temporal abstraction in temporal-difference networks

Predictive representations of state

On the economic value of signals

Using eligibility traces to find the best memoryless policy in partially observable markov decision processes

Heuristic decision making

Game dynamics as the meaning of a game

Learning to predict by the methods of temporal differences

Reinforcement Learning

A neural substrate of prediction and reward

Reinforcement learning: the good, the bad and the ugly

Learning without state-estimation in partially observable markovian decision processes

Experimental results on learning stochastic memoryless policies for partially observable Markov decision processes

Q-learning. Machine Learning

Reinforcement learning dynamics in the infinite memory limit

Dynamical systems as a level of cognitive analysis of multi-agent learning

Nonlinear dynamics and chaos

Batch reinforcement learning

Self-improving reactive agents based on reinforcement learning, planning and teaching

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Approximating optimal policies for partially observable stochastic domains

Markov Decision Processes: Discrete Stochastic Dynamic Programming

James McGilvray, and Michael Common. Natural resource and environmental economics

Modeling experiential learning: The challenges posed by threshold dynamics for sustainable renewable resource management

Sustainable use of renewable resources in a stylized social-ecological network model under heterogeneous resource distribution

The physics of governance networks: critical transitions in contagion dynamics on multilayer adaptive networks with application to the sustainable use of renewable resources

Equivalent comparisons of experiments

Natural selection and veridical perceptions

Five rules for the evolution of cooperation

Social dilemmas: The anatomy of cooperation

Caring for the future can turn tragedy into comedy for long-term collective action under risk of collapse

The values of information in some nonzero sum games

Categorization and cooperation across games. Games

The tragedy of the commons

Climate negotiations under scientific uncertainty

Tipping versus cooperating to supply a public good

RESQlearning in stochastic games

Ecological rationality: Intelligence in the world

Taming uncertainty

The interface theory of perception

Cognition, evolution, and behavior

Cognitive ecology II

Fitness beats truth in the evolution of perception

Optimal use of simplified social information in sequential decision-making. bioRxiv

When optimization for governing humanenvironment tipping elements is neither sustainable nor safe

Deep reinforcement learning in world-earth system models to discover sustainable management strategies

Stewardship of global collective behavior

Payoff-based learning explains the decline in cooperation in public goods games

Systematic uncertainty in self-enforcing international environmental agreements

The collective-risk social dilemma and the prevention of simulated dangerous climate change

Uncertainty, learning and heterogeneity in international environmental agreements

Timing uncertainty in collective risk dilemmas encourages group reciprocation and polarization

Adding noise to the institution: an experimental welfare investigation of the contribution-based grouping mechanism

Intrinsic noise in game dynamical learning