key: cord-0646003-io68nbfy authors: Ebtekar, Aram title: Information Dynamics and The Arrow of Time date: 2021-09-16 journal: nan DOI: nan sha: f205d5ce3f8ac1d3932d19f335439272682c7df8 doc_id: 646003 cord_uid: io68nbfy Time appears to pass irreversibly. In light of CPT symmetry, the Universe's initial condition is thought to be somehow responsible. We propose a model, the stochastic partitioned cellular automaton (SPCA), in which to study the mechanisms and consequences of emergent irreversibility. While their most natural definition is probabilistic, we show that SPCA dynamics can be made deterministic and reversible, by attaching randomly initialized degrees of freedom. This property motivates analogies to classical field theories. We develop the foundations of non-equilibrium statistical mechanics on SPCAs. Of particular interest are the second law of thermodynamics, and a mutual information law which proves fundamental in non-equilibrium settings. We believe that studying the dynamics of information on SPCAs will yield insights on foundational topics in computer engineering, the sciences, and the philosophy of mind. As evidence of this, we discuss several such applications, including an extension of Landauer's principle, and sketch a physical justification of the causal decision theory that underlies the so-called psychological arrow of time. The complete trajectory of a dynamical system at all times can be given in two pieces: an initial condition specifying its configuration at the initial time, and dynamics that specify how the configuration evolves over time. It's widely believed that the Universe has dynamics that exhibit CPT symmetry: under a simultaneous reversal in charge (C) and parity (P), the laws of physics are symmetric in time (T); this has been proved in the context of axiomatic quantum field theory [9] . Roughly speaking, CPT symmetry says that every video recording remains physically valid when played in rewind, except that all particles would behave like the mirror image of their antiparticle twins. This finding appears to contradict our common sense experience, not only of irreversible phenomena such as dropped glasses shattering, but also our sense of the passage of time, mediated by memory, causality, and planning. If the dynamics are truly symmetric, then by a process of elimination, we must conclude that the symmetry is broken by a special choice of initial condition. In particular, the initial condition must be set in such a way as to imply the second law of thermodynamics: a general principle of physics that forbids the entropy of any closed system from decreasing. Much work has gone into justifying various formal definitions of entropy, along with conditions that would imply the second law. However, even accepting the second law, it remains to explain how it relates to causal and decision-theoretic concepts. The toolkit of thermodynamics makes ample use of large-scale limits, equilibrium, coarse-graining, and conservation laws. Notwithstanding the power of these techniques, we believe they obscure the fundamental role of information in nature. The mismatch is particularly egregious when discussing systems capable of sophisticated computations, be they electronic or biological, as they operate far outside the large-scale equilibrium regime. We present a novel approach that minimizes use of this toolkit. To compensate for the loss of tools from physics, we abstract away the details of physical field theories, substituting a generic class of cellular automata in their place. Like our Universe, these stochastic partitioned cellular automata In the search for physical explanations, a common line of attack takes the thermodynamic arrow as given, and focuses on what appears to be a basic component of the psychological arrow: memory. After proposing a definition for memory, one tries to argue that its operation must align with the thermodynamic arrow. For example, Wolpert [34] distinguishes between two types of memory systems: computer-type and photograph-type. Both types aim to encode information at some time , about an event occurring at some other time . A computer-type memory has access to so much state information at , that it can deduce the event at by directly computing the dynamical evolution of the Universe. Such "memories" have no arrow of time, with both < and > being admissible. On the other hand, while Wolpert's photograph-type memory is less demanding, it requires initialization. He argues that realworld initialization procedures result in a net increase in entropy, forcing the memory to align with the thermodynamic arrow; however, this is begging the question: the thermodynamic arrow makes it so nearly all real-world processes increase entropy. If the thermodynamic arrow were reversed, we might expect a variety of entropy-reducing many-to-one mappings to function as initialization procedures. In addition, initialization procedures can be made reversible, simultaneously performing a many-to-one mapping on the memory alongside a one-to-many mapping on a second system (e.g., a heat bath) [2] . Mlodinow and Brun [16] argue that even a reversibly implemented memory can only function in the direction of the thermodynamic arrow. Their thought experiment consists of a pair of connected chambers containing elastic particles, and a counter that tracks the net flow of particles from one chamber to the other. Instead of treating initialization explicitly, the counter is assumed to have a known value at . Thus, reading the counter's value at is enough to infer the net flow of particles during the time interval between and . In order for this memory to be useful, the authors argue that it should be robust to a small random perturbation of the particles at . Such a perturbation directs both arrows of time away from : the thermodynamic arrow because perturbed particles tend to increase in entropy, and the psychological arrow because the counter's value at times other than is randomized, and hence must be read rather than assumed by initialization. Thus, regardless of which of or is greater, it appears that the thermodynamic and psychological arrows must align. We raise two rebuttals against Mlodinow and Brun. First, perturbations that evolve backward in time cannot be a realistic model of uncertainty, as they would violate the Universe's low-entropy initial condition (see Section 3.4) . Second, if the particles were mixed to equilibrium, the thermodynamic arrow would not be discernible by their evolution; nonetheless, one can still ask whether the particles' past or future movements can be made to correlate with the memory. Our paper answers this question in the context of a special kind of initial condition, which ensures macroscopic homogeneity and locality of the forward-in-time dynamics. As a result, the system can transition to states previously unknown to the memory. Indeed, we can conclude from the Memory Law (Theorems 4.8 and 4.11) that correlations must be traceable to past interactions. In a physically plausible thought experiment, Rovelli [27] considers a memory whose temperature is cooler than that of its environment. Viewing the environment as an agent, and its random interactions with the memory as choices, he concludes that exercising free will must increase entropy [26] , hence aligning with the thermodynamic arrow. Unfortunately, since temperatures are only defined at thermodynamic equilibrium, Rovelli's thought experiment excludes the vast majority of systems that perform interesting computations. On philosophical grounds, we also object to a definition of free will that requires choices to be random. If we take the SCM approach seriously, then deterministic decisions, e.g., made by a computer program, should be considered equally "free" so long as they are model-based, i.e., based upon an evaluation of counterfactual outcomes. For this reason, we also reject accounts of free will that require quantum or Knightian forms of uncertainty, including Aaronson's freebit picture [1] . On the other hand, since SCMs allow for sources of non-determinism that are independent of the past, we also reject superdeterminism, which is the view that our actions must somehow conspire to meet constraints on future outcomes. In this regard, our philosophical tenets differ from 't Hooft's cellular automaton interpretation of quantum mechanics [32] . A number of additional approaches may be considered relevant; we try our best to mention a representative sample of these. Heinrich et all [8] present simulation experiments as evidence that natural selection favors agents with knowledge of the past over those with knowledge of the future. However, their arguments don't consider non-living memory systems, nor alternative scenarios in which knowledge of the future may be more advantageous. Furthermore, they present no mechanism by which selective pressure may be applied, between populations whose survival rates are evaluated in opposite temporal directions. Finally, in the context of quantum information theory, Maccone [15] argues that when entropy decreases, any trace or memory must be erased so that we cannot recall the decrease. Maccone's entropy differs from the macroscopically emergent definitions: it's attributed to quantum entanglement within a larger pure-state system, which limits the concept's scope. Furthermore, when the entropy decreases to a non-zero value, it's unclear whether the erased trace necessarily includes all evidence of a higher past entropy. Perhaps it's surprising to find so little clarity across the literature on the psychological arrow of time. One must bear in mind that the tools and abstractions historically favored by the physics community were, by and large, motivated by relatively homogeneous systems, in the vicinity of thermodynamic equilibrium. Powerful digital computers, capable of manipulating quantities of information that are non-negligible in comparison to the entropy of their physical parts, are only recently within the realm of possibility. The accompanying advances in computer science include new sets of abstractions. By building upon these, we arrive at a much clearer picture of the psychological arrow as an informationtheoretic phenomenon. Of course, we are far from the first to study connections between logical and physical descriptions of entropy or reversibility; see for example, [2] . More recently, the thermodynamics of correlated systems has also received substantial research attention [20] . Nonetheless, to our knowledge, we are the first to develop a rigorous theory of the dynamics of information, over a space-time structure. We use it to clarify old ideas, as well as to uncover new insights into the mechanisms and consequences of the psychological arrow of time. Our main model, the SPCAs, can be thought of as a hybrid of SCMs [21] and partitioned cellular automata (PCAs) [11, 18] . We'll study them in terms of quantities derived from classical information theory [5] , and refer informally in Sections 5.3 and 5.5 to algorithmic information theory [14] . The reader will find it helpful to possess at least a passing familiarity with these topics. Here we summarize our ideas, leaving the rigorous general exposition to later sections. The core technical contributions of this paper can be divided into two parts. The first part consists of defining SPCA dynamics in three ways, and proving all three to be equivalent. This equivalence allows us to translate between microscopically reversible dynamics and their macroscopic counterpart. The second part consists of proving the laws of information dynamics: this is a term we use to describe statistical mechanics in the generic SPCA setting, where we don't assume scale, equilibrium, nor conservation laws. By combining both sets of results, we arrive at a model of how an arrow of time may emerge from reversible dynamics. We now elaborate on each set of contributions in turn. An SPCA's description consists of four parts: • a discrete spatial geometry (X, T ), • a countable state space S = ∈S S , • an initial condition , and • dynamics. For concreteness, we can imagine the spatial set to be an infinite grid X = Z , or a finite grid X = (Z/ Z) with periodic boundary conditions. Neighborhoods are defined by a finite set T of local translations, which are bijections on X. These always include the identity; for the grid, a natural choice also includes the orthogonal unit displacements. Altogether, an SPCA's description uniquely determines the joint distribution of a random variable C = ( , , ) ( , , ) ∈ 1 2 Z + ×X×T , called its configuration history. The coordinates , , are called the time, cell (or position), and track, respectively, and , , is an S -valued random variable. To visualize, picture the SPCA as being dividing into cells, with each cell ∈ X further subdivided into |T | subcells. The reason time proceeds in half-steps is that an SPCA's evolution alternates between cellwise dynamics that apply the dynamics independently at each cell, and trackwise translations that translate every track according to its respective map ∈ T . This separation of concerns makes analysis easier. The initial condition is a probability measure on S X , specifying a random configuration at the initial time = 0. From there, the dynamics evolve the SPCA forward in time. We discuss three ways to specify the dynamics. Ordered from most mathematically convenient to most physically plausible, they are: 1) a matrix of pairwise transition probabilities between states; 2) a probability distribution Γ over transition functions, from which to sample i.i.d. at every time-space coordinate ( , ); and 3) a deterministic transition function applied identically at every ( , ), on an extended state that includes randomly initialized "microscopic" degrees of freedom. To make the third presentation more explicit, the cellular state space, instead of S, is extended to S × R Z . If we think of elements of R as encoding transition functions, the initial condition can be made to effectively sample an infinite i.i.d. sequence from Γ. To obtain our desired dynamics deterministically, we simply use one of these embedded samples at each time step. On the other hand, if we think of elements of R as digits with which to build real variables in a positional numeral system (e.g., decimal), then becomes a symbolic representation of a chaotic multibaker map [7] . By exploring this connection, we obtain a close analogy to classical Hamiltonian mechanics, which serves to justify the SPCA model. In the case where |X| = 1, SPCAs reduce to Markov chains; thus, we also justify the modeling of physical systems by Markov chains, whenever spatial structure is to be ignored. Of course, these justifications rest on the three presentations being equivalent, in the sense of describing the same set of joint distributions on C. In fact, the statements of Theorems 3.7 and 3.9 are slightly stronger: for a fixed spatial geometry and state space, every dynamics, given in any one of the three presentations, has equivalents in each of the remaining presentations, such that regardless of initial condition, C's distribution doesn't depend on which of the three presentations is used to define it. In addition, we prove a more specific equivalence between: 1) being doubly stochastic, 2) Γ being restricted to bijections, 3) being a measure-preserving bijection. Thus, double-stochasticity can be seen as a generalization of reversibility to the macroscopic, or probabilistic, setting. For computer engineering purposes, this means the most general set of operations that can be performed on a closed system (i.e., without dissipating heat [13] ) are the probabilistic mixtures of bijections. The proof of equivalence between doubly stochastic and random bijections Γ merits special attention, so we highlight its main ideas. We cast in the language of weighted bipartite graphs, with both vertex partitions isomorphic to S. For every , ′ ∈ S, the edge from in the left partition to ′ in the right has weight ( , ′ ). Now, suppose Γ selects a particular bijection with probability . Its contribution to the equivalent matrix can be represented by a perfect matching of weight , which adds to the weighted degree of every vertex. In this manner, a mixture Γ, of countably many bijections whose total probability is one, amounts to a weighted bipartite graph whose vertices all have degree one. Therefore, is doubly stochastic. The converse is trickier, as it turns out that to find an equivalent Γ, we might need to decompose into a mixture of uncountably many perfect matchings. Fortunately, the existence of such a decomposition is guaranteed by a very general form of the Birkhoff-von Neumann theorem [25] . As a final application of the graph-theoretic view, note that by a simple exchange of the two partitions, it's immediately apparent that inverting the dynamics of the second or third presentation amounts to replacing by its transpose. This is relevant to our analysis of time reversal in Section 3.4, where we also present a more direct proof. Altogether, this first set of theoretical contributions establishes SPCAs as a model of emergent timereversal asymmetry. Our second set of contributions are very general statements and extensions of the second law of thermodynamics, on SPCAs. Collectively, we term these the laws of information dynamics. To proceed, we must assume the dynamics to have some stationary distribution . To simplify the present overview, we specialize to the case where is the counting measure, which corresponds to doubly stochastic dynamics. Let ≤ be two instants in time, and ⊂ X be a finite region in space. Using T to define permissible movements, let + be the set of positions reachable at time , starting from inside at time . Similarly, let − be the set of positions at time , that could only have come from at time ; that is, − := X \ (X \ ) + . Note that − ⊂ ⊂ + ⊂ X. Let denote Shannon's entropy, and the mutual information (see Section 2.2). The Resource Law for open systems (Theorem 4.7) is of a pair of inequalities, the first of which resembles a standard statement of the second law of thermodynamics: In other words, entropy is non-decreasing. In order to capture any entropy that might escape via translations, the right-hand side uses the expanded region + . The second inequality concerns a region's negentropy , obtained by subtracting from the entropy of a uniform distribution. is nonincreasing but, in order to avoid capturing any negentropy that originated outside , the right-hand side uses the contracted region − . These inequalities are severely weakened by their need to account for the worst-case scenario, in which information enters or exits the region at the SPCA's analogue of the speed of light. To strengthen the Resource Law, we require some notion of a closed system that blocks the movement of information in or out. Our solution, detailed in Definition 4.9, considers (possibly time-varying) regions whose boundaries remain filled with a quiescent state, throughout the time interval [ , ). In this setting, Theorem 4.10 implies the straightforard inequality: Of course, if the region in question is not time-varying, we can omit the subscripts on . The Resource Law considers one system in isolation. In the non-equilibrium regime, we should also account for correlations between disjoint systems. The Memory Law says that the correlation between two disjoint regions , ⊂ X is non-increasing. Its precise statement for general open systems is Theorem 4.8, a special case of which yields When , are each closed systems, it strengthens (Theorem 4.11) to Once again, the inequality for open systems contracts its regions at the "speed of light" so that no new information may enter, whereas closed systems ensure this by virtue of being walled off with quiescent cells. Both versions of the Memory Law are applications of the data processing inequality [5, §2.8] : intuitively speaking, since the future states at and are functions of their initial states plus independent sources of randomness, they cannot acquire any information about one another that was not already present in their respective initial states. This concludes the summary of our core technical results. The latter parts of the paper are devoted to applications. Even on topics that are fairly well-established, we find that our model's precision and simplicity serves to clarify or extend a variety of analyses, suggesting that SPCAs will continue to be powerful tools for investigations involving the dynamics of information. To start, we see how the negentropy can be thought of as a resource, analogous to the free energy in physics. In the presence of correlations, it's no longer additive over disjoint regions, but supermodular, decomposing as By the Resource and Memory laws, all three terms on the right are non-increasing when the systems at and are separated from one another. However, when the systems collide, the terms may redistribute arbitrarily, subject to their total not increasing. One consequence of this is that "forgetting", i.e., decreasing the mutual information between systems that are not in physical contact, carries an irrevocable entropic cost. We present the latter result as a non-local extension of Landauer's principle. In addition, we clarify the usual interpretations of Landauer's principle, using our equivalence theorems to make the arguments precise. Next, we discuss the psychological arrow of time. The Memory Law gets its name for the following reason: since it forbids the mutual information from spontaneously increasing at a distance, the presence of such mutual information must necessarily be traceable to a past interaction. Thus, the mutual information can be understood as a memory of the interaction. We also informally discuss a second type of memory, substituting mutual information with Bennett's logical depth [3] . Then, by viewing SPCAs as causal structural models, we proceed to justify the time-reversal asymmetry present in causal concepts. A full formal proof of the emergence of counterfactual-based decision theories is beyond the scope of this paper. In light of functional decision theory [36] , which models certain effects as preceding their causes, we speculate that it might not be universally appropriate to model causal relations as following physical time. Finally, we remark on the basic limits of empirical knowledge. In the SPCA setting, we arrive at an especially lucid presentation of Boltzmann brains. To escape absurdities, we are pushed into borrowing ideas from algorithmic information theory. Ultimately, they inform our interpretation of probabilities, and lead us to a close analogy between data compression and free energy gathering. Section 2 sets some notational conventions, and then gives an overview of the Kullback-Leibler divergence. This overview collects some useful technical lemmas, and describes how the KL divergence is used to quantify the entropy and negentropy of a dynamical system, relative to its stationary distributions. Discrete time-homogeneous Markov chains can be thought of as SPCAs without a spatial geometry. Section 3 explores the theory of Markov chains, focusing on the extent to which they exhibit time-reversal symmetry. It turns out that this setting suffices for discussing our three equivalent presentations of the dynamics, with the proofs being essentially the same. Theorem 3.9 is this section's main result, providing the link between microscopic and macroscopic dynamics. Section 4 explores the theory of full-fledged SPCAs, and how the locality of their dynamics informs the arrow of time. In particular, we state and prove the laws of information dynamics. Section 5 explores several applications of these laws to computer engineering, the sciences, and the philosophy of mind. In some cases, we present novel answers or extensions; in all cases, we obtain additional clarity and rigor by applying the SPCA model and its laws. The arrow of time is a ubiquitous aspect of our reality, fundamental to all that we experience. As such, its study has the potential to deepen our understanding of information, resources, and agency in the physical world. We suggest some possibly fruitful directions for future work in the concluding Section 6. R + denotes the non-negative real numbers, Z + , Z − the non-negative and non-positive integers, N := Z + \ {0} the natural numbers, and Z := {0, 1, . . . , − 1} denotes the first elements of Z + . We use uppercase letters for other sets , , as well as for random variables . We use lowercase letters for elements of the corresponding sets ∈ , ∈ , as well as for specific realizations of random variables = ( ). 's power set, consisting of all subsets of , is denoted by ℘ ( ). Script letters are used for -algebras F , G, as well as for three important sets that will be treated as fixed in most contexts: the state space S and the discrete geometry (X, T ). The greek letters , , , Γ denote measures, with reserved for the Lebesgue measure (i.e., volume) on R . is the set of functions from to or, equivalently, sequences indexed by , of terms in . : → is synonymous with ∈ ; in either case, ∈ implies ( ) = ∈ . The image of a set ′ ⊂ under is ( ′ ) := { ( ) : ∈ ′ }. Bij( ) ⊂ denotes the set of invertible functions, or bijections, from onto itself. The substitution ← : → equals everywhere except at , where it equals instead. That is, Some intuitive shorthands will be used: for instance, ′ denotes the restriction of on the set ′ ⊂ . If is ordered, ≤ := ′ where ′ := { ′ ∈ : ′ ≤ }. If takes multiple subscripts (i.e., arguments), we allow partial application of the arguments from left to right, so that, e.g., (( ) ) := , , =: ( , ) . When convenient, we will speak of a set of jointly distributed random variables, without explicit reference to the implied probability space (Ω, F , Pr). Thus, when referring to a random variable : Ω → , we may abuse notation and write ∈ , instead of the technically accurate ( ) ∈ . We write M ( ), M + ( ) for the set of probability measures and non-null measures, respectively, on a set equipped with the following -algebra: ℘ ( ) if is countable, or the product measure (generated by cylinder sets) if is itself defined as a product of sets with specified -algebras. For ∈ , the point mass that puts probability one on is denoted by ∈ M ( ). When is countable, it will often be convenient to identify measures with real-valued functions of , so that with a slight abuse of notation: Finally, indexed collections of random variables C = ( ) ∈ will be bolded for emphasis. Note that C can itself be thought of as a random variable, mapping to ( ( )) ∈ . All of our information-theoretic quantities are derived from this definition: Definition 2.1. Let S be a countable set, ∈ M (S), and ∈ M + (S). The Kullback-Leibler divergence of relative to is Following standard conventions, terms with ( ) = 0 are treated as 0, and terms with ( ) = 0 < ( ) are treated as +∞. 2 When the logarithm's base is 2, the KL divergence is measured in units of bits; when the base is , it's measured in nats. ) quantifies how much knowledge we have about a state distributed according to , with respect to the weights . Relative to a fixed , we'll say has zero entropy (negentropy) when KL ( ) is maximal (minimal). Let's derive formulas for these extrema: sup ∈M ( S) P . First, we show that for all ∈ M (S), The left-hand inequality is trivial if the sum is infinite; otherwise, it follows from Gibbs' inequality. As for the right-hand inequality, since ( ) ≤ 1, Now to actually achieve the infimum, enumerate S = { 1 , 2 , . . .} and let Then, as → ∞, Finally, to achieve the supremum, choose a sequence ( ) ∈N such that lim →∞ ( ) = inf ∈S ( ). Then, as → ∞, 2 When the sum's positive and negative terms both diverge, the result is ill-defined. In this case, is a convex combination of some + , − ∈ M (S), with disjoint support, satisfying KL ( + ) = +∞ and KL ( − ) = −∞. Linear transformations of remain convex combinations of the results of the same transformation on + , − . Therefore, the identities in this paper extend to all such , provided we set their KL divergence to an arbitrary constant in R ∪ {+∞, −∞}. When is a finite measure, the bound in Equation (1) is finite, and we define the negentropy of any ∈ M (S) by It follows that ( ) = 0 iff = ∈S ( ) . Conversely, when inf ∈S ( ) > 0, the bound in Equation (2) is finite, and we define the entropy of any ∈ M (S) by It follows that ( ) = 0 iff = with ∈ arg min ∈S ( ). Note that the sum ( ) + ( ), which we term the information capacity of , does not depend on ; therefore, a zero in either quantity is a maximum in the other. It may aid the reader's intuition to specialize the results in this paper to the case where is the counting measure: ♯( ) := 1 for all ∈ S. In this case, we recover Shannon's entropy However, our theoretical development will work directly in terms of the KL divergence, as it's more general. The next definition will be useful when we discuss memory, and want to quantify how much two systems know about one another. In what follows, let S , S be countable sets. It will often be convenient to speak of the KL divergence of a random variable or its conditional expression. We will take these as shorthand for the corresponding marginal or conditional distributions. For example, we write KL ( ) and KL ( | = ) instead of KL (Pr ) and KL Pr | = , respectively. L 2.4. Let ∈ M + (S ), ∈ M + (S ). For any pair of random variables ∈ S , ∈ S , we have the following identities: , Therefore, Substituting for E KL ( | = ) in Lemma 2.4 yields the desired result. For ease of exposition, as well as to better appreciate the role of locality in the full setting of Section 4, we begin our investigation of time-reversal asymmetry in a more generic setting, devoid of spatial structure. An SPCA without a spatial geometry is simply a Markov chain: a sequence of random variables, each of which depends only on the previous. The sequence's joint distribution is uniquely determined by the initial state's probability distribution, together with a matrix detailing the conditional probabilities of transitioning from one state to another. Throughout this section, let S be a fixed countable set. Definition 3.1. The set of stochastic matrices, and doubly stochastic matrices on S are, respectively, A measure or matrix is said to have a common denominator ∈ N if all its entries are multiples of 1 ; that is, if ∈ ( 1 Z + ) S or ∈ ( 1 Z + ) S×S , respectively. It is said to be strictly positive if none of its entries are zero. Finally, Note that ∈ DM(S) iff ♯ is stationary for . Our first presentation of Markov chain dynamics is the canonical one, given by a stochastic matrix: A random variable C = ( ) ∈Z + is a discrete time-homogeneous Markov chain with initial condition ∈ M (S) and transition matrix ∈ SM(S), or a ( , )-Markov chain for short, if its joint distribution is given by, for all ∈ S Z + , Or equivalently, Notice from Equation (4) that, conditioned on , +1 is independent of < . By induction on , we obtain the Markov property in a form that is symmetric under time reversal: In light of this symmetry, we might wonder if the asymmetry in Equation (4) is a presentational artifact. That is, does C satisfy similar equations with the time subscripts negated? In addition, can C be extended to negative times, in such a way that Equation (4) is satisfied at all times? In general, extensions to negative times are not guaranteed to exist, nor be unique if they do. Even at positive times, the backward probabilities may lack time-homogeneity. Using the Markov property and Bayes' rule, we derive the backward probabilities: Let's suppose that the dynamics is recurrent, meaning that from every ∈ S, we'll almost surely eventually revisit . By Theorem 17.48 in [12] , then has a strictly positive stationary measure . By Remark 17.51(i) in [12] , is uniquely determined up to constant multiples, on each irreducible component of S; that is, the ratio ( ′ )/ ( ) is uniquely determined whenever ( ′ , ) > 0. Therefore, if the initial distribution is stationary and the dynamics is recurrent, the backward dynamics are time-homogeneous and given by a well-defined dual transition matrix: and is stationary for both and dual . In this case, it's also clear that C has a unique extension to negative times, such that Equation (4) is satisfied for all ∈ Z. Unfortunately, when is non-stationary, the backward probabilities in Equation (6) generally differ from dual ( , −1 ). In the typical case, the ratio Pr( −1 = −1 )/Pr( = ) is time-dependent, the backward probabilities are time-inhomogeneous, and C has no extension that satisfies Equation (4) at negative times. Nonetheless, in Section 3.4, we will argue that a natural extension to negative times still exists. The negative-time dynamics will be given by dual . If we picture the arrow of time pointing away from = 0 on both sides, toward increasing values of | |, then we find homogeneous dynamics (either or dual ) along the arrow, and inhomogenous probabilities (inferred using Bayes' rule) against the arrow. 3.1.1 The Second Law of Thermodynamics. As time advances, non-stationary distributions tend to evolve toward stationary distributions. To make this statement precise, we define a preorder on M (S): Suppose , ∈ M (S) and ∈ M + (S). We say thermo-majorizes with respect to , and write , if there exists ∈ SM(S) transforming into while keeping stationary. That is, for all ∈ S, For alternative characterizations of thermo-majorization and a broad survey of the topic, see [28, §3] . For fixed , the relation is clearly symmetric and transitive, hence a preorder. The zero entropy and zero negentropy distributions from Section 2.2, when they exist, are the maxima and minimum of , respectively. Note that if ∈ M (S), then is the minimum of . The precise statement we were looking for, then, is that evolutions always follow this preorder: . Let C be a Markov chain with stationary measure . Then for all ∈ Z + , +1 . Therefore, In particular, if C's transition matrix is doubly stochastic, then ( ) ≤ ( +1 ). P . The first and last statements are immediate from the definitions of and , respectively. The only non-trivial claim is that +1 implies KL ( ) ≥ KL ( +1 ). In fact, a much more general result is presented as Theorem 17 in [4] , according to which KL is but one of a broad class of monotones compatible with thermo-majorization. Restricting attention to the doubly stochastic case, it's natural to wonder how good a monotone is. For example, does ( 1 ) ≤ ( 2 ) imply 1 ♯ 2 ? To see that the answer is no, consider the following distributions on the state space S = Z 2 500 : It's easily verified that ( 1 ) = ( 2 ) = 300 bits, but neither thermo-majorizes the other; in fact, any simultaneously satisfying 1 ♯ and 2 ♯ must have ( ) > 399. 3 Nonetheless, [28, §3] summarizes some settings in which an increase in is almost sufficient for thermo-majorization. Intuitively, the most important such setting occurs when we have a large number of i.i.d. samples, say from 2 . In this case, the joint sample almost certainly belongs to a typical set, consisting of about 2 − ( 2 ) outcomes, each approximately equally likely [5, §3.1]. Therefore, in sufficiently large aggregates, distributions of equal entropy are effectively interchangeable. However, that's not at all the case in the one-shot setting, where 2 represents a large fraction of our system's total entropy. Equating the negentropy with chemical free energy (see Section 5.1), an illustrative example would be to find ourselves with a 50% chance of discovering a crude oil deposit underneath our property. No matter our risk-aversion, no local action on our part can convert this situation into one with a 100% chance of having half an oil deposit! On the other hand, note that it takes a negligible amount of information to describe which of the two branches we're in: merely peeking at the most significant bit of ∈ 2 500 reveals whether 2 ( ) = 2 −100 or 2 ( ) = 2 −500 . Thus, it seems more useful to say that the entropy is either 100 bits or 500 bits, rather than averaging it out to 300 bits. This intuition is captured by the Kolmogorov complexity ( ), which is a function of the specific instance , rather than of a distribution as in ( ). In Section 5.5, we'll revisit how to quantify resources in terms of the Kolmogorov complexity. Until then, for convenience's sake, our analyses are confined to the distribution-based framework of KL divergences. Duplication. It will often be convenient to assume ∈ DM(S). Fortunately, any recurrent Markov chain can be cast approximately in these terms. To demonstrate, suppose ∈ SM(S) has a strictly positive stationary measure . If we think of ( ) as the "size" of state , we want to split all the states into equal-sized pieces. If has common denominator (or we are willing to tolerate approximations with large ), then we can construct dup ∈ DM(S dup ) on a new state space S dup , consisting of · ( ) duplicates of each ∈ S. Intuitively, any time spent in state ∈ S would instead be uniformly distributed among its duplicates ( , ) ∈ S dup . It's straightforward to verify that the projection of In physics, we specify dynamics not with transition matrices, but with reversible equations of motion. The discrete-time analogue would be invertible functions on S. In this subsection, we demonstrate a close correspondence between transition matrices and random functions on S. In particular, we will see that doubly stochastic matrices correspond to invertible functions. Then, in Section 3.3, the functions will be made deterministic by attaching microscopic degrees of freedom in which to store the randomness. Definition 3.6. A random variable C = ( ) ∈Z + is a discrete time-homogeneous Markov chain with initial condition ∈ M (S) and random dynamics Γ ∈ M (S S ), or a ( , Γ)-Markov chain for short, if it can be extended to a joint distribution on (C, F) = ( , ) ∈Z + (whose marginal agrees on C) such that Pr ( 0 ,F) = × Γ Z + , and The Markov chain terminology is justified by the following formal correspondence: Conversely, for every ∈ SM(S), there exists Γ ∈ M (S S ) such that, for all ∈ M (S), every ( , )-Markov chain is also a ( , Γ)-Markov chain. If ∈ DM(S), then Γ can be chosen to be supported on Bij(S). The sets { ∈ S S : ( ) = ′ }, with fixed and ′ ranging over S, are mutually exclusive and exhaustive, so ∈ SM(S). If is invertible, then these sets with ′ fixed and ranging over S are also mutually exclusive and exhaustive, so ∈ DM(S). Now fix ∈ M (S). Consider a ( , Γ)-Markov chain C and its companion functions F. Since Pr 0 = , Equation (3) holds. Furthermore, since is independent of ≤ , For the converse, suppose ∈ SM(S) is given. . Taking to be drawn uniformly from [0, 1), is then drawn from the pushforward of the Lebesgue measure ; call it Γ ∈ M (S S ). For all , ∈ S, it satisfies For this Γ and a fixed , let (C, F) be jointly distributed according to Definition 3.6. To see it as an extension of an arbitrary ( , )-Markov chain, it remains to show that the marginal on C agrees with Equations (3) and (4); the verification steps are identical to the previous case. If ∈ DM(S), the only change in the argument is that Γ must be constructed using only bijections in its support, satisfying Γ({ ∈ Bij(S) : ( ) = ′ }) = ( , ′ ). The existence of such a Γ is a generalization of the Birkhoff-von Neumann theorem. Its proof is highly technical; for details, see [25] . In classical physics, we think of the dynamics as fundamentally deterministic and reversible. The appearance of randomness emerges from chaos: the gradual amplification of microscopic uncertainty until it enters the macroscopic world, increasing some coarse-grained notion of entropy. Since the time-reversed dynamics are equally chaotic, we might be puzzled as to why entropy only increases in the future direction, i.e., why time has an arrow. For instance, consider a system that has been brought to thermodynamic equilibrium. It is considered effectively random for the purposes of its evolution into the future. However, the system's past evolution would see it evolve out of equilibrium, revealing that the present equilibrium state is, in reality, far from random. In this subsection, we develop a third presentation of Markov chains, adding microscopic degrees of freedom which are effectively random for the purposes of its future evolution, while in fact containing hidden structure sufficient to recover its past evolution. The dynamics are deterministic, but appear random when considering only the macroscopic evolution forward in time. This section's main result, Theorem 3.9, provides a formal justification for the modeling of physical systems, despite their determinism and reversibility, by Markov chains. When is stationary with respect to a macroscopic view of the dynamics, we think of it as defining a measure space (S, ℘ (S), ). This macroscopic variable is coupled with infinitely many microscopic degrees of freedom, each initially sampled i.i.d. from the probability space (R, G, Γ). These microscopic components are arranged along a bidirectional sequence indexed by ∈ Z; it may be helpful to think of them as random seeds prepared at initialization, one to use at each time step. Putting the pieces together, a full state is given generically by an element of S × R Z . Formally, define the shift map : R Z → R Z by ( ) := +1 . The dynamics is specified in terms of a function : S × R → S × R that ignores all but the zeroth microscopic component. Interleaving it with the shift map ensures that we always act on fresh, never-before-seen components. To be precise, given , we define its shift-extension : In other words, first applies to ( , 0 ), and then applies to the entire sequence . will be our deterministic dynamical law: Definition 3.8. Let (R, G, Γ) be a probability space. A random variable C = ( ) ∈Z + is a discrete timehomogeneous Markov chain with initial condition ∈ M (S), randomness generator (R, G, Γ), and ℘ (S) × G-measurable deterministic dynamics : S × R → S × R, or a ( , R, G, Γ, )-Markov chain for short, if it can be extended to a joint distribution on (C, R) = ( , ) ∈Z + (whose marginal agrees on C) such that Pr ( 0 , 0 ) = × Γ Z , and where is as defined in Equation (9). A straightforward induction verifies the indentity , = ′ , ′ for all , ′ , , ′ ∈ Z + satisfying + = ′ + ′ . In particular, ,Z + = 0, +Z + ; therefore, at all times ∈ Z + , ( , ,Z + ) is distributed according to × Γ Z + . This property ensures the forward macroscopic dynamics are time-homogeneous, as we now show. there is a unique ∈ SM(S) such that, for all ∈ M (S), every ( , R, G, Γ, )-Markov chain is also a ( , )-Markov chain. If is × Γ-measure-preserving, then is stationary for . Conversely, for every ∈ SM(S), there exists a probability space (R, G, Γ) and ℘ (S) × G-measurable : S × R → S × R such that, for all ∈ M (S), every ( , )-Markov chain is also a ( , R, G, Γ, )-Markov chain. If has a strictly positive stationary measure with a common denominator, then can be chosen to be a × Γ-measure-preserving bijection. . Suppose (R, G, Γ) and are given. For , ′ ∈ S, define the sets Then, because the sets , ′ with fixed are mutually exclusive and exhaustive. Hence, ∈ SM(S). Furthermore, if is × Γ-measure-preserving, then , so is stationary for . Now fix ∈ M (S), a ( , R, G, Γ, )-Markov chain C, and its microscopic companion R. By definition, Pr 0 = , so Equation (3) holds. Furthermore, since ,0 = 0, is independent of ≤ , For the converse, suppose ∈ SM(S) is given. By Theorem 3.7, it corresponds to some random dynamics Γ ∈ M (S S ), defined on the -algebra G generated by the cylinder subsets of S S . Define : S × S S → S × S S by ( , ) := ( ( ), ). For a generic cylinder set = { ∈ S S : is a countable union of cylinder sets; hence, is ℘ (S) × G-measurable. It's straightforward to check that for all ∈ M (S), every ( , )-Markov chain is also a ( , S S , G, Γ, )-Markov chain. We modify this construction in the case where ∈ ( 1 N) S is stationary for . The -duplication of (see Definition 3.5) has dynamics dup ∈ DM(S dup ), where S dup consists of · ( ) "duplicates" of each ∈ S. By Theorem 3.7, dup 's random function presentation Γ can be taken to be supported on Bij(S dup ). Now for each ∈ S, split the interval [0, 1) into · ( ) equal-sized sub-intervals , := [ · ( ) , +1 · ( ) ), one for each duplicate ( , ) ∈ S dup of . For each ( , ) ∈ S dup and ∈ Bij(S dup ), use ( ′ , ′ ) := (( , )) to define the bijection The collection { , , } have disjoint domains and disjoint ranges. By joining them together, we obtain a single × × Γ-measure-preserving bijection : To elucidate the situation in physical terms, let's imagine for simplicity that ∈ SM(S) has common denominator ∈ N. In this case, the functions constructed in the proof of Theorem 3.7 only depend on through the value of ∈ Z := {0, 1, . . . , − 1} for which ∈ [ , +1 ). Hence, Γ is the uniform distribution on the multiset (allowing for duplicates) { 0 , 1 , . . . , −1 }. When ∈ DM(S), that proof used a different construction to obtain a mixture of bijections. We replace it with a discrete variant: rather than invoking the general Birkhoff-von Neumann theorem, we can consider the -regular bipartite graph with vertex partition (S, S), and · ( , ′ ) edges from each on the left partition to each ′ on the right. Repeated application of Hall's marriage theorem decomposes the graph into perfect matchings. Therefore, we can take Γ to be the uniform distribution over the corresponding bijections on S. In the proof of Theorem 3.9, we embedded these random functions into the state's microscopic components, allowing to simply "read" a choice of function and apply it to the macroscopic component. Now that a function is uniquely determined by the integer ∈ Z , we might as well store directly. Thus, we take R := Z and Γ := 1 ♯. The microscopic information ∈ (Z ) Z , then, forms a bidirectional sequence of -ary digits. We can map (Z ) Z onto the unit square [0, 1] 2 as follows: This mapping is almost one-to-one, aside from ambiguities at the -adic rationals, e.g., 1 = 0.9. Ignoring the ambiguous set, whose measure is zero, we can therefore represent our generic state as an element of S × [0, 1) 2 , initially distributed according to × 2 . Such a multibaker map was first studied for simple random walks in [7] . With Theorem 3.9, we have generalized it to arbitrary discrete Markov chains. The macroscopic state in S is coupled to a microscopic element in [0, 1) 2 , whose shift map now becomes area-preserving and chaotic, expanding the -axis while contracting the -axis, each by a factor of . In the language of dynamical systems, can be viewed as a measure-preserving transformation with Lyapunov exponents ± log , and Kolmogorov-Sinai entropy log [6] . Now, perhaps the requirement that ( , ) be initialized uniformly on the unit square seems too restrictive for analogies to physics. Instead, let's imagine starting with a fixed arbitrary state in S × [0, 1) 2 , and perturbing it slightly. This is done in such a way that the microscopic component becomes uniformly distributed on a set ⊂ [0, 1) 2 of positive area: 2 ( ) > 0. Consider the half-open axis-aligned squares with -adic rational endpoints: D is a semiring of sets that generates the Borel -algebra on [0, 1) 2 . By the measure approximation theorem (Theorem 1.65(ii) in [12] ), for any > 0, there exists a finite collection of disjoint sets 1 , . . . , ∈ D, whose union approximates , in the sense that Let − be the minimum width among the sets { } =1 . Then, in the base representations of the coordinates of a uniform sample from =1 , all but the first positions are uniformly distributed. Since the coordinate expands by one position per time step, it will be uniformly distributed on [0, 1) at all times ≥ ; consequently, the macroscopic dynamics will approximate a time-homogeneous Markov chain in the forward direction. Since was arbitrary, the error in this approximation goes to 0 as → ∞. At last, we have the tools needed to discuss the extension of a non-stationary Markov chain C to negative times. If is invertible, such an extension is uniquely determined by Equation (10) . The preceding argument then applies equally well in reverse: at all times ≤ − , our sample's coordinate is uniformly distributed on [0, 1); consequently, the macroscopic dynamics will approximate a timehomogeneous Markov chain in the backward direction. If the set is very convoluted, the arrow of time may be ambiguous over a broad time interval, during which the macroscopic dynamics are not necessarily time-homogeneous nor Markov in either direction. Nonetheless, we see that for sufficiently large positive times, the arrow of time must eventually stabilize in the forward direction; likewise for sufficiently negative times, the arrow stabilizes in the backward direction. Remarkably, it turns out that the forward and backward dynamics (as → +∞ and → −∞, respectively) are related by duality of their transition matrix presentations. We demonstrate this connection next. Under the idealized conditions of Definition 3.8, a dynamical system's macroscopic behavior is exactly that of a Markov chain, and Theorem 3.9 gives it a unique transition matrix . Now, if is invertible, the dynamical system's extension to negative times has macroscopic backward probabilities identical to the forward probabilities corresponding to −1 : this is an immediate consequence of inverting Equation (10) under symmetric initial conditions. If not only has an inverse, but also preserves a relevant measure, we now show that replacing with −1 changes the transition matrix from to dual . Recall that, with respect to a strictly positive stationary distribution of ∈ SM(S), the dual matrix is defined by Equation (7). If is recurrent, dual is uniquely determined; otherwise, it may depend on a choice of , which will be made clear by context. In addition, for Γ ∈ M (Bij(S)), we define Γ dual ∈ M (Bij(S)) by Γ dual ( ) := Γ({ ∈ Bij(S) : −1 ∈ }). This allows us to equate dual and inverse dynamics across all three presentations: is a random function presentation for ∈ DM(S). Then, Γ dual is a random function presentation for dual . Alternatively, suppose (R, G, Γ, ) is a deterministic function presentation for ∈ SM(S), which has a strictly positive stationary measure . If is bijective and × Γ-measure-preserving, then (R, G, Γ, −1 ) is a deterministic function presentation for dual . . When discussing a generic ∈ DM(S), the implied stationary measure is ♯; hence, the dual matrix in Equation (7) is simply the transpose of . Let ∈ M (S) be arbitrary. We verify the dynamics of a ( , Γ dual )-Markov chain C, with companion functions F: Similarly, under the hypotheses of the second claim, we verify the dynamics of a ( , R, G, Γ, −1 )-Markov chain C, with microscopic companion R: In light of Theorem 3.10, it's natural to extend Equations (3) and (4) with an additional dual equation. That is, given ∈ M (S) and a recurrent matrix ∈ SM(S), the bidirectional ( , )-Markov chain C = ( ) ∈Z has its joint distribution uniquely determined by: C is effectively two Markov chains glued together at = 0, at which time its entropy is minimal, by Theorem 3.4. In the context of reversible theories in physics, dual would be the macroscopic representation of time reversal, if we destroy (e.g., via perturbation) its microscopic correlations, thus preventing restoration of the past's low entropy. To illustrate with a thought experiment, suppose we start from the present state of the Universe, and apply its dual dynamics. When looking at reversible phenomena, such as celestial mechanics, the effect is like rewinding time. However, thermodynamically irreversible events would not become undone. For instance, a dropped glass that has shattered would remain in pieces. A fried egg in a pan would not "unfry", but overcook instead. Life on Earth would likely malfunction in bizarre ways, though we don't presume to guess whether it would eventually adapt to the new arrow of time, or instead suffer a mass extinction. In reality, of course, rewinding through history does not exhibit statistics characteristic of this dual Markov chain. Instead, microscopic correlations render the macroscopic backward probabilities timeinhomogeneous, as in Equation (6). In the next section, we'll see that the backward probabilities can even violate locality. This motivates us to split the second law of thermodynamics into two parts: a Resource Law and a Memory Law. Informally speaking, homogeneity is most closely related to the Resource Law, which informs the thermodynamic arrow of time; and locality is most closely related to the Memory Law, which informs the psychological arrow of time. Generic Markov chains are not well-suited to investigations of locality, for they lack a notion of space. In the physical Universe, dynamics are defined locally, interactions propagating no faster than the speed of light. As a result, each point in space-time is causally related to regions called its past and future light cones, ensuring some causal separation between spatially distant entities. We will need discrete analogues of these concepts. Definition 4.1. A discrete spatial geometry (X, T ) is a set of spatial cells X, along with a finite set of local translations T ⊂ Bij(X) which includes the identity function: id ∈ T . Associated with (X, T ) is the quasimetric which is a metric iff T is closed under inverses. It determines a partial order on 1 2 Z + × X: given times , ∈ 1 2 Z + and positions , ∈ X, we write ( , ) < ( , ) iff < and ( , ) ≤ ⌈ ⌉ − ⌈ ⌉. In this case, we say ( , ) is reachable from, or in the future of, ( , ), and that ( , ) is in the past of ( , ). If either ( , ) < ( , ), ( , ) = ( , ), or ( , ) > ( , ), we say the pairs are timelike related; otherwise, they are spacelike related. The reason for times to be half-integers will be made clear in Definition 4.2. We now describe some useful sets, including the past and future "light cones". For ∈ 1 2 Z + , ∈ X, and ⊂ 1 2 Z + × X: In addition, we define T • ( ) to be the set of subcells reachable from ⊂ X in one local translation. As it will usually be inconvenient to deal with a fine region containing fractions of cells, we also define + ( ) and − ( ) to be sets of cells that over-and under-estimate T • ( ), respectively. To be precise: Our main object of study, the stochastic partitioned cellular automaton or SPCA, combines the structure of a spatial geometry with that of a Markov chain. Its deterministic analogue, the PCA, was first studied in [18] . For the purposes of modeling reversible systems, we find PCAs to be more useful than ordinary cellular automata (CAs), for two reasons. Firstly, a PCA's evolution alternates between two types of steps, enforcing a clear separation between localized dynamics and movement; this helps with the intuition and analysis. Secondly, a PCA is reversible if and only if its cellwise dynamics is, and its time-reversal has the same maximum speed of propagation; in contrast, the question of whether a given CA is reversible is undecidable, and its time-reversal may interact across much larger neighborhoods [10] . For a modern survey of reversible cellular automata, see [11] . While all cellular automata are divided into cells ∈ X, a PCA or SPCA is additionally divided along tracks ∈ T . The pair ( , ) describes a subcell. Whereas a Markov chain assigns a random element of S to each time ∈ Z + , an SPCA decomposes S = ∈T S along tracks, and assigns a random element of S to each time-cell-track coordinate ( , , ) ∈ 1 2 Z + × X × T . The SPCA's evolution alternates between cellwise dynamics, applied independently at each cell, and trackwise translations. From now on, we treat both the spatial geometry (X, T ) and the finitely many countable sets {S } ∈T , whose product makes up the state space S, as fixed. We augment our trio of Markov chain presentations in Definitions 3.2, 3.6 and 3.8 with spatial structure, making them SPCAs: Definition 4.2. Fix an initial condition ∈ M (S X ). Given a random configuration history C = ( , , ) ( , , ) ∈Z + ×X×T , extend it to the half-integer times ∈ 1 2 Z + by setting For ∈ SM(S), we call C a ( , )-SPCA if Pr 0 = , and For Γ ∈ M (S S ), we call C a ( , Γ)-SPCA if it can be extended to a joint distribution on (C, F) = ( , , , ) ( , ) ∈Z + ×X (whose marginal agrees on C) such that Pr ( 0 ,F) = × Γ Z + ×X , and Finally, let (R, G, Γ) be a probability space, and : S × R → S × R be ℘ (S) × G-measurable. We call C a ( , R, G, Γ, )-SPCA if it can be extended to a joint distribution on (C, R) = ( , , , ) ( , ) ∈Z + ×X (whose marginal agrees on C) such that Pr ( 0 , 0 ) = × Γ X×Z , and where is as defined in Equation (9). Some remarks are in order. If the unequal time coordinates on the right-hand side of Equation (14) seem unnatural, we can think of the microscopic variables R as residing on the stationary track id ∈ T . Repeating Equation (11) for R, with = id, yields − 1 2 , := −1, . To make the situation even more symmetric, we could place microscopic variables , , ∈ R Z on all the tracks, resulting in a deterministic PCA whose subcellular states belong to S × R Z . Since the dynamics apply at the cellular level, it doesn't matter how we distribute R between tracks: either way, Theorems 3.7 and 3.9 generalize directly, simply substituting "Markov chain" with "SPCA" in their statements. The proofs are essentially the same, so we won't repeat them. However, note that a stationary measure with respect to the cellwise dynamics will not necessarily be stationary with respect to trackwise translations. To ensure stationarity with respect to translations, we'll typically require that the measure be i.i.d. across cells and independent across tracks. That is, it should be of the form X , with satisfying the following: is stationary for , then we say that is stationary for C. We now have all the pieces of our model. If Definition 4.2 seems dense, recall that by the equivalence theorems, we really only need Equations (11) and (12) . SPCAs and Markov chains simultaneously generalize one another. On one hand, every Markov chain can be thought of as an SPCA with a trivial geometry, having |X| = 1 and T = {id}. On the other hand, every SPCA can be thought of as a Markov chain by collapsing all its cells into a global state C ∈ S ′ := S X . Applying Theorem 3.4 to an SPCA's global state yields a global statement of the second law of thermodynamics; however, it only applies when there are finitely many cells, whose aggregate KL divergence is finite. Even then, it's not particularly useful, as it says nothing about the KL divergence within specific regions of interest. The importance of a spatial geometry is that it makes the dynamics local, constraining the movement of information. Before setting out to derive local laws, we need one more piece of notation. A region that consists only of whole cells can be described by its set of cells, ⊂ X. If ∈ M + (S), then ∈ M + (S ). However, a fine region, which may include fractions of cells, is best described in terms of its subcells, ⊂ X × T . In this case, we abuse notation slightly and write: Unlike traditional thermodynamic analyses, we do not assume any sort of internal equilibrium. As a result, regions may become arbitrarily correlated with one another, rendering the KL divergence not quite additive as a set function, but supermodular. The following decomposition is very useful to keep in mind. The desired result now follows from a direct substitution. Under suitable hypotheses, we'd like to show that all three terms in this decomposition are nonincreasing. One difficulty is that the trackwise translation step can make a mess of our regions. Let C be an SPCA with stationary measure . Consider a disjoint pair of finite regions , ⊂ X, and ∈ N. Then, where the right-hand parts also hold for = 0. P . The left-hand inequalities concern a cellwise dynamics step. When each cell evolves according to ∈ SM(S), for which is stationary, then the evolution from − 1 2 , to , is mediated by the Kronecker product := ∈ ∈ SM(S ), for which is stationary. Theorem 3.4 now implies both − 1 2 , , and the left half of Equation (17) . The left half of Equation (18) is an immediate consequence of the data processing inequality [5, §2.8] . Finally, the right-hand parts concern a trackwise translation step. By Equation (11), the contents of + 1 2 ,T • ( ) are identical to those of , , so the result follows. Equation (17) is a myopic version of the second law of thermodynamics; in the SPCA context, we'll call it a Resource Law. Intuitively, it says that the KL divergence can be moved at the velocities present in T , and it can be destroyed, but it cannot be spontaneously created. Accordingly, the KL divergence can be seen as a kind of resource, analogous to the free energy in physics. Equation (18) is a similarly myopic version of what we'll call a Memory Law. It says essentially the same thing for the mutual information: correlations can be moved, each piece of the correlated pair travelling at some velocity from T . They can also be destroyed. However, non-local correlations cannot be spontaneously created. On the other hand, we'll see in Section 5 that a correlated pair can be created within a cell, and then be made to move apart. This will play a critical role in our understanding of memory. Unfortunately, Lemma 4.5 doesn't compose nicely over multiple time steps. The issue is that the regions T • ( ), T • ( ) are fine, containing some fractions of cells. The left-hand inequalities only hold for regions consisting of complete cells, so we are blocked from applying them inductively at the subsequent time step. To get loose bounds, we may either complete or clear the fractional cells: P . Corollary 2.5 can be used to bound the difference between the KL divergences of two fine regions, one containing the other, relative to independent product measures. Here, independence is ensured by Equation (15). We'll also need the fact that T • ( ) consists of exactly | | subcells on each track ∈ T . This implies, for instance, that (T + ( ) × T ) \ T • ( ) consists of |T + ( ) \ | subcells on each track. The first inequality now follows from applying Corollary 2.5 to bound the difference We can now extend the Resource Law to longer time horizons, at the expense of requiring our regions to expand or contract at the "speed of light": . We fix and proceed by induction on . The base case = is trivial, since + = = − . We have two types of induction steps. The cellwise dynamics step follows from the left-hand side of Equation (17), since + , − do not change during these steps. On the other hand, the trackwise translation step follows by chaining the right-hand side of Equation (17) with Lemma 4.6. Remark. In terms of Section 2.2's entropy and negentropy , the conclusions of Theorem 4.7 are more simply written: . Likewise, we can extend the Memory Law to longer time horizons. Our new statement is more general, saying that it's impossible to spontaneously create information about any sort of event, unless the event in question is in our future. . The first result is a direct application of the data processing inequality, on the SCM-style graphical model implied by Equation (12) . To prove the second result, we apply the first result twice. In the first application, we substitute , for ; in the second, we substitute , − for . Chaining the steps together yields To see that the Resource and Memory Laws need not hold under time reversal, let's consider a simple counter-example. Fix ∈ M (S), and consider the transition matrix ( , ′ ) = ( ′ ). In the first time step, it collapses an arbitrary initial condition into the stationary distribution X , setting the negentropy and mutual information to zero everywhere. Therefore, reversing this step backward would yield an increase in both quantities, in clear violation of both laws. Theorems 4.7 and 4.8 are significantly weakened by their need to allow for all possible movements of information. In many practical settings, movements are much more restricted. In particular, we take an interest in regions that are insulated for a period of time, by a boundary that's impervious to the transmission of information. For a fixed finite region ⊂ X and track ∈ T , the translation step moves an equal number of -subcells into as out of . If the arriving and departing subcells happen to have identical contents, then the translation step amounts to nothing more than a trackwise permutation of the subcells in . When satisfies Equation (15), is stationary with respect to such a permutation. To ensure that the arriving and departing subcells match, we fix a state ∈ S, and require that the boundary of be filled with . This way, all subcells crossing in or out of via track will be in state . If is a quiescent state, satisfying ( , ) = 1, then it will be possible to maintain a thick "barrier" of cells surrounding ; however, we will not formally require quiescence. To increase generality, we allow the region to vary with time, thus modeling systems that can move and change size. Definition 4.9. Let C be an SPCA, and = ( ) ∈Z + be a (non-random) sequence of subsets of X. We say is finite if | | < ∞ for all ∈ Z + . For ∈ Z + , ∈ T , define the boundary We say that ∈ S insulates at time ∈ Z + if, for all ∈ T and ∈ , , we have , , = . Given , ∈ 1 2 Z + , we say that is a -closed system during the time interval [ , ) if, with probability one, ∈ S insulates at all times in [ , ) ∩ Z + . Note that we only care about insulating at integer times ∈ Z + ; in particular, every region is a -closed system with respect to intervals of the form [ + 1 2 , + 1), corresponding a single cellwise dynamics step. This makes sense, as the step involves no movement. Accordingly, in what follows, we adopt the notational convention that when ∈ 1 2 Z + , := ⌈ ⌉ . All time intervals are interpreted as subsets of 1 2 Z + . T 4.10 (R L C S ). Let C be an SPCA with stationary measure . Let ∈ S, , ∈ 1 2 Z + with ≤ , and ( ) ∈Z + be a finite -closed system during [ , ). Then, . Now, is stationary with respect to , and \ is stationary with respect to id \ . Therefore, is stationary with respect to ⊗ id \ , proving our claim. The case ∈ Z + corresponds to a trackwise translation step. Fix ∈ T and let ′ := \ , = ∩ −1 ( +1 ). Note that maps ′ ⊂ injectively into , and that ′ , = ′ + 1 2 , = for all ∈ \ ′ , ∈ \ ( ′ ). Therefore, without altering the translation step's outcome on ′ , we can re-arrange on \ ′ to get a bijection ′ : → , agreeing with on ′ . Putting all the modified tracks together yields a trackwise permutation of 's subcells, for which is stationary. This proves our claim. By transitivity of , we can chain it across the time interval [ , ] to obtain ′ ′ . Or, expanding the definition of C ′ : In particular, when | | = | |, we conclude that , , , as desired. Now, a second consequence of ′ ′ is that Applying Substituting these into Equation (19) yields the desired inequality. Remark. If ∈ arg min ∈S ( ), the conclusion of Theorem 4.10 can be more simply written: . We fix and prove both inequalities by induction on . The base cases are trivial. The translation steps follow easily too, because all arriving and departing subcells on track are deterministically , which cannot contribute to the mutual information. Finally, let's consider the cellwise dynamics step, supposing it ends at ∈ Z + . Note that − 1 2 := , by our convention. By Equation (12), , is conditionally independent of NonFut( { − 1 2 }× ) (and hence, of ), given − 1 2 , . The data processing inequality now implies the cellwise step of our first result: ( − 1 2 , ; ) ≥ ( , ; ). Applying it twice, substituting first − 1 2 , and then , for , yields the cellwise step of our second result: . By induction, the proof is complete. Compared to Theorems 4.7 and 4.8, our new formulations in Theorems 4.10 and 4.11 offer stronger conclusions. The price, of course, is the hypothesis of closed systems. As spelled out in Definition 4.9, a closed system and its coordinates must be known a priori: in a Universe such as our own, this amounts to foreseeing, ever since the Big Bang, precisely where any laboratory experiment is going to take place! Realistically, then, it's often insufficient to apply the laws directly on the SPCA C of interest. Instead, we should condition on the experiment's setup at the time that it's initiated. It's relatively straightforward to check that the conditional distribution of C ≥ , given any event at time ∈ Z + , is itself an SPCA with initial condition Pr | . By applying Theorems 4.10 and 4.11 on this transformed SPCA, we obtain conditional versions of those laws. Using a similar transformation, it's possible to define counterfactual versions of the laws. Instead of conditioning on an event, we derive C ′ from C by removing the instance of Equation (12) corresponding to a particular pair ( , ), and instead set ′ , to some exogenous value. Note that, according to this definition, counterfactual interventions only alter the future, so that C ′ < = C < . Since the dynamics at times other than are unaffected, C ′ ≥ is an SPCA. These transformations mirror their SCM equivalents [21] . Intuitively, conditionals represent beliefs updated by an observation, whereas counterfactuals represent the hypothetical effects of alternate causes. Section 5.3 gives physical relevance to the conditional view: once we demonstrate how memories of past events are formed, it will follow that agents can condition their actions on such a memory. The physical interpretation of counterfactuals is more mysterious, as they involve hypothetical alternatives, even ones that are known not to occur. They violate the "laws of physics" given by Equation (12) in a decidedly time-asymmetric manner: in the deterministic function presentation, it's the constraint immediately preceding the intervention that's ignored, therefore cutting links to the past. We'll return to this topic in Section 5.4. The SPCA is a simple, mathematically precise model of emergent irreversibility. It's also quite general, being parametrized by (X, T , S, , ). Our main motivation for creating such a model is to provide a firm foundation with which to reason clearly about time and information. The arrow of time is so deeply ingrained in our intuitions, even in the language of science and mathematics, that, in the absence of a precise model, it's extremely easy for assumptions to go unchecked. This section explores some applications that demonstrate the clarifying power of SPCAs and their laws of information dynamics. [11] summarizes several computational universality results for reversible cellular automata. While extending them to SPCAs is beyond the scope of this paper, it's easy to see that, despite the fixed dynamics , an SPCA's evolution in a non-closed region can vary substantially as a result of interactions with the region's surroundings. For example, we might think of these surroundings as including an agent who "intentionally" (see Section 5.4) manipulates the dynamics, by employing various technologies. Since it's difficult to reason in detail about all possible interactions, we seek general constraints imposed by the underlying dynamics . Suppose is stationary in the sense of Definition 4.3. Then, in terms of Section 2.2's negentropy , Theorem 4.4 reads: If , are closed systems, then by Theorems 4.7 and 4.10, none of the three terms on the right may increase. As a result, we may think of them as resources. Intuitively speaking, Equation (20) decomposes the aggregate resource in ∪ into three non-negative components: the internal resource in each of and , plus a mutual resource due to their correlations. The resource interpretation makes even more sense for non-closed systems. If , are non-fine (i.e., made of complete cells), then Lemma 4.5 shows that the three terms are non-decreasing over any single time step, although afterward the regions may no longer consist entirely of complete cells. On the resulting fine regions, the laws no longer apply. Now we must interpret Equation (20) differently. Suppose ∪ is non-fine, even though each of , individually are fine. Then by the Resource Law, the aggregate resource cannot decrease; however, it may be arbitrarily redistributed among its three components! In summary, the general situation is that the negentropy may be moved or destroyed, but not created; and likewise, non-local mutual information may be moved or destroyed, but not created. However, as soon as two systems come close enough to collide at the cellular level, they are free to produce new mutual information and/or exchange negentropy, subject to the aggregate negentropy not increasing. The aggregate negentropy of an SPCA is analogous to the free energy in physics. Readers who are less familiar with statistical mechanics may find this counter-intuitive: after all, energy is a conserved quantity, not a decreasing one. Nonetheless, real-life applications consistently see energy change from useful forms into less useful forms. For example, the chemical energy of fossil fuels can be converted into electricity that powers household devices, before ultimately changing irreversibly into ambient heat. Even the class of energy sources that we term renewable -hydro, wind and solar -can be traced back to the consumption of nuclear fuel at the Sun's core: a vast reserve to be sure, but finite. The explanation is that, in the physical Universe as well, the ultimate currency is negentropy, for which the free energy is merely a convenient accounting mechanism. Just as the stationarity of is a property of certain dynamics , so is the law of conservation of energy. In a world where both properties hold, a closed system's negentropy should be more precisely defined as the difference between its KL divergence, and the least possible KL divergence at the same total energy. A system is said to be in thermodynamic equilibrium when its negentropy is zero. Defined this way, it's possible for two systems to have zero internal negentropy and zero mutual information, but positive aggregate negentropy: for example, while a box of hot gas and a box of cold gas may be useless individually, by exchanging energy, together they can power a heat engine. The details of such an engine are standard material, so we end the discussion here. As an additional note, the modern framework of resource theories abstracts over these sorts of exchanges in the presence of multiple constraints. For an exposition and additional references, see [31] . In the digital age, a question of great practical interest is how much computation can be performed while consuming a given amount of free energy, i.e., dissipating it as heat. Von Neumann initially proposed that every elementary computation should cost at least one bit of negentropy. Landauer refined this hypothesis, and argued quite convincingly [13] that only irreversible computations, which erase information, carry this cost. By translating his argument to the setting of SPCAs, we'll now make it clearer and more precise, and offer an extension involving non-local correlations. Let each subcell consist of an array of ∈ N digital bits: that is, S := {0, 1} for each ∈ T . As previously discussed, we seek constraints implied by the dynamics' stationary distribution. For simplicity, let's choose it to be ♯; this is equivalent to assuming ∈ DM(S). By Theorem 3.7, the dynamics can in full generality be described as a probabilistic mixture of local (i.e., cellwise) bijections. Non-bijective operations are forbidden. Now, it's important to be precise about what we mean by "erase" and "cost". Suppose our computer's digital memory contains a random bit. If erasing means simply replacing the bit's value (0 or 1 with equal probability) with another, i.i.d. random value, this can be done for free, provided that no other data is correlated with the bit's original value (see the Principle below). On the other hand, suppose we wish to clear the bit without giving it a random value, i.e., set it to zero. Erasing entropy in this manner is forbidden, as it's non-bijective; in order to make it bijective, we must move the entropy elsewhere. For example, in a real physical system, we can simultaneously dissipate energy into a heat bath, such as the surrounding air. The additional energy gives the air access to twice as many configurations as it had before, making the overall operation bijective. The "cost" of erasure, then, is the minimum energy that must be dissipated as heat, in order to produce a bijection. At typical room temperatures, this amounts to about 3 × 10 −21 Joules per bit of entropy. Note that the aggregate negentropy does not change: we merely move it from the memory into the heat bath. In principle, the operation can be reversed, randomizing the cleared bit while withdrawing energy from the heat bath [2] . In this sense, clearing a randomized bit is also free. If "costs" should refer to something irrevocable, then we should really be looking for decreases in the aggregate negentropy. They occur when we try to erase that which has no genuine entropy to begin with. Consider a deterministic computation: it likely creates junk data, which should eventually be disposed of in order to reuse the memory. As the data was created deterministically, it has zero entropy, and can be erased for free by reversible computing methods that make careful use of this fact [17, 33] . Indeed, reversible computing research is largely motivated by the desire to circumvent the Landauer price 4 . On the other hand, these methods tend to spend a fair amount of computation time in order to undo the logical depth [3] of the data. To avoid this, we may be tempted to forget about the data's source and perform a generic erasure, as if we were clearing a random value. But we know that erasing a random value is free! Conceptually, we should say that the Landauer price was paid at the moment when we decided to forget the junk data's deterministic origins and treat it as random: at this point, our subjective model of the data already has a decreased negentropy. Eventually, we'll dispose of each junk bit by dissipating heat. Since the bit was in fact deterministic, the heat bath initially occupies only half of its now-expanded configuration space; however, it quickly "consummates" the Landauer price by applying true randomization 5 during its return to thermodynamic equilibrium. At this point, even someone with knowledge of the erased bit's former value would be unable to retrieve the lost negentropy. An interesting twist on the situation occurs in the presence of non-local correlations, i.e., regions with positive mutual information. Since the three terms in Equation (20) are non-increasing, a decrease in any one of them necessarily carries a price. In particular, non-local correlations are a resource whose destruction would be irrevocable: Non-Local Landauer's Principle. Non-local increases in mutual information are forbidden. Non-local decreases in mutual information are permitted, but at an equal cost to the aggregate entropy. This principle is an immediate consequence of the discussion following Equation (20) . The first part says that information cannot be spontaneously created: while it is possible to obtain news from a faraway nation, say by carrier pidgeon, the pidgeon must first visit the source, and then physically carry the information to you! This is essentially just another way of saying that faster-than-light communication is forbidden. On the other hand, since the time-reversal of any forbidden operation is necessarily irreversible, it should be no surprise that a spontaneous decrease in mutual information would be costly. In case this principle seems too abstract, let's see it in engineering terms. Returning to our arrays of bits, recall that the free operations are probabilistic mixtures of local bijections. A deterministic example is the XOR-assignment operation ( , ) ↦ → ( , ⊕ ). It can be used to copy the data onto blank memory, resulting in two copies: ( , 0) ↦ → ( , ). Conversely, two local copies can be cancelled by this operation, leaving only one copy: ( , ) ↦ → ( , 0). However, once two copies are separated, erasing one copy is no easier than if it were the only copy in the Universe. In fact, the situation is even worse: the previous trick of re-randomizing the data is no longer free either, as it now results in two independent pieces of random data. The dream of reversible computing depends on carefully returning all copies of data to their source, so that they can be cleaned up without cost. This seems plausible for internal computations, where we can hope to tightly control the components of our computer and isolate them from external noise. However, in the case of sensory data, the Landauer cost is likely to persist, as we have less control over environmental sources of information. If the environment changes between the time when data come to thermal equilibrium with the heat bath. If an acceptable probability of failure is 2 −50 , then this method dissipates at least 50 bits worth of entropy! Methods that improve upon this should perhaps be called semireversible computing: for example, we might accumulate the entropy in a separate "junk memory", which is dissipated in batches using a correlated energy differential. Or, we might smooth the thermal equilibriation over the course of a long computation, as in the branching Brownian model in Figure 10c of [2] . 5 Recall that, in the SPCA model, randomization is true at the level of the macroscopic state space S. The deterministic presentation works on the microscopic state space S × R Z , randomly perturbed at initialization. is acquired, and when we'd like to dispose of it, we may have to do so at a cost proportional to the amount of collected sensory data. We leave further exploration of reversible sensing to future work. Now, we'll advance our understanding of the psychological arrow of time: the subjective sense that time passes sequentially. Following in the footsteps of [16, 27, 34] (see our critique in Section 1.1), we'll begin by discussing how nature produces memories, which are detailed records of past interactions. Then, in Section 5.4, we'll informally sketch explanations of future-directed agency and causality. We'll consider two types of memories: those characterized by long-distance mutual information, and those characterized by Bennett's logical depth [3] . Without going into its formal definition, the logical depth of a string is the amount of computation time needed to derive it from a compressed description. Bennett originally defined this concept to characterize the RNA and DNA strings that encode life on Earth. He observed that entropy alone cannot characterize life, because both low-entropy strings (e.g., all zeros) and high-entropy strings (e.g., uniformly random) can be very uninteresting. The interesting strings are those containing structure that can only have come from a long and difficult computation, such as billions of years of evolution by natural selection. The motivation for starting with a compressed source string is that short strings are more likely to be arrived at "by chance"; for a more philosophical justification, see Section 5.5. It's typically assumed that the initial condition is low in both long-distance mutual information and logical depth. The former means that the initial configuration is nearly uncorrelated at long distances, while the latter means it's "shallow" in the sense that it doesn't look like the result of a difficult computation. If doesn't satisfy these criteria, we can turn it into a point mass by conditioning on 0 ; similar conclusions will then hold for the conditional mutual information and logical depth, with respect to 0 . We've seen, using the Memory Law, that mutual information can only be created locally. Therefore, if at the time we find disjoint regions , ∈ X with ( , ; , ) being large, it would follow that this mutual information can be traced back in time. Either we trace it all the way back to the initial condition, which is impossible if such correlations didn't exist initially, or we trace it back to a local interaction between some predecessors of and . Thus, we see that mutual information in the present is evidence of an interaction in the past. Similarly, it has been shown [3] that the logical depth is extremely unlikely to increase very quickly. Therefore, if at the time we find a logically deep object, such as the DNA of an advanced organism, then we'll expect that the growth in its logical depth can be traced very far back time. In other words, we'll know that it had a shallower predecessor in the past, from which it gradually grew deeper through some kind of computational process. In a macroscopically deterministic setting, such as a reversible computer, only the logical depth concept would apply. Perhaps, in future work, the two memory concepts may be unified. For now, we merely note something in common between the production of mutual information and the production of logical depth: while both may be done reversibly, without altering the aggregate negentropy resource, both processes effectively render our resources less accessible. With long-distance mutual information, it takes a lot of time to bring the pair of systems close enough together to make use of their correlation; with logical depth, it takes a lot of computation to reversibly recover the shallow source string. Agency, as well as the related concepts of cause, effect, and counterfactuals, are fundamental to any study of beings with objectives or incentives. Agents are central objects in economics, public policy, law, artificial intelligence, game theory, philosophy, and more. Since agents plan only their futures, agency is also a clear manifestation of the arrow of time. Perhaps the oldest conceptual framework with a claim to describing the phenomenon of agency, is mind-body dualism. Predating the scientific method, dualist ideas are sometimes stated vaguely, and interpreted broadly. Their unifying feature is the assertion that the mind is something distinct from physical matter, not emerging from it, and hence subject to different laws. In the scientific age, investigations of the central nervous system have produced physical explanations for many features of the mind. Nonetheless, a modern form of dualism persists in scientific circles: decision theories, as employed by the same academic disciplines listed in the previous paragraph, typically treat agents as qualitatively distinct from other processes. While employing dualist abstractions does not imply a literal belief in dualism, it does raise the question of why such abstractions fit so well to our physical Universe. Causal decision theory (CDT) collects a representative set of these abstractions. It models salient variables by nodes in a structural causal model (SCM). Decision nodes are singled out as variables whose outcomes are chosen by one of more intelligent agents. While other nodes are thought of as being determined by immutable physical processes, the decision nodes are treated as exogenous: each of them is assigned to an agent, each of whom has its own objective. When evaluating these models, we set the value of each decision node in a way that is "best" among counterfactual alternatives, according to the corresponding agent's objective. In order to endogenize the decision nodes, we would have to explicitly model the optimization procedure's physical implementation, typically by a brain or computer. The brain or computer would either "instinctively" answer the optimization problem in a model-free manner, or it might try to compute an optimal answer by recursively employing its own mental copy of the same SCM model in which the agent lives. Neither of these descriptions are particularly convenient for modeling purposes; clearly, the exogenous description is more parsimonious. Simply assuming that a given objective is optimized, or approximately so, seems to fit reality fairly well. Thus, we are left with two questions: first, without considering agents or decision nodes, why are SCMs a good model of the physical world? Second, why are exogenous decision nodes a good model of living agents? Since SCMs lack time-reversal symmetry, both questions deal with the emergent psychological arrow of time. The first question is about causality, while the second is about agency or, to use the more philosophical term, free will. A rigorous language for causality is provided in Pearl's groundbreaking book [21] : it describes the mathematics of do-calculus on SCMs, demonstrating both its practical utility and its alignment with the intuitive notions of cause and effect. What remains for us is to demonstrate how this mathematics, which is asymmetric with respect to time-reversal, emerges in a deterministic Universe whose dynamics do have a time symmetry. Specifically, we seek to demonstrate the emergence of Pearl's two fundamental principles [22] : Principle 1: "The law of structural counterfactuals. " Principle 2: "The law of structural independence. " We start with the second principle, as it's easier to justify. This principle is formalized in Pearl's -separation criterion: from an SCM's graph structure, it determines which sets of variables must be probabilistically independent of one another. Since Equation (12) already has the mathematical structure of an SCM, all of SCM theory, including the -separation criterion, applies to SPCAs as well. To tie this back to a deterministic CPT-symmetric Universe, we need only refer to Theorem 3.9 and the subsequent discussion. There, we demonstrated that randomized microscopic degrees of freedom can yield macroscopic dynamics that are probabilistic and time-asymmetric. In fact, we now see that these dynamics emulate an SCM. On the other hand, the first principle is more mysterious. A counterfactual is a hypothetical alternate Universe in which one or more events are intervened upon, causing additional changes in their future. An example would be the "alternate timeline" in which the coronavirus pandemic never occurred. Historically, as discussed in [21] , the idea of counterfactuals was widely dismissed as unscientific. The problem is two-fold: these alternate worlds are inaccessible to us, and they violate the dynamical laws of physics (e.g., Equations (12) to (14)) at the intervened events. Even if we accept such violations as somehow meaningful, why should they always be construed to alter only the event's future, and not its past? Pearl vigorously defended counterfactuals by demonstrating their vital role in commonsense reasoning, and their practical utility in the scientific processes of experiment design and inference. However, he stopped short of explaining how counterfactuals emerge in a CPT-symmetric Universe. One clue, perhaps, appears in the deterministic presentation of SPCAs: there, we see that specifying an intervention macroscopically is sufficient to propagate it forward in time. In contrast, any change that's not microscopically fine-tuned would be immensely destructive if propagated backward in time: as saw at the end of Section 3, a perturbed backward evolution approximates the dual dynamics dual , and undoes all of the finely-tuned "coincidences" that would be needed to recover a low-entropy past. In any case, the violation of dynamics is reason enough to rule out counterfactuals as a literally accurate description of the Universe as a whole. Nonetheless, they may be useful for modeling a usuallyclosed system that's occasionally subject to outside interactions. When modeling such a system, the outside interaction may be treated as exogenous. Furthermore, by the Memory Law, information about the interaction can only appear in its future. Therefore, if we're uncertain about the precise nature of an outside interaction, each possibility can be modelled counterfactually. Now that we've justified both of Pearl's principles, we return to our second question. We want to know why agents are best modeled by exogenous decision nodes, whose values are set by optimizing over a space of counterfactuals. The reasoning appears to be similar to the one we gave for Pearl's law of structural counterfactuals, with the additional twist that Darwinian evolution tends to produce agents that make near-optimal decisions. As we saw in Section 5.3, the arrow of time for Darwinian evolution seems to require an assumption of low logical depth at initialization. In future work, we hope to expand upon these arguments in order to rigorously model the emergence of counterfactuals and free will. Any complete argument will have to address an important class of counter-examples to CDT: Yudkowsky and Soares [36] consider Newcomb-like problems, in which causal decision theorists tend to underperform. These problems typically involve an agent, the oracle, whose decision is a function of what it predicts another agent will do. The oracle's prediction, being based on the inner workings of the predictee's mind, should somehow be correlated with the predictee's action. The predictee, knowing this, will be better off if it models its action as affecting the oracle's decision, despite the latter having already occurred! As a result of such multi-agent thought experiments, CDT is found to be inconsistent in the following sense: given the ability to do so, a causal decision theorist will prefer to rewire its own brain so as to cease to follow CDT. In the authors' proposed alternative, functional decision theory (FDT), agents act as if they directly control the output of the mathematical function corresponding to their decision algorithm. Like CDT, FDT performs counterfactual reasoning on SCMs; however, it may model the same problem with a different graph. For example, the Newcomb problem above has an effect that temporally precedes its cause. We believe that, due to the misconception that such retroactive causality is more unphysical than CDT, FDT may have been relegated to an ultra-niche community. However, we saw previously that the counterfactuals of CDT cannot be interpreted in a literally physical sense either. In our view, then, CDT and FDT should compete on an equal footing, judged only on their respective ability to parsimoniously predict how the products of Darwinian evolution behave. In that light, the reflective inconsistency of CDT offers a compelling reason to prefer FDT over it. In Sections 5.3 and 5.4, we discussed the psychological arrow of time. Now, in closing, let's examine some basic limitations on empirical knowledge. By letting SPCAs stand in as a model for the Universe, we arrive at a version of the famous Boltzmann brain paradox; its resolution has consequences for our theories of inference and resource gathering. While related arguments have been made plenty of times in the literature, our SPCA version is simple, clear, and general. As a macroscopic inhabitant of an SPCA, we ask: what knowledge is available to us? The use of probability theory in Definition 4.2 might raise concerns. Even in the deterministic function presentation, the initial state is a random variable, distributed according to × Γ X×Z . We only have empirical access to one instance of the Universe, evolved from one sample from this distribution. How can our belief in this ensemble be justified from only one sample? What's to stop us from theorizing an alternative ensemble, perhaps one with radically different statistics, the only restriction being that one of its astronomically many outcomes correspond to our actual experience? Well, if we commit to the hypothesis that the microscopic degrees of freedom are i.i.d., then our one sample is really an endless sequence of samples from Γ. This is promising: now we can do statistical tests. However, this only worked because we confined ourselves to a small class of hypotheses in advance; otherwise, our statistical analysis might happen to overfit by chance. How should we decide which small class of hypotheses to prioritize? A classic theorem of algorithmic information theory states that, for any computable prioritization strategy, there's a corresponding encoding that compresses the highest-priority hypotheses to the smallest size. Thus, we might as well go with the strategy of prioritizing the hypotheses with the most concise descriptions. We'll elaborate on this idea later, but first let's consider an even more severe empirical limitation. It turns out that we cannot provably know much about the state of our specific instance of the Universe, either. Let's model the situation with a precise thought experiment. Example 5.1 (The Clueless God). As everything that we know should somehow be encoded in the brain, which lives within the SPCA, let's designate some region Mem ⊂ X, the long-term memory in which our knowledge should be stored. We imagine being born without any prior knowledge, Mem being filled with random junk data. Can we learn what's in this world, at least in our immediate vicinity? That is, can we arrange it so that, at some future time , Since it's difficult to reason in detail, about the capabilities of a realistic agent controlling Mem, let's be generous and overstate our powers. Suppose that, rather than being totally ignorant, we somehow know that we are located inside an SPCA with finite local state space S, and a spatial geometry (X, T ) containing the special region Mem ⊂ X. Suppose we also know the dynamics , for which we'll assume ♯ to be stationary. No, better yet: suppose we can control the dynamics, perhaps even applying a different ∈ DM(S) at each position and time! The only thing we don't know is the actual state C, including its initial distribution . Certainly, equipped with such divine capabilities, we should have no trouble sensing some aspects of our surroundings, right?? Unfortunately, we cannot rule out the possibility that the Universe is in heat death: a stationary distribution of zero negentropy. Once heat death occurs, stationarity (or the Resource Law) ensures that we remain in heat death forever, regardless how we set the dynamics. Every term in Equation (20) is then zero; in particular, we must have ( ,Mem ; ,X\Mem ) = 0. Heat death is like a uniform soup of randomness. On rare occasions, we might find Boltzmann brains: regions shaped like Mem which contain thought patterns identical to what you are thinking right now, complete with all your senses and memories. For that instant, you'd feel as if you had lived a full life on planet Earth; the next instant, you'd vanish back into the soup of randomness. This seems absurd: while we can't logically eliminate heat death, it appears too contrived to be likely. However, the problem runs deeper. If we're looking for likely theories, we might encode our initial state of ignorance by imposing a uniform Bayesian prior on the Universe's initial configuration 0 . That is, we subjectively believe 0 to be distributed according to = ( 1 | S | ♯) X ; but, this is the heat death distribution! Subjective heat death is mathematically equivalent to the previously considered objective heat death. Therefore, nothing that can be done within the Universe would produce evidence to sway our belief away from . By Equation (20), knowledge, as defined in terms of the mutual information, is quantitatively bounded by the aggregate negentropy resource; thus, our initial ignorance expressed in amounts to a form of starvation. Quite literally, knowledge is power. To clarify the issue, our task can be made concrete: suppose we want to copy the contents of some region ⊂ X \ Mem into Mem, thus obtaining knowledge about . Such a copy operation cannot be implemented as a mixture of bijections, as it would overwrite the existing contents of Mem. If we had just one blank sheet of paper, we could swap it into Mem, and then reversibly copy our desired data onto it, e.g., using an XOR-assignment. However, our information capacity would never grow beyond the size of the allocated sheet, and would most likely deteriorate with imperfect use (e.g., see the sensing example at the end of Section 5.2). We would need to know a priori about the paper, rather than use vision to see it, since the act of sensing is itself a form of writing. Without prior information, even our own senses are useless! The situation is similar to the No Free Lunch Theorems in optimization theory [35] : without some sort of prior bias, inductive learning is impossible. It remains to ask ourselves which sort of prior is least questionable, philosophically and practically. Solomonoff proposed an answer in [30] . See also [14, §5] for a thorough textbook treatment, or [24] for a more accessible overview of the same ideas. To summarize, Solomonoff's theory of universal induction is a mathematically rigorous version of an ancient epistemic principle, Occam's razor. Informally speaking: Given competing theories that predict our observations equally well, prefer the simplest. "Theories that predict our observations" are identified with computer programs which, when run on a universal Turing machine , output our observed data ; "the simplest" is the shortest such program. Once is fixed, the Kolmogorov complexity ( ) of a given string is defined to be the minimum length of a program (in a prefix-free language), such that ( ) = . The choice of affects the Kolmogorov complexity by no more than a constant additive term that does not depend on . Solomonoff's universal prior can be defined in a few ways, all of which are equivalent up to constant factors. One is to give each string a prior probability of 2 − ( ) . Another is to enter a uniformly random infinite sequence of bits into the universal machine ; since the language is prefix-free, this defines at most one finite prefix constituting a valid program, which in turn defines at most one output string (if the program halts). Cover and Thomas [5, §14.6] illustrate the difference between the uniform and universal priors like so: imagine a monkey sitting at a keyboard and typing randomly. The uniform prior corresponds to a monkey entering the state of 0 directly into a text editor. On the other hand, the universal prior corresponds to a monkey typing into an interpreter, which runs on the universal machine to output 0 := ( ). This prior is called universal because it dominates all lower-semicomputable priors, up to a constant factor that does not depend on . Universality serves as a practical justification: natural selection favors agents whose inferences happen to succeed. Any inference computation taking place within an SPCA is equivalent to the results of some program running on , and therefore any probabilities that it might compute will be dominated by the universal prior. Using such heuristic approximations of the universal prior solves our Boltzmann brain problem: when Mem contains well-formed memories, we'll take a bias toward simpler explanations in which the memories correlate with truths about the Universe. An agent with such a highly non-uniform prior can also "hunt" for negentropy by employing a compression algorithm, whose encodings are shortest for strings that it deems likely. To see how, let : {0, 1} * → {0, 1} * be an injective function such that, for some fixed ∈ N that does not depend on ∈ {0, 1} * , | ( )| ≤ | | + . Fix ∈ N. Assuming | | ≤ , a fixed-length binary encoding of | | takes 1 + ⌊log ⌋ bits, so the concatenation of (| |, , 0 − | | ) takes 1 + ⌊log ⌋ + bits, regardless of . By the injectivity of , there exists a bijection on {0, 1} 1+ ⌊log ⌋+ that, for all satisfying | | ≤ − , maps (| |, , 0 − | | ) to (| ( )|, ( ), 0 − | ( ) | ). Suppose our agent finds a string in the environment. If | ( )| < | |, then effectively compresses , lengthening the "bank" of zeros in the process. If the string ( ) has no further use to the agent, it can be expelled into the environment, perhaps swapping it with a fresh candidate for compression. In real-world terms, we may think of as food, ( ) as waste, as the energy expenditure of digestion, and the store of − | | zeros as free energy reserves. These reserves may be consumed to perform useful manipulations: for instance, reproduction can be implemented by XOR-assigning a copy of the agent onto a string of zeros. Our agent's performance depends on how well compresses typical strings that are found in its environment. Since ( ) < | ( )| + (1) 6 , where the constant depends on but not on , the Kolmogorov complexity can be seen as a platonic ideal for data compression. The Kolmogorov complexity is also closely related to the physicist's definition of entropy. In classical physics, the set of all possible states of an -particle system, when all microscopic information is included, is a subset of R 6 called phase space. It is partitioned into 6 -measurable subsets called macrostates. The entropy of a system in macrostate , then, is defined to be log 6 ( ), which happens to equal Shannon's differential entropy of the uniform distribution on . We might object that this definition depends heavily on the choice of partition. However, this dependence goes away if we assume two things: 1) the partition is simple, in the sense that each macrostate has a short description, and 2) the system's actual state is a "typical" random point within . Typicality means that, conditional on , the point's shortest description, up to neighborhoods of measure less than a given > 0, has length approximately log 6 ( ) + log 1 . It can be shown that this two-part description is approximately the shortest possible: since the macroscopic part is short, it follows that the latter expression approximates the -neighborhood's Kolmogorov complexity. For details of this reduction, see [14, §8.6-8.7 ]. There's a caveat: in a reversible universe, closed systems that have mixed must necessarily "unmix" in their time-reversed evolutions; as such, they cannot truly be at typical random states. Nonetheless, ergodicity makes it so that, over time, the state appears random at finer and finer levels of precision . For a practical illustration, consider a well-mixed container of gas. It will be "typical" for its macroscopic parameters, which include temperature, pressure, and chemical composition. Describing these to a satisfactory degree of precision takes on the order of 100 bits. This is negligible in comparison to the entropy of the large number of particles in the container, on the order of Avogadro's constant, 6 × 10 23 . Therefore, we can in principle measure the macroscopic parameters almost "for free", thereby obtaining macroscopic knowledge of its contents. The gas' usefulness as a resource, then, is quite accurately determined by the physicist's entropy. Such two-part descriptions apply equally well to probabilistic theories in fundamental physics. For example, consider a quantum theory, which uses the Born rule to prescribe probabilistic outcomes for quantum measurements. We consider this theory to be good if the experimental data 's twopart description, consisting of quantum theory along with 's Shannon encoding optimized for Born probabilities, has length not much more than ( ). With this approach, probabilistic theories are not so mysterious after all. Like other theories, they compete to win the favor of Occam's razor. To say that the Universe is probabilistic is to say that its statistics are "typical" for the probabilistic theory in question, in the sense that our two-part description cannot be further compressed. Now, if it so happens that several competing theories are of similar length, we might as well choose the most useful. In this case, probabilistic theories are especially promising, because only a small fraction of ( ) goes toward their conceptually interesting core; the rest is a Shannon-style encoding of the probabilistic observations. For a rigorous discussion of two-part theories, see [14, §5.5 ]. In summary, we provide a model, the SPCA, in which the thermodynamic and psychological arrows of time emerge from reversible dynamics and an initial condition. We do this in two parts. In the first part, we prove, within our model, a physically plausible correspondence between microscopic and macroscopic dynamics. Several additional results come out of this correspondence. For example, we find that reversibility with measure-preservation at the microscopic level is equivalent to double-stochasticity at the macroscopic level. We also find that time-reversal, in the absence of "coincidences" due to the initial condition, amounts to taking the dual of the macroscopic transition matrix. In the second part, we prove the laws of information dynamics, which generalize and extend the second law of thermodynamics. We see that the laws take a more powerful form when we're able to insulate a system from having information enter or leave it. Then, we discuss a variety of applications. We use the laws to show how the negentropy behaves as a resource. On the union of two systems, it decomposes into the negentropy of the two individual systems plus a mutual information term. For closed systems, all three of these terms are non-increasing. In particular, this means a decrease in the mutual information term cannot be offset by an increase elsewhere. We present this fact as an extension of Landauer's principle, within a wider discussion of operations that do or don't necessarily carry an entropic cost. Regarding the psychological arrow of time, we discuss how memory, causality, and agency emerge along the arrow of time in an SPCA. After seeing that counterfactuals are useful despite not representing physical reality, we argue that a related idea, functional decision theory, should be taken equally seriously. Finally, we discuss how a lack of informative prior leads immediately to Boltzmann brains. Invoking Solomonoff induction to escape this trap, we find a way to express resource gathering in terms of data compression. For the reader's benefit, this opportunity is taken to summarize the connection between Kolmogorov complexity and the physicist's entropy. We now suggest some directions for future work. One possibility would be to bridge the gap between our model and real-world physics. For example, it remains to be seen how the laws of information dynamics would adapt to continuous and/or relativistic space-time geometries, to the setting of quantum information, or under the constraint of conservation laws. Even within the SPCA setting, the laws of information dynamics might be formulated with different hypotheses: for instance, the hypothesis of closed systems might be weakened, in order to study situations where the flow of information in or out is restricted in realistic ways. Another potential direction would be to explore more of the existing physics literature on entropy and information, looking for places where the simplicity of SPCAs might improve an analysis, either simplifying, generalizing, or contradicting it. In the case of a contradiction, one should investigate whether the fault is an error in the realistic model, or whether it's a feature of the real world that's not adequately captured by the SPCA model. For example, it was recently argued that the entropic cost of measuring time is linear in the number of ticks of precision [23] . The proposed clock used irreversible ticks. If, instead, the ticks were implemented by incrementing a reversible counter, then it would seem that the entropic cost should reduce to just the amount of memory needed to record the time, which is only logarithmic in the number of ticks. However, this remains to be demonstrated in the real world. As mentioned in Section 5.3, it might be possible to unify the mutual information and logical depth approaches to resource and memory. Such a connection may help us better understand the limits of reversible computing, since the time-consuming cleanup operations of reversible computers essentially serve to undo logical depth [33] . Finally, it would be very exciting to see a fully rigorous proof that causal decision theory, or perhaps functional decision theory, is the right model of intelligence. By this, we mean a proof that the decision theory accurately predicts the behavior of sophisticated products of Darwinian selection, in the context of some model with a space-time structure. Such a development might contribute towards a more unified formulation of the psychological arrow of time. The Ghost in the Quantum Turing Machine The thermodynamics of computation-a review Logical depth and physical complexity The second laws of quantum thermodynamics Elements of information theory Entropy in ergodic theory and topological dynamics Diffusion, effusion, and chaotic scattering: An exactly solvable Liouvillian dynamics Entropy, biological evolution and the psychological arrow of time A remark on the C.T.P. theorem. Helvetica Physica Acta Reversibility and Surjectivity Problems of Cellular Automata Reversible Cellular Automata: From Fundamental Classical Results to Recent Developments Probability Theory: A Comprehensive Course Irreversibility and Heat Generation in the Computing Process An Introduction to Kolmogorov Complexity and Its Applications Quantum solution to the arrow-of-time dilemma Relation between the psychological and thermodynamic arrows of time Theory of Reversible Computing Computation universality of one-dimensional reversible (injective) cellular automata Master-equation approach to deterministic chaos Thermodynamics of information Causality (2 ed.) Structural Counterfactuals: A Brief Introduction Measuring the thermodynamic cost of timekeeping A probabilistic solution of problem 111 2020. Agency in Physics. arXiv: History and Philosophy of Physics 2020. Memory and Entropy. arXiv: History and Philosophy of Physics 2020. Entropy, Divergence, and Majorization in Classical and Quantum Thermodynamics. arXiv: Quantum Physics Chaos for Liouville probability densities A Formal Theory of Inductive Inference. Parts I and II The first law of general quantum resource theories The cellular automaton interpretation of quantum mechanics Time, space, and energy in reversible computing Memory systems, computation, and the second law of thermodynamics No free lunch theorems for optimization Functional Decision Theory: A New Theory of Instrumental Rationality Jason Li contributed in-depth feedback at several stages of this manuscript's preparation, which helped to clarify this work. Danica Sutherland and William Evans provided additional helpful feedback.