12364 558..579 Time and Knowability in Evolutionary Processes Elliott Sober and Mike Steel*y Historical sciences like evolutionary biology reconstruct past events by using the traces that the past has bequeathed to the present. Markov chain theory entails that the passage of time reduces the amount of information that the present provides about the past. Here we use a Moran process framework to show that some evolutionary processes destroy information faster than others. Our results connect with Darwin’s principle that adaptive similarities provide scant evidence of common ancestry whereas neutral and deleterious similarities do better. We also describe how the branching in phylogenetic trees affects the information that the present supplies about the past. 1. Introduction. What is the epistemic relation of present to past? Absent a time machine, we are trapped in the present and must rely on present traces to learn about the past. There are memory traces inside the skull, but outside there are tree rings, fossils, and traces of other kinds. People use these traces to reconstruct the past. Sometimes they simply assume that the traces pro- vide unerring information about the past, but often they realize that the jump from present to past is subject to error. A bevy of epistemic concepts can be pressed into service to investigate the relation of present traces to past events, ranging from strong concepts like knowledge and certainty to more modest ones like justified belief and evidence. Received February 2014; revised April 2014. *To contact the authors, please write to: Elliott Sober, Philosophy Department, University of Wisconsin, Madison, WI, USA; e-mail: ersober@wisc.edu. Mike Steel, Biomathematics Re- search Centre, University of Canterbury, Christchurch, New Zealand. yElliott Sober presented this paper in 2012 at the University of Bordeaux, the London School of Economics, and the Institute for Mathematical Philosophy at the Ludwig Maximilian University in Munich, and received valuable comments. We are grateful for these and also to Elchanan Mossel for helpful comments. Elliott Sober thanks the William F. Vilas Trust of the University of Wisconsin–Madison, and Mike Steel thanks the Allan Wilson Centre (New Zealand). Philosophy of Science, 81 (October 2014) pp. 558–579. 0031-8248/2014/8104-0009$10.00 Copyright 2014 by the Philosophy of Science Association. All rights reserved. 558 We are interested in how the natural processes connecting past to present constrain our ability to know about the past by looking at the traces found in the present. An optimistic view of these processes is that the past is po- tentially an open book; all we need do is understand the connecting pro- cesses correctly and look around for the right traces. If the relation of the past state of a system to its present state were deterministic and one to one, this optimistic view would be correct. If only we could know the present state with sufficient precision, and if only we could grasp the true mapping function that connects present and past, we would be home free. This op- timism is something that Laplace ð1814, 4Þ affirmed when he discussed une intelligence ðnow referred to as “a demon”Þ: We may regard the present state of the universe as the effect of its past and the cause of its future. An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes. It is worth noting that determinism is not sufficient for the optimistic view to be true; without the one-to-one assumption, distinct states of the past may map onto the same state of the present, with the consequence that the exact state of the past cannot be retrieved from even a perfectly precise grasp of the present. Is determinism necessary for the optimistic view to be true? It is not, provided that we set to one side strong concepts like knowledge and cer- tainty and take up an epistemic evaluation that is more modest. Consider, for example, a process in which the system is, at each moment, in one of two states ðcoded 0 and 1Þ. Suppose Past 5 0 makes Present 5 0 extremely probable ðsay, 0.96Þ and that Past 5 1 makes Present 5 1 extremely prob- able as well ðsay, 0.98Þ. This means that when we observe the system’s present state, we gain strong evidence that discriminates between the two hypotheses Past 5 0 and Past 5 1. We cannot infer from Present 5 0 that the past state was certainly 0; in fact, we cannot even infer that the past state was probably 0. But we can conclude that the observation favors the hy- pothesis that Past 5 0 over the hypothesis that Past 5 1. This conclusion is licensed by what Hacking ð1965, 59–62Þ calls the Law of Likelihood: Observation O favors hypothesis H1 over hypothesis H2 if and only if Pr ðOjH1Þ > PrðOjH2Þ. Royall ð1997, 9–11Þ suggests that this qualitative principle should be supplemented by a quantitative measure of favoring: TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 559 The degree to which O favours H1 over H2 is given by the likelihood ra- tio Pr ðOjH1Þ=PrðOjH2Þ. Royall further suggests that a reasonable convention for separating strong evidence from weak is a ratio of 8. Royall’s suggestion entails that the prob- abilistic process just described has the consequence that Present 5 0 pro- vides strong evidence favoring Past 5 0 over Past 5 1, since the likelihood ratio is 0.96/0.02 5 48. This simple example should not be overinterpreted. If there were a pro- cess connecting past to present in the way described, the present would pro- vide strong evidence about the past. Do not forget the if. Perhaps there are such processes, especially when the past under discussion is the recent past. But what if we consider not just the recent past, but past times that are more and more ancient? How does increasing the temporal separation between present and past affect the amount of information that the present provides about the past? 2. Two Theorems. A simple theorem provides an answer to the question just posed ðCover and Thomas 2006Þ. Consider a system that at any time is in one of n possible states ðs1, s2, . . . , snÞ. For simplicity we shall think of the system as evolving in discrete time steps. We stipulate that the system has the following two properties: The Markov property. For any two times t1 < t2, the state of the system at time t1 screens-off the system’s history prior to t1 from the state at t2. That is, for all states x and y: Prðsystem is in state y at t2jsystem is in state x at t1Þ 5 Prðsystem is in state y at t2jsystem is in state x at t1 & system’s history prior to t1Þ: Note that the Markov property does not require that the transition probabil- ities are constant with time ðoften called ‘time-homogeneous’ chainsÞ; rather, they may vary from step to step. The second property we stipulate is sometimes referred to, in the time- homogeneous, finite-state setting, as the Markov chain’s being “regular”; we will use the same term in the broader setting we are considering in which time homogeneity is relaxed: Regularity. For some positive integer n, and some strictly positive real value ε, the following holds for all ordered pairs of states ði, jÞ: given that the system at any given time t is in state i, the probability that the system is in state j at time t 1 n is at least ε. 560 ELLIOTT SOBER AND MIKE STEEL This condition asserts that it is possible to move from any given state to any other state ðincluding the original stateÞ in a fixed finite number of steps with a probability that remains bounded above zero. Regularity for finite-state time-homogeneous Markov processes is equivalent to the condition that the Markov chain is ‘aperiodic’ and ‘irreducible’ ðfor details, see Häggström ½2002, corollary 4.1�Þ. For any system of this sort, the following result holds ða precise state- ment and proof is provided in app. AÞ: Exponential information loss theorem. If a finite-state system satisfies the Markov property and regularity, then IðPast; PresentÞ is less than or equal to a term that approaches zero exponentially fast as the time between the present and past increases. Here IðX; YÞ is the “mutual information” linking the two variables. If the variables are discrete, the formula for this quantity is IðX ; YÞ 5 o y∈Y o x∈X pðx; yÞlog pðx; yÞ pðxÞpðyÞ � � ; where pðx, yÞ is the joint probability that X 5 x and Y 5 y, and pðxÞ, pðyÞ are the ðmarginalÞ probabilities that X 5 x, and that Y 5 y, respectively. Mutual information measures how much ðon averageÞ you learn about the state of one of the variables by observing the state of the other. Its value is zero when X and Y are independent; otherwise it is positive. Mutual information is sym- metrical: IðX; Y Þ 5 IðY; X Þ. The exponential information loss theorem generalizes the special case where transition probabilities are constant with time ðSober and Steel 2011, proposition 6Þ. This generalization is important, since many processes ðin- cluding the biological examples we will discussÞ often change their rates from one period of time to another. Note that the theorem does not ensure a monotonic decline in informa- tion as the temporal separation of past and present is increased. That extra element is provided by a different result, the so-called Data Processing In- equality ðDPI; Cover and Thomas 2006, 32Þ: The Data Processing Inequality: In a causal chain from a distal cause D to a proximate cause P to an effect E, if P screens-off D from E, then IðE; DÞ is less than or equal to both IðE; PÞ and IðP; DÞ. For a discrete-state process, these two inequalities are strict whenever P is nei- ther perfectly correlated with D or with E, nor is P independent of them ðsee app. BÞ. The Data Processing Inequality does not require that the process link- ing D to P is the same as the process linking P to E. TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 561 The information-processing inequality is “chain internal.” It does not say that the present always provides more information about recent events than about ones that are older. Consider figure 1. Suppose that R screens- off A from D1 & D2 Then the information-processing inequality says that IðD1 & D2; RÞ ≥ IðD1 & D2; AÞ. It does not say that IðD1 & D2 & : : : & D100; RÞ ≥ IðD1 & D2 & : : : & D100; AÞ. It is perfectly possible that the hun- dred descendants that now exist provide more information about A than they do about R. Note that R does not screen-off A from D1 & D2 & . . . & D100. The heightened information that the leaves provide about A is not due to the simple fact that A has a larger number of outgoing lineages than R does; it is possible to modify this tree so that each nonleaf node ðincluding the root node AÞ has just two descending lineages ðSober and Steel 2011, fig. 5b, 243–34Þ, and still the leaves provide more information about A than they do about R, owing to the values of the transition probabilities attach- ing to branches. Both the exponential information loss theorem and the Data Processing Inequality are very general. They characterize any system whose laws of mo- tionhavetherequisiteprobabilisticfeatures.Thesystemmightbeachamberof gas, but it also might be an evolving population of organisms. Indeed, if there were disembodied spirits that changed probabilistically, the results would apply to them. Both results are more general than physics—they cover the systems and properties that are discussed in the laws of physics, but they also applytosystemsandpropertiesthatarenot.Inaddition,bothresultsare apriori mathematical truths, although of course it is an empirical matter whether a given system satisfies the antecedent of the conditional that each result ex- presses ðSober 2011aÞ. Figure 1. A case in which it is possible for the present to provide more information about an ancient event ðAÞ than about a more recent event ðRÞ. 562 ELLIOTT SOBER AND MIKE STEEL 3. Five Evolutionary Processes. Since our interest here is in how evolu- tionary processes affect the amount of information that the present provides about the past, it is worth making clear how the exponential information loss theorem applies to models of biological evolution. In phylogenetics, the rapid loss of information in models of nucleotide substitution with time has been highlighted as a significant problem for using DNA sequence data to accurately resolve deep divergences of species lineages ðfor a recent re- view, see Salichos and Rokas ½2013�Þ and for inferring ancestral states deep within a given tree ðMossel 2003; Gascuel and Steel 2014Þ. We will return to phylogenies in a later section, but for now we consider information loss in population genetics. The application of information the- ory to population genetics has been investigated a bit. For example, Frieden, Plastino, and Soffer ð2001Þ explore a variational principle ð“extreme phys- ical information”Þ based on Fisher information to study genotype frequency changes. The questions we consider here are different, as our primary interest is in the relative ranking of likelihood ratios across different evo- lutionary processes, including both drift and different types of selection. The Moran ð1962Þ models of evolution will be our workhorse in what follows. We consider a population containing N individuals. The popula- tion evolves through a sequence of discrete temporal “moments” ðhow long a moment is will not matterÞ. At each moment, one of those N in- dividuals produces a copy of itself and one of those N individuals dies. We consider two traits A and B; each individual has one of them or the other. At any moment, the population is in one of N 1 1 states ðranging from 0% A to 100% AÞ. This Moran framework can be articulated in different ways to represent different evolutionary processes. For example, if individuals are chosen at random to reproduce and die, then we have a drift process. Selection processes of different kinds can be represented by letting A individuals have chances of dying or reproducing that differ from those possessed by B individuals. A population undergoing a Moran process forms a Markov chain, with its recent past screening-off its more remote past from the present. Suppose we observe the population in the present and see that all N in- dividuals are in state A. How much information does that observation pro- vide about the state of the population at some earlier time? If all states are accessible to each other ðwhich requires that mutations can prevent the pop- ulation from getting “stuck” at 100% A or 100% BÞ, then the exponential information loss theorem applies and so the mutual information declines asymptotically to zero with time. However, if there is no mutation, then the population will evolve to either 100% A or 100% B and will stay there. In this case, the present state of the population provides information about its past even if the two are infinitely separated. For example, if we observe that the population is now 100% A and the population has been evolving by drift, this TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 563 observation favors the hypothesis that the population was at more than 95% A at some earlier time over the hypothesis that it was 5% A or less, and this is true regardless of the time separation between the past and the present. John Maynard Keynes ð1924, chap. 3Þ once said that “in the long run, we are all dead.” His point was to pooh-pooh the relevance of claims about the infinite long run. What should matter to us mortals is finite time. This point applies to the bearing of the exponential information loss theorem on our knowledge of the evolutionary past. Who cares if mutual information goes to zero as the time separating present from past goes to infinity? Life on earth is a mere 3.8 billion years old. What is relevant is that information decays monotonically in Markov chains. But in addition, we know that there are different kinds of evolutionary process. Which of these speeds the loss of information and which slows it? The five processes we want to investigate are represented in figure 2; each of them can result in our present observation that all N of the in- dividuals in the population now have trait A. In panel i, trait A was favored by selection. In panel iii, there was selection against trait A. In panel ii, the traits are equal in fitness, and so the traits evolved by pure drift. In panel iv, selection favored the majority trait. And in panel v, selection favored the minority trait. We are not asking which of these process hypotheses is most plausible, given the observed state of the population at present. Rather, we want to explore what happens to information loss under each of these five scenarios, in each case thinking of the process in the context of the Moran framework of a finite population of fixed size N. To compare the five processes in this respect, we calculate the following likelihood ratio for each of them: Rij 5 PrðPresent 5 NjPast 5 iÞ PrðPresent 5 NjPast 5 jÞ ; for i > j: Although our main interest is to compare cases in which i is close to N and j is close to 0, the Moran framework makes it easy to derive results for the more general case in which i > j. So the observation is that all N of the individuals now in the population have trait A. Does this observation favor the hypothesis that there were exactly i individuals at some past time who had trait A over the hypothesis that there were exactly j, where i > j? It does; Rij > 1 for each of the five processes we are considering ðsee app. CÞ. Our question is how the mag- nitude of Rij depends on the underlying evolutionary process. It is worth noting that although the likelihood ratio Rij is “past-directed” ðin that it describes the degree to which a present observation discriminates between two hypotheses about the pastÞ, evaluating this ratio requires one to consider 564 ELLIOTT SOBER AND MIKE STEEL Figure 2. Five processes that can result in all N of the individuals now in the population having trait A. two “future-directed” probabilities—the probability of reaching Present 5 N if the system begins at Past 5 i and the probability of reaching Present 5 N if the system begins at Past 5 j. We begin by adopting an assumption that we treated above with Keynes- ian disdain. Let us assume that the temporal separation of past and present is infinite and that there is zero mutation. We will relax this idealization in due course. For each of the processes we are considering, we now can de- scribe what the value of Rij is for each pair of values for i and j such that i > j. For example, under neutral evolution the value of Rij is i/j. The values for the other four processes are given in appendix C. The ordering of the Rij values for the five processes is depicted in figure 3. Let us first consider three of the cases described in figure 3—drift and the two cases of frequency-independent selection. The ordering of Rij values for these three processes means that the observation that all the individuals in the population now have trait A provides more information about the past state of the population the less probable it was that A would evolve to fixation. Selection for A is at the bottom of the pile, with neutrality next, and selection against A at the top. The ordering of these three processes has an intuitive interpretation. Suppose we observe that trait A is fixed in a population and we wish to estimate whether Awas common ðat frequency iÞ or rare ðat frequency j < iÞ at some time in the past. Selection for A makes it easy for A to go to 100% in the population, regardless of whether it starts off common or starts off rare, which is why selection for A comes in last in the part of figure 3 that describes frequency-independent processes. The evidence in favor of the former hypothesis relative to the latter, as measured by the likelihood ratio, is stronger under selection against A than under a drift model. This is be- cause the drift model provides more opportunity for a rare allele A to fix at 100% in the population than the model in which A is selected against. For a Figure 3. Comparing Rij values for five processes, assuming infinite temporal sep- aration of Past from Present and zero mutation. The relation of the two frequency- dependent processes to drift is derived using the assumption that j < N/2. 566 ELLIOTT SOBER AND MIKE STEEL formal proof concerning how the Rij values for different processes compare, see appendix C. The three results depicted in figure 3 that describe frequency-independent processes echo an insight that Darwin expresses in the Origin: “Adaptive characters, although of the utmost importance to the welfare of the being, are almost valueless to the systematist. For animals belonging to two most distinct lines of descent, may readily become adapted to similar conditions, and thus assume a close external resemblance; but such resemblances will not reveal—will rather tend to conceal their blood-relationship to their proper lines of descent” ðDarwin 1859, 427Þ. Darwin illustrates this idea by giving an example: whales and fish both have fins, but this is not strong evidence for their common ancestry, since the trait is an adaptation for swim- ming through water. Far stronger evidence for common ancestry is provided by similarities that are useless or deleterious. One of us has called this idea Darwin’s Principle, in view of its centrality to Darwin’s framework ðSober 2008, 2011bÞ. Darwin’s topic in the passage quoted is inferring common ancestry, not inferring the past state of a lineage from its present state, but the episte- mologies are similar. Here is a simplified argument that illustrates why. Suppose trait A has probability p of being fixed in a recent species. If two species x and y diverged from their most recent common ancestor very recently, the probability that they both have trait A is approximately p. On the other hand, if they have no common ancestor, then the probability of them both having A is p2. This means that the likelihood ratio of the hy- potheses “x and y have a recent common ancestor” and “x and y have no common ancestor” is approximately p/p2 5 1/p under Markovian trait evolution. Thus, the smaller p is, the greater this likelihood ratio will be. And the value for p if there is selection for A is larger than the value for p if there is drift, which in turn is larger than the value for p if there is selection against A. There are two cases of frequency-dependent selection represented in figure 3. One of them ðfrequency-dependent selection for the majority traitÞ shows that it would be an overstatement to say that the current state of the population ðall individuals having trait AÞ always provides scant evidence concerning the population’s past state if trait A evolved because of natu- ral selection. It matters a great deal what sort of selection process we are talking about. Frequency-dependent selection for the majority trait is bet- ter than drift in terms of how much information the present state of a line- age provides about its past. Figure 3 also locates the evidential meaning of frequency-dependent selection for the minority trait; it has an informational yield that is worse than that provided by drift. As noted in appendix C, our results for both cases of frequency-dependent selection make the assump- tion that j < N/2. This assumption ensures that the ordering of the frequency- TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 567 dependent selection cases is always maintained as shown in figure 3; it is an innocent assumption since, as noted above, we are mainly interested in the case where i and j are majority and minority values, respectively.1 Figure 3 provides only a partial ordering of the five cases depicted. The reason for this is that a comparison of, say, frequency-independent selection against A with frequency-dependent selection for the majority trait would depend on the values of specific parameters. We now can remove the idealization of infinite time and zero mutation. The ordering of the Rij ratios for the five processes, when time is infinite and mutation is zero, is the same as the ordering of those processes for the following slightly different likelihood ratio, when time is finite ðand suffi- ciently largeÞ and there is a sufficiently small mutational input: PrðPresent 5 NjPast ≈ NÞ PrðPresent 5 NjPast ≈ 0Þ : See appendix D for a proof of this ordinal equivalence result. Here “Past ≈ N” and “Past ≈ 0” just mean any states close to N and to 0 ðrespectivelyÞ, which are then held fixed across the five models. With mutational input, the exponential information loss theorem applies to all five processes. The mutual information between present and past declines monotonically as their temporal separation increases, but the decline is faster under some processes than it is under others. Our results are derived within the setting of the Moran model of popu- lation genetics. This is a finite-state Markov chain that forms a “continuant process” ðEwens 2010Þ. That is, at each step the number of individuals car- rying a particular allele in the population of fixed size N either goes up by 1, or goes down by 1, or it stays the same. We expect similar conclusions concerning the partial ordering of the Rij ratios for other continuant pro- cesses, provided the transition probabilities faithfully reflect the various types of selection being compared. Our results also apply to a slightly dif- ferent problem: how does the kind of evolutionary process at work in line- ages affect the strength of the evidence that similarity provides for common ancestry ðSober 2008Þ? 4. Mutual Information and the Likelihood Ratio. We began our discus- sion of information loss by using the concept of mutual information but then shifted to considering the likelihood ratio. Have we illicitly changed 1. The situation is messier if you want to consider frequency-dependent processes for the full range of all i, j values such that i > j. If i > j > N/2, then frequency-dependent selection for the majority trait is rather like frequency-independent selection for trait A, and we know that Rij for frequency-independent selection for trait A is less than Rij for drift. 568 ELLIOTT SOBER AND MIKE STEEL horses in midstream? We think not. Consider the mutual information be- tween the binary random variable E that takes the value 1 if allele A is fixed in the present population ðE 5 0 otherwiseÞ, and the frequency X of allele A in the past population, under some strictly positive prior distribution. In this case, mutual information and the likelihood ratio are linked as follows: IðX ; EÞ 5 0 precisely when Rij 5 1 for all past states i; j: If we now observe E 5 1, then instead of comparing IðX; EÞ across the dif- ferent processes, it is more relevant to compare the values of Rij. That is, we are thinking of a single observation ðPresent 5 NÞ, so we are not consider- ing all the possible observations that we might make of the present state. This is why we used the likelihood ratio Rij rather than mutual information to carry out the cross-process comparison. 5. The Impact of Branching on Information. We have emphasized that loss of information within a lineage is a fact of life. However, the branching that takes place in evolution is a force that pushes in the opposite direction, since it creates new lineages. As illustrated in figure 1, it is possible for an ancient ancestor to have more present-day descendants than a more recent ancestor has, and this means that the information lost to the passage of time can be offset by the proliferation of descendants that each bear wit- ness to the ancient ancestor’s state. If the process is a symmetric one on two states, it is possible to describe precisely how often branching must occur if information loss is to be offset in this way ðEvans et al. 2000Þ; for more general processes, one must usually be content with upper and lower bounds. It might be asked why one needs to worry about observing descendants to infer the characteristics of ancestors. Doesn’t observing fossils provide a simpler and more definitive solution? Our answer has three parts. First, one cannot assume that a fossil comes from an ancestor of extant species; it may just be an ancient relative. Second, fossils provide evidence about the morphological hard parts of ancient organisms; molecular characters, not to mention phenotypic features of physiology and behavior, typically do not fossilize. And finally, fossil traces degrade and are subject to the exponen- tial information loss theorem if fossils change state in conformity with the regularity assumption and have the Markov property. Just as the process at work in lineages has an impact on information loss, so too does the topology of the branching process itself. It might seem intuitive to conjecture that the star phylogeny shown in figure 4i is “better” than the bifurcating topology shown in figure 4ii in the sense that the former topology allows the observations to provide more information about the root than the latter topology does. In order to hold other factors fixed, we TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 569 assume that the two topologies have the same number of leaves and that the process at work in branches is the same in the two topologies ðin particular, that the expected number of substitutions—the branch lengths—between the root and each leaf match up for the two treesÞ. The conjecture just stated seems reasonable, since in figure 4i the observations are independent of each other, conditional on the state of the root, whereas in 4ii, the obser- vations are not conditionally independent. The guess is correct for a two- state symmetrical process when we compare the two trees in figure 4 ðSober 1989, 280Þ. More generally, this guess holds whenever we compare any binary tree with a matching star phylogeny on the same number of leaves when a two-state symmetric process is at work ðEvans et al. 2000Þ. But, surprisingly, the guess is not always true for symmetric Markov processes with more than two states. More precisely, consider a completely balanced binary tree with 2n leaves, on which a constant symmetric process on five ðor moreÞ states operates with a constant substitution probability on each edge of the tree. Then if this substitution probability lies in a certain region, and n is large enough, the mutual information between the leaf states and the root state can be higher for the binary tree than for a comparable star tree on the same number of leaves ðSly 2011, theorem 1.2Þ. Here “comparable” means that the expected number of substitutions from the root to any tip is the same in both trees. In other words, the root state of a tree can sometimes be more accurately predicted from the state of its leaves when the tree is binary ðand the leaves are correlatedÞ than when the tree is a star ðand the leaves are independentÞ, where the two trees have the same marginal distribution. A similar result holds for strongly asymmetric two-state models ðMossel 2001, theorem 1Þ. So the intuitive maxim that testimony from independent witnesses of an Figure 4. Does observing the four leaves of the star phylogeny ðiÞ provide more information about the state of the root than observing the four leaves of the bi- furcating phylogeny ðiiÞ? 570 ELLIOTT SOBER AND MIKE STEEL event provides more information about the event than testimony from oth- erwise similar dependent witnesses is sometimes false. 6. Conclusion. In summary, the view of evolution as an “information- destroying process” is basically right, but it overlooks some interesting de- tails. For some special processes ðe.g., zero mutation in simple population genetic modelsÞ, information never completely disappears, even after infinite time. For example, under a drift model with zero mutation, some information concerning whether allele A was initially in the minority or the majority is always detectable at any time in the present frequency of A ðwhich eventually fixes at 0 or NÞ in the population. At the other extreme there are certain ðdiscrete timeÞ processes for which the information can collapse completely to zero in finite time ðMossel 1998; Sober and Steel 2011, 233Þ. The more usual situation lies between these two extremes; for processes that are Markovian and regular, the information between past and present decays at an exponential rate and vanishes only in the limit. Here we can still compare the relative support such models provide in estimating an an- cestral state from an observation today. For the five models considered, this support varies in a predictable way depending on the type of model assumed. We are well aware that the Moran framework contains various simplifi- cations ðe.g., constant population sizeÞ, but the model’s simplicity allows for explicit calculations and results and does not raise questions as to whether the ordering might depend on the complexities that might be introduced in a more intricate model. We hope that our results will provide a foundation for exploring how adding complexities affects information loss. In philosophy as well as in science, it is reasonable to walk before one runs. The estimation of an ancestral state from the leaves of a phylogenetic tree exhibits further subtleties, along with a surprise: the independent estimates obtained from a star phylogeny may or may not be more informative than the correlated estimates obtained from a binary tree, depending on the num- ber of states, the size of the tree, and the substitution rate. The episte- mological principle that “the testimony of independent witnesses always provides more evidence than the testimony of otherwise similar dependent witnesses” is wrong. Appendix A Formal Statement and Proof of the Exponential Loss Theorem Proposition 1. Suppose Xt; t ≥ 0 is any discrete, finite-state Markov process that satisfies the following condition. For some ε > 0, and integer N > 0 the following inequality holds for all t ≥ 0 and states i, j: TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 571 PrðXt1N 5 jjXt 5 iÞ ≥ ε :: ðA1Þ Then IðX0; XtÞ ≤ C expð2ctÞ for constants C, c > 0. Proof. Let Yn be the ‘N-step’ Markov chain defined by Yn 5 XnN for all n ≥ 0. Note that equation ðA1Þ implies that Yn satisfies the following in- equality for all n, i, j: PrðYn11 5 jjYn 5 iÞ ≥ ε: ðA2Þ We will show that IðY0; YnÞ ≤ B expð2bnÞ ðA3Þ for constants B, b > 0, from which proposition 1 follows because if t 5 nN 1 r for 0 ≤ r < N then IðX0; XtÞ 5 IðY0; XtÞ ≤ IðY0; YnÞ ≤ B exp ð2bnÞ < Bebexpð2ðb=NÞtÞ; where the first inequality is from the Data Processing Inequality, and the second inequality is from ðA3Þ. Thus we can take C 5 Beb, and c 5 b/N to obtain ðA1Þ from ðA2Þ. To establish inequality ðA3Þ we apply a standard type of coupling ar- gument. For k ≥ 0 let pkðiÞ:5 PrðYk 5 iÞ for each state i. Consider the fol- lowing Markov process Y 0 k defined as follows: Y 0 0 5 Y0ð 5 X0Þ; and for each k ≥ 0, the state of Y 0 k11 is determined as follows: At each step of this chain, we toss a biased coin, which returns a head ðHÞ with probability ε ðindependently of the chainÞ or a tail ðHÞ with probability 1 2 ε. If a head H is returned, Y 0k11 is assigned a random state according to the distribution pk11. If the coin toss results in a tail H outcome, then Y 0 k11 selects a state that depends on Y 0 k as follows: if Y 0 k 5 i, then Y 0 k11 is assigned state j with probability PrðYk11 5 jjYk 5 iÞ 2 εpk11ð jÞ ð1 2 εÞ : ðA4Þ Note that this expression is greater than or equal to zero ðby 2Þ, and is less than or equal to 1; moreover, o j PrðYk11 5 jjYk 5 iÞ 2 εpk11ð jÞ ð1 2 εÞ 5 1; so ðA4Þ describes a legitimate probability distribution conditional onH. 572 ELLIOTT SOBER AND MIKE STEEL By the law of total probability we can express PrðY 0k11 5 jjY 0 k 5 iÞ as follows: Pr ðY 0k11 5 jjH & Y 0 k 5 iÞPr ðHjY 0 k 5 iÞ 1 Pr ðY 0 k11 5 jjH & Y 0 k 5 iÞPrðHjY 0 k 5 iÞ 5 pk11ð jÞε 1 ½Pr ðYk11 5 jjYk 5 iÞ 2 εpk11ð jÞ� 5 PrðYk11 5 jjYk 5 iÞ: Summarizing, we have: Y 0 0 5 Y0, and Pr ðY 0 k11 5 jjY 0 k 5 iÞ 5 PrðYk11 5 jjYk 5 iÞ, and so Yk and Y 0k describe the same Markov chain. In particular, IðY0; YnÞ 5 IðY 00; Y 0 n Þ, and pnð jÞ 5 PrðY 0n 5 jÞ. Let pnði; jÞ :5 Pr ðY0 5 i & Yn 5 jÞ 5 PrðY 00 5 i & Y 0 n 5 jÞ; and let An be the event that H occurs at least once in the first n biased coin tosses that are performed in the construction of the chain Y 0 1 ; Y 0 2 ; : : : . Then pnði, jÞ can be written as the sum of two terms: Pr ðY 0n 5 jjAn & Y 0 0 5 iÞPr ðAn & Y 0 0 5 iÞ 1 Pr ðY 0n 5 jjAn & Y 0 0 5 iÞPrðAn & Y 0 0 5 iÞ: ðA5Þ Now, the first term in ðA5Þ is exactly pnð jÞð1 2 ð1 2 εÞnÞp0ðiÞ since • conditional on An, Y 0n is independent of Y 0 0 and, in addition, Y 0 n is dis- tributed as pn; and • An and Y 00 are independent, and so Pr ðAn & Y 00 5 iÞ 5 Pr ðAnÞPrðY 0 0 5 iÞ 5 ½1 2 ð1 2 εÞn�p0ðiÞ: The second term in ðA5Þ can be written as p0ðiÞ multiplied by a term that lies between 0 and ð1 2 εÞn, since 0 ≤ Pr ðAn & Y 00 5 iÞ ≤ PrðAnÞ 5 ð1 2 εÞn: Thus, selecting b > 0 so that e2b 5 ð1 2 εÞ we can write: pnði; jÞ 5 p0ðiÞ½pnð jÞ 1 Oðe2bnÞ�; ðA6Þ where, as usual, ‘f ðnÞ 5 OðgðnÞÞ’ is shorthand for the statement that f ðnÞ is at most some constant times gðnÞ. Finally, observe that, from ðA2Þ, pnð jÞ ≥ ε > 0 and so, from ðA6Þ: IðY 00; Y 0 nÞ 5 o i; j pnði; jÞlog pnði; jÞ p0ðiÞpnð jÞ � � 5 o i; j pnði; jÞlog 1 1 Oðe2bnÞð Þ ≤ Be2bn; for a constant B > 0. Since IðY0; YnÞ 5 IðY 00; Y 0 nÞ, this establishes ðA3Þ, as required. QED TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 573 Appendix B Proposition 2. Suppose that X → Y → Z forms a Markov chain, where the state spaces for X, Y, and Z are discrete, and PrðX 5 x & Y 5 yÞ > 0 and PrðY 5 y & Z 5 zÞ > 0 for all choices of states x, y, z for X, Y, Z, respectively. Then the DPI is an equality if and only if X, Y, and Z are mutually inde- pendent. Proof. First, observe that if X, Y, Z are independent then they are pair- wise independent and so IðX; YÞ 5 IðX; ZÞ 5 IðY; ZÞ 5 0 and thus equal- ity holds trivially. Next, suppose that Pr ðX 5 x & Y 5 yÞ > 0 and PrðY 5 y & Z 5 zÞ > 0 for all choices of states x, y, z for X, Y, Z, respectively. Then PrðX 5 x & Y 5 y & Z 5 zÞ > 0 holds also ðsince X → Y → Z is a Markov chainÞ. Suppose further that the DPI is an equality; we will show that X, Y, and Z are independent. Since the DPI is an equality, X → Y → Z and X → Z → Y are both Markov chains ðCover and Thomas 2006Þ. We write pðxyzÞ as shorthand for the probability PrðX 5 x & Y 5 y & Z 5 zÞ and similarly for conditional and marginal probabilities ðthus, e.g., pðxjzÞ 5 PrðX 5 xjZ 5 zÞÞ. First observe that the positivity condition pðxyzÞ > 0 for all ðx, y, zÞ implies that pðxyÞ, pðxzÞ, pðyzÞ, pðxÞ, pðyÞ, pðzÞ are also strictly positive. Since X → Y → Z is a Markov chain, and pðxyÞ > 0: pðxyzÞ 5 pðzjxyÞpðxyÞ 5 pðzjyÞpðxyÞ; ðB1Þ and since X → Z → Y is also a Markov chain, and pðxzÞ > 0, we have pðxyzÞ 5 pðy|xzÞ pðxzÞ 5 pðy|zÞ pðxzÞ. Applying Bayes’s theorem, the last term can be written as pðzjyÞpðyÞ pðzÞ pðxzÞ ðnote that pðzÞ > 0Þ and so, combining this with equation ðB1Þ gives pðxyzÞ 5 pðzjyÞpðxyÞ 5 pðzjyÞpðyÞ pðzÞ pðxzÞ; and so pðzjyÞpðxyÞpðzÞ 5 pðzjyÞpðyÞpðxzÞ: ðB2Þ Since pðz|yÞ > 0 ðbecause pðyzÞ > 0Þ we can cancel this term on the left and right of equation ðB2Þ to obtain: pðxyÞpðzÞ 5 pðyÞpðxzÞ: ðB3Þ 574 ELLIOTT SOBER AND MIKE STEEL Now, we can further write pðxyÞ 5 pðx|yÞ pðyÞ and pðxzÞ 5 pðx|zÞ pðzÞ which, upon substitution into equation ðB3Þ gives: pðxjyÞpðyÞpðzÞ 5 pðyÞpðxjzÞpðzÞ; inotherwords,pðx|yÞ5pðx|zÞðnotingthatpðyÞ,pðzÞ>0Þ.Now,thisequation must hold for all choices of x, y, and z so pðx|yÞ must be constant as y varies— which implies that X and Y are independent. Similarly X and Z are indepen- dent. Finally, reversing the two Markov chains gives that Z is independent of Y. Thus X, Yand Z are pairwise independent. Moreover they are independent as a triple since X → Y → Z is a Markov chain and so pðxyzÞ 5 pðzjxyÞpðxyÞ 5 pðzjyÞpðxÞpðyÞ 5 pðyzÞpðxÞ 5 pðxÞpðyÞpðzÞ: QED Appendix C Rij Ratios with Zero Mutation at the Infinite Time Limit Consider the Moran model in population genetics, with two trait values A and B and population size N. Let Xt ∈ f0; 1; : : : ; Ng be the number of copies of A in the population at time t. In this section we assume zero mutation, and we consider neutral evolution, selection for A, selection against A, and frequency-dependent selection ðfor the majority state and against the majority stateÞ. Since each of these Markov processes has ab- sorbing states 0 and N ðbecause of zero mutationÞ eventually one allele will be fixed and the other lost. Let E ∈ f0; Ng be this end state, and S ∈ f0; 1; : : : ; Ng the starting state ðthus S 5 X0 and E 5 limt→`XtÞ. We are interested in comparing the ratio of conditional probabilities: Rij :5 PrðE 5 NjS 5 iÞ PrðE 5 NjS 5 jÞ ; for i > j under the various models. Proposition 3. ðiÞ Under neutral evolution Rij 5 ði=jÞ, for all 0 < i, j ≤ N. ðiiÞ Under frequency-independent selection Rij 5 ½ð1 2 ciÞ=ð1 2 cjÞ�, for all 0 < i, j ≤ N, where c is a positive constant with c < 1 when there is selection for A and c > 1 when there is selection against A. TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 575 ðiiiÞ For any two values i, j ∈ f1, 2, . . . , Ng with i > j the Rij value for selection against A exceeds the Rij value for neutral evolution, which in turn exceeds the Rij value for selection for A. ðivÞ For frequency-dependent selection, where the fitness of trait A is proportional to its frequency, the associated Rij value exceeds that for neutral evolution for all i, j ∈ f1, 2, . . . , Ng with i > j provided that j < N/2. ðvÞ For frequency-dependent selection, where the fitness of trait A is proportional to the frequency of the alternative trait B, the asso- ciated Rij value is lower than that for neutral evolution for all i, j ∈ f1, 2, . . . , Ng with i > j, provided that j < N/2. ðviÞ In all the cases considered above we have Rij > 1 for all i > j. Proof: Part i. For any integer x : 0 ≤ x ≤ N it is a well-known result ðfor many neutral modelsÞ that PrðE 5 NjS 5 xÞ 5 x=N ðsee, e.g., eq. 3.49 of Ewens 2010Þ, from which i immediately follows. Part ii. From Ewens ð2010, eq. 3.66Þ we have, for any x ∈ f1, . . . , Ng: PrðE 5 NjS 5 xÞ 5 1 2 c x 1 2 cN ; for a positive constant c which is greater than 1 for selection against A and less than 1 for selection for A. Part ii now follows immediately. Part iii. By parts i and ii, part iii is equivalent to the assertions that for i > j, if c ∈ ð0, 1Þ then: 1 2 ci 1 2 cj < i j ; while if i > j and c > 1 then: 1 2 ci 1 2 cj > i j : Now, using the identity 1 2 ck 5 ð1 2 cÞ ð1 1 c 1 c2 1 . . . 1 ck21Þ we have 1 2 ci 1 2 c j 5 1 1 : : : 1 ci21 1 1 : : : 1 c j21 ; ðC1Þ 576 ELLIOTT SOBER AND MIKE STEEL and so we wish to compare 1 1 : : : 1 ci21 1 1 : : : 1 cj21 and i j ; which is equivalent to comparing 1 1 : : : 1 ci21 i and 1 1 : : : 1 cj21 j : Now, the left-hand side is merely the average of the terms ck from k 5 0 up to k 5 i 2 1 while the right-hand side is the average of these terms up to j 2 1 and since i > j the left-hand side is smaller than the right when c < 1 and greater when c > 1. This completes the proof. Part iv. When selection is frequency dependent, we need to use the expression: PrðE 5 NjS 5 iÞ 5 1 1 o i21 j51 Pj k51 gk fk !� 1 1 o N21 j51 Pj k51 gk fk ! ; ðC2Þ where fk and gk denote the fitnesses of alleles A and B, respectively, when k individuals have allele type A ðsee, e.g., Huang and Traulsen 2010, eq. 6Þ. If we now take the fitness of trait A to be proportional to its frequency, that is fk 5 a � k for a constant a > 0, and take the fitness of trait B to also be proportional to its frequency ðwith the same coefficientÞ—that is, gk 5 a � ðN 2 kÞ—then equation ðC2Þ gives Rij 5 o i21 m50 N 2 1 m � �� o j21 m50 N 2 1 m � � : ðC3Þ As before, this ratio exceeds i/j precisely when the average of the first i terms N 2 1 m � � ðfor m 5 0, 1, 2 . . .Þ exceeds the average of the first j terms N 2 1 m � � ; this holds for all i > j with j < N/2. Part v. If we take the fitness of trait A to be proportional to the frequency of B, that is fk 5 a � ðN 2 kÞ for a constant a > 0, and take the fitness of trait TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 577 B to be proportional to the frequency of A ðwith the same coefficientÞ—that is, gk 5 a � k—then equation ðC2Þ gives Rij 5 o i21 m50 N 2 1 m � �21� o j21 m50 N 2 1 m � �21 ; ðC4Þ and this ratio is lower than i/j precisely when the average of the first i terms N 2 1 m � �21 ðfor m 5 0, 1, 2 . . .Þ is lower than the average of the first j terms N 2 1 m � �21 ; this holds for all i > j with j < N/2. Part vi. The inequality Rij > 1 for i > j is trivial for neutral evolution, while for the other processes the inequality follows from equations ðC1Þ, ðC3Þ, and ðC4Þ, noting that when i > j the i terms in the numerator include the j terms in the denominator along with some additional positive terms. Note that N plays no role in the expression of the ratio Rij in cases i–iii, but it does for iv and v. Appendix D Rij Ratios with Finite Time and Nonzero Mutation Rate The results in the previous section assume zero mutation and consider the infinite time limit. However, they also have some bearing on what happens at finite time and for nonzero mutation. First assume zero mutation and consider the ratio Rij at finite times t. Since these ratios are continuous functions of t and converge to values that satisfy the partial order described in figure 3, this ordering also holds for t sufficiently large ðbut finiteÞ. Select any such sufficiently large value t0 of t, and consider this ratio Rij at t0 as a function of the mutation rates. Again, Rij is a continuous function of these mutation rates ðwith t fixed to t0Þ, and so, for sufficiently small ðbut non- zeroÞ mutation rates, the five Rij values will still be ordered as in figure 3 at t0. In summary, for a sufficiently large value of time, we can take small but strictly positive mutation rates that preserve the order of the Rij ratios shown in figure 3. This ordering may change if the mutation rates are then held fixed and time increased. In other words, the order of quantifiers here is important. We are merely asserting that for any sufficiently large value of time, there exist positive mutation rates that preserve the orderings of the 578 ELLIOTT SOBER AND MIKE STEEL ratios—if we select a larger time, the mutation rates may need to be reduced ðbut will still be strictly positiveÞ. REFERENCES Cover, Thomas M., and Joy A. Thomas. 2006. Elements of Information Theory. 2nd ed. New York: Wiley. Darwin, Charles R. 1859. On the Origin of Species by Means of Natural Selection. London: Murray. Evans, William, Claire Kenyon, Yuval Peres, and Leonard J. Schulman. 2000. “Broadcasting on Trees and the Ising Model.” Advances in Applied Probability 10:410–33. Ewens, Warren J. 2010. Mathematical Population Genetics. Vol. 1, Theoretical Introduction. New York: Springer. Frieden, B. Roy, Angelo Plastino, and B. H. Soffer. 2001. “Population Genetics from an Infor- mation Perspective.” Journal of Theoretical Biology 208 ð1Þ: 49–64. Gascuel, Olivier, and Mike Steel. 2014. “Predicting the Ancestral Character Changes in a Tree Is Typically Easier than Predicting the Root State.” Systematic Biology 63 (3): 421–35. Hacking, Ian. 1965. The Logic of Statistical Inference. Cambridge: Cambridge University Press. Häggström, Olle. 2002. Finite Markov Chains and Algorithmic Applications. Cambridge: Cam- bridge University Press. Huang, Weini, and Arne Traulsen. 2010. “Fixation Probabilities of Random Mutants under Fre- quency Dependent Selection.” Journal of Theoretical Biology 263 ð2Þ: 262–68. Keynes, John Maynard. 1924. A Tract on Monetary Reform. London: Macmillan. Laplace, Pierre-Simon. 1814. A Philosophical Essay on Probabilities. Trans. from the French 6th ed. by F. Truscott and F. Emory. New York: Dover, 1951. Moran, Patrick. 1962. The Statistical Processes of Evolutionary Theory. Oxford: Oxford University Press. Mossel, Elchanan. 1998. “Recursive Reconstruction on Periodic Trees.” Random Structures and Algorithms 13 ð1Þ: 81–97. ———. 2001. “Reconstruction on Trees: Beating the Second Eigenvalue.” Annals of Applied Probability 11:285–300. ———. 2003. “On the Impossibility of Reconstructing Ancestral Data and Phylogenies.” Journal of Computational Biology 10:669–78. Royall, Richard. 1997. Statistical Evidence—a Likelihood Paradigm. Boca Raton, FL: Chapman & Hall. Salichos, Leonidas, and Antonis Rokas. 2013. “Inferring Ancient Divergences Requires Genes with Strong Phylogenetic Signals.” Nature 497:327–31. Sly, Allan. 2011. “Reconstruction for the Potts Model.” Annals of Probability 39:1365–1406. Sober, Elliott. 1989. “Independent Evidence about a Common Cause.” Philosophy of Science 56: 275–87. ———. 2008. Evidence and Evolution: The Logic behind the Science. Cambridge: Cambridge University Press. ———. 2011a. “A Priori Causal Models of Natural Selection.” Australasian Journal of Philosophy 89:571–89. ———. 2011b. Did Darwin Write the Origin Backwards? Amherst, NY: Prometheus Books. Sober, Elliott, and Mike Steel. 2011. “Entropy Increase and Information Loss in Markov Models of Evolution.” Biology and Philosophy 26:223–50. TIME AND KNOWABILITY IN EVOLUTIONARY PROCESSES 579