key: cord-0619579-ffkrnqyh authors: Vogelstein, Joshua T.; Verstynen, Timothy; Kording, Konrad P.; Isik, Leyla; Krakauer, John W.; Etienne-Cummings, Ralph; Ogburn, Elizabeth L.; Priebe, Carey E.; Burns, Randal; Kutten, Kwame; Knierim, James J.; Potash, James B.; Hartung, Thomas; Smirnova, Lena; Worley, Paul; Savonenko, Alena; Phillips, Ian; Miller, Michael I.; Vidal, Rene; Sulam, Jeremias; Charles, Adam; Cowan, Noah J.; Bichuch, Maxim; Venkataraman, Archana; Li, Chen; Thakor, Nitish; Kebschull, Justus M; Albert, Marilyn; Xu, Jinchong; Shuler, Marshall Hussain; Caffo, Brian; Ratnanather, Tilak; Geisa, Ali; Roh, Seung-Eon; Yezerets, Eva; Madhyastha, Meghana; How, Javier J.; Tomita, Tyler M.; Dey, Jayanta; Huang, Ningyuan; Shin, Jong M.; Kinfu, Kaleab Alemayehu; Chaudhari, Pratik; Baker, Ben; Schapiro, Anna; Jayaraman, Dinesh; Eaton, Eric; Platt, Michael; Ungar, Lyle; Wehbe, Leila; Kepecs, Adam; Christensen, Amy; Osuagwu, Onyema; Brunton, Bing; Mensh, Brett; Muotri, Alysson R.; Silva, Gabriel; Puppo, Francesca; Engert, Florian; Hillman, Elizabeth; Brown, Julia; White, Chris; Yang, Weiwei title: Prospective Learning: Back to the Future date: 2022-01-19 journal: nan DOI: nan sha: 66cc87463c0f27256a18eb02915460ef0b510c0b doc_id: 619579 cord_uid: ffkrnqyh Research on both natural intelligence (NI) and artificial intelligence (AI) generally assumes that the future resembles the past: intelligent agents or systems (what we call 'intelligence') observe and act on the world, then use this experience to act on future experiences of the same kind. We call this 'retrospective learning'. For example, an intelligence may see a set of pictures of objects, along with their names, and learn to name them. A retrospective learning intelligence would merely be able to name more pictures of the same objects. We argue that this is not what true intelligence is about. In many real world problems, both NIs and AIs will have to learn for an uncertain future. Both must update their internal models to be useful for future tasks, such as naming fundamentally new objects and using these objects effectively in a new context or to achieve previously unencountered goals. This ability to learn for the future we call 'prospective learning'. We articulate four relevant factors that jointly define prospective learning. Continual learning enables intelligences to remember those aspects of the past which it believes will be most useful in the future. Prospective constraints (including biases and priors) facilitate the intelligence finding general solutions that will be applicable to future problems. Curiosity motivates taking actions that inform future decision making, including in previously unmet situations. Causal estimation enables learning the structure of relations that guide choosing actions for specific outcomes, even when the specific action-outcome contingencies have never been observed before. We argue that a paradigm shift from retrospective to prospective learning will enable the communities that study intelligence to unite and overcome existing bottlenecks to more effectively explain, augment, and engineer intelligences. : Our conceptual understanding of prospective learning. W: world state. X: input (sensory data). Y: output (response/action). h: hypothesis. g: learner. Subscript n: now. Subscript f: future. NOW: the task and context that matters right now. FUTURE: the task and context that will matter at some point of time in the future. Note that g, the learning algorithm, is fixed and unchanging, while continually updating the hypothesis. Black arrows denote inputs and outputs. White arrows indicate conceptual linkages. and traditional learning approaches (see Section 3 for further details). At any given time (now, indicated by subscript n), the intelligence receives some input from the world (X n ). The intelligence contains a continual learning algorithm (g). The goal of that algorithm is to leverage the new data to update the intelligence's current hypothesis (h n ) without catastrophically forgetting the past, and ideally even improving upon previously acquired skills or capabilities. The hypothesis that is created is selected from a constrained set of hypotheses that are inductively biased towards those that also generalize well to problems the intelligence is likely to confront in the future. Curiosity motivates gathering more information that could be useful now, or in the future. Based on all available information, the intelligence makes a decision about to respond or act (Y ). Those actions causally impact the world. This process of acquiring input, learning from the past, updating hypotheses, and acting to gain rewards or information, is relevant and repeats itself in the far future (indicated by subscript f ). The central premise of our work here is that by understanding how NIs achieve (or fail at) each of these four capabilities-and describing them in terms of an AI formalism-we can overcome existing bottlenecks in explaining, augmenting, and engineering both natural and artificial intelligences. 2 Evidence for prospective learning in natural intelligences NIs (which we limit here to any organism with a brain or brain-like structure) always learn for the future, because the future is where survival, competition, feeding, and reproduction happen. That is not to say that NIs are perfect at it, or even particularly efficient. Rather, we argue that prospective abilities are successful just enough to bolster evolutionary fitness so as to be reinforced over time. In many ways, brains appear to be explicitly built (i.e., evolved) to make predictions about the future [8] . NIs learn abstract, transferable knowledge that can be flexibly applied to new problems, rather than greedily optimizing for the current situation (for review see Gershman et al. [9] , Raby and Clayton [10] ). In the field of ecology, this process is part of what is called 'future planning' [11] or 'prospective cognition' [10] , both of which describe the ability of animals to engage in 'mental time travel' [12] by projecting themselves into the future and imagining possible events, anticipating unseen challenges. Given our focus on prospective learning, here we will focus on the learning aspects of future planning and prospective cognition. While prospective abilities have classically been thought of as a uniquely human trait [13] , we now know that many other NIs can do this. Bonobos and orangutans socially learn to construct tools for termite fishing, not for immediate use, but instead to carry with them in anticipation of future use [14] . This tool construction and use also extends beyond primates. Corvids collect materials to build tools that solve novel tasks [15] . This building of novel tools can be seen as a form of prospective learning. Here the experience of learning novel objects (e.g., pliability of specific twigs, inspecting glass bottles) transfers to novel applications in the future (e.g., curving a twig to use as a hook to fish out food from a bottle). It requires that the animal seek out the information (curiosity) to learn the physics of various objects (causality), both biased and facilitated by internal heuristics that limit the space of hypotheses for how to configure the tool (constraints), and extend this knowledge to produce new behavioral patterns (continuous learning) within the constraints of the inferred structure of the world. Another manifestation of future learning is food caching, seen in both mammals (e.g., squirrels and rats) as well as birds. Western scrub-jays not only have a complex spatial memory of food cache locations, but can flexibly adapt their caching to anticipate future needs [16] . Experiments on these scrub-jays have shown that they will stop caching food in locations where it gets degraded by weather or is stolen by a competitor [17, 18] . Indeed, consistent with the idea that these birds are learning, scrubjays that are caught stealing food from another's cache (i.e., observed by another jay), will restore the stolen food in private, as if aware that the observing animal will take back the food [19] . This behavior can be considered prospective within a spatial framework, wherein prior experience facilitates learning a unique 'cognitive map' that facilitates innovative zero-shot transfer to creatively solve tasks (e.g., strategic storage and retrieval of food) in a novel future context (i.e., next season) [20] . This spatial learning likely evolved and helps animals solve ethologically critical tasks such as navigation and foraging, as well as to allow animals to use vicarious trial-and-error to imaginatively simulate multiple scenarios [21] , generalizing previously learned information to a novel context. But such mechanisms are not limited to navigational problems: many animals have evolved the ability to transform non-spatial variables into a spatial framework, enabling them to solve a broad set of problems using the same computational principles and neural dynamics underlying spatial mapping [22, 23] . Finally, as an animal explores its environment, it can quickly incorporate novel information using the flexibility of the cognitive map, while leaving existing memories largely intact (i.e., without disrupting the weights of the existing network). Learning for the future is seen across phyla as well. Bees (arthropods) can extrapolate environmental cues to not only locate food sources, but communicate this location to hivemates via a 'waggle dance' where future targets lie with a high degree of accuracy (for review see Menzel [24] ). Importantly, bees can also learn novel color-food associations, identifying the color of new high value nectar sources via associative learning in novel foraging environments [25, 26] . This ability also extends to learning novel flower geometries that may indicate high nectar food sources that are then communicated back to the hive for future visits by other bees [27] . This remarkably sophisticated form of learning for the future happens in an animal with less than a million neurons, and fewer than 10 billion synapses. In contrast, modern deep learning systems, such as GPT-3, have over 100 billion synapses [28] and yet fail at similar forms of sophisticated associative learning. Even in the mollusca phylum, octopuses have been found to perform observational learning, with single shot accuracy at selecting novel objects by simply watching another octopus perform the task [29] . This rapid form of (continual) learning allows the animal to effectively use an object it has never seen before in new situations (constraints and causality), simply by choosing to play with it (curiosity). Thus, it is learning for the future. Prospective learning has, thus, a very long evolutionary history. Given that arthropods, mollusks, and chordates diverged in evolution 500 million years ago, the observation of prospective learning abilities across these phyla suggests one of two possibilities: 1) prospective learning is an evolutionarily old capacity with a shared neural implementation that exists in very simple nervous systems (and scales with evolution), or 2) prospective learning has independently evolved multiple times with different implementation-level mechanisms (a.k.a., multiple realizability [30, 31] ). These two possibilities have different implications for the types of experiments that would inform our understanding of prospective learning in NIs and how we can implement it in artificial systems (see § 5 for further details). 3 The traditional approach to (retrospective) learning. The standard machine learning (ML) formalism dates back to the 1920's, when Fisher wrote the first statistics textbook ever. In it, he states that "statistics may be regarded as . . . the study of methods of the reduction of data." In other words, he established statistics to describe the past, not predict the future. Shortly thereafter, Glivenko and Cantelli established the fundamental theorem of pattern recognition: given enough data from some distribution, one can eventually estimate any parameter of that distribution [32, 33] . Vapnik and Chervonenkis [34] and then Valiant [35] rediscovered and further elaborated upon these ideas, leading to nearly all of the advancements of modern ML and AI. Here we will highlight the structure of this standard framework for understanding learning as used in AI. As above (Section 1.2), let X be the input to our intelligence, (e.g., sensory data) and Y be its output (e.g., an action). We assume those data are sampled from some distribution P X,Y that encapsulates some true but unknown properties of the world. For brevity, we allow P X,Y to also incorporate the causal graph, rather than merely the probabilistic distribution. Let n denote the n th experience, the one that is happening right now. In the classical form of the problem, a learning algorithm g (which we hereafter refer to as a 'learner') takes in the current data sample S := {(X i , Y i )} n i=1 and outputs a hypothesis h n ∈ H, where the hypothesis h n : X → Y chooses a response based on the input. The n th sample corresponds to 'now', as described in Section 1.2. The learner chooses a hypothesis, often by optimizing a loss function , that compares the predicted output from any hypothesis, h(X) with the (sometimes unobserved) ground truth output, Y : (h(X), Y ). The goal of the learner is to minimize risk, which is often defined as the expected loss, by integrating over all possible test datasets: Note that when we are learning, h is dependent on the past observed (training) dataset. However, when we are designing new learners (e.g., as evolution does), we do not have a particular training dataset available. Therefore we seek to develop learners that work well on whatever training dataset we have. To do so, in retrospective learning, we assume that all data are sampled from the exact same distribution (in supervised learning, this means the train and test sets), and then we can evaluate risk by integrating out all possible test datasets to determine the expected risk, E, of what is learned: Although both training and test datasets are assumed to be drawn from the same distribution, the two integrals are integrating over two different sets of random variables: the inner integral is integrating over all possible test datasets, and the outer integral is integrating over all possible train datasets. Assuming that the two distributions are identical has enabled retrospective learning to prove a rich set of theorems characterizing the limits of learning. Many learners have been developed based on this assumption, and such algorithms have recently enjoyed a cornucopia of successes, spanning computer vision [36] , natural language processing [37] , diagnostics [38] , protein folding [39] , autonomous control [40] , and reinforcement learning [41] . The successes of the field, so far, rely on problems amenable to the classical statistical definition of learning in which the data are all sampled under a fixed distributional assumption. This, to be fair, encompasses a wide variety of applied tasks. However, in many real-world data problems, the assumption that the training and test data distributions are the same is grossly inadequate [42] . Recently a number of papers have proposed developing a theory of 'out of distribution' (OOD) learning [43] [44] [45] , that includes as special cases transfer learning [46] , multitask learning [47] [48] [49] , metalearning [50] , and continual [51] and lifelong learning [52] . The key to OOD learning is that now we assume that the test set is drawn from a distribution that differs in some way from the training set distribution. This assumption is an explicit generalization of the classical retrospective learning problem [45] . In OOD problems, the train and test sets can come from different sample spaces, different distributions, and optimized with respect to different loss functions. Thus, rather than designing learners with small expected risk as defined above, in OOD learning, we seek learners with small E OOD (highlighting in red the differences between E OOD and classical 'in-distribution' retrospective learning: Note that this expression for the risk permits the case of multiple tasks: both g and h are able to operate on different spaces of inputs and outputs, the inputs to both could include task identifiers or other side information, and the loss would measure performance for different tasks. All of prospective learning builds upon, and generalizes, finding hypotheses that minimizes E OOD . There are various ways in which an intelligence that can solve this OOD problem can still be only a retrospective learner. First, consider continual learning. A learner designed to minimize E OOD has no incentive to remember anything about the past. In fact, there is no coherent notion of past, because there is no time. However, even if there were time (for example, by assuming the training data is in the past, and the testing data is in the future), there is no mechanism by which anything about the past is retained. Rather, it could all be overwritten. Second, consider constraints. Often in retrospective ML, constraints are imposed on learning algorithms to help find a good solution for the problem at hand with limited resources. These constraints, therefore, do not consider the possibility of other future problems that might be related to the existing problems in certain structured ways, which a prospective learner would. Third, curiosity is not invoked at all. Even if we generalized the above to consider time, curiosity would still only be about gaining information for the current situation, not realizing that there will be future situations that are similar to this one, but distinct along certain predictable dimensions (for example, entropy increases, children grow, etc.). Fourth, there is no notion of causality in the above equation, the optimization problem is purely of association and prediction. These limitations of retrospective learning motivates formalizing prospective learning to highlight these four capabilities explicitly. 4 The capabilities that characterize prospective learning The central hypothesis of this paper is that by posing the problem of learning as being about the future, many of the current problem areas of AI become tractable, and many aspects of NI behaviors can be better understood. Here we spell out the components of systems that learn for the future. While Figure 1 provides a schematic akin to partially observable Markov decision process [53] , it is important to note, however, that prospective learning is not merely a rebranding of Markov decision processes, or reinforcement learning. Specifically, the 'future' in Figure 1 is not the next time step, but rather, some time in the potentially distant future. Moreover, at that time, everything could be different, not merely a time-varying state transition distribution and reward function, but also possibly different input and output spaces. In other words, it could be a completely different environment. Thus, while Markov assumptions may be at play, algorithms designed merely to address a stationary Markov decision process will catastrophically fail in the more general settings considered here. Nonetheless, without further assumptions, the problem would be intractable (or even uncomputable) [4] . Thus, we further assume that the external world W is changing somewhat predictably over time. For example, the distribution of states in a world that operates according to some mechanisms (e.g., sunny weather, cars driving on the right, etc.) may change when one or more of those mechanisms changes (e.g., rainy weather, or driving on the left). 1 Continual learning thereby enables the intelligence to store information in h about the past that it believes will be particularly useful in the (far) future. Prospective constraints, h ∈ H , including inductive biases and priors, contain information about the particular idiosyncratic ways in which both the intelligence and external world are likely to change over time. Such constraints, for example, include the possibility of compositional representations. The constraints also push the hypotheses towards those that are accurate both now and in the future. The actions are therefore not only aimed at exploiting the current environment but also aimed at exploring to gain knowledge useful for future environments and behaviors, reflecting a curiosity about the world. Finally, learning causal, rather than merely associational relations, enables the intelligence to choose actions that lead to the most desirable outcomes, even in complex, previously unencountered, environments. These four capabilities, continual learning, prospective constraints, curiosity, and causality, together form the basis of prospective learning. learn for performance in the future, in a way that involves sequentially acquiring new capabilities (e.g., skills and representations) without forgetting-or even improving upon-previously acquired capabilities that are still useful. In general we expect previously learned abilities to be useful again in the future, either in part or in full. As such, it is clear that an intelligence that can remember useful capabilities, despite learning new behaviors, will outperform (and out survive) those that do not [54] . However, AI systems often do forget the old upon learning the new, a phenomenon called catastrophic interference or forgetting [3, 55, 56] . Better than merely not forgetting old things, a continual learner improves performance on old things, and even potential future things, upon acquiring new data [49, 52, [57] [58] [59] . As such, the ability to do well on future tasks is the hallmark of real learning and the need to not forget immediately derives from it. An example of successful continual learning in NI is learning to play music. If a person is learning Mozart, and then practices arpeggios, having learned the arpeggios will improve their ability to play Mozart, and also their ability to play Bach in the future. When people learn another language, it also improves their comprehension of previously learned languages, making future language learning even easier [60] . The key to successful continual learning is, therefore, to transfer information from data and experiences backwards to previous tasks (called backwards transfer) and forwards to future tasks (called forward transfer) [45, 58, 59] . Humans also have failure modes: sometimes this prior learning can impair future performance, a process known as interference (e.g., [61, 62] ). The extent of transfer or interference in future performance depends on both environmental context and internal mechanisms (see Bouton [63] ). While continual learning is obviously required for efficient prospective learning, to date there are relatively few studies quantifying forward transfer in NIs, and, as far as we know, none that explicitly quantify backwards transfer [64] [65] [66] . Crucially, learning new information does not typically cause animals to forget old information. Nonetheless, while existing AI algorithms have tried to enable both forward and backward transfer, for the most part they have failed [67] . The field is only just beginning to explore effective continual learning strategies [68] , which includes those that explicitly consider non-stationarity of the environment [69, 70] . Traditional retrospective learning starts from a tabula rasa mentality, implicitly assuming that there is only one task (e.g., a single machine learning problem) to be learned [71] . In these classical machine learning scenarios, each data sample is assumed to be sampled independently; this is true even in OOD learning. While online learning [72] , sequential estimation [73] , and reinforcement learning [74] relax this assumption, traditional variants of those ML disciplines typically assume a slow distributional drift, disallowing discrete jumps. This does not consider the possibility that the far future may strongly depend on the present (e.g., memories from last time an animal was at a particular location will be useful next time, even if it is far in the future). These previous approaches also typically only consider single task learning, whereas in continual learning there are typically multiple tasks, and multiple distinct datasets, sometimes each of which having different domains. In prospective learning, however, data from the far past can be leveraged to update the internal world model [75, 76] . Here the training and test sets are necessarily coupled by time. This is in contrast to the canonical OOD learning problem, in which the training and testing data lack any notion of time, but similar to classical online, sequential, and reinforcement learning. Here we assume that the future depends to some degree on the past. This dependency can be described by their conditional distribution: . Crucially, we do not necessarily assume a Markov process, where the future only depends on the recent past, but rather allow for more complex dependencies depending on structural regularities across scenarios. We thus obtain a more general expected risk in the learning for the future scenario: Continual learning is thus an immediate consequence of prospective learning. Recent work on continual reinforcement learning [77] can be thought of as devoted to developing algorithms that optimize the above equation, but such efforts typically lack the other capabilities of prospective learning. As we will argue next, continual learning is only non-trivial upon assuming certain computational constraints. Constraints for prospective learning effectively shrink the hypothesis space to require less data and fewer resources to find solutions to the current problem, which also generalize to potential future problems. Whereas in NI these constraints come from evolution, in AI these constraints are built into the system. Traditionally, constraints come in two forms. Statistical constraints limit the space of hypotheses that are possible to enhance statistical efficiency; they reduce the amount of data required to achieve a particular goal. For our purposes, priors and inductive biases are 'soft' statistical constraints. Computational constraints, on the other hand, impose limits on the amount of space and/or time an intelligence can use to learn and make inferences. Such constraints are typically imposed to enhance computational efficiency; that is, to reduce the amount of computation (space and/or time) to achieve a particular error guarantee. Both kinds of constraints, statistical and computational, restrict the search space of effective or available hypotheses, and of the two statistical constraints likely play a bigger role in prospective learning than computational. Moreover, both kinds of constraints can be thought of as different ways to regularize, either explicit regularization (e.g., priors and penalties) or implicit regularization (e.g., early stopping). 2 . There is no way to build an intelligence, either via evolution or from human hands, without it having some constraints, particularly inductive biases (i.e., assumptions that a learner uses to facilitate learning an input-output mapping). For example, most mammals [23, 78, 79] , and even some insects ( [80] ), excel at learning general relational associations that are acquired in one modality (e.g., space) and applied in another (e.g., social groups). Inductive biases like this often reflect solutions to problems faced by predecessors and learned over evolution. They are often expressed as instincts and emotions that provide motivation to pursue or avoid a course of action, leading to opportunities to learn about relevant aspects of the environment more efficiently. For example, mammals have a particular interest in moving stimuli, and specifically biologically-relevant motion [81] , likely reflecting behaviorally-relevant threats [82] . Both chicks [83] and human babies [84] have biases for parsing visual information into object-like shapes, without extensive experience with objects. Newborn primates are highly attuned to faces [85] and direction of gaze [86] , and these biases are believed to facilitate future conceptual [87] and social learning [88] . Thus, within the span of an individual's lifetime, NIs are not purely data-driven learners. Not only is a great deal of information baked in via evolution, but this information is then used to guide prospective learning [89] . AI has a rich history of choosing constrained search spaces, including priors and specific inductive biases, so as to improve performance (e.g., [90] ). Perhaps the most well known inductive bias deployed in modern AI solutions is the convolution operation [91] , something NIs appear to have discovered hundreds of millions of years prior to us implementing them in AIs [92] . Such ideas can be generalized in terms of symmetries in the world [93] . Machine learning has developed many techniques to incorporate known invariances to the learning process [94] [95] [96] [97] , as well as mathematically quantifying how much one can gain by imposing them [98, 99] . In fact, in many cases we may want to think about constraints themselves as something to be learned [100, 101] , a process that would unfold over evolutionary timescales for NIs. However, in many areas the true potential of prospective constraints for accelerating learning for the future remains unmet. For example, as pointed out above, many NIs can learn the component structure of problems (e.g., relations) which accelerates future learning when new contexts have similar underlying compositions (see Whittington et al. [23] ). This capability corresponds to zero-shot cross-domain transfer, a challenge unmet by the current state-of-the-art machine learning methods [102] . Why are these constraints important? With a sufficiently general search space, and enough data, space, and time, one can always find a learner that does arbitrarily well [32, 103] . In practice, however, intelligences have finite data (in addition to finite space and time). Moreover, a fundamental theorem of pattern recognition is the arbitrary slow convergence theorem [104, 105] , which states that given a fixed learning algorithm and any sample size N , there always exists a distribution such that the performance of the algorithm is arbitrarily poor whenever n < N [34, 35] . This theorem predates and implies the celebrated no free lunch theorem [106] , which states that there is not one algorithm to rule them all; rather, if learner g converges faster than another learner g on some problems, then the second learner g will converge faster on other problems. In other words, one cannot hope for a general "strong AI" that solves all problems efficiently. Rather, one can search for a learner that efficiently solves problems in a specified family of problems. Constraints on the search space of hypotheses thereby enable intelligences to solve the problems of interest efficiently, by virtue of encoding some form of prior information and limiting the search space to specific problems. Prospective learners use prospective constraints, that is, constraints that push hypotheses to the most general solution that works for a given problem, such that it can readily be applied to future distinct problems. Formalizing constraints using the above terminology (see Section 3) does not require modifying the objective function of learning. It merely modifies the search space. Specifically, we place constraints on the learner g ∈ G ⊂ G, the hypothesis h ∈ H ⊂ H, and the assumed joint distribution governing everything P = P f uture,past ∈ P ⊂ P. The existence of constraints is what makes prospective learning possible, and the quality of these constraints is what decides how effective learning can be. We define curiosity for prospective learning as taking actions whose goal is to acquire information that the intelligence expects will be useful in the future (rather than to obtain rewards in the present). Goal-driven decisions can be broken down into a choice between maximizing one of two objective functions [107] : (1) an objective aimed at directly maximizing rewards, R, and (2) an objective aimed at maximizing relevant information, E. For prospective learning E is needed for making good choices now and in the as-yet-unknown future. In this way the intelligence, at each point of time, decides if it should dedicate time to learning about the world thereby maximizing E, or to doing a rewarding behavior thereby maximizing R. Critically, by being purely about relevant information for the future, objective (2) (i.e., pure curiosity) can maximize information about both current and future states of the world. E can be defined simply as the value of the unknown, the integration over possible futures and the knowledge it may afford. However, this term, can not easily be evaluated. Instead, we know much about what it drives the intelligence to learn: compositional representations, causal relations, and other kinds of invariances that allow us to solve current and future problems. In this way, E ultimately quantifies our understanding of the relevant parts of the world. In humans, curiosity is a defining aspect of early life development, where children consistently engage in more exploratory and information driven behavior than adults [6, [108] [109] [110] [111] . This drive, particularly important in children, for directed exploration is often focused on learning causal relations-to acquire both forward and reverse causal explanations [112] -and develop models of the world that they can exploit in later development. But curiosity is not limited to humans (for review see Loewenstein [113] ). Just like children [114] , monkeys show curiosity for counterfactual outcomes [115] . Rats are driven to explore novel stimuli and contexts, even in the absence of rewards [116, 117] . Just like children [118] , octopuses appear to learn from playing with novel objects [119] , a pure form of curiosity. In fact, even the roundworm C. elegans, an animal with a simple nervous system of only a few hundred neurons, shows evidence of exploration in novel environments [120] . Curiosity is clearly a fundamental drive of behavior in NIs [107] . It is well established in the active learning literature that curiosity, i.e., gathering information rather than rewards, can lead to an exponential speed-up in sample size convergence guarantees [121, 122] . Specifically, this means that if a passive learner requires n samples to achieve a particular performance guarantee, then an active learner requires only ln n samples to achieve the same performance. This is important as the scenarios for which prospective learning provides a competitive advantage are those where information is relatively sparse and the outcomes are of high consequential value. So every learning opportunity must really count in these situations. Although we cannot expect that either AIs or NIs can perfectly implement prospective learning by integrating E over long time horizons. Instead, we can approximate what we learn about the parts of the world that we will want to take future actions in, which compositional elements (i.e., constraints) exist in this world, and which causal interactions these components have. These properties mean that we can see E as an approximation to how well we can learn from the world. Thus optimal information gathering (i.e., curiosity) relies on similar learning policies as reinforcement learning. This may explain why empirical studies in humans show that information seeking relies on overlapping circuits as reward learning [123] . Most importantly, this shows how curiosity is innately future focused. The solution to reinforcement learning (i.e., the Bellman equation) reflects the optimal decision to make to maximize future returns [76, 124] . Thus, in the case of curiosity, this solution is the optimal decision to maximize information in the future. What distinguishes curiosity from reward learning is that learning E informs intelligences, whether NI or AI, about the structure of the world. E provides the necessary knowledge of things like spatial configurations, hierarchical relationships, and contingencies. In other words, to find an optimal curiosity policy we can find an optimal policy today about the structure of the world, regardless of immediate rewards, and solve the optimization problem again tomorrow. Causal estimation is enabled in practice by assuming that the direct causal relationships are sparse. This sparsity assumption greatly simplifies modeling the world by adding some bias, but drastically reducing the search space over hypothesis to learn. While it might be tempting to think that prospective learning boils down to simply learning factorizable probabilistic models of the world, such models are inadequate for prospective learning. This is because probabilistic models are inherently invertible. That is, we can just as easily write the probability of wet grass given that it is raining, P (wet|rain), as the probability that it is raining given wet grass, P (rain|wet). Yet these probabilities do not tell us what would be the effect of intervening on one or the other variable. These probabilistic statements of the world do not convey whether or not increasing P (wet) increases P (rain). According to causal essentialists, such as Pearl [7] , to make such statements requires more than a probabilistic model: it requires a causal model. Causal reasoning enables intelligences to transfer information across time. Specifically, it enables transferring causal mechanisms which, by their vary nature, are consistent across environments. This includes environments that have not previously been experienced, thereby transferring out-of-distribution. Thus, causal reasoning, like continual learning, is a qualitative capability, rather than a quantitative improvement, that is necessary for prospective learning. Causal reasoning has long been seen by philosophers as a fundamental feature of human intelli-gence [125] . While it is not always easy to distinguish causal reasoning from associative learning in animals, many non-human animals have been shown to perform predictive inferences about objectobject relationships, allowing them to estimate causal patterns (for review see Völter and Call [126] ). For example, great apes [127] , monkeys [128] , pigs [129] , and parrots [130] can use simple sensory cues (e.g., rattling sound of a shaken cup) to infer outcomes (e.g., presence of food in cup), a form of diagnostic inference. However, this form of causal reasoning is inconsistently observed in NIs (see Cheney and Seyfarth [131] , Collier-Baker et al. [132] ). Other studies have shown that, particularly in social contexts, animals from great apes [133] and baboons [134] to corvids [135] and rats [136, 137] , can perform transitive causal inference (i.e., if A → B and B → C, then A → C; for review see Allen [138] ). This causal ability has even been observed in insects [139] , suggesting that forms of causal inference exist across taxa. The insight driving causal reasoning is that the causal mechanisms in the world tend to persist while correlations are often highly context sensitive [140] . Further, the same causal mechanisms are involved in generating many observations, so that models of these mechanisms are reusable modules in compositional solutions to many problems within the environment. For example, understanding gravity is useful for catching balls as well as for modeling tides and launching rockets. Prospective learning thus crucially benefits from causal models: they are more likely to be useful as they encode real invariances that persist across environments. For example, different variants of COVID will continue to emerge, but certain treatments are likely to be effective for each of them insofar as they act on the mechanism of disease which remains constant [141] . Such scenarios pose a problem for traditional AI algorithms. Modern retrospective learning machines notoriously fail to learn causal models in all but the most anodyne settings. Some AI researchers have advocated for creating models that can perform causal reasoning, which would help AI systems generalize better to new settings and perform prospective inference [142] [143] [144] , but this field remains in its infancy. Going back to our formulation of the problem, what this all means is that what matters for future decisions is 'doing Y ': intervening on the world by virtue of taking action Y , rather than simply noting the (generalized) correlation between X and Y . Fundamentally, implementing Y simply means returning the value of the hypothesis for a specific X: i.e., do(h(X)). This modification yields an updated term to optimize to achieve prospective learning: 3 Crucially, the ability to choose actions, Y , allows the agent to discover causal relations, regardless of the amount of confounding in the outside world. Causality links actions and learner, both by enabling actions that are helpful for learning (e.g., randomized ones), and by enabling learning strategies that are useful for discovering causal aspects of reality (e.g., through quasi-experiments). For our purposes here, we consider all interactions between learning and action strategies to belong to either causality or curiosity. Table 1 provides a summary of the four capabilities that are necessary for prospective learning, examples of how they are expressed in nature, how retrospective learning handles each process, and how a prospective learner would implement it. In it we highlight examples in the literature where the behavior of a prospective learner has been demonstrated in AI, illustrating the fact that the field is already moving somewhat in this direction, but just not completely yet. We argue this is due to the fact that the form of prospective learning has not been carefully defined, as we have attempted to do here. With this gap in mind, we argue that in NIs, evolution has led to the creation of intelligent agents that incorporate the above key components that jointly characterize prospective learning. Continual learning, enabled by constraints and driven by curiosity, allows for the ability to make causal inferences about which actions to take now that lead to better outcomes now and in the far future. In other words, our claim is that evolution led to the creation of NIs that choose a learner such that, with each new experience, updates the internal model g(h n , X n , Y n ) → h f , where each h f is the solution to minimize E (do(h), n, P ) , where P = P future,past , the constraints on g, h, and P encode aspects of time, compositionality, and causality (e.g., that the future is dependent on the past via causal mechanisms). The expected risk, E , for a specified hypothesis, at the current moment, given a set of experiences, is defined by This E gives the fundamental structure of the prospective learning problem. We argue that although there has been substantial effort in the NI and AI communities to address each of the four capabilities that lead to solving Eq. (1) independently, each remains to be solved at either the theoretical or algorithmic/implementation levels. Solving prospective learning requires a coherent strategy for jointly solving all four of these problems together. While we argue that optimizing for Eq. (1) characterizes our belief about what intelligent agents do when performing prospective learning, it is strictly a computational level problem. It does not, however, satisfy the question of how they do it. What is the mechanism or algorithm that intelligent agents use to perform prospective learning? Intriguingly, the implementation of prospective learning in NIs happens in a network (to a first approximation) [145] , and most modern state-of-the-art machine learning algorithms are also networks [146] . Moreover, both fields have developed a large body of knowledge in understanding network learning [147] [148] [149] [150] . Thus, a key to solving how prospective learning can be implemented relies on understanding the how networks learn representations, particularly representations that are important for the future. This is a critical component in explaining, augmenting, and engineering prospective learning. Understanding the role of representations in prospective learning, thus, requires a deep understanding of the nature of internal representations in modern ML frameworks. A fundamental theorem of pattern recognition characterizes sufficient conditions for a learning system to be able to acquire any pattern. Specifically, that an intelligent agent must induce a hypothesis such that, for any new input, it only looks at a relatively small amount of data 'local' to that input [151] . In other words, a good learner will map new observations into a new representation, i.e., a stable and unique trace within a memory, such that the inputs that are 'close' together in the external world are also close in their internal representation (see also Sorscher et al. [152] , Sims [153] ). In networks, this is implemented by any given input only activating a sparse subset of nodes, which is typical in both NIs [154] , and becoming more common for AIs [155] . Indeed, deep networks can satisfy these criteria [156] . Specifically, deep networks partition feature space into geometric spaces called polytopes [157] . The internal representation of any given point then corresponds to which polytope the point is in, and where within that polytope it resides. Inference within a polytope is then simply linear [158] . The success of deep networks is a result of their ability to efficiently learn what counts as 'local' [159] . 4 In prospective learning, in contrast to retrospective learning, what counts as local is also a function of potential future environments. Thus, the key difference between retrospective and prospective representation learning is that the internal representation for prospective learning must trade-off between being effective for the current scenario, and being effective for potential future scenarios. Continual Don't forget the important stuff, hn → h f When people learn a new language we get better at our old one [60] Learning new information overwrites old information [163] Reuse useful information to facilitate learning new things without interference [59, 164] Constraints Regularize via prior knowledge, heuristics, & biases, h ∈ H Animals learn to store food in locations that are optimal given local weather conditions [16] Generic constraints like sparseness enable learning a single task more efficiently [165] Compositional representations can be exponentially reassembled for future scenarios and to compress the past [90] Curiosity Go get information about the (expected future) world, instead of just rewards, E(hn, W f ) Animals explore novel stimuli and contexts, even in the absence of rewards [116, 117] Use randomness to explore new options with unknown outcomes in current scenario [166] Seek out information about potential future scenarios [107] Causality The world W has sparse causal relationships, do(Yn) → W f Animals can learn if A →B and B → C, then A → C [138] Learn statistical associations between stimuli [167] Apply causal information to novel situations [168] Table 1: The four core capabilities of prospective learning, evidence for their existence in natural intelligence, how retrospective learners deal with (or fail to deal with) it, and how a prospective learner would deal with it. See see Figure 1 for notation. 5 The future of learning for the future In many ways, prospective learning has always been a central (though often hidden) goal of both AI and NI research. Both fields offer theoretical and experimental approaches to understand how intelligent agents learn for future behavior. What we are proposing here is a formalization of the structure for how to approach studying prospective learning jointly in NI and AI that will benefit both by establishing a more cohesive synergy between these research areas (see Table 2 ). Indeed, the history of AI and NI exhibits many beautiful synergies [169, 170] . In the middle of the 20th century cognitive science (NI) and AI started in large part as a unified effort. Early AI work like the McCulloch-Pitts neurons [171] and the Perceptron [172] had strong conceptual links to biological neurons. Neural network models in the Parallel Distributed Processing framework had success in explaining many aspects of intelligent behavior, and there have been strong recent drives to bring deep learning and neuroscience closer together [169, 170, 173] . We believe that our understanding of both AI and NI are significantly held back by a lack of a coherent framework for prospective learning. NI research requires a framework to analyze the incredible way in which humans and non-human animals learn for the future. AI can benefit by studying how NIs solve problems that remain intractable for AI systems. We argue that the route to solving prospective learning rests on the two fields coming together around three major areas of development. • Theory: A theory of prospective learning, building on and complementing the theory of retrospective learning, will provide insights into which experiments (in both NIs and AIs) will provide the most insights that fill gaps in our current understanding, while also providing metrics to evaluate progress [174] . A theoretical understanding of prospective learning will also enable the generation of testable mechanistic hypotheses characterizing how intelligent systems can and do prospectively learn. • Experiments: Carefully designed, ecologically appropriate experiments across species, phyla, and substrates, will enable (i) quantifying the limitations and capabilities of existing intelligence systems with respect to the prospective learning criteria, and (ii) refining the mechanistic hypotheses generated by the theory. NI experiments across taxa will also establish milestones for experiments in AI [175] . • Real-World Evidence: Implementing and deploying AI systems and observing NIs exhibiting prospective learning 'in the wild' will provide real-world evidence to deepen our understanding. These implementations could be purely software, involve leveraging specialized (neuromorphic) hardware [176, 177] , or even include wetware and hybrids [178] . An astute reader may wonder how prospective learning relates to reinforcement learning (RL), a subfield of both AI and NI. RL already has long worked towards bridging the gap between AI and NI. For example, early AI models of reinforcement learning formalized the phenomenon of an 'eligibility trace' in synaptic connections that may be crucial for resolving the credit assignment problem, i.e., determining which actions lead to a specific feedback signal [179] . Over 30 years later this AI work informed the design of experiments that led to the discovery of such traces in brains [180, 181] . In RL, through repeated trials of a task usually specified by its corresponding task reward, agents are trained to choose actions at each time instant, that would maximize those task rewards at future instants. [76, 124] . This future-oriented reward maximization objective at first glance bears resemblance to prospective learning, and deep learning-based RL algorithms building on decades of research have recently made great progress towards meeting this challenge [182] [183] [184] [185] [186] . However, these standard RL algorithms do not truly implement prospective learning: for example, while deep RL agents may do even better than humans in games they were trained on, they have extreme difficulty transferring skills across games (see Shao et al. [187] ), even to games with similar task or rule structures as the training set [188, 189] . Rather than optimizing a prespecified task as in RL, prospective learning aims to acquire skills / representations now, which will be useful for future tasks whose rewards or other specifications are not usually available in advance. In the example above, a true prospective learner would acquire representations and skills that transfer across many games. As in other machine learning subfields, there are several growing movements within RL that study problems that would fall under the prospective learning umbrella, including continual multi-task RL [77] , hierarchical RL [190] [191] [192] [193] that combines low-level skills to solve new tasks, causal RL [194] [195] [196] , and unsupervised exploration [197] [198] [199] . Advances in AI's understanding of prospective learning provides a necessary formalism that can be harvested by NI research. AI can produce the critical structure of scientific theories that lead to more rigorous and testable hypotheses for which to build experiments [200] . There is some historical evidence where our understanding of NI abilities has conceptually benefited from AI, including a few examples whereby theoretical formalisms from AI have inspired understandings of NI [201] [202] [203] [204] . Our proposal for prospective learning expands upon the existing interrelation between these fields. Consider the problem of designing NI experiments to understand prospective learning, rather than cognitive function or retrospective learning. How would they look different? The experiments would demand that the NIs wrestle with each of the four capabilities: continual learning, constraints, curiosity, and causality-exactly what NI researchers currently avoid because we cannot readily fit theories to such behaviors. For continual learning, the experiments would have a sequence of tasks, some of which repeat, so that both forward and backward transfer can be quantified. For constraints, tasks would specifically investigate the priors and inductive biases of the animal, rather than their ability to learn over many repeated trials. For curiosity, tasks would require a degree of exploration and include information relevant only for future tasks. For causality, tasks would encode various conditional dependencies that are not causal, as in Simpson's paradox. The ways in which the NIs are evaluated would also be informed by the theory, for example, quantifying amount of information transferred, rather than long run performance properties [45, 205] . While extensive research in all these areas exists, a deepened dialogue between individuals studying NI and AI will significantly advance their synergy. Importantly, prospective learning provides a scaffolding to organize the debate around. Advances in our understanding of NIs has always been a central driver, if not the central driver, of AI research. Indeed, the fundamental logic of modern computer circuits was directly inspired by McCullough and Pitts' [171] model of the binary logic in neural circuits. The way that we build AIs often involves a crucial step of observing differences between NIs and AIs and then looking for inspiration in NIs for what is missing and building it into our algorithms. The entire concept of intelligence used by AI derives from considerations of NIs. The concept of prospective learning promises to enable a stronger link between NI and AI, where the many components of the study of NI can be directly ported to the components of AI. Focusing the study of both NI and AI together promises to clarify the logical relations of concepts in the two fields, making them considerably more synergistic. Over the past few decades, there have been many proposals for how to advance AI, and they all center on capturing the abilities of NIs, primarily humans. These include understanding 'meaning' [206] , in particular language, and continuing the current trend of building ever larger deep networks and feeding them ever larger datasets as a means of approaching the depth and complexity seen in biological brains [207] . Others have looked exclusively at human intelligence, going so far as to even define AI as modeling human intelligence [208] . Indeed, today's rallying cry in AI is for 'artificial general intelligence', which is commonly defined as the hypothetical ability of an intelligent agent to understand or learn any intellectual task that a human being can [209] . This ongoing influence can be seen by the naming of AI concepts with cognitive science words. Deep Learning is concerned with concepts like curiosity [210] , attention [211] , memory, and most recently consciousness priors [212] . Our approach fundamentally differs from, and builds upon, those efforts. Yes, human intelligenceor, more accurately, human intellect-is incredibly interesting. Yet NI, and specifically prospective learning, evolved hundreds of millions of years prior to human intelligence and natural language. We therefore argue that prospective learning is much more fundamental than human intellect. Therefore, AI can advance by studying prospective learning in non-human animals (in addition to studying human animals), that are more experimentally accessible. Moreover, we have an existence proof (from evolution) that one can get to human intellect by first building prospective learning. Whether one can side-step prospective learning capabilities, and go straight to complex language understanding, is an open question. So then how does studying prospective learning in NIs potentially move AI forward? Consider the design of experiments. The study of NI in the lab typically focuses on simple constrained environments, such as two-alternative forced choice paradigms. This is in contrast to how behavior is looked at in ecology, whose contributions to our understanding of NI behavior are largely underappreciated, where NIs have been studied in complex, unconstrained environments for centuries. Similarly as ecology, but in contrast to typical cognitive science and neuroscience approaches, modern AI often investigates abilities of agents in environments with rich, but complex, structure (e.g., video games, autonomous vehicles). Yet, those same AIs (or similar ones) often catastrophically fail in real-world environments [213] [214] [215] [216] [217] . Part of this failure to generalize to natural environments is likely due to the fact that the realworld places a heavy emphasis on prospective learning abilities, something that most artificial testing environments do not do. Thus, prospective learning provides additional context and motivation to design experiments that transcend boundaries between taxa and substrates, both natural and artificial [218] . We argue that experiments such as those described above in Section 2 to study NIs can be ported to also study AIs. We can build artificial environments, such as video-game worlds, that place heavy demands on prospective abilities, including learning, that allow for direct comparison of the abilities of AIs and NIs. Since it is possible to get some non-human NIs to play video games (e.g., monkeys [219] , rats [220] ), these experiments do not necessarily limit the comparisons to be between humans and AIs alone. Thus we can more effectively transfer our understanding of the abilities of NIs into AIs via a unification of tasks. In many ways the process of doing retrospective learning is simple. It requires the skill sets that many in ML have today: statistics, algorithms, and mathematics. Prospective learning, on the other hand, requires us to reason about potential futures that we have not yet experienced. In other words, we need to do prospective learning in order to understand prospective learning. As such, solving the problem of prospective learning requires a far broader group of people working on the problem. While it sits clearly within the domain of statistics and machine learning, the problem of prospective learning also requires perspectives from well outside these fields as well, such as biology, ecology, and philosophy. As AI is not only modeling, but also shaping the future, it also reminds us of the deep ethical debt intelligence research owes to the society that enables it, and to those who are most directly impacted by it [221, 222] . Fast & furious: the misregulation of driverless cars Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis Catastrophic interference in connectionist networks: The sequential learning problem How hard is cognitive science? Large-scale neuromorphic spiking array processors: A quest to mimic the brain The psychology and neuroscience of curiosity Causality: Models, Reasoning, and Inference I of the Vortex: From Neurons to Self Discovering latent causes in reinforcement learning Prospective cognition in animals Goal-directed behavior and future planning in animals. Animal thinking: Contemporary issues in comparative cognition Planning for breakfast Studying mental states is not a research program for comparative cognition Apes save tools for future use Compound tool construction by new caledonian crows Can animals recall the past and plan for the future? Food caching by western scrub-jays (aphelocoma californica) is sensitive to the conditions at recovery The control of food-caching behavior by western scrub-jays (aphelocoma californica) Effects of experience and social context on prospective caching strategies by scrub jays Acrobatic squirrels learn to leap and land on tree branches without falling Vicarious trial and error Organizing conceptual knowledge in humans with a gridlike code The Tolman-Eichenbaum machine: Unifying space and relational memory through generalization in the hippocampal formation A short history of studies on intelligence and brain in honeybees Colour learning when foraging for nectar and pollen: bees learn two colours at once Bees remember flowers for more than one reason: pollen mediates associative learning Trading off short-term costs for long-term gains: how do bumblebees decide to learn morphologically complex flowers? Language Models are Few-Shot Learners Observational learning in octopus vulgaris What functionalism should now be about The Multiple Realization Book Sulla determinazione empirica delle leggi di probabilita Sulla determinazione empirica delle leggi di probabilita On the uniform convergence of relative frequencies of events to their probabilities A theory of the learnable Imagenet classification with deep convolutional neural networks BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding International evaluation of an AI system for breast cancer screening Improved protein structure prediction using potentials from deep learning OpenAI's dota 2 AI steamrolls world champion e-sports team with back-to-back victories George van den Driessche, Thore Graepel, and Demis Hassabis Classifier technology and the illusion of progress The risks of invariant risk minimization Out of distribution generalization in machine learning Towards a theory of out-of-distribution learning The influence of pattern similarity and transfer learning upon the training of a base perceptron B2 On the optimization of a synaptic learning rule Multitask learning A model of inductive bias learning Meta-learning in natural and artificial intelligence Continual learning in reinforcement environments Lifelong robot learning Learning in non-stationary partially observable markov decision processes No woman no cry An empirical investigation of catastrophic forgeting in Gradient-Based neural networks Anatomy of catastrophic forgetting: Hidden representations and task semantics Is learning the n-th thing any easier than learning the first? ELLA: An efficient lifelong learning algorithm Ensembling Representations for Synergistic Lifelong Learning with Quasilinear Complexity Corpus use in language learning: A meta-analysis Transfer and interference in bumblebee learning Context, time, and memory retrieval in the interference paradigms of pavlovian learning Extinction of instrumental (operant) learning: interference, varieties of context, and mechanisms of contextual control The formation of learning sets Budgerigars and zebra finches differ in how they generalize in an artificial grammar learning experiment Complementary task representations in hippocampus and prefrontal cortex for generalising the structure of problems Meta-Learning representations for continual learning Continual lifelong learning with neural networks: A review Continual learning with bayesian neural networks for Non-Stationary data Deep reinforcement learning amidst lifelong Non-Stationarity The Blank Slate: The Modern Denial of Human Nature Prediction, Learning, and Games Statistical Learning and Sequential Prediction Patterns, predictions, and actions: A story about machine learning A possibility for implementing curiosity and boredom in model-building neural controllers Introduction to Reinforcement Learning Towards continual reinforcement learning: A review and perspectives Relational memory and the hippocampus: representations and methods Time and space in the hippocampus The concepts of 'sameness' and 'difference' in an insect Spontaneous discriminative response to the biological motion displays involving a walking conspecific in mice Moving and looming stimuli capture attention One-shot object parsing in newborn chicks Principles of object perception Face scanning and responsiveness to social cues in infant rhesus monkeys Five primate species follow the visual gaze of conspecifics From simple innate biases to complex visual concepts Social origins of cortical face areas A critique of pure learning and what artificial neural networks can learn from animal brains Environment generation for zero-shot compositional reinforcement learning Object recognition with Gradient-Based learning On the division of cortical cells into simple and complex types: a comparative viewpoint Invariante variationsprobleme, math-phys. Klasse Scalars are universal: Equivariant machine learning, structured like classical physics Risi Kondor. N-body networks: a covariant hierarchical neural network architecture for learning atomic potentials Group equivariant convolutional networks A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups On the sample complexity of learning with geometric stability Learning with invariances in random features and kernel models Learning invariances in neural networks Learning augmentation distributions using transformed risk minimization Not-So-CLEVR: learning same-different relations strains feedforward neural networks A generalization of the Glivenko-Cantelli theorem On arbitrarily slow rates of global convergence in density estimation Lower bounds for bayes error estimation The supervised learning No-Free-Lunch theorems A way around the exploration-exploitation dilemma Hegel's Practical Philosophy: Rational Agency as Ethical Life The exploration advantage: Children's instinct to explore allows them to find information that adults miss Exploring exploration: Comparing children with RL agents in unified environments Children are more exploratory and learn more than adults in an approach-avoid task Why ask why? forward causal inference and reverse causal questions The psychology of curiosity: A review and reinterpretation Counterfactual curiosity in preschool children Monkeys are curious about counterfactual outcomes The arousal and satiation of perceptual curiosity in the rat Analysis of exploratory, manipulatory, and curiosity behaviors Play, curiosity, and cognition. Annual Review of Developmental Psychology Learning from play in octopus Maximally informative foraging by caenorhabditis elegans Agnostic active learning The true sample complexity of active learning Common neural code for reward and information value An abstract of a treatise of human nature Causal and inferential reasoning in animals Inferences about the location of food in the great apes (pan paniscus, pan troglodytes, gorilla gorilla, and pongo pygmaeus) Capuchin monkeys (cebus apella) use positive, but not negative, auditory cues to infer food location Domestic pigs' (sus scrofa domestica) use of direct and indirect visual and auditory cues in an object choice task Grey parrots use inferential reasoning based on acoustic cues alone How Monkeys See the World: Inside the Mind of Another Species Do dogs (canis familiaris) understand invisible displacement? Chimpanzees extract social information from agonistic screams The responses of female baboons (papio cynocephalus ursinus) to anomalous social interactions: evidence for causal reasoning? Ravens notice dominance reversals among conspecifics within and outside their social group Transitive inference in rats (rattus norvegicus) Transitive inference in rats: A test of the spatial coding hypothesis Transitive inference in animals: Reasoning or conditioned associations. Rational animals Transitive inference in polistes paper wasps Causal inference by using invariant prediction: identification and confidence intervals The association between alpha-1 adrenergic receptor antagonists and In-Hospital mortality from COVID-19. medRxiv Causality for machine learning Off-the-shelf deep learning is not enough, and requires parsimony, bayesianity, and causality Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Towards causal representation learning Connectome: How the Brain's Wiring Makes Us who We are Deep Learning Statistical inference on random dot product graphs: a survey A simple spectral failure mode for graph convolutional networks How Powerful are Graph Neural Networks Consistent Nonparametric Regression The geometry of concept learning. bioRxiv Efficient coding explains the universal law of generalization in human perception Modern Machine Learning: Partition & Vote The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks A probabilistic theory of pattern recognition. Stochastic Modelling and Applied Probability When are deep networks really better than decision forests at small sample sizes, and how? On the number of linear regions of deep neural networks Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? Random Forests The strength of weak learnability Do we need hundreds of classifiers to solve real world classification problems Catastrophic forgetting in connectionist networks Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks Director of the Max Planck Institute for Intelligent in Tübingen Germany Professor for Machine Lea Bernhard Schölkopf, Rnhard Schölkopf, Alexander J Smola, Francis Bach, and Managing Director of the Max Planck Institute for Biological Cybernetics in Tubingen Germany Profe Bernhard Scholkopf. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond Balancing exploration and exploitation with information and randomization Deterministic policy gradient algorithms Discovering causal signals in images Toward an integration of deep learning and neuroscience A logical calculus of the ideas immanent in nervous activity The perceptron: a probabilistic model for information storage and organization in the brain A deep learning framework for neuroscience How computational modeling can force theory building in psychological science Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses Progress in Neuromorphic Computing : Drawing Inspiration from Nature for Gains in AI and Computing Understanding the human brain using brain organoids and a Structure-Function theory Neuronlike adaptive elements that can solve difficult learning control problems Reinforcement determines the timing dependence of corticostriatal synaptic plasticity in vivo A silent eligibility trace enables dopamine-dependent synaptic plasticity for reinforcement learning in the mouse striatum Human-level control through deep reinforcement learning George van den Driessche, Thore Graepel, and Demis Hassabis Grandmaster level in StarCraft II using multi-agent reinforcement learning Human-level performance in 3D multiplayer games with population-based reinforcement learning A distributional code for value in dopamine-based reinforcement learning A survey of deep reinforcement learning in video games Combining imagination and heuristics to learn strategies that generalize. Neurons, Behavior, Data analysis, and Theory Knowledge transfer between similar atari games using deep Q-Networks to improve performance Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems The Option-Critic architecture Pieter Abbeel, and John Schulman. Meta learning shared hierarchies Feudal networks for hierarchical reinforcement learning Causal reinforcement learning Causal learning for decision making (CLDM) Causal sequential decision making workshop Diversity is all you need: Learning skills without a reward function Self-supervised exploration via disagreement Dynamicsaware unsupervised discovery of skills Theory before the test: How to build High-Verisimilitude explanatory theories in psychological science The Transfer of Cognitive Skill Probing the compositionality of intuitive functions A theory of causal learning in children: causal maps and bayes nets Bayesian models of child development Don't forget, there is more than forgetting: new metrics for Continual Learning On Crashing the Barrier of Meaning in Artificial Intelligence The bitter lesson Contemporary approaches to artificial general intelligence Curiosity-driven exploration by self-supervised prediction Neural machine translation by jointly learning to align and translate The consciousness prior The Parable of Google Flu: Traps in Big Data Analysis Big data's disparate impact Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy AI can be sexist and racist -it's time to make it fair Algorithms of Oppression: How Search Engines Reinforce Racism by Safiya Umoja Noble (review). Feminist Formations Navigation Turing Test (NTT): Learning to evaluate human-like navigation * 1 Prefrontal neurons represent winning and losing during competitive video shooting games between monkeys A neuroengineer's guide on training rats to play doom Algorithmic injustice: a relational ethics approach Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy Acknowledgements This white paper was supported by an NSF AI Institute Planning award (# 2020312), as well as support from Microsoft Research, and DARPA. The authors would like to especially thank Kathryn Vogelstein for putting up with endless meetings at the Vogelstein residence in order to make these ideas come to life. References Diversity Statement By our estimates (using cleanBib), our references contain 11.67% woman(first)/woman(last), 22.15% man/woman, 22.15% woman/man, and 44.03% man/man, and 9.16% author of color (first)/author of color(last), 13.09% white author/author of color, 17.63% author of color/white author, and 60.12% white author/white author.