key: cord-0505287-uizlvrqb
authors: Vogelstein, Joshua T.; Dey, Jayanta; Helm, Hayden S.; LeVine, Will; Mehta, Ronak D.; Geisa, Ali; Xu, Haoyin; Ven, Gido M. van de; Gao, Chenyu; Yang, Weiwei; Tower, Bryan; Larson, Jonathan; White, Christopher M.; Priebe, Carey E.
title: Representation Ensembling for Synergistic Lifelong Learning with Quasilinear Complexity
date: 2020-04-27
journal: nan
DOI: nan
sha: 93dd8e707728b3b56399a7d83c099f1070f24e0d
doc_id: 505287
cord_uid: uizlvrqb

In biological learning, data are used to improve performance not only on the current task, but also on previously encountered, and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on both past tasks (backward transfer) and future tasks (forward transfer) with any new data. Our key insight is that even though learners trained on other tasks often cannot make useful decisions on the current task (the two tasks may have non-overlapping classes, for example), they may have learned representations that are useful for this task. Thus, although ensembling decisions is not possible, ensembling representations can be beneficial whenever the distributions across tasks are sufficiently similar. Moreover, we can ensemble representations learned independently across tasks in quasilinear space and time. We therefore propose two algorithms: representation ensembles of (1) trees and (2) networks. Both algorithms demonstrate forward and backward transfer in a variety of simulated and real data scenarios, including tabular, image, and spoken, and adversarial tasks. This is in stark contrast to the reference algorithms we compared to, all of which failed to transfer either forward or backward, or both, despite that many of them require quadratic space or time complexity.

1 Introduction Learning is the process by which an intelligent system improves performance on a given task by leveraging data [1] . In biological learning, learning is lifelong, with agents continually building on past knowledge and experiences, improving on many tasks given data associated with any task. For example, learning a second language often improves performance in an individual's native language [2] . In classical machine learning, the system often starts with essentially zero knowledge, a "tabula rasa", and is optimized for a single task [3, 4] . While it is relatively easy to simultaneously optimize for multiple tasks (multi-task learning) [5] , it has proven much more difficult to sequentially optimize for multiple tasks [6, 7] . Specifically, classical machine learning systems, and natural extensions thereof, exhibit "catastrophic forgetting" when trained sequentially, meaning their performance on the prior tasks drops precipitously upon training on new tasks [8, 9] . This is in contrast to many biological learning settings, such as the second language learning setting mentioned above.

In the past 30 years, a number of sequential task learning algorithms have attempted to overcome catastrophic forgetting. These approaches naturally fall into one of two camps. In one, the algorithm has fixed resources, and so must reallocate resources (essentially compressing representations) in order to incorporate new knowledge [10] [11] [12] [13] [14] . Biologically, this corresponds to adulthood, where brains have a nearly fixed or decreasing number of cells and synapses. In the other, the algorithm adds (or builds) resources as new data arrive (essentially ensembling representations) [15] [16] [17] . Biologically, this corresponds to development, where brains grow by adding cells, synapses, etc.

Approaches from both camps demonstrate some degree of continual (or lifelong) learning [18] . In particular, they can sometimes learn new tasks while not catastrophically forgetting old tasks. However, as we will show, many state of the art lifelong learning algorithms are unable to transfer knowledge forward, and none are able to transfer knowledge backward with small sample sizes where it is particularly important. This inability to synergistically learn has been identified as one of the key obstacles limiting the capabilities of artificial intelligence [19, 20] . Our work falls into the resource growing camp in which each new task is learned with additional representational capacity. Our key innovation is the introduction of ensembling independent representations, rather than ensembling decisions (as in random forests [21] or network ensembles [22] ). This is in contrast to ensembling representations that are conditionally dependent on the past representations (like gradient boosting trees [23] and ProgNN [16] ) or both past and future representations (DF-CNN [17] and all of the fixed resource algorithms). By virtue of learning them independently, we overcome interference from the past and the future representations to the current representation, and thereby, mitigate catastrophic forgetting. Moreover, the representations interact with each other through a channel layer which enables both forward and backward transfer. In doing so, we also reduce computational time and space from quadratic to quasilinear (i.e., linear up to polylog terms).

In this paper, we consider a simplified learning environment similar to the ones used in [10-13, 16, 17] where we know the task identities and the data arrives in batches rather than in a streaming fashion. Moreover, we keep the data from the previous tasks for achieving backward transfer between the tasks. Robins [24] , Shin et al. [25] , van de Ven et al. [26] showed if one algorithm is allowed to keep the old task data, one can mitigate catastrophic forgetting. However, it is not obvious how one can improve from the future tasks, i.e., achieve backward transfer. Again, the problem of data storage can be effectively solved by learning a generative model simultaneously while learning the represenations for the tasks. Therefore, throughout the paper, we mainly focus on how synergistic learning from past and future tasks can be achieved by synergistically combining independent representations over the tasks in a simplified environment. In this work, we introduce two complementary synergistic learning algorithms, one based on decision forests (Syngeristic Forests, SynF), and another based on deep networks (Synergistic Networks, SynN). Both SynF and SynN demonstrate forward and backward transfer, while maintaining computational efficiency. Simulations illustrate their learning capabilities, including performance properties in the presence of adversarial tasks. We then demonstrate their learning capabilities in vision and language benchmark applications. Although the algorithms presented here are primarily resource building, we illustrate that they can effectively leverage prior representations. This ability implies that the algorithm can convert from a "juvenile" resource building state to the "adult" resource recruiting state -all while maintaining key synergistic learning capabilities and efficiencies.

Classical supervised learning [27] considers random variables (X, Y ) ∼ P X,Y , where X is an X -valued input, Y is a Y-valued label (or response), and P X,Y ∈ P X,Y is the joint distribution of (X, Y ). Given a loss function : Y × Y → [0, ∞), the goal is to find the hypothesis (also called predictor), h : X → Y that minimizes expected loss, or risk,

A learning algorithm is a function f that maps data sets (n training samples) to a hypothesis, where a data set S n = {X i , Y i } n i=1 is a set of n input/response pairs. Assume n samples of (X, Y ) pairs are independently and identically distributed from some true but unknown P X,Y [27] . A learning algorithm is evaluated on its generalization error (or expected risk): E [R(f (S n ))] , where the expectation is taken with respect to the true but unknown distribution governing the data, P X,Y . The goal is to choose a learner f that learns a hypothesis h that has a small generalization error for the given task [28] .

Lifelong learning generalizes classical machine learning in a few ways: (i) instead of one task, there is an environment T of (possibly infinitely) many tasks, (ii) data arrive sequentially, rather than in batch mode, and (iii) there are computational complexity constraints on the learning algorithm and hypotheses. This third requirement is crucial, though often implicit. Consider, for example, the algorithm that stores all the data, and then retrains everything from scratch each time a new sample arrives. Without computational constraints, such an algorithm could be classified as a lifelong learner; we do not think such a label is appropriate for that algorithm.

The goal in lifelong learning therefore is, given new data and a new task, use all the existing data to achieve lower generalization error on this new task, while also using the new data to obtain a lower generalization error on the previous tasks. This is distinct from classical online learning scenarios, because the previously experienced tasks may recur, so we are concerned about maintaining and improving performance on those tasks as well. In "task-aware" scenarios, the learner is aware of all task details for all tasks, meaning that the hypotheses are of the form h : X × T → Y. In "taskunaware" (or task agnostic [29] ) scenarios the learner may not know that the task has changed at all, which means that the hypotheses are of the form h : X → Y. We only address task-aware scenarios here.

We compared our approaches to nine reference lifelong learning methods. These algorithms can be classified into two groups based on whether they add capacity resources per task, or not. Among them, ProgNN [16] and Deconvolution-Factorized CNNs (DF-CNN) [17] learn new tasks by building new resources. For ProgNN, for each new task a new "column" of network is introduced. In addition to introducing this column, lateral connections from all previous columns to the new column are added. These lateral connections are computationally costly, as explained below. DF-CNN [17] is a lifelong learning algorithm that improves upon ProgNN by introducing a knowledge base with lateral connections to each new column, thereby avoiding all pairwise connections, and dramatically reducing computational costs. We also compare two variants of exact replay (Total Replay and Partial Replay) [30] . Both store all the data they have ever seen, but Total Replay replays all of it upon acquiring a new task, whereas Partial Replay replays M samples, randomly sampled from the entire corpus, whenever we acquire a new task with M samples.

The other five algorithms, Elastic Weight Consolidation (EWC) [10] , Online-EWC (O-EWC) [13] , Synaptic Intelligence (SI) [11] , Learning without Forgetting (LwF) [12] , and "None," all have fixed capacity resources. For the baseline "None", the network was incrementally trained on all tasks in the standard way while always only using the data from the current task. The implementations for all of the algorithms are adapted from open source codes [17, 31] ; for implementation details, see Appendix D.

Others have previously introduced criteria to evaluate transfer, including forward and backward transfer [32, 33] . These definitions typically compare the difference, rather than the ratio, between learning with and without transfer. Pearl [34] introduced the transfer benefit ratio, which builds directly off relative efficiency from classical statistics [28] . Our definitions are closely related to his. Learning efficiency is the ratio of the generalization error of an algorithm that has learned on one dataset, as compared to the generalization error of that same algorithm on a different dataset. Typically, we are interested in situations where the former dataset is a subset of the latter dataset. Let R t be the risk associated with task t, and S t n be the data from S n that is specifically associated with task t, so R t (f (S t n )) is the risk on task t of the hypothesis learned by f only on task t data, and R t (f (S n )) denotes the risk on task t of the hypothesis learned on all the data. Definition 1 (Learning Efficiency). The learning efficiency of algorithm f for given task t with sample size n is LE t n (f ) := E R t f (S t n ) /E R t (f (S n )) . We say that algorithm f has learned all the tasks up to t with data S n if and only if LE t n (f ) > 1 for all the tasks up to t.

To evaluate a lifelong learning algorithm while respecting the streaming nature of the tasks, it is convenient to consider two extensions of learning efficiency. Forward learning efficiency is the expected ratio of the risk of the learning algorithm with (i) access only to task t data, to (ii) access to the data up to and including the last observation from task t. This quantity measures the relative effect of previously seen out-of-task data on the performance on task t. Formally, let N t = max{i : T i = t}, be the index of the last occurrence of task t in the data sequence. Let S ≤t

all data up to and including that data point.

Definition 2 (Forward Learning Efficiency). The forward learning efficiency of f for task t given n samples is

. We say an algorithm (positive) forward transfers for task t if and only if FLE t n (f ) > 1. In other words, if FLE t n (f ) > 1, then the algorithm has used data associated with past tasks to improve performance on task t.

One can also determine the rate of backward transfer by comparing R t f (S ≤t n ) to the risk of the hypothesis learned having seen the entire training dataset. More formally, backward learning efficiency is the expected ratio of the risk of the learned hypothesis with (i) access to the data up to and including the last observation from task t, to (ii) access to the entire dataset. Thus, this quantity measures the relative effect of future task data on the performance on task t.

Definition 3 (Backward Learning Efficiency). The backward learning efficiency of f for task t given n samples is

. We say an algorithm (positive) backward learns task t if and only if BLE t n (f ) > 1. In other words, if BLE t n (f ) > 1, then the algorithm has used data associated with future tasks to improve performance on previous tasks.

After observing m tasks, the extent to which the LE for the j th task comes from forward transfer versus from backward transfer depends on the order of the tasks. If we have a sequence in which tasks do not repeat, learning efficiency for the first task is all backward transfer, for the last task it is all forward transfer, and for the middle tasks it is a combination of the two. In general, LE factorizes into FLE and BLE:

Throughout, we will report log LE so that positive learning corresponds to LE > 1. In a lifelong learning environment having T tasks drawn with replacement from T , learner f w-lifelong learns tasks t ∈ T if the log of the convex combination of learning efficiencies is greater than 0, that is,

Note that the hardest performance to achieve is when w t puts equal weights on every task. We say an agent has synergistically learned if the agent has positively learned for all tasks in all of the possible convex combinations of w. Again, we say an agent has catastrophically forgotten, if it has negative backward transfer for all the tasks. Figure 1 : Schemas of composable hypotheses. Ensembling decisions (as output by the channels) is a well-established practice, including random forests and gradient boosted trees. Ensembling representations (learned by the encoders) was previously used in lifelong learning scenarios, but were not trained independently, thereby causing interference or forgetting. Note that the new encoders interact with the previous encoders through the channel layer (indicated by red arrows), thereby, enabling backward transfer. Again the old encoders interact with the future encoders (indicated by black arrows), thereby, enabling forward transfer.

Our approach to lifelong learning is based on an information theoretic hypothesis decomposition of an encoder, channel, and decoder [35, 36] ( Figure 1A) :

The encoder, u : X →X , maps an X -valued input into an internal representation spaceX [37, 38] . The channel v :X → ∆ Y maps the transformed data into a posterior distribution (or, more generally, a score) on the response space Y. Finally, a decoder w : ∆ Y → Y, produces a predicted label. See Appendix A for a concrete example using a decision tree.

One can generalize the above decomposition by allowing for multiple encoders. Given B different encoders, one can attach a single channel to each encoder, yielding B different channels ( Figure 1B ).

Doing so requires generalizing the definition of a decoder, which would operate on multiple channels. Such a decoder ensembles the decisions, because here each channel provides the final output based on the encoder. This is the learning paradigm behind boosting [39] and bagging [40] -indeed, decision forests are a canonical example of a decision function operating on a collection of B outputs [21] . A decision forest learns B different decision trees, each of which has a tree structure corresponding to an encoder. Each tree is assigned a channel that outputs that single tree's guess as to the class of any probability that an observation is in any class. The decoder outputs the most likely class averaged over the trees.

Although the task specific structure in Figure 1B can provide useful decision on the corresponding task, they can not, in general, provide meaningful decisions on other tasks because those tasks might have completely different class labels, for example. Therefore, in the multi-head structure ( Figure 1C ) a single encoder is used to learn a joint representation from all the tasks and a separate channel is learned for each task to get the score or class conditional posteriors for each task which is followed by each task specific decider [10, 11, 13] . Again, further modification of the multi-head structure allows ProgNN to learn separate encoder for each task with forward connections from the past encoders to the current one ( Figure 1D ). This creates the possibility of having forward transfer while freezing backward transfer. Note that if the encoders are learned independently across different tasks, they may have learned useful representations that the tasks can mutually leverage. Thus, a further generalization of the decomposition in Figure 1B allows for each channel to ensemble the encoders ( Figure 1E ). Doing so requires generalizing the definition of the channel, so that it can operate on multiple distinct encoders. The result is that the channels ensemble representations (learned by the encoders), rather than decisions (learned by the channels). The channels ensembles all the existing representations, regardless of the order in which they were learned. In this scenario, like with bagging and boosting, the ensemble of channels then feeds into the single decoder. When each encoder has learned complementary representations, this latter approach has certain appealing properties, particularly in multiple task scenarios, including lifelong learning. See Appendix B for a concrete example. We developed two different representation ensembling algorithms.

The key to both of our algorithms is the realization that both forests and networks partition feature space into a union of polytopes [41] . Thus, the internal representation learned by each can be considered a sparse vector encoding which polytope a given sample resides in.

In either of the algorithms, as new data from a new task arrives, our algorithm first builds a new independent encoder (using forests or networks), mapping each data point to a sparse vector encoding which polytope it is in. Then, it builds the channel for this new task, which integrates information across all existing encoders, thereby enabling forward transfer. If new data arrive from an old task, it can leverage the new encoders to update the channels from the old tasks, thereby enabling backward transfer. In either case, new test data are passed through all existing encoders and corresponding channels to make a prediction. Note that while updating the previous task channels with the cross-task posteriors, we do not need to subsample the previous task data (see Appendix C for implementation details and pseudocodes).

Synergistic Forests (SynF) ensembles decision trees or forests. For each task, the encoder u t of a SynF is the representation learned by a decision forest [21, 42] . The leaf nodes of each decision tree partition the input space X [43] . The representation of x ∈ X corresponding to a single tree can be a one-hot encoded L b -dimensional vector with a 1 in the location corresponding to the leaf x falls into of tree b. The representation of x resulting from the collection of trees simply concatenates the B one-hot vectors from the B trees. Thus, the encoder u t is the mapping from X to a B-sparse vector of length B b=1 L b . The channel then learns the class-conditional posteriors by populating the cells of the partitions and taking class votes with out-of-bag samples, as in "honest trees" [43] [44] [45] . Each channel outputs the average normalized class votes across the collection of trees, adjusted for finite sample bias [46] . The decoder w t averages the posterior estimates and outputs the argmax to produce a single prediction. Recall that honest decision forests are universally consistent classifiers and regressors [45] , meaning that with sufficiently large sample sizes, under suitable though general assumptions, they will converge to minimum risk. Thus, the single task version of this approaches simplifies to an approach called "Uncertainty Forests" [46] . Table 1 in the appendix lists the hyperparameters used in the CIFAR experiments.

A Synergistic Network (SynN) ensembles deep networks. For each task, the encoder u t in an SynN is the "backbone" of a deep network (DN), including all but the final layer.

Thus, each u t maps an element of X to an element of R d , where d is the number of neurons in the penultimate layer of the DN. In practice, we use the architecture described in van de Ven et al. [26] as "5 convolutional layers followed by 2 fully-connected layers each containing 2,000 nodes with ReLU non-linearities and a softmax output layer." We trained this network using cross-entropy loss and the Adam optimizer [47] to learn the encoder. The channels are learned via k-Nearest Neighbors (k-NN) [48] . Recall that a k-NN, with k chosen such that as the number of samples goes to infinity, k also goes to infinity, while k n → 0, is a universally consistent classifier [48] . We use k = 16 log 2 n, which satisfies these conditions. The decoder is the same as above.

SynN was motivated by ProgNN, but differs from ProgNN in two key ways. First, recall that ProgNN builds a new neural network "column" for each new task, and also builds lateral connections be-tween the new column and all previous columns. In contrast, SynN excludes those lateral connections, thereby greatly reducing the number of parameters and train time. Moreover, this makes each representation independent, thereby potentially avoiding interference across representations. Second, for inference on task j data, assuming we have observed tasks up to J > j, ProgNN only leverages representations learned from tasks up to j, thereby excluding tasks j + 1, . . . , J. In contrast, SynN leverages representations from all J tasks. This difference enables backward transfer. SynF adds yet another difference as compared to SynN by replacing the deep network encoders with random forest encoders. This has the effect of making the capacity, space complexity, and time complexity scale with the complexity and sample size of each task. In contrast, both ProgNN and SynN have a fixed capacity for each task, even if the tasks have very different sample sizes and complexities.

Lifelong learning approaches can be divided into those with fixed computational space resources, and those with growing space resources (which we refer to as 'fixed resources' hereafter). We therefore quantify the computational space and time complexity of the internal representation of a number of algorithms, using both theoretical analysis and empirical investigations. We also study the representation capacity of these algorithms. We use the soft-O notationÕ to quantify complexity [49] . Letting n be the sample size and T be the number of tasks, we write that a lifelong learning algorithm is f (n, t) =Õ(g(n, T )) when |f | is bounded above asymptotically by a function g of n and T up to a constant factor and polylogarithmic terms. Table 1 summarizes the capacity, space and time complexity of several reference algorithms, as well as our SynN and SynF. For the deep learning methods, we assume that the number of iterations is proportional to the number of samples. For space and time complexity, the table shows results as a function of n and T , as well as the common scenario where sample size per task is fixed and therefore proportional to the number of tasks, n ∝ T . Table 1 : Capacity, space, and time constraints of the representation learned by various lifelong learning algorithms. We show soft-O notation (Õ(·, ·) defined in main text) as a function of n = T t nt and T , as well as the common setting where n is proportional to T . Our algorithms and DF-CNN are the only algorithms whose space and time both grow quasilinearly with capacity growing.

Capacity Space Time Examples 

Parametric lifelong learning methods have a representational capacity is invariant to sample size and task number. Although the space complexity of some of these algorithms grow (because the size of the constraints grows, or they continue to store more and more data), their capacity is fixed. Thus, given a sufficiently large number of tasks, without placing constraints on the relationship between the tasks, eventually all parametric methods will catastrophically forget at least some things. EWC, Online EWC, SI, and LwF are all examples of parametric lifelong learning algorithms.

Semi-parametric algorithms' representational capacity grows slower than sample size. For example, if T is increasing slower than n (e.g., T ∝ log n), then algorithms whose capacity is proportional to T are semi-parametric. ProgNN is semi-parametric, nonetheless, its space complexityÕ(T 2 ) due to the lateral connections. Moreover, the time complexity for ProgNN also scales quadratically with n when n ∝ T . Thus, an algorithm that literally stores all the data it has ever seen, and retrains a fixed size network on all those data with the arrival of each new task, would have smaller space complexity and the same time complexity as ProgNN. For comparison, we implement such an algorithm and refer to it as Total Replay. DF-CNN improves upon ProgNN by introducing a "knowledge base" with lateral connections to each new column, thereby avoiding all pairwise connections. Because these semi-parametric methods have a fixed representational capacity per task, they will either lack the representation capacity to perform well given sufficiently complex tasks, and/or will waste resources for very simple tasks. SynN eliminates the lateral connections between columns of the network, thereby reducing space complexity down toÕ(T ). SynN stores all the data to enable backward transfer, but retains linear time complexity.

SynF is the only non-parametric lifelong learning algorithm to our knowledge. Its capacity, space and time complexity are allÕ(n), meaning that its representational capacity naturally increases with the complexity of each task.

In this paper, we have proposed two approaches of synergistic lifelong learning algorithm, namely-SynF and SynN. They are based on the same representation ensembling method illustrated in Fig. 1 . However, SynN requires training DNs as encoders which is computationally expensive to do it for many Monte Carlo repetitions. Therefore, for SynN experiments we did 100 repetitions and reported the results after smoothing it using moving average with a window size of 5. Again, for the SynF experiments we used 1000 repetitions and reported the mean of these repetitions.

Synergistic learning in a simple environment Consider a very simple two-task environment: Gaussian XOR and Gaussian Exclusive NOR (XNOR) (Figure 2A , see Appendix E for details). The two tasks share the exact same discriminant boundaries: the coordinate axes. Thus, transferring from one task to the other merely requires learning a bit flip. We sample a total 750 samples from XOR, followed by another 750 samples from XNOR.

SynF and random forests (RF) achieve the same generalization error on XOR when training with XOR data (Figure 2Bi ). But because RF does not account for a change in task, when XNOR data appear, RF performance on XOR deteriorates (it catastrophically forgets). In contrast, SynF continues to improve on XOR given XNOR data, demonstrating backward transfer. Now consider the generalization error on XNOR (Figure 2Bii ). Both SynF and RF are at chance levels for XNOR when only XOR data are available. When XNOR data are available, RF must unlearn everything it learned from the XOR data, and thus its performance on XNOR starts out nearly maximally inaccurate, and quickly improves.

On the other hand, because SynF can leverage the encoder learned using the XOR data, upon getting any XNOR data, it immediately performs quite well, and then continues to improve with further XNOR data, demonstrating forward transfer (Figure 2Biii ). SynF demonstrates positive forward and backward transfer for all sample sizes, whereas RF fails to demonstrate forward or backward transfer, and eventually catastrophically forgets the previous tasks. Similar results are visible for SynN and DN in Figure   2 .

Statistics has a rich history of robust learning [50, 51] , and machine learning has recently focused on adversarial learning [52] [53] [54] [55] cases the focus is on adversarial examples, rather than adversarial tasks. In the context of synergistic learning, we informally define a task t to be adversarial with respect to task t if the true joint distribution of task t, without any domain adaptation, impedes performance on task t . In other words, training data from task t can only add noise, rather than signal, for task t . An adversarial task for Gaussian XOR is Gaussian XOR rotated by 45 • (R-XOR) (Figure 2Aiii ). Training on R-XOR therefore impedes the performance of SynF and SynN on XOR, and thus backward transfer falls below one, demonstrating graceful forgetting [56] (Figure 2Ci ). Because R-XOR is more difficult than XOR for SynF (because the discriminant boundaries are oblique [57] ), and because the discriminant boundaries are learned imperfectly with finite data, data from XOR can actually improve performance on R-XOR, and thus forward transfer is positive. In contrast, both forward and backward transfer are negative for RF and DN.

To further investigate this relationship, we design a suite of R-XOR examples, generalizing R-XOR from only 45 • to any rotation angle between 0 • and 90 • , sampling 100 points from XOR, and another 100 from each R-XOR ( Figure 2Cii ). As the angle increases from 0 • to 45 • , log BLE flips from positive (≈ 0.18) to negative (≈ −0.11) for SynF. A similar trend is also visible for SynN. The 45 • -XOR is the maximally adversarial R-XOR. Thus, as the angle further increases, log BLE increases back up to ≈ 0.18 at 90 • , which has an identical discriminant boundary to XOR. Moreover, when θ is fixed at 25 • , BLE increases at different rates for different sample sizes of the source task ( Figure 2Ciii ).

Together, these experiments indicate that the amount of transfer can be a complicated function of (i) the difficulty of learning good representations for each task, (ii) the relationship between the two tasks, and (iii) the sample size of each. Appendix E further investigates this phenomenon in a multi-spiral environment. 

We consider two modalities for real data experiments: vision and language. Below we provide a detailed analysis of the performance of lifelong learning algorithms in vision data; Appendix F provides details for our language experiments, which have qualitatively similar results illustrating that SynF and SynN| are modality agnostic, sample and computationally efficient, lifelong learning algorithms. The CIFAR 100 challenge [58] , consists of 50,000 training and 10,000 test samples, each a 32x32 RGB image of a common object, from one of 100 possible classes, such as apples and bicycles. CIFAR 10x10 divides these data into 10 tasks, each with 10 classes [17] (see Appendix F for details). We compare SynF and SynN to the deep lifelong learning algorithms discussed above. Under the lifelong learning framework, a learning agent, constrained by capacity and computational time, is sequentially trained on multiple tasks. For each task, it has access to limited training samples [17, 59] , and it improves on a particular task by leveraging knowledge from the other tasks. Therefore, for our following experiments, we are particularly interested in the behavior of our representation ensembling algorithms in the low training sample size regime. The below experiments use only 500 training samples per task. For the corresponding experiments using higher training samples per task (5,000 samples), see Appendix Figure 4 .

We first compare SynF and SynN to state-of-the-art resource growing algorithms: ProgNN and DF-CNN ( Figure 3 , top panels). Both SynF and SynN demonstrate positive forward transfer for every task (SynF increases nearly monotonically), indicating they are robust to distributional shift in ways that ProgNN and DF-CNN are not. SynN and SynF uniquely demonstrate positive backward transfer, SynN is actually monotonically increasing, indicating that with each new task, performance on all prior tasks increases (and SynF nearly monotonically increases BLE as well). In contrast, while neither ProgNN nor DF-CNN exhibit catastrophic forgetting, they also do not exhibit any positive backward transfer. Final transfer efficiency per task is the transfer efficiency associated with that task having seen all the data. SynF and SynN both demonstrate positive final transfer efficiency for all tasks (synergistic learning), whereas ProgNN and DF-CNN both exhibit negative final transfer efficiency for at least one task.

It is possible that the above algorithms are leveraging additional resources to improve performance without meaningfully transferring information between representations. To address this concern, we devised a "resource constrained" variant of SynF. In this constrained variant, we compare the lifelong learning algorithm to its single task variant, but ensure that they both have the same amount of resources. For example, on Task 2, we would compare SynF with 20 trees (10 trained on 500 samples from Task 1, and another 10 trained on 500 samples from Task 2) to RF with 20 trees (all trained on 500 samples Task 2). If SynF is able to meaningfully transfer information across tasks, then its resource-constrained FLE and BLE will still be positive. Indeed, FLE remains positive after enough tasks, and BLE is actually invariant to this change (Figure 3, bottom left and center) . In contrast, all of the reference algorithms that have fixed resources exhibit negative forward and backward transfer. Moreover, the reference algorithms also all exhibit negative final transfer efficiency on each task, whereas our resource constrained SynF maintains positive final transfer on every task (Figure 3 , top right). Interestingly, when using 5,000 samples per task, replay methods are able to demonstrate positive forward and backward transfer (Supplementary Figure 4) , although they require quadratic time. Note that in this experiment, building the single task learners actually requires substantially more resources, specifically, 10 + 20 + · · · + 100 = 550 trees, as compared with only 100 trees in the prior experiments. In general, to ensure single task learners use the same amount of resources per task as omnidirectional learners requiresÕ(n 2 ) resources, where as SynF only requiresÕ(n), a polynomial reduction in resources.

In both cases, resource growing or resource constrained, both SynF and SynN show synergistic learning over all the 10 tasks (Figure 3, top right panel) whereas all other algorithms except ProgNN suffer from catastrophic forgetting.

The binary distinction we made above, algorithms either build resources or reallocate them, is a false dichotomy, and biologically unnatural. In biological learning, systems develop from building (juvenile) to constrained (adult) resources (which requires recruiting some resources for new tasks). We therefore train SynF on the first nine CIFAR 10x10 tasks using 50 trees per task, with 500 samples per task. For the tenth task, we could (i) select the 50 trees (out of the 450 existing trees) that perform best on task 10 (recruiting), (ii) train 50 new trees, as SynF would normally do (building), (iii) build 25 and recruit 25 trees (hybrid), or (iv) ignore all prior trees (RF). SynF outperforms other approaches except when 5,000 training samples are available, but the recruiting approach is nearly as good as SynF (Figure 3, bottom right) . This result motivates future work to investigate optimal strategies for determining how to optimally leverage existing resources given a new task, and task-unaware settings.

Adversarial Experiments Consider the same CIFAR 10x10 experiment above, but, for tasks two through nine, randomly permute the class labels within each task, rendering each of those tasks adversarial with regard to the first task (because the labels are uninformative). Figure 4A indicates that BLE for both SynF and SynN is invariant to such label shuffling (the other algorithms also seem invariant to label shuffling, but did not demonstrate positive backward transfer). Now, consider a Rotated CIFAR experiment, which uses only data from the first task, divided into two equally sized subsets (making two tasks), where the second subset is rotated by different amounts (Figure 4, right) . Learning efficiency of both SynF and SynN is nearly invariant to rotation angle, whereas the other approaches are far more sensitive to rotation angle. Note that zero rotation angle corresponds to the two tasks having identical distributions. 

We introduced quasilinear representation ensembling as an approach to synergistic lifelong learning. Two specific algorithms, SynF and SynN, achieve both forward and backward transfer, due to leveraging resources (encoders) learned for other tasks without undue computational burdens. Forest-based representation ensembling approaches can easily add new resources when appropriate. This work therefore motivates additional work on deep learning to enable dynamically adding resources when appropriate [60] .

To achieve backward transfer, SynF and SynN stored old data to vote on the newly learned transformers. Because the representation space scales quasilinearly with sample size, storing the data does not increase the space complexity of the algorithm, and it remains quasilinear. It could be argued that by keeping old data and training a model with increasing capacity from scratch (a sequential multitask learning approach), it would be straightforward to maintain performance (TE = 1) in a particular task.

However, it is not obvious how to achieve backward transfer with quasilinear time and space complexity even if we are allowed to store all the past data, because computational time would naively become quadratic. For example, both ProgNN and Total Replay have quadratic time complexity, unlike SynF and SynN. Thus, one natural extension of this work would obviate the need to store all the data by using a generative model.

While we employed quasilinear representation ensembling to address catastrophic forgetting, the paradigm of ensembling representations rather than learners can be readily applied more generally. For example, "batch effects" (sources of variability unrelated to the scientific question of interest) have plagued many fields of inquiry, including neuroscience [61] and genomics [62] . Similarly, federated learning is becoming increasingly central in artificial intelligence, due to its importance in differential privacy [63] . This may be particularly important in light of global pandemics such as COVID-19, where combining small datasets across hospital systems could enable more rapid discoveries [64] .

Finally, our quasilinear representation ensembling approach closely resembles the constructivist view of brain development [65, 66] . According to this view, the brain goes through progressive elaboration of neural circuits resulting in an augmented cognitive representation while maturing in a certain skill. In a similar way, representation ensembling algorithms can mature in a particular skill such as vision tasks by learning a rich encoder dictionary from different vision datasets and thereby, transfer forward to future or yet unseen vision dataset (see CIFAR 10x10 recruitment experiment as a proof). However, there is also substantial pruning during development and maturity in the brain circuitry which is important for performance [67] . This motivates future work for pruning adversarial encoders to enhance the transferability among tasks even more. Moreover, by carefully designing experiments in which both behaviors and brain are observed while learning across sequences of tasks (possibly in multiple stages of neural development or degeneration), we may be able to learn more about how biological agents are able to synergistically learn so efficiently, and transfer that understanding to building more effective artificial intelligences. In the meantime, our code, including code to reproduce the experiments in this manuscript, is available from http://proglearn.neurodata.io/.

for a two class classification problem. The input to the decision tree is a set of n feature-vector/response pairs, (x i , y i ). The learned tree structure corresponds to the encoder u, because the tree structure maps each input feature vector into an indicator encoding in which leaf node each feature vector resides.

Formally, u : X → [L], where [L] = {1, 2, . . . , L} and L is the total number of leaf nodes. In other words, u maps from the original data space, to a L-dimensional one-hot encoded sparse binary vector, where the sole non-zero entry indicates in which leaf node a particular observation falls, that is,

Learning the voter is simply a matter of counting the fraction of observations in each leaf per class.

So, the voter is trained using n pairs of transformed feature-vector/response pairs (x i , y i ), and it assigns a probability of each class in each leaf: {v l := P[y i = 1|x i = l], ∀l ∈ [L]} and v(x) = vx. In other words, for two class classification, v maps from the L-dimensional binary vector to the probability that

x is in class 1. The decider is simply w (v(x)) = 1 {v(x)>0.5} , that is, it outputs the most likely class label of the leaf node that x falls into.

For inference, the tree is given a single x, and it is passed down the tree until it reaches a leaf node, where it is represented by its leaf identifierx. The voter takesx as input, and outputs the estimated posterior probability of being in class 1 for the leaf node in whichx resides:

is bigger than 0.5, the decider decides that x is in class 1, and otherwise, it decides it is in class 0.

Consider a scenario in which we have two tasks, one following the other. Assume that we already learned a single decomposable hypothesis for the first task: w 1 • v 1 • u 1 , and then we get new data associated with a second task. Let n 1 denote the sample size for the first task, and n 2 denote the sample size for the second task, and n = n 1 + n 2 . The representation ensembling approach generally works as follows. First, since we want to transfer forward to the second task, we push all the new data through the first encoder u 1 , which yieldsx (1) n 1 +1 , . . . ,x (1) n . Second, we learn a new encoder u 2 using the new data, {(x i , y i )} n i=n 1 +1 . We then push the new data through the new encoder, yieldingx (2) n 1 +1 , . . . ,x (2) n . Third, we train a new channel, v 2 . To do so, v 2 is trained on the outputs from both encoders, that is, {(x (j) i , y i )} n i=n 1 +1 for j = 1, 2. The output of v 2 for any new input x is the posterior probability (or score) for that point for each potential response in task two (class label). Thus, by virtue of ensembling these representations, this approach enables forward transfer [16, 68] . Now, we would also like to improve performance on the first task using the second task's data. While many lifelong methods have tried to achieve this kind of backward transfer, to date, they have mostly failed [15] . Recall that previously we had already pushed all the first task data through the first task encoder, which had yieldedx This 'replay-like' procedure facilitates backward transfer, that is, improving performance on previous tasks by leveraging data from newer tasks. Both the forward and backward transfer updates can be implemented every time we obtain data associated with a new task. Enabling the channels to ensemble omnidirectionally between all sets of tasks is the key innovation of our proposed synergistic learning approaches.

Appendix C. Synergistic Algorithms. We propose two concrete synergistic algorithms, Synergistic Forests (SynF) and Synergistic Networks (SynN). The two algorithms differ in their detais of how Algorithm 1 Add a new SynX representer for a task. OOB = out-of-bag.

(1) t current task number (2) D t n = (x t , y t ) ∈ R n×p × {1, . . . , K} n training data for task t Output:

(1) u t a representer set

(2) I t OOB a set of the indices of OOB data 1: function SynX.FIT(t, (x t , y t )) 2:

train a representer X on bootstrapped data 3: return u t , I t OOB 4: end function Algorithm 2 Add a new SynX voter for the current task.

(1) t current task number

the set of representers

. . , K} n training data for task t (4) I t OOB a set of the indices of OOB data for the current task

in-task (t = t) and cross-task (t = t) voters for task t

add the in-task voter using OOB data 3: for t = 1, . . . , t − 1 do update the cross task voters for task t 4:

v tt ← u t .add_voter(x t , y t ) 5: end for 6: return v t 7: end function to update representers and voters, but abstracting a level up they are both special cases of the same procedure. Let SynX refer to any possible synergistic algorithm. Algorithms 1, 2, 3, and 4 provide pseudocode for adding representers, updating voters, and making predictions for any SynX algorithm; the below sections provide SynF and SynN specific details. Appendix D. Reference Algorithm Implementation Details. The same network architecture was used for all compared deep learning methods. Following van de Ven et al. [26] , the 'base network architecture' consisted of five convolutional layers followed by two-fully connected layers each containing 2000 nodes with ReLU non-linearities and a softmax output layer. The convolutional layers had Algorithm 3 Update SynX voter for the previous tasks.

(1) t current task number (2) u t representer for the current task

training data for tasks t = 1, · · · , t − 1

all previous task voters 1: function SynX.UPDATE_VOTER(t, u t , D) 2: for t = 1, . . . , t − 1 do update the cross task voters 3: v t t ← u t .get_voter(x t , y t ) 4: end for 5: return v 6: end function Algorithm 4 Predicting a class label using SynX.

Input:

voter for task t Output:ŷ a predicted class label

T ← SynX.get_task_number() get the total number of tasks 3:p t = 0 p t is a K-dimensional posterior vector 4: for t = 1, . . . , T do update the posteriors calculated from T task voters 5:p t ←p t + v tt .predict_proba(u t (x)) 6: end for 7:p t ←p t /T 8:ŷ = argmax i (p t ) find the index i of the elements in the vectorp t with maximum probability 9: returnŷ 10: end function 16, 32, 64, 128 and 254 channels, they used batch-norm and a ReLU non-linearity, they had a 3x3 kernel, a padding of 1 and a stride of 2 (except the first layer, which had a stride of 1). This architecture was used with a multi-headed output layer (i.e., a different output layer for each task) for all algorithms using a fixed-size network. For ProgNN and DF-CNN the same architecture was used for each column introduced for each new task, and in our SynN this architecture was used for the transformers u t (see above the current task based on the previous tasks. Each of these algorithms has 1.4M parameters in total.

Appendix E. Simulated Results. In each simulation, we constructed an environment with two tasks. For each, we sample 750 times from the first task, followed by 750 times from the second task. These 1,500 samples comprise the training data. We sample another 1,000 hold out samples to evaluate the algorithms. We fit a random forest (RF) (technically, an uncertainty forest which is an honest forest with a finite-sample correction [46] ) and a SynF. We repeat this process 30 times to obtain errorbars. Errorbars in all cases were negligible. E.1 Gaussian XOR Gaussian XOR is two class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a mixture of two Gaussians with means ± 0.5, 0.5 T , and variances proportional to the identity matrix. Conditioned on being in class 1, a sample is drawn from a mixture of two Gaussians with means ± 0.5, −0.5 T , and variances proportional to the identity matrix. Gaussian XNOR is the same distribution as Gaussian XOR with the class labels flipped. Rotated XOR (R-XOR) rotates XOR by θ • degrees.

A description of the distributions for the two tasks is as follows: let K be the number of classes and S ∼ multinomial( 1 K 1 K , n). Conditioned on S, each feature vector is parameterized by two variables, the radius r and an angle θ. For each sample, r is sampled uniformly in [0, 1]. Conditioned on a particular class, the angles are evenly spaced between

where t K controls the number of turns in the spiral. To inject noise along the spiral, we add Gaussian noise to the evenly spaced angles θ : θ = θ + N (0, σ 2 K ). The observed feature vector is then (r cos(θ), r sin(θ)). In Consider an environment with a three spiral and five spiral task ( Figure 1 ). In this environment, axis-aligned splits are inefficient, because the optimal partitions are better approximated by irregular polytopes than by the orthotopes provided by axis-aligned splits. The three spiral data helps the five spiral performance because the optimal partitioning for these two tasks is relatively similar to one another, as indicated by positive forward transfer. This is despite the fact that the five spiral task requires more fine partitioning than the three spiral task. Because SynF grows relatively deep trees, it over-partitions space, thereby rendering tasks with more coarse optimal decision boundaries useful for tasks with more fine optimal decision boundaries. The five spiral data also improves the three spiral performance.

Appendix F. Real Data Extended Results.

In this experiment, we used the spoken digit dataset provided in https://github.com/Jakobovski/free-spoken-digit-dataset. The dataset contains audio recordings from 6 different speakers with 50 recordings for each digit per speaker (3000 recordings in total). The experiment was set up with 6 tasks where each task contains recordings from only one speaker. For each recording, a spectrogram was extracted using Hanning windows of duration 16 ms with an overlap of 4 ms between the adjacent windows. The spectrograms were resized down to 28 × 28. The extracted spectrograms from 8 random recordings of '5' for 6 speakers are shown in Figure 2 . For each Monte

Carlo repetition of the experiment, spectrograms extracted for each task were randomly divided into 55% train and 45% test set. As shown in Figure 3 , both SynF and SynN show positive transfer and synergistic learning between the spoken digit tasks, in contrast to other methods, some of which show only forkward transfer, others show only backward transfer, with none showing both, and some showing neither. Table 3 shows the image classes associated with each task number. Supplementary Figure 4 is the same as Figure 3 but with 5,000 training samples per task, rather than 500. Notably, with 5,000 samples, replay methods are able to transfer both forward and backward as well. However, note that although total replay outperforms both SynF and SynN with large sample sizes, it is not a bona fide lifelong learning algorithm, because it requires n 2 time. Moreover, the replay methods will eventually forget as more tasks are introduced because it will run out of capacity. Figure 5 shows the same result as the label shuffling from Figure 4 , but with 5,000 samples per class. The results for SynN and SynF are qualitatively similar, in that they transfer backward. The replay methods are also able to transfer when using this larger number of samples, although with considerably higher computational cost.

We also considered the setting where each task is defined by a random sampling of 10 out of 100 classes with replacement. This environment is designed to demonstrate the effect of tasks with shared subtasks, which is a common property of real world lifelong learning tasks. Supplementary Figure 6 shows transfer efficiency of SynF and SynN on Task 1. SynF still demonstrates positive forward, backward, and final transfer, unlike most of the state-of-the-art algorithms, which demonstrate forgetting. The replay methods, however, do demonstrate transfer, albeit with significantly higher computational cost. 

Machine learning and data mining

On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities

A Theory of the Learnable

Multitask learning. Machine learning

Is learning the n-th thing any easier than learning the first?

Learning to Learn

Catastrophic interference in connectionist networks: The sequential learning problem

Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory

Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks

Continual learning through synaptic intelligence

Learning without forgetting

Progress & compress: A scalable framework for continual learning

Online meta-learning

ELLA: An Efficient Lifelong Learning Algorithm

Razvan Pascanu, and Raia Hadsell. Progressive neural networks

Learning shared knowledge for deep lifelong learning using deconvolutional networks

Continual lifelong learning with neural networks: A review

The seven tools of causal inference, with reflections on machine learning

Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon

Random forests. Machine learning

A high-throughput screening approach to discovering good forms of biologically inspired visual representation

XGBoost: A Scalable Tree Boosting System

Catastrophic forgetting, rehearsal and pseudorehearsal

Continual learning with deep generative replay

Brain-inspired replay for continual learning with artificial neural networks

Foundations of Machine Learning

Mathematical statistics: basic ideas and selected topics, volumes I-II package

Task Agnostic Continual Learning Using Online Variational Bayes

Experience replay for continual learning

Three scenarios for continual learning. CoRR, abs/1904.07734

Gradient episodic memory for continual learning

Measuring Cumulative Gain of Knowledgeable Lifelong Learners

What is gained from past learning

Elements of Information Theory

Learning phrase representations using rnn encoder-decoder for statistical machine translation

Attention is All you Need

BERT: pre-training of deep bidirectional transformers for language understanding

Boosting a Weak Learning Algorithm by Majority

Bagging predictors

Modern Machine Learning: Partition & Vote

Shape Quantization and Recognition with Randomized Trees

Classification and regression trees

Narrowing the gap: Random forests in theory and in practice

Generalized random forests

Estimating informationtheoretic quantities with random forests

Adam: A method for stochastic optimization

Consistent Nonparametric Regression

Cognition and Intractability: A Guide to Classical and Parameterized Complexity Analysis

Robust learning with missing data

Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks

Mitigating unwanted biases with adversarial learning

Attacks which do not kill training make adversarial learning stronger

Adversarial learning

Memory aware synapses: Learning what (not) to forget

Sparse Projection Oblique Randomer Forests

Learning multiple layers of features from tiny images

Lifelong Machine Learning

Lifelong Learning with Dynamically Expandable Networks. International Conference on Learning Representations

Big Data Reproducibility: Applications in Brain Imaging. bioRxiv

Adjusting batch effects in microarray expression data using empirical Bayes methods

Differential Privacy: A Survey of Results

Alpha-1 adrenergic receptor antagonists for preventing acute respiratory distress syndrome and death from cytokine storm syndrome

The constructivist brain

From constructivism to neuroconstructivism: The activity-dependent structuring of the human brain

Core Concept: How synaptic pruning shapes neural wiring during development and, possibly, in disease

A baseline for few-shot image classification