key: cord-0468162-5s5rh1q6 authors: Ermis, Beyza; Zappella, Giovanni; Wistuba, Martin; Archambeau, Cedric title: Memory Efficient Continual Learning for Neural Text Classification date: 2022-03-09 journal: nan DOI: nan sha: ae00d8f00c2c0b18ad7046f55fce99a47ae10eea doc_id: 468162 cord_uid: 5s5rh1q6 Learning text classifiers based on pre-trained language models has become the standard practice in natural language processing applications. Unfortunately, training large neural language models, such as transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. Moreover, in many real-world scenarios, classes are uncovered as more data is seen, calling for class-incremental modelling approaches. In this work we devise a method to perform text classification using pre-trained models on a sequence of classification tasks provided in sequence. We formalize the problem as a continual learning problem where the algorithm learns new tasks without performance degradation on the previous ones and without re-training the model from scratch. We empirically demonstrate that our method requires significantly less model parameters compared to other state of the art methods and that it is significantly faster at inference time. The tight control on the number of model parameters, and so the memory, is not only improving efficiency. It is making possible the usage of the algorithm in real-world applications where deploying a solution with a constantly increasing memory consumption is just unrealistic. While our method suffers little forgetting, it retains a predictive performance on-par with state of the art but less memory efficient methods. Text classification is a widely used technology powering a number of industrial applications from news tagging to semantic search (Ye et al., 2020; Liu et al., 2017) . Large pretrained language models such as BERT (Devlin et al., 2018) have shown their effectiveness in various natural language 1 Amazon Web Services. Correspondence to: Beyza Ermis , Giovanni Zappella . Preprint processing tasks such as classification (Ke et al., 2020) , Natural Language Inference (NLI) (Pfeiffer et al., 2020b; , and Question Answering (Greco et al., 2019) . Adapting large-scale pre-trained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on natural language processing (NLP) benchmarks and practitioners creating applications for tagging sentences, web pages or text documents rely on pre-trained language model and "fine-tune" them by performing additional training on a small amount of data collected specifically for the task at hand. For example, newspapers can tag news according to topics of interest such "sport", "politics", "food", etc. by leveraging a pretrained language model and refining it using a few thousand pre-tagged articles. Tags can also appear over time due to external events, for example "COVID-19" was a completely unknown category of news in 2019 but became very frequent over the last two years. The ability to continually extend the set of tags (or classes) used to categorize content is a major problem in many applications. Retraining the models from scratch is often impactical. The set of tags is often subject to discussion or is identified with successive refinements and experimentation. Retraining a model from scratch can lead to inconsistencies in the labelling when compared to the one provided by the previous model, which can have a negative impact on the customer experience. Last but not least, lowering compute consumption would reduce the environmental impact of training machine learning models. At the same time, naïvely training a classifier using only new labels will degrade the performance of the previously observed ones, a phenomenon called "catastrophic forgetting" (McCloskey & Cohen, 1989) . The simplest way of avoiding forgetting is by freezing all parameters, using a pre-trained model as a feature extractor, and fine-tuning only the prediction layer. However, this leads to an inferior performance compared to fine-tuning all parameters (Rosenfeld & Tsotsos, 2018) . A promising approach is to use Adapter modules (Houlsby et al., 2019) . Adapters have been introduced as an alternative lightweight fine-tuning strategy that achieves on-par performance to full fine-tuning on most tasks in multi-task learning. Adapters are neural modules with a small amount of additional newly introduced parameters within a large pre-trained model. The adapter parameters are learnt on a target task while keeping pre-trained model parameters fixed; thus learn to encode task-specific representations in intermediate layers of the pre-trained model. This enables efficient parameter sharing between tasks by training many task-specific adapters with the same pre-trained model, which can be exchanged and combined post-hoc . However, in the sequential learning setting, Adapters keep adding additional parameters for each task with the consequence of significantly increasing the number of parameters used to fine-tune the language model. In this work, we tackle the problem of labeling unstructured text data in a setting where the number of tags associated to the text documents grows over time. In particular, we focus on incrementally extending classifiers based on pretrained language models. To address the issues observed with other approaches, we propose Adaptive Distillation of Adapters (ADA ), an algorithm that keeps a fixed amount of adapters in memory and leverages task similarity to effectively consolidate newly created adapters with previously created ones. This allows the user to have a strict control on the memory consumption, while retaining state of the art performance when running the algorithm on sequences of tens of tasks. The tight memory control is important in industrial applications. Indeed, having to change the hardware to adapt to the growing memory requirements of the deployed model would be problematic and being forced to be conservative and provision cloud instances with very large memory sizes (to prevent crashes) would make its use impractical. In Section 2, we discuss the related work, in Section 3 the problem setup and in Section 4 we introduce our main contribution: the ADA algorithm. In Section 5, we test ADA with different similarity scores on three large public datasets to compare its performance with state of the art methods. We show that ADA can achieve similar predictive performance while using up to twelve times less parameters for fine tuning. We also show how reducing the number of parameters impacts the inference time, a key metric for machine learning models used in customer-facing online services. In Section 6, we provide the results of a number of additional comparison and ablation studies demonstrating the effectiveness of ADA in practice. Finally, in Section 7 we summarize our findings and discuss potential extensions of our work. Existing methods for continual learning (CL) can be roughly categorized as follows: (1) Replay-based methods (Lopez-Paz & Ranzato, 2017; Rolnick et al., 2018; d'Autume et al., 2019; Chaudhry et al., 2019; Wang et al., 2020) retain some training data of old tasks and use them in learning a new task to circumvent the issue of catastrophic forget-ting (CF); (2) Regularization-based methods (Kirkpatrick et al., 2017; Aljundi et al., 2018; Huang et al., 2021b ) add a regularization term to the loss to consolidate previous knowledge when learning a new task; (3) Gradient-based methods (Zeng et al., 2019; Aljundi et al., 2019) ensure the gradient updates occur only in the orthogonal direction to the input of old tasks and thus will not affect old tasks. Recently, some studies use pre-trained models for class incremental learning (Ke et al., 2021; Hu et al., 2021) ; (4) Parameter isolation-based methods (Ke et al., 2020; Wortsman et al., 2020) allocate model parameters dedicated to different tasks and mask them out when learning a new task; (5) Meta-learning-based methods, which directly optimize the knowledge transfer among tasks (Riemer et al., 2018; Obamuyide et al., 2019) or learn robust data representations (Javed & White, 2019; Wang et al., 2020) . Recently, Adapters were proposed for fine-tuning of pretrained language models. Adapters (Houlsby et al., 2019) were studied for the multi-task setting, where AdapterFusion provide state of the art performance and it can simply be adapted for preventing CF in CL tasks. It was shown by Houlsby et al. (2019) that for common adapter architectures, the number of additional model parameters is significantly smaller than the number of parameters used in the pre-trained model, e.g., 3.6% of the parameters of the pre-trained model. At the same time, since approaches like Adapters fusion require to store all the model parameters, it means that for 30 tasks, we would have to add more parameters than the number of pre-trained model parameters. The number of parameters added per task can reach 25 M for different models with larger adapter size. We give the details of the architectures we use and the number of model parameters for each case in Section 5. In CL applications, we observe the tasks sequentially and may have to process large number of tasks. This linear increase in the model size has immediate practical consequences on both the storage requirements and the inference time which make solutions such AdapterFusion impractical for our setting. Recent work studied catastrophic forgetting (Sun et al., 2019; Chuang et al., 2020; Ke et al., 2020; Liu et al., 2020) and incremental learning (Xia et al., 2021) for NLP. Pasunuru et al. (2021) focus on the few-shot setting where only a few data points are available for each task. Ke et al. (2021) proposed an architecture to achieve both CF prevention and knowledge transfer. This method has some similarity to AdapterBERT (Houlsby et al., 2019) since they insert a CL plug-in module in two locations in BERT, but they learn all tasks using only one pair of CL-plugin modules inserted into BERT. A CL-plugin is a capsule network (Sabour et al., 2017) that uses one separate capsule (Hinton et al., 2011) for each task and a proposed transfer routing algorithm to identify and transfer knowledge across tasks to achieve improved accuracy. It further learns and uses task masks to protect task-specific knowledge to avoid forgetting. Each capsule is a 2-layer fully connected network, and like Adapters, they are added for each task and memory increases linearly over the time. In addition, this algorithm requires to learn task masks to address knowledge transfer, which is costly to compute. We introduce a method with predictive performance onpar with the state of the art solutions for fine-tuning large language models, which can at the same time provide a strict control on the memory consumption. In particular, it should avoid significant increases of model parameters for each task. We would like the model to scale on tens or hundreds of tasks, while keeping the number of additional parameters required way below the one of the pre-trained language model. This is simply not possible with methods that add a new module for every task (e.g., a new Adapter for AdapterFusion). In this section we formalize our goal of CL on a sequence of text classification tasks {T 1 , . . . , T N } where each task T i contains a different set of text-label training pairs (x i 1:t , y i 1:t ). Each task T i may contain c new classes namely In the case of the tagging application describe in Section 1, each task represents a tag and the learner creates a new binary classifier for each tag. The goal of the learner is to learn a set of parametersΘ such that 1 N i∈{1,...,N } loss(T i ;Θ) is minimized. In our specific case,Θ is composed of a set of parameters Θ provided by a pre-trained model and, depending on the algorithm, some additional parameters which need to be learned for each specific task. In its simplest case this additional set of model parameters can just be a head model but, some algorithms use significantly more elaborate functions. For the training of task T i , the system can only access the newly added examples and label names in this task. To evaluate the system, the test data consists of examples across all the previous tasks, where the potential label space for the test example is Y 1:c 1 ∪ Y 1:c 2 ∪ · · · ∪ Y 1:c N . All methods that we define in the following sections receive as input a pre-trained language model f Θ (.), e.g., BERT (Devlin et al., 2018) , parameterized by Θ. The pre-trained model receives as input raw text, called x i , and it is able to compute a representation for it. To address the issues we mentioned in the previous section, we propose Adaptive Distillation of Adapters (ADA ). ADA keeps a fixed amount of adapters in memory and takes transferability of representations into account to effectively consolidate newly created adapters with previously created ones with unlabeled training data which is required for consolidation. We perform ADA in two steps: the first step is to train an adapter model and a new classifier using the new task T i 's training dataset D i , which we refer as the new model; the second step is to consolidate the old model(s), the model(s) obtained in the previous round, and new model. For ADA , we fix a budget for the number of adapters K that we can keep, for instance due to memory constraints, and the algorithm selects the adapter model from the pool of K adapters to consolidate by using scores that are computed based on transferability estimation. In the following sections we explain the building blocks of ADA : Adapters, Distillation of Adapters, and Transferability Estimation. As we mentioned in Section 2, Adapters were proposed by Houlsby et al. (2019) as an alternative to fine-tuning. Adapters share a large set of parameters Θ across all tasks and introduce a small number of task-specific parameters Φ i . Current work on adapters focuses on training adapters for each task separately. For each of the N tasks, the model is initialized with parameters of a pre-trained model Θ. In addition, a set of new and randomly initialized adapter parameters Φ i are introduced for tasks i ∈ {1, . . . , N }. The parameters Θ are fixed and only the parameters Φ i are trained. This makes it possible to train adapters for all N tasks, and store the corresponding knowledge in designated parts of the model. The objective for each task i ∈ {1, . . . , N } is of the form: AdapterFusion , has been proposed to mitigate the lack of knowledge sharing across tasks. It works in two phases: i) in the knowledge extraction stage, adapters, which encapsulate the task-specific information are learnt for each of the N tasks; while ii) in the knowledge composition stage, the set of N adapters are combined by using additional parameters Ψ. The additional parameters Ψ i for task i are defined as: (1) While this provides good predictive performance, in the CL setting, new tasks are added sequentially and storing a large set of adapters Φ 1 , . . . , Φ N is practically infeasible. For each new task T i , the adapter parameters Φ i are added to the model, while the pre-trained model parameters Θ are, as always, kept frozen and never updated. Only the task-specific model parameters Φ i and the head model parameters h i are trained for the current task. The model The head model parameters are fixed after training the new model and they are not updated during model consolidation. When a prediction for T i is required the corresponding head model h i is called. For the consolidation step, we use both new model that is trained on T i and the old model trained on the previous tasks. We defer the explanation of the mechanism selecting the old model to the next section. The distillation of the two models has the following objective: where i denotes the index of the considered task. We want the output of the consolidated model to approximate the combination of the model outputs of the old model and the new model. To achieve this, the outputs of the old model and the new model are employed as supervisory signals in joint training of the consolidated model Φ c . To this purpose, we use the double distillation loss proposed by Zhang et al. (2020) to train a new adapter that is used with the pre-trained model to classify both old and newly learned tasks. Alternative solutions for distillation are discussed in Appendix A.1 but this solution was the best performing one in our experiments. The distillation process proceeds as follows: we freeze f old and f new , run a feed-forward pass for each training sample, and collect the logits of the two modelsŷ old = ŷ 1 , . . . ,ŷ n−1 andŷ new =ŷ n respectively, where the super-script is the class label associated with the neuron in the model. Then we minimize the difference between the logits produced by the consolidated model and the combination of logits generated by the two existing specialist models based on L 2 -loss: where y j are the logits produced by the consolidated model for the i th task andŷ is the concatenation ofŷ old andŷ new . The training objective for consolidation is given by where U denotes the unlabeled training data used for distillation. While several different data sources can be used to populate U, such as using auxiliary external data Zhang et al. (2020) or generating synthetic data Chawla et al. (2021) , in this work we populate the buffer using covariates from previous tasks selected with Reservoir Sampling (Vitter, 1985) . After the consolidation, the adapter parameters Φ c are used for f old in the next round. In the previous section we assumed f old as given, but ADA keeps a pool of adapters and the selection of the adapter to be distilled is an important part of the algorithm. To this end, we designed a selection mechanism where the adapters trained for similar tasks get distilled together in order to minimize forgetting. The intuition behind this choice is that highly similar tasks will interfere less with each other and so will cause significantly less forgetting. The algorithm is independent from the choice of the transferability score. In this work we leverage two common methods for transferability estimation: (1) Log Expected Empirical Prediction (LEEP) (Nguyen et al., 2020) and (2) TransRate (Huang et al., 2021a) . LEEP is a measure (or a score) that can tell us, without training on the target data set, how effectively the transfer learning algorithms can transfer knowledge learned in the source model Θ s to the target task, using the target data set D. LEEP is a three steps method. At Step 1, it computes dummy label distributions of the inputs Θ s (x) in the target data set D. At Step 2, it computes the empirical conditional distributionP (y|z) of target label y given the source label z. At Step 3, it computes LEEP using Θ s (x) andP (y|z): where z is a dummy label randomly drawn from Θ s (x) and y is randomly drawn fromP (y|z). TransRate measures the transferability as the mutual information between the features of target examples extracted by a pre-trained model and labels of them with a single pass through the target data. The knowledge transfer from a source task T s to a target task T t is measured as: where Y are the labels of target examples and Z = g(X) are features of them extracted by the pre-trained feature extractor g(.). TransRate achieves minimal value when the data covariance matrices of all classes are the same. In this case, it is impossible to separate the data from different classes and no classifier can perform better than random guesses. The ADA algorithm is detailed in Algorithm 1. For every new task the algorithm trains a new adapter and head model (called Φ n and h n ). If the adapters pool did not reach the maximum size yet (controlled by K), it just adds it to the pool. If the pool reached the maximum size, the algorithm is forced to select one of the adapters already in the pool and distill it together with the newly trained one. In order to select which adapter to distill it leverages the transferability scores (e.g., LEEP or TransRate). Once the adapter in the pool with the highest transferability score (called f j * old )is identified, it consolidates that adapter and the newly trained one into a new adapter and replaces the old one present in the pool. In order to be able to make effective predictions, the algorithm also keeps a mapping (in the map m) of which adapter in the pool must be used in combination with each of the task-specific heads. In this section, we empirically validate our adapter distillation approach and show that ADA achieves similar performance to AdapterFusion, a state of the art solution, while consuming significantly less memory. Datasets. We use three text dataset for multi-label classification: (1) Arxiv Papers (Yang et al., 2018 ) (paper classification), (2) Reuters (RCV1-V2) (Lewis et al., 2004 ) (news classification) and Wiki-30K (Zubiaga, 2012) (Wikipedia article classification). arXiv papers dataset contains the abstract and the corresponding subjects of 55,840 papers in the computer science field from Arxiv.org. There are 54 subjects in total and each paper can cover multiple subjects. In our work each of these subjects will represent a different task for the classifier where the target is to predict corresponding subjects of an academic paper according to the content of the abstract. Reuters consists of over 800,000 manually categorized newswire stories made available by Reuters Ltd for research purposes. Multiple topics can be assigned to each newswire story and there are 103 topics in total. For Wiki-30K, a set of tags for the English Wikipedia was gathered. Starting with a set of more than 2 million articles from the English Wikipedia on April 2009, the tag information for each of these articles was retrieved from the social bookmarking site Delicious. Only the articles annotated by at least 10 users in Delicious were preserved. As a result, a dataset with 20,764 tagged Wikipedia articles was generated. There are 29,947 labels in this dataset. Experimental Setup. The setup works as follows: we first sample a sequence of labels from the label space of the multi-label classification problem. Then, we create a balanced binary classification task for each label by sampling the same amount of positive data points from the label considered and negative data points from the labels preceding the current one in the sequence. After splitting the data in training and test set, we provide the algorithm with the training set and subsequently measure its performance on the test set. The algorithm never observes any data point in the test set and, more generally, every data point in the dataset is used only once. For Arxiv Papers and Reuters datasets, we created 20 tasks and for Wiki-30K dataset we created 60 tasks and we fixed the number of samples per tasks to 100 for the training set. The test set is made of 40 data points on Reuters and of 100 data points on AAPD and Wikipedia. Since ADA requires some data points for which to generate soft labels at distillation time, in order to keep the comparison fair, we avoid using external data sources, but we keep a buffer for the covariates observed in the previous tasks. To manage the distillation memory, we use Reservoir sampling (Vitter, 1985) fixing the maximum memory size m to 500 for Reuters and AAPD and to 1000 for Wikipedia, which has a larger number of tasks. All the results in this section are average of 5 runs. We use pre-trained models from HuggingFace Transformers (Wolf et al., 2020) as our base feature extractors. We ran experiments with BERT, RoBERTa and DistilBERT. The results with the first are displayed in this section while the rest is reported in Appendix A.4. We use Adam as optimizer with the batch size of 8. For learning rate, we select best from {0.00005, 0.0001, 0.0005, 0.001} after observing the results on the first five tasks. For the adapter implementation, Figure 1 . Comparison between baselines and ADA on arXiv, Reuters and Wikipedia. On the x-axis we report the number of tasks processed, on the y-axis we report the average accuracy measured on the test set of the tasks processed, shaded area shows standard deviation. we use Adapter-Hub framework (Pfeiffer et al., 2020a) . Baselines. We compare ADA the following baselines. 1) Fine-tuning head model (B1): We freeze the pre-trained representation and only fine-tune the output layer of each classification task. The output layer is multiple-head binary classifier that we also use for the other methods. 2) Fine tuning the full model (B2): We fine-tune both the pre-trained representation and the output layer for each classification task. 3) Adapters (Houlsby et al., 2019): We train and keep separate adapters for each classification task as well as the head models. 4) AdapterFusion : It is a two stage learning algorithm that leverages knowledge from multiple tasks by combining the representations from several task adapters in order to improve the performance on the target task. This follows exactly the solution depicted in Section 4.1. 5) Experience Replay (ER) (Rolnick et al., 2018) : ER is a commonly used baseline in Contin- ual Learning that stores a subset of data for each task and then "replays" the old data together with the new one to avoid forgetting old concepts. (d' Autume et al., 2019) propose to use such a memory module for sparse experience replay and local adaptation in the language domain. This method stores all training examples, in order to achieve optimal performance. To make this method comparable with adapter-based methods, we freeze pre-trained representation, add a single adapter parameters Φ and train the adapter by replaying examples from old tasks while training using data from the new task. In order to keep baselines comparable we assign to ER the same amount of memory m is used for the distillation buffer in ADA . In addition to these baselines, we use one special case of ADA with K=1 as a baseline to demonstrate the advantage of effective consolidation of adapters. Adapter Architectures. In our work, we use BERT base , DistilBERT base and RoBERTa base as our base models. We analyze the cases based on all these models, due to the space constraints, we present BERT base in this section and the rest in Appendix A.2. BERT base model uses 12 layers of transformers block with a hidden size of 768 and number of self-attention heads as 12 and has around 110 M (440 MB) trainable parameters. An adapter has a simple bottleneck architecture. The bottleneck contains fewer parameters than the attention and the feed-forward layers. The adapter size is the hyper-parameter that is tuned and it can be set to {12, 24, 48, 96, 192, 384} for BERT base model. For all the methods, we use the same configuration for the adapters, setting the size to 48. With this setting, an adapter contains 1.8 M parameters. We also train a head model for each task, that has 768 parameters for BERT base (last hidden size of BERT base ×output size, which equals to 1 for binary classification). The table in Appendix 1 reports the number of parameters used for baselines and ADA in our experiments. Predictive performance. Figure 1 shows the comparison of ADA and the baseline methods. It can be clearly seen that freezing all pre-trained model parameters, and fine-tuning only the head models (B1) leaded to an inferior performance compared to adapter-based approaches. The main reason is that the head models have small amount of parameters to train and fine-tuning only the heads suffers from underfitting. B2 performs good only for first 2-3 tasks, since we keep training the complete model, it forgets the previously learned tasks very quickly. As mentioned above, Adapters and AdapterFusion add 1.8 M parameters for each task and train these parameters with new task data, and these parameters are fixed after training. So, they perform well on both new tasks and previous tasks. The results on each dataset confirm this. Both ER and ADA K=1, perform closely with Adapters almost for half of the tasks. The similar behavior of ER and ADA K=1 demonstrates that the distillation with soft labels work well and it is almost as good as training with the true labels. Later the performance declines for both methods, because the capacity of the adapter is exceeded. ADA LEEP and ADA TransRate results with K=4 adapters show that selective consolidation of adapters significantly improve the performance. Their performance are on par with AdapterFusion while the number of model parameters is significantly lower. Memory consumption. Figure 2 shows the number of parameters used by each method and their predictive performance. These results make clear that ADA is significantly more efficient in terms of memory usage. It can achieve predictive performance similar to the one of Adapters and AdapterFusion while requiring significantly less model parameters. On Reuters and arXiv, it can store the parameters of only 5 Adapters (K=4 adapters in the pool, and one adapter for new task), against the 20 required by Adapter-Fusion. Inference time. Inference time is often overlooked when leveraging ensemble methods but it is one of the main concerns for many online services. When machine learning models are used to power customer-facing web sites, they are often required to provide predictions in a few milliseconds to keep the overall latency within requirements. Moreover, in this kind of application the model will be trained once and make billions of predictions so a reasonable increase in the training time is irrelevant compared to a decrease in the inference time. In Figure 3 we report the average time per prediction made during our experiments. We observe a significant speedup at inference time compared to state of the art models such AdapterFusion. For example, on Reuters, ADA is 5 times faster than AdapterFusion when both K=1 and K=4 (because it always uses one distilled adapter for inference that has a fixed size). The inference time of AdapterFusion depends on the number of adapters fused. Our method provides a sufficiently fast inference for most applications and still offers opportunities to speed it up further, for example by employing smaller pre-trained transformers (e.g. DistilBERT, see Appendix A.2). Comparison with larger distilled models. In Section 5 we compared ADA with the special case of ADA with K=1 to evaluate the improvement provide by our approach over a distillation-only solutions. We would like to provide additional observations of the superior performance of ADA by comparing its performance with the one of a distilled adapter using more parameters. Specifically, we run an experiment where we compare ADA with K=4 and ADA with K=1 as displayed before but in this case the "size" of the adapter, which is 48 for Size×1, is multiplied by 4 for Size×4 adapter to have a comparison where the different methods use the same number of model parameters. Since K=1 is a special case where a single adapter is kept in the pool, the transferability metric is irrelevant and we can see ADA with K=1 as a method purely based on distillation like DMC . The results reported in Figure 5 show that ADA can make a better use of the model parameters compared to a distillationonly method and that the intelligent selection of which adapters to distill together makes once again a big difference. It is also interesting to observe that the usage of additional model parameters brings a clear advantage but the mixed comparison between the ADA K=4 with random adapter selection and ADA K=1 with four times larger adapters leaves some questions open regarding how far distillation can get in this setting. Another finding is that TransRate outperforms LEEP for all cases. It is also demonstrated in the original paper (Huang et al., 2021a ) that TransRate has a strong correlation to the transfer learning performance and it outperforms LEEP and other metrics employed. Impact of the adapters pool size. In our experiments we used a fixed number of adapters in the pool size, but nothing prevents us from adding adapters to ADA 's pool along the way. This may actually be the preferred usage in some applications. We already know that having an adapter per task provides good performance and that leveraging multiple of them at the same time like in AdapterFusion provides a benefit, but we would like to verify the sensitivity to this parameter. The results reported in Figure 5 show a rapidly decreasing added value when the number of adapters grows, a behavior which aligns well with our practical requirements of keeping the number of model parameters under control when the number of tasks grows. Additional experiments on this aspect are available in Appendix A.5. In this paper we presented Adaptive Distillation of Adapters (ADA ). It allows neural text classifiers to learn with new classes based on pre-trained models such as BERT while maintaining strict control of the memory usage and retaining state of the art predictive performance. The ability to effectively control the memory consumption, via the instantiation of a predefined number of model parameters makes the algorithm suitable for practical usage. Besides, exhibiting a good predictive performance on new tasks, it prevents catastrophic forgetting of previously learnt tasks. Hence, ADA can adapt in a continual manner after deployment. We evaluated ADA on three large text datasets for multi-label text classification where each new task represents unseen binary classifier to be learned. The predictive performance is competitive with state of the art methods, such as Adapter-Fusion, with fewer parameters. Moreover, ADA displayed lower latency at inference time and improved data efficiency for some specific settings (see Appendix A.3). The proposed approach is not limited to text classification. It is applicable to other NLP problems, such as learning multilingual language models, building question answering models in different domains or sentiment classification. Our approach could also be useful for computer vision applications where Transformers (Khan et al., 2021) and Adapters (Rebuffi et al., 2017) are becoming more popular. A.1. Related work on distillation and transferability Knowledge distillation refers to the process of transferring the knowledge from a large bulky model or a set of models to a single smaller model that can be practically deployed under real-world constraints. Essentially, it is a form of model compression that was first proposed by Bucilua et al. (2006) and used by Hinton et al. (2015) to preserve the output of a complex ensemble of networks when adopting a simpler network for more efficient deployment. The idea is adopted in CL and incremental learning domain to maintain the responses of the network unchanged on the old tasks whilst updating it with new training samples in different ways (Shmelkov et al., 2017; Castro et al., 2018; Li & Hoiem, 2017; Zhou et al., 2019) . Shmelkov et al. (2017) propose an end-to-end learning framework where the representation and the classifier are learned jointly without storing any of the original training samples. Li & Hoiem (2017) distill previous knowledge directly from the last trained model. Zhou et al. (2019) propose to use the current model to distill knowledge from all previous model snapshots, of which a pruned version is saved. Schwarz et al. (2018) use distillation to consolidate the network after each task has been learned and Buzzega et al. (2020) leverage knowledge distillation for retaining past experience. Zhang et al. (2020) proposed the idea of consolidating two individual image classification models trained on image data of two distinct set of classes (old classes and new classes) into one single model that can classify all classes. In our work, we adopt the idea of model consolidation and use it for incremental text classification. In our setting, we leverage the pre-trained model, keep it fixed, and only use adapters to transfer knowledge from old tasks to the new tasks and train one adapter that can perform well on all classification tasks. Our main goal is to use the advantage of knowledge transfer between tasks with distillation. So we also use transferability estimation methods to select the adapters that needs to be distilled. By enhancing the power of distillation, we achieve the same performance with state-of-the-art methods while keeping the number of model parameters much smaller. Automatically selecting intermediate tasks that yield transfer gains is critical when considering the increasing availability of tasks and models. There are a number of works that explores task transferability in NLP (Phang et al., 2018; Liu et al., 2019; Vu et al., 2020; Puigcerver et al., 2020; Pruksachatkun et al., 2020) . Poth et al. (2021) present a large-scale study on adapter-based sequential fine-tuning. Given multiple source and target task pairs (s,t), they first train an adapter on s, then fine-tune the trained adapter on t and show the relative transfer gains across the different combinations. They use different methods for intermediate task selection, and LEEP (Nguyen et al., 2020) is one of the methods that they used in this work to measure transferability and it is consolidated in NLP domain. TransRate (Huang et al., 2021a ) is a very recent work and it is used with image classification tasks in the original work. To the best of our knowledge, we use TransRate for the first time in NLP domain. Our work is quite different from what is proposed in the literature. We focus on selecting the best representation from a pool of representations (trained adapters) for model consolidation, without the necessity of computationally expensive additional approach. We use proxy estimators, LEEP and TransRate, that evaluate the transferability of pre-trained models towards a target task without explicit training on all potential candidates. The tables below reports the number of parameters used for baselines and ADA in our experiments. We reported all the cases for different models: BERT base , RoBERTa base and DistilBERT base . We don't add the head size to the table, since it's very small and same for all the methods. We would like to verify if the intelligent distillation mechanism we designed for ADA is not only able to avoid forgetting and save memory but also to increase the data efficiency. Distilling together similar tasks for which a small number of data points is available could also provide a better representation of the data points. To verify this hypothesis, we repeated our experiments with a variable number of data points in the training set of each task. The amount of positive and negative samples is balanced in both train and test tasks. The size of the training sets of the Reuters tasks contain t = {20, 50, 80} samples per class (positive and negative) and the test sets contain 20 samples per class. arXiv has more samples than Reuters dataset, so we added larger training tasks of size 400 to the configuration, and increased the test task size. For arXiv, we created the training sets with t = {20, 50, 100, 200} samples per positive and negative classes and the test set with 50 samples per class. Our expectation is that by increasing the training set size the overall predictive performance will improve, but we also expect to see the predictive performance of ADA matching (or narrowing the gap with) independent adapters' one when using a smaller training set. In Figure 6 we report the results of our experiment. We observe TransRate performing generally better than LEEP, as in previous experiments. Focusing on TransRate, we can see that ADA K=4 with TransRate can actually outperform independent adapters when the training set size is around 100 data points and even match the performance of independent adapters using significantly more labels (200 labels on arXiv and 160 on Reuters). The effect becomes smaller or vanishes when the training set gets larger but this could still bring an important advantage in the "few-shot" setting. We repeated all the experiments presented in Section 5 with DistilBERT base and RoBERTa base as our base models in order to show that it's not only limited to one specific model. The results demonstrated the same trends with BERT base model experiments. Figure 7 . Comparison of baselines and distillation methods on arXiv and Reuters with RoBERTa base . On the x-axis we report the number of tasks processed, on the y-axis we report the average accuracy measured on the test set of the tasks processed, shaded area shows standard deviation. Figure 7 compares the ADA algorithms with baselines. The findings that we mention in predictive performance is exactly applicable to RoBERTa base results. RoBERTa base performs slightly better on all the methods compared to BERT base . The behavior of algorithms are same for DistilBERT base and is very similar to the results with BERT base , however, the number of parameters used is different. Figure fig:compMemoryDB shows the number of parameters used by each method and their predictive performance with DistilBERT base model. that with small number of labels and with a model much less parameters, we can still have good prediction accuracy on old and new tasks in CL setting. This section has the additional results with different adapters pool size on arXiv dataset. As in Figure 5 , the results show a rapidly decreasing added value when the number of adapters grows, a behavior which aligns well with our practical requirements of keeping the number of model parameters under control when the number of tasks grows. Figure 10 . Impact of adapter pool size for LEEP and TransRate when K = {1, 2, 4, 8} on arXiv for t = 50. Selfless sequential learning Gradient based sample selection for online continual learning Model compression Dark experience for general continual learning: a strong End-to-end incremental learning Continual learning with tiny episodic memories Datafree knowledge distillation for object detection Lifelong language knowledge distillation Episodic memory in lifelong language learning Pre-training of deep bidirectional transformers for language understanding Psycholinguistics meets continual learning: Measuring catastrophic forgetting in visual question answering Distilling the knowledge in a neural network Transforming auto-encoders Parameter-efficient transfer learning for nlp Continual learning by using information of each class holistically Frustratingly easy transferability estimation Continual learning for text classification with information disentanglement based regularization Meta-learning representations for continual learning Continual learning with knowledge transfer for sentiment classification Adapting bert for continual learning of a sequence of aspect sentiment classification tasks Transformers in vision: A survey Overcoming catastrophic forgetting in neural networks Rcv1: A new benchmark collection for text categorization research Learning without forgetting Deep learning for extreme multi-label text classification Linguistic knowledge and transferability of contextual representations Exploring fine-tuning techniques for pre-trained crosslingual models via continual learning Advances in neural information processing systems Catastrophic interference in connectionist networks: The sequential learning problem Leep: A new measure to evaluate transferability of learned representations Meta-learning improves lifelong relation extraction Continual fewshot learning for text classification A framework for adapting transformers Mad-x: An adapter-based framework for multi-task cross-lingual transfer AdapterFusion: Non-destructive task composition for transfer learning Sentence encoders on stilts: Supplementary training on intermediate labeleddata tasks What to pre-train on? efficient intermediate task selection Intermediate-task transfer learning with pretrained models for natural language understanding: When and why does it work? Scalable transfer learning with expert models Learning multiple visual domains with residual adapters Learning to learn without forgetting by maximizing transfer and minimizing interference Experience replay for continual learning Incremental learning through deep adaptation Dynamic routing between capsules Progress & compress: A scalable framework for continual learning Incremental learning of object detectors without catastrophic forgetting Language modeling for lifelong language learning Random sampling with a reservoir M. Exploring and predicting transferability across nlp tasks Efficient meta lifelong-learning with limited memory Transformers: State-of-the-art natural language processing A. Supermasks in superposition Incremental few-shot text classification with multi-round new classes: Formulation, dataset and system Sgm: sequence generation model for multi-label classification Pretrained generalized autoregressive model with adaptive probabilistic label clusters for extreme multi-label text classification Continual learning of context-dependent processing in neural networks Class-incremental learning via deep model consolidation Multi-model and multi-level knowledge distillation for incremental learning Enhancing navigation on wikipedia with social tags