key: cord-0207047-2jxhi96e authors: Cossu, Andrea; Tuytelaars, Tinne; Carta, Antonio; Passaro, Lucia; Lomonaco, Vincenzo; Bacciu, Davide title: Continual Pre-Training Mitigates Forgetting in Language and Vision date: 2022-05-19 journal: nan DOI: nan sha: b3bc37a15aa74c523d656ad89b1896651f5eef72 doc_id: 207047 cord_uid: 2jxhi96e Pre-trained models are nowadays a fundamental component of machine learning research. In continual learning, they are commonly used to initialize the model before training on the stream of non-stationary data. However, pre-training is rarely applied during continual learning. We formalize and investigate the characteristics of the continual pre-training scenario in both language and vision environments, where a model is continually pre-trained on a stream of incoming data and only later fine-tuned to different downstream tasks. We show that continually pre-trained models are robust against catastrophic forgetting and we provide strong empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols. Code is provided at https://github.com/AndreaCossu/continual-pretraining-nlp-vision . Continual Learning (CL) (Lesort et al., 2020) focuses on the design of agents able to learn from a stream of non-stationary data while preserving previously acquired knowledge. The tendency of neural networks to catastrophically forget when confronted with new data has been the subject of many studies (McCloskey, Cohen, 1989; French, 1999) , mostly focused on the design of new CL strategies that mitigate such problem . The traditional CL scenario currently used in the literature considers a single model tackling a sequence of tasks, one after the other (Parisi et al., 2019) . In this setting, the CL model needs to learn its features while, at the same time, leveraging the same features to solve the supervised task. However, this scenario is not the only conceivable one. Natural Language Processing (NLP), for example, often exploits Transfer Learning techniques (Ruder et al., 2019) implemented through the so-called pre-training fine-tuning setup. In this setting, the more general linguistic knowledge acquired with pre-training is leveraged as a starting point to target specific downstream tasks. Specifically: 1) during pre-training, language models focus on unsupervised learning tasks (e.g. predicting masked words based on the surrounding context), and 2) during fine-tuning, the pre-trained model is further trained on supervised learning tasks (e.g. sequence labeling). Pre-trained models are widespread also in CL (Mehta et al., 2021; Wu et al., 2021) , where they are mostly used to conveniently initialize the model weights before learning from the non-stationary stream of data. However, the generality and robustness of the neural representations features may be greatly impaired during the continual training on the sequence of tasks, since the model will tend to overfit to the tasks objective. By separating the goal of building robust features from that of solving the task during the continual training, we provide a new way to design continual learning models which are 1) kept continuously up-to-date over time and 2) more robust to catastrophic forgetting since pre-trained features have been reported to be subjected to softer drifts during adaptation to the task (Mehta et al., 2020; Ramasesh et al., 2021) . (e.g., scientific abstracts). Subsequently (bottom), the model is fine-tuned against one (or more) downstream task D ds i (e.g. scientific abstracts classification). Forgetting is measure by fine-tuning on D f c (e.g. sentiment analysis). At each stage, only the current pre-trained and downstream datasets/models are available. The former point can be better understood with an example: let us consider the case in which a model is pre-trained on a snapshot of Wikipedia containing articles up to 2018. Part of the knowledge contained inside the model will soon become outdated: on one hand, the information contained in the original articles is likely to be replaced with up-to-date versions (e.g., changes in public figures such as a new President). On the other hand, outdated models do not incorporate the semantics of concepts related to more recent events. For example, the semantics of a term like COVID-19, which becomes important in a short amount of time, cannot be incorporated in the model without additional pre-training. As a consequence, an outdated language model may perform worse on tasks like language generation and Question Answering (Q/A), since it will not be able to generate sentences related to recent events (Jang et al., 2022) . In this paper, we formalize and study the continual pre-training scenario (Figure 1 ), where the model is continuously updated via an appropriate pre-training objective on a non-stationary stream of (possibly unlabeled) data. After each stage of pre-training, we build a new model from the pre-trained one (e.g., by substituting its final classifier head) and we train it on a number of downstream tasks. We monitor whether continual pre-training improves/worsens the performance on tasks which are similar/different with respect to the ones encountered during continual pre-training. We are particularly interested in studying the possible deterioration, which represents catastrophic forgetting. For the sake of the evaluation, we specifically introduce a Forgetting Control (FC) dataset as one of the downstream tasks. The FC dataset contains samples different from the ones present in the non-stationary stream and more similar to the dataset used for the original pre-training phase prior to continual training. Against this FC dataset we compare the performance of the pre-trained model at the beginning of the sequence of tasks with the performance of the model after each stage of continual pre-training. Our aim is to investigate the behavior of different architectures, pre-training protocols and input modalities in the continual pre-training scenario and how these factors impact on catastrophic forgetting. In order to explore this broad research question: The ability of pre-trained models to solve a diverse set of tasks through fine-tuning has led to consider them as almost static models. However, it was recently shown that taking a pre-trained model and performing an additional step of pre-training on domain-specific data is beneficial for the downstream performance in that domain (e.g., Q/A in bio-medicine as showed by Gururangan et al. (2020) ; Lee et al. (2020)). Pre-trained models are helpful also in CL, where leveraging a pre-trained model as the starting point for the continual training leads to better results with respect to forgetting both in CV (Mehta et al., 2021; Ramasesh et al., 2021) and NLP (Wu et al., 2021) , especially when combined with CL strategies. An additional pre-training step before the continual training also provides advantages in terms of downstream performance on domain-specific tasks (Rongali et al., 2021) . The need to perform continual pre-training is present in many different applications, where updating the pre-trained model is fundamental to incorporate new knowledge and update or erase outdated information (Lazaridou et al., 2021; Han et al., 2021; Jang et al., 2022) . While models trained directly on a domain task may achieve similar or even better performance on downstream tasks (Gu et al., 2021) , the cost of starting from scratch each time is large and mitigating it is one of the objectives of CL. Continual pre-training has been recently explored in the context of NLP by leveraging either domain-specific datasets (like multi-domain research papers) (Jin et al., 2021) or news/tweets corpora split into different temporal segments (Loureiro et al., 2022; Jang et al., 2021) . The results show that continual pre-training is beneficial to the downstream performance and that forgetting on the tasks stream can be effectively mitigated by employing CL strategies. Moreover, continual pre-training is also able to provide advantages in terms of temporal generalization on unseen future data (Loureiro et al., 2022) and event temporal reasoning (Han et al., 2021) . The work by Hu et al. (2021) focuses on the performance difference between contrastive self-supervised (MoCo-v2 by Chen et al. (2020)) and supervised pre-training in CV, showing that self-supervised leads to robust features in terms of forgetting. A more detailed discussion of related works is presented in Appendix D. Our work provides new evidence of the behavior of pre-trained models in the continual pre-training scenario. We propose to evaluate the performance in terms of catastrophic forgetting on a FC dataset not present in the CL stream. We provided results for both CV and NLP, with experiments on longer streams than most of the existing studies (with the exception of Qin et al. (2022)). Unlike prior works, we did not use any CL strategy, but we just employed naive fine-tuning. The traditional CL scenario (Lomonaco et al., 2021) trains a model h 0 on a (possibly infinite) stream of experiences S = (e 1 , e 2 , e 3 , . . .), where each experience e i brings a dataset D i , representing the current task. The model is trained on S, one experience after the other, and needs to address the non-stationarity and drifts occurring between experiences without having access to the previously encountered data. The model h 0 is sometimes initialized with the weights of a pre-trained model. The pre-training phase is conducted on the dataset D pr which is however not available during CL. We provide a formal characterization of the continual pre-training scenario (pseudo-code in Appendix C) and highlight the differences with respect to the traditional CL setup. The continual pre-training scenario leverages a model h pr 0 originally pre-trained on dataset D pr 0 , not available anymore. The model is presented with a (possibly infinite) stream of experiences, where each experience e i brings a dataset D pr i for pre-training and a downstream dataset D ds i for fine-tuning. For each experience e i , the last pre-trained model h pr iāˆ’1 is further pre-trained on D pr i . After the pre-training step, the model h pr i is fine-tuned on D ds i , resulting in h ds i . We adopt naive fine-tuning, without any CL strategies. In order to measure catastrophic forgetting, we leverage a FC dataset D f c in place of the D pr 0 originally used during the first pre-training phase. While each D ds i contains samples similar to the ones encountered during pre-training, the FC dataset contains knowledge more similar to the one in D pr 0 than the one in i=1,2,3,... D pr i . Forgetting is assessed after each experience e i by comparing the performance of h pr 0 fine-tuned on D f c with the performance of h pr i fine-tuned on the same dataset. We use h ds i to verify that the continual pre-training step actually contributes to learning meaningful features for the downstream task. In this way we avoid the uninteresting case where pre-training leaves features (mostly) unchanged, resulting in no catastrophic forgetting of previous knowledge but also in a lower performance on the downstream task. It is important to note that the head (last layer of the model) used during pre-training is not the one used during fine-tuning. In fact, the pre-training and downstream tasks are different ones and they therefore require different heads. Before fine-tuning on each downstream task, the head of h pr i is replaced with a randomly initialized head. The model is then trained until convergence to obtain h ds i . During the continual pre-training step instead, the head is not replaced. The continual pre-training scenario has different characteristics with respect to the traditional CL setup. Firstly, the continual pre-training scenario updates continuously the pre-trained model and then adapts it to specific tasks. The traditional CL setup does not consider this important distinction, using the same model both for representation learning and to solve incoming tasks. Secondly, model evaluation in continual pre-training requires an additional training phase on the target task, while CL usually requires the model to be readily able to tackle all tasks seen so far without any additional training. Therefore, the model has to focus on the new task without the opportunity to build robust, general features via pre-training protocols. As our results will show, the additional cost of a training phase in continual pre-training can be largely mitigated by a quick adaptation phase (e.g., one epoch of training). In fact, fast remembering of previous knowledge is considered one of the objectives of CL (Hadsell et al., 2020) . Ultimately, our continual pre-training scenario aims at building models which are general learners, able to quickly adapt to unseen data while still preserving the original knowledge. We studied continual pre-training by introducing two evaluation environments: one for NLP and one for CV. They are designed to investigate the impact on forgetting of specific components of the scenario (Table 1) , namely the input modality, the pre-training protocol and the model architecture. Current NLP applications are all based on the idea of leveraging large-scale pre-trained models to then solve different tasks under fine-tuning, few-or even zero-shot learning settings. Therefore, NLP applications based on the traditional pre-training fine-tuning setting seem to be the most natural field for evaluating our continual pre-training scenario. For example, when dealing with a stream of news, it is important to keep the language model updated (Lazaridou et al., 2021) so that it can incorporate information which was not previously available. Our NLP environment employs an unsupervised/self-supervised pre-training protocol and different Transformer architectures (Vaswani et al., 2017) . These components are standard ones in NLP and represent the state of the art of the field. We uses the popular pre-trained Transformers RoBERTa (Liu et al., 2019) and BERT (Devlin et al., 2019), pre-trained on Wikipedia. In addition, we study a variant of RoBERTa in which the vocabulary is dynamically expanded with the addition of new tokens. We select the most frequent tokens of the continual pre-training dataset which were not present in the pre-trained tokenizer. Vocabulary expansion is beneficial for downstream performance, as showed by recent works on dynamic token expansion in both CV (Douillard et al., 2022) and NLP (Zhang et al., 2020; Han et al., 2021) . Our aim is to understand whether the addition of new tokens may result in a larger forgetting of existing knowledge. We apply continual pre-training on a dataset of scientific abstracts from arXiv (Geiger, 2019) . The motivation behind the choice of this dataset is that scientific abstracts represent a very specific domain for NLP both in terms of syntactic structures and domain-specific terminology. Indeed, updating the language model before fine-tuning is particularly beneficial under these circumstances. The downstream task is modeled as a document classification problem aiming to associate scientific abstracts to their corresponding arXiv classes. The CL stream includes 5 experiences, with 2 scientific domains (classes) in each experience (as in common CL benchmarks like Split-MNIST/CIFAR-10). Please, refer to Appendix A for a complete description of the split used for pretraining and downstream fine-tuning. We test two different FC datasets to measure forgetting: sentiment analysis from tweets and Question Answering Natural Language Inference (QNLI). The idea behind these choices is that the dataset of scientific abstracts should not contain much knowledge neither about sentiments, nor about generic facts for language inference. Pre-training on scientific abstracts may therefore disrupt the knowledge contained in the original language model. We additionally expand our analysis by using the 20 datasets present in the SentEval benchmark (Conneau, Kiela, May 7-12, 2018 2018) as FC datasets. We found CV to be a useful test-bed to disentangle the importance of the three components in our continual pre-training scenario. In particular, we design the CV environment to understand to what extent forgetting depends on the input modality (natural language against vision), on the architecture (Transformer against CNN) and on the pre-training protocol (unsupervised/self-supervised against supervised). To limit the large number of experiments needed to explore these three factors, in the CV environment we do not measure the performance on the downstream task after each step of continual pre-training. Instead, we focus on the study of forgetting on the FC dataset. In fact, the impact of pre-training on downstream tasks similar to the ones in the pre-training stream is assessed both in the discussion of related works (Section 2 above) and in the experiments with scientific abstracts classification in NLP environment (results presented below in Section 4 and Appendix B.2). The CV environment uses iNaturalist (Van Horn et al., 2018) for continual pre-training and CORe50 (Lomonaco, Maltoni, 2017) as FC dataset for catastrophic forgetting. We use ResNet101, Vision Transformer (ViT) and BEiT originally pre-trained on ImageNet. The choice of ResNet and ViT is fundamental to disentangle the role of the architecture (NLP uses only Transformers) and the pre-training protocol (NLP uses only self-supervised pre-training). In fact, ResNet (He et al., 2016) and ViT (Dosovitskiy et al., 2020) are pre-trained via supervised image classification. The choice of BEiT (Bao et al., 2021) , instead, allows to understand the role of the input modality. BEiT uses the recent self-supervised masked image modeling pre-training, which closely resembles the masked language modeling one used in NLP. The proposed setup allows to run experiments by changing one factor at a time among the three we studied and to keep fixed the other two. In this way, we are able to properly compare results between the NLP and CV environments. For NLP, we use the Huggingface's pre-trained BERT and RoBERTa with 12 layers. The NLP datasets, SentEval excluded, are also taken from Huggingface. For SentEval, we train our models using the original code. We use the same pre-training protocol across all experiments, with a learning rate of 5e-5 and 30 epochs with early stopping with 2 epochs patience. For fine-tuning, we adopt a similar setup but with a learning rate of 1e-5 and 20 epochs. For CV, we use ResNet101 and iNaturalist from Torchvision, while we retrieve ViT and BEiT models from Huggingface, using the version with 12 layers in order to properly compare results with NLP experiments. We use Avalanche (Lomonaco et al., 2021) to run the continual pre-training and fine-tuning. For fine-tuning on FC task, we try few combinations of learning rates (1e āˆ’ 5, 1e āˆ’ 4, 1e āˆ’ 3) and batch sizes (64, 128, 256) on a held-out validation set built from CORe50. We report the best performance in terms of accuracy on the test set. The experimental setup is described in detail in Appendix A. We provide strong empirical evidence supporting the hypothesis that the continual pre-training scenario is less impacted by catastrophic forgetting than the traditional one. In particular, we found the unsupervised pre-training objective to be the common factor for the resistance to forgetting in the proposed environments. Our result adds to the evidences discussed in Section 2 for the robustness of unsupervised and self-supervised protocols with respect to catastrophic forgetting. Our evaluation offers similar conclusions for the novel continual pre-training scenario. Continual pre-training improves performance on the downstream task without forgetting on FC datasets. We verified that continual pre-training positively impacts on the performance on the downstream scientific abstracts classification task. That is, we observed that acquiring domain knowledge on scientific abstracts helps when solving the classification task (on heldout data). Appendix B.2 shows that continual pre-training on 5 experiences improves the downstream classification performance (Table 11) . Performance is improved also with one step of pre-training on the entire scientific abstracts dataset. As discussed in Appendix B.2, while the improvement is relatively small, we were able to achieve it by using a smaller number of samples with respect to the common pre-training datasets (e.g. Wikipedia): this points to the fact that continual pre-training does not necessarily need enormous datasets to actually be beneficial (a very useful aspect for continual learning). In terms of catastrophic forgetting on the FC dataset after continual pre-training, we show that, quite surprisingly, both RoBERTa (Table 2 and 3) and BERT (Table 4) achieves almost zero forgetting, reaching an accuracy comparable to the one originally obtained by the model before continual pre-training. This happens both for sentiment analysis and QNLI. Moreover, a single epoch of gradient descent is sufficient to retain most of the original performance, showing the quick adaptation capabilities of the pre-trained models. Notably, the additional pretraining steps on domain-specific texts along with the expansion of the RoBERTa vocabulary does not worsen the effects of catastrophic forgetting. We conducted a broader empirical assessment on a diverse set of NLP tasks by using the SentEval benchmark. Figure 2 shows the downstream performance of BERT and RoBERTa after the entire continual pre-training stream. GloVe and fastText results are used as baselines and are taken from Conneau, Kiela (May 7-12, 2018 2018), except on SNLI and on all probing tasks, for which they were not available. Therefore, we computed these results using original code. The results confirm our findings: BERT and RoBERTa do not not show clear signs of forgetting, neither with respect to their original pre-trained version, nor with respect to the baselines. Self-supervised continual pre-training mitigates forgetting. We found out that self-supervised continual pre-training is the main responsible for the mitigation of forgetting in continual pre-training. Since all NLP models use the self-supervised masked language modeling task for pre-training, we turned our attention to the CV environment. In fact, ResNet and ViT both use a supervised image classification during pre-training. In contrast, BEiT uses the recent self-supervised protocol of masked image modeling (Bao et al., 2021) (mirroring the NLP setting). We show ( Feature Space Analysis: supervised pre-training induces larger drifts. We verified the coherence of our findings by studying the feature space of the models. We leveraged linear evaluation for a quantitative analysis and Centered Kernel Alignment (CKA) (Kornblith et al., 2019) for a qualitatively analysis. Linear evaluation (i.e., training only the linear classifier and keeping the rest of the model fixed) is a powerful tool to understand the impact of the learned model representations in terms of catastrophic forgetting (Davari et al., 2022) . A model which exhibits forgetting during linear evaluation is likely to posses features which are not representative of the task. Conversely, a good linear evaluation performance points to a set of strong features, since it means that the task is linearly separable in that feature space. We adopted this approach for our continual pre-training scenario. In the NLP environment (Table 6 ), the features built by the models during continual pre-training are robust and do not cause a large deviation of performance with respect to the original pre-trained model. The lower training accuracy with respect to fine-tuning is expected, since we train only a subset of all parameters. In the CV environment (Table 7) 2021) for unsupervised CL, namely that unsupervised models in the traditional CL scenario have larger correlations in the lower layers than supervised ones. Our results further extend this conclusion to continual pre-training, supporting the idea that pre-training acts mainly in the upper layer of the networks (the ones containing more specific domain knowledge) and that heavy changes in these layers are enough to cause a deterioration of performance on the FC dataset, resulting in forgetting. Our empirical evaluation provides evidence that forgetting is mitigated in continual pre-training by the usage of self-supervised pre-training protocols (Table 8) . Fine-tuning for only one epoch allows to recover most of the performance: this is important since an expensive fine-tuning phase might reduce the applicability of continual pre-training in environments with constrained resources. Deciding when to use continual pre-training and when to use the traditional CL scenario is an open question. As previously discussed, the properties of continual pre-training do not fit the case in which a single model has to be readily applicable to different tasks without a step of fine-tuning. Nonetheless, we believe that whenever knowledge must be kept updated over time, continual pre-training can deliver a superior solution, less affected by forgetting (see Appendix B.3 for a comparison with the traditional CL scenario). Continual pre-training offers the possibility to shift the focus from the mitigation of forgetting to other CL objectives like quick adaptation and knowledge reuse and transfer. The main limitation of our study is related to the scale of the experiments, as we were able to experiment with only a limited number of datasets for each environment. While the computational cost of each experiment was reasonable (each experiment took from few hours -fine-tuning -to around one day -continual pre-training on a single A100), the number of experiments per environment was large. We preferred to thoroughly evaluate few environments rather than trying to address a wide range of different datasets without being able to properly explore them (Table 1) . We are well aware that a comprehensive exploration of continual pre-training in both NLP and CV domains is an ambitious objective, possible only in the context of a broad research program. However, we are confident of the fact that this study has shed some light on the subject and clearly pointed towards promising research directions. Continual pre-training represents a novel CL scenario with promising opportunities and unexpected characteristics. In this work, we formally defined the continual pre-training scenario and we showed the effect that pre-training has on catastrophic forgetting, for both NLP and CV environments and with different architectures. Our results show that forgetting can be effectively mitigated by means of self-supervised pretraining, even with a single epoch of fine-tuning on the FC dataset. Ultimately, this work opens up the possibility to continually train large pre-trained models in a scalable and efficient way. Much like Deep Learning has advanced by disentangling the representation learning objective from the solution to specific tasks, continual pre-training aims to focus on the incremental development of robust features which are kept updated over time. This is a fundamental property towards the achievement of agents who can truly learn continuously in the real-world. Here, we describe the experimental setup we adopted in our work for both the NLP environment and the CV environment. All our experiments were run on a single A100 GPU with 80 GB of memory, on a server with 96 cores. NLP The continual pre-training dataset of scientific abstracts is taken from GitHub 2 . We selected 10 ArXiv classes to build our continual pre-training stream, namely 'hep-ph', 'astro-ph', 'hep-th', 'quant-ph', 'cond-mat.mes-hall', 'gr-qc', 'cond-mat.mtrl-sci', 'cond-mat.str-el', 'condmat.stat-mech' and 'astro-ph.SR' . For both pre-training and downstream fine-tuning, we selected 10, 000 abstracts for each of the 10 classes for the training set and 1, 000 for the test set. Hence, an abstract present in one of the training/test set of continual pre-training or downstream fine-tuning is not present in the other partitions. We chose similar abstract categories since being able to distinguish very different kinds of abstracts may greatly simplify the problem (e.g., one term may be enough to classify the entire abstract). We will publicly release our version of the scientific abstract dataset used in the experiments. The dataset can be easily loaded via Huggingface. In order to select new tokens for the expansion of RoBERTa vocabulary at each experience of continual pre-training, we trained from scratch a tokenizer on the WikiText dataset (Merity et al., 2016) . This tokenizer quickly approximates the tokens present in Wikipedia. We also train a tokenizer on our scientific abstracts dataset and ranked the tokens which were occurring in the latter but not in the former tokenizer. That is, the domain tokens related to the scientific abstracts datasets. We selected 426 new tokens for joint training experiments (Appendix B.2) and 39/42/28/30/10 for each of the 5 experiences of continual pre-training. We added tokens to the tokenizer such that new tokens have precedence over already existing tokens during tokenization process. Within new tokens, we sorted inversely by token length and the precedence is given by the order of addition (First In First Out) . The list of new tokens is embedded in the released code. We also ran few experiments (not reported here) by adding with the same procedure sub-word tokens (BPE encoding) instead of word tokens. We did not find significant differences in the results, which do not seem to depend on which specific new tokens are selected, as long as they provide domain knowledge about the task. The FC dataset QNLI is available from Huggingface as part of the GLUE benchmark https:// huggingface.co/datasets/glue. The sentiment analysis from tweets dataset is also taken from Huggingface at https://huggingface.co/datasets/emotion. Senteval benchmark is taken from the official codebase at https://github.com/facebookresearch/SentEval. During linear evaluation, we removed the feedforward layer right before the classifier. We observed that keeping it frozen yielded a very low training performance. On the other side, fine-tuning it together with the linear classifier did not show the issue but resulted in a non linear fine-tuning procedure, making it difficult to compare results against the CV setup. Therefore, linear evaluation is performed by taking the representation built for the special CLF token by the last hidden layer of the transformer and decoding it with a trained linear classifier. Computer Vision We adopted the Masked Image Modeling task for self-supervised pre-training with BEiT. Following the original BEiT paper, we leveraged the DALL-E encoder, which is kept fixed during continual pre-training. A simple example of masked image modeling can be found at https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ BEiT/Understanding_BeitForMaskedImageModeling.ipynb. Experiments which continually pre-train also the encoder may constitute interesting future works. Following the original Pytorch code at https://github.com/pytorch/vision/blob/main/ references/classification/presets.py, for continual pre-training and fine-tuning on FC dataset with ResNet we used a chain of augmentations: RandomResizedCrop with bilinear interpolation, RandomHorizontalFlip and normalization of mean and standard deviation. On the test sets, we resized the image to 256x256, applied center crop and normalization. ViT uses the same setup without normalization. BEiT applies the ViT setup on the FC dataset only. For all CKA experiments, we used the Python library from https://github.com/AntixK/ PyTorch-Model-Compare, which provides the unbiased minibatch estimator of the CKA. Table 9 shows the complete set of results for the SentEval benchmark. We compare the performance of continual pre-training after 5 experiences on scientific abstracts against two baselines (GloVe and fastText) and the original pre-trained model. For RoBERTa, we also provide the results in case of vocabulary expansion. We used one hidden layer of 50 units for probing tasks and logistic regression for the transfer tasks. Task Table 10 shows the accuracy on the entire dataset of scientific abstracts classification after pre-training on the entire datasets of scientific abstracts (held-out sets). Therefore, this setup uses only one step of pre-training to assess its effectiveness on the performance on the downstream task. We show that pre-training is beneficial to the final performance with respect to the original model pre-trained on Wikipedia. Similarly, Table 11 shows the impact of 1 and 5 steps of continual pre-training on the dataset of scientific abstracts classification. Each fine-tuning step is performed on the corresponding split of the scientific abstract dataset. Again, we see a moderate improvement in the final performance. It is important to note that the improvement, although small, is nonetheless present even if each experience of continual pre-training contains a smaller set of samples with respect to the pre-training dataset typically used in the NLP literature, like Wikipedia. For each experience, we have 20, 000 samples. This aspect is particularly important for continual learning, where the model is not updated one-shot with a large dataset, but in multiple steps with few samples. Table 12 shows that in a traditional CL setup, fine-tuning a single model on scientific abstracts classification tasks continuously leads to large forgetting on the same scientific abstracts Table 9 : Accuracy on 10 transfer and 10 probing tasks from SentEval. For comparison, we report the performance of the pre-trained models at the end of pre-training on the last experience (e5) of scientific abstracts dataset. RoBERTa BERT Task classification task (held-out dataset), unless CL strategies are employed. We measure the popular ACC metric (Lopez-Paz, Ranzato, 2017) which computes the accuracy on all tasks after training on the last task. The lower its value, the larger the forgetting effect. This shows that, although in the traditional CL scenario we always have a model ready to tackle all the previous tasks without retraining, the loss in terms of performance (accuracy in this case) is very large with respect to the continual pre-training scenario. CKA is computed incrementally in minibatches, following Nguyen et al. (2020) . We provide the full set of CKA plots in Figure 4 for the NLP environment and in Figure 5 for the CV environment. We include the CKA against the original pre-trained model and its continually pre-trained version after each experience of continual pre-training. The upper-right corner of each image represents the upper layers of the models and its correlation is very low only for ViT and ResNet, while it stays large for BEiT, RoBERTa and BERT on all FC datasets. We report in Table 13 and Table 14 the performance obtained by larger Vision Transformers models with 24 transformers layers for fine-tuning and linear evaluation, respectively. The results are in line with our main findings with smaller models, except for the ViT, which shows a smaller degree of forgetting. However, the training curves for the large ViT shows an unstable trend: the best accuracy is reached usually after one epoch, after which the value quickly degrades to a lower performance. We believe that future works investigating the impact of model depth on our results may shed a light on this phenomenon. Algorithm 1 provides a high-level description of the continual pre-training scenario, showing the steps of continual pre-training, downstream fine-tuning and catastrophic forgetting evaluation against the FC dataset. To obtain the configuration we used in linear evaluation, it is sufficient to change fine-tune with linear-eval in Line 6. The continual pre-training scenario appeared very recently in the literature. In this section, we provide a more detailed description of the existing works exploring continual pre-training and the differences with respect to our work. Section 2 already provides a brief description but, due to lack of space, we were unable to thoroughly discuss the few existing studies. Among existing works, the CL scenario used in (Jin et al., 2021) constitutes the most similar setup with respect to our definition of continual pre-training. Like us, the authors used a dataset of research papers as pre-training stream and leveraged RoBERTa in their experiments. Differently from us, though, their work is focused on NLP tasks and on the impact that different CL strategies have on the final performance, rather than on the kind of pre-training protocol and on its impact on a separate FC task. Moreover, the downstream tasks used to measure performance are strongly related to the pre-training stream, making it difficult to understand the impact of each pre-training step on catastrophic forgetting. The results they provided show that the amount of forgetting does not depend on the specific CL strategy used. In line with our findings, a naive fine-tuning approach is robust and does not show a catastrophic loss in performance. The Continual Knowledge Learning (CKL) framework (Jang et al., 2021) shares some similarities with the continual pre-training scenario adopted in our work. The CKL considers a pre-trained model updated continuously and, throughout its training, focuses on different objectives: recognizing invariant knowledge which does not change over time, incorporating new knowledge not present before and updating knowledge which is outdated. The proposed benchmark is entirely based on NLP: it consists of a continual pre-training dataset of news, a "time-invariant knowledge" dataset hand-crafted from relations dataset and an "updated knowledge" and "new knowledge" datasets built from scratch through Amazon Mechanical Turk and validated by a set of external experts. The empirical evaluation provided in the paper is based on a new metric, called FUAR, which condenses the performance of the pre-trained model in these three tasks into a single number. The experiments are conducted on the T5 transformer endowed with existing CL strategies. The authors found out that that parameter expansion methods are amongst the best performing ones, although they require a larger number of parameters with respect to static alternatives. Compare performance of h f c i with h f c 0 to assess forgetting. 8: end for 9: return y BEiT: BERT Pre-Training of Image Transformers // International Conference on Learning Representations He Kaiming. Improved Baselines with Momentum Contrastive Learning SentEval: An Evaluation Toolkit for Universal Sentence Representations Probing Representation Forgetting in Supervised and Unsupervised Continual Learning Tuytelaars Tinne. A Continual Learning Survey: Defying Forgetting in Classification Tasks // IEEE Transactions on Pattern Analysis and Machine Intelligence Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale // International Conference on Learning Representations DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion Catastrophic Forgetting in Connectionist Networks // Trends in Cognitive Sciences ArXiV Archive: A Tidy and Complete Archive of Metadata for Papers on Arxiv Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing // ACM Transactions on Computing for Healthcare 't Stop Pretraining: Adapt Language Models to Domains and Tasks // Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Embracing Change: Continual Learning in Deep Neural Networks // Trends in Cognitive Sciences ECONET: Effective Continual Pretraining of Language Models for Event Temporal Reasoning Deep Residual Learning for Image Recognition // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) How Well Does Self-Supervised Pre-Training Perform with Streaming Data? TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models Towards Continual Knowledge Learning of Language Models // International Conference on Learning Representations Ren Xiang. Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora Similarity of Neural Network Representations Revisited // Proceedings of the 36th International Conference on Machine Learning Blunsom Phil. Mind the Gap: Assessing Temporal Generalization in Neural Language Models // Thirty-Fifth Conference on Neural Information Processing Systems BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining // Bioinformatics Continual Learning for Robotics: Definition, Framework, Learning Strategies, Opportunities and Challenges // Information Fusion. 2020. 58 RoBERTa: A Robustly Optimized BERT Pretraining Approach CORe50: A New Dataset and Benchmark for Continuous Object Recognition // Proceedings of the 1st Annual Conference on Robot Learning. 78. 2017. 17-26. (Proceedings of Machine Learning Research) Avalanche: An End-to-End Library for Continual Learning Gradient Episodic Memory for Continual Learning // NIPS TimeLMs: Diachronic Language Models from Twitter Hwang Sung Ju. Representational Continuity for Unsupervised Continual Learning // International Conference on Learning Representations Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem // Psychology of Learning and Motivation Bayesian Nonparametric Weight Factorization for Continual Learning // arXiv An Empirical Investigation of the Role of Pre Do Wide and Deep Networks Learn the Same Things? Continual Lifelong Learning with Neural Networks: A Review // Neural Networks ELLE: Efficient Lifelong Pre-training for Emerging Data // Findings of ACL Effect of Scale on Catastrophic Continual Domain-Tuning for Pretrained Language Models Transfer Learning in Natural Language Processing // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials The INaturalist Species Classification and Detection Dataset // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Polosukhin Illia. Attention Is All You Need // Advances in Neural Information Processing Systems 30 Pretrained Language Model in Continual Learning: A Comparative Study // International Conference on Learning Representations Multi-Stage Pre-training for Low-Resource Domain Adaptation This work has been partially supported by the H2020 TEACHING project (GA 871385).