key: cord-0565633-7i46z4k6 authors: Lovon-Melgarejo, Jesus; Soulier, Laure; Pinel-Sauvagnat, Karen; Tamine, Lynda title: Studying Catastrophic Forgetting in Neural Ranking Models date: 2021-01-18 journal: nan DOI: nan sha: 2e705d4a374ca3d5f0a0f6062646eaec2337200d doc_id: 565633 cord_uid: 7i46z4k6 Several deep neural ranking models have been proposed in the recent IR literature. While their transferability to one target domain held by a dataset has been widely addressed using traditional domain adaptation strategies, the question of their cross-domain transferability is still under-studied. We study here in what extent neural ranking models catastrophically forget old knowledge acquired from previously observed domains after acquiring new knowledge, leading to performance decrease on those domains. Our experiments show that the effectiveness of neuralIR ranking models is achieved at the cost of catastrophic forgetting and that a lifelong learning strategy using a cross-domain regularizer success-fully mitigates the problem. Using an explanatory approach built on a regression model, we also show the effect of domain characteristics on the rise of catastrophic forgetting. We believe that the obtained results can be useful for both theoretical and practical future work in neural IR. Neural ranking models have been increasingly adopted in the information retrieval (IR) and natural language processing (NLP) communities for a wide range of data and tasks [35, 40] . One common underlying issue is that they learn relationships that may hold only in the domain from which the training data is sampled, and generalize poorly in unobserved domains 1 [6, 40] . To enhance the transferability of neural ranking models from a source domain to a target domain, transfer learning strategies such as fine-tuning [53] , multi-tasking [29] , domain adaptation [41] , and more recently adversarial learning [7] , have been widely used 2 . However, these strategies have by essence two critical limitations reported in the machine learning literature [6, 22] . The first one, as can be acknowledged in the NLP and IR communities [7, 29] , is that they require all the domains to be available simultaneously at the learning stage (except the fine-tuning). The second limitation, under-studied in both communities, is that the model leans to catastrophically forget existing knowledge (source domain) when the learning is transferred to new knowledge (target domain) leading to a significant drop of performance on the source domain. These limitations are particularly thorny when considering open-domain IR tasks including, but not limited to, conversational search.In the underlying settings (e.g., QA systems and chatbots [15, 25, 33, 43] ), neural ranking models are expected to continually learn features from online information streams, sampled from either observed or unobserved domains, and to scale across different domains but without forgetting previously learned knowledge. Catastrophic forgetting is a long-standing problem addressed in machine learning using lifelong learning approaches [6, 42] . It has been particularly studied in neural-network based classification tasks in computer vision [22, 26] and more recently in NLP [32, 37, 46, 49] . However, while previous work showed that the level of catastrophic forgetting is significantly impacted by dataset features and network architectures, we are not aware of any existing research in IR providing clear lessons about the transferability of neural ranking models across domains, nor basically showing if state-of-the-art neural ranking models are actually faced with the catastrophic forgetting problem and how to overcome it if any. Understanding the conditions under which these models forget accumulated knowledge and whether a lifelong learning strategy is a feasible way for improving their effectiveness, would bring important lessons for both practical and theoretical work in IR. This work contributes to fill this gap identified in the neural IR literature, by studying the transferability of ranking models. We put the focus on catastrophic forgetting which is the bottleneck of lifelong learning. The main contributions of this paper are as follows. 1) We show the occurrence of catastrophic forgetting in neural ranking models. We investigate the transfer learning of five representative neural ranking models (DRMM [14] , PACRR [17] , KNRM [50] , V-BERT [31] and CEDR [31] ) over streams of datasets from different domains 3 (MS MARCO [3] , TREC Microblog [45] and TREC COVID19 [47] ); 2) We identify domain characteristics such as relevance density as signals of catastrophic forgetting ; 3) We show the effectiveness of constraining the objective function of the neural IR models with a forget cost term, to mitigate the catastrophic forgetting. From Domain Adaptation to Lifelong Learning of Neural Networks. Neural networks are learning systems that must commonly, on the one hand, exhibit the ability to acquire new knowledge and, on the other hand, exhibit robustness by refining knowledge while maintaining stable performance on existing knowledge. While the acquisition of new knowledge gives rise to the well-known domain shift problem [18] , maintaining model performance on existing knowledge is faced with the catastrophic forgetting problem. Those problems have been respectively tackled using domain adaptation [41] and lifelong learning strategies [6, 42] . Domain adaptation, commonly known as a specific setting of transfer learning [41] , includes machine learning methods (e.g., fine-tuning [49] and multi-tasking [29] ) that assume that the source and the target domains from which are sampled respectively the training and testing data might have different distributions. By applying a transfer learning method, a neural model should acquire new specialized knowledge from the target domain leading to optimal performance on it. One of the main issues behind common transfer learning approaches is catastrophic forgetting [11, 12] : the newly acquired knowledge interfers with, at the worst case, overwrites, the existing knowledge leading to performance decrease on information sampled from the latter. Lifelong learning [6, 42] tackles this issue by enhancing the models with the ability to continuously learn over time and accumulate knowledge from streams of information sampled across domains, either previously observed or not. The three common lifelong learning approaches are [42] : 1) regularization that constrains the objective function with a forget cost term [22, 26, 49] ; 2) network expansion that adapts the network architecture to new tasks by adding neurons and layers [5, 44] ; and 3) memory models that retrain the network using instances selected from a memory drawn from different data distributions [2, 32] . On the Transferability of Neural Networks in NLP and IR. Transferability of neural networks has been particularly studied in classification tasks, first in computer vision [4, 54] and then only recently in NLP [19, 38, 39] . For instance, Mou et al. [39] investigated the transferability of neural networks in sentence classification and sentence-pair classification tasks. One of their main findings is that transferability across domains depends on the level of similarity between the considered tasks. Unlikely, previous work in IR which mainly involves ranking tasks, have only casually applied transfer learning methods (e.g., fine-tuning [53] , multi-tasking [29] and adversarial learning [7] ) without bringing generalizable lessons about the transferability of neural ranking models. One consensual result reported across previous research in the area, is that traditional retrieval models (e.g., learning-to-rank models [28] ) that make fewer distributional assumptions, exhibit more robust cross-domain performances [7, 40] . Besides, it has been shown that the ability of neural ranking models to learn new features may be achieved at the cost of poor performances on domains not observed during training [35] . Another consensual result is that although embeddings are trained using large scale corpora, they are generally sub-optimal for domain-specific ranking tasks [40] . Beyond domain adaptation, there is a recent research trend in NLP toward lifelong learning of neural networks, particularly in machine translation [46] , and language understanding tasks [37, 49, 51] . For instance, Xu et al. [51] recently revisited the domain transferability of traditional word embeddings [34] and proposed lifelong domain embeddings using a meta-learning approach. The pro-posed meta-learner is fine-tuned to identify similar contexts of the same word in both past domains and the new observed domain. Thus, its inference model is able to compute the similarity scores on pairs of feature vectors representing the same word across domains. These embeddings have been successfully applied to a topic-classification task. Unlikely, catastrophic forgetting and lifelong learning are still under-studied in IR. We believe that a thorough analysis of the transferability of neural ranking models from a lifelong learning perspective would be desirable for a wide range of emerging open-domain IR applications including but not limited to conversational search [15, 33, 25, 43] . Our study mainly addresses the following research questions: RQ1: Does catastrophic forgetting occur in neural ranking models? RQ2: What are the dataset characteristics that predict catastrophic forgetting? RQ3: Is a regularization-based lifelong learning method effective to mitigate catastrophic forgetting in neural ranking models? -Repeat • Apply to model M k a method D w.r.t to objective O1 (resp. method L w.r.t. to objective O2) to transfer knowledge to the right dataset D k+1 (forward transfer). The resulting model is noted M k+1 with parameterŝ θ k+1 . Its performance on dataset D k+1 is noted R k+1,k+1 . • Measure the retrieval performance R k+1,k of model M k+1 obtained on the testing instances of left dataset D k (backward transfer) • Move to the next right dataset : k = k + 1 -Until the end of the dataset stream setting (k = n). This experimental pipeline, illustrated in Figure 1 , follows general guidelines adopted in previous work [2, 20, 26] . We detail below the main underlying components highlighted in bold. (5) stateof-the-art models selected from a list of models criticallly evaluated in Yang et al. [52] : 1) interaction-based models: DRMM [14] and PACRR [17] and KNRM [50] ; 2) BERT-based models: Vanilla BERT [31] and CEDR-KNRM [31] . We use the OpenNIR framework [30] that provides a complete neural ad-hoc document ranking pipeline. Note that in this framework, the neural models are trained by linearly combining their own score (S N N ) with a BM25 score (S BM 25 ). Datasets and settings. We use the three following datasets: 1) MS MARCO (ms) [3] a passage ranking dataset which includes more than 864 K questionalike queries sampled from the Bing search log and a large-scale web document set including 8841823 documents; 2) TREC Microblog (mb) [27] , a real-time ad-hoc search dataset from TREC Microblog 2013 and 2014, which contains a public Twitter sample stream between February 1 and March 31, 2013 including 124969835 tweets and 115 queries submitted at a specific point in time; 3) TREC CORD19 (c19 ) [47] an ad-hoc document search dataset which contains 50 question-alike queries and a corpora of 191175 published research articles dealing with SARS-CoV-2 or COVID-19 topics. It is worth mentioning that these datasets fit with the requirement of cross-domain adaptation [41] since they have significant differences in both their content and sources. Besides, we consider four settings (See Table 1 , column "Setting") among which three 2dataset (n = 2) and one 3-dataset (n = 3) settings. As done in previous work [2, 26] , these settings follow the patterns ( where dataset orders are based on the decreasing sizes of the training sets assuming that larger datasets allow starting with well-trained networks. Domain adaptation and lifelong learning methods. We adopt finetuning (training on one domain and fine-tuning on the other) as the representative domain adaptation task D since it suffers from the catastrophic forgetting problem [2, 22] . Additionally, we adopt the Elastic Weight Consolidation (EWC) [22] as the lifelong learning method L. The EWC constrains the loss function with an additional forget cost term that we add to the objective function of each of the five neural models studied in this work. Basically speaking, EWC constrains the neural network-based model to remember knowledge acquired on left datasets by reducing the overwriting of its most important parameters as: where L(θ k ) is the loss of the neural ranking model with parameters θ k obtained right after learning on (D k ), λ is the importance weight of the models parameters trained on left datasets (D i , i < k) with the current one (D k ), F is the Fisher information matrix. Measures. Given the setting (D 1 → · · · → D n ), we use the remembering measure (REM) derived from the backward transfer measure (BWT) proposed by Rodriguez et al. [10] as follows: • BWT: measures the intrinsic effect (either positive or negative) that learning a model M on a new dataset (right in the setting) has on the model performance obtained on an old dataset (left in the setting), referred as backward transfer. Practically, in line with a lifelong learning perspective, this measure averages along the setting the differences between the performances of the model obtained right after learning on the left dataset and the performances of the oracle model trained and tested on the same left dataset. Thus, while positive values handle positive backward transfer, negative values handle catastrophic forgetting. Formally, the BWT measure is computed as: R i,j is the performance measure of model M i obtained right after learning on on dataset D j . R * j,j is the performance of the oracle model M * j trained on dataset D j and tested on the same dataset. To make fair comparisons between the different studied neural models, we normalize the differences in performance (R i,j − R * j,j ) on model agnostic performances obtained using BM 25 model on each left dataset D j . In our work, we use the standard IR performance measures MAP, NDCG@20 and P@20 to measure R i,j but we only report the REM values computed using the MAP measure, as they all follow the same general trends. • REM: because the BWT measure assumes either positive values for positive backward transfer and negative values for catastrophic forgetting, it allows to map with a positive remembering value in the range [0, 1] as follows: A REM value equals to 1 means that the model does not catastrophically forget. To better measure the intrinsic ability of the neural ranking models to remember previously acquired knowledge, we deploy in the OpenNIR framework two runs for each neural model based on the score combination (score G = α × S N N + (1 − α) × S BM 25 ). The first one by considering the neural model after a re-ranking setup (0 < α < 1) leading to compute an overall REM measure on the ranking model. The second one by only considering the neural ranking based on the S N N score by totally disregarding the BM25 scores (α = 1). REM N denotes the remembering measure computed in this second run. We use the OpenNIR framework with default parameters and the pairwise hinge loss function [8] . To feed the neural ranking models, we use the GloVe pre-trained embeddings (42b tokens and 300d vectors). The datasets are split into training and testing instance sets. For MS MARCO, we use the default splits provided in the dataset. For TREC CORD19 and TREC Microblog, where no training instances are provided, we adopt the splits by proportions leading to 27/18 and 92/23 training/testing queries respectively. In practice, we pre-rank documents using the BM25 model. For each relevant document-query pair (positive pair), we randomly sample a document for the same query with a lower relevance score to build the negative pair. We re-rank the top-100 BM25 results and use P @20 to select the best-performing model. For each dataset, we use the optimal BM25 hyperparameters selected using grid-search. In the training phase, we consider a maximum of 100 epochs or early-stopping if no further improvement is found. Each epoch consists of 32 batches of 16 training pairs. All the models are optimized using Adam [21] with a learning rate of 0.001. BERT layers are trained at a rate of 2e−5 following previous work [31] . For the EWC, we fixed λ = 0.5. The code is available at https://github.com/jeslev/OpenNIR-Lifelong. Ranking Models Within-and Across-Model Analysis (RQ1). Our objective here is to investigate whether each of the studied neural models suffer from catastrophic forgetting while it is fine-tuned over a setting (D 1 → D 2 or D 1 → D 2 → D 3 ). To carry out a thorough analysis of each model-setting pair, we compute the following measures in addition to the REM/REMN measures: 1) the MAP@100 performance ratio (P R = 1 (n−1) n i=2 Ri,i R * i,i ) of the model learned on the right dataset and normalized on the oracle model performance; 2) the relative improvement in MAP@100 ∆ M AP (resp. ∆ M AP N ) achieved with the ranking based on the global relevance score Score G (resp. Score N N ) trained and tested on the left dataset over the performance of the BM25 ranking obtained on the same testing dataset. Table 1 reports all the metric values for each model/setting pairwise. In line with this experiment's objective, we focus on the "Fine-tuning" columns. Looking first at the P R measure reported in Table 1 , we notice that it is greater than 0.96 in 100% of the settings, showing that the fine-tuned models are successful on the right dataset, and thus allow a reliable investigation of catastrophic forgetting as outlined in previous work [38] . It is worth recalling that the general evaluation framework is based on a pre-ranking (using the BM25 model) which is expected to provide positive training instances from the left dataset to the neural ranking model being fine-tuned on the right dataset. The joint comparison of the REM (resp. REM N ) and ∆ M AP (resp.∆ M AP N ) measures lead us to highlight the following statements: • We observe that only CEDR and VBERT models achieve positive improvements w.r.t to both the global ranking (∆ M AP : +19.6%, +17.4% resp.) and the neural ranking (∆ M AP : +29.2%, +25.8% resp.), particularly under the setting where mb is the left dataset (mb → c19). Both models are able to bring effectiveness gains additively to those brought by the exact-based matching signals in BM25. These effectiveness gains can be viewed as new knowledge in the form of semantic matching signals which are successfully transferred to the left dataset (c19) while maintaining stable performances on the left dataset (mb) (REMN=0.940 and 0.913 for resp. CEDR and VBERT). This result is consistent with previous work suggesting that the regularization used in transformer-based models has an effect of alleviating catastrophic forgetting [23] . • We notice that the CEDR model achieves positive improvements w.r.t to the neural ranking score (∆ M AP N : +14.2%) in all the settings (3/4) where ms is the left dataset while very low improvements are achieved w.r.t. to the global score (∆ M AP : +2.6%). We make the same observation for the PACRR model but only for 1/4 model-setting pair (∆ M AP N : +10% vs. ∆ M AP N : 0%) with mb as the left dataset. Under these settings, we can see that even the exactmatching signals brought by the BM25 model are very moderate (leading to a very few positive training instances), the CEDR and, to a lower extent, the PACRR models, are able to inherently bring significant new knowledge in terms of semantic matching signals at however the cost of significant forget on the global ranking for CEDR (REM is the range [0.510; 0.826]) and on the neural ranking for PACRR (REM=0.523). • All the models (DRMM, PACRR, KNRM and VBERT (for 3/4 settings) that do not significantly beat the BM25 baseline either by using the global score (∆ M AP in the range [−12.1%; +2.2%]) nor by using the neural score (∆ M AP N in the range [−89%; +0%]), achieve near upper bound of remembering (both REM and REMN are in the range [0.94; 1]). Paradoxically, this result does not allow us to argue about the ability of these models to retain old knowledge. Indeed, the lack or even the low improvements over both the exact matching (using the BM25 model) and the semantic-matching (using the neural model) indicate that a moderate amount of new knowledge or even no knowledge about effective relevance ranking has been acquired from the left dataset. Thus, the ranking performance of the fine-tuned model on the left dataset only depends on the level of mismatch between the data available in the right dataset for training and the test data in the left dataset. We can interestingly see that upper bound remembering performance (REM = 1) is particularly achieved when ms is the left dataset (settings ms → c19, ms → mb, ms → mb → c19). This could be explained by the fact that the relevance matching signals learned by the neural model in in-domain knowledge do not degrade its performances on general-domain knowledge. Assuming a well-established practice in neural IR which consists in linearly interpolating the neural scores with the exact-based matching scores (e.g., BM25 scores), these observations give rise to three main findings: 1) the more a neural ranking model is inherently effective in learning additional semantic matching signals, the more likely it catastrophically forgets. In other terms, intrinsic effectiveness of neural ranking models comes at the cost of forget; 2) transformerbased language models such as CEDR and VBERT exhibit a good balance between effectiveness and forget as reported in previous work in NLP [38] ; 3) given the variation observed in REM and REMN, there is no clear trend about which ranking (BM25-based ranking vs. neural ranking) impacts more importantly the level of overall catastrophic forgetting of the neural models Across Dataset Analysis (RQ2). Our objective here is to identify catastrophic forgetting signals from the perspective of the left dataset. As argued in [1] , understanding the relationships between data characteristics and catastrophic forgetting allows to anticipating the choice of datasets in lifelong learning settings regardless of the neural ranking models. We perform a regression model to explain the REM metric (dependent variable) using nine datasets characteristics (independent variables). The latter are presented in Table 2 and include dataset-based measures inspired from [1, 9] and effectiveness-based measures using the BM25 model. To artificially-generate datasets with varying data characteristics, we follow the procedure detailed in [1] : we sample queries within each left dataset in the settings presented in Table 1 (15 for mb and 50 for ms) to create sub-datasets composed of those selected queries and the 100 top corresponding documents retrieved by the BM25 model. Then, we replace in each setting the left dataset by the corresponding subdataset. We estimate for each model-setting pair the REM value as well as the characteristic values of the left sub-dataset. We repeat this procedure 300 times to obtain 300 new settings per model, based on the 300 sampled subdatasets. This leads to 300 (sub-setting-model) pairs with a variation for both the dependent and the independent variables. Finally, we build the following explanatory regression model, referring to the "across dataset analysis" in [1] : where i denotes the i th sub-setting and j refers to the neural ranking model M j . C k and f ik denote respectively the weight and the value of the k th characteristic of the left dataset in the i th sub-setting. Please note, that dataset feature values are independent of the model M j . Dataset i and M j are the residual variables of resp. the left dataset and the model. The characteristic values f ik are centered before the regression as suggested in Adomavicius and Zhang [1] . Table 2 presents the result of the regression model. From R 2 and Constant, we can see that our regression model can explain 54.4% of the variation of the REM metric, highlighting an overall good performance in explaining the remembering metric with a good level of prediction (0.7014). From the independent variables, we can infer that the difficulty of the dataset positively impacts the remembering (namely, decreasing the catastrophic forgetting). More precisely, lower the relevance density (RD), the BM25 effectiveness (MAP) and higher the variation in terms of BM25 performances over queries (std-AP) are, the higher the REM metric is. This suggests that relevance-matching difficulty provides positive feedback signals to the neural model to face diverse learning instances, and therefore to better generalize over different application domains. This is however true to the constraint that the vocabulary of the dataset (V ocab) is not too large so as to boost neural ranking performance as outlined in [16, 36] . Looking at the residual variables (Dataset j and M j ), we can corroborate the observations made at a first glance in RQ1 regarding the model families clearly opposing (DRMM-PACRR-KNRM-VBERT) and CEDR since the former statistically exhibit higher REM metrics values than CEDR. From RQ1, we observed that some models are more prone to the catastrophic forgetting problem than others. Our objective here is to examine whether an EWC-based lifelong strategy can mitigate the problem. It is worth mentioning that this objective has been targeted in previous research in computer vision but without establishing a consensus [24, 46, 48] . While some studies reveal that EWC outperforms domain adaptation strategies in their settings [24, 46] , others found that it is less effective [48] . To achieve the experiment's objective, we particularly report the following measures in addition to the REM/REM N measures: 1) ∆ REM (REM N ) that reveals the improvement (positive or negative) of the REM/REM N measures achieved using an EWC-based lifelong learning strategy over the REM/REM N measure achieved using a fine-tuning strategy; 2) the PR measure introduced in Section 4.1. Unlikely, our aim through this measure here, is to highlight the performance stability of the learned model on the right dataset while avoiding catastrophic forgetting on the left dataset. We turn now our attention to the "EWC-based lifelong learning" columns in Table 1 . Our experiment results show that among the 9 (resp. 11) settings that exhibit catastrophic forgetting in the combined model (resp. neural model), EWC strategy allows to improve 9/9 i.e., 100% (resp. 9/11 i.e., 88%) of them in the range [+0.3%, +96.1%] (resp.[+3.3%, +79.7%]). Interestingly, this improvement in performance on the left dataset does not come at the cost of a significant decrease in performance on the right dataset since 100% of the models achieve a P R ratio greater than 0.96. Given, in the one hand, the high variability of the settings derived from the samples, and in the other hand, the very low number of settings (10% i.e., 2/20) where a performance decrease is observed in the left dataset, we could argue that the EWC-based lifelong learning is not inherently impacted by dataset order leading to a general effectiveness trend over the models. We emphasize this general trend by particularly looking at the CEDR model which we recall (See Section 4.1, RQ1), clearly exhibits the catastrophic forgetting problem. As can be seen from Table 1 , model performances on the left datasets are significantly improved (+6.4% ≤ ∆ REM ≤ +96.1%; 0% ≤ ∆ REM N ≤ +8.7% ) while keeping model performances on the right dataset stable (0.961 ≤ P R ≤ 1.008). This property is referred to as the stabilityplasticity dilemma [42] . To get a better overview of the effect of the EWC strategy, we compare in Figure 2 the behavior of the CEDR and KNRM models which exhibit respectively low level (REM = 0.510) and high level of remembering (REM = 1) particularly in the setting ms → mb. The loss curves in Figure 2 (a) highlight a peak after the 20 th epoch for both CEDR and KNRM. This peak denotes the beginning of the fine-tuning on the mb dataset. After this peak, we can observe that the curve representing the EWC-based CEDR loss (in purple) is slightly above the CEDR loss (in orange), while both curves for the KNRM model (green and blue resp. for with and without EWC) are overlayed. Combined with the statements outlined in RQ1 concerning the ability of the CEDR model to accumulate knowledge, this suggests that EWC is able to discriminate models prone to catastrophic forgetting and, when necessary, to relax the constraint of good ranking prediction on the dataset used for the fine-tuning to avoid over-fitting. This small degradation of knowledge acquisition during the fine-tuning on the ms dataset is carried out at the benefit of the previous knowledge retention to maintain retrieval performance on the mb dataset (Figure 2(b) ). Thus, we can infer that the EWC strategy applied on neural ranking models plays fully its role to mitigate the trade-off between stability and plasticity. We investigated the problem of catastrophic forgetting in neural-network based ranking models. We carried out experiments using 5 SOTA models and 3 datasets showing that neural ranking effectiveness comes at the cost of forget and that transformer-based models allow a good balance between effectiveness and remembering. We also show that the EWC-based strategy mitigates the catastrophic forgetting problem while ensuring a good trade-off between transferability and plasticity. Besides, datasets providing weak and varying relevance signals are likely to be remembered. While previous work in the IR community mainly criticized neural models regarding effectiveness [35, 40, 52] , we provide complementary insights on the relationship between effectiveness and transferability in a lifelong learning setting that involves cross-domain adaptation. We believe that our study, even under limited setups, provides fair and generalizable results that could serve future research and system-design in neural IR. Impact of data characteristics on recommender systems performance Progressive memory banks for incremental domain adaptation Ms marco: A human generated machine reading comprehension dataset Deep learning of representations for unsupervised and transfer learning Adaptive parameterization for neural dialogue generation Lifelong Machine Learning, Second Edition Cross domain regularization for neural ranking models using adversarial learning Neural ranking models with weak supervision How dataset characteristics affect the robustness of collaborative recommendation models Don't forget, there is more than forgetting: new metrics for Continual Learning Catastrophic forgetting in connectionist networks An empirical investigation of catastrophic forgeting in gradient-based neural networks In search of lost domain generalization A deep relevance matching model for ad-hoc retrieval Learning from dialogue after deployment: Feed yourself On the effect of lowfrequency terms on neural-ir models Pacrr: A position-aware neural ir model for relevance matching When does data augmentation help generalization in nlp Measuring catastrophic forgetting in neural networks Adam: A method for stochastic optimization Overcoming catastrophic forgetting in neural networks Mixout: Effective regularization to finetune large-scale pretrained language models Overcoming catastrophic forgetting by incremental moment matching Learning through dialogue interactions by asking questions Learning without forgetting Overview of the trec-2013 microblog track Learning to rank for information retrieval Representation learning using multi-task deep neural networks for semantic classification and information retrieval OpenNIR: A complete neural ad-hoc ranking pipeline Cedr: Contextualized embeddings for document ranking Episodic memory in lifelong language learning Towards a continuous knowledge learning engine for chatbots Efficient estimation of word representations in vector space An introduction to neural information retrieval. Foundations and Trends in Information Retrieval An updated duet model for passage re-ranking On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines. arXiv e-prints p On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines How transferable are neural networks in NLP applications? In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing Neural information retrieval: at the end of the early years A survey on transfer learning Continual lifelong learning with neural networks: A review Open-domain conversational agents: Current progress, open problems, and future directions Progressive neural networks Overview of the TREC-2012 microblog track Overcoming catastrophic forgetting during domain adaptation of neural machine translation Cord-19: The covid-19 open research dataset Overcoming catastrophic forgetting problem by weight consolidation and long-term memory Neural domain adaptation for biomedical question answering End-to-end neural ad-hoc ranking with kernel pooling Lifelong domain word embedding via metalearning Critically examining the" neural hype" weak baselines and the additivity of effectiveness gains from neural ranking models Data augmentation for BERT fine-tuning in open-domain question answering How transferable are features in deep neural networks? In: NIPS'14 We would like to thank projects ANR COST (ANR-18-CE23-0016) and ANR JCJC SESAMS (ANR-18-CE23-0001) for supporting this work.