key: cord-0205881-g37ztxzt authors: Jang, Joel; Ye, Seonghyeon; Lee, Changho; Yang, Sohee; Shin, Joongbo; Han, Janghoon; Kim, Gyeonghun; Seo, Minjoon title: TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models date: 2022-04-29 journal: nan DOI: nan sha: 28394219e999fe67ee824fdfbb06c18d4342e3a8 doc_id: 205881 cord_uid: g37ztxzt Language Models (LMs) become outdated as the world changes; they often fail to perform tasks requiring recent factual information which was absent or different during training, a phenomenon called temporal misalignment. This is especially a challenging problem because the research community still lacks a coherent dataset for assessing the adaptability of LMs to frequently-updated knowledge corpus such as Wikipedia. To this end, we introduce TemporalWiki, a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM's ability to retain previous knowledge and acquire updated/new knowledge at each point in time. We also find that training an LM on the diff data through continual learning methods achieves similar or better perplexity than on the entire snapshot in our benchmark with 12 times less computational cost, which verifies that factual knowledge in LMs can be safely updated with minimal training data via continual learning. The dataset and the code are available at https://github.com/joeljang/temporalwiki . Large Language Models (LMs) pretrained on a vast amount of text corpus have shown to be highly effective when finetuned or prompted to perform various downstream tasks (Raffel et al., 2019; Brown et al., 2020; Sanh et al., 2022; Wei et al., 2022) . However, most of the datasets used to evaluate these LMs are static benchmarks; the train and test data are both from similar points in time. On the other hand, in the real world, factual knowledge is frequently changed, added, or deprecated. For example, suppose a language model is asked what the most dominant coronavirus variant is ( Figure 1 ). * indicates equal contribution. The answer would have been the Delta variant in the fall of 2021 but has changed to the Omicron variant near the end of 2021. If LMs remain unchanged and are not periodically trained to cope with the changing world, they will be outdated very quickly. This means downstream tasks that directly depend on or are finetuned from the LM will suffer from temporal misalignment (Luu et al., 2022; Lazaridou et al., 2021) , which refers to the misalignment in time between the train and test data. Temporal misalignment becomes a critical problem, especially when using language models for knowledge-intensive tasks such as closed-book question answering (Roberts et al., 2020; Petroni et al., 2021; Jang et al., 2022) since they rely solely on the knowledge stored in their parameters. Furthermore, LMs augmented with retrieval mechanism (Guu et al., 2020; Lewis et al., 2020; Borgeaud et al., 2021) often suffer from hallucination even if they successfully retrieve up-to-date information (Zhang and Choi, 2021; Longpre et al., 2021) . This means that the implicit knowledge stored in the model parameters has to be updated as well because it may cause conflicts with the explicit knowledge retrieved from external sources such as up-to-date knowledge bases and ultimately cause the LM to hallucinate. Recently, Lazaridou et al. (2021) ; Jang et al. (2022) have explored updating the internal knowledge of LMs through continual pretraining on new and updated data as a solution for mitigating temporal misalignment. However, these datasets are still static in nature: as the world changes, they will eventually get outdated as well. In order to comprehensively measure the capability of ever-evolving LMs on addressing temporal misalignment, automated periodic evaluation of the LMs is crucial. In this paper, we introduce TEMPORALWIKI, a lifelong benchmark for training and evaluating everevolving LMs in a periodic and automated manner, shown in Figure 1 . The corpora used for updating • We find that continually training LMs only on the updated portion of English Wikipedia, which we call temporal language modeling, is much more efficient than updating LMs on entire English Wikipedia snapshots in terms of both computation and stability-plasticity trade-off. It is still a challenging task, especially when multiple updates are required due to catastrophic forgetting. • As competitive baselines for temporal language modeling, we implement previous continual learning approaches that mitigate forgetting while bolstering the learning of new knowledge, thus providing an overall enhancement in terms of both stability and plasticity. We hope that TEMPORALWIKI will foster future research towards training ever-evolving LMs. Recent works have introduced the need to tackle the issue of temporal misalignment, which refers to neural networks showing poor performance due to misalignment in time between the train and test data. Temporal misalignment can be caused either by (1) the dynamic nature of language (Röttger and Pierrehumbert, 2021; Hombaiah et al., 2021; Rosin et al., 2021; Loureiro et al., 2022) or (2) the update of factual information Dhingra et al., 2021; Jang et al., 2022) . Luu et al. (2022) have emphasized the effect of temporal misalignment on eight different NLP downstream tasks, asserting that misalignment between the train and test sets of the downstream tasks causes severe performance degradation that can be mitigate finetuning on the corpus from the target period. Agarwal and Nenkova (2021) have argued this to be less of a concern when utilizing representations from pretrained LMs and show that self-labeling on the downstream task is more effective than continued pretraining on more recent data for temporal adaptation. Note that these works have focused on misalignment caused by the dynamic nature of language on tasks that are not knowledge-intensive, such as text classification. Others have tackled the problem of temporal misalignment caused by the update of factual knowledge. Lazaridou et al. (2021) have shown that LMs deteriorate significantly in performance when there is a misalignment in time between the pretraining data and the downstream task and argued everevolving LMs are necessary. Dhingra et al. (2021) have proposed explicitly including time information during pretraining as a potential solution. Jang et al. (2022) ; Jin et al. (2021) have implemented continual learning methods to mitigate catastrophic forgetting that occurs during continued pretraining on new data. Despite the recent surge of community interest in the need for ever-evolving LMs, the community still lacks widely-available resources to train and evaluate such LMs. Previous works have introduced benchmarks comprised of data sources from Twitter feeds (Osborne et al., 2014; Yogatama et al., 2014; Loureiro et al., 2022) , recent news articles (Jang et al., 2022) , and arXiv papers (Lazaridou et al., 2021) where the temporal adaptability of LMs and the effectiveness of different methodologies of updating LMs can be evaluated. However, these data sources are domain-specific and inher-ently static. On the other hand, Wikipedia and Wikidata are known to be great sources of general world knowledge and thus have been widely used by the community (Dinan et al., 2019; Thorne et al., 2018; Kwiatkowski et al., 2019; Piktus et al., 2021) . 120K volunteer editors make 120 updates to the English Wikipedia per minute and add hundreds of new article entries every day (Logan IV et al., 2021) 2 . Even though every Wikipedia and Wikidata update may not correspond to an actual change in the real world, TEMPORALWIKI leverages the dynamic nature of Wikipedia and Wikidata to provide a lifelong benchmark for developing and maintaining ever-evolving LMs. In this section, we delve into the process of creating TEMPORALWIKI, which is comprised of training corpora (TWIKI-DIFFSETS) and evaluation datasets (TWIKI-PROBES) constructed by comparing the consecutive snapshots of English Wikipedia and English Wikidata, respectively (For brevity, we abbreviate them by omitting 'English' throughout the paper). Moreover, we clarify that not all Wikipedia/Wikidata updates equate to the actual updates of the world knowledge. In Section 3.1, we first describe the process of constructing the training corpora from Wikipedia snapshots. Then in Section 3.2, we describe the process of generating the evaluation datasets from Wikidata snapshots. In Section 3.3, we describe the quality control applied to the evaluation datasets, including the alignment of instances from Wikidata with Wikipedia. Lastly, in Section 3.4, we briefly discuss the current limitations of TEMPORALWIKI. In terms of computational resources, it is highly inefficient to train an LM on the entire Wikipedia snapshot every time the LM requires updates since most part of Wikipedia is unchanged from the previous snapshot. Moreover, it is not certain whether updating the LM on the entire Wikipedia snapshot is the best approach for updating the factual knowledge stored in the LM. Therefore, we compare the differences between consecutive Wikipedia snapshots in order to use only updated and new text for training. We call these subsets TWIKI-DIFFSETS. Require: Wikipedia snapshots W P prev and W P recent where W P recent is more recent. D := An empty array to store new and updated data. *article in W P has attributes id and text for all article a r ∈ W P recent do if a r .id = a p .id for some article a p ∈ W P prev then D.append(GETDIFF(a p , a r )) else D.append(a r ) end if end for function GETDIFF(a p , a r ) Di f f := An empty string to append difference between text in two articles. for all paragraph p r ∈ a r .text do if p r have no matching sentences with any paragraph p p ∈ a p .text then Di f f ← Di f f + p r else if p r have some matching and some different sentences with any paragraph p p ∈ a p .text then Di f f ← Di f f + sentences that differ between p r and p p . Algorithm 1 shows the procedure for generating them. As shown in Algorithm 1, a single TWIKI-DIFFSET is generated by getting the differences (similarly to git diff) between two consecutive Wikipedia snapshots. If an article with a new unique id is included in the recent snapshot, we append the entire article to TWIKI-DIFFSET. For an article having an existing id in the previous snapshot, we compare the two articles by paragraphs and add new or updated sentences to TWIKI-DIFFSETS. Examples of TWIKI-DIFFSET are shown in Figure 2 , and detailed statistics are shown in Section 4. In this work, the main objective for continually pretraining LMs is to add and update the factual knowledge stored in the implicit parameters of LMs. The success of an LM update can be evaluated by quantifying the stability-plasticity dilemma (Mermillod et al., 2013) : the dilemma of artificial and biological neural systems having to sacrifice either stability, ability to retain learned knowledge, or plasticity, ability to obtain new knowledge. In order to evaluate whether each update is successful, we need evaluation datasets that can quantify the amount of changed (updated or new) knowledge successfully gained (plasticity) and the amount of knowledge that remains unchanged as intended after the LM update (stability). Therefore, we categorize factual instances from Wikidata snapshots that are temporally aligned with Wikipedia snapshots and call the resulting datasets TWIKI-PROBES. Wikidata snapshots are structured knowledge graphs that store factual information in the form of (Subject, Relation, Object) such as (Barack Obama, born-in, Hawaii). These factual instances can be used to probe the LM for factual knowledge (Petroni et al., 2019) . Through Algorithm 2, we distinguish each factual instance into either UNCHANGED or CHANGED. Require: Wikidata snapshots W D prev and W D recent where W D recent is more recent. Un, C := Arrays that store UNCHANGED and CHANGED factual instances, respectively. for all fact (s r , As shown in Algorithm 2, given two consecutive Wikidata snapshots, a single TWIKI-PROBE is constructed, which is used to evaluate an LM updated with TWIKI-DIFFSET, constructed with the two consecutive Wikipedia snapshots with the same timestamp. Algorithm 2 categorizes instances with new Relation or instances with the same Relation, but a new Object into CHANGED, and unchanged instances into UNCHANGED. We apply several quality control steps to the categorized factual instances from Section 3.2 (Algorithm 2) to best represent the actual change of knowledge from the LM update. Alignment with TWIKI-DIFFSETS We ensure correct alignment of CHANGED factual instances with articles in TWIKI-DIFFSETS and UN-CHANGED factual instances with articles from the entire Wikipedia since Wikidata updates do not necessarily entail Wikipedia updates and vice versa. In order to do this, we take three steps. Step #1: We crawl information from each Wikipedia article page to find the mapping to the corresponding Wikidata entity id and store the information as a dictionary. Step #2: Then, for each factual instance from CHANGED, we check if the Subject id can be mapped to an article from TWIKI-DIFFSETS using the dictionary of id mappings. Likewise, for each instance from UNCHANGED, we check if the Subject id can be mapped to an article from Wikipedia. Step #3: Lastly, for a successfully mapped factual instance from Step 2 (whether it is CHANGED or UNCHANGED), we finally keep the instances where Object exists in the text of the article. Heuristic Filtering In addition to the alignment with TWIKI-DIFFSETS, in order to further ensure the quality of the evaluation datasets, we apply three heuristic filtering rules to strengthen the quality of the data. Rule #1: We remove the instances where either SUBJECT or OBJECT is a substring of the other. Rule #2: We remove the instances where OBJECT contains more than 5 words. Rule #3: We limit the number of highly frequent entities and relations: < 1% for each SUBJECT entity and < 5% for each OBJECT entity and RELATION. Table 1 shows some examples of TWIKI-PROBES after quality control. It is important to note that, as mentioned at the beginning of this Section, each Wikipedia and Wikidata update does not necessarily align with an actual change in the real world. For instance, an update in Wikidata or Wikipedia may have actually happened in distant past, and some changes in the real world may not be immediately reflected in their next snapshots. Moreover, one aspect that is not covered in this work is knowledge deletion. While maintaining Wikipedia and Wikidata, volunteer editors not only update or add new information but also delete information that is incorrect or misinformed. As removing the misinformation and bias stored in LMs is an important issue and necessary for truly ever-evolving LMs, future work should address this aspect utilizing deleted information from general knowledge sources such as Wikipedia. In this paper, we construct TEMPORALWIKI from 08.2021 to 12.2021 3 and its statistics are discussed below. Training Corpora Statistics Statistics of Wikipedia snapshots and TWIKI-DIFFSETS are shown in Table 2 . An interesting aspect of TWIKI-DIFFSETS is that the amount of information being updated and added (i.e., number of tokens in each subset) is similar for each month. The statistics of TWIKI-PROBES from the initial categorization from Algorithm 2 and quality control are shown in Table 3 For further analysis, we break down the entity types of Subject and Object, and observe a similar proportion of each entity category for each month of TWIKI-PROBES (Appendix A). We also show the distribution of the top 30 most frequent Relation of UNCHANGED and CHANGED (Appendix B). In this section, we train and evaluate ever-evolving LMs with TEMPORALWIKI, which consists of TWIKI-DIFFSETS and TWIKI-PROBES. Section 5.1 describes the experimental settings. Section 5.2 describes the baseline methodologies for updating LMs. Section 5.3 shows evaluation results on the training corpora. Section 5.4 presents the experimental results on TWIKI-PROBES. For our experiments, we continue pretraining GPT-2 Large (Radford et al., 2019) (774M parameters), which serves as our baseline language model (LM). We first compare the baseline performances between updating GPT-2 with TWIKI-DIFFSETS and updating it with entire Wikipedia snapshots and evaluate each update using TWIKI-PROBES. We also implement continual learning methods from which amounts to roughly 2.8 billion factual instances. Since most instances from Algorithm 2 are categorized into UN-CHANGED, we randomly sample 0.1% of the factual instances after applying Algorithm 2. literature known for mitigating catastrophic forgetting that occurs when updating GPT-2 with only TWIKI-DIFFSETS. Further detailed configuration of the experimental settings is provided in Appendix C. Here we describe the baseline methods used for training and evaluation, namely INITIAL, FULL, DIFF, RECADAM, MIX-REVIEW,K-ADAPTER, and LORA as shown in Table 4 and 5. Initial As the starting model checkpoint for all of the experiments, we first bring the initially pretrained GPT-2 from Radford et al. (2019) and continue pretraining it on the 08.2021 Wikipedia snapshot for four epochs in total (around 546K global steps) so that the initial GPT-2 used for all of the experiments is updated with the last two years of world knowledge. We denote this checkpoint as INITIAL, and it serves as the initial checkpoint for all of the other methods. Full We start from INITIAL and continue pretraining it on the entire Wikipedia snapshot of each month in a sequential manner. For example, after training on the 09.2021 Wikipedia snapshot from INITIAL, we continue training it on the 10.2021 Wikipedia snapshot and move on to the next snapshot. We denote the resulting model as FULL. We iterate through the training data only once, which corresponds to an average of 4.6 billion token updates (140K global steps) for each month. Diff We start from INITIAL and continue pretraining it on TWIKI-DIFFSETS in a sequential manner. We denote the resulting model as DIFF. Similarly to FULL, we iterate through the training data only once, which is an average of 347 million token updates (12K global steps) for each month. RecAdam We implement a regularization-based continual learning method for training large LMs called RECADAM (Chen et al., 2020) which places a stronger independent assumption among the model parameters, overcoming the limitations of implementing traditional methods such as EWC (Kirkpatrick et al., 2017) for training large language models. We set the hyperparameters of the optimizer identical to the original implementation. We implement a rehearsal-based continual learning method for training large LMs called MIX-REVIEW (He et al., 2021) which mixes in random subsets of the initial pretraining data (08.2021 Wikipedia data). We fix the mix-ratio as 2 in our experiments. LoRA We implement a parameter-expansionbased continual learning method called LORA (Hu et al., 2021) which freezes the original parameters while adding trainable rank-decomposition matrices into each layer. We use hyperparameters identical to the optimal setting of the original implementation. We implement another parameterexpansion-based continual learning method, K-ADAPTER , which freezes the original parameters while adding additional adapters (an increase of 103M parameters) to the LM. 5 We first perform intrinsic evaluation by measuring the perplexity of the baseline models on their training corpora. For each month, we measure the model's perplexity on TWIKI-DIFFSETS and NON-TWIKI-DIFFSETS, where the latter refers to the subset of the month's entire Wikipedia snapshot that does not include the data from TWIKI-DIFFSETS. We sample 10,000 input instances from each subset with a fixed length of 512 and measure the perplexity on proper noun tokens determined by a Part-of-Speech (POS) tagger (Honnibal and Montani, 2017) as in (Lazaridou et al., 2021) , which can be considered as a proxy for tokens containing factual knowledge. Therefore, the result on NON-TWIKI-DIFFSETS is meant to indicate the performance on unchanged knowledge, while the result on TWIKI-DIFFSETS corresponds to updated and new knowledge. Figure 3 shows the relative perplexity of each baseline method compared to INITIAL (i.e., dividing each model by INITIAL, and thus the lower, the better). Results on NON-TWIKI-DIFFSETS show that the relative perplexity of DIFF increases rapidly while that of FULL remains constant as time goes on, which implies that forgetting occurs when the LM is trained with TWIKI-DIFFSETS. The relative perplexities of continual learning methods increase less rapidly than DIFF, which means that applying continual learning mitigates catastrophic forgetting. 5 We add the additional parameters once for the updates from 08.2021. Exploring the optimal interval to add parameters for ever-evolving LMs is left for future work. MIX-REVIEW, especially, shows the least amount of forgetting among the continual learning methods, which indicates that training on the past corpus is effective in retaining performance on the previous training corpora in terms of perplexity. On the other hand, the results on TWIKI-DIFFSETS show the opposite trend: the relative perplexity of DIFF is much lower than FULL. One thing to note is that the perplexity of FULL is very similar to that of INITIAL on TWIKI-DIFFSETS, which suggests that updating LMs on entire Wikipedia snapshots hinders the effective learning of changed data compared to DIFF, despite both having seen the same instances of TWIKI-DIFFSETS during training for the same number of iterations. Among continual learning methods, K-ADAPTER and LORA shows higher overall perplexities than DIFF while MIX-REVIEW and RECADAM shows similar perplexity to DIFF on TWIKI-DIFFSETS. Table 4 : Zero-shot perplexity of LMs measured on TWIKI-PROBES. Time represents the average training time of a single update under the setting described in Section 5.1. The description of each baseline model is explained in Section 5.2. Best performance is marked as bold while the second best is underlined. Figure 4 : Average overall perplexity of TWIKI-PROBES. We average the perplexities of UNCHANGED and CHANGED with equal importance placed on stability and plasticity. The x-axis depicts the two-month intervals. A lower score indicates better performance. Performing only intrinsic evaluation on the training corpora is not sufficient because the intrinsic evaluation itself only tests the capability of the LMs for memorization (McCoy et al., 2021) . Through extrinsic evaluation with TWIKI-PROBES (Section 3.2), we specifically focus on evaluating factual knowledge of the LMs from each update. Placing equal importance on stability (UNCHANGED) and plasticity (CHANGED), we show the average of the perplexities of UNCHANGED and CHANGED as well as individual perplexities in Table 4 , and show a bar graph of the average perplexities in Figure 4 6 . As shown in Table 4 , DIFF and all continual learning methods show better overall performance on CHANGED factual instances than INITIAL in all months, bolstering the results from the intrinsic evaluation. For UNCHANGED, however, DIFF suffers from catastrophic forgetting, showing consistent performance degradation as the number of updates increases. In contrast, continual learning methods effectively mitigate much of the catastrophic forgetting during temporal language modeling, resulting in lower perplexity on UNCHANGED, 6 The perplexity of UNCHANGED and CHANGED were each calculated by measuring the average perplexity of generating each factual instances. except RECADAM which performs worse as the number of updates increases. K-ADAPTER, especially, shows surprising results on UNCHANGED, outperforming even FULL throughout all of the months. Moreover, all continual learning methods surpass or are on par with DIFF on CHANGED factual instances, showing that ability to learn new knowledge (plasticity) is not sacrificed to preserve previous knowledge (stability). Moreover, as shown in the average perplexity column of Table 4 and Figure 4 , K-ADAPTER shows the most robust performance throughout the time periods. It is important to note that K-ADAPTER is around 12 times more computationally efficient than FULL in terms of total training time, under the same computational constraint. DIFF also outperforms FULL in all months but 1011, showing that temporal language modeling itself is an effective approach for overall stability-plasticity trade-off. We note that, as also shown in previous works (Lazaridou et al., 2021) , results in Table 4 present an overall high perplexity (>200) because the sentences in TWIKI-PROBES are not natural sentences; they are factual phrases synthetically generated from a naive concatenation of Subject, Relation, and Object 7 . We discuss experiments with light-tuning as an alternative in Appendix D. We quantify the effect of temporal misalignment on each method by training the LMs and evaluating their zero-shot perplexity on CHANGED instances of TWIKI-PROBES with various time intervals of training and evaluation. Among continual learning methods, we select K-ADAPTER since it shows the most robust performance for extrinsic evaluation across all time periods. As shown in Figure 5 , FULL method is mostly influenced by the number of training updates and not much by whether there is temporal alignment. Since FULL is continuously pretrained on the entire Wikipedia corpus in each month, it would have likely seen the data containing CHANGED factual instances multiple times, leading to lower perplexity as training steps increases. 8 For DIFF and K-ADAPTER, there is a general trend of strong performance when there is temporal alignment (diagonal entries), outperforming FULL with much fewer global training steps. It is important to note that K-ADAPTER shows robustness against temporal misalignment, i.e., the perplexity does not increase much even when the training and evaluation months do not match, compared to DIFF which suffers from a more severe perplexity spike. In this paper, we provide answers to the four proposed questions in Section 1. (1) How can we train ever-evolving LMs efficiently and automate the evaluation of each update? We introduce TEM-PORALWIKI, a lifelong benchmark that can be used for training and evaluating ever-evolving language models (LMs) in an automated manner. It consists of TWIKI-DIFFSETS as the training corpora for temporal language modeling and TWIKI-PROBES as the evaluation datasets for measuring the stability-plasticity trade-off at each LM update. (2) How does updating LMs only on new and up-A Details of Entity Types of Subject and Relation Figure 6 shows the ratio of different entity types of Subject and Object of UNCHANGED and CHANGED. The distribution of Relation for UNCHANGED, CHANGED factual instances in TWIKI-PROBES is shown in Figure 7 . For each LM update, we use 8 32GB V100 GPUs with a global batch size of 64 and a fixed input sequence length of 512. We use the max learning rate of 1e-4 and one cycle learning rate scheduling policy (Smith, 2018) . For light-tuning, the training is done for only one epoch with a learning rate of 1e-5 and a batch size of 32. Input and output sequence lengths are set to 25. For continual learning-based methods, we unfreeze all of the parameters during light-tuning, following Jang et al. (2022) . To alleviate the distributional shift that causes high zero-shot perplexity, we light-tune the LMs on 500 instances randomly sampled from WikiData that do not overlap with instances from TWIKI-PROBES (details in Appendix E). Unlike finetuning, lighttuning lets the LM only learn the input and output distribution of the task, avoiding the problem of test-train overlap pointed out by . Table 5 shows the results of light-tuning, which demonstrate a similar trend as the zero-shot performance. Although light-tuning avoids the problem of test-train overlap, results are largely affected by the sampled instances for tuning, so a zero-shot evaluation setting is preferred for reliability. Many knowledge-intensive tasks such as closed- book question answering (Roberts et al., 2020; Petroni et al., 2021; Jang et al., 2022) or slot filling (Petroni et al., 2021) use accuracy, EM, or F1 score to evaluate the task. We also show the F1 score on TWIKI-PROBES in Table 6 . Overall trend is consistent with zero-shot perplexity metric; K-ADAPTER shows robust performance for both UNCHANGED and CHANGED. We sample 500 instances from WikiData for each time step that do not overlap with instances from TWIKI-PROBES for each factual instance category. During sampling, we keep the distribution of each Relation proportional to the original distribution. Table 7 shows the size and distribution of Relation of light-tuning datasets. Temporal effects on pre-trained models for language processing tasks Improving language models by retrieving from trillions of tokens Language models are few-shot learners Recall and learn: Fine-tuning deep pretrained language models with less forgetting A dataset for answering time-sensitive questions Time-aware language models as temporal knowledge bases Wizard of wikipedia: Knowledge-powered conversational agents Realm: Retrievalaugmented language model pre-training Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models Dynamic language models for continuously evolving content 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing Lora: Low-rank adaptation of large language models Towards continual knowledge learning of language models Lifelong pretraining: Continually adapting language models to emerging corpora Overcoming catastrophic forgetting in neural networks Natural questions: a benchmark for question answering research Mind the gap: Assessing temporal generalization in neural language models Retrieval-augmented generation for knowledge-intensive nlp tasks Question and answer test-train overlap in open-domain question answering datasets Fruit: Faithfully reflecting updated information in text Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering Timelms: Diachronic language models from twitter Time waits for no one! analysis and challenges of temporal misalignment Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation Jianfeng Gao, and Asli Celikyilmaz. 2021. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects Exponential reservoir sampling for streaming language models Kilt: a benchmark for knowledge intensive language tasks Language models as knowledge bases The web is your oysterknowledge-intensive nlp against a very large web corpus Language models are unsupervised multitask learners Exploring the limits of transfer learning with a unified text-to-text transformer How much knowledge can you pack into the parameters of a language model Time masking for temporal language models Temporal adaptation of bert and performance on downstream document classification: Insights from social media Multitask prompted training enables zero-shot task generalization A disciplined approach to neural network hyper-parameters: Part 1-learning rate, batch size, momentum, and weight decay Fever: a large-scale dataset for fact extraction and verification K-adapter: Infusing knowledge into pre-trained models with adapters Dynamic language models for streaming text Situatedqa: Incorporating extra-linguistic contexts into qa