key: cord-0540307-taftm4zu
authors: Liu, Zihan; Xu, Yan; Yu, Tiezheng; Dai, Wenliang; Ji, Ziwei; Cahyawijaya, Samuel; Madotto, Andrea; Fung, Pascale
title: CrossNER: Evaluating Cross-Domain Named Entity Recognition
date: 2020-12-08
journal: nan
DOI: nan
sha: 5cb87cd3b1feb8f39e565b1d054d37a3cf38b66c
doc_id: 540307
cord_uid: taftm4zu

Cross-domain named entity recognition (NER) models are able to cope with the scarcity issue of NER samples in target domains. However, most of the existing NER benchmarks lack domain-specialized entity types or do not focus on a certain domain, leading to a less effective cross-domain evaluation. To address these obstacles, we introduce a cross-domain NER dataset (CrossNER), a fully-labeled collection of NER data spanning over five diverse domains with specialized entity categories for different domains. Additionally, we also provide a domain-related corpus since using it to continue pre-training language models (domain-adaptive pre-training) is effective for the domain adaptation. We then conduct comprehensive experiments to explore the effectiveness of leveraging different levels of the domain corpus and pre-training strategies to do domain-adaptive pre-training for the cross-domain task. Results show that focusing on the fractional corpus containing domain-specialized entities and utilizing a more challenging pre-training strategy in domain-adaptive pre-training are beneficial for the NER domain adaptation, and our proposed method can consistently outperform existing cross-domain NER baselines. Nevertheless, experiments also illustrate the challenge of this cross-domain NER task. We hope that our dataset and baselines will catalyze research in the NER domain adaptation area. The code and data are available at https://github.com/zliucr/CrossNER.

Named entity recognition (NER) is a key component in text processing and information extraction. Contemporary NER systems rely on numerous training samples (Ma and Hovy 2016; Lample et al. 2016; Chiu and Nichols 2016; Dong et al. 2016; Yadav and Bethard 2018) , and a well-trained NER model could fail to generalize to a new domain due to the domain discrepancy. However, collecting large amounts of data samples is expensive and time-consuming. Hence, it is essential to build cross-domain NER models that possess transferability to quickly adapt to a target domain by using only a few training samples.

Existing cross-domain NER studies (Yang, Salakhutdinov, and Cohen 2017; Jia, Liang, and Zhang 2019; Jia and Zhang 2020) consider the CoNLL2003 English NER dataset (Tjong Kim Sang and De Meulder 2003) from Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Reuters News as the source domain, and utilize NER datasets from Twitter (Derczynski, Bontcheva, and Roberts 2016; Lu et al. 2018) , biomedicine (Nédellec et al. 2013 ) and CBS SciTech News (Jia, Liang, and Zhang 2019) as target domains. However, we find two drawbacks in utilizing these datasets for cross-domain NER evaluation. First, most target domains are either close to the source domain or not narrowed down to a specific topic or domain. Specifically, the CBS SciTech News domain is close to the Reuters News domain (both are related to news) and the content in the Twitter domain is generally broad since diverse topics will be tweeted about and discussed on social media. Second, the entity categories for the target domains are limited. Except the biomedical domain, which has specialized entities in the biomedical field, the other domains (i.e., Twitter and CBS SciTech News) only have general categories, such as person and location. However, we expect NER models to recognize certain entities related to target domains.

In this paper, we introduce a new human-annotated crossdomain NER dataset, dubbed CrossNER, which contains five diverse domains, namely, politics, natural science, music, literature and artificial intelligence (AI). And each domain has particular entity categories; for example, there are "politician", "election" and "political party" categories specialized for the politics domain. As in previous works, we consider the CoNLL2003 English NER dataset as the source domain, and five domains in CrossNER as the target domains. We collect ∼1000 development and test examples for each domain and a small size of data samples (100 or 200) in the training set for each domain since we consider a lowresource scenario for target domains. In addition, we collect the corresponding five unlabeled domain-related corpora for the domain-adaptive pre-training, given its effectiveness for domain adaptation (Beltagy, Lo, and Cohan 2019; Donahue et al. 2019; Lee et al. 2020; Gururangan et al. 2020) .

We evaluate existing cross-domain NER models on our collected dataset and explore using different levels of domain corpus and masking strategies to continue pre-training language models (e.g., BERT (Devlin et al. 2019) ). Results show that emphasizing the partial corpus with specialized entity categories in BERT's domain-adaptive pre-training (DAPT) consistently improves its domain adaptation ability. Additionally, in the DAPT, BERT's masked language modeling (MLM) can be enhanced by intentionally masking contiguous random spans, rather than random tokens. Comprehensive experiments illustrate that the span-level pretraining consistently outperforms the original MLM pretraining for NER domain adaptation. Furthermore, experimental results show that the cross-domain NER task is challenging, especially when only a few data samples in the target domain are available. The main contributions of this paper are summarized as follows:

• We introduce CrossNER, a fully-labeled dataset spanning over five diverse domains, as well as the corresponding five domain-related corpora for studies of the crossdomain NER task.

• We report a set of benchmark results of existing strong NER models, and propose competitive baselines which outperform the current state-of-the-art model.

• To the best of our knowledge, we are the first to conduct in-depth experiments and analyses in terms of the number of target domain training samples, the size of domain-related corpus and different masking strategies in the DAPT for the NER domain adaptation.

Existing NER Datasets CoNLL2003 (Tjong Kim Sang and De Meulder 2003) is the most popular NER dataset and is collected from the Reuters News domain. It contains four general entity categories, namely, person, location, organization and miscellaneous. The Email dataset (Lawson et al. 2010) , Twitter dataset (Derczynski, Bontcheva, and Roberts 2016; Lu et al. 2018) , and SciTech News dataset (Jia, Liang, and Zhang 2019) have the same or a smaller set of entity categories than CoNLL2003. WNUT NER (Strauss et al. 2016) , from the Twitter domain, has ten entity types. 1 However, aside from four entity types that are the same as those in CoNLL2003, the other six types come from four domains, which means they are not concentrated on a particular domain. Different from these datasets, the OntoNotes NER dataset (Pradhan et al. 2012) consists of six genres (newswire, broadcast news, broadcast conversation, magazine, telephone conversation and web data). However, the six genres are either relatively close (e.g., newswire and broadcast news) or have broad content (e.g., web data and magazine). To the best of our knowledge, only Biomedical NER (Nédellec et al. 2013 ) and CORD-NER ) focus on specific domains (biomedicine and COVID-19, respectively) which are diverse from the News domain and have specialized entity classes. However, annotations in CORD-NER are produced by models instead of annotators.

Cross-Domain NER Cross-domain algorithms alleviate the data scarcity issue and boost the models' generalization ability to target domains (Kim et al. 2015; Yang, Liang, and Zhang 2018; Lee, Dernoncourt, and Szolovits 2018; Lin and Lu 2018; Liu, Winata, and Fung 2020) . Daumé III (2007) enhanced the adaptation ability by mapping the entity label space between the source and target domains. Wang et al. (2018) proposed a label-aware double transfer learning framework for cross-specialty NER, while Wang, Kulkarni, and Preoţiuc-Pietro (2020) investigated different domain adaptation settings for the NER task. Liu et al. (2020b) introduced a two-stage framework to better capture entities for input sequences. Sachan et al. (2018) ; Jia, Liang, and Zhang (2019) injected target domain knowledge into language models for the fast adaptation, and Jia and Zhang (2020) presented a multi-cell compositional network for NER domain adaptation. Additionally, fast adaptation algorithms have been applied to low-resource languages (Lample and Conneau 2019; Liu et al. 2019 Liu et al. , 2020a Wilie et al. 2020) , accents , and machine translation (Artetxe et al. 2018; Lample et al. 2018 ).

To collect CrossNER, we first construct five unlabeled domain-specific (politics, natural science, music, literature and AI) corpora from Wikipedia. Then, we extract sentences from these corpora for annotating named entities. The details are given in the following sections.

Wikipedia contains various categories and each category has further subcategories. It serves as a valuable source for us to collect a large corpus related to a certain domain. For example, to construct the corpus in the politics domain, we gather Wikipedia pages that are in the politics category as well as its subcategories, such as political organizations and political cultures. We utilize these collected corpora to investigate domain-adaptive pre-training.

Pre-Annotation Process For each domain, we sample sentences from our collected unlabeled corpus, which are then given named entity labels. Before annotating the sampled sentences, we leverage the DBpedia Ontology (Mendes, Jakob, and Bizer 2012) 2 to automatically detect entities and pre-annotate the selected samples. By doing so, we can alleviate the workload of annotators and potentially avoid annotation mistakes. However, the quality of the pre-annotated NER samples is not satisfactory since some entities will be incorrectly labeled and many entities are not in the DBpedia Ontology. In addition, we utilize the hyperlinks in Wikipedia and mark tokens that have hyperlinks to facilitate the annotation process and assist annotators in noticing entities. This is because tokens having hyperlinks are highly likely to be the named entity.

Annotation Process Each data sample requires two welltrained NER annotators to annotate it and one NER expert to double check it and give final labels. The data collection proceeds in three steps. First, one annotator needs to detect and categorize the entities in the given sentences. Second, the other annotator checks the annotations made by the first annotator, makes markings if he/she thinks that the annotations could be wrong and gives another annotation. Finally, the expert first goes through the annotations again and checks for possible mistakes, and then makes the final decision for disagreements between the first two annotators. In order to ensure the quality of annotations, the second annotator concentrates on looking for possible mistakes made by the first annotator instead of labeling from scratch. In addition, the expert will give a second round check and confer with the first two annotators when he/she is unsure about the annotations. A total 63.64% of entities (the number of preannotated entities divided by the number of entities with hyperlinks) are pre-annotated based on the DBpedia Ontology, 73.33% of entities (the number of corrected entities divided by the number of entities with hyperlinks) are corrected in the first annotation stage, 8.59% of entities are annotated as possibly incorrect in the second checking stage, and finally, 8.57% of annotations (out of all annotations) are modified by the experts. The details are reported in the Appendix.

The data statistics of the Reuters News domain (Tjong Kim Sang and De Meulder 2003) and the collected five domains are illustrated in Table 1 . In general, it is easy to collect a large unlabeled corpus for one domain, while for some low-resource domains, the corpus size could be small. As we can see from the statistics of the unlabeled corpus, the size is large for all domains except the AI domain (only a few AIrelated pages exist in Wikipedia). Since DAPT experiments usually require a large amount of unlabeled sentences (Wu et al. 2020; Gururangan et al. 2020) , this data scarcity issue introduces a new challenge for the DAPT. We make the size of the training set (from Table 1 ) relatively small since cross-domain NER models are expected to do fast adaptation with a small-scale of target domain data samples. In addition, there are domain-specialized entity types for each domain, resulting in a hierarchical category structure. For example, there are "politician" and "person" classes, but if a person is a politician, that person should be annotated as a "politician" entity, and if not, a "person" entity. Similar cases can be found for "scientist" and "person", "organization" and "political party", etc. We believe this hierarchical category structure will bring a challenge to Reuters denotes the Reuters News domain, "Science" denotes the natural science domain and "Litera." denotes the literature domain.

this task since the model needs to better understand the context of inputs and be more robust in recognizing entities.

The vocabulary overlaps of the NER datasets between domains (including the source domain (Reuters News domain) and the five collected target domains) are shown in Figure 4 . Vocabularies for each domain are created by considering the top 5K most frequent words (excluding stopwords). We observe that the vocabulary overlaps between domains are generally small, which further illustrates that the overlaps between domains are comparably small and the domains of our collected datasets are diverse. The vocabulary overlaps of the unlabeled corpora between domains are reported in the Appendix.

We continue pre-training the language model BERT (Devlin et al. 2019 ) on the unlabeled corpus (i.e., DAPT) for the do-main adaptation. The DAPT is explored in two directions. First, we investigate how different levels of the corpus influences the pre-training. Second, we explore the effectiveness between token-level and span-level masking in the DAPT.

When the size of the domain-related corpus is enormous, continuing to pre-train language models on it would be timeconsuming. In addition, there would be noisy and domainunrelated sentences in the collected corpus which could weaken the effectiveness of the DAPT. Therefore, we investigate whether extracting more indispensable content from the large corpus for pre-training can achieve comparable or even better cross-domain performance. We consider three different levels of corpus for pretraining. The first is the domain-level corpus, which is the largest corpus we can collect related to a certain domain. The second is the entity-level corpus. It is a subset of the domain-level corpus and made up of sentences having plentiful entities. Practically, it can be extracted from the domain-level corpus based on an entity list. We leverage the entity list in DBpedia Ontology and extract sentences that contain multiple entities to construct the entity-level corpus. The third is the task-level corpus, which is explicitly related to the NER task in the target domain. To construct this corpus, we select sentences having domain-specialized entities existing in the DBpedia Ontology. The size of the task-level corpus is expected to be much smaller than the entity-level corpus. However, its content should be more beneficial than that of the entity-level corpus.

Taking this further, we propose to integrate the entitylevel and the task-level corpus. Instead of simply merging these two corpora, we first upsample the task-level corpus (double the size in practice) and then combine it with the entity-level corpus. Hence, models will tend to focus more on the task-level sentences in the DAPT.

Inspired by Joshi et al. (2020) , we propose to change the token-level masking (MLM) in BERT (Devlin et al. 2019) into span-level masking for the DAPT. In BERT, MLM first randomly masks 15% of the tokens in total, and then replaces 80% of the masked tokens with special tokens ([MASK]), 10% with random tokens and 10% with the original tokens. We follow the same masking strategy as BERT except the first masking step. In the first step, after the random masking, we move the individual masked index position into its adjacent position that is next to another masked index position in order to produce more masked spans, while we do not touch the continuous masked indices (i.e., masked spans). For example, the randomly masked sentence:

Western music's effect would [MASK] to grow within the country [MASK] sphere would become Western music's effect would continue to grow within the [MASK] [MASK] sphere.

Intuitively, span-level masking provides a more challenging task for pre-trained language models. For example, predicting "San Francisco" is much harder than predicting "San" given "Francisco" as the next word. Hence, the spanlevel masking can facilitate BERT to better understand the domain text so as to complete the more challenging task.

We consider the CoNLL2003 English NER dataset (Tjong Kim Sang and De Meulder 2003) from Reuters News, which contains person, location, organization and miscellaneous entity categories, as the source domain and five domains in CrossNER as target domains. Our model is based on BERT (Devlin et al. 2019 ) in order to have a fair comparison with the current state-of-the-art model (Jia and Zhang 2020) , and we follow Devlin et al. (2019) to fine-tune BERT on the NER task. More training details are in the Appendix.

Before training on the source or target domains, we conduct the DAPT on BERT when the unlabeled domain-related corpus is leveraged. Moreover, in the DAPT, different types of unlabeled corpora are investigated (i.e., domain-level, entity-level, task-level and integrated corpora), and different masking strategies are inspected (i.e., token-level and spanlevel masking). Then, we carry out three different settings for the domain adaptation, which are described as follows:

• We ignore the source domain training samples, and finetune BERT directly on the target domain data. • We first pre-train BERT on the source domain data, and then fine-tune it to the target domain samples. • We jointly fine-tune BERT on both source and target domain data samples. Since the size of the data samples in the target domains is smaller than in the source domain, we upsample the target domain data to balance the source and target domain data samples.

We compare our methods to the following baselines:

• BiLSTM-CRF (Lample et al. 2016) incorporate bidirectional LSTM (Hochreiter and Schmidhuber 1997) and conditional random fields for named entity recognition. We combine source domain data samples and the upsampled target domain data samples to jointly train this model (i.e., the joint training setting mentioned in the experimental settings). We use the word-level embeddings from Pennington, Socher, and Manning (2014) 

From Table 2 , we can see that DAPT using the entity-level or the task-level corpus achieves better or on par results with using the domain-level corpus, while according to the corpus statistics illustrated in Table 3 , the size of the entitylevel corpus is generally around half or less than half that of the domain-level corpus, and the size of the task-level corpus is much smaller than the domain-level corpus. We conjecture that the content of the corpus with plentiful entities is more suitable for the NER task's DAPT. In addition, selecting sentences with plentiful entities is able to filter numerous noisy sentences and partial domain-unrelated sentences from the domain corpus. Picking sentences having domain-specialized entities also filters a great many sentences that are not explicitly related to the domain and makes the DAPT more effective and efficient. In general, DAPT using the task-level corpus performs slightly worse than using the entity-level corpus. This can be attributed to the large corpus size differences. Furthermore, integrating the entitylevel and task-level corpora is able to consistently boost the adaptation performance compared to utilizing other cor-pus types, although the size of the integrated corpus is still smaller than the domain-level corpus. This is because the integrated corpus ensures the pre-training corpus is relatively large, and in the meantime, focuses on the content that is explicitly related to the NER task in the target domain. The results suggest that the corpus content is essential for the DAPT, and we leave exploring how to extract effective sentences for the DAPT for future work. Surprisingly, the DAPT is still effective for the AI domain even though the corpus size in this domain is relatively small, which illustrates that the DAPT is also practically useful in a small corpus setting. As we can see from Table 2 , when leveraging the same corpus, the span-level masking consistently outperforms the token-level masking. For example, in the Pre-train then Finetune setting, DAPT on the integrated corpus and using spanlevel masking outperforms that using token-level masking by a 1.34% F1-score on average. This is because predicting spans is a more challenging task than predicting tokens, forcing the model to better comprehend the domain text and then to possess a more powerful capability to do the downstream tasks. Moreover, adding DAPT using the span-level masking and the integrated corpus to Jia and Zhang (2020) further improves the F1-score by 2.16% on average. Never- theless, we believe that exploring more masking strategies or DAPT methods is worthwhile. We leave this for future work.

From Table 2 , we can clearly observe the improvements when the source domain data samples are leveraged. For example, compared to the Directly Fine-tune, Pre-train then Fine-tune (w/o DAPT) improves the F1-score by 3.45% on average, and Jointly Train (w/o DAPT) improves the F1score by 3.08% on average. We notice that Pre-train then Fine-tune generally leads to better performance than Jointly Train. We speculate that jointly training on both the source and target domains makes it difficult for the model to concentrate on the target domain task, leading to a sub-optimal result, while for the Pre-train then Fine-tune, the model learns the NER task knowledge from the source domain data in the pre-training step and then focuses on the target domain task in the fine-tuning step. Finally, we can see that our best model can outperform the existing state-of-the-art model in all five domains. However, the averaged F1-score of the best model is not yet perfect (lower than 70%), which highlights the need for more advanced cross-domain models.

Given that a large-scale domain-related corpus might sometimes be unavailable, we investigate the effectiveness of different corpus sizes for the DAPT and explore how the masking strategies will influence the adaptation performance. As shown in Figure 2 , as the size of unlabeled corpus increases, the performance generally keeps improving. This implies that the corpus size is generally essential for the DAPT, and within a certain corpus size, the larger the corpus is, the better the domain adaptation performance the DAPT will produce. Additionally, we notice that in the Pre-train then Fine-tune setting, the improvement becomes compa-rably less when the percentage reaches 75% or higher. We conjecture that it is relatively difficult to improve the performance when it reaches a certain amount.

Furthermore, little performance improvements are observed for both the token-level and span-level masking strategies when only a small-scale corpus (1% of the music integrated corpus, 1.48M) is available. As the corpus size increases, the span-level masking starts to outperform the token-level masking. We notice that in Directly Finetune, the performance discrepancy between the token-level and span-level is first increasing and then decreasing. And the performance discrepancies are generally increasing in the other two settings. We hypothesize that the span-level masking can learn the domain text more efficiently since it is a more challenging task, while the token-level masking requires a larger corpus to better understand the domain text.

From Figure 3 , we can see that the performance drops when the number of target domain samples is reduced, the spanlevel pre-training generally outperforms token-level pretraining, and the task becomes extremely difficult when only a few data samples (e.g., 10 samples) in the target domain are available. Interestingly, as the target domain sample size decreases, the advantage of using source domain training samples becomes more significant, for example, Pre-train then Fine-tune outperforms Directly Fine-tune by ∼10% F1 when the sample size is reduced to 40 or lower than 40. This is because these models are able to gain the NER task knowledge from the large amount of source domain examples and then possess the ability to quickly adapt to the target domain. Additionally, using DAPT significantly improves the performance in the Pre-train then Fine-tune setting when target Figure 3 : Few-shot F1-scores (averaged over three runs) in the music domain. We use the integrated corpus for the DAPT.

Genre Table 4 : F1-scores (averaged over three runs) for the categories in the music domain over the Directly Fine-tune and Pre-train then Fine-tune settings. Span-level+Integrated denotes that the span-level masking and integrated corpus are utilized for the DAPT. "Loc.", "Org.", "Per." and "Misc." denote "Location", "Organization", "Person" and "Miscellaneous", respectively.

domain samples are scarce (e.g., 10 samples). This can be attributed to the boost in the domain adaptation ability made by the DAPT, which allows the model to quickly learn the NER task in the target domain. Furthermore, we notice that with the decreasing of the sample size, the performance discrepancy between Pre-train then Fine-tune and Jointly Train is getting larger. We speculate that in the Jointly Train setting, the models focus on the NER task in both the source and target domains. This makes the models tend to ignore the target domain when the sample size is too small, while for the Pre-train then Fine-tune setting, the models can focus on the target domain in the fine-tuning stage to ensure the good performance in the target domain.

In this section, we further explore the effectiveness of the DAPT and leveraging NER samples in the source domain. As shown in Table 4 , the performance is improved on almost all categories when the DAPT or the source domain NER samples are utilized. We observe that using source domain NER data might hurt the performance in some domain-specialized entity categories, such as "artist" ("musical artist") and "band". This is because that since "artist" is a subcategory of "person" and models pre-trained on the source domain tend to classify artists as "person" entities. Similarly, "band" is a subcategory of "organization", which leads to the same misclassification issue after the source domain pre-training. When the DAPT is used, the performance on some domain-specialized entity categories is greatly improved (e.g., "song", "band" and "album"). We notice that the performance on the "person" entity is relatively low compared to other categories. It is because that the hierarchical category structure could cause models to be confused between "artist" and "person" entities, and we find out that 84.81% of "person" entities are misclassified as "artist" in the best model we have.

In this paper, we introduce CrossNER, a humanly-annotated NER dataset spanning over five diverse domains with specialized entity categories for each domain. In addition, we collect the corresponding domain-related corpora for the study of DAPT. A set of benchmark results of existing strong NER models is reported. Moreover, we conduct comprehensive experiments and analyses in terms of the size of the domain-related corpus and different pre-training strategies in the DAPT for the cross-domain NER task, and our proposed method consistently outperforms existing baselines. Nevertheless, the performance of our best model is not yet perfect, especially when the number of target domain training samples is limited. We hope that our dataset will facilitate further research in the NER domain adaptation field.

We gather the NER experts who are familiar with the NER task and the annotation rules. We train the annotators before he/she starts annotating. For each domain, we first give the annotator the annotation instruction and 100 annotated examples (annotated by the NER experts) and ask them to check the possible annotation errors. After the checking process, the NER experts will inspect the results and tell the annotators the mistakes they made in the checking stage. Hence, in this process, the annotators are able to learn how to annotate the NER samples for specific domains.

We split the annotation instructions into two parts, namely, general instructions and domain-specific instructions. We describe the instructions as follows:

General Instructions Each data sample requires two well-trained NER annotators to annotate it and one NER expert to double check it and give final labels. The annotation process consists of three steps. First, one annotator needs to detect and categorize the entities in the given sentences. Second, the other annotator checks the annotations made by the first annotator, makes markings if he/she thinks that the annotations could be wrong and gives another annotation. Finally, the expert first goes through the annotations again and checks for possible mistakes, and then makes the final decision for disagreements between the first two annotators. Notice that we mark the tokens with hyperlinks in Wikipedia, and it is highly likely that these tokens are named entities. When one entity contains another entity, we should give labels to the entity with a larger span. For example, "Fellow of the Royal Society" is an entity (a award entity), while "Royal Society" is another entity (an organization entity), we should annotate "Fellow of the Royal Society" instead of "Royal Society". The requirements for the annotators in different annotation stages are as follows:

• If you are in the first annotation stage, you need to detect and categorize the entities carefully, and check the pre-labeled entities (annotated by DBpedia Ontology described in the main paper) are correct or not, and give the correct annotations if you think the labels are wrong. Notice that the pre-label entity might not be an entity, and some tokens not labeled as entities could be entities.

• If you are in the second annotation stage, you need to focus on looking for the possible annotations mistakes made by the first annotator, and give another annotation if you think the labels could be wrong. Additionally, you need to detect and categorize the entities that are missed by the first annotator.

• If you are in the third annotation stage (only applicable for NER experts), you need to carefully go through the annotations and correct the possible mistakes, and in the meantime, you need to check the corrected annotations made by the second annotator and then makes final decisions for the disagreements between the two annotators. If you are unsure about the annotations, you need to confer with the two annotators.

We list the annotation details for the five domains, namely, politics, natural science, music, literature, and artificial intelligence.

• Politics: The entity category list for this domain is {person, organization, location, politician, political party, election, event, country, miscellaneous}. The annotation rules for the abovementioned entity categories are as follows: -Person: The name of a person should be annotated as a person entity. Note that the annotation rules for some general entity categories (i.e., person, location, organization, event, country, miscellaneous) are the same as those in the other domains, and we do not put the annotation rules for these entity categories for the other domains.

• Natural Science: This domain contains the area of biology, chemistry, and astrophysics. The entity category list for this domain is {scientist, person, university, organization, country, location, discipline, enzyme, protein, chemical element, chemical compound, astronomical object, academic journal, event, theory, award, miscellaneous}. 3 The annotation rules for the abovementioned entity categories are as follows: -University: The university entity.

-Discipline: The discipline entity. It contains the areas and subareas of biology, chemistry and astrophysics, such as quantum chemistry. -Theory: The theory entity. It includes law and theory entities, such as ptolemaic planetary theories. -Award: The award entity.

-Scientist: If a person entity is a scientist, you should label this person as a scientist entity instead of a person entity. -Protein: The protein entity.

-Enzyme: Notice that enzyme is a special type of protein. Hence, if a protein entity is an enzyme, you should label this protein as an enzyme entity instead of a protein entity. -Chemical element: The chemical element entity. Basically, this category contains the chemical elements from the periodic table. -Chemical compound: The chemical compound entity.

If a chemical compound entity do not belong to protein or enzyme, you should label it as a chemical compound entity. -Astronomical object: The astronomical object entity.

-Academic journal: The academic journal entity.

• Music: The entity category list for this domain is {music genre, song, band, album, musical artist, musical instrument, award, event, country, location, organization, person, miscellaneous}. The annotation rules for the abovementioned entity categories are as follows: -Music genre: The music genre entity, such as country music, folk music and jazz. -Song: The song entity.

-Band: The band entity. If an organization belongs to a band, you should label it as a band entity instead of an organization entity. -Album: The album entity.

-Musical artist: The musical artist entity. It a person is working on the music area (e.g., he/she is a singer, composer, or songwriter), you should label it as a musical artist entity instead of a person entity. -Musical instrument: The musical instrument entity, such as piano. area. The annotation rules for the abovementioned entity categories are as follows: -Researcher: The researcher entity. If a person is working on research (including professor, Ph.D. student, researcher in companies, and etc), you should label it as a researcher entity instead of a person entity. -Field: The research field entity, such as machine learning, deep learning, and natural language processing. -Task: The specific task entity in the research field, such as machine translation and object detection. -Product: The product entity that includes the product (e.g., a certain kind of robot like Pepper), system (e.g., facial recognition system) and toolkit (e.g., Tensorflow and PyTorch) -Algorithm: The algorithm entity. It contains algorithms (e.g., decision trees) and models (e.g, CNN and LSTM). -Metrics: The evaluation metrics, such as F1-score.

-Programming Language: The programming language, such as Java and Python. -Conference: The conference entity. It contains conference and journal entities.

The data statistics for each category for the five domains is illustrated in Table 5, Table 6, Table 7, Table 8 , and Table 9 .

NER data examples for the five domains are shown in Table 10 .

The vocabulary overlaps of the unlabeled domain-related corpora between domains are shown in Figure 4 . The Reuters News corpus is taken from Jia, Liang, and Zhang (2019) . For each domain, we sample 150K paragraphs from the domain-related corpus and create the vocabulary by considering the top 50K most frequent words (excluding stopwords). We can see that the overlaps in the unlabeled corpus are generally larger than those in the NER datasets. Since the large corpus size is large, it will contain more frequent words that overlap with those in other domains. In addition, except the vocabulary overlaps between politics and literature domains, and music and literature domains (above 60%), the overlaps for other domain pairs are still comparably small.

We perform the domain-adaptive pre-training (DAPT) for 15 epochs on the pre-trained corpus. We add a linear layer on top of BERT (Devlin et al. 2019 ) and then fine-tune the whole model on the NER task. We select the best hyperparameters by searching a combination of batch size and learning rate with the following range: batch size {16, 32} and learning rate {1 × 10 −5 , 3 × 10 −5 , 5 × 10 −5 }. In the Directly Fine-tune and Pre-train then Fine-tune settings, we use batch size 16 and learning rate 5 × 10 −5 , while in the Jointly Train setting, we use batch size 16 and learning rate 1 × 10 −5 . Additionally, we upsample the target domain data Domains Examples

In the subsequent election, Hugo Chávez (politician) 's political party, the United Socialist Party of Venezuela (political party) drew 48% of the votes overall.

Natural Science Mars (astronomical object) has four known co-orbital asteroids, such as 5261 Eureka (astronomical object), all at the Lagrangian points (miscellaneous).

House of Pain (band) abruptly broke up in 1996 after the release of their third album, Truth Crushed to Earth Shall Rise Again (album), which featured guest appearance by rappers Sadat X (musical artist) of Brand Nubian (band), etc.

Literature Charles (writer) spent outdoors, but also read voraciously, including the picaresque novels (literary genre) of Tobias Smollett (writer) and

Henry Fielding (writer), as well as Robinson Crusoe (book) and Gil Blas (book).

The gradient decent (algorithm) can take many iterations to compute a local minimum (miscellaneous) with a required accuracy (metrics), if the curvature (miscellaneous) in different directions is very different for the given function. in the Jointly Train setting to balance the data samples between the source and target domains. Given that the number of training samples in the source domain is around 100 times larger than the target domain samples, the number of times we multiply the target domain data is searched with the range {10, 50, 100, 150, 200}, and we find that 100 is generally suitable for all domains. We use F1-score to evaluate the models. It is commonly used evaluation metrics for the NER models (Ma and Hovy 2016; Lample et al. 2016; Chiu and Nichols 2016) . We use the early stop strategy and select the model based on the performance on the development set of the target domain. All the experiments are running on GTX 1080 Ti. We will release our collected labeled NER datasets, unlabeled domain-related corpora, different levels of unlabeled corpora, and all model checkpoints to catalyze the research in the cross-domain NER area.

In experiments of the main paper, the entity list is created from the DBpedia Ontology for constructing the the entitylevel corpus. And the domain-specialized entity list, which is needed to build the task-level corpus, is also based on the DBpedia Ontology. However, some low-resource domains might not be covered by DBpedia Ontology. Therefore, to make the construction of the entity-level and task-level corpora more scalable, we utilize the NER model to categorize entities and substitute the DBpedia Ontology. Concretely, the process of generating entity-level and task-level corpus based on the NER model consists of three steps. First, we leverage the NER training data in both source and target domains and follow the Pre-train then Fine-tune setting to build a NER model. Second, we use this trained NER model to recognize and categorize entities in the domain-related corpus. Third, we follow the same setting as that in the main paper to construct the entity-level and task-level corpora. In other words, we construct the entity-level corpus by extracting sentences having plentiful entities and create the task-level corpus by selecting sentences having domainspecialized entities. Finally, the integrated corpus is the combination of the entity-level and task-level corpora. We make the size of the corpora constructed based on DBpedia Ontology and NER model comparable so as to have a fair comparison between these settings.

The comparisons between the DBpedia Ontology and NER model are illustrated in Table 11 . We can see that the DAPT using the NER model-based corpus achieves comparable results to using DBpedia Ontology-based corpus. The experimental results show that creating different levels of corpora does not rely on DBpedia Ontology.

Unsupervised Neural Machine Translation

SciBERT: A Pretrained Language Model for Scientific Text

Named Entity Recognition with Bidirectional LSTM-CNNs

Frustratingly Easy Domain Adaptation

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Character-based LSTM-CRF with radical-level features for Chinese named entity recognition

Don't Stop Pretraining: Adapt Language Models to Domains and Tasks

Task Model: Growing a Neural Network for Multiple NLP Tasks

Conference on Empirical Methods in Natural Language Processing

Long short-term memory

Cross-domain NER using cross-domain language modeling

Multi-Cell Compositional LSTM for NER Domain Adaptation

SpanBERT: Improving Pre-training by Representing and Predicting Spans

New transfer learning techniques for disparate label sets

Neural Architectures for Named Entity Recognition

Cross-lingual language model pretraining

Unsupervised Machine Translation Using Monolingual Corpora Only

Annotating large email datasets for named entity recognition with mechanical turk

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Transfer Learning for Named-Entity Recognition with Neural Networks

Neural Adaptation Layers for Cross-domain Named Entity Recognition

Zero-shot Cross-lingual Dialogue Systems with Transferable Latent Variables

Zero-Resource Cross-Domain Named Entity Recognition

Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems

Coach: A Coarse-to-Fine Approach for Cross-domain Slot Filling

Cross-lingual Spoken Language Understanding with Regularized Representation Alignment

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

DBpedia: A Multilingual Cross-domain Knowledge Base

Overview of BioNLP shared task 2013

Glove: Global vectors for word representation

CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes

Results of the wnut16 named entity recognition shared task

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

Multidomain named entity recognition with genre-aware and agnostic inference

Comprehensive named entity recognition on cord-19 with distant or weak supervision

Label-Aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding

Learning Fast Adaptation on Cross-Accented Speech Recognition

Todbert: Pre-trained natural language understanding for taskoriented dialogues

A Survey on Recent Advances in Named Entity Recognition from Deep Learning models

Design Challenges and Misconceptions in Neural Sequence Labeling

Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks

Pre-train on the Source Domain then Fine-tune on Target Domains

F1-scores (averaged over three runs) of proposed methods for the five domains in three settings. We use the span-level masking strategy for the DAPT

We want to thank Genta Indra Winata for providing insightful comments for this research project. We also want to thank the anonymous reviewers for their constructive feedback. This work is partially funded by ITF/319/16FP and MRP/055/18 of the Innovation Technology Commission, the Hong Kong SAR Government.