key: cord-0439255-ldjbp22u
authors: Ding, Ning; Xu, Guangwei; Chen, Yulin; Wang, Xiaobin; Han, Xu; Xie, Pengjun; Zheng, Hai-Tao; Liu, Zhiyuan
title: Few-NERD: A Few-Shot Named Entity Recognition Dataset
date: 2021-05-16
journal: nan
DOI: nan
sha: e1cb1ff6965525380b41b9971aed71edbb393703
doc_id: 439255
cord_uid: ldjbp22u

Recently, considerable literature has grown up around the theme of few-shot named entity recognition (NER), but little published benchmark data specifically focused on the practical and challenging task. Current approaches collect existing supervised NER datasets and re-organize them to the few-shot setting for empirical study. These strategies conventionally aim to recognize coarse-grained entity types with few examples, while in practice, most unseen entity types are fine-grained. In this paper, we present Few-NERD, a large-scale human-annotated few-shot NER dataset with a hierarchy of 8 coarse-grained and 66 fine-grained entity types. Few-NERD consists of 188,238 sentences from Wikipedia, 4,601,160 words are included and each is annotated as context or a part of a two-level entity type. To the best of our knowledge, this is the first few-shot NER dataset and the largest human-crafted NER dataset. We construct benchmark tasks with different emphases to comprehensively assess the generalization capability of models. Extensive empirical results and analysis show that Few-NERD is challenging and the problem requires further research. We make Few-NERD public at https://ningding97.github.io/fewnerd/.

Named entity recognition (NER), as a fundamental task in information extraction, aims to locate and classify named entities from unstructured natural language. A considerable number of approaches equipped with deep neural networks have shown promising performance (Chiu and Nichols, 2016) on fully supervised NER. Notably, pre-trained language models (e.g., BERT (Devlin et al., 2019a) with an additional classifier achieve significant success on this task and gradually become the base paradigm. Such studies demonstrate that deep models could yield remarkable results accompanied by a large amount of annotated corpora.

With the emerging of knowledge from various domains, named entities, especially ones that need professional knowledge to understand, are difficult to be manually annotated on a large scale. Under this circumstance, studying NER systems that could learn unseen entity types with few examples, i.e., few-shot NER, plays a critical role in this area. There is a growing body of literature that recognizes the importance of few-shot NER and contributes to the task (Hofer et al., 2018; Fritzler et al., 2019; Yang and Katiyar, 2020; Li et al., 2020a; Huang et al., 2020) . Unfortunately, there is still no dataset specifically designed for few-shot NER. Hence, these methods collect previously proposed supervised NER datasets and reorganize them into a few-shot setting. Common options of datasets include OntoNotes (Weischedel et al., 2013) , CoNLL'03 (Tjong Kim Sang, 2002) , WNUT'17 (Derczynski et al., 2017) , etc. These research efforts of few-shot learning for named entities mainly face two challenges: First, most datasets used for few-shot learning have only 4-18 coarse-grained entity types, making it hard to construct an adequate variety of "N-way" metatasks and learn correlation features. And in reality, we observe that most unseen entities are finegrained. Second, because of the lack of benchmark datasets, the settings of different works are inconsistent (Huang et al., 2020; Yang and Katiyar, 2020) , leading to unclear comparisons. To sum up, these methods make promising contributions to few-shot NER, nevertheless, a specific dataset is urgently needed to provide a unified benchmark dataset for rigorous comparisons.

To alleviate the above challenges, we present a large-scale human-annotated few-shot NER dataset, FEW-NERD, which consists of 188.2k sentences extracted from the Wikipedia articles and 491.7k entities are manually annotated by well-trained annotators (Section 4.3). To the best of our knowledge, FEW-NERD is the first dataset specially constructed for few-shot NER and also one of the largest human-annotated NER dataset (statistics in Section 5.1). We carefully design an annotation schema of 8 coarse-grained entity types and 66 fine-grained entity types by conducting several pre-annotation rounds. (Section 4.1). In contrast, as the most widely-used NER datasets, CoNLL has 4 entity types, WNUT'17 has 6 entity types and OntoNotes has 18 entity types (7 of them are value types). The variety of entity types makes FEW-NERD contain rich contextual features with a finer granularity for better evaluation of fewshot NER. The distribution of the entity types in FEW-NERD is shown in Figure 1 , more details are reported in Section 5.1. We conduct an analysis of the mutual similarities among all the entity types of FEW-NERD to study knowledge transfer (Section 5.2). The results show that our dataset can provide sufficient correlation information between different entity types for few-shot learning.

For benchmark settings, we design three tasks on the basis of FEW-NERD, including a standard supervised task (FEW-NERD (SUP)) and two few-shot tasks (FEW-NERD-INTRA) and FEW-NRTD (INTER)), for more details see Section 6. FEW-NERD (SUP), FEW-NERD (INTRA), and FEW-NERD (INTER) assess instance-level generalization, type-level generalization and knowledge transfer of NER methods, respectively. We implement models based on the recent state-of-theart approaches and evaluate them on FEW-NERD (Section 7). And empirical results show that FEW-NERD is challenging on all these three settings. We also conduct sets of subsidiary experiments to analyze promising directions of few-shot NER. Hopefully, the research of few-shot NER could be further facilitated by FEW-NERD.

As a pivotal task of information extraction, NER is essential for a wide range of technologies (Cui et al., 2017; Li et al., 2019b; Ding et al., 2019; Shen et al., 2020) . And a considerable number of NER datasets have been proposed over the years. For example, CoNLL'03 (Tjong Kim Sang, 2002 ) is regarded as one of the most popular datasets, which is curated from Reuters News and includes 4 coarsegrained entity types. Subsequently, a series of NER datasets from various domains are proposed (Balasuriya et al., 2009; Ritter et al., 2011; Weischedel et al., 2013; Stubbs and Uzuner, 2015; Derczynski et al., 2017) . These datasets formulate a sequence labeling task and most of them contain 4-18 entity types. Among them, due to the high quality and size, OntoNotes 5.0 (Weischedel et al., 2013) is considered as one of the most widely used NER datasets recently.

As approaches equipped with deep neural networks have shown satisfactory performance on NER with sufficient supervision (Lample et al., 2016; Ma and Hovy, 2016) , few-shot NER has received increasing attention (Hofer et al., 2018; Fritzler et al., 2019; Yang and Katiyar, 2020; Li et al., 2020a) . Few-shot NER is a considerably challenging and practical problem that could facilitate the understanding of textual knowledge for neural model (Huang et al., 2020) . Due to the lack of specific benchmarks of few-shot NER, current methods collect existing NER datasets and use different few-shot settings. To provide a benchmark that could comprehensively assess the generalization of models under few examples, we annotate FEW-NERD. To make the dataset practical and close to reality, we adopt a fine-grained schema of entity annotation, which is inspired and modified from previous fine-grained entity recognition studies (Ling and Weld, 2012; Gillick et al., 2014; Choi et al., 2018; Ringland et al., 2019) .

NER is normally formulated as a sequence labeling problem. Specifically, for an input sequence of tokens x = {x 1 , x 2 , ..., x t }, NER aims to assign each token x i a label y i ∈ Y to indicate either the token is a part of a named entity (such as Person, Organization, Location) or not belong to any entities (denoted as O class), Y being a set of pre-defined entity-types.

N -way K-shot learning is conducted by iteratively constructing episodes. For each episode in training, N classes (N -way) and K examples (K-shot) for each class are sampled to build a support set

and K examples for each of N classes are sampled to construct a query set Q train = {x (j) , y (j) } N * K j=1 , and S Q = ∅. Few-shot learning systems are trained by predicting labels of query set Q train with the information of support set S train . The supervision of S train and Q train are available in training. In the testing procedure, all the classes are unseen in the training phase, and by using few labeled examples of support set S test , few-shot learning systems need to make predictions of the unlabeled query set Q test (S Q = ∅). However, in the sequence labeling problem like NER, a sentence may contain multiple entities from different classes. And it is imperative to sample examples in sentence-level since contextual information is crucial for sequence labeling problems, especially for NER. Thus the sampling is more difficult than conventional classification tasks like relation extraction (Han et al., 2018) .

Some previous works (Yang and Katiyar, 2020; Li et al., 2020a) use greedy-based sampling strategies to iteratively judge if a sentence could be added into the support set, but the limitation becomes gradually strict during the sampling. For example, when it comes to a 5-way 5-shot setting, if the support set already had 4 classes with 5 examples and 1 class with 4 examples, the next sampled sentence must only contain the specific one entity to strictly meet the requirement of 5 way 5 shot. It is not suitable for FEW-NERD since it is annotated with dense entities. Thus, as shown in Algorithm 1 we adopt a N -way K∼2K-shot setting in our paper, the primary principle of which is to ensure that each class in S contain K∼2K examples, effectively alleviating the limitations of sampling.

Algorithm 1: Greedy N -way K∼2K-shot sampling algorithm Input: Dataset X , Label set Y, N , K Output: output result 1 S ← ∅; // Init the support set // Init the count of entity types

Compute |Count| and Count i after update ;

4 Collection of FEW-NERD

The primary goal of FEW-NERD is to construct a fine-grained dataset that could specifically be used in the few-shot NER scenario. Hence, schemas of traditional NER datasets such as CoNLL'03, OntoNotes that only contain 4-18 coarse-grained types could not meet the requirements. The schema of FEW-NERD is inspired by FIGER (Ling and Weld, 2012) , which contains 112 entity tags with good coverage. On this basis, we make some modifications according to the practical situation. It is worth noting that FEW-NERD focuses on named entities, omitting value/numerical/time/date entity types (Weischedel et al., 2013; Ringland et al., 2019) like Cardinal, Day, Percent, etc. First, we modify the FIGER schema into a two-level hierarchy to incorporate simple domain information (Gillick et al., 2014) . The coarse-grained types are {Person, Location, Organization, Art, Building, Product, Event, Miscellaneous }. Then we statistically count the frequency of entity types in the automatically annotated FIGER. By removing entity types with low frequency, there are 80 finegrained types remaining. Finally, to ensure the practicality of the annotation process, we conduct rounds of pre-annotation and make further modifications to the schema. For example, we combine the types of Country, Province/State, City, Restrict into a class GPE, since it is difficult to distinguish these types only based on context (especially GPEs at different times). For another example, we create a Person-Scholar type, because in the pre-annotation step, we found that there are numerous person entities that express the semantics of research, such as mathematician, physicist, chemist, biologist, paleontologist, but the Figer schema does not define this kind of entity type. We also conduct rounds of manual denoising to select types with truly high frequency.

Consequently, the finalized schema of FEW-NERD includes 8 coarse-grained types and 66 fine-grained types, which is detailedly shown accompanied by selected examples in Appendix.

The raw corpus we use is the entire Wikipedia dump in English, which has been widely used in constructions of NLP datasets (Han et al., 2018; Yang et al., 2018; Wang et al., 2020) . Wikipedia contains a large variety of entities and rich contextual information for each entity.

FEW-NERD is annotated in paragraph-level, and it is crucial to effectively select paragraphs with sufficient entity information. Moreover, the category distribution of the data is expected to be balanced since the data is applied in a fewshot scenario. It is also a key difference between FEW-NERD and previous NER datasets, whose entity distributions are usually considerably uneven. In order to do so, we construct a dictionary for each fine-grained type by automatically collecting entity mentions annotated in FIGER, then the dictionaries are manually denoised. We develop a search engine to retrieve paragraphs including entity mentions of the distant dictionary. For each entity, we choose 10 paragraphs and construct a candidate set. Then, for each fine-grained class, we randomly select 1000 paragraphs for manual annotation. Eventually, 66,000 paragraphs are selected, consisting of 66 fine-grained entity types, and each paragraph contains an average of 61.3 tokens.

London [Art-Music] is the fifth album by the British [Loc-GPE] rock band Jesus Jones [Org-ShowOrg] in 2001 through Koch Records [Org-Company] . Following the commercial failure of 1997's "Already [Art-Music] " which led to the band and EMI [Org-Company] parting ways, the band took a hiatus before regathering for the recording of "London [Art-Music] " for Koch/Mi5 Recordings, with a more alternative rock approach as opposed to the techno sounds on their previous albums. The album had low-key promotion, initially only being released in the United States [Loc-GPE] . Two EP's were released from the album, "Nowhere Slow [Art-Music] " and "In the Face Of All This [Art-Music] ". 

As named entities are expected to be contextdependent, annotation of named entities is complicated, especially with such a large number of entity types. For example, shown in Table 1 , "London is the fifth album by the British rock band Jesus Jones..", where London should be annotated as an entity of Art-Music rather than Location-GPE. Such a situation requires that the annotator has basic linguistic training and can make reasonable judgments based on the context.

Annotators of FEW-NERD include 70 annotators and 10 experienced experts. All the annotators have linguistic knowledge and are instructed with detailed and formal annotation principles. Each paragraph is independently annotated by two welltrained annotators. Then, an experienced expert goes over the paragraph for possible wrong or omissive annotations, and make the final decision. With 70 annotators participated, each annotator spends an average of 32 hours during the annotation process. We ensure that all the annotators are fairly compensated by market price according to their workload (the number of examples per hour). The data is annotated and submitted in batches, and each batch contains 1000∼3000 sentences. To ensure the quality of FEW-NERD, for each batch of data, we randomly select 10% sentences and conduct double-checking. If the accuracy of the annotation is lower than 95 % (measured in sentencelevel), the batch will be re-annotated. Furthermore, we calculate the Cohen's Kappa (Cohen, 1960) to measure the aggreements between two annotators, the result is 76.44%, which indicates a high degree of consistency.

FEW-NERD is not only the first few-shot dataset for NER, but it also is one of the biggest humanannotated NER datasets. We report the the statistics of the number of sentences, tokens, entity types and entities of FEW-NERD and several widely-used NER datasets in Table 2 , including CoNLL'03, WikiGold, OntoNotes 5.0, WNUT'17 and I2B2. We observe that although OntoNotes and I2B2 are considered as large-scale datasets, FEW-NERD is significantly larger than all these datasets. Moreover, FEW-NERD contains more entity types and annotated entities. As introduced in Section 4.2, FEW-NERD is designed for few-shot learning and the distribution could not be severely uneven. Hence, we balance the dataset by selecting paragraphs through a distant dictionary. The data distribution is illustrated in Figure 1 , where Location (especially GPE) and Person are entity types with the most examples. Although utilizing a distant dictionary to balance the entity types could not produce a fully balanced data distribution, it still ensures that each fine-grained type has a sufficient number of examples for few-shot learning.

Knowledge transfer is crucial for few-shot learning (Li et al., 2019a) . To explore the knowledge correlations among all the entity types of FEW-NERD, we conduct an empirical study about entity type similarities in this section. We train a BERT-Tagger (details in Section 7.1) of 70% arbitrarily selected data on FEW-NERD and use 10% data to select the model with best performance (it is actually the setting of FEW-NERD (SUP) in Section 6.1). After obtaining a contextualized encoder, we produce entity mention representations of the remaining 20% data of FEW-NERD. Then, for each fine-grained types, we randomly select 100 instances of entity embeddings. We mutually compute the dot product among entity embeddings for each type two by two and average them to obtain the similarities among types, which is illustrated in Figure 2 . We observe that entity types shared identical coarse-grained types typically have larger similarities, resulting in an easier knowledge transfer. In contrast, although some of the fine-grained types have large similari- ties, most of them across coarse-grained types share little correlations due to distinct contextual features. This result is consistent with intuition. Moreover, it inspires our benchmark-setting from the perspective of knowledge transfer (see Section 6.2).

We collect and manually annotate 188,238 sentences with 66 fine-grained entity types in total, which makes FEW-NERD one of the largest human-annotated NER datasets. To comprehensively exploit such rich information of entities and contexts, as well as evaluate the generalization of models from different perspectives, we construct three tasks based on FEW-NERD (Statistics are reported in Table 3 ).

FEW-NERD (SUP) We first adopt a standard supervised setting for NER by randomly splitting 70% data as the training data, 10% as the validation data and 20% as the testing data. In this setting, the training set, dev set, and test set contain the whole 66 entity types. Although the supervised setting is not the ultimate goal of the construction of FEW-NERD, it is still meaningful to assess the instance-level generalization for NER models. As shown in Section 6.2, due to the large number of entity types, FEW-NERD is very challenging even in a standard supervised setting. (Weischedel et al., 2013) 103.8k 2067k 161.8k 18 General WNUT'17 (Derczynski et al., 2017) 4.7k 86.1k 3.1k 6 SocialMedia I2B2 (Stubbs and Uzuner, 2015) 107 this setting is to explore if the coarse information will affect the prediction of new entities.

Recent studies show that pre-trained language models with deep transformers (e.g., BERT (Devlin et al., 2019a) ) have become a strong encoder for NER (Li et al., 2020b) . We thus follow the empirical settings and use BERT as the backbone encoder in our experiments. We denote the parameters as θ and the encoder as f θ . Given a sequence x = {x 1 , ..., x n }, for each token x i , the encoder produces contextualized representations as:

(1)

Specifically, we implement four BERT-based models for supervised and few-shot NER, which are BERT-Tagger (Devlin et al., 2019b) , Proto-BERT (Snell et al., 2017) , NNShot (Yang and Katiyar, 2020) and StructShot (Yang and Katiyar, 2020) .

BERT-Tagger As stated in Section 6.1, we construct a standard supervised task based on FEW-NERD, thus we implement a simple but strong baseline BERT-Tagger for supervised NER. BERT-Tagger is built by adding a linear classifier on top of BERT and trained with a cross-entropy objective under a full supervision setting.

ProtoBERT Inspired by achievements of metalearning approaches (Finn et al., 2017; Snell et al., 2017; Ding et al., 2021) on few-shot learning. The first baseline model we implement is Proto-BERT, which is a method based on prototypical network (Snell et al., 2017) with a backbone of BERT (Devlin et al., 2019a) encoder. This approach derives a prototype z for each entity type by computing the average of the embeddings of the tokens that share the same entity type. The computation is conducted in support set S. For the i-th type, the prototype is denoted as z i and the support set is S i ,

(2)

While in the query set Q, for each token x ∈ Q, we firstly compute the distance between x and all the prototypes. We use the l-2 distance as the metric function d(f θ (x), z) = ||f θ (x) − z|| 2 2 . Then, through the distances between x and all other prototypes, we compute the prediction probability of x over all types. In the training step, parameters are updated in each meta-task. In the testing step, the prediction is the label of the nearest prototype to x. That is, for a support set S Y with types of Y and a query x, the prediction process is given as

(3)

NNShot and Struct-Shot (Yang and Katiyar, 2020) are the state-of-theart methods based on token-level nearest neighbor classification. In our experiments, we use BERT as the backbone encoder to produce contextualized representations for fair comparison. Different from the prototype-based method, NNShot determines the tag of one query based on the token-level distance, which is computed as

Hence, for a support set S Y with type of Y and a query x,

With the identical basic structure as NNShot, StructShot adopts an additional Viterbi decoder during the inference phase (Hou et al., 2020) (not in training phase), where we estimate a transition distribution p(y |y) and an emission distribution p(y|x) and solve the problem:

To sum up, BERT-Tagger is a wellacknowledged baseline that could produce pronounced results on supervised NER. Proto-BERT, and NNShot & StructShot respectively use prototype-level and token-level similarity scores to tackle the few-shot NER problem. These baselines are strong and representative models of the NER task. For implementation details, please refer to Appendix.

We evaluate models by considering query sets Q test of test episodes. We calculate the precision (P), recall (R) and micro F1-score over all test episodes. Instead of the popular BIO schema, we utilize the IO schema in our experiments, using I-type to denote all the tokens of a named entity and O to denote other tokens.

We evaluate all baseline models on the three benchmark settings introduced in Section 6, including FEW-NERD (SUP), FEW-NERD (INTRA) and FEW-NERD (INTER). Supervised NER As mentioned in Section 6.1, we first split the FEW-NERD as a standard supervised NER dataset. As shown in Table 4 , BERT-Tagger yields promising results on the two widely used supervised datasets. The F1-score is 91.34%, 89.11%, respectively. However, the model suffers a grave drop in the performance on FEW-NERD (SUP) because the number of types of FEW-NERD (SUP) is larger than others. The results indicate that FEW-NERD is challenging in the supervised setting and worth studying.

We further analyze the performance of different entity types (see Figure 3 ). We find that the model achieves the best performance on the Person type and yields the worst performance on the Product type. And almost for all the coarse-grained types, the Coarse-Other type has the lowest F1-score. This is because the semantics of such fine-grained types are relatively sparse and difficult to be recognized. A natural intuition is that the performance of each entity type is related to the portion of the type. But surprisingly, we find that they are not linearly correlated. For examples, the model performs very well on the Art type, although this type represents only a small fraction of FEW-NERD. Few-shot NER For the few-shot benchmarks, we adopt 4 sampling settings, which are 5 way 1∼2 shot, 5 way 5∼10 shot, 10 way 1∼2 shot, and 10 way 5∼10 shot. Intuitively, 10 way 1∼2 shot is the hardest setting because it has the largest number of entity types and the fewest number of examples, and similarly, 5 way 5∼10 shot is the easiest setting. All results of FEW-NERD (INTRA) and FEW-NERD (INTER) are reported in Table 5 and Table 6 respectively. Overall, we observe that the previous state-of-the-art methods equipped by BERT encoder could not yield promising results on FEW-NERD. From a perspective of high level, models generally perform better on FEW-NERD (INTER) than FEW-NERD (INTRA), and the latter is regarded as a more difficult task as we analyze in Section 5.2 and Section 6, it splits the data according to the coarse-grained entity types, which means entity types between the training set and test set share less knowledge. In a horizontal comparison, consistent with intuition, almost all the methods produce the worst results on 10 way 1∼2 shot and achieve the best performance on 5 way 5∼10. In the comparison across models, ProtoBERT generally achieves better performance than NNShot and StructShot, especially in 5∼10 shot setting where calculation by prototype may differ more from calculation by entity. StructShot has seen a large improvement in precision in FEW-NERD (INTRA). It shows that Viterbi decoder at the inference stage can help remove false positive predictions when knowledge transfer is hard. It is also observed that NNShot and StructShot may suffer from the instability of the nearest neighbor mechanism in the training phase, and prototypical models are more stable because Table 7 : Error analysis of 5 way 5∼10 shot on FEW-NERD (INTER), "Within" indicates "within the coarse types" and "Outer" is "outer the coarse types". the calculation of prototypes essentially serves as regularization.

We conduct error analysis to explore the challenges of FEW-NERD, the results are reported in Table 7 . We choose the setting of FEW-NERD (INTER) because the test set contains all the coarse-grained types. We analyze the errors of models from two perspectives. Span Error denotes the misclassifying in token-level classification. If an O token is misclassified as a part of entity, i.e., I-type, it is an FP case, and if a token with the type I-type is misclassified to O, it is FN. Type Error indicates the misclassification of entity types when the spans are correctly classified. A "Within" error represents the entity is misclassified to another type within the same coarse-grained type, while "Outer" denotes the entity is misclassified to another type in a different coarse-grained type. As the statistics of type errors may be impacted by the sampled episodes in testing, we conduct 5 rounds of experiments and report the average results. The results demonstrate that the token-level accuracy is not that low since most O tokens could be detected. But an entity mention is considered to be wrong if one token is wrong, which becomes the main reason for the challenge of FEW-NERD. If an entity span could be accurately detected, the models could yield relatively good performance on entity typing, indicating the effectiveness of metric learning.

We propose FEW-NERD, a large-scale few-shot NER dataset with fine-grained entity types. This is the first few-shot NER dataset and also one of the largest human-annotated NER dataset. FEW-NERD provides three unified benchmarks to assess approaches of few-shot NER and could facilitate future research in this area. By implementing state-of-the-art methods, we carry out a series of experiments on FEW-NERD, demonstrating that few-shot NER remains a challenging problem and worth exploring. In the future, we will extend FEW-NERD by adding cross-domain annotations, distant annotations, and finer-grained entity types. FEW-NERD also has the potential to advance the construction of continual knowledge graphs.

In this paper, we present a human-annotated dataset, FEW-NERD, for few-shot learning in NER. We describe the details of the collection process and conditions, the compensation of annotators, the measurements to ensure the quality in the main text. The corpus of the dataset is publicly obtained from Wikipedia and we have not modified or interfered with the content. FEW-NERD is likely to directly facilitate the research of few-shot NER, and further increase the progress of the construction of large-scale knowledge graphs (KGs). Models and systems built on FEW-NERD may contribute to construct KGs in various domains, including biomedical, financial, and legal fields, and further promote the development of NLP applications on specific domains. FEW-NERD is annotated in English, thus the dataset may mainly facilitate NLP research in English. For the sake of energy saving, we will not only open source the dataset and the code, but also release the checkpoints of our models from the experiments to reduce unnecessary carbon emission.

We use the dump 2 of English Wikipedia, and extract the raw text by WikiExtractor 3 . NLTK language tool 4 is used for word and sentence tokenization in the preprocessing stage. As stated in Section 4.2, we develope a search engine to index and select paragraphs with key words in distant dictionaries. If the search is performed with linear operations, the calculation process will be extremely slow, instead, we adopt a search engine with Lucene 5 to conduct effective indexing and searching.

A.2 More Details of the Schema As stated in Section 4.1, we use FIGER (Ling and Weld, 2012) as the start point and conduct rounds of make a series of modifications. Despite the modifications mentioned in Section 4.1, we also conduct manual denoising of the automatically annotated data of FIER. For each entity type and the corresponding automatically annotated mentions, we randomly select 500 mentions and compute the accuracy to obtain the real frequency. For example, statistics report that cemetery is a type with high frequency. However, a plenty number of the mentions labeled as cemetery are actually GPE. Similarly, engineer is also affected by noise.

The interface in shown in Figure 4 , where annotators could expediently select entity spans and annotate the corresponding coarse and fine types. And annotators could check the current annotation information on the interface. 

All the four models use BERT base (Devlin et al., 2019a) and the backbone encoder and initialized with the corresponding pre-trained uncased weights 6 . The hidden size is 768, and the number of layers and heads are 12. Models are implemented by Pytorch framework 7 (Paszke et al., 2019) and Huggingface transformers 8 (Wolf et al., 2020) . BERT models are optimized by AdamW 9 (Loshchilov and Hutter, 2019) with the learning rate of 1e-4. We evaluate our implementations of NNShot and StructShot on the datasets used in the original paper, producing similar results. For supervised NER, the batch size is 8, and we train BERT-Tagger for 70000 steps and evaluate it on the test set. For 5 way 1∼2 and 5∼10 shot settings, the batch sizes are 16 and 4, and for 10 way 1∼2 and 5∼10 shot settings, the batch sizes are 8 and 1. We train 12000 episodes and use 500 episodes of the dev set to select the best model, and test it on 5000 episodes of the test set. Most hyper-parameters are from original settings. We manually tune the hyper-parameter τ in Viterbi for StructShot, and the value for 1∼2 settings shot is 0.320, for 5∼10 shot settings is 0.434. All the experiments are conducted with CUDA on NVIDIA Tesla V100 GPUs. With 2 GPUs used, the average time to train 10000 episodes is 135 minutes. The number of parameters of the models is 120M.

As introduced in Section 4.1 in main text, FEW-NERD is manually annotated with 8 coarsegrained and 66 fine-grained entity types, and we list all the types in Table 8 . The schema is designed under practical situation, we hope the schema could help to better understand FEW-NERD. Note that ORG is the abbreviation of Organization, and MISC is the abbreviation of Miscellaneous.

Association for Computational Linguistics

Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics

Ultra-fine entity typing

A coefficient of agreement for nominal scales. Educational and psychological measurement

Kbqa: learning question answering over qa corpora and knowledge bases

Results of the WNUT2017 shared task on novel and emerging entity recognition

BERT: Pre-training of deep bidirectional transformers for language understanding

BERT: Pre-training of deep bidirectional transformers for language understanding

Event detection with triggeraware lattice neural network

Prototypical representation learning for relation extraction

Model-agnostic meta-learning for fast adaptation of deep networks

Few-shot classification in named entity recognition task

Contextdependent fine-grained entity type tagging

FewRel: A large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation

Few-shot learning for named entity recognition in medical text

Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network

Jianfeng Gao, and Jiawei Han. 2020. Fewshot named entity recognition: A comprehensive study

Neural architectures for named entity recognition

Large-scale few-shot learning: Knowledge transfer with class hierarchy

Few-shot named entity recognition via metalearning

A unified MRC framework for named entity recognition

Chinese relation extraction with multi-grained information and external linguistic knowledge

Fine-grained entity recognition

Decoupled weight decay regularization

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Pytorch: An imperative style, high-performance deep learning library

NNE: A dataset for nested named entity recognition in English newswire

Named entity recognition in tweets: An experimental study

Modeling relation paths for knowledge graph completion

Prototypical networks for few-shot learning

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/uthealth corpus

Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition

MAVEN: A Massive General Domain Event Detection Dataset

Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia

Transformers: State-of-the-art natural language processing

Simple and effective few-shot named entity recognition with structured nearest neighbor learning

HotpotQA: A dataset for diverse, explainable multi-hop question answering

The company moved to a new office in Las Vegas, Nevada.

The Finke River normally drains into the Simpson Desert to the north west of the Macumba.

An invading army of Teutonic Knights conquered Gotland in 1398.Mountain C.G.E. Mannerheim met Thubten Gyatso in Wutai Shan during the course of his expedition from Turkestan to Peking.

Victoria Park contains examples of work by several architects including Alfred Waterhouse (Xaverian College).

The thirty-first race of the 1951 season was held on October 7 at the one-mile dirt Occoneechee Speedway.Other Herodotus (7.59) reports that Doriscus was the first place Xerxes the Great stopped to review his troops.

The first performance of any work of Gustav Holst given in that capital.

A film adaption was made by Arne Bornebusch in 1936.

Smith was named co-Player of the Week in the Big Ten on offense.

Margin for Error is a 1943 American drama film directed by Otto Preminger.

Then-President Gloria Macapagal Arroyo led the inauguration rites of the facility on August 19, 2002.

Jeffery Westbrook and Robert Tarjan (1992) developed an efficient data structure for this problem based on disjoint-set data structures.

Sadowski was promoted to general, and took command of the freshly created Fortified Area of Silesia.

In Albany, Doane planned a cathedral like those in England.

A Vocaloid voicebank developed and distributed by Yamaha Corporation for Vocaloid 4.

Long volunteer coached the offensive line for Briarcrest Christian School for 9 seasons.

It was constructed using the savings of the Quezon provincial government.

He was the Editor in Chief of Grenada's national newspaper "The Free West Indian".

Stanley Norman Evans was a British industrialist and Labour Party politician.Religion D'Souza was born on 10 November 1985 into a Goan Catholic family in Goa, India.

His strong performances convinced him that he was ready for the NBA.

The Pirates won the game and the World Series with Oldham on the mound.

Standing in the Way of Control is the third studio album by American indie rock band Gossip.

He is the Creative Director of the Oliver Sacks Foundation.

The city is served by the Sir Seretse Khama International Airport.

Then he did residency in ophthalmology at Farabi Eye Hospital from 1979 to 1982.

Nick also played at the regular Sunday evening sessions that were held at the Ramada Inn in Schenectady.

RMIT University Library consists of six academic branch libraries in Australia and Vietnam.

The first Panda Express restaurant opened in Galleria II in the same year, on level 3 near Bloomingdale's.

This was the last year that the Razorbacks would play in Barnhill Arena.

From 1954, she became a guest singer at the Vienna State Opera.

Eissler designated Masson to succeed him as Director of the Sigmund Freud Archives after his and Anna Freud's death.

Music "Get Right" is a song recorded by American singer Jennifer Lopez for her fourth studio album.

Margin for Error is a 1943 American drama film directed by Otto Preminger.

The Count is a text adventure written by Scott Adams and published by Adventure International in 1979.

In the fall of 1957, Mitchell starred in ABC's "The Guy Mitchell Show".

His painting 'Rooftops' has been in the collection of the City of London Corporation since 1989.

Kirwan appeared on stage at the Chichester Festival Theatre in a Jeremy Herrin production of Uncle Vanya.

The Royal Norwegian Air Force's 330 Squadron operates a Westland Sea King search and rescue helicopter out of Florø.

The BYD Tang plug-in hybrid SUV was the top selling plug-in car with 31,405 units delivered.

The words "Time to make the donuts" are printed on the side of Dunkin' Donuts boxes in memory of Michael Vale/Fred the Baker.

Team Andromeda wanted to create a fully 3D arcade game, having worked on similar games such as "Out Run" which were not truly 3D.

As night fell, Marine Corps General Holland Smith studied reports aboard the command ship "Eldorado".

It allows communication between the Wolfram Mathematica kernel and front-end.

On 9 June 1929, railcar No. 220 "Waterwitch" overran signals at Marshgate Junction.

Mannerheim gave Tibet's spiritual pontiff a Browning revolver and showed him how to reload the weapon.

Rhinestone is as artificial and synthetic a concoction as has ever made its way to the screen.

It was on this route that Tecumseh was killed at the Battle of the Thames on October 5, 1813.

At the 1935 United Kingdom general election, McGleenan stood in Armagh as an Independent Republican.

He was originally from Chicago, but moved to Japan after the Second Great Kanto earthquake that all but decimated Japan's infrastructure.

In 1832, following the failed Polish November Uprising, the Dominican monastery was sequestrated.

Carle received a new defense partner when the Flyers traded for Chris Pronger at the 2009 NHL Entry Draft.

One of TMG's first performances was in September 1972 at the Waitara Festival.

He discovered a number of double stars and took many photographs of Mars.

He was awarded the Bialik Prize eight years later for these efforts.

Estradiol valerate is rapidly hydrolyzed into estradiol in the intestines.

It was the first gas manufacturer in Kuwait to provide industrial gases such as oxygen and nitrogen to the local petroleum industry.

Total investment has been 19 billion Norwegian krone.

The 2020 competition was cancelled as part of the effort to minimize the COVID-19 pandemic.Educational Degree Sigurlaug enrolled into the medical department of the University of Iceland and graduated as a Medical Doctor in 2010.

Originally a farmer, Viking Ragnar Lothbrok claims to be descended from the god Odin.

The play was translated into English by Michael Hofmann and published in 1987 by Hamish Hamilton.

Four of his five policy recommendations were incorporated into the U.S. Federal Financial Law of 1966.

Schistura horai is a species of ray-finned fish in the stone loach genus "Schistura".

Precious Blood Hospital offers specialist outpatient and inpatient services in General medicine. Table 8 : All the coarse-grained and fine-grained entity types in FEW-NERD, we only highlight the entities with the corresponding entity types in "Example".