key: cord-0493602-4c6jxx2j authors: Gatta, Valerio La; Moscato, Vincenzo; Postiglione, Marco; Sperli, Giancarlo title: Few-shot Named Entity Recognition with Cloze Questions date: 2021-11-24 journal: nan DOI: nan sha: 943747a5242fa3116add49c63f15026f27b51368 doc_id: 493602 cord_uid: 4c6jxx2j Despite the huge and continuous advances in computational linguistics, the lack of annotated data for Named Entity Recognition (NER) is still a challenging issue, especially in low-resource languages and when domain knowledge is required for high-quality annotations. Recent findings in NLP show the effectiveness of cloze-style questions in enabling language models to leverage the knowledge they acquired during the pre-training phase. In our work, we propose a simple and intuitive adaptation of Pattern-Exploiting Training (PET), a recent approach which combines the cloze-questions mechanism and fine-tuning for few-shot learning: the key idea is to rephrase the NER task with patterns. Our approach achieves considerably better performance than standard fine-tuning and comparable or improved results with respect to other few-shot baselines without relying on manually annotated data or distant supervision on three benchmark datasets: NCBI-disease, BC2GM and a private Italian biomedical corpus. Recent advances in computational linguistics, characterized by an intensive study and wide adoption of language models [3, 5, 22] , have led to precious improvements in Named Entity Recognition (NER), which consists in identifying and classifying entities in a given text. However, the annotation of datasets for specific domains or languages is an expensive and time-consuming process, which often requires domain knowledge (e.g. healthcare, finance). At the present, few-shot settings for NER with pre-trained language models have not been extensively studied yet. A recent line of research provides pre-trained language models with "task descriptions" in the form of cloze-questions [21] , which enable them to leverage the knowledge they acquired during the pre-training phase. Brown et al. [3] show that this approach, without any sort of fine-tuning, allows to obtain high performance for a variety of SuperGLUE [29] tasks with only 32 examples. Schick et al. [24] show that the extension of such approach with regular gradient optimization results in better and lighter models. To the best of our knowledge, the use of cloze-questions has not been investigated yet for NER tasks, probably due to their sequence-labeling nature: it is difficult (if not impossible) to provide a single task description which allows the language model to assign a label to each token in the input text. In this work, we propose a simple and intuitive adaptation which enables us to test the effectiveness of cloze-questions for NER: from the input example, we generate a new sentence for each token it contains. An example is shown below: . . . x. In the sentence above, the word t refers to the [MASK] of a disease entity. Figure 1 : Example of PVP application. An input example (x, t) is generated for each token t in the sentence x. A label y is related to the token t. The pattern P (x, t) and verbalizer v(y) are then applied to generate the input of the masked language model, which is then fine-tuned on the q p (y|x, t) loss. The language model will thus learn how to appropriately replace the masked token (e.g. [MASK] = beginning/inside/outside). Being based on the Pattern-Exploiting Training (PET) framework from Schick et al. [24] , we will refer to our approach as PETER (Pattern-Exploiting Training for Named Entity Recognition). We describe here the slight adjustment we made to enable PET to handle the recognition of entities. Given a pre-trained masked language model M with T as its vocabulary, PET addresses the task of mapping a textual input x ∈ X to some label y ∈ Y = {1, ..., k} with k ∈ N by making M predict the token which has to be replaced to a masked token [MASK] ∈ T . To accomplish this task, PET requires a set of pattern-verbalizer pairs (PVPs) consisting in: • a pattern P : X → T * which converts inputs to cloze questions, i.e. sequences of tokens containing exactly one [MASK] token; • a verbalizer v : Y → T mapping labels to tokens representing their meanings. The training set T for supervised NER consists in a list of couples , where the i-th couple is a sequence of token-tag pairs (x, y). The set of tags depends on the NER schema (e.g. IO, IOB2, IOBES). Differently from PET, our patterns take in input not only the sentence but also the token which the cloze-question refers to, as shown in Figure 1 . The steps of our pipeline are listed below: 1. from the textual input x, |x| input examples are generated, each of them referring to a different token in the sentence; 2. for each PVP P (x, t), v(y) , a masked language model M is fine-tuned based on the conditional probability distribution of y given (x, t): were M (v(y)|P (x, t)) denotes the score that the language model assigns to the token at the masked position of being v(y); 3. unlabeled data are used to generate a soft-labeled dataset from the aggregation of predictions of the previously trained masked language models; 4. the soft-labeled dataset is used to train a masked language model with a classification head. Datasets We used the original train-dev-test splits to facilitate comparability. However, in accordance with Schick et al. [28] , we do not use development sets and we do not perform hyperparameter Fine-tuning precision PETER precision Fine-tuning recall PETER recall Fine-tuning F1 PETER F1 Figure 2 : Average precision, recall and F1 scores obtained with the standard fine-tuning approach and PETERas the number of shots increase optimization, due to its non-feasibility in few-shot scenarios. For three replications with different seeds, we randomly select k training instances from the training split (k ∈ {10, 25, 50, 100} being the number of "shots") and report results on the whole test set. We evaluate our framework based on the following datasets: • NCBI-Disease We use the pre-processed versions of NCBI-Disease and BC2GM provided by Wang et al. [30] . Patterns We experiment with the following patterns: • P 1 (x, t): x. In the sentence above, the word t refers to the [MASK] of a entity. • P 2 (x, t): x. Question: In the passage above, which part of a entity does the word t refers to? Answer: [MASK] . Note that is replaced with "disease" in the first and second datasets and "gene" in the second one. PET We have mostly left the default configurations of PET unchanged, with the exception of the following settings: (1) we set the maximum sequence length to 128 for NCBI-Disease and BC2GM and to 256 for ClinicalNotesITA; (2) we use BioBERT [12] for NCBI-Disease and BC2GM and GilBERTo 1 for ClinicalNotesITA as pre-trained language models; (3) we limit the number of unlabeled examples to 10.000 for a faster training; (4) we set the number of epochs to 10, 7, 5, 5 for the 10-, 25-, 50-and 100-shot scenarios, respectively. Metrics We use precision, recall and f1 scores to evaluate our results. We refer to seqeval 2 and sklearn 3 for evaluations on the IOB2 schema. Results To gauge the ability of PETER to provide effective annotations with few examples, we compared its results with (1) the regular fine-tuning approach, consisting in starting from a pre-trained language model and fine-tuning with a classification head, (2) TriggerNER [15] , which uses manually annotated triggers to guarantee good few-shot performance and (3) BOND [14] , which leverages distant supervision to increase automatically the labelled samples. 0.06 ± 0.04 0.02 ± 0.03 0.02 ± 0.03 Figure 2 shows a significant improvement with respect to the standard fine-tuning approach for NCBI and BC2GM datasets. In the case of the WINCARE dataset, PETER has similar performance to the baseline, probably due to the fact that all the WINCARE clinical notes are collected from the same department of cardiology, and thus could be effectively represented by few instances. In Table 1 we report results of baseline models in a highly-constrictive 10-shot scenario. PETER results are extremely promising and worthy for further research: our approach can indeed reach comparable or superior performance with respect to other baselines without relying on manual annotated triggers, as in TRIGGERNER, or distant supervision, as in BOND. Recently, there is an increasing number of research works about few-shot Named Entity Recognition. They are mainly based on meta-learning [4, 9, 10, 13, 18, 31] , distant and weak supervision [1, 11, 16, 17, 26, 32] and transfer learning [2, 23, 25] . However, current methods usually rely on additional data or annotating efforts [15] , which can be a strong limitation in low-resource languages and domains. A recent line of research uses cloze-style questions to rephrase tasks, enabling language models to obtain high performance in unsupervised scenarios by leveraging the knowledge acquired during the pre-training phase [21] . This idea has been successfully applied to unsupervised text classification [20] , commonsense knowledge mining [8] and knowledge probing [7, 19] . Schick et al. [24] show that not only does the cloze-question approach can be combined with regular gradient-based finetuning, but it can also lead to considerable performance improvements, in both quality and efficiency terms. Thanks to the inclusion of few supervised data to fine-tune the language model, it is possible to obtain comparable performance with GPT-3 [3] while using far fewer parameters [28] . Being originally designed for text classification, to the best of our knowledge, the use of clozequestions for NER in few-shot settings has not been investigated until now. We have shown that the cloze-questions mechanism can be also a valuable technique for few shot NER. Our approach, based on an adaptation of PET, strongly out-performs standard fine-tuning and obtains comparable results with the state-of-the-art without relying on any additional information, such as additional manual annotations or distant supervision. In future work, we would like to extend this approach for multi-entity datasets and to further investigate the application of cloze-questions for NER. Distant supervision and noisy label learning for low resource named entity recognition: A study on hausa and yorùbá Using few-shot learning techniques for named entity recognition and relation extraction Language models are few-shot learners Few-shot event detection with prototypical amortized conditional random field Bert: Pre-training of deep bidirectional transformers for language understanding Ncbi disease corpus: A resource for disease name recognition and concept normalization What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models Commonsense knowledge mining from pretrained models Few-shot classification in named entity recognition task Learning to classify intents and slot labels given a handful of examples. ArXiv, abs Learning to contextually aggregate multi-source supervision for sequence labeling Biobert: a pre-trained biomedical language representation model for biomedical text mining Metaner: Named entity recognition with meta-learning BOND: bert-assisted open-domain named entity recognition with distant supervision. CoRR, abs Triggerner: Learning with entity triggers as explanations for named entity recognition Knowledge-augmented language model and its application to unsupervised named-entity recognition A graph attention model for dictionary-guided named entity recognition Few-shot learning for slot tagging with attentive relational network Language models as knowledge bases? In EMNLP Zero-shot text classification with generative language models. ArXiv, abs Language models are unsupervised multitask learners Exploring the limits of transfer learning with a unified text-to-text transformer Making monolingual sentence embeddings multilingual using knowledge distillation Exploiting cloze questions for few-shot text classification and natural language inference. ArXiv, abs Biobertpt -a portuguese neural language model for clinical named entity recognition Learning named entity tagger using domain-specific dictionary Overview of biocreative ii gene mention recognition It's not just size that matters: Small language models are also few-shot learners Superglue: A stickier benchmark for general-purpose language understanding systems Cross-type biomedical named entity recognition with deep multi-task learning Knowledge-aware fewshot learning framework for biomedical event trigger identification Counterfactual generator: A weakly-supervised method for named entity recognition