key: cord-0187691-nclrn24s authors: Rychlikowski, Pawel; Najdecki, Bartlomiej; La'ncucki, Adrian; Kaczmarek, Adam title: Named Entity Recognition and Linking Augmented with Large-Scale Structured Data date: 2021-04-27 journal: nan DOI: nan sha: 247137d63149201ba14d0856ac3d60dc22391c6d doc_id: 187691 cord_uid: nclrn24s In this paper we describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021, respectively. The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection. Our solution takes advantage of large collections of both unstructured and structured documents. The former serve as data for unsupervised training of language models and embeddings of lexical units. The latter refers to Wikipedia and its structured counterpart - Wikidata, our source of lemmatization rules, and real-world entities. With the aid of those resources, our system could recognize, normalize and link entities, while being trained with only small amounts of labeled data. Intelligent analysis of texts written in natural languages, despite the advancements made with deep neural networks, is still regarded as challenging. The lingua franca of science is English, and new methods are typically evaluated firstly on English data, and often on other Germanic or Romance languages. This puts a certain bias on the development and design of modern NLP methods, which are not always transferable, and the metrics comparable, across languages and language families. Due to the complexity and inherent vagueness of intelligent language processing, is has been naturally split into simple tasks, one of which is named entity recognition (NER), concerned in this paper. The output of a NER system is traditionally a set labelled phrases recognized in a given text. In order to process a document, one has to not only find and label the entities, but also link appropriately subsequent occurrences of the same entity. The task becomes harder, if the linking can be made across languages, when the entities are globally present. We describe our submission to the 3rd Multilingual Named Entity Challenge in Slavic languages, held at the 8th Workshop on Balto-Slavic Natural Language Processing (BSNLP) in conjunction with the EACL 2021 conference. The system was similar to the one submitted to the 2nd Multilingual Named Entity Challenges in Slavic languages (Piskorski et al., 2019) held at 7th BSNLP Workshop in conjunction with ACL 2019 conference, and we discuss the differences between both systems. The aim of those shared tasks was to recognize, normalize, and ultimately link -on a document, language and cross-language level -all named entities in collections of documents concerning the same topic, e.g., the 2020 US presidential election. Named entities have been split into five categories: PER (person), LOC (location), ORG (organization), PRO (product), and EVT (event). The 2019 edition featured four Slavic languages (Czech, Russian, Bulgarian, Polish), and the 2021 edition featured six languages (the previous four plus Ukrainian and Sloven). In our solution we have combined models trained unsupervised on large datasets, and fine-tuned on small ones in a supervised way, with simple, whitebox algorithms that perform later stages of processing in a stable and predictable manner. In addition, we have taken advantage of similarities between certain languages in order to augment the data and further improve the results. Our system chains three modules for named entity recognition, lemmatization, and linking, which correspond to the objectives of the BSNLP Shared Task. We describe them in detail in the following sections. Our submissions for the 2019 and the 2021 shared tasks were similar, and differed (Marcińczuk et al., 2016) , CNEC (Ševčíková et al., 2014) , and Fac-tRuEval (Starostin et al., 2016) only in the first element of the chain -the entity recognition method. Because the training datasets were small, we looked for other labeled datasets. There is no common standard of labelling NER datasets, and those extra datasets had to be remapped into the label set of the shared task. However, their addition did improve the recognition scores, and we describe them in the following paragraphs. PL We used 1343 documents from KPWr with Named Entity annotations pre-processed with liner2-convert (Marcińczuk et al., 2017) tool, flattening and mapping original categories as shown in Table 1 . RU, BG, UK For languages with Cyrillic script we used FactRuEval2016 (Starostin et al., 2016) corpus consisting of 255 documents with 11754 annotated spans. Interestingly, the addition of this dataset improved scores for BG and UK despite the language mismatch. CS, SL For Czech and Slovene we used Czech Named Entity Corpus (Ševčíková et al., 2014) containing 8993 sentences with manually annotated 35220 named entities, classified according to a twolevel hierarchy. Recognition in our 2019 submission was realized with Flair (Akbik et al., 2018) , a model made of the embedding layer and a bi-directional LSTM with a Conditional Random Field output (BiLSTM-CRF). The embedding layer aggregated pre-trained word representations of varying granularity and origin (word embeddings, subword embeddings (Heinzerling and Strube, 2018) , contextual forward and backward character embeddings inherent to Flair). Because of the data scarcity, we adopted the philosophy of making our systems "neural gazetteers". To this end, we tried to collect as much various embeddings as possible. This line of reasoning applied especially to word-level embeddings. Ideally we wanted our systems to have, for every language, embeddings trained on Wikipedia, Common Crawl 1 , and a collection of news articles. We found it beneficial to mix word pieces and character embeddings between languages. For instance, our model for Russian used Bulgarian embeddings.This is especially useful when the model of specific granularity in the target language is unavailable. Lastly, we also found it beneficial to mix training data for seemingly related languages, and improved the scores by adding our additional FactRuEval data to the Bulgarian training dataset. Our recurrent recognition model underperformed in comparison to the top 2019 contestants, notably those based on BERT (Arkhipov et al., 2019; Devlin et al., 2019) . We present an excerpt from the 2019 recognition results in Section 3.1. For our submission to the 2021 BSNLP Shared Task we have used FLERT (Schweter and Akbik, 2020) , a state-of-the-art architecture for named entity recognition. It is a BERT-style transformer approach, in which a XLM-RoBERTa model (Conneau et al., 2019) , initially trained on a 100language Common Crawl corpus (Wenzek et al., 2020) , is fine-tuned on a small, language-specific corpus. This model departs from training an output CRF. We found that FLERT models train fast, and outperform our previously used Flair models by a significant margin. In the process of lemmatization of compound phrases, some words are converted into their lemmas, and some words remain unchanged. Occasionally some words are changed into other forms, e.g., adjectives might be transformed to nominatives with an appropriate gender. In the low-data regime of the shared task, we have opted for a simple rule-based system and data augmentation. We pose the lemmatization task as splitting a word w into two concatenated parts w = w 1 w 2 , and computing the lemma as w 1 v 2 , where (w 2 , v 2 ) ∈ R lem , and R lem is a small set of singleword lemmatization rules. We use two main additional sources of information: Wikipedia We take advantage of numerous links between articles, from which we extract pairs . The anchors often are the inflected forms, and document titles the lemmatized forms of the same entity. In order to filter out spurious we consider a pair (anchor, title) a correct lemmatization if both the anchor and the text have the same number of words, and every i-th word in a title is either equal to the i-th word in an anchor, or is its possible lemma. Finally, we heuristically recognize a small set of of words for later use, which we call stopper words. We define them as words shared between the anchor and the title, such that all words that follow them are identical in the anchor and the title, e.g., in the (anchor, title) pair (Bazylikęsw. Pawła za Murami, Bazylikasw. Pawła za Murami), a stoper word is "sw.". Universal Dependencies (UD) (Universal Dependencies Consortium, 2021) is a large collection of treebanks in multiple languages. We extract morphosyntactic information (word, lemma, POS-tags and additional parameters 2 ) from the words present in UD subsets for our target languages. Using that information, we construct single-word lemmatization rules. We say that the word w is a possible lemma of v if there is a one word lemmatization rule transforming v into w. PoliMorfologik For the Polish language, we additionally use PoliMorfologik (Woliǹski et al., 2012) , a comprehensive morphosyntactic dictionary, which allows us to extract a large collection of lemmatization rules. Lemmatization of every phrase gives rise to a lemmatization schema. It works as follows: for every word we take its suffix (the longest suffix which occurs in the list of 2000 most popular suffices), in that way we obtain the left-hand side of the rule. The right-hand side describes, how this suffices should be transformed. For instance for the pair (Václavem Havlem, Václav Havel) we obtain a rule Our lemmatization algorithm takes a phrase (named entity found in the first stage) and returns its lemma. It follows that we do not consider every information from the words surrounding the phrase/context. Afterwards, we try to apply the following heuristics in a given order: 1. Try to find the (rightmost) stopper word. If there is one, then leave unchanged suffix of the phrase after the stopper (including the stopper itself), find the lemma for the prefix. 2. Try to apply rule based agreement phrase lemmatization (only for Polish) 3. Try to find the lemmatization schema suitable for the phrase. If there are more than one such rule, use the one which gives ,more natural lemmatization' (which prefers common words and words occurring in lemmas) 4. Replace every word with its most popular lemma (in the training data, and in Wikipedia), if the word doesn't occur leave it unchanged A recognized entity, associated with a category and a normalized lemma, has to be linked with other occurrences of this entity (in this document, in other documents, and ultimately across the documents in all languages). The task is difficult due to the subtle differences between seemingly identical entities. Consider Donald Trump entity: its one occurrence could be linked with the 45th president of the United States, or Donald Trump Jr, depending on the role in the text, but not with both at the same time. We divide the task into two phases: 1) initial assignment of identifiers, and 2) refinement of identifiers. Our linking algorithm relies on three kinds of matches: exact matches of entity names, partial matches, and fuzzy matches with word embeddings. In order to ground the recognized entities regardless of the language, as well as extend our inventory of entities and their possible names, we use Wikidata 3 as a catalogue of entities. Wikidata is a structured database of entities extracted from Wikipedia. Every entity has a unique identifier, e.g. Q123456, a list of labels and languages for each label, a description and subclasses/instances of properties, and relationships to other Wikidata entities (instance of, part of, etc.), which form a graph. Thanks to the hierarchy of the relations, we have selected a handful of top-level Wikidata entities (Table 2), and collected all their descendants into sets of wikidata_entities. These are further weighted by their Term Frequency in Wikidata, so we could resolve collisions in favor of the most popular entities. In a typical, coherent paragraph, the narrative develops with every new sentence. Upon introduction, the entities are named carefully (e.g., with a full name, expanded acronym), to be shortened later, when it is clear from the context what they refer to. For this reason we designed a stateful algorithm, that processes and refines a local list of doc_entities caught in the document. Algorithm 1 outlines the linking procedure. Assignment of identifiers is performed separately for every document with the ADD_AND_LINK function. It processes a lemmatized set of entities recognized earlier modules of our system. Two kinds of entity dictionaries: doc_entities, which is local to a function, and a global wikidata_entities, which we prepare earlier using Wikidata. Those dictionaries map the textual mentions to identifiers from Wikidata and the target language, e.g., Donald Trump maps to [(Q22686, en) , (Q22686, pl), (Q22686, cs), (Q3713655, cs)] (the last identifier refers to Donald Trump Jr). We process document entities starting from the longest ones, and for each select the best entity id with the BEST_ID function. It firstly prefers the matching entries from the doc_entities dictionary, and secondly the most popular Wikidata entries (by Term Frequency) from wikidata_entities. For instance, with the local doc_entities dictionary, after processing Donald Trump, a subsequent shorter mention Trump should be linked with it. The function ALIASES handles only PRO and ORG labels, and returns a list of all short forms and acronyms specific to those labels, present in Wikidata, e.g., Sony Ericsson is aliased as SE. The refinement stage uses dense embeddings of phrases in order to uncover high similarities between them, that might have been otherwise missed. We use FastText (Bojanowski et al., 2017) , which is suited for morphologically rich Slavic languages, since the representations are built from generic subword units. The refinement is carried out in two phases. In the first one, all phrases with the same identifier are grouped together. In the second one, two groups are merged into one if there exist two mentions (one per each group) with sufficiently similar embeddings measured by their dot product. Phrases are embedded as sums of embeddings of their words. When we merge two groups, we assign to them the identifier with a higher Wikidata term frequency. We refine identifiers only on the single language level. We present experiments carried out on different levels of the entity recognition pipeline. The data used in those experiments comes from the BSNLP 2019 Shared Task test set (Nord Stream and Ryanair subsets). Our algorithms are tested in the submitted form and have not been further adapted to those datasets. Recognition Table 3 summarizes strict recognition results on the test data. Lemmatization We analyzed the influence of various part of lemmatization on the performance of our method. The results are shown in Table 4 . Our baseline is the identity function, in which we assume a phrase being its own lemma. One should be aware that due to the small amount of test data, the results should be treated as approximate. Some differences can be caused by bad lemmatization of one phrase (especially if the phrase occurs many times in test data). It seems that all implemented heuristic are reasonable and improve over the baseline. Moreover, it is easy to see that links from Wikipedia are useful source of information in this task. Table 5 shows the result of linking. Even though our recognizer did not hold up to the competition, the linking algorithm was able to close the gap in F1 score. In order to test the algorithm in ablation, we include linking results on ground truth lemmatized data (Lemma Oracle). We present the results of our FLERT-based submission, which are partial results of the entire shared task available at the time of writing. One of the sets of articles in the training data is devoted to COVID-19. This situation is unusual: the phrase very often used in test data, does not appear at all in the training data (also in the data used to pre-train language model). We have verified that our NER models struggle with assigning consistent labels to the phrase COVID-19, which is common in the test data. An additional difficulty is the ambiguity of this phrase, which may refer to a disease and possibly remain unclassified as a named entity, or a pandemic and be classified as EVT. We decided to do a simple post-processing which assigns EVT to all COVID-19 phrases recognized by the NER module. We think that this situation is so unusual that in a real system, used in the industry, it would be handled using a special ad-hoc rule. Moreover, we wanted to know, what are the result of this fixed assignment, and submitted two versions of our solutions. This paper describes our submissions to the 2019 and 2021 BSNLP Shared Tasks on named entity recognition on Slavic languages. Even though the training data was scarce, we have used large-scale datasets: corpora of unstructured text in the unsupervised training phase of training of the recognition model, and structured Wikipedia and Wikidata knowledge bases in order to extract rules and entities for lemmatization and linking phases. The linking algorithm is a strong point of our submission. In the 2019 task it allowed to close the performance gap between our solution and competitors, introduced by a weak initial recognition model. The results suggest that, perhaps, there is still a white spot in between supervised and unsupervised neu- ral learning, where the structure of the data matters more than volume, and simple rule-based system excel. Contextual string embeddings for sequence labeling Tuning multilingual transformers for language-specific named entity recognition Enriching word vectors with subword information Unsupervised cross-lingual representation learning at scale BERT: Pre-training of deep bidirectional transformers for language understanding BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages Liner2 -a generic framework for named entity recognition The second crosslingual challenge on recognition, normalization, classification, and linking of named entities across Slavic languages Flert: Document-level features for named entity recognition Czech named entity corpus 2.0. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL Factrueval 2016: Evaluation of named entity recognition and fact extraction systems for russian CCNet: Extracting high quality monolingual datasets from web crawl data PoliMorf: a (not so) new open morphological dictionary for Polish The authors thank Polish National Science Center for funding under the OPUS-18 2019/35/B/ST6/04379 grant. We also would like to thank Adam Wawrzyński and Wojciech Janowski from VoiceLab AI for their support during conducting experiments and model training.