key: cord-0163069-zyrzwliz
authors: Yeshpanov, Rustem; Khassanov, Yerbolat; Varol, Huseyin Atakan
title: KazNERD: Kazakh Named Entity Recognition Dataset
date: 2021-11-26
journal: nan
DOI: nan
sha: cf4c9d626facb20c271ca5cbd04af5c0eb06f813
doc_id: 163069
cord_uid: zyrzwliz

We present the development of a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules and examples. The dataset annotation, based on the IOB2 scheme, was carried out on television news text by two native Kazakh speakers under the supervision of the first author. The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes. State-of-the-art machine learning models to automatise Kazakh named entity recognition were also built, with the best-performing model achieving an exact match F1-score of 97.22% on the test set. The annotated dataset, guidelines, and codes used to train the models are freely available for download under the CC BY 4.0 licence from https://github.com/IS2AI/KazNERD.

Named entity recognition (NER) refers to a subtask of information extraction aimed at identifying named entities (NEs) in semi-or unstructured text and classifying them into pre-specified types (Nadeau and Sekine, 2007) . NEs, in turn, generally refer to (proper) names of persons, organisations, and geographical locations (Sang and Meulder, 2003) , as well as numerical and temporal expressions, including quantities, monetary units, percentages, dates, or durations (Chinchor, 1998) . Widely used in natural language processing applications, including automatic text understanding (Cheng and Erk, 2020) , machine translation (Babych and Hartley, 2003) , question answering (Aliod et al., 2006) , and knowledge base development (Etzioni et al., 2005) to name a few, NER has been of interest not only to scientific research, but also to business (Schön et al., 2019) and defence (Han et al., 2020) ever since 1995, when the term was coined (Grishman and Sundheim, 1996) . By virtue of most of the early works in information extraction being launched as part of United States Government initiatives (e.g., ACE, MUC, TIPSTER) (Maynard et al., 2003) , a great deal of research in NER concerns English. Nonetheless, an equally large proportion of NER research has been dedicated to different wellresourced languages, such as Spanish, French, German, Japanese, Chinese, Russian (see Nadeau and Sekine (2007) , for a detailed overview), as well as to less resourced ones, such as Sindhi (Ali et al., 2020) , Romanian (Dumitrescu and Avram, 2020) , and Icelandic (Ingólfsdóttir et al., 2019) . Likewise low-resourced, the language of interest of this paper-Kazakh-has only latterly appeared on the radar of NER researchers. Underrepresented and lexically underdeveloped because overshadowed by Russian, which was promoted as a lingua franca during the Soviet era (Dave, 2007) , the earliest NER research in this agglutinative Turkic language dates back as recently as 2016. Although there is evidence for annotated corpus construction as part of Kazakh NER research (Akhmed-Zaki et al., 2020; Tolegen et al., 2016) , to our knowledge, neither of the corpora is publicly available. In addition, none of the studies into Kazakh NER appears to have developed annotation guidelines-or at least adapted those existing in other languages-to take into account cases characteristic of the Kazakh language. Given this relatively nascent stage of Kazakh NER accompanied by the digital underrepresentation of the language and the lack of freely accessible annotated corpora, it is hoped that our research will fill the existing gaps in the field and thus contribute to its further development. Particularly, we built a dataset consisting of 112,702 sentences from television news, of which 86,246 are unique sentences and 26,456 are their various representations. All sentences in the dataset were manually annotated by two native Kazakh-speaking linguists, supervised by the first author. This resulted in the largest Kazakh NE annotated corpus. To assist the annotators in making the right choices when presented with expressions potentially matching NEs, annotation guidelines in Kazakh were developed. The guidelines contain rules for annotating 25 NE types, as well as relatable examples of Kazakh NEs. Finally, we built four state-of-the-art machine learning models to automatise Kazakh NER, with the highest exact match F 1 -score reaching 97.22% on the test set. The remainder of the paper proceeds as follows: Section 2 reviews existing research on Kazakh NER. Section 3 discusses data collection and preparation, the development of the guidelines and dataset. Section 4 provides the annotated dataset specifications, including the description of NEs, as well as the dataset structure and statistics. Section 5 offers the details of the implemented NER models, the experimental setup, and the evaluation criteria and results. Section 6 discusses the results of the experiment. Section 7 concludes the paper.

As mentioned earlier, Kazakh is a digitally lowresourced language, with a small number of (annotated) corpora freely available. That said, recently, there have been progressive efforts made to address such underrepresentation. Khassanov et al. (2021) have built a crowdsourced freely accessible Kazakh speech corpus (KSC) containing 332 hours of transcribed audio. In another work, Mussakhojayeva et al. (2021a) have constructed the first publicly available large-scale Kazakh text-to-speech synthesis dataset consisting of approximately 93 hours of transcribed audio recordings spoken by male and female professional narrators. While Kazakh speech processing research has been gathering momentum, thanks to the recent development of publicly available datasets, Kazakh NER research can hardly boast of commensurable progress, which appears to be chiefly due to a lack of such resources. One of the earliest studies into Kazakh NER was conducted by Sadykova and Ivanov (2016) . To build a manually-annotated Kazakh NE corpus, two experts were tasked with labelling 1,000 news articles with a set of seven NEs-namely, (1) person, (2) organisation, (3) location, (4) geopolitical entity (GPE), (5) event, (6) award, and (7) tender-using the brat rapid annotation tool (BRAT) (Stenetorp et al., 2012) . Approximately 3,000 NEs are reported to have been tagged, of which 1,084 were persons, 974 locations, and 973 organisations. However, no breakdown of the remaining NEs is provided in the paper, nor is reference made to the metric applied to achieve an inter-annotator agreement (IAA) score of 0.86-0.89 (Artstein and Poesio, 2008) . Another criticism is that, while the annotation guidelines are reported to have been developed specifically for the task, there is no mention of how to access them or the resulting annotated corpus. Tolegen et al. (2016) created a Kazakh NE corpus, annotated according to the IOB (Inside, Outside, Beginning) scheme, from 2,500 general news media articles. The corpus is reported to consist of 18,054 sentences and 270,306 words. Annotation was performed using a self-developed webbased tool, with two native Kazakh speakers using the MUC-7 NE task definition (Chinchor, 1998 ) as a guide. More than 14,000 NEs were labelled in three categories: 4,292 persons, 7,391 locations, and 2,560 organisations. The IAA measured with Fleiss' kappa ranged from 0.93 to 0.98 (Fleiss, 1971) . Furthermore, the scholars conducted an extensive analysis of Kazakh morphological and word type features and were the first to apply a statistical model to Kazakh NER based on conditional random fields (CRFs) (Lafferty et al., 2001) , achieving an F 1 -score of 89.81%. The same model was used as a baseline in Tolegen et al. (2020) , where the researchers approached the Kazakh NER task by comparing (1) a bidirectional long short-term memory (BiLSTM) model (Hochreiter and Schmidhuber, 1997) , (2) BiLSTM with CRF (BiLSTM-CRF), and (3) a tensor layer-based deep neural network (DNN) model. While the performance of the BiLSTM model yielded a result significantly lower than that of the baseline model (78.76%), the performance of the BiLSTM-CRF model varied depending on whether or not character embedding was used, 86.45% and 80.28%, respectively. The DNN model outperformed the other models, producing an F 1 -score of 90.49%. Although the three models were trained on the annotated corpus built in Tolegen et al. (2016) , neither of the studies provides information on access to it.

In Kozhirbayev and Yessenbayev (2020) , an annotated NE corpus comprising 29,629 sentences was constructed in the IOB format, with the names of persons, organisations, and locations tagged along with Other, a category for NEs of interest that presumably fall outside the three said categories. Four methods to address the Kazakh NER task were applied-specifically, (1) the random forest classifier (Ho, 1995) , (2) the Naïve Bayes classifier (Friedman et al., 1997) , (3) CRFs, and (4) a hybrid method of BiLSTM and CRF. The results show that, while the first two methods achieved an F 1 -score in the range of 81% to 89%, the hybrid method was notably outperformed by the CRFs, 88% versus 99%, in turn. However, the study included no information on what guidelines were followed to build the corpus, the quantities of NEs in the corpus, and how, if any, annotation accuracy checks were performed. Kuralbayev et al. (2020) compared four NER models-(1) CRFs, (2) LSTM with character embedding, (3) LSTM-CRF, and (4) bidirectional encoder representations from transformers (BERT) (Devlin et al., 2019) -to anonymise 40,000 court decisions in Kazakh and Russian. The names of persons, organisations, locations and addresses were tagged using a self-built annotation tool. The scholars note that the BERT model, which was run without fine-tuning, reached an F 1 -score of 87%, with the results of the other models peaking at 82%. Nevertheless, some notes of caution are warranted here, because, although the model is reported to have achieved high accuracy for both Kazakh and Russian, it was trained exclusively on Russian data. Furthermore, surprisingly, no mention is made of the guidelines used or the IAA assessment, considering that the annotation was carried out by over 150 local university students recruited for the task. Nor is it stated how many NEs were anonymised as a result. The last study on Kazakh NER we discuss in this paper is by Akhmed-Zaki et al. (2020) , who applied the BiLSTM, CRF, and BERT methods to a dataset collected from Kazakh online news portals. The dataset was manually annotated using the IOB scheme with four NEs-(1) persons, (2) locations, (3) organisations, and (4) other. In this study, too, the BERT model performed the best with an F 1 -score of 97.99%, followed by CRF (94.27%) and BiLSTM (85.31%). While the study provides clear information on the parameters of the BERT model and formulae for the precision, recall, and F 1 -scores computed, it is still limited by the lack of clarity on the volume of the data. Although the dataset built is claimed to consist of 7,153 sentences, the scholars explicitly state that it was split into 6,507, 2,531, and 3,015 sentences for training, validation, and test sets, respectively, which is 12,053 sentences in the aggregate. It is also unclear whether the category Other was used for NEs that were not names of persons, locations, and organisations, but were still of interest (see, e.g., Kozhirbayev and Yessenbayev (2020) ), or whether it simply referred to a category of words that are not annotated as NEs and are labelled as O in the IOB scheme. Much like in the previous studies, no reference is made to the annotation guidelines adhered, the annotators and their backgrounds, the measurement of IAA, and the means of accessing the annotated dataset. 

The source data were obtained from the television news of the Khabar Agency, a major broadcasting network in Kazakhstan. With the agency's permission, the Kazakh transcribed text accompanying the original news posted on their official website 1 was collected over the second half of 2020. The news included reports on events in local and international politics, economy, sports, religion, and education that did not necessarily occur during the data collection period, as some news items were also extracted from the agency's archives. The extracted text 2 was not screened for inappropriate content on the assumption that this must have been prudently done by the agency's content policy department. The text was split sentence-wise-with an identifier assigned to each sentence-and inspected for grammatical and spelling errors (cf. Tolegen et al. (2016) ) and homoglyphs. Duplicate sentences and those containing only Russian utterances were removed; sentences with both Kazakh and Russian utterances were retained, as Kazakh-Russian codeswitching is normal practice in Kazakhstan (Pavlenko, 2008; Mussakhojayeva et al., 2021b) . Ultimately, the total number of sentences was 86,246.

To enable the developed NER models (see Section 5) recognise instances of the same NE regardless of their typographic characteristics (e.g., numerals written in words and digits), the following six sentence representation variants were adopted:

1) AID -All sentence elements were recorded in the Cyrillic script 3 . Arabic and Roman numerals (e.g., 9 → тоғыз, IV → төрт, etc.), names of organisations, applications, events, and so on, spelt in Latin characters 1 www.khabar.kz 2 Tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed] were retained in the transcribed text. 3 At the time of writing, the Kazakh language is undergoing a gradual transition from the Cyrillic to the Latin script, with the full transition scheduled to take place between 2023 and 2031.

(e.g., Bank of America → Банк оф Америка, Telegram → Телеграм, etc.), terms conventionally spelt in Latin characters (e.g., PhD → ПиЭйчДи, etc.), and special symbols (e.g., % → процент or пайыз) were recorded in Cyrillic words. 2) BID -Sentences of the AID representation with numerals recorded in digits. 3) CID -Sentences of the BID representation with percentages recorded using the % symbol. 4) DID -Sentences of the AID representation with words conventionally spelt in the Latin script recorded in that script. 5) EID -Sentences of the DID representation with numerals recorded in digits. 6) FID -Sentences of the EID representation with percentages recorded using the % symbol.

The assigned representation designations, as well as example sentences with the resulting quantity of each variant in the dataset are summarised in Table 1 .

The IOB2 scheme-also referred to as BIO-was selected for annotation (Sang and Veenstra, 1999) . Under this scheme, each token in text receives one of three tags-namely, B, I or O, indicating whether a token is at the Beginning, Inside or Outside of an annotated extent. It is similar to the IOB scheme except that a B tag is used at the beginning of every NE extent (see Table 2 ). 

Considering that none of the studies on Kazakh NER provided Kazakh annotation guidelines that our study could rely on to embark on the task, we decided to create such a set of rules. First, we studied some of the most referenced annotation guidelines for NER-particularly, Chinchor (1998) , Brunstein (2002) , Raytheon BBN Technologies (2004) , Linguistic Data Consortium (2008), and Weischedel et al. (2012) . Next, the first author experimentally annotated a random sample of 2,000 sentences to see what NEs could actually be extracted from the data on hand. Twenty-two NEs described in the guidelines studied were found in the sample. The first draft of the annotation guidelines containing the definition of an NE, information on the valid boundaries of NEs, rules for NE classification, and related examples was prepared in Kazakh.

Later, as a result of the annotator training task, it was decided to tag three more NEs whose examples were found in the news reports annotated. The NEs under consideration were NON_HUMAN, MISCELLANEOUS, and ADAGE. While the first two had been previously mentioned in the existing annotation guidelines for NER, the decision to tag ADAGE rested upon the relatively frequent use of Kazakh proverbs and sayings in the training sentences. Due adjustments were made to the guidelines, with some rules clarified and supported by comprehensible examples. It is also worth mentioning at this point that the guidelines were iteratively amended as annotation proceeded. This was partly due to subsequent encounters with cases unconsidered while drafting the guidelines and partly as a result of daily discussions of questions posed by the annotators hired for the task. For a complete list of the 25 NEs and their brief descriptions, see Table 3 . The final annotation guidelines (in Kazakh) are available for download from our GitHub repository 4 .

Two native Kazakh-speaking linguists received training in NER for two weeks under the supervision of the first author. As part of training, 3,500 sentences from the Khabar agency's official website were annotated, by following the developed guidelines. The annotation was carried out using the Webanno web-based tool (Yimam et al., 2013) (see (Neves and Seva, 2021) for an extensive review of various tools for annotation). The annotators worked independently on the same version of a text file, which was subsequently reviewed by the first author for annotation divergences and inconsistencies. The final version of the file contained text with annotations approved or modified as appropriate by the first author. During the training period, the IAA score, computed by Webanno, reached a Fleiss' kappa of 0.94. The annotation process proceeded for six months, with the annotators labelling 1,500 sentences per day and the first author inspecting these once they were marked complete on Webanno. During the period, the IAA score was in the range of 0.95 to 0.97 Fleiss' kappa. Table 3 provides the statistics for annotated NEs.

The resulting annotated Kazakh NER dataset (hereafter KazNERD) contains 136,333 NEs. As can be seen from pandemic received massive public attention, which led to the source data often reflecting information on the outbreak of the disease across the country and worldwide. Second, the national media regularly discussed symptoms of various diseases similar to those observed in individuals infected with COVID-19, which resulted in the names of the diseases appearing in the source news reports.

To allow reproducibility of the NER experiment between different research groups, KazNERD was split into three sets: training (80%), validation (10%), and test (10%). Table 4 provides statistical information on the number of tokens, sentences, and NEs in the dataset and per set. An evenly proportional distribution of sentence representations and NEs across the sets was ensured. We also saw to it that a sentence and its representations were only assigned to the same set. More detailed information on the numbers of NEs and sentence representations across the three sets can be found in Tables 5 and 6 .

Furthermore, we extracted all unique NEs from KazNERD and computed the intersection between the training, validation, and test sets (see Figure 1 ). The total numbers of unique NEs in the training, validation, and test sets are 33,177, 6,547, and 6,742, respectively. We found that 42% of the unique NEs in the test set do not appear in the training and validation sets, which confirms its suitability for evaluating the generalisation capability of the NER models. The three sets are stored in separate files, in the CoNLL-2002 format (Tjong Kim Sang, 2002) -that is, all files contain one token and the corresponding NE tag per line, with blank new lines representing sentence boundaries (see Table 2 ). Tokens and IOB2 tags are separated by a single space. Additionally, we provide variants of the 

We applied several state-of-the-art machine learning methods to evaluate the KazNERD corpus. Detailed information on the NER model implementation and feature construction can be found in our GitHub repository 4 . CRF We applied the CRF models implemented by the CRFsuite toolkit (Okazaki, 2007) . Specifically, we used the features derived from the surface forms of tokens, including target and context token prefixes, suffixes, and shape features. We note that the CRF models do not incorporate external linguistic resources, such as gazetteers, lookup tables, or word vector features. BiLSTM-CNN-CRF We used the PyTorch implementation of a BiLSTM-CNN-CRF model (Ma and Hovy, 2016) . The model combines the word embeddings with the characterlevel representations extracted using the CNN and feeds them into the BiLSTM module with the CRF output layer. Word embeddings are usually pre-trained on large unlabelled corpora, but, in the present study, we used randomly initialized embeddings. BERT A pre-trained BERT model can be readily applied to the NER task, by reinitializing the output layer size to match the NE labels and fine-tuning the model on the NER data. We used the case-sensitive version of the multilingual BERT model within the Hugging Face Transformers framework (Wolf et al., 2020) . The model consists of around 110M parameters and was pre-trained on 104 languages with the largest Wikipedia content, which includes the Kazakh language as well.

We also applied the XLM-RoBERTa model (Conneau et al., 2020) , a multilingual version of RoBERTa (Liu et al., 2019) , within the Hugging Face Transformers framework. Similar to BERT, it was adapted for the NER task, by reinitializing the output layer and finetuning. The rationale behind choosing the model lies in the fact that it has over five times as many parameters as BERT does (560M) and was pre-trained on CommonCrawl data containing 100 languages, Kazakh included.

The four NER models were trained on the training set. The hyperparameters were tuned on the validation set. The final, best-performing, model was evaluated on the test set. The deep learning-based models utilised a single V100 GPU on an NVIDIA DGX-2 machine.

Recall F 1 -score Table 7 : Experiment results of four NER models on the validation and test sets of KazNERD

The CRF model was run for 550 iterations using the L-BFGS training algorithm, with the L 1 and L 2 regularisation terms set to 0.1 and 0.01, respectively. The other hyperparameters were left at their default values of the CRFsuite toolkit.

For the BiLSTM-CNN-CRF model, we used a single BiLSTM layer with 256 hidden units and a CNN layer with 30 filters of size 3. The word and character embedding sizes were set to 100 and 30, respectively. We chose an initial learning rate of 0.005 and a batch size of 1,024. To prevent overfitting, a dropout rate of 0.5 was applied. The model was trained for 1,000 epochs using the Adam optimizer and the early stopping criteria based on the validation set, which yielded the highest score on epoch 432.

The BERT model was fine-tuned for 8 epochs, with the initial learning rate set to 5 · 10 −5 and the weight decay rate set to 10 −4 . We set the batch size to 128 and applied 3,000 warmup steps. Likewise, the XLM-RoBERTa model was fine-tuned for 10 epochs, with the initial learning rate set to 10 −5 and the weight decay rate set to 10 −3 . We set the batch size to 64 and applied 800 warmup steps.

We evaluate NER performance in terms of exact match using precision, recall and F 1 -score (Nadeau and Sekine, 2007) and the standard seqeval script (Nakayama, 2018) , requiring that both the type and span of predicted NEs match the gold standard mention. Table 7 presents the performance of the NER models on the validation and test sets of KazNERD, measured by micro-averaging (Yang, 2001) . The highest results were achieved by XLM-RoBERTa, followed by BERT, BiLSTM-CNN-CRF and CRF. Specifically, XLM-RoBERTa achieved relative improvements of 1%, 4%, and 5% over BERT, BiLSTM-CNN-CRF, and CRF, respectively. In general, all the NER models performed well, achieving precision, recall, and F 1 -scores of above 90%, highlighting the utility of our annotated dataset for the Kazakh NER task. The results of the XLM-RoBERTa model for different NEs are shown in Table 8 and will be discussed in the following section.

The performance of XML-RoBERTa was above 95% for 14 NE classes and in the range of 85% to 95% for eight classes. Only three classes were predicted with an F 1 -score below 85%. tenge" (the local currency), "dollar", and "euro" making frequent appearances. Likewise, in Kazakh, PERSON NEs often appear as a combination of first and last names, with both capitalised and the latter normally ending in -ов(а) "-ov(a)", -ев(а) "-ev(a)", -ин(а) "-in(a)". These features presumably enabled the model to achieve high prediction accuracy for the class. The low F 1 -scores for NON_HUMAN (0%) and ADAGE (64.52%) on the test set could be due to the apparent insufficiency of instances of the former in the dataset and the form variability of the latter. Increasing the number of NON_HUMAN NEs in the training sample, by expanding the dataset to embrace domains where the use of names of non-humans is expected (e.g., science fiction, children's stories, or animal fantasies) will likely improve the accuracy of the model. As for ADAGE NEs, they are generally easy to recognise in context thanks to their form fixedness (e.g., No smoke without fire). Lexical and grammatical variations of proverbs and sayings are possible (e.g., There is no smoke without fire or Where there is smoke, there is fire), but still unlikely to preclude humans from continuing to identify these: such phrases bear greater psychological and social significance than do other set expressions (Norrick, 2015) . However, this can hardly apply to a machine learning model, which will struggle to decide whether a given expression is a pre-existent variation of a known adage, its nonce restructuring, or not an adage at all, especially if there is inadequacy of data to make inferences from. As mentioned earlier, the class ADAGE was labelled as a result of our scientific curiosity, and further review and investigation as to the worth of this class for the NER task is required.

Since the present study was, to the best of our knowledge, the first to develop a publicly available annotated corpus as well as guidelines in Kazakh for 25 NE classes, it was subject to several challenges. Firstly, although NER generally implies the recognition of proper nouns in text, which are expected to be capitalised given their designation of names of persons, places, organisations and so forth, some Kazakh nouns assigned to certain NE classes in our dataset do not seem to meet this criterion. For example, the NEs дүйсенбi "Monday" (DATE), христиандар "Christians" (NORP), or ағылшын тiлi "English" (LANGUAGE) to name a few, are normally lower-cased in Kazakh, unless they appear at the beginning of a sentence. Further studies on Kazakh NER taking such cases into account need to be undertaken. NE coordination posed another problem. The challenge concerns whether two (or more) coordinated NEs, for example, Олжас пен Аина Қорғанбек "Olzhas and Aina

Qorganbek" (the names of a husband and a wife followed by their family name; PERSON) or Байтұрсынов пен Қонаев көшелерiнде "on Baitursynov and Qonayev Streets" (the names of two local streets followed by the word "streets"; FACILITY) ought to be labelled as a single NE or two separate NEs. Although MUC-6 (Grishman and Sundheim, 1996) originally advocated the separate use of annotations, in KazNERD, it was decided to label coordinated NEs as a single entity in accordance with the recommendations of MUC-7 (Chinchor, 1998) , promoting joint annotation. Another similar issue was related to nested entities: for example, should the expression Қазақстан Президентi "The President of Kazakhstan" be considered two entities Қазақстан (Kazakhstan, GPE) and Президентi (President, POSITION) or a single entity Қазақстан Президентi (The President of Kazakhstan, POSITION)? Here again, our decision was guided by MUC-7, encouraging the annotation of such expressions as a single NE. Thus, while developing KazNERD, we chose not to decompose compound entities and not to label subentities. However, future research into Kazakh NER should consider these challenges, with the decision as to which of the approaches is more likely to cover the needs of application areas left to the discretion of those concerned.

As regards challenges related to metonymy (i.e., the use of the name of something to refer to that of something else that is closely associated with it, as in Downing Street to refer to the British Prime Minister), consistent with the MUC recommendations, KazNERD generally retains the semantics of common NEs, unless otherwise specified in the developed annotated guidelines. Thus, in Абайды тану "cognising Abai" (the name of a great Kazakh poet), the NE Абайды is tagged as PERSON, despite the contextual reference to the person's literary works (the NE class ART). This should certainly be borne in mind by enthusiasts willing to make use of KazNERD. Similarly, challenges presented by the ambiguity between the classes ORGANISATION and FACILITY may presumably account for the comparatively low F 1 -score for the latter. In the annotation guidelines, we recommend that, in cases of confusion, preference should be given to ORGANISATION when actions normally characteristic of persons (e.g., say, state, report etc.) are used with names of institutions or if a building houses an institution of the same name, unless explicitly referring to the physical structure alone in a locative manner. Yet, in cases where the distinction is still not clear-cut, such as Президент ... Ақордада арнайы кеңес өткiздi "President ... held a special meeting in Akorda" (the official workplace of the President of Kazakhstan), we annotated Ақордада as ORGANISATION in line with the existing guidelines tagging White House or Kremlin as ORGANISATION, in spite of the contextual reference to the facility.

The present study set out to develop the first publicly available annotated dataset for Kazakh NER. The resulting dataset, KazNERD, contains 112,702 sentences from the television news domain and 136,333 annotations for 25 entity classes. All NEs were labelled using the IOB2 scheme by two native Kazakh speakers under the supervision of the first author, in accordance with the annotation guidelines specially designed in and for the Kazakh language. To automate Kazakh NER, state-of-the-art machine learning models were built, with the best-performing model yielding an exact match F 1 -score of 97.22% on the test set. In the future, we aim to focus on developing fine-grained and domain-independent NER models to ensure their external validity. To this end, we intend to train the models on a version of KazNERD supplemented with annotated data from different domains and genres, including transcribed conversations from television and radio shows, podcasts, phone talks, fiction, and senate speeches. The annotated dataset, guidelines, and codes used in training the models can be freely downloaded under the CC BY 4.0 licence from https://github.com/IS2AI/KazNERD.

Our very special thanks go to Aigerim Kabduluakhitova and Aizhan Seipanova-the two annotators, who demonstrated their expertise, diligence, and continued patience throughout the whole process of developing KazNERD.

Development of Kazakh named entity recognition models

SiNER: A large dataset for Sindhi named entity recognition

European Language Resources Association

Named entity recognition for question answering

Inter-coder agreement for computational linguistics

Improving machine translation quality with automatic named entity recognition

Annotation guidelines for answer types. LDC2005T33, Linguistic Data Consortium

Attending to entities for better text understanding

Overview of MUC-7

Unsupervised cross-lingual representation learning at scale

Kazakhstan: Ethnicity, language and power

BERT: Pre-training of deep bidirectional transformers for language understanding

Introducing RONEC -the Romanian named entity corpus

Unsupervised named-entity extraction from the Web: An experimental study

Measuring nominal scale agreement among many raters

Machine Learning

Message understanding conference-6: A brief history

Research on Named Entity Recognition Technology in Military Software Testing

Random decision forests

Long Short-Term Memory

Towards high accuracy named entity recognition for Icelandic

A crowdsourced open-source Kazakh speech corpus and initial speech recognition baseline

Распознавание именованных объектов для казахского языка

Named Entity Recognition Algorithms Comparison for Judicial Text Data

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ACE (Automatic Content Extraction) English Annotation Guidelines for Entities Version 6

RoBERTa: A robustly optimized bert pretraining approach

End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF

Towards a Semantic Extraction of Named Entities

KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset

A study of multilingual end-to-end speech recognition for Kazakh, Russian, and English

A survey of named entity recognition and classification

seqeval: A python framework for sequence labeling evaluation

An extensive review of tools for manual annotation of documents

1 Subject Area, Terminology, Proverb Definitions, Proverb Features

CRFsuite: a fast implementation of Conditional Random Fields

Russian in post-Soviet countries

OntoNotes Named Entity Guidelines Version

Формирование корпуса с разметкой сущностей в новостных медиа ресурсах для казахского языка [The formation of a corpus with the markup of entities in news media resources for the Kazakh language

Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition

Representing text chunks

A corpus study and annotation schema for named entity recognition and relation extraction of business products

BRAT: A web-based tool for nlpassisted text annotation

Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition

Named entity recognition for Kazakh using conditional random fields

Neural Named Entity Recognition for Kazakh

OntoNotes release 5.0 with OntoNotes DB Tool v0.999 beta

Transformers: Stateof-the-art natural language processing

A study of thresholding strategies for text categorization

WebAnno: A flexible, web-based and visually supported system for distributed annotations