key: cord-0043010-xift5l9c
authors: Mukhamedshin, Damir; Nevzorova, Olga; Kirillovich, Alexander
title: Using FLOSS for Storing, Processing and Linking Corpus Data
date: 2020-05-05
journal: Open Source Systems
DOI: 10.1007/978-3-030-47240-5_17
sha: b4331969ceebea2cb1f60d77d5788ac64c37634f
doc_id: 43010
cord_uid: xift5l9c

Corpus data is widely used to solve different linguistic, educational and applied problems. The Tatar corpus management system (http://tugantel.tatar) is specifically developed for Turkic languages. The functionality of our corpus management system includes a search of lexical units, morphological and lexical search, a search of syntactic units, a search of N-grams and others. The search is performed using open source tools (database management system MariaDB, Redis data store). This article describes the process of choosing FLOSS for the main components of our system and also processing a search query and building a linked open dataset based on corpus data.

In this paper, we discuss the development of the corpus management system for the Tatar National Corpus "Tugan Tel" [1] . The corpus is organized as a collection of texts covering different genres, such as fiction, news, science, official, etc. A text consists in sentences (called contexts). The words from a sentence are provided with morphological annotation. The annotation of a word includes the lemma, POS and the sequence of grammatical features expressed by the affixes.

Access to the corpus is provided by the corpus management system. The system allows users to search for word tokens with a specified full form or lemma and grammatical features. The search results are represented as a list of the sentences containing the found word tokens. Figure 1 shows the search result for the word kuman 'book' in the plural number and the genitive or the directive case. The tooltip under one of the found tokens contains its morphological annotation.

The general architecture of the corpus management system is represented at Fig. 2 . This model includes three main components: a web interface, a search engine, and a database. The system imports the annotated texts from the corpus annotation tool, and exports them to a triplestore of the LLOD publishing platform.

Development of the corpus management system cannot begin without a clear understanding of which tools will be used to store and process corpus data. Therefore, one of the important factors when choosing and using tools for data storing and processing is When choosing tools for processing and storing corpus data, we were faced the task of finding a set of FLOSS to ensure high speed of performing search queries (no more than 0.1 seconds for direct search queries, no more than 1 second for reverse search queries), wide search capabilities (at least direct and reverse search, search in parts of word forms and lemmas, mixed search, phrase search), and the possibility of further growth of system performance and functionality. In Sect. 2 we describe using of FLOSS in the corpus management system, and in Sect. 3 we describe using of FLOSS the LLOD publishing platform.

To choose FLOSS for storing and processing of corpus data, we formed a list of the most important selection criteria: We analyzed information in public sources [2] and chose FLOSS according to criteria 2-5. The resulting FLOSS set is presented in Table 1 . To verify compliance with the first criterion, we conducted a series of experiments on writing and searching based on the generated data [3] . The best results were shown by the MySQL + Redis suite, which was chosen by us to store data in the corpus management system. To build the system based on the selected components, additional elements must be included. So, to bind the PHP interpreter and the Redis data warehouse, we use the PhpRedis extension. In order for the PHP interpreter and MySQL DBMS to work in tandem, we use the php-mysql package, namely the mysqlnd driver, which allows working with the DBMS using its own low-level protocol.

Thus, the scripts executed by the PHP interpreter allow operations with data both from the Redis data storage and from the MySQL DBMS. Performing operations in the order necessary to solve the tasks assigned to the corpus data management system, PHP scripts are the link between indexes and cached data stored in Redis storage, index tables and text data stored in MySQL DBMS.

Let us consider how the processing of a simple direct search query alma tat /apple en is executed. The process execution of such query is shown in Fig. 3 First of all, the script that performs this task searches the identifier of the word-form alma in the Redis database, by making a request using the PhpRedis extension. Already at this point, the script may report an error in the query, if the identifier of the desired wordform is not found in the corpus data.

In the second step, the script makes a request to the MySQL database using mysqlnd. In this case, the script needs to get a list of sentences containing the desired wordform. This data is stored in the index table of corpus data. In response, the script receives a list of context (sentence) identifiers, according to which in the third step the required number of sentence texts is requested. The received data is displayed to an user in the form of an HTML document or a JSON structure.

The corpus has been published on the Linguistic Linked Open Data cloud [4] . The dataset is represented in terms of NIF [5] , OntoLex-Lemon [6] , LexInfo, OLiA [7] and MMoOn [8] ontologies. (Fig. 4) . The RDF is stored in the OpenLink Virtuoso triplestore. The dataset is available via deferrable URI's and SPARQL endpoint. Deferrable URI's are accessible throws the LodView RDF browser, based on the Apache Tomacat webserver. The query interface is powered by YASQE [9] .

Publication of the corpus on the LLOD cloud makes possible its interlinking with the external linguistic resources for Tatar, including Russian-Tatar Socio-Political Thesaurus [10] , TatWordNet and TatVerbBank.

The solutions presented in this article are applied in the developed corpus management system. Using FLOSS significantly reduced the development time and ensured the transparency of processes, flexibility, and possibility of in-depth analysis, as well as opportunities for further development.

The corpus data management system is used to work with Tatar texts. The total volume of the collection of Tatar texts is about 200 million wordforms. The average execution time of the direct search query does not exceed 0.05 s in 98% of cases, and the reverse search is performed by the system within 0.1 s in 82% of cases, which exceeded the expected system performance. In many ways, such performance is provided by FLOSS, used for data storage and processing. Also, thanks to FLOSS, search capabilities were expanded in the system. The search using category lists, the complex search using logical expressions, the search for named entities and other functions were added to the advanced version of the system.

Currently, the corpus management system is in open beta testing and available online at http://tugantel.tatar. After that, we are going to release it under an open license.

Tugan Tel": grammatical annotation and implementation

Performance analysis for NoSQL and SQL

Choosing the right storage solution for the corpus management system (analytical overview and experiments)

Linguistic linked open data cloud

Integrating NLP using linked data

The OntoLex-lemon model: development and applications

OLiA -ontologies of linguistic annotation

Creating Linked data morphological language resources with MMoOn -the hebrew morpheme inventory

The YASGUI Family of SPARQL Clients

Toward domain-specific Russian-Tatar thesaurus construction

Acknowledgements. The work was funded by Russian Science Foundation according to the research project no. 19-71-10056.