key: cord-0060593-c9v2buab
authors: Klyshinsky, Eduard; Karpik, Olesya; Bondarenko, Alexander
title: A Comparison of Neural Networks Architectures for Diacritics Restoration
date: 2021-02-20
journal: Recent Trends in Analysis of Images, Social Networks and Texts
DOI: 10.1007/978-3-030-71214-3_20
sha: 8e221e072a5901d9004428779730e4d08c872e27
doc_id: 60593
cord_uid: c9v2buab

Neural networks are widely used for the task of diacritics restoration last years. Authors use different architectures of neural network for selected languages. In this paper, we demonstrated that an architecture should be selected according to a language in hand. It also depends on a task one states: low and full resourced languages could use different architectures. We demonstrated that common used accuracy metric should be changed in this task to precision and recall due to the heavy unbalanced nature of the input data. The paper contains results for seven languages: Croatian, Slovak, Romanian, French, German, Latvian, and Turkish.

Let us consider an extended alphabet that contains characters with diacritics, e.g. the extended Latin or Cyrillic alphabet. Every character with a diacritical mark could be matched with a character without such a mark; e.g.Ä→A.

In this paper, we do not consider the substitution of a single character with a diacritical mark for several characters, e.g.Ä→AE. Let us substitute all characters with diacritics for their corresponding characters without diacritics. After such replacements we could state the task of diacritics restoration: consider the resulting text and restore the diacritics only in those positions where they were omitted.

Originally, the task of diacritics restoration was set for a natural language text. Finding a way to automate the input of diacritics is necessary not only for old valuable texts stored in an electronic format, but also for modern electronic texts, since they continue to be created in non-diacritical form. For some reasons, such as the lack of keyboards with diacritics, OCR mistakes, ergonomic factors, etc., the diacritics are omitted in a text, and they must be restored. Such situation is typical not only for languages that use the Latin script, but also for Cyrillic and Arabic writing.

Recently, this task has been solved by machine learning methods. Preparing a data set could be done in a following easy way. On the first step, one constructs a set of replacements for a given language. For example, the letter A with different diacritics (ÄÀÂÁȂ) will be replaced with the same letter -A. To create training and test sets, one could replace all characters with diacritics in a text. In order to create a test set, one should tag positions of proper characters without diacritics by corresponding classes of diacritical marks. The set of tagged words could be shuffled and separated into the training and test sets.

In such case, the most frequent words have a chance of occurrence in both the training and test sets. Therefore, a model is both trained and tested on the highly intersected sets of words, and the probability of overlap is obviously high. As a result, the algorithm could 'study answers' but not 'generalize data'. Such approach to the construction of training and test sets is similar to diacritics restoration using a dictionary.

The motivation of this paper is the following. According to the ICAO standard specification, all proper names in formal documents should be presented without diacritics. There is no complete dictionary of proper names for all languages yet; therefore this dictionary could not be used to restore diacritics in proper names. There is the same problem with low resourced languages. So, the more correct way to model such situation is to divide a vocabulary of a text and generate non-intersecting training and test sets. In this paper, we investigate the difference between these two approaches and compare several different models based on neural networks.

The rest of the paper is organized as following. In Sect. 2, we make a brief overview of current approaches to restoration of diacritics. In Sect. 3, we describe our data preparation method and machine learning methods used. Section 4 presents our experimental results and their brief discussion. Finally, Sect. 5 concludes this paper with a summary of outcomes.

One of the first papers in the area of diacritics restoration are Yarowsky [1] and Tufiş [2] . The main method used in these papers is searching a word in a dictionary of a given language. The word from a text was compared with the dictionary regardless of the accentuation. However, such approach could not be applied to a language that has majority of accentuated words. For example, in German the diacritical mark can be an indicator of the plural form (der Vaterdie Väter). In Vietnamese and Igbo, there are words that differ in diacritics only, but have extremely different meanings. So, for example, the paper [3] reports the following chain: akwà (cloth),àkwà (bed/bridge),ákwá (cry),àkwá (egg).

The same problem is presented in the Arabic script, where diacritical marks are mostly omitted despite of having a semantic meaning.

Due to lack of accuracy, the described methods were replaced by machine learning methods. For example, Pauw [8] and Schlippe [4] propose to use character n-grams. The former article used the CRF method to determine the correct n-gram for the current position. The reported quality of this method is about 85%-95%.

The development of neural networks shifted the focus of investigations. The state-of-the-art implementations are using two architectures of neural networksrecurrent (RNN) and convolutional (CNN) neural networks. The results of both architectures are comparable. The article [5] reports accuracy metrics for twelve languages as high as 97-99%. The authors used a network constructed using BiLSTM units with input embedding layer. The authors of [6] used Acausal Temporal Convolutional Neural Network with two hidden layers; precision of diacritics restoration reported as 96.2% for the Yoruba language, 97% for Arabic, and 97.5% for the Vietnamese language. The difference between results of CNN and RNN reported less than 0.3%. The article [7] uses CNN with an attention layer. Note that all authors use a single neural network to restore all diacritical marks in a given language. All neural networks accept as an input the vector of the character n-gram in various forms.

As it was mentioned in the previous section, low resourced languages have a problem with training a neural network on a small data set. The article [6] demonstrates that though the precision of a neural network learned and tested on the same dictionary is about 96-98%, the precision of the same network for out-of-vocabulary words dramatically decreased to 70-85% and even less than 50%. Most of the authors (e.g. [3, 5, 7] ) do not consider the class of outof-vocabulary words at all. It should be noted that the accuracy metrics could not be used for such a project, since there is a disbalance between amount of words with and without diacritical marks. As it shown in [5] , the amount of words with diacritics in different languages varies from 8% in German to 88% in Vietnamese. Our experience demonstrates that the relation between amount of occurrences of the same character, with and without diacritical marks, could be correlated as 1:100. In such conditions, the accuracy metric is useless, and one should prefer the precision metric.

According to the written above, we can state that most of the current methods of diacritics restoration have problems with processing of out-of-vocabulary words. To use machine learning methods, it is more correct to create nonintersected training and test sets. The precision and recall metrics should be prioritized for evaluation of the achieved results. Convolutional and recurrent neural networks demonstrate almost the same results in the same language, but these results differ among languages. Therefore, the aim of this article is to investigate the dependence of the neural network architecture on the selected language and the type of test and training sets. Moreover, we want to test a new approach to constructing a neural network solver, where instead of training single neural network that recognizes any character, we train a set of binary solvers where one network is responsible for only one character with a given diacritic.

This section discusses the used dataset, the data preparation method, and used architectures of neural networks.

In this paper, we used several news collections. French, German, and Turkish collections were downloaded from the news wire sites. Collections for Croatian, Latvian, Romanian, and Slovak languages were downloaded from the corpus presented in [5] (http://hdl.handle.net/11234/1-2607). The size of the former ranges from 63 to 217 mln of characters, the latter were cut to the first 200 mln characters.

We constructed a list of character replacements with diacritics to their ASCII counterparts for each language. We divided the collection into words not shorter than 4 characters; each word was aligned to the left and right side with three alignment symbols. Then we extracted all 7-grams with the character from the replacements list in the central position 1 . The 7-gram was tagged depending on whether the symbol in the central position had a diacritical mark or not. Finally, we replaced all the characters with diacritics with their counterparts.

For example, the Turkish word 'öncesi ' is aligned with three spaces in the starting and final positions: 'öncesi '. Than the algorithm extracts all 7-grams containing an accentuated symbol or its counterpart in the central position; in our case the resulting list is ['önce', 'öncesi', 'ncesi ', 'cesi ']. Finally, symbols with diacritics are replaced by their counterparts, 7-grams are tagged by 0 (no diacritics) and 1 (with diacritics) according to the symbol in the central position:

We used two types of character embedding. In the first case, we replace the symbol with its position in the alphabet of the given language (dense representation); in the second case, we used one-hot encoding for this alphabet (onehot representation). The alphabet was shaped automatically by the first million characters in the collection after word tokens were extracted.

Unlike other authors, we constructed one binary classifier for each accentuated character in a given language. Such approach can be directly applied to languages like German, which has only three accentuated characters. Therefore, one could just extract all 7-grams with counterpart symbols and pass them to the proper classifier, which takes decision whether the diacritic mark should be placed here. For languages such as French, Romanian, and Latvian, which have several variant of diacritics for the same symbols, our approach should be extended to the second step. A 7-gram with a candidate character is passed to several classifiers connected to the given character. After classification, we should find the maximum among classifiers' answers and make a conclusion about the necessary diacritical mark or its elimination. In this paper, we also compared two approaches: training separate classifiers without taking decision about different diacritics for the same base character vs training one classifier for every character with diacritics.

According to the chosen approach, any neural network contains an output dense layer consisting of two neurons with SoftMax activation function.

We used three different neural network architectures: dense, convolutional, and recurrent 2 ; we also used the Random Forest method as a baseline. As mentioned above, we used two types of input vectors in our experiments: a vector of seven character codes and one-hot encoded vector of 7*(size of an alphabet) elements. The used architectures of neural networks are also listed in Table 1 .

The dense network consists of three layers of 128 neurons with the ReLU activation function and an output layer of 2 neurons with the SoftMax function; the batch size was 2048. We experimented with the number of neurons and layers, but it decreased the results. We studied this type of networks for 7 character codes vectors only. The network converges in 30 epochs.

The convolutional network consists of two layers with 32 convolution matrices size of 3 × 1, a dense layer with 128 neurons with the ReLU activation function, and an output layer of 2 neurons with the SoftMax function. The batch size was 512; the network converges in 10 epochs. We experimented with the number of neurons, convolution matrices, and their size but it decreased the results.

For the recurrent networks we used three configurations. The first one consists of a layer with 64 biLSTM units and an output layer of 2 neurons with the SoftMax function. The first layer of the second configuration consists of 128 biLSTM units. The third configuration had a dense input layer with 32 neurons and the ReLU activation function, 128 biLSTM units, and an output layer of 2 neurons with SoftMax function. The batch size was 512; all networks converge in 10 epochs. In order to investigate the influence of the intersection of training and test vocabularies, we conducted two series of experiments. For the first series, we separated the input set of 7-grams into two non-intersecting sets; the sum of the test set's frequencies was approximately 20% of the total frequency sum. Thus, in this series we trained and tested a classifier on different vocabularies and simulated the situation of out-of-vocabulary words. The second series of experiments tested only convolutional and recurrent networks, since they shown better productivity. In this series, we simply shuffled the set of 7-grams and split the input set into training and test in the ratio 80%-20%. Our experiments demonstrated that 80% of the words of the test set could be found in the training set.

The results of our experiments are shown in the Tables 2, 3, 4, 5, 6, 8 3 . We did not use Word Error Rate (WER) and Diacritics Error rate (DER) since they are introduced as accuracy calculated for words and single diacritics, respectively. Instead of these metrics, we report precision and recall. As shown below, they evaluate results of the methods more correctly.

The tables are organized as following. Each table consists of results for a single language. We measured the results separately for every character with a diacritic mark. The first line for every character contains precision; the second line reports a recall. Figures marked with the bold font demonstrate maximal precision for the character among all results for separated training and test vocabularies. Figures marked with the underlined font demonstrate maximal precision for the character among all results for mixed training and test vocabularies. The last two rows in each table demonstrate the average precision and recall for the given classifier. Note that the best precision does not mean the best F1-value, but we did not calculate the F1-value due to the big size of resulting tables. Mostly, it is the case of low recall; however, such classifiers are not practically applicable. In the case of a high recall and a small difference between the values, both classifiers can be applied and should be compared by other features. In this article, we tried to draw the overall picture, but not to compare different neural networks in detail.

In the case of zero precision and recall, our system either failed to randomly select any 7-grams with a diacritic mark because of lack of words with diacritics, either evaluate all 7-grams as ones without diacritics due to the huge disbalance between classes. In case of Romanian S , , it was true for every experimental run. For the FrenchÊ, classes was unbalanced about 1:50. Its precision and recall were equal to zero but the accuracy was equal to 0.98. Thus, the accuracy metrics do not demonstrate the real situation here. That is why we used both precision and recall in this project. As we can see, there is no 'silver bullet' among the neural network architectures even for a single language. For German and Turkish, the preferred architecture is convolutional network over one-hot encoded input vector, if we consider the number of winning characters (2 of 3 and 4 of 5 respectively). Other languages do not demonstrate any preferences for this feature. In the case of consideration the average precision for the ML model, the same network wins for Croatian, German, and Turkish; but for Romanian, Slovak, and Latvian the leading model is LSTM with 64 units; the convolutional network with dense vectors wins for French. Note that training CNN with a one-hot-encoded input vector takes about 10-50 times more calculation time. For mixed training and test vocabularies, the biLSTM is a default best choice, it wins in 39 out of 56 cases. Table 5 . Results of experiments for the Romanian language (precision, recall). Column names are presented in Table 2 . Bold script indicates the best solution for separated train and test, underline -for mixed ones. We compared our approach, training of separate neural networks for each character, to the common one, training one neural network for all the characters of a given language, using convolutional networks over separated training and Google Colab's T4. As result, the training of all the models for languages like Slovak takes a whole day and should be done several times. Considering the difference between results with and without intersection of training and test vocabularies, we can state that results for mixed vocabularies are almost always better than for vocabularies without intersection. It is always true for a dense 7-character input vector, except the FrenchÂ with obviously erroneous results for LSTM on separate dictionaries, the FrenchÔ, and the GermanÖ. Probably, the last two cases can be explained by the lack of statistics, since we had trained neural networks 3-6 times on separate data sets, while for mixed data sets -only once. The average difference here for the RNN network is 0.28, and for CNN is 0.33. Mixed vocabularies demonstrated the best result with the average difference of 0.12. Thus, we can state that methods of investigation of low resource languages should be changed.

Note, that we considered words not shorter than 4 characters. Such languages as Vietnamese, Chinese, and Irish have shorter names and prefixes. Our experiments demonstrated that the introduced method lacks of productivity; we suppose that it is mostly improbable to restore the correct diacritics having just one symbol. Moreover, we can not be completely sure that a neural network trained on news wire will correctly restore diacritics in proper names due to specificity of testing data. Some characters, such as FrenchÎ, GermanÄ, Romanian S , , demonstrate very poor results. The reasons of such behavior need to be investigated in an independent research, but we can presuppose that these characters could be found in the same context as their counterparts. For example, this is true for FrenchÎ which is used to differentiate similarly sounding words. But this is not always the case of GermanÄ, since there are many words which do not have counterparts.

In this paper we demonstrated that there is no best neural network architecture for each considered language in the area of diacritics restoration. There are differences in the neural network design and their results in case of low and full resourced languages. For a full resourced language, the task is usually stated as searching for a word from the vocabulary and taking a decision about its diacritics. For a low resourced language, we could have in hand an out-of-vocabulary word. These tasks demand different approaches to their solution. In the former case, the slightly better results are achieved with the biLSTM networks; in the later case, the choice depends on the considered language.

We have investigated languages such as Croatian, Slovak, Romanian, French, German, Latvian, and Turkish. We found that there is a large difference between the results for the case of completely separated train and test vocabularies and the random distribution of tokens between these vocabularies. The average difference between precision for these two cases reaches 0.12-0.33.

We have presented a new approach based on several neural networks each trained for its own symbol. This approach wins over the previous one, where only one neural network is trained for all the symbols.

Almost always, there is a good option for any character with diacritics to be restored. However, the results for some characters are quite low. These are FrenchÎ, GermanÄ, and Romanian S , . For the last character, any machine training algorithm has decided that it is easier to attribute all the characters as S since there are not so many examples of words with S , ; the same is true for many other cases. This means that we have to use other methods to form train and test sets and construct batches. The case of the GermanÄ needs a linguistic investigation of this phenomenon.

Comparison of corpus-based techniques for restoring accents in Spanish and French text. Natural Language Processing using Very Large Corpora

Automatic diacritics insertion in Romanian texts

Automatic restoration of diacritics for Igbo language

Diacritization as a machine translation problem and as a sequence labeling problem

Diacritics restoration using neural networks

Efficient convolutional neural networks for diacritic restoration

attentive sequence-to-sequence learning for diacritic restoration of Yorùbá language text

Automatic diacritic restoration for resource-scarce languages