key: cord-0057572-h381gh0c
authors: Slim, Amel; Melouah, Ahlem; Faghihi, Yousef; Sahib, Khouloud
title: Algerian Dialect Translation Applied on COVID-19 Social Media Comments
date: 2020-10-30
journal: Artificial Intelligence and Renewables Towards an Energy Transition
DOI: 10.1007/978-3-030-63846-7_68
sha: 2633d1a1d921310cd18ba8cba7312dab53b2a18e
doc_id: 57572
cord_uid: h381gh0c

This work is part of a study on the propagation of misinformation about COVID-19 and its impact on Algerian society. It studies the problem of Algerian dialect translation applied to COVID-19 social media communications. The proposed system begins by filtering messages to identify comments that talk about COVID-19. Then, COVID-19 texts are translated from the Algerian dialect to formal standard Arabic. The filtering process is based on the long short-term memory (LSTM) model. The translation process is based on the embedding-GRU model. Experimental results give precision rates of about 99.98% in the filtering process and about 97.56% in the translation process. The achieved BLUE score is 22.10.

Covid-19 burst out in the city of Wuhan China and became pandemic very rapidly [1] . During this pandemic, people use social media to share opinions and perceptions about coronavirus. They exchange and acquire distinct types of information at a historic and unprecedented scale [2] . However, people's opinion about coronavirus is the scariest scene because most information is untruth. According to the world health organization, untruth information is "spreading faster than the virus". Misinformation is very dangerous on personality psychology; it increases the level of stress on themselves and with the others. It is very important to detect this type of information and to understand its impact on society. We must identify all messages who publish comments that cause terror in the hearts of others, as well as those who publish worn medicine COVID-19 treatments.

Arabic countries are in front of an important count of misinformation. Like all countries, they must control this propagation but first, they must surmount the dialect language barrier. Arabic countries' peoples communicate using the language of the dialect as Levantine dialect, Egyptian, Algerian, Tunisian, etc. The Algerian dialect is a difficult language to understand, even for those who are Arab but not Algerian. It is one of the most difficult dialects in Arabic countries.

Through this work, we are trying to provide clues on the problem of corona-virus information propagation in social media expressed with the Algerian dialect. Firstly, a filtering process identifies Coronavirus Algerian dialect messages. Secondly, a translation process transforms the informal message to a formal one. The filtering applies a long short-term memory model. The translation transforms the Algerian dialect sentences into Model Standard Arabic (MSA) forms. The main objective of this work is to make the Covid-19 Algerian dialect messages understandable.

The rest of the paper is organized as follows. Section 2 reviews related work done in the field of machine translation. Section 3 details the adopted filtering model, the proposed machine translation neural network and the used database. Section 4 presents the results with discussion. Finally, Sect. 5 concludes the paper.

With deep learning machine translation becomes more efficient. Many researches excelled in this field, and several deep learning models are proposed for machine translation.

The first neural machine translation model is an encoder-decoder model. The Encoder-Decoder is a recurrent neural network designed to solve sequence-to-sequence problems. This model gives more cleverness of any work in machine translation. Several articles have used the encoder-decoder model in the language translation problem. This section presents some of these articles. [3] proposed an encoder-decoder model combined with a statistical machine translation approach; the model can either mark a pair of sequences or generate a target sequence from a source sequence. [4] translated the Algerian Arabic dialect into MSA (Modern Standard Arabic) through two steps; a transliteration step to convert the Arabic letter into the Arabizi letter, and a translation step to transform the Arabizi sentences into MSA. [5] demonstrated the performance of the encoder-decoder model by evaluating and comparing it to other models for an English-French translation problem.

Another area of research in the field of translation is transliteration. Transliteration consists in transforming a grapheme transcription from one writing system to another while preserving its pronunciation. [6] has shown that the encoder-decoder can solve a transliteration problem by using it for the Latin-Tunisian dialect transliteration.

The evolution of neural machine translation has not stopped at the encoder-decoder model; some works have added the attentional mechanism in neural machine translation. Attention, which is considered as an alignment model [7] , offers the possibility to encode the relevant parts of the source sentence at each step of the translation. Several works have used this alignment principle in machine translation. The first attentional encoder-decoder model was proposed in [8] . The main idea of this model is to add the alignment mechanism to the basic concept of encoder-decoder for English-French translation. This idea has been developed further in two Arabic dialect translation systems [9] . The first system, dialectal translation to a standard language (D2SLT), is based on the attentional sequence-to-sequence learning model. The second system, Google Neural Machine Translation (GNMT), is based on a sequence-to-sequence model with the addition of a residual connection, an attentional mechanism and a bidirectional encoder all proposed in [10] . [11] extended the basic encoder-decoder by including the alignment mechanism in the task of English-to-German translation.

In social media, COVID-19 has spread terror among people because of rumors and myths about the virus. It is a problem for all societies, including Algerian society. Deep learning models have dominated the field of machine learning for natural language processing. Several works have used deep learning for classification and translation. In this work, we apply deep learning models to the translation of Covid-19 commentaries from Algerian dialects into a formal language. Figure 1 describes the architecture of the proposed systems. Filtering and translation are the two main operations of the system. They are based on the deep learning model. In the last number of scholarly works, there have been remarkable developments in deep learning [12, 13] . Architectures such as a convolutional neural network (CNN), LSTM, and GRU have obtained competitive results in several competitions (e.g., computer vision, signal, and natural language processing) [14] . The Long Short-Term Memory networks "LSTMs", introduced in [15] , are a special kind of RNN, capable of learning long-term dependencies. They work tremendously well on a large variety of problems and, they are widely used in different domains. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods is practically their default behavior.

The following sections are devoted to these operations, but first, we present all the used data. 

In this study, we construct a dataset using two sources. The first source is Corpora PADIC [16] dataset. The second source is Facebook.

Corpora PADIC contains five Arabic dialects and their translation to MSA (model standard Arabic). The Arabic dialects in the Corpora PADIC dataset are ANB Algerian, ALG Algerian, Palestine, Syrian and Tunisian. Corpora PADIC contains 6413 sentences of each dialect all collected before the COVID-19 pandemic. All ALG dialects sentences in PADIC (6413 sentences) compose the first part of our dataset. The second part of our dataset is composed of 1200 sentences collected from Facebook comments on the COVID-19 virus. Table 1 gives some examples of the used dataset.

Deep learning is the concept adopted for the filtering. LSTM model classifies sentences into two categories: sentences with COVID-19 information and sentences without COVID-19 information as shown in Fig. 2 . The filtering model architecture starts with an embedding layer to quantify and categorize the semantic similarities between linguistic items based on their distributional properties in large samples of linguistic data. Then, the dropout layer forces the LSTM to learn useful robust features in conjunction with different random subsets of other neurons. Later, the LSTM layer followed by another dropout layer classifies the input data. Finally, a batch normalization layer and a dense layer improve the performance, speed, and stability of our architecture. Table 2 shows the number of the parameters of each layer.

A neural machine translation system is any neural network that maps a source sentence, X1, …, XN, to a target sentence, Y1, …, YT, where all sentences are assumed to terminate with a special "end-of-sentence" token <EOS> [17] .

In translation stage, sentences about COVID-19 are translated from the Arabic Algerian dialects into MSA. The translation applies an embedding gated recurrent unit (GRU). Typically, the network architecture content embedding input layer, GRU layer, dropout layer, two time-distributed layers, and dropout layer. The translation network accepts one dialect in input and gives its translation in output as shown in Fig. 3 . In a translation system, the words or phrases from the source vocabulary are mapped to vectors of real numbers. It yields word embedding as the weights of the first layer, which is usually referred to as the Embedding Layer. Conceptually it involves a mathematical embedding from space with many dimensions per word to a continuous vector space with a much lower dimension. In parallel, Gated Recurrent Neural Networks (Gated RNNs) have shown success in several applications [18] . They were proposed by [19] to make each recurrent unit to adaptively capture dependencies of different time scales. Similarly, to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit, however, without having separate memory cells [20] . We designed the parameters of the model as shown in Table 3 . This table shows the number of parameters in each layer.

The translation of Covid-19 commentaries written in the Algerian dialect is the problem addressed by this study. The strategy is to filter the comments between those referring to Covid-19 and those not referring to Covid-19. Next, translate the comments referring to Covid-19 from the Algerian dialect into a formal language. An LSTM model ensures the filtering and a GRU model does the translation. We use Tensorflow and Keras to implement the proposed system. The model evaluation engages

• Filtering: accuracy metric, the binary cross-entropy loss, and the Adam optimizer.

With 50 epochs and batch of size 1024 sequence for each iteration. • Translation: accuracy metric, sparse categorical cross-entropy loss, and Adam optimizer. With a learning rate a = 0.005, 30 epochs, and a batch of size 1024 sequence for each iteration. The performances are graphically depicted with accuracy and loss. Figure 4 and An overall loss score and accuracy based on the train, validation, and test dataset are computed and used to determine the model's performances. Table 4 presents the metrics results. We tested the proposed system on 14026 sentences. The training step used 13000 sentences and, 10% of these sentences were reserved for the evaluation. The rest, 1026 sentences, were exploited in the test step. We selected the weight averaging method to aggregate the individual class metrics. The selected hyperparameter values are shown in Table 4 . In the translation step, 1300 sentences were used for the training step, and, 10% of these sentences were used in the evaluation step.

An interesting result is the values of the BLUE scores of Fig. 6 . Table 5 shows the results of the filtering process. The good performance of the proposed model is evident; all the sentences are well classified. Table 6 gives the results of the translation of certain comments from the Algerian dialect into MSA. We observe that the proposed model translated wrongly the words into the words when it should have been translated by the words . The same problem arises with the word . However, the proposed model gives a correct answer to the sentence ; this sentence is correctly translated. To prove the performance of the proposed system, we compared it with the system proposed in [4] . Table 6 presents the results of this comparison. Although the proposed embedding GRU translation architecture is simpler, it is clear that it gave better results than the encoder-decoder model proposed in [4] (Table 7) .

In this work, we were interested in the translation of comments written in the Algerian dialect on the subject of the Covid-19 pandemic. A two-step system is used to recognize sentences that mention Covid-19. Then, it translates them into a formal language. The recognition phase is based on a filtering principle realized by a deep learning model (LSTM). The phase of translation of the Algerian dialect into MSA is based on an embedding GRU model. The filtering model has reached an accuracy score of about 91.96%. Besides, the translation model obtained 22.10 BLUE points. 

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modeling study

Characterizing the propagation of situational information in social media during COVID-19 epidemic: a case study on weibo

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Neural vs statistical translation of algerian arabic dialect written with arabizi and arabic letter

On the properties of neural machine translation: encoder-decoder approaches

A sequence-to-sequence based approach for the double transliteration of Tunisian dialect

Sequence to sequence learning with neural networks

What does attention in neural machine translation pay attention to?

Unsupervised dialectal neural machine translation

Google's neural machine translation system

Effective approaches to attention-based neural machine translation

Deep Learning

A survey on deep learning for big data

A deep learning classifier for sentence classification in biomedical and computer science abstracts

Long short-term memory

PADIC: extension and new experiments

A study of reinforcement learning for neural machine translation

Gate-variants of gated recurrent unit (GRU) neural networks

Learned-norm pooling for deep feedforward and recurrent neural networks

Empirical evaluation of gated recurrent neural networks on sequence modeling

Acknowledgments. We are grateful to the Direction Générale de la Recherche Scientifique et du Développement Technologique (DGRSDT) which kindly supported this research, as well as to the Laboratoire de Recherche Informatique (LRI) where this study was conducted.