key: cord-0180168-0t99ipdt authors: Sabri, Nazanin; Edalat, Ali; Bahrak, Behnam title: Sentiment Analysis of Persian-English Code-mixed Texts date: 2021-02-25 journal: nan DOI: nan sha: 7a89f0e5b1d323c4ca774c2665f0882e59232eb1 doc_id: 180168 cord_uid: 0t99ipdt The rapid production of data on the internet and the need to understand how users are feeling from a business and research perspective has prompted the creation of numerous automatic monolingual sentiment detection systems. More recently however, due to the unstructured nature of data on social media, we are observing more instances of multilingual and code-mixed texts. This development in content type has created a new demand for code-mixed sentiment analysis systems. In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets. We then proceed to introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets. Our model outperforms the baseline models that use Na"ive Bayes and Random Forest methods. Online social networking platforms place very few constraints on the type and structure of the textual content posted online. The unstructured nature of these content results in the creation of textual content which could be far from the original grammatical, syntactic, or semantic rules of the language they are written in [1] , [2] . One deviation that has been observed quite often is the use of words from more than one language in the text [3] . Commonly known as "code-mixed" text. These texts are written in language A but include one or more words of language B (either written in the official alphabets of language B or transliterated to language A). In this study, we investigate code-mixed Persian-English data and perform sentiment analysis on these texts. The reason why sentiment analysis of these texts would differ from a text written purely in the Persian language is that the words containing the emotional energies of the text could be written in English which would make the Persian-only sentiment analysis models unable to produce the correct outputs. Grammatical differences could also render such monolingual models useless [4] . The difficulties of this task have been shown in other language pairs [5] , [6] . Our reasoning behind choosing English as the second language is the prominence of the usage of this language overall but more importantly among Persian speakers. Since Persian is a low resource language, in order to perform the aforementioned task, we first needed to create a dataset of texts and label those texts with the correct sentiment scores. Thus, we begin by using the Twitter API and searching for a list of "Finglish" 1 words through the use of the API. After the data has been collected, 2 annotators labeled the 3640 tweets completely. A third annotator was then added to the project to label the tweets on which the previous two annotators did not agree on. After the dataset collection and creation was completed, an ensemble model was created and used to detect the sentiment scores of the texts. The rest of this paper is structured as follows: Section II provides a brief overview of related work. Next, we look at our dataset in detail in Section III. Our models, as well as our text cleaning and preparation steps are described in Section IV. We then report our results in Section V and the study is concluded in Section VI. The prevalence of code-mixed textual data, due to the unstructured and uncontrolled nature of the web as well as social networks, have resulted in a focus on the topic throughout recent years, including multiple shared-tasks being defined on the subject [7]- [10] . One of the language pairs that has been the center of attention in code-mixed text analysis is Hindi-English [11] - [16] . With the large population of multilingual individuals in India, such forms of texts have become quite common. Other language pairs (such as Bengali-English, Bambara-French, and Tamil-English), however, have been studied as well [5] , [17] - [19] . Some studies attempt to solve the issue by hand engineering features which help in the task. In [20] various features (e.g. the number of code switches in the text) were introduced and employed in a multi-layer perceptron model. Number of sentiment and opinion lexicon, number of uppercase words, and POS tags are among some of the features which have been utilized. Other studies try to find methods with less need for feature selection. For instance, it has been attempted to use crosslingual word-embeddings [21] or subword embeddings [13] to accomplish the task. In the SemEval-2020 task on Spanglish and Hinglish [10] it is reported that BERT and ensemble methods are the most common and successful approaches. In [22] an approach is introduced to help deal with different variations of the same word by substituting words with consideration to their context words. In [23] a benchmark for linguistic codeswitching is presented, the aim of which is to enable evaluation of models. To the best of our knowledge, the dataset annotated and presented as part of this study is the first dataset for the Codemixed Persian-English sentiment analysis task. We also believe that there are no other studies on this specific subset of the topic available. In this section, we describe the distributions and characteristics of our data. As described in Section I, our dataset consists of 3,640 tweets labeled with polarity values. Our dataset fields include the terms that were searched via the API that resulted in the tweets' retrieval, the text content of the tweets, the three labels assigned to each tweet, and the final label, which is calculated through majority voting. We selected a list of 44 unique Finglish (English words transliterated to Persian) terms in order to collect this data. Some of the words in the list include: (perfect), (happy), (boring), and (single). In our dataset, 69.2% of the instances received unanimous labels by the first two annotators. A third annotator was then asked to label the rest of the data. In the resulting dataset, 15.7% were labeled as positive and 59.7% as negative (and the remaining 21.5% were labeled as neutral). Table I displays two examples of annotated data in our dataset. The majority of the data being negative can be explained by two facts: One is that due to the access restrictions of Twitter in Iran, only 9.24% of Iranians use Twitter [24] and as a result the subset of users on the platform are mostly from a particular belief system [25] which could result in the observed negative opinions. Another reason is that the data collection process was conducted in the last months of 2019 which included the beginning stages of the spread of the Coronavirus in Iran. Even though our keywords did not relate to the Coronavirus in any way the shift in spirit was observed in our data. We however believe that this issue reflects the current state of our society and thus the dataset can still be used for the task of code-mixed sentiment detection as the tweets do include the characteristics of codemixed language and there are enough examples in the dataset to allow the model to learn attributes of polarities other than negative. To preserve the privacy of the users, all user mentions in the texts have been replaced by @USERMENTION. However, since all tweets were public (at least at the time of collection) and were collected using the official Twitter API, Tweet IDs have been provided to allow use in future research should the need arise. This dataset is publicly available on GitHub. 2 2 https://github.com/nazaninsbr/Persian-English-Code-mixed-Sentiment-Analysis In this section we will go over our text processing and feature extraction, and data representation methods as well as our model. In the text processing step we aim to create a vectorized representation of the textual input in order to be able to fit the data into our machine learning model. To do so we take the following steps: 1) Finding the non-Persian words in the sentence: By the definition of our task we know that the text we are processing includes non-Persian words, however, we do not know which words in the sentence they are. The reason why this knowledge could be useful is that knowing which words they are would allow us to use methods such as translation to convert them to their original language. In order to find these words, we use a dataset of Persian words collected from Wikipedia 3 . The huge collection of articles available on Wikipedia ensures that the most frequent words of the language would be in the list. We then check the existence of every word in the list and if it is not in the list we add it to our non-Persian word candidates. 2) Translation: Next we translate the non-Persian words. First, we use an automatic tool, Yandex 4 . This tool, however, faces difficulties when asked to translate Twitter specific slangs or expressions. Thus, we use common Twitter expression lists 5 and create a dictionary of our own. 3) Embedding creation: To create an embedding for our textual data we used the pretrained multilingual BERT [26] model that Google has provided on their GitHub page 6 . Since the model is multilingual, it would allow for creation of embeddings for words from more than one language which fits nicely with our problem since code-mixed data includes instances of more than one language. Further since the model uses the idea of subword tokenization and embedding, it could allow for a better understanding of slangs or other non-common words that could be made up of better known subwords. After our data passes these steps, it is fed into an ensemble model consisting of three Bidirectional Long Short-Term Memory (Bi-LSTM) networks. • Our second model adds the attention mechanism to the Bi-LSTM network which enables the model to pay more attention to the most important words in the sentence. The attention mechanism also helps with the encoding of long-distance dependencies and information. • Another method we use to make sure long-distance dependencies are accounted for and that information is not lost in our model is the use of pooling layers in our final model. The final model takes the outputs of all three models and uses a weighted average to produce the output. To find the best weight assigned to the output of each model, we use the optimization algorithms offered by SciPy [27] . We used 10-fold cross-validation and averaged the metrics in order to present more reliable values. Additionally, to be able to compare our results, naive Bayes and random forest models were also used on the data. The results have been presented in Table II . We can see that our ensemble model outperforms the baseline models with regards to all metrics. Through our experiments, we find that the attention and pooling mechanisms both help with the performance of the models. We further find that the sum of all three models, offers better performance than each individual model, as each model appears to make up for the shortcomings of the other models in the ensemble. In this study we presented a Persian-English code-mixed dataset. The dataset consisted of 3640 tweets collected through the use of the Twitter API. Each tweet was consequently labeled with its corresponding polarity score. We then used neural classification models to learn the polarity scores of these data. Our models employed Yandex and dictionary-based translation techniques to translate the code-mixed words in our texts. We further used pretrained BERT embeddings to represent our data. Our models reached an accuracy of 66.17% and F1 of 63.66 on the data. Future work could focus on other methods of dealing with the code-mixed words or ways in which we could find word-based polarity scores for our code-mixed words using sentence level scores. Character level embedding with deep convolutional neural network for text normalization of unstructured data for twitter sentiment analysis Sentiment analysis of twitter data using machine learning approaches and semantic analysis Normalization of indonesian-english code-mixed twitter data Proceedings of the workshop on noisy user-generated text Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages Limsi upv at semeval-2020 task 9: Recurrent convolutional neural network for code-mixed sentiment analysis Sentiment analysis of code-mixed indian languages: An overview of sail code-mixed shared task@ icon-2017 Overview for the first shared task on language identification in code-switched data Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. arXiv e-prints Iit gandhinagar at semeval-2020 task 9: code-mixed sentiment classification using candidate sentence generation and selection Towards sub-word level compositions for sentiment analysis of hindienglish code mixed text Towards sub-word level compositions for sentiment analysis of hindienglish code mixed text Sentiment analysis of mixed language employing hindienglish code switching Understanding language preference for expression of opinion and sentiment: What do hindi-english speakers do on twitter An ensemble model for sentiment analysis of hindi-english code-mixed data Sentiment analysis of code-mixed bambara-french social media text using deep learning techniques A sentiment analysis dataset for code-mixed malayalam-english Corpus creation for sentiment analysis in code-mixed tamil-english text Sentiment identification in code-mixed social media text Sentiment analysis for hinglish code-mixed tweets by means of cross-lingual word embeddings Automatic normalization of word variations in code-mixed social media text Lince: A centralized benchmark for linguistic code-switching evaluation Mapping the political landscape of persian twitter: The case of 2013 presidential election Pre-training of deep bidirectional transformers for language understanding