key: cord-0788434-88sya7fi
authors: Pradhan, Rahul; Sharma, Dilip Kumar
title: An ensemble deep learning classifier for sentiment analysis on code-mix Hindi–English data
date: 2022-04-23
journal: Soft comput
DOI: 10.1007/s00500-022-07091-y
sha: f74fc9a514fd9b8967e6469841b395ec876c13ec
doc_id: 788434
cord_uid: 88sya7fi

Code-mixing on social media is a trend in many countries where people speak multiple languages, such as India, where Hindi and English are major communication languages. Sentiment analysis is beneficial in understanding users’ opinions and thoughts on social, economic, and political issues. It eliminates the manual monitoring of each and every review, which is a cumbersome task. However, performing sentiment analysis on code-mix data is challenging, as it involves various out of vocabulary terms and numerous issues, making it a new field in natural language processing. This work includes dealing with such text and ensembling a classifier to detect sentiment polarity. Our classifier ensembles a multilingual variant of RoBERTa and a sentence-level embedding from Universal Sentence Encoder to identify the sentiments of these code-mixed tweets with higher accuracy. This ensemble optimises the classifier's performance by using the strength of both for transfer learning. Experiments were conducted on real-life benchmark datasets and revealed their sentiment. The performance of the proposed classifier framework is compared with other baselines and deep learning models on five datasets to show the superiority of our results. Results showed improved and increased performance in the proposed classifier's accuracy, precision, and recall. The accuracy achieved by our classifier on code-mix datasets is 66% on Joshi et al. 2016, 60% on SAIL 2017, and 67% on SemEval 2020 Task-9 dataset, which is on average around 3% as compared to contemporary baselines.

Since the inception of e-commerce websites and social media, many of us, especially introverts, also got a platform to express ourselves freely. The anonymity of social media makes us write about the grievance towards the recently purchased product. We can aggressively express the grievance on these platforms. All these reviews and posts are not possible for some digital marketing teams to read and generate a report for targeted advertising. Here comes the sentiment analysis that can automate the whole process. There is a dire need to consider reviews or opinions while strategizing marketing schemes for brands and designers while updating or generating new designs. Even governments are also using social media to assess the success of policies in the time of covid-19 and social distancing (Pradhan and Sharma 2022) . This constant demand led to an increase in the interest of all research fraternity towards sentiments and their analysis (Park et al. 2021) . Most FMCGs (Fast Moving Consumer Goods), sellers, and analysts depend on sentiment analysis of reviews/posts to gauge their product feedback overall and on various aspects of that product (Pradhan and Sharma 2021) . Consumers nowadays visit various product pages before going to the showroom; for instance, a car buyer will go and visit the car manufacturer's web page and check simulations (Alattar and Shaalan 2021) . Some manufacturers provide simulation with augmented reality for soundcheck pick up and other features (Ke et al. 2021) . The buyer at the end relies on reviews; the language used in these reviews could Communicated by Priti Bansal. be English (Wang et al. 2021) . Various stock market traders use news articles to predict a stock's price (Qiu et al. 2022) ; automatic analysis of related news is helpful for them (Januário et al. 2022) . In this area, multimodal analysis of news and events is also explored. The field of sentiment analysis is not limited to text, but all forms of media are considered, such as audio and video (Wang et al. 2022) .

The major challenge with social media text is informality that encourages us to use these platforms more and more (Kern et al. 2016 ). This informality is not limited to social media but also applies to reviews posted on e-commerce websites. Users mostly write some reviews in their language that has full-on slang, hence making processing these textual data non-trivial (Yadav and Chakraborty 2021) . Often, users tend to mix the languages, especially in countries where people read two or more languages in their schools. For example, as in India, where half of the population has Hindi as their mother tongue, they also use other languages such as English. Due to informality and anonymity on social media, people often switch between these languages (Banerjee et al. 2016 ). This swapping of languages between the conversation is known as code swapping (see Fig. 1 ).

Since India is a vast country and in recent years with the advent of cheap and affordable internet packs, many Indians have access to mobiles. They now write a post on Facebook, post comments on a YouTube video (Chakravarthi et al. 2021) , write a tweet on Twitter, and many more activities where users extensively use code-switch. This code-switch is also prevalent in these platforms because users implicitly assume the ambience of informality and are easy-going. Therefore, this type of switching is assumed to be relaxed on these platforms and give oneself more acceptance and visibility.

This code-switching in text poses many challenges while processing, such as entity recognition, sarcasm detection, irony detection, and many more. Dowlagar and Mamidi (2021) discuss various issues and challenges in processing code-mix in South Indian Dravidian languages. The task of extracting knowledge from these code-mixed text is not trivial because this text contains many spelling variations, as users tend to write other languages using the script of another language. For instance, we can write the Hindi word ''pooja'' in multiple variations based on sounds such as ''puja'', ''poojaa'', ''puja'', and many more. Another challenge is informality in these texts, such as usage of ''nyt'' in place of ''night'', or ''tc'' in place of ''take care'', and many more. This mixing of code adds up more to the problem of data sparsity. Correctly compressing all words to a single word to minimise the dictionary is tedious. Users often don't adhered to grammar in such texts (see Fig. 3 ).

Various terms have similar sounds but have different meanings in different languages, i.e. they should belong to separate scripts. For example, ''mutter'' in English means ''whisper or talking softly'', while in Hindi, it means ''pea'', a vegetable. Table 1 shows a list of such words that are common in both languages.

We have elaborated this task for Non-Hindi speakers using Table 2 . This first column contains the text in codemix language (in Hindi & English). The text in first column will show how messages are posted using mixing words from different languages. In contrast, the second column of Table 2 shows the English transliteration of this text so that non-Hindi speakers can understand the sentence. The third column shows the polarity of sentiments. The sentence used in the first instance with language tags is shown in Fig. 2 .

The contribution of the carried research work is listed below:

• An ensemble classifier is proposed using a multilingual variant of RoBERTa (XLM-R) and Universal Sentence Encoder word embedding to detect the sentiment polarity in code-switch data.

• Generate encoding by XLM-R for short text sentences and sentence embedding vector for Universal Sentence Encoder (USE). This approach will help us deal with data sparsity and save time for tedious feature engineering in code-switch texts.

• Tweets encoded into a vector of 768 lengths using XLM-R, and another 512 length encoding of these tweets generated through Universal Sentence encoder for vector space.

The results are compared and evaluated with other baselines to show that our model helps deal with the The remaining sections of the paper are organised as follows. A literature review of existing approaches is discussed in Sect. 2. This section will concentrate on approaches developed in past for targeting code-mixed data and for languages other than English. This section also discussed the Part of Speech tagging on Indian and other European languages. Following Sect. 3, we discuss various steps to deal with informality, noise in data, and the proposed approach. Section 4 contains the information about the dataset we consider in this work. In Sect. 5, we will discuss the performance measures of our approaches. In the last section (Sect. 6), we will conclude our work. An ensemble deep learning classifier for sentiment analysis… 2 Related work Sentiment analysis (SA) is one of the essential tasks in NLP since it is majorly explored on data from newer domains such as social media or customer reviews (Liu 2015) . Poria et al. (2020) explored the field of sentiment analysis and listed down future directions of sentiment analysis such as Aspect-Based SA, multimodal SA, and many more; one of the future directions prominently discussed was Multilingual SA, which mentions sub-tasks such as:

• Handling code-switching,

• Creating and maintaining lexicons for code-switch data, and

• Machine translation will preserve text sentiment when translated to a single language.

Many shared tasks have been organised to draw the attention of various researchers towards this field of SA in code-mix data. In code-mix data, various tasks explored are listed by Thara and Poornachandran (2018) . Their list includes Language Identification, which is the most popular task on this type of data, as this task accuracy will be the base of further tasks. Other tasks include entity extraction, question classification, generating code-mix data for training models, topic modelling, detecting switch points, POS tagging, and many more. Sentiment analysis can be divided into various types of approaches employed to predict sentiment polarity. Predicting sentiment analysis using the lexicon is very popular as this proved to be fast and easy. Yin et al. (2020) developed context-dependent part of speech chunks to disambiguate the sentiment of words that have different sentiment values in a different context. Their approach measures the context-aware sentiment polarity and tries to find the intensity to identify the trigger portion of the sentence for that particular sentiment. Aspect-based sentiment analysis is said to be the future in the field of context-dependent or context-aware approaches. Gui et al. (2022) proposed model that can predict sentiment and detect latent topic hidden in text; they try to explore topic evolution and sentiment variations.

Deep learning models with attention layers are employed to perform this; Zhou et al. (2020) proposed a CNN variant, a deformable convolution neural network incorporated with an attention mechanism (Fig. 3) . They have also used Bi-LSTM to extract local features and preserve long-distance dependency. Recurrent neural networks have also been popularly used in sentiment analysis. Aydin and Gungor (2020) used RNN to detect features. The framework they have given uses both recurrent and recursive neural models. They use user reviews, the constituency, and the dependency parser to divide the review into sub reviews for each feature review discussion and then build a neural network from these parse reviews.

Adversarial training and generating data to deal with data sparsity have been explored and Pandey et al. (2021) proposed AugmentatGAN an generative adversarial network for generating coherent syntactic sentences for deep learning models. They have tested their proposed model on English, Hindi, and Bengali data. Lin et al. (2022) use multichannel word embedding which is a combination of traditional vectorisation methods in combination of their variant of BERT for generating story. Such models are helpful for generating data for adversarial training. The intensity of sentiment emotion makes the problem of sentiment analysis multiclass other than positive, negative, and neutral; newer classes had been added based on underlying emotion and their intensity; Akhtar et al. (2022) proposed an ensemble framework to provide more clarity on sentiment and emotion in text.

We can classify the research based on the type of models in particular categories.

There are various models present in sentiment analysis, but the state-of-the-art performance showed by various in recent times masked language models worth mentioning. We will be discussing a few of such notable models in this subsection.

Pretrained models are classified into casual language models and masked language models (Naseem et al. 2021 ). These both have been used in the multilingual domain. Masked language models take data in which few tokens are masked, they act like fill in the blanks, and the model needs to predict the appropriate word for that blank. Few notable masked language models are discussed below.

XLNet (Yang et al. 2019b ) was used for a new task in BERT (Devlin et al. 2019) to predict the bi-direction context. XLNet was designed to deal with the issue of pretraining fine-tuning discrepancies. It uses permutation language with a self-attention mechanism to consider both contexts.

ALBERT (Lan et al. 2019 ) is the modification of BERT (Devlin et al. 2019) to deal with many parameters, creating issues while pretraining. This model was designed to scale the pretraining by splitting the vocabulary matrix into two small matrices and adopting the sentence-order prediction loss over the subsequent sentence prediction loss to shift the focus on inter-sentence coherence.

RoBERTa (Liu et al. 2019 ) is a variant of BERT (Devlin et al. 2019) , wherein the algorithm focuses on hidden sections of the text. RoBERTa results from changes or modifications done in BERT to improve its accuracy for masked language models. RoBERTa, implemented in PyTorch, uses BERT with key hyperparameters, training in smaller units and learning rates, and removing BERT's following sentence pretraining objective. RoBERTa is tested on the following datasets MNLI, QNLI, RTE, STS-B, and RACE and performs better than its contemporise state-of-the-art methods. RoBERTa is trained on unannotated datasets and news articles. RoBERTa beats its parent BERT in the GLUE platform and last time leader XLNetlarge.

DistilBERT (Sanh et al. 2020 ) is a refined version of BERT. It is popularly known as a distilled version of BERT because it has reduced the size of BERT while being faster and more efficient.

Other masked language models, such as UniLMv2 (Bao et al. 2020) , are autoregressive language models. This model is based on learning interrelationships between corrupted tokens and context. This model used pseudomasks to learn relations between the masked span of tokens.

Masked language models form new milestones in sentiment analysis and create a breakthrough in various related tasks such as sarcasm detection, hate speech, multilingual data. Shin et al. (2020) explore these models find that prompt-based mask models are better than pre-train finetuning-based masked language models.

Divate (2021) considers online Marathi news and performs sentiment analysis on these news articles. Although news articles are monolingual, many challenges between the Marathi language and code switch Hindi-English are same. Their proposed work aimed to stop the spread of news that can cause negativity or depression in society. They have used knowledge-based and machine learning-based approaches while implementing their work. These approaches are a core component of their work. This work is significant as Marathi is India's third most spoken language. They have achieved an accuracy of 72% using the LSTM model. Name Entity Recognition (NER) is the task that helps to find the nouns that could represent an entity. Priyadarshani et al. (2020) had performed NER on code-mix data. They have utilised meta-embedding and a transformer with conditional random field (CRF) to locate named entities in code-mix data. Since the usage of CRF, they have also processed data written in romanised script and in native script.

Another set of tasks related to sentiment analysis and could be used to increase the accuracy are word sense disambiguation (Choi et al. 2017) , sentiment reasoning (Alattar and Shaalan 2021) , and sentiment analysis of live comments in video and tagging the timestamp to track what portion of video trigger the sentiment. Many tasks ranging in various languages will design the future of sentiment analysis and opinion mining.

Urdu is one of the popular languages spoken in the Indian subcontinent; Qureshi et al. (2022) have explored Urdu reviews written in Roman Script. They have explored YouTube comments in Romanised Urdu on various Machine Learning models. They also released their dataset by the name DRU of 24,000 reviews.

Similarly, religious text snippets and their impact were analysed while creating their short or summarised versions, one such work done by Chandra et al. (2022) . They have performed semantic and sentimental analyses on the translation of Bhagavad Gita using BERT.

When dealing with sentiment analysis of code-mix data, CNN and BiLSTMs are usually the most popular choices. Kumar et al. (2019) use CNN architecture for sub-word representation and use encoding of BiLSTM for detecting the sentiment analysis of Hindi-English code-mix data. Jhanwar and Das (2018) use the dataset proposed by Joshi et al. (2016) to evaluate the model; they employed LSTM and Naïve Bayes Multinomial Classifier to perform the sentiment analysis. They ensemble both by multiplying the probability from these classifiers and then taking the one with the highest combined probability. They use the variant of LSTM that has tri-gram for exploring the sequential patterns in text that is lengthy and Multinomial Naive Bayes (MNB) with ngram for Polarity of keywords using probability. This approach helps in dealing with sparse and inconsistent data. Jamatia et al. (2020) proposed deep learners for sentiment analysis. They evaluate their work on both benchmark datasets on Patra et al. (2018) and Joshi et al. (2016) . Deep learners proposed in Jamatia et al. (2020) are BiLSTM-CNN, DoubleBiLSTM, and an Attention-based model. They compare the performance of their proposed models with SVM, MNB, and Random Forest. The performance of their deep learners outperforms the traditional baselines. They also compared the work with BERT (Devlin et al. 2019 ) and the mBERT multilingual variant of BERT.

Jain and Batra (2015) proposed a model that can deal with label-scare language using the knowledge of languages that are label rich. Here label-rich and label-scare are based on benchmark dataset available or not. They have used recursive autoencoder architecture for developing cross-lingual sentiment analysis. They have developed a movie review dataset for Hindi based on rating and a framework for Cross-Lingual Classification tasks using an existing model BRAE (Zhang et al. 2014) . They have exploited the fact that semantic meaning is shared between languages to learn independent semantic vector representations. They have fine-tuned the embeddings using labelled datasets in English for using this embedding for resource-scare languages. Liu et al. (2020) use the domain transfer learning from ERNIE, a uni-language model and XLm-R (Conneau et al. 2019 ) from Facebook. They perform adversarial training to achieve accuracy using these misclassifying texts. In addition, they used various word embedding to enhance the sentiment polarity detection. Gupta et al. (2021) also handle the problem of low resource languages, and they consider the issue of the Hindi Language as prime. They discussed the issues with the Hindi language, such as spelling variations due to many dialectics, co-reference resolution, and many more. They have explored many machine languages-based models such as Naïve Bayes, support vector machine, decision tree, and logistic regression for performing sentiment analysis on Hindi. They used a dataset having tweets collected from Twitter. All crawled tweets are in the Hindi language. Their proposed model uses a lexicon-based approach using a Domain-specific sentiment dictionary. Their proposed model is a mixture of RNN and LSTM.

Bi-LSTM and ensemble classifiers are extensively used for predicting sentiment in code-mix data. For example, Yadav et al. (2020) use the ensemble approach using multinomial Naïve Bayes, support vector, stochastic gradient descent and linear regression with soft voting to aggregate the result. They observe that Bi-LSTM is consistent in providing the same scores for precision, recall, and f1-measure for each sentiment polarity and ensemble classifier is best in average accuracy.

There are various issues while evaluating the code-mix data most of the time. Hence, to create a dataset Ranjan et al. (2016) used a recurrent neural network-based language model for language identification. They have also used code mixing index (CMI) (Gambäck and Das 2014) for measuring the level of code-mixing in their dataset. This CMI uses the word-level language identification information for detecting the level of code-mixing. Rajan et al. (2016) proposed dataset and targeted Tamil-English code-mix data on Twitter and Facebook. For Measuring the complexity and level of code-mixing, various researchers have designed various metrics, from which the most popular one was CMI that we have already discussed by Das and Gambäck (2015) ,

Here in the above-given metrics w l represents the word w belong to language l; n is a number of tokens; and u represents the count of independent tokens; over here independent tokens are those that have the name of entity or hashtags, which do not change with language. Lower the value of this CMI indicates that text is almost written in one language, while the higher value of CMI indicates that text has a considerable amount of code-mixing. Srivastava and Singh (2021) have listed down various metrics that help measure the code-mixing. Their list includes CMI (Das and Gambäck 2015; Gambäck and Das 2014) , M-Index also known as Multilingual Index (Barnett et al. 2000) , I-index (Guzmán et al. 2017) , Burstiness (Goh and Barabási 2008) , and Memory (Goh and Barabási 2008) . Joshi et al. (2016) introduced Hindi-English code-mix dataset that can be used to evaluate the coming algorithms. They have targeted Facebook as their social media platform and use the comments on Facebook pages of celebrities. They have selected two celebrities who have great names in the Indian population, especially in the Hindi belt of northern India, i.e. Salman Khan, a Hindi film actor and another Prime Minister of India, Mr Narendra Modi. Both of these pages have more than 40 million likes. This dataset (Prabhu 2021 ) is very apt for this task as celebrity choices attract many users and have various comments on their posts in mixed languages that make tested algorithms robust. They have used subword representation rather than complete word or character level representations. They use the value of sentiment of meaningful morphemes. These values make it fit for noisy data that have misspellings and many spelling variations.

Besides Indian languages, Lee (2012) presented a dataset of code-mix data of Hong Kong speeches. This dataset had code-mixing between Cantonese and English words. This dataset was developed to explore the penetration of code-mixing between these two languages in Hong Kong Society.

This section presents the proposed framework for sentiment analysis of code-mix data. Our proposed framework had three main components a sentence encoder (see Sect. 3.1), a transformer encoder (see Sect. 3.2) and a dense neural network (see Sect. 3.3).

Since its inception in sentence embedding, universal sentence encoder (USE) (Cer et al. 2018 ) has been most prevalent. USE provides a sentence embedding model that converts the sentence into its corresponding vector representation. The best part is that despite converting to vectors, this pre-train embedding model keeps the semantic features of sentences intact. This helps in many natural language processing tasks such as classification, clustering, and many more. The prime reason for considering this encoding is that we need not require a lot of labelled data to train our classifier further. Yang et al. (2019a) released a multilingual variant of universal sentence encoder that supported 16 languages. This multilingual model is based on CNN and transformer encoder. Figure 4 depicts a simple sentence encoding process, and it shows how sentences are embedded into numerical vectors of fixed length. This embedding process helps in dealing with sparse and various textual information in vector space.

Few popular Sentence embeddings such as tf-idf, doc2vec (Le and Mikolov 2014) and InferSent (Conneau et al. 2018 ) are used widely. Sentences are converted into embedding and their semantics into vectors. Sentence embedding helps in dealing with the understanding of context and disambiguation. Various of these sentence embedding algorithms are built with the view that they can be used for various tasks. Major popular tasks are sentiment analysis, generating conversation, and inference. Figure 5 shows the architecture of USE as depicted by Cer et al. (2018) . This architecture contains two variants, deep averaging networks and transformer encoder, for generating a fixed-length vector of 512. Transformer encoder provides great accuracy but utilises more resources, while deep averaging network consumes low recourses, but here accuracy is compromised.

Universal Sentence Encoder had two encoder. Firstly, all preprocessing such as case-folding and tokenisation of sentences is performed in Fig. 5 . Secondly, these tokens send for encoding. One of the encoders has 6 layers with a self-attention mechanism for context-aware embedding. The other is the averaging network that uses the unigram or bigram embeddings and passes to a deep neural network averaging to an embedding. Finally, we get a 512 dimension long vector of sentence embedding as output. Devlin et al. (2019) presented Bidirectional Encoder Representations from Transformers (BERT). This transferbased learning model has become very popular in the field of natural language processing (NLP). It becomes so popular that researchers have started using it in various domains of NLP, such as sentence inference, classification, sentence generation, and many more. The core idea behind designing BERT was to transfer bidirectional representations from text utilising both right and left context in all layers of deep learning techniques. These representations are trained from unlabelled textual data.. Figure 6 depicts the architecture of xlm-RoBERTa as described in Conneau et al. (2019) Table 3 shows the architecture with each layer where input 11 and input 12 are the encoding layers for each XLM-RoBERTa and USE, respectively. These layers will generate the fixed-length vectors of length 768 512, respectively. We further ensemble the output of the classifier using the dense neural network. The activation function of tanh is used for this layer of dense neural network (DNN). Table 3 has the architecture details of DNN. Softmax is used to convert the vector of the dense neural network into a probability distribution. Figure 7 summarises the complete framework of our system and ensemble. 

In this section, we will be discussing the dataset we have used to test our proposed approach. We have employed five datasets from various shared tasks, conferences, research articles, and competitions.

SemEval 2013 Task 2 (Nakov et al. 2019 ) is a dataset developed and annotated using Amazon Mechanical Turk to evaluate and analyse the performance of various sentiment analysis algorithms. This dataset has two subtasks: one for expressions and the other for messages. Data were collected from Twitter from January 2012 to January 2013. We have utilised the dataset training portion from a GitHub repository (Rios 2022) , this repository had 8018 tweets, and their distribution among various sentiment polarity is shown in Fig. 8. 

A shared task on SAIL was conducted earlier in 2015, but it was not on code-mixed languages. That task includes three languages, namely Hindi, English, and Tamil Patra et al., (Patra et al. 2015) . Dataset (Patra et al. 2015) had 4849 tweets of three languages, out of which 1688 are written in Hindi, 1498 are in Bengali, and the remaining 1663 are in Tamil. Sentiment polarity-wise distribution is found in Figs. 9 and 10. Figure 9 displays the distribution of tweets language-wise, and we can observe that the dataset is not balanced concerning languages. It had slightly more tweets on Hindi than in Tamil, and this difference is vast enough when Bengali is considered.

Sentiment polaritywise distribution of tweets in the dataset is shown in Fig. 10 . In all figures and tables, it is evident that negative tweets are way more than positive and neutral tweets. Data were collected by Patra et al. (2015) using Twitter (2022) , and then, they manually segregated the data into groups. This dataset is manually annotated. They have also kept the information of nonverbal features such as smiley in the dataset. Joshi et al. (2016) had collected a dataset for code-mixed text. This dataset was built on Facebook pages, using the page of Bollywood Actor ''Salman Khan'' and Indian Prime Minister ''Narendra Modi''. They have collected the comments from these pages and then manually filtered the comments that had code-mixing. Their filtering is based on three criteria as follows: Comment not more than one sentence long as they assume sentiment might change from sentence to sentence.

• Comments that are not written in Roman scripts.

• Comments do not have code-mixing; if they are entirely written in one language, they were discarded.

This dataset had 3879 tweets, with 7549 being vocabulary terms. This dataset is also labelled; this annotation is done manually. Table 4 shows the proportion of each sentiment polarity in the dataset. Figure 11 shows the pictorial representation of Table 4 concerning the dataset used by Joshi et al. (2016) . In Fig. 11 , we can clearly infer that most tweets are from sentiment polarity neutral; this results in an imbalance concerning other sentiment polarities, i.e. positive and negative. This is quite similar between the above two datasets containing Hindi-English pairs, with way more text having neutral polarity than any two sentiment polarities. However, this might be because users, when in an extreme state of mind like anger or happiness, tend to express their feelings in haste and choose to write in their mother tongue. However, one can argue this could be the other way round too in a hurry, one unable to choose the words correctly from the vocab and mix the words from multiple. Patra, Das, and Das (2018) had published their dataset in a competition SAIL Code Mixed (SAIL Code Mixed 2022) during ICON (2017) . Dataset is released with the name SAIL_Code-Mixed (SAIL Code Mixed 2022). This dataset is for the sentiment analysis task in tweets for two sets of 2 Indian languages code-mixed, namely Bengali-English (BE-EN) and Hindi-English (HI-EN). The main task is to classify the tweets in a dataset into positive, negative and neutral polarity. Dataset is available at the link (SAIL Code Mixed 2022) for download. Patra, Das, and Das (2018) were also based on tweets collected from Twitter using common words from Hindi and Bengali. This collection then filtered for only tweets that have code-mixing of languages is done. This dataset SAIL 2017 is tagged with language information based on a dictionary which they have manually developed and with sentiment polarity. For language tags, they have used the following tags:

• HI for Hindi

• UN for Universal such as hashtags, symbols and many more.

• MIX for words that are a mix of two languages, such as ''Khaing'' where ''Khana'' is eating in Hindi while ''ing'' is used for making present continuous. Figure 12 displays the proportion of tweets in both set of language pair in SAIL 2017 Code-Mix dataset. One can observe that the collected data are not in the same proportion; there are more Hindi-English Tweets than Bengali English, although, in our work, we have considered only Hindi-English pair. Table 5 shows the number of available tweets in both language pairs in Patra et al. (2018) .

For further exploration of the language pairs, Figs. 13 and 14 show the proportion of each sentiment polarity, i.e. a number of tweets having sentiment positive, negative or neutral for both pairs, Hindi-English and Bengali-English, respectively. In the first pair, neutrals are almost double in number from negative and a little less than double compared to positive tweet counts. Such a difference in the count of members of each class makes the classification algorithm overfit for a certain class. Although since we are also comparing and training our proposed approach on other datasets to avoid such issues (see Fig. 13 ). In Fig. 14, showing the data sentiment polarity distribution in Bengali-English, it is also as imbalanced as the earlier pair of this dataset. However, here data are approximately equal, and the difference between the tweet count in each sentiment class is not wide enough to cause an issue like in the Hindi-English pair languages.

4.5 SemEval-2020 Task 9 Patwa et al. (2020) is the third dataset we have used in this work. This dataset is most recent and released during the International Workshop on Semantic Evaluation (2022; . Dataset available at (2022) for download. The task during which this dataset is released is named as SentiMix (CodaLab and Competition 2022). This task is about the prediction of sentiment in code-mixed data. The data used in this dataset are crawled from Twitter (2022) . This dataset is also labelled into three sentiment polarity positive, negative, and neutral. The tweets crawled during the formation of this dataset are code-mixed in two languages: English-Hindi and English-Spanish. Other than each tweet's sentiment polarity, each of these tweets also contains information about language identification at the word level. That is, each term used in tweets is labelled with the language from which it is used as shown in Fig. 16 , tweet shown in Fig. 16 is segregated into multiple lines, and the first line is information. Sentiment polarity and its ID, i.e. sentiment polarity, are "negative'', and tweet id is ''23". One can have observed tags corresponding to each token in a tweet, like O for others, Eng for English, and Hin for Hindi. "Hin" means that token whose corresponding language written in Hindi belongs to language Hindi if O, i.e. others mean it might be smiley or some symbol that has no association with any of two languages in the set. Other tags used in the dataset are ''mixed'', that is, that word belongs partially to both languages. Such words are formed by mixing features of both languages. For example, ''pakauing'' in this work ''pakau'' is from Hindi means boring, while ''ing'' is added to make it a doing verb. Another interesting tag used is ''fw'' which denotes words that do not belong to any language, ''unk'' for unrecognised words. Table 6 shows that the dataset is quite balanced in terms of tweets from both sets of languages; if we are concerned about Hindi-English then Hinglish set contains Twenty thousand tweets while in Spanish-English (Spanglish) contains 18,789 tweets. Figure 15 contains three subfigures; the middle explores the 2 key points about data. Firstly the number of tweets or short texts contributes to both language set. This will not be a wrong inference that the dataset is almost balanced in terms of both set of language pairs; the number of tweets belonging to each language set is almost equal. Secondly, in Fig. 15b we can see the proportion of tweets belonging to each sentiment polarity. We can clearly see that the Spanglish number of positive tweets is more than the other two sentiment polarities, so there is a significant difference in the number of tweets belonging to each sentiment polarity. Tables 7 and 8 show the distribution of tweets for each sentiment polarity in both language pairs. Table 7 shows the actual numbers for each sentiment polarity and for both pairs of languages. Table 8 and Fig. 15 discuss the proportion in which tweets of each language pair are available for each sentiment polarity. Table 8 shows the An ensemble deep learning classifier for sentiment analysis… proportion in percentages, while Fig. 15a and b show the proportion of each sentiment polarity for Hinglish and Spanish, respectively (Fig. 16 ).

The evaluation metrics used in this paper are precision (P), recall (R), F 1 score (F1), and accuracy. These metrics will help us compare and evaluate the performance of classification tasks. The formulas for respective metrics are given below:

As (Conneau et al. 2019) have already discussed in their research, the XLM-RoBERTa is superior in performance compared to its other contemporaries. Combined with the ensemble approach, it gives better results than proving the claim (Conneau et al. 2019 In each set of code-mixed languages, the chart in the middle shows the variation in the number of tweets per codemixed languages set and the variation in sentiment polarity between these sets (Joshi et al. 2016; Patra et al. 2018) and SemEval-2020 Task 9 (CodaLab and Competition 2022).

We have presented results in Tables 9, 10, 11, 12 and 13 for three sentiment polarities: positive, negative, and neutral. We have used the formula in Eqs. 1,2, and 3 for evaluation metrics and evaluation on each sentiment class.

We can observe in the above tables that the accuracy of our proposed framework is higher than the other baselines. Our proposed framework had beaten the other models in all five sets, and it also performed well on Neutral class. When we observe results on SAIL 2017 (Patra et al. 2018) , in that case, our proposed model works better in terms of accuracy. It had outperformed the other models in F1-Score in positive, negative, and neutral classes. F1-Score is the harmonic average of precision and recall. This parameter helps in maintaining a trade-off between precision and recall. Precision is the ratio of the number of times we classify something correctly to the number of times tweets classified by classifier belong to that particular class. While the recall is the number of times, the classifier classifies different texts to a particular class divided by the number of tweets belonging to that particular class in the dataset. If we classify everything to that one particular class, then recall will be one for this particular class. We need an F1score that helps maintain the balance between precision and recall.

As the internet becomes cheaper and with gradual penetration of 5G in various markets in the Indian Subcontinent, there is a considerable demand for tools and frameworks that can analyse social media texts, tweets, and comments to understand the needs and trends in society better. We have performed sentiment analysis on a particular type of data emerging rapidly in third-world countries as these use English as their second language. Many corporates want to tap this market sentiment by analysing tweets or reviews of their products or about their brand. We proposed an ensemble-based approach using a sentence embedding and a multilingual variant of RoBERTa to identify the sentiment out of this code-mix data. The data in consideration is the mix of Hindi-English; in such text, the author tends to switch between languages during the conversation. Most often, this text appears in reviews. Such reviews were earlier supposed as noisy and were neglected as it was considered that this is the incorrect form and could not be modelled by any language model. As time passed, researchers and the linguistic community recognised it and showed interest in it. We have conducted experiments on five datasets, and our proposed classifier had achieved the best accuracy of 67% on SemEval 2020 Task-9 dataset outperforming the other contemporary baselines. We have also evaluated the proposed work on parameters such as precision, recall, and f 1 -score.

In future, we will be extending our work for.

• Aspect-based sentiment analysis on these data to get better insights. Our target will be product reviews and tweets in code mix. In further studies, we will address many issues such as finding aspects which is non-trivial as without proper language model is complicated.

• Exploring and building a low resource lexicon-based model that might be specific to language pair as these models are way faster than deep learning models.

• Due to the scarcity of datasets available in this field of code-mix data, we will create and annotate a large dataset for evaluating and training models.

Funding All authors certify that they have no affiliations with or involvement in any organisation or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Data availabilty Enquiries about data availability should be directed to the authors.

Conflict of interest The authors have not disclosed any competing interests.

All-in-one: emotion, sentiment and intensity prediction using a multi-task ensemble framework

Using artificial intelligence to understand what causes sentiment changes on social media

Combination of recursive and recurrent neural networks for aspect-based sentiment analysis using interaspect relations

Overview of the mixed script information retrieval (MSIR) at fire-2016. In: Forum for information retrieval evaluation

Unilmv2: Pseudo-masked language models for unified language model pre-training

The LIDES coding manual: a document for preparing and analyzing language interaction data version 1.1-July 1999

Universal sentence encoder for English

DravidianCodeMix: sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text

Semantic and sentiment analysis of selected bhagavad gita translations using BERT-based language framework

Coarse-grained ?/-effect word sense disambiguation for implicit sentiment analysis

supervised learning of universal sentence representations from natural language inference data

Code-mixing in social media text: the last language identification frontier

BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding

Sentiment analysis of Marathi news using LSTM

Cmsaone@ dravidian-codemix-fire2020: a meta embedding and transformer model for codemixed sentiment analysis on social media text

On measuring the complexity of codemixing

Burstiness and memory in complex systems

Multi task mutual learning for joint sentiment classification and topic detection

Toward integrated cnn-based sentiment analysis of tweets for scarce-resource language-Hindi

Metrics for modeling code-switching across corpora

Long short-term memory

Cross lingual sentiment analysis using modified BRAE

Deep learning based sentiment analysis in a code-mixed English-Hindi and English-Bengali social media corpus

Sentiment analysis applied to news from the Brazilian stock market

An ensemble model for sentiment analysis of Hindi-English code-mixed data

Towards subword level compositions for sentiment analysis of hindi-english code mixed text

An automatic instrument recognition approach based on deep convolutional neural network

Gaining insights from social media language: Methodologies and challenges

Demixing sentiment from code-mixed text

Albert: A lite bert for self-supervised learning of language representations

Distributed representations of sentences and documents

A corpus-based analysis of mixed code in Hong Kong speech

Chinese story generation of sentence format control based on multi-channel word embedding and novel data format

Sentiment analysis: mining opinions, sentiments, and emotions

A robustly optimized bert pretraining approach. arXiv preprint arXiv

Kk2018 at SemEval-2020 Task 9: Adversarial training for code-mixing sentiment classification

SemEval-2013 Task 2: Sentiment analysis in Twitter

A comprehensive survey on word representation models: From classical to stateof-the-art word representation language models

An ensemble deep learning classifier for sentiment analysis…

Syntactically coherent text augmentation for sequence classification

Mind games: A temporal sentiment analysis of the political messages of the internet research agency on Facebook and Twitter

Shared task on sentiment analysis in indian languages (sail) tweets-an overview

Sentiment analysis of code-mixed indian languages: An overview of sail_code-mixed shared task@ icon-2017

Semeval-2020 task 9: overview of sentiment analysis of code-mixed tweets

Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research

Sub-word-LSTM

A frequency-based approach to extract aspect for aspect-based sentiment analysis

Named entity recognition for code-mixed indian corpus using meta embedding

Short-term stock trends prediction based on sentiment analysis and machine learning

Sentiment analysis of reviews in natural language: Roman Urdu as a case study

A comparative study on code-mixed data of Indian social media vs formal text

Classification-Example/allTrainingData.tsv at master · AnthonyMRios/Sentiment-Classification-Example · GitHub

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

SemEval

Autoprompt: eliciting knowledge from language models with automatically generated prompts

Challenges and limitations with the metrics measuring the complexity of code-mixed text

Code-mixing: a brief survey

It's what's happening/Twitter

A research on HMM based speech recognition in spoken english

M2Lens: visualizing and explaining multimodal models for sentiment analysis

Zera-shot sentiment analysis for code-mixed data

Bi-LSTM and ensemble based bilingual sentiment analysis for a code-mixed Hindi-English social media Text

Multilingual universal sentence encoder for semantic retrieval

XLNet: generalized autoregressive pretraining for language understanding

The construction of sentiment lexicon based on context-dependent part-of-speech chunks for semantic disambiguation

Bilinguallyconstrained phrase embeddings for machine translation

ADeCNN: An improved model for aspect-level sentiment analysis based on deformable CNN and attention