key: cord-0078946-pd7mrej8 authors: Kanfoud, Mohamed Raouf; Bouramoul, Abdelkrim title: SentiCode: A new paradigm for one-time training and global prediction in multilingual sentiment analysis date: 2022-05-25 journal: J Intell Inf Syst DOI: 10.1007/s10844-022-00714-8 sha: a9091a90a12af03392c476f78277206b94d6f551 doc_id: 78946 cord_uid: pd7mrej8 The main objective of multilingual sentiment analysis is to analyze reviews regardless of the original language in which they are written. Switching from one language to another is very common on social media platforms. Analyzing these multilingual reviews is a challenge since each language is different in terms of syntax, grammar, etc. This paper presents a new language-independent representation approach for sentiment analysis, SentiCode. Unlike previous work in multilingual sentiment analysis, the proposed approach does not rely on machine translation to bridge the gap between different languages. Instead, it exploits common features of languages, such as part-of-speech tags used in Universal Dependencies. Equally important, SentiCode enables sentiment analysis in multi-language and multi-domain environments simultaneously. Several experiments were conducted using machine/deep learning techniques to evaluate the performance of SentiCode in multilingual (English, French, German, Arabic, and Russian) and multi-domain environments. In addition, the vocabulary proposed by SentiCode and the effect of each token were evaluated by the ablation method. The results highlight the 70% accuracy of SentiCode, with the best trade-off between efficiency and computing time (training and testing) in a total of about 0.67 seconds, which is very convenient for real-time applications. Sentiment analysis applies natural language processing, opinion mining, and text analytics to derive subjective information from textual data. The online community is becoming an important part of today's business world. The increase in Internet users from different cultures implies a high variability in the style and languages used to express opinions. A variety of general opinions that fall into the category of sentiment analysis can be generated by explicit or indirect measures. This involves determining whether or not the opinion is related to a topic or product of interest to gain insight into how people feel about the product, service, or brand online. Analyzing these opinions derived from social media platforms (Singh & Singh, 2021) such as Facebook, Twitter, and YouTube is challenging, especially when multiple languages are used (Londhe et al., 2021) . The objective is to provide high-quality performance (accuracy and precision) and a high level of coverage (recall). Researchers have proposed and developed different approaches categorized into different categories machine learning (Sebastiani, 2002) knowledge-based (Cambria, 2016) lexicons (Taboada et al., 2011) rule/case-based (Berka, 2020) and hybrid (Appel et al., 2016) for specific languages (often for English), but they do not contain all relevant expressions or cover all languages. Although multilingual sentiment analysis has attracted much interest from researchers, most efforts are focused on English sentiment analysis. However, most efforts are focused on using standard machine translation, either for training or prediction or even cross-lingual in both phases. Certainly, the performance of these methods is influenced by the quality of the translation process (Balahur & Turchi, 2012) , and it depends more on its existence. In this work, we propose a new method called SentiCode (SEN-TIment CODE), which allows the representation of various features in English and non-English languages without involving machine translation at any stage. We take advantage of existing tools in the different languages, such as part-of-speech taggers, which are by far the most efficient and capable of 99% accuracy. A series of extensive experiments were conducted to evaluate the performance of SentiCode in multilingual and multi-domain environments. We also established a trade-off between efficiency and computing time (i.e., training and prediction). Furthermore, we compared different features, such as unigrams and bigrams on vector space representations. An additional experiment on the training data size was also carried out. SentiCode was trained using different traditional machine learning and deep learning techniques The rest of the paper is organized as follows: Section 2 underlines previous work in the field; Section 3 presents the SentiCode method; Section 4 describes the techniques and libraries used in the experiment; Section 5 discusses the results, while Section 6 concludes the paper and highlights future work. Most existing methods for sentiment analysis are designed to address English language content or, at most, evaluate the English language and claim that the proposed approach applies to other languages. However, these methods regarding multilingual sentiment analysis rely on the machine translation approach to bridge the gap between different languages (Banea et al., 2008; Balahur & Turchi, 2012; Araujo et al., 2016) . There are two approaches to sentiment analysis widely adopted in the literature. The first, is based on the so-called lexicon-based methods. Lexicons are constructed manually or automatically, starting with a seed word list then expanding this list from dictionaries or corpus. Some of these lexicons contain a list of words labeled with their polarity, such as the General Inquirer (Stone et al., 1966) . Similarly, the MPQA Subjectivity lexicon (Wilson et al., 2005) includes over 8000 words annotated into four classes of feelings: positive, negative, both, and neutral. Another rich lexicon is SentiWordNet (SWN) (Baccianella et al., 2010) . It is a lexical resource for sentiment analysis, and it was created based on the English lexical called WordNet; the latter groups different grammatical classes. SentiWordNet assigns positive and negative scores to each synset (group of synonyms). In contrast, the objective score is obtained by: one minus the positive score plus the negative (objective = 1 -(positive + minus)). There are two versions of SentiWordNet: SentiWordNet 1.0 (Esuli & Sebastiani, 2006) annotates WordNet 2.0 and SentiWordNet3.0 (Baccianella et al., 2010) annotates WordNet 3.0. (Hutto & Gilbert, 2014 ) introduced the VADER model (Valence Aware Dictionary for sEntiment Reasoning) for sentiment analysis in social media texts. The core of VADER is based on a sentiment lexicon containing a list of words, western emoticons, acronyms, and slang commonly used in social texts. These features are associated with their sentiment intensity, ranging from -4 to 4. The annotation was performed by Turkers recruited from Amazon Mechanical Turk. In addition to the lexicon, VADER employs five manually generalized heuristics: punctuation, capitalization, degree modifiers, opposing expressions, and negation. The authors of VADER claimed that it is not a black box and can be extended or modified for other purposes. It should be noted that General Inquirer, MPQA Subjectivity Lexicon, SentiWordNet, and VADER are freely available and widely used in the literature. Unlike the general lexicon mentioned above, there are specialized lexicons for specific domains. Yekrangi & Abdolvand, (2021) constructed a lexicon for the financial market's domain. The authors evaluated its performance by calculating the correlation between dollar price trends and sentiment scores, which averages 60% due to the finance-specific words in the specialized lexicon. Then there are the socalled machine learning techniques, which are widely adopted. (Pang et al., 2002) used three machine learning techniques, including Naïve Bayes (NB), Maximum Entropy (ME), and Support Vector Machine (SVM) for sentiment classification. These three techniques were tested and compared on a set of features (Pang et al., 2002) . The results indicate that unigrams and presence count are the most effective for movie reviews. Also, these machine learning techniques provide a good result compared to the baseline based on human-rated word lists. Besides, the authors pointed out that SVM is the most effective while NB is the least effective, but with small differences. (Tripathy et al., 2016) used four machine learning techniques: SVM, NB ME, SGD (Stochastic Gradient Descent) and fed them with n-gram and TF-IDF (term frequency-inverse document frequency) features extracted from IMDb (Internet Movie Database) movie reviews. The results show that unigram and bigram perform better than trigram or more. (Jagdale et al., 2019) selected two machine learning algorithms, namely NB and SVM, to analyze the Amazon website reviews in different categories such as cameras, tablets, and TVs. The most adopted machine learning techniques are SVM and NB from the previously mentioned work. Moreover, SVM gives very promising results. Recently (Zhang et al., 2019 ) proposed a new model called quantum-inspired sentiment representation, based on quantum theory to represent semantic information and sentiment. Combining the two approaches (lexicon-based and machine learning-based) could give better results than the individual approach, as the work by (Lu & Tsou, 2010) on the diversity of output results between lexicon and machine learning techniques. In conversational sentiment analysis, a bidirectional emotional recurrent unit (BiERU) is used to extract the sentiment of each message in a text-based conversation (Li et al., 2022) . The emotional recurrent units are used with fewer parameters compatible with multiparty conversations without adjustments. Another aspect of sentiment analysis is detecting sentiment on different product entities. Aspect-based sentiment analysis (ABSA) refers to considering the characteristics of a product or entity instead of a document or sentence level (Do et al., 2019) . In (Liang et al., 2022) , an LSTM (Long Short-Term Memory) layer was used to capture context, and a GCN layer (graph convolutional network) was built to capture the potential sentiment in a particular context. ABSA is beyond the scope of this paper. Multilingual Sentiment analysis (MSA) is more challenging and has gained interest over the years. Boiy & Moens, (2009) analyzed multilingual (English, Dutch, and French) and multi-domain sentiments while using various representations such as unigrams, negation, discourse features, compound words, and verbs for sentiment classification. The authors proposed a three-layer cascade approach starting with detecting whether the text is opinionated or not; if yes, the second level is called to determine its polarity, and if the output of this second level is not above a threshold, then a third level is needed. Banea et al., (2010) used multiple features from several languages and showed how these features would behave together for the source and target language experience in six languages: English, Arabic, French, German, Romanian and Spanish. While (Cui et al., 2011) Integrate characters and punctuation next to emoticons to evaluate multilingual tweets. Ghorbel, (2012) classified French posts from a forum. The proposed method is based on extracting word polarity from the following PoS (part of speech) tags: adjectives, adverbs, nouns, and verbs from SWN. As SWN is in English and the posts to be classified are in French, the author used/compared two ways of translation: using traditional machine translation or translation by EuroWordNet. Štajner et al., (2013) compared different combinations of features and showed that machine translation could generate a training corpus in another language. The authors also showed that general-purpose lexicons could achieve similar performance to domainspecific lexicons. Balahur & Turchi, (2014) conducted a series of experiments based on translated data from the SemEval 2013 corpus in four other languages to evaluate sentiment analysis systems in a multilingual environment. In the end, they concluded that poor translation has a bad effect on the final sentiment classification. Balahur et al., (2015) performed polarity classification on monolingual and multilingual texts (English, Spanish, and English-Spanish) comparing three techniques: Monolingual model, Monolingual pipeline model, and Multilingual model. In addition to training an English-Spanish tagger to handle code-switched texts. The authors found that the multilingual model performs better for code-switched text. Therefore, they recommend it as the best choice when the messages to be analyzed are in multiple languages. Zhou et al., (2016) conducted a cross-lingual analysis using English as the source language and French, German, and Japanese as the target language. The proposed model called BiDRL (Bilingual Document Representation Learning) is based on integrating bilingual representation words from the source and target languages using machine translation to translate between them. Nguyen & Le Nguyen, (2018) used convolutional neural network (CNN) and LSTM over n-gram word embedding to capture the sentiment of multilingual and multi-domain YouTube comments by applying experiments on the SenTube dataset. The authors claimed that the proposed model applies to any language without relying on linguistic features. Chen et al., (2019) proposed a novel ELSA (Emoji-powered representation learning for Cross-Lingual Sentiment Analysis) framework to solve the problem of cross-lingual sentiment analysis using emojis as an instrument. In terms of accuracy, ELSA outperforms the three compared methods, namely MT-BOW (Machine Translation -Bag of Words), CL-RL (Cross-Lingual Representation Learning), and BiDRL (Bilingual Document Representation Learning). In (Cruz Paulino et al., 2021) , the authors used tweets about COVID-19 written in Filipino or English to conduct their experiment. The Filipino tweets were translated into English to be annotated. Then, building classifier models to evaluate these multilingual tweets by employing count vectorizers as features. SVC was found to be the best-performed model. (Chen et al., 2019) and (Cruz Paulino et al., 2021) used machine translation to bridge the gap between the languages used in their experiments. Therefore, multilingual sentiment analysis is an active research area. Many studies have been conducted on MSA (Lo et al., 2017; Abdullah & Rusli, 2021; Agüero-Torales et al., 2021) , discussing, comparing, and reviewing concepts, results, and trends. This section first highlights the challenge of multilingual sentiment analysis and explains the motivation of the proposed approach. Then, the SentiCode algorithm, its process, and its vocabulary are described. There are about 7000 languages 1 in the world. People express their opinions in different languages, whether native or preferred, through different support and canals such as social media platforms. At the same time, with the emergence of online business entities (e.g., e-commerce), producers and buyers need to know people's opinions and feelings about other people, products, events, or anything else for different reasons. To this end, sentiment analysis techniques are applied. One of these techniques is a supervised machine learning algorithm, which requires prior training; therefore, labeled data is required. People use different languages, so it is mandatory to train models for different languages. Moreover, this process will also require labeled data for each language (which is difficult to obtain), not to mention the required training time. Thus, considering what was highlighted above and following the adage "train once, use many times", this work proposes an approach that allows us to train a model and then exploit it several times to deal with multilingual sentiment analysis. We propose an approach called the SentiCode approach (SentiCode for SENTIment CODE), where instead of training a new model for each language with all the laborious work that follows, we focus our efforts on a single abstraction, representation, pseudocode and train a single model capable of handling many other languages, saving time and effort (data collection and labeling, retraining, etc.) The SentiCode approach uses SentiCoder (see Algorithm 1), which generates a code from the raw text of the studied language. The generated code is named SentiCode (like the name of the proposed approach). Figure 1 illustrates the main idea and goal of the SentiCode approach. The full process of SentiCode uses a symbolic approach (extract knowledge-based features such as adjectives, adverbs, negation, etc., to generate SentiCode) and a sub symbolic approach (statistical and machine learning algorithms to infer the expressed sentiment in text) (Cambria et al., 2020) . The syntax or vocabulary of SentiCode should include common features between languages. Thus, in this paper, we detail the first implementation of SentiCode (other versions of SentiCode will be studied in future work). In the first version, we considered the grammatical units such as adjectives, adverbs, nouns, and verbs for the reason that these units are sentiment-bearing words (Taboada et al., 2011; Polanyi & Zaenen, 2006; Hatzivassiloglou & Wiebe, 2000) . Hence, they are referred to as bearer tokens. Subsequently, to extract this feature, we use PoS taggers. The Universal Dependencies (Nivre et al., 2020) PoS tags 2 annotation was followed to apply a cross-lingual representation for all bearer tokens. Furthermore, when a positive word is combined with a negation, such as "not", it takes on a negative polarity (Polanyi & Zaenen, 2006) , and the opposite is also correct. Therefore, negation is handled in SentiCode. All negation terms are replaced by the token NOT. This rule is applied to all languages. Furthermore, to tackle the task of sentiment analysis., we enriched the SentiCode vocabulary with two tokens representing the prior polarity of bearer tokens, the POS token stands for the positive polarity, and the NEG token stands for negative polarity. POS and NEG are placed in front of bearer tokens. The vocabulary size is 7, including ADJ, ADV, NOUN, VERB, NOT, POS, and NEG. Table 1 summarizes the vocabulary of SentiCode. This section describes the data that was used in the experiment. In addition, we present the generated SentiCode and the machine learning techniques and tools used to evaluate the proposed approach. Since we addressed a multilingual sentiment analysis task, we needed a multilingual dataset. Therefore, our experiments were performed on the dataset (Prettenhofer & Stein, 2010a) presented in (Prettenhofer & Stein, 2010b) . It consists of Amazon reviews in three categories (books, DVDs, and music) in four languages (English, French, German and Japanese). However, we only used English, French, and German to evaluate the proposed approach. There is a training document (file name: training.review) and a test document (file name: test.review) containing 2000 reviews each for each language-domain pair The last document (file name: unlabeled.review) was discarded. The structure of the files is in XML format, where each review has attributes described in Table 2 . According to (Blitzer et al., 2007) , a review with a rating > 3 is labeled as positive, while a review with a rating < 3 is labeled as negative. Thus, we obtained a balanced dataset between positive and negative reviews, i.e., the number of positive reviews = the number of negative reviews (1000 reviews for each category). This labeling was applied for all language-domain pairs (English books, English DVDs, English music, French books, French DVDs, French music, German books, German DVDs, and German music). three languages in a stacked bar chart for comparison in Figure 2 . In addition it gives an example of the generated SentiCode by Algorithm 1. This example in Table 3 shows an original review in English, its PoS tags and finally the equivalent SentiCode. It is important to note that some reviews are so short that their PoS tags cannot be generated. For example, the review "Hallo" from the German books corpus was ignored. To evaluate the SentiCode approach, we first conducted an extensive experiment on three languages: English, French, and German. We evaluated SentiCode on machine learning by selecting four state-of-art machine learning algorithms (Pang et al., 2002; Tripathy et al., 2016; Ahuja et al., 2019) along with multilayer perceptron (MLP) to test its consistency in sentiment analysis against different classifiers. These machine learning algorithms were employed (trained/tested) with default parameters defined by the Scikit-learn library (Pedregosa et al., 2011) , except for MLP is tunned with alpha= 1e-5 along with one hidden layer with 10 neurons and random_sate= 1, while the rest of parameters were kept as the default ones. Algorithm 1 was applied to generate the SentiCode. In addition, PoS tags were extracted using the spaCy library. Table 4 summarizes the characteristics of the PoS taggers used for the three languages. Also, Table 4 lists the negation terms that were replaced by the NOT token in the SentiCode. All parts of the system were implemented in Python 3 language. Additionally, the Scikit-Learn library 4 (Pedregosa et al., 2011) was used for the machine learning algorithms and spaCy 5 to parse the raw text and extract PoS tags. Furthermore, to obtain the prior polarity (Wiegand et al., 2018) of the bearer tokens, the GraLexi approach (Kanfoud & Bouramoul, 2022 ) was used for English; for French and German, we used another list (Chen & Skiena, 2014) . Next, we used the vector space model, namely TF-IDF. TF-IDF is a numerical statistic that relates the importance of a term to a document in a collection or corpus. In simple terms, TF-IDF measures the importance of a term relative to a document. Furthermore, we compared the results obtained using uni-gram, bi-gram, and uni-gram + bi-gram. After that, we started the training process with the SentiCode provided in en.train.books (i.e., English with books domain). Then we evaluated it with nine pairs of language-domains, namely: Englishbooks (en.test.books), English-DVDs (en.test.dvd), English-music (en.test.music), Frenchbooks) (fr.test.books), French-DVDs (fr.test.dvd), French-music (fr.test.music), Germanbooks (de.test.books), German-DVDs (de.test.dvd), German-music (de.test.music). In other words, the models were tested with the same/different languages and the same/different domains. Figure 3 represents the cross-language-domain training and testing process repeated on five different classifiers (i.e., SVM, MNB, LR, SGDC, MLP) This section discusses the results obtained. First, we demonstrated that SentiCode achieves consistent results using different machine learning algorithms. Next, we evaluated the Sen-tiCode vocabulary by applying the ablation technique to show the impact of each token in the proposed vocabulary. Then, the cross-language and cross-domain validations were performed to demonstrate the effectiveness of the trained SentiCode on languages other than English and domains other than books. After that, we evaluated the SentiCode tradeoff between performance and computing time (training and prediction). The error analysis of misclassified reviews is presented, along with some suggestions for correcting the final classification. In the last evaluation, we evaluated SentiCode on corpora with a combined language and domain (the corpus in training and the corpus in prediction gather SentiCode generated from English, French, and German of three domains: books, DVDs, and music). The last step assessed SentiCode with non-western languages (Arabic and Russian) and evaluated the ternary classification (positive, negative, and neutral). This subsection presents our results in terms of accuracy (Acc.), macro f1-score (M-F1, as the data was balanced), and Matthews correlation coefficient (MCC) (Chicco & Jurman, 2020) . Tables 5, 6, 7, 8 and 9 report all the obtained results using SVM, MNB, LR, SGDC, and MLP, respectively After reporting the results of SentiCode on the same and different languages and domains, the next step was to evaluate the vocabulary effect. Subsequently, we determined which tokens significantly impact the quality of discrimination between classes. In short, we removed token by token by decreasing the size of the SentiCode vocabulary. The experiment was conducted on English-books corpora (trained with the en.train. books corpus, then tested with the en.test.books corpus) with six tunings: • Keep all tokens (the proposed holistic version of the SentiCode) • Remove NOT token (to assess the impact of negation) • Remove POS token (effect of prior positive polarity of bearer tokens) • Remove NEG token (effect of prior negative polarity of bearer tokens) • Remove POS and NEG tokens (effect of both prior positive and prior negative polarities) • Remove POS, NEG, and NOT tokens (effect of prior polarities and handling the negation) Figure 4 compares the effect of the proposed tokens (POS, NEG, and NOT) in Senti-Code. The results show that processing negation and taking into account the prior polarity of the bearer tokens contribute to better accuracy. Similarly, negation processing (see the NOT token bar in Fig. 4) has a significant impact compared to POS and NEG tokens alone. Another question we raised during the experiments was: What about training the models using other corpora than en.train.books? To answer this question, a heat map in Fig. 5 shows the accuracy of the cross-language-domain consisting of 81 language-domain pairs. The results are different and depend on the training corpus. Thus, the results are better when using a corpus rich in terms of various patterns and handling negation. A rich corpus allows to train the model well and determine the discriminating features to differentiate the classes, as in this study with the en.train.books corpus. Figure 6 summarizes the token distribution for the additional corpora used in the cross-evaluation. This subsection presents the assessment of SentiCode with three other methods in terms of performance, using the accuracy metric and time in seconds. Furthermore, we determine the winner method by classifying the trade-off between the accuracy and the computing time using the Balanced Integration Score (BIS) metric (Liesefeld & Janczyk, 2019) . The method with the higher BIS performs better. All evaluated methods use the SVM algorithm by applying the TF-IDF vector space with uni-gram as features on book domain corpora. The first method (trainFR__testFR) was tested using a classical approach by training and testing on the same language. In the second method (trainEN__testFR), the model was trained with English text and then tested directly (without translation) in the French language. While in the third method (trainEN__ testTranslatedFRtoEN), we evaluated a model trained in English and tested with French text translated into English. The accuracy and total time (training time + prediction time) are highlighted in Table 10 . Although SentiCode (SentiCode_En__FR) does not have a maximum performance (Acc. = 65.65), it has by far the best total computing time (time = 0.36s). Moreover, SentiCode has a good trade-off between performance and total time among other tested methods with BIS ≈ 1.44. It is important to note that the experiments were performed on a personal laptop: Intel Core i7 4th generation CPU 2.10 GHz 2.70 GHz and 8Go of RAM. The operation mentioned above was repeated for the German language, where German replaced French. Moreover, based on Table 10 , we observe similar results for the German language. SentiCode (SentiCode_EN__DE) is by far faster than the other methods on German language and has the best trade-off with BIS ≈ 1.24. This good trade-off of SentiCode makes it suitable for real applications where speed plays a good role in such commercial applications. This subsection, highlights some misclassified reviews found in the results. We showed the original text of the review with its equivalent SentiCode in addition to the actual and predicted class. Furthermore, we explained and proposed an adjustment at the level of Senti-Code to reclassify the review correctly. Tables 11, 12, and 13 highlight three examples of misclassified reviews, one from each language. The added tokens are in bold in the adjustment row, while removed tokens are . SentiCode is suitable for multilingual and multi-domain sentiment analysis and requires only one training session. After that, the trained model is used several times for other texts corpora, en.test.books, en.test.dvd, en.test.music, fr.test.books, fr.test.dvd, fr.test.music, de.test.books, de.test.dvd, de.test.music into one called multi.test.all. Similarly, the nine train corpora, en.train.books, en.train.dvd, en.train.music, fr.train.books, fr.train.dvd, fr.train.music, de.train.books, de.train. dvd, de. train.music (see Fig. 2 for their statistics), were combined to create another corpus called multi.train.all. The multi.train.all corpus was used for training and the multi.test. all corpus was used for testing. The experiment was conducted on five classifiers: SVM, MNB, LR, SGDC, and MLP. Then, multi-training was evaluated in terms of n-grams to present the role of context. Figure 7a shows that using unigram only leads to low performance for all models used in the experiment, while giving more context, e.g., bigram, will improve accuracy. Figure 7b shows that 50% of the training corpus is sufficient to achieve stable performance. Another observation based on Fig. 7a and b is the performance of SVM; it is higher than the others. The same observation is made with SVM in (Zhang et al., 2019) . Regarding the time consuming, SVM is the slowest, mainly with a large number of features, e.g., using an n-gram range of (2, 3 After assessing SentiCode in western languages (English, French, and German), we assessed how SentiCode performs in non-western languages, taking Russian and Arabic as additional ones. For Arabic, we used the ASTD dataset (Nabil et al., 2015) to generate Arabic Nabil et al., (2015) NLTK 1 Chen & Skiena, (2014) SentiCode, referred to as ar.senticode, with some pre-processing (Remove diacritization, lemmatization 6 ). On the other side, we employed the dataset kindly provided by (Araujo et al., 2016) for the Russian language (Tweets in Russian, 1145 positive, 1188 negative, and 1635 neutral). The generated SentiCode from Russian is referred to ru.senticode. Besides, the resources used to generate SentiCode (following the Algorithm 1 presented before) from these two languages are presented in Table 14 . After, the cross-evaluations between Arabic, Russian and western languages (English, French, and German, i.e., multi.train.all for training and multi.test.all for testing) were conducted on SVM and MLP (with the same configuration as the previous experiment). The results are detailed in Tables 15 and 16 for each method, respectively. In these crossevaluations, it is important to note that we removed the third class for Arabic and Russian (neutral reviews) to keep the consistency between data for the western language, which is labeled only on binary classes (positive and negative ). We used the 80-20 rule for training and testing in Arabic to split the ar.senticode dataset (80% for training and 20% for testing). The same was carried out for the Russian language. We used the entire data for all pairs. The results show good interoperability between western and non-western languages and are justly balanced. We recorded an accuracy of 64.04% for the experiment where the training was in Russian and the test in Arabic. Also, we recorded 64.68% of accuracy in the experiment with Russian as training data and western languages as testing data. Furthermore, we reported 64.04%, 62.33%, 30.69% as accuracy, M-F1, MCC for the Russian-Arabic pair, which is higher than the Russian-Russian pair. Finally, we assessed the SentiCode on a dataset with ternary classification. In addition to ar.senticode and ru.senticode, we supported our dataset with a third one, generating its SentiCode from (Potts et al., 2021) using the same resources used for English in Section 4.3, taking into consideration three classes: positive, negative, and neutral. This new SentiCode generated from the English language is called en.senticode. The final dataset used in the experiment for ternary classification is gathered from ar.senticode, ru.senticode and en.senticode to obtain a mixed dataset, namely mixed.senticode. The experiment is conducted on SVM, MNB, LR, SGDC, and MLP, with the same previous parameters and tunning. The evaluation used 10-fold cross-validation on global balanced data (4000 positives, 4000 negatives, and 4000 neutrals), reviewing the accuracy mean, macro f1 score, and MCC, and standard deviation. Table 17 lists the results obtained. Table 17 shows that the results are highly comparable. Where MLP is more efficient and MNB is less efficient. The important thing also to notice is the consistency of results. We observe a very small standard deviation (1e-3) value resulting from 10-fold cross-validation. This paper introduced SentiCode as a new sentiment language for analysis and domain adaptation of multilingual sentiment. This work is the first implementation of its vocabulary. The results are very consistent for multilingual sentiment analysis and domain adaptation; they reached 70% -Regarding the obtained results, SentiCode can be improved and needs further analysis. SentiCode is a newcomer in the field with strong competition and interest among researchers and could attract more attention for more experiments on other languages, domains, or further on its basic structure. Furthermore, one of our objectives in presenting SentiCode is to propose it as a baseline for comparison with other future approaches in the field of multilingual sentiment analysis, as we reported results in various metrics such as accuracy, f1-score, and MCC coupled with different tuning. For future work, we propose two axes: first, the enhancement of the algorithm by applying more advanced techniques to handle negation and developing prior polarity detection. The second objective concerns the holistic aspect by applying SentiCode to other languages not tested in this paper. Finally, we will train and test it on multiclass classification. To conclude, SentiCode will help focus research on designing and training SentiCode rather than dispersing efforts on many languages. Therefore, we can use it directly for prediction without training for each language beforehand. Consequently, we will save time and effort. The datasets used to generate SentiCode during the current study are available in the Zenodo repository, https:// zenodo. org/ record/ 32516 72. The datasets of SentiCode generated and analysed during the current study are available in the figshare repository, https:// doi. org/ 10. 6084/ m9. figsh are. 17695 559. Multilingual sentiment analysis: A systematic literature review Pertanika Deep learning and multilingual sentiment analysis on social media data: An overview The impact of features extraction on the sentiment analysis A hybrid approach to the sentiment analysis problem at the sentence level. Knowledge-Based Systems An evaluation of machine translation for multilingual sentence-level sentiment analysis Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining Multilingual sentiment analysis using machine translation Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis Multilingual subjectivity: Are more languages better Multilingual subjectivity analysis using machine translation Sentiment analysis using rule-based and case-based reasoning Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification A machine learning approach to sentiment analysis in multilingual web texts Affective computing and sentiment analysis Senticnet 6: Ensemble application of symbolic and subsymbolic ai for sentiment analysis Emoji-powered representation learning for cross-lingual sentiment classification Building sentiment lexicons for all major languages The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation Multilingual sentiment analysis on short text document using semi-supervised machine learning Emotion tokens: Bridging the gap among multilingual twitter sentiment analysis Deep learning for aspect-based sentiment analysis: a comparative review Sentiwordnet: A publicly available lexical resource for opinion mining Experiments in cross-lingual sentiment analysis in discussion forums Effects of adjective orientation and gradability on sentence subjectivity Vader: a parsimonious rule-based model for sentiment analysis of social media text Sentiment analysis on product reviews using machine learning techniques Linking the linguistic resources using graph structure for multilingual sentiment analysis Bieru: Bidirectional emotional recurrent unit for conversational sentiment analysis Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks. Knowledge-Based Systems Combining speed and accuracy to control for speed-accuracy tradeoffs(?) Multilingual sentiment analysis: from formal to informal and scarce resource languages Challenges in multilingual and mixed script sentiment analysis Creating a general Russian sentiment lexicon Combining a large sentiment lexicon and machine learning for subjectivity classification ASTD: Arabic Sentiment tweets dataset Multilingual opinion mining on youtube-a convolutional n-gram bilstm word embedding Universal dependencies v2: An evergrowing multilingual treebank collection Thumbs up?: sentiment classification using machine learning techniques Scikit-learn: Machine learning in python Contextual valence shifters Dynasent: A dynamic benchmark for sentiment analysis Webis Cross-Lingual Sentiment Dataset Cross-language text classification using structural correspondence learning Machine learning in automated text categorization Empirical study of sentiment analysis tools and techniques on societal topics The general inquirer: a computer approach to content analysis Lexicon-based methods for sentiment analysis Classification of sentiment reviews using n-gram machine learning approach Sentiment analysis on monolingual, multilingual and code-switching Twitter corpora Negation modeling for german polarity classification Recognizing contextual polarity in phrase-level sentiment analysis Financial markets sentiment analysis: Developing a specialized lexicon A quantum-inspired sentiment representation model for twitter sentiment analysis Cross-lingual sentiment classification with bilingual document representation learning Informal multilingual multi-domain sentiment analysis The authors declare that they have no conflict of interest.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.