key: cord-0133140-ynj97idu authors: Alharbi, Basma; Alamro, Hind; Alshehri, Manal; Khayyat, Zuhair; Kalkatawi, Manal; Jaber, Inji Ibrahim; Zhang, Xiangliang title: ASAD: A Twitter-based Benchmark Arabic Sentiment Analysis Dataset date: 2020-11-01 journal: nan DOI: nan sha: 8ceb8fdf522b826016fdb4f4c9fbc6e1aa88a537 doc_id: 133140 cord_uid: ynj97idu This paper provides a detailed description of a new Twitter-based benchmark dataset for Arabic Sentiment Analysis (ASAD), which is launched in a competition3, sponsored by KAUST for awarding 10000 USD, 5000 USD and 2000 USD to the first, second and third place winners, respectively. Compared to other publicly released Arabic datasets, ASAD is a large, high-quality annotated dataset(including 95K tweets), with three-class sentiment labels (positive, negative and neutral). We presents the details of the data collection process and annotation process. In addition, we implement several baseline models for the competition task and report the results as a reference for the participants to the competition. Sentiment analysis is a popular and challenging Natural Language Processing (NLP) task, which facilitates capturing public opinions on a topic, product or a service. It is thus a widely studied research area, where many benchmark datasets and algorithms exist to tackle this challenge. The state of research in Arabic sentiment analysis falls behind other high-source languages such as English and Chinese [1] . A comprehensive review on the current state-of-research on Arabic sentiment analysis can be found in [2] . In general, the work on Arabic sentiment analysis can be grouped by the employed approaches, which range from simple, rule-based / lexicon-based algorithms [3] to more complex deep learning algorithms [4] . Synthesizing recent publications on Arabic sentiment analysis show that there is an increase in the utilization of deep learning approaches in the past few years [5, 6, 7, 8, 9, 10] . In such approaches, often a combination of word embedding model for feature representation is used along with a deep learning model for classification. For example, in [10] , the authors applied three deep learning models: CNN, LSTM and RCNN, and the input to all these models were word embeddings generated from a pretrained word2vec CBOW model [11] , named Aravec [12] . The contributions of work in Arabic deep learning-based sentiment analysis is limited to the application of existing models to Arabic corpora, and only a few publications contribute new models that are specifically designed to target the unique characteristics of the Arabic language, as in [13] . The lack of large and high-quality Arabic benchmark dataset is often attributed to this deficiency in Arabic-based NLP studies [14] . The objective of this work is to introduce a new large Twitter-based Arabic Sentiment Analysis Dataset This being said, the increasing interest in the past few years in Arabic sentiment analysis has resulted in the release of a number of datasets. Datasets collected for Arabic sentiment analysis span multiple sources, which include newspapers, reviews, and different content from social media platforms [15, 16, 17, 18] . Table 1 compares our dataset ASAD with similar benchmark datasets, which are also Twitter-based Arabic corpora for general sentiments (as opposed to topic-based sentiment). The benchmark datasets are compared in terms of the size of dataset (i.e., the number of annotated tweets), the number of sentiment classes, the inclusion of Arabic dialect, the annotation approach, the number of annotators and the quality of the annotation (measured by Fleiss's Kappa metric [19] ). TEAD [20] , the currently largest dataset for Arabic sentiment analysis, was annotated automatically using emojis in tweets. The same automatic annotation approach was adopted by ASTAD [21] , while no explicit details of how the 40k Tweets dataset was annotated [10] . The remaining datasets have significantly smaller size, yet they were all annotated manually. Manual annotations indicate that a number of human annotators were recruited and trained to provide high quality annotations for each record in the dataset. Manual annotation thus provides a more reliable data source for sentiment classification algorithms to learn from. The quality of annotations can be assessed, e.g., by the most commomly used Fleiss's Kappa metric [19] , whose values closer to 1 indicate higher agreements between annotators. Among the four manually annotated corpora, AraSenTi-Tweet [22] and Gold Standard [23] stated their annotation quality measure, which are included in Table 1 . SemEval [24] annotates tweets in either one of five sentiments, which are: strongly positive, weakly positive, neutral, weakly negative and strongly negative. AraSenTi-Tweet [22] , ASTD [25] , and Gold standard [23] adopt a four-point scale for sentiments, which are: positive, neutral, negative and mixed classes. Lastly, TEAD [20] follows a three-point scale for sentiments, where each tweet is either positive, negative or neutral. Both 40K Tweets [10] and ASTAD [21] use a two-point scale for sentiment, where tweets are either positive or negative. Our dataset, ASAD, adopts a three-point scale similarly to TEAD [20] , has multiple dialects, and was annotated manually by an average of three annotators per tweet. The annotation quality for our dataset is κ = 0.56, indicating a moderate agreement between the annotators. A detailed analysis of the annotation is presented in the next section. The rest of this paper is organized as follows. Section 2 provides details on the construction mechanism we followed to collect ASAD. Section 3 provides details on becnhmark evaluation and results using baseline approaches for three-class sentiment classification. Section 4 provides a discussion on challenges and future applications to ASAD. Figure 1 illustrates the methodology we adopted to construct the corpus. In general, the methodology consists of three main phases: data collection, data annotation and data cleaning. Each phase is further described in the subsections below. The data collection and annotation was done by Lucidya 5 which is an AI-based company with rich experience in organizing data annotation projects. Figure 1 : Corpus Collection Methodology. Illustration of the different phases involved in collecting, annotating and cleaning the dataset. * In the annotation phase, each record was annotated by at least three annotators. This justifies the increase in the number of records in the annotation phase when compare to the previous phase (the data collection phase). The tweets in ASAD were randomly selected from a pool of tweets that have been collected using the Twitter public streaming API 6 between May 2012 and April 2020. The selection process consisted of several filtration operations to ensure that the Arabic texts are comprehensive enough for the annotation task. First, a random set of tweets with "ar" label in their json object were selected from the pool of tweets and added to the corpus. Next, tweets that contained less than seven words (excluding hashtags and user mentions) were removed from the corpus. After that, all tweets that included URLs, images, or videos or consisted of a predefined set of inappropriate Arabic keywords were also removed from the Corpus. Finally, the corpus was refined to include 100k tweets. We analyzed the content of ASAD and found out that around 69% of the tweets were selected from the year 2020 (January 2020 until April 2020), 30% were tweeted in 2019 and the remaining 1% happened between 2012 and 2018. 82% of the tweets came from unique authors where we encountered only 503 verified authors in the corpus. For locations, 72% of the tweets originated from Saudi Arabia, 13% from Egypt, 7% from Kuwait, 3% from the United States, and 3% from the United Arab Emirates. Based on the models developed by Lucidya, 36% of the tweets are written in Modern Arabic, 31% of the tweets used the Khaleeji dialect, 22% used the Hijazi dialect, and the rest 10% tweeted in Egyptian dialect. The objective of this phase is to apply tweet-level sentiment annotation. We adopted a three-way classification sentiment, where a sentiment can either be: positive, negative or neutral. A screenshot of the platform for our project is given in Figure 2 . Each tweet was annotated by at least three experienced annotators who were trained and qualified by the company. The annotation process was done by 69 native Arabic speakers with an average age of 26 years old where 12% of the recruited annotators were under the age of 21, 71% between 21 and 30, 13% between 31 and 40, and the rest over 40 years old. Among the annotators, 85% of them were females, and 71% of them have at least a Bachelor's degree, and 4% obtained a degree in higher education. In addition, 67% of the annotators have a background in arts, linguistics, or education while the remaining 33% majored in either science or engineering. Based on Lucidya's experience, it is more efficient to hire local annotators which resulted in having 71% of the annotators to be from Saudi Arabia. The remaining annotators came from various parts of the Middle East where 12% were hired from Egypt, 7% from Yemen, 7% from Algeria, while the rest came from Indonesian and Palestine. The annotators started the corpus annotation project on February 2020 and finished by May 2020 where the tweets were added to the the annotation portal in several batches 7 . By the end of each batch, an evaluation of the quality of the annotation was conducted to ensure that only high quality annotators were allowed to participate in the next batch. By the end of the 10 batches, we gained a total of 304, 625 annotations for 100, 661 distinct tweets, with a ratio of 3 annotations per tweet. Preliminary data cleaning was done to ensure minimum data quality level. Empty tweets, i.e., tweets with no text, as well as meaningless tweets (e.g., those containing only URLs or coupons) were excluded from the corpus. The total number of identified records at this stage is 95,000. For a corpus to be considered as a benchmark, we need to verify the reliability of the labels as well as the difficulty level of the task. For that, we adopt the Fleiss's Kappa [19] metric to measure the Inter Annotator Agreement (IAA). The Fleiss's Kappa measure is calculated by where N is the total number of tweets, n is the number of annotations per tweet, k is the number of classes, and n ij is the number of annotators who have assigned the i-th tweet to the j-th class. In our ASAD dataset, we have tweets annotated by three annotators (n=3), and by four annotators (n=4). The agreement among the annotators and the calculated κ for each case is shown in Table 2 . The over all Fleiss Kappa coefficient of ASAD is κ=0.56, indicating a moderate agreement among the annotators. Since there are around 37% tweets initially annotated with more than two different labels (as given in Table 2 ), we investigate whether the disagreement is due to the fact that the labeling task is non-trivial for human annotators, or is caused by the carelessness of certain annotators. We determine the final label for each tweet by majority voting or another experienced annotator if there was no consensus. Then, we evaluate the accuracy of each annotator: how much percentage of their annotated tweets are the same as the final label. The accuracy is shown in Figure 3 , where the 69 annotators are ordered by their number of contributed annotations. We can see that the top-10 annotators contributed more than 60% annotations with an average accuracy 0.88. The overall average accuracy is 0.85 regardless the number of contributed annotations. Only a few annotators had accuracy less than 0.8, and most of these annotators gave less than 200 annotations. Following [26] and [27] , we also evaluate the average inter-rater agreement ι= 0.83, indicating the reliability of the annotators since ι is much higher than 0.5. Therefore, we can conclude that the annotators had conducted high-quality labeling work. The disagreement is mainly caused by the non-travail understanding of the tweets themselves. Such challenges in real-world tweets sentiment annotation also make our competition extremely meaningful. We expect the advanced models developed by the participants can be the effective solutions to this problem. ASAD consists of 95,000 annotated tweets, with three-class sentiments. Table 3 presents basic statistics about ASAD. ASAD (represented by ALL in the table) has a total of 15,215 positive tweets, 15,267 negative tweets and 64,518 neutral tweets. In the Kaggle competition, the entire ASAD data is split into TRAINING, TEST1 and TEST2 for the usage at different stages. The data distribution in all subset splits are the same. We also use these data split for benchmark evaluation, which is described in Section 3. This section provides details on evaluation details and baseline results. In the competition, the ASAD dataset is divided into three folds: training (TRAINING), development-time testing (TEST1) and main testing (TEST2). The TRAINING set contains 55K labeled tweets, while TEST1 and TEST2 are composed of 20K tweets each, as shown in Table 3 . In the competition, both TRAINING and TEST1 datasets are released at the beginning of the competition. At this stage, competing teams can assess their relative rank using the evaluation metric described next on the TEST1 set. At the end of the competition, teams will be ranked based on their performance on TEST2. Teams will be ranked on TEST2, where ranking will be done offline by evaluating the submitted codes. The prizes for the first three winners are: • First place winner, 10,000 USD • Second place winner, 5,000 USD • Third place winner, 2,000 USD At the end of the competition, a report will be published by inviting the top ranking teams to describe their winning solutions. And an awarding conference will be organized to invite the team introducing their solutions. For more details about the competition, timelines and rules please visit the competition website: https://wti.kaust.edu.sa/ arabic-sentiment-analysis-challenge. Given the highly imbalanced dataset and the multi-class nature of the problem and following [24, 28] , we use the following metrics in the competition: • Primary metric, the macro-averaged recall (recall averaged across the three classes) where R P , R N and R U are the recall for the positive, negative and neutral class, respectively. • Secondary metric, the average F 1 for positive and negative classes only where F P 1 and F N 1 refer to the F 1 measure for the positive and negative classes, respectively. For more details on the evaluation metric, please refer to [29] . The two proposed evaluation metrics take into consideration the multi-class sentiment classification problem, as well as the imbalanced nature of the dataset. On the released competition, participants will be evaluated and ranked by their performance on the primary metric only. The secondary metric is introduced to provide additional analysis of the results, and will not be used while ranking participating teams. Preprocessing: In our baseline experiments we applied some preprocessing to the data by removing the stop-words, special symbols like '@' and '#', and URLs. We also removed non-Arabic characters in order to make a clear analysis of Arabic language. In addition, the word tokenization was applied to the text data. For this part we used the NLTK (https://www.nltk.org ) and pyarabic tool (https://github.com/linuxscout/pyarabic). Traditional Baseline Methods: We applied bag-of-words (BOW) and Tf-Idf to extract the features of the text data, then for each of them we classify by using logistic regression. • BOW+ Logistic regression. The BOW is the simplest form of text representation, which describes the occurrence of words within a document. We used BOW to extract the features of the texts, then we fed them into the logistic regression classifier with 'lbfgs' solver. • TF-IDF + Logistic regression. TF-IDF measures how important a particular word is with respect to a document and the entire corpus. Similar to the previous model, the text features where extracted by Tf-Idf, then we applied the logistic regression classifier. For BOW, TF-Idf, and logistic regression, we used the popular Scikit-learn 8 python library. Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities. Deep Learning Baseline Models: We fine-tuned the pre-trained model BERT [30] and AraBERT [31] for learning the embedding vectors for tweets and then for the classification. • BERT [30] is one of the most popular language representation models developed by Google. BERT can be fine-tuned with one additional output layer to create a model for a specific NLP task. Several multilingual BERT models have been released to support many languages. • AraBERT [31] is an Arabic language model by training BERT on a large-scale Arabic corpus. It overcomes the performance of a multilingual version of BERT in many NLP tasks applied to Arabic text. To fine-tune BERT and AraBERT, we used ktrain 9 , which is a lightweight wrapper for many deep learning libraries to help build, train, and deploy machine learning models in a more accessible and easier manner. With ktrain, the text must be pre-processed in a specific way to obtain a fixed-size sequence of word IDs for each document. Then, they are used as input for the classifier. Each model in ktrain has its corresponding pre-processing procedure and classifier. baseline models on TEST1 and TEST 2 BOW-LR TF-IDF-LR BERT AraBERT TEST1 TEST2 TEST1 TEST2 TEST1 TEST2 TEST1 TEST2 F1 In this section, we highlight some of the challenges related to Arabic sentiment classification from a Twitter-based dataset. We identify three main challenges, listed below: • Sentiment Imbalance: ASAD adopts a three-point sentiment scale, where a tweet is labeled as either positive, negative or neutral. As it was detailed in Section 2, the distribution of the labels is imbalanced, where the neutral class has more number of tweets than the other two classes. Imbalanced datasets create a challenge to most learning algorithms. Thus, the prediction performance of the model should be evaluated using a metric that handles unbalanced data. • Spam: The ASAD dataset was preliminary cleaned as discussed in Section 2. However, spam tweets still exist and may affect the prediction performance of a sentiment classifier. • Multiple Dialects: The existence of multiple dialects in the dataset creates a challenge to the sentiment classification problem. In a simpler dataset with one dialect only, e.g., ASTD [25] , a supervised learning algorithm will be able to identify the main keywords associated with each sentiment in this dialect. When multiple dialects exists, the range of keywords associated with each sentiment increases because such words differ from one dialect to another. This creates an additional challenge that one-dialect corpus does not have. A pre-trained word embedding model with one-dialect corpus may not work well on the multi-dialect dataset. The complete dataset with both training and testing sets will be publicly released for the research community after the competition ends. The dataset can be used for a range of other NLP tasks, besides sentiment classification. This includes dialect identification, spam detection, and also assist the sentiment analysis of Covid-19 tweets [32] . We will update our work in the near future. Arabic natural language processing: An overview A comprehensive survey of arabic sentiment analysis. Information processing & management Rawad Abou Assi, and Hazem Hajj. Sentence-level and document-level sentiment mining for arabic texts Deep learning. nature Deep learning models for sentiment analysis in arabic Hybrid deep learning for sentiment polarity determination of arabic microblogs Sentiment analysis of arabic tweets using deep learning Deep recurrent neural network vs. support vector machine for aspect-based sentiment analysis of arabic hotels' reviews Deep learning for arabic nlp: A survey Deep learning approaches for arabic sentiment analysis. Social Network Analysis and Mining Efficient estimation of word representations in vector space Aravec: A set of arabic word embedding models for use in arabic nlp Wassim El-Hajj, and Khaled Shaban. hulmona: The universal language model in arabic Labr: A large scale arabic sentiment analysis benchmark Sana: A large scale multi-genre, multi-dialect lexicon for arabic subjectivity and sentiment analysis Labr: A large scale arabic book reviews dataset Building large arabic multi-domain resources for sentiment analysis Awatif: A multi-genre corpus for modern standard arabic subjectivity and sentiment analysis Measuring nominal scale agreement among many raters Using tweets and emojis to build tead: an arabic dataset for sentiment analysis An arabic tweets sentiment analysis dataset (atsad) using distant supervision and self training Arasenti-tweet: A corpus for arabic sentiment analysis of saudi tweets An arabic twitter corpus for subjectivity and sentiment analysis Semeval-2017 task 4: Sentiment analysis in twitter Astd: Arabic sentiment tweets dataset Semeval-2018 Task 1: Affect in tweets Inter-annotator agreement in sentiment analysis: Machine learning perspective An axiomatically derived measure for the evaluation of classification algorithms Semeval-2016 task 4: Sentiment analysis in twitter Pre-training of deep bidirectional transformers for language understanding Arabert: Transformer-based model for arabic language understanding SenWave: Monitoring the global sentiments under the covid-19 pandemic