key: cord-0675621-omqnmc2f authors: Mubarak, Hamdy; Hassan, Sabit title: ArCorona: Analyzing Arabic Tweets in the Early Days of Coronavirus (COVID-19) Pandemic date: 2020-12-02 journal: nan DOI: nan sha: 2d6a8e0b3e8f5ad2a39fac0db0e8131c00f9964a doc_id: 675621 cord_uid: omqnmc2f Over the past few months, there were huge numbers of circulating tweets and discussions about Coronavirus (COVID-19) in the Arab region. It is important for policy makers and many people to identify types of shared tweets to better understand public behavior, topics of interest, requests from governments, sources of tweets, etc. It is also crucial to prevent spreading of rumors and misinformation about the virus or bad cures. To this end, we present the largest manually annotated dataset of Arabic tweets related to COVID-19. We describe annotation guidelines, analyze our dataset and build effective machine learning and transformer based models for classification. As the Coronavirus (COVID-19) crippled lives across the world, people turned to social media to share their thoughts, news about vaccines or cures, personal stories, etc. With Twitter being one of the popular social media platforms in the Arab region, tweets became a major medium of discussion about COVID-19. These tweets can be indicators of psychological and physical well being, public reactions to specific actions taken by the government and also public expectation from governments. Therefore, identifying types of tweets and understanding their content can aid decision making by governments. It is also important for governments to identify and prevent rumours and bad cures since they can bring harm to society. While there have been many recent works about tweets related to COVID-19, there are a very few targeted toward aiding governments in their decision making in the Arab region despite Arabic being one of the dominant languages on Twitter (Alshaabi et al., 2020) . Some of the existing works use automatically collected datasets (Alqurashi et al., 2020) . Manually labeled datasets are either small in size (few hundred tweets) (Alam et al., 2020) or target different task such as sentiment analysis (Haouari et al., 2020) . To fill this gap, we present and publicly share the largest (to our best knowledge) manually annotated dataset of Arabic tweets collected from early days of COVID-19, labeled for 13 classes. We present our data collection and annotation scheme followed by data analysis, identifying trends, topics and distribution across countries. Lastly, we employ machine learning and transformer models for classification. Much of recent works on COVID-19 rely on queries to Twitter or distant supervision. This allows a large number of tweets to be collected. Chen et al. (2020) collect 123M tweets by following certain queries and accounts on Twitter. GeoCoV19 (Qazi et al., 2020) is a large-scale dataset containing 524M tweets with their location information. Banda et al. (2020) collected 152M tweets at the time of their writing. Li et al. (2020) identifies situational information about COVID-19 and its propagation on Weibo. Other works include propagation of misinformation (Huang and Carley, 2020; Shahi et al., 2020) , cultaral, social and political impact of misinformation (Leng et al., 2020) and rumor amplification (Cinelli et al., 2020) . For Arabic, we see a similar trend where few datasets are manually labeled. Alqurashi et al. (2020) provide a large dataset of Arabic tweets containing keywords related to COVID-19. Similarly, ArCOV-19 (Haouari et al., 2020 ) is a dataset of 750K tweets obtained by querying Twitter. Alam et al. (2020) annotate a small number of English (currently 504) and Arabic tweets (currently 218) for (i) existence of claim and worthiness of factchecking (ii) harmfulness to society, and (iii) relevance to governments or policy makers. Yang et al. (2020) annotate 10K Arabic and English tweets for the task of fine-grained sentiment analysis. We used twarc search API 1 to collect tweets having the Arabic word (Corona) in Feb and March 2020. We collected 30M tweets in total. The reason behind selecting this word is that it's widely used by normal people, news media 2 and official organizations 3 as opposed to 19-(COVID-19) which is rarely used by normal people. We aimed to increase diversity of tweet sources. Our collection covers the period from Feb 21 until March 31 in which Coronavirus was reported for the first time in All Arab countries except United Arab Emirates (AE) 4 (Jan 29) and Egypt (EG) (Feb 14). The date During the period of our study (40 days), we extracted the top retweeted 200 tweets in each day (total of 8000). We assume that the top retweeted tweets are the most important ones which get highest attention from Twitter users. Annotation was done manually by a native speaker according to class descriptions shown in Table 1 . To measure quality, we annotated 200 random tweets by a second annotator. Inter-annotator agreement was 0.85 using Cohen's kappa coefficient which indicates high quality given that annotation is not trivial and some classes are close to each other. Examples of annotation classes are shown in Figures 1 and 2 We found that ≈10% of the tweets can take more than one class, e.g. a tweet reports new cases and a medical advice. We plan to allow multiple labels in future. In the current version, such tweets take the label of the first "important" class. We consider the first 8 classes in Table 1 to be important and the last 5 classes (PRSNL, SUPPORT, PRAYER, UNIMP and NOT ARB) to be less important 5 . Class timeline is shown in Figure 4 . We can observe the following important notes: • Large portion of tweets can be considered as LessImportant to many people (≈ 30%). • Reports (REP) and actions taken by governments (ACT) are the most retweeted tweets. • Information about the virus (INFO) get less attention with time and there is an increasing number of tweets about volunteering (VOLUNT). • There are continuous requests for governments to take actions (SEEK ACT) -especially in the beginning (≈ 15%), and few tweets are about rumors (≈ 5%) and cures (≈ 2%). We took a random sample of 1000 tweets and annotated them for their topics. Figure 3 shows that, in addition to health, the virus affected many aspects of people's lives such as politics, economy, education, etc. We found also that 7% of tweets have hate speech, e.g. attacking China and Iran for spreading the virus as shown in Figure 5 . Table 2 shows country distribution and top accounts for the original authors of tweets. Typically, people retweet tweets from ministry of health in their countries in addition to famous news agencies and celebrities. Most of these accounts are verified. 5 These classes will be merged into LessImportant class. We randomly split the data into sets of 6000, 1000 and 1000 tweets for train, dev and test sets respectively. We report macro-averaged Precision (P), Recall (R) and F1 score along with Accuracy (Acc) on test set 6 . We use F1 score as primary metric for comparison. We experimented with character and word n-gram features weighted by term frequency-inverse term document frequency (tfidf). We report results for only the most significant ranges, namely, word [1-2] and character [2] [3] [4] [5] . Mazajak Embeddings Mazajak embeddings are word-level skip-gram embeddings trained on 250M Arabic tweets, yielding 300-dimensional vectors. Support Vector Machines (SVMs) SVMs have been shown to perform decently for Arabic 6 Differences between dev and test sets are ±2 − 3% (F1). text classification tasks such as spam detection , offensiveness detection or dialect identification Bouamor et al., 2019) . We experimented with i) word n-gram, ii) character n-gram and iii) Mazajak Embeddings. We used LinearSVC implementation by scikit-learn 7 . Deep Contextualized Transformer Models (BERT) Transformer-based pre-trained contextual embeddings, such as BERT (Devlin et al., 2019) , have outperformed other classifiers in many NLP tasks. We used AraBERT (Antoun et al., 2020) , a BERT-based model trained on Arabic news. We used ktrain library (Maiya, 2020) that utilizes Huggingface 8 implementation to fine-tune AraBERT. We used learning rate of 8e -5 , truncating length of 50 and fine-tuned for 5 epochs. First, we experiment to distinguish LessImportant tweets from others (see Section 4). From Our next set of experiments were designed for finegrained classification for 13 classes. With F1 score of 60.5, AraBERT outperformed others (Table 4) . Error Analysis: AraBERT confusion matrix ( Fig 6) shows that PRSNL, INFO and RUMOR are the hardest classes to identify and the most common error is misclassifying INFO as ADVICE. This suggests increasing data size to have more examples from different classes. We present the largest publicly available manually annotated dataset of Arabic tweets for 13 classes that includes the most retweeted tweets in the early days of COVID-19. Followed by data analysis, we present models that can reliably identify important tweets and can perform fine-grained classification. In the future, we plan to compare our data to data from later days of the pandemic. Arabic dialect identification in the wild Fighting the covid-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society Large arabic twitter dataset on covid-19 The growing amplification of social media: Measuring temporal and social contagion dynamics for over 150 languages on twitter for Arabert: Transformer-based model for arabic language understanding Yuning Ding, and Gerardo Chowell. 2020. A large-scale covid-19 twitter chatter dataset for open scientific researchan international collaboration The MADAR shared task on Arabic finegrained dialect identification Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set The covid-19 social media infodemic BERT: Pre-training of deep bidirectional transformers for language understanding Reem Suwaileh, and Tamer Elsayed. 2020. ArCOV-19: The first Arabic COVID-19 twitter dataset with propagation networks Ammar Rashed, and Shammur Absar Chowdhury. 2020. ALT submission for OSACT shared task on offensive language detection Disinformation and misinformation on twitter during the novel coronavirus outbreak Analysis of misinformation during the covid-19 outbreak in china: cultural, social and political entanglements Characterizing the propagation of situational information in social media during covid-19 epidemic: A case study on weibo ktrain: A low-code library for augmented machine learning Spam detection on arabic twitter An exploratory study of covid-19 misinformation on twitter Senwave: Monitoring the global sentiments under the covid-19 pandemic