key: cord-0604922-oubf8brx
authors: Ameur, Mohamed Seghir Hadj; Aliane, Hassina
title: AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset
date: 2021-05-07
journal: nan
DOI: nan
sha: e559386a748a09c9d3bd4e9092d4b25d9c7d15d4
doc_id: 604922
cord_uid: oubf8brx

Along with the COVID-19 pandemic, an"infodemic"of false and misleading information has emerged and has complicated the COVID-19 response efforts. Social networking sites such as Facebook and Twitter have contributed largely to the spread of rumors, conspiracy theories, hate, xenophobia, racism, and prejudice. To combat the spread of fake news, researchers around the world have and are still making considerable efforts to build and share COVID-19 related research articles, models, and datasets. This paper releases"AraCOVID19-MFH"a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. Our dataset contains 10,828 Arabic tweets annotated with 10 different labels. The labels have been designed to consider some aspects relevant to the fact-checking task, such as the tweet's check worthiness, positivity/negativity, and factuality. To confirm our annotated dataset's practical utility, we used it to train and evaluate several classification models and reported the obtained results. Though the dataset is mainly designed for fake news detection, it can also be used for hate speech detection, opinion/news classification, dialect identification, and many other tasks.

on a daily basis. Thus, the need for automatic fake news detection to ease the burden on human fake-news annotators. The task's main goal is to automatically evaluate the degree of truthfulness/trustworthiness of a given claim or news. Of course, addressing this task requires solving several challenges, such as identifying the factual articles that can be judged as fake or real, estimating their fact-checking worthiness, and assessing their content in terms of hate, racism, threats, etc. The quality of fake news detection models is heavily dependant on the size and richness of the datasets upon which they are trained. Thus, tremendous efforts are continuously being made to create several annotated datasets for the task of fake news detection in the context of the new emerging COVID-19 pandemic [6, 7, 8] . However, for many low-resource languages, such sophisticated datasets are not available or are not rich enough. In this paper, we release "AraCOVID19-MFH" 8 a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset. Our dataset contains 10,828 Arabic tweets annotated with 10 different labels. The labels have been designed to consider some aspects relevant to the fact-checking task, such as the tweet's check worthiness, positivity/negativity, dialect, and factuality. Though the dataset is mainly designed for fake news detection, it can also be used for hate speech detection, opinion/news classification, dialect identification, and many other tasks. To confirm our annotated dataset's practical utility, we used it to train and evaluate several classification models and reported the obtained results. To the best of our knowledge, there are no Arabic COVID-19 fake news detection datasets that are as large and as rich as the one we are releasing in this paper.

The remainder of this paper is organized as follows: Section 2 presents the fake news detection datasets that have been published in the context of the COVID-19 pandemic. The details of our dataset collection, construction, and statistics are then provided in Section 3. Then, in Section 4, we present and discuss the tests we have done and the results we have obtained. Finally, In Section 5, we conclude our work and highlight some possible future improvements.

Though the coronavirus pandemic appeared only a year ago, it has received a considerable amount of attention from the research community. In this section, we will start by highlighting some datasets that have been released to combat the spread of COVID-19, then we will summarize the ones that are most relevant to the task of fake news detection 9 . Since the pandemic occurred at the end of December 2019, considerable efforts have been made to build large COVID-19 datasets. For instance, Wang et al. [12] published COVID-19 Open Research Dataset (CORD-19), a dataset containing more than 128,000 scientific papers regarding COVID-19. Kleinberg et al. [13] released Real World Worry Dataset (RWWD), an annotated COVID-19 emotion dataset containing 5000 English tweets along with their ground truth sentiment labels. Banda et al. [14] released a large-scale dataset containing 383 million tweets related to COVID-19 gathered from January to June 2020. The dataset includes raw tweets in many languages with a predominance of English, French, and Spanish. Alqurashi et al. [15] released a dataset containing 3.9 million raw multi-dialect Arabic tweets regarding the COVID-19 pandemic. Their dataset was collected starting from January 2020 and is still being updated periodically. The area of COVID-19 fake news detection has also received much attention from the research community, and multiple datasets have been released 10 . For instance, Limeng and Dongwon [6] created "CoAID" an English fake news detection dataset including 4,251 news articles and 926 social network posts about COVID-19 along with their ground truth labels. Elhadad et al. [17] published "COVID-19-FAKE," an automatically annotated English and Arabic tweets dataset, collected from February 4 to March 10, 2020, and annotated using 13 machine learning algorithms and seven feature extraction techniques. Gautam and Durgesh [7] created "FakeCovid," a multilingual dataset gathered from 105 countries and included a total of 40 languages. The dataset contains 5,182 verified COVID-19 news articles mostly written in English. The articles were collected between April and May 2020 from 92 different fact-checking websites and have been manually annotated into 2 categories "false" and "others" 11 . Zhou et al. [8] published "ReCOVery", a repository designed and built to facilitate the research studies regarding COVID-19 disinformation detection. They collected a total of 2,029 news articles published between January and May 2020, as well as 140,820 retweets of those articles that reveal how the original articles spread on Twitter. Firoj et al. [18] designed a multi-label twitter dataset for disinformation detection in the context of the COVID-19 pandemic for both the English and Arabic languages. There proposed dataset has been annotated using seven questions that investigate different aspects of each tweet, for instance, the questions were designed to check if the tweet contains verifiable or false information, if it has an effect on the general public, if it needs fact-checking, etc. Haouari et al. [19] presented "ArCOV19-Rumors", an Arabic COVID-19 Twitter dataset for misinformation detection containing 9.4K tweets. The tweets were manually annotated based on verified claims gathered from popular fact-checking websites. Parth et al. [20] manually annotated a COVID-19 fake news detection dataset containing 10,700 English social media posts and articles. Their dataset contains two ground truth labels, which are "real" and "fake". Tammana et al. [21] published "COVIDLIES," a binary misinformation dataset containing 6761 annotated tweets built using 86 different known COVID-19 fake news articles. Alsudias and Rayson [22] collected 1 million Covid-19 related Arabic tweets from Twitter and analyzed them for three different tasks: (1) topic identification, (2) rumors detection, and (3) tweets' sources prediction. For the rumors detection task, they manually annotated a total of 2,000 tweets using binary labels and used several machine learning algorithms to evaluate the quality of their resulting dataset.

The related study that has followed a multi-label classification similar to ours is the one published by Firoj et al. [18] . However, unlike their dataset which considers 7 questions and includes a very limited number of instances (around 100-200 instances per class), our dataset includes 10 tasks and significantly more annotated instances (thousands of instances per class). To the best of our knowledge, there are no Arabic COVID-19 multi-label fake news detection datasets that are as large and as rich as the one we are releasing in this paper.

This section will first present the "AraCOVID19-MFH" dataset, its design goal, and the different labels it contains. Then it explains the process of tweets collection and annotation and provides the dataset's statistics.

The majority of the existing fake news detection datasets contain only two classes, "real" and "fake" [9] . Though this kind of dataset is far easier to build, their use cases in real-world scenarios are limited. One of their main limitations is caused by the non-factual news, which can not be classified as "real" or "fake". Another limitation of their utility is the fact that they do not take into consideration several important aspects. For instance, they do not consider the fact-checking worthiness of the tweet (or news), which can be very important in prioritizing the urgent (more harmful) news for manual fact-checking.

Our dataset uses multi-label classes (tasks) that consider multiple aspects of each tweet, thus increasing its utility and allowing more accurate Arabic fake news detection models to be built. It considers a total of 10 labels (tasks) described in Table 1 . One of the essential labels for fake news detection is the "Factual" label. It allows the distinction between tweets that can be classified as "real" or "fake" and the tweets that contain only personal opinions or thoughts, thus, not being verifiable. Another important label is "Worth fact-checking", which considers the level of danger (or harm) that can be inflicted by the tweet, according to which the tweet can be prioritized for manual fact-checking. We note that our dataset's tasks were designed so that they can be used both simultaneously and independently depending on the desired task. Whether the tweet contains an important claim or dangerous content that maybe be of worth for manual fact-checking.

Yes, Maybe, No, Can't decide Whether the tweet contains any fake information.

As shown in Table 1 , the annotation task involves a set of labels (tasks), each one of them investigates a given aspect of the Arabic tweet. In the following, we will try to provide more details about the dataset.

• For the dialect label, we considered only three major values: Modern Standard Arabic (MSA), North African, and Middle Eastern dialects. North African dialect (also known as Maghrebi Arabic) includes all the Arabic dialects spoken in Morocco, Algeria, Tunisia, Libya, Western Sahara, and Mauritania 12 . Middle Eastern dialect (also known as Mashriqi Arabic) includes all the Arabic dialects spoken in the Mashriq countries 13 . This dialect classification is adopted to make the manual annotation much easier for the annotators who may not be very familiar with the little differences between Arabic countries' dialects.

• Both the fake information and the worth fact-checking classes are considered only when the tweets are factual. Thus, if the tweet is not factual, we can not talk about its fact-checking worthiness nor about it being "fake" or "real".

• We note that for all the tasks of our dataset, an additional label "Can't decide" is added to give the annotators a choice to not make a decision for a given class. For example, if the annotator cannot decide whether the tweet is fake or not, he/she can choose the "can't decide" label for it. Table 2 shows some examples of annotated Arabic tweets. The tweets N°2, 3 and 4 ( Table 2 ) are not factual. Thus they Table 2 : Example of some Arabic tweets along with their respective multi-label annotations can not be classified as "fake" or "real", unlike the tweets N°1 and 5, which are factual. We note that our dataset's released version follows the same structure as shown in Table 2 . However, it does not provide the tweets' texts (due to the Twitter regulations 14 ), instead, it only provides the tweets' IDs that can be used to retrieve the full information of each tweet.

The first step that we followed to build the dataset was to prepare a set of keywords for each specific task, then we retrieved the tweets based on those keywords.

Portions of the keywords' lists that we prepared for each task are shown in Table 3 The "Base Corona" category (Table 3 ) contains the COVID-19 specific keywords that we add to each task's keywords to ensure that the retrieved tweets are related to the COVID-19 pandemic.

The retrieved tweets were filtered in the following way:

• All the retweets of a given tweet were removed.

• Identical tweets that share the exact same textual content (when ignoring the tweets' links) were removed. This is done to ensure that the text of each considered tweet is unique.

• Very short tweets that contain less than 5 Arabic words were filtered.

• Tweets were gathered within the time period spanning from December 15, 2019, and December 15, 2020.

After the filtering step, we have ended up with a total of 300k unique Arabic tweets related to the COVID-19 pandemic.

Due to the high cost of the annotation task, we only required each tweet to be annotated by one expert annotator. This allowed us to annotate a total of 10, 828 Arabic tweets from the 300k gathered tweets. We plan to perform a second annotation phase (re-annotation) at a later time in which each existing tweet will be further examined and annotated by at least two other annotators, and the remaining non annotated tweets will be gradually annotated. The manual annotation task was carried-out while taking into account the following guidelines and instructions:

• For each tweet, the annotator has been provided with the full text of the tweet, including the links, and asked to read the tweet, check the tweet's links if necessary, and annotate it for each one of the 10 labels (tasks). This results in a dataset in which each tweet is labeled for each one of the 10 tasks (as shown in Table 2 ).

• For the "Contains fake information" label, the annotator needs to verify the validity of the tweet's claims or news from trusted online sources.

• The dialect classification is based solely on the tweet's text. We do not use the tweets' geolocalisation data because they are not always correct, and also because we believe that the distinction between the considered "MSA", "North Africa", and "Middle East" can be easily made.

• If a tweet contains both news and opinion parts, the annotator needs to make the classification based on the more significant part, or he can choose the "Can't decide" value.

• For the "Contains hate" label, we consider hate in a global manner, which includes racism and any type of offensive speech.

• We give the annotator a choice to not make a decision for any given task by selecting the value "Can't decide". The tweet will then be unlabelled for that specific task.

The statistics of our "AraCOVID19-MFH" dataset are provided in Table 4 .

As shown in Table 4 , most of the considered tasks contain more than 1000 instances for each one of their values, which helps train robust classification models. For the "Worth fact-checking" and the "Contains fake information" tasks the number of tweets that have been annotated with the "Can't decide" value is very high because a large portion of our dataset's tweets is not factual. And as we stated at the end of section 3.1, the non-factual tweets can't be annotated for either one of those two tasks.

Our tests aim at assessing the quality of our constructed dataset and provide baseline results for each one of its tasks. To this end, for each one of our dataset's tasks, several deep learning models are trained and tested. In the following, first, we will present the Arabic preprocessing that we performed, and the different pre-trained deep learning models that we considered. Then, we will report and discuss the results of our performed tests.

We applied a basic preprocessing to all the collected Arabic tweets, which includes:

• The removal of diacritical marks.

• The removal of elongated and repeated characters.

• Arabic characters normalization.

• The removal of links and users' references (users' notifications).

• Tweets tokenization in which punctuation, words, and numbers are separated.

We note that this preprocessing have been used only when training the transformer models (Section 4.2); it has not been used for the annotation task nor in the final dataset.

Pretrained transformer models have been recently used in many NLP tasks and have continuously achieved new state-of-the-art results [23] . In the following, we will highlight the five transformer models that we considered in our tests.

In our experiments we use three pretrained transformer models:

• AraBERT 15 : A BERT (Bidirectional Encoder Representations from Transformers) model [23] pretrained on 200 million Arabic MSA sentences gathered from different sources [24] .

• Multilingual BERT (mBERT) 16 : A BERT-based model [23] pretrained on the first 104 major Wikipedia languages 17 .

• Distilbert Multilingual 18 : A distilled version of the mBERT (smaller version), trained using knowledge distillation [25] .

The three aforementioned transformer models have been trained solely on MSA Arabic. Thus they may have difficulty dealing with Arabic dialects. For this reason, we performed a further pretraining (or fine-tuning) for AraBERT and mBERT models using 1.5 million tweets from the "Large Arabic Twitter Dataset on COVID-19" [15] which contains raw multi-dialect Arabic tweets regarding the COVID-19 pandemic. This is done to make the transformer language models more familiar with the Arabic COVID-19 multi-dialect vocabulary. We named the resulting two models: "AraBERT COV19" 19 , and "mBERT COV19" 20 . With these two modes, we end up with a total of five models for our experiments.

Pretrained transformer models can often achieve better results when their weights are fine-tuned on the classification task [26] . For our tests, we investigate two training scenarios:

1. Without fine-tuning: we train our five considered classification models without allowing their weights to be changed during the training process.

2. With fine-tuning: we train our five considered classification models while allowing their weights to be updated (fine-tuned) to maximize the classification performance.

The implementation of the different models have been done using the following libraries:

• Scikit-learn [27] 21 is a python-based machine learning library. We used it to evaluate the performance of our models.

• Flair [28] 22 : is a framework for building state-of-the-art NLP models. We used it to train our classification models.

• Huggingface-transformers [29] 23 : is a framework for building and pretraining different state-of-the-art NLP models. We used it to fine-tune the pretrained transformer language models on the new Arabic COVID-19 data.

• PyTorch [30] 24 is an open-source library designed for implementing deep neural networks. We used it as a backend for both the Huggingface-transformers and the Flair frameworks.

To evaluate the performance of our five considered classification models, we have used a stratified 5-fold cross-validation method. This is done by randomly partitioning the instances of each one of our dataset's tasks into 5 disjoint sets of equal size. In this five-fold cross-validation, five experiments are performed, in each one, one of the five sets is selected for testing, and the remaining four are used for training. For each experiment, the weighted F-score is calculated, and finally, the average F-score for all the five experiments (the 5-folds) is reported.

Our tests examine two important aspects. First, we compare the transformer models' performance when their weights are fine-tuned on the classification task and when their weights are kept unchanged (with and without fine-tuning). The second aspect investigates the effect of the additional COVID-19 pretraining that we incorporated for the AraBERT and mBERT models. The results of the two aforementioned experiments are given in Table 5 . Regarding the first experiment, we can observe that the models' weights fine-tuning was extremely helpful for all the tested transformer models with an improvement ranging between 0.1 and 0.5 f-score. This implies that fine-tuning the transformers' weights during the training phase of the classification is extremely helpful. For the second experiment, we can see that the additional training that has been performed for the "AraBERT" and "mBERT" transformer models using 1.5 million multi-dialect Arabic COVID-19 tweets was helpful both when the transformers' weights were and weren't updated. Indeed, we can observe that "AraBERT COVID-19" and "mBERT COVID-19" models achieved the best performance reaching more than 0.92 f-score across all the tested tasks. This confirms that performing additional training using task-specific data for the pretrained transformer models is helpful for the classification tasks.

The quality of the obtained results reflects the importance of having a large annotated dataset and confirms our adopted annotation schema's practical utility.

We have presented and released "AraCOVID19-MFH" the Arabic COVID-19 multi-label fake news and hate speech detection dataset. The dataset contains 10,828 Arabic tweets; each tweet is annotated with 10 labels. The labels were designed to consider different aspects of each tweet, such as its check worthiness, positivity/negativity, dialect, factuality, etc. Though the dataset is mainly designed for fake news detection, it can also be used for hate speech detection, opinion/news classification, dialect identification, and many other tasks. All the dataset's tweets have been manually annotated and validated by human annotators. The quality of the final annotated dataset has been evaluated using several pretrained transformer models. Two transformer models have been fine-tuned using COVID-19 data and managed to achieve the best classification results across all the tested categories. Both our fine-tuned transformer models, along with the full annotated dataset, are freely available for research purposes. As future work, we plan to continue enriching our annotated dataset to make it larger and keep it up-to-date with the latest events and discussions that are shared on Twitter regarding the COVID-19 pandemic.

Severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and coronavirus disease-2019 (covid-19): The epidemic and the challenges

Coronavirus diseases (covid-19) current status and future perspectives: a narrative review

Covid-19-related infodemic and its impact on public health: A global social media analysis

Have you been a victim of covid-19-related cyber incidents? survey, taxonomy, and mitigation strategies

The causes and consequences of covid-19 misperceptions: Understanding the role of news and social media

CoAID: COVID-19 Healthcare Misinformation Dataset. arXiv e-prints

Fakecovid -a multilingual cross-domain fact check news dataset for covid-19

ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research

A survey of fake news: Fundamental theories, detection methods, and opportunities

Fake news detection on social media: A data mining perspective

Automated fact checking: Task formulations, methods and future directions

CORD-19: The COVID-19 open research dataset

Measuring emotions in the covid-19 real world worry dataset

A large-scale covid-19 twitter chatter dataset for open scientific research-an international collaboration

Large arabic twitter dataset on covid-19

Covid-19 open source data sets: A comprehensive survey

Covid-19-fakes: A twitter (arabic/english) dataset for detecting misleading information on covid-19

Fighting the covid-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society

Reem Suwaileh, and Tamer Elsayed. Arcov19-rumors: Arabic covid-19 twitter dataset for misinformation detection

Amitava Das, and Tanmoy Chakraborty. Fighting an infodemic: Covid-19 fake news dataset

Covidlies: Detecting covid-19 misinformation on social media

COVID-19 and Arabic Twitter: How can Arab world governments and public health organizations learn from social media?

Bert: Pre-training of deep bidirectional transformers for language understanding

Arabert: Transformer-based model for arabic language understanding

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter

How to fine-tune bert for text classification?

Scikit-learn: Machine learning in Python

Contextual string embeddings for sequence labeling

Transformers: State-of-the-art natural language processing

Pytorch: An imperative style, high-performance deep learning library