key: cord-0592165-rv2xke8g
authors: Alqurashi, Sarah; Hamoui, Btool; Alashaikh, Abdulaziz; Alhindi, Ahmad; Alanazi, Eisa
title: Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter
date: 2021-01-09
journal: nan
DOI: nan
sha: 62e2194dda52e13ddd64b8f9a12efcfa999d686c
doc_id: 592165
cord_uid: rv2xke8g

The rapid growth of social media content during the current pandemic provides useful tools for disseminating information which has also become a root for misinformation. Therefore, there is an urgent need for fact-checking and effective techniques for detecting misinformation in social media. In this work, we study the misinformation in the Arabic content of Twitter. We construct a large Arabic dataset related to COVID-19 misinformation and gold-annotate the tweets into two categories: misinformation or not. Then, we apply eight different traditional and deep machine learning models, with different features including word embeddings and word frequency. The word embedding models (textsc{FastText} and word2vec) exploit more than two million Arabic tweets related to COVID-19. Experiments show that optimizing the area under the curve (AUC) improves the models' performance and the Extreme Gradient Boosting (XGBoost) presents the highest accuracy in detecting COVID-19 misinformation online.

The new coronavirus pandemic was accompanied by a large and rapid spread of rumors, false information, and fake news. Misinformation has existed over the years and usually flourish on various important issues such as health outbreaks, climate change, and vaccinations. Human crises are fertile ground for misinformation, as it happened during the Zika virus [1] , Ebola [2] , and others. Moreover, misinformation is intensified during sudden and intense crises such as the COVID-19 pandemic. However, in the modern era, social media has helped magnify the spread of misinformation among individuals. As in recent times, there has been a global increase in the spread of information in general, especially misinformation related to COVID-19 through various social media. The unprecedented amount of information poses serious public health challenges, especially concerning infectious diseases, which prompted the World Health Organization (WHO) to warn against the infodemic. The infodemic is a massive amount of correct and incorrect information, making it difficult for individuals to access reliable information and credible guidance when needed [3] . This phenomenon, in turn, leads to a fast and easy spread of fake and unreliable information, especially on social media, which facilitates the diffusion of misinformation.

Several conspiracy theories about the origins of the COVID-19 virus have spread on Arabic social media, all with a common idea that the virus was a biological weapon. This misinformation started from social media accounts with no reliable proof to back their claims. Moreover, misleading information about the virus's symptoms and how to cure the new virus and reduce its transmission circulates on social media. For example, a widespread misinformation claimed that home remedies such as taking vitamin C and eating garlic can treat and prevent COVID-19 infection with a complete lack of evidence. Although some home remedies are harmless, some can be very dangerous. While these misinformation serve their promoters' interests, it also harms societies. Especially that a high percentage of individuals depend on a social media platform for information and news. Research has shown that the more individuals are exposed to false information and fake news, the more likely they will accept and believe it [4] . Misinformation confuses people and causes harm to the health of individuals. It may also incite violence, discrimination, or hostility against specific groups in society. Furthermore, it may obstruct the efforts to control the current health crisis.

Twitter is one of the most used social networking sites in the Arab world that has become a tool for spreading misinformation regarding COVID-19. A recent study shows that false information spreads six times faster than correct information on Twitter [5] , which makes it challenging to find accurate information on Twitter, causing an increase in mental distress and anxiety during the pandemic. One of the misinformation concerning factors is the spread rate on Twitter exceeds physical distances. The early spread of conspiracy theories and other false and misleading information may occur on Twitter. However, it may reach a larger audience once it appears, and it may be amplified by social media influencers as well through reports in unreliable media sites, which reduces the effectiveness of officials attempts in slowing the spread of such misleading information. As a result, individuals around the world are affected mentally and physically by misinformation. The World Health Organization has teamed up with prominent social media platforms such as Facebook, Twitter, and YouTube to fight the infodemic by verifying irrelevant information and providing evidence-based information to the public [6] . Despite the efforts made by different entities around the globe, including WHO, governments, and social media sites, misinformation continues to spread widely. The problem lies in the difficulty of detecting and correcting misinformation in the Arabic content before it spreads more widely. The Arabic language also poses a challenge because it has many dialects and rich vocabulary, which makes misleading information exists in more than one dialect making it harder to detect. As a result, there is an urgent need to develop systems that are capable of automatically identifying misinformation in Arabic content. In this work, we investigate detecting Arabic misinformation on Twitter using natural language processing and machine learning. Our contributions to this area are summarized as follows:

• We extract a sample of tweets from a large Arabic dataset related to the COVID-19 pandemic. Human annotators are utilized for labeling the sample. With high-quality, human-powered data annotation, we can estimate the credibility of the considered tweets automatically.

• We build two Arabic word embedding models using FASTTEXT and word2vec based on more than two million Arabic tweets related to COVID-19 for a comparative analysis between the classifiers.

• We examine the prediction performance on five traditional classifiers: Random Forests (RF), Extreme Gradient Boosting (XGB), Naive Bayes (NB), Stochastic Gradient Descent (SGD), and Support Vector Machines (SVM) with different features in addition to three other deep learning classifiers CNN, RNN, and CRNN.

• We improve the performance of all the models by optimizing the area under the curve using grid search for the traditional classifiers and AUC loss function for the deep learning models.

To reduce misinformation on social media, it is essential to understand what the term misinformation means. Some scholars describe misinformation as false and inaccurate information that unintentionally transmits [7] . Usually, ordinary users spread this type of misinformation because of their confidence in the information source, whether they were personally acquainted with or were influential users on their social network. They share the misinformation to inform people in their surroundings about a specific situation or story because they believe it is true.

In contrast, disinformation is known as false and inaccurate information that is transmitted intentionally [7] . Usually carried by a group of people/writers or even publishers with a common goal to deceive the public and promote disinformation. Disinformation includes conspiracy theories, fake news, and spams. The outcome of mis-and dis-information is the same, whether it is published intentionally or not.

On social media where users can post anything, it is difficult for researchers to determine whether a piece of information was intentionally created or not. Therefore, misinformation has been identified as an umbrella term for all false and inaccurate information, regardless of the goal or intention [8] . The umbrella term misinformation includes fake news, which is a type of misinformation that mimics traditional news, rumors, which are unverified information that can be correct, and spams, unwanted information that exhaust its recipient [8, 9] . These misinformation types share a negative impact, as their impact extends on every aspect of life, and may have social and economic consequences. Furthermore, misinformation has a significant impact on emergency response during disasters. It aims to mislead and confuse the public opinion and threaten public security and community stability, especially in the absence of immediate intervention to combat it [9] .

Due to social media ease of use, the spread of misinformation has expanded across ranges. The impact of misinformation goes beyond personal life to affecting society and even the economy. One of the examples of misinformation that has a negative effect is the spread of inaccurate information related to vaccinations. Anti-vaccinations groups claim that vaccines cause autism, which caused fear of vaccinations among many parents, making them refuse or at least hesitate to vaccinate their children, which caused an unprecedented increase in preventable diseases [10] . The fear of vaccination continue during the global pandemic of COVID-19, as some conspiracy theories spread through social media platforms claiming that the COVID-19 vaccine contained a chip that controls humans.

The amount of data on social media makes it difficult to distinguish between misleading and accurate information. Therefore, identifying misinformation via social media has been a popular topic in recent years.

Many studies in the English language have examined the presence of misinformation on social media, such as detecting rumors [11] , fake news [12] , spam [13] and heath misinformation [1, 2] . However, most of the Arabic language research focused on identifying information credibility of news disseminated in Twitter. Often, the tweets were annotated based on an annotator judgment and machine learning models are used based on user or content features, or a combination of both features [14, 15] . Some studies added new features such as sentiment analysis [16, 17] , user replies polarity [18] , the similarity between username and display name [19] , and TF-IDF [20] . The work in [21] used content and user features to detect Arabic rumor from Twitter using semi-supervised expectation-maximization (E-M). The proposed model achieved an F1 score of 80%. However, little work so far has focused on detecting and tracking health misinformation in the Arabic language. Recently, a study that tackled the problem of detecting Arabic cancer treatment-related rumors on Twitter was presented in [22] . They utilized ten machine learning models using TF-IDF features with different n-grams extracted form a dataset of 208 annotated tweets. An oversampling technique was applied to the dataset where it achieved F1 score of 0.86 by the random-forest model with oversampling and 5 gram TF-IDF features.

There is a great body of work related to COVID-19 infodemic in social media. The evolution of misinformation was studied on the Weibo social media platform [23] using misinformation identified by fact-checking platforms.

Another study [24] examines the identification of misinformation videos on YouTube using NLP and machine learning. Furthermore, the work in [25] presented an analysis of the evolution of the opinion of Singapore telegram group chat regarding COVID-19.

The vast majority of COVID 19 infodemic studies on social media platforms focused on Twitter largely because Twitter is one of the most popular social media platforms. Twitter also provides access to a large amount of content in many languages. Along this line, many studies of misinformation on Twitter focused on analyzing the content of tweets to understand Twitter conversion during COVID 19 [26, 27, 28] . To study the development of conversation around misinformation on Twitter, Singh et al. [26] , collected five common misinformation related to COVID-19, which are about the virus's origin, vaccine development, flu comparison, heat kills the disease, and home remedies. Each tweet is assigned to corresponding misinformation based on words and phrases in the tweets. The authors noticed an increase in the conversation around the misinformation since January 2020. In [29] , a study of disseminating COVID 19 misleading information and reliable information on Twitter using communicative content analysis shows that the likelihood of misleading information to be retweeted is less than accurate information.

Several studies have relied on fact-checking websites as ground truth data. In [30] , authors collected the COVID-19 related tweets that have been mentioned in fact-checking articles to study the source of misinformation and how it is spreading. The retweet speeds were used as a proxy for the propagation speed of misinformation. Their work suggests that the propagation speed of misinformation is higher than accurate information. Another study on how misinformation content spreads over five months on Twitter was presented in [31] . On a different note, the work in [27] presents a different measure of the tweets' credibility based on user specialty and occupation.

Considerable work has also focused on the quality of the links and information sources found in tweets in many languages (e.g., English [26] , Italy [32] ). The links were examined and classified as reputable sources or not, using fact-checking websites [32] and well-known domains [26] . Low-quality links were less used in tweets than highquality links.

Researchers also studied the type of accounts that help spread false information about COVID-19. The role and behavior of bot accounts on Twitter during COVID-19 were analyzed in [33, 34] where it was shown that Twitter bots participate in misinformation propagation on Twitter, either for political or marketing gain [33] .

Machine learning techniques have also been adapted to detect misinformation. The work in [35] discussed the challenges in designing and developing an AI solutions for infodemic detection. Moreover, authors presented a tool to estimate whether an article is a misinformation based on URL checker, fake news classifier, and website matcher. A misleading information detection system was presented in [36] . The system relies on the fact-checking website and international organization data. The system is based on ensemble techniques built using 10 machine learning models with 7 feature extraction techniques. Another study in [37] deploys machine learning techniques on Twitter misinformation using ensemble techniques based on user level and tweet level features. The models that had great accuracy were SVM and random forest.

Most of the research has focused on the English language. However, there are very few studies on the Arabic language.

In [38] , they applied SVM, FASTTEXT, and BERT on 218 Arabic tweets and 504 English tweets. The FASTTEXT model provided the best result for Arabic text. The work in [39] study the Arabic conversation on Twitter by applying topic modeling. Machine learning models such as logistic regression, support vector machines, and naive Bayes were used on 2000 labeled tweets to build rumors detection system. The highest accuracy is 84% achieved by logistic regression classifier with vector count features. Also, they found out that rumors are usually written in an academic way and promoted by fake health professionals. Another study of COVID-19 misinformation was presented in [40] .

The study published a large, manually annotated dataset of Arabic tweets related to COVID-19. The tweets were labeled based on 13 classes, including only 421 rumors. The author employed machine learning and transformer models using Mazajak embeddings and TF-IDF n-gram for words and characters. The best model was SVC with TF-IDF characters n-gram with 0.79 F1 scores.

In all previous studies investigating the Arabic content of misinformation on social media, the used datasets were very limited. In this work, we construct one of the largest datasets of tweets for misinformation in Arabic language. We provide a comparative analysis between the classifiers based on using TF-IDF and Arabic word embedding models, built based on more than two million Arabic tweets related to COVID-19. Furthermore, we further optimize the area under the curve (AUC) to better improve the models accuracy.

The proposed system comprises several stages shown in Figure 1 . It begins by collecting tweets using the Twitter streaming API and ends with evaluating the models performance and comparing them. In the remaining of this section, we describe the steps in details.

We collected a large number of Arabic tweets using the Twitter streaming application interface and Tweepy Python library for four months from January 1, 2020, to April 30, 2020. We extracted tweets based on a list of the most common Arabic keywords associated with COVID-19. We filter the Twitter stream based on the Arabic language and obtain relevant tweets about the pandemic. Table 2 shows a list of all the relevant Arabic keywords used to collect tweets about COVID-19. The dataset contains more than 4,514,136 million tweets. We store the tweet's full object, including the timestamp of the tweet, the id of the tweet, user profile information including the number of followers, and geolocation of the tweet in a MongoDB NoSQL database. The dataset is available online on GitHub 2 .

When dealing with Arabic data, it is important to recognize the rich cultural and linguistic diversity across the Arab region which may translate into different challenges (e.g., dialects) that must be addressed during the model's development. It is also essential to consider the general features of the Twitter data. For example, tweets size is limited to 280 characters. Despite this, the content of the tweets are varied and can consist of texts, symbols, URLs, pictures, and videos. Furthermore, on Twitter, users tend to use informal writing methods to reduce the text's length, while others can still comprehend it. Also, Twitter data contains large amounts of spelling errors and does not necessarily follow the language's formal structure. Thus, Twitter data becomes very noisy. Accordingly, it is essential to apply some pre-processing to the raw text data before feeding it to the classifiers. We perform the following preprocessing steps to the tweets:

• We removed non-Arabic words.

• We removed special characters such as (#, %, &, ).

• We removed URLs.

• We removed Arabic diacritics and punctuation marks ( ).

• We performed text correction using Textblob python Library [41] • We Normalized Arabic text by : Replacing ( ) with ( ) and Replacing ( ) with ( ) and Replacing ( ) with ( ) and Replacing ( ) with ( ) and Replacing ( ) with ( ) and Replacing ( ) with ( ) and Replacing ( ) with ( )

• We removed the repetition of characters such as ( ) turns into ( )

• We removed stop words such as ( ) (from, to , in ).

• We performed word stemming to convert each word to its corresponding root using Farasapy library [42] Our work Automatically detect Arabic misinformation related to COVID19 on Twitter 8786 TF-IDF, FastText, and word2vec word embedding Traditional and deep learning classifiers 

Misinformation identification: This work copes with health-related misinformation detection. We do this by relying on trusted sources of information. A recent work [43] shows that the official account of the Ministry of Health in Saudi Arabia was among the top influential accounts in March 2020. Hence, we collected the false information reported on both the World Health Organization (WHO) website and the Ministry of Health in Saudi Arabia website. Table 3 shows a sample of tweets containing misinformation.

Dataset annotation: our misinformation dataset is sampled from tweets collected from early March 2020 to the end of April 2020. To narrow down the set of tweets without misinformation content, we use the similar procedure as used by [1] . We first manually crafted a set of terms that best describe different misinformation. Then, we retrieved tweets related to those terms (e.g. "Vitamin C: ", "Sarin gas: ", "Mosquitoes: ", and "Biological warfare: "). The tweets were then combined into one dataset and labelled by two Arabic native speaker volunteers. Before labeling the tweets, the annotators reviewed the list of the collected misinformation. Due to the substantial manual effort involved in labeling these tweets, each tweet in the dataset was labeled by exactly one annotator. The tweet which contains misinformation were labeled "1" and others were labeled by "0".

In total, our misinformation dataset consists of 8,786 Arabic tweets, which contains 36,198 unique words after applying pre-processing. Overall, our labelled misinformation dataset covers significant misleading and inaccurate content that were widely circulated among Arabic tweeters during March and April. The number of tweets containing misinformation in April (709 tweets) was higher than its counterpart in March (602 tweets). Table 4 shows general statistics about the dataset. Recall that we consider a (Misinformation) as the tweet that has been labeled by "1" and (Other) as the tweet that has been labeled by "0" by the annotators. From Table 4 , we observe that the dataset is unbalanced; the majority class (Other) has 7,475 more tweets than the minority class (Misinformation). The misinformation data-set is freely accessible on GitHub. 3 One and a half meters 2020-04-01

Quarantine activities 2020-04-01

Quarantine 2020-04-01

Malaria medicine 2020-04-25

Remdesivir 2020-04-25

Curfew lift 2020-04-26

Partial curfew 2020-04-26

Active surveillance 2020-04-29

Active testing 2020-04-29 Table 2 : The list of keywords that we used to collect the tweets.

This step involves transforming the pre-processed tweets texts into the feature vector where we construct the feature vector for each tweet from a vectorization or word embedding. In that sense, from tokenized vectors of words, we build the feature vectors using TF-IDF [44] and word embeddings techniques [45, 46] . The concept behind each of them is briefly explained as follows:

• Sparse Vector Based on TF-IDF: In this representation, the importance of a term/(n-gram) in a tweet is evaluated in relation to the whole dataset. This representation method gives high weights to terms that are specific to some tweets and decreases the weight of frequently occurring words in the whole dataset. It is composed of term frequency (TF) and inverse document frequency (IDF) and it is computed using Equation 1: where, T F i is the number of occurrences of i in j, N is the total number of tweets, and DF i is the number of tweets containing the word i. We constructed TF-IDF vectors twice; once with unigrams and another with N-grams. We obtained for each tweet a sparse vector of dimension 5000 with both unigrams and N-grams and used the Scikit-learn tool for the implementation.

• Word Embeddings Creation: In natural language processing, word embedding refer to the techniques used in mapping words or phrases to vectors of real numbers. Word embedding methods represent words as continuous vectors in a low dimensional space. These vectors capture semantic information between words; the words with similar meaning will have vectors closer to each other. Building word embedding model using a large-scale training dataset is important to obtain meaningful embeddings [47] . We built a word vectors model exploiting our whole COVID-19 dataset collected from January 2020 to April 2020. By removing retweets and duplicated tweets, we ended with 2,821,940 tweets. We consider two noticeable word embeddings generation methods: word2vec, and FASTTEXT. To train our models, we adopt the pre-processing pipeline of [48] . We investigate the following two types of word embeddings in this work:

-word2vec [45] : This is probably the most widely used technique to learn word embeddings utilizing a shallow feed forward neural network. To build word2vec model, we take into consideration the maximum length of a tweet is 280 characters, hence, we use small context window sizes W = 3. The model trained using the CBOW algorithm with dimension D = 200. For the rest of parameters, we set the batch size to 50, negative sampling to 10, minimum word frequency to 5 and iterations to 5.

-FASTTEXT [49] : in FASTTEXT, the smallest unit is character-level n-grams and each word consists of a bag of character n-grams. This representation helps capture the meaning of shorter words and provides extraction of all prefixes and suffixes of a given word. For this reason, FASTTEXT has been shown to be more accurate and effective comparing to word2vec [50] . To train FASTTEXT model, we used a small window of size 3 with a dimension of 200. We set minimum word frequency to 5 and iterations to 5.

It is worth noting that very few studies have been developed FASTTEXT models for the Arabic language.The general word embeddings model such as the FASTTEXT model developed by [46] was trained on the Arabic Wikipedia Articles written in Modern Standard Arabic (MSA), hence employing such a model will not perform well on Twitter datasets. In the literature, the Arabic misinformation studies that employed FASTTEXT model, used FASTTEXT model in an unsupervised manner to produce feature vectors in [39] while it was used in a supervised manner to predict the class labels in [38] .However, there is a lack of detail about the training set sizes and parameters used in training the word embeddings model used in [39] .A recent study showed that unsupervised pretraining FASTTEXT on domain-specific can improve the classification quality over the supervised one, particularly when the dataset labels are limited [51] . Hence, we opted to employ the FASTTEXT method where models are pre-trained through unsupervised training in our classification models.

We use the Gensim [52] implementation for the word2vec and FASTTEXT tools and we use the scheme that was used by [53] to build the tweet-level representation for the machine learning models. Given the word2vec or FASTTEX models, we retrieve the vector representation of each word in each tweet, by averaging the word vectors of all words per tweets, as follow:

where n is the number of words in the tweet, and W i is the word2vec embedding for the i word. This representation retains the number of dimensions (D = 200) in the word embedding models. The word embeddings models used in our work is made freely available 4 .

To automatically predict COVID-19 misinformation in the Arabic Twitter, we used different types of classifiers. In this section, we will present them in detail. The first type of classifiers includes traditional (e.g., not deep) classifiers which are: support vector machine (SVM), multinomial naive Bayes (NB), Extreme Gradient Boosting (XGBoost), Random forest (RF), and Stochastic Gradient Descent (SGD). We used the implementation of these classifiers from the scikit-learn library [54] .

The second type are deep learning models. The deep learning classifiers involved a convolutional neural network (CNN), Recurrent Neural Networks with bidirectional long short-term memory (RNN BiLSTM), Convolutional Recurrent Neural Networks (CRNN). We used the implementations of these classifiers in Pytorch [55] . Each proposed deep learning model consists of an input embedding layer, hidden layer, a dense output layer, and an activation function. The embedding layer is the first layer of our deep learning models, and it creates a dense vector representation from the inputted text sequence. It can be initialized by a pre-train word embedding model or learned while training the model. We experiment with three types of embedding layers to train the models. In the first experiment, the weights of the embedding layer are initialized randomly, and it will learn the embedding for all the words in the dataset. The second and third embedding layers are initialized using the weights from pre-trained word2vec and FASTTEXT. Once the embedding layer maps each sequence text into a vector representation, the embedding representation is fed into the classifiers. The dense output layer takes the number of categories available as its output dimension. We used the sigmoid activation function and cross-entropy loss function. In the following, we describe the models structures .

• Convolutional Neural Network (CNN): In the CNN model, we use a one-dimensional convolution layer with a multi-scale kernel 4 and 5 with a fixed length of 100 for each filter dimensionality. The kernel size defines the number of words to consider as the convolution passes over the word vector resulting in different n-grams. The application of a convolution operation using one filter window over the word vector produces a new features map. After each convolution operation, we apply a nonlinear transformation using a Rectified Linear Unit (ReLU) [56] . The convolved result is pooled using the maximum pooling operation to capture the text's most relevant features. Then all feature maps are concatenated in one single vector with a fixed length. Finally, we feed this vector through a fully-connected layer with a 0.5 dropout rate.

• Recurrent Neural Networks (RNN): The RNN model consists of one bi-directional LSTM layer. The bidirectional LSTM train two LSTMs on the input sequence. The first one examines the input sequence in forward order, and the second one examines the input sequence backward and then combines the information from both ends to derive a single representation. This helps in learning a better feature representation and capturing more sequential patterns from both directions. The bi-directional LSTM layer is followed by a dropout layer and a fully connected layer.

• Convolutional Recurrent Neural Networks (CRNN): For the final model, we used a combination of onedimensional convolution layer and five bi-directional LSTM layers to create CRNNs (Convolutional Recurrent Neural Networks). The model uses a multi-scale convolutional layer with a kernel of 4 and 5 to extract multiple map features from the input text. Each kernel has a fixed length of 100. We apply a nonlinear transformation using a Rectified Linear Unit (ReLU) [56] to each feature map. Then, the max-pooling layer pools them separately to extract essential text features. Then the extracted features are concatenated and fed as input to bi-LSTM layers. The bi-LSTM extracted the text features, and the output of bi-LSTM is fed to a fully connected layer.

Deep learning models have some advantages. For example, CNN automatically selects relevant words in tweets while the RNN-BiLSTM network captures the word patterns in tweets in two directions from right to left and vice versa and, unlike CNN, can manage the different lengths of tweets. The CRNN model combines the benefits of both networks.

When dealing with imbalanced classification tasks, it is natural for the classifier to get biased toward the majority class. One of the most used techniques to solve the imbalance issue is to change the evaluation metric to a metric that tells a more truthful story. Therefore, when evaluating the models performance, we report different metrics including the Area Under the ROC Curve (AUC), precision, recall, and F1. The definition of these measurements is briefly outlined as follows:

The Area Under the ROC Curve (AUC): indicates the classifiers' ability to distinguish between classes through the probability curve ( ROC ). The AUC is defined as follows:

Precision: represent the percentage of positively classified tweets that actually correct. The precision is mathematically expressed as follows:

Recall: indicates the ability of the classifiers to classify all positive instances correctly. The recall is mathematically expressed as follows:

Where T P is the number of correctly identified tweets as misinformation, F P is the number of incorrectly identified tweets as misinformation, T N is the number of correctly identified tweets as not misinformation, and F N the number of incorrectly identified tweets as not misinformation.

F1 score : indicates the weighted harmonic mean of both precision and recall. The F1 is mathematically expressed as follows:

F 1score = 2 P recision · Recall P recision + Recall (6)

First, we shuffled the data to ensure that the model is not affected by order of the data. We randomly split the sample of 8786 annotated tweets into training and testing sets (80:20 splits) for the traditional classifiers and for the deep learning classifiers the sample was randomly split into training, testing, and validation sets (60:20:20 splits). Table 6 : Hyper-parameter Settings for Deep Learning Classifiers.

Since our dataset is imbalanced, we constructed a grid search to find the best hyper-parameters and maximize the AUC score. In the grid-search function, we chose AUC as the scoring parameter with 5-fold cross-validation. The model trains in each fold with all training data using all parameter combinations. To find the optimum parameter for the fold, each trained model is evaluated on the validation set. Then the trained model with the optimum parameters is used on the test set. This procedure is repeated until the model that maximizes the AUC score is found. We trained the traditional classifiers using unigram and n-gram TF-IDF feature representations. The n-gram was a combination of bigrams and trigrams. We also report the results based on word2vec and FASTTEXT embeddings methods. Table 7 shows Using Both word2vec and FASTTEXT word embeddings results in slightly higher AUC and F1 scores. The overall AUC is increased by 1 to 5 points, as shown in Table 8 . Nevertheless, almost all classifier's performance improved with the trained word embeddings except for the SGD, where the AUC score decreased by 1 to 2 points. The highest AUC Score was generated by XGB classifier using both word embedding methods. However, the XGB classifier with FASTTEXT performs the best among the traditional classifiers, giving as much as 85.4% AUC Score as well as second best precision of 0.72 an F1 score of 0.39, which signifies that the predication by XGB classifier is much better than all other classifiers. Followed by the SVC classifier with a close AUC Score of 85.3 % and 0.80 recall with highest F1 score 0.53 among all classifiers. The ROC curve generated by the traditional classifiers using both word embedding methods are shown in Figure 2 .

We trained the deep learning classifiers using adam optimizer to learn model parameters with varying learning rates, a batch size of 32, and 500 epochs to optimize the cross-entropy loss. Table 6 shows the hyper-parameter settings for the deep learning classifiers. We reported the results with and without the pre-trained word embeddings. Without the pre-trained word embeddings, the accuracy of the CNN, RNN, and CRNN, 85.0%, 84.3%, 85.3%, respectfully, and the AUC score was 50%, 50%, 50%, as shown in To handle the imbalanced data set and further improve the classifier performance, we conducted a second experiment. We trained the classifiers using AUCPRLoss loss function that optimize for AUC based on [57] , they introduced simple building block bounds that provide a unified framework for efficient, salable optimization of a wide range of objectives, including directly optimizing AUC. We used Adam's optimizer with varying learning rates, a batch size of 32, and 600 epochs. Using the pre-trained word embeddings with AUCPRLoss function seems to improve only some of classifiers. Using word2vec improve the CNN classifer by 4.7 points. However, further hyper tuning the classifiers parameter may increase the performance. Table 10 shows the overall performance for deep learning classifiers after optimizing the AUC. Among the deep learning classifiers, The CNN with FASTTEXT embeddings with cross-entropy loss achieved the best performance overall with the highest AUC score, precision, recall, and F1 score.

As a language, Arabic is a rich and complex language that has a vast vocabulary. It is also a highly morphological and derivative language. The complexity increases due to the informal nature of social media texts. There are two main forms of the Arabic language present on social media: Modern Standard Arabic (MSA) and Dialectical Arabic (DA). Where the MSA is used for formal writing, and DA is used for informal daily communication. Nevertheless, the latter is the most common form.

Further complicating matters is that the Arabic language has many different dialects that people use in social media.

The different dialects are one of the reasons for introducing many new words into any language, especially stop words [58] . Another challenge is the diacritics used in the Arabic orthographic system. Diacritics are used to represent small vowels and to clarify the meanings of words. There are thirteen diacritics in the Arabic language [59] . Many Arabic vocabularies has more than one meaning based on diacritics. Such as , which can mean thought or counting, depending on the context. Furthermore, there are Arabic vocabularies that have different meaning based on the context, like the word , which can mean public or year based on the context [59] . In most of the Arabic tweets, the text is written without diacritics, where the reader is supposed to understand the purpose of the meaning. However, this does not apply to machines. In addition, the Arabic language contains a lot of grammatical rules that change the shape and meaning of the words.

Despite all these challenges, the classifiers showed promising results in distinguishing COVID-19 misinformation in Arabic tweets, which means that available machine learning methods can deliver high-performance and promising classifiers using an imbalanced dataset of tweets. Compared to deep learning, traditional classifiers have better performance with higher AUC values. Based on the experimental results, it is evident that feature selection can be an effective technique to improve traditional classifiers' performance.

Although the deep learning models are biased toward the minority class, the models' performance can increase using pre-train word embedding or by optimizing the AUC score. Using pre-trained word embedding on the disease-specific dataset can be more accurate than using other generic pre-trained word embedding in detecting health misinformation [60] . As shown by the results, all classifiers perform better with the help of pre-train embedding than without it. The FASTTEXT word embedding improved all classifiers' performance except for CRNN and NB, While word2vec improve the results for CRNN. The key difference between word2vec and FASTTEXT is that during the learning phase, FASTTEXT tackles each word as composed of character n-grams whereas word2vec tackles words as the smallest unit. Arabic is a morphological rich language, in addition, the social media posts such as tweets usually include informal writing and could be written with a misspelling which creates ambiguity. For example, the word corona : could be written with several spellings ( , , , , ). The FASTTEXT trained the embeddings vectors on the subwords units which consider the language morphology [46] . Therefore, FASTTEXT deals with mispelling and allows learning meaningful representation for rare words while word2vec ignores them. For this reason, we believe the FASTTEXT tweet vectors representation is better compared to the word2vec representation. The Area under the ROC Curve (AUC) measures how well a machine learning model distinguishes between positive and negative tweets. A few studies have shown that optimizing the AUC is extremely useful for evaluating the classifier when class distributions are heavily imbalanced [61, 62] . Our results confirm that maximizing the AUC score improves some of classifiers' results for imbalanced datasets. It increases the ability of the classifiers to recognizes the minority classes. Nevertheless, optimizing for AUC is difficult because it requires dataset sorting, which makes it relatively expensive. As well, AUC is not continuous in the training set and, hence, most studies optimize for a variant of AUC that is differentiable. Many methods have been developed that directly optimize the AUC during the training of the classifiers [63, 64] . Future studies are needed to investigate the impact of different AUC optimization techniques for detecting the misinformation.

While the proposed dataset covers a diverse range of misinformation content, one limitation is that our work have been limited to tweets disseminated during March and April 2020. This is largely motivated by the fact that most of the Arabic-speaking countries reported their first confirmed cases of COVID-19 during March, 2020. Due to the lack of proper awareness and knowledge among people, the false information mostly spread at the early stage of the pandemic. Experimenting with a larger datasets spanning a longer duration (e.g., 5 months) would be useful to extend our work and validate it. 

With the increase use of social media as a primary source of information, the distinction between correct and misleading information becomes very difficult and critical, especially during the ongoing COVID-19 pandemic. Many intervention strategies for COVID-19 depend on the quality and reliability of information shared between people. Several features in social media facilitate the spread of inaccurate information among users worldwide. Identifying and combating misinformation is therefore a critical task during pandemics.

In this work, we conducted an extensive experiment using real misinformation content from Twitter. We examined different machine learning classifiers to identify Arabic misinformation related to COVID-19 automatically using an annotated dataset of 8786 tweets and employed word2vec and FASTTEXT.

Our results show that using word embeddings will indeed enhance the performance of the classifier. FASTTEXT produces better results with traditional classifiers and the CNN, while word2vec allows for better results with the deep learning classifiers. Optimizing the AUC score improved the classifiers' performance and the ability to handle imbalanced datasets. The XGB classifier were shown to be capable of accurately identifying Arabic misinformation based solely on a tweet's text and it outperforms all other classifiers in terms of AUC, precision, recall, and F1. In the foreseeable future, we plan to improve deep learning classifiers by stacking multiple layers, and further optimize the hyperparameters, and possibly extend the study to include the very recent adabelief optimizer [65] . Finally, we indeed plan to consider other social networks that will help enriching our dataset and widening its applications.

Catching zika fever: Application of crowdsourcing and machine learning for tracking health misinformation on twitter

Ebola, twitter, and misinformation: a dangerous combination?

Managing the covid-19 infodemic: Promoting healthy behaviours and mitigating the harm from misinformation and disinformation

The spreading of misinformation online

study: on twitter, false news travels faster than true stories

Here's how social media can combat the coronavirus 'infodemic

Social Media and Democracy: The State of the Field, Prospects for Reform

Misinformation in social media: definition, manipulation, and detection

Deep learning for misinformation detection on online social networks: a survey and new perspectives

Misinformation and its correction: Continued influence and successful debiasing

No, that never happened!! investigating rumors on twitter

Automatically identifying fake news in popular twitter threads

Don't follow me: Spam detection in twitter

Measuring the credibility of arabic text content in twitter

Supervised learning approach for twitter credibility detection

Classifying arabic tweets based on credibility using content and user features

Cat: Credibility analysis of arabic content on twitter

Arabic news credibility on twitter: An enhanced model using hybrid features

The effect of the similarity between the two names of twitter users on the credibility of their publications

Credibility detection in twitter using word n-gram analysis and supervised machine learning techniques

Rumor detection in arabic tweets using semi-supervised and unsupervised expectation-maximization

Detecting health-related rumors on twitter using machine learning methods

Analysis of misinformation during the covid-19 outbreak in china: cultural, social and political entanglements

Nlp-based feature extraction for the detection of covid-19 misinformation videos on youtube

Is this pofma? analysing public opinion and misinformation in a covid-19 telegram group chat

A first look at covid-19 information and misinformation sharing on twitter

Critical impact of social networks infodemic on defeating coronavirus covid-19 pandemic: Twitter-based study and research directions

An" infodemic": Leveraging highvolume twitter data to understand public sentiment for the covid-19 outbreak

Covid-19 infodemic: More retweets for science-based information on coronavirus than for false information

An exploratory study of covid-19 misinformation on twitter

Cultural convergence: Insights into the behavior of misinformation networks on twitter

Analysis of online misinformation during the peak of the covid-19 pandemics in italy

Like a virus: The coordinated spread of coronavirus disinformation

What types of covid-19 conspiracies are populated by twitter bots?

Challenges in combating covid-19 infodemic-data, tools, and ethics

Detecting misleading information on covid-19

Lies kill, facts save: Detecting covid-19 misinformation in twitter

Fighting the covid-19 infodemic in social media: A holistic perspective and a call to arms

Covid-19 and arabic twitter: How can arab world governments and public health organizations learn from social media?

Arcorona: Analyzing arabic tweets in the early days of coronavirus (covid-19) pandemic

Textblob: Simplified text processing

Identifying information superspreaders of covid-19 from arabic tweets

Term-weighting approaches in automatic text retrieval

Efficient estimation of word representations in vector space

Enriching word vectors with subword information

Methodical evaluation of arabic word embeddings

Word representations in vector space and their applications for arabic

Fasttext. zip: Compressing text classification models

Bag of tricks for efficient text classification

Fast and scalable neural embedding models for biomedical sentence classification

Software Framework for Topic Modelling with Large Corpora

Using word embeddings in twitter election classification

Scikit-learn: Machine learning without learning the machinery

Pytorch: An imperative style, high-performance deep learning library

Rectified linear units improve restricted boltzmann machines

Scalable learning of non-decomposable objectives

Identifying comparative opinions in arabic text in social media using machine learning techniques

Impact of stemming and word embedding on deep learning-based arabic text categorization

A tale of two epidemics: Contextual word2vec for classifying twitter streams during outbreaks

The use of the area under the roc curve in the evaluation of machine learning algorithms

Maximizing auc with deep learning for classification of imbalanced mammogram datasets

An efficient variance estimator of auc and its applications to binary classification

Stochastic auc maximization with deep neural networks

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients

This work was supported by King Abdulaziz City for Science and Technology. Grant Number: 5-20-01-007-0033.