key: cord-0160347-3lwuo2sb authors: Shaar, Shaden; Alam, Firoj; Martino, Giovanni Da San; Nikolov, Alex; Zaghouani, Wajdi; Nakov, Preslav; Feldman, Anna title: Findings of the NLP4IF-2021 Shared Tasks on Fighting the COVID-19 Infodemic and Censorship Detection date: 2021-09-23 journal: nan DOI: nan sha: cd4607527dc3fcce4e0580b258fefdc167503669 doc_id: 160347 cord_uid: 3lwuo2sb We present the results and the main findings of the NLP4IF-2021 shared tasks. Task 1 focused on fighting the COVID-19 infodemic in social media, and it was offered in Arabic, Bulgarian, and English. Given a tweet, it asked to predict whether that tweet contains a verifiable claim, and if so, whether it is likely to be false, is of general interest, is likely to be harmful, and is worthy of manual fact-checking; also, whether it is harmful to society, and whether it requires the attention of policy makers. Task~2 focused on censorship detection, and was offered in Chinese. A total of ten teams submitted systems for task 1, and one team participated in task 2; nine teams also submitted a system description paper. Here, we present the tasks, analyze the results, and discuss the system submissions and the methods they used. Most submissions achieved sizable improvements over several baselines, and the best systems used pre-trained Transformers and ensembles. The data, the scorers and the leaderboards for the tasks are available at http://gitlab.com/NLP4IF/nlp4if-2021. Social media have become a major communication channel, enabling fast dissemination and consumption of information. A lot of this information is true and shared in good intention; however, some is false and potentially harmful. While the so-called "fake news" is not a new phenomenon, e.g., the term was coined five years ago, the COVID-19 pandemic has given rise to the first global social media infodemic. The infodemic has elevated the problem to a whole new level, which goes beyond spreading fake news, rumors, and conspiracy theories, and extends to promoting fake cure, panic, racism, xenophobia, and mistrust in the authorities, among others. Identifying such false and potentially malicious information in tweets is important to journalists, fact-checkers, policy makers, government entities, social media platforms, and society. A number of initiatives have been launched to fight this infodemic, e.g., by building and analyzing large collections of tweets, their content, source, propagators, and spread (Leng et al., 2021; Medford et al., 2020; Mourad et al., 2020; Karami et al., 2021 ). Yet, these efforts typically focus on a specific aspect, rather than studying the problem from a holistic perspective. Here we aim to bridge this gap by introducing a task that asks to predict whether a tweet contains a verifiable claim, and if so, whether it is likely to be false, is of general interest, is likely to be harmful, and is worthy of manual fact-checking; also, whether it is harmful to society, and whether it requires the attention of policy makers. The task follows an annotation schema proposed in (Alam et al., 2020 (Alam et al., , 2021b . While the COVID-19 infodemic is characterized by insufficient attention paid to the problem, there are also examples of the opposite: tight control over information. In particular, freedom of expression in social media has been supercharged by a new and more effective form of digital authoritarianism. Political censorship exists in many countries, whose governments attempt to conceal or to manipulate information to make sure their citizens are unable to read or to express views that are contrary to those of people in power. One such example is Sina Weibo, a Chinese microblogging website with over 500 million monthly active users, which sets strict control over its content using a variety of strategies to target censorable posts, ranging from keyword list filtering to individual user monitoring: among all posts that are eventually censored, nearly 30% are removed within 5-30 minutes, and for 90% this is done within 24 hours (Zhu et al., 2013) . We hypothesize that the former is done automatically, while the latter involves human censors. Thus, we propose a shared task that aims to study the potential for automatic sensorship, which asks participating systems to predict whether a Sina Weibo post will be censored. In this section, we discuss studies relevant to the COVID-19 infodemic and to censorship detection. Disinformation, misinformation, and "fake news" thrive in social media. Lazer et al. (2018) and Vosoughi et al. (2018) in Science provided a general discussion on the science of "fake news" and the process of proliferation of true and false news online. There have also been several interesting surveys, e.g., Shu et al. (2017) studied how information is disseminated and consumed in social media. Another survey by Thorne and Vlachos (2018) took a fact-checking perspective on "fake news" and related problems. Yet another survey (Li et al., 2016) covered truth discovery in general. Some very recent surveys focused on stance for misinformation and disinformation detection (Hardalov et al., 2021) , on automatic fact-checking to assist human fact-checkers (Nakov et al., 2021a) , on predicting the factuality and the bias of entire news outlets (Nakov et al., 2021c) , on multimodal disinformation detection (Alam et al., 2021a) , and on abusive language in social media (Nakov et al., 2021b) . A number of Twitter datasets have been developed to address the COVID-19 infodemic. Some are without labels, other use distant supervision, and very few are manually annotated. Cinelli et al. (2020) studied COVID-19 rumor amplification in five social media platforms; their data was labeled using distant supervision. Other datasets include a multi-lingual dataset of 123M tweets (Chen et al., 2020) , another one of 383M tweets (Banda et al., 2020) , a billion-scale dataset of 65 languages and 32M geo-tagged tweets (Abdul-Mageed et al., 2021) , and the GeoCoV19 dataset, consisting of 524M multilingual tweets, including 491M with GPS coordinates (Qazi et al., 2020) . There are also Arabic datasets, both with (Haouari et al., 2021; Mubarak and Hassan, 2021) and without manual annotations (Alqurashi et al., 2020) . We are not aware of Bulgarian datasets. Zhou et al. (2020) created the ReCOVery dataset, which combines 2,000 news articles about COVID-19, annotated for their factuality, with 140,820 tweets. Vidgen et al. (2020) studied COVID-19 prejudices using a manually labeled dataset of 20K tweets with the following labels: hostile, criticism, prejudice, and neutral. Song et al. (2021) collected a dataset of false and misleading claims about COVID-19 from IFCN Poynter, which they manually annotated with the following ten disinformation-related categories: (1) Public authority, (2) Community spread and impact, (3) Medical advice, self-treatments, and virus effects, (4) Prominent actors, (5) Conspiracies, (6) Virus transmission, (7) Virus origins and properties, (8) Public reaction, and (9) Vaccines, medical treatments, and tests, and (10) Cannot determine. Another related dataset study by (Pulido et al., 2020) analyzed 1,000 tweets and categorized them based on factuality into the following categories: (i) False information, (ii) Science-based evidence, (iii) Fact-checking tweets, (iv) Mixed information, (v) Facts, (vi) Other, and (vii) Not valid. Ding et al. (2020) have a position paper discussing the challenges in combating the COVID-19 infodemic in terms of data, tools, and ethics. Hossain et al. (2020) developed the COVIDLies dataset by matching a known misconceptions with tweets, and manually annotated the tweets with stance: whether the target tweet agrees, disagrees, or has no position with respect to a known misconception. Finally, (Shuja et al., 2020) provided a comprehensive survey categorizing the COVID-19 literature into four groups: diagonisis related, transmission and mobility, social media analysis, and knowledge-based approaches. The most relevant previous work is (Alam et al., 2021b (Alam et al., , 2020 , where tweets about COVID-19 in Arabic and English were annotated based on an annotation schema of seven questions. Here, we adopt the same schema (but with binary labels only), but we have a larger dataset for Arabic and English, and we further add an additional language: Bulgarian. There has been a lot of research aiming at developing strategies to detect and to evade censorship. Most work has focused on exploiting technological limitations with existing routing protocols (Leberknight et al., 2012; Katti et al., 2005; Levin et al., 2015; Weinberg et al., 2012; Bock et al., 2020) . Research that pays more attention to the linguistic properties of online censorship in the context of censorship evasion includes Safaka et al. (2016) , who applied linguistic steganography to circumvent censorship. Other related work is that of Lee (2016) , who used parodic satire to bypass censorship in China and claimed that this stylistic device delays and often evades censorship. Hiruncharoenvate et al. (2015) showed that the use of homophones of censored keywords on Sina Weibo could help extend the time for which a Weibo post could remain available online. All these methods require significant human effort to interpret and to annotate texts to evaluate the likelihood of censorship, which might not be practical to carry out for common Internet users in real life. King et al. (2013) in turn studied the relationship between political criticism and the chance of censorship. They came to the conclusion that posts that have a Collective Action Potential get deleted by the censors even if they support the state. Zhang and Pan (2019) introduced a system, Collective Action from Social Media (CASM), which uses convolutional neural networks on image data and recurrent neural networks with long short-term memory on text data in a two-stage classifier to identify social media posts about offline collective action. Zhang and Pan (2019) found that despite online censorship in China suppressing the discussion of collective action in social media, censorship does not have a large impact on the number of collective action posts identified through CASM-China. Zhang and Pan (2019) claimed that the system would miss collective action taking place in ethnic minority regions, such as Tibet and Xinjiang, where social media penetration is lower and more stringent Internet control is in place, e.g., Internet blackouts. Finally, there has been research that uses linguistic and content clues to detect censorship. Knockel et al. (2015) and Zhu et al. (2013) proposed detection mechanisms to categorize censored content and to automatically learn keywords that get censored. Bamman et al. (2012) uncovered a set of politically sensitive keywords and found that the presence of some of them in a Weibo blogpost contributed to a higher chance of the post being censored. Ng et al. (2018b) also targeted a set of topics that had been suggested to be sensitive, but unlike Below, we describe the two tasks: their setup and their corresponding datasets. Task Setup: The task asks to predict several binary properties for an input tweet about COVID-19. These properties are formulated in seven questions as briefly discussed below: 1. Verifiable Factual Claim: Does the tweet contain a verifiable factual claim? A verifiable factual claim is a statement that something is true, and this can be verified using factual, verifiable information such as statistics, specific examples, or personal testimony. Following (Konstantinovskiy et al., 2018) , factual claims could be (a) stating a definition, (b) mentioning a quantity in the present or in the past, (c) making a verifiable prediction about the future, (d) reference laws, procedures, and rules of operation, and (e) reference images or videos (e.g., "This is a video showing a hospital in Spain."), (f) implying correlation or causation (such correlation/causation needs to be explicit). To what extent does the tweet appear to contain false information? This annotation determines how likely the tweet is to contain false information without fact-checking it, but looking at things like its style, metadata, and the credibility of the sources cited, etc. 3. Interesting for the General Public: Will the tweet have an impact on or be of interest to the general public? In general, claims about topics such as healthcare, political news and findings, and current events are of higher interest to the general public. Not all claims should be factchecked, for example "The sky is blue.", albeit being a claim, is not interesting to the general public and thus should not be fact-checked. To what extent is the tweet harmful to the society/person(s)/company(s)/product(s)? The purpose of this question is to determine whether the content of the tweet aims to and can negatively affect the society as a whole, a specific person(s), a company(s), a product (s) 5. Need to Fact-Check: Do you think that a professional fact-checker should verify the claim in the tweet? Not all factual claims are important or worth fact-checking by a professional factchecker as this is a time-consuming process. For example, claims that could be fact-checked with a very simple search on the Internet probably do not need the attention of a professional factchecker. 6. Harmful to Society: Is the tweet harmful for the society? The purpose of this question is to judge whether the content of the tweet is could be potentially harmful for the society, e.g., by being weaponized to mislead a large number of people. For example, a tweet might not be harmful because it is a joke, or it might be harmful because it spreads panic, rumors or conspiracy theories, promotes bad cures, or is xenophobic, racist, or hateful. 7. Requires Attention: Do you think that this tweet should get the attention of government entities? A variety of tweets might end up in this category, e.g., such blaming the authorities, calling for action, offering advice, discussing actions taken or possible cures, asking important questions (e.g., "Will COVID-19 disappear in the summer?"), etc. Data: For this task, the dataset covers three different languages (Arabic, Bulgarian, and English), annotated with yes/no answers to the above questions. More details about the data collection and the annotation process, as well as statistics about the corpus can be found in (Alam et al., 2021b (Alam et al., , 2020 , where an earlier (and much smaller) version of the corpus is described. We annotated additional tweets for Arabic and Bulgarian for the shared task using the same annotation schema. Task Setup: For this task, we deal with a particular type of censorship -when a post gets removed from a social media platform semi-automatically based on its content. The goal is to predict which posts on Sina Weibo, a Chinese microblogging platform, will get removed from the platform, and which posts will remain on the website. Data: Tracking censorship topics on Sina Weibo is a challenging task due to the transient nature of censored posts and the scarcity of censored data from well-known sources such as FreeWeibo 2 and WeiboScope 3 . The most straightforward way to collect data from a social media platform is to make use of its API. However, Sina Weibo imposes various restrictions on the use of its API 4 such as restricted access to certain endpoints and restricted number of posts returned per request. Above all, their API does not provide any endpoint that allows easy and efficient collection of the target data (posts that contain sensitive keywords). Therefore, Ng et al. (2019) and Ng et al. (2020) developed an alternative method to track censorship for our purposes. The reader is referred to the original articles to learn more details about the data collection. In a nutshell, the dataset contains censored and uncensored tweets, and it includes no images, no hyperlinks, no re-blogged content, and no duplicates. For the present shared task 2, we use the balanced dataset described in (Ng et al., 2020) and (Ng et al., 2019) . The data is collected across ten topics for a period of four months: from August 29, 2018 till December 29, 2018. Table 2 summarizes the datasets in terms of number of censored and uncensored tweets in the training, development, and testing sets, while Table 3 shows the main topics covered by the dataset. In this section, we describe the overall task organization, phases, and evaluation measures. We ran the shared tasks in two phases: Development Phase In the first phase, only training and development data were made available, and no gold labels were provided for the latter. The participants competed against each other to achieve the best performance on the development set. Test Phase In the second phase, the test set (unlabeled input only) was released, and the participants were given a few days to submit their predictions. The official evaluation measure for task 1 was the average of the weighted F1 scores for each of the seven questions; for task 2, it was accuracy. Below, we describe the baselines, the evaluation results, and the best systems for each language. The baselines for Task 1 are (i) majority class, (ii) ngram, and (iii) random. The performance of these baselines on the official test set is shown in Tables 4, 5, and 6. The results on the official test set for English, Arabic, and Bulgarian are reported in Tables 4, 5 , and 6, respectively. We can see that most participants managed to beat all baselines by a margin. Below, we give a brief summary of the best performing systems for each language. The English Winner: Team TOKOFOU (Tziafas et al., 2021) performed best for English. They gathered six BERT-based models pre-trained in relevant domains (e.g., Twitter and COVID-themed data) or fine-tuned on tasks, similar to the shared task's topic (e.g., hate speech and sarcasm detection). They fine-tuned each of these models on the task 1 training data, projecting a label from the sequence classification token for each of the seven questions in parallel. After model selection on the basis of development set F1 performance, they combined the models in a majority-class ensemble. The Arabic Winner: Team R00 had the best performing system for Arabic. They used an ensemble of the follwoing fine-tuned Arabic transformers: AraBERT (Antoun et al., 2020) , Asafaya-BERT (Safaya et al., 2020) , ARBERT. In addition, they also experimented with MARBERT (Abdul-Mageed et al., 2020). The Bulgarian Winner: We did not receive a submission for the best performing team for Bulgarian. The second best team, HunterSpeech-Lab (Panda and Levitan, 2021) , explored the crosslingual generalization ability of multitask models trained from scratch (logistic regression, transformer encoder) and pre-trained models (English BERT, and mBERT) for deception detection. DamascusTeam (Hussein et al., 2021 ) used a two-step pipeline, where the first step involves a series of pre-processing procedures to transform Twitter jargon, including emojis and emoticons, into plain text. In the second step, a version of AraBERT is fine-tuned and used to classify the tweets. Their system was ranked 5th for Arabic. Team dunder_mifflin (Suhane and Kowshik, 2021) built a multi-output model using task-wise multi-head attention for inter-task information aggregation. This was built on top of the representations obtained from RoBERTa. To tackle the small size of the dataset, they used back-translation for data augmentation. Their loss function was weighted for each output, in accordance with the distribution of the labels for that output. They were the runners-up in the English subtask with a mean F1-score of 0.891 on the test set, without the use of any task-specific embeddings or ensembles. Team F1 P R Q1 Q2 Q3 Q4 Q5 Q6 Table 5 : Task 1, Arabic: Evaluation results. For Q1 to Q7, the results are in terms of weighted F1 score (The team iCompass submitted their system after the deadline, and thus we rank them with a * ). Team Team HunterSpeechLab (Panda and Levitan, 2021) participated in all three languages. They explored the cross-lingual generalization ability of multitask models trained from scratch (logistic regression, transformers) and pre-trained models (English BERT, mBERT) for deception detection. They were 2nd for Arabic and Bulgarian. Team iCompass (Henia and Haddad, 2021 ) had a late submission for Arabic, and would have ranked 2nd. They used contextualized text representations from ARBERT, MARBERT, AraBERT, Arabic ALBERT and BERT-base-arabic, which they fine-tuned on the training data for task 1. They found that BERT-base-arabic performed best. Team InfoMiner participated in all three subtasks, and were ranked 4th on all three. They used pre-trained transformer models, specifically BERT-base-cased, RoBERTabase, BERT-multilingual-cased, and AraBERT. They optimized these transformer models for each question separately and used undersampling to deal with the fact that the data is imbalanced. Team NARNIA experimented with a number of Deep Learning models, including different word embeddings such as Glove and ELMo, among others. They found that the BERTweet model achieved the best overall F1score of 0.881, securing them the third place on the English subtask. Team R00 (Qarqaz et al., 2021) had the best performing system for the Arabic subtask. They used an ensemble of neural networks combining a linear layer on top of one out of the following four pre-trained Arabic language models: AraBERT, Asafaya-BERT, ARBERT. In addition, they also experimented with MARBERT. Team TOKOFOU participated in English only and theirs was the winning system for that language. They gathered six BERT-based models pre-trained in relevant domains (e.g., Twitter and COVID-themed data) or fine-tuned on tasks, similar to the shared task's topic (e.g., hate speech and sarcasm detection). They fine-tuned each of these models on the task 1 training data, projecting a label from the sequence classification token for each of the seven questions in parallel. After carrying out model selection on the basis of the F1 score on the development set, they combined the models in a majority-class ensemble in order to counteract the small size of the dataset and to ensure robustness. Tables 7, 8 and 9 offer a high-level comparison of the approaches taken by the participating systems for English, Arabic and Bulgarian, respectively (unfortunately, in these comparisons, we miss two systems, which did not submit a system description paper). We can see that across all languages, the participants have used transformer-based models, monolingual or multilingual. In terms of models, SVM and logistic regression were used. Some teams also used ensembles and data augmentation. Below, we report the results for the baselines and for the participating system. For task 2, we have three baselines as shown in Table 10: a majority class baseline, as before, and two additional baselines described in (Ng et al., 2020) . The first additional baseline is a human baseline based on crowdsourcing. The second additional baseline is a multilayer perceptron (MLP) using linguistic features as well as such measuring the complexity of the text, e.g., in terms of its readability, ambiguity, and idiomaticity. These features are motivated by observations that censored texts are typically more negative, more idiomatic, contain more content words and more complex semantic categories. Moreover, censored tweets use more verbs, which indirectly points to the Collective Action Potential. In contrast, uncensored posts are generally more positive, and contain words related to leisure, reward, and money. Due to the unorthodox application, and perhaps to the sensitivity of the data, task 2 received only one submission: from team NITK_NLP. The team used a pre-trained XLNet-based Chinese model by Cui et al. (2020) , which they fine-tuned for 20 epochs, using the Adam optimizer. The evaluation results for that system are shown in Table 10 . We can see that while the system outperformed both the human baseline and the majority class baseline by a large margin, it could not beat the MLP baseline. This suggests that capturing the linguistic fingerprints of censorship might indeed be important, and thus probably should be considered, e.g., in combination with deep contextualized representations from transformers (Ng et al., 2018a (Ng et al., , 2019 (Ng et al., , 2020 . We have presented the NLP4IF-2021 shared tasks on fighting the COVID-19 infodemic in social media (offered in Arabic, Bulgarian, and English) and on censorship detection (offered in Chinese). In future work, we plan to extend the dataset to cover more examples, e.g., from more recent periods when the attention has shifted from COVID-19 in general to vaccines. We further plan to develop similar datasets for other languages. While our datasets do not contain personally identifiable information, creating systems for our tasks could face a "dual-use dilemma," as they could be misused by malicious actors. Yet, we believe that the need for replicable and transparent research outweigh concerns about dual-use in our case. 2021) 2 (Suhane and Kowshik AbdelRahim Elmadany, and El Moatez Billah Nagoudi. 2020. ARBERT & MARBERT: Deep bidirectional transformers for Arabic Mega-COV: A billion-scale dataset of 100+ languages for COVID-19 Hamed Firooz, and Preslav Nakov. 2021a. A survey on multimodal disinformation detection Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms Large Arabic Twitter dataset on COVID-19 AraBERT: Transformer-based model for Arabic language understanding Censorship and deletion practices in Chinese social media Yuning Ding, and Gerardo Chowell. 2020. A large-scale COVID-19 Twitter chatter dataset for open scientific research -an international collaboration Detecting and evading censorship-in-depth: A case study of Iran's protocol whitelister Tracking social media discourse about the COVID-19 pandemic: Development of a public coronavirus Twitter data set Fabiana Zollo, and Antonio Scala. 2020. The COVID-19 social media infodemic Revisiting pretrained models for Chinese natural language processing Challenges in combating COVID-19 infodemic -data, tools, and ethics ArCOV19-rumors: Arabic COVID-19 Twitter dataset for misinformation detection Preslav Nakov, and Isabelle Augenstein. 2021. A survey on stance detection for mis-and disinformation identification 2021. iCompass at NLP4IF-2021-Fighting the COVID-19 infodemic Algorithmically bypassing censorship on Sina Weibo with nondeterministic homophone substitutions COVIDLies: Detecting COVID-19 misinformation on social media DamascusTeam at NLP4IF2021: Fighting the Arabic COVID-19 Infodemic on Twitter using AraBERT Identifying and analyzing healthrelated themes in disinformation shared by conservative and liberal Russian trolls on Twitter Slicing the onion: Anonymous routing without PKI How censorship in China allows government criticism but silences collective expression Every rose has its thorn: Censorship and surveillance on social video platforms in China Towards automated factchecking: Developing an annotation schema and benchmark for consistent automated claim detection NARNIA at NLP4IF-2021: Identification of misinformation in COVID-19 tweets using BERTweet The science of fake news A taxonomy of censors and anti-censors: Part I: Impacts of internet censorship Surviving online censorship in China: Three satirical tactics and their impact Misinformation during the COVID-19 outbreak in China: Cultural, social and political entanglements Alibi routing A survey on truth discovery An "Infodemic": Leveraging high-volume Twitter data to understand early public sentiment for the coronavirus disease Critical impact of social networks infodemic on defeating coronavirus COVID-19 pandemic: Twitter-based study and research directions ArCorona: Analyzing Arabic tweets in the early days of coronavirus (COVID-19) pandemic Automated fact-checking for assisting human fact-checkers Guillaume Bouchard, and Isabelle Augenstein. 2021b. Detecting abusive language on online platforms: A critical analysis Jisun An, and Haewoon Kwak. 2021c. A survey on predicting the factuality and the bias of news media Detecting censorable content on Sina Weibo: A pilot study Linguistic fingerprints of internet censorship: the case of Sina Weibo Linguistic characteristics of censorable language on SinaWeibo Neural network prediction of censorable language Detecting multilingual COVID-19 misinformation on social media via contextualized embeddings COVID-19 infodemic: More retweets for science-based information on coronavirus than for false information Abdullah. 2021. R00 at NLP4IF-2021: Fighting COVID-19 infodemic with transformers and more transformers GeoCoV19: A dataset of hundreds of millions of multilingual COVID-19 tweets with location information Matryoshka: Hiding secret communication in plain sight KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media Fake news detection on social media: A data mining perspective Waleed Alasmary, and Abdulaziz Alashaikh. 2020. COVID-19 open source data sets: A comprehensive survey Classification aware neural topic model for COVID-19 disinformation categorisation Multi output learning using task wise attention for predicting binary properties of tweets: Shared-task-on-fighting the COVID-19 infodemic Automated fact checking: Task formulations, methods and future directions Fighting the COVID-19 infodemic with a holistic BERT ensemble Transformers to fight the COVID-19 infodemic Detecting East Asian prejudice on social media The spread of true and false news online StegoTorus: A camouflage proxy for the Tor anonymity system CASM: A deeplearning approach for identifying collective action events with text and image data from social media ReCOVery: A multimodal repository for COVID-19 news credibility research The velocity of censorship: High-fidelity detection of microblog post deletions We would like to thank Akter Fatema, Al-Awthan Ahmed, Al-Dobashi Hussein, El Messelmani Jana, Fayoumi Sereen, Mohamed Esraa, Ragab Saleh, and Shurafa Chereen for helping with the Arabic data annotations.This research is part of the Tanbih mega-project, developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of "fake news," propaganda, and media bias by making users aware of what they are reading.This material is also based upon work supported by the US National Science Foundation under Grants No. 1704113 and No. 1828199. This publication was also partially made possible by the innovation grant No. 21 -Misinformation and Social Networks Analysis in Qatar from Hamad Bin Khalifa University's (HBKU) Innovation Center. The findings achieved herein are solely the responsibility of the authors.