key: cord-0610925-ona9dy3q authors: Li, Yichuan; Jiang, Bohan; Shu, Kai; Liu, Huan title: MM-COVID: A Multilingual and Multimodal Data Repository for Combating COVID-19 Disinformation date: 2020-11-08 journal: nan DOI: nan sha: 48ca40748556f24c77d443e3e78e2848348294d0 doc_id: 610925 cord_uid: ona9dy3q The COVID-19 epidemic is considered as the global health crisis of the whole society and the greatest challenge mankind faced since World War Two. Unfortunately, the fake news about COVID-19 is spreading as fast as the virus itself. The incorrect health measurements, anxiety, and hate speeches will have bad consequences on people's physical health, as well as their mental health in the whole world. To help better combat the COVID-19 fake news, we propose a new fake news detection dataset MM-COVID(Multilingual and Multidimensional COVID-19 Fake News Data Repository). This dataset provides the multilingual fake news and the relevant social context. We collect 3981 pieces of fake news content and 7192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages. We present a detailed and exploratory analysis of MM-COVID from different perspectives and demonstrate the utility of MM-COVID in several potential applications of COVID-19 fake news study on multilingual and social media. I. INTRODUCTION COVID-19, an infectious disease caused by a newly discovered coronavirus 2 , has caused more than 40 million confirmed cases and 1.2 million deaths around the world in 2020 November 3 . Unfortunately, the fake news about Covid-19 has boosted the spreading of the disease and hate speech among people. For example, a couple who followed the half-backed health advice, took chloroquine phosphate to prevent COVID-19 and became ill within 20 minutes 4 ; the racist linked the COVID-19 pandemic to Asian and people of Asian descent and the violence attacked Asian people have increased in the United States, United Kingdom, Italy, Greece, France, and Germany 5 . To stop the spreading of COVID-19 fake news, we should first address the problem of fake news detection. However, identifying these COVID-19 related to fake news is non-trivial. There are several challenges: firstly, the COVID-19 fake news is multilingual. For example, FACTCHECK.org, a fact-checking agency, found that the fake news "COVID-19 is caused bacteria, easily treated with aspirin and coagulant." is firstly seen in Portuguese in Brazil then has the version of English in India and American 6 . The current available fake news datasets and methods are mainly focused on monolingual, omit the correlation between different languages. Thus it is necessary to have a multilingual fake news dataset to utilize rich debunked fake news language to help detect fake news in poor resource language. Second, fake news content merely provides a limited signal for spotting fake news. This is because the fake news is intentionally written to mislead readers and the difficulty in correlating multilingual fake news content. Thus, we need to explore auxiliary features except for fake news content such as social engagements and user profiles on social media. For example, people who post many vaccine conspiracy theories are more likely to transmit COVID-19 fake news. Thus, it is necessary to have a comprehensive dataset that has multilingual fake news content and their related social engagements to facilitate the COVID-19 fake news detection. However, to the best of our knowledge, existing COVID-19 fake news datasets did not cover both aspects. Therefore, in this paper, we present a fake news dataset MM-COVID which contains fake news content, social engagements, and spatial-temporal information in 6 different languages. This dataset will bring several advantages to combating global COVID-19 fake news. First, the multilingual dataset provides an opportunity for cross-language fake news detection. Secondly, a rich set of features facilitate the research on multi-modal(visual and textual) fake news detection and boosting the fake news performance by including auxiliary social context. Thirdly, the temporal information provides an idea experiment data for early fake news detection. Researchers can flexibly set the cutoff time periods to test the sensitivity of the proposed model. Fourthly, researchers can investigate the fake news diffusion process on the languages and the social network for developing intervention strategies to mitigate the bad impacts of fake news [1] . The main contribution of this dataset are as follows: • We provide a multilingual and multidimensional fake news dataset MM-COVID to facilitate the fake news detection and mitigation; • We conduct extensive exploration analysis on MM-COVID from a different perspective to demonstrate the quality of this dataset, and provide baseline methods for multilingual fake news detection, and • We discuss benefits and propose insights for the fake news detection research on multilingualism and social media with MM-COVID. This rest of this paper is organized as follows. We review the related work in Section II. The detail dataset construction and collection are presented in Section III. The exploring data analysis and fake news detection baselines are illustrated in Section IV and Section V respectively. Finally, we propose insights into multilingual fake news detection in Section VI and conclude in Section VII. The COVID-19 fake news is a global threat now. Different languages of fake news is an explosion on social media. Most of them are intentionally written to mislead readers. To better combat the COVID-19 fake news, a multilingual and comprehensive dataset for developing fake news detection methods is necessary. Although there are many fake news datasets, most of them are either monolingual or only with linguistic features. To relieve the threat of fake news during the pandemic, we propose a dataset MM-COVID, which not only contains multilingual fake news, but also multi-dimensional features including news contents and social engagements. To be clarified, we list the detailed introduction of the related fake news dataset in the following. myth related keywords to collect the fake tweets. From Table I , we can find that no existing fake news datasets can afford the multilingual fake news and comprehensive news content and social engagements. There are still some limitations to the existing datasets that we want to address in our proposed dataset. For example, FakeCovid labeled news pieces into fake and not fake which contains partly fake, half true, missing evidence, and so on. The news contents in FakeNewsNet contains noise since some of them are collected from Google Search result which often mentions similar but unrelated news pieces. ReCOVery labels each news piece as credible and incredibly based on the news source, rather than the human experts separately label each news pieces. CoAID mostly keeps the title of the fake news and much fake news misses the social engagements. To address the aforementioned limitations of the existing datasets, we provide a new multilingual and multi-dimensional dataset MM-COVID which covers 6 languages and contains the information from the fake news content to the related social engagements. In this section, we introduce the whole procedure of data collection, including fake news content and social context. The whole process is depicted in Figure 1 . As shown in Figure 1 , we need to firstly get the reliable labels from the fact-checking websites, and then retrieve the source content from these websites. We collect the veracity To keep the quantity of each language, we only filter languages like English (en), Spanish (es), Portuguese (pt), Hindi (hi), French (fr), and Italian (it). Because the Poynter website only displays the translated English claims, we set the language for each claim based on the language used in the fact-checking article. After collecting the reliable label, we set heuristic crawling strategies for each fact-checking website to fetch the source content URL from the fact-checking websites. 12 www.snopes.com 13 www.poynter.org/coronavirusfactsalliance/ 14 www.politifact.com 15 fullfact.org/ In some cases, the source content URL may be no longer available. To resolve the problem, we check the archived website 16 to see whether the page is archived or not. If not, we will consider the claim as the content of fake news. Since most news pieces in Poynter and Snopes are fake news, to balance the dataset for each language, we choose several official health websites and collect the COVID-19 related news in these websites as additional real information. To filter unrelated information, we collect the news piece whose title contains any of the keywords COVID-19, Coronavrius and SARS-CoV-2. The reliable websites for each language are listed in Appendix Table VIII . After we get the source URLs, we utilize the Newspaper3k 17 to crawl the content and its metainformation. It should be noticed that the source of both fake news and real news include social media posts like Facebook, Twitter, Instagram, WhatsApp, etc, and news article posted in blogger and traditional news agencies. As shown in Figure 1 , we collect the user social engagements from the social platform based on the news content. Specifically, we form the search query based on the URL, the headline and the first sentence of the source content then use the Twitter advanced search API 18 through twarc 19 to collect the user social engagements. To reduce the search noise, we remove the special character, negative word, utilize the TFIDF [9] to extract the important words, and lastly check the query manually. The social engagements include the tweets which directly mention the news pieces, and the replies and retweets responding to these tweets. After we obtain the related tweets from the advanced search result, we collect the tweets' replies and retweets. Due to the fact that Twitter's API does not support getting replies, we approximately utilize the tweet's ID as the search query, which can only obtain the replies sent in the last week 20 . In the end, we fetch all users' profiles, network connection, and the timeline of who engages in the news dissemination process. In this section, we will demonstrate the quality of the MM-COVID through statistical analysis and visualization. Because MM-COVID contains multi-dimensional information which can be used as features to identify the fake news, we separately make comparison among real news and fake news in source content, social context, and language spatial-temporal information. We also select several fake news detection methods as baseline methods for further research. The detailed statistical information of our dataset is demonstrated in Table III . Since the malicious users mostly manipulate the text content to mislead the audience, there stay text clues in the fake news content. We reveal these clues through the word cloud and the visualization of semantic representation and make a comparison among the fake news and real news. In Figure 2 , we visualize the most frequent words for each language. Non-English languages are translated into English for comparison. From Figure 2 , we can find the fake news often mentions the medical-related words like doctor, hospital and vaccine across languages. This is because these places are the front line of defending Coronavirus, malicious users will transmit this fake news to spread fear and anxiety. The fake news also mentions the country name like India, China, Spain, Brazil and et al. While, the real news often mentions the keywords like test and patient. Besides, we also observe the topic similarity and difference among languages. For example, languages like "es", "fr", and "it", they all talk about the welfare like commission and aid while other languages do not mentions these phrases. Although there is a topic difference between the fake news and real news, it is not consistent across languages and meanwhile, it cannot be directly applied to a single piece of text [10] . Thus it is necessary to learn a better representation of these contents and include auxiliary features into detection like the social context. Also, to understand the semantic representation difference between the fake news and real news, we visualize the hidden representation of these contents in Figure 3 . We firstly utilize multi-lingual RoBERTa 21 to learn the representation of the content and utilize the t-SNE [11] to visualize these hidden representations. From Figure 3 , we can find that there are some spreadable fake news and real news clusters, and the center upper right corner is mixed with these two labels. This observation indicates the necessity for better feature representation across languages and the difficulty in detecting fake news only on the content. To understand how the fake news is spread and debunked in different languages, we reveal the common fake news 22 originated and debunked timeline in Figure 4 . We can find these selected fake news have been spread in different languages and there is postpone among the spreading. For example, the fake news "Alcohol cures COVID-19" takes about half a month to transit from English to Hindi. In addition, much fake news has many similar versions in the same language. For example, fake news like "Hydroxychloroquine benefit treating COVID-19" has many versions in English. This indicates the possibility of early detection cross-language and in language based on historical data. Since the social media platform provides direct access to a large amount of information which may contain the COVID-19 related fake news, the propagation networks, transition paths, and the interacted user nodes in the path. They all can provide auxiliary and language invariant information for fake news detection. The monolingual social context integrated fake news models like dEFEND [12] and TCNN-URG [13] have achieved considerable performance improvement compared with only relying on the fake news content. Our dataset contains three different kinds of social context: user profiles, tweet posts, and social network structure. These can provide the opportunity to explore these findings across languages. In the following sections, we will explore the characteristics of these features and discuss the potential utilization of fake news detection. User Profiles The existing research [14] has proven the correlation between user-profiles and fake news detection. For example, users who are uncredible and bots-like are likely to transmit the fake news [15] [10] [16] and social bots play a disproportionate role in spreading fake news [17] . In this part, we will illustrate several useful features. Firstly, we explore the social network of the users and to see whether there is a difference between the users who engage in fake news and real information. We visualize the follower and friends count of all the users in the fake news and real information in Figure 5 . From this figure, we can observe that users who interact with es, pt, hi, fr, and its fake news have a larger number of friends and follower than the real news with the p-value ¡ 0.05 under statistical t-test. However, in en, there is no significant difference in the followers and friends. Lastly, we include more user profile information and to understand the bot-like probability of users in different veracity of information. For each language, we randomly sample 500 users who only respond to the fake news and another 500 users related to real news for the bot detection. For a language that contains less than 500 users, like pt, fr in real news, we take all the users in these languages. We utilize the state-of-the-art bot detection method Botometer [18] to identify the probability of users being social bots. Botometer makes the prediction based on users' public profile, timeline, and mentions. From the cumulative distributions listed in Figure 6 , we can find that the users who engage in fake news are slightly more likely to be bots. In languages like hi, fr, the users who have extremely large bot-likelihood (¿ 0.6) are more likely to interact with the fake news. This observation is also consistent with past fake news research in [3] , [19] . However, we also observe that bot-likelihood does not indicate the veracity of the news. For example, in es and pt, we have the opposite observation, and in it, there is no significant difference between the real news and fake news. Tweet and Response In social media, people will express their emotions and focus on an event through tweets and their responses. These features can benefit the detection of fake news in general [20] [21] . We firstly perform the sentiment analysis on the tweets. Since there is no sentiment classification method cover these 6 languages and emoji is the proxy of the sentiment in the tweets, we reveal the distribution of emojis for tweets among different languages in Figure 7 . Looking at the emoji of the reply tweets (Figure 7) , we observe that there are more emotional emoji in the tweets, like laughing in en, pt, hi and fr, and angry in hi and it. However, in the real news, the direction and enumeration emoji dominate in all languages. These observations indicate that emoji or users' emotions can benefit from fake news detection. To gain insights into user interaction intensify between the fake news and real news, we reveal the distribution of the count of retweets and replies towards them. From Figure 8 and Figure 9 , we can find that for languages except en real news get larger number of replies and retweets than the fake news. But in en, there is no significant difference between the real news and the fake news. These observations indicate that language also impacts users' social interactions. Next, to understand the topic difference between the tweets of fake and real news, we reveal the most salient hashtags in Table IV . We remove the frequent hashtags like #COVID-19, #Coronavirus, #sars to better provide distinct patterns. From Table IV , we observe that there exits consistent difference in several languages. For example, in en, fake news tweets men- tioned the key words of common conspiracy theories like #vaccine and #hydroxychloroquine. This also happens in language like fr and it. fake news of fr mentions #5g, #chloroquine and #antimasque, and it mentions #vaccino. Besides, fake news tweets in en and hi mention the politic keyword #trump and #telanganaliberation respectively. However, the real news tweets in these languages either focus on official health agency like #nih in en or general exhortation of defending COVID-19 like #healthforall in hi, #stopthepandemic and #prevention in fr and #restiamoadistanza in it. In the meantime, there is no significant topic difference in the es and pt, the fake news and real news both talk about the general exhortation. Recent researches have shown that the temporal information of social engagements can improve fake news detection performance [22] , [23] . To reveal the temporal patterns difference between real news and fake news, we follow the analysis approaches in [3] , [19] that select two news pieces for each language and reveal the count. From Figure 10 , we observe that (i) real news in en, es, pt, hi, and fr have a sudden increase in social engagements. (ii) in the language, on the contrary, there is a steady increase in the real news. These common temporal social engagement patterns allow us to extract the language invariant features for fake news detection. In this section, we select several baseline methods to perform fake news detection on MM-COVID. Since the COVID-19 is the global pandemic, the COVID-19 fake news has been spread all over the world. There are three different stages of fake news spreading in one language: at the beginning, there is no fake news resource(labeled fake news content), in the middle, there is a limited resource and in the end, there is enough resource. We aim to answer three research questions under different resource settings: • RQ1 Enough Resource: what is the fake news classification performance on each language when there is enough resource? • RQ2 Low Resource: what is the fake news classification performance for each language when there is low resource at that language? • RQ3 No Resource: what is the fake news classification performance for each language when there is no resource at that language? We deploy several fake news detection methods as following: • Text Content: Models in this group only utilize the fake news text content to do the detection. We apply several classification methods like Support Vector Machine (SVM), XGBOOST (XGB), and the variant of dE-FEND [12] -dEFEND\C which utilize sentence attention LSTM model to learn the representation of the news content. • Social Context: The social context-based models utilize the social engagements to do the fake news classification. We utilize the variant of dEFEND [24] -dEFEND\N, which utilizes the user's reply sequences for fake news detection. • Text Content and Social Context: dEFEND [12] utilize the fake news reply from the user social engagements and fake news content to do the fake news detection. The overall dataset is randomly divided into training and testing datasets while the proportion is based on the different resource settings. To control counterfactual features of the dataset(the length of fake news, the existence of social engagements), remove the fake news samples whose length is shorter than 10-word tokens, and whose count of replies and tweets is zero. In addition, we balance the fake news and real news. This result in 1,006, 174, 300, 142, 90, and 70 samples in en, es, pt, hi, fr, it respectively. For each method, we repeat the experiment 5 times and report the average accuracy and Macro-F1 score. For traditional machine learning methods (SVM and XGBoost), we utilize bag-of-words to represent the text. For neural network-based methods(dEFEND and its variants), we utilize XLM-RoBERTa [25] to get the representation of the text without fine-tuning. To answer these three research questions, we set up three different experiment settings: Enough Resource: We train the fake news classification model on 80% data and test on the left 20% data for each language. The experiment result is provided in Table V . We observe that (i) for content-based approaches, dEFEND\C achieves the best performance and all content baseline methods achieve reasonable performance in all languages; (ii) the social-context and content-based method dEFEND achieves the best performance compared with model only utilize the content and social context. These experimental observations indicate the importance of social engagements in fake news detection and the quality of MM-COVID in each language. Low Resource: there is a limited number of target language resources and enough other language resources. We jointly train the model on multiple source languages and limited target language samples then apply the model to the target language. For each source language, there are 80% of data for training, and for the target language, there are only 20% data for training and also 20% of data for the test. From the experiment result shown in Table VI , we find that (i) without any source language, dEFEND achieves the best performance across all languages and dEFEND\N achieve better performance than dEFEND\C in most languages(en, hi and fr). This indicates that the social context provides the auxiliary information when there is a limited resource; (ii) in language es, the additional languages improve the performance in dEFEND\N and dE-FEND models and in language fr, the additional languages improve the dEFEND\C and dEFEND\N. However, in other cases, simply combine different languages brings much noise This situation is that fake news spread in a new language, there is no labeled fake news content in this specific language to train the language-dependent fake news detection model. For each language, we split the dataset into 80% and 20% for training and test respectively. For simplicity, we only train the detection model in one source language then apply it to the target language. From the experiment result shown in Table VII , we can observe that social information plays an important roles in most languages(en, es, pt, hi, and fr; dEFEND\N, dEFEND > dEFEND\C), this experiment result indicates that social context can provide the language invariant features for the cross-lingual fake news detection. Our goal is to provide a comprehensive COVID-19 fake news dataset to help research around COVID-19 infodemic. This dataset provide multilingual and multi-modal information which could benefit in various topics like cross-lingual and early fake news detection; fake news propagation and fake news mitigation. Our goal is to provide a comprehensive COVID-19 fake news dataset to help researchers around the COVID-19 infodemic. This dataset provides multilingual and multi-modal information that could benefit from various topics like crosslingual and early fake news detection; fake news propagation and fake news mitigation. Cross-Lingual Fake News Detection: The multilingual characteristics bring two new applications from a language perspective. On the one hand, with the daily emerging COVID-19 fake news, we can correlate the knowledge we learn from different languages to improve the overall fake news detection performance for the future; and on the other hand, for languages that are poor with annotated fact-checking labels, we can transfer the knowledge in rich source languages such as English towards these low resource languages. The past cross-lingual research like abusive language detection [26] , cross-lingual rumor verification [27] and cross-lingual hate speech detection [28] have shown proven performance in either languages cooperation or low resource language. These approaches only utilize the text information through extracting the language invariant features and encoding the text content into a shared embedding space to achieve knowledge transferring among different languages. Since fake news is intentionally written to misled audiences, the approaches of only utilizing the content in a monolingual setting are hard [3] , let alone cross-lingual. Our dataset provides auxiliary information like social engagements. dEFEND [12] integrate the users' replies into fake news representation learning and Shao [29] propose a method utilize the user profile into fake news detection. Thus, MM-COVID provides a comprehensive dataset to study the cross-lingual fake news detection by expanding the feature space including the fake news content and social engagements. Early Fake News Detection: The COVID-19 fake news has already brought uncertainty, fear, and racism globally. To defend future epidemic fake news and resolve the impacts of the fake news, it is urgent to identify the fake news at the early stage before it was widely spread [30] . This indicates that there is limited social engagements can be used for detection. Our dataset contains the timestamp for the engaged tweets, retweets, and replies which allow researchers to set specific early time windows to understanding the pattern difference between the fake news and real news. Besides, user characteristic plays a very important role in early fake news detection [14] . We include user profiles, recent timelines and follower-friend networks in MM-COVID where we can extract useful features and develop early detection models. Overall, this dataset not just provides all the required features but also the flexibility for researchers to do the early fake news detection analysis to defend the next new epidemic. Multi-Modal Fake News Detection: Some of the COVID-19 fake news contents contain figure or video and text in the same time 23 . The existing researches also have suggested that combining the textual and visual features can improve the performance of fake news detection [10] , [31] , [32] . MM-COVID contains multi-modal information by keeping the referenced URLs of the pictures and videos embedded in the fake news content. In this way, researchers can develop new models to extract textual and visual features for the COVID-19 fake news detection. Cross-Domain Fake News Detection: MM-COVID is the mixture of different fake news domains, like political, entertaining, and healthy. It can help researchers to learn the domain invariant and domain-dependent features for cross-domain fake news detection. To overcome the negative impacts of fake news after it was posted, it is urgent to reduce the spread of fake news. The fake news on social media is widely distributed by users' social networks and personalized recommendation algorithm [33] . Propagation Network Intervention: The aim of propagation network intervention is to prevent the spread of fake news. There are two main approaches [3] : (i) Influence Minimization: slowing down the spread of fake news during the dissemination process. Past researches [34] , [35] proposes methods to delete a small set of users in the propagation network to reduce the spread of fake news. (ii) Mitigation Campaign: maximizing the spread of true news to combat the dissemination of fake news. Researches in [34] , [36] , [37] select k seed users for true news cascade in the presence of fake news to minimize the users who will be influenced by fake news. MM-COVID can provide rich propagation network information like multiple dissemination paths(tweet, reply, and retweet), and detailed meta information of the interacted users and transmit information which can help researchers to build up heterogeneous diffusion network to assist the understanding of fake news influence minimization and real news influence maximization. Personalized Recommendation Algorithm Intervention: Since people react more extremely and engage more towards the fake news content, the recommendation algorithm in social media platform will propagate the fake news to attract more users [33] . The MM-COVID contains the fake news page and its relevant authorized evidence pages from fact-checking websites. These web pages can help the researchers to develop fake news aware recommendation algorithms to drop the fake news pages. In addition, MM-COVID provide the users profile metadata and historical tweets which can facilitate the study of personalized fake news aware recommendation algorithm. C. Fact-checking Accessory: Fact-checking accessory aims to improve the efficiency of the debunking process for fact-checking agencies like Snopes and PolitiFact. The manually fact-checking process requires the fact-checkers to not only provide the veracity of the content but also provide additional evidence and context from authorized sources to support their decisions. To fully utilize fact-checkers' professionalism and help them engage with their familiar domains, researchers can build a model to recommend interested suspicious claims to the professional fact-checkers. In addition, it is possible to automatically retrieve evidence content during the fact-checking process. MM-COVID can provide the metadata of fact-checking reviews with the suspicious claim and the name of fact-checker and the detailed content of the fact-checking reviews. This rich information can help the researchers to develop semi-automatic or automatic fact-checking accessories to help the fact-checkers report the fake news. To combat the global infodemic, we release a multilingual fake news dataset MM-COVID, which contains the news content, social context, and spatiotemporal information in English, Spanish, Portuguese, Hindi, French, and Italian six different languages. Through our exploratory analysis, we identify several languages invariant and language variant features for fake news detection. The experiment result of several fake news detection methods under three different experiment settings (enough, low, and no resource) demonstrate the utility of MM-COVID. This dataset can facilitate further research in fake news detection, fake news mitigation, and fact-checking efficiency improvement. Studying fake news via network analysis: Detection and mitigation liar, liar pants on fire": A new benchmark dataset for fake news detection Fakenewsnet: A data repository with news content, social context and spatialtemporal information for studying fake news on social media Fakecovid -a multilingual cross-domain fact check news dataset for covid-19 Recovery: A multimodal repository for covid-19 news credibility research Coaid: Covid-19 healthcare misinformation dataset Characterizing covid-19 misinformation communities using a novel twitter dataset A curated collection of covid-19 online datasets Fake news detection on social media: A data mining perspective Visualizing data using t-SNE Defend: Explainable fake news detection Neural user response generator: Fake news detection with collective user intelligence The role of user profile for fake news detection Leveraging multi-source weak social supervision for early detection of fake news Media Bias in the Marketplace The spread of low-credibility content by social bots Botornot Ginger cannot cure cancer: Battling fake health news with a comprehensive data repository News verification by exploiting conflicting social viewpoints in microblogs Rumor has it: Identifying misinformation in microblogs Detecting rumors from microblogs with recurrent neural networks Fakenewstracker: a tool for fake news collection, detection, and visualization Fakenewstracker: a tool for fake news collection, detection, and visualization Unsupervised cross-lingual representation learning at scale Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon Cross-lingual cross-platform rumor verification pivoting on multimedia content Cross-lingual zero-and few-shot hate speech detection utilising frozen transformer language models and axel The spread of fake news by social bots Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks Multimodal fusion with recurrent neural networks for rumor detection on microblogs Eann: Event adversarial neural networks for multimodal fake news detection Data Mining, ser. KDD '18 Disinformation in the online information ecosystem: Detection, mitigation and challenges Limiting the spread of misinformation in social networks Containment of misinformation spread in online social networks Influence blocking maximization in social networks under the competitive linear threshold model technical report Combating fake news: A survey on identification and mitigation techniques There are several potential improvements for future work: (1) include more languages in the dataset, such as Chinese, Russian, Germany, and Japanese. (2) collect social context from different social platforms like Reddit, Facebook, YouTube, and Instagram, and so on. The sources of real news are listed in Table VIII .