key: cord-0586634-by6v2hhh authors: Zhou, Xinyi; Mulay, Apurva; Ferrara, Emilio; Zafarani, Reza title: ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research date: 2020-06-09 journal: nan DOI: nan sha: 06fb506703d8015b67d8f0ecdf01d7b23cd99f22 doc_id: 586634 cord_uid: by6v2hhh First identified in Wuhan, China, in December 2019, the outbreak of COVID-19 has been declared as a global emergency in January, and a pandemic in March 2020 by the World Health Organization (WHO). Along with this pandemic, we are also experiencing an"infodemic"of information with low credibility such as fake news and conspiracies. In this work, we present ReCOVery, a repository designed and constructed to facilitate the studies of combating such information regarding COVID-19. We first broadly search and investigate ~2,000 news publishers, from which 61 are identified with extreme [high or low] levels of credibility. By inheriting the credibility of the media on which they were published, a total of 2,029 news articles on coronavirus, published from January to May 2020, are collected in the repository, along with 140,820 tweets that reveal how these news articles are spread on the social network. The repository provides multimodal information of news articles on coronavirus, including textual, visual, temporal, and network information. The way that news credibility is obtained allows a trade-off between dataset scalability and label accuracy. Extensive experiments are conducted to present data statistics and distributions, as well as to provide baseline performances for predicting news credibility so that future methods can be directly compared. Our repository is available at http://coronavirus-fakenews.com, which will be timely updated. As of June 4 th , the COVID-19 pandemic has resulted in over 6.4 million confirmed cases and over 380,000 deaths globally. 1 Governments have enforced border shutdowns, travel restrictions, and quarantines to "flatten the curve" [2] . The COVID-19 outbreak has had a detrimental impact on not only the healthcare sector but also every aspect of human life such as education and economic sectors [10] . For example, over 100 countries have imposed nationwide (even complete) closures of education facilities, which has lead to over 900 million learners being affected. 2 Statistics indicate that 3.3 million Americans applied for unemployment benefits in the week ending on March 21 th and the number doubled in the following week, before which time the highest number of unemployment applications ever received in one week was 695,000 in 1982. [7] Along with the COVID-19 pandemic, we are also experiencing an "infodemic" of information with low credibility regarding COVID-19. 3 Hundreds of news websites have contributed to publishing false coronavirus information. 4 Individuals who believe false news articles claiming that, for example, eating boiled garlic or drinking chlorine dioxide, an industrial bleach, can cure or prevent coronavirus, might take an ineffective or extremely dangerous action to protect themselves from the virus. 5 Given this background, research is motivated to combat this infodemic. Hence, we design and construct a multimodal repository, ReCOVery, to facilitate reliability assessment of news on COVID-19. We first broadly search and investigate ∼2,000 news publishers, As past literature has indicated, there is a close relationship between the credibility of news articles and their publication sources. [22] In total 2,029 news articles on coronavirus are finally collected in the repository along with 140,820 tweets that reveal how these news articles are spread on the social network. The main contributions of this work are summarized as follows: First, we construct a repository to support the research that investigates (1) how news with low credibility is created and spreads in this COVID-19 pandemic and (2) ways to predict such "fake" news. The manner in which the ground truth of news credibility is obtained allows a scalable repository, as annotators need not label each news article that is time-consuming and instead they can directly label the news site. Second, ReCOVery provides multimodal information on COVID-19 news articles. Basically, for each news article, we collect its news content and social context information revealing how it spreads on social media, which covers textual, visual, temporal, and network information. Third, we conduct extensive experiments using ReCOVery, which includes analyzing our data (data statistics and distributions) and providing baseline performances for predicting news credibility using ReCOVery data. These baselines allow future methods to be easily compared to. Baselines are obtained using either news content alone or combined with social context information within a framework of supervised machine learning. The rest of this paper is organized as follows. We first detail how the data is collected in Section 2. The statistics and distributions of the data are presented and analyzed in Section 3. Experiments that use the data to predict news credibility are designed and conducted in Section 4, whose results can be used as benchmarks. Finally, we review the related dataset in Section 5, and conclude in Section 6. The overall process that we collect the data, including news content and social media information, is presented in Figure 1 . To facilitate scalability, news credibility is assessed based on the credibility of the media (site) that publishes the news article. Based on the process outlined in Figure 1 , we will further detail how the data is collected, answering the following three questions: (1) how to identify reliable (or unreliable) news sites mainly releasing real news (or fake news)? (which we address in Section 2.1); having determined such news sites, (2) how do we crawl COVID-19 news articles from these sites and what news components are valuable for collection? (Section 2.2); and given COVID-19 news articles, (3) how can we track their spread on social networks? (Section 2.3) To determine a list of reliable and unreliable news sites, we primarily rely on two resources: NewsGuard and Media Bias/Fact Check. NewsGuard. 6 NewsGuard is developed to review and rate news websites. Its reliability rating team is formed by trained journalists and experienced editors, whose credentials and backgrounds are all transparent and available on the site. The performance (credibility) of each news website is assessed based on the following nine journalistic criteria: (1) Does not repeatedly publish false content, (22 points) (2) Gathers and presents information responsibly, (18 points) (3) Regularly corrects or clarifies errors, (12.5 points) (4) Handles the difference between news and opinion responsibly, (12.5 points) (5) Avoids deceptive headlines, (10 points) (6) Website discloses ownership and financing, (7.5 points) (7) Clearly labels advertising, (7.5 points) (8) Reveals whoâĂŹs in charge, including possible conflicts of interest, and (5 points) (9) The site provides the names of content creators, along with either contact or biographical information, (5 points) where the overall score of a site is between 0 to 100; 0 indicates the lowest credibility, and 100 indicates the highest credibility. A news website with a NewsGuard score higher than 60 is often labeled as reliable; otherwise, it is unreliable. NewsGuard has provided ground truth for the construction of news datasets such as NELA-GT-2018 [11] for studying misinformation. Media Bias/Fact Check (MBFC). 7 MBFC is a website that rates factual accuracy and political bias of news medium. The fact-checking team consists of Dave Van Zandt, the primary editor and the website owner, and some journalists and researchers (more details can be found on its "About" page). MBFC labels each news media as one of six factual-accuracy levels based on the fact-checking results of the news articles it has published (more details can be found on its "Methodology" page): (i) very high, (ii) high, (iii) most factual, (iv) mixed, (v) low, and (vi) very low. Such information has been used as ground truth for automatic fact-checking studies. [1] What Are Our Criteria? Referenced by NewsGuard and MBFC, our criteria for determining reliable and unreliable news sites are: ✓ Reliable A news site is reliable if its NewsGuard score is greater than 90, and its factual reporting on MBFC is very high or high. × Unreliable A news site is unreliable if its NewsGuard score is less than 30, and its factual reporting on MBFC is below mixed. Our search towards news medium with high credibility is conducted among news articles listed in MBFC (∼2,000). To find news medium with low credibility we search in MBFC and the newly released "Coronavirus Misinformation Tracking Center" 5 of News-Guard, which provides a list of websites publishing false coronavirus information. Ultimately, we obtain a total of 61 news sites, from which 22 are the sources of reliable news articles (e.g., National Public Radio 8 and Reuters 9 ) and the remaining 39 are sources to collect unreliable news articles (e.g., Human Are Free 10 and Natural News 11 ). The full list of sites considered in our repository is also available at http://coronavirus-fakenews.com. Note that several "fake" news medium are not included, such as 70 News, Conservative 101, and Denver Guardian, since they no longer exist or their domains have been unavailable. Also note that to achieve a good trade-off between dataset scalability and label accuracy, we determine more extreme threshold (a) Reliable News 12 (b) Unreliable News 13 Figure 3 : Examples of News Articles Collected scores (30 and 90) compared to the initial one provided by News-Guard (60). In this way, the selected news sites share an extreme reliability (or unreliability) to reduce the number of the false positive and false negative of news labels in our repository; ideally, each news article published on a reliable site is factual, and on an unreliable site is false. Figure 2 illustrates the credibility distributions of reliable and unreliable news sites. It can be observed from the figure that for reliable news, most of them get a full mark on NewsGuard and are labeled as "high"ly factual by MBFC; "very high" is rare for all sites listed in MBFC. In contrast, unreliable news sites share an average NewsGuard score of ∼15 and a low factual label by MBFC; similarly, "very low" is rarely given on MBFC. To crawl COVID-19 news articles from selected news sites, we first determine whether the news article is about COVID-19; the process is detailed in Section 2.2.1. Next, we detail how the data is crawled and the news content components that are included in our repository in Section 2.2.2. To identify news articles on COVID-19, we use a list of keywords: • COVID-19, and • Coronavirus. News articles whose content contains any of the keywords (caseinsensitive) are considered related to COVID-19. These three keywords are the official names announced by the WHO on February 11 th , where "SARS-CoV-2" (standing for Severe Acute Respiratory Syndrome CoronaVirus 2) is the virus name, and "coronavirus" and "COVID-19" are the name of the disease that the virus causes. Before the WHO announcement, COVID-19 was previously known as the "2019 novel coronavirus, " 14 , which also includes the "coronavirus" keyword which we are considering. We merely consider official names as keywords to avoid potential biases, or even discrimination in articles collected. Furthermore, a news media (article) that is credible, or pretends to be credible, often acts professionally and adopts the official name(s) of the disease/virus. Compared to those articles that use biased and/or inaccurate terms, false news pretending to be professional is more detrimental and challenging to detect, which has become the focus of current fake news studies. [22] Examples of such news articles are illustrated in Figure 3 . Python library. 15 The content of each news article corresponds to twelve components: (C1) News ID: Each news article is assigned a unique id as the identity; (C2) News URL: The URL of the news article. The URL helps us verify the correctness of the collected data. It can also be used as the reference and source when repository users would like to extend the repository by fetching additional information; (C3) Publisher: The name of the news media (site) that publishes the news article; (C4) Publication Date: The date (in yyyy-mm-dd format) on which the news article was published on the site, which provides temporal information to support the investigation of, e.g., the relationship between the misinformation volume and the outbreak of COVID-19 over time; (C5) Author: The author(s) of the news article, whose number can be none, one, or more than one. Note that some news articles might have fictional author names. Author information is valuable in evaluating news credibility by either investigating the collaboration network of authors [14] or exploring its relationships with news publishers and content [20] ; (C6-7) News Title and Bodytext as the main textual information; (C8) News Image as the main visual information, which is provided in the form of a link (URL). Note that most images within the news page are noise -they can be advertisements, images belonging to other news articles due to the recommender systems embedded in news sites, logos of news sites and/or social media icons, such as Twitter and Facebook logos for sharing. Hence, we particularly fetch the main/head/top image for each news article to reduce noise; (C9) Country: The name of country where the news is published; (C10) Political bias: Each news article is labeled as one of 'extreme left', 'left', 'left-center', 'center', 'right-center', 'right', and 'extreme right' that is equivalent to the political bias of its publisher. News political bias is verified by two resources, AllSides 16 and MFBC, both which rely on domain experts to label media bias; and (C11-12) NewsGuard score and MBFC factual reporting as the original ground truth of news credibility, which has been detailed in Section 2.1. We first use Twitter Search API 17 The general statistics on our dataset is presented in Table 1 . The dataset contains 2,029 news articles, most of which have both textual and visual information for multimodal studies (2,017), [18, 21] and have shared on social media (1,747). The proportion of reliable versus unreliable news articles is around 2:1, hence due to class imbalance, compared to accuracy rate, AUC or F 1 scores should be a better evaluation metric when using the collected data to predict news credibility. Note that the number of users who spread reliable news (78,659) pluses that of users spreading unreliable news (17, 323) is greater than the total number of users including in the dataset (93,761), which indicates that users can both engage in spreading reliable and unreliable news articles. Next, we visualize the distributions of data features/attributes. Distribution of News Publishers. Figure 4 shows the number of COVID-19 news articles published in each [extremely reliable or extremely unreliable] news site. There are six unreliable publishers with no news on COVID-19; hence, they are not presented in the figure. We keep these publishers in our repository as the data will be updated over time and these publishers may publish news articles on COVID-19 in the future. News Publication Dates. The distribution of news publication dates is presented in Figure 5 , where all articles are published in 2020. We point out that from January to May, the number of COVID-19 news articles published is significantly (exponentially) increased. The possible explanation for this phenomena is three-fold. First, from the time that the outbreak was first identified in Wuhan, China (December 2019) [8] to May 2020, the number of confirmed cases and deaths caused by SARS-CoV-2 have grown exponentially globally. 1 Meanwhile, the virus has become a world topic and has triggered more and more discussions on a world-wide scale. Second, some older news articles are no longer available, which has motivated us to timely update the dataset. Third, the keywords we have used to identify COVID-19 news articles are the official ones A B C N ew s B us in es s In si de r C B S N ew s C N B C C hi ca go S un -T im es F iv eT hi rt yE ig ht Lo s A ng el es D ai ly N ew s N at io na l P ub lic R ad io (N P R ) P B S N ew sH ou r P ol iti co R eu te rs S la te T he A tla nt ic T he D et ro it N ew s T he M er cu ry N ew s T he N ew Y or k T im es T he N ew Y or ke r T he V er ge T he W as hi ng to n P os t U S A To da y W as hi ng to n M on th ly 19 Some news articles published in January are also collected, as before the WHO announcement COVID-19 was known as the "2019 novel coronavirus, " which also includes one of our keywords: "coronavirus. " We have detailed the reasons behind our keyword selection in Section 2.2.1. News Authors and Author Collaborations. Figure 6 presents the distribution of the number of authors contributing to news articles, which is governed by a long-tail distribution: most articles have less than five authors. Instead of including the [real or virtual] names of the authors, some articles provide publisher names as authors. Considering such information has been available in the repository, we leave the author information of these news articles blank, i.e., their number of authors is zero. Furthermore, we construct the coauthorship network, shown in Figure 7 . It can be observed from the network that node degrees also follow a power-law-like distribution: among 1,095 nodes (authors), over 90% of them have less than or equal to two collaborators. Figures 8 and 9 reveal textual characteristics within news content (including news title and bodytext). It can be observed from Figures 8 that the number of words within news content follows a long-tail (power-low-like) distribution, with an average value of ∼800 and a median value of ∼600. On the other hand, Figure 9 provides the word cloud for the entire repository. As the news articles collected share the same COVID-19 topic, some relevant topics and vocabularies have been naturally and frequently used by the news authors, such as "coronavirus" (6465), "COVID" (5413), "state" (4432), "test" (4274), "health" (3714), "pandemic" (3427), "virus" (2903), "home" (2871), "case" (2676), and "Trump" (2431) that are illustrated with word font size scaled to their frequencies. Country Distribution. Figure 10 reveals the countries that news and news publishers belong to. It can be observed that in total six countries (USA, Russia, UK, Iran, Cyprus, and Canada) are covered, where US news and news publishers constitute the majority of the population. Figure 11 is the distribution of political bias of news and news medium (publishers). It can be observed from the figure that for both news and publishers, the distribution for those exhibiting a right bias (including extreme right, right, and right-center) is more balanced compared to those exhibiting a left bias (including extreme left, left, and left-center). News Spreading Frequencies. Figure 12 shows the distribution of the number of tweets spreading each news article. The distribution exhibits a long tail -over 80% of news articles are spread less than 100 times while a few have been shared by thousands of tweets. News Spreaders. The distribution of the number of spreaders for each news article is shown in Figure 13 . It differs from the distribution in Figure 12 as one user can spread a news article multiple times. As for social connections of news spreaders, the distributions of their followers and friends are respectively presented in Figures 14 and 15 , where the most popular spreader has over 40 million followers (or 600,000 friends). In this section, several methods that often act as baselines are utilized and developed to predict COVID-19 news credibility using ReCOVery data, hoping to facilitate future studies. These methods (baselines) are first specified in Section 4.1. The implementation details of experiments are then provided in Section 4.2. Finally, we present the performance results for these methods in Section 4.3. Broadly speaking, all developed methods fall under a traditional supervised machine learning framework where features are manually engineered to represent news articles (see Section 4.1.1) and then classified by a well-trained classifier such as a random forest classifier (see Section 4.1.2). We design and extract the following three feature groups in our experiments: LIWC Features. LIWC is a widely-accepted psycholinguistic lexicon. Given a news story, LIWC can count the words in the text falling into one or more of 93 linguistic (e.g., self-references), psychological (e.g., anger), and topical (e.g., leisure) categories [12] , based on which 93 features are extracted. Here, we consider in a total of eight features for each news article: (1) the timestamp at which the news was published, (2) the number of news authors, (3) (4) the mean and median number of collaborators of the news authors, (5-7) the number of words in news title, bodytext, and the entire content, and (8) the number of news images. Compared to LIWC features that merely focus on news textual information (title and bodytext), this group of features comprehensively investigates most of the components of news content that are included in the repository. Social Attributes. Six features are extracted from the available social attributes of each news article in the repository: (1) the frequency of the news being spread, i.e., the number of corresponding tweets; (2) the number of news spreaders; (3-6) the mean (and median) number of followers (or friends) of news spreaders. In current fake news research, often a random classifier is used as one of the baselines [22] , which randomly labels a news article as reliable or unreliable with equal probability. We further use multiple common supervised learners (classifiers) in our experiments: Logistic Regression (LR), Naïve Bayes (NB), k-Nearest Neighbor (k-NN), Random Forest (RF), Decision Tree (DT), Support Vector Machines (SVM), and XGBoost (XGB) [4] . The overall dataset is randomly divided into training and testing datasets with a proportion of 80%:20%. As the dataset has an unbalanced distribution between reliable and unreliable news articles (≈2:1), we evaluate the prediction results in terms of Precision, Recall, and the F 1 score. Each performance outcome is obtained by averaging five experimental results repeated with different random seeds on the dataset division. All classifiers are trained with default hyperparameters. Prediction results are presented in Table 2 . It can be observed from the table that when predicting news credibility using news content alone, attribute features are more representative compared to LIWC features. Attribute features can perform best with an F 1 score of 0.772 with a random forest classifier, and LIWC features perform best with an F 1 score of 0.708 using XGBoost. Furthermore, using both news content and social information to predict news credibility can further improve the performance, achieving an F 1 score of ∼0.8. Related datasets can be generally grouped as (I) COVID-19 datasets and (II) "fake" news and rumor datasets. COVID-19 Datasets. As a global emergency [15] , the outbreak of COVID-19 has been labelled as a black swan event and likened to the economic scene of World War II [10] . With this background, a group of datasets have emerged, whose contributions range from real-time tracking of COVID-19 to help epidemiological forecasting (e.g., [5] and [19] ) and collecting scholarly COVID-19 articles for literaturebased discoveries (e.g., , to tracking the spreading of COVID-19 information on Twitter (e.g., [3] ). 20 https://www.semanticscholar.org/cord19 Specifically, researchers at Johns Hopkins University (JHU) develop a Web-based dashboard 21 to visualize and track reported cases of COVID-19 in real-time. The dashboard is released on January 22 nd , presenting the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries [5] . Another dataset shared publicly on March 24 th is constructed to aid the analysis and tracking of the COVID-19 epidemic, which provides real-time individual-level data (e.g., symptoms; date of onset, admission, and confirmation; and travel history) from national, provincial, and municipal health reports [19] . Intended to mobilize researchers to apply recent advances in Natural Language Processing (NLP) to generate new insights in support of the fight against COVID-19, Allen Institute for AI has contributed a free and dynamic database of more than 128,000 scholarly articles about COVID-19, named CORD-19, to the global research community. 20 On the other hand, Chen et al. [3] release the first large-scale COVID-19 twitter dataset. The dataset, updated regularly, collects COVID-19 tweets that are posted from January 21 st and across languages. "Fake" News and Rumor Datasets. Existing "fake" news and rumor datasets are collected with various focuses. These datasets may (i) only contain news content that can be full articles (e.g., NELA-GT-2018 [11] ), or short claims (e.g., FEVER [16] ); (ii) only contain social media information (e.g., CREDBANK [9] ), where news refers to user posts; or (iii) contain both content and social media information (e.g., LIAR [17] and FakeNewsNet [13] ). Specifically, NELA-GT-2018 [11] is a large-scale dataset of around 713,000 news articles from February to November 2018. News articles are collected from 194 news medium with multiple labels directly obtained from NewsGuard, Pew Research Center, Wikipedia, OpenSources, MBFC, AllSides, BuzzFeed News, and PolitiFact. These labels refer to news credibility, transparency, political polarizations, and authenticity. FEVER dataset [16] consists of ∼185,000 claims and is constructed following two steps: claim generation and annotation. First, the authors extract sentences from Wikipedia, and then the annotators manually generate a set of claims based on the extracted sentences. Then, the annotators label each claim as "supported", "refuted", or "not enough information" by comparing it with the original sentence from which it is developed. On the other hand, some datasets focus on user posts on social media, for example, CREDBANK [9] comprises more than 60 million tweets grouped into 1049 real-world events, each of which is annotated by 30 human annotators, while some contain both news content and social media information. For instance, collecting both claims and fact-check results (labels, i.e., "true", "mostly true", "halftrue", "mostly false", and "pants on fire") directly from PolitiFact, Wang establishes the LIAR dataset [17] containing around 12,800 verified statements made in public speeches and social medium. The aforementioned datasets only contain textual information valuable for NLP research with limited information on how "fake" news and rumors spread on social networks, which motivate the construction of FakeNewsNet dataset. [13] The dataset collects verified (real or fake) full news articles from PolitiFact (#=1,056) and GossipCop (#=22,140), respectively. The dataset also tracks news spreading on Twitter. To fight the coronavirus infodemic, we construct a multimodal repository for COVID-19 news credibility research, which provides textual, visual, temporal, and network information regarding news content and how news spreads on social media. The repository balances data scalability and label accuracy. To facilitate future studies, benchmarks are developed and their performances on predicting news credibility using the data available in the repository are presented. We find that using news content and/or social attributes available in the repository, we can achieve an F 1 score of ∼0.77 when news has not yet spread on social media (i.e., only news content is available) and an F 1 score of ∼0.81 can be achieved when it has been shared by social media users. We point out that the data could be further enhanced (1) by including COVID-19 news articles in various languages such as Chinese, Russian, Spanish, and Italian, as well as the information on how these news articles spread on the popular local social media for those languages, e.g., Sina Weibo (China). Countries speaking (but not limited to) these languages have all been suffering heavy losses in this pandemic and have shown different characteristics in their spreading in the physical world 22 , which would be invaluable when investigating the relationship between the spread of the virus in the physical world and that of its related misinformation on social networks. Furthermore, (2) extending the dataset by introducing the ground truth of, for example, hate speech, clickbaits, and social Predicting factuality of reporting and bias of news media sources Flattening the COVID-19 curves. Scientific American Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set Xgboost: A scalable tree boosting system An interactive web-based dashboard to track COVID-19 in real time The rise of social bots An instant economic crisis: How deep and how long Clinical features of patients infected with 2019 novel coronavirus in Wuhan CREDBANK: A Large-scale Social Media Corpus with Associated Credibility Annotations Ahmed Al-Jabir, Christos Iosifidis, Maliha Agha, and Riaz Agha. 2020. The socio-economic implications of the coronavirus and COVID-19 pandemic: A review NELA-GT-2018: A Large Multi-Labelled News Dataset for the Study of Misinformation in News Articles The development and psychometric properties of LIWC2015 FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media Credibility-based Fake News Detection Ahmed Al-Jabir, Christos Iosifidis, and Riaz Agha. 2020. World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19) FEVER: a large-scale dataset for fact extraction and verification liar, liar pants on fire": A new benchmark dataset for fake news detection EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection Epidemiological data from the COVID-19 outbreak, real-time case information Fake News Detection with Deep Diffusive Network Model SAFE: Similarity-Aware Multi-Modal Fake News Detection A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities