key: cord-0122857-isnqbetr authors: Nakov, Preslav title: Can We Spot the"Fake News"Before It Was Even Written? date: 2020-08-10 journal: nan DOI: nan sha: d148d387601c8c6a4958fa09829d6ba705677219 doc_id: 122857 cord_uid: isnqbetr Given the recent proliferation of disinformation online, there has been also growing research interest in automatically debunking rumors, false claims, and"fake news."A number of fact-checking initiatives have been launched so far, both manual and automatic, but the whole enterprise remains in a state of crisis: by the time a claim is finally fact-checked, it could have reached millions of users, and the harm caused could hardly be undone. An arguably more promising direction is to focus on fact-checking entire news outlets, which can be done in advance. Then, we could fact-check the news before it was even written: by checking how trustworthy the outlets that published it is. We describe how we do this in the Tanbih news aggregator, which makes readers aware of what they are reading. In particular, we develop media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various claims and topics. Recent years have seen the rise of social media, which have enabled people to easily share information with a large number of online users, without quality control. On the bright side, this has given the opportunity to anybody to become a content creator, and it has enabled a much faster information dissemination. On the not-so-bright side, it has also made it easy for malicious actors to spread disinformation much faster, potentially reaching very large audiences. In some cases, this included building sophisticated profiles for individual users based on a combination of psychological characteristics, metadata, demographics, and location, and then micro-targeting them with personalized "fake news" and propaganda campaigns that have been weaponized with the aim of achieving political or financial gains. To be clear, false information in the news has always been around, e.g., think of tabloids. However, social media have changed everything. They have made it possible for malicious actors to micro-target specific demographics, and to spread disinformation much faster and at scale, at the disguise of news. Thanks to social media, the news could be weaponized at an unprecedented scale. Overall, thanks to social media, people today are much more likely to believe in conspiracy theories. For example, according to a 2019 study, 57% of Russians believed that the USA did not put a man on the Moon. In contrast, when the event actually occurred, there was absolutely no doubt about it in the USSR, and Neil Armstrong was even invited to visit Moscow, which he did. Indeed, disinformation has become a global phenomenon: a number of countries had election-related issues with "fake news". To get an idea of the scale, 150 million users on Facebook and Instagram saw inflammatory political ads, and Cambridge Analytica had access to the data of 87 million Facebook users in the USA, which it used for targeted political advertisement; for comparison, the 2016 US Presidential elections in the USA were decided by 80,000 voters in three key states. While initially the focus has been on influencing the outcome of political elections, "fake news" has also caused direct life loss. For example, disinformation on WhatsApp has resulted in people being killed in India, and disinformation on Facebook was responsible for the Rohingya genocide, according to a UN report. Disinformation can also put people's health in danger, e.g., think of the antivaccine websites and the damage they cause to public health worldwide, or of the ongoing COVID-19 pandemic, which has also given rise to the first global infodemic. Recently, there has been a lot of research interest in studying disinformation and bias in the news and in social media. This includes challenging the truthiness of claims [6, 52, 81] , of news [17, 33, 36, 37, 42, 60, 61] , of news sources [8], of social media users [3, 26, 49, 50, 51, 59] , and of social media [18, 19, 64, 82] , as well as studying credibility, influence, and bias [7, 8, 20, 45, 49, 51] . The interested reader can also check several recent surveys that offer a general overview on "fake news" [46] , or focus on topics such as the process of proliferation of true and false news online [75] , on fact-checking [71] , on data mining [67] , or on truth discovery in general [47] . For some specific topics, research was facilitated by specialized shared tasks such as the SemEval-2017 task 8 and the SemEval-2019 task 7 on Determining Rumour Veracity and Support for Rumours (RumourEval) [28, 35] , the CLEF 2018-2020 CheckThat! lab on Automatic Identification and Verification of Claims [4, 5, 13, 14, 16, 31, 32, 38, 39, 57, 58, 65] , the FEVER-2018 and FEVER-2019 tasks on Fact Extraction and VERification [72, 73] , and the SemEval-2019 Task 8 on Fact Checking in Community Question Answering Forums [53, 54] , among others. Finally, note that the veracity of information is a much bigger problem than just "fake news". It has been suggested that "Veracity" should be seen as the fourth "V" of Big Data, along with Volume, Variety, and Velocity. 6 In order to fact-check a news article, we can analyze its contents, e.g., the language it uses, and the reliability of its source, which can be represented as a number between 0 and 1, where 1 indicates a very reliable source, and 0 stands for a very unreliable one: f actuality (article) = reliability (language (article)) In order to fact-check a claim (as opposed to an article), we can retrieve articles discussing the claim, then we can detect the stance of each article with respect to the claim, and we can take a weighted sum (here, the stance is a number between -1 and 1, where it is -1 if the article disagrees with the claim, it is 1 if it agrees, and it is 0 if it just discusses the claim or if it is unrelated): Note that in formula (1), the reliability of the website that hosts an article serves as a prior to compute the factuality of an article, while in formula (2), we use the factuality of the retrieved articles to compute a factuality score for a target claim. The idea is that if a reliable article agrees/disagrees with the claim, this is a good indicator for it being true/false, and it is the other way around for unreliable articles. Of course, the formulas above are oversimplifications, e.g., one can fact-check a claim based on the reactions of users in social media [41] , based on the claim's spread over time in social media [48] , based on information in a knowledge graph [70] , extracted from the Web [43] or from Wikipedia [72] , using similarity to previously fact-checked claims [64] , etc. Yet, the formulas give the general idea that the reliability of the source should be an important element of fact-checking articles and claims. Yet, it is an understudied problem. Characterizing entire news outlets is an important task on its own right. We argue that it is more useful than fact-checking claims or articles, as it is hardly feasible to fact-check every single piece of news. Doing so also takes time, both to human users and to automatic programs, as they need to monitor the way reliable media report about a given target claim, how users react to it in social media, etc., and it takes time to get enough such evidence accumulated in order to be able to make a reliable prediction. It is much more feasible to check entire news outlets. Note that we can also fact-check a number of sources in advance, and we can then fact-check the news before it was even written! Once it is published online, it would be enough to check how trustworthy the outlets that published it are, in order to get an initial (imperfect) idea about how much we should trust this news. This would be similar to the movie Minority Report, where the authorities could detect a crime before it was even committed. In general, fighting disinformation is not easy; as in the case of spam, this is an adversarial problem, where the malicious actors constantly change and improve their strategies. Yet, when they share news in social media, they typically post a link to an article that is hosted on some website. This is what we are exploiting: we try to characterize the news outlet where the article is hosted. This is also what journalists typically do: they first check the source. Finally, even though we focus on the source, our work is also compatible with fact-checking a claim or a news article, as we can provide an important prior and thus help both algorithms and human fact-checkers that try to fact-check a particular news article or a claim. How can we profile a news source? Note that disinformation typically focuses on emotions, and political propaganda often discusses moral categories [23] . There are many incentives for news outlets to publish articles that appeal to emotions: (i ) this has a strong propagandistic effect on the target user, (ii ) it makes it more likely to be shared further by the users, and (iii ) it will be favored as a candidate to be shown in other users' newsfeed as this is what algorithms on social media optimize for. And news outlets want to get users to share links to their content in social media as this allows them to reach larger audience. This kind of language also makes them potentially detectable for Artificial Intelligence (AI) systems; yet, these outlets cannot do much about it as changing the language would make their message less effective and it would also limit its spread. While the analysis of the language used by the target news outlet is the most important information source, we can also consider information in Wikipedia and in social media, traffic statistics, and the structure of the target sites URL as shown in Figure 1: 1. the text of a few hundred articles published by the target news outlet, analyzing the style, subjectivity, sentiment, offensiveness [77, 78, 79, 62] , toxicity [30] , morality, vocabulary richness, propagandistic content, etc.; 2. the text of its Wikipedia page (if any), including infobox, summary, content, categories, e.g., it might say that the website spreads false information and conspiracy theories; 3. metadata and statistics about its Twitter account (if any): is it an old account, is it verified, is it popular, how is the medium self-describing, is there a link to its website, etc.; 4. whether people in social media, e.g., on Twitter, post links in articles to the target sources in the context of a polarizing topic, and which side of the debate are these users from; 5. whether there is liberal-vs-moderate-vs-conservative bias of the audience of the target medium in social media, e.g., in Facebook; 6. the language used in videos by the target medium, e.g., in their Youtube channels (if any), where the focus is on analysis of the speech signal, i.e., not on what is said but on how it is said, e.g., is it emotional [44] ; 7. Web traffic information: whether this is a popular website; 8. the structure of site's URL: is it too long, does it contain a sequence of meaningful words, does it have a fishy suffix such as ".com.co", etc. Characterizing media in terms of factuality of reporting and bias is part of a larger effort at the Qatar Computing Research Institute, HBKU: the Tanbih mega-project 7 aims to limit the effect of "fake news", disinformation, propaganda and media bias by making users aware what they are reading, thus promoting media literacy and critical thinking. The mega-projects flagship initiative is the Tanbih news aggregator [80] , 8 which shows real-time news from a variety of news sources [68] . It builds a profile for each news outlet, showing a prediction about the factuality of its reporting[8,9,10], its leading political ideology [29] , degree of propaganda [12,15], hyper-partisanship [63, 66] , and general frame of reporting (e.g., political, economic, legal, cultural identity, quality of life, etc.), and stance with respect to various claims and topics [11, 27, 55, 56, 69] . For the individual news, it signals when an article is likely to be propagandistic. It further mixes Arabic and English news, and allows the user to see them all in English/Arabic thanks to QCRIs Machine Translation technology. Tanbih also offers analytics capabilities, allowing a user to explore the media coverage, the frame of reporting, and the propaganda around topics such as Brexit, Sri Lanka bombings, and COVID-19. 9 Moreover, it performs fine-grained analysis of the propaganda techniques in the news [21, 22, 24, 25, 76] . 10 We developed tools such as a Web browser plugin, 11 a mechanism to share media profiles and stories in social media, a tool to detect check-worthiness for English and Arabic [34, 40, 74] , 12 a Twitter fact-checking bot, 13 and an API to the Tanbih functionality. 14 The latter is used by Aljazeera and other partners. More recently, we have been developing tools for fighting the COVID-19 infodemic by modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society [1, 2] . Tanbih was developed in close collaboration with MIT-CSAIL. 15 We were also partners in a large NSF project on Credible Open Knowledge Networks, 16 and we further collaborate with Carnegie Mellon University in Qatar, Qatar University, Sofia University, the University of Bologna, Aljazeera, Facebook, the United Nations, Data Science Society, and A Data Pro, among others. As part of a larger team, including Al Jazeera, Associated Press, RTE Ireland, Tech Mahindra, and Metaliquid, and V-Nova, we won an award at IBC 2019 by TM Forum and IBC 2019 for our Media-Telecom Catalyst project on AI Indexing for Regulatory Practice. 17 In its 2.5 years of history, the Tanbih mega-project has produced 30+ toptier publications, a Best Demo Award (honorable mention) at ACL-2020, and several patent applications. The project was featured in 30+ keynote talks, and it was highlighted by 100+ media including Forbes, Boston Globe, Aljazeera, MIT Technology Review, Science Daily, Popular Science, Fast Company, The Register, WIRED, and Engadget. It is widely believed that "fake news" can and has affected major political events. In reality, the true impact is unknown; however, given the buzz that was created, we should expect a large number of state and non-state actors to give it a try. From a technological perspective, we ca expect further advances in "deep fakes", such as machine-generated videos, and images. This is a really scary development, but probably only in the mid-long run; as of present, "deep fakes" are still easy to detect both using AI and also by experienced users. We also expect advances in automatic news generation, thanks to recent developments such as GPT-3. This is already a reality and a sizable part of the news we are consuming daily are machine generated, e.g., about the weather, the markets, and sport events. Such software can describe a sport event from 11 http://chrome.google.com/webstore/detail/tanbih/ igcppjdbignhkiikejdjpjemejoognen 12 http://claimrank.qcri.org/ 13 http://twitter.com/factchecker_bot/ 14 http://app.swaggerhub.com/apis/yifan2019/Tanbih/0.6.0#/ 15 http://qcri.csail.mit.edu/node/25 16 http://cokn.org/ 17 http://www.tmforum.org/ai-indexing-regulatory-practise/ various perspectives: neutrally or taking the side of the winning or the losing team. It is easy to see how this can be used for disinformation purposes. Yet, we hope to see "fake news" gone the way of spam: not entirely eliminated (as this is impossible), but put under control. AI has already helped a lot in the fight against spam, and we expect that it would play a key role in putting "fake news" under control as well. A key element of the solution would be limiting the spread. Social media platforms are best positioned to do this on their own platforms. Twitter has suspended more than 70 million accounts in May and June 2018, and these efforts continue to date; this can help in the fight against bots and botnets, which are the new link farms: 20% of the tweets during the 2016 US Presidential campaign were shared by bots. Facebook, from its part, warns users when they try to share a news article that has been fact-checked and identified as fake by at least two trusted fact-checking organizations, and it also downgrades "fake news" in its news feed. We expect the AI tools used for this to get better, just like spam filters have improved over time. Yet, the most important element of the fight against disinformation is raising user awareness and develop critical thinking. This would help limit the spread as users would be less likely to share it further. We believe that practical tools such as the ones we develop in the Tanbih mega-project would help in that respect. Fighting the COVID-19 infodemic in social media: A holistic perspective and a call to arms. ArXiv preprint Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society Predicting the role of political trolls in social media Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of political claims, Task 1: Check-worthiness Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification of Claims. Task 1: Check-Worthiness Online journalists embrace new marketing function. Newspaper Research Finding credible information sources in social networks based on content and social structure Information credibility on Twitter Battling the Internet Water Army: detection of hidden paid posters SemEval-2020 task 11: Detection of propaganda techniques in news articles Findings of the NLP4IF-2019 shared task on fine-grained propaganda detection A survey on computational propaganda detection Prta: A system to support the analysis of propaganda techniques in the news Fine-grained analysis of propaganda in news articles Seminar users in the Arabic Twitter sphere Unsupervised user stance detection on twitter SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours Predicting the leading political ideology of Youtube channels using acoustic, textual and metadata information Detecting toxicity in news articles: Application to Bulgarian CheckThat! at CLEF 2019: Automatic identification and verification of claims Overview of the CLEF-2019 CheckThat!: Automatic identification and verification of claims Digital journalism credibility study A context-aware approach for detecting worth-checking claims in political debates SemEval-2019 task 7: RumourEval, determining rumour veracity and support for rumours In search of credible news In search of credible news Overview of the CLEF-2020 Check-That! lab on automatic identification and verification of claims in social media: Arabic tasks Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality ClaimRank: Detecting check-worthy claims in Arabic and English Linguistic signals under misinformation and fact-checking: Evidence from user comments on social media We built a fake news & click-bait filter: What happened next will blow your mind Fully automated fact checking using external sources Detecting deception in political debates using acoustic and textual features Multi-view models for political ideology detection of news articles The science of fake news A survey on truth discovery Detecting rumors from microblogs with recurrent neural networks Finding opinion manipulation trolls in news community forums Exposing paid opinion manipulation trolls The dark side of news community forums: Opinion manipulation trolls Hunting for troll comments in news community forums SemEval-2019 task 8: Fact checking in community question answering forums Fact checking in community forums Automatic stance detection using end-to-end memory networks Contrastive language adaptation for crosslingual stance detection Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of political claims CLEF-2018 lab on automatic identification and verification of claims in political debates Do not trust the trolls: Predicting credibility in community question answering forums FANG: Leveraging social context for fake news detection using graph representation A stylometric inquiry into hyperpartisan and fake news A largescale semi-supervised dataset for offensive language identification Team QCRI-MIT at SemEval-2019 task 4: Propaganda analysis meets hyperpartisan news detection That is a known lie: Detecting previously fact-checked claims Overview of the CLEF-2020 CheckThat! lab on automatic identification and verification of claims in social media: English tasks Team jack ryder at SemEval-2019 task 4: Using BERT representations for detecting hyperpartisan news Fake news detection on social media: A data mining perspective Dense vs. sparse representations for news stream clustering Predicting the topical stance and political leaning of media using tweets ClaimsKG: A knowledge graph of fact-checked claims Automated fact checking: Task formulations, methods and future directions FEVER: a large-scale dataset for fact extraction and VERification The second Fact Extraction and VERification (FEVER2.0) shared task It takes nine to smell a rat: Neural multi-task learning for check-worthiness prediction The spread of true and false news online Experiments in detecting persuasion techniques in the news Predicting the type and target of offensive posts in social media SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval) SemEval-2020 task 12: Multilingual offensive language identification in social media Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing Fact-checking meets fauxtography: Verifying claims about images Analysing how people orient to and spread rumours in social media by looking at conversational threads