key: cord-0452022-h44r9qjw authors: Sehgal, Vibhor; Peshin, Ankit; Afroz, Sadia; Farid, Hany title: Mutual Hyperlinking Among Misinformation Peddlers date: 2021-04-20 journal: nan DOI: nan sha: 00ea9623dff772da3a2b07952586e194591f2cbb doc_id: 452022 cord_uid: h44r9qjw The internet promised to democratize access to knowledge and make the world more open and understanding. The reality of today's internet, however, is far from this ideal. Misinformation, lies, and conspiracies dominate many social media platforms. This toxic online world has had real-world implications ranging from genocide to, election interference, and threats to global public health. A frustrated public and impatient government regulators are calling for a more vigorous response to mis- and disinformation campaigns designed to sow civil unrest and inspire violence against individuals, societies, and democracies. We describe a large-scale, domain-level analysis that reveals seemingly coordinated efforts between multiple domains to spread and amplify misinformation. We also describe how the hyperlinks shared by certain Twitter users can be used to surface problematic domains. These analyses can be used by search engines and social media recommendation algorithms to systematically discover and demote misinformation peddlers. flotsam and jetsam onto our news feeds and watch lists, plunging us into increasingly isolated echo chambers devoid of reality. Tackling misinformation at any scale requires striking a balance between public safety and creating an environment that allows for an open exchange of ideas. We don't necessarily advocate a specific solution to achieve this balance, but rather seek to provide the tools to help others find this balance. By way of nomenclature, we will refer to a broad category of domains that traffic in conspiracies, distortions, lies, misinformation, and disinformation -whether they are maintained by a state-sponsored actor, a private or public entity, or an individual -as "misinformational domains." All other domains will be referred to as "informational domains." We describe in more detail, in Section 2.1, how domains are characterized as either informational or misinformational. Tackling misinformation on a per-post/image/video basis (e.g., [26, 31, 27, 1, 16, 45, 44, 38, 49, 28, 17, 53, 15, 43, 37, 3, 21, 42, 2, 29, 18, 46] ) is leading to a maddeningly massive game of online whack-a-mole. At the same time, social networks are under intense pressure from the public and government regulators to address the scourge of misinformation. We propose that search engines and social media recommendation algorithms would benefit from more aggressively demoting entire domains that are known to traffic in lies, conspiracies, and misinformation. To this end, we describe two techniques for rooting out domains that consistently or primarily traffic in misinformation. This type of domain-level analysis might also be helpful to fact checkers evaluating the reliability of source material. To better understand the online misinformation ecosystem, we build two networks of misinformational and informational domains: a domain-level hyperlink network and a social-media level link sharing network. The domain-level network represents the hyperlinking relationship between domains. The social-media network represents the link-sharing behavior of social network users. From these networks, we test two main hypotheses: (1) misinformational domains are more connected (through hyperlinks) to each other than to informational domains; and (2) certain social media users are super-spreaders of misinformation. Our primary contributions include: 1. Collating and curating a large set of more than 1000 domains identified as trafficking in misinformation. 2. Revealing a distinct difference between how misinformational and informational domains link to external domains. 3. Showing how hyperlink differences can predict if a domain traffics in misinformation. Revealing that certain Twitter users have predictable patterns in their spread of misinformation. 5. Building a classifier for predicting the likelihood a domain is a misinformation peddler based on how specific Twitter users engage with a domain. Over the past five years, academic research on assessing and mitigating misinformation has increased significantly, as has the public's and government's interest in this pressing issue. Research has focused on understanding the nature of misinformation and its impact on the general population [34, 25, 47] , understanding how misinformation spreads [40, 46, 50, 29] , and the automatic detection of misinformation [45, 44, 38, 49, 28, 17, 53, 15, 43, 37, 3, 21, 42, 2, 18, 46, 22] . With 53% of Americans getting at least some of their news from social media [41] , significant efforts have focused on the promotion and spread of misinformation on social media. Vosoughi et al. [50] , for example, analyzed the spread of misinformation on Twitter and found that false news spreads faster than true news. This study also found false news is more novel than true news and is designed to inspire a strong response of fear, disgust, and surprise, and hence more engagement in terms of likes, share, and retweets. Faddoul et al. [13] and Tang et al. [46] showed how YouTube's own recommendation algorithms help spread conspiracies and misinformation. Automated tools, such as Hoaxy [40] , reveal in real time how misinformation spreads on Twitter. The distinctive characteristics of false news stories make it somewhat easier to automatically detect them. A wide range of machine learning approaches have been employed including both traditional classifiers (SVM, LR, Decision Tree, Naive Bayes, k-NN) and machine learning (CNN, LSTM, Bi-LSTM, C-LSTM, HAN, Conv-HAN) models to demonstrate that false news can be automatically detected with a high level of accuracy (see [22] for a comparative study of detection approaches). Having a large and accurately labeled list of misinformational news, however, is difficult to obtain, which is why most studies use small datasets. Detection, debunking and fact-checking alone, however, are unlikely to stem the flow of online misinformation. It has been shown, for example, that the effect of misinformation may persist even after false claims have been debunked [9, 24] . We systematically study how over 1000 domains previously identified as peddlers of misinformation, are connected with one another and how this connection can be used to detect and disrupt misinformational networks. This type of hyperlink analysis has previously been examined, however not specifically in the space of misinformation. By analyzing 89 news outlets, for example, Pak et al. [35] found that partisan media outlets are more likely to link to nonpartisan media, but that liberal media link to liberal and neutral outlets, whereas conservative media link more exclusively to conservative outlets. In analyzing hyperlinks between news media between 1999 to 2006, Weber et al. [51] found that establishing hyperlinks with other, younger news outlets strengthens the position of that organization in the network thus boosting traffic. In contrast to these previous works, by analyzing significantly larger networks (> 1000), we demonstrate more robust patterns of hyperlinking, and specifically focus on the growing problem of misinformation and coordinated misinformation peddlers. We begin by collating and curating several public databases of previously identified misinformational and information domains. The domain-level hyperlink network is constructed by scraping all hyperlink tags () from these domains. These hyperlinks can be to either an internal or external page. A level-1 scraping collects all hyperlinks from the top-level domain; a level-2 scraping collects all hyperlinks by following the level-1 links and repeating the scraping. A graph, G = (V, E) is constructed from the scraped domains. Each vertex/node v ∈ V corresponds to a domain, and each directed edge e = (A, B) ∈ E corresponds to a hyperlink from domain A to domain B. As described below, this graph is used to evaluate our underlying hypothesis and to gain further insight in coordinated efforts by seemingly unconnected domains. The social-media level link sharing network is constructed using the Twitter API to find users who shared links to misinformational domains. We use a user-sharing feature vector as input to linear classifier to predict a domain as being a likely source of misinformation. We begin by describing the collation and curation of four publicly available misinformation datasets. In total, these four data sets consist of 1,707 domains. There is, however, overlap between these datasets, which once removed yields 1,389 distinct misinformational domains. There are several limitations to immediately using these domains in our analyses. The GossipCop, FakeNewsNet, and PolitiFacts entries, for example, only provide the headline of the offending news article, from which we had to perform a reverse heuristic Google search to identify the source domain. This reverse search does not always identify the offending domain; entertainment stories, for example, often lead to domains like imdb.com and people.com. To contend with these limitations, we applied a ranking of the 1,389 domains to down-rank mislabelled domains like imdb.com. Each domain i in our original data set was assigned a score of s i = f i exp(r i /5000), where f i is the frequency with which domain i appeared in our original data set, and r i is the domain's Alexa top-million ranking. The exponential term weights the observation frequency so that highly ranked Alexa domains will have a nearly unit-value weight, and lower rated domains will have a higher weight. The domains with the largest scores s i were then categorized as misinformational. Despite this ranking system, a dozen clearly non-misinformational domains remained in our data set, like theonion.com and huffingtonpost.com. These domains were manually removed, yielding a total of 1,059 domains. We paired these 1,059 misinformational domains with 1,059 informational domains corresponding to the top-ranked Alexa domains (which we manually verified are trustworthy domains). We selected 222 domains from the "news & media" Alexa categorizaton, 198 domains from each of the "business", "education", "entertainment", and "sports" categories, "45" from "health", and 15 from "religion." Shown in Table 1 are the 20 top-ranked misinformational domains and informational domains and corresponding categories. We next used OpenWPM 13 to scrape the hyperlink on each of the 1,059 domains. The hyperlinks tag () is used to link to an internal or external page. OpenWPM is used to scrape the top-level domain for each hyperlink (level 1), and to scrape all pages linked from this top-level (level 2). This scraping was performed from a Google Cloud Machine with no user login and running Ubuntu 20.04 with 4vCPUs, 16GB RAM and 500GB disk space. Before a hyperlink was scraped the browser was reset and all the cookies were deleted. This entire process was repeated once every two weeks over a six-week period between Feb 19, 2021 and Apr 2, 2021. Any domain that returned a 404 error were excluded from our analysis, yielding a total of 874/1,059 misinformational domains and 888/1,059 informational domains. After scraping all informational and misinformational domains, we constructed an unweighted, directed graph of hyperlinks in which the graph nodes are the domains and a directed edge connects one domain that hyperlinked to another. For example, hoggwatch.com hyperlinks to www.infowars.com/posts/<...> is processed as a directed edge from hoggwatch.com to infowars.com. Shown in Fig. 1 are the level-1 (left) and level-2 (right) graphs in which we can clearly see the strong misinfo-misinfo connections and weak misinfo-info connections. To visualize and analyze the large hyperlink network, we use the open-source tool Gephi [4] . We use the Louvain method [32, 6] for detecting communities of domains that are connected via hyperlinking relationship. This method is a fast heuristic based on modularity optimization. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Since the misinformational domains are more connected to each other than informational domains, we hypothesize the misinformational and informational domains will belong to different communities. 3 Results: Domain-Level Hyperlinking Each domain in the network created from the misinformational and informational domains are assigned a label of "misinfo", "info" or "none". The "none" categorization is used to classify domains not in the misinformational or informational data set. Shown in the top portion of Table 2 is the number and proportion (%) of level-1 hyperlinks from misinfo and info (rows 1-2) to misinfo, none, and info (columns 1-3 Shown in the hyperlink graph in Fig. 2 is a small, nearly fully-connected clique of eight domains: cancer.news, climate.news, food.news, health.news, medicine.news, naturalmedicine.news, pollution.news, and sciences.news. This clique is an example of how individual domains can amplify misinformation by, disproportionately, linking to like-minded domains. Following a reverse whois lookup (www.whois.com), we find all of these misinformational domains are owned by Webseed, LLC based in Arizona, USA. A deeper internet search reveals this LLC was created by Mike Texas, a pseudonym for Mike Adams, founder of Natural News. According to Wikipedia "Natural News (formerly NewsTarget, which is now a separate sister domain) is an anti-vaccination conspiracy theory and fake news website known for promoting pseudoscience and far-right extremism. Characterized as a "conspiracy-minded alternative medicine website", Natural News has approximately 7 million unique visitors per month." [52] The level-2 scraping reveals this clique of eight domains is the tip of the iceberg. Shown in Fig. 3 is a large network of 102 .news domains all owned by Webseed, LLC. In this figure, the red nodes and edges correspond to the original eight-node clique, Fig. 2 . The remaining red nodes are those from our original misinformational data set, and the magenta nodes/edges are misinformational domains discovered by the level-2 scraping. This analysis shows the power of the hyperlinking theory to discover new domains peddling in misinformation. Our level-1 scraping also discovered a smaller clique of three domains: blackeyepolitics.com, greatamericandaily.com and americanpatriotdaily.com. Despite forming a fully-connected clique, at first glance these domains appear to be unrelated each with the following ownerships: Rising Media News Network LLC, Great American Daily Press LLC, and American Patriot News LLC, respectively. These domains, however are all owned by David A. Warrington [12] . Warrington also owns other domains including conservativerevival.com and liberalpropagandaexposed.com, the former of which we discovered through our level-2 scraping. This analysis reveals how mutual hyperlinking can reveal seemingly coordinated misinformation efforts despite owners' efforts to conceal their coordination. The above insights were gained by visually inspecting the hyperlink graphs in Fig. 1 . As these graphs increase in size, however, this type of manual approach will quickly become impractical. We, therefore, employ a community detection algorithm [6] to discover connections between a subset of domains. We applied this community detection algorithm to the level-1 graph in Fig. 1 . This analysis revealed two communities. The first consisted of the following eight domains: globalresearch.ca, journal-neo.org, sott.net, strategic-culture.org, swprs.org, theduran.com, thelibertybeacon.com, and wikispooks.com. The average degree -defined as the sum of all the edges incident to a node -of this community was 3.12 and the graph density was 0.446 (a graph density of 1 signifies a complete graph). Some of these domains have previously been identified as spreading misleading and false COVID-19 related information [14, 5, 19] , three of which appear to be controlled by the Russian intelligence agency [48, 33, 7] . The second community consisted of the following five domains: cnsnews.com, protrumpnews.com, thegatewaypundit.com, thepoliticalinsider.com, and waynedupree.com. The average degree of this community was 2.11 and the graph density was 0.12. These domains focus Domain cross-linking among misinformational domains adds to the spread of lies and conspiracies. Additionally, billions of world-wide, social-media users are at least equally responsible for spreading misinformation on social media. A recent report, for example, Facebook's own internal research found 111 users are responsible for the majority of anti-vaccination misinformation [11] . We investigate the ability to identify misinformational domains by tracking the hyperlinks shared by certain social-media users. Because of the relative ease of access, we focus on Twitter's publicly available user data. In particular, we enlist two Twitter APIs: (1) The Search Tweets API 14 allows filtering tweets based on a query term against a tweet's keywords, hashtags, or shared URLs. We filter tweets by matching shared URLs against our misinfo/info URL dataset, surfacing which users are sharing a particular domain.; and (2) The Get Tweet Timelines API 15 allows querying all tweets surfaced from the Search Tweets API. In our case, we extract the domain URLs shared by the Twitter users surfaced in the previous step. Although we don't consider them here, the data returned by both APIs contains geo-location, replied-to, time, and other attributes that could be leveraged in the future. Each domain in our misinformational and informational data set is represented by a binary-valued vector corresponding to whether a particular user shared the domain URL. In order to avoid an overly large and sparse representation, starting with a total of 289,984 users from our initial Twitter search, we eliminated 244,525 users who shared less than 2 domains each. The final 45459-D, binary-valued vector serves as the feature vector for each domain. We further remove any domains with fewer than 5 total tweeters, yielding a reduction from 961 to 451 misinformational domains and 962 to 705 informational domains. A total of 1156 domains are split into a 75%/25% training/testing split. In order to balance the training data, random oversampling with replacement is applied to the minority (misinformational) class. We trained a logistic regression (LR) using 75% of the data and then evaluated it on the rest of the 25% data set. The LR hyperparameters are tuned to maximize the F1-score. From the nature of lies, conspiracies, and rumors, to the methods for their delivery and spread, misinformation has, and is likely to continue to be an ever-evolving phenomena. While misinformation is not new, the consequences of its collision with a vast digital landscape has led to significant offline harms to individuals, marginalized groups, societies, and our very democracies. Addressing these harms will require a multi-faceted approach from thoughtful government regulation, to corporate responsibility, technological advances, and education. As with most aspects of cybersecurity, technological solutions to addressing misinformation will themselves have to be multi-faceted. With some 500 hours of video uploaded to YouTube every minute, and over a billion posts to Facebook each day, the massive scale of social media makes tackling misinformation an enormous challenge. We propose that in conjunction with complementary approaches to tackling misinformation, addressing misinformation at the domain level holds promise to disrupt large-scale misinformation campaigns. Previous studies have found a relatively small group of individuals are responsible for a disproportionate number of lies and conspiracies. Identifying this group, and reducing their reach -while not necessarily silencing them entirely -holds the potential to make a large dent in the online proliferation of harmful misinformation. We understand and appreciate the need to balance an open and free internet, where ideas can be debated, with the need to protect individuals, societies, and democracies. Social media, however, cannot hide behind the facade they are creating a neutral marketplace of ideas where good and bad ideas compete equally. They do not. It is well established that social media's recommendation algorithms favor the outrageous and conspiratorial because it increases engagement and profit. As a result, Brandies' concept that the best remedy for falsehoods is more speech, not less, simply doesn't apply in the era of algorithmic curation and amplification. We propose that identified misinformation peddlers not necessarily be banned or de-platformed, but that their content simply be demoted in favor of more honest, civil, and trustworthy content. As with any inherently adversarial relationship, all approaches to addressing misinformation -including ours -will have to adapt to new and emerging threats. In our case, misinformation peddlers may add decoy hyperlinks to external trustworthy domains to escape being classified based on their hyperlinks to other misinformational domains. This, in turn, will require techniques to root out such decoy links. And so on, and on, and on. While such a cat and mouse game can be frustrating, the end game will be that it will become increasingly more difficult and time consuming to create and spread misinformation, with the eventual goal of discouraging most, leaving us to contend with the die-hard adversary. While this is not a complete success, it will mitigate the risk of misinformation and, hopefully, return some civility and trust to our online ecosystems. Protecting world leaders against deep fakes Detecting fake news in social media networks Detecting fake news with machine learning method Gephi: an open source software for exploring and manipulating networks Fast unfolding of communities in large networks Addressing Far-Right QAnon Conspiracy, Offers Praise For Its Followers Debunking: A meta-analysis of the psychological efficacy of messages countering misinformation Massive facebook study on users' doubt in vaccines finds a small group appears to play a big role in pushing the skepticism A Longitudinal Analysis of YouTube's Promotion of Conspiracy Videos Deep learning algorithms for detecting fake news in online text Deepfake video detection using recurrent neural networks A retrospective analysis of the fake news challenge stance detection task Identifying disinformation websites using infrastructure features More than 1 in 3 Americans believe a 'deep state' is working to undermine Trump Multi-source multi-class fake news detection A benchmark study of machine learning models for online fake news detection LEADSTORIES Beyond misinformation: Understanding and coping with the "post-truth" era Misinformation and its correction: Continued influence and successful debiasing Exposing deepfake videos by detecting face warping artifacts Celeb-df: A large-scale challenging dataset for deepfake forensics Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks Detect rumors in microblog posts using propagation structure via kernel learning The hockey stick and the climate wars: Dispatches from the front lines Exploiting visual artifacts to expose deepfakes and face manipulations Modularity and community structure in networks Examining the Global Spread of COVID-19 Misinformation Intermedia reliance and sustainability of emergent media: A large-scale analysis of american news outlets' external linking behaviors Supervised learning for fake news detection CSI: A hybrid deep model for fake news detection The difference between what Republicans and Democrats believe to be true about COVID-19 Anatomy of an online misinformation network Detecting fake news on social media Understanding user profiles on social media for fake news detection Beyond news contents: The role of social context for fake news detection The role of user profiles for fake news detection down the rabbit hole" of vaccine misinformation on youtube: Network exposure study Belief echoes: The persistent effects of corrected misinformation Fake news detection in social networks via crowd signals The spread of true and false news online Newspapers and the long-term implications of hyperlinking Detecting fake news for reducing misinformation risks using analytics approaches This work was supported by funding from funding Avast, Inc. (Sehgal and Farid). We thank Juyong Do and Rajarshi Gupta for their thoughtful comments and discussions.