key: cord-0589267-opj9d4cn
authors: Chen, Zhouhan; Chen, Haohan; Freire, Juliana; Nagler, Jonathan; Tucker, Joshua A.
title: Understanding how people consume low quality and extreme news using web traffic data
date: 2022-01-11
journal: nan
DOI: nan
sha: d63d50d9734ecd8fc7df1d50dacf53d955a69015
doc_id: 589267
cord_uid: opj9d4cn

To mitigate the spread of fake news, researchers need to understand who visit fake new sites, what brings people to those sites, where visitors come from, and what content they prefer to consume. In this paper, we analyze web traffic data from The Gateway Pundit (TGP), a popular far-right website that is known for repeatedly sharing false information that has made its web traffic available to the general public. We collect data on 68 million web traffic visits to the site over a month period and analyze how people consume news via multiple features. Our traffic analysis shows that search engines and social media platforms are main drivers of traffic; our geo-location analysis reveals that TGP is more popular in counties that voted for Trump in 2020; and our topic analysis shows that conspiratorial articles receive more visits than factual articles. Due to the inability to observe direct website traffic, existing research uses alternative data source such as engagement signals from social media posts. To validate if social media engagement signals correlate with actual web visit counts, we collect all Facebook and Twitter posts with URLs from TGP during the same time period. We show that all engagement signals positively correlate with web visit counts, but with varying correlation strengths. Metrics based on Facebook posts correlate better than metrics based on Twitter. Our unique web traffic data set and insights can help researchers to better measure the impact of far-right and fake news URLs on social media platforms.

Fake news site is a major threat on today's Internet (per 2020; Grinberg et al. 2019; Vosoughi, Roy, and Aral 2018) . How to measure the consumption of fake news URLs remains a challenge. Since there is no single metric to quantify the spread of information, the choice of metrics can affect downstream analysis and alter final conclusions. There are two approaches to measuring fake news consumption: indirect and direct.

For indirect measurement, a common method is to collect social media posts containing the URL of interest, calculate engagement signals, and use those metrics as a proxy for URL popularity (for an Informed Public et al. 2021; Guess, Copyright © 2021 , Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Nagler, and Tucker 2019; Guess et al. 2021) . Indirect measurements reveal how people share news URLs, but not how people actually visit those URLs (Sacher and Yun 2017) .

Only a few studies use direct measurement data. For example, (Chalkiadakis et al. 2021) collects visit data to fake news sites from third party services such as SimilarWeb and CheckPageRank to assess user engagement. In another study, (Fourney et al. 2017 ) gathers browsing data from Microsoft Internet Explorer and Edge to analyze visiting patterns to fake news sites before the 2016 US Election. As far as we know, one unexplored data source is web traffic data collected on the server side. This type of web traffic data has rich features that alternative sources do not have. Even though most news websites record their traffic, few make the data publicly available.

During an audit of popular far-right and extreme news websites, we discovered that TGP makes its website traffic available to the general public. TGP is one of the top three right-wing news sites with the largest percentage of traffic surge from December 2019 to December 2020 (Majid 2021) . It is also one of "the top-three most cited domains in tweets spreading false and misleading narratives about voter fraud in 2020" (for an Informed Public et al. 2021 ). Even though -or perhaps because -the site constantly shares misinformation (The 2021; Faris et al. 2017) , it remains highly influential. For example, its articles were cited by Former President Trump's lawyer and referenced in Trump's Impeachment Defense Memo (tra 2021). All of these features make TGP an ideal case study to understand online extreme news consumption behavior.

Given this opportunity, we crawl the entire web traffic from TGP for one month from February 4, 2021 to March 3, 2021. We collect a total of 68 million website visits. Our analysis is two-fold: we first explore available features within the web traffic data to understand how people consume low quality news; we then collect additional social media posts to test correlations between social media engagement signals and actual web visit counts. Our substantive findings include:

1. Search engines such as Google, Duckduckgo and Bing account for 88.5% of external referral traffic to TGP home page. Social media platforms including Twitter, Facebook, Telegram and Gab account for more than 42% of external referral traffic to TGP article pages. 2. At the county level, TGP is more popular in counties that vote for Trump. At the state level, TGP is more popular in "swing states" such as Georgia and Arizona.

3. Topic modeling reveals that articles that mention "2020 US election fraud" are visited by 29% more users, and spread 8 hours longer, compared with articles of other topics. Those viral articles usually cover events happening in swing states.

4. Social media engagement signals positively correlate with actual website visit counts. Not all metrics are the same: Facebook metrics achieve a stronger correlation than Twitter metrics.

To the best of our knowledge, our work is the first to analyze server-side web traffic data of a popular low quality news site, and the first to correlate social media engagement signals with actual web traffic counts. In the future, we plan to apply our method to similar server-side web traffic data, although getting access to additional data sets remains challenging.

In this section, we first explain how we collect the entire visit traffic from TGP for one month. We then give an overview of the collected data, and address issues related to data integrity, missing data and data privacy.

TGP uses StatCounter, a web traffic service, to capture visitor traffic. To access traffic data to TGP , users can either visit a publicly available web portal, or download the data by sending an HTTP GET request to a URL endpoint, which we refer to as the download URL. Users need to specify two parameters in the URL, which we refer to as StartTime and EndTime. 1 During our testing phase, we find that no matter what StartTime and EndTime we set, the downloaded CSV file always contains traffic captured during the most recent 20 minutes. To collect website traffic continuously, we set up a Selenium Chrome Browser to visit the download URL every 15 minutes, from February 3, 2021 to March 3, 2021. We choose a 15 minute interval because it is below the 20minute interval with a safe margin. One side effect is that our data has duplicates. To remove duplicates, we identify that each website visit is uniquely defined by the combination of five features: datetime, url, ip, os, and browser. Therefore, we only keep the first record if multiple records have the same five-feature combination. 

To validate that our collection method captures the entire traffic, we compare the daily number of visits reported by Statcounter against the number calculated from our collection after de-duplication. 2 Figure 1 shows that our data set has a completeness ratio of more than 99.8% on a daily basis. We define the completeness ratio as our number of visits divided by Statcounter's number of visits. The lost entries are possibly caused by parsing errors or corrupted network packages. We believe that this small number of missing entries (less than 0.2%) will not affect trends we observe.

Figure 1: Total number of visits per day. Blue bar is the official count from TGP, and red bar is the count from our data set. Our data collection has a completeness ratio of more than 99.8% on a daily basis.

Even though we capture the entire web traffic, our data source (StatCounter) has several inherent problems. One potential issue is under-counting. For example, anyone who blocks HTTP and HTTPS request to StatCounter will not have their visits logged by the server. This can happen if people install certain anti-tracking plug-ins. Unfortunately, it is impossible to know exactly how many users install anti-tracking tools, as those tools are designed to hide web visit history. Another problem is the presence of bot traffic. Bots are programs that automatically visit web pages. According to the documentation, StatCounter does not record most bots or crawlers, because clients have to actually load javascript for their hit to be logged in the system (sta 2012). For more advanced bots that emulate human behavior (load javascript, click buttons), there is no way to distinguish their traffic from real human traffic. To sum up, even though the amount of missing data and advanced bot traffic is unknown and undetectable, we believe that those irregularities will not affect the overall trend during our analysis.

To address concerns regarding data privacy, we first note that our web traffic data does not contain any personal identifiable information such as name, phone number, cookie, session ID, device ID, email address, etc,. Additionally, all of our results presented below are aggregated.

Despite the challenges noted above associated with our data collection process, two factors give us confidence in the robustness of our findings. First, our data is about as close to the ground truth as one can hope for in any sort of online data analysis, with a 99.8% completeness rate. Second, we collect more than 68 million page visits that span a full month. This extended period of time ensures that any daily or hourly data irregularity is smoothed out and the overall trend preserved.

In this section, we take a multi-pronged approach to analyze our one-month web visit data along multiple dimensions. To better understand how people consume low quality news, we start by visualizing when people visit the site and from what type of device. We then analyze referrer links to understand which sites bring users to TGP . We also leverage geo-spatial information to validate if people who visit TGP come from areas that voted more favorably for Donald Trump in the 2020 Presidential Election. Finally, we apply topic clustering techniques to quantify what topics are discussed, and what topics are more likely to go viral.

Finding 1: The majority of users visit the site during the day on mobile devices.

Our data collection contains 68,268,818 unique visits, from February 3, 2021 to March 3, 2021. Figure 2 plots the number of visits per hour. Since more than 95% of the visits come from the United States, we see a regular and circadian pattern where the traffic increases during the day, and decreases during the night. The daily peak hourly visit is around 200,000. The only exception is one hour in February 13, 2021, with a recorded visit of nearly 300,000. February 13, 2021 is the day Donald Trump was acquitted on impeachment charges. After checking the data set, we find that the two most visited articles published that day are both about impeachment charges. 3 Figure 2 : Number of visits per hour, from February 4, 2021 to March 3, 2021. There is a peak on February 13, 2021, the day Donald Trump was acquitted on impeachment charges. Our data shows that the two most visited articles during that day both covered this event.

To understand how people visit TGP, we look at the operating systems (OS) column as it reveals what device people use. According to Figure 3 , more than 80% of visits come from mobile operating systems including IPhone and Android devices. If this finding holds for other low quality news sources, it suggests that research that mostly focuses on desktop users may miss a large proportion of the population that visits low quality news sources (OGNYANOVA et al. 2020). Knowing what websites bring people to TGP helps us to identify the source of traffic and to design intervention strategies to slow down the spread of fake news. To reconstruct traffic flows, we use the referrer column in our web traffic data. When a browser navigates to URL B from URL A, it usually includes a string called referrer in the HTTP request (in our example A is the referrer of B). Among 68,268,818 visits, 35,296,042 (52%) have referrers. For visits that do not have referrers, either users visit a URL directly, or the browser strips the referrer, which can happen when certain privacy-enhancing features are turned on (moz 2021). To aggregate referrers that belong to the same site, we normalize each referrer URL to its domain name, removing hostname, path, and other query parameters. We consider two referral behaviors based on the destination URL: sites that bring users to the home page, and sites that bring users to an article page. A home page URL points to domain thegatewaypundit.com, while an article page URL has the form thegatewaypundit.com\ARTICLE. Each type of traffic flow has its own characteristics, which we analyze separately.

Websites that bring users to the home page. Figure 4 shows the top 15 domains that bring visitors to TGP home page. Three major search engines (Google, Duckduckgo and Bing) account for 88.5% of external referral traffic. Among them, Google.com is the top driver of home page traffic (66%). The anonymous search engine duckduckgo.com is the fourth (13%), and the Microsoft-developed bing.com ranked the fifth (9%). 4 Second to Google are internal TGP article pages. This shows when people browse articles on TGP, they usually navigate back to the home page from different article pages. The third referrer is TGP home page. This is likely caused by people clicking links to the home page when they are already at the home page. Further down the list are far-right and conservative news sites such as drudgereport.com, 63red.com and protrumpnews.com.

We also identify referrers from suspected phishing domains. One such domain is netlix.com, ranked number ten. The domain name used to be a center of a lawsuit. According to a legal complaint filed by Netflix in 2009, the video streaming company claims that the domain name "netlix" looks too similar to "netflix," and requests netlix.com to be transferred to Netflix. 5 The court rejected the order, and netlix.com still belongs to its original owner. As of September 14, 2021, the website does not host actual content, but automatically redirects users to TGP. We do not know what motivates the owner of netlix.com to redirect visitors to TGP. We will keep monitoring the site as previous research demonstrates that URL redirection is a common technique to distribute unwanted or malicious software (Chen and Freire 2021) .

Websites that bring users to an article page. Figure 5 shows the top 15 domains that bring visitors to an article page. The top two referrers -home page and article page -are both internal traffic. This indicates that (a) most users first land on the home page before clicking an individual article, (b) some users click a new article page while browsing an existing article page, since different articles are interlinked together. When we exclude referrers from thegatewaypundit.com, we can classify the rest of sites into two groups: 1. Social media platforms including Twitter, Facebook, and emerging platforms such as Telegram and Gab. Together they account for 42% of external referral traffic. 2. Conservative news sites such as protrumpnews.com, thelibertydaily.com, populist.press and whatfinger.com. Those sites repost articles from TGP on a regular basis. The top two referrers are TGP home page and TGP article pages. Together they account for more than 80% of referral traffic. We drop those two domains for better visualization.

To further understand how much role each social media platform plays in driving the traffic, we plot the daily number of visits with referrers from four different social media platforms, shown in Figure 6 . The overall trend shows that Twitter and Facebook drive more traffic than Telegram and Gab. Daily traffic volume fluctuates and can be affected by external events. For example, Jim Hoft, founder of TGP, was suspended by Twitter on Februrary 6, 2021. 6 The suspension is likely related to the decline of traffic from Twitter on that day. Other than several peaks in late February, the volume of traffic from Twitter and Facebook has continued to decline. This finding suggests that suspending social media accounts that spread low quality URLs may indeed be an effective way to reduce the spread of misinformation. Finding 3: Visitors to the site are more likely to be from areas that voted for Donald Trump during the 2020 presidential election Our web traffic data records IP and city-level geo-location label for every request. To better understand what types of audiences visit TGP, and the audiences' political preferences, we leverage the geo-location information to answer two key questions: given the fact that articles published on TGP are pro-Trump, pro-Republican Party, and often related to the 2020 US election (for an Informed Public et al. 2021) , (1) is TGP more popular in counties that voted Trump? and (2) is TGP more popular in Republican states, Democratic states, or Swing states?

We assume that each unique IP address is one unique visitor, and each visitor is a voter during the 2020 US Presidential Election. In reality, our assumption might not always be true. For example, multiple people in a household can share the same IP, or one person can visit the site from multiple IP addresses, or the person who visits the site is not eligible to vote. Even though those limitations exist, IP address is the most accurate proxy to real human traffic in our dataset. IP is also commonly used in security research to generate threat intelligence from traffic logs (Fourney et al. 2017; cis 2021) . Additionally, we also filter out cities that have fewer than 1,000 unique visitors, because those cities are too small to allow a safe margin of error. For example, an IP address that belongs to a small city might get erroneously assigned to a neighboring city. After the filtering our data set has 596 US cities.

Question 1: Is TGP more popular in counties that voted Trump?

To answer this question, we first collect county-level 2020 US election results, including the total number of voters, number of voters who voted for Trump, and number of voters who voted for Biden. Then in our web traffic data set, we group number of visits per city into number of visits per county. Finally for each county, we calculate two metrics: % voters who visited GP(y) = unique number of IP total number of voters % voters who voted for Trump(x) = # voters who voted Trump total number of voters Figure 7 shows the scatter plot of x and y. The red line is the expected value of y given x, based on a linear regression model. The r-squared value is 0.17 and the slope is 0.037, which indicates a positive correlation between % voters who visited TGP and % voters who voted for Trump. Thus we do in fact find that TGP is more popular in counties that voted Trump. This finding is consistent with a 2016 study that shows people from counties that voted for Trump are more likely to visit fake news sites (Fourney et al. 2017 ). in Swing, Republican and Democratic states. Table 2 shows the results. Interestingly, we find that TGP is more popular in Swing states and Republican states than in Democratic states. We perform pairwise t-test and find that the difference is statistically significant when using the per-capita count, but is not significant when using the absolute count. To understand why more people from Swing states visit the site, in the next section we analyze all articles published on the site during our data collection period. We show that most popular articles frequently mention topics related to Swing states, including "2020 US Election fraud", "missing ballot", or "voting irregularity", all of which are unverified or false claims. Those stories have more direct impact on people from Swing states than those from Republican or Democratic states.

Visualizing hotspots. To better understand where people visit TGP, we visualize cities with a high concentration of visitors on two maps. We separate visits into two categories: those coming outside of the United States, and those coming from the United States. Figure 8, 9 show top US and non-US cities. The radius of each dot is in proportion to the percentage of city population that visited TGP within a month. Most non-US visits come from cities in Canada, Australia, New Zealand and Israel. For visits within the United States, some come from metropolitan areas such as Denver, Houston and Chicago. Others come from cities within Swing or Republican states such as Florida, Texas and Arizona. Finding 4: Topics related to "election fraud" receive more clicks and remain popular on the site for a longer period of time than other topics During the one-month period of our study, TGP published 1070 articles. Some stories go viral, others do not. What topics are discussed? What makes one topic goes viral? Is virality associated with the "fakeness" of the story? To better understand those connections, we use topic clustering technique to group TGP articles into ten distinct topics. We then design two metrics to quantify the popularity of an article: number of unique visits (volume based), and number of minutes it takes to receive 50%/90%/95% of all visits (time based). We first show the distribution of those two metrics over all articles, and then aggregate metrics into topics to identify viral content.

What topics are discussed? Each article published on TGP comes with a one-sentence title with references to key names and events. For example, one article published on February 18, 2021 is titled "Maricopa County Audits Are Proving to Be a Waste of Time and Money, They Were Never Created to Identify the Suspected Election Fraud in the County." Given the rich information from the title, we use non-negative matrix factorization (NMF) to cluster 1070 article titles into different topics. NMF is an unsupervised algorithm to extract topics from text corpus. In our case, the input to NMF is an article-word matrix, where each entry is the tf-idf weight of a word in an article. NMF then factorizes this matrix into a word-topic matrix, and a topic-article matrix. The number of topic is a user-defined parameter. After experimenting with different values, we set the parameter as 10, because the resulting topics are coherent and distinct among each other. Table 3 shows keywords associated with each topic.

What topics receives more visits? We first use the number of unique visit to measure article virality. Each unique IP address counts as one unique visit. Figure 10 shows the histogram of number of unique visits per article. Overall, the average number of visits per article is 32,488; the median topic keywords 1 president donald trump acquittal 2 joe biden kamala harris 3 2020 election fraud voter integrity 4 marjorie taylor greene, liz cheney 5 capitol riot antifa police fbi 6 governor andrew cuomo new york 7 democrat impeachment trial 8 maricopa arizona county ballot shredded dumpster 9 covid 19 vaccine virus cdc 10 dominion voting machine Table 3 : Keywords associated with each topic. We use nonnegative matrix factorization to cluster 1070 articles into 10 topics.

number of visits is 22,434. The distribution is skewed to the right, suggesting that some articles have very high number of unique visits. Figure 10 : Histogram of unique visits per article. We use IP address as a proxy for visitor. The mean number of visits is 32,488, and the median is 22,434. This discrepancy suggests that some articles receive very high number of visits.

We then aggregate article-level number of visits into topic-level. Figure 11 shows the mean and median number of visits per topic. The most visited topic is #3, which according to Table 3 is related to "2020 US election fraud", an unverified claim pushed by far-right news media. The second most visited topic is #8, which covers "voting irregularity and ballot counting in Maricopa county", another unfounded claim. The popularity of those topics indicate that readers of TGP had a huge appetite for articles about electoral fraud stories. The fact that those articles are published and remain popular three month after the US 2020 election shows that this type of misinformation can have a long-lasting effect on readers, and that misinformation does not have to cover real-time topics to remain popular.

Do viral topics last longer? To quantify the popularity of an article in the time dimension, we measure how long it takes an article to receive 50% (t 1 ), 90% (t 2 ), and 95% (t 3 ) of all visits, since the publication of the article. Figure 12 shows that in median values, 50% of visits come from the first 237 minutes (4 hours); 90% of visits come from the first 1,177 Figure 11 : Mean and median number of unique visits per topic. The most visited topics are both related to the US 2020 election. Topic #3 is related to election fraud, and topic #8 is related to maricopa county ballot. Both claims are unverified conspiracy theories. minutes (20 hours), and 95% of visits come from the first 1,634 minutes (28 hours). In general, it is rare to have an article that stays viral for more than a day. Figure 12 : Histogram of number of minutes to reach 50%, 90% and 95% of total unique visits. On average, an article receives 50% of all visit traffic within 4 hours (280 minutes) of publication, and receives 90% of all visit traffic within 20 hours (1,200 minutes) of publication. We then aggregate article-level count to topic-level count. Figure 13 shows the median t 2 and median t 3 for each topic. Topics that trend longer include #3 ("US election fraud"), #8 ("Arizona county ballot"), and #10 ("Dominion voting machine"). Topics that trend shorter include #1 ("President Donald Trump acquittal") and #7 ("Democrat Impeachment Trial"). The longest-trending topics are also the most visited topics. This suggests that viral topics are not only read by more people, but also last for a longer period of time. In general, topics related to conspiracy theories are more popular, while topics that state a known fact are less viral. Figure 13 : Median number of minutes to reach 90% and 95% of total visits per topic. Topics that trend longer include #3 (US election fraud), #8 (Arizona county ballot), and #10 (Dominion voting machine). Topics that trend relatively shorter include #1 (president donald trump acquittal) and #7 (democrat impeachment trail). In general, conspiratorial topics last longer, while topics that report a known fact do not last as long.

Comparing web traffic data with social media engagement signals

As we mentioned previously, existing research on news consumption mostly focuses on how news URLs are shared on social media platforms, especially Twitter and Facebook. While social media signals can tell us how people share news, they do not answer how many people actually visit each URL. Is there any correlation between social media sharing behavior and actual news consumption behavior? If so, how strong is the correlation? To answer those questions, we first collect Facebook and Twitter metrics that measure popularity of TGP links shared on each platforms. We then test the correlational strength of different social media metrics against website visit count, and identify metrics that are good estimations of actual web visit counts.

Among 1070 online articles published by TGP during our one-month data collection, 1020 received more than 10,000 unique web visits. To ensure the stability of our experiment, we focus on those 1020 URLs and discard URLs with lower web visit counts. We use Crowdtangle API to collect Facebook posts that contain any one of the 1020 URLs published by TGP. Crowdtangle is a data intelligence service that tracks aggregated engagements and interactions of posts from Facebook pages and groups (both public and private) (cro 2021). We use Twitter Academic API (twi 2021) to collect all original and public tweets that contain any one of the 1020 URLs. For each URL, we calculate seven metrics, shown in Table 4 .

We calculate Pearson correlations between each social media metric and (a) the number of visits from all traffic and (b) the number of visits from platform-specific traffic. Pearson correlation is the normalized covariance between two variables, and is used to summarize the strength of the linear metric source number of unique visits (all) web traffic dataset # unique visits (from facebook.com) web traffic dataset # unique visits (from twitter.com) web traffic dataset total number of FB reactions Crowdtangle API total number of FB interactions Crowdtangle API total number of likes Twitter API total number of retweets Twitter API Table 4 : We calculate seven metrics to quantify the popularity of an article URL. We later correlate web traffic-based metrics with social media-based metrics.

relationship between two variables (Freedman, Pisani, and Purves 2007) . We first observe that Facebook metrics correlate better with traffic that only originated from Facebook than traffic originated from all sites. The same is true for Twitter metrics. For example, Figure 14 shows that the Pearson correlation between total Facebook interaction and number of visits from facebook.com is 0.894, while the correlation is only 0.595 for number of visits from all sites. Since social media metrics cannot capture URL sharing activities outside of the platform, the correlation significantly decreases when using number of visits from all sites.

We also observe that Facebook metrics correlate better with web visit count than Twitter metrics. For example, when we compare the first row of Figure 14 against the first row of Figure 15 , we see that Facebook interactions have a higher correlation with web visit counts than Twitter likes. The former metric has a Pearson correlation of 0.595 while the latter has a correlation of 0.435. Why is there a discrepancy? One reason could be that Facebook metrics we collect count both private and public posts, while Twitter metrics only count public posts. Understanding what other factors affect the correlation is a subject for future research.

To summarize, we validate that all social media metrics have a positive correlation with web visit counts. Therefore it is reasonable to use social media engagement signals as a proxy for URL popularity. However, there are limitations when using social media metrics, as they only capture link sharing activities on one platform. Given those insights, researchers should carefully choose from which platform to collect data and which engagement signals to use, as each metric has varying correlational strength. In the future, we plan to test more metrics to further understand correlations between how people share news on social media versus how people actually read news.

Measuring who consumes and how people consume fake news is an important but challenging research area. Previous work mostly studies the spread of fake news on social media platforms (for an Informed Public et al. 2021) . For example, (Vosoughi, Roy, and Aral 2018) collects tweets containing links to fake news sites, and concludes that fake news spread faster and further than traditional news. In another study using Twitter data, (Grinberg et al. 2019 ) claims that "fake news accounted for nearly 6% of all news consumption, but it was heavily concentrated" on a small percentage of users. Similarly, (Guess, Nagler, and Tucker 2019) , (Guess et al. 2021) collect Facebook posts to understand news consumption behavior, and find that older people are more susceptible and share more fake news.

While social media engagement signals can tell us how people share news on different platforms, they do not necessarily translate into web traffic to the news site (Sacher and Yun 2017) . One way to bridge this gap is to directly gather data from volunteers via browsing extensions. For example, (OGNYANOVA et al. 2020 ) asked participants to install a browser extension to measure their exposure to fake news. However, this approach is usually expensive and the sample size is small.

To understand population-level news consumption behavior, there is an urgent need to collect "unique datasets with increased validity" (Pasquetto et al. 2021) . Web traffic data is a direct measurement of news consumption. In one study, (Chalkiadakis et al. 2021) assesses user engagement by collecting traffic data from tracking services such as Similar-Web and CheckPageRank. (Fourney et al. 2017 ) gathers browsing data from Microsoft Internet Explorer and Edge, and analyzes visitor patterns to a list of fake news domains before the 2016 US Election.

Different from all previous approaches, we focus on collecting the entire web traffic to a single but important news site (TGP). Our data set enables us to validate and extend previous traffic-based analysis. As far as we know, we are the first to test correlations between social media engagement signals and web traffic counts, by combining Twitter and Facebook posts with web traffic data.

In this paper, we collect and analyze a unique website traffic data set that contains more than 68 million visits to The Gateway Pundit (TGP), a major far-right website known to spread fake news and conspiracy theories. We find that search engines and social media platforms are the main drivers that bring traffic to the site. Our geo-location analysis reveals that TGP is more popular in counties that vote for Donald Trump, and our topic analysis shows that conspiratorial stories are more viral. Finally, we compare engagement signals derived from Twitter and Facebook posts with actual website visit counts, and find varying degrees of correlations. Our population-level behavioral analysis can help researchers design robust intervention methods to counter the spread of misinformation.

One major difficulty encountered during our research was our inability to analyze other comparable web traffic data sets. We reached out to several organizations who could offer such a data set, but did not move forward due to insufficient response. In the future, we plan to collaborate more with industry partners that have direct access to populationlevel news consumption data. Potential collaborators include web tracking companies and Internet service providers. As misinformation spreads over multiple platforms with an increasing speed, researchers need to be able to access more direct measurement data to quantify and understand the phenomenon of how people access low quality news websites.

New Feature Added: Ignore Crawlers and Bots from your Stats

2020. Social Media and Democracy: The State of the Field, Prospects for Reform. SSRC Anxieties of Democracy

Referrers by default to protect user privacy

The-Gateway-Pundit-NewsGuard-Nutrition-Label.pdf

Twitter API Academic Research product track

The Rise and Fall of Fake News sites: A Traffic Analysis

Discovering and Measuring Malicious URL Redirection Campaigns from Fake News Domains

Partisanship, Propaganda, and Disinformation: Online Media and the

Geographic and Temporal Trends in Fake News Consumption During the 2016 US Presidential Election

Statistics (international student edition). Pisani, R. Purves, 4th edn

Fake news on Twitter during the 2016 U.S. presidential election

Cracking Open the News Feed: Exploring What U.S. Facebook Users See and Share with Large-Scale Platform Data

Less than you think: Prevalence and predictors of fake news dissemination on Facebook

Top 50 largest news websites in the world: Surge in traffic to Epoch Times and other right-wing sites

Misinformation in action: Fake news exposure is linked to lower trust in media, higher trust in government when your side is in power

Tackling misinformation: What researchers could do with social media data

The spread of true and false news online