key: cord-0315118-ozooplci
authors: Papadogiannakis, Emmanouil; Papadopoulos, Panagiotis; Markatos, Evangelos P.; Kourtellis, Nicolas
title: Leveraging Google's Publisher-specific IDs to Detect Website Administration
date: 2022-02-10
journal: nan
DOI: 10.1145/3485447.3512124
sha: a1931fa37ed76c252815b8d2cab0f55121829013
doc_id: 315118
cord_uid: ozooplci

Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively answer questions such as: Which entities monetize what websites? What categories of websites does an average entity typically monetize on and how diverse are these websites? How has this website administration ecosystem changed across time? In this paper, we propose a novel, graph-based methodology to detect administration of websites on the Web, by exploiting the ad-related publisher-specific IDs. We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration. Our findings show that approximately 90% of the websites are associated each with a single publisher, and that small publishers tend to manage less popular websites. We perform a historical analysis of up to 8 million websites, and find a new, constantly rising number of (intermediary) publishers that control and monetize traffic from hundreds of websites, seeking a share of the ad-market pie. We also observe that over time, websites tend to move from big to smaller administrators.

Digital advertising keeps the content we consume on the Web free of charge, being an important stream of revenue for web publishers [18] . Even during 2020, with all the adverse economic impacts of the COVID-19 pandemic, there was a reported 12.2% increase in adrevenues [29] , and $100s of billions in annual spending worldwide ($455B in 2021 [18] ). However, it is inherently difficult to assess the effectiveness of digital ad spending, due to the overly complex, layered ecosystem of digital marketing, with thousands of intermediaries brokering ads and ad-slots between sellers and buyers. Some even consider this market overvalued and possibly due to correction with various adverse effects [28] .

In an attempt to increase ad profits, advertisers, intermediaries and publishers resort to analytics and other web tracking services for better measuring of user audiences and their engagement with webpages. But the increasing complexity of this ecosystem makes it hard to answer questions such as: Who are the entities that control and monetize websites, and which websites? How many websites does the average such entity control? Are they from the same category, or diverse in nature? What are the characteristics of these website administrators, and how have these changed over time?

In the last decade, journalists and academic researchers have grappled with such questions. In fact, they made several efforts to (i) provide more transparency to the ecosystem of web content monetization and administration [45] , (ii) raise awareness of its impact to user's privacy due to online tracking and possibilities of de-anonymization [3, 4, 33, 47, 50] , (iii) shed light on how this ecosystem drives misinformation and fake news [1, 43, 45] . For example, L. Alexander [1] used Google analytic IDs to find evidence of a pro-Kremlin concerted web campaign, executed among different websites owned by the same entity. C. Silverman et al. [45] looked into Google-related IDs and found websites being operated by the same entities, which promoted fake news content and delivered polarizing ads during the USA 2016 presidential election. Furthermore, C.I. Samson [43] discussed the issue of fake news spreading within the context of the Philippines 2016 presidential election. Such reports demonstrate the urgent need for more transparency in the issue of website administration. In addition, academic works [4, 33, 47, 50] have looked at the problem from the point of view of user tracking or de-anonymization using such Googlerelated IDs to detect malicious websites and their administrators. However, to date, there has been no systematic study to reveal, at scale, the way websites are monetized and by which entities.

In this work, we try to shed light on website administration and propose a novel, graph-based methodology to detect entities that are in charge of websites. To that extent, we exploit the ad-related, publisher-specific IDs that publishers embed in their websites in order to use third-party services. We (i) apply our methodology across the top 1 million websites of Tranco list, to detect groups of websites monetized by the same entity, (ii) study the characteristics of the generated website administration graphs, (iii) find intermediary publishers that manage and monetize traffic from hundreds or even thousands of websites. We perform a 2-year historical analysis of up to top 8 million websites and explore how the small, medium and large publishers have evolved over time.

In summary, the contributions of this work are:

• We propose a novel methodology for detecting website administration and co-ownership based on publisherspecific IDs, with applicability in different use cases. • We conduct the first, to our knowledge, large-scale systematic study of such publisher-specific IDs embedded in up to 8M websites. We make our implementation [35] along with our results [36] publicly available to support further research on this topic. • Our findings show that approximately 90% of the websites are associated with a single publisher and that small publishers tend to manage less popular websites. We also conclude that there is preferential administration with an inclination towards "News and Media" websites. Finally, we show that over time, websites tend to move away from big administrators into smaller ones.

AdSense is a service for publishers to generate revenue by displaying ads in their websites. For ads to be displayed, publishers need to insert the AdSense code snippet in their website, which includes a Publisher ID: a unique identifier for an AdSense account that follows the format pub-XXXXXXXXXXXXXXX. The owner of the account is allowed to share the account with employees, or even business partners, however, the account holder is always one, and different AdSense accounts cannot be merged [10] . An AdSense account cannot be transferred to another individual [9] , but two or more AdSense accounts with different Publisher IDs can co-exist on the same website [12] . These other Publisher IDs can belong to a business partner, contributing authors, or even third-parties.

Google Tag Manager (GTM) is a service for web administrators to manage code snippets (called Tags -provided by third-parties to integrate their respective service e.g., analytics, marketing, support) in their website. GTM provides an interface for publishers to handle such code snippets, which uses an abstraction called container that needs to be installed in the website by inserting its own snippet [21] . A container is uniquely identified by a Container ID, formatted as GTM-XXXXXX. One GTM account can create and manage more than one containers. Usually, a GTM account represents the topmost level of organization and, typically, an organization uses a single GTM account [13] . Thus, containers are not bound to a domain or a website, and with the appropriate configuration, the same container can be used in multiple websites [11] .

Google Analytics is a service to track and report website traffic. The service revolves around Properties, which contain the reports and traffic data for one or more websites or applications. There are two types: (i) Google Analytics 4 Properties that are identified by a Measurement ID, which follows the format G-XXXXXXX, and (ii) Universal Analytics (the older version of properties [15] ) that are uniquely identified by a Tracking ID, formatted as UA-000000-1.

When a user creates a Google Analytics account, a unique identifier is created that acts as a prefix of Tracking IDs (i.e., the first set of numbers). Consequently, the Tracking ID which is included in the code snippet indicates which account data is sent to [14] . The suffix of a Tracking ID represents the property that data is sent to. A website publisher that owns more than one website is able to associate a single property with all of these websites.

To detect websites operated by the same entity, we search for the identifiers of the respective services described in Section 2. Specifically, we develop a Puppeteer-based crawler that instruments instances of the Chromium Browser. Using these instances, we crawl with clean state the landing page of the top 962K websites of the Tranco list, which aggregates the ranks from the lists provided by Alexa, Umbrella, and Majestic, from 16/3/2021 to 14/4/2021 [49] . This list is formed based on techniques which enable list stability, facilitate reproducibility, and protect against adversarial manipulation. The implementation of our crawler is publicly available [35] . When our crawler visits a website, it waits until the page has completely loaded and for an additional 5-sec period to ensure that all programmatically purchased ads (via Real Time Bidding (RTB)) have been rendered. Then, it stores the HTML of the page, a cookiejar and the HTTP(S) requests performed during the website visit. We capture all requests passively in a read-only fashion without mutating or intercepting them. This ensures that the behavior of the website is not affected by our crawler. To collect the HTML of the website, we utilize the Chrome DevTools Protocol [48] . This way, we ensure that we capture not only the actual HTML code but also the documents, styles or code fetched by iFrames or code snippets. Our crawler visits 962K websites of the Tranco list (1MT crawl) from 15-27/4/2021 and collects 415GB of data. In 93,817 websites (9.75%) the crawling process failed due to timeouts or site inaccessibility. Overall, we detect 525,493 websites, with at least one of the identifiers discussed in Section 2. Ethical concerns regarding the crawling process and collected data are addressed in Appendix A. 

We detect the Google Identifiers described in Section 2 by performing an offline analysis on the collected data. Specifically, using regular expressions 1 , we search for these identifiers inside the page content, HTTP(S) requests and stored cookies. Then, we remove false positives using a combination of data-filtering techniques. First, using the dictionary of GNU Aspell [39] , an open-source spell-checking tool, we remove values which are words of the English dictionary, but match the suffix of regular expressions (e.g., G-BACKPACK). Using this technique, we were able to remove ∼1,500 distinct false positive values, which were found in ∼5,000 unique websites. Then, we remove false positives using a list of common keywords. This list was generated by manually inspecting over 10,000 values that satisfy the regular expressions, and investigating whether they are actually used as identifiers. Our keyword list contains over 1,250 values which were filtered out (e.g., G-APRIL2020).

As in Table 1 , we find that ∼10% of the most popular websites monetize their content through an AdSense account (i.e., Publisher ID), ∼52% use Google Analytics to track their traffic, and ∼20% use the Google Tag Manager for easier management of code snippets. Moreover, for some services, we observe that there are more domains than publisher-specific IDs. This suggests that some identifiers are being re-used in more than one website. Additionally, we examine the source of information for each type of identifier. Specifically, we investigate whether the identifiers can be found in the HTML code of a website, its outgoing network traffic, or in the cookies set by either the first-party or various third-parties. As shown in Table 1 , regardless of the type of identifier, the majority of them can be found in both the HTML code of the website and the HTTP(S) requests. This result is inline with the official guidelines for using Tags [8] . This indicates that the detected identifiers are not only valid, but since they are sent to the respective Google services, they are in use. Finally, we find that only Tracking IDs are commonly found in cookies. 

Using the detected publisher-specific IDs, we construct a bipartite graph for each of the respective types of identifiers. In these graphs, the nodes are either websites or identifiers. Whenever a website contains an identifier, we introduce a directed edge from the respective website node to the respective identifier node. Tracking IDs and Measurement IDs are placed into the same graph since they represent the same service and have similar functionality. Thus, we create three bipartite graphs. Moreover, for Tracking IDs, we focus only on the prefix, which refers to the account number, as discussed in Section 2.3. Figure 1 illustrates an example of a small Publisher ID bipartite graph.

We also form a metagraph based on the three bipartite graphs of different publisher-specific IDs. The metagraph contains nodes only for websites and represents the relationships between websites. Whenever two websites share an identifier, we introduce an undirected meta-edge between the two respective website nodes. The more identifiers two websites share, the greater the weight of the connecting edge. Each shared identifier increases the weight of the meta-edge by 1 , where is the total number of distinct identifiers of this type found in more than one website. A larger edge weight between two websites implies greater confidence that they are indeed operated and monetized by the same entity. Figure 2 illustrates an example of how the metagraph is constructed. The code to construct both the bipartite graphs and the metagraph is publicly available [36] .

We hypothesize that the metagraph constructed by following the steps above can lead to clusters of websites operated by the same entity. Meta-edges reflect the actual relationships between websites, thus the ones operated by the same entity should form a strongly connected community. Since the metagraph combines information from multiple service (i.e., bipartite graphs), it provides us with greater confidence about the actual relationships of the websites.

We find that there are some outlying cases where communities consist of thousands of websites. After manual investigation, we conclude that these communities are formed due to intermediary publishing partners. These publishers provide services for content creators to monetize their content or improve their website traffic and require that websites integrate the partner's identifiers. To focus our analysis on an enhanced level of granularity, we ignore such publishers for the time being. This allows us to study more detailed cases of website administrators and results in a metagraph that contains ∼127,000 nodes and ∼2,885,000 edges. We discuss intermediate publishing partners in later sections. To find the websites operated and monetized by the same entity with high confidence, we perform edge pruning, thus removing noise. Specifically, we remove edges that do not belong to the top 5%, when ranked by weight. We choose this threshold based on empirical analysis. This way, we ensure that there are limited false positives, i.e., websites that are wrongfully added to a community because of a typographical error in their source code, or older identifiers. After the edge pruning, dangling nodes are also removed from the graph as they do not provide any additional information. To further explore this graph, we execute the Girvan-Newman community detection algorithm [23] . In fact, we compare our methodology with the one described in [6] , where the authors apply the Louvain method [5] to the connected components extracted from their bipartite graph. Specifically, we manually examine and evaluate 40 distinct communities, which can be found through both methodologies, consisting of 215 unique websites. We find that applying the Girvan-Newman algorithm, after edge-pruning, results in better communities in 42.5% of the cases, in exactly the same communities 37.5% of the cases, and in worse communities in 20% of the cases. Consequently, the Girvan-Newman algorithm results in higher quality communities at the expense of a higher computational cost.

Finally, we compare the communities our methodology detects with the (publicly available) communities detected in [6] . We acknowledge that the comparison is difficult to achieve because the two studies focus on different websites, in a different time period (i.e., 5-year ago: a big portion of the websites are no longer active and cannot be evaluated) and in a very dynamic environment like the Web. We compare only communities that contain websites, all of which have been crawled in our 1MT crawl. In total, we manually evaluated and found 12 communities with results similar to ours and 15 cases where the methodology of [6] fails and places websites which are operated by the same legal entity into different communities. Using our methodology we detect 2,369 communities, formed by ∼11,000 distinct websites. The distribution of community sizes ( Figure 3) shows that the majority of them are small (i.e., <6 websites). Indeed, 61% of the communities are pairs of websites, indicating that the median publisher operates just 2 websites.

First, we study how organizations or publishers use identifiers necessary for Google services. For this, we measure the number of unique identifiers found in each website of our dataset. In Figure 4 , we plot the portion of websites in our data that contain a certain number of identifiers. Around 82-83% of the websites contain only one Container ID or Analytics ID, and about 90% of websites have only one Publisher ID. This indicates that the majority of organizations prefer to use the simplest and most straightforward configuration of services in their websites, where they use a single identifier to achieve their goal, be it monetization or traffic measuring. Most importantly, in the case of Publisher IDs, it indicates that the majority of websites have a single contributing author and that revenue is not shared. This is contrasting the small portion of websites (less than 3.2% in all cases), with 3 or more identifiers, which indicate multiple collaborating authors, with their own Publisher IDs, contributing to a website. Surprisingly, we see a small number of websites with an extremely large number of identifiers. For instance, we find that prykoly.ru contains 94 Tracking IDs, while www.pps.net, a website for public schools in Portland, contains 88 IDs hard-coded in the JavaScript code and the correct identifier is selected based on the page that is visited. To further investigate this an abnormal behavior, we lookup these websites in the VirusTotal [30] and Sucuri [24] security services for malicious content. Sucuri reports [25] that prykoly.ru contains a known JavaScript malware associated with a back-link purchase service, called Sape. For pps.net, Virustotal reports [31] that there are 4 detected files that communicate with this domain. In total, we find 67 distinct websites with over 40 publisher-specific IDs of any type. This preliminary analysis suggests that cases with numerous publisher-specific IDs in a website might imply abnormal or even malicious behavior. This observation, though interesting, is considered as out of scope for this work and left for future research.

Next, we explore the amount of websites that publishers manage and monetize. For each publisher, we measure the number of websites in which they place their publisher-specific IDs. Analysis from now on, is performed on distinct domains of landing pages, and not distinct domains in the Tranco list. Specifically, if two different domains in the Tranco list redirect to the same domain, we measure this website only once towards the size of the respective publisher. In Figure 5 , we plot, in descending order, the number of websites monetized by each unique Publisher ID in our data. We show that the great majority of publishers (up to 87.8%), monetize traffic from a single site. On the other hand, we find 340 publishers monetizing traffic from more than 10 websites each.

We observe some "mega-publishers" that can be found in hundreds or even thousands of websites. Indeed, the top 10 publishers in our data can be found in a total of more than 4,200 websites. We observe similar behavior in Container IDs and Analytics ID, where we find that the top 10 identifiers can be found in a total of 4,245 and 6,795 websites, respectively. To verify this finding, we explore the connectivity of the three bipartite graphs (described in Section 3.3) and generate a list of connected components in each graph. In Figure 6 we plot, in decreasing order using a loglog scale, the number of nodes in each connected component of the Publisher ID bipartite graph. We see that the distribution of connected component sizes can fit a power-law with a cutoff, an anomaly due to the intermediary publishers mentioned earlier. By applying appropriate statistical tests [16] , we find that the distribution is indeed heavy-tailed with power law being a better fit than the exponential distribution (loglikelihood ratio test with = 1.6 −8 ). We find similar results for the Container IDs and Analytics IDs bipartite graphs but exclude them for brevity. This verifies our finding that there are only a few publishers monetizing traffic from a very large number of websites, while the majority of publishers operate one website.

We attribute this behavior to the existence of intermediary publishing partners [26] . These are third-party services which provide services for content creators to readily deliver their content, effortlessly monetize it, optimize their revenue and deliver better experiences to users. Publishers are required to integrate the service's identifiers in their websites so that the service can monitor traffic and user behavior, and deliver ads. By examining requests towards third-party services, we successfully identified multiple such "mega-publishers", including Ezoic, optAd360, Blogger and Pro-jectAgora. Specifically, Blogger provides publishers with AdSense gadgets, which can be used to display ads in a blog, without taking any percentage of earnings [17] . At the moment of writing, the PublicWWW service [40] reports that Blogger's Publisher ID can be found in more than 364,000 websites.

Next, we explore if there is an association between the popularity of website and the size of the publishers. First, we group together websites that share the same Publisher ID, meaning that there is a single account responsible for their monetization. For each such account (i.e., publisher), we compute its popularity as the average rank of the websites it operates, based on the Tranco list. In Figure 7 , we plot this average popularity of publishers, for a given size of publishers. We show that the average website popularity (y-axis) increases (i.e., Tranco rank decreases), as the number of monetized websites increases (x-axis). The average popularity subsections have also been fitted with a straight line (the negative slope is from the reversed y-axis to indicate increased popularity) indicating a clear trend with a 2 = 0.28. We observe similar behavior when plotting the median popularity, indicating that there is no skewness in the distribution.

As a result, independent publishers who generate revenue from a single website, tend to monetize less popular websites. On the other hand, publishers that manage multiple websites, usually manage the most popular ones. Consequently, in a classic case of rich getting richer, big publishers who operate dozens of websites, not only claim a bigger share of the market and generate a bigger revenue, but they are also able to improve their reputation and attract more visitors. The increased popularity of some websites can also be credited to the intermediary publishers mentioned earlier.

Manual inspection of communities (Section 3.5) revealed that most operators tend to manage websites with similar content. To investigate this further, we retrieve (if available) from SimilarWeb [32] the category of each website with a Publisher ID in our dataset. Figure 8 illustrates the distribution of the categories we retrieved from over 23,000 websites with a Publisher ID. Websites with no category information are excluded from the analysis. We see a preference towards "News and Media" websites (24.5%), followed by websites related to "Computers, Electronics and Technology" (18.6%), "Arts" (11%) and "Science" (8%).

Next, we investigate if there is a preference in the types of websites a publisher monetizes, or if their portfolio is usually random. As a first step, we perform a Poisson Sampling experiment to construct a scenario where publishers monetize websites based on their categories' measured popularity in the data. For this sampling, we perform the following steps. For a given size of publisher (i.e., number of websites they operate and monetize), we randomly select websites from our data. For example, if the size of a publisher is 10, we randomly select 10 websites from the 23K websites with a category. However, this selection is biased, based on the prior probability of a website appearing due to its category. Thus, for example, a "News and Media" website has almost 1/4 chance to be selected in any publisher. We perform this process for all publishers. Then, for each publisher, we compute the number of unique website categories they have in their control (i.e., richness). Figure 9 plots the richness distribution of the observed (or "actual") data and the Poisson-Sampling experiment data. The = line represents the case of a uniform distribution, where all of a publisher's websites come from different categories, with equal probability. We see that the average number of website categories in our actual data is lower than a purely probabilistic choice, for every possible publisher size. This actively demonstrates a preferential administration of websites when it comes to their category. Thus, publishers tend to monetize websites with similar type of content. To verify this hypothesis, we also utilize Shannon's diversity index [44] . Shannon's diversity index is a statistical measure, which provides information about the composition of a community. It is defined as ′ = − =1 ln , where is the number of different categories in the dataset (i.e., richness) and is the proportion of websites belonging to category . The maximum value of the diversity index is ( ), where stands for the number of distinct website categories in a community. This represents the case where all categories are equally common inside a cluster of websites operated by the same publisher. As we can see in Figure 10 , Shannon's diversity index for the websites in our data is much smaller than the maximum value and is closer to zero. A smaller diversity index corresponds to a more unequal composition of the community. We conclude that there is indeed a preferential administration or monetization and that publishers tend to acquire new websites of the same type as the ones they already manage.

We perform a historical analysis of the last two years (April 2019 to April 2021), on a trimester basis, resulting in 9 snapshots. We use the entire dataset of HTTPArchive [2] and do not limit ourselves to the websites found in the Tranco list. In Section 3.2, we show that HTTP requests are a reliable source to find identifiers of interest. Thus, we examine HTTP(S) requests of websites in these snapshots and detect Publisher IDs and Tracking IDs embedded in them. Figure 11 illustrates our findings for the total number of websites crawled, per trimester snapshot. We find that, on average, 9.9% of websites monetize their content using Google AdSense Publisher IDs, while around 64% use Google Analytics in order to track and measure their traffic. These trends are stable over both the snapshots and the sample size, with a standard deviation of only 0.7 and 1.81, respectively. These results, computed on 8× more websites than the 1MT crawl (up to ∼8M websites in 2021), are inline with our earlier findings described in Section 3, and lend credence to our analysis as being representative of the general Web.

Next, we study how many publishers contribute to the content of a website. In Figure 12 , we plot the portion of websites that contain one, two, or three or more Publisher IDs for each time period. On the 2 axis, we plot the average number of distinct Publisher IDs in each website and observe that it is almost constant across time, with a mean value of 1.11, and a standard deviation of only 0.015. We also find that, on average, 88.75% of the websites, have a single contributing publisher that generates revenue. Finally, we find a very small amount of websites (less than 1% in all snapshots) that contain identifiers of 3 or more publishers. These cases are either due to an intermediary publishing partner, or websites running under the partnership of various authors or authorized external authors. Overall, these numbers match our earlier in-depth analysis using the 1MT crawl of April 2021.

Next, we explore how the market of publishers has changed in the last couple of years, and specifically, how intermediary publishing partners have grown. First, we study how websites behave with regards to their Publisher IDs and detect changes of these identifiers. We perform our analysis on websites which have been crawled in all snapshots (i.e., their intersection) and contain at least one Publisher ID. There are over 191,000 such websites. For each time interval and for each website, we compare the detected identifiers in the previous and the next snapshot. We ignore websites that made no change in their Publisher IDs. For websites that do not contain exactly the same identifiers across two consecutive snapshots, we compare the size of their publishers. Specifically, we define the old community size of a website as the maximum size of its publishers, detected in the first snapshot for that website. Respectively, we retrieve the new community size from the second snapshot. Please note that the size of a publisher for a specific snapshot is computed across all websites in the snapshot and not only for the common websites. If the new community size of a website is greater than the old size, then we conclude that the website moved to a bigger publisher. If the old size is greater, the website moved to a smaller publisher, while if the size is the same, then the publisher made an insignificant change. This change might be the addition or removal of a secondary contributing author, the move to a different AdSense account, etc. The results of this analysis can be seen in Figure 13 . We observe that the majority of websites made no changes in their Publisher IDs. Indeed, over 96% of websites have a consistent and stable behavior and do not change their monetization scheme. In contrast, we find that, on average, 3.35% of websites made a change in their contained Publisher IDs.

In Figure 14 , we show a linear regression model for the different cases of Figure 13 . Interestingly, we find that both the cases where a website moved to a bigger publisher, and where a website made an insignificant change in their Publisher IDs, have a negative slope. In contrast, the case where websites move to a smaller publisher is the only one with a positive slope. This suggests that there is a tendency for decentralization, meaning that websites are inclined to move away from big intermediary publishers. To test this hypothesis, we plot in Figure 15 the sum of websites operated by the 10 most popular publishers in each snapshot along with the total number of websites operated by the top 10 publishers, present in all snapshots. We can see that there is a constant decrease in the number of websites that these "mega-publishers" manage. Indeed, in just a 2-year span, "mega-publishers" lost approximately 25% of their population. Interestingly, this decrease in managed websites is observed even though the number of crawled websites has increased over the years (as shown in Figure 11 ).

Next, we explore how this market of big publishers has changed over time. We characterize as Small publishers with up to 10 websites, Medium those that monetize from 11 to 50 websites, Large those who monetize from 51 to 100, and as Mega, the publishers that monetize more than 100 websites. In Figure 16 , we plot the population of these classes, i.e., the number of such publishers, and we also fit the data subsections with a straight line. As we can see, the number of Small publishers has greatly increased over the years (∼15K new Small publishers per trimester), which is expected with the increasing ad-revenues motivating new independent publishers to monetize their content. We also observe that there is an increase in the number of Medium and Large publishers (∼29 new Medium and ∼5 new Large publishers per trimester), while Mega publishers are the only class that shrinks over time (∼2 Mega publishers lost per trimester). This is evident in the negative slope of the fitted straight line. This behavior attests to the fact that the market of intermediary publishing partners seems to be flourishing and that new such services have emerged during the last couple of years. These services provide a new platform for independent content creators to generate revenue and lure clients away from Mega publishers. It is evident that these new services seek their share of a competitive, but highly profitable market.

During our manually analysis, we found that there were a lot of communities which were not only operated but also owned by the same legal entity. To better understand the utility of the metagraph in identifying websites owned by the same legal entity, we manually examine detected communities.Overall, we find communities belonging to the news and media sector, music and entertainment, as well as manufacturing and other industrial applications. As an example, we detect a community of websites owned by Koninklijke Philips N.V. We find 45 official websites, each with a different country code top-level domain (ccTLD), all of which belong to the same company but serve clients of different countries. Another community in our dataset is a cluster of 73 news websites, all serving news content using .au as their ccTLD. In their privacy policy, these websites mention that they are published by a subsidiary or related body corporate of Rural Press Pty, Ltd and, in their footer, they declare that they are operated by Australian Community Media & Printing. Australian Community Media is a media company operating over 160 regional publications and targets a vast audience in multiple geographic locations.

One of the biggest detected communities is related to the music industry (i.e., over 140 websites of popular singers or music bands). To our surprise, these websites are owned by various companies including Atlantic Records, Electra Records, Warner Records and Nonesuch Records. By observing the copyright notification and the privacy notices of these websites, we find that all of them are subsidiaries of a single multinational conglomerate, Warner Music Group [27] . This is a clear proof that our methodology is even able to overcome the barriers of business organization and subsidiaries, and detect ownership in the highest level of hierarchy. Finally, we find two communities of websites related to public entertainment and information. We detect a community of 111 radio websites, operated by Townsquare Media, Inc, a US-based radio network and media company that owns hundreds of local terrestrial radio stations [34] . We also detect a community of 76 websites, which explicitly state in their copyrights claim that they are owned by Gray Television, Inc., an American television broadcasting company.

These examples and many more not analyzed here due to space, demonstrate the efficacy of our methodology to detect coownership status of websites by organizations monetizing on them in a collective fashion. Overall we find and report 112 distinct communities of various sizes, consisting of over 1,280 websites. For each community we report its size, the websites that compose the community, as well as, the legal entity that owns the respective websites.

We manually visited, evaluated and verified all of these websites and make our results publicly available [36] . We report some of the largest communities along with their size (i.e., number of websites) and some indicative websites as examples in Appendix B.

The ecosystem of digital advertising and analytics has motivated a lot of studies that aim to reverse engineer it (e.g., [7, 20, 37, 38] ). In [22] , the authors studied the advertising ecosystem and services provided by Google, including AdSense and focused on how revenues are generated across aggregators. In [33] , the authors presented an automated tool to de-anonymize Tor hidden services using information like Google Analytics and AdSense IDs to disclose the server's IP. Their analysis is limited regarding publisherspecific IDs, since they only extract 24 unique Analytics IDs and 3 Publisher IDs. Similarly, in [50] , Yoon et al. studied phishing threats in the DarkWeb, by trying to obtain the identity of owners operating such websites. Using the technique of [33] , they extracted 276 Analytics IDs and 1,171 Publisher IDs. In [47] , Starov et al. analyzed identifiers of multiple analytics services to bundle websites and discover malicious websites and campaigns. With a focus on malicious content, they identified 7,945 Analytics IDs and 278 Container IDs and, contrary to our work, they did not consider Publisher IDs or Measurement IDs. In [42] , authors outlined how Google Analytics IDs can be used for digital forensics investigations to unmask online actors and lead to the entity, that operates a cluster of websites, which can be an individual, organization, or media group.

In [46] , the authors associate organizations with domain names in an attempt to create a property graph for Internet infrastructures. To achieve this, they utilize X.509 certificates and extract the organization to which the certificate was issued. In [6] authors, argue that relying on such certificates is not effective and propose a methodology that revolves around the email addresses found in WHOIS records. Similar to our work, authors build a bipartite graph and apply a community detection algorithm to extract clusters of domains, owned by the same organization. Limitations of the proposed methodology include that many WHOIS records contain the email address of the registrar or the hosting provider instead (i.e., WHOIS privacy service). Finally, our methodology provides an additional advantage in the cases where websites are purchased by a new legal entity, since new website owners or administrators have an incentive to update the publisher-specific IDs in their new websites in order to gain revenue or insight. This does not apply to WHOIS records.

In [4] , Bashir et al. performed a study of the specification and adoption of ads.txt files during a 15-month period and clustered publishers serving identical ads.txt files. Similar to our work, they found that there is a big number of smaller clusters (i.e., less than 5 websites) but only a few big clusters with over 50 websites. Finally, authors manually investigated the top clusters in their dataset and found that such clusters exist due to (i) shared media properties with a common owner, (ii) independent publishers, (iii) the use of the same platform to deliver their content, or (iv) the use of consolidated SSP services.

In this work, we shed light on website administration by using bipartite graphs and exploiting the publisher-specific IDs that publishers embed in their websites to use third-party ad-related services. We studied various properties induced by these graphs, reflecting important characteristics of administration, such as portfolio size, popularity, etc., and we identified power-law patterns of website administration, as well as indications of preferential monetization in the type of controlled websites. We studied the use of such publisher-specific IDs across time and showed how the market of intermediary publishing partners has boomed in the last few years. We showed that our methodology can be used to detect ownership in the Web and even overcome the company organization barriers (i.e., subsidiary companies).

Our methodology is based on detecting publisher-specific IDs using regular expressions. However, there are cases where alphanumeric values might match with these regular expressions without being actual identifiers. While we perform various techniques to limit these false positives (Section 3.2), we acknowledge that there might be cases that we miss. Additionally, our study focuses on publisherspecific IDs related to services offered by Google, one of the biggest players in the advertising and analytics ecosystem. Even though the analysis of Google services provides a good coverage of the real world, there are several other ad networks and analytics services that can be studied. Finally, we acknowledge that our analysis of website categories (Section 4.4) relies on SimilarWeb, which might be prone to errors or subjective bias.

We believe our graph methodology and analysis is a powerful tool for web and privacy measurements aiming to understand the context, nature and activity of websites, as well as the possible leverage or political agendas behind their administration. In fact, our proposed technique can help researchers, journalists, and even individual users to better understand popular websites and the entities that control and monetize them. Furthermore, our preliminary analysis shows that outlier websites in the bipartite graphs yielded by our method may reveal anomalous or even malicious behavior, suggesting that our methodology can be used to discover malicious actors without even examining their published content. Also, ad networks can make use of our technique to detect fraudulent or fake news-related website administrators that may violate their ad campaign policies. Altogether, we believe that our method can help improve the safety and health of the Web ecosystem at large.

The execution of this work has followed the principles and guidelines of how to perform ethical information research and the use of shared measurement data [19, 41] . In particular, this study paid attention to the following dimensions.

We keep our crawling to a minimum to ensure that we do not slow down or deteriorate the performance of any web service in any way. Therefore, we crawl only the landing page of each website and visit it only once. We do not interact with any component in the visited website, and only passively observe network traffic. In addition to this, our crawler has been implemented to wait for both the website to fully load and an extra period of time before visiting another website. Consequently, we emulate the behavior of a normal user that stumbled upon a website.

In accordance to the GDPR and ePrivacy regulations, we did not engage in collection of data from real users. Also, we do not share with any other entity any data collected by our crawler. Our analysis is, to a large extent, based on public historical data (e.g., HTTPArchive Project). Moreover, we ensure that the privacy of publishers and administrators is not invaded. We do not collect any of their information (e.g., email addresses) and only discuss publishers, who explicitly and voluntarily disclose their identity in their websites, as we did in Section 6. Last but not least, we intentionally do not make our 1MT crawl dataset public, to ensure that there is no infringement of copyrighted material. Table 2 : Communities detected using the proposed methodology. Through manual investigation we determine the legal entity behind these websites.

We manually examined communities and tried to determine the legal entity operating or even owning all websites in each community. We report some of the largest of these communities in Table 2 along with their size (i.e., number of websites) and some indicative websites as examples.

Open-Source Information Reveals Pro-Kremlin Web Campaign

Think You Can Hide

A Longitudinal Analysis of the Ads.Txt Standard

Fast unfolding of communities in large networks

Measurement and Analysis of Private Key Sharing in the HTTPS Ecosystem

I Always Feel like Somebody's Watching Me: Measuring Online Behavioural Advertising

Google Help Center. 2021. About Tags

Ad placement policies

Manage user access to your account

Organize your containers

Revenue share

Setup and install Tag Manager

Google Help Center. 2021. Tracking ID and property number

Universal Analytics property

Power-law distributions in empirical data

What is the adsenseHostId on blog

Worldwide Digital Ad Spending 2021

The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research

Online Tracking: A 1-Million-Site Measurement and Analysis

Google Analytics and Google Tag Manager. ALA TechSource

Best Paper -Follow the Money: Understanding Economics of Online Aggregation and Advertising

Community structure in social and biological networks

Sucuri -Free website security check and malware scanner

Sucuri Report -prykoly

Certified Publishing Partner

Services -Recorded Music

Subprime Attention Crisis: Advertising and the Time Bomb at the Heart of the Internet. FSG Originals x Logic, Farrar, Straus and Giroux

IAB Technology Laboratory. 2021. IAB Releases Internet Advertising Revenue Report for 2020

Chronicle Security Ireland Limited

VirusTotal -Analyze suspicious files and URLs to detect types of malware, automatically share them with the security community

VirusTotal Report -pps

Website Traffic -Check and Analyze Any Website

CARONTE: Detecting Location Leaks for Deanonymizing Tor Hidden Services

Townsquare Media. 2021. Digital Media and Radio Advertising Company

Emmanouil Papadogiannakis. 2021. Scrape Titan

Website Administration Graphs

WWW '18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva

If You Are Not Paying for It, You Are the Product: How Much Do Advertisers Pay to Reach You

GNU Aspell -Free and Open Source spell checker

Ethical research standards in a world of big data

Digital Forensics: Repurposing Google Analytics IDs

VERA FILES FACT CHECK YEARENDER: Ads reveal links between websites producing fake news

A mathematical theory of communication. The Bell system technical journal

Inside The Partisan Fight For Your News Feed

WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva

Betrayed by Your Dashboard: Discovering Malicious Campaigns via Web Analytics

The Chromium Authors. 2014. Chrome DevTools Protocol

Tranco list with the 1M top sites generated on 14

DoppelgäNgers on the Dark Web: A Large-Scale Assessment on Phishing Hidden Web Services

Description Size Websites MinuteMedia 172 showsnob.com, sodomojo.com, sportdfw.com, thejetpress.com, reignoftroy.com, 90min.de

Australian Community Media 73 thesenior

Postmedia Network Canada Corp 58 lfpress.com

Philips 45 usa.philips.com, philips.com.br, philips.com.mx, philips.com.pk, philips.cz, philips.ru, philips.pl

This project received funding from the EU H2020 Research and Innovation programme under grant agreements No 830927 (Concordia), No 830929 (CyberSec4Europe), No 871370 (Pimcity) and No 871793 (Accordion). These results reflect only the authors' view and the Commission is not responsible for any use that may be made of the information it contains.