key: cord-0459156-bz8k0e80 authors: Ulloa, Roberto; Makhortykh, Mykola; Urman, Aleksandra title: Scaling up Search Engine Audits: Practical Insights for Algorithm Auditing date: 2021-06-10 journal: nan DOI: nan sha: 1256bcf0ec5d4c0246e332ad05887c5560952a25 doc_id: 459156 cord_uid: bz8k0e80 Algorithm audits have increased in recent years due to a growing need to independently assess the performance of automatically curated services that process, filter, and rank the large and dynamic amount of information available on the internet. Among several methodologies to perform such audits, virtual agents stand out because they offer the ability to perform systematic experiments, simulating human behaviour without the associated costs of recruiting participants. Motivated by the importance of research transparency and replicability of results, this paper focuses on the challenges of such an approach. It provides methodological details, recommendations, lessons learned, and limitations based on our experience of setting up experiments for eight search engines (including main, news, image and video sections) with hundreds of virtual agents placed in different regions. We demonstrate the successful performance of our research infrastructure across multiple data collections, with diverse experimental designs, and point to different changes and strategies that improve the quality of the method. We conclude that virtual agents are a promising venue for monitoring the performance of algorithms across long periods of time, and we hope that this paper can serve as a basis for further research in this area. The high and constantly growing volume of internet content creates a demand for automated mechanisms that help to process and curate information. However, by doing so, the resulting information filtering and ranking algorithms can steer individuals' beliefs and decisions in undesired directions [1] [2] [3] . At the same time, the dependency that society has developed on these algorithms, together with the lack of transparency of companies that control such algorithms, has increased the need for algorithmic auditing, the "process of investigating the functionality and impact of decision-making algorithms" [4] . A recent literature review on the subject identified 62 articles since 2012, 50 of those published between 2017 and 2020, indicating a growing interest from the research community in this method [5] . One of the most studied platforms are web search engines -almost half of the auditing works reviewed by Bandy (2021) were focused on Google alone -as a plethora of concerns have been raised about representation, biases, copyrights, transparency, and discrepancies in their outputs. Research has analyzed issues in areas such as elections [6] [7] [8] [9] [10] [11] , filter bubbles [12] [13] [14] [15] [16] [17] , personalized results [18, 19] , gender and race biases [20] [21] [22] , health [23] [24] [25] , source concentration [10, [26] [27] [28] [29] , misinformation [30] , historical information [31, 32] and dependency on user-generated content [33, 34] . Several methodologies are used to gather data for algorithmic auditing. The data collection approaches range from expert interviews to Application Programming Interfaces (APIs) to data donations to virtual agents. The latter refer to programs (scripts or routines) that simulate user behaviour to generate data outputs from other systems. In a review of auditing methodologies [35] , the use of virtual agents (referred to as agent-based testing) stands out as a promising approach to overcome several limitations in terms of applicability, reliability, and external validity of audits, as it allows to systematically design experiments by simulating human behaviour in a controlled environment. Around 20% (12 of 62) of the works listed by Bandy (2021) use this method. Although some of the studies share their programming code to enable the reproducibility of their approach [18, 24, 36] , the methodological details are only briefly summarized. The challenges involved in implementing the agents and the architecture that surround them are often overlooked. Thus, the motivation for this paper is to be transparent to the research community, allow for the replicability of our results, and transfer knowledge about lessons learned for the process of auditing algorithms with virtual agents. Out-of-the-box solutions to collect data for algorithmic auditing do not exist because websites (and their HTML) evolve rapidly, so the data collection tools require constant updates and maintenance to adjust to these changes. The closest option to such an out-of-thebox solution is provided by Haim (2020) , but even there the researcher is responsible for creating the necessary "recipes" for the platforms they audit -and these recipes will most certainly break as platforms evolve. Therefore, this work focuses on considerations, challenges, and potential pitfalls of the implementation of an infrastructure that systematically collects large volumes of data through virtual agents that interact with a specific class of platforms -that is search engines -, to give other researchers a broader perspective on the method. To evaluate the performance of our method we pose two research question: RQ1: How are (a) data coverage and (b) effective size affected when audits are applied at a large scale? RQ2: What practical challenges emerge when scaling up search engine audits? Our results demonstrate the success of our approach by often achieving near-perfect coverage and consistently collecting above 80% of results. Additionally, the overall effective size of the collection is above 95%, and we retrieved the exact number of pages in more than 92% of the cases. We use those "exact" cases to provide size estimates of search results, and we demonstrate that they can be used to successfully calculate data collection sizes using an out-of-sample approach. By providing disaggregate figures per collection and search sections, we also show how strategic interventions improve coverage in later rounds. We provide a detailed methodological description of our approach including the simulated behaviour, the configuration of environments and the experimental designs for each collection. Additionally, we discuss contingencies included in our approach to cope with (a) personalization, the adjustment of search results according to the user characteristics such as location or browsing history, and (b) randomization, difference of search results that emerge even on the same browsing conditions of the audited systems' outputs. Both issues can distort the process of auditing if not addressed properly. For example, we synchronize the search routines of the agents and utilize multiple machines and different IP addresses under the same conditions to capture the variance of unknown sources. We focus on simulating user behaviour in controlled environments in which we manipulate several experimental factors: search query, language, type of search engine, browser preference and geographic location. We collect data coming from the text (also known as main or default results), news, image, and video search results of eight search engines representing the United States, Russia and China. Our main contributions are the presentation of a comprehensive methodology for systematically collecting search engine results at a large scale, as well as recommendations and lessons learned during this process, which could lead to the implementation of an infrastructure for long-term monitoring of search engines. We demonstrate the successful performance of our research infrastructure across multiple data collections, and we provide average search section sizes that are useful to calculate the scale of future data collections. The rest of this paper is organized as follows: section 2 discusses related work in the field of algorithmic auditing, in particular studies that have used virtual agents for data collection. Section 3 presents our methodology in terms of architecture and features of our agents, including detailed pitfalls and improvements in each stage of development. Section 4 presents the experiments corresponding to our data collection, and the response variables that are used to evaluate the performance of the method. Section 5 presents the results for each round of data collection according to browser, search engine and type of result (text, news, images and video). Section 6 discusses the achievements of our methodology, lessons learned from the last two years of research, and important considerations to successfully perform agent-based audits. Section 7 concludes with an invitation of scaling search engine audits further with long-term monitoring infrastructures. Until now, the most common methodology to perform algorithm auditing is through APIs [5] . This approach is relatively simple because the researcher accesses clean data directly produced by the provider, avoiding the need to decode website layouts (represented in HTML). However, it ignores user behaviour (e.g. clicking, scrolls, loading times) as well as the environment in which behaviour takes place (e.g. browser type, location). For example, in the case of search engines, it has been shown that APIs sometimes return different results than standard webpages [37] , and that results are highly volatile in general [38] . An alternative to using APIs is to recruit participants and collect their browsing data by asking them to install software (i.e. browser extensions) to directly observe their behaviour (e.g., Bodo et al., 2018; Möller et al., 2020; Puschmann, 2019; Robertson, Jiang, et al., 2018) . Although this captures user behaviour in more realistic ways, it requires a diverse group of individuals who are willing to be tracked on the web and/or capable of installing tracking software on their machines [35, 41] . Additionally, it is difficult to systematically control for sources of variation, such as the exact time in which the data is accessed in the browser, and personalization, such as the agent's location. Compared with these two alternatives, virtual agents allow for flexibility to perform systematic experiments that include individual behaviour in realistic conditions, without the costs involved in recruitment of human participants. Several studies have used virtual agents to conduct audits of search engine performance with a variety of criteria. Feutz et al. [42] analyzed changes in search personalization based on accumulation of data about user browsing behaviour. Mikians et al. [43] found evidence of price search discrimination using different virtual personas on Google. Hanak et al. [18] analyzed how search personalization on Google varied according to different demographics (e.g. age, gender), browsing history and geolocation, and found that only browsing history and geolocation significantly affected personalization of results. A follow-up study extended the work of Hannak et al. [18] to assess the impact of location, finding that personalization of results grows as physical distance increases [19] . Haim et al. [24] performed experiments to examine if suicide-related queries will lead to a "filterbubble"; instead, they found that the decision to present Google's suicide-prevention result (SPR, with country-specific helpline information) was arbitrary (but persistent over time). In a follow up study, Scherr et al. [44] showed profound differences in the presence of the SPR between countries, languages and different search types (e.g. celebrity-suicide-related searches). Recently, virtual agents were used to measure the effects of randomization and differences on non-personalized results between search engines for the "coronavirus" query in different languages [25] and the 2020 U.S. Presidential Primary Elections [11] . News, images and video search results have also been subject to virtual agent-based auditing. Cozza et al. [13] found personalization effects for the recommendation section of Google News, but not for the general news section. In line with this, Haim et al. [14] found that only 2.5% of the overall sample of Google news results (N=1200) were exclusive to four constructed agents based on archetypical life standards and media usage. Image search results have also been audited for queries related to migrant groups [45] , mass atrocities [31] and artificial intelligence [22] . A video search audit found that results are concentrated on YouTube for five different search engines [29] , and the predominance of YouTube in the Google video carousel [46, 47] . Directly analyzing YouTube search results and Top 5 and Up-Next recommendations, Hussein et al. [30] showed personalization and "filter bubble" effects for misinformation topics after agents had developed a watch history on the platform. Apart from search engine results, virtual agent auditing was used to study gender, race and browsing history biases in news and Google advertising [36, 48] , price discrimination in e-commerce, hotel reservation and car rental websites [49, 50] , music personalization in Spotify [51, 52] , and news recommendations in the New York Times [53] . To our knowledge, four previous works have provided their programming code to facilitate data collection [18, 24, 35, 36] . Two of these programming solutions are built on top of the PhantomJS framework whose development has been suspended [18, 24] . Adfisher [36] specializes exclusively in Google Ads, and includes the automatic configuration of demographics information for the Google account, as well as statistical tests to find differences between groups. Haim [35] has provided a toolkit to set up a virtual agent architecture; the approach is generic and the bots can be programmed with a list of commands to create "recipes" that target specific websites or services. We contribute to this set of solutions by providing the source code of our browser extension [54] which simulates the search navigation on up to eight different search engines, including text, news, images, and video categories. The process of conducting algorithmic auditing, from our perspective, has two requirements: on one hand, the user information behaviour (e.g., browsing web search results) must be simulated appropriately; on the other hand, the data must be collected in a systematic way. Regarding the behaviour simulation, our methodology controls for factors that could affect the collection of web search results, so that they are comparable within and across search engines. We focus on the use case of a "default user" that attempts to browse anonymously, i.e. avoids personalization effects by removing historical data (e.g. cookies), but still behaves close to the way a human would do when using a browser (e.g. clicking and scrolling search pages). At the same time, we attempt to keep this behaviour consistent across several search engines, e.g. by synchronizing the requests. Effectively, the browsing behaviour is encapsulated in a browser extension, called WebBot [54] . For data collection, we have been using a distributed cloud-based research infrastructure. For each collection, we have configured a number of machines that vary depending on the experimental design. On each machine (2CPUs, 4GB RAM, 20GB HD), we installed CentOS and two browsers (Firefox and Chrome). In each browser, we installed two extensions: the WebBot that we briefly introduced above, and the WebTrack [55] . The tracker collects HTML from each page that is visited in the browser, and sends it to a server (16CPUs, 48GB RAM, 5TB HD), a different machine where all the content is stored, and where we can monitor the activity of the agents. In this section, we used the term virtual agent (or simply "agent") to refer to a browser that has the two extensions installed and that is configured for one of our collections. A virtual agent in our methodology consists of an automated browser (through two extensions) that navigates through the search results of a list of query terms on a set of search engines, and that sends all the HTML of the visited pages to a server where the data is collected. The agent is initialized by assigning to it (1) a search engine and (2) the first item of the query terms list. Given that pair, the agent simulates the routine of a standard user performing a search on the following search categories of the engine: text, news, images and video. After that, it will simultaneously shift the search engine and the query term in each iteration to form the next pair and repeat the routine. This rest of this section describes the latest major version (version 3.x) of the browser extension that simulates the user behaviour, and later we will list differences in the older versions that have methodological implications. The extension can be installed in Firefox and Chrome. Upon installation, the bot cleans the browser by removing all relevant historical data (e.g. cookies, local storage). For this, the extension requires the "browsingData" privilege. Table 1 presents the full lists of data types that are removed for Firefox and Chrome. After this, the bot downloads the lists of search engines and query terms that are previously defined as part of an experimental design (see Data Collections section). Data Types Chrome appcache, cache, cacheStorage, cookies, fileSystems, formData, history, indexedDB, localStorage, pluginData, passwords, serviceWorkers, webSQL Firefox cache, cookies, formData, history, indexedDB, localStorage, pluginData, passwords Table 1 . Data types that are cleaned during the installation and after each query. The lists differ due to the differences between the browsers. A description of data is available for Chrome [56] and Firefox [57] . The bolded elements were included in version 3.0. The navigation in the browser extension is triggered on the next exact minute (i.e., "minute o' clock") after a browser tab lands on the main page of any of the supported search engines: Google, Bing, DuckDuckGo, Yahoo, Yandex, Baidu, Sogou and So. Once triggered, the extension will use the first query term to navigate over the search result pages of the search engine categories (text results, news, images and videos). After each search routine, the browser is cleaned again according to Table 1 . The search routines pursue two goals: first, to collect 50 results on each search category and, second, to keep the navigation consistent. For the most part, we succeed in reaching the first goal with the majority of search engines providing the required number of results. The only exception was Yandex, for which we decided to only collect the first page for text and news results because Yandex allows a very low number of requests per IP. After the limit is exceeded, Yandex detects the agent and blocks its activity by means of captchas. Our second goal was only fulfilled partially, because it is impossible to reach full consistency given multiple differences between search engines such as the number of results per page, speed of retrieval, the navigation mechanics (pagination, continuous scrolling, or scroll and click to load more), and other features highlighted in italics in Table 2 . To make the behaviour of agents more consistent, we tried to keep the search routines under 4 minutes and guaranteed that each search routine started at the same time by initializing a new routine every 7 minutes (with negligible differences due to internal machine clock differences). Additionally, the extension is tolerant to network failures (or slow server responses), because it refreshes a page that is taking too long to load (maximum of 5 attempts per result section). In the worst -case scenario, after 6.25 minutes an internal check is made to make sure that the bot is ready for the next iteration, i.e. the browser has been cleaned and landed in the corresponding search engine starting page, ready for the trigger of the next query term (that happens every 7 minutes). To give a clearer idea of the agent functionality, Table 3 presents a detailed step by step description of the search routine implementation for an agent configured to start with the Google search engine (followed by Bing) in the Chrome navigator. The description assumes that the routine is automatically triggered by a terminal script, for example using Linux commands such as "crontab" or "at". current query term is typed, the search button is clicked. 7 Once the text result page appears, we simulate the scroll down in the browser until the end of the page is reached. 8 If the bot has not reached the 5th result page, the bot clicks on the next page button and repeats step 7. Otherwise, it continues to step 9. 9 The bot clicks on the news search link, and repeats the behaviour used for text results (steps 7 and 8) 10 The bot clicks on the image search link and scrolls until the end of the page is reached. When the end of the page is reached, the bot waits for more images to load and then continues scrolling down. It repeats the process of scrolling and loading more images three times. 11 The bot clicks on the video search link, and repeats the behaviour used for text results (steps 7 and 8) 12 The bot navigates to a dummy page hosted at http://localhost:8000. Upon landing, the bot updates internal counters of the extension and removes historical data as shown in Table 1 . 13 The bot navigates to the next search engine main page according to the list downloaded on Step 2 (e.g., https://bing.com), and sets the next element of the query term list (or the first element if the current query term is the last of the list) as the current query term. 14 Upon landing on the search engine page, the bot resolves the consent agreement that pops up on the main page. 15 After 7 minutes have passed since entering the previous query, the next search routine is triggered and continues from Step 6 (adjusting step 7 to 11 according to the next search engine in Table 2 , for example https://bing.com. Table 3 . Detailed navigation process for an agent starting search routine on Google. Each row corresponds to a step of the search routine for Google. The first column enumerates the step and the second gives its description. The process includes the steps that correspond to the agent setup before the actual routine starts (Steps 1 to 5), and steps that correspond to the routine of the next search engine (Steps 13 to 15). The description in this section only involves the steps for one machine (and one browser), which simultaneously shifts to the new search engine and the new query term after the routine ends. Assuming a list of search engines, say , and a list of query terms, say , and an agent that is initialized with the engine e1 and query q1, i.e. , then, the procedure will only consider the pairs , , , , and exclude the combinations , , , . To obtain results for all the combination of engines and queries, the researcher can (1) manipulate the list so that all pairs are included, e.g., one possible solution would be to repeat the query term twice in the query list, i.e. , or (2) to use as many machines as search engines, e.g. one that is initialized to e1 and another to e2. The second alternative is preferred, because it keeps the search results synchronized assuming that all machines are started at the same. On Table 4 , we report relevant changes in the WebBot versions that have been used for the data collection rounds (See Data Collections section). We only include differences that have methodological implications because they either (1) Table 4 . Relevant features of WebBot versions. The first column indicates the version and the date when it was released. The second column enumerates features that could have an effect on the data collection of search results or on the experimental design. All the features correspond to changes with respect to previous versions, except for the first row (version 1.0). In that case, the included features are the ones that change in the following versions (and that differ from the navigation described in Table 3 ). The value inside the brackets, e.g. Table 1 ). Before that, the local storage and cookies were removed from the extension front end [1.0.a] , so it could only remove cookies and storage that were allocated by the search engine (due to browsing security policies). For our first version, we were forced to do so due to a technical issue. Cleaning the local storage or cookies from the backend also removed those data from all installed extensions in the browser (not only the browsing data corresponding to the webpages), including the WebTrack [55] . This session data was generated when the virtual agent was set up by manually entering a token which is pre-registered in the server. A proper fix involved a change that automatically assigned a generic token to each machine and, due to time Cookie consent. Regulation such as the European Union's ePrivacy Directive (together with the General Data Protection Regulation, GDPR) and the California Consumer Privacy Act (CCPA) forced platforms to include cookie statements asking for the user consent to store and read cookies from the browser as well as to process personal information collected through them. Since we have focused on non-personalized search results, we decided to ignore these banners in the first extension version -except for Yahoo!, where the search engine window was blocked unless the cookies were accepted [1.0.c]. By version 1.1 release, Google also started forcing cookie consent, so we integrated it for Google and Yandex corresponds to a minor improvement that was only included for consistency with the description provided in Table 3 . Starting February 2020, we have been using the WebBot extension to collect data to explore a multitude of research questions related to, for example, search engine differences, browser and geo-localization effects, visual portrayals in the image results, and source concentration. In total, we have performed 15 data collections with diverse experimental designs that are summarized in Table 5 Table 5 . Experimental design of the data collections. From left to right, each column corresponds to: (1) an identifier (ID) of the collection used in the text of this paper, (2) the date of the collection, (3) the version of the bot utilized for the collection, (4) if the collection replicates queries and experimental set up of a previous collection, then such a collection is identified in this column, (5) the number of agents that were used for the collection, (6) the number of geographical regions in which agents where deployed, (7) the number of search engines and (8) the number of browsers that were configured for each collection, (9) the number of times (iterations) that each query was performed, and (10) the number of query terms included in the collection. (*) We assigned 24 extra machines to one of the regions (São Paulo) as this Amazon region seemed less reliable in a previous experiment. All the collections included the same six search engines (Baidu, Bing, DuckDuckGo Google, Yahoo! and Yandex), except the collection B which also included two extra Chinese engines, So and Sogou, which were important given the nature of the collection and the research questions. The collection rounds E and F excluded Yandex, because the platform was detecting too many requests coming from our IPs (see Results section). To understand the robustness of the method and scale of the collections, we present the results of the data collections in terms of coverage, size, and effective size (response variables). Coverage is the proportion of agents that collected data in each experimental condition according to the agents expected. We estimate this value by counting the agents that successfully collected at least one result page under each experimental condition and dividing it by the number of agents assigned to that condition. Size is the space that each collection occupies on the server. We estimate this number by adding up the kilobytes of each file that was collected for the collection. Effective size is the space of the collection excluding extra pages that are not relevant for the collection, e.g. home or cookie consent pages but also pages collected due to a delay between the end of the experiment and turning off the machines. To help future researchers in estimating data collections, we provide average sizes per combination of search engine and results section. For this calculation, we include only those query terms for which we obtained the exact number of pages that we aimed for. To show that these values are robust, we compare two size estimates for each collection: in-sample estimates, calculated using only the averages corresponding to the query terms of that collection, and out-sample estimates, using the averages corresponding to the query terms that are not included in that collection. Figure 1 presents the coverage for 3 experimental conditions (browser, search category, and search engine) for our different collections. In multiple cases, we achieved near-perfect coverage and consistently collected above 80% of results. However, there are some clear gaps that we explain and discuss below. In the first two rows, the collection (together with the version of the extension used and the date) and the browser are presented, and, in the columns, the result type (text results, news, images or videos) and the six engines most often included in our experiments. Coverage values which are closer to 0 is coloured with red tones, the ones closer to .5 with yellow tones and the ones closer to 1 with blue tones. The gray color is used for missing values, i.e., for conditions that were not included in the experimental design. The coverage for so.com and sogou.com (only used for collection 19.02.21) ranged between .48 and .68, except for sogou.com in Chrome in which was between .19 and .23. Poor coverage for Yandex. Yandex restricts the number of search queries that come from the same IP and after the limit is reached it starts prompting captchas [58] . After several tests, we found that Yandex only blocks text and news search results, but not image and video ones. Therefore, we improved our extension by making it jump to image and video search when a captcha was detected. We can see that the coverage for images and news was fixed after collection F (version 3.0). However, coverage for news was still poor (see collections G,H,I). So, we decided to only collect the top 10 results for Yandex (i.e. the first page of search results) for text and news search, which allowed us to improve the consistency of coverage at the cost of volume. We did not experience these issues in the last collection (Q) because it only included 8 queries Coverage gaps before v3.0. Most of these gaps were due to various small programming errors that triggered under special circumstances (e.g. lack of results for queries in certain languages) combined with the lack of recovery mechanisms in the extension. We also noticed that Google detected our extension more often for Chrome than for Firefox (see collections B and C), which caused low coverage. Differences between the browsers. Apart from Chrome-based agents being more often detected as bots by Google, we noticed that Chrome performed poorly when it did not have visual focus from the graphical user interface, for which the operating system gives more priority. This problem was clearly observed in collection G, so all subsequent collections kept the visual focus on Chrome, which allowed us to address this limitation. Specific problems with particular collections. Collection J included 720 agents and exceeded our infrastructure capabilities; the bandwidth of our server was not sufficient to attend the uploading requests in time. This explains the progressive degradation between the text and the video search results. Collection E was very distinct as (1) it had very few machines (only one per region and engine) and (2) it took over 4 days (see information about iterations in Table 5 ). Therefore, one single machine that failed (and did not recover) would heavily affect the coverage for the rest of the iterations in this case. The first row of Table 6 presents the total size of each of the collections, followed by the effective size, i.e., the size of the files that correspond to the page that are targeted by the collection. Overall, the effective size is 95.46% of the total size (1.19 out of 1.25 Terabytes(TB)). The remaining 4.54% percent are composed of extra pages that do not contain search results, including search engine home, captcha, cookie and dummy (see v3.2b, Table 4 ) pages, but also search results pages that were collected after the end of the experiment due to a delay when stopping the machines and the iteration over the query list (Step 13 of Table 3 ) and from unintended queries (due to search engine automatic corrections and completions, or encoding problems, see Table 7) . Although we only use exact cases for these calculations, there is an important variance in the size of each section (Figure 1 ), which stems from the query term and date of the collections. The variance is even bigger if all cases (not just the exact cases) are considered. To test if these averages would be useful to calculate future collections sizes, we estimate the size of the collection using in-sample and out-of-sample data (see Coverage and size in the Methodology). The estimates are displayed in the fourth and fifth rows of Table 6 . In all cases (but one, collection I), the in-sample size estimate is higher than the effective size, and the out-of-sample estimates are close to the in-sample estimates. This indicates our averages are a good way to approximate the sizes of the collection. The coverage obtained with our method (Figure 1 In terms of effective size (RQ1b), our method introduces very little noise to our collection, as most of the data (~95.45%) corresponds to results pages relevant for the queries of our experimental designs as opposed to extra pages not containing search results, e.g. cookie agreements, or unintending queries, e.g., due to engine automatic corrections or delays dismantling the infrastructure at the end of the experiment. Additionally, 92.44% of the effective size corresponds to complete queries, where the number of collected pages corresponds to the expected number according to the pagination of the search engine; thus, supporting our success in terms of coverage. We use this exact dataset to estimate sizes of search sections that are useful to calculate the size of future collections; using out of sample data, we provide evidence that our figures approximate the data collections sizes well. Researchers should be aware of the complexities of collecting search engine results at a large scale with approaches like ours. This paper describes in detail all the steps we have taken to improve our methodology and Table 7 summarizes practical challenges that we had to address in this process (RQ2). We hope this will help researchers to succeed in their data collection endeavours. Maintenance Volatility of search engines layouts. The HTML layout of search engines is in constant evolution making it practically impossible to develop an out-of-the-box solution, even if one limits the simulation of user behaviour to one platform. The changes are unannounced and unpredictable, so collections tools should be tested and adjusted before any new data collection. Browser evolution. Browsers change the way they organize and allow access to the different data types that they store, and it is necessary to keep the extension up to date. Browsers could offer more controls in relation to the host site in which the data (e.g. cookie) is added, and not only the third party that adds the cookie. A consequence of cleaning the browser data is that the behaviour must consider the acceptance of cookie statements of the different platforms each time a new search routine is started. Reginal differences are important as regulations differ. For example, the cookie statement no longer appeared for the machines with US-based IPs during our most recent data collection. It is recommendable to let the agents iterate over the search engines, which brings three benefits: (1) avoid possible confounds between IPs and search results coming from the platforms, (2) decrease the number of search requests per IP to the same search engine which prevents the display of captcha pop-ups for most search engines, and (3) equally distribute the negative effects of the failure of one agent across all the search engines, so that the collection remains balanced. Network connectivity. Although rare, network problems could cause major issues if not controlled appropriately. We included several contingencies to keep the machines synchronized, reduced data losses by resuming the procedure from predefined points (e.g., next query or next search section), and not saturating the server by allowing pauses between the different events. Unexpected errors. Multiple extrinsic factors can lead to browsers not starting properly or simply terminated. The underlying reasons for such failures are difficult to identify as all the machines are configured identically (clones), and we dismantle the architecture as soon as the collections are finished to save costs. Simulated browsing is not immune to being misled by corrections or completions that search engines offer, e.g. "protesta" (Spanish for protest) changed the query term to "protestant" due to the geolocation or "derechos LGBTQ" to "derechos LGBT" (which is problematic per se) Character Encoding. Certain search engines do not support characters of all languages, e.g. Baidu did not handle accents in Latin-based languages, e.g. the query "manifestação" was changed to "manifesta0400o". [11, 25] . Given the difficulty of establishing baselines to properly evaluate whether some of these selections might be more or less skewed -or biased -towards certain interpretations of social reality [8] , We provide the code for the extension that simulates the user behaviour [54] . It is not as advanced as a recently released tool called ScrapeBot [35] ; ScrapeBot is highly configurable, and offers an integrated solution for simulating user behaviours through "recipes" for collecting and extracting the data, as well as a web interface for configuring the experiments. Nonetheless, our approach holds some additional merit: first, by using the browser extensions API, we have full control of the HTML and the browser, which, for example, allows us to decide exactly when the browser data should be cleaned, and provides maximum flexibility in terms of interactions with the interface. Second, we collect all the HTML and not target specific parts of it to avoid potential errors when it comes to defining the specific selectors; once the HTML is collected a post-processing can be used to filter the desired parts. If one uses ScrapeBot, one is encouraged to target specific HTML sections, but it is also possible, and highly recommended based on our experience, to capture the full HTML to avoid possible problems when the HTML of the services changes. Lastly, our approach clearly separates (1) the simulation of the browsing behaviour and (2) the collection of the HTML that is being navigated. The latter allowed us to repurpose an existing tool which initially was aimed to be used for collection of human user data. Such an architecture enables more freedom in the use of each of the two components -namely, the WebBot and the WebTrack. Researchers could reuse a different bot, e.g. one that simulates browsing behaviour on a different platform, without worrying about changing the data collection architecture. Conversely, our WebBot could be used with a different web tracking solution to achieve similar results. A single caveat of the former scenario is our aggressive method to clean the browser history, which forced us to make modifications to the source code of the tracker that we used. A limitation of virtual agent-based auditing approaches is that they depart from a simplified simulation of individual online behaviour. The user actions are simulated "robotically", i.e., the agent interacts with platforms in a scripted way; this is sufficient to collect the data, but not necessarily authentic. On one hand, it is possible that the way humans interact with pages (e.g., hovering the mouse for a prolonged time in a particular search result) have no effect on the search results, because these interactions are not considered by the platform algorithms. On the other hand, one cannot be certain until it is tested, given that the source code of the platforms is closed. Experiments that closely track user interactions with online platforms could help create more lifelike virtual agents. At the same time, it is important to revisit the differences in results obtained via the alternative ways of generating system outputs: simulating user behaviour via virtual agents, querying via platforms of APIs, and crowdsourcing data from real users. Our browsing simulation approach is sufficient for experimental designs in which all machines follow a defined routine of searches, but, so far, the only possible variable that can be configured in each agent is the starting search engine, and even then, this is done manually in the start-up script (by preparing the number of machines corresponding to the number of unique search engines that are going to be included). A more sophisticated approach can allow more flexibility in configuring each virtual agent. In this paper, we offer an overview over the process of setting up an infrastructure to systematically collect data from search engines. We document the challenges involved and improvements undertaken, so that future researchers can learn from our experiences. Despite challenges, we demonstrate the successful performance of our infrastructure and present evidence that algorithm audits are scalable. We conclude that virtual agents can be used for long-term algorithm auditing, for example to monitor long-lasting events, such as the current COVID-19 pandemic, or century affairs, such as climate change and human rights. We have no conflicts of interest to disclose. The relevance of algorithms Algorithms of oppression: how search engines reinforce racism Weapons of math destruction: How big data increases inequality and threatens democracy Automation, Algorithms, and Politics| Auditing for Transparency in Content Personalization Systems Problematic Machine Behavior: A Systematic Literature Review of Algorithm Audits I Vote For-How Search Informs Our Choice of Candidate Auditing the Partisanship of Google Search Snippets Quantifying Search Bias: Investigating Sources of Bias for Political Searches in Social Media Search Media and Elections: A Longitudinal Investigation of Political Search Results Search as news curator: The role of Google in shaping attention to news information The Matter of Chance: Auditing Web Search Results Related to the 2020 U.S. Presidential Primary Elections Across Six Search Engines Challenging Google Search filter bubbles in social and political information: Disconforming evidence from a digital methods case study Experimental Measures of News Personalization in Google News Burst of the Filter Bubble? Effects of personalization on the diversity of Google News Beyond the Bubble: Assessing the Diversity of Political Search Results Auditing Partisan Audience Bias within Google Search Auditing the Personalization and Composition of Politically-Related Search Engine Results Pages Measuring personalization of web search Location, Location, Location: The Impact of Geolocation on Web Search Personalization Competent Men and Warm Women: Gender Stereotypes and Backlash in Image Search Results Female Librarians and Male Computer Programmers? Gender Bias in Occupational Images on Digital Media Platforms Detecting Race and Gender Bias in Visual Representation of AI on Web Search Engines Prof Inf; 28. Epub ahead of print 25 Abyss or Shelter? On the Relevance of Web Search Engines' Search Results When People Google for Suicide How search engines disseminate information about COVID-19 and why they should do better. Harv Kennedy Sch Misinformation Rev; 1. Epub ahead of print Auditing local news presence on Google News Opening Up the Black Box: Auditing Google's Top Stories Algorithm What kind of news gatekeepers do we want machines to be? Filter bubbles, fragmentation, and the normative dimensions of algorithmic recommendations Auditing Source Diversity Bias in Video Search Results Using Virtual Agents Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube Google, is this what the Holocaust looked like? Auditing algorithmic curation of visual historical content on Web search engines. First Monday. Epub ahead of print 4 Querying the Internet as a mnemonic practice: how search engines mediate four types of past events in Russia The Substantial Interdependence of Wikipedia and Google: A Case Study on the Relationship Between Peer Production Communities and Information Technologies Measuring the Importance of User-Generated Content to Search Engines Agent-based Testing: An Automated Approach toward Artificial Reactions to Human Behavior Automated Experiments on Ad Privacy Settings Agreeing to disagree: search engines and their public interfaces On the Volatility of Commercial Search Engines and its Impact on Information Retrieval Research Tackling the Algorithmic Control Crisis -the Technical, Legal, and Ethical Challenges of Research into Algorithmic Agents Explaining Online News Engagement Based on Browsing Behavior: Creatures of Habit? How We Built a Facebook Inspector -The Markup Personal Web searching in the age of semantic capitalism: Diagnosing the mechanisms of personalisation. First Monday. Epub ahead of print Detecting price and search discrimination on the internet Equal access to online information? Google's suicideprevention disparities may amplify a global digital divide Visual representation of migrants in Web search results YouTube Dominates Google Video in 2020 Searching for Video? Google Pushes YouTube Over Rivals Auditing Race and Gender Discrimination in Online Housing Markets Measuring Price Discrimination and Steering on E-commerce Web Sites An Empirical Study on Online Price Differentiation Tracking Gendered Streams More of the Same -On Spotify Radio Analyzing the News Coverage of Personalized Newspapers GESIS -Leibniz Institute for the Social Sciences Webtrack-Desktop Extension for Tracking Users' Browsing Behaviour using Screen-Scraping MDN Web Docs. browsingData.DataTypeSet. MDN Web Docs Search blocking and captcha -Captcha