key: cord-0276856-s0aansqv authors: Suh, J.; Horvitz, E.; White, R. W.; Althoff, T. title: Widening Disparities in Online Information Access during the COVID-19 Pandemic date: 2021-09-17 journal: nan DOI: 10.1101/2021.09.14.21263545 sha: 9dbf82329e8c2acfa0a50176aea3611daec201ae doc_id: 276856 cord_uid: s0aansqv The COVID-19 pandemic has stimulated a staggering increase in online information access, as digital engagement became necessary to meet the demand for health, economic, and educational resources. We pursue insights about inequity in leveraging online information, spanning challenges with access and abilities to effectively seek and use digital information. We observe a widening of digital inequalities through a population-scale study of 55 billion everyday web search interactions during the COVID-19 pandemic across 25,150 US ZIP codes. We observe that ZIP codes with low socioeconomic status (SES) and high racial/ethnic diversity did not leverage health information and pandemic-relevant online resources (e.g., online learning, online food delivery) as much as regions with higher SES and lower levels of diversity. We also show increased shifts in online information access to financial or unemployment assistance for ZIP codes with low SES and high racial/ethnic diversity. These findings demonstrate that the pandemic has exacerbated existing inequalities in online information access and highlight the role of large-scale, anonymized data about online search activities in digital disparities research. The results frame important questions and future research on identifying and targeting interventions for vulnerable subpopulations that could reduce further widening of digital access inequalities and associated downstream outcomes including health, education, and employment. digital divide [14] [15] [16] ). For example, individuals with low socioeconomic status (SES) are shown to be slower to adopt information and communication technologies (ICTs) from childhood and throughout their life course and have smaller social networks and limited employment opportunities as a result 3 . Furthermore, even after controlling for internet access, those from higher SES integrate digital resources into their lives and use the internet for more "capital-enhancing" activities that are likely to result in more upwards mobility 15, 17 . Because this lack of engagement in digital resources may impact a range of downstream outcomes such as health 3, 4, 18 , education 19 , and employment 20, 21 , it is important to observe such behaviors across subpopulations and scrutinize the role of digital inequalities in our society. In addition, disadvantaged subpopulations are already at a higher risk of the COVID-19 infection and mortality and increased level of pandemic-induced socioeconomic burden, such that it is critical to ensure that digital inequalities do not exacerbate the disparate influences of the pandemic even further 2 . Here, we harness the centrality of web search engines for online information access to conduct a retrospective observational study on how the pandemic may have shifted people's engagement towards or away from digital resources and how such shifts may reflect societal disparities. This study extends prior work on pandemic-related disparities which are largely driven by datasets focused on hospitalization or case/fatality rates [8] [9] [10] [11] . Leveraging web search interactions enables us to capture the use of critical digital resources such as online educational sites in response to school closures, online food delivery information in response to restaurant closures, online social interactions in response to physical distancing and travel restrictions, or online unemployment and economic assistance in response to economic instability during the pandemic. Given that the pandemic has impacted everyone's web search behaviors nationally 1, 22 , our goal is to identify differences across subpopulations in their behavioral responses to the pandemic and to discover potential barriers and challenges in accessing critical resources on the web. Prior work on understanding digital disparities has relied on costly surveys, interviews, or selfreports 23-25 that require direct engagement with the study population for subjective recounting of their past behaviors rather than passively observing their actual behaviors. Datasets from specific service providers (e.g., Wikipedia 26 , Zearn.org 27, 28 ), domain (e.g., telehealth 29 , eHealth 18 ) or geographic areas (e.g., Northern California 18 ) do not capture digital behaviors across a broad spectrum of human needs and subpopulations and at fine geo-temporal granularities. Macroeconomic measures, such as unemployment rates, do not capture potentially unmet needs or access barriers (e.g., confusion around unemployment benefits [30] [31] [32] ). Conversely, web search logs are routinely collected at near-real time and at large scales, providing unique opportunities to unobtrusively examine digital behaviors across a wide range of topics, geographies, and subpopulations as well as highlighting potential barriers and changes to such engagement behaviors 33 . In fact, web search logs have enabled studies of human behaviors across many different domains 34-37 , time 38-41 , location 42, 43 , and to make inferences about the future or to identify risk factors 22, [44] [45] [46] [47] . In the context of the COVID-19 pandemic, such data has stimulated a prolific range of research on physical 22, 48 , psychological 49 , and socioeconomic 50, 51 well-being 1 . We analyze 55 billion everyday web search interactions during the COVID-19 pandemic across 25,150 US ZIP codes. Our dataset includes anonymized search queries to the Bing search engine and subsequently clicked web site URLs from those queries. Each search interaction is classified into categories of health, education, economic assistance, and food access, covering a range of critical resource needs (Supplementary Table S4 ). We link the search interactions from each United States ZIP code to their respective per-ZIP code census variables that broadly cover five social determinants of health (SDoH) categories defined by the US Department of Health 52 : (1) Healthcare Access and Quality (through health insurance coverage), (2) Education Access and Quality (through education level), (3) Social and Community Context (through race/ethnicity), (4) Economic Stability (through income and unemployment rate), and (5) Neighborhood and Built Environment (through population density and internet access). We divide our dataset along these SDoH factors and compare the magnitude of change in search behaviors between two ZIP code groups during the pandemic, where more or less observed change in search behaviors could indicate a shift in higher or lower demand for information (e.g., health, unemployment) or a shift to digital means to accessing resources (e.g., online remote learning). For example, we split our ZIP codes into low and high income groups (below and above $55,000 median household income) and compare the magnitude of change in health condition information queries (Fig. 1a) . To disentangle the confounding effects of SES and race/ethnicity on behaviors and health 53 , we compare changes in search behaviors on matched pairs of ZIP codes that are highly similar across these potentially confounding factors (Methods). We isolate the relative changes in search behaviors that occur concurrently with the pandemic using difference-in-differences approach 54 , adjusting for yearly and weekly seasonality and for pre-existing, pre-pandemic disparities Methods) . Thus, we operationalize digital disparities attributable to a single SDoH factor by quantifying the differences in these changes in search behaviors between two subpopulations delineated by that factor (Fig. 1e) . Finally, we apply the same process across all SDoH factors (Fig. 1f , Methods). Given the higher rate of pre-existing health conditions, documented disparities in healthcare access, and higher COVID-19 case and mortality rates for low SES subpopulations 8, 53 , low SES subpopulations might be expected to seek information about their health conditions at higher rates during the pandemic as compared to before. Instead, we find that ZIP codes associated with lower incomes show over a 200 percentage point smaller increase (95% CI [−287, −152] ) in health condition queries than their higher income counterparts (Fig. 1e) . This means that a person who was making one health condition query a month before the pandemic makes about ten such queries a month during the pandemic, but the same person would only make about eight such queries a month if they lived in a ZIP code with lower income. We find that ZIP codes with higher Hispanic population, higher population density, and higher unemployment rate also responded to the pandemic with lower relative change in their health condition queries during the first four weeks (Fig. 1f) . While ZIP codes with high (i.e., above population-average) Black population (≥12%) do not seem to be affected as much as high Hispanic population groups during the first four weeks, their response is lower during the months of August to November ( Supplementary Fig. S3g ). On the other hand, we find that ZIP codes with low education (≤21.1% with bachelor's degrees) make over 70 percentage point more (95% CI [31, 117] ) health condition queries compared to ZIP codes with higher education (Fig. 1f) . Online health information seeking behaviors can educate and empower patients to engage in their health care, lead to better health and wellbeing, and reduce health disparities 3, 4, 55, 56 . Prior research has shown that SES and demographics correlated with online health information seeking behaviors, highlighting the digital divide in health information access 56, 57 . Unfortunately, our results suggest that racial and economic factors may have contributed to a lessened response in health condition information seeking behaviors during the pandemic, which may exacerbate health disparities down the line 3, 58 . When we examine unemployment-related search interactions, we find that relative changes in unemployment related search queries (e.g., "eligible for unemployment benefits", "jobless claims") closely follow those of reported unemployment claims by the Bureau of Labor Statistics ( Supplementary Fig. S1 ). However, the impact of the pandemic on ZIP codes with high Black population and their unemployment Figure 1 . Illustration of the process of quantifying disparities in online health information access between high and low income groups. a, We contrast 25,150 ZIP codes above and below $55,224 median household income. Prior work suggests that ZIP codes with lower income typically have higher health risks, and that income is correlated with many other potential confounders such as race and education 53 . We disentangle the influences of other potential factors by matching each high income population ZIP code with a low income population ZIP code of similar profiles along other covariates (see Methods). b, From over 55 billion search queries, we examine the proportion of queries relating to health conditions (e.g., hypertension, diabetes, cancer, coronavirus) , across years 2019 and 2020 and across low income population (red) and high income population (gray), using a set of regular expressions (see Supplementary Table S4 ). c, We remove seasonal and weekly variations by subtracting the proportion of queries in 2019 from 2020 while aligning the days of the week. d, From all data points, we further remove the seasonally adjusted pre-pandemic baseline, shaded in gray (January 6 -February 23, 2020), to quantify changes introduced during the pandemic. e, We compute the difference in the changes in health condition queries during the pandemic by subtracting the low income group's changes from the high income group's changes. The differences between the two groups during the pre-pandemic period average to zero because we controlled for seasonal and pre-pandemic differences between the groups in the previous steps. We observe that low income ZIP codes experienced almost 200% less change in health condition queries compared to that of the high income groups right after the US national emergency is declared. This difference already accounts for each group's own baseline and, therefore, indicates a widening disparity in digital health engagement. Error bars in all charts indicate 95% confidence intervals obtained through bootstrapping (N=500). f, Finally, we repeat this process across all SDoH factors. Here, blue bars indicate the differences in percentage points across two matched comparison groups during the first four weeks since the declaration of the pandemic in the US. search queries is almost three times the amount on ZIP codes with low Black population (Fig. 2a) , with 3026% increase in query proportions for ZIP codes with high Black population compared to over 1365% increase for their counterparts, resulting in a 1,661 percentage point difference (95% CI [260, 2374]) (Fig. 2b) . We find another surge in search queries that result in over 1000% increase in the proportion of clicks into state-specific unemployment websites past July 2020, at which point the expanded federal supplement to unemployment insurance benefits expired (Fig. 2c) . During the month of August, ZIP codes with higher Black and higher Hispanic population present 789 (95% CI [595, 957] ) and 716 (95% CI [351, 1043] ) percentage points more in their change in clicks to unemployment sites, indicating that Black and Hispanic groups may have required additional long-term unemployment benefits. Conversely, ZIP codes with lower education levels experienced 517 percentage points less (95% CI [−1009, −81] ) in the change in state unemployment site visits (Fig. 2d) . Internet access and digital engagement is an important form of human capital that allows efficient access to information, increases in economic opportunities, and ultimately leads to better prospects and economic stability 59 . During economic hardships and especially during the pandemic, the internet can be an efficient way for government and institutions to deliver interventions and can lower barriers to accessing economic assistance or welfare services (e.g., https://www.usa.gov/food-help provides a comprehensive list of resources for food assistance). Unfortunately, the pandemic imposes multiple layers of barriers to accessing crucial economic assistance because low SES subpopulations are more likely to suffer economically from the pandemic 60 and deprioritize improving digital access as a consequence 3 . To understand the economic impact of the pandemic, we examine behaviors for accessing unemployment and financial assistance on the web. Our results suggest that lower education level, even while controlling for other factors such as internet access and SES, is linked to barriers to accessing unemployment sites on the web. Since our differencein-differences analysis already accounts for existing disparities through normalization, this difference in the increase in unemployment site visits may signal of widening disparities in employment. Such "interest" in digital unemployment resources is not captured in reported claims that measure unemployment claims that are actually submitted, but can be readily observed in web search logs. The discrepancy between unemployment interests expressed online and officially submitted claims may suggest potential barriers in successful submission of benefits application (e.g., confusion, eligibility 30, 31 ). Coupled with low recipiency rate of unemployment benefits 61 and the association between unemployment accessibility and suicide risks 62 , the mismatch between demands and claims are concerning. April of 2020 was a prime time for financial assistance related queries (e.g., "loan forgiveness", "stimulus check deposit") because the first stimulus checks were deposited on April 11, 2020 (Fig. 2e) . We find that financial assistance related queries increased by over 15,000% in mid-April on average, but ZIP codes with higher Black population experience 5,119 percentage points less change (95% CI [−8809, −1407] ) in financial assistance related queries between April 13 and May 10, 2020 (Fig. 2f ). That means that if a person made one financial assistance related query per month in mid-April of 2019, that person makes 167 such queries per month in mid-April during the pandemic, but only 116 queries if that person was from a ZIP code with higher Black population. Since we successfully controlled for other potential confounding factors such as income and education in our comparison, as shown in Supplementary Table S8 , our result points to race as a plausible cause for such disparity. Our finding highlights the need to further investigate potential barriers or causes that disparately prevent Black subpopulations from responding to pandemic-induced stimulus demands on the web. Figure 2 . Disparities in online economic assistance access. a, The surge in unemployment related search queries peaks during the first month since the declaration of the pandemic and taper off over the year 2020. During this first month, ZIP codes with high Black population (≥ 12%) have expressed up to 3,358% more unemployment related queries while ZIP codes with low Black population (< 12%) have expressed 1,320% more. b, Across the seven census variables, ZIP codes with high Black, high Hispanic, and low income populations experience greater changes in unemployment related queries during this first month. c, When we examine search queries that led to clicks in state unemployment sites, we see a second surge in August, with ZIP codes with high Hispanic population (≥ 18%) experiencing more than double the change in clicks in state unemployment sites compared to ZIP codes with low Hispanic population (< 18%). d, We observe that ZIP codes with high Black and Hispanic populations experience greater change in clicks in unemployment sites during the month of August, but ZIP codes with low educational attainment express less change in clicks in unemployment sites. e, Search queries related to financial stimulus were at their peak in late April, right after the time that the first stimulus checks were deposited on April 11. f, However, throughout the year and especially during the four weeks since mid-April, ZIP codes with high Black population experienced smaller change in financial stimulus related queries than ZIP codes with low Black population. The COVID-19 pandemic brought a rapid and massive digital transformation to lives as mandated lockdowns forced people to transform and reimagine traditional interpersonal connections (e.g., going to school, getting food, or meeting friends) into virtual and digital ones. Unfortunately, digital inequalities worsen social and material deprivations and perpetuate existing disadvantages into a "digital vicious cycle" 2, 64 . To understand the impact of the pandemic on reinforcing this vicious cycle, we investigate two classes of digital resources particularly influenced when traditional in-person access was impossible or severely limited: online remote learning and online food delivery services. Statewide mandates in the US required many schools to close in-person learning as early as March 16, 2020 65 , and school districts scrambled to implement remote learning alternatives. Many parents, students, and teachers turned to free online resources such as Khan Academy to fill the gaps temporarily or permanently 66 . There were also reported disparities in access to technologies or live virtual learning as well as absenteeism that stymied low income students 67 . When we examined search queries that result in visits to free online learning resources (e.g., coursera.org, khanacademy.org), we found that ZIP codes with lower income and higher Hispanic population exhibited only half to two-thirds of the increase (percentage point difference 95% CI [−227, −109] and [−202, −46] respectively) in those queries relative to their counterpart groups (Fig. 3a) . If a person made one search-led click to online learning sites per month before the pandemic, that person would make 5 such clicks per month during the pandemic, but only 3 such clicks would be made if that person was from a ZIP code with lower income or higher Hispanic population, even after controlling for internet access (Fig. 3b ). ZIP codes with high Black population and high population density show a similar trend. Even though these free online learning resources are designed to be accessible and flexible, helping students to go at their own pace, we find that low income and highly diverse subpopulations did not leverage them at the same level as their counterpart groups during the pandemic. In addition, school districts in low SES neighborhoods were more likely to be closed during the pandemic and less equipped to provide remote learning or at-home assignments, greatly reducing opportunities for both in-person and online learning for students with negative educational outcomes 68, 69 . Our findings suggest that there exists digital exclusion and the unintended consequences of the public health policies that perpetuate a myriad of disadvantages, as education is such a crucial factor in digital literacy, socioeconomic status, and health. COVID-19 fundamentally changed people's purchasing and spending behaviors, as many of the restaurants, stores, and non-essential businesses were closed to in-person shopping 70 . Spending on food delivery and groceries also increased significantly during the pandemic, with more people eating at home with higher utilization of online e-commerce platforms for accessing food and groceries 70, 71 . When we examine search queries for online food delivery (e.g., "grocery delivery", "deliver food"), we find that online food delivery queries increased by over 500% for ZIP codes with lower Black population while those with higher Black population only increased by over 170% (percentage point difference 95% CI [−382, −188], Fig. 3c,d) . We found similar lessened engagement in online food delivery searches for lower income and higher Hispanic subpopulations (95% CI [−200, −29] and [−140, −24] respectively, Fig. 3d ). These findings could be explained by the fact that low income subpopulations receive and seek more food assistance and tend to eat food away from home less frequently 72 and that such online food delivery services may not be accessible because they accompany higher markup and delivery surcharge. ZIP codes with low education subpopulations also experienced 301 percentage point more increase (95% CI [167, 419]) in queries for seeking food assistance (e.g., "Supplemental Nutrition Assistance Program", "help with food stamps", "free and reduced lunch", Fig. 2e ,f) relative to their high education counterparts. Unfortunately, those that relied on these traditional food assistance programs were left with 7/46 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint . Disparities in shifting to digital resources. Engagement in critical online resources necessary during prolonged lockdown or school/business closures were lower for ZIP codes with low income and high racial/ethnic diversity. a, Online learning sites played a significant role in filling in the gaps introduced by school closures at the beginning of the pandemic with over 200% increase in engagement. b, However, ZIP codes with lower income and higher Black population tend to access online learning resources less. In the new academic year (after August), while low income group continued to show lower engagement, ZIP codes with higher Black population show slightly higher engagement in online learning sites. c, With mandated restrictions on social gatherings, populations have transitioned to online-mediated social activities during the pandemic. d, For ZIP codes with higher population density, where lockdown measures were more strictly enforced due to higher case and mortality rates 63 , changes in online social activities search were higher. However, we see that ZIP codes with higher Hispanic population show less change in online social engagement, even after controlling for population density or internet access, indicating potential racial/ethnic barrier or preference to accessing online social activities. e, Online food delivery services were also in high demand due to restaurant closures, with over 250% increase at the beginning of the pandemic. f, However, ZIP codes with higher Black population and lower education showed smaller change in online food delivery search throughout the pandemic, regardless of population density, income, or internet access. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. severely limited choices during the pandemic because these programs do not extend to online purchase or delivery services 73 . Our findings highlight a potential gap between the increased food assistance need, as illustrated by the increase in the online information seeking behavior about food assistance, and the ability to actually procure food goods through online food purchase and delivery services. This study provides quantitative evidence that the current COVID-19 pandemic was associated with a widening of disparities in access, engagement, and utilization. Prior studies have shown that access to digital resources and information and incorporation of such digital technologies in everyday lives from childhood are crucial for gaining upwards mobility 3 . Although SES is an important factor in disparities in digital access, prior research has shown that SES also impacts the levels of web expertise and utilizing those resources in more information-seeking activities 17 . Low SES populations suffer from the lack of training and educational support to build the necessary skills to make efficient use of digital access and tools 13 , highlighting that simply making the internet more accessible may not level the playing field 74 . In the context of the current COVID-19 pandemic, where digital access and resources became more critical due to prolonged at-home isolation and restrictions on in-person activities, people from low SES backgrounds may experience the compounding effects of multiple potential disadvantages. Our findings confirm such compounding effects. We find that lack of internet access can be attributed to reduced engagement in web unemployment resources, implying that internet access is crucial ingredient to economic stability and most important for those experiencing significant economic burden. Furthermore, we discover that low income and high Hispanic populations are taking less advantage of web health resources, presenting similarly unfortunate consequences of eHealth initiatives that may disproportionately benefit digitally advantaged subpopulations 3 . Additionally, the findings confirm that low SES populations fell behind in the digital shift catalyzed by the pandemic 2 , as low income and more diverse subpopulations did not leverage online learning or online food delivery resources as much as their high income and less diverse counterpart subpopulations. We note the inherent limitations of studying digital engagement using digitally obtained data: This and other studies with online data can inadvertently exclude those who leave no or very little digital footprint 3 . Our information sources provide signals about levels of activity, but we cannot study details of changes in types of access if there is no engagement. Our analysis is also limited to the footprint of Bing as one of several search engines used for online information access, and Bing's user population may not be fully representative of the United States population. Our study carefully controls for internet access, as measured by the census, during the analysis such that any observed effects cannot be explained by differences in population internet access. Our approach combines search log data with socioeconomic and environmental variables that are routinely captured through census tracks to examine the influence of such census variables on potentially a diverse array of different topic areas of digital engagement at population scales. Our observed changes can only be attributed to ZIP code levels and not individuals because individual-level SDoH factors are not available and to preserve anonymity. Like any retrospective observational study, the potential for unobserved or uncontrolled confounding prevents us from making causal claims. However, we adjusted for observed confounding through a matching-based and differencein-difference based methodology (Methods). Our data cannot be used to discern whether different access behaviors are due to the lack of web expertise, the lack of awareness of the information value, or the lack of intangible resources like time or energy. Thus, we see value in follow-up, small-scale focused studies aimed at contextualizing individuals' experiences of the crisis and measuring the effects of populationspecific interventions 2 . These population-specific interventions could include education around web 9/46 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; expertise or digital know-how or may include non-digital methods, because traditional methods (e.g., text messaging, handouts) have been shown to work better for low SES populations 67 . Although the SDoH factors and outcomes reviewed in our analysis may not be the only variables of interest, our matching-based approach provides methodological robustness relative to traditional univariate analyses in observational studies by controlling for observed covariates 75, 76 . Future research aimed at understanding digital disparities, therefore, must acknowledge the correlations between different SES, race/ethnicity, and social determinants of health 77 and leverage methods that embrace examining the intersectionality 78 . This study presents a new methodology for digital disparities research by demonstrating that web search logs can be harnessed to deliver key insights about the impact of global crises on widening of digital disparities. Our observational study design is able to scale to a large population (billions of queries by millions of people) to quantify the disparities in digital engagements. Building on prior disparities research that advocated for comprehensive look at SES factors including race/ethnicity 53, 77 , our study emphasizes the inclusion of a broad set of factors and outcomes representative of the SDoH. Through the lens of SDoH factors, our findings highlight under-served subpopulations that may be struggling to overcome the economic burdens through online financial assistance and unemployment resources, that may be facing barriers in maintaining the necessary level of online information access for health, education, and food, and for whom to target public health interventions to prevent further widening of digital disparities. Our source data set consists of a random sample of 57 billion de-identified search interactions in the United States from years 2019 and 2020 from Microsoft's Bing search engine. Each search interaction includes the search query string, URLs of all subsequent clicks from the search result page, timestamp, and ZIP code from reverse IP lookup. We excluded search interactions from ZIP codes with less than 100 queries per month so as to preserve anonymity. All data were deidentified, aggregated to ZIP code levels or higher, and stored in a way to preserve the privacy of the users and in accordance to Bing's Privacy Policy. Our study was approved by the Microsoft Research Institutional Review Board (IRB). Our goal is to characterize the impact of social economic status on digital engagement outcomes. To account for well-known issues associated with residential segregation and socioeconomic disparities 13, 79 , we use ZIP codes as our geographic unit of analysis. We leveraged available ZIP code level American Community Survey estimates using the Census Reporter API 80 in order to characterize the ZIP codes in our data set. Because of the multidimensional nature of socioeconomic status and its association to health outcomes, it is important to include relevant socioeconomic factors 77 . Therefore, we examined eight census variables that represent all five categories of the social determinants of health (SDoH) defined by the US Department of Health 52 to cover a broad range of socioeconomic and environmental factors. Under Healthcare Access and Quality, we included the percentage of population with health insurance coverage (Table B27001) . Under Education Access and Quality, we included the percentage of population that attained Bachelor's degree or higher (Table B15002) . Under Social and Community Context, we included the percentage of population with Hispanic origin (Table B03003 ) and the percentage of population with Black or African American alone (Table B02001) . Under Economic Stability, we included the median household income (Table B19013 ) and the percentage of civilian labor force that is unemployed (Table B23025) . Under Neighborhood and Built Environment, we included the percentage of population with broadband or dial-up internet subscription (Table B28003 ) and the population density. We computed per ZIP code population density by joining area measurements from ZIP Code Tabulation Areas Gazetteer Files 81 and 10/46 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; total population (Table B01003) . We joined search interaction data with the above SDoH factors on ZIP codes and excluded ZIP codes that did not have either search interactions or census data. The resulting 55 billion search interactions covered web search traffic from 25,150 ZIP codes in the US, and these ZIP codes represents 97.2% of the total US population. Supplementary Table S1 provides per-ZIP code summary statistics of our dataset. We leverage interactions with search engines to obtain signals about digital engagements where everyday human needs are expressed or fulfilled through a digital medium, in our case Bing 1 . To gain a nuanced understanding of these search interactions, we categorize each search interaction into topics ranging from health access, economic stability, and education access using detectors on query strings and subsequently clicked URLs based on regular expressions and basic propositional logic. For example, we compute the proportion of search queries that contain health condition keywords such as cancer, diabetes or coronavirus to quantify the level of engagement in health information seeking behaviors. In another case, we examine search queries that result in subsequent clicks to state unemployment benefit sites to quantify the level of engagement in unemployment benefits. Supplementary Table S4 enumerates the categories we examined with example query strings, URLs, and regular expressions. From these search interactions, we want to estimate the disparate impact of the pandemic on digital engagement behaviors across SDoH factors in a way that controls for yearly and weekly seasonality and for pre-existing, pre-pandemic disparities in order to highlight where disparities have worsened throughout the pandemic period of March 16, 2020 to December 27, 2020. To do this, we first use a difference-in-differences method 54 to correct for seasonality and volume variations. After we categorize each search interaction with our categories of interest, we count and aggregate them per calendar day and per ZIP code (Fig. 1a) . We compute the proportion of the total query volume represented by each category at this daily increment to remove undesired variations in query volume over time (Fig. 1b) . We denote the digital engagement at time t in category c as the fraction of the total number of queries at time t: E(t, c) = N(t, c)/N(t). From this, we control for yearly seasonal variations by subtracting the digital engagements of 2019 from that of 2020: E(t 2020 , c) − E(t 2019 , c). People tend to behave differently on weekends, and we observed a 7-day periodicity in our data, sometimes known as the "weekend effect" 82 . Therefore, when comparing two years, it is important to account for the weekend effect. In order to highlight actual difference that are not explained by weekend mismatches across years, we aligned the day of the week between both years (i.e., Monday, January 6, 2020 is aligned to Monday, January 7, 2019). In addition, we ensured that our comparison analysis included all seven days of the week (i.e., look at means across one or multiples of a full week) (Fig. 1c) . Finally, to compute the change in digital engagement during the pandemic since the time at which the US national emergency was declared on March 16, 2020, we subtract the query proportions between January 6, 2020 and February 23, 2020, a time period we defined as the "pre-pandemic baseline" (Fig. 1d) . Even though the national emergency was declared three weeks later, we use February 23, 2020 as the cut-off because individual states declared a state of emergency at different times between February 29 and March 15 of 2020 and to avoid partial weeks in our analysis. This process results in the change in digital engagement most likely attributable to the pandemic. Our estimate of the relative change in digital engagement in category c between before and during the pandemic is defined as: . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09. 14.21263545 doi: medRxiv preprint Or the relative percentage change in digital engagement C perc is expressed as: × 100 Next, we aggregate these changes in digital engagements across two comparison ZIP code groups for each SDoH factor. For example, if we are examining the impact of having low income on the changes in digital engagement during the pandemic, we compare the average change in digital engagement of the low income ZIP codes with the average change of the high income ZIP codes (Fig. 1d) . Thus, we operationalize digital disparities attributable to a single socioeconomic or environmental factor by quantifying the differences in these changes in search behaviors between two subpopulations delineated by that factor (Fig. 1) . In our analysis, we report the change in digital engagement as the percentages of the pre-pandemic baseline, C perc , where 0% denotes no change. We report the disparities in digital engagement between two comparison ZIP code groups as the percentage point difference where 0 denotes no difference (Fig. 1e,f) . We formalize disparities in digital engagement in category c during the pandemic between high-risk ZIP code group g high and low-risk ZIP code group g low as: To obtain non-parametric confidence intervals, we conducted bootstrapping with replacement during this aggregation step (N=500). Supplementary Figures S2-S15 illustrate percent changes in each query category for each of two matched groups and their differences in percentage points across all SDoH factors. Our goal is to quantitatively estimate the independent impact of one socioeconomic factor on digital engagements while controlling for other factors to understand if and how a single factor independently influences digital behaviors during a global crisis such as the COVID-19 pandemic. Specifically, we are interested in the impact of the eight SDoH factors: median household income, % unemployed, % with insurance, % with Bachelor's degree or higher, population density, % Black population, % Hispanic population, and % with internet access. We use % with internet access primarily to control the level of digital access because internet access is necessary for web search. Many of the socioeconomic and racial variables are known to be correlated 53, 77, 83 . This means that a univariate analysis of outcomes along one SDoH factor would likely be confounded by multiple other variables. In fact, within our dataset, we observed high correlation among many SDoH factors examined (Supplementary Table S3 ). For example, median household income of the ZIP codes in our dataset is negatively correlated with the percentage of Black population (Pearson r = −0.23) and is positively correlated with internet access (Pearson r = 0.66). Comparing high and low income groups without considering other factors would result in two groups of uneven distributions of race and internet access, among many other factors. Therefore, it is important to consider these factors jointly and adequately control for SES factors when analyzing outcome disparities 53, 77 . To disentangle the individual independent contributions of these SDoH factors when measuring disparities in online information access between groups, we employ a matching-based approach 75, 76 , designed to create a comparable set of groups with similar covariate distributions. Because of the high degrees of spatial segregation in the US 13, 79 , matching every ZIP code can be challenging. For example, for every ZIP code with low income and high Black population, it is difficult to find a unique ZIP code . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; with high income and high Black population. Therefore, we perform one-to-one matching of ZIP codes with replacement and achieve better matches (i.e., lower bias). Theoretically, this is at the expense of higher variance, but given the size of our dataset, this downside was not a problem in practice. We use the MatchIt package 84 with the nearest neighbor method and Mahalanobis distance measure to perform the matching. For each of the SDoH factors, we first split all available ZIP codes into treatment and control groups using a threshold. We use a value close to the median to split the population into two groups for median household income ($55,224), % unemployed (3.0%), % with insurance (92.7%), % with internet access (81.8%), and % with Bachelor's degree or higher (21.1%) because the mean and median of those factors across the ZIP codes are similar. In other cases, the distribution across the ZIP codes are highly skewed. For race/ethnicity, we use the rounded percentage of the national population for that race/ethnicity (12% for Black and 18% for Hispanic populations). For population density, we follow previous practices of urban-rural classification (500 people per square mile) 85 . Supplementary Tables S1 and S2 outline descriptive statistics of our ZIP codes across SDoH factors as well as the national average and our chosen cutoff thresholds. We consistently defined the treatment group as "high-risk" -typically suffering from disadvantaged SES and digital inequalities 58 , so the treatment groups are low income, high racial/ethnic diversity, low education, high unemployment rate, low insurance rate, low internet access, and high population density. For example, for income, we split the ZIP codes into a high-income group (median household income > $55,224) and a low-income group (median household income ≤ $55,224), where the low-income group is the treatment group. Then, for each treatment ZIP code, we look for a control (i.e., "low-risk") ZIP code that closely matches it on all other SDoH factors (i.e., |SMD| < 0.25 to generate a matching pair of ZIP codes). We performed this matching on all ZIP codes, and we discarded ZIP codes for which we cannot find a good match. As demonstrated in Supplementary Table S6 , this process retains at least 99.8% of the treatment ZIP codes in our matching process and the discarding of ZIP codes is a rare exception. To gauge whether two ZIP code groups are similar across the SDoH factors and to determine the quality of matching while minimizing potential confounding effects of these factors, we leverage Standardized Mean Difference (SMD) across ZIP code groups as our measure of comparative quality. The SMD is used to quantify the degree to which two groups are different and is computed by the difference in means of a variable across two groups divided by the standard deviation of the one group (often, the treated group) 76, 86, 87 . In our analysis, we use |SMD| < 0.25 across all our SDoH factors as a criterion to determine that two groups are comparable, following common practice 75, 87 . For example, when we split our ZIP codes in half along median household income to create a high-income ZIP code group (median household income > $55,224) and a low-income ZIP code group (median household income ≤ $55,224) and examine the SMD of other SDoH factors, we find that all SDoH factors except % Hispanic population and population density fail to achieve the necessary matching criteria of |SMD| < 0.25 prior to matching. This means that low-income ZIP codes are more likely to have less internet access, lower education attainment, less health insurance, more unemployment, and higher Black population. We perform this evaluation process for all comparison groups to find that correlations among all SDoH factors pose threats to validity in univariate analyses. Supplementary Table S5 summarizes mean SMD if we were to directly compare two ZIP code groups created by splitting the ZIP codes along the chosen split boundaries. Instead of such direct comparison, we perform matching and tune the caliper of the matching algorithm to determine a good match and to meet the |SMD| < 0.25 criterion between the two comparison groups across all covariates. Supplementary Table S6 summarizes the result of the matching operation with the maximum |SMD| being below 0.25, that is ensuring comparability across all covariates, between two ZIP code groups along all SDoH factors. Supplementary Tables S7-S22 enumerate pre-and post-matching balance assessments between groups for each SDoH factor. After identifying treatment and control ZIP code groups with comparable distributions along all SDoH factors, we compare the outcomes (i.e., constructs of digital engagement such as online access to health condition information) between the matched ZIP code groups. This matching process estimates the Average Treatment Effect on the Treated (ATT), or specifically the effects of having low income on digital engagement while removing plausible contributions from all other observed factors. Due to the segregation and inequalities revealed by these factors, estimating the Average Treatment Effect (ATE) is practically impossible. One may opt to compute a local average treatment effect (LATE) and discard a large fraction of the U.S. population. However, such local estimates are easily misleading when the underlying population is not well understoodd and they fail to capture key populations in our study (e.g., low income and high Black populations). The ATT estimates in this study provide actionable insights on the effects of being at high risk (e.g., low income, low education, high racial/ethnic diversity) that can be used to suggest interventions to mitigate or reduce risk. Raw US census data are publicly available through the Census Reporter API (https://censusreporter.org/). Geographical area measurements are available through the US Census Bureau (https://www.census.gov/ geographies/reference-files/2010/geo/state-area.html). Seasonally adjusted US unemployment claims data for 2020 is available through the US Department of Labor (https://oui.doleta.gov/unemploy/claims.asp). The Bing data that support the findings of this study are available on request from the corresponding author with a clear justification and a license agreement. The Bing data are not publicly available. Source code used for processing and analysis of the data is available on request from the corresponding author. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint https://about.zearn.org/press-releases/zearn-provides-real-time-snapshot-on-the-state-of-u-smath-education-through-new-oi-economic-tracker-by-opportunity-insights (Accessed 2021-08-17). 28. Yglesias, M. Reopening schools safely is going to take much more federal leadership. https: //www.vox.com/2020/7/8/21314563/school-reopening-testing-money (Accessed 2021-08-17) (2020). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09.14.21263545 doi: medRxiv preprint 32. Desilver, D. Not all unemployed people get unemployment benefits; in some states, very few do. https://www.pewresearch.org/fact-tank/2020/04/24/not-all-unemployed-people-getunemployment-benefits-in-some-states-very-few-do/ (Accessed 2021-08-17) (2020). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Supplementary Information . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09.14.21263545 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; Figure S1 . Percent change in the unemployment related queries in Bing (shaded) and the reported unemployment claims from the US Department of Labor (line, https://oui.doleta.gov/unemploy/claims.asp) compared to pre-pandemic baseline. Change since before pandemic (%) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; Figure S6 . Percent change in 'Clicks to state unemployment sites' between two matched groups across eight SDoH factors. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09.14.21263545 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09.14.21263545 doi: medRxiv preprint Change since before pandemic (%) US National Emergency Pre-Pandemic Baseline (h) Percent change in 'Clicks to online learning sites' between two matched groups across '% Unemployed' Unemployed 3% Unemployed < 3% . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; Figure S11 . Differences in percentage points for changes in 'Click to online learning sites' between two matched groups across eight SDoH factors. Change since before pandemic (%) US National Emergency Pre-Pandemic Baseline (h) Percent change in 'Food delivery related queries' between two matched groups across '% Unemployed' Unemployed 3% Unemployed < 3% . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09.14.21263545 doi: medRxiv preprint Change since before pandemic (%) US National Emergency Pre-Pandemic Baseline (h) Percent change in 'Food assistance related queries' between two matched groups across '% Unemployed' Unemployed 3% Unemployed < 3% . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted September 17, 2021. ; https://doi.org/10.1101/2021.09.14.21263545 doi: medRxiv preprint Figure S15 . Differences in percentage points for changes in 'Food assistance related queries' between two matched groups across eight SDoH factors. Population-scale study of human needs during the covid-19 pandemic: Analysis and implications COVID-19 and digital inequalities: Reciprocal impacts and mitigation strategies Digital inequalities and why they matter. Information, communication & society 18 Second-level digital divide: Mapping differences in people's online skills Digital inequality: Differences in young adults' use of the Internet How does household spending respond to an epidemic? Consumption during the 2020 COVID-19 pandemic Covid-19 and the demand for online food shopping services: Empirical evidence from Taiwan America's eating habits: food away from home Pandemic reveals vulnerabilities in food access: confronting hunger amidst a crisis Impact of the digital divide in the age of COVID-19 Matching methods for causal inference: A review and a look forward An introduction to propensity score methods for reducing the effects of confounding in observational studies Socioeconomic status in health research: one size does not fit all Race, socioeconomic status and health: Complexities, ongoing challenges and research opportunities Racial residential segregation: a fundamental cause of racial disparities in health American Community Survey 5-year estimates ZIP Code Tabulation Areas Examining repetition in user search behavior Socioeconomic status and health in blacks and whites: the problem of residual confounding and the resiliency of race Differences in percentage points for changes in 'Financial assistance related queries' between two matched groups across eight SDoH factors Financial assist. related queries' between two matched groups across 'Education level'. High risk (Attained BA 21%) -Low risk (Attained BA > 21%) Financial assist. related queries' between two matched groups across '% Insurance coverage'. High risk (Insurance cov. 93%) -Low risk (Insurance cov Financial assist. related queries' between two matched groups across '% Internet access'. High risk (Internet access 82%) -Low risk (Internet access > 82%) percentage points for changes in 'Financial assist. related queries' between two matched groups across '% Hispanic populations'. High risk (Hisp. pop. 18%) -Low risk (Hisp. pop. < 18%) in percentage points for changes in 'Financial assist. related queries' between two matched groups across 'Income High risk (Income 55, 224) Lowrisk(Income > Financial assist. related queries' between two matched groups across 'Population density'. High risk (Pop. density 500) -Low risk Financial assist. related queries' between two matched groups across '% Black population'. High risk (Black pop. 12%) -Low risk (Black pop in percentage points for changes in 'Financial assist. related queries' between two matched groups across '% Unemployed'. High risk (Unemployed 3%) -Low risk Clicks to online learning sites' between two matched groups across 'Education level'. High risk (Attained BA 21.1%) -Low risk (Attained BA > 21%) Clicks to online learning sites' between two matched groups across '% Insurance coverage'. High risk (Insurance cov. 92.7%) -Low risk Clicks to online learning sites' between two matched groups across '% Internet access'. High risk (Internet access 81.8%) -Low risk (Internet access > 81 Clicks to online learning sites' between two matched groups across '% Hispanic populations in percentage points for changes in 'Clicks to online learning sites' between two matched groups across 'Income High risk (Income 55, 224) Lowrisk(Income > Clicks to online learning sites' between two matched groups across 'Population density'. High risk (Pop. density 500) -Low risk Clicks to online learning sites' between two matched groups across '% Black population'. High risk (Black pop. 12%) -Low risk (Black pop Clicks to online learning sites' between two matched groups across '% Unemployed'. High risk (Unemployed 3%) -Low risk Differences in percentage points for changes in 'Food delivery related queries' between two matched groups across eight census variables Food delivery related queries' between two matched groups across 'Education level'. High risk (Attained BA 21.1%) -Low risk (Attained BA > 21%) Food delivery related queries' between two matched groups across '% Insurance coverage'. High risk (Insurance cov. 92.7%) -Low risk Food delivery related queries' between two matched groups across '% Internet access'. High risk (Internet access 81.8%) -Low risk (Internet access > 81 Food delivery related queries' between two matched groups across '% Hispanic populations in percentage points for changes in 'Food delivery related queries' between two matched groups across 'Income High risk (Income 55, 224) Lowrisk(Income > Food delivery related queries' between two matched groups across 'Population density'. High risk (Pop. density 500) -Low risk Food delivery related queries' between two matched groups across '% Black population'. High risk (Black pop. 12%) -Low risk (Black pop Food delivery related queries' between two matched groups across '% Unemployed'. High risk (Unemployed 3%) -Low risk Food assistance related queries' between two matched groups across '% Insurance coverage'. High risk (Insurance cov. 92.7%) -Low risk Food assistance related queries' between two matched groups across '% Internet access'. High risk (Internet access 81.8%) -Low risk (Internet access > 81 Food assistance related queries' between two matched groups across '% Hispanic populations in percentage points for changes in 'Food assistance related queries' between two matched groups across 'Income High risk (Income 55, 224) Lowrisk(Income > Food assistance related queries' between two matched groups across 'Population density'. High risk (Pop. density 500) -Low risk Food assistance related queries' between two matched groups across '% Black population'. High risk (Black pop. 12%) -Low risk (Black pop Food assistance related queries' between two matched groups across '% Unemployed'. High risk (Unemployed 3%) -Low risk We thank E. Pierson, the University of Washington Behavioral Data Science Group, Microsoft Research Human Understanding and Empathy Group, participants at seminars and talks for their support and comments. Author contributions statement J.S., E.H., R.W., and T.A. were involved with the conceptualization of the study, and contributed to the design and refinement of the methodology. J.S. conducted data collection and analysis. All authors interpreted the data, drafted the manuscript, and critically contributed to the important intellectual content of the manuscript. The authors declare no competing interests. The study was approved by the Microsoft Research Institutional Review Board (IRB). Figure S5 . Differences in percentage points for changes in 'Unemployment related queries' between two matched groups across eight SDoH factors.