key: cord-0539708-zyrxoiii
authors: Chen, Aiyou; Au, Timothy C.
title: Robust Causal Inference for Incremental Return on Ad Spend with Randomized Paired Geo Experiments
date: 2019-08-08
journal: nan
DOI: nan
sha: ae64c99faaa7bfb9923972995978e04a189b2e3a
doc_id: 539708
cord_uid: zyrxoiii

Evaluating the incremental return on ad spend (iROAS) of a prospective online marketing strategy (i.e., the ratio of the strategy's causal effect on some response metric of interest relative to its causal effect on the ad spend) has become increasingly more important. Although randomized ``geo experiments'' are frequently employed for this evaluation, obtaining reliable estimates of iROAS can be challenging as oftentimes only a small number of highly heterogeneous units are used. Moreover, advertisers frequently impose budget constraints on their ad spends, which further complicates causal inference by introducing interference between the experimental units. In this paper, we formulate a novel statistical framework for inferring the iROAS of online advertising from randomized paired geo experiment which further motivates and provides new insights into Rosenbaum's arguments on instrumental variables, and we propose and develop a robust, distribution-free and interpretable estimator ``Trimmed Match'', as well as a data-driven choice of the tuning parameter which may be of independent interest. We investigate the sensitivity of Trimmed Match to some violations of its assumptions and show that it can be more efficient than some alternative estimators based on simulated data. We then demonstrate its practical utility with real case studies.

AbstractEvaluating the incremental return on ad spend (iROAS) of a prospective online marketing strategy (i.e., the ratio of the strategy's causal effect on some response metric of interest relative to its causal effect on the ad spend) has become increasingly more important. Although randomized "geo experiments" are frequently employed for this evaluation, obtaining reliable estimates of iROAS can be challenging as oftentimes only a small number of highly heterogeneous units are used. Moreover, advertisers frequently impose budget constraints on their ad spends, which further complicates causal inference by introducing interference between the experimental units. In this paper, we formulate a novel statistical framework for inferring the iROAS of online advertising from randomized paired geo experiment which further motivates and provides new insights into Rosenbaum's arguments on instrumental variables, and we propose and develop a robust, distribution-free and interpretable estimator "Trimmed Match", as well as a data-driven choice of the tuning parameter which may be of independent interest. We investigate the sensitivity of Trimmed Match to some violations of its assumptions and show that it can be more efficient than some alternative estimators based on simulated data. We then demonstrate its practical utility with real case studies.

1. Introduction. Similar to traditional offline media such as television, radio and print, the primary goal of online advertising is to help promote the selling of goods and services. However, despite these shared goals, online advertising has been the leading source of advertising revenue in the United States since 2016 (Interactive Advertising Bureau, 2018) . Goldfarb and Tucker (2011) attribute this success of online advertising to its superiority over other media in terms of its measurability and targetability.

A prospective online marketing strategy (e.g., expanding the list of search keywords on which to advertise) is frequently evaluated in terms of its incremental return on ad spend (iROAS)-that is, the ratio of the strategy's causal effect on some response metric of interest relative to its causal effect on the ad spend. Here the response metric of interest may be for example, revenue from online sales, offline sales, or overall sales which may be affected by the ad. Indeed, this evaluation has become progressively more important as advertisers increasingly seek to optimize the impact of their marketing decisions-an evaluation that is, in theory, facilitated by large-scale randomized experiments (i.e. "A/B tests") which randomize users to different ad serving conditions (Goldfarb and Tucker, 2011; Johnson, Lewis and Nubbemeyer, 2017) . In practice, however, privacy concerns which restrict the collection of user data and technical issues such as cookie churn and multiple device usage have made it hard to maintain the integrity of a randomized user experiment in the online advertising context (Gordon et al., 2019) . Consequently, observational studies remain an area of active research for estimating the causal impact of online marketing strategies (e.g., Varian (2016) , Sapp et al. (2017) , Chen et al. (2018) , and the references therein), although the empirical Keywords and phrases: effect ratio, interference, heterogeneity, studentized trimmed mean. studies of Lewis and Rao (2015) and Gordon et al. (2019) continue to suggest caution when using observational methods despite these recent advances.

Indeed, randomized experiments are still regarded as the "gold standard" for causal inference (Imbens and Rubin, 2015) and, to mitigate some of the challenges of user-level experimentation, advertisers frequently instead employ randomized "geo experiment" designs (Vaver and Koehler, 2011) which partition a geographic region of interest into a set of nonoverlapping "geos" (e.g., Nielsen Media Research's 210 Designated Market Areas 1 which subdivide the United States) that are regarded as the units of experimentation rather than the individual users themselves. More formally, let G be the set of geos in a target population where, for a geo g ∈ G, we let (S g , R g ) ∈ R 2 denote its observed bivariate outcome with ad spend S g and response R g . Following the Neyman-Rubin causal framework, we denote geo g's potential outcomes under the control and treatment ad serving conditions as (S (C) g , R (C) g ) and (S (T ) g , R (T ) g ), respectively, where we can only observe one of these two bivariate potential outcomes for each geo g. Therefore, relative to the control condition, there are two unit-level causal effects caused by the treatment condition for each geo g-the incremental ad spend and the incremental response which are defined by

However, advertisers frequently find the iROAS, i.e. the ratio of incremental response to incremental ad spend, to be a more informative and actionable measure of advertising performance:

for g ∈ G. Following Kerman, Wang and Vaver (2017) and Kalyanam et al. (2018) , and letting |·| denote the set cardinality function, the overall iROAS with respect to G can be defined as the ratio of the average incremental response to the average incremental ad spend:

which is the parameter of primary interest in this paper.

In a randomized experiment, one can use the group difference to obtain unbiased estimates of the average incremental response and average incremental ad spend. The ratio of these group differences then gives an empirical estimate of θ * :

where T and C denote the set of geos in treatment and in control, respectively. Similar empirical estimates have been commonly used for effect ratios in other applications, e.g., the incremental cost-effectiveness ratio which summarizes the cost-effectiveness of a health care intervention (Chaudhary and Stearns, 1996; Bang and Zhao, 2014) .

However, geo experiments often introduce some additional complexity which makes the causal estimation of the iROAS more difficult. The complexity can be attributed mostly to two sources of interference: 1) spillover effects (e.g., from consumers traveling across geo boundaries), and 2) budget constraints on ad spend. Existence of interference, if not handled properly, may invalidate traditional causal inference which relies on the "stable unit treatment value assumption" (SUTVA)-that is, the presumption that the treatment applied to one unit does not affect the outcome of another unit (Rubin, 1980) . Interference due to the first source can be controllable to a large extent by using bigger geographically clustered regions as the experimental units (Vaver and Koehler, 2011; Rolnick et al., 2019) and, therefore, such interference is assumed to be ignorable in this paper. As the consequence, however, geo experiments frequently involve only a small number of geos (Vaver and Koehler, 2011) where the distributions of {S g : g ∈ G} and {R g : g ∈ G} can be very heavy-tailed and, as a result, the empirical estimator defined in equation (1.3) can be very unreliable.

Interference due to the second source is more subtle and less controllable. Although advertisers have some control over the online marketing strategies that they employ (e.g., which search keywords to advertise on, the maximal price they are willing to pay to show their ads, etc.), they are competing in a dynamic ecosystem where the actual delivery of their online ads is determined by advertising platforms which run auctions and use machine learning models to optimize the ad targeting in real-time to maximize key performance indicators such as clicks, site visits, and purchases (Varian, 2009; Johnson, Lewis and Nubbemeyer, 2017) . Thus, for a given budget constraint for geos in the treatment group (or the control group), these advertising platforms may choose to allocate ad spend to one geo at the expense of others which, in turn, introduces interference in the responses observed in each geo. We discuss later that this interference can be handled naturally for various kinds of geo experiments, unlike interference which has recently been studied elsewhere (e.g., medical science, social network), see Rosenbaum (2007) , Luo et al. (2012) , Athey, Eckles and Imbens (2018) and references therein.

The key contributions of this paper are: 1) the formulation of a novel statistical framework for inferring the iROAS θ * from randomized paired geo experiments which embed the complex nature of small sample sizes, data heavy-tailedness, and interference due to budgetary constraints, 2) the proposal and development of a robust and distribution-free estimator "Trimmed Match" with a simple interpretation, and 3) extensive simulations and real case studies demonstrating that Trimmed Match can be more efficient than some alternative estimators.

The rest of this paper is organized as follows. We first provide the background on geo experiments in the online advertising context in Section 2. Afterwards, we formulate a statistical framework for inferring θ * in a randomized paired geo experiment in Section 3. Under this framework and in the spirit of Rosenbaum (1996 Rosenbaum ( , 2002 , we review a few distribution-free estimators based on common test statistics in Section 4, and we propose the Trimmed Match estimator in Section 5. Section 6 introduces a data-driven choice of the trim rate. Simulations demonstrating the robustness and efficiency of Trimmed Match are presented in Section 7, and some real case studies illustrating the performance of Trimmed Match in practice are shown in Section 8. Finally, Section 9 concludes with some suggestions for future research. Fast computation of Trimmed Match is presented in Appendix A.

Trimmed Match has been applied for advertiser studies at Google. The Python library for the implementation is available at GitHub (Chen, Longfils and Best, 2020) .

Geo experiments are now a standard tool for the causal measurement of online advertising at Google-see, for example, Blake, Nosko and Tadelis (2015) , Ye et al. (2016) , and Kalyanam et al. (2018) . 2 There has been some related work in terms of estimating the effectiveness of online advertising with geo experiments, but to the best of our knowledge, all work to date has been model-based.

Notably, after introducing the concept of a randomized geo experiment design for online advertising, Vaver and Koehler (2011) proceed to analyze them using a two stage weighted linear regression approach, called Geo-Based Regression (GBR). The first stage fits a linear predictive model for the geo-level potential control ad spend using data from the control group, where pre-experimental ad spend is used as regressors and model weights to try to account for heteroscedasticity caused by the differences in geo size. The second stage of GBR fits a regression for the response variable, where pre-experimental response and incremental ad spend-which is 0 by definition for geos in the control group, but for geos in the treatment group is inferred by the difference between the observed ad spend and the counterfactual ad spend predicted from the first stage-are used as regressors and model weights to try to account for heteroscedasticity, and the iROAS parameter is the coefficient of incremental ad spend.

More recently, to address situations where there are only a few geos available for experimentation, Kerman, Wang and Vaver (2017) propose a Time-Based Regression (TBR) approach which uses a constrained version of the Bayesian structural time series model Brodersen et al. (2015) to estimate the overall incremental response for the treatment group, where the control group's time series is used to contemporaneously model the treatment group's "business as usual" behavior prior to the experiment, and then subsequently used in conjunction with the trained model to predict what the treatment group's "business as usual" counterfactual would have been had the experiment not occurred.

However, it can be shown that these methods rely on some strong modeling assumptions that are often hard to justify in practice. For GBR, the result can be quite sensitive to the choice of weights, and furthermore, even if the geo-level incremental ad spends are known, unlike more recent regression adjustment models (Lin, 2013) , its second stage regression may still suffer from the endogeneity problem-incremental ad spend may correlate with the residual-despite randomness in the treatment assignment. As a natural extension of GBR, one might attempt to fit the heterogeneous geo-level causal effects on the response and ad spend separately using parametric or nonparametric models (Bloniarz et al., 2016; Wager and Athey, 2018; Kunzel et al., 2019) . Besides the requirement of larger sample size, however, this may not be straight-forward since budget constraints on ad spend may break the assumption of independent measurements behind the models. Meanwhile, TBR assumes a stable linear relationship regarding the contemporaneous "business as usual" time series between the treatment group and the control group from the pre-experimental period into the experimental period. But this is an untestable assumption and may not hold in practice. Given the temporal dynamics (e.g., the COVID-19 pandemic as an extreme case), it is important to have a method which is robust and does not rely on any fragile and untestable modeling assumptions. Such a method is especially desirable when it needs to be built into a product to serve many geo experiments seamlessly.

We first consider the scenario where there is no budgetary constraint so that there is no interference between geos. Recall from (1.1) that a geo g's unit-level iROAS θ g is defined in terms of the ratio of its incremental response to its incremental ad spend. Rearranging the terms in this definition then leads to the following lemma, which serves as the basis for our statistical framework.

Lemma 1 implies that the quantity

which is not observable due to unknown θ g , remains the same regardless of whether geo g is assigned to treatment or control. Loosely speaking, the quantity Z g measures geo g's "uninfluenced response"-that is, the part of g's baseline response due to, for example, seasonality in the market demand which is not influenced by its ad spend.

As previously discussed in Section 1, budgetary constraints may introduce complex interference and thus a violation of SUTVA since the ad spend allocated to one geo may come at the expense of others; therefore the experimental outcome (S g , R g ) for each geo g may be affected by the treatment assignment of other geos. In particular, for a design on n matched pairs, there are 2 n possible assignments each associated with its own potential outcome vector of length 2n, where the realized potential outcome vector depends on the materialized assignment-see for instance Hudgens and Halloran (2008) for a detailed formulation. But in light of Lemma 1, in this paper we assume for notational simplicity a relaxed version of SUTVA when there is interference due to budgetary constraints, which is formally described as Assumption 0 below.

Assumption 0. The uninfluenced response Z g for any geo g introduced by Lemma 1 is invariant to both its own treatment assignment and the treatment assignment of other geos.

It is not hard to verify that under Assumption 0, the parameters θ g in (1.1) are well defined. Assumption 0 trivially holds when there is no interference (e.g., if the advertiser had no budget constraints or if the total actual ad spend was below the pre-specified budget constraint). More generally, by the decomposition

it is important to note that under Assumption 0, Z g is not affected by geo assignment and thus any interference introduced from budgetary constraints only affects the response through the magnitude of materialized ad spend S g and iROAS θ g . In other words, this specifies a simple linear form which quantifies how the interference due to budgetary constraints affects the measurements. Consequently, Assumption 0 may still hold in less trivial situations with budget constraints such as:

• When the potential ad spend under the control condition is known to be 0 for every geo (e.g., showing online ads under the treatment condition versus not showing ads under the control condition), or when the potential ad spend under the treatment condition is known to be 0 for every geo (e.g., "go dark" experiments where no ads are shown under the treatment condition (Blair and Kuse, 2004) ). • When advertisers use their budget constraint to pre-specify each geo's potential ad spend under the control condition (but not necessarily when under the treatment condition), or when advertisers pre-specify each geo's potential ad spend under the treatment condition (but not necessarily when under the control condition).

These scenarios encompass many of the geo experiment designs in practice and, consequently, we assume that Assumption 0 holds in the remainder of this paper.

PROPOSITION 1. In a completely randomized geo experiment, under Assumption 0, the distribution of R g − θ g S g is the same between the treatment group and the control group.

Description of the notation used for the ith pair of geos.

Description

Observed ad spend and response for the 1st geo S i 2 , R i 2 Observed ad spend and response for the 2nd geo A i Indicator which geo receives treatment or control

Difference in the "uninfluenced responses" with respect to θ Proposition 1, whose proof directly follows from Assumption 0, provides a general framework for inferring the unit-level iROAS {θ g : g ∈ G} (e.g., by parameterizing θ g with geolevel features) and the overall target population iROAS θ * by simplifying the bivariate causal inference problem to a single dimension. In this paper, we formulate a statistical framework specifically for inferring θ * by assuming, just as Vaver and Koehler (2011) do, that the unitlevel iROAS θ g are all identical.

Assumption 1. θ g = θ * for all geos g ∈ G.

Although the rigorous verification of Assumption 1 is beyond the scope of this paper, our sensitivity analysis in Section 7 suggests that estimates of θ * can still be reliable even if the unit-level iROAS θ g moderately differ, while our hypothesis tests in Section 8 indicate that this assumption is compatible with data observed from several real case studies.

Following the recommendations of Vaver and Koehler (2011) , in the remainder of this paper we consider a randomized paired design where 2n geos are matched into n pairs prior to the experiment such that, within each pair, one geo is randomly selected for treatment and the other geo for control.

Let A i be the random assignment for the two geos in the ith pair with P (A i = 1) = P (A i = −1) = 1 2 , where A i = 1 indicates that the 1st geo receives treatment and the 2nd geo receives control, while A i = −1 indicates that the 2nd geo receives treatment and the 1st geo receives control. Let (S i1 , R i1 ) and (S i2 , R i2 ) be the observed spend and response values for the 1st geo and the 2nd geo respectively in the ith pair. Let X i and Y i be the observed differences in the ad spends and responses, respectively, between the treatment geo and the control geo in the ith pair, that is, Table 1 lists the notation and definitions used for the ith pair of geos. PROOF. Let Z i1 and Z i2 be the uninfluenced responses associated with the two geos in the ith pair, i.e. Z ij = R ij − θ * S ij for j = 1, 2. Then we have

Under Assumption 0, Z i1 and Z i2 are non-random quantities and are invariant to any treatment assignment of the 2n geos. The conclusion follows immediately since {A i : i = 1, . . . , n} are i.i.d. and each A i is symmetric about 0.

Proposition 2 provides a general framework that facilitates the estimation of θ *regardless of how complicated the bivariate distribution of {(R g , S g ) : g ∈ G} may be, we can always reformulate the causal inference problem in terms of a simpler univariate "location" problem that is defined in terms of the symmetric distribution of each i (θ * ).

By Proposition 2, the average of { i (θ * ) : 1 ≤ i ≤ n} is expected to be 0, so by setting

and then solving for θ, we arrive at the following estimator for θ * :

which coincides with, and also further motivates, the empirical estimator given in (1.3) with |T | = |C|. However, recall from our discussions in Section 1 that the empirical estimator may be unreliable when the bivariate distribution of {(R g , S g ) : g ∈ G} is heavy tailed.

Although the iROAS estimation problem is fundamentally different from the classical location problem as studied extensively in the statistics literature, the reformulation of the problem in terms of the symmetry of the i (θ * ) values about 0 facilitates the application of robust statistical methods to address the heterogeneity issue of geo experiments. For conciseness, we only consider three such techniques in this paper and leave the exploration of other robust statistical methods to future work-we refer the reader to Tukey and McLaughlin (1963) , Lehmann (2006) , and Huber and Ronchetti (2009) for a comprehensive overview of such techniques. Specifically, we first briefly review the application of binomial sign test and the Wilcoxon signed-rank test in Section 4. Afterwards, in Section 5, we develop a robust and more easily interpretable estimator based on the trimmed mean, and we demonstrate its efficiency in practice through simulations and real case studies presented in Sections 7 and 8.

A similar statistic in the form of (3.2) was first proposed and studied by Rosenbaum (1996 Rosenbaum ( , 2002 to generalize an instrumental variable argument made by Angrist, Imbens and Rubin (1996) , but in a different context and without the concern of interference as studied in this paper. In this section, we review two distribution and covariate free estimators of θ * along the same lines and refer to Rosenbaum (2020) for a more comprehensive overview.

For any θ ∈ R, let M n (θ) be a statistic for testing symmetry where, in the case of binomial sign test, we have

with i (θ) given by (3.2) and where I(·) is the indicator function, while in the case of Wilcoxon's signed-rank test we have

with sgn(·) and rank(·) denoting the sign and rank functions, respectively. We refer the reader to Lehmann (2006) for additional details on tests of symmetry.

PROPOSITION 3. Under Assumption 0 & 1, the test statistic M n (θ * ) for either binomial sign test or Wilcoxon's signed-rank test follow a known distribution that is symmetric about 0.

Proposition 3, whose proof follows directly from Proposition 2, allows us to construct a 100(1 − α)% confidence interval for θ * -if we let q 1−α/2 be the (1 − α/2) quantile for the distribution of M n (θ * ) under Assumption 1, then we identify the minimal interval containing all θ ∈ R satisfying |M n (θ)| ≤ q 1−α/2 .

Note, however, that M n (θ) = 0 may not always have a root. Following Rosenbaum (1996) , the point estimator of θ * can be defined as the average of the smallest and largest values of θ that minimize |M n (θ)|-that is,θ

where Θ M = arg min θ∈R |M n (θ)|. In the remainder of this paper, we letθ (binom) andθ (rank) denote the estimators which correspond to binomial sign test statistic and Wilcoxon's signedrank test statistic, respectively. 5. The "Trimmed Match" Estimator . In this section, we derive an important estimator for θ * based on the trimmed mean under Proposition 2. In particular, for a randomized paired geo experiment, let {(X i , Y i ) : 1 ≤ i ≤ n} be as defined in (3.1) and, for any θ ∈ R, let { i (θ) : 1 ≤ i ≤ n} be as defined in (3.2) with the corresponding order statistics given by

For a fixed value λ ∈ [0, 1/2), the trimmed mean statistic as a function of θ is defined as:

where m ≡ nλ is the minimal integer greater or equal to nλ. Here λ is a tuning parameter which is commonly referred to as the trim rate and, in order to be well defined, λ must satisfy n − 2m ≥ 1 so that trimming does not remove all n data points. We first develop an estimator for a fixed λ and defer discussions on the choice of λ to Section 6.

By Proposition 2, nλ (θ * ) has an expected value of 0, so we can estimate θ * by solving nλ (θ) = 0. (5.2) When multiple roots exist, we choose the one that minimizes 

where I is the set of n−2m untrimmed indices of i (θ) used in the calculation of nλ (θ (trim) λ ) and thus I depends onθ (trim) λ . Note that if the two geos in the ith pair are perfectly matched in terms of the uninfluenced response, then i (θ * ) = 0. Therefore,θ (trim) λ has a nice interpretation: it trims the poorly matched pairs in terms of the i (θ * ) values and estimates θ * using only the well-matched untrimmed pairs. Consequently, in this paper, we refer toθ (trim) λ as the "Trimmed Match" estimator.

It is worth emphasizing that Trimmed Match directly estimates θ * without having to estimate either the incremental response or the incremental spend. Moreover, the point estimate is calculated after trimming the pairs that are poorly matched in terms of the i (θ (trim) λ ) values rather than the pairs which are poorly matched with respect to the differences in their response Y i or ad spend X i . Indeed, consider an alternative trimmed estimator which does not directly estimate θ * , but instead first separately calculates a trimmed mean estimate of the average incremental response and a trimmed mean estimate of the average incremental ad spend, and then takes their ratio:

where the sets I Y and I X denote the indices of the untrimmed pairs used for estimating the incremental response and the incremental ad spend, respectively, and where the two sets will generally not be identical. Note, however, that this is not a desirable estimator for θ * since its numerator and denominator may not even yield an unbiased estimate of either the average incremental response or the average incremental spend, respectively, as neither {Y i : 1 ≤ i ≤ n} nor {X i : 1 ≤ i ≤ n} is expected to follow a symmetric distribution even if all of the geo pairs are perfectly matched in terms of their "uninfluenced responses".

Finally, it is also interesting to note the connection between trimming poorly matched pairs and the theory presented in Small and Rosenbaum (2008) which shows that a smaller study with a stronger instrument is likely to be more powerful and less sensitive to biases than a larger study with a weaker instrument. These arguments were later supported empirically by Baiocchi et al. (2010) , who studied a similar effect ratio in the form of (1.2) and showed that optimally removing about half of the data in order to define fewer pairs with similar pre-treatment covariates but with stronger instrument resulted in shorter confidence intervals and more reliable conclusions. Our Trimmed Match method also identifies and trims poorly matched pairs, but does not rely on pre-treatment covariates.

Interval. Define the studentized trimmed mean statistic (Tukey and McLaughlin, 1963) with respect to { i (θ) : 1 ≤ i ≤ n} as follows:

is the winsorized variance estimate for¯ nλ (θ), and

is the winsorized mean of i (θ)'s. The Trimmed Match confidence interval is constructed by determining the minimal interval containing all θ ∈ R satisfying

where the threshold c is chosen such that P (|T nλ (θ * )| ≤ c) = 1 − α.

Under mild conditions, T nλ (θ * ) approximately follows a Student's t-distribution with n − 2m − 1 degrees of freedom, and we therefore set c to be the (1 − α/2) quantile of this distribution. Alternatively, one can also choose the threshold c by using Fisher's randomization test approach (see, for example, Rosenbaum (2002) and Ding, Feller and Miratrix (2016) ) and relying on the fact that the distribution of i (θ * ) is symmetric about zero for each i.

However, when constructing the confidence interval, it is also important to recognize that the trim rate λ is unknown in practice. Later, in Section 6, we propose a data-driven estimate of this trim rate which can be used by plug-in to construct the confidence interval, although such an interval may suffer from undercoverage in finite samples as it ignores the uncertainty associated with estimating this tuning parameter (Ding, Feller and Miratrix, 2016) . Interestingly, however, our numerical studies in Section 7 suggest that the empirical coverage of the confidence intervals constructed using the estimated trim rates are often quite close to the nominal level even when n is small-a finding which is consistent with the observation that the studentized trimmed mean belongs to the class of "less vulnerable confidence and significance procedures" for the classical location problem (Tukey and McLaughlin, 1963) .

6. Data-driven Choice of the Trim Rate λ. For the location problem, Jaeckel (1971) proposed minimizing the empirical estimate of the asymptotic variance when choosing the trimmed mean's trim rate λ, while Hall (1981) proved the general consistency of this approach. Similarly, we could choose the trim rate λ for Trimmed Match by minimizing an estimate of the asymptotic variance ofθ (trim) λ , which can be derived by assuming independent sample for (X, Y ). 3 This may apply to the scenario with no budget constraints, but not with budget constraints.

We note that the essential idea of Jaeckel (1971) is to choose the trim rate by minimizing the uncertainty of the point estimate measured by approximate variance. To handle both scenarios without budget constraints and with budget constraints, we extend this idea and propose to choose the trim rate by minimizing the uncertainty of the point estimate measured by the width of the 100(1 − α 0 )% two-sided confidence interval previously defined in Section 5.2, where α 0 ∈ (0, 1) is pre-specified. Although different levels of α 0 can be used, our numerical studies in Section 7 suggest α 0 = 0.5 to be a reliable choice in terms of its mean squared error performance for both light and heavy tailed distributions. Hereafter we useλ to denote this data-driven trim rate using α 0 = 0.5.

In this section, we present several numerical simulations which evaluate the performance and sensitivity of the Trimmed Match estimator θ (trim) λ defined by (5.4). We consider interferences between experimental units due to the presence of budget constraint on the total incremental ad spend. To meet the requirement of Assumption 0, we consider the scenarios where the potential ad spend under the control condition is pre-specified for each geo and will not be affected by any geo assignment, but the potential ad spend under the treatment condition may vary subjective to geo assignment, as discussed in Section 3. In particular, for simulations where Assumption 1 holds, we investigate how the choice of the trim rate λ affects the performance ofθ (trim) λ and, more broadly, we compare its performance against the empirical estimatorθ (emp) given by (3.3), as well as the binomial sign-test estimatorθ (binom) and the Wilcoxon signed-rank-test estimator θ (rank) defined in Section 4. Meanwhile, for simulations where Assumption 1 is violated, we investigate how the level of deviation from Assumption 1 affects the performances of these estimators.

For each simulation scenario, we first simulate the size of each geo g = 1, 2, . . . , 2n as

where F controls the amount of geo heterogeneity in the population and is taken to be either a half-normal distribution, a log-normal distribution, or a half-Cauchy distribution. The geos are then paired based on these sizes (the largest two geos form a pair, the third and fourth largest geos form a pair, and so on), and afterwards the geos are randomized within each pair, which determines whether a geo's control or treatment potential outcome is observed. We list the detailed simulation steps as follows.

Step 1. Potential ad spend and response under the control condition according to a nonlinear relationship which is not affected by geo assignment: for g ∈ {1, · · · , 2n},

be the pre-specified budget for total incremental ad spend, where r > 0 is a parameter controlling the intensity of incremental ad spend.

Step 2. Given a geo assignment denoted by (T , C) for the treatment and control groups respectively, the incremental ad spend for each g ∈ T is proportional to their potential control spend:

Step 3. Observed ad spend and response: For each g ∈ C, S g = S (C) g and R g = R (C) g ; For each g ∈ T , S g = S (C) g + ∆ S g and R g = R (C) g + θ g × ∆ S g , where θ g = θ 0 × (1 + δ × (−1) g ) is the iROAS for geo g ∈ {1, 2, · · · , 2n} with δ ∈ [0, 1] controlling the level of deviation from Assumption 1.

The overall iROAS θ * as defined in (1.2) can be rewritten as

g=1 ∆ S g which may not be well-defined any more when potential outcomes depend on geo assignment. When Assumption 1 holds, i.e. θ g ≡ θ 0 ∀g ∈ 1, · · · , 2n, then θ * ≡ θ 0 for any assignment. When Assumption 1 is violated, to get around the ill-definition, we may assume a virtual experiment where all 2n geos were assigned to treatment with a doubled total incremental budget, i.e. 2B, and where the incremental budget for each geo is still proportional to their potential control spends. It is easy to show that for this virtual experiment ∆ S g = 0.5 × r × S (C) g for each g, and that the overall iROAS can be simplified to a welldefined static quantity:

(1 + 0.25 × (−1) g ) which will be treated as the source of truth. To summarize, the simulation parameters which are allowed to vary from scenario to scenario are the number of geo pairs n, the distribution F controlling the amount of geo heterogeneity, the iROAS scale θ 0 , the intensity of the incremental ad spend r, and the level of deviation δ from Assumption 1. Within each scenario, we then simulate K = 10, 000 random assignments-a process that determines which bivariate outcome (S g , R g ) is actually observed for each geo g, and also the observed differences (X i , Y i ) as defined in (3.1) for each geo pair i. Note that this assignment mechanism is the only source of randomness within each of our simulations.

We summarize the results for n = 50 and θ 0 = 10 for each scenario reported in this section, although we note that other simulation parameters (e.g. n = 25) yielded similar conclusions. The performance of an estimator's point estimateθ is evaluated in terms of its root mean square error

and its bias

whereθ (k) is the estimated value of θ * from the kth replicate. Meanwhile, the performance of an estimator's 100(1 − α)% confidence interval (θ α/2 ,θ 1−α/2 ) is measured in terms of its one-sided power

and its two-sided empirical coverage

1−α/2 ) denotes the confidence interval from the kth replicate.

7.1. Performance Comparison When Assumption 1 Holds. We first fix δ = 0 (i.e. Assumption 1 holds) to investigate the performance of the estimators as we vary the geo heterogeneity F ∈ {Half-Normal, Log-Normal, Half-Cauchy} (89) and the incremental ad spend intensity r ∈ {0.5, 1, 2}. Table 2 summarizes the simulation results in terms of each estimator's RMSE and bias. Here we see that RMSE and bias of every estimator improves as the intensity of the incremental ad spend r increases, and we note that the rank test based estimatorθ (rank) is generally more efficient than the sign test based estimatorθ (binom) , but both of them perform poorly relative to the Trimmed Match estimatorθ (trim) λ . In addition, recall that the Trimmed Match estimatorθ (trim) λ coincides with the empirical estimatorθ (emp) when the trim rate λ = 0. Thus, if we focus specifically on the performance of the Trimmed Match estimator, we see that some level of trimming can be beneficial when the geo sizes are generated from the more heterogeneous Log-Normal and Half-Cauchy distributions. It is interesting to note that the data-driven choiceλ generally perform better than a fixed choice of the trim rate λ ∈ {0, 0.10}, even for the Half-Normal scenario at r = 0.5. For each scenario, RMSEs greater than 2 times the RMSE for the best performed estimator are colored in red. Obviously, among all the estimators,θ (trim) λ is the most robust for all the three distributions. Meanwhile, Table 3 summarizes the power and empirical coverage for each estimator's accompanying 90% confidence interval. Indeed, although the table suggests that the Trimmed Match estimator with the data-driven estimateλ of the trim rate can suffer slightly from undercoverage-a result which agrees with our discussions in Section 5.2-we note that this estimator also provides considerably more power thanθ (emp) andθ (binom) when there is a low (r = 0.5) or moderate (r = 1.0) level of incremental ad spend, more power thanθ (rank) when there is a low (r = 0.5) level of incremental ad spend, and is quite competitive for other scenarios.

7.2. Sensitivity Analysis When Assumption 1 is Violated. We now fix r = 1.0 (a moderate level of incremental ad spend) and evaluate the performance of the estimators when Assumption 1 is violated-that is, the geo-level iROAS are no longer the same. Instead, in these simulations, half of the geos have an iROAS of θ 0 (1 − δ) while the other half have an iROAS of θ 0 (1 + δ) according to Step 3. Figure 1 compares the performance of the estimatorsθ (emp) ,θ (binom) ,θ (rank) andθ (trim) λ in terms of a scaled RMSE. Here we see thatθ (trim) λ significantly outperformsθ (binom) and θ (rank) for all scenarios. The empirical estimatorθ (emp) slightly outperforms the Trimmed Match estimator in the case of a half-normal distribution, however, the Trimmed Match estimator significantly outperforms the empirical estimator in the cases of the heavier tailed log-normal and half-Cauchy distributions where there are stronger geo heterogeneity. Moreover, we see that the Trimmed Match estimator still provides a useful estimate of θ * even when Assumption 1 is heavily violated at δ = 1. 8. Real Case Studies. Next, we report real data analysis from three different geo experiments, referred to as "A", "B" and "C", respectively, which were run either in the United States or in Canada. Each of these three experiments focused on a different business vertical, but they were all designed by randomized matched geo pairs. The number of geo pairs ranges from 60 to 105. Each experiment took a few weeks which were split into two time periods: a test period during which the experiment took place, and a cooldown period where the treatment geos were returned to the control condition to account for potential lagged effects. For each geo g, S g and R g are the aggregated ad spend and aggregated response over both time periods.

In experiments A and B, the advertisers wanted to measure the iROAS for an improved ad format for their business where geos assigned to the control group would run the business as usual using their existing ad format, while geos assigned to the treatment group would adopt the improved ad format. For either the control group or the treatment group, the actual ad spend was much lower than the budget pre-specified by the advertisers, and thus we expect no interference, which implies Assumption 0.

Experiment C was run to measure the iROAS of a new ad. The ad would only be shown to geos in the treatment group, but not the control group. The advertiser pre-specified a budget for the total ad spend. After the experiment, the total ad spend for the control group was 0, while for the treatment group was equal to the budget, which implies strong interferences. The potential control ad spend is 0 for each geo, and thus Assumption 0 holds as discussed in Section 3.

To illustrate the data heavy-tailedness, the scatter plots of (X, Y ) for the three case studies are provided in Figure 2 , where the scales of both X and Y are removed in order to anonymize the experiments. In Table 4 , we report the kurtosis as a measure of heavy-tailedness for the empirical distributions of [-0.22, 1.94] all of which are much larger than 3, which is the kurtosis of any univariate normal distribution. Table 4 also lists the point estimates and confidence intervals for the empirical estimator θ (emp) , sign test based estimatorθ (binom) , rank test based estimatorθ (rank) , and Trimmed Match estimatorθ (trim) λ . Here we see thatθ (rank) andθ (trim) λ yield similar results except in experiment B, where the confidence interval from the rank based estimator is almost 74% wider than Trimmed Match. Meanwhile, the sign test based estimator gives confidence intervals that are even wider for all experiments by a considerable amount. It is also interesting to note that the data-driven choice of Trimmed Match's trim rate results in no trimming (coinciding with the empirical estimatorθ (emp) ) for case B despite the heavy-tailedness of the data, likely due to the approximately linear relationship between X and Y (as shown in Figure 2 ) so that trimming a geo pair with a large | i (θ * )| may not necessarily reduce the variance. As experiment A demonstrates, however, the empirical estimator can sometimes be very unreliable as it is heavily affected by outliers (as shown in Figures 2 and 3) .

Meanwhile, Figure 3 plots the Trimmed Match point estimate and confidence interval as a function of the trim rate λ. Here in case A, we note the significant reduction in the size of the confidence intervals relative to the empirical estimatorθ (emp) , i.e. no trimming.

We also investigate whether the real data are incompatible with the statistical framework that we developed in Section 3 under Assumption 1, which assumes that the geo-level iROAS θ g are all equal to one another. Recall from Proposition 2 that the distribution of the residuals { i (θ * ) : 1 ≤ i ≤ n} is symmetric about 0, where n is the number of geo pairs. Therefore, we expect { i (θ (trim) λ ) : 1 ≤ i ≤ n} to be approximately symmetric about 0 as well-a null hypothesis which we can test by using the Wilcoxon signed-rank test. The corresponding p-values 4 are 0.66, 0.55 and 0.85 for the above three real case studies, which suggest that the real data are not incompatible with Assumption 1. Figure 4 shows the boxplots of the fitted residuals for the three cases respectively, which illustrate the heavy-tailedness as well as approximate symmetry. 9. Discussion. In this paper, we introduced the iROAS estimation problem in online advertising and formulated a novel statistical framework for its causal inference in a randomized paired geo experiment design which is often complicated by the issues of small sample sizes, geo heterogeneity, and interference due to budgetary constraints on ad spend. Moreover, we proposed and developed a robust, distribution-free, and easily interpretable Trimmed Match estimator which adaptively trims poorly matched geo pairs. In addition, we devised a datadriven choice of the trim rate which extends Jaeckel's idea but does not rely on asymptotic variance approximation, and presented numerical studies showing that Trimmed Match is often more efficient than alternative methods even when some of its assumptions are violated.

Several open research questions of considerable interest remain such as 1) using Trimmed Match to improve the design of matched pairs experiments, 2) using covariates to further improve the estimation precision (Rosenbaum, 2002) , 3) estimating the geo-level iROAS, and 4) further investigation of the choice of the trim rate and corresponding asymptotic analysis, e.g. using sample split (Klaassen, 1987; Nie and Wager, 2017) .

Finally, although this paper focused on the estimation of the iROAS of online advertising in geo experiments, we note that Trimmed Match can also be applied to matched pairs experiments in other areas where the ratio of two causal estimands is of interest (e.g., the incremental cost-effectiveness ratio (Chaudhary and Stearns, 1996; Bang and Zhao, 2014) ; see Chapter 5.3 of Rosenbaum (2020) for more examples).

Recall from Section 5.1, that obtaining the Trimmed Match point estimateθ (trim) λ requires finding all roots of the trimmed mean equation (5.2). Moreover, recall that this computation is trivial when λ = 0 asθ (trim) λ just corresponds to the empirical estimator given by (1.3) . Therefore, in the remainder of this section, we focus on the computation with a fixed trim rate λ > 0.

Although (5.5) implies that calculatingθ (trim) λ is straightforward once its corresponding set of n−2m untrimmed indices I is known, I is generally a priori unknown as it depends on θ (trim) λ . One could, at least in theory, check all possible subsets of size n − 2m, but this brute force approach requires the evaluation of n 2m such subsets and would be computationally too expensive to be usable in practice when m is large. However, by instead considering how the ordering of the values in the set { i (θ) : 1 ≤ i ≤ n} changes as a function of θ ∈ R-in particular, by enumerating all possible values of θ at which this ordering changes-we are able to devise an efficient O(n 2 log n) algorithm for finding all of the roots of (5.2), which are required by (5.4).

Following (3.1), let {(x i , y i ) : 1 ≤ i ≤ n} be the differences in the ad spends and responses that are observed from a randomized paired geo experiment. For notational simplicity, assume that {(x i , y i ) : 1 ≤ i ≤ n} is ordered such that x 1 < x 2 < . . . < x n . LEMMA 2. For any two pairs of geos i and j such that 1 ≤ i < j ≤ n, let

Then i (θ) < j (θ) if and only if θ < θ ij .

Note that ties in {x i : 1 ≤ i ≤ n} rarely occur in practice; when ties do occur, they can be broken by adding a small amount of random noise to the x i 's. Lemma 2, whose proof is straightforward and is omitted, allows us to efficiently solve the Trimmed Match equation defined by (5.2).

Algorithm 1 Solving the Trimmed Match Equation (5.2) Input: {(x i , y i ) : 1 ≤ i ≤ n} and trim rate λ > 0; Let m ≡ nλ .

Output: roots of (5.2). i) Reorder the pairs {(x i , y i ) : 1 ≤ i ≤ n} such that x 1 < . . . < xn; Calculate {θ ij : 1 ≤ i < j ≤ n} and order them such that θ i 1 j 1 < θ i 2 j 2 < . . . < θ i N j N . (Break ties with negligible random perturbation if needed.) ii) Initialize the set of untrimmed indices with I = {i : m < i ≤ n − m} Initialize a = i∈I y i , b = i∈I x i , and two ordered sets Θ 1 = {} and Θ 2 = {}. iii) For k = 1, . . . , N : a) If i k ∈ I and j k / ∈ I, then update I, a, b as follows:

and append a/b to Θ 1 and θ i k j k to Θ 2 . b) If j k ∈ I and i k / ∈ I, then update I, a and b similar to (a), and append a/b to Θ 1 and θ i k j k to Θ 2 . c) Otherwise, continue. iv) Append ∞ to Θ 2 , and output a subset of Θ 1 as follows: For k = 1, . . . ,

A.1. Solving the Trimmed Match Equation.

For ease of exposition, assume that {θ ij : 1 ≤ i < j ≤ n} has been ordered such that θ i1j1 ≤ θ i2j2 ≤ . . . ≤ θ iN jN , where N = n(n − 1)/2. Then, for any k = 1, 2, . . . , N − 1, Lemma 2 implies that the ordering of { i (θ) : 1 ≤ i ≤ n} is the same for all θ ∈ (θ ikjk , θ ik+1jk+1 ) and, thus, the set of untrimmed indices I(θ) ≡ {1 ≤ i ≤ n : (m+1) (θ) ≤ i (θ) ≤ (n−m) (θ)} must also be the same for all θ ∈ (θ ikjk , θ ik+1jk+1 ). Moreover, Lemma 2 also implies that as θ increases and crosses a point θ ikjk , then for any 1 ≤ i < j ≤ n, the ordering between i (θ) and j (θ) changes if and only if (i, j) = (i k , j k ) or (i, j) = (j k , i k ).

Therefore, we can sequentially update the set of untrimmed indices I(θ) based on what occurs as θ increases and crosses each point θ i1j1 , θ i2j2 , . . . , θ iN jN . If i k , j k ∈ I(θ) or if i k , j k ∈ I(θ), then I(θ) remains unchanged; if i k ∈ I(θ) but j k ∈ I(θ), then we update I(θ) by replacing i k with j k ; if i k ∈ I(θ) but j k ∈ I(θ), then we update I(θ) by replacing j k with i k . Pseudocode further describing this O(n 2 log n) procedure is provided in Algorithm 1.

A.2. Computing the Confidence Interval. Lemma 2 also facilitates the calculation of the confidence interval by reducing (5.7) to a quadratic inequality. The specific details are omitted from this paper for conciseness but are available from the authors upon request.

(trim) λ . From our discussions in this section, it is not necessarily obvious whether the Trimmed Match point estimateθ (trim) λ always exists. However, the following theorem, whose proof is also omitted for conciseness, guarantees that it does indeed always exist as long as the trimmed mean of the x i 's is nonzero. THEOREM 1. (Existence) Suppose that {(x i , y i ) : 1 ≤ i ≤ n} is ordered such that x 1 ≤ x 2 ≤ . . . ≤ x n . Then: 1) nλ (θ) is a continuous function with respect to θ ∈ R. 2) If n−m i=m+1 x i = 0, then nλ (θ) = 0 has at least one root.

Identification of causal effects using instrumental variables

Exact p-Values for Network Interference

Building a stronger instrument in an observational study of perinatal care for premature infants

Cost-effective analysis: a proposal of new reporting standards in statistical analysis

Better practices in advertising can change a cost of doing business to wise investments in the business

Consumer Heterogeneity and Paid Search Effectiveness: A Large-Scale Field Experiment

Lasso adjustments of treatment effect estimates in randomized experiments

Inferring causal impact using Bayesian structural time-series models

Estimating confidence intervals for cost-effectiveness ratios: an example from a randomized trial

The Python library for Trimmed Match and Trimmed Match Design

Bias correction for paid search in media mix modeling Technical Report

On the derivatives of the trimmed mean

Randomization inference for treatment effect variation

Online Advertising

A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

Large sample properties of Jaeckel's adaptive trimmed mean

Robust statistics

Toward causal inference with interference

Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction

Some flexible estimates of location

Ghost ads: Improving the economics of measuring online ad effectiveness

Cross channel effects of search engine advertising on brick & mortar retail sales: Meta analysis of large scale field experiments on Google

Estimating Ad Effectiveness using Geo Experiments in a Time-Based Regression Framework Technical Report

Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics

Metalearners for estimating heterogeneous treatment effects using machine learning

Nonparametrics: statistical methods based on ranks

The Unfavorable Economics of Measuring the Returns to Advertising

Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique

Inference With Interference Between Units in an fMRI Experiment of Motor Inhibition

Quasi-oracle estimation of heterogeneous treatment effects

Randomized Experimental Design via Geographic Clustering

Identification of causal effects using instrumental variables: Comment

Covariance adjustment in randomized experiments and observational studies

Interference Between Units in Randomized Experiments

Design of Observational Studies

Discussion of "Randomization Analysis of Experimental Data in the Fisher Randomization Test" by Basu

Near impressions for observational causal ad impact Technical Report

War and Wages: The Strength of Instrumental Variables and Their Sensitivity to Unobserved Biases

Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/Winsorization 1

Online Ad Auctions

Causal inference in economics and marketing

Measuring Ad Effectiveness Using Geo Experiments Technical Report

Estimation and Inference of Heterogeneous Treatment Effects using Random Forests

ggplot2: Elegant Graphics for Data Analysis

The Seasonality Of Paid Search Effectiveness From A Long Running Field Test

Acknowledgements. The authors would like to thank Art Owen and Jim Koehler for insightful early discussion, Peter Bickel for the reference of Jaeckel's paper on the choice of trim rate, Nicolas Remy, Penny Chu and Tony Fagan for the support, Jouni Kerman, Yin-Hsiu Chen, Matthew Pearce, Fan Zhang, Jon Vaver, Susanna Makela, Kevin Benac, Marco Longfils and Christoph Best for interesting discussions, and the people who read and commented on the manuscript. We appreciate Editor Beth Ann Griffin and anonymous reviewers whose comments have helped improve the paper significantly. All the figures are produced with the R package ggplot2 (Wickham, 2016) .