key: cord-0586889-p0kdfh31
authors: Torrens-i-Dinares, Miquel; Papaspiliopoulos, Omiros; Rossell, David
title: Confounder importance learning for treatment effect inference
date: 2021-10-01
journal: nan
DOI: nan
sha: 765fb56c0a5638ea3168b6367f3472c19de552e7
doc_id: 586889
cord_uid: p0kdfh31

We address applied and computational issues for the problem of multiple treatment effect inference under many potential confounders. While there is abundant literature on the harmful effects of omitting relevant controls (under-selection), we show that over-selection can be comparably problematic, introducing substantial variance and a bias related to the non-random over-inclusion controls. We provide a novel empirical Bayes framework to mitigate both under-selection problems in standard high-dimensional methods and over-selection issues in recent proposals, by learning whether each control's inclusion should be encouraged or discouraged. We develop efficient gradient-based and Expectation-Propagation model-fitting algorithms to render the approach practical for a wide class of models. A motivating problem is to estimate the salary gap evolution in recent years in relation to potentially discriminatory characteristics such as gender, race, ethnicity and place of birth. We found that, albeit smaller, some wage differences remained for female and black individuals. A similar trend is observed when analyzing the joint contribution of these factors to deviations from the average salary. Our methodology also showed remarkable estimation robustness to the addition of spurious artificial controls, relative to existing proposals.

This article addresses a problem of fundamental importance in applied research, that of evaluating the joint effect, if any, of multiple treatments on a response variable while controlling for a large number of covariates, many of which might be correlated with the treatments. We refer to the covariates as controls, and when correlated to both the response and the treatments we call them confounders. A motivating application is to study the association between salary and multiple treatments such as gender, race and country of birth, after accounting for differential access to education, position at the work place, economic sector, region in the country, and many other potential controls. Studying which treatments are associated to salary, after accounting for said controls, helps quantify salary variation potentially due to discrimination, and how this evolved over time.

We model the dependence of the response y i ∼ p(y i ; η i , φ) on t = 1, . . . , T treatments d i,t and j = 1, . . . , J controls x i,j , via

where p(y i ; η i , φ) defines a generalized linear model with linear predictor η i and dispersion parameter φ. The distinction between treatments and controls is that we are primarily interested in inference for the former, i.e, for the set of α t 's in (1), whereas the latter are included in the model to avoid omitted variable biases. When the controls are plenty, a shrinkage and selection estimation framework is required, either in terms of penalized likelihood or Bayesian model averaging (BMA). However, a key observation is that their naive application can yield imprecise estimates, especially when there are many confounders and/or the treatment effects are not very strong. For example, Figure 1 (described later in this section) shows that LASSO and BMA suffer from very high root mean squared error (RMSE) when all controls affecting the response (β j = 0) are confounders. A further issue for LASSO and other likelihood penalties is that the treatment variables might not be selected, in which case the powerful post-selection inference framework of Lee et al. (2016) , is not applicable. An alternative is to use methods based on de-biasing the LASSO (van de Geer et al. 2014 , Javanmard & Montanari 2014 , which provide inference for all parameters, however these incur a loss of shrinkage that typically results in low power to detect anything but strong effects. For example, in linear regression with orthogonal design matrix the debiased LASSO of van de Geer et al. (2014) recovers the least-squares estimate,

i.e. offers no shrinkage.

To address this issue a number of penalized likelihood and Bayesian methods have been introduced. Section 2.2 provides an overview. A popular one is the Double Machine Learning (DML) approach of Belloni et al. (2014) . In a first step, DML regresses separately the outcome and the treatment on the controls via penalized likelihood, and in a second step it fits a model like (1) via maximum likelihood on the set of controls selected in either regression in the first stage. In a similar spirit, Bayesian Adjustment for Confounding (BAC) in Wang et al. (2012) models jointly the response and the treatment and uses a prior that encourages controls to be simultaneously selected in the two regression models.

The key idea behind these approaches and related work is to encourage the inclusion of covariates that are associated with the treatments to the regression model for the response.

In this article we highlight an overlooked weakness that is relevant for the application of such methods, which relates to over-selection of controls. Adding controls to the regression model for y i that are correlated to the treatments, but are not conditionally related to the outcome, has two effects. The first one is an obvious over-selection variance, an inflation of the standard errors of the treatment effects that leads to a reduced power to detect weaker effects. The second one is more subtle, and we call it control over-selection bias. The inclusion of controls in (1) which were screened out to be correlated with the treatments leads to biased inference for the treatment effects. Said over-selection bias and variance worsen as the number of treatments increases and as the level of confounding (the proportion of controls that are relevant both for the response and the treatments) decreases. In Figure 1 we consider a single treatment simulated according to linear regression on 6 active controls, and we varied the number of controls active in both models from 0 (no confounding) to 6 (full confounding, i.e., the same controls are used for generating the response and the treatment). While LASSO and BMA perform worse the stronger the confounding due to control under-selection, DML and BAC perform well in the presence of strong confounding but poorly in the lack of it. In Figure 5 (described later) we show that these effects can be exacerbated in the presence of an increasing number of treatments.

We propose a new approach, Confounder Importance Learning (CIL), which can be easily implemented using existing software, that deals successfully with both over-and under-selection, in both high and low confounding situations, and in the presence of multiple simultaneous treatments. A first illustration of the merits of CIL is given in Figure 1 , where it achieves good performance across the spectrum.

In Figure 2 we analyze data from the Current Population Survey, which has T = 204 treatments and J = 278 controls (see Section 4). The former include four average treatment effects of particular interest, Figure 2 focuses on two of them, a gender and a black race indicators, and compares a number of methods. All methods show that the average logsalary is reduced for women, although this gap is less pronounced in 2019 relative to 2010.

However, the methods differ in their conclusions for the black race. To understand better what drives these differences, we added 100 and 200 simulated controls that are dependent averaged over 250 simulated datasets, considering strong (α = 1), weak (α = 1/3) and no effect (α = 0). In all panels, n = 100, J = 49 and the response and treatment are simulated from a linear regression model based on 6 active controls each. The overlap between the two sets of active controls varies from 0 (no confounding) to 6 (full confounding). DML is double machine learning, BMA is Bayesian model averaging, BAC is Bayesian Adjustment for Confounding and CIL is confounder importance learning.

on the treatments but conditionally independent of the response. The figure shows a marked robustness of CIL to the addition of said controls, whereas other methods lose their ability to detect the weaker effects (e.g. gender in 2019 and race in 2010). The remainder of the introduction outlines our main ideas and the paper's structure.

Section 2 details our proposed approach and provides a deeper discussion of related literature. Our main contribution is a BMA framework where the prior inclusion probabilities, π j = P (β j = 0), vary with j in a manner that is learned from data. We build a model

which expresses these probabilities in terms of features f j,t ≥ 0 extracted from the treatment and control data, and hyperparameters ρ ∈ (0, 1/2) and θ = (θ 0 , θ 1 , . . . , θ T ). The role of ρ is to bound the probabilities away from 0 and 1 and, as we discuss in Section 2.1, we propose the default choice ρ = 1/(1 + J 2 ). Our method relies on a good choice of the features f j,t and the hyperparameters θ t . Section 2.1 describes high-dimensional approaches to obtain the former, and the main idea is to obtain rough estimates of the relative impact of each control to predict each treatment. Regarding the latter, when θ t = 0 for t = 1, . . . , T our approach assigns equal inclusion probabilities, when θ t > 0 controls found to predict treatment t are encouraged to be included in the response model, and discouraged when θ t < 0. This is in contrast to methods such as DML and BAC that encourage the inclusion of any control associated with any treatment, i.e. with large f j,t .

Section 3 describes our computational methods, which are another main contribution.

We design an empirical Bayes choice for θ based on optimizing the marginal likelihood and a much cheaper alternative based on an expectation-propagation approximation. In summary, our approach is based on learning the importance of each possible confounder via (2). In doing so, it helps prevent both under-selection bias and over-selection bias and variance.

Section 4 applies the methodology to our motivating salary application. As discussed, our CIL detects negative salary gaps associated to the black race that might go otherwise unnoticed. We also illustrate how considering multiple treatments allows to portray, via posterior predictive inference, a measure of joint salary variation due to potentially discriminatory factors. Our results suggest that in 2019 said variation decreased nation-wide (from 5.4% in 2010 to 1.5% in 2019) and state-wide, with lesser disparities across states than in 2010. Finally, it also shows simulations that help understand practical differences between methods, in particular in terms of the amount of confounding present in the data.

All proofs and additional empirical results are in the supplementary material. R code to reproduce our results is available at https://github.com/mtorrens/rcil.

We model the dependence of the response y i on treatments d i = (d i,1 , . . . , d i,T ) and controls

x i = (x i,1 , . . . , x i,J ), according to (1). We are primarily interested in inference for α = (α 1 , . . . , α T ), i.e. the treatment effects. We call single treatment to the special case T = 1, while for T > 1 we have multiple treatments.

We adopt a Bayesian framework where we introduce variable inclusion indicators γ j = I(β j = 0) and δ t = I(α t = 0), and define a model prior

where θ are the hyper-parameters in (2), and p(φ) is dropped for models with known dispersion parameter (e.g. logistic or Poisson regression). For the regression coefficients, we assume prior independence,

and adopt the so-called product moment (pMOM) non-local prior of Johnson & Rossell (2012) , according to which α t = 0 almost surely if δ t = 0, and

with the analogous setting for every β j . Figure S6 illustrates its density. This prior involves a hyperparameter τ > 0, that we set to τ = 0.348, following Johnson & Rossell (2010) , so that the prior signal-to-noise ratio |α t |/ √ φ is greater than 0.2 with probability 0.99. Nonlocal priors achieve model selection consistency on a range of high-dimensional linear and generalized linear regression models and play an important role in helping discard spurious predictors (Johnson & Rossell 2012 , Wu 2016 , Shin et al. 2018 , Rossell 2021 . As for the dispersion parameter, where unknown, we place a standard φ ∼ IGam(a φ = 0.01, b φ = 0.01)

prior, see e.g. Gelman (2006) .

For the inclusion indicators, we also assume prior independence, and set

Bern(γ j ; π j (θ)).

All treatments get a fixed marginal prior inclusion probability P (δ t = 1) = 1/2, as we do not want to favor their exclusion a priori, considering that there is at least some suspicion that any given treatment has an effect. This choice is a practical default when the number of treatments T is not too large, else one may set P (δ t = 1) < 1/2 to avoid false positive inflation due to multiple hypothesis testing (Scott & Berger 2010 , Rossell 2021 . Our software allows the user to set any desired P (δ t = 1).

The main modelling novelty in this article is the choice of π j (θ), which we set according to (2). A key part of the construction is the choice of features f j,t . Our generic approach

is to take f j,t = |w j,t |, where w t = (w 1,t , . . . , w J,t ) are regression coefficients obtained via a high-dimensional regression of d t on the controls. We highlight two possibilities. First, a LASSO regression, w t := arg min

where λ > 0 is a regularization parameter, which we set by minimizing the BIC (we obtained similar results when using cross-validation). The choice in (6) 

where (X X) + is the Moore-Penrose pseudo-inverse, and X the n × J design matrix. For J < n this is the familiar OLS estimator, but (7) is also well-defined when J > n, and it has been recently investigated in terms of the so-called benign overfitting property in Bartlett et al. (2020) . Wang & Leng (2016) showed that when J > n, (7) ranks the coefficients consistently under theoretical conditions slightly broader than the LASSO. This is appealing in our context, since π j (θ) are mainly driven by the relative magnitudes of f j,t , rather than their exact value.

The scalar ρ in (2) ensures that prior probabilities are bounded away from 0 and 1. In particular, we set it to ρ = 1/(J 2 +1). This sets a lower-bound P (β j = 0) ≥ 1/(J 2 +1) that is of the same order as J grows as the Complexity priors in (Castillo & van der Vaart 2012) , which are sufficiently sparse to discard spurious predictors and attain minimax estimation rates (Castillo et al. 2015 , Rossell 2021 ).

The final element in (2) are the hyper-parameters θ t , which can encourage the inclusion or exclusion of controls associated to the treatment t. Figure 3 illustrates π j (θ) for three different values of θ 1 . Setting θ is critical for the performance of our inferential paradigm, and in Section 3 we introduce data-driven criteria and algorithms for its choice. The dotted line in the middle corresponds to θ 1 = 0.

We discuss approaches for treatment effect inference in the presence of many controls.

The main idea in both frequentist and Bayesian literatures is to encourage the inclusion of confounders in (1) Within a Bayesian framework, a natural approach is to build a joint model

the dependence between each treatment t and control j. BAC (Wang et al. 2012 ) considers this approach only for T = 1, setting a prior for γ j where each control has two potential prior inclusion probabilities. If a control j is associated to the single treatment t = 1 (ξ tj = 1), the prior inclusion probability P (γ j = 1) increases by a factor determined by a hyperparameter ω that is be set by the user. Lefebvre et al. (2014) and Wang et al. (2015) provided some theoretical support and proposals to set ω, and Wilson et al. (2018) Our main contributions are of an applied, but relevant, nature: replacing the joint model (8) by extracting features derived from p(d i | x i ) to render computations practical, and learning from data whether confounder inclusion should be encouraged, discouraged, or neither, to avoid over-selection issues. Another contribution is considering the multiple treatments problem (T > 1), which has been less studied.

3 Computational Methodology

All expressions in this section are conditional on the observed (x i , d i ), we drop them from the notation for simplicity. Inference for our approach relies on posterior model probabilities

is the marginal likelihood of model (γ, δ). We set the hyperparameter θ to a point estimatê θ described in the next section. Conditional on θ, our model prior p(γ | θ) is a product of independent Bernouilli's with asymmetric success probabilities defined by (2). As a simple variation of standard BMA, one can exploit existing computational algorithms, which we outline next.

Outside particular cases such as Gaussian regression under Gaussian priors, (9) does not have a closed-form expression. To estimate (9) under our pMOM prior we adopt the approximate Laplace approximations of Rossell et al. (2021) , see Section S7.2.

We obtain point estimates using BMA,

and similarly employ the BMA posterior density p(α | y, γ, δ, θ) to provide posterior credible intervals. To this end we use posterior samples from this density using a latent truncation representation described by Rossell & Telesca (2017) . Expression (10) is a sum across 2 T +J models, when it is unfeasible we use Markov Chain Monte Carlo to explore the posterior distribution p(γ, δ | y, θ), see e.g. Clyde & Ghosh (2012) for a review.

We used all the algorithms described above as implemented by the modelSelection function in R package mombf (Rossell et al. 2020 ).

Our main computational contribution is a strategy to learn the hyperparameter θ, which plays a critical role by determining prior inclusion probabilities. We devised an empirical Bayes approach maximizing the marginal likelihood, with

where the right-hand side follows easily, denoting by p u (δ, γ | y) the posterior probabilities under a uniform model prior p u (δ, γ) ∝ 1. The use of empirical Bayes for hyperparameter learning in variable selection has been well-studied, see George & Foster (2000) , Scott & Berger (2010), Petrone et al. (2014) .

The main challenge is that one must evaluate the costly sum in (11) for each value of θ considered by an optimization algorithm. Note that p(y | γ, δ) does not depend on θ, and hence can be re-used to evaluate (11) for any number of θ values. In fact, by Proposition 1 below, this provides grounds to use stochastic gradient methodology to maximize (11).

If, additionally, the model prior is separable such that

Corollary 2 Under the model prior in (4) and (5), and with π j (θ) as defined by (2),

Expressions (12) and (13) evaluate the gradient with a sum of J terms, relative to the 2 J+T terms in (11). Further, (13) only depends on y via marginal inclusion probabilities P (γ j = 1 | y, θ), which can typically be estimated more accurately than the joint model probabilities in (11). However, two problems remain unaddressed. First, one must compute P (γ j = 1 | y, θ) for every considered θ, which is cumbersome when the optimization requires more than a few iterations. Second, log p(y | θ) can be non-convex. Hence, standard algorithms may converge to low-quality local optima if θ is poorly initialized. Figure   S7 (left) shows an example of a multi-modal p(y | θ). We next describe an Expectation Propagation approximation which, as illustrated in Figure S7 , typically provides a good approximation to the global mode.

The use of Expectation Propagation (Minka 2001a,b) is common in Bayesian machine learning, including in variable selection (Seeger et al. 2007 , Hernández-Lobato et al. 2013 , Xu et al. 2014 ). We propose a computationally tractable approximation to (11), which can also serve as an initialization point for an algorithm to solve (11) exactly, if so desired.

We consider a mean-field approximation to the posterior probabilities in (11),

where s = (s 1 , . . . , s T ) and q = (q 1 , . . . , q J ) are given in Proposition 3 to optimally approximate p(δ, γ | y). By Proposition 3 below, (14) leads to replacing (11) by a new objective function (15) that only requires an inexpensive product across J terms. These only depend on y via posterior inclusion probabilities q j = P (γ j = 1 | y, θ = 0) that can be easily pre-computed prior to conducting the optimization exercise.

Proposition 3 Let s t , q j andp u (δ, γ | y) be as defined in (14). Then, s ep t = P (δ t = 1 | y, θ = 0) and q ep j = P (γ j = 1 | y, θ = 0) minimize Kullback-Leibler divergence from p u (δ, γ | y) top u (δ, γ | y). Further

The gradient of the objective function (15) is in Section S6.4. Since this function may not be concave, we conduct an initial grid search and subsequently use a quasi-Newton BFGS algorithm. See Section S7.3 and Algorithm 1 therein for a full description of our algorithm to obtainθ eb andθ ep . In most our examplesθ eb andθ ep provided virtually indistinguishable inference, the latter incurring a significantly lower computational cost, but the exactθ eb did provide slight advantages in some high-dimensional settings (see Section 4.3).

We studied the association between belonging to certain social groups and the hourly salary, and its evolution over the last decade (prior to the covid-19 pandemic), to assess progress in wage discrimination. We analyzed the USA Current Population Survey (CPS) microdata (Flood et al. 2020) , which records many social, economic and job-related factors.

The outcome is the individual log-hourly wage, rescaled by the consumer price index of 1999, and we considered four treatments: gender, black race, Hispanic ethnicity and Latin

America as place of birth. These treatments are highly correlated to sociodemographic and job characteristics that can impact salary, i.e. there are many potential confounders. in Section 2.1. In Section 4.3 using simulated data we also compare to BAC (Wang et al. 2012) , which was computationally unfeasible to apply to the salary data. We set its hyperparameter to ω = +∞, which encourages the inclusion of confounders relative to standard BMA. For completeness, we also considered a standard LASSO regression on the outcome equation (1), setting the penalization parameter via cross-validation. We compared these methods to the oracle OLS, i.e. based on the subset of controls truly featuring in (1) 

We Since every state has its own regulatory, sociodemographic and political framework, we captured state effects by adding interactions for each pair of treatment and state. On these interactions, we applied a sum to zero constraint, so that the coefficients associated to the four treatments remain interpretable as average effects across the USA, and the interactions as deviation from the average. Hence, overall, we have T = 4 + 4 × 50 = 204 treatments, our main interest being in the first four. To simplify computation in our CIL prior we assumed a common θ t shared between each main treatment and all its interactions with state, so that dim(θ) = 5. A full list of the extracted variables can be found as a supplementary file.

To study issues related to under-and over-selection, we analyzed the original data and two augmented datasets where we added artificial controls correlated with the treatments but not the outcome. The augmented data incorporated 100 artificial controls in the first scenario, and 200 in the second one, which were split into four subsets of size 25 and 50, respectively. Each of these subsets was designed to correlate to one of the four main treatments. Section S8.2 summarizes the simulation procedure. The resulting average correlation between gender and its associated artificial variables was of 0.83, and analogously of 0.69, 0.76 and 0.67 for black race, ethnicity and place of birth with their corresponding correlated variables, respectively.

None of the methods pointed to an association between salary and ethnicity nor place of birth, see Figure S8 . Figure 2 reports the results for gender and race. The treatment effect for gender is picked up by all methods in both years with similar point estimates.

All methods found a significant decrease of this effect in 2019. When adding the artificial controls, the confidence intervals for OLS and DML became notably wider, which in 2019 led to a loss of statistical significance. This points towards a relevant loss in power due to over-selecting irrelevant controls, with the associated variance inflation. The CIL results

were particularly robust to the artificial controls.

As for race, we obtained more diverse results. In 2010, DML, BMA and CIL found a negative association between black race and salary. However, in 2019 DML and BMA found no significant association, OLS found a small positive association, and CIL was the only method finding a negative association in both years. Once we introduce the artificial controls, we observe that OLS and DML suffer a large variance inflation, and in 2010 BMA suffers a significant loss of power, failing to detect an effect that was confidently picked up in the original data. On the other end, CIL experiences no perceptible change to adding the artificial controls. These results seem to suggest that CIL has sufficient power to detect the difference that other methods miss in the original data in 2019.

The full scope of our proposed approach is materialized when considering more complex functions of the parameters. We study a measure of overall treatment contribution to deviations from the average salary. For a new observation n + 1, with observed treatments d n+1 and controls x n+1 , let (16) be its expected salary minus the expected salary averaged over possible d n+1 , given equal control values x n+1 . Since y n+1 is a log-salary, we examine the posterior predictive distribution of exp {h n+1 (d n+1 , α, x n+1 )} as a measure of salary variation associated to the treatments. A value of 1 indicates no deviation from the average salary, relative to another individual with the same controls x n+1 .

To evaluate the posterior predictive distribution of (16) given y, the observed d and the set of controls, we obtained posterior samples from the model averaged posterior p(α | y) associated to CIL (Section 3.1). Given that we do not have an explicit model for (d n+1 , x n+1 ), we sampled pairs (d n+1 , x n+1 ) from their empirical distribution, and estimated E(d n+1 | x n+1 ) from a logistic regression of d on the set of controls. Figure 4 shows the results. In 2010, joint variation in the treatments was associated to an average 5.4% salary variation (90% predictive interval [0.1%, 18.3%]). The posterior mean in 2019 dropped to 1.5% and the 90% predictive interval was [0%, 4.9%]. That is, the treatments not only played a smaller role in the 2019 average, but there was also a strong improvement in inequality, e.g. individuals whose salary was furthest from the average.

It is also of interest to study differences between states. This is possible in our model, which features 200 interaction terms for the 4 treatments and 50 states. Figure 4 (right) shows the results. The most salient feature is a lower heterogeneity across states in 2019 relative to 2010. The states whose median improved the most were among the lowest ranking states in 2010: Pennsylvania (reducing it by 3.7%), Indiana (3.7%) and Ohio (3.6%), while those improving the least (Nebraska aside, whose median is 0% on both years) were Maine (0.2%), Wyoming (0.2%) and South Dakota (0.3%), which were already among the top-ranking states in 2019. This points towards a gradual catch-up effect across U.S. states, although the intervals still show some variability within states.

To illustrate issues associated to under-and over-selection of controls, a key factor we focus on is the level of confounding. Our scenarios range from no confounding (no controls affect both y and d) to complete confounding (any control affecting y also affects d, and vice versa). We also considered the effect of dimensionality, treatment effect sizes α, and true level of sparsity (number of active controls).

We considered a Gaussian outcome and a single treatment (T = 1), and an error variance φ = 1. The controls were obtained as independent Gaussian draws x i ∼ N(0, I), and any active control had a coefficient β j = 1. The treatment d was also Gaussian with its mean depending linearly on the controls, unit error variance, and any control having an effect on d had unit regression coefficient. Throughout, the number of controls that truly had affect d was set equal to the number of controls that affect the outcome y. We measured the RMSE of the estimatedα.

Figures 1, S10, S9 and 5 summarize the results. Figure 1 shows that the RMSE of BMA and LASSO worsens as confounding increases, this was due to a lower power to select all relevant confounders (see Figure S9 for model selection diagnostics), i.e. an omitted variable bias. These effects have been well studied. Methods such as DML and BAC were designed to prevent omitted variables, but as shown in Figure 1 they can run into over-selection when there truly are few confounders. In contrast, our CIL performed well at all levels of confounding. The Empirical Bayes and the Expectation Propagation versions of CIL provide nearly indistinguishable results (not shown). It is worth noting that, when the treatment truly had no effect (α = 0), CIL provided a strong shrinkage that delivered a significantly lower RMSE than other methods. Figure S10 extends the analysis to consider a growing number of covariates, under a strong treatment effect (α = 1). As dimensionality grew, standard LASSO and BMA incurred a significantly higher RMSE under strong confounding. Our CIL generally provided marked improvements over BMA, except for the larger J + T = 200. Here we observed the only perceptible differences between the EB and EP approximations, with the former attaining better results, pointing to advantages of the EB approach in higher dimension. Figure S11 further extends the analysis to less sparse settings, with γ 0 = 6, 12 and 18 active parameters. Overall, the results were similar to Figures 1 and S10.

A focus in this paper is to understand over-selection issues in multiple treatment inference. To this end, we added a multiple treatments design with an increasing number of treatments, with a maximum of T = 5. There, every present treatment was active, setting α t = 1 on all treatments. For all levels of T , we set β j = 1 for j = 1, . . . , 20, denoting the set of active controls by x 1:20 , and β j = 0 for the rest of controls x 21:J . Regarding the association between treatments and controls, x 1:20 were divided into five disjoint subsets with four variables each, and each of these subsets was associated to a different treatment.

When existing, the treatments were linearly dependent on every control of its associated subset. Additionally, each treatment also depended on a further subset of controls in the set x 21:J . In this case, the size of such subset was increasing by four with each added treatment: treatment 1 was associated to x 21:24 , treatment 2 was associated to x 21:28 , etc., up to treatment 5, which correlated to x 21:40 . All controls affecting a treatment had a unit linear coefficient. The rest of the design, including the DGP of the controls and the error variances, is akin to that in Figure 1 . We also replaced BAC with the ACPME method of (Wilson et al. 2018) , an extension of BAC for multiple treatments. Figure 5 shows the estimation results on α for the different values of T , akin to Figure   1 . We observe similar trends as before. Methods prone to over-selection recovered more inactive controls as T increased, i.e. they included too many controls affecting the inference of those treatments they correlated to. Some of these controls were increasingly influential with T as they were associated to more treatments, and so they became harder to discard.

Despite that, under-selection (here suffered by BMA) was also problematic, as for larger T the model became highly confounded, as a subset of the controls accounted for a larger proportion of the variance in the outcome, as well as for that of the treatment(s). This led to BMA discarding with high probability active but highly correlated variables between treatments and confounders. Additionally, we also observed stable but improving performance of ACPME, although in its best performing scenario its average RMSE still more than doubled that of oracle OLS. On the other end, our CIL proposal was able to achieve oracle-type performance for every examined level of T .

The two main ingredients in our proposal are learning from data whether and to what extent control inclusion/exclusion should be encouraged to improve multiple treatment inference, and a convenient computational strategy to render the approach practical. This is in contrast to current literature, which encourages the inclusion of certain controls to avoid under-selection biases but can run into serious over-selection bias and variance, as we have illustrated. By learning the relative importance of potential confounders, as in our CIL framework, one may bypass this problem.

These issues are practically relevant, e.g. in the salary data we showed that one may fail to detect a negative association between the black race and salary. Further, the proposed Bayesian framework naturally allows for posterior predictive inference on functions that depend on multiple parameters, such as the variation in salary jointly associated with multiple treatments. Interestingly, our analyses revealed a reduced association between salary and potentially discriminatory factors such as gender or race in 2019 relative to 2010, as well as a lesser heterogeneity across states. These results are conditional on controls that include education, employment and other characteristics that affect salary. That is, our results reveal lower salary discrepancies in 2019 between races/genders, provided that two individuals have the same characteristics (and that they were hired in the first place).

This analysis offers a complementary view to analyses that are unadjusted by controls, and which may reveal equally interesting information. For example, if females migrated towards lower-paying occupational sections in 2019 and received a lower salary as a consequence, this would not be detected by our analysis, but would be revealed by an unadjusted analysis.

To keep our exposition simple, we focused our discussion on linear treatment effects, but it is possible to extend our framework to non-linear effects and interactions between treatments, e.g. via splines or other suitable non-linear bases.

S6.1 Proof of Proposition 1

Let p(y | γ, δ, θ) = p(y | γ, δ), then

If, further, the model prior satisfies p(γ, δ | θ) = T t=1 p(δ t ) J j=1 p(γ j | θ), then

The empirical Bayes estimate defined by (11) writes

For short, denote H(θ) = p(y | θ) and h j (θ) = p j (γ j | θ), where generically ∇ θ log H(θ) = ∇ θ H(θ)/H(θ). Under the assumptions of Corollary 2

Denoting f j = (1, f j,1 , . . . , f j,T ) , direct algebra gives

since ∇ θ π j (θ) = (1 − 2ρ)f j π j (θ)(1 − π j (θ)).

Then, replacing (S19) into (S18)

Consider the right-hand side in (11),

where p u (δ, γ | y) are the posterior probabilities under a uniform prior p u (δ, γ) ∝ 1.

We seek to set the parameters s t and q j in the approximation

Bern(γ j ; q j ) using Expectation Propagation. That is, setting and q = (q 1 , . . . , q J ) such that

and analogously for s = (s 1 , . . . , s T ). Proceeding elementwise, we derive

Optimizing this expression yields

With the same exact procedure one analogously obtains s ep t = P u (δ t = 1 | y). Let

Note that the product in the RHS of (S23) defines a probability distribution on (δ 1 , . . . , δ T , γ 1 , . . . , γ J ) with independent components, hence the sum is the normalizing constant of such distribution. Thus, this constant is just the product of the univariate normalizing constants. The univariate normalizing constant of each Bernouilli is then

for every q j , and similarly s ep t π t + (1 − s ep t )(1 − π t ) for every s t . Hence, replacing into (S23) we obtain

From (15), for a given set of q j we have

We are interested in computing the gradient of the function being optimized in (S24).

Denote h j (θ) := q j π j (θ) + (1 − q j )(1 − π j (θ)) for short. Simple algebra provides

From (S20) we recover the remaining gradient in the last expression and derive

where f j = (1, f j,1 , . . . , f j,T ) , and so the gradient for the expression in (S24) is simply

S7 Computational methods S7.1 Product MOM non-local prior Figure S6 illustrates the density of the product MOM non-local prior of Johnson & Rossell (2012) .

Briefly, denote by p n (α t | δ t = 1, φ) = N(α t ; 0, τ φ) independent Gaussian priors for t = 1, .., T , and similarly p n (β j | γ j = 1, φ) = N(β j ; 0, τ φ) for j = 1, . . . , J. Proposition 1 in Rossell & Telesca (2017) shows that the following identity holds exactly

where p n (y | γ, δ) is the integrated likelihood under p n (α, β), and E n [·] denotes the posterior expectation under p n (α, β | y, γ, δ). To estimate p n (y | γ, δ) for non-Gaussian outcomes we use a Laplace approximation. Regarding the second term, we approximate it by a product of expectations, which Rossell et al. (2021) showed leads to the same asymptotic properties and typically enjoys better finite-n properties than a Laplace approximation.

Algorithm 1 describes our method to estimateθ ep andθ eb . We employ the quasi-Newton BFGS algorithm to optimize the objective function. Forθ eb , we use the gradients from Corollary 2, while the Hessian is evaluated numerically using line search, with the R function nlminb. Note, however, that obtainingθ eb requires sampling models from their posterior distribution for each θ, which is impractical, to then obtain posterior inclusion probabilities required by (13). Instead, we restrict attention to the models M sampled for either θ = 0 or θ =θ ep in order to avoid successive MCMC runs at every step, relying on the relative regional proximity between the starting pointθ ep andθ eb . This proximity would ensure that M contains the large majority of models with non-negligible posterior probability underθ eb . Forθ ep , we use employ the same BFGS strategy using gradient computed in S6.4, with numerical evaluation of the Hessian. This computation requires only one MCMC run at θ = 0, which allows us to use grid search to avoid local optima. As for the size of the grid, we let the user specify what points are evaluated. For K points in the grid one must evaluate the log objective function K T +1 times, so we recommend to reduce the grid density as T grows. By default, we evaluate every integer in the grid assuming T is not large, but preferably we avoid coordinates greater than 10 in absolute value, as in our experiments it is very unlikely that any global posterior mode far from zero is isolated, i.e. not reachable by BFGS by starting to its closest point in the grid. Additionally, even if that were the case, numerically it makes no practical difference, considering that marginal inclusion probabilities are bounded away from zero and one regardless. Algorithm 1: Obtaining θ ep and θ eb Output:θ ep andθ eb 1 Obtain B posterior samples (γ, δ) (b) ∼ p(γ, δ | y, θ = 0) for b = 1, . . . , B. Denote by M (0) the corresponding set of unique models.

2 Compute s t = P (δ t = 1 | y, θ = 0) and q j = P (γ j = 1 | y, θ = 0).

3 Condut a grid search forθ ep around θ = 0. Optimize (15) Figure S7 shows the Empirical Bayes objective function in (11) and (15) in a simulated dataset with a single treatment. A bimodality is appreciated in the left panel.

For both amounts K 1 = 100 and K 2 = 200 of artificial predictors, the simulation protocol was the same. Every artificial control z k ∈ R n , for k = 1, . . . , 100 or k = 1, . . . , 200 respectively, was simulated to correlate to one individual treatment, according to which subset said control was assigned to, correlating only indirectly to the rest of treatments.

In particular, we drew elements of z k from z i,k | d i,t = 1 ∼ N(1.5, 1), and z i,k | d i,t = 0 ∼ N(−1.5, 1), where d t denotes the corresponding column in the treatment matrix associated to the given z k . S8.6 Testing CIL to different amounts of confounders for T = 1 Figure S11 shows the effect of having various amounts of active confounders. The results look consistent to the effects reported in Figures 1 and S10 , which were magnified for large amounts of active confounders. These are really challenging situations to tackle since the tested methods aim at model sparsity, while the true model size is relatively large. Although our method still performed at oracle rates in low-confounding scenarios, its relative performance was compromised for the highest levels of confounding. This occurred in part because accurate point estimation in (6) became increasingly harder as the correlation between covariates strengthened, which in turn influenced the ability of the algorithm to calibrate θ reliably. Even in these hard cases, however, its performance was not excessively far to the best competing method, while it clearly outperformed BMA on all of them. with the corresponding 50% posterior intervals for both years. simulated datasets for each level of confounding reported, as described in Figure 1 . In all panels, n = 100, J + T = 100 and α = 1. Sudden general improvement at the right end of center and right panels is due to a sharper deterioration of oracle OLS RMSE at complete confounding relative to other methods.

Double robust matching estimators for high dimensional confounding adjustment

High-dimensional confounding adjustment using continuous spike and slab priors'

Guided Bayesian imputation to adjust for confounding when combining heterogeneous data sources in comparative effectiveness research

Benign overfitting in linear regression

Inference on treatment effects after selection among high-dimensional controls

Bayesian linear regression with sparse priors

Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences

hdm: High-dimensional metrics

Automatic debiased machine learning of causal and structural effects

Finite population estimators in stochastic search variable selection

Variable selection in causal inference using a simultaneous penalization method

Robust inference on average treatment effects with possibly more covariates than observations

Integrated public use microdata series, current population survey: Version 8.0 [dataset

Regularization paths for generalized linear models via coordinate descent

Prior distributions for variance parameters in hierarchical models

Calibration and empirical Bayes variable selection

Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects'

Regularization and Confounding in Linear Regression for Treatment Effect Estimation

Generalized spikeand-slab priors for bayesian group feature selection using expectation propagation

Confidence intervals and hypothesis testing for high-dimensional regression

On the use of non-local prior densities in bayesian hypothesis tests

Bayesian model selection in high-dimensional settings

Exact post-selection inference, with application to the lasso

The effect of the prior distribution in the bayesian adjustment for confounding algorithm

Expectation propagation for approximate bayesian inference

A Family of Algorithms for Approximate Bayesian Inference

Bayes and empirical Bayes: do they merge?

Concentration of posterior probabilities and normalized l 0 criteria

Approximate laplace approximations for scalable model selection

mombf: Bayesian Model Selection and Averaging for Non-Local and Local Priors

Nonlocal priors for high-dimensional estimation

Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem

Bayesian inference for sparse generalized linear models

Scalable bayesian variable selection using non-local prior densities in ultrahigh-dimensional settings

Outcome-adaptive lasso: Variable selection for causal inference

BACprior: Choice of the Hyperparameter Omega in the Bayesian Adjustment for Confounding (BAC) Algorithm. R package version 2

The bayesian causal effect estimation algorithm

On asymptotically optimal confidence regions and tests for high-dim. models

Accounting for uncertainty in confounder and effect modifier selection when estimating average causal effects in generalized linear models: Accounting for uncertainty in confounder and effect modifier selection when estimating aces in glms

Bayesian effect estimation accounting for adjustment uncertainty

High dimensional ordinary least squares projection for screening variables

Model-averaged confounder adjustment for estimating multivariate exposure effects with linear regression: Modelaveraged confounder adjustment for estimating multivariate exposure effects

Nonlocal priors for Bayesian variable selection in generalized linear models and generalized linear mixed models and their applications in biology data

Distributed bayesian posterior sampling via moment sharing