key: cord-0136600-drxp5h96
authors: Jacob, Daniel
title: Variable Selection for Causal Inference via Outcome-Adaptive Random Forest
date: 2021-09-09
journal: nan
DOI: nan
sha: b195ea1acc8fd661bfcd9786a0f457a2e68f14b5
doc_id: 136600
cord_uid: drxp5h96

Estimating a causal effect from observational data can be biased if we do not control for self-selection. This selection is based on confounding variables that affect the treatment assignment and the outcome. Propensity score methods aim to correct for confounding. However, not all covariates are confounders. We propose the outcome-adaptive random forest (OARF) that only includes desirable variables for estimating the propensity score to decrease bias and variance. Our approach works in high-dimensional datasets and if the outcome and propensity score model are non-linear and potentially complicated. The OARF excludes covariates that are not associated with the outcome, even in the presence of a large number of spurious variables. Simulation results suggest that the OARF produces unbiased estimates, has a smaller variance and is superior in variable selection compared to other approaches. The results from two empirical examples, the effect of right heart catheterization on mortality and the effect of maternal smoking during pregnancy on birth weight, show comparable treatment effects to previous findings but tighter confidence intervals and more plausible selected variables.

In the causal inference literature, we can classify data into two categories. The one is data from a randomized controlled trial where the researcher or practitioner has full control of the selection process. The counterexample is data from a so-called observational study. In such a setting there are confounding variables that influence both, the outcome and the probability of treatment. To construct unbiased treatment effect estimates from observational studies, propensity score (PS) methods are an increasingly popular tool to control for confounding (Rosenbaum and Rubin, 1983) . One model-based approach, the inverse probability of treatment weighting (IPTW), to directly adjust for the confounding bias using propensity scores was proposed by Hirano and Imbens (2001) . Estimating the propensity score can be seen as a classification problem where one seeks to have a good prediction of the assignment probability given covariates, regardless of the functional form of the distribution of the probabilities. Besides logistic regression, non-parametric methods such as random forests (Lee, Lessler, and Stuart, 2010; Westreich, Lessler, and Funk, 2010; Zhao, Su, Ge, and Fan, 2016) , neural-networks, and support vector machines (Westreich et al., 2010) have been proposed to estimate the propensity score. An interesting question is which variables should be included to estimate the propensity score. Common suggestions are, to include all pre-treatment variables that influence the treatment while excluding variables that do not affect the treatment (this can be variables that do not influence any dependent variable -they are spurious, but also variables that are only predictive on the outcome). Following this rule, we would include confounding variables since they influence the treatment and the outcome as well as variables that only predict the treatment and not the outcome but exclude variables that are only predictive of the outcome.

We give an example of the different relationships between the covariates, the outcome, and the treatment in Figure 1 .1. We denote X t for covariates that predict only treatment but not the outcome, and X o that predict the outcome but not the treatment probability.

Of special interest are the confounding covariates X c that we need to take into account to get an unbiased estimate of the average treatment effect (ATE) while X s are spurious covariates that are uncorrelated to both, treatment and outcome. Let us illustrate the role D Y X c X o X t X s Figure 1 .1: Dependencies of the covariates of the variables using the vaccination against COVID 19 as an example. The treatment variable (D) is whether a person is vaccinated or not while the outcome variable (Y ) is the individual probability of severe symptoms. Variable X t could be the industry sector a person is working in since it influences the probability of being vaccinated but has no influence on symptoms. X o could be whether a person smokes or not, which has an influence on the degree of symptoms but does not determine the vaccination probability.

X c might be the age of a person, which is associated with vaccination and symptoms while the variable body height might be unrelated and is classified as X s .

The rational behind the classification of covariates is to perform variable selection when estimating the propensity score. First, and most important to ensure unbiased treatment effects in observational studies is to achieve unconfoundedness. In theory, it is sufficient to only use covariates X c for the propensity score since the main analytic goal is to eliminate bias. This implies that we do not aim to explain the treatment assignment mechanism with high accuracy. If this would be the goal, variables X t would need to be included. However, doing so could limit the overlap assumption. These are the two main reasons why we want to exclude covariates X t in the propensity score model. Last, even if the dependency of X o and D is zero in the true data generating process, the finite sample bias can be reduced if variables X o are included in the propensity score model. The finite sample bias arises due to random confounding when the sample size is small.

In a recent paper by Shortreed and Ertefaie (2017), the authors suggest a different approach to get unbiased treatment effects from observational studies with a focus on decreasing the variance. Their proposed outcome-adaptive lasso (OAL) approach only selects features in the propensity score estimation that have a relationship with the outcome (X c , X o ) but exclude variables that are predictive on the treatment (X t ) as well as spurious variables (X s ). To do so, they first find covariates that predict the outcome by regressing the outcome on the treatment and all covariates using a linear model. In the second step, the estimation of the propensity score, they use the lasso with an additional penalty term.

The penalty contains individual weights for each covariate based on the coefficients of the covariates from the first step. The result is that the lasso excludes covariates that predict the treatment but are not related to the outcome as well as spurious covariates. Consider p covariates, denoted X j for j = 1, ..., p, then the OAL estimator is defined as:

where ω j = β j −γ s.t. γ > 1. The vectorβ j refers to the coefficient estimates from regressing the covariates on the outcome, conditioning on the treatment.

An important limitation of the outcome-adaptive lasso is that this approach is restricted to parametric models. Both, the outcome model and the propensity score model have to be correctly specified. However, to control for selection bias we would like to have as many pre-treatment characteristics as possible to condition on them. This requirement lets new datasets easily become high-dimensional. In such settings, we face two potential challenges. First, the outcome and the treatment variable might not dependent linearly on the covariates. There can be interactions between variables and complex structures. Second, even if we could include such interactions, facing many covariates leads to the problem of which covariates to include in the outcome and propensity score model? It might be the case that only a few characteristics are important -the question is which are they? Such settings call for a non-parametric method that uses regularization. Keeping the idea of outcome-adaptive regularization but accounting for non-linearity and high-dimensionality, we propose the outcome-adaptive random forest (OARF). First, we estimate a modified and standardized variable importance score using a random forest used as the penalty weight. Second, we use the modified variable importance to regularize a random forest that learns the propensity score. To do so we penalize the gain at each split and propose the use of an initial feature space using the modified variable importance. This approach allows replacing both linear models, the regression (which we will refer to as OLS), and the lasso, with a random forest to estimate the ATE when the functions are non-linear or high-dimensional. We also make use of sample splitting to avoid overfitting and apply cross-fitting to restore efficiency. Our approach of a regularized random forest further selects only those covariates in the propensity score that are predictive of the outcome and excludes spurious variables. The OARF is designed to allow for categorical variables and is robust to different amounts of levels between categorical or continuous variables. The result is a flexible non-parametric method that allows unbiased estimation of the ATE while decreasing the variance. Our approach is also fast in computation (about 30 seconds for 2000 observations and 20 covariates).

Let us demonstrate the outcome-adaptive approach in a simple example. Assume the true outcome model as in equation 2.2. Y depends on the treatment D and linearly on three covariates (X 1 , X 2 , X 3 ). We also generate a propensity score model that includes variables X 1 , X 2 , X 5 , and X 6 (see the Appendix for how the function is generated). The covariates are generated from a multivariate normal distribution and are independent. We set the sample size to N = 1000 and generate p = 20 covariates where only the first three are dependent on the outcome Y . Using the notation from Figure 1.1 we have the following structure: X 1 , X 2 ∈ X c , X 3 ∈ X o and X 5 , X 6 ∈ X t . Using a linear model (OLS), as in Shortreed and Ertefaie (2017), we want variables X 1 to X 3 to have the highest coefficients and penalize all others. Hence they should have a very small estimated coefficient. We run 500 Monte Carlo simulations and predict the treatment effect using the inverse probability of treatment weighted (IPTW) estimator (Lunceford and Davidian, 2004; Hernán and Robins, 2006) . We use a logit model to estimate the propensity score, using all covariates (the full model) and only using covariates that are confounding or predict the outcome (target model). Figure 2 .1 shows the estimated treatment effect using both models as well as the standardized and absolute coefficients from the OLS model. The target model (which can be seen as an oracle model) shows a smaller variance around the true treatment effect of 0.5 (indicated with the horizontal line). We also see that the OLS correctly assigns the highest coefficients to the variables that are predictors of the outcome while all other variables get smaller coefficients. (2017) to select only variables for the propensity score model that are predictive for the outcome works as expected if the underlying outcome and propensity score functions are linear. As soon as we introduce a more complicated function, the selection process in step 1 is biased. To demonstrate this, let us now assume the following true outcome model: 

Selecting variables using non-parametric models is not straightforward since we do not directly estimate coefficients for each feature like in the OLS. Using a random forest, we can instead create a variable importance measure to find the most predictive variables from the outcome model. Since we want to use a random forest for both steps, the outcome model and the propensity score, we have to replace the adaptive lasso by penalizing the tree-building mechanism. In this section, we show how to best estimate the variable importance measure from the outcome model and state the importance of the initial feature space. We describe in detail how the information about the covariates is then used to penalize the random forest that estimates the propensity score. This leads to a regularized version of the random forest that shrinks penalized variables to zero.

To ensure unbiased effects from a causal parameter the following assumptions from the potential outcome framework are required: Each observation has two potential outcomes, Y 1 if treated and Y 0 if not. We denote treatment by the binary indicator D ∈ {0; 1} and observed covariates X ∈ R p . See, for example Rubin (1980) . First, ignorability:

It states that the treatment assignment is independent of the two potential outcomes. Second, the stable unit treatment value assumption (SUTVA),

ensures that there is no interference, no spillover effects, and no hidden variation. This means that the treatment status for individual i does not affect the potential outcomes of individual j. The third assumptions describes the propensity score: P (D i = 1 X i = x) def = e(x), which needs to be bounded away from zero and one: ∀x ∈ supp(X i ), 0 < P (D i = 1 X i = x) < 1,. The last assumption states that the covariates are not affected by the treatment: X 1 i = X 0 i . We are interested in estimating the ATE (θ), assuming a partially linear model that takes the following form:

(3.5)

Building a tree based on the classification and regression tree (CART) algorithm works as follows: A subset of the feature space X is selected at each node t in the tree. Internal nodes t are labelled with a split s t = (X j < c), which creates at least two further subsets or children t L and t R (L and R refer to left and right, as in a tree). This procedure is repeated until we reach terminal nodes (or leaves) that are labeled with the best prediction of the outcome variable. In the regression case, this would be the mean valueȳ t . The predicted output for a new instance is the mean value of the leaf reached by the instance when it is propagated through the tree. Using a recursive procedure at every note t, a tree identifies the split s t = s * for which the partition of the sample into t L and t R maximizes the reduction of

Let D be the set of observations for a specific decision or split. For the first step regression setting, we define the cost function as cost(D) = ∑ i∈D (y i −ȳ) 2 , whereȳ = D −1 ∑ i∈D y i is the mean of the outcome variable in the specified set or region. Maximising the decrease of the variance within each leaf can be seen as making the leafs pure in terms of the outcome values. Hence, the cost function is called the impurity measure.

The impurity change (or gain) through a split can be used to estimate the importance of a variable by evaluating each cost function given a specific feature j on which the split s * is based. Hence, we can define the gain in terms of a feature j instead of split s: ∆(j, t) def = ∆(s, t). The global importance value is given by accumulating the gain over a feature, ∆(j) = ∑ t∈S j ∆(j, t) where S j represents all the splitting points used in a single tree for the j-th feature. This is because a feature X j can be used multiple times during the recursive growth of a tree. Since a random forest consists of B such trees, the importance value (Imp j ) is just an average over all trees:

Nembrini, König, and Wright (2018) find that the impurity importance can be explained by two parts: The impurity reduction directly related to the true importance and a part of impurity reduction that is highly based on the structure of the feature (i.e. a dummy variable with only two levels vs. a continuous variable). The latter component introduces a bias in the impurity measure. To correct for the structure of different features, the authors propose to extend the feature space dimension p to 2p. If the selected variable is within importance measures. We note that the permutation method is computationally expensive when compared with the AIR method. This is because the former method relies on OOB predictions for each feature. In terms of robustness concerning the different amounts of levels between variables, they perform similarly.

In our setting, we have at least one binary variable, the treatment assignment, which has a limited range of splitting values compared to continuous variables. Therefore, it is important to take the different structures into account by applying the AIR measure (available in the ranger package as 'impurity corrected').

The penalization parameter λ j should depend on the predictive power from the covariates on the outcome. Our measure of predictive power is the importance score Imp * j which can be included as proposed by Deng and Runger (2013) :

where λ 0 is a general penalization parameter and γ is a weight parameter that determines the proportion of general and specific feature penalization. Next, we define the normalized importance score (Imp * j ) as

The new normalized importance score is scaled within the interval [0, 1] (see the Appendix for the proof). Equation 3.7 sets all negative importance scores from the AIR measure to zero. Since we only want to rely on the importance values obtained from the outcome model, we set λ 0 = 0 and γ = 1. This allows for the heaviest penalization based on outcome-related covariates, namely

Next, we want to use the information obtained in the first step and only use variables that have a high importance on the outcome to estimate the propensity score. Again, we want to use a random forest since the propensity score function can have a similar non-linear structure as the outcome model. To guide the feature selection, Deng and Runger (2012) introduce a (guided) regularized random forest (RRF) by proposing to weight the gains of the splits during the recursive procedure. As a result of the random feature selection at each split, after m splits, only a subset F of features are included in the tree. To limit the feature space the idea is to exclude variables not belonging to F, unless their importance is substantially larger than the maximum of the gain for features already included. Deng and Runger (2013) define the regularized gain as

where λ j ∈ (0, 1] is the penalty coefficient that controls the gain for each feature j if this feature was not previously used for a split. Originally, the feature space F is an empty set at the root note in the first tree. Only if a feature adds enough predictive information it is included. Based on equation 3.10, the smaller the value for λ j , the higher the penalty and hence the less likely it is for feature j to be included in the subset. In our setting, we want to include features that may not be that predictive of the treatment but of the outcome. We also want to make sure that the important features are used in the splitting process (at least with a higher probability). Therefore, we already include features in F that fulfill the following criterion:

If the importance score X j is at least the mean over all importance scores, we include the variable in the initial feature space and drop values that are zero in F. Figure 3 .1 illustrates the guided feature selection process. We start with a non-empty set (here variables 1 and 2 are included). The first split is based on variable 3. If this variable is not in the feature space, the gain is multiplied by λ 3 . If the penalized gain is higher than the gain from the parent the feature is included. As an illustration, we always move from top to bottom and from left to right. The next split is then on X 4 , again if the penalized gain difference is positive, the feature is included in F. Next, a split on X 1 is made, since the variable is already in the feature space the gain is not penalized. Still, it has to be higher than the gain from the parent node to keep the split. Building the first tree, we end up with a feature space containing four variables. The information of the feature space is now used to build further trees. In tree 2, we start with the initial features space as extracted from tree 1. Note that this procedure does not allow to grow trees in parallel since each tree needs the information from the former tree about the feature space to determine if the gain from variables should be penalized or not.

. . The covariate selection process does not depend on how often a variable can be split and hence if the variable is continuous or, for example, binary. It is sufficient if the gain from one split is above the threshold. Different from calculating the variable importance using the impurity measure we do not average over all split within a tree. This allows the covariates to be of any form, such as binary, categorical or continuous.

To avoid overfitting, which can easily happen when using flexible methods such as random forest, we make use of sample splitting and cross-fitting. First, we split our sample into two equal parts, I A denotes the auxiliary sample and I E is the estimation sample.

We first use the subset I A to train the propensity score function and I E to estimate the treatment effect using the predicted propensity score in the IPTW estimator. Let us denote the resulting estimator asθ(I A , I E ). Now we switch the roles of the auxiliary and estimation sample to obtain a second estimator, calledθ(I E , I A ). Since both estimators are estimated on only a subset of the observations there might be an efficiency loss. Crossfitting, which was recently introduced by Chernozhukov, Chetverikov, Demirer, Duflo,

Hansen, Newey, and Robins (2018), aims to restore efficiency by simply averaging the two

This approach generalizes to K folds where I A contains K −1 folds and I E the remaining fold. Similar to cross-validation, each fold is used to estimate the ATE by iteratively looping through the folds. The final estimator is the average over the K estimators.

We use the full sample to get the variable importance from the first step. This is especially helpful if the sample size is small. When using a logit model or the lasso as a benchmark, we also use the full sample and estimate the final treatment effect in one step.

To evaluate the performance of our OARF method in more detail, we consider different data generating processes (DGP's) and consider the following methods for comparison and benchmarking: The OAL method by Shortreed and Ertefaie (2017) and two generalized linear models. The first one uses all covariates in a logit model (Lo full) while the second one only uses target variables (X c , X o ) to estimate the propensity score (Lo targ). We use the same benchmarks for the random forest, denoted by RF full and RF targ. We also use the regularized random forest (RRF) which only sets a penalty based on the first step variable importance but does not make use of an initial feature set, as proposed here by the OARF. We use the following order of variables when we look at the variable selection plots: X c , X o , X t . If the amount of the variables are set to two, then the first two are confounders, variable three and four are regressors on the outcome and five and six are regressors on the treatment. We use the ranger package in R for all estimations based on the RF and the RRF package by Deng (2013) in part for the OARF. The OAL approach is based on the replication file from Shortreed and Ertefaie (2017). The tuning parameter λ and γ for the OAL method are selected using a weighted absolute mean difference (wAMD) which we describe in the Appendix.

First, we consider two linear settings and generate a binary treatment, D, from a

Bernoulli distribution with logit{P (D = 1)} = ∑ p j=1 ν j X j and the continuous outcome variable Y as Y = θD + ∑ p j=1 β j X j + ε, where ε ∼ N (0, 1) and θ = 0.5. The two settings differ in the strength of the confounding effect. Setting 1 sets β = (0.6, 0.6, 0.6, 0.6, 0, 0, 0, ..., 0) and ν = (1, 1, 0, 0, 1, 1, 0, ..., 0), and setting 2 sets β = (0.6, 0.6, 0.6, 0.6, 0, 0, 0, ..., 0) and ν = (0.4, 0.4, 0, 0, 1, 1, 0, ..., 0). Setting 2 hence has a weaker confounding relationship than setting 1. These two settings are identical to the one used in Shortreed and Ertefaie (2017). Setting 3 aims to have a non-linear relationship between the covariates and the outcome but the same linear structure for the propensity score as in setting 1. Setting 3 is generated as follows: Y = θD +0.8(X 1 ⊗X 2 )+0.8(X 3 ⊗X 4 )+ε; logit{P (D = 1)} = ∑ p j=1 ν j X j , with ν = (1, 1, 0, 0, 1, 1, 0, ..., 0). In all three setting we set N = 500 and p = 20. We find that our OARF performs similar to the benchmark OAL method. The random forest using all covariates and the one that uses the penalization weights from the first step perform worse and are clearly biased. The reason for the bias might be that the full random forest has a higher chance to select variables X t and X s than OARF. We do see an improvement in terms of bias when using the RRF, which selects fewer of the above mentioned variables due to the penalization weights. We also find that if we weaken the confounding relationship, all methods have a smaller variance and the RF methods a smaller bias. Boxplots illustrating the IPTW estimator using different methods are shown in Figure C .1. Selected covariates from the propensity score model are shown in Figure  C .5. In both settings, the OARF approach only selects the desired features. This is similar to the OAL method. For comparison, we show that the full RF selects all features and the RRF a higher proportion of all variables while always selecting the desired variables.

In setting 3, illustrated in Figure 4 .1, all approaches are unbiased since the propensity score function depends linearly on the covariates. Only the outcome model is non-linear which is why the variance is higher in the linear models compared to the random forest. Next, we generate data processes where both functions are non-linear (settings 4 to 10). These are the settings where we would expect the OARF to perform superior. A complete list of all the DGP's is shown in Table A .1. In those settings, we set the sample size to N = 1000. Our results are shown for p = 20 to allow a noticeable visualization.

Varying the number of covariates to 50, 100, and 200 does not change the results of the ATE estimates nor the selection of the correct covariates. In setting 4 to 7, we keep the function on the outcome model and only change the propensity score function. Setting 1 to 7 sets the amount of variables for X c , X o , and X t to 2 while the amount of spurious variables X s = p − (X c + X o + X t ). Setting 8 to 10 uses X c = 6, X o = X t = 2 covariates and again the remaining set for X s . The favourable covariates to select are X 1 to X 4 for settings 1 to 7 and X 1 to X 8 for setting 8 to 10. Overall, we find that the OARF performs best and is close to the RF that only uses X c and X o (RF targ). Setting 4 and 5 show that all methods are biased while the OARF is closed to the true ATE. In setting 4, all methods are downward biased, which leads to a negative estimate using the OAL. Only the RF approaches estimate the correct sign of the treatment effect. In setting 6, the linear methods are slightly upward biased while the RF approaches show no bias. Setting 7 shows a similar effect of bias where only the OARF estimates unbiased treatment effects. Using more covariates allows making the functions even more complicated since the flexibility can be increased. Setting 8 and 10 again show some bias for the linear methods as well as a higher variance compared to the RF approaches. Setting 9 is comparable to setting 4 in the sense that all methods are downward biased. In this setting, even the OARF is biased and shows a higher variance.

We notice that in some settings there is not so much difference between the OARF and the full RF. The advantage of the OARF remains since it achieves the same accuracy using fewer variables as the full RF. In all settings that have at least one non-linear function, only the OARF selects the correct features with high precision. The full RF selects all covariates and the RRF uses unnecessary covariates in about 80% of the cases.

Introducing a correlation between the covariates increases the bias and variance. We investigate the effect in a linear setting with moderate and heavy correlation. The results are shown in Figure C. 3. If the correlation is moderate (ρ = 0.2) the OARF is still unbiased while all other methods show a slight bias and an increase in variance. It is reassuring that the random forest can find the correct importance score even with moderate correlation among the variables. If we introduce a heavy correlation (ρ = 0.5) also the random forest In Table 4 .1 we show coverage rates of 95% confidence intervals and the width of the interval (in parentheses). Confidence intervals for IPW were constructed using a percentile-based nonparametric bootstrap. For the OAL method, we use a smoothed non-parametric bootstrap approach that takes the model selection procedure into account.

This procedure is described in Efron (2014) . The confidence intervals for all RF approaches are based on non-parametric bootstrapping. We use 500 bootstrap samples for each method. We then take the 0.025 (α 2) and 0.975 (1 − α 2) quantile from the empirical distribution as the lower and upper bound for the confidence interval. We apply three non-parametric versions, the RF full (without regularization), the RRF (which uses regularization but no initial feature set), and our proposed OARF. We also apply the IPW method using a linear model to estimate the propensity score and the OAL method using the outcome-adaptive lasso to estimate the propensity score. We notice that the width of the confidence interval for the OARF is smaller than for the OAL method (in some settings only half as wide). The vanilla RF and the RRF do not achieve a coverage rate of 95% for any settings, while the OARF achieve the rate in 8 out of the 15 settings.

The results show that some data generating processes might be too complex and hence are biased for the IPTW estimator. This is why increasing the sample size does not lead to a higher coverage rate. In all other settings, we see an increase in the coverage rate and a decrease in the width of the confidence intervals when increasing the sample size from N=500 to N=2000.

The OARF approach does not only decrease the variance when increasing the sample size but also the bias. Figure 4 .3 shows the mean squared error over 400 Monte Carlo In the Appendix, we discuss the possibility of tuning certain parameters in the random forest while keeping the objective to get balanced covariates rather than maximize the prediction accuracy. We also discuss a generalization of the OARF to other methods like the double machine learning, introduced by Chernozhukov et al. (2018) . We find that using the OARF to estimate the propensity score decreases the bias compared to a full RF.

In settings where even the OARF is biased, we additionally use the OARF to estimate the conditional mean of Y . This approach further decreases the bias.

In this section, we revisit two empirical examples. Both datasets are observational and contain a rich set of characteristics that may lead to selection into treatment. We use the same 5 methods as in the simulation study above. We report the ATE and a 95% confidence interval (CI). The datasets we use are freely accessible: RHC (https://hbiostat.org/ data/), birth weight (http://www.stata-press.com/data/r13/cattaneo2.dta).

Right heart catheterization (RHC) is a diagnostic procedure used for critically ill patients. Connors, Speroff, Dawson, Thomas, Harrell, Wagner, Desbiens, Goldman, Wu, Califf, et al. (1996) used a propensity score matching approach to study the effectiveness of right heart catheterization in an observational setting. The authors found that after controlling for selection bias by using a rich set of covariates, RHC appeared to lead to higher mortality than not performing RHC. This conclusion is in contrast to popular belief among practitioners that RHC was beneficial.

The SUPPORT study collected data on hospitalized adult patients at five medical centers in the US. Based on information from a panel of experts, a rich set of characteristics, believed to be related to the decision on whether to perform the right heart catheterization or not, was collected. of −0.065 and −0.067 when using optimal matching. All methods mentioned above use a logit model to estimate the propensity score. Given the rich set of covariates that consists of characteristics like age, race, income, and medical characteristics, there might be interaction effects and non-linear dependence. It would also be useful to know which covariates are true confounders and are selected when we do not assume any parametric form for both, the outcome model and the propensity score model. 

In this example, we reinvestigate the effect of maternal smoking status during pregnancy, the treatment variable, on babies' birth weight, the outcome variable. We use a publicly available dataset that consists of 4642 singleton births in the USA. Additionally, we observe a rich set of characteristics like age, marital status, race, education, number of prenatal care visits, months since last birth, an indicator of firstborn infant, and indicator of alcohol consumption during pregnancy. All covariates are for the mother, except education, which we also observe for the father. The full dataset, containing more observations, was first used by Almond, Chay, and Lee (2005) who found a strong negative effect of maternal smoking during pregnancy on the weights of babies (about 200 -250 gram lighter for a baby with a mother smoking). In their study, the authors use a logit model to estimate the propensity score. We focus on estimating the propensity score without assuming any parametric form and again base the variable selection on features that are associated with the outcome. Table 5 .3 shows the ATE estimates along with 95% CI's. We find similar negative effects as Almond et al. (2005) . The OARF estimates a decreased birth weight of -224 grams. Compared to the other RF models and the OAL method, the OARF has the tightest confidence intervals. Only the classic IPW estimator has a slightly tighter upper bound.

The OAL model and the full RF include 17 out of 19 variables while the RRF model includes 16 covariates. The OARF only includes 7 covariates and excludes, for example, an indicator for marital status, the education of mother and father, and if there were prenatal visits. The excluded variables are shown in Table 5 .4 and the complete list with inclusion probability is shown in Table A .3. The OARF is the only method that excludes the variable alcohol, meaning that it is not a confounding variable nor is it predictive on the outcome. The results, from a recent study by Lundsberg, Illuzzi, Belanger, Triche, and Bracken (2015) , suggest low-to-moderate alcohol exposure during early and late gestation is not associated with increased risk of low birth weight. Such findings are quite interesting since they allow a better understanding and interpretation of true confounding variables and of such that are not informative. 

We propose a non-parametric variable selection procedure for the estimation of treatment effects from observational studies. Building on outcome-adaptive penalization, we use a random forest to define variables that are predictive of the outcome. This allows the outcome model to deviate from a linear dependence on the covariates. We use a modified variable importance score to generate coefficients from the outcome model. We use the importance score to penalize variables that are spurious or that only predict the treatment but not the outcome. We show how to use penalty weights for each covariate to regularize the random forest that estimates the propensity score. Additional to the penalty, we

show the importance of an initial feature space and how to include it in the RF. A Monte

Carlo simulation shows that our proposed method, the OARF, has a smaller variance and produces unbiased estimates. In cases where all estimators are biased, the OARF produces the smallest bias and variance. The second goal of our proposed approach is to select only variables that have a relationship with the outcome (including all confounding variables) while excluding variables that predict the treatment and unimportant variables.

Based on the simulation, we find that only the OARF selects the correct covariates and disregards all others. This holds for linear and non-linear settings and even if there is a strong correlation between the variables.

We apply the OARF and all other benchmark methods in two empirical examples. 1, 0, 0, 1, 1, 0 

ν = (1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, ..., 0)

As setting 1 but with a correlation between the variables of around 0.2

As setting 2 but with a correlation between the variables of around 0.2 13

14 2 cos(X 2 ) + ∑ p j=1 ν j X j ; ν = (1, 1, 0, 0, 1, 1, 0, ..., 0) 2 cos(X 2 ) + ∑ p j=1 β j X j ; β = (0.6, 0.6, 0.6, 0.6, 0, 0, 0, ..., 0)

Notes: Only setting 1 and 2 have a linear DGP. Setting 1 to 7 and 11 to 15 set X c = X o = X t = 2 while setting 8 to 10 set ∼ Bernoulli{e 0 (x)}.

Min-Max normalization to the interval [0, 1]:

Since the min(V arImp) = 0, a = 0 and b = 1, the expression simplifies to

If min(V arImp) ≠ 0: Favourable covariates are X 1 to X 8 . 

The outcome-adaptive lasso (OAL) approach has to important tuning parameters.

The parameter λ n is the regularization parameter that needs to be optimised while the parameter γ is set to fulfil λ n n γ 2−1 = n 2 , with N = n. In the IPTW estimator, the propensity score is used to balance the covariate distribution between the treatment and the control group. Shortreed and Ertefaie (2017) propose to select λ n by minimizing a weighted absolute mean difference (wAMD) using the covariates and the propensity score for the treatment and control group:

Equation 4.18 represents the IPT-weights obtained from the propensity score model using the OAL method for variable selection. The λ n value that minimizes the wAMD is used to estimate the ATE using the propensity score estimates given the specific λ n and γ.

In equation 4.17 the beta coefficients are used to weight the covariate balancing based on the strengths of the coefficients. Since we do not require exact coefficients from a linear model for the weighting, we could use the wAMD to tune the OARF. Instead of the coefficients, the β j could contain the variable importance scores (they don't even need to be standardized). Since we mainly want to find a good penalization procedure for the propensity score function, possible tuning parameters could include the threshold for the initial feature space, the normalization of the importance score, and whether to apply different penalty weights based on the depth of the tree. For example, now we use the penalty Imp * j for each node that contains the variable j and is not in the initial feature set F. Another possibility would be to use (Imp * j ) ξ where ξ states the depth of the node. Considering the depth of a node would make sense if we believe that splits near the end of a tree are less important and hence are heavier penalized. In the simulation settings that we consider, we find that the default values lead to a significant decrease in bias and variance. Even if the ranger implementation of the RF is quite fast, the tuning of parameters is computationally expensive (in comparison to the tuning of the lasso). These are the main reasons why we do not consider parameter tuning at this stage but provide a possible approach to do so if necessary. The RF tuning parameters aim to get high prediction accuracy. We do not aim for a high classification rate using the propensity score rather that it balances the covariate distribution between the treated and control group observations. Still, we can use the wAMD and tune, for example, the number of random variables to choose at each split (mtry), the number of trees, the node size, or imbalance methods, as long as we maximize the weighted absolute mean difference. Figure D .1 shows the ATE applying different tuning parameters on the OARF. The ATE is a median ATE over 200 Monte Carlo iterations. We also show the relative amount of tuning combinations that were chosen when minimizing the wAMD. We have 18 combinations of tuning parameters to choose from. The more often each combination was chosen based on minimizing the wAMD, the bigger the symbol.

We find that, with 500 observations, the most often chosen combination has a node size of 20 and selected 2 variables at each split. The same holds when increasing the sample size to 2000 observations. With the latter amount of observations, the best tuning parameter combination is closest to the true ATE of 0.5. The number of trees seems not to be important since the results are mainly constant.

The balancing of covariates through the propensity score generalizes to other methods besides the IPTW estimator. We illustrate the ATE estimation using the double machine learning (DML) approach proposed by Chernozhukov et al. (2018) . The estimation is based on the residual-on-residual approach to cancel out the effect from confounding covariates.

If more variables are at choice from which only a few are true confounders it might be more beneficial to select variables for the propensity score estimation. Since this approach needs to estimate two functions (the conditional mean of the outcome and the propensity score function) we do not consider this approach as a direct comparison in the main part of the paper.

The treatment effect is estimated as follows: First, we estimate the conditional mean of the outcome by regressing X on Y . This results in the functionˆ (X). Second, we estimate two propensity score models. One that uses all covariates and the second based on the OARF. We then estimate the residualsÛ = Y −ˆ (X) andV m = Y −ê m (X). Note that onlyV depends on the method m butÛ is only estimated once. The treatment effect is then estimated by:θ

As in the previous settings we use sample splitting and cross-fitting to estimate the final parameter. Figure E .1 shows the ATE estimated using the full RF and the OARF to estimate the propensity score. We find that using the OARF, the ATE is less biased and has a smaller variance in most settings. In settings 3 and 6, however, we find that both methods have the same variance but using the full RF might be less biased. The reason might be that if both functions, E[Y X] and E[D X] are quite complicated it might be desirable to also regularize the first function. This means that we only use the variables that are confounders and predictive on Y to estimate the outcome function. In Figure E .2 we show that using the regularized RF to estimate the outcome model, we can decrease the bias. This is as expected since we want to exclude variables that have no association on Y . We call the additional approach where we regularize both functions based on outcome variables, "double OARF" (DOARF). 

The costs of low birth weight

Double/debiased machine learning for treatment and structural parameters

The effectiveness of right heart catheterization in the initial care of critically iii patients

Dealing with limited overlap in estimation of average treatment effects

Guided random forest in the rrf package

Feature selection via regularized trees

Gene selection with guided regularized random forest

Estimation and accuracy after model selection

Estimating causal effects from epidemiological data

Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization

Improving propensity score weighting using machine learning

Balancing covariates via propensity score weighting

Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study

Low-to-moderate prenatal alcohol consumption and the risk of selected birth outcomes: a prospective cohort study

The revival of the gini importance?

Extending iterative matching methods: an approach to improving covariate balance that allows prioritisation

The central role of the propensity score in observational studies for causal effects

Randomization analysis of experimental data: The fisher randomization test comment

Outcome-adaptive lasso: Variable selection for causal inference

Propensity score estimation: neural networks, support vector machines, decision trees (cart), and meta-classifiers as alternatives to logistic regression

Proof of unconfoundedness based on the propensity score.We show that P r(For simplification, we show the proof for Y 0 and note that the same logic follows for Y 1 or both. Using the law of iterated expectation and noticing that e(X i ) is a function ofUsing the assumption of conditional independence of D i and Y 0 i given X i allows us to neglect the conditioning in equation 2.15:Combining equation 3 and 2 shows that,Based on this result and using equation 1 shows that given the propensity score, D i is independent from Y 0 i . Propensity score model for linear example:Dependence is linear: a(x) = 1X 1 + 1X 2 + 1X 5 + 1X 6 . Calculate the probability distribution for the vector a from the logit distribution function:e 0 (x) = exp{a(x)} (1 + exp{a(x)})