key: cord-0501996-ai2ixw8i
authors: Liu, Yutao; Gelman, Andrew; Chen, Qixuan
title: Inference from Non-Random Samples Using Bayesian Machine Learning
date: 2021-04-12
journal: nan
DOI: nan
sha: 2ea628d2454b8b96cf31479351a0ea35073bb5aa
doc_id: 501996
cord_uid: ai2ixw8i

We consider inference from non-random samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable while the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, inspired by Little&An (2004), we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We show through simulation studies that the regularized predictions using soft Bayesian additive regression trees yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiology study.

1 Inference about a target population based on sample data relies on the assumption that the sample is representative. However, simple random samples are often not available in real data problems. Therefore, there is a need to generalize inference from the available non-random sample to the target population of interest. For example, randomized controlled trials (RCTs) are considered a gold standard to estimate treatment effects, but the measured effects can only be formally generalized to the participants within the trial.

Recent evidence has indicated that subjects in an RCT can be much different from patients in routine practice. Such concern among clinicians about the external validity of RCTs has led to the underuse of effective treatments (Rothwell 2005) . This highlights the importance of generalizing treatment effect of RCTs to a definable patient population.

Survey sampling is a field that specifically deals with inference on populations with non-random samples, which can be viewed as a special case of generalizing inference.

Probability samples collected via probability surveys have historically proven effective.

However, such data comes with considerable cost, both time and budget. In the past several decades, large scale probability surveys have suffered increasingly high non-response rates, besides the rising costs. The probability surveys with low response rates are often non-representative, which challenges the validity of survey inference. In the meanwhile, recent development of information technology makes it increasingly convenient and cost-effective to collect large numbers of samples with detailed information via online surveys and opt-in panels. Such samples are highly non-representative due to selection bias. Classical weighting methods in survey literature such as post-stratification (Valliant 1993 ) and raking (Deming & Stephan 1940) can improve representativeness of survey samples when a small number of discrete auxiliary variables about populations are available for survey adjustments. However, such weighting methods can yield highly 2 variable estimates of population quantities in the presence of extreme weights.

Alternatively, model-based methods can be used. Wang, Rothschild, Goel & Gelman (2015) demonstrates, through election forecast with non-representative voter intention polls on the Xbox gaming platform, that multilevel regression and post-stratification (MRP) can be used to generate accurate survey estimates from non-representative samples. Their estimates are in line with the forecasts from leading poll analyst. MRP is very appealing when statistical adjustment are made using a small number of discrete auxiliary variables.

In recent years, population data of high volume, variety, and velocity has become increasing available, with examples including administrative data or electronic medical records. Such data contained detailed individual level information with high-dimensionality and can be used to generalize inference of non-random samples to their target populations.

Although post-stratification, raking, and MRP methods can improve representativeness in the presence of a small number of discrete auxiliary variables, they are infeasible to be applied in high-dimensional settings. With high-dimensional auxiliary variables, Bayesian machine learning techniques have been shown to be effective in improving statistical inference in missing data and causal inference (Hill 2011 , Tan, Flannagan & Elliott 2019 , Hahn, Murray & Carvalho 2020 . Specially, Hill (2011) shows that Bayesian additive regression trees (BART) produces more accurate estimates of average treatment effects compared to propensity score matching, propensity-weighted estimators, and regression adjustment when the response surface is nonlinear and not parallel between treatment and control groups. Tan, Flannagan & Elliott (2019) demonstrate, in the presence of missing data, that BART reduces bias and root mean square error of the doubly robust estimators when both propensity and mean models were misspecified. Inspired by these works, we

propose Bayesian machine learning model-based methods and extensions for estimating population means using non-random samples. The proposed methods can be applied not only in the context of survey inference but also in more general settings, such as RCTs and 3 epidemiological observational studies. We evaluate the proposed methods using simulation studies and demonstrate their applications in a mental health survey of Ohio Army National Guard service members and a non-random sample from an observational study using electronic medical records of COVID-19 patients.

Let U be the finite population of size N and s be a non-random sample of size n from the population. In the sample s, information on the outcome of interest Y , discrete auxiliary variables Z and continuous auxiliary variables X were collected. In addition, data from the population U (e.g. census, administrative data, or electronic medical records) is also available with the same set of auxiliary variables Z and X measured for all units in the population. Figure 1 illustrates the scenario under consideration, with population data on the left and the sample data on the right. Without loss of generality, we consider a continuous variable of interest Y with the estimand of interest being the finite population

When the dimensions of Z and X are small, post-stratification, raking, and MRP can be applied by first discretizing the continuous auxiliary variables X as X * using quantiles.

Using the joint distribution of discrete auxiliary variables (Z, X * ), post-stratification partitions the population into J disjoint post-strata with U = J j=1 U j of size N j and the sample into subsamples with s = J j=1 s j of size n j for the jth post-stratum, correspondingly. With respect to the post-strata, the finite population mean can be Figure 1 : Population U and non-random sample s with shared discrete auxiliary variables Z and continuous auxiliary variables X as well as outcome Y measured only in s.

With the assumption that the sample units in each post-stratum are representative of population units in that post-stratum, the post-strata means are estimated using corresponding subsample meanŝ θ j =ȳ j = 1 n j i∈s j y i . Naturally, the post-stratification (PS) estimator takes the form

where w i = N j /n j for i ∈ U j is the post-stratification weight assigned to sample unit i in post-stratum j which is inverse proportional to the sampling fraction n j /N j . The post-stratification estimator could be numerically unstable when such partition results in small cells in the sample, in other words, small n j and large weights w j .

Alternatively, raking generates weights w i to match successively the marginal (rather than the joint) distributions of (Z, X * ) via iterative proportional fitting. The raking weighted estimator takes the form

5 with w i denoting raking weights. Raking weights could be highly variable, so the resulting weighted estimators could be inefficient. Also, raking may have convergence issues as the number of auxiliary variables increases.

Gelman (2007) reviews a model-based perspective on the PS estimator. In the model-based approach, a regression model is specified to model the conditional distribution of outcome given the discrete auxiliary variables p(Y |Z, X * ). Define stratum-specific means

based on the fitted model leads to the regression and post-stratification (RP) estimator

As a special case, specifying a saturated regression model (including all possible interactions terms) allows J post-stratum specific means and the least square estimatorŝ θ j =ȳ j = 1 n j i∈s j y i . As a result, Q RP = Q PS .

From the model-based perspective, the problem of unstable estimates due to small cells in post-stratification can be viewed as a model fitting problem due to model complexity. Such perspective motivates using alternative modeling techniques to improve estimation. Instead of using classical saturated regression models, multilevel regression and post-stratification (MRP) utilizes hierarchical regression models to achieve stable estimates. Both main effects and interaction terms could be specified as multilevel random effects so that information across post-strata can be partially pooled in the model fitting procedure (Gelman & Little 1997) . MRP improves efficiency in the population mean estimation than post-stratification and raking when data are sparse in some post-strata.

Still, it is challenging to perform MRP in high-dimensional setting, especially in the 6 presence of a large number of noise variables not associated with Y , because a parametric form needs to be specified for the multilevel regression. Also, continuous auxiliary variables need to be discretized before modeling.

The model-based RP approach can also be viewed as a prediction approach and the RP estimator in (3) can be rewritten as

. Such perspective motivates the use of modern statistical techniques for generalization of inference via valid predictions of the outcomes in the population. Specifically, the classical regression models in regression and post-stratification can be replaced by any regularized prediction methods that achieve stable estimates while including high-dimensional covariates. Such models also allows modeling the continuous X directly.

Tree-based methods are appealing techniques for handling high-dimensional problems.

Sum-of-trees ensembles achieve high prediction accuracy and better approximate the functional forms of continuous variables, with each single tree regularized to obtain stable predictions and achieve bias variance trade-off. Taking a model-based predictive perspective, we extend the RP approach to high-dimensional setting by replacing parametric regression models with regularized additive regression trees. We consider the Bayesian modeling framework, as it is natural to implement predictive inference and straightforward for quantification of uncertainty.

In the current setting, the conditional distribution of a continuous outcome given the high-dimensional auxiliary variables p(Y |Z, X) can be modeled using Bayesian additive regression trees (BART) (Chipman, George & McCulloch 2010) or soft Bayesian additive regression trees (SBART) (Linero & Yang 2018) .

For continuous outcomes, BART and SBART assume Gaussian noise and model the location parameter using a non-parametric sum-of-trees structure, allowing both discrete and continuous auxiliary variables

where M is fixed number of trees in the sum-of-trees structure, T m is the m-th binary tree with µ m being the parameters associated with the terminal nodes, and g(·) is the function assigning µ m according to (Z, X). The sum-of-trees structure naturally handles high-dimensional auxiliary variables without specifying a parametric form, accounting for categorical variables, continuous variables and possible interactions. In the Bayesian framework, quantification of uncertainty is naturally characterized by the posterior and posterior predictive distributions.

In BART, g(·) is a deterministic function and the potential effect of continuous predictors, either linear or nonlinear, is approximated by step functions generated by cutting the continuous predictor at various splitting points in different trees. Regularization priors are specified on p(T m ), p(µ m |T m ), p(σ 2 ) such that each single tree T m is a weak learner. Such specification aims at preventing the individual tree effects from unduly influential and achieving stable predictions, with automatic default specifications facilitating easy 8 implementation. For p(T m ), the prior is specified by three aspects: (i) the probability that a node is nonterminal, (ii) the distribution on the splitting variable assignments at each interior node, and (iii) the distribution on the splitting rule assignment in each interior node, conditional on the splitting variable. For p(µ m |T m ) and p(σ 2 ), conjugate normal distributions and inverse chi-square distributions are specified.

In Step 1 Model p(Y |Z, X) using BART or soft BART, Y = G(Z, X) + ǫ, ǫ ∼ N(0, σ 2 ) with corresponding Bayesian priors.

Step 2 Obtain posterior distributions of Q(Y ) = 1 N i∈U y i using Markov chain Monte

, using the observed y i in the sample and the predicted values for the population units that are not in the sample.

Step 3 Obtain Q (S)BART : point estimates using (posterior) median of Q (t) (S)BART with credible intervals constructed using quantiles splitting the tails of posterior distribution equally.

In some cases, inference on subpopulation means are also of interest, which can be obtained via modification of item 3 in Step 2, restricting the average to predictions and observed outcomes in the corresponding subpopulation Ω ⊂ U and subsamples s ∩ Ω,

In the missing data literature, Little & An (2004) proposed including logit-transformed response propensity score as covariates using splines in the imputation models. This Inspired by this, we extend the BART and SBART prediction with a two-step approach.

First, we estimate sample inclusion propensity using a propensity model. If the sample data are linked to the population data, we code the sample inclusion indicators I = 1 for the units in the sample and I = 0 for the rest of the units in the population. The propensity scoreπ can then be estimated via modeling p(I|Z, X) using probit Bayesian additive regression trees (Chipman et al. 2010 ). If the sample data is unlinked to the population data, we round the continuous X to [X] at a certain precision level and identify K categories with unique values of (Z, [X]). Within each category k = 1, . . . , K, the number of units in the population N k and that in the sample n k can be counted. Once the counts (N k , n k ) are created for each category, the propensity scoreπ for the units to be included in the sample, given (Z, [X]), can be obtained via models for binomial outcomes.

Next, we model p(Y |Z, X,π) by additionally includingπ as a covariate in BART or SBART model with the rest of the steps being the same as Section 2.2.1. The detailed steps of obtaining the BART propensity (BART-P) prediction estimator Q BART-P and the SBART propensity (SBART-P) prediction estimator Q SBART-P are outlined as follows.

Step 1 Model p(I|Z, X) with probit BART and estimateπ using posterior mean

Step 2 Obtain the (S)BART-P prediction estimator for finite population mean

• Q (S)BART-P : point estimates using (posterior) median of Q (t) (S)BART-P with credible intervals constructed using quantiles splitting the tails of posterior distribution equally.

BART-P and SBART-P prediction methods are expected to be doubly robust 11 (Long, Hsu & Li 2012) . More specifically, as long as either of the mean model for the outcome or the propensity model is correctly specified, a consistent estimator of the population mean is obtained. Samples of size n = 600 were drawn from the populations with inclusion probability π = Pr(I = 1|Z, X) as a function of the auxiliary variables X and Z. We considered the following four simulation scenarios: S1 Low-dimensional auxiliary variables (p = 3, r = 1) with higher inclusion propensity at the lower tail of X 1 . The outcomes {Y i } i=1,...,N were generated using an additive model: N(0, 3 2 ) , and the samples were selected with π ∝ logit −1 [−13.66 + .5Z 1 + Z 2 + 1.75Z 3 + 12.5(X 1 − .75) 2 ].

Consequently, units with values of X 1 falling between 0.5 and 1 were under-sampled.

S2 High-dimensional auxiliary variables (p = 30, r = 10) with higher inclusion propensity at the lower tail of X 1 . Same Y and π models as S1, but add noise auxiliary variables {Z l } l=4,...,30 and {X l } l=2,...,10 that are not associated with Y or π.

S3 High-dimensional auxiliary variables (p = 30, r = 10) with lower inclusion propensity at the lower tail of X 1 . Same as S2, but change the signs of the coefficients in the model for π to introduce selection bias in the opposite direction:

. Consequently, units with small values of X 1 were under-sampled, especially among those with X 1 ≤ 0.25.

S4 High-dimensional auxiliary variables (p = 30, r = 10) with interaction and different relevant continuous predictors for Y and π. The outcomes

with samples selected using

Units at the lower tails of X 3 and X 5 were under-sampled, but X 3 and X 5 were not associated with Y . Figure 2 (a)-(b) show the scatter plots of Y against X 1 , the continuous variable that is associated with both Y and π, of the simulated population overlaid with a selected sample in scenarios S1-S3. Population units with lower values of X 1 were more likely to be selected into samples in scenarios S1/S2 but less likely to be selected in scenario S3.

Scenario S4 was designed to assess whether tree-based methods handle interactions well and how they perform when the continuous variables that are associated with undersampling are not associated with outcome. Figure 2 related to Y but not π, and of Y against X 3 , the continuous variables related to π but not Y . The plot on the left shows a positive association between Y and X 1 but units with different values of X 1 are equally likely to be included in the sample; while the plot on the right shows no association between Y and X 3 but units at the lower tail of X 3 are less likely to be included in the sample.

For each scenario, 500 replicates of simulation were conducted, with point and interval estimates of finite population mean computed for each. The Bayesian tree-based methods used all available auxiliary variables, as it is unknown which variables are involved in the true data generating process in practice. For scenario S1 with low-dimensional auxiliary variables, the tree-based methods were also compared to the PS and raking estimators using all four available variables with X 1 discretized using tertiles in PS and using quintiles in raking. Raw estimates were also calculated using sample means.

The performance of point estimates are evaluated with empirical bias and empirical root mean squared error (RMSE), summarised in Table 1 . Scenarios S1-S3 share the same outcome model, the same outcome values {Y i } i=1,...,N and the same ground truth for the finite population mean defined as Q = 1 N N i=1 Y i . The empirical coverage rates and average widths of 80% and 95% probability intervals are visualized in Figure 3 . The raw estimates ignoring selection bias are off the chart, leading to confidence intervals with 0% coverage rates, therefore, not shown in Figure 3 .

In scenario S1, where the weighting methods are feasible, raking is less biased as well as more efficient than post-stratification (PS). This is because raking maintains more *PS is based on Z 1 , Z 2 , Z 3 and X 1 discretized using tertiles **Raking is based on Z 1 , Z 2 , Z 3 and X 1 discretized using quintiles. The standard errors of empirical bias from 500 simulation replicates are < 7.5 × 10 −3 for all methods S3 S4 S1 S2 information from the continuous variable X 1 by discretizing X 1 using quintiles as compared to tertiles in PS, and raking implicitly assumes an additive propensity model while PS assumes an interaction model. Both PS and raking generate confidence intervals with coverage rates lower than the nominal levels, with raking yielding shorter intervals but higher coverage rates. BART and SBART both outperform the weighting methods via utilizing the continuous form of X 1 , generating credible intervals with coverage rates close to the nominal levels. BART and SBART perform similarly as all auxiliary variables are relevant in this low-dimensional setting. Including propensity score in BART and SBART leads to a small bias reduction which is offset by efficiency loss, indicated by slightly higher RMSE and slightly wider credible intervals.

Scenario S2 differs from scenario S1 by adding irrelevant auxiliary variables. PS and raking are not feasible due to high-dimensionality. Units with X 1 falling between 0.5 and 1.0 have lower selection probabilities than those with X 1 in between 0 and 0.5 as shown in Figure 2 (a). SBART outperforms BART with lower bias, lower RMSE and better credible interval coverage. Including propensity score in BART reduces bias and RMSE and fixes credible interval coverage. However, such improvement is not obvious for SBART.

BART-P, SBART and SBART-P all yields valid credible intervals but not BART, with SBART having the shortest intervals.

Scenario S3 differs from scenario S2 in the direction of selection bias. Moreover, because π was negatively associated with (X 1 − .75) 2 , the units in the lower tail of X 1 (e.g. X 1 < .25

in Figure 2 (b)) have even smaller inclusion probabilities than the units in the upper tail of X 1 in scenario S2. Consequently, there are sparse data in the lower tail of X 1 . In this setting, neither BART nor SBART performs well with large bias and RMSE, although SBART yields smaller bias and RMSE than BART. The empirical coverage rates for both BART and SBART are lower than the nominal levels due to bias in the estimation. By including propensity score, BART-P improves credible interval coverage as well as bias and RMSE than BART, but does not fix the undercoverage issue. Again, such improvement is not obvious for SBART.

In scenario S4, both BART and SBART performs well with small bias and RMSE and close to nominal level coverage rate. SBART yields slightly smaller bias, smaller RMSE, and better coverage rate than BART. Including propensity score slightly reduces bias and improves coverage rate in BART. Although there are sparse data in the lower tails of X 3 and X 5 , these two X variables are not associated with Y and thus such biased selection did not yield poor performance of the tree-based methods like in Scenario S3.

In all the scenarios considered here, SBART outperforms other competing methods and is, therefore, recommended. However, it should be used with caution, as it still does not perform well when selection bias results in sparse data at the tails of continuous auxiliary variables associated with the outcome.

We took a further investigation to compare the performance of BART and SBART in scenario S3, where neither BART nor SBART performs well with BART performing worse than SBART. We consider two random samples from the population. For sample I, data is sparse at the lower tail of X 1 , while, for sample II, no data is available at the lower tail, X 1 < .2. The top panels I(a) and II(a) of Figure 4 shows the population in gray dots, with sample I and II in red, respectively. For a closer examination of the data at the lower tail of X 1 , we restrict to a subset with Z 2 = Z 3 = 0 and focus on the lower tail with X 1 < 0.3.

The middle panels I(b) and II(b) show the population and corresponding sample data in this restricted subgroup. Finally, the bottom panels I(c) and II(c) plot the population units of Y in this subgroup (gray points) overlaid by the posterior means of the location parameters, G(Z, X), of each population units estimated using BART and SBART as shown using red pluses and blue crosses, respectively. Panel I(c) shows that both BART and SBART fit the data well within the region X 1 > .2 where sample data are available.

However, the SBART fit the data much better than BART when X 1 < 0.2 where very sparse data are available. The estimated posterior means of the location parameters based on SBART are also less noisy, due to the sparsity-induced priors that tend to exclude the noise auxiliary variables in model fitting. Panel II(c) shows that both models fail to produce valid predictions in the region X 1 < .2 where there is no sample data available. In the simulation study, about 5% of the 500 simulated samples do not include units with X 1 < .1 and one third include fewer than 10 units with X 1 < .2.

We demonstrate the application of the proposed methods using real data from two different studies. The first application example deals with a mental health survey assessing psychiatric disorders among the Ohio Army National Guard (OHARNG) service members.

The second application is in a clinic setting where it is of interest to generalize inference on COVID-19 patients when clinical outcomes are only available in a subset of patients. with potential selection bias due to under-coverage of sampling frame and non-response.

Auxiliary information is available at individual level for the entire study population, including age (17-24 yr, 25-34 yr, 35+ yr), sex (male, female), race (white, black, other), rank (enlisted, officer), marital status (married, non-married), and years of service (in years). We apply the proposed trees-based methods to correct the discrepancy between the sample and population utilizing the five categorical and one continuous auxiliary variables.

For BART-P and SBART-P, the propensity models were built using probit BART.

Before modeling, log(y + 1) transformation was applied to trauma scores to reduce right skewness such that the normality assumption in BART and SBART is reasonable.

Distributions of the only continuous variable, years of service, in the sample and population were checked to avoid prediction failure due to sparse data at the tails (see Figure S1 in Supplementary Materials). After fitting the models, we performed model checking using posterior predictive graphics checking (Gelman et al. 2014 , chapter 6) based on the following test quantities, including (a) T 1 (y) =ȳ, (b) T 2 (y) = 1

The test quantities catch different aspects of the data, with T 1 (·) and T 2 (·) measuring the location and variability of the survey outcome while T 3 (·) measuring the discrepancy between the survey outcome and 21 fitted distribution. In each MCMC iteration t, the realized test quantities T i (y, G (t) , σ (t) ) under the observed data and predictive test quantities T i (ỹ (t) , G (t) , σ (t) ) under the simulated data were computed and compared, withỹ (t) drawn from the posterior predictive distribution. For each quantity T i (·), a Bayesian posterior predictive p-value can also be computed, which is defined as the probability that the predictive test quantity is greater than the realized test quantity, evaluated over the posterior distribution. The

Bayesian p-value measures the discrepancy between the observed data and the posterior predictive distribution in the aspect characterized by T (·). A Bayesian p-value close to 0.5 indicates good fit while a Bayesian p-value near 0 or 1 indicates that the observed pattern would be unlikely to happen if the model were true and, therefore, lack of fit. Figure 

The COVID-19 is a global pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2). The first positive case was confirmed in New York City on

March 1, 2020. The city had more cases than any country other than the United States by As is visualized in Figure 5( For estimand (ii), estimates can be readily obtained by restricting the calculation to predicted and observed prolongation for patients over 80 years old, without additional computation for model fitting. Similar to the estimate among all patients, SBART yields smaller estimates and shorter probability intervals than the raw estimator. Although the mean prolongation estimate is higher among this subgroup of patients than all patients using the raw estimator, the estimates are similar using SBART.

We consider generalization of inference on a descriptive estimand from a non-random sample to a target population in data-rich settings where high dimensional auxiliary information is available in both the sample and population, with survey inference being a special case. Existing methods such as post-stratification, raking and MRP are challenging or infeasible to be performed due to high-dimensionality and the need to discretize continuous auxiliary variables before applying such methods leads to loss of information.

To address such issues, we propose a regularized prediction approach by modeling the conditional distribution of the outcomes given the high-dimensional auxiliary variables using Bayesian machine learning techniques. In this paper, we specifically consider BART and soft BART which handles both discrete and continuous auxiliary variables as well as potential interactions. Besides the auxiliary variables, we also consider modified methods that estimates the propensity score for a unit to be included in the sample and also include the estimated propensity score as a covariate in the BART and soft BART model.

Artificial data simulation studies demonstrate that the Bayesian additive-trees-based methods outperform post-stratification (PS) and raking in low-dimensional settings where PS and raking are feasible, as the regularized additive trees better utilize information in the continuous auxiliary variables and avoid model overspecification. The Bayesian additive-trees-based methods also yield valid inference in high-dimensional settings when PS and raking are not feasible, as long as selection bias does not result in sparse data points at the tails of relevant continuous auxiliary variables. In high-dimensional setting with sparse signals, SBART, with soft decision trees and sparsity-inducing priors, is less biased and more efficient than BART. In challenging settings where the additive-trees-based methods underperform, including propensity score in BART could reduce bias and improve credible interval coverage while such benefit is not obvious for SBART. Therefore, the soft BART prediction method is recommended for generalization of inference with high-dimensional auxiliary variables. The soft BART better utilizes information in the continuous auxiliary variables and more effectively regularize the effect of irrelevant noise auxiliary variable. As is demonstrated in the OHARNG mental health study and the COVID-19 study, the proposed methods could be applied in both survey and more general settings, with estimands being overall population as well as subpopulation quantities.

The Bayesian additive-trees-based methods need to be used with caution. More specifically, both BART and SBART prediction fail when selection bias results in very sparse data point at the tails of the continuous covariates. Such prediction failure cannot be fixed via robust methods involving propensity scores. Therefore, for important continuous variables associated with the outcomes, the range and distribution in the sample and population need to be checked before using the methods. In some cases, transformation on such auxiliary variables could be applied to reduce sparsity at the tails.

Although BART and SBART are considered in this paper, the regularized prediction 26 approach is general and any Bayesian machine learning techniques that achieve valid predictions could be applied.

Bart: Bayesian additive regression trees

On a least squares adjustment of a sampled frequency table when the expected marginal totals are known

Struggles with survey weighting and regression modeling'

Bayesian data analysis

Poststratification into many categories using hierarchical logistic regression

Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion

Bayesian nonparametric modeling for causal inference

Bayesian regression tree ensembles that adapt to smoothness and sparsity

Robust likelihood-based analysis of multivariate data with missing values

Doubly robust nonparametric multiple imputation for ignorable missing data

External validity of randomised controlled trials:"to whom do the results of this trial apply

Covid-19 infection is associated with qtc prolongation

robust-squared" imputation models using bart

Poststratification and conditional variance estimation

This work was partially supported by NIH grants R01AG067149 and R21ES029668 and ONR grant N00014-17-1-2141. The authors thank Dr. Sandro Galea for sharing the data on the OHARNG service members, Drs. Elaine Wan and Marc Waase for sharing the data on COVID-19 patients admitted at Columbia University Irving Medical Center.