key: cord-0143293-zsdn226x
authors: Lee, Hyun-Suk; Zhang, Yao; Zame, William; Shen, Cong; Lee, Jang-Won; Schaar, Mihaela van der
title: Robust Recursive Partitioning for Heterogeneous Treatment Effects with Uncertainty Quantification
date: 2020-06-14
journal: nan
DOI: nan
sha: 5601127bf9f37f0bbee6ee392496cbc96cc88273
doc_id: 143293
cord_uid: zsdn226x

Subgroup analysis of treatment effects plays an important role in applications from medicine to public policy to recommender systems. It allows physicians (for example) to identify groups of patients for whom a given drug or treatment is likely to be effective and groups of patients for which it is not. Most of the current methods of subgroup analysis begin with a particular algorithm for estimating individualized treatment effects (ITE) and identify subgroups by maximizing the difference across subgroups of the average treatment effect in each subgroup. These approaches have several weaknesses: they rely on a particular algorithm for estimating ITE, they ignore (in)homogeneity within identified subgroups, and they do not produce good confidence estimates. This paper develops a new method for subgroup analysis, R2P, that addresses all these weaknesses. R2P uses an arbitrary, exogenously prescribed algorithm for estimating ITE and quantifies the uncertainty of the ITE estimation, using a construction that is more robust than other methods. Experiments using synthetic and semi-synthetic datasets (based on real data) demonstrate that R2P constructs partitions that are simultaneously more homogeneous within groups and more heterogeneous across groups than the partitions produced by other methods. Moreover, because R2P can employ any ITE estimator, it also produces much narrower confidence intervals with a prescribed coverage guarantee than other methods.

The understanding of treatment effects plays an important role -especially in shaping interventions and treatments -in areas from clinical trials [1, 2] to recommender systems [3] to public policy [4] . In many settings, the relevant population is diverse, and different parts of the population display different reactions to treatment. In such settings, heterogeneous treatment effect (HTE) analysis -also called subgroup analysisis used to find subgroups consisting of subjects who have similar covariates and display similar treatment responses [5, 6] . The identification of subgroups is informative of itself; it also improves the interpretation of treatment effects across the entire population and makes it possible to develop more effective interventions and treatments and to improve the design of further experiments. In a clinical trial, for example, HTE analysis can identify subgroups of the population for which the studied treatment is effective, even when it is found to be ineffective for the population in general [7] .

To identify subjects who have similar covariates and display similar treatment responses, it is necessary to create reliable estimates of the treatment responses of individual subjects; i.e. of individualized treatment effects (ITE). The state-of-the-art work on HTE proceeds by simultaneously estimating ITE and recursively partitioning the subject population [8] [9] [10] [11] . In these HTE methods, the criterion for partitioning is maximizing the heterogeneity of treatment effects across subgroups, using a sample mean estimator, under the assumption that treatment effects are homogeneous within subgroups. In particular, the population (or any previously identified subgroup) would be partitioned into two subgroups provided that the sample means of these subgroups are sufficiently different, ignoring the possibility that treatment effects might be very heterogeneous within the groups identified. Put differently, these methods focus on inter-group heterogeneity but ignore intra-group heterogeneity. Figure 1 : A toy example with two subgroups identified by HTE method in [10] . The solid red line shows the ITE estimation and their 95% confidence interval is filled in red.

An important problem with this approach is that, because it relies solely on inter-group heterogeneity based on sample means, it may lead to false discovery. To illustrate, consider the toy example depicted in Fig. 1 . In this example, the true treatment effect (shown on the vertical axis) was generated by iid random draws from the normal distribution having mean 0.0 and standard deviation 0.1. In truth, the treatment under consideration is in fact totally ineffective and innocuous; on average, it has no effect at all and the treatment effects are entirely uncorrelated with the single covariate (shown on the horizontal axis). However if the observed data -the realization of the random draws -happens to be the one shown in Figure 1 , standard methods will typically partition the population as shown in the figure, thereby "discovering" a segment of the population for whom the treatment is effective and a complementary segment where the treatment is dangerous. Obviously, decisions based on such false discovery are useless -or worse. Note that this false discovery occurs because, although the outcome variations between the two groups are indeed substantially different, the outcome variations within each group are just as different -but the latter variation is entirely ignored in the creation of subgroups. This paper proposes a robust recursive partitioning (R2P) method that avoids such false discovery. R2P has several distinctive characteristics.

• R2P discovers heterogeneous subgroups in a way that is not tied to any particular ITE estimator. This is in sharp contrast with previous methods [8] [9] [10] [11] [12] , each of which relies on a specific ITE estimator. R2P can leverage any ITE estimator for subgroup analysis, e.g. an ITE estimator based on Random Forest [13] , or on multi-task Gaussian processes [14] or on deep neural networks [15, 16] . This flexibility is important because it allows for the possibility that different ITE estimators might be appropriate in different settings and because it allows R2P to seamlessly integrate ITE estimators that might be be developed in the future and that are found to be superior to current estimators.

• R2P makes a conscious effort to guarantee homogeneity of treatment effects within each of the subgroups while maximizing heterogeneity across subgroups. This is also different from previous methods, e.g., [10, 8] where variation within the subgroup is largely ignored.

• R2P produces confidence guarantees with narrower confidence intervals than previous methods. It accomplishes this by using methods of conformal prediction [17] that produce valid confidence intervals. Quantifying the uncertainty allows R2P to employ a novel criterion we call confident homogeneity in order to create partitions that take account of both heterogeneity across groups and homogeneity within groups.

These characteristics make R2P both more reliable and more informative than previous methods. They also distinguish R2P from existing classification methods for continuous labels [18, 19] or model interpretable methods [20] , which can be used for subgroup analysis (by viewing the estimated ITE as a continuous label).

Extensive experiments using synthetic and semi-synthetic datasets (based on real data) demonstrate that R2P outperforms previous state-of-the-art methods, more robustly identifying subgroups while providing much narrower confidence intervals.

To highlight the core design principles, we begin by introducing robust recursive partitioning (R2P) in the regression setting; we extend to the more complicated HTE setting in the next section.

We consider a standard regression problem with a d-dimensional covariate (input) space X ⊆ R d and a 1-dimensional outcome space Y ⊆ R. We are given a dataset

where, for the i-th sample, x i ∈ X is the vector of input covariates and y i ∈ Y is the outcome. We assume that samples are independently drawn from an unknown distribution P X,Y on X × Y. We are interested in estimating µ(x) = E[Y |X = x], which is the mean outcome conditional on x. We denote the estimator byμ : X → Y;μ predicts an outcomeŷ =μ(x) on the basis of the covariate information x. To quantify the uncertainty in the prediction, we apply the method of split conformal regression (SCR) [17] to construct a confidence intervalĈ that satisfies the rigorous frequentist guarantee in the finite-sample regime. (To the best of our knowledge, SCR is the simplest method that achieves this guarantee.)

In SCR, we take as given a miscoverage rate α ∈ (0, 1). We split the samples in D into a training set I 1 and a validation set I 2 that are disjoint and have the same size. We train the estimatorμ I1 on I 1 and compute the residual ofμ I1 on each sample in I 2 . For a testing sample x the confidence interval is given byĈ

Assuming that the training and testing samples are drawn exchangeably from P X,Y , the confidence intervals defined in (1) satisfy the coverage guarantee P[y ∈Ĉ I1,I2 ] ≥ 1 − α [17] . 1 To illustrate, assume the miscoverage rate is given as α = 0.05 and we are given 1000 testing samples. SCR prescribes a confidence interval for each sample in such a way that for at least 950 samples the prediction is within the prescribed confidence interval. (We often say the sample is covered.) However this coverage guarantee is marginal over the entire covariate space. If we perform a subgroup analysis that partitions the covariate space X into subgroups X 1 , X 2 in such a way that X 1 has 800 samples and X 2 has 200 samples, it might be the case that 790 samples in X 1 are covered but only 160 samples in X 2 are covered. In this case, 80% of the samples in X 2 would be covered. It seems obvious that such a situation is undesirable for subgroup analysis; we want to achieve the prescribed coverage rate for each subgroup, not just for the population as a whole. R2P overcomes this problem.

We begin by discussing how we use confidence intervals to quantify outcome homogeneity within a subgroup. We then introduce our space partitioning algorithm and provide the required theoretical guarantee of subgroup coverage.

Let Π be a partition of the covariate space X into (disjoint) subgroups. Write |Π| for the number of subgroups in the partition, l j for an element of Π and l(x; Π) for the subgroup that contains the sample x. Write D l = {(x i , y i ) ∈ D|x i ∈ l} for the samples whose covariates belong to the subgroup l. Note that when we restrict to covariates in the subgroup l, the samples are drawn from the truncated distribution P l X,Y which is the distribution conditional on the requirement that the vector of covariates of samples lie in the subgroup l.

We evaluate homogeneity within the subgroup l by the concentration of outcome values for covariate vectors in the subgroup l. To do this, we apply SCR to the samples in D l by splitting it into two sets, I l 1 and I l 2 . Writeμ l (x) denote the mean outcome model trained on I l 1 . As in (1), we obtain the confidence intervalĈ l (x) for subgroup l by setting the upper and lower endpoints to beμ up l (x) =μ l (x) +Q

1−α , respectively. (To avoid notational 1 Recall that assuming exchangeability is weaker than assuming iid. Figure 2 : Illustration of partitioning and impurity of the confident homogeneity. The regions shaded in red and gray (roughly) represent W l and S l , respectively. Partitioning the heterogeneous covariate space (left panel) reduces its impurity of the confident homogeneity. The partition with the smaller impurity (middle panel) makes the heterogeneity across subgroups and the homogeneity within subgroups both stronger compared to others with the larger impurity (e.g., right panel).

complications, omit reference to the subsets I l 1 and I l 2 hereafter; this should not cause confusion. Throughout, we follow the convention that the confidence bound have been computed on the basis of the split.) To estimate the center of the subgroup l, we use the average outcomeμ l,mean = E[μ l (x)]. We define the expected absolute deviation in group l to be

By definition,μ up l (x) is larger thanμ lo l (x) provided the residual quantileQ 1−α > 0. When the first indicator function is one, i.e. the average outcome (the group center)μ l,mean is larger than the upper boundμ up l (x) at x, we are confident that the outcome value at x is smaller than the group center (and perhaps smaller than the outcome value for many covariate vectors in l). Similarly, when the second indicator function is one, we are certain that the outcome value at x is larger than the group center (and perhaps larger than the outcome value for many covariate vectors in l). It is worth noting that whenĈ l (x) = μ lo l (x),μ up l (x) contains the centerμ l,mean , both indicator functions are zero and v l (x) = 0. The quantity v l (x) evaluates the homogeneity in subgroup l on the basis of the confidence interval for each x in l. (Using v l (x) is more conservative for partitioning is using the mean discrepancy |μ l (x) −μ l,mean |, and hence provides greater protection against false discovery because of uncertainty.) However, minimizing S l is not enough to maximize subgroup homogeneity. If the intervalsĈ l (x) for all x ∈ l are very wide and contain the average outcomeμ l,mean , homogeneity can be very low even though S l = E[v l (x)] is zero. To resolve this issue, when partitioning the covariate space we jointly minimize S l and the expected confidence interval width W l = E |Ĉ l (x)| . We formalize the robust partitioning problem as minimize Π l∈Π

where λ ∈ [0, 1] is a hyperparameter that balances the impact of W l and S l . We call the weighted sum, λW l + (1 − λ)S l , the impurity of the confident homogeneity for subgroup l. Fig. 2 illustrates how minimizing the impurity of the confident homogeneity improves both homogeneity within subgroups and heterogeneity across subgroups. There may be more than one partition that achieves the minimum; because a larger number of subgroups is harder to interpret, we will choose a minimizer with the smallest number of subgroups.

We can now describe our robust recursive partitioning (R2P) method for solving the optimization problem (3). We begin with the trivial partition Π = {X }. We denote the set of subgroups whose objectives in (3) can be potentially improved by Π c . In the initialization step, we set Π c = Π and apply the SCR on D to obtain the confidence intervals in (1) . Based on these intervals, we computê W X andŜ X . More generally, using the confidence intervals for subgroup l, we can estimate W l and S l byŴ l = 1

, respectively, where N l 2 is the number of samples in the validation set I l 2 . After initialization, we recursively partition the covariate space by splitting the subgroups in Π c with respect to the criterion in (3) . To split each subgroup l ∈ Π c , we first consider the two disjoint subsets from subgroup l given by l +

is the threshold for splitting, x k is the k-th covariate element, and x l,min k and x l,max k are the minimum and maximum values of the k-th covariate within the subgroup l, respectively. We then apply SCR to each of these subsets. Specifically, we split the samples corresponding to l + k (φ) and l − k (φ) into training and validation sets:

where the split subsets are I

To compute residuals, we do not train new estimators for l + k (φ) and l − k (φ); instead we use the previously trained estimatorμ X ; this provides consistency of estimators across groups and within groups. (It also avoids the enormous computational burden of training new estimators for all the possible splits.) Using the residuals, we can construct the confidence intervalsĈ l + k (φ) (x) andĈ l − k (φ) (x) and the associated quantities in the objective function,

. Then we find the optimal covariate k * l and threshold φ * l for splitting subgroup l as

To improve the objective in (3), we split the subgroup l into l + k * (φ * ) and l − k * (φ * ) only if the reduction in the impurity of the confident homogeneity is sufficiently large:

Here, γ ∈ [0, 1) is a tuning hyperparameter for regularization. We refer to (4) as the confident criterion. With an appropriate choice of γ, this criterion prevents overfitting, prevents the number of subgroups from becoming too large and prevents the size of each subgroup from becoming too small.

After the splitting decision, we remove l from Π c ; if we have split l, we remove l from Π and add the two split sets to both Π and Π c . We continue recursively until Π c is empty, at which point no further splitting is productive. When the procedure stops, we will have obtained an estimator µ X and a partition Π and for each l ∈ Π we will have corresponding confidence intervalŝ

The following theorem guarantees that the R2P partition Algorithm 1 provides a valid confidence intervalĈ l for each subgroup l ∈ Π; this is exactly what a user would want in subgroup analysis. The proof is provided in the supplementary material.

Theorem 1 Given a prescribed miscoverage rate α, the created partition Π, estimatorμ X , and confidence interval functionĈ l (·) have the following property: for each l ∈ Π and for new samples (x, y) drawn from the truncated distribution P l X,Y , we have P[y ∈Ĉ l (x)] ≥ 1 − α.

Algorithm 1 Robust Recursive Partitioning

, trainμ X , compute its confidence intervalĈ X using the split subsets, and obtainŴ X andŜ X using I X

end if 10:

Π c ← Π c \ {l} 11: end for 12: Output: Π,μ X , andĈ l for all l ∈ Π Note that confident homogeneity does not necessarily improve as the group size shrinks because smaller groups lead to greater uncertainty. This alleviates the issue of generalization to unseen data in HTE analysis [8, 10] .

In this section, we extend the R2P method to the setting of HTE estimation, resulting in the R2P-HTE design as detailed below.

Heterogeneous Treatment Effect Model We consider a setup with n units (samples), For the unit i ∈ {1, 2, ..., n}, there exists a pair of potential outcomes, Y i (1) and Y i (0) that are independently drawn from an unknown distribution, where 0 and 1 represent whether the unit is treated or not, respectively. We define the treatment indicator as t i ∈ {0, 1}, where t i = 1 and 0 indicate the unit i is treated and untreated, respectively. The outcome for unit i is realized as the potential outcome corresponding to its treatment indicator

Since the ITE is defined as the expected difference between the two potential outcomes Y (1) and Y (0), its estimatorτ (x) is given as the contrast between two regression models:μ 0 (x) for the conditional non-treated outcome

R2P-HTE We adapt R2P to HTE estimation by constructing the quantities W l and S l in (3) based on the ITE estimatorτ (x). To construct an ITE estimator, many popular machine learning models have been considered in the literature [8, [13] [14] [15] [16] . R2P-HTE can use one of these models to parameterize the outcome modelsμ 0 (x) andμ 1 (x). We set the target coverage rate ofμ 0 (x) andμ 1 (x) as √ 1 − α. As in the previous section, we can construct a confidence interval for each estimator by using the split conformal regression. Let us denote the

This confidence interval ensures the coverage rate 1 − α for the estimated ITEτ (x) =μ 1 (x) −μ 0 (x), because its upper endpoint is given as the difference between the upper endpoint ofĈ 1 and the lower endpoint ofĈ 0 , and its lower endpoint is given as the difference between the lower endpoint ofĈ 1 and the upper endpoint ofĈ 0 . If the coverage rates forĈ 1 andĈ 0 are √ 1 − α, the coverage rate for C τ will be 1 − α.

From the ITE estimatorτ (x) and its confidence intervalĈ τ l (x) for each subgroup l, we can calculate W l and S l and adapt the R2P method to HTE estimation. The robust partitioning problem for HTE in (3) is solved by applying the R2P method in Algorithm 1, with two minor changes: 1) each sample in the HTE dataset is a triple (x i , t i , y i ) consisting of the covariate vector, the treatment indicator, and the observed outcome; 2) the outcome modelμ(x) in R2P is replaced by the ITE estimatorτ (x). As before, we show that this procedure achieves the specified coverage guarantee. The proof is provided in the supplementary material.

Theorem 2 Given a prescribed miscoverage rate α, the created partition Π, estimatorτ X , and confidence interval functionĈ τ l (·) have the following property: for each l ∈ Π and for new samples (x, τ ) drawn from the truncated distribution of the subgroup l, we have P[τ ∈Ĉ τ l (x)] ≥ 1 − α.

Subgroup analysis methods with recursive partitioning have been widely studied based on regression trees (RT) [8] [9] [10] [11] . In these methods, the subgroups (i.e., leaves in the tree structure) are constructed and the individualized outcome or treatment effects are estimated by the corresponding sample mean estimator to the leaf for given covariates. To overcome the limitations of the traditional trees to represent the non-linearity such as interactions between treatment and covariates [21] , a parametric model is integrated into regression trees for subgroup analysis [12] . However, such approach can be used only for the limited types of estimator models, which is particularly undesirable since advanced causal inference models based on deep neural networks or multi-task Gaussian processes have been studied which outperform the traditional estimators [14] [15] [16] . The global model interpretation method in [20] can analyze the subgroup structure of arbitrary models but it depends on local model interpreters and does not consider the treatment effects.

For recursive partitioning, various criteria have been proposed. In the traditional RT, the criterion based on the mean squared error between the sample mean estimations from the training samples and the test samples is used [11] , and it is referred to as the adaptive criterion. Based on the adaptive criterion, an honest criterion is proposed in [8] by splitting the training samples into the training set and the estimation set to eliminate the bias of the adaptive criterion. In addition, a generalization cost is introduced to the adaptive or honest criterion in [10] to encourage generalization of the analysis. The interaction measure between the treatment and covariates is used as a partitioning criterion in [9] and the parameter instability of the parametric models is used in [12] . In [20] , the contribution matrix of the samples from local model interpreters is used. Some of these criteria implicitly consider the confidence of the estimation by the variance, but most of them do not provide a valid confidence interval of the estimation for each subgroup. In [11] , a conformal regression method based on regression trees that provides the confidence interval is proposed, but the adaptive criterion is used for partitioning without any consideration of the confidence interval.

In this section, we evaluate R2P-HTE by comparing its performance with state-of-the-art HTE methods. Specifically, we compare R2P-HTE with four baselines: standard regression trees for causal effects (CT-A) [22] , conformal regression trees for causal effects (CCT) [11] , causal trees with honest criterion (CT-H) [8] , and causal trees with generalization costs (CT-L) [10] . We implement CCT and CT-A by modifying the conformal regression tree and conventional regression tree methods for causal effects. Details of the baseline algorithms are provided in the supplementary material. For the ITE estimator of R2P-HTE, here abbreviate as R2P, we use the causal multi-task Gaussian process (CMGP) [14] . Because individual ground-truth treatment effects can never be observed in real data, we use two synthetic and two semi-synthetic datasets. The first synthetic dataset (Synthetic dataset A) is the simple one proposed in [8] . Because dataset A possesses little of the homogeneity within subgroups that is often found in the real world, we offer a second synthetic dataset B that possesses greater homogeneity within subgroups and greater heterogenity across subgroups and has many more features than A. B was inspired by the initial clinical trial of remdesivir [23] as a treatment for COVID-19 and uses the patient features listed in that trial, but not the data. Aside from inspiration and features, B is relevant to COVID-19 only in that the COVID-19 is known to be a disease which displays in very heterogeneous ways. The two semi-synthetic datasets are based on real world data; the first uses the Infant Health and Development Program (IHDP) dataset [24] and the second uses the Collaborative Perinatal Project (CPP) dataset [25] . Details of all these datasets are provided in the supplementary material. For each experiment, we conduct 50 simulations.

We denote the set of test samples by D te and the test samples that belong to subgroup l as D te l . We define the mean and variance of the treatment effects of the test samples in subgroup l as Mean(D te l ) and Var(D te l ), respectively. We define heterogeneity across subgroups to be the variance of the mean of the treatment effects:

where L is the number of subgroups; we define the average in-subgroup variance to be V in (D te ) = 1 L L l=1 Var(D te l ). We also provide the average number of subgroups for better understanding of the results. We set the miscoverage rate to be α = 0.05, so we demand a 95% ITE coverage rate. (We do not report actual coverage rates here because all methods achieve the target coverage rate, but they are reported in the supplementary material, along with other details.)

Results Table 1 reports the performances of R2P and the baselines for all four datasets. Keep in mind that larger V across means greater heterogeneity across subgroups, while smaller V in means greater homogeneity within subgroups. As the table shows, R2P displays by far the best performance on all four datasets: the greatest heterogeneity across subgroups, the greatest homogeneity within subgroups, and the narrowest confidence intervals. It accomplishes all this while identifying comparable numbers of subgroups. We conclude that R2P identifies subgroups more effectively than any of the other methods, and allows fewer false discoveries (as was illustrated in the Introduction). The performance of R2P reflects one of its strengths: the ability to use any method of estimating ITE.

The effectiveness of R2P can also be seen in Fig. 3 , which provides, for R2P and each of the four baseline algorithms, boxplots of the distribution of the treatment effects for each identified subgroup Figure 3 : Distribution of treatment effects for subgroups identified by each algorithm when applied to Synthetic dataset B. Each box represents the range between the 25th and 75th percentiles of the treatment effects of the test samples; each whisker represents the range between the 5th and 95th percentiles.

for Synthetic dataset B. We see that that R2P identifies subgroups reliably: different subgroups have very different average treatment effects and their distributions are non-overlapping or welldiscriminated. All the other methods are unreliable: false discovery occurs for all four baseline methods, and occurs frequently for three of the four. Gain from subgroup analysis To indicate the gain from subgroup analysis obtained by R2P, and hence to indicate the effectiveness of recursive partitioning, we compare V in , the homogeneity within subgroups, obtained by R2P against the homogeneity within the entire population that is obtained by R2P without subgroups. We normalize by setting V in of R2P without subgroups to 1. As can be seen in Table 2 , subgroup analysis with R2P reduces the average in-subgroup variance by 89% and more than 95% on Synthetic dataset A and B, respectively. For the semi-synthetic datasets, it reduces the average in-subgroup variance by more than 50% and 30% on the IHDP and CPP datasets, respectively. (Keep in mind that R2P constructs subgroups in a way that produces both strong heterogeneity across subgroups and strong homogeneity within subgroups.)

In this paper, we have studied robust HTE analysis based on recursive partitioning. The algorithm proposed, R2P-HTE, recursively partitions the entire population by taking into account both heterogeneity across subgroups and homogeneity within subgroups, using the novel criterion of confident homogeneity that is based on the quantification of uncertainty of the ITE estimation. R2P-HTE robustly constructs subgroups and also provides confidence guarantees for each subgroup. Experiments for synthetic and semi-synthetic datasets (the latter based on real data) demonstrate that R2P-HTE outperforms state-of-the-art baseline algorithms in every dimension: greater heterogeneity across subgroups, greater homogeneity within subgroups, and narrower confidence intervals. One of the strengths of R2P-HTE is that it can employ any method for ITE estimation, including improved methods that will undoubtedly be developed in the future.

The understanding of treatment effects plays an important role in many areas, and especially in medicine and public policy. In both areas, it is often the case that the same treatment has different effects on different groups; hence subgroup analysis is called for. In medicine, subgroup analysis may make it possible to identify groups of patients (defined by covariates such as age, body mass index, blood pressure, etc.) suffering from a particular disease for whom a particular drug is effective and safe and other groups for whom the same drug is ineffective and unsafe. Similarly, subgroup analysis may make it possible to identify groups of patients for whom one course of treatment (e.g. a particular mode of radiotherapy or chemotherapy) is preferable (more likely to be successful with fewer side effects) to another. In public policy, subgroup analysis may make it possible to identify groups of people or geographic regions for which particular interventions (e.g., providing mosquito nets to combat malaria) are likely to be successful or unsuccessful. The method for subgroup analysis that is developed in this paper, R2P, is demonstrably an enormous improvement over previous state-ofthe-art methods and therefore has the potential to make enormous and widespread impact. Moreover, because R2P can make use of improvements in the underlying estimation algorithms, this impact may grow over time.

Here, we provide a basic idea of conformal prediction to help understanding. To this end, we introduce the following example in [17] . Let z 1 , ..., z n , z be samples of a scalar random variable exchangeably drawn from a distribution P Z and z (1) , ..., z (n) denote the order statistics of z 1 , ..., z n . We denote the 1 − α-th quantile of z (1) , ..., z (n) aŝ

The rank of z among z i , ..., z is uniformly distributed over the set {1, ..., n + 1} under the exchangeability assumption that the joint distribution of z 1 , ..., z n , z is invariant of the sampling order z i , ..., z.

Thus, for a given miscoverage level α ∈ [0, 1], we have P[z ≤Q 1−α ] ≤ 1 − α by summing the uniform distribution up toQ 1−α .

This idea can be used in a regression problem with covariates x ∈ X , where d is the dimension of the covariate vector, and outcomes y ∈ Y. Specifically, with the regressorμ, the confidence interval for y is given asĈ

where G is the empirical distribution of the fitted residuals on the training samples (i.e., |y i −μ(x i )|, i = 1, ..., n) and G −1 1−α is the 1 − α-th quantile of G. However, this method may undercover y since the residuals on the training samples typically smaller than those on the test samples due to overfitting. To avoid this, Split Conformal Regression (SCR) is introduced which separates the samples for training and computing the residuals. In SCP, the training samples is split into two equal-size subsets I 1 and I 2 , and one subset I 1 is used to fit the regressorμ I1 and another one I 2 is used to compute the residuals forμ I1 , R I2 = {|y i −μ I1 (x i )| : (x i , y i ) ∈ I 2 }. Based on the regressor and residuals, the confidence interval for y with the regressorμ I1 is given aŝ

. Under the only one assumption of the exchangeability of the training samples {(x i , y i )} n i=1 and the testing sample (x, y), it satisfies the following theorem.

Theorem 3 [17] If the samples {(x i , y i )} n i=1 are exchangeable, then for a new sample (x n+1 , y n+1 ) drawn from P X,Y ,

For subgroup l, let P l X,Y be the distribution on l × Y, which is the conditional distribution P X,Y |X∈l . According to Algorithm 1, we have the samples of subgroup l from the entire samples consisting of two disjoint subsets as D l = I l 1 ∪ I l 2 , where I l 1 is the samples that are used for training the estimator µ X and I l 2 is the samples that are used for constructing the confidence intervalĈ l . The samples in D l are exchangeable since they are i.i.d. Thus, for a new sample (x n+1 , y n+1 ) drawn from P l X,Y , we have P[y n+1 ∈Ĉ l ] ≤ 1 − α from Theorem A.1.

We have the samples of subgroup l from the entire samples consisting of two disjoint subsets as D HTE,l = I l 1 ∪ I l 2 . For training the estimatorsμ 1 X andμ 0 X , the samples in I l 1 are used. Also, the samples in I l 2 are used to construct the confidence intervalsĈ 1 l andĈ 0 l with the miscoverage rate √ 1 − α according to the treatment indicator of the samples. Since the samples are i.i.d., for a new sample with each potential outcome, the corresponding confidence interval satisfies the miscoverage rate √ 1 − α from Theorem A.1 as

For the definitions of the ITE estimationτ l (x) andĈ τ l (x), the events that the potential outcomes and ITE estimation belong to their corresponding confidence intervals satisfy the following relation:

Subgroup analysis methods with recursive partitioning have been widely studied based on regression trees (RT) [8] [9] [10] [11] . In these methods, the subgroups (i.e., leaves in the tree structure) are constructed and the individualized outcome or treatment effects are estimated by the corresponding sample mean estimator to the leaf for given covariates. To overcome the limitations of the traditional trees to represent the non-linearity such as interactions between treatment and covariates [21] , a parametric model is integrated into regression trees for subgroup analysis [12] . However, such approach can be used only for the limited types of estimator models, which is particularly undesirable since advanced causal inference models based on deep neural networks or multi-task Gaussian processes have been studied which outperform the traditional estimators [14] [15] [16] . The global model interpretation method in [20] can analyze the subgroup structure of arbitrary models but it depends on local model interpreters and does not consider the treatment effects. For recursive partitioning, various criteria have been proposed. In the traditional RT, the criterion based on the mean squared error between the sample mean estimations from the training samples and the test samples is used [11] , and it is referred to as the adaptive criterion. Basically, this adaptive criterion identifies subgroups with heterogeneous treatment effects by trying to maximize the heterogeneity across the identified subgroups. Based on the adaptive criterion, in [8] , an honest criterion is proposed. In the criterion, the training samples are split into two subsets; One is used to build a tree structure and another one are used to estimate the treatment effects. By this, the honest criterion can eliminate the bias of the adaptive criterion. In [10] , a generalization cost is introduced to encourage generalization of the analysis. It is defined by using another subset of the training samples and adopted to the adaptive or honest criterion. The interaction measure between the treatment and covariates is used as a partitioning criterion in [9] and the parameter instability of the parametric models is used in [12] . In [20] , the contribution matrix of the samples from local model interpreters is used for partitioning. These criteria focus on the heterogeneity across the subgroups, but the variant of the treatment effects within each subgroup (i.e., the homogeneity within each subgroup) is largely ignored. Besides, some of these criteria implicitly consider the confidence of the estimation by the variance, but most of them do not provide a valid confidence interval of the estimation for each subgroup. In [11] , a conformal regression method based on regression trees that provides the confidence interval is proposed, but the adaptive criterion is used for partitioning without any consideration of the confidence interval. Compared with these criteria, the confident criterion in our method considers both heterogeneity and homogeneity of subgroups in a confident way with the confidence intervals of the estimation for each subgroup.

Synthetic dataset A We first consider a synthetic model for treatment effects proposed in [8] . It describes the potential outcome y for given treatment t ∈ {0, 1} as follows:

, and η(·) and κ(·) are the design functions. The response surface of outcome y i 's is determined according to the design functions. The function η(·) describes mean outcomes for given covariates and the function κ(·) describes treatment effects for given covariates. Based on this synthetic model, we consider the following design with 2-dimensional covariates, η(x) = 1 2 x 1 + x 2 and κ(x) = 1 2 x 1 . there is no noise covariates which do not affect outcomes at all. In the experiments, we generate 300 samples for training and 1000 samples for testing.

Synthetic dataset B Here, we introduce a synthetic model based on the initial clinical trial results of remdesivir to COVID-19 [23] , in which it is discovered that remdesivir results in a faster time to clinical improvement for the patients with the shorter time from symptom onset to starting trial. Since the clinical trial data is not public, we generate a synthetic model in which the treatment effects mainly depends on the time from symptom onset to trial based on the clinical trial setting and results in the paper [23] . We consider the following 10 baseline covariates: age∼ N (66, 4), white blood cell count (×10 9 per L)∼ N (66, 4), lymphocyte count (×10 9 per L)∼ N (0.8, 0.1), platelet count (×10 9 per L)∼ N (183, 20.4), serum creatinine (U/L)∼ N (68, 6.6), asparatate aminotransferase (U/L)∼ N (31, 5.1), alanine aminotransferase (U/L)∼ N (26, 5.1), lactate dehydrogenase (U/L)∼ N (339, 51), creatine kinase (U/L)∼ N (76, 21), and time from symptom onset to starting trial (days)∼ Unif (4, 14) . We approximate the distribution using the patient characteristics provided in the paper. To construct treatment/control responses, we first adopt the response surface in the IHDP dataset [24] for the covariates except for the time. We then use a logistic function with the covariate of the time to consider the different effectiveness according to the time (i.e., the faster time to clinical improvement with the shorter time from symptom onset to trial). In specific, the control response are defined as Y (0) ∼ N (X −0 β + (1 + e −(x0−9) ) −1 + 5, 0.1) and Y (1) ∼ N (X −0 β + 5 · (1 + e −(x0−9) ) −1 , 0.1), where X −0 represents the matrix of the standardized (zeromean and unit standard deviation) covariate values except for the time covariate x 0 and the coefficients in the vector β are randomly sampled among the values (0, 0.1, 0.2, 0.3, 0.4) with the probability (0.6, 0.1, 0.1, 0.1, 0.1), respectively. With this synthetic model, we can approximate the response surface, in which the time to clinical improvement (i.e., the treatment effect) becomes faster as the shorter time from symptom onset to trial, provided in [23] .

We consider two semi-synthetic datasets developed for treatment effects based on real-world data: the Infant Health and Development Program (IHDP) dataset [24] and the Collaborative Perinatal Project (CPP) dataset [25] .

The Infant Health and Development Program (IHDP) is a randomized experiment intended to enhance the cognitive and health status of low-birth-weight, premature infants through intensive high-quality child care and home visits from a trained provider. Based on the real experimental data about the impact of the IHDP on the subjects' IQ scores at the age of three, the semi-synthetic (simulated) dataset is developed and has been used to evaluate treatment effects estimation in [24, 14, 16, 26] . All outcomes (i.e., response surfaces) are simulated using the real covariates. The dataset consists of 747 subjects (608 untreated and 139 treated), and there are 25 covariates associated with each subject. For generating outcomes, we used the standard non-linear mean outcomes of "Response Surface B" setting provided in [24] and also used in [14, 16, 26] . We added the noise ∼ N (0, 0.1) for each potential outcome. In the experiments, we use 80% samples for training and 20% samples for testing.

CPP dataset For 2016 Atlantic Causal Inference Conference competition (ACIC), a semi-synthetic dataset for treatment effect is developed based on the data from the Collaborative Perinatal Project (CPP) [25] . It comprises of multiple datasets that are generated by distinct data generating processes (causal graphs) and random seeds. Each dataset consists of 4802 observations with 58 covariates of which 3 are categorical, 5 are binary, 27 are count data, and the remaining 23 are continuous. The factual and counter-factual samples are randomly drawn according to the data generating process and the noise ∼ N (0, 0.1) is added for each potential outcome. In the experiment, we use the dataset with index 1 provided in [25] and drop the rows whose Y (1) or Y (0) belong to above the 99% quantile or below the 1% quantile to avoid the outliers. The dataset consists of 35% treated subjects and 65% untreated subjects. We randomly pick 500 samples for training and 300 samples for testing from the dataset.

Here, we describe the algorithms used in the experiments.

Robust recursive partitioning for HTE (R2P-HTE) We implemented R2P-HTE based on Section 4 in our manuscript. For the ITE estimator, we use the causal multi-task Gaussian process (CMGP) in [14] . By using the estimated potential outcomes from CMGP (i.e.,μ 1 (x) andμ 0 (x)), we construct the confidence interval for the CATE estimationτ (x). Then, we can apply Algorithm 1 in our manuscript for the CATE estimation. In the experiments, we set λ = 0.5 and γ = 0.05. (We used λ = 0 for the CPP dataset to avoid the excessive effect of the confidence intervals compared with the effect of the heterogeneity in the dataset.) We use α = 0.1 and 0.5 split ratio for the split conformal prediction. We set the minimum number of training samples in each leaf as 10. To avoid excessive conservativeness, we use the confidence interval of the miscoverage rate β = 0.8 for the expected deviation,Ŝ l , in the confident split criterion. We set the hyper-parameters manually as above considering a typical values (e.g., 0.95 is a typical value for significance tests in tree-based methods), but if needed, the hyper-parameter tuning can be done by a grid search method as in a typical recursive partitioning methods [10] .

Standard regression tree for causal effects (CT-A) Since a standard regression tree in [22] is not developed for causal effects, we implement the modified version of the standard regression tree for causal effects introduced in [8] . In the modified version, the regression tree recursively partitions according to a criterion based on the expectation of the mean squared error (MSE) of treatment effects. In the literature, this criterion called an adaptive criterion. For the details of the method, please refer to [8] . In the experiments, we set the minimum number of training samples in each leaf as 20 since CT-A does not need to split the data samples into two subsets for validation as in other methods. After building the tree-structure, we prune the tree according to the statistical significance gain given by 0.05.

Conformal regression tree for causal effects (CCT) We modify the conformal regression tree proposed in [11] for our experiments since it is not developed for treatment effects. The conformal regression tree is developed by applying conformal prediction methods to a standard regression tree. Hence, we implemented CCT by applying a split conformal prediction method to a standard causal tree (i.e., CT-A). In specific, the confidence intervals for the estimation of the potential outcomes from CT-A are constructed by a split conformal prediction. Then, the confidence interval for the CATE estimation is constructed by summing them as in R2P. In the experiments, we use 0.5 split ratio for the split conformal prediction and set the minimum number of training samples in each leaf as 10. After building the tree-structure, we prune the tree according to the statistical significance gain given by 0.05.

Causal tree with honest criterion (CT-H) We implement a causal tree method proposed in [8] . This method is a modified version of the standard regression tree for causal effects in which a honest criterion is used instead of the adaptive criterion. It separates tree-building and treatment effect estimation as two stages. To this end, the data samples are split into two subsets: training samples to build tree and estimation samples to estimate treatment effects. By this, it makes independent the tree-building and the treatment effect estimation, which results in eliminating the bias. For the details of the method, please refer to [8] . In the experiments, we use 0.5 split ratio for the training (tree-building) samples and estimation samples and set the minimum number of training samples in each leaf as 10. After building the tree-structure, we prune the tree according to the statistical significance gain given by 0.05.

Causal tree with generalization costs (CT-L) We implement a causal tree with a criterion considering generalization costs in [10] . This method is a modified version of the causal tree in [8] . It split the data samples into the training and validation samples, and builds the tree using the training samples while penalizing generalization ability based on the validation samples. To this end, it introduces a generalization cost that encourages generalization on unseen data through the penalty of the estimations that do not match validation samples. For the details of the method, please refer to [10] . In the experiments, we use 0.5 split ratio for the training (tree-building) samples and validation samples and set the minimum number of training samples in each leaf as 10. After building the tree-structure, we prune the tree according to the statistical significance gain given by 0.05.

We provide the results with maximum depth for partitioning to show the effectiveness of the confident criterion more clearly. We set the maximum depth of each method to be 2, which limits the maximum number of identified subgroups by 4. In most datasets, R2P has both highest variance across subgroups and lowest in-subgroup variance. This implies that each partitioning in R2P is more effective to identify the subgroups than that in the other methods. In Synthetic dataset B, CT-A has the highest variance across subgroups, but its difference from that of R2P is marginal and the in-subgroup variance of CT-A is much higher than that of R2P. We here provide the complete results of Table 1 in the manuscript. From the table, we can see that R2P satisfies the target coverage for all datasets even with the narrower confidence interval widths. Impact of hyper-parameter γ We show the impact of the hyper-parameter γ ∈ [0, 1) of R2P in Fig. 4 . The hyper-parameter γ controls the regularization in R2P. From the figures, we can see that as γ increases, the number of subgroups converges to one since a group is barely partitioned with the larger γ. This causes the degradation of the performance in the following aspects: V across decreases, V in increases, and the confidence interval width generally increases. Thus, it seems that a smaller value is a better choice for γ. However, this trend comes from the excessive number of subgroups, which implies that if γ is too small, the subgroups (i.e., the partition) constructed by R2P becomes overfitted. This overfitting of the partition results in the loss of generalization ability to unseen data, and the too large number of subgroups due to the overfitting makes the subgroup analysis less informative. In addition, we can see that our method robustly satisfies the target coverage rate regardless of γ. Consequently, the hyper-parameter γ should be appropriately chosen considering the regularization.

Impact of hyper-parameter λ We show the impact of the hyper-parameter λ of R2P in Fig. 5 . The hyper-parameter λ ∈ [0, 1] controls the weight between the homogeneity within each subgroup and the confidence interval discrimination in the confident criterion. With smaller λ, the homogeneity within each subgroup is more emphasized in the criterion. So in that case, from the figures, we can see that R2P finds the larger number of subgroups, which results in the higher V across and the lower V in . On the other hand, with larger λ, the confidence interval discrimination is more emphasized in the criterion, and thus, the confidence interval width generally decreases. We can see that our method robustly satisfies the target coverage rate regardless of λ in most cases. Consequently, the hyper-parameter λ should be appropriately chosen considering the variance over the entire population. 

Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. The Lancet

Residual weighted learning for estimating individualized treatment rules

Observational data for heterogeneous treatment effects with application to recommender systems

Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods

Estimating treatment effect heterogeneity in randomized program evaluation

Subgroup identification from randomized clinical trial data

Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials

Recursive partitioning for heterogeneous causal effects

Subgroup analysis via recursive partitioning

Learning triggers for heterogeneous treatment effects

Interpretable regression trees using conformal prediction. Expert systems with applications

Model-based recursive partitioning for subgroup analyses

Estimation and inference of heterogeneous treatment effects using random forests

Bayesian inference of individualized treatment effects using multi-task gaussian processes

GANITE: Estimation of individualized treatment effects using generative adversarial nets

Estimating individual treatment effect: generalization bounds and algorithms

Distribution-free predictive inference for regression

A dynamic discretization approach for constructing decision trees with a continuous label

A novel decision-tree method for structured continuous-label classification

Global model interpretation via recursive partitioning

A comparison of five recursive partitioning methods to find person subgroups involved in meaningful treatmentsubgroup interactions

Classification and regression trees

Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial. The Lancet

Bayesian nonparametric modeling for causal inference

Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition

Causal effect inference with deep latent-variable models