key: cord-0208262-q1zywyuo authors: Wu, Han; Tan, Sarah; Li, Weiwei; Garrard, Mia; Obeng, Adam; Dimmery, Drew; Singh, Shaun; Wang, Hanson; Jiang, Daniel; Bakshy, Eytan title: Distilling Heterogeneity: From Explanations of Heterogeneous Treatment Effect Models to Interpretable Policies date: 2021-11-05 journal: nan DOI: nan sha: 34fa16cde3c243e859e5e4d793161effdf794b3e doc_id: 208262 cord_uid: q1zywyuo Internet companies are increasingly using machine learning models to create personalized policies which assign, for each individual, the best predicted treatment for that individual. They are frequently derived from black-box heterogeneous treatment effect (HTE) models that predict individual-level treatment effects. In this paper, we focus on (1) learning explanations for HTE models; (2) learning interpretable policies that prescribe treatment assignments. We also propose guidance trees, an approach to ensemble multiple interpretable policies without the loss of interpretability. These rule-based interpretable policies are easy to deploy and avoid the need to maintain a HTE model in a production environment. Unlike conventional A/B testing where individuals are randomly assigned to a treatment condition, internet companies are increasingly using heterogeneous treatment effects models (HTE) and personalized policies to decide, for each individual, the best predicted treatment (Garcin et al., 2013; Toguç and Kuzu, 2020) . However, to flexibly model heterogeneity in treatment and outcome surfaces, state-of-the-art HTE models used for personalization Shi et al., 2019; Kennedy, 2020) frequently leverage black-box, uninterpretable base learners such as gradient boosted trees and neural networks. To generate a HTE prediction, some of them combine outputs from multiple models via stacking, weighting, etc. (Künzel et al., 2019; Nie and Wager, 2021; Montoya et al., 2021) . Moreover, their treatment effect predictions are sometimes used as inputs to additional black-box policy learning models (Imai and Strauss, 2011; McFowland et al., 2020) . There are several challenges with these approaches. Firstly, when there are multiple (not just binary) outcomes, the creation of a policy is no longer as trivial as assigning the treatment group that maximizes a single treatment effect. Secondly, as dataset size and the number of features used for HTE modeling or policy learning increases, the cost of maintaining such models in deployment increases (Sculley et al., 2015a) . Finally, the black-box nature of HTE models and resulting policies makes them difficult to interpret, which can deter their uptake for critical applications. This paper describes methods for: (1) learning explanations of HTE models, to better understand such models; (2) generating interpretable policies that can perform on par with or even better than black-box policies. Our explanations and interpretable policies are constructed from if-else rules, using individual features as inputs (Figure 1 ). We demonstrate a variety of different ways to learn interpretable policies, from methods that act on HTE model predictions to methods that work on conditional average treatment effects. We also propose ways to ensemble multiple interpretable policies while still remaining interpretable. Crucially, all of our proposed methods work in the general setting of more than two treatment groups and multiple outcomes, which is needed in production environments at internet companies. For example, a new website feature may increase user engagement at the expense of revenue. We do so by leveraging techniques such as model distillation (Bucilua et al., 2006; Hinton et al., 2014) with multitask (Caruana, 1997) and multiclass student models, and linear scalarization of multiple outcomes from the multi-objective optimization literature (Gunantara (2018) ). We do not attempt to comprehensively survey the state of HTE estimation since our contribution is agnostic to the choice of HTE model. Most modern HTE estimation methods are compatible with our proposed methods. Subgroup finding. Our work on interpreting HTE models by surfacing segments with heterogeneity produces a representation analogous to subgroup finding methods. These methods are divided into: (1) statistical tests to confirm or deny a set of pre-defined hypotheses about subgroups (Assmann et al., 2000; Song and Chi, 2007) ; (2) methods that aim to discover subgroups directly from data (Imai and Ratkovic, 2013; Chen et al., 2015; Wang and Rudin, 2021; Nagpal et al., 2020) ; (3) methods that act on inferred individual-level treatment effects to discover subgroups Foster et al. (2011); Lee et al. (2020) . However, the comparison of subgroup finding algorithms is still an open question. Loh et al. empirically compared 13 different subgroup-finding algorithms (Loh et al., 2019) , finding that no one algorithm satisfied all the desired properties of such algorithms. Policy learning. The problem of policy learning from observational/experimental data has received significant attention due to its wide applicability (Beygelzimer and Langford, 2009; Qian and Murphy, 2011; Zhao et al., 2012; Swaminathan and Joachims, 2015; Kitagawa and Tetenov, 2018; Kallus and Zhou, 2018) , but their interpretability and scalability has received less attention. Several existing works (Amram et al., 2020; Athey and Wager, 2021; Jo et al., 2021) do construct interpretable tree policies based on counterfactual estimation. However, they focus on constructing exactly optimal trees, which is prohibitively slow in a large scale and unsuitable for a production environment. Interpretability via distillation and imitation learning. Black-box classification and regression models have been distilled into interpretable models such as trees, rule lists, etc. (Craven and Shavlik, 1995; Frosst and Hinton, 2017; Tan et al., 2018b,a) . Craven and Shavlik (1995) distilled a neural net to a decision tree and evaluated the explanation in terms of fidelity, accuracy, and complexity. Namkoong et al. (2020) distilled a Thompson Sampling policy to policies represented using MLPs. Interpreting reinforcement learning methods has also garnered interest; see Alharin et al. (2020) for a survey. These methods can be roughly divided into those that leverage the interpretable representation provided by trees to approximate states and actions (Bastani et al., 2018; Vasic et al., 2019) , saliency and attention methods (Wang et al., 2016; Greydanus et al., 2018; Qureshi et al., 2017), and distillation (Czarnecki et al., 2019) . We also use distillation techniques, but in a non-sequential setting. Suppose we have a dataset D of N points and P features with K treatment groups, 1 control group, and J outcomes. Let W i ∈ {0, 1, . . . , K} denote the treatment group indicator for the i-th individual (W i = 0 means the individual is in the control group). Let G k denote the set of individuals in the kth treatment group, k = 0, 1, . . . , K. Following the potential outcome framework (Neyman, 1923; Rubin, 1974) , we assume for each individual i and outcome j there are K + 1 potential outcomes Y , the observed potential outcome (out of K + 1 potential outcomes). We suppose we have access to predicted outcomes (Ŷ is the predicted j-th outcome for individual i assuming treatment group W . These could be obtained directly from the HTE model or together with estimates of the baseline surface. We focus on deterministic policies, Π: R P → {0, 1, .., K}, which map feature vectors to treatment groups. An individual with feature vector X i will be given treatment Π(X i ). A segment, S, is a set of indices, representing a set of points, N S , treatment assignments, W S along with their predicted outcomes,Ŷ (W ) ij for i ∈ S. Throughout this work, we refer to lists of segments, {S 1 , S 2 , ..., S n } which form a partition of the input indices, i.e. there is no index overlap between the segments. Throughout this work, we combine multiple outcomes into a single outcome via linear scalarizations, parameterized by a set of weights (c 1 , . . . , c J ). Thus, for each individual i, the multiple observed outcomes becomes a single outcome: We also combine predicted outcomes in the same way:Ŷ iJ .. Our proposed methods take (c 1 , . . . , c J ) as inputs and assume a value 1 (equal weighting on outcomes) if not provided. In practice, it is common to either tune the weights (c 1 , . . . , c M ) (Letham and Bakshy, 2019) or handpick weights to manage particular preferences or tradeoffs. Multi-task decision trees (MTDT) extend single-task decision trees by combining the prediction loss on each outcome, across multiple outcomes, into a single scalar loss function. This simple representation is effective and suitable for locating segments where individuals present heterogeneity across multiple outcomes. In this tree, each task (label) is the predicted treatment effect for an outcome, and each node is a segment identified to have elevated or depressed treatment effects across multiple outcomes. We learn these trees using distillation techniques. The resulting method is called Distilling HTE model predictions (Distill-HTE). Figure 2 provides a concrete example. Distillation. To learn an MTDT model in a HTE model-agnostic fashion, we leverage model distillation techniques, taking the HTE model as the teacher and the MTDT model as a student. After obtaining predicted outcomesŶ is the predicted j-th outcome for individual i assuming treatment group W , we construct pairwise predicted treatment effectsT ij , i.e. individual i's predicted treatment effect for outcome j for a particular treatment contrast (analogous definitions will hold for all treatment contrasts). LetF be the prediction function of an MTDT model using the following distillation training objective that minimizes the mean square error between the HTE model predicted treatment effects andF: (1) F j is the MTDT model's prediction for just outcome j, accompanied by c j , a weight that encodes how much outcome j contributes to the overall loss, as reviewed in Section 3.2. When c j = 1, as we will assume by default, each outcome contributes equally to the overall distillation loss. Figure 2: Distill-HTE: Multitask treatment effect decision tree learned by distilling a black-box HTE model. After the tree is learned, we "unroll" the tree such that each terminal node forms a segment visualized in a barplot. Counterfactual reasoning. To improve the robustness of the subgroups, we introduce three elements based on counterfactual reasoning ideas: (1) Terminal nodes without sufficient overlap (i.e. enough treatment AND control points) are post-pruned from the tree, and the prediction regenerated; (2) Confidence intervals are provided for treatment effects within each node; (3) Honesty (Athey and Imbens, 2016): splits and predictions are determined on different data splits. From treatment effects to policies. With K treatment groups, 1 control group, and M outcomes, the proposed method learns K MTDT models, where each model predicts treatment effects of M outcomes for each treatment group compared to control. One MTDT model can then be used to generate a policy that indicates, for each segment, which of the pair of treatment groups should be assigned to maximize treatment effects. However, generating a single policy still requires combining multiple MTDT, and it is not immediately obvious the best way to do so. This motivates our proposal of directly learning a single policy that applies to multiple treatment groups and outcomes. To generate a single interpretable policy Π that applies to multiple treatment groups and outcomes, we now describe different approaches, all using if-else rules representations. An already trained HTE model from which predicted outcomes can be obtained is a prerequisite for the approaches described in this subsection. We describe two ways of leveraging these predictions: GreedyTreeSearch-HTE and Distill-Policy. Greedy tree search from HTE model predictions (GreedyTreeSearch-HTE). In this approach, we directly solve the following optimization problem in the space of trees with pre-determined maximum tree depth, assuming without loss of generality that higher outcomes are better: This can be seen as a cost-sensitive classification problem, where the cost of assigning a point to a group is the negative value of the predicted potential outcome (Elkan, 2001) . To achieve a scalable implementation, we solve the optimization problem greedily instead of resorting to exact tree search over the whole space of possible trees. DefineM j (S) = u i ∈SŶ (j) i andM (S) = max jMj (S). The detailed implementation can be found in Algorithm 1. Output: A list of segments: segments. 1. Set depth := 0 and segments = {D}. 2. Add segments to segments using greedy search: while depth ≤ m do Set nodes := {} for S ∈ segments do Split S into S * l and S * r such that: If such S * l and S * r exist, we add them to nodes; otherwise we add S to nodes. end Let segments := nodes and depth := depth + 1. end This approach is similar to and Amram et al. (2020) which respectively solve an exact tree search problem over predicted outcomes and find the optimal tree using coordinate ascent. Due to scalability concerns and the need for faster test-time inference, we take a greedy approach. On a sample of 100k data points with 12 features and three treatment groups, the Zhou et al. (2018) method took 2.5 hours to run. With a quadratic runtime in the number of data points, and the amount of data at large internet companies sometimes exceeding 100 million data points, it is impossible to run it at a scale suitable for internet companies. Distilling policy derived from HTE model predictions (Distill-Policy). In this approach, we start from the naive policy Π implied by the outcome predictionsŶ . Then, we train a decision tree classifier on the dataset (X i , Π HT E (X i )) N i=1 , and output the segments from the decision tree classifier. The resulting policy assigns all individuals in the same segment to the majority treatment group of individuals in that segment. The non-HTE-model based methods proposed in this section are useful when we are not able to train an accurate HTE model. Here we aim to generate an interpretable policy without training a HTE model, by leveraging average treatment effects. Concretely, we define the segment average treatment effect for segment S and treatment j as: where G j denote the set of individuals in the j-th treatment group. We then define a splitting metric A(S) = max j=0,1,...,K A j (S), which considers only the most responsive treatment in that segment. We build a tree and consider binary splits that improve this splitting metric. Algorithm 3 and Algorithm 4 describe two different implementations for this. In the greedy implementation, a split is only considered if both the left and right child segment have an improvement over the parent segment. In the iterative implementation, we run several iterations of splitting. For each iteration we keep the best segment, excluding it in the next run. A split is considered as long as one of the child segments improves the outcome compared to the parent node. One pitfall with this iterative implementation is that segments are not necessarily disjoint in feature space, so one individual could appear in several segments. We resolve this by always assigning the individual to the first found segment. For both methods, if no segments are found because no eligible split exists, the policy defaults to assigning all individuals to the treatment group with highest average treatment effect. The resulting policy is: Π * = argmax j A j (S) if S ∈ segments and individual i∈ S and if i ∈ S for ∀S ∈ segments, we assign the individual to the control or default treatment. These non-model based methods can be viewed as attempts to segment the feature space using empirical conditional average treatment effects; similar efforts have appeared in the literature (e.g. Dwivedi et al. (2020) ). We can already generate interpretable tree-based policies using the methods described above. However, different policies may exhibit different strengths in different regions, and simply training these trees with deeper depth does not necessarily improve the resulting policy in our experiments. We leverage ensemble learning to identify such regions, with the hope of generating a better policy. Concretely, suppose we have access to Q policies Π 1 , Π 2 , . . . , Π Q . We want to train an ensemble policy Π that uses all or a subset of the policies Π 1 , Π 2 , . . . , Π Q while still remaining interpretable. We introduce two ways of doing so. For ease of notation, we assume the trained policies Π 1 , ..., Π Q were obtained from another split of the dataset and we can safely use D as the validation set on which we learn the ensemble policy. Ensemble based on Uniform Exploration (GUIDE-UniformExplore): This method is inspired by the explore-exploit paradigm in the contextual bandits literature (see Slivkins (2021) for a review). To perform this offline, we use HTE outcome predictions when the observed outcome is not revealed in the dataset. Algorithm 2 provides the implementation. The ensemble policyπ is generated using Algorithm 1 but with the HTE predictions (Ŷ Ensemble using Offline Evaluation (GUIDE-OPE): We aim to find one feature split such that the left and right children uses a different candidate policy. We do so using offline policy evaluation (OPE; see Algorithm 5). Suppose we split dataset D into D l and D r , letΠ be the ensemble policy that applies Π k to D l and Π q to D r . To createΠ, we use exhaustive search to find the optimal feature Figure 3: Example guidance tree learned using GUIDE-UniformExplore or GUIDE-OPE methods, to ensemble multiple policies while still remaining interpretable. Algorithm 2 Ensemble based on Uniform Exploration (GUIDE-UniformExplore) Output: A list of segments: segments. 1. Generate a dataset with randomly selected policies on individuals for i = 1 to N do Randomly select a policy from Π 1 , .., Π Q . Let A i be the index of this policy, A i ∈ 1, ..., Q. Apply this policy to individual i, assume the policy assigns treatment group k ∈ {0, 1, . . . , K}. split and candidate policies (D * l , D * r , Π * k , Π * q ) that solves max D l ,Dr max 1≤k =q≤Q OPE(D,Π). We assign to individual i policy Π * k if individual i ∈ D * l and Π * q otherwise. Both methods return trees that we call guidance trees; see Figure 3 for an example. While there exists many other ways to ensemble policies, such as SuperLearner (Montoya et al., 2021) , we do not consider these methods as they result in non-interpretable policies. We compare the Distill-HTE method proposed in Section 3.3 against several other subgroup finding methods that take the predictions of HTE models as input: (1) Virtual Twins (VT) (Foster et al., 2011); (2) R2P (Lee et al., 2020) . We deliberately restrict to such post-hoc methods, due to shared motivation of explaining already-trained HTE models. We also compare to a black-box model: a T-Learner that does not find subgroups but rather provides one prediction per individual. Setup: The first dataset, Synthetic COVID, was proposed by Lee et al. (2020) and uses patient features as in an initial clinical trial for the Remdesivir drug, but generates synthetic outcomes where the drug reduces the time to improvement for patients with a shorter period of time between symptom onset to starting the trial. The second dataset, Synthetic A, was proposed in Athey and Imbens (2016). For each dataset we generate ten train-test splits, on which we compute the mean and standard deviation of estimates. For R2P, we use the implementation of R2P provided by the authors; For VT we used our own implementation. Evaluation: On synthetic data where ground truth treatment effects are available, we report the Precision in Estimation of Heterogeneous Effect (PEHE) (Hill, 2011), defined as P EHE = Unlike bias and RMSE metrics computed relative to ground truth treatment effects, PEHE requires accurate estimation of both counterfactual and factual outcomes (Johansson et al., 2016) . We also compute between-and within-subgroup variance. Results: Table 1 presents the results. We make a few observations: (1) Black-box vs. whitebox: As expected, GBDT T-Learners perform well as they do not have interpretability constraints, unlike all the other methods (VT, Distill-HTE, R2P), all of which modify standard decision trees while still remaining visualize-able as a tree. Yet, Distill-HTE tends to be far more accurate than other decision tree approaches, in terms of PEHE. (2) Optimization criterion: R2P, the only method of those presented here, that considers not only homogeneity within subgroups but also heterogeneity between subgroups, has the lowest between-subgroup variance. Other methods that do not try to reduce heterogeneity between subgroups do not fare so well on this metric. On the other hand, they fare better than R2P at minimizing within subgroup variance, because they do not consider the tradeoffs between minimizing within and between subgroup variance, unlike R2P. However, R2P does this at the expense of PEHE. (3) HTE model class: The choice of HTE model matters, with Virtual Twins not performing as well when using an RF HTE model compared to GBDT HTE model. Similarly, GBDT T-Learners perform better than DT T-Learners. (4) The impact of distillation: While the T-Learner DT model did not perform well, being worst in terms of PEHE on all datasets, Virtual Twins RF, Virtual Twins GBDT and Distill-HTE that train modified decision trees have a marked improvement over T-Learner DT, suggesting that distilling a complex GBDT or RF teacher rather than learning a tree directly is beneficial, which agrees with the existing distillation literature. 33976.605 ± 705.053 * denotes no segments were found, and the resulting policy assigned all individuals to one treatment group. Table 2 : Regret (lower is better) of policy-generation methods on synthetic and semi-synthetic datasets. Methods colored in red do not personalize. Methods colored in black are blackbox policies. Methods colored in blue are interpretable policies. Methods colored in green are ensembles that still remain interpretable. Best method's regret in bold. If the best policy is no personalization (assigning all individuals to one treatment group), the next best policy is also bolded. If the best policy is an ensemble policy, the constituent policies are also bolded. We compare the methods proposed in Section 3.4 to (1) a black-box policy: training a T-Learner HTE model, then assigning each individual to the treatment group with best predicted treatment effects; (2) a policy which chooses the treatment for each unit uniformly at random. Setup: Besides the synthetic datasets described in 4.1, we use other publicly-available datasets. The IHDP dataset (Hill, 2011) studied the impact of specialized home visits on infant cognition using mother and child features. A multiple-outcome dataset, Email Marketing (Hillstrom, 2008) has 64k points, and visits, conversions, and money spent outcomes (details on how we constructed potential outcomes in Appendix). We combine multiple outcomes as explained in Section 3.2. The ensemble policies (GUIDE-UniformExplore, GUIDE-OPE) are based on GreedyTreeSearch-HTE and Distill-Policy -selected because of their individual performance. Evaluation: To evaluate the different policies Π, we define a notion of regret: where Π(X i ) is the treatment group prescribed by that particular policy for individual i and Y i is individual i's potential outcome under treatment k. Results: Table 2 presents the results. We make a few observations: (1) In many cases, at least one interpretable policy (GreedyTreeSearch-HTE, Distill-Policy, No-HTE, ensembled policies) improved over or is not far behind the black-box policy. (2) Personalization: In datasets like IHDP where the benefit from personalization was limited, as the majority of points have positive treatment effects, No-HTE-Greedy and No-HTE-Iterative were able to pick up on this best policy and assign all points to one treatment. (3) The impact of ensembling: In general, ensembled policies GUIDE-UniformExplore and GUIDE-OPE improved over their constituent policies GreedyTreeSearch-HTE and Distill-Policy. However, ensemble methods with deeper guidance trees are not necessarily better. Hence, the tradeoff between one more level of guidance, and reduced interpretability, should be considered. We now display several of the interpretable policies, and show how they can be ensembled while remaining interpretable. On Synthetic A, a simple dataset with only two features, the following policies were obtained: GreedyTreeSearch-HTE: If feature0 < -0.019, assign control. Otherwise, assign treatment. Distill-Policy: If feature0 <= -0.02, assign control. Otherwise, assign treatment. We applied the GUIDE-UniformExplore and GUIDE-OPE approaches to learn guidance trees: trees that ensemble these two policies while still remaining interpretable. While the GUIDE-OPE guidance tree suggested simply following Distill-Policy, the GUIDE-UniformExplore guidance tree suggested: GUIDE-UniformExplore: If feature1 < -0.426, follow Distill-Policy. Otherwise, follow GreedyTreeSearch-HTE. By leveraging feature1 which had so far not been used in the individual policies, GUIDE-UniformExplore returned the lowest regret of all methods, including a regret lower than that of the individual policies it ensembled, GreedyTreeSearch-HTE and Distill-Policy (Table 2) . Interestingly, even if the individual policies GreedyTreeSearch-HTE and Distill-Policy were grown to greater depth, feature1 was not still not selected. While we do not always see gains from ensembling, in some datasets guidance trees can correct for the greediness of tree learning. While typical ways of ensembling trees (e.g. random forest) reduces interpretability, depth-constrained guidance trees, added to the top of an interpretable policy tree, merely makes the policy tree a bit deeper. We now present the explanations and resulting interpretable policies on the Synthetic COVID data. Interpreting HTE model: Figure 4 .4 presents segments found by Distill-HTE on the synthetic COVID data. The segment with the most negative predicted treatment effect (first segment; red color), at -0.0304 +-0.0010, covers individuals who started taking the drug between 4.5 and 10.5 days after the onset of symptoms and had Aspartete Aminotransferase and Lactate Dehydrogenase levels within normal ranges (MedicineNet, 2021), suggesting that they were not extremely sick. It is therefore not surprising that they were not predicted to benefit as much from the drug. On the other hand, the segment with the most positive predicted treatment effect (last segment; green color) covers individuals who started taking the drug soon (<= 4.5 days) after exhibiting COVID symptoms. These individuals were predicted to benefit the most from the drug, with treatment effect 0.0187 +-0.0006. This agrees with the finding that the Remdesivir drug results in a faster time to clinical improvement for the patients with shorter time from symptom onset to starting trial (Wang et al., 2020) . Policy Generation: A simple policy can be derived from the Distill-HTE segments, with all red segments assigned to not receiving the drug (as they are predicted to not benefit from it) and all green segments assigned to receiving the drug. This policy is different from the interpretable policies learned directly (Table 2) . For example, the GreedyTreeSearch-HTE policy is extremely simple: if days from symptoms onset to starting trial < 11, assign to receive the drug. Otherwise, assign to not receive the drug. Another example is No-HTE-Greedy and No-HTE-Iterative whose rather stringent splitting criteria (Equation 3) did not find enough heterogeneity worth assigning different treatment groups to, and assigned all individuals to receive the drug. The motivation for this work was three-fold. (1) HTE models sometimes overfit on extremely large datasets, due to having a large number of data points, extremely noisy outcomes, noisy features, etc. (2) Interpretable policies can avoid much of the tech-debt ML-based policies are known to incur (Sculley et al., 2015b) . (3) Scalability constraints excluded many optimal tree-based policy learning algorithms from deployment in production environments, as illustrated in Section 3.4. The interpretable policy methods proposed in this paper are not theoretically optimal, a shortcoming we acknowledge. Nonetheless, we believe that the comparison to sensible baselines, such as policies derived from T-Learner HTE models, and the ability of some policies to achieve low regret on simple synthetic datasets, demonstrates their merit. In real, large-scale datasets where the potential benefit from personalization was higher, No-HTE-Greedy and No-HTE-Iterative suggested more personalized policies, especially when feature selection was performed beforehand. This is because the splitting criteria 3 included all features. If feature selection is not performed, this, together with using sample average treatment effects as our splitting criteria makes it easier to find a good split on the training dataset by chance but not on the test set. Further work is needed to realize the full potential of these methods. In practice, we have seen that the choice between deriving policies from black-box HTE models and directly learning ground-up interpretable policies depends on the ability of the individuals utilizing such policies to maintain HTE models in production, and the need to explain how exactly personalization is happening, especially for applications with where personalization based on sensitive features incurs disparate treatment unfairness (Lal et al., 2020) . While HTE models are resource and maintenance intensive, the ability to continuously retrain the model allows for adjustment to a dynamic user population. Conversely, interpretable policies are easy to implement and maintain, but may not perform best over the long-term without policy regeneration as the features of the user population shift. Algorithm 4 No-HTE-model policy generation: iterative implementation (No-HTE-Iterative) Input : Maximal tree depth d, number of iterations t, dataset D. Output: segments 1. Set iterations := 0 and segments = {}. 2. Add to segments iteratively while iterations ≤ t do Step 1 to 2 in Algorithm 3, but change all min to max. Denote the resulting list of segments as C. Add to segments the S in C that maximizes A(S) and remove the data points in S from D. Set iterations := iterations + 1. end Reinforcement learning interpretation methods: A survey Optimal policy trees Subgroup analysis and other (mis) uses of baseline data in clinical trials Recursive partitioning for heterogeneous causal effects Policy learning with observational data Verifiable reinforcement learning via policy extraction The offset tree for learning with partial labels KDD Gong Chen, Hua Zhong, Anton Belousov, and Viswanath Devanarayan. A prim approach to predictive-signature development for patient stratification Extracting tree-structured representations of trained networks Distilling policy distillation Stable discovery of interpretable subgroups via calibration in causal studies The foundations of cost-sensitive learning Subgroup identification from randomized clinical trial data Distilling a neural network into a soft decision tree Personalized news recommendation with context trees Visualizing and understanding Atari agents A review of multi-objective optimization: Methods and its applications Bayesian nonparametric modeling for causal inference Minethatdata e-mail analytics and data mining challenge Distilling the knowledge in a neural network Estimating treatment effect heterogeneity in randomized program evaluation Estimation of heterogeneous treatment effects from randomized experiments, with application to the optimal planning of the get-out-the-vote campaign Learning optimal prescriptive trees from observational data Learning representations for counterfactual inference Policy evaluation and optimization with continuous treatments Optimal doubly robust estimation of heterogeneous causal effects Who should be treated? empirical welfare maximization methods for treatment choice Metalearners for estimating heterogeneous treatment effects using machine learning Sahin Cem Geyik, and Krishnaram Kenthapadi. Fairness-aware online personalization Jang-Won Lee, and Mihaela van der Schaar. Robust recursive partitioning for heterogeneous treatment effects with uncertainty quantification Bayesian optimization for policy search via online-offline experimentation Subgroup identification for precision medicine: A comparative review of 13 methods A prescriptive analytics framework for optimal policy deployment using heterogeneous treatment effects. Forthcoming at MIS Quarterly Liver function tests (normal, low, and high ranges & results) The optimal dynamic treatment rule superlearner: Considerations, performance, and application Interpretable subgroup discovery in treatment effect estimation with application to opioid prescribing guidelines Distilled thompson sampling: Practical and efficient thompson sampling via imitation learning Sur les applications de la théorie des probabilités aux experiences agricoles: Essai des principes Quasi-oracle estimation of heterogeneous treatment effects Performance guarantees for individualized treatment rules Show, attend and interact: Perceivable human-robot social interaction through neural attention q-network. ICRA Estimating causal effects of treatments in randomized and nonrandomized studies Hidden technical debt in machine learning systems Hidden technical debt in machine learning systems Adapting neural networks for the estimation of treatment effects Introduction to multi-armed bandits A method for testing a prespecified subgroup in clinical trials Counterfactual risk minimization: Learning from logged bandit feedback Considerations when learning additive explanations for black-box models Distill-and-compare: Auditing black-box models using transparent model distillation Hybrid models of factorization machines with neural networks and their ensembles for click-through rate prediction Moët: Interpretable and verifiable reinforcement learning via mixture of expert trees Estimation and inference of heterogeneous treatment effects using random forests Causal rule sets for identifying subgroups with enhanced treatment effect Remdesivir in adults with severe covid-19: a randomised, double-blind, placebo-controlled, multicentre trial Dueling network architectures for deep reinforcement learning Estimating individualized treatment rules using outcome weighted learning Offline multi-action policy learning: Generalization and optimization The Email Marketing dataset (Hillstrom, 2008) is a real dataset that came from a randomized experiment, where customers were randomized into receiving one of three treatments. To generate potential outcomes, for each individual i in treatment group k, we searched for the 5-nearest-neighbors in treatment group k , averaging their outcomes to get the potential outcome for individual i for treatment group k . To compute distance between individuals, we used Euclidean distance in feature space. Unless otherwise mentioned, the HTE models we train are T-learners consisting of GBDT base learners. In general, we use 40% of the data as the test set on which we repot results, and 30% of the remaining 60% as the validation set. We learn individual policies on the training set. When learning ensemble policies, we learn the ensemble on the validation set. (S * l , S * r ) = argmax S l ,Sr min(A(S l ), A(S r )).If such S * l and S * r exist, add them to nodes; otherwise add S to nodes. end Let segments := nodes and depth := depth + 1. end Algorithm 5 Off-Policy Evaluation (OPE) with Inverse Propensity Scoring Input : Policy Π, dataset D, propensity score {p i } N i=1 for all individuals (p i = 1 K+1 for randomized data with K + 1 treatment groups. Output: The estimated policy value. 1. Generate the index set where policy decision matches the observed treatment decision in D: