key: cord-0209073-55yn012e authors: Verdinelli, Isabella; Wasserman, Larry title: Forest Guided Smoothing date: 2021-03-08 journal: nan DOI: nan sha: b4398719fdd925b0436e2d5deca500b5f53f2799 doc_id: 209073 cord_uid: 55yn012e We use the output of a random forest to define a family of local smoothers with spatially adaptive bandwidth matrices. The smoother inherits the flexibility of the original forest but, since it is a simple, linear smoother, it is very interpretable and it can be used for tasks that would be intractable for the original forest. This includes bias correction, confidence intervals, assessing variable importance and methods for exploring the structure of the forest. We illustrate the method on some synthetic examples and on data related to Covid-19. Random forests are often an accurate method for nonparametric regression but they are notoriously difficult to interpret. Also, it is difficult to construct standard errors, confidence intervals and meaningful measures of variable importance. In this paper, we construct a spatially adaptive local linear smoother that approximates the forest. Our approach builds on the ideas in Bloniarz et al. (2016) and Friedberg et al. (2020) . The main difference is that we define a one parameter family of bandwidth matrices which help with the construction of confidence intervals, and measures of variable importance. Our starting point is the well-known fact that a random forest can be regarded as a type of kernel smoother (Breiman (2000) ; Scornet (2016) ; Lin and Jeon (2006) ; Geurts et al. (2006) ; Hothorn et al. (2004) ; Meinshausen (2006) ). We take it as a given that the forest is an accurate predictor and we do not make any attempt to improve the method. Instead, we want to find a family of linear smoothers that approximate the forest. Then we show how to use this family for interpretation, bias correction, confidence intervals, variable importance and for exploring the structure of the forest. Related Work. Our work builds on Bloniarz et al. (2016) and Friedberg et al. (2020) . Bloniarz et al. (2016) fit a local linear regression using weights from a random forest. They show that this often leads to improved prediction. Friedberg et al. (2020) go further and modify the forest algorithm to account for the fact that a local linear fit will be used and to reduce the bias of the fit. This further improves the performance and yields confidence intervals. We use the forest weights to fit a local linear regression but we do so by first building a family of bandwidth matrices {hH x : h > 0, x ∈ R d } depending on one free parameter h > 0. We use the bandwidth matrices to define a kernel from which we get the local linear fit. Creating the bandwidth matrices has several advantages. First, it allows us to use the generalized jackknife to correct the bias and construct confidence intervals. In contrast to Friedberg et al. (2020) , this allows us to use any off-the-shelf random forest; no adjustments to the forest algorithm are required. Second, the collection of bandwidth matrices will be used to create several summaries of the forest. For example, we can examine how much smoothing is done with respect to different covariates and in different parts of the covariate space. We also define the notion of a typical bandwidth matrix using the Wasserstein barycenter. Third, we can explore variable importance based on local slopes at different resolutions by varying the parameter h thus giving a multiresolution measure of variable importance. Paper Outline. In Section 2 we define the forest guided smoother. In Section 3 we discuss the construction of confidence intervals. In Section 4 we present methods for exploring the structure of the forest. Examples are presented in Section 5. Section 6 contains concluding remarks. Let (X 1 , Y 1 ), . . . , (X n , Y n ) ∼ P where Y i ∈ R and X i ∈ R d . We assume that d < n and is fixed. denote the regression function. Recall that the random forest estimatorμ RF (x) iŝ where eachμ j is a tree estimator built from a random subsample of the data, a random subsample of features and B is the number of subsamples. We take, as a starting point, the assumption thatμ RF is a good estimator. Our goal is not to improve the random forest or provide explanations for its success. Rather, we construct an estimator that provides a tractable approximation to the forest which can then be used for other tasks. As noted by Hothorn et al. (2004) ; Meinshausen (2006) the random forest estimator µ(x) RF can be re-written asμ As these authors note, these weights behave like a spatially adaptive kernel. We proceed as follows. As in Friedberg et al. (2020) we split the data into two groups D 1 and D 2 . For simplicity, assume each has size n. We construct a random forestμ RF from D 1 . Now we define the bandwidth matrix where the sum is over D 1 . Let K be a spherically symmetric kernel and define This yields a kernel centered at x whose scale matches the scale of the forest weights. We then define the one parameter family of bandwidth matrices Ξ = {hH x : h > 0, x ∈ R d }. We define the forest guided local linear smoother, or FGS, to be the local linear smoother When h = 1, which can be regarded as a default value, we writeμ h (x) simply asμ(x). Although we focus on local linear regression, one can also use this for kernel regression or higher order local polynomial regression. We shall see thatμ(x) is often a good approximation toμ RF (x). Remark: Other approaches for choosing H x are possible. For example, one could minimize the difference between K(x − X i ; H x ) and w i (x) over all positive definite matrices H x . However, (1) is simple and in our experience works quite well. In high dimensional cases, H x would require regularization but we do not pursue the high dimensional case in this paper. Figure 1 shows a one-dimensional example. The top left shows the data, the random forest estimatorμ RF in lack, and the true function in red. The forest guided smootherμ(x) is the black line in the top right plot. The bottom left shows the weights w 1 (x), . . . , w n (x) at x = 0 and the the bottom right shows our kernel approximation to the weights. We see that the FGS approximates the forest and the kernel approximates the weights very well. For getting standard errors and confidence intervals, we will also need to estimate the variance σ 2 (x) = Var(Y |X = x). We will proceed as follows. Let r i = Y i −μ RF (X i ) be the residuals from the forest. We regress the r 2 i 's on X i 's to estimate σ 2 (x) using another random forest. We find that this approach tends to under-estimate σ 2 (x) in some cases and we replaceσ(x) with cσ(x) where we use c = 1.5 as a default to compensate for this in our examples. In this section we construct estimators of the bias ofμ(x) and then obtain confidence intervals for µ(x). This is difficult to do directly from the forest without delicate modifications of the forest algorithm to undersmooth, as in Friedberg et al. (2020) . But bias estimation using standard methods is possible with the FGS. We start by recalling some basic properties of local linear smoothers. Letμ be the local linear smoother based on bandwidth matrices H x ≡ H n,x . Let f (x) be the density of X, define µ 2 (K) by uu T K(u)du = µ 2 (K)I, and R(K) = K 2 (u)du. Let Hess be the Hessian of µ. Ruppert and Wand (1994) consider the following assumptions. (A1) K is compactly supported and bounded. All odd moments of K vanish. (A2) σ 2 (x) is continuous at x and f is continuously differentiable. Also, the second order derivatives of µ are continuous. Further, f (x) > 0 and σ 2 (x) > 0. (A3) H n,x is symmetric and positive definite. As n → ∞ we have n −1 |H n,x | → 0 and H n,x (i, j) → 0 for every i and j. (A4) There exists c λ such that for all n where λ max and λ min denote the maximum and minimum eigenvalues. Under these conditions, Ruppert and Wand (1994) showed that the bias B(x, H x ) and and It follows that the bias using bandwidth hH x satisfies for some c n (x). Assumptions (A3) and (A4) capture the idea that the bandwidth matrix needs to shrink towards 0 in some sense. Assumption (A4) essentially says that H n,x behaves like a scalar tending to 0 times a fixed positive definite matrix. For our results, we will make this more explicit and slightly strengthen (A4) to: (A4) There exists a sequence φ n → 0 and a positive definite symmetric matrix C x such To construct the bias correction we need to add the following stronger smoothness condition. (A5) For some t, the t th order derivatives of µ are continuous and there exist functions Ruppert (1997) In more detail, Ruppert's method (i.e. the generalized jackknife) works as follows. Choose a set of b bandwidths h 1 , h 2 , . . . , h b and letm = ( We estimate κ n by least squares, namely, (2). Thereforê We estimate the bias ofμ h (x) bŷ and the estimated variance is Ruppert used the bias estimation method as part of a bandwidth selection method. We are interested, instead, to get a centered central limit theorem. We now confirm that this indeed works. For the theory, we need to be more specific about the choice of bandwidths in the bias correction procedure. Specifically, let h j = α j n −γ , for j = 1, 2, . . . b, with 0 < α 1 < . . . < α b being constants not depending on n. Theorem 1 Assume that, conditional on D 1 , assumptions (A1)-(A5) hold and: Hence, The proof is in the appendix. It is important to note that γ can be 0 or even negative. Ruppert (1997) requires γ > 0. The difference is that our bandwidth is of the form hH x and H x is already tending to 0 and we only need the product to go to 0. This significantly simplifies the choice of grid of bandwidths because the bandwidths can be constant order and don't need to change with n. For example, one could use a grid like (1/8, 1/4, 1/2, 1, 2, 4, 8). We recommend including h = 1 in the grid as this corresponds to the original FGS. A commonly used alternative to confidence intervals for nonparametric regression is to form some sort of interval around the estimate that informally represents uncertainty but without the coverage claim of a confidence interval. We will refer to these as variability intervals. The simplest approach is to useμ(x) ± c α s(x) where s 2 (x) is the estimated variance ofμ(x). Ifμ(x) satisfies a central limit theorem and c α = z α/2 then this is a In our case, such a variability interval is simply parameter h is not needed since we do not use the generalized jackknife to reduce the bias. However, it might be useful to construct multiresolution variability intervals at various resolutions h. This is the approach to inference recommended by Chaudhuri and Marron (2000) who refer to this as scale-space inference. In Section 5 we illustrate this multiresolution approach for estimating the gradient as a measure of variable importance. Variability intervals for forests have been obtained in Mentch and Hooker (2016) estimate the variance of the forest using the jackknife. The advantage of these approaches is that they do not need to use sample splitting as we do. Confidence intervals were obtained by Athey et al. (2019) and Friedberg et al. (2020) . They also use data splitting. The main difference is that we leave the forest algorithm untouched and we use the generalized jackknife to reduce the bias. Instead, they modify the construction of the forest and require that the forest is constructed to satisfy certain assumptions; specifically they require that the forest is built from subsamples of size s n β where β min = 1 − 1 + d π where π/d is a lower bound on the probability of splitting on a feature and each tree leaves a fraction of points α on each size of every split. The advantage of this approach is that it only requires the regression function to be Lipschitz whereas the generalized jackknife assumes that µ(x) has at least t + 1 derivatives. The disadvantage is that the conditions on the construction of the forest are rather complicated and non-standard and one cannot use any off-the-shelf forest. As noted in Friedberg et al. (2020), the tuning of forest parameters in practice can be quite different than what is assumed in the theory. Our main assumption is simply that the local smoother has standard bias and variance properties. Both approaches require assumptions and it is difficult to say that one set of assumptions is better than the other as they are quite incomparable. One is an assumption about the algorithm and the other is an assumption about the function and the bandwidth. Remark: It may be the case that there are irrelevant variables. That is, we have have that µ(x) = µ(x S ) for some subset of variables x S . If the forest is able to discover the relevant variables, then the bandwidth matrix H x might not shrink in the direction of the irrelevant variables. This is a good thing but, technically, the conditions (A3-A5) may be violated. However, the gradient and Hessian of µ(x) vanish in the irrelevant directions and Theorem 1 still holds. Now we consider some examples. In each case, we use t = 2. The results using t = 3 and t = 4 are similar. with n = 500, σ = 1 and X i is uniform on [0, 1] 5 . We take h to be in an equally spaced grid of size 20 from h = 1 to h = 5. We construct 90 percent confidence intervals at 10 randomly selected points. The second example is from Friedberg et al. (2020) and is Y i = µ(X i ) + σ where now µ(x) = 10 1 + exp(−10(x 1 − .5)) + 5 1 + exp(−10(x 2 − .5)) with n = 500, σ = 5 and X i is uniform on [0, 1] 5 . We take h to be in an equally spaced grid of size 20 from h = 1 to h = 30. We construct 90 percent confidence intervals at 10 randomly selected points. chosen points for the functions in (5) and (6). The coverage is close to the nominal value and the lengths are close to those in Friedman and Roosen (1995) and Friedberg et al. (2020). Remark. Our grids were chosen to achieve good coverage and length for the examples. In practice we suggest a grid ranging from h = .1 to h = 10. While this choice cannot be claimed to be optimal, and may not eliminate the bias, it should result in some amount of bias reduction. As pointed out in the discussion of Cattaneo et al. (2013), finding an optimal grid for the generalized jackknife is an unsolved problem. In this section we show how the forest guided smoother can be used to examine properties of the forest. A random forest is a complex object and is difficult to interpret. In contrast, the FGS is completely determined by the set of bandwidth matrices Ξ = {H x : x ∈ R d } which is a subset of the manifold of symmetric positive-definite matrices. We now consider a variety of methods for summarizing and exploring the set Ξ. In this section we describe the methods. Examples are given in Section 5. Here we show how to quantify the degree to which H x varies with x. We take K to be a multivariate Gaussian. The kernel at x is K(x, H x ). First we define what the kernel looks like on average over x. To do this we find the Wasserstein barycenter of the distributions The Wasserstein barycenter comes from the theory of optimal transport; a good reference on this area is Peyré and Cuturi (2019). Recall first that the (second order) Wasserstein distance between two distributions P 1 and P 2 where X ∼ P 1 , Y ∼ P 2 and the infimum is over all joint distributions J with marginals P 1 and P 2 . In the special case of Normals, where P 1 = N (µ 1 , Σ 1 ) and P 2 = N (µ 2 , Σ 2 ) we have The Wasserstein barycenter of a set of distributions Q x indexed by x is the distribution Q that minimizes This barycenter is useful because it preserves the shape of the distributions. For example, the barycenter of a N (µ 1 , 1) and N (µ 2 , 1) is N ((µ 1 + µ 2 )/2, 1). The Euclidean average is the mixture (1/2)N (µ 1 , 1)+(1/2)N (µ 2 , 1) which does not preserve the shape of the original densities. In our case, we summarize the set of bandwidth matrices by finding the barycenter of the set of distributions {K(0, H X i )}. The barycenter in this case can be shown to be K(0, H) were H is the unique positive definite matrix such that In our examples, we will compute H to see what a typical bandwidth matrix looks like. We also compute the Frechet variance which gives a sense of how much the bandwidth matrices vary over x. If H x does not vary with x then then V = 0. Next we consider another way to summarize the FGS. For each X i , we find the effective bandwidth with respect to each covariate by finding the length of the ellipse {x : (u − in the direction of each coordinate axis, for any c > 0. In other words, we compute ∆ j (X i ) = c 2 /H −1 X i (j, j). In the example section we'll see that plots of these quantities can be very informative. How much prediction accuracy is lost by using the smoother instead of the forest? To answer this question we define We can get an estimate of Γ using the approach in Williamson et al. (2020) . Split the data into four groups D 1 , D 2 , D 3 , D 4 each of size m ≈ n/4. From D 1 getμ RF and from D 2 getμ. LetΓ Then, Williamson et al. (2020) show that and a consistent estimate of τ 2 is m −1 ( i (r i − r) 2 + i (s i − s) 2 ). Hence, a 1 − α confidence interval for Γ isΓ±z α/2τ / √ m. (One can repeat this by permuting the blocks and averaging if desired.) One popular method of assessing local variable importance is to estimate the gradient of Letβ h (x) = (β h,1 (x), . . . ,β h,d (x)). Noŵ where W x is a diagonal matrix with W x (i, i) = K(X i − x; hH x ) and e j+1 is the vector that is all 0 except it is 1 in the j + 1 position. The standard error ofβ . A plot of the valuesβ h,j (X i ) gives a global sense of the local importance of the j th covariate. A plot ofβ h,j (x) as a function of h for a fixed x summarizes local variable importance at various resolutions. In this section, we illustrate the methods from the previous section on two examples. The first is a synthetic example and the second is a data example. We return to the example given in (6). Figure 4 shows the Wasserstein barycenter of the bandwidth matrices. The barycenter shows that the typical bandwidth for the first two variables is small. This makes sense as the function only depends on x 1 and x 2 . Also, the small off-diagonals suggest the bandwidth matrix is typically not far from diagonal. (6). The Frechet variance is 0.019, suggesting that the bandwidth matrix does not vary greatly across the sample space. importance of x 1 and x 2 . Note that importance variables correspond to small bandwidths but large slopes. It also shows two scatterplots ofμ RF (X i ) and eμ(X i ). and of their residuals. We do see a very slight loss in accuracy for the FGS but the difference is small. It appears that the two fits are very similar. To formalize this, we estimate Γ as described in Section 4.2 and we find that the 95 per cent confidence intervalΓ = −0.134 ± 4.9 again suggesting little difference between the two methods. Thus we conclude that the FGS appears to be a good approximation to the forest. In this section we consider data on Covid-19 obtained from the API of the CMU Delphi group at covidcast.cmu.edu. Our goal is to construct a random forest to predict Y = average daily deaths from these variables: The variable Y is averaged over December 1 2020 to December 12 2020. The covariates are averaged from October 1 2020 to December 1 2020. We took the logarithms of all variables and then scaled each covariate to have mean 0 and variance 1. The problem of predicting the epidemic is an intensely studied issue and our goal is not to develop a cutting edge prediction method. Rather, we use these data as a vehicle for illustrating our methods. After fitting the FGS we can summarize the local fit for various counties by reporting the local slopesβ 1 (x), . . . ,β d (x) and their standard errors. Table 1 and Table 2 Table 4 Figure 9 shows the effective bandwidths and local slopes at resolution h = 2. The two most important variables (small bandwidths and large slopes) are x 4 (home) and x 7 (previous deaths). The importance of previous deaths is obvious. The fact that social mobility (home) is important is notable but we should emphasize that this is a predictive analysis not a causal analysis. We also see some correlation between the bandwidths for x 5 and x 6 . The Frechet variance is 0.817 suggesting that H x varies quite a bit with x (recall that all the variables are scaled to have variance 1). Throughout this paper we have assumed that the number of covariates d is fixed. If d increases with n then local linear fitting will not work. Instead one will need to include some sort of ridge or 1 penalty. Furthermore, when d is large, H x will not be invertible and so regularization on H x is required. We have focused on random forests but similar ideas can be used for other black box methods such as neural nets. Koh and Liang (2017) show how to compute the influence function for deep nets and other predictors. The influence function can be used to define a spatially adaptive kernel as we have done using the weights from a forest. In our examples we have not found much difference between the forest and the FGS. But this may be due to the fact that we have not considered complex high dimensional problems. Understanding when a complex predictor can be approximated by a spatially varying local smoother is a interesting but challenging problem. Generalized random forests Supervised neighborhoods for distributed nonparametric regression Randomizing outputs to increase prediction accuracy Generalized jackknife estimators of weighted average derivatives Scale space view of curve estimation Local linear forests An introduction to multivariate adaptive regression splines Extremely randomized trees Bagging survival trees Understanding black-box predictions via influence functions Random forests and adaptive nearest neighbors Quantile regression forests Quantifying uncertainty in random forests via confidence intervals and hypothesis tests Asymptotic distributions and rates of convergence for random forests via generalized u-statistics Model agnostic supervised local explanations why should i trust you?" explaining the predictions of any classifier Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation Multivariate locally weighted least squares regression. The annals of statistics Random forests and kernel methods Confidence intervals for random forests: The jackknife and the infinitesimal jackknife A unified approach for inference on algorithm-agnostic variable importance Here we recall Theorem 1 and give an outline of the proof.and further, if t < d/2 we require a < 1/(d − 2t). Also, assume that Y is bounded and that b > t + 1. ThenProof Outline. First note that the condition γ > −a ensures that n|hH x | → ∞ andHence,Let V be the covariance matrix ofm. Then, arguing as in the proof of Theorem 2.1 of Ruppert and Wand (1994) , there exists a b × b positive definite matrix A depending on K, α 1 , . . . , α b , x, f (x) and σ 2 (x) but not on n, such that(1 + o P (1)).