key: cord-0481192-moggo21p
authors: Barreiro-Ures, D.; Cao, R.; Francisco-Fern'andez, M.
title: Bagging cross-validated bandwidth selection in nonparametric regression estimation with applications to large-sized samples
date: 2021-05-10
journal: nan
DOI: nan
sha: e44a0e6615b58c6ddd29e89e3a564052cec36ef7
doc_id: 481192
cord_uid: moggo21p

Cross-validation is a well-known and widely used bandwidth selection method in nonparametric regression estimation. However, this technique has two remarkable drawbacks: (i) the large variability of the selected bandwidths, and (ii) the inability to provide results in a reasonable time for very large sample sizes. To overcome these problems, bagging cross-validation bandwidths are analyzed in this paper. This approach consists in computing the cross-validation bandwidths for a finite number of subsamples and then rescaling the averaged smoothing parameters to the original sample size. Under a random-design regression model, asymptotic expressions up to a second-order for the bias and variance of the leave-one-out cross-validation bandwidth for the Nadaraya--Watson estimator are obtained. Subsequently, the asymptotic bias and variance and the limit distribution are derived for the bagged cross-validation selector. Suitable choices of the number of subsamples and the subsample size lead to an $n^{-1/2}$ rate for the convergence in distribution of the bagging cross-validation selector, outperforming the rate $n^{-3/10}$ of leave-one-out cross-validation. Several simulations and an illustration on a real dataset related to the COVID-19 pandemic show the behavior of our proposal and its better performance, in terms of statistical efficiency and computing time, when compared to leave-one-out cross-validation.

The study of a variable of interest depending on other variable(s) is a common problem that appears in many disciplines. To deal with this issue, an appropriate regression model setting up the possible functional relationship between the variables is usually formulated. As part of this analysis, the unknown regression function, describing the general relationship between the variable of interest and the explanatory variable(s), has to be estimated. This task can be carried out using nonparametric methods that do not assume any parametric form for the regression function, providing flexible procedures and avoiding misspecification problems. Among the available nonparametric approaches, kernel-type regression estimators (Wand and Jones, 1995) are perhaps the most popular. To compute this type of estimators the user has to select a kernel function (typically a density function) and a bandwidth or smoothing parameter that regulates the amount of smoothing to be used, which in turn determines the trade-off between the bias and the variance of the estimator.

Although the choice of the kernel function is of secondary importance, the smoothing parameter plays a crucial role. In this regard, numerous contributions have been made over the last decades, providing methods to select the bandwidth. These approaches include, among others, cross-validation methods (Härdle et al., 1988) and plug-in selectors (Ruppert et al., 1995) . In Köhler et al. (2014) , a complete review and an extensive simulation study of different data-driven bandwidth selectors for kernel regression are presented. Due to their wide applicability and the good performance obtained in this complete comparison, in the present paper, we focus on analyzing cross-validation bandwidth selection techniques.

Cross-validation is a popular method of model selection that precedes an early discussion of the method by Stone (1974) . In its simplest form, cross-validation consists of splitting the dataset under study into two parts, using one part to fit one or more models, and then predicting the data in the second part with the models so-built. In this way, by not using the same data to fit and validate the models, it is possible to objectively compare the predictive capacity of different models. The leave-one-out version of cross-validation, which will be of interest in the present paper, is somewhat more involved. It excludes one datum from the dataset, fits a model from the remaining observations, uses this model to predict the datum left out, and then repeats this process for all the data.

The present paper studies the leave-one-out cross-validation bandwidth selection method and the application of bagging (Breiman, 1996) to this procedure. We derive some asymptotic properties of the corresponding selectors when considering a random-design regression model and the Nadaraya-Watson kernel-type estimator is used. The Nadaraya-Watson estimator can be seen as a particular case of a wider class of nonparametric estimators, the so-called local polynomial estimators (Stone, 1977; Cleveland, 1979; Fan, 1992) , when performing a local constant fit. Given a random sample of size n, bagging cross-validation consists of selecting N subsamples of size r < n, each without replacement, from the n observations. One then computes a cross-validation bandwidth from each of the N subsets, averages them, and then scales the average down appropriately to account for the fact that r < n. It is well-known that the use of bagging can lead to substantial reductions in the variability of an estimator that is nonlinear in the observations (see Friedman and Hall, 2007) , as occurs in the case of the cross-validation criterion function. The use of bagging in conjunction with cross-validation for bandwidth selection has already been studied in the case of kernel density estimation by several authors (see, for example Barreiro-Ures et al., 2020; Hall and Robinson, 2009 ). In addition to the potential improvement in statistical precision, even in the case of small sample sizes, the use of bagging (with appropriate elections of r and N ) can drastically reduce computation times, especially for very large sample sizes. Note that the complexity of cross-validation is O(n 2 ), while the complexity of bagging cross-validation is O(N r 2 ). Larger reductions in computation time can also be additionally achieved with the application of binning techniques in the bagging procedure.

Apart from the theoretical analysis of the cross-validation bandwidth selection methods, another goal of this study is to apply the techniques studied in the present paper to a dataset related to the current COVID-19 pandemic. In particular, using a moderately large sample, provided by the Spanish Center for Coordinating Sanitary Alerts and Emergencies, consisting of the age and the time in hospital of people infected with COVID-19 in Spain, we are interested in studying the relationship between those two variables by means of the Nadaraya-Watson estimator. Apart from its purely epidemiological interest and due to the considerable size of the sample, this dataset is also useful to put into practice the techniques analyzed in the present paper.

The remainder of the paper is as follows. In Section 2, the regression model considered, the Nadaraya-Watson regression estimator and the important problem of bandwidth selection are presented. In Section 3, the leave-one-out cross-validation bandwidth selection method is described and some asymptotic properties of the corresponding selector are provided when the Nadaraya-Watson estimator is used. Section 4 considers the use of bagging for cross-validation in bandwidth selection for the Nadaraya-Watson estimator.

Asymptotic expressions for the bias and the variance of the proposed bandwidth selector, as well as for its limit distribution, are also derived in this section. In Section 5, an algorithm is proposed to automatically choose the subsample size for the bandwidth selector studied in Section 4. The techniques proposed are empirically tested through several simulation studies in Section 6 and applied to the previously mentioned COVID-19 dataset in Section 7. Finally, concluding remarks are given in Section 8. The detailed proofs and some additional plots concerning the simulation study are included in the accompanying supplementary materials.

Let X = {(X 1 , Y 1 ), . . . , (X n , Y n )} be an independent and identically distributed (i.i.d.) sample of size n of the two-dimensional random variable (X, Y ), drawn from the nonparametric regression model:

where m(x) = E(Y | X = x) denotes the regression function, and ε is the error term,

satisfying that E (ε | X = x) = 0 and E (ε 2 | X = x) = σ 2 (x).

The Nadaraya-Watson estimator or local constant estimator (Nadaraya, 1964; Watson, 1964 ) offers a nonparametric way to estimate the unknown regression function, m. It is

where h > 0 denotes the bandwidth or smoothing parameter and K the kernel function.

As pointed out in the introduction, the value of the bandwidth is of great importance since it determines the amount of smoothing performed by the estimator and, therefore, heavily influences its behavior. Thus, in practice, data-driven bandwidth selection methods are needed. In the present paper, we deal with this problem, analyzing cross-validation bandwidth selection techniques when using the estimator (2) and considering the general regression model (1).

A possible way to select a (global) optimal bandwidth for (2) consists in minimizing a (global) optimality criterion, for instance, the mean integrated squared error (MISE), defined as:

where f denotes the marginal density function of X. The bandwidth that minimizes (3) is called the MISE bandwidth and it will be denoted by h n0 , that is,

The MISE bandwidth depends on m and f and, since in practice both functions are often unknown, h n0 cannot be directly calculated. However, it can be estimated, for example, using the cross-validation method.

In the following section, we present the leave-one-out cross-validation bandwidth selection criterion and provide the asymptotic properties of the corresponding selector when using the Nadaraya-Watson estimator (2) and considering the regression model (1).

Cross-validation is a method that offers a criterion for optimality which works as an empirical analogue of the MISE and so it allows us to estimate h n0 . The cross-validation function is defined as:

wherem

that is, leaving out the i-th observation,

Hence, the cross-validation bandwidth,ĥ CV,n , can be defined aŝ

In order to obtain the asymptotic properties of (7) as an estimator of (4), it is necessary to study certain moments of (5) and its derivatives. However, the fact that the Nadaraya-Watson estimator has a random denominator makes this a very difficult task. To overcome this problem, the following unobservable modified version of the Nadaraya-Watson estimator was proposed in Barbeito (2020) :

Expression (8) does not define an estimator, but a theoretical approximation of (2). Of course, (8) can be used to define a modified version of the cross-validation criterion,

wherem (−i) h denotes the leave-one-out version of (8) without the i-th observation, that is,

For the sake of simplicity and convenience, from now on, we will denote by CV n (h) andĥ CV,n the modified version of the cross-validation criterion defined in (9) and the bandwidth that minimizes (9), respectively. Using Taylor expansions, we can obtain the following approximation:

where the second term of (11) and M n (h n0 ) in order to analyze the bias of the modified cross-validation bandwidth. As for the variance of the modified cross-validation bandwidth, it suffices to calculate the firstorder term of var[CV n (h n0 )]. It is well-known that under suitable regularity conditions, up to first order,

with R(g) = g 2 (x) dx and µ j (g) = x j g(x) dx, j = 0, 1, . . ., provided that these integrals exist finite. Then, the first-order term of the MISE bandwidth, h n , has the following expression:

However, there are no results in the literature that obtain the higher-order terms of the MISE of the Nadaraya-Watson estimator. To undertake this task, let us start by defining

Then, we have thatm

where

Expression (12) splitsm h (x) as a sum of five terms with no random denominator plus an additional term, F , which has a random denominator. This last term is negligible with respect to the others. This approach will help us study the higher-order terms of the MISE of (2). Let us define S = A + B + C + D + E, then we have that

The asymptotic bias and variance of the cross-validation bandwidth minimizing (9) are derived in this section. For this, some previous lemmas are proved. The detailed proof of these results can be found in the supplementary material. The following assumptions are needed:

A1. K is a symmetric and differentiable kernel function.

A2. For every j = 0, . . . , 6, the integrals µ j (K), µ j (K ) and µ j (K 2 ) exist and are finite.

A3. The functions m and f are eight times differentiable.

A4. The function σ 2 is four times differentiable.

Lemma 3.1 provides expressions for the first and second order terms of both the bias and the variance of S.

Lemma 3.1. Under assumptions A1-A4, the bias and the variance of S = A+B+C+D+E satisfy:

It follows from Lemma 3.1 that

Lemma 3.2 provides expressions for the first and second order terms of both the expectation and variance of CV n (h).

Then, under assumptions A1-A4,

Finally, Theorem 3.1, which can be derived from (11), Lemma 3.1 and Lemma 3.2, provides the asymptotic bias and variance of the cross-validation bandwidth that minimizes (9).

Theorem 3.1. Under assumptions A1-A4, the asymptotic bias and variance of the bandwidth that minimizes (9) are:

Corollary 3.1. Under the assumptions of Theorem 3.1, the asymptotic distribution of the bandwidth that minimizes (9) satisfies

where the constant V was defined in Theorem 3.1.

While the cross-validation method is very useful to select reliable bandwidths in nonparametric regression, it also has the handicap of requiring a high computing time if the sample size is very large. This problem can be partially circumvented by using bagging (Breiman, 1996) , a statistical technique belonging to the family of ensemble methods (Opitz and Maclin, 1999) , in the bandwidth selection procedure. In this section, we explain how bagging may be applied in the cross-validation context. Additionally, the asymptotic properties of the corresponding selector are obtained. Apart from the obvious reductions in computing time, the bagging cross-validation selector also presents better theoretical properties than the leave-one-out cross-validation bandwidth. This will be corroborated in the numerical studies presented in Sections 6 and 7.

Let

} be a random sample of size r < n drawn without replacement from the i.i.d sample X defined in Section 2. This subsample is used to calculate a cross-validation bandwidth,ĥ r . A rescaled version ofĥ r ,h r = (r/n) 1/5ĥ r , can be viewed as a feasible estimator of the optimal MISE bandwidth, h n0 , form h . Bagging consists of repeating this resampling procedure independently N times, leading to N rescaled bandwidths,h r,1 , . . . ,h r,N . The bagging bandwidth is then defined as:

In the case of kernel density estimation, both the asymptotic properties and the empirical behavior of this type of bandwidth selector have been studied in Hall and Robinson (2009) for N = ∞ and generalized in Barreiro-Ures et al. (2020) , where the asymptotic properties of the bandwidth selector are derived for the more practical case of a finite N .

In the next section, the asymptotic bias and variance of the bagging bandwidth (15) when using the Nadaraya-Watson estimator (2) and considering the regression model (1) are obtained. Moreover, its asymptotic distribution is also derived.

Expressions for the asymptotic bias and the variance of (15) are given in Theorem 4.1. The following additional assumption is needed:

A5. As r, n → ∞, r = o(n) and N tends to a positive constant or ∞. 

where the constant V was defined in Theorem 3.1. In particular, assuming that r = o n/ √ N , then,

It should be noted that, whileĥ CV,n − h n0 converges in distribution at the rate n −3/10 , this result can be improved with the use of bagging and letting r and N tend to infinity at adequate rates. For example, if both r and N tended to infinity at the rate √ n, then h(r, N ) − h n0 would converge in distribution at the rate n −1/2 , which is indeed a faster rate of convergence than n −3/10 .

In practice, an important step of our approach is, for fixed values of n and N , choosing the optimal subsample size, r 0 . A possible optimality criterion could be to consider the value of r that minimizes the main term of the variance ofĥ(r, N ). In this case, we would get:

and the variance of the bagging bandwidth would converge to zero at the rate var ĥ r

(1) 0 , N ∼ n −3/5 N −9/10 , which is a faster rate of convergence than that of the standard cross-validation bandwidth.

In particular,

The obvious drawback of this criterion is that it would not allow any improvement in terms of computational efficiency, since the complexity of the algorithm would be the same as in the case of standard cross-validation, O(n 2 ). This makes this choice of r 0 inappropriate for very large sample sizes. Another possible criterion for selecting r 0 would be to minimize, as a function of r, the asymptotic mean squared error (AMSE) ofĥ(r, N ), given by:

Since B, C 1 and V are unknown, we propose the following method to estimate

Step 1. Consider s subsamples of size p < n, drawn without replacement from the original sample of size n.

Step 2. For each of these subsamples, obtain an estimate,f , of the marginal density function of the explanatory variable (using kernel density estimation, for example) and an estimate,m, of the regression function (for instance, by fitting a polynomial of a certain degree). Do the same for the required derivatives of both f and m.

Step 3. Use the estimates obtained in the previous step to compute the constants B [i] , C Step 4. Compute the bagged estimates of the unknown constants, that is,

and obtain AMSE ĥ (r, N ) by plugging these bagged estimates into (16).

Step 5. Finally, estimate r 0 by:r

Additionally, assuming that r = o n/ √ N , then

and the rate of convergence to zero of the AMSE of the bagging bandwidth would be:

Hence,

(2) 0 , N AMSE ĥ CV,n ∼ n 1/5 N −4/3 , and this ratio would tend to zero if N tended to infinity at a rate faster than n 3/20 . Furthermore, if we let N = n 3/20 and r = r

(2) 0 , then the computational complexity of the algorithm would be O n 13/20 , much lower than that of standard cross-validation. In fact, by selecting r 0 in this way, the complexity of the algorithm will only equal to that of standard cross-validation when N tends to infinity at the rate n 6/13 .

The behavior of the leave-one-out and bagged cross-validation bandwidths is evaluated by simulation in this section. We considered the following regression models:

whose regression functions are plotted in Figure 1 . The Gaussian kernel was used for computing the Nadaraya-Watson estimator throughout this section. Moreover, to reduce computing time in the simulations, we used binning to select the ordinary and the bagged cross-validation bandwidths.

In a first step, we empirically checked how close the bandwidths that minimize the MISE of (2) and (8) are. For this, we simulated 100 samples of different sizes (100, 1000 and 5000) from models M1, M2 and M3 and compute the corresponding MISE curves for the standard Nadaraya-Watson estimator and for its modified version, given in (8). For the sake of brevity, the plot containing these curves is included in the accompanying supplementary materials. That plot shows that the bandwidth that minimizes the MISE of (8) and

the MISE of the standard Nadaraya-Watson estimator appear to be quite close, even for moderate sample sizes. Of course, the distance between the minima of both curves tends to zero as the sample size increases. Moreover, the standard cross-validation bandwidths and the modified cross-validation selectors (using the standard and the modified version of the Nadaraya-Watson estimator, respectively) are obtained for samples of sizes ranging from 600 to 5000 drawn from model M2. The corresponding figure is also included in the supplementary material. It shows that both bandwidth selectors provide similar results, which in turn get closer as n increases.

In a second step, we checked how fast the statistic S n = n 3/10 ĥ CV,n − h n0 approaches its limit distribution. For this, 1000 samples of size n were simulated from model M2 (with values of n ranging from 50 to 5000) and the corresponding values of S n were computed. The limit distribution of S n is also shown (dashed line).

In the next part of the study, we focused on empirically analyzing the performance of the bagged cross-validation bandwidthĥ(r, N ), given in (15), for different values of n, r

and N . Figure 3 shows the sampling distribution ofĥ n /h n0 , whereĥ n denotes either the ordinary or bagged cross-validation bandwidth. For this, 1000 samples of size n = 10 5 from models M1, M2 and M3 were generated, considering in the case ofĥ(r, N ) the values r ∈ {100, 500, 1000, 5000, 10000} and N = 25. For all three models, it is observed how the bias and variance of the bagging bandwidth decrease as the subsample size increases and how its mean squared error seems to stabilize for values of r close to 5000. Moreover, the behavior of the bagging selector turns out to be quite positive even when considering subsample sizes as small as r = 100, perhaps excluding the case of model M3 for which the variance of the bagging bandwidth is still relatively high for r = 100, although it undergoes a rapid reduction as the subsample size increases slightly. The effect that r has on the mean squared error of the bagged bandwidth is also illustrated in Table 1 , which shows the ratio of the mean squared errors of the bagged bandwidth and the ordinary cross-validation bandwidth, MSE[ĥ(r, N )]/MSE(ĥ CV,n ), for the three models.

Apart from a better statistical precision of the cross-validation bandwidths selected using bagging, another potential advantage of employing this approach is the reduction of computing times, especially with large sample sizes. To analyze this issue, Figure 4 shows, as a function of the sample size, n, the CPU elapsed times for computing the (r) depending on n were considered: r = n 0.7 , r = n 0.8 and r = n 0.9 . Calculations were performed in parallel using an Intel Core i5-8600K 3.6GHz CPU. The sample size n ∈ {5000, 28750, 52500, 76250, 10 5 } and a fixed number of subsamples, N = 25, were used. In this experiment, binning techniques were employed using a number of bins of 0.1n for standard cross-validation and 0.1r in the case of bagged cross-validation. The time required to compute the bagged cross-validation bandwidth was measured considering the three possible growth rates for r, mentioned above.

Fitting an appropriate model, these CPU elapsed times could be used to predict the computing times of the different selectors for larger sample sizes. Considering Figure 4 , the following log-linear model was used:

where T (n) denotes the CPU elapsed time as a function of the original sample size, n. In the case of the bagged cross-validation bandwidths, there is a fixed time corresponding to the one required for the setting up of the parallel socket cluster. This time, which does not depend on n, r or N , but only on the CPU and the number of cores used in the parallelization, was estimated to be 0.79. Using this value, the corrected CPU elapsed times obtained for the bagged bandwidths, T − 0.79, were employed to fit the log-linear model (17) estimating α, β > 0 by least squares and, subsequently, to make predictions. Table 2 shows the predicted CPU elapsed time for ordinary and bagged cross-validation for large sample sizes. Although we should take these predictions with caution, the results in Table 2 It should also be stressed that although the quadratic complexity of the cross-validation algorithm is not so critical in terms of computing time for small sample sizes, even in these cases, the use of bagging can still lead to substantial reductions in mean squared error of the corresponding bandwidth selector with respect to the one selected by ordinary crossvalidation. In order to show this, 1000 samples from model M1 of sizes n ∈ {50, 500, 5000}

were simulated and the ordinary and bagged cross-validation bandwidths for each of these samples were computed. In the case of the bagged cross-validation bandwidth, both the size of the subsamples and the number of subsamples were selected depending on n, choosing Figure 6 shows the sampling distribution ofĥ n /h n0 , whereĥ n denotes either the ordinary or bagged cross-validation bandwidth. In the three scenarios, it can be observed the considerable reductions in variance produced by bagging more than offset the slight increases in bias, thus obtaining significant reductions in mean squared error with respect to the ordinary cross-validation bandwidth selector. Specifically, the relative reductions in mean squared error achieved by the bagged bandwidth turned out to be 69.3%, 90.1% and 93.8% for n = 50, n = 500 and n = 5000, respectively. This experiment was repeated with models M2 and M3, obtaining similar results.

In order to illustrate the performance of the techniques studied in the previous sections, in order to avoid problems when performing cross-validation, we decided to remove the ties by jittering the data. In particular, three independent random samples of size n, U 1 , U 2 and U 3 , drawn from a continuous uniform distribution defined on the interval (0, 1), were generated. Then, U 1 was added to the original explanatory variable and U 2 − U 3 to the original response variable. Figure 7 shows scatterplots for the complete sample as well as for three randomly chosen subsamples of size 1, 000.

To compute the standard cross-validation bandwidth using binning, the number of bins was set to 10, 000, that is, roughly 10% of the sample size. The value of the bandwidth subsamples of size 30, 000 were considered. Binning was used again for each subsample, fixing the number of bins to 3, 000. The calculations associated with each subsample were performed in parallel using 5 cores. The value of the bagged bandwidth was 1.52 and its computing time was 33 seconds. Figure 8 shows the Nadaraya-Watson estimates with both standard and bagged cross-validation bandwidths. For comparative purposes, the local linear regression estimate with direct plug-in bandwidth (Ruppert et al., 1995) was also computed. and 100 years. This could be due to the fact that patients in this age group are more likely to die and, therefore, end the hospitalization period prematurely. Finally, the expected hospitalization time grows again very rapidly with age for people over 100 years of age, although this could be caused by some boundary effect, since the number of observations for people over 100 years old is very small, specifically 155, which corresponds to roughly 0.15% of the total number of observations. In order to avoid this possible boundary effect, the estimators were also fitted to a modified version of the sample in which the explanatory variable was transformed using its own empirical distribution function. The resulting estimators are shown in Figure 9 , where the explanatory variable was returned to its original scale by means of its empirical quantile function.

Finally, the same procedure was followed to estimate the expected time in hospital but splitting the patients by gender, as shown in Figure 10 . This figure shows that the expected time in hospital is generally shorter for women, except for ages less than 30 years or between 65 and 85 years. Anyhow, the difference in mean time in hospital for men and women never seems to exceed one day. In Figure 10 , 

The asymptotic properties of the leave-one-out cross-validation bandwidth for the Nadaraya- Watson estimator considering a regression model with random design have been studied in this paper. Additionally, a bagged cross-validation selector have been also analyzed (theoretically and empirically) as an alternative to standard leave-one-out cross-validation.

The advantage of this bandwidth selector is twofold: (i) to gain computational efficiency with respect to standard leave-one-out cross-validation by applying the cross-validation algorithm to several subsamples of size r < n rather than a single sample of size n, and The methodology presented in this paper can be applied to other bandwidth selection techniques, apart from cross-validation, as mentioned in Barreiro-Ures et al. (2020) . Extensions to bootstrap bandwidth selectors is an interesting topic for a future research. The bootstrap resampling plans proposed by Cao and González-Manteiga (1993) can be used to derive a closed form for the bootstrap criterion function in nonparametric regression estimation, along the lines presented by Barbeito et al. (2021) who have dealt with matching and prediction.

Another interesting future research topic is the extension of the results presented in this paper to the case of the local linear estimator, whose behavior is known to be superior to that of the Nadaraya-Watson estimator, especially in the boundary regions.

Supplementary materials include proofs and some plots completing the simulation study.

Supplementary material for "Bagging cross-validated bandwidth selection in nonparametric regression estimation with applications to large-sized samples"

This section includes the proofs of Lemmas 3.1 and 3.2, Theorems 3.1 and 4.1, and Corollaries 3.1 and 4.1 of the main paper. The following assumptions are needed:

A1. K is a symmetric and differentiable kernel function.

A2. For every j = 0, . . . , 6 the integrals µ j (K), µ j (K ) and µ j (K 2 ) exist and are finite.

A3. The functions m and f are eight times differentiable.

A4. The function σ 2 is four times differentiable.

A5. As r, n → ∞, r = o(n) and N tends to a positive constant or ∞. 

Proof of Lemma 3.1. Let us start by defining

Therefore,

and, hence,

Therefore,

and, hence,

and, so,

Adding (1), (2), (3), (4) and (5) we get that

Regarding the variance of S, we have that

Therefore,

and so

Hence,

The remaining variances and covariances were not explicitly calculated because they are clearly negligible with respect to n −1 h. Namely,

Therefore, plugging (7)- (14) into (6) yields

Lemma 3.2 provides expressions for the first and second order terms of both the expectation and variance of CV n (h), where CV n (h) is given in equation (9) of the main paper (with notation CV n (h)).

Then, under assumptions A1-A4,

where B 1 and V 1 are the main terms of the bias and the variance of the MISE of the Nadaraya-Watson estimator, given by:

Proof of Lemma 3.2. If we define

then CV n (h) can be expressed as follows:

where

We have that

where we have used the fact that

After some Taylor developments, we obtain that

where we have used the fact that

Plugging (19) and (20) into (18) we get the main order terms in (15). However, the modified version of the Nadaraya-Watson estimator, given in equation (8) of the main paper, is a first order approximation ofm h and so the second order terms of (17) do not match those of its "classical" counterpart. Accordingly, we will use the following, more precise, theoretical approximation of the Nadaraya-Watson estimator,

and we will denote by CV (h) the corresponding cross-validation function, that is,

Therefore, we need to study the second order terms of

where

,

For the sake of simplicity, we will denote by "Z(h, n) 2 =" the second order terms of a function Z(h, n). E Ω 2 1 Ω 2

where we have used the fact that

where we have used the fact that

Plugging (23) and (24) into (22) E Ω 23 2 Ω 4 3 = I 1 + I 2 + I 3 + I 4 ,

where

f (x 1 − hu)f (x 1 − hv) dx 1 dudv 2 = h 6 6µ 6 (K) mf ϕ 8 + 2µ 2 (K)µ 4 (K) mf ϕ 6 + 1 12 µ 2 (K)µ 4 (K) mf 4) ϕ 7 .

Plugging (28), (29), (30) and (31) into (27) we get E Ω 23 2 Ω 4 3 = −µ 2 (K) 3 h 6 f ϕ 1 ϕ 7 .

E Ω 22 2 Ω 3 3 = I 1 + I 5 + I 3 + I 4 ,

where

where we have used the fact that

Plugging (34) into (33) yields

E Ω 23 2 Ω 2 3 = I 6 + I 7 + I 3 + I 4 ,

where

Plugging (37) and (38) into (36) we get

E Ω 32 2 Ω 3 3 = I 6 + I 7 + I 3 + I 4 = E Ω 23 2 Ω 2 3 .

Then, plugging (32), (35), (39) and (40) into (26) 

E Ω 2 1 Ω 34

Plugging (44), (45), (46) and (47) into (43) we get E Ω 2 1 Ω 34

E Ω 2 1 Ω 23

where

Plugging (50), (51) and (52) into (49) we get E Ω 2 1 Ω 23

E Ω 2 1 Ω 32

where

Plugging (55), (56) and (57) into (54) yields

E Ω 3 1 Ω 22

where

Plugging (60) into (59) we get E Ω 3 1 Ω 22

Plugging (48), (53), (58) and (61) 

E Ω 23 2 Ω 45

where

f (x 1 − hv) dx 1 dudv 2 = h 6 6µ 6 (K) mf ϕ 10 + 6µ 2 (K)µ 4 (K) f ϕ 1 ϕ 2 .

= h 6 6µ 6 (K) mf ϕ 10 + 2µ 2 (K)µ 4 (K) mf ϕ 2 + 1 12 µ 2 (K)µ 4 (K) mf 4) ϕ 1 .

Plugging (65)-(81) into (64) we get E Ω 23 2 Ω 45 4 2 = 1 6 µ 2 (K)µ 4 (K)h 6 mf 4) ϕ 1 .

E Ω 22 2 Ω 34

where

Plugging (84), (85), (86) and (87) into (83) yields

E Ω 23 2 Ω 24

where

Plugging (90), (91), (92), (93) and (94) into (89) we get E Ω 23 2 Ω 24

E Ω 23 2 Ω 42

where

Plugging (87) and (97) 

E Ω 32 2 Ω 24

E Ω 32 2 Ω 42

where

Plugging (105) 

E Ω 34 2 Ω 22

where

Plugging (113) 

Plugging (82) Ω jk 2 Ω lr 4 2 = 1 6 µ 2 (K)µ 4 (K)n 4 h 6 mf 4) ϕ 1

Finally, plugging (25), (41), (62) and (118) into (21) we get

Let us now define where C ijklrs = cov (P ij , P lr ) − h −1 cov (P ij , Q lrs ) − h −1 cov (P lr , Q ijk ) + h −2 cov (Q ijk , Q lrs ) .

By counting the possible cases we get var [CV n (h)] = 4 n 2 (n − 1) 4 h 2 [n(n − 1)(n − 2)(n − 3)(n − 4)(n − 5)C 123456 + n(n − 1)(n − 2)(n − 3)(n − 4) (C 123145 + 2C 123415 + 2C 123451 + 2C 123455 + C 123425 + 2C 123452 + C 123453 ) + n(n − 1)(n − 2)(n − 3) (2C 122134 + C 123124 + 2C 123142 + C 123143 + 2C 122314 + C 123214 + 2C 123412 + 2C 123314 + 2C 123413 + 2C 122341 + 2C 123421 + C 123341 + 2C 123431 + C 122344 + C 123423 + C 123432 + + 2C 123411 + 2C 122324 + 2C 122342 ) + n(n − 1)(n − 2) (C 122322 + C 122133

Among the previous covariances, it can be argued that the only ones that contribute to the dominant term of var [CV n (h)] are C 123145 , C 123245 , C 123425 and C 123124 . Before we continue and with the intention of facilitating the calculations of the four C ijklrs that we need, let us obtain general expressions for each of the summands that make up C ijklrs .

If r, s = i it is clear that cov (P ij , Q lrs ) = 0. Now, for the cases r = i and s = i (both cases imply i = l), let us define

Now, regarding the term C 123245 , since i = l and r, s = i, we have cov (P 12 , P 24 ) = cov (P 12 , Q 245 ) = 0 and cov (P 24 ,

Making the following change of variables,

and using the fact that

Therefore,

Regarding the term C 123425 , since i = l, j, k = l and r, s = i, then cov (P 12 , P 42 ) = cov (P 12 , Q 425 ) = cov (P 42 , Q 123 ) = 0.

We have that

where we have made the following change of variables,

x 4 = x 2 + hu 4

x 5 = x 2 − hu 5 and used the fact that β 1 (u 2 , u 3 )β 1 (u 4 , u 5 )u i 2 u j 3 u k 4 u l 5 du 2 du 3 du 4 du 5 = jlµ i (K)µ j (K)µ k (K)µ l (K) = 0 ⇐⇒ j = 0 or l = 0 or (i, j, k or l is an odd number).

Therefore,

We have that i = l and so cov (P 12 ,

where we have used the fact that

Therefore,

Finally, using similar reasoning and calculations we get that

Now, from the following decomposition (equation (11) of the main paper):

and using Lemmas 3.1 and 3.2, the asymptotic bias and variance of the cross-validation bandwidth that minimizes the modified version of the cross-validation criterion given in equation (9) of the main paper:

can be obtained. Theorem 3.1 contains this result.

Theorem 3.1. Under assumptions A1-A4, the asymptotic bias and the variance of the bandwidth that minimizes (120) are:

Proof of Theorem 3.1. From equation (119), it follows that, up to first order,

Since the first-order terms of M n (h n0 ) and E [CV n (h n0 )] coincide, we must consider the second-order terms of M n (h n0 ) and E [CV n (h n0 )] for the bias ofĥ CV,n , while for the variance, it will suffice to consider the first-order term of var [CV n (h n0 )]. Therefore, to proof Theorem 3.1, we only have to plug the results of Lemma 3.1 and Lemma 3.1 into (121) and (122). 

are given in Theorem 4.1.

Theorem 4.1. Under assumptions A1-A5, the asymptotic bias and the variance of the bagged cross-validation bandwidth defined in (123) are:

E ĥ (r, N ) − h n0 = (B + C 1 )r −2/5 n −1/5 + o r −2/5 n −1/5 , var ĥ (r, N ) = V r −1/5 n −2/5 1 N + r n 2 + o r −1/5 n −2/5 N + r 9/5 n −12/5 , where the constants B and V were defined in Theorem 3.1 and the constant C 1 is defined in (124).

Proof of Theorem 4.1. If we define

then we have h r0 = C 0 r −1/5 + C 1 r −3/5 + o r −3/5 , r n 1/5 h r0 = C 0 n −1/5 + C 1 r −2/5 n −1/5 + o r −2/5 n −1/5 and r n 1/5 h r0 − h n0 = C 1 r −2/5 n −1/5 − n −3/5 + o r −2/5 n −1/5 + n −3/5 = C 1 r −2/5 n −1/5 + o r −2/5 n −1/5 , where we have used the fact that r = o(n). Therefore, E ĥ (r, N ) − h n0 = E r n 1/5ĥ CV,r,1 − h n0 = r n 1/5 E ĥ CV,r,1 − h r0 + r n 1/5 h r0 − h n0 = (B + C 1 )r −2/5 n −1/5 + o r −2/5 n −1/5 .

Regarding the variance, we have var ĥ (r, N ) = 1 N r n 2/5 var ĥ CV,r,1 + (N − 1)cov ĥ CV,r,1 ,ĥ CV,r,2

and cov ĥ CV,r,1 ,ĥ CV,r,2 ≈ M r (h r0 ) −2 cov [CV 1 (h r0 ), CV 2 (h r0 )] ,

where CV q (h) = 2 r(r − 1) 2 h i,j,k∈Iq since E CV q (h) | I 1 , I 2 , for q ∈ {1, 2}, does not depend on I 1 , I 2 and is therefore not random.

cov [CV 1 (h), CV 2 (h) | I 1 , I 2 ] = 4 r 2 (r − 1) 4 h 2 i,j,k∈I 1 l,s,t∈I 2 j,k =i

Following the proof of Lemma 3.2, we only need to count the number of cases associated with C 123124 and C 123425 . If we define M = # (I 1 ∩ I 2 ), which is a random variable, then the number of times C 123124 and C 123425 appear in (127) Plugging these numbers into (127), we get cov [CV 1 (h), CV 2 (h) | I 1 , I 2 ] = 4 r 2 (r − 1) 4 h 2 C 123124 M 2 r 2 + C 123425 M 2 r 3 + Z,

where Z = o p (C 123124 M 2 r −4 + C 123425 M 2 r −3 ).

i∈I 1 j∈I 1 1 I 2 (i)1 I 2 (j) | I 1 = i∈I 1 j∈I 1 P(i, j ∈ I 2 | I 1 ) = rP(1 ∈ I 2 ) + r(r − 1)P(1 ∈ I 2 ) 2 = r r n + r(r − 1) r 2 n 2 = r 2 [n + r(r − 1)] n 2 = E M 2 , where 1 I 2 (·) denotes the indicator function of I 2 and we have used the fact that 1 I 2 (i) is a Bernoulli distribution with parameter r/n. Therefore, cov [CV 1 (h), CV 2 (h)] = R 1 (n −1 r −1 h 2 + rn −2 h 2 ) + R 2 n −2 h −3 + O n −2 h −1 + n −1 r −1 h 4 + n −2 rh 4 and cov [CV 1 (h r0 ), CV 2 (h r0 )] = R 1 C 2 0 (n −1 r −7/5 + n −2 r 3/5 ) + R 2 C −3 0 n −2 r 3/5 + O n −1 r −9/5 + n −2 r 1/5 .

Now, plugging (128) into (126) we get cov ĥ CV,r,1 ,ĥ CV,r,2 = V n −2 r 7/5 + W n −1 r −3/5 + O n −2 r + n −1 r −1 ,

where

Finally, plugging (129) into (125) yields var ĥ (r, N ) = V r −1/5 n −2/5 1 N + r n 2 + o r 9/5 n −12/5 . 

In this section, we complete the simulations presented in the main paper, adding two additional plots not included in the paper for reasons of space. In this simulation study, we considered the following regression models: 

Exact bootstrap methods for nonparametric curve estimation

Bandwidth selection for statistical matching and prediction

Bagging cross-validated bandwidths with application to big data

Bagging predictors

Bootstrap methods in regression smoothing

Robust locally weighted regression and smoothing scatterplots

Design-adaptive nonparametric regression

On bagging and nonlinear estimation

Reducing variability of crossvalidation for smoothing parameter choice

How far are automatically chosen regression smoothing parameters from their optimum

A review and comparison of bandwidth selection methods for kernel regression

On estimating regression

Popular ensemble methods: an empirical study

An effective bandwidth selector for local least squares regression

Consistent nonparametric regression

Cross-validatory choice and assessment of statistical predictions

Kernel smoothing

Smooth regression analysis

As for the term C 123124 , since r, s = i and j, k = l

M3: Y = m(X) + ε, m(x) = x + x 2 sin(8πx) 2 , X ∼ Beta(3, 3), ε ∼ N(0, 0.1 2 ), whose regression functions are plotted in Figure 1 of the main paper. In a first step, we empirically checked how close the bandwidths that minimize the MISE of the Nadaraya-Watson estimator and its modified version given, respectively, in equations (2) and (8) of the main paper are. For this, we simulated 100 samples of different sizes (100, 1000 and 5000) from models M1, M2 and M3 and compute the corresponding MISE curves for the standard Nadaraya-Watson estimator and for its modified version. Figure 1 shows these curves. It can be observed that the bandwidth that minimizes the MISE of standard Nadaraya-Watson estimator and the MISE of its modified version appear to be quite close, even for moderately small sample sizes. Naturally, the distance between the minima of both curves tends to zero as the sample size increases. On the other hand, Figure 2 shows the standard and modified cross-validation bandwidths (using the standard and modified version of the Nadaraya-Watson estimator, respectively) obtained for samples of sizes ranging from 600 to 5000 drawn from model M2

Bagging cross-validated bandwidths with application to big data

Some theorems on distribution functions

The authors would like to thank the Spanish Center for Coordinating Sanitary Alerts and Emergencies for kindly providing the COVID-19 hospitalization dataset.

This research has been supported by MINECO grant MTM2017-82724-R, and by Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them through the ERDF.

mf 4) ϕ 7 + µ 2 (K)µ 4 (K) mf ϕ 6 + µ 2 (K) 3 f ϕ 1 ϕ 7 .H 11 = m(x 1 )α 1 (u)m(x 1 − hu)f (x 1 − hu) dx 1 du 2 = −6µ 6 (K)h 6 mf ϕ 10 .H 12 = − f (x 1 ) −1 m(x 1 )K(u)α 1 (v)m(x 1 − hv)f (x 1 − hu)f (x 1 − hv) dx 1 dudv 2 = h 6 6µ 6 (K) mf ϕ 10 + 2µ 2 (K)µ 4 (K) mf ϕ 2 + 1 12 µ 2 (K)µ 4 (K) mf 4) ϕ 1 .