key: cord-0575953-uu4y6omz authors: Barrera, David title: Confidence intervals for nonparametric regression date: 2022-03-20 journal: nan DOI: nan sha: 2027efc9d2db9529278a9496c70c6ccf48909b9f doc_id: 575953 cord_uid: uu4y6omz We demonstrate and discuss nonasymptotic bounds in probability for the cost of a regression scheme with a general loss function from the perspective of the Rademacher theory, and for the optimality with respect to the average $L^{2}$-distance to the underlying conditional expectations of least squares regression outcomes from the perspective of the Vapnik-Chervonenkis theory. The results follow from an analysis involving independent but possibly nonstationary training samples and can be extended, in a manner that we explain and illustrate, to relevant cases in which the training sample exhibits dependence. This paper is a companion to [BCG + 22] , which proposes new methods and error bounds in the nonparametric, distribution-free setting for the approximation of conditional quantiles and expected shortfalls. By the latter, we mean functions x → s α (x) (1.1) defined on a Polish space S where, for a pair of random variables (X, Y ) ∈ S × R, and under appropriate regularity hypotheses on the conditional distribution of Y given X = x 1 , q α (·) and s α (·) are characterized by the properties where P x [ · ] and E x [ · ] denote the conditional probability of Y given X = x and the expectation with respect to P x [ · ]. The approximation of the functions (1.1) is an important problem in statistical inference in general [Koe17] and in particular nowadays within the context of machine learning methods [BCG + 22] . The present paper arose from the necessity of producing bounds in probability applicable to the convergence analysis of the methods in [BCG + 22] , concretely to the approximationq α (·) of q α (·) through a scheme based on the "pinball" or "tilted" loss 2 and to the least squares scheme proposed to produce and approximationrq α (·) of the function r α (·) := s α (·) − q α (·) given any approximationq α (·) of q α (·) 3 . More precisely, it turns out that the analysis (and the construction) of the regression scheme for the approximation of s α (·) proposed in [BCG + 22] is possible via a rather natural continuation of the arguments presented in the papers [BG19] and [BG21] in which nonparametric, distribution-free error bounds associated to learning schemes are discussed in nonstationary settings mostly in the context of the "Vapnik-Chervonenkis", or "VC" theory ( [GKKW02, Vap00] ), complemented by a discussion on learning bounds via the Rademacher theory (see [Wol20, MRT18] ), applicable to the pinball loss associated to the scheme in [BCG + 22] used to approximate q α (·). The results following from this task, which go beyond the application in [BCG + 22] , and are therefore of general interest, constitute the subject of this paper. From this general perspective, an alternative motivation for the present paper arises from the increasing demand for rigorous expositions on nonasymptotic bounds associated to regression schemes in which the learning sample may be nonstationary. 4 The continuity with [BG19] and [BG21] is rather evident in this context: [BG19] is basically an extension of the analysis in [GKKW02, Chapter 11, 12] aimed to provide concentration inequalities associated to least squares which have "the right rate" (and constants tighter than those in [GKKW02] ) that are valid also in the independent, nonstationary case; [BG21] continues these developments by applying coupling ideas that permit, in particular, a set of weak error bounds for least squares regression schemes with β-mixing learning samples 5 . The 1 For instance its absolute continuity and strictly positive density for P X −a.e. x ∈ S. present paper continues this analysis towards the corresponding bounds in probability, complemented by an application of the coupling idea in [BG21] to the bounds obtained for general loss functions from the Rademacher theory. We remark that, while we do not think that this paper contains any essential contribution to the bounds developed via the Rademaccher theory (beyond their extension to dependent cases), we believe that the inclusion of the corresponding survey-like Section 3 below is justified by at least two reasons: first, the author has not been able to find a presentation of these results which is concise enough to serve as a single source for the analysis performed in [BCG + 22] : the presentation below provides such a discussion departing from first principles "modulo folklore"; second, in spite of not being an essentially difficult task, there seems to be a lack of references presenting the Rademacher theory (with independence) in a nonstationary context. We hope therefore that this paper serves to motivate a broadening of perspective in this direction, which is further justified by the emergence in practice of learning scenarios in which the training sample may not be i.i.d.. The paper is organized as follows: Section 2 presents the notation and conventions to be used, and a general remark on the extension of concentration inequalities from independent to dependent samples. Section 3 is a relatively self-contained presentation of some concentration inequalities via the Rademacher theory and of their application to upper bounds on the probability of large deviations associated to the empirical process, illustrated at the end with examples. Finally, Section 4 presents large deviation estimates via the VC theory akin to those in Section 3 but specialized to the deviation of the quadratic loss, and interprets them in terms of the optimality of the average L 2 −distance between the function obtained by empirical minimization and the conditional expectations of the responses given the covariates. We begin in Section 2.1 by explaining the notation used in the paper: the conventions introduced in this regard are important for a concise presentation of the proofs. Then we present in Section 2.2 some coupling results allowing to extend our bounds, which are demonstrated under the assumption of independent training samples, to the dependent case (these ideas are applied later, in sections 3 and 4, for remarks 3.8 and 4.1). The following notation and conventions will be used in what follows: State spaces, sequential functions. Given a Polish space S, L S denotes the set of Borel measurable functions S → R. If (S k ) n k=1 is a sequence of Polish spaces, any subset H 1:n ⊂ n k=1 L S k =: L ⊗ S1:n (2.1) will be called a family of sequential functions on n k=1 S k =: S ⊗ 1:n . (2.2) Remark 2.1. If S ⊗ 1:n = S n and H ⊂ L S , we will use often the identification H ≡ diag(H) 1:n where in order to "treat" H as a subset of (L S ) n . A random element is a Borel measurable function Z : Ω → S where (Ω, A, P) is a probability space and S is some Polish space (which will be clear from the context). We will use the usual notation where B is a Borel element of S. If S = R we will refer to Z as a random variable, and we will denote by ||Z|| P,∞ := inf{z ∈ R : P (|Z| > z) = 0} (2.5) the L ∞ P norm of Z, with the convention inf ∅ = ∞, and by All the random elements below will be assumed to be defined in the same fixed probability space, whose existence can be verified a posteriori by standard measure-theoretical methods. This is easy since all of our arguments will involve only finitely many random elements. Remark 2.2. Let H 1:n ⊂ L ⊗ S1:n . If H 1:n : S ⊗ 1:n → [0, ∞] n is the sequential function defined by 1:n ) is the family in (2.24), and if Z 1:n is a random element of S ⊗ 1:n , then the inequalities (2.9) ("≤" follows from (2.8), and "≥" is true under no condition). If the components of Z 1:n are even more independent, then 6 || |H 1:n (Z 1:n )| n,p || P,∞ = | ||H 1:n (Z 1:n )|| P,∞ | n,p . (2.10) Operations with sequential functions. If z 1:n is an element of S ⊗ 1:n , and if h 1:n is a sequential function on S ⊗ 1:n , we will denote by h 1:n (z 1:n ) the vector h 1:n (z 1:n ) := (h k (z k )) n k=1 ∈ R n , (2.11) and we will operate with sequential functions in a component-wise manner, with posible multiplication by scalars. Thus if (a, g 1:n , h 1:n ) ∈ R × L ⊗ S1:n × L ⊗ S1:n (ag 1:n + h 1:n )(z 1:n ) = ag 1:n (z 1:n ) + ah 1:n (z 1:n ) = (ag k (z k ) + h k (z k )) n k=1 (2.12) (g 1:n h 1:n )(z 1:n ) = g 1:n (z 1:n )h 1:n (z 1: (2.13) for every z 1:n ∈ S ⊗ 1:n . We will also make use of the scalar product where the state space of a 1:n or b 1:n will be clear from context. For any two G 1:n ∪ H 1:n ⊂ L ⊗ S1:n , we will use the notation for the direct sum of G 1:n and H 1:n . A similar interpretation defines aH 1:n (where a is a scalar), G 1:n H 1:n , and G 1:n · H 1:n . Operations via a componentwise defined functional. We will define F (a 1:n ) := (F (a 1 ), . . . , F (a n )) (2.16) whenever "F " is an operator well defined on each a k (this is in harmony with the identification F = (F, . . . , F ) in Remark 2.1 and with (2.11)). Thus (for instance) for a random element (Y 1:n , Z 1:n ) of R n × S ⊗ 1:n and h 1:n ∈ L ⊗ S1:n , (2.18) whenever the expectations are well defined. In order to avoid confusions, we will always introduce the dimension "n" for functionals defined on sequences a 1:n rather than on each one of its elements. See for instance (2.19). ℓ p norms. We will use the notation denotes an m−tuple whose elements are n−tuples (this can be thought of as a m×n matrix when convenient), and for consistency we will use the notation t 1:m for any m−tuple that operates against a 1:m 1:n . Given H 1:n ⊂ L ⊗ S1:n we will denote by Pointwise measurability. We will assume that all the sequential families H 1:n considered here are pointwise measurable, meaning that there exists a family for all z 1:n . This implies in particular that, for all z 1:n ∈ S ⊗ 1:n and every continuous ϕ n : R n → R, sup h1:n∈H1:n ϕ n (h 1:n (z 1:n )) = sup l∈N ϕ n (h (l) 1:n (z 1:n )). (2.26) for all z 1:n ∈ S ⊗ 1:n . Sequences indexed by arbitrary sets. Let us finally establish that all the definitions above can and will be extended to cases where, instead of 1, . . . , n, the sequences are indexed by other sets, so if (for instance) J ⊂ N is any given set, z J denotes an element of the form (z j ) j∈J . We will also be careful to preserve a consistent notation, so if J 1 and J 2 are given, we will use the notation z J1 and z J2 only if these touples coincide on the indexes in J 1 ∩ J 2 (and therefore the restriction z J1∩J2 is unambiguously defined). We will extend the results in sections 3 and 4 that are obtained under the hypothesis of independent sampling to the dependent case in a manner that is useful when certain β−mixing coefficients decay rapidly enough. The beta-mixing coefficients are defined as follows: Definition 2.1 (β−mixing coefficients). Let A 1 and A 2 be two sub-sigma algebras of A. The β−mixing coefficient β(A 1 , A 2 ) between A 1 and A 2 is defined as where ess sup A1∈A1 denotes the essential supremum indexed by the element of A 1 . If Z 1 , Z 2 are random elements, β(Z 1 , Z 2 ) := β(σ(Z 1 ), σ(Z 2 )). (2.28) See [Nev75, Proposition VI-1-1] for a definition of the essential supremum, from where it follows in particular that there exist a countable family {A 1,n } n ⊂ A 1 such that, Remark 2.4. If A 2 is countably generated, then where P A k (k = 1, 2) denotes the family of finite partitions of Ω by A k −sets. 7 . This representation holds in particular if A k := σ(Z k ) (k = 1, 2) as in (2.28). With this notion, we define the past-to-present β−mixing coefficient of m−dependence as follows: If Z 1:n is a random element of S ⊗ 1:n , then for any t ∈ R, any a 1:n ∈ R n , and any H 1:n ⊂ L ⊗ S1:n , the inequality P sup h1:n∈H1:n {a 1:n · h 1:n (Z 1:n )} > mt holds, where Z * 1:n is an independent sequence with the same marginals as Z 1:n . Proof. This follows from Berbee's lemma [Bra07, Theorem 16.12 ]. See [BG21, Theorem 2.11 and Proposition 2.14] for a detailed proof. The inequality (2.34) permits to extend conveniently the estimates on independent samples that will appear below to cases in which the corresponding sampling sequences satisfy β Z1:n (m n ) → n 0 with an appropriate rate of decay for appropriate m n ≤ n. To illustrate with a classical case, assume that Z 1:n is the finite-dimensional projection of a sequence Z 1:∞ such that the exponential mixing rate for some r > 1 independent of n (2.35) is verified 8 . Then given δ ∈ (nr −n , 1) the choice m = ⌈log r (n/δ)⌉ gives which combined with (2.35) gives a deviation inequality that typically differs from the one for the independent case "only" by (essentially) a logarithmic factor. To be more concrete, if we know a bound of the type giving uniform deviation bounds for sub-samples of size |J| of an independent sequence Z * 1:∞ with the same marginals of Z 1:∞ , then under (2.35) the left-hand side of (2.34) is upper bounded (ignoring divisibility issues) by log r (n/δ)F (t/ log r (n/δ), n/(log r (n/δ))) + δ. (2.38) For further illustration, we will apply this idea in Remarks 3.8 and 4.1 below, assuming that n is divisible by log r (n/δ). The reader is invited to write down the estimate for general δ and to perform analogous estimations for the subpolynomial mixing case in which for some r > 1. (2.39) This section presents some of the main ideas within the so-called "Rademacher theory", interpreted in terms of confidence intervals for regression schemes with general loss functions. We begin in Section 3.1 by presenting the notion of Rademacher complexity and some of its properties; then, in Section 3.2, we present some deviation inequalities involving the Rademacher complexity which are relatively straightforward applications of McDiarmid's inequality. Section 3.3 presents the classical relation, known as "Massart's lemma", between the Rademacher complexity and the geometric notion of covering numbers, including some applications. Finally, in Section 3.4, we introduce the notion of entropy estimates, which allow us to illustrate via some relevant examples the way in which the estimates previously developed apply to particular sets of hypotheses. We begin by reminding the following: Definition 3.1. A Rademacher sequence of lenght n is an i.i.d. sequence U 1:n with The essential notion for what follows is that of Rademacher complexity (for families of sequential functions), defined as: Definition 3.2. The empirical Rademacher complexity of a family of sequential functions H 1:n ⊂ L ⊗ S1:n at z 1:n is defined as R emp (H 1:n , z 1:n ) := E sup h1:n∈H1:n (U 1:n · h 1:n (z 1:n )) (3.2) where U 1:n a Rademecher sequence. If Z 1:n is a random element of S ⊗ 1:n , the Rademacher complexity of H 1:n with respect to Z 1:n is defined as where U 1:n is a Rademacher sequence independent of Z 1:n . If S 1:n = S n and H ⊂ L(S) is given, we define where diag(H) 1:n is defined by (2.3). Remark 3.1. Notice that in this definition R emp can be interpreted as a particular instance of R ave (consider a point measure at z 1:n ). We keep both definitions separate in order to eventually use the inequality where the sup is taken over a set S ⊂ S ⊗ 1:n supporting Z 1:n (P(Z 1:n ∈ S) = 1). This permits to upper estimate R ave (H 1:n , Z 1:n ) uniformly over all distributions supported on S. Proposition 3.1. With the notation (2.15), (2.22), the inequalities hold for every z 1:n ∈ S ⊗ 1:n and H 1:n , H 1:n ⊂ L ⊗ S1:n (k = 1, 2). The same inequalities hold (by integration) when R emp (·, z 1:n ) is replaced by R ave (·, Z 1:n ), where Z 1:n is a random element of S 1:n , provided that these complexities are finite. Remark 3.2. The inequality (3.8) is tight (but see Remark 3.3): if n = 2 and H 1: R emp (cobal(H 1:2 ), z 1:2 ) = 1 = 2 R emp (H 1:2 , z 1:2 ). (2) 1:n u 1:n · h (2) 1:n (z 1:n ), (3.10) valid for every u 1:n ∈ {−1, 1} n . The proof of "≤" in (3.10) is obvious, whereas "≥" follows from the following observation: given ǫ > 0 and u 1: 1:n such that 1:n (z 1:n ) 1:n )∈H (1) 1:n ×H (2) 1:n 1:n (z 1:n )) + ǫ (3.11) which gives the conclusion by letting ǫ → 0. The proof of (3.7) is left to the reader. To prove (3.8) notice that, by (2.23) and (3.7), if U 1:n is a Rademacher sequence, where we used the fact that −U 1:n is a Rademacher sequence. Remark 3.3. It is clear from the proof of (3.8) that if for every h 1:n ∈ H 1:n there exists h ′ 1:n ∈ H 1:n with h 1:n (z 1:n ) = −h ′ 1:n (z 1:n ) (3.13) then R emp (cobal(H 1:n ), z 1:n ) = R emp (H 1:n , z 1:n ). (3.14) Remark 3.4. Clearly, (3.7) implies that for any H ′ 1:n ⊂ H 1: with the respective analogous consequence for R ave (H 1:n , Z 1:n ). The following lemma, which we prove for the sake of completeness, is a typical tool in the nonasymptotic analysis of empirical minimization. Lemma 3.1. If Z 1:n has independent components, then for any fixed u 1:n ∈ {−1, 1} n E sup h1:n∈H1:n u 1:n · (h 1:n (Z 1:n ) − E [h 1:n (Z 1:n )]) ≤ 2R ave (H 1:n , Z 1:n ). (3.16) Proof. Pick a sequence Z ′ 1:n of random variables with Z ′ 1:n ∼ Z 1:n and Z ′ 1:n independent of Z 1:n . Notice that for every u ′ 1: by the independence of (Z 1:n , Z ′ 1:n ) and because Z 1:n ∼ Z ′ 1:n (given J ⊂ {1, . . . , n}, exchange of signs in h J for all h 1:n ∈ H 1:n does not affect the finite-dimensional distributions of (h 1:n (Z 1;n ) − h 1:n (Z ′ 1:n )) h∈H ). This implies that for every Rademacher sequence U ′ 1:n independent of (Z 1:n , Z ′ 1:n ), the equality E sup h1:n∈H1:n U ′ 1:n · (h 1:n (Z ′ 1:n ) − h 1:n (Z 1:n )) = E sup h1:n∈H1:n u 1:n · (h 1:n (Z ′ 1:n ) − h 1:n (Z 1:n ) (3.18) holds (condition on U ′ 1:n and use (3.17)). It follows by monotonicity of the (conditional) expectation that E sup h1:n∈H1:n We also remind the following version of McDiarmid's equality [GKKW02, Theorem Lemma 3.2. If Z 1:n is a random element of S ⊗ 1:n with independent components and if K : P-a.s., for P Z k × P Z k -almost every (z k , z ′ k ) and for some c 1:n ∈ [0, ∞] n then, for u = ±1 and every ǫ > 0, (3.21) This has the following straightforward consequence: Theorem 3.1. Let H 1:n ⊂ L ⊗ S1:n , let Z 1:n be a random element of S ⊗ 1:n with independent components, and let H 1:n : S ⊗ 1:n → [0, ∞] n be defined as in (2.7). Then for every (u 1:n , ǫ) ∈ {−1, 1} n × (0, ∞), P sup h1:n∈H1:n Proof. Take In this case |K(Z 1:k−1 , z k , Z k+1:n ) − K(Z 1:k−1 , z ′ k , Z k+1:n )| ≤ 2||H k (Z k )|| P,∞ (3.24) for P Z k × P Z k −a.e. (z k , z ′ k )). Combining Lemma 3.1 with Lemma 3.2 we obtain P (K(Z 1:n ) > ǫ + 2R ave (H 1:n , Z 1:n )) ≤P (K(Z 1:n ) > ǫ + E [K(Z 1:n )]) where the first inequality follows from Lemma 3.1 and the second follows from Lemma 3.2, (3.24) and (2.10). Remark 3.5. 9 If H 1:n ⊂ L ⊗ S1:n is a sequential family of nonnegative functions, in the sense that for every h 1:n ∈ H 1:n and every 1 ≤ k ≤ n, h k (Z k ) ≥ 0, P−a.s. as can easily be seen by an examination of the argument leading to (3.24) (which allows to get rid of the factor 2 under (3.26)) and the steps underneath it: an equivalent formulation of this is that, under (3.26), we can replace ǫ by ǫ/2 at the left-hand side of (3.22). We introduce now, for notational convenience, the functionals defined for (u 1:n , h 1:n , z 1:n ) ∈ {−1, 1} n × H 1:n × S ⊗ 1:n by w h1:n (z 1:n ) =u 1:n · (E [h 1:n (Z ′ 1:n )] − h 1:n (z 1:n )) (3.28) w(z 1:n ) = sup where Z ′ 1:n is an independent copy of Z 1:n 10 . Notice in particular that, if w h1:n (z 1:n ) ≤ 0, then k h1:n (z 1:n ) = 0. In the case in which w h1:n (z 1:n ) > 0, k h1:n (z 1:n ) is just the integer part of n(w h1:n (z 1:n )/w(z 1:n )). Theorem 3.2. Under the hypotheses of Theorem 3.1, let h 1:n ∈ arg min h1:n∈H1:n u 1:n · E [h 1:n (Z 1:n )] ,ĥ 1:n =ĥ 1:n,Z1:n ∈ arg min h1:n∈H1:n u 1:n · h 1:n (Z 1:n ). (3.31) . . , n}, and with the notation (3.28)-(3.30), the bound P u 1:n · E (ĥ 1:n (Z ′ 1:n ) −h 1:n (Z ′ 1:n ))|Z 1:n ≥ ǫ + η + 2R ave (H 1:n , Z 1:n )k/n kĥ 1:n (Z 1:n ) = k ≤ exp −(ǫn/ √ 2k|| |H 1:n (Z 1:n )| n,2 || P,∞ ) 2 1 {k>0} + P −wh 1:n (Z 1:n ) ≥ η . (3.32) holds. Remark 3.6. By considering the case H 1:n = {h 1:n } and exchanging the signs in u 1:n , the bound P −wh 1:n (Z 1:n ) ≥ η ≤ exp(−(η/( √ 2|| |h 1:n (Z 1:n )| n,2 || P,∞ )) 2 ) (3.33) follows from (3.22 Proof. (of Theorem 3.2) By the definition ofĥ 1:n ,h 1:n , Z ′ 1:n , and the functionals (3.28)-(3.30), we get u 1:n · E (ĥ 1:n (Z ′ 1:n ) −h 1:n (Z ′ 1:n ))|Z 1:n =wĥ 1:n (Z 1:n ) + u 1:n · ĥ 1:n (Z 1:n ) − E h 1:n (Z ′ 1:n ) ≤wĥ 1:n (Z 1:n ) − wh 1:n (Z 1:n ) ≤kĥ 1:n (Z 1:n )w(Z 1:n )/n − wh 1:n (Z 1:n ). P a 1:n · E (ĥ 1:n (Z ′ 1:n ) −h 1:n (Z ′ 1:n ))|Z 1:n > ǫ + η + 2R ave (H 1:n , Z 1:n )k/n kĥ 1:n (Z 1:n ) = k ≤P kĥ 1:n (Z 1:n )w(Z 1:n )/n > ǫ + 2R ave (H 1:n , Z 1:n )k/n kĥ 1:n (Z 1:n ) = k + P −w(h 1:n , Z 1:n ) > η =P (w(Z 1:n ) > ǫn/k + 2R ave (H 1:n , Z 1:n )) 1 {k>0} + P −w(h 1:n , Z 1:n ) > η ≤ exp −(ǫn/( √ 2k|| |H 1:n (Z 1:n )| n,2 || P,∞ )) 2 1 {k>0} + P −w(h 1:n , Z 1:n ) > η . (3.35) Corollary 3.1. Let H 1:n ⊂ L ⊗ S1:n , let Z 1:n be a random element of S ⊗ 1:n with independent components, let H 1:n : S ⊗ 1:n → [0, ∞] n be defined as in (2.7), and let (3.36) Then for any ǫ > 0 and any independent copy Z ′ 1:n of Z 1:n P n k=1 E (ĥ k (Z ′ k ) −h k (Z ′ k ))|Z 1:n > 2(ǫ + R ave (H 1:n , Z 1:n )) (3.37) In particular, for any δ > 0, the inequality )|Z 1:n ≤ 2 || |H 1:n (Z 1:n )| n,2 || P,∞ 2 log(2/δ) + R ave (H 1:n , Z 1:n ) (3.38) holds with probability at least 1 − δ. Proof. Let us for convenience denote δ(ǫ, k) := exp −(ǫn/( √ 2k|| |H 1:n (Z 1:n )| n,2 || P,∞ )) 2 1 {k>0} . (3.39) In the context of Theorem 3.2, considering the case in which u 1:n := (1, . . . , 1), an application of (3.32) gives that the left-hand side of (3.37) is equal to P u 1:n · E ĥ 1:n (Z ′ 1:n ) −h(Z ′ 1:n )|Z 1:n ≥ 2(ǫ + R ave (H 1:n , Z 1:n )) = n k=0 E P u 1:n · E ĥ 1:n (Z ′ 1:n ) −h(Z ′ 1:n )|Z 1:n ≥ 2(ǫ + R ave (H 1:n , Z 1:n ))|kĥ 1:n = k 1 {kĥ 1:n =k} ≤ n k=0 δ(ǫ, k)P kĥ 1:n = k + P wh(Z 1:n ) > ǫ ≤ 2δ(ǫ, n). (3.40) where in the last inequality we used the fact that (3.39) is increasing in k to upper bound δ(ǫ, k), and where we used the fact that the right hand side of (3.33) is upper bounded by δ(ǫ, n). Notice that (3.37) is the inequality between the extremes of (3.40). The upper bound (3.38) follows by equating 2δ(ǫ, n) to δ and solving for ǫ. > 2 ⌈log r (2n/δ)⌉ (ǫ + max 1≤k≤log r (n/δ) where J k := {k + l ⌈log r (2n/δ)⌉} l ∩ {1, . . . , n} and Z * 1:n is an independent sequence with the same marginals as Z 1:n . This carries further to the fact that, always under (2.35), the conclusion of Corollary 3.1 holds if the right-hand side of (3.38) is replaced by (3.42) (this estimate implicitly requires that r −n ≤ δ/2n ≤ r −1 ). Remember the definition of an r−covering in a semimetric space: If (L, || · ||) is a seminormed vector space, we define coverings with respect to || · || via the semimetric d(l 1 , l 2 ) = ||l 1 − l 2 ||. We proceed now to establish a relationship between the Rademacher complexities of a sequential family H 1:n at Z 1:n and the covering numbers associated to the empirical L 1 -norm of H 1:n at Z 1:n . Definition 3.4. Consider the empirical L 1 −norm of L ⊗ S1:n at z 1:n ∈ S ⊗ 1:n , defined by |f 1:n | z1:n,1 := 1 n n k=1 |f j (z j )| (3.44) for every f 1:n ∈ L ⊗ S1:n . Given a family of sequential functions H 1:n ⊂ L ⊗ S1:n and r ∈ [0, ∞), the r-convering number N 1 (H 1:n , z 1:n , r) is the smallest m ∈ N such that there exists an r−covering of size m of H 1:n with respect to | · | z1:n,1 (with the convention inf ∅ = ∞). If S 1:n = S n and H ⊂ L(S) is given, we define holds. Theorem 3.3. If H 1:n ⊂ L ⊗ S1:n is a family of sequential functions, then for every z 1:n ∈ S ⊗ 1:n and every r > 0, R emp (H 1:n , z 1:n ) ≤ rn + |H 1:n (z 1:n )| n,2 2 log(N 1 (H 1:n , z 1:n , r)). (3.47) where H 1:n : S ⊗ 1:n → [0, ∞] n is the sequential function defined by (2.7). Proof. The conclusion is obvious when some H k (z k ) = ∞ or when N 1 (H 1:n , z 1:n , r) = ∞. In any case, it is easy to see that any a 1:n ∈ L ⊗ S1:n satisfying |a 1:n − h 1:n | z1:n,1 ≤ rn (3.48) for some h 1:n ∈ H 1:n , can be assumed to satisfy that . The conclusion follows from Lemma 3.3 and the inequality u 1:n · h 1:n (z 1:n ) =u 1:n · (h 1:n (z 1:n ) − a (k h 1:n ) 1:n (z 1:n )) + u 1:n · a k h 1:n 1:n (z 1:n ) ≤rn + max k (u 1:n · a (k) 1:n (z 1:n )). (3.51) Combined with (3.38), this has the following consequence, fundamental to the application of these ideas to error bounds in learning )|Z 1:n ≤2(r + || |H 1:n (Z 1:n )| n,2 || P,∞ 2 log(2/δ) + E 2 log(N 1 (H 1:n , Z 1:n , r/n)) (3.52) holds for every (r, δ) ∈ (0, ∞) × (0, 1), with probability at least 1 − δ. Proof. The estimate (3.47) with r/n in place of r gives, by integration and Hölder's inequality, the bound R ave (H 1:n , Z 1:n ) = E [R emp (H 1:n , Z 1:n )] ≤r + E |H 1:n (Z 1:n )| n,2 2 log(N 1 (H 1:n , Z 1:n , r/n)) ≤r + || |H 1:n (Z 1:n )| n,2 || P,∞ E 2 log(N 1 (H 1:n , Z 1:n , r/n)) . (3.53) A combination of (3.53) with (3.38) gives (3.52). Remark 3.9. It follows from the inequalities in Proposition 3.1 and an elementary argument that if, in the above, H 1:n is a family satisfying G 1:n ⊂ H 1:n ⊂ co(G 1:n ) [resp. G 1:n ⊂ H 1:n ⊂ cobal(G 1:n )] (3.54) (see (2.21) and (2.22) for definitions), and if G 1:n is defined as in (2.7) with H 1:n replaced by G 1:n (actually it is easy to see that G 1:n = H 1:n ), then all the instances of R ave (H 1:n , Z 1:n ) and H 1:n in the inequalities above can be replaced by instances of R ave (G 1:n , Z 1:n ) and G 1:n [resp. by instances of 2R ave (G 1:n , Z 1:n ) and G 1;n ]. An examination of the proofs above shows that this implies in particular that: (the change when H 1:n ⊂ F 1:n ⊂ cobal(H 1:n ) is evident also from Remark 3.9). For an application related (but not identical) to this fact within the case of neural networks see Example 3.3 below. Of special importance for the applications is the notion of entropy estimates, which permit to give explicit upper bounds to the generalization power of regression schemes via (3.52): Definition 3.5. Let H 1:n ⊂ L ⊗ S1:n be a family of sequential functions and let Z 1:n be a random element of S ⊗ 1:n . An entropy estimate of H 1:n with respect to Z 1:n is a Borelmeasurable, nonincreasing function L H1:n,Z1:n : [0, ∞) → [0, ∞] such that for all r > 0, log(E [N 1 (H 1:n , Z 1:n , r)]) ≤ L H1:n,Z1:n (r). (3.56) If the stronger condition || log(N 1 (H 1:n , Z 1:n , r))|| P,∞ ≤ L H1:n,Z1:n (r) (3.57) holds, we call L H1:n,Z1:n a uniform entropy estimate (of H 1:n with respect to Z 1:n ). If H ⊂ L(S) is given and Z 1:n is a random element of S n we define entropy estimates of H at Z 1:n via the family diag(H) 1:n in (2.3). The following two are instances of uniform entropy estimates that do not depend on the distribution of Z 1:n : Example 3.1. For a function f : R d → R, define the subgraph of f as the set The VC-dimension V F of a family F of functions R d → R is the supremum of the natural numbers l with the following property: there exists a set G ⊂ R d × R with l elements such that every subset G ′ ⊂ G can be written in the form G ′ = G ∩ G + f for some f ∈ F . When F is a family of bounded, nonnegative functions f : R d → [0, B], one has the following uniform L 1 −entropy estimate ([GKKW02, Lemma 9.2 and Theorem 9.4.]) for F : for every r ∈ (0, B/4] and every z 1:n ∈ (R d ) n , log(N 1 (F , z 1:n , r)) ≤ L(r) := log 3 + V F (1 + log 2 + log(B/r) + log(1 + log 3 + log(B/r))), (3.59) which is clearly O(log(1/r)) as r → 0 + when V F < ∞ 11 , in particular when F = T B G is the family of truncated functions from a vector space of dimension d G < ∞, thanks to the bounds ([GKKW02, Theorem 9.5 (and previous paragraph) and Equation (10.23)]). The estimate (3.59) is a consequence of the celebrated Sauer-Shelah lemma (see [Ver18, Theorem 8.3.16 p.193] ). It is therefore a relationship between the complexity of F , as measured by V F , and the notion of uniform entropy estimates. Example 3.2. A second example is given by neural networks with one layer and N "independently powered" units: it is shown in [GKKW02, p.314 ] that if σ : R → [0, 1] is any cumulative distribution function (for instance a "sigmoid" function with asymptotes y = 0 and y = 1) and is the family of functions g : R d → R of the form with N ∈ N fixed, and with To further illustrate the interaction between entropy estimates and Rademacher complexities, let us finally give the following: we arrive at the following: Theorem 3.4. If F (σ, N, B) is as above, (X, Y ) 0:n is an i.i.d. sequence of random variables in R d × R, andf ∈ arg min then for every δ ∈ (0, 1) with probability at least 1 − δ. We will see in the next section that this bound is suboptimal for large samples; in addition, the constants in (3.76) can be improved applying the consequences of Remark 3.5 (essentially replace the 8 by a 4). In any case, since the right-hand side of (3.76) does not depend on the number of units N , Theorem 3.4 is consistent 13 with the popular observation that lasso regularization preserves the generalization power of neural networks as the number of units increases 14 . In this section we present some error estimates in probability for least squares regression schemes that rely on versions of (3.22) adjusted to the case in which Z 1:n = (X, Y ) 1:n is an independent sequence in S × R and the sequential space H 1:n is the space of pointwise deviations of the losses associated to a least square regression scheme on a space of hypotheses G := {g : S → R}. The setting is the following: for some B > 0, consider the truncation operator T B y := min{max{y, −B}, B}, (4.1) and for (X, Y ) 1:n as before, let φ B,k : S → [−B, B] be such that Let G ⊂ L(S), assume that there exist and defineφ We define, for every (g, k) ∈ L(S) × {1, . . . , n}, which in particular satisfies, by (4.2) and the Pythagorean theorem, the equation In this section, we will describe some estimates of the conditional probability where Z ′ 1:n = (X ′ , Y ′ ) 1:n is an independent copy of Z 1:n . Notice that, if we (naturally) denote by y : S × R → R the projection y(x 0 , y 0 ) = y 0 , then this is the same as where || · || k at the left [resp. right]-hand side of (4.8) is the L 2 norm on L(S × R) [resp. L(S)] associated to the law of (X k , Y k ) [resp. X k ], and where we used (4.6) for claiming the equality. The estimates will depend on the functions where T B G := {T B g : g ∈ G} is the family of B−truncated functions from G (see (4.1)). Theorem 4.1. Under the previous setting, for every (c, λ, δ) ∈ (1, ∞) × (1, ∞) × (0, 1), the inequality (log a(c, λ, ǫ n (c, λ)) + log(1/δ))) (4.14) holds with probability at least 1 − δ. Proof. Let λ > 1 be given and write where a n (c, λ) := a(c, λ, ǫ n (c, λ)). (4.17) From here on, we keep fixed (c, λ) and therefore we omit for simplicity the arguments of the functions of these parameters (so for instance a n ≡ a n (c, λ)). From the equivalence a n exp(−bnǫ) ≤ δ ⇐⇒ 1 bn (log a n + log(1/δ)) ≤ ǫ, (4.18) and from (4.16), it follows that for any δ ∈ (0, 1) and for ǫ n (δ) := 1 bn (log a n + log(1/δ)) ∨ ǫ n , with probability at least 1 − δ: this and (4.15) imply the claim of the theorem. Remark 4.1. In the setting of Section 2.2, using (for simplicity) the notation m n,δ = log r (n/δ), n m,δ := n/m(n, δ) (4.21) (we assume also for simplicity that m(n, δ) ∈ Z and divides n), and extending the definition (4.10) of A to arbitrary finite sequences (z j ) j∈J by means of the empirical measure for the left-hand side of (4.16), where this time with J k := {k + lm n,δ } l ∩ {1, . . . , n} (1 ≤ k ≤ m n,δ ) and Z * 1:n an independent sequence with the same marginals as Z 1:n . The conclusion (ignoring divisibility issues) is that of Theorem 4.1 with the second line of (4.14) replaced by (n m,δ/2 m n,δ/2 ǫ n m,δ/2 ) ∨ 1 b(c, λ) (log m n,δ/2 + log a * n m,δ/2 + log(2/δ)). (4.24) Let us give explicit (and useful) estimates of the second term at the right hand side of (4.14). We begin by pointing out the estimate ǫ n (c, λ) ≤ 8B 2 λ c(c + 1) 2n which follows from elementary majorizations after multiplying and dividing the sum in (4.12) by its conjugate. Notice now that (this minorization may be suboptimal: see the discussion leading to (4.44) below). Simple manipulations show also that, if =2 5 B 2 (q 1 (λ)p(c) + q 2 (λ)p 2 (c) + q 3 (λ)p 3 (c)). (4.29) Together, (4.25), (4.26) and (4.29) permit to see that, if , (4.30) then the second term at the right-hand side of (4.14) is upper bounded by (4.31) A numerical search shows that the function (1, ∞) × (1, ∞) → R given by (c, λ) → V (c, λ) = 2 5 q 0 (λ)c(c + 1) ∨ 3 k=1 q k (λ)(p(c)) k (log (2(c + 1)(2c + 3))) (4.32) attains a local minimum at some (λ 0 , c 0 ) with 1.29 < λ 0 < 1.3, 11.46 < c 0 < 11.47 (4.33) and with for the left-hand side of (4.14). Behavior as λ → 1. In applications, we may be interested in upper bounding the left hand side of (4.14) for λ close to one 15 . To discuss the behavior of our bounds for this case, notice that This, together with the inequality q 0 (λ) ≤ q 3 (λ), allows us to upper bound (4.31) by (4.38) with C(λ) → λ→1 + 1. The expression (4.38) can be optimized in several directions according to necessity. To illustrate concretely, the restriction 1 < λ ≤ 13/12 (4.39) implies the bound C(λ) < 2. (4.40) Using (4.39) together with (4.40) and with the choice c = 2, we can bound the second term at the right-hand side of (4.14) by 2 6 1 λ − 1 B 2 n log(42) + log 1 δ + log E N 1 T B G, Z 1:n , B 24 n . (4.41) provided (4.39). A refined minorization. Let us finally discuss a case in which a more refined use of the estimate (4.14) gives an interesting consequence. The setting is the same as before but this time we will make more explicit the dependence on n, thus denoting B ≡ B n , Z 1:n ≡ Z (n) 1:n , etc. If we have in this setting a sequence (F n (·)) n of entropy estimates for (T Bn G n , Z (n) 1:n ) n (Definition 3.5) and if the following "continuity" property holds for (F n ) n r n r n ′ → n 1 implies F n (r n ) F n (r ′ n ) → n 1, (4.42) (see for instance (3.59) and (3.64)), then from the first equality in (4.26) follow the bounds and an argument similar the one leading to (4.38) shows that if B n ≥ 1, if (λ n ) n is of the form 1 < λ n = 1 + B 2 n n −1/2 , (4.43) and if (c n ) n is bounded away from one, then the second term at the right-hand side of (4.14) is upper bounded by C n 1 n 1/2 2 2 c n (c n + 1) + 2 5 c n c n − 1 3 log 2(c n + 1)(2c n + 3) δ with C n → n 1. The advantage of (4.44) over (4.38) relies on the lower rate of convergence to zero of the radii involved in the covering, which gives room for meaningful estimates under a higher complexity of (T Bn G n , Z (n) 1:n ). Consider for instance the subeuclidean case of order α ∈ (0, 1) (Definition 3.6) in which, assuming that B n n 1/2 → n ∞, we get that the second term at the right-hand side of (4.14) is of the form as n → ∞ for δ fixed; in particular (4.47), and therefore the second term at the right-hand side of (4.14), goes to zero if some choice B n = o(n 1 2 ( 1 α −1) ) makes all this possible 16 : this is related to convergence in probability of the regression function in the L 2 norm (consider the classical i.i.d. case where the hypotheses are as in (2.3); see also [BG19, Theorem 3.14 and Remarks 3.3, 3.5, 3.19]). 16 Notice that Bn is formally involved on the definition of Fn(·). In the case in which G is a set of functions already bounded by B, the argument behind the proof of Theorem 4.1 can be applied also to the large deviation whose behavior in probability we did not analyze before. Concretely, we have the following result (we use an alternative description, akin to (4.8), in the statement). (log a(c, λ, ǫ n (c, λ)) + log(2/δ))) (4.49) where || · || k denotes the L 2 norm in L(S) associated to the law of X k , holds for every (c, λ, δ) ∈ (1, ∞) × (1, ∞) × (0, 1), with probability at least 1 − δ. Proof. First, notice that in this case the truncation operator is indistinguishable from the identity, in particularφ =ĝ. Let us also introduce where the last two equalities follow from (4.6) and the definition ofg in (4.3). By decomposing as in (4.15) and using again (4.6) and (4.50), we arrive at the inequality 1 n n k=1 ||ĝ − φ k || 2 k ≤ inf (4.51) We will seek this time for ǫ(δ) such that P sup ||g − φ k || 2 k + 2ǫ(δ) (4.53) with probability at least 1 − δ. An ǫ 1 (δ) appropriate for the first term in the maximization (4.52) can be found exactly as in the proof of Theorem 4.1: it suffices to exchange δ by δ/2 in that argument. To treat the second term, notice that if where a(c) = 2(c + 1)(2c + 3) (the covering numbers from the argument in [BG19] are equal to one). This and the argument in the proof of Theorem 4.1 permit to conclude that for any ǫ 2 (δ) satisfying ǫ 2 (δ) ≥ (λ − 1)ǫg ∨ ǫ n (c, δ) ∨ 1 nb(c, λ) (log a(c) + log(2/δ)) (4.57) we have the estimate (4.58) The upper bound at the right hand side of (4.49) follows from writing down (4.53) and using (4.50) when ǫ(δ) := 3 (λ − 1)ǫg + (ǫ n (c, δ) ∨ ( 1 nb(c, λ) (log a(c, λ, ǫ n (c, λ)) + log(2/δ)))) (4.59) (which is lower bounded by ǫ 1 (δ) ∨ 3ǫ 2 (δ)). Using the inequality (a + b) 2 ≤ ξ(η)a 2 + ξ(1/η)b 2 , (4.60) valid for every (a, b, η) ∈ R × R × (0, ∞), where ξ(η) = 1 + η (4.61) for every η > 0, it is not difficult to lift the previous result to an estimate of the distance betweenĝ as before and the conditional expectations of W (without truncation) given X, namely: Corollary 4.1. Assume that W k ∈ L 2 P (k = 1, . . . , n) and consider versions Φ 1:n of the conditional expectation of W given X Φ k (X k ) = E [W k |X k ] , P − a.s., (4.62) for k = 1, . . . , n. Then, with the notation and the hypothesis from Theorem 4.2 and the notation (4.61), the inequality 1 n n k=1 ||ĝ − Φ k || 2 k ≤ξ(η) ξ(η ′ )(6 λ − 5) inf g∈G 1 n n k=1 ||g − Φ k || 2 k + 6 ǫ n (c, λ) ∨ ( 1 nb(c, λ) (log a(c, λ, ǫ n (c, λ)) + log(2/δ))) + (ξ(1/η) + ξ(η)ξ(1/η ′ )(6 λ − 5)) 1 n n k=1 E (|W k | − B) 2 1 {|W k |>B} (4.63) holds for every (c, λ, η, η ′ δ) ∈ (1, ∞) × (1, ∞) × (0, ∞) × (0, ∞) × (0, 1), with probability at least 1 − δ. Remark 4.2. Notice that the bound (4.49) follows from (4.63) by considering W ≡ T B W and letting η, η ′ → 0. Proof. (of Corollary 4.1) For every (k, g, η) ∈ {1, . . . , n} × G × (0, ∞), the estimates hold (the last is an application of Jensen's inequality). Taking g =ĝ and averaging over k we get an inequality whose right-hand side is bounded in probability via (4.49). The estimate thus obtained is then upper bounded again via a further application of (4.64) with η ′ in place of η and φ k exchanged with Φ k . Learning value-at-risk and expected shortfall Quantitative bounds for concentration of measure inequalities and empirical regression: the independent case Generalization bounds for nonparametric regression with β−mixing samples Introduction to Strong Mixing Conditions Markov chains A distribution-free theory of nonparametric regression Quantile regression: 40 years on Regularization techniques for learning with matrices Mixing properties of arma processes Foundations of machine learning Discrete-Parameter Martingales CBMS Regional Conference Series in Probability and Statistics. Institute of Mathematical Statistics and American Statistical Association Forecasting non-stationary processes The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science High-dimensional probability: An introduction with applications in data science Mathematical foundations of supervised learning (growing lecture notes The research leading to this paper was supported by several sources during its different stages, all of which deserve the author's most sincere gratitude.At its initial stage in 2020 the author was supported by a grant from the the Chair Stress Test, RISK Management and Financial Steering, led by theÉcole Polytechnique (l'X) and its Foundation and sponsored by BNP Paribas, for its participation as postdoctoral researcher at the aforementioned school, and later by a public allocation from the French employment agency (Pôle Emploi) that was very important to keep the pace during a job transition elongated by the COVID-19 pandemic. The research continued in 2021 during the author's affiliation with theÉcole Polytechnique Fédérale de Lausanne (EPFL) for participation in a project that profits also from the results presented above: this memorable stay was possible thanks to a grant provided by the EPFL's Chair of Statistical Data Science, under the supervision of Prof. Sofia Olhede. The paper was finalized during the author's first month as an associate professor at the University of the Andes (Uniandes).