key: cord-0528072-16hb1x2t authors: Claeskens, Gerda; Kalogridis, Ioannis; Aelst, Stefan Van title: Robust and efficient estimation of nonparametric generalized linear models date: 2022-01-27 journal: nan DOI: nan sha: 572eb849da650b1624add7025b061de1aae4b8f2 doc_id: 528072 cord_uid: 16hb1x2t Generalized linear models are flexible tools for the analysis of diverse datasets, but the classical formulation requires that the parametric component is correctly specified and the data contain no atypical observations. To address these shortcomings we introduce and study a family of nonparametric full rank and lower rank spline estimators that result from the minimization of a penalized power divergence. The proposed class of estimators is easily implementable, offers high protection against outlying observations and can be tuned for arbitrarily high efficiency in the case of clean data. We show that under weak assumptions these estimators converge at a fast rate and illustrate their highly competitive performance on a simulation study and two real-data examples. Based on data (t 1 , Y 1 ), . . . , (t n , Y n ) with fixed t i ∈ [0, 1], consider the generalized linear model (GLM) stipulating that where F θ 0,i ,φ 0 is a discrete or continuous exponential family of distributions over R. Here, θ 0,i is the "canonical" parameter depending on t i and φ 0 is a common nuisance or scale parameter. Each F θ 0,i ,φ 0 has density with respect to either the Lebesgue or counting measure of the form f θ 0,i ,φ 0 (y) = exp yθ 0,i − b(θ 0,i ) φ 0 + c(y, φ 0 ) , (i = 1, . . . , n), for some functions b : R → R and c : R × R + → R that determine the shape of the density. In the classical formulation (see, e.g., McCullagh and Nelder, 1983) , the systematic component of the model is defined indirectly through the relation G(b (θ 0,i )) = β 0 + β 1 t i for a known link function G : R → R and some unknown (β 0 , β 1 ) ∈ R 2 . In the present paper, following Green and Silverman (1994) , we avoid this heavy parametric assumption and simply require that θ 0,i = g 0 (t i ) for a smooth function g 0 : [0, 1] → R, to be estimated from the data. The limitations of parametric inference in GLMs have been previously noticed by several authors and a large number of broader alternatives have been proposed through the years. The interested reader is referred to the dedicated monographs of Green and Silverman (1994) , Ruppert et al. (2003) and Gu (2013) for extensive relevant discussions and illustrative examples of nonparametric GLM estimators. From a practical standpoint, an important drawback of most of these methods is their reliance on a correctly specified probability model for all the data, whereas, as Huber and Ronchetti (2009, p. 3) note, mixture distributions and aberrant observations not following the model frequently occur in practice. The need for resistant estimators in the nonparametric framework has been acknowledged as early as Hastie and Tibshirani (1990) , but the literature in the meantime has remained relatively sparse and overwhelmingly focused on continuous responses with constant variance. For this type of response, resistant penalized spline estimators have been considered by Kalogridis (2021) and Kalogridis and Van Aelst (2021) , but unfortunately these ideas do not naturally extend to the setting of GLMs. A resistant estimation procedure encompassing many generalized linear models was proposed by Azadeh and Salibian-Barrera (2011) , who devised a robust backfitting algorithm based on the robust quasi-likelihood estimator of Cantoni and Ronchetti (2001b) along with local polynomial estimation and showed the asymptotic unbiasedness of their estimates under a suitable set of conditions. Croux et al. (2012) combined the approach of Cantoni and Ronchetti (2001b) with the P-spline approach of Eilers and Marx (1996) in order to construct robust estimates of both mean and dispersion, but without theoretical support. For smoothing-spline type robust quasi-likelihood estimators Wong et al. (2014) showed that for smooth score functions and under certain limit conditions the robust estimator inherits the rate of convergence of the non-robust quasi-likelihood estimator. More recently, Aeberhard et al. (2021) proposed obtaining robust estimates by wrapping log-likelihood contributions with a smooth convex function that downweights small log-likelihood contributions, but again with limited theoretical support. A drawback shared by all aforementioned approaches is the lack of automatic Fisher consistency, that is, these methods do not estimate the target quantities without ad-hoc corrections. These corrections may seem unnatural to researchers and practitioners that are well-acquainted with classical maximum likelihood estimators. Moreover, the estimators of Azadeh and Salibian-Barrera (2011), Croux et al. (2012) and Wong et al. (2014) depend on monotone score functions, which implies that outliers in both response and predictor spaces can still exert undue influence on the estimates. Croux et al. (2012) proposed using a weight function to limit the influence of high leverage observations in the predictor space, but this greatly complicates theoretical investigations and does not help with gross outliers in the response space. To overcome these issues, a new family of non-parametric spline estimators for GLMs is introduced based on the concept of power divergence between distributions developed by Basu et al. (1998) . We propose both a full-rank smoothing spline type estimator and a lower-rank estimator based on the principle of penalized splines, which is particularly advantageous for large datasets. The proposed estimators possess a number of notable advantages. First, they are inherently Fisherconsistent for all GLMs and thus do not require ad-hoc corrections. Furthermore, the proposed class of estimators can be tuned to obtain estimators which combine high efficiency with robustness against gross outliers in both the response and predictor spaces. Finally, these estimators can be efficiently computed through combination of the locally supported B-splines and fast iterative algorithms. The rest of the paper is organized as follows. Section 2 provides a detailed description of the proposed estimators. Section 3 is devoted to their asymptotic study. Our results include uniform rates of convergence as well as rates of convergence for their derivatives, which to the best of our knowledge, had not been obtained before for other robust estimators. Our theoretical treatment may be of independent mathematical interest, as for the first time we cast both classes of estimators in a reproducing kernel Hilbert space framework, leading to novel results and useful connections. Section 4 provides a computational algorithm and discusses practical implementation. Section 5 demonstrates the flexibility of the proposed methods in a simulation study while two real-data applications are given in Section 6. Section 7 contains a final discussion. All mathematical proofs are collected in the supplementary material accompanying this work. 2 The proposed family of estimators We begin by reviewing the definition of density power divergence, as introduced by Basu et al. (1998) for i.i.d. data and modified by Ghosh and Basu (2013) for independent but not identically distributed data. Consider two densities f and h, which for clarity we take to be Lebesgue-densities. Densities with respect to the counting measure are also covered by the subsequent arguments provided essentially that integrals are replaced by sums. The density power divergence d α (h, f ) is defined as It is easy to see that d α (h, f ) ≥ 0 and that d α (h, f ) = 0 if and only if h = f almost everywhere. Moreover, it can be readily verified that d α (h, f ) is continuous for α → 0. In fact, d 0 (h, f ) is the Kullback-Leibler divergence, which is closely associated with classical maximum likelihood estimators (see, e.g., Claeskens and Hjort, 2008) , whereas d 1 (h, f ) is the L 2 -error between densities, which has been used by Scott (2001) for robust parametric modelling. Hence, as Basu et al. (1998) note, density power divergence provides a smooth bridge between maximum likelihood estimation and L 2 -distance minimization. In the GLM setting we assume that the densities of the Y i share a possibly infinite-dimensional parameter θ 0 determining the form of their conditional means through the relationship G(µ i ) = i (θ 0 ) where i is the evaluation functional for each t i , i.e., i (θ 0 ) = θ 0 (t i ). Moreover, θ 0 belongs to some metric space Θ, which is known a priori. Henceforth, we shall place emphasis on estimating θ 0 , and assume that φ 0 , the dispersion parameter, is either known or suitably substituted. For popular GLMs such as logistic, Poisson and exponential models, φ 0 is indeed known whereas for other GLMs, e.g., with Gaussian or Gamma responses, φ 0 can be estimated without any model fitting with the resistant Rice-type estimators proposed by Ghement et al. (2008) and Boente et al. (2010) . Therefore, we set φ 0 = 1 without loss of generality and drop it from the notation from now on. Furthermore, for convenience we shall from now on use θ i to denote i (θ). To derive estimators from the density power divergence, we fix an α > 0 and minimize over all θ ∈ Θ. For this minimization problem we may drop terms only involving f θ 0,i as they do not depend on θ. An unbiased estimator for the unknown cross term f α θ i dF θ 0,i is given by f α θ i (Y i ). Thus, we may replace each summand in (3) by Below we explain how we use this principle to construct a new class of nonparametric GLM estimators with good theoretical properties. Consider now the specific GLM (1) with densities (2) where θ 0,i = θ 0 (t i ) = g 0 (t i ) for i = 1, . . . , n, and g 0 needs to be estimated from the data. In this section we only require that g 0 belongs to the Hilbert-Sobolev space W m,2 ([0, 1]) for some m ≥ 1, which is defined as The space W m,2 ([0, 1]) is well-suited for nonparametric regression problems, as it forms a reproducing kernel Hilbert space (RKHS) so that for each x ∈ [0, 1] the evaluation functionals are continuous, see, e.g., Wahba (1990) . As a compromise between goodness of fit and complexity we propose to estimate g 0 by the for some λ > 0, that acts as the tuning parameter. For bounded densities that are continuous with respect to their parameter, the objective function is bounded from below and continuous in W m,2 ([0, 1]). Reasoning along the same lines as in the proof of Theorem 1 of Kalogridis (2021) reveals that for n ≥ m this minimization problem is well-defined and there exists at least one minimizer in W m,2 ([0, 1]). Arguing now in a standard way (see, e.g., Eubank, 1999) shows that this minimizer must be an easily computable n-dimensional natural spline with knots at the unique t 1 , . . . , t n . Details are given in Section 4. It is worth noting that for α → 0, the objective function (5) which leads to the penalized maximum likelihood estimator, considered, for example, by Cox and O'Sullivan (1990) , Mammen and van de Geer (1997) and Kauermann et al. (2009) . An important implication of this limiting relation is that the present smoothing spline type estimator can be made highly efficient relative to maximum likelihood estimators by taking a sufficiently small α. In practice, we aim to balance robustness and efficiency and select an α in (0, 1], although higher values can also be considered. Section 4 outlines a possible strategy in this respect. Even for Gaussian responses the smoothing spline type estimator in (5) has not been considered before. In this case the form of the loss function is rather simple. Indeed, for Gaussian Y i the first term in (4) is constant as a function of g ∈ W m,2 ([0, 1]). Hence, apart from constants the objective function becomes with ρ α (x) = −e −αx 2 /2 . This exponential loss function has attractive properties for robust estimation because it is a bounded loss function which is infinitely differentiable with bounded derivatives of all orders. In the parametric setting the exponential squared loss function has been used, e.g., by Wang et al. (2013) . In the nonparametric setting considered herein, the penalized exponential squared loss gives rise to a novel estimator that may be viewed as a more robust alternative to the Huber and least absolute deviations smoothing spline estimators studied in van de Geer (2000); Kalogridis (2021) . See Section 5 for interesting comparisons. An noteworthy property of the penalty functional in (5) is the shrinkage of the estimator towards a polynomial of order m. To see this, assume that g n lies in the null space of the penalty so that g (m) n = 0. A Taylor expansion with integral remainder term shows that where P m (t) is the Taylor polynomial of order m (degree ≤ m − 1). The Schwarz inequality shows that the integral remainder vanishes for all t ∈ [0, 1], whence sup t∈[0,1] | g n (t) − P m (t)| = 0. This implies that letting λ → ∞ will cause the estimator to become a polynomial of order m, as the dominance of the penalty term in (5) forces the estimator to lie in its null space. A crucial property underlying the construction of all our GLM estimators is their inherent Fisher-consistency (Hampel et al., 2011, p. 83) . In particular, our previous discussion shows that θ 0,i minimizes E{l α (Y i , θ i )} for each i = 1, . . . , n. Hence, our estimation method is Fisher-consistent for every model distribution. To the best of our knowledge, this is the first robust nonparametric estimator that enjoys this property without corrections. Since these corrections are model-dependent, this inherent Fisher-consistency yields an important practical bonus. A drawback of smoothing spline type estimators is their dimension, which grows linearly with the sample size. This implies that for large n smoothing spline estimators can be computationally cumbersome. Moreover, as noted by Wood (2017, p.199) , in practice the value of λ is almost always high enough such that the effective degrees of freedom of the resulting spline is much smaller than n. Penalized spline estimators offer a compromise between the complexity of smoothing splines and the simplicity of (unpenalized) regression splines (O'Sullivan, 1986; Eilers and Marx, 1996) . We now discuss penalized spline estimators for our setting as a simpler alternative to the smoothing spline estimator discussed above. Fix a value K ∈ N + and define the interior knots 0 = x 0 < x 1 , . . . , x K < x K+1 = 1, which do not have to be design points. For any fixed integer p ∈ N + , let S p with λ ≥ 0. Hence, we have replaced W m,2 ([0, 1]) in (5) by a (K + p)-dimensional spline subspace. For K n this yields sizeable computational gains in relation to the smoothing spline estimator. Moreover, it turns out that penalized spline estimators do not sacrifice much in terms of accuracy if Penalized spline estimators retain a number of important mathematical properties of their full rank smoothing spline counterparts. In particular, for λ > 0 and p = 2m it can be shown that the penalized spline estimator is a natural spline of order 2m (Wand and Ormerod, 2008) . Moreover, the null space of the penalty consists exactly of polynomials of order ≤ m. In the frequently used setting of equidistant interior knots, the latter property is also retained if one replaces the derivative penalty with the simpler difference (P-spline) penalty K+p j=m+1 |∆ m β j | 2 proposed by Eilers and Marx (1996) . Here, ∆ m is the mth backward difference operator and β j , j = 1, . . . , K + p are the coefficients of the B-spline functions. In this case, these two penalties are scaled versions of one another with the scaling factor depending on K, p and m (see, e.g., Proposition 1 of Kalogridis and Van Aelst, 2021). Thus, P-spline estimators are also covered by the asymptotic results of the following section. As noted before, an essential characteristic of W m,2 ([0, 1]) is that it is a RKHS. The reproducing kernel depends on the inner product that W m,2 ([0, 1]) is endowed with. We shall make use of the inner product where ·, · denotes the standard inner product on L 2 ([0, 1]). It is interesting to observe that ·, · m,λ is well-defined and depends on the smoothing parameter λ, which typically varies with n. Eggermont and LaRiccia (2009, Chapter 13) show that there exists a finite positive constant c 0 such that for all f ∈ W m,2 ([0, 1]) and λ ∈ (0, 1] we have with f m,λ = f, f 1/2 m,λ . Hence, for any λ ∈ (0, 1], W m,2 ([0, 1]) is indeed a RKHS under the inner product ·, · m,λ . The condition λ ≤ 1 is not restrictive, as for our main result below we assume that λ → 0 as n → ∞ and our results are asymptotic in nature. The assumptions needed for our theoretical development are as follows. (A1) The support χ := {y : f θ (y) > 0} does not depend on θ ∈ R. (A2) The densities {f θ (y), θ ∈ R} are thrice differentiable as a function of their canonical parameter θ and there exists an > 0 and an M > 0 such that for mass functions. (A4) Write u θ (y) = ∂ log f θ (y)/∂θ and let u θ 0,i (y) denote the derivative evaluated at the true parameter. Then E{u θ (Y )f α θ (Y )} is differentiable for all θ ∈ R, expectation and differentiation can be interchanged and (A5) The design points t i , i = 1, . . . , n, are quasi-uniform in the sense of Eggermont and LaRiccia (2009) , that is, Condition (A1) is standard in the theory of exponential families and may be viewed as an identifiability condition. Assumptions (A2)-(A3) are regularity conditions, which are largely reminiscent of classical maximum likelihood conditions (cf. Theorem 5.41 in van der Vaart, 1998). These conditions essentially impose some regularity of the second and third derivatives in a neighbourhood of the true parameter. Since for any α > 0, conditions (A2) and (A3) are satisfied for a wide variety of GLMs by the rapid decay of f α θ (y) for large values of |y|. In particular, they are satisfied for the Gaussian, logistic and Poisson models. Similarly condition (A4) is an extension of the assumptions underpinning classical maximum likelihood estimation. For α = 0 these moment conditions entail that the Fisher information is strictly positive and finite. The following examples demonstrate that condition (A4) holds for popular GLMs. Example 1 (Gaussian responses). Clearly, so that (A4) is satisfied without any additional conditions. Example 2 (Binary responses). The canonical parameter is θ i = log(p i /(1 − p i )) with p i denoting the probability of success for the ith trial. Thus, where p 0,i = 1/(1 + e −θ 0,i ). Thus, (A4) is satisfied whenever p 0,i is bounded away from zero and one, precisely as required by Cox and O'Sullivan (1990) and Mammen and van de Geer (1997) for classical nonparametric logistic regression. Example 3 (Exponential responses). Assume that f θ 0,i (y) = exp[yθ 0,i − log(−θ 0,i )]I (0,∞) (y) with θ 0,i = −λ 0,i , the rate parameter of the exponential distribution. A lengthy calculation shows that Hence, (A4) is satisfied provided that λ 0,i stays away from zero, since by compactness and continuity, λ 0 (t) is always bounded. Example 4 (Poisson responses). Consider the case of Poisson responses with canonical parameters θ 0,i = log(λ 0,i ) and positive rate parameters λ 0,i , i = 1, . . . , n. Then, it can be shown that so that (A4) is satisfied provided that λ 0,i stays away from zero. Finally, condition (A5) ensures that the design points are well-spread throughout the [0, 1]interval. Call Q n the distribution function of t 1 , . . . , t n . Then, for each f ∈ W 1,1 ([0, 1]) an integration by parts argument reveals that Consequently, (A5) is satisfied provided that Q n approximates well the uniform distribution function, for example if t i = i/(n + 1) or t i = 2i/(2n + 1). Theorem 1 contains the main result for the smoothing type spline estimator defined in (5). Theorem 1. If assumptions (A1)-(A5) hold and λ → 0 in such a way that n m λ → ∞ as n → ∞. Then, Pr[there exists a sequence of local minimizers g n of (5) satisfying The result in Theorem 1 is local in nature as the objective function (4) is not convex, unless α = 0. Similar considerations exist in Basu et al. (1998) , Fan and Li (2001) and Wang et al. (2013) . The limit requirements of the theorem are met, e.g., whenever λ n −2m/(2m+1) , in which case we are led to the optimal rate || g n − g 0 || 2 m,λ = O P (n −2m/(2m+1) ). This rate is similar to the rate obtained in nonparametric regression for continuous responses with constant variance (Kalogridis, 2021) . A faster n −4m/(4m+1) -rate can be obtained whenever g 0 and its derivatives fulfil certain boundary conditions, see Eggermont and LaRiccia (2009) ; Kalogridis (2021) . Since for all λ > 0, · < · m,λ the same rate applies for the more commonly used L 2 ([0, 1])distance. However, the fact that our results are stated in terms of the stronger · m,λ leads to two notable consequences. First, the bound in (7) immediately yields which implies that convergence can be made uniform. Although this uniform rate is slower than the log 1/2 (n)n −m/(2m+1) -rate in Eggermont and LaRiccia (2009, Chapter 21) for the classical smoothing spline in case of continuous data with constant variance, this rate is guaranteed for a much broader setting encompassing many response distributions. Secondly, by using Sobolev embeddings we can also describe the rate of convergence of the derivatives of g n in terms of the L 2 ([0, 1])-metric. These are given in Corollary 1 below. To the best of our knowledge, these are the first results on uniform convergence and convergence of derivatives outside the classical likelihood framework. Corollary 1. Assume the conditions of Theorem 1 hold. Then, for any λ n −2m/(2m+1) , the sequence of minimizers in Theorem 1 satisfies The assumptions for the penalized spline type estimators are for the most part identical to those for smoothing spline type estimators. In addition to (A1)-(A4), we require the following assumptions instead of assumption (A5). (B5) The number of knots K = K n → ∞ and K = o(n) as n → ∞. there exists a distribution function Q with corresponding Lebesgue density bounded away from zero and infinity such that sup These assumptions have been extensively used in the treatment of penalized spline estimators, (see, e.g., Claeskens et al., 2009; Kauermann et al., 2009; Xiao, 2019) . Assumption (B5) imposes a weak condition on the rate of growth of the interior knots which is not restrictive in practice, as it is in line with the primary motivation behind penalized spline type estimators. Assumption (B6) concerns the placement of the knots and is met if the knots are equidistant, for example. Finally, assumption (B7) is the lower-rank equivalent of assumption (A5) and holds in similar settings. Recall that S p K ([0, 1]) is a finite-dimensional space and, for any f : [0, 1] → R which is continuous or a step function, ||f || = 0 implies that f = 0. Consequently, endowing S p K ([0, 1]) with ·, · m,λ makes it a Hilbert space. Proposition 1 shows that S p K ([0, 1]) is an RKHS allowing us to draw direct parallels between W m,2 ([0, 1]) and its dense subspace S p K ([0, 1]). Proposition 1. If assumptions (B5)-(B7) hold and p > m ≥ 1, then there exists a positive constant Proposition 1 implies the existence of a symmetric function R m,K,λ : Hence, R m,K,λ is the reproducing kernel which can be derived explicitly in this setting. Let H p denote the (K + p) × (K + p) matrix consisting of the inner products B i , B j with B 1 , . . . , B K+p , the B-spline functions of order p, and let P m denote the penalty matrix with elements B Since, for f, g ∈ S p K ([0, 1]) we have f, g m,λ = f G λ g, with f , g ∈ R K+p the vectors of scalar coefficients, it is easy to see that R m,K,λ satisfies the required properties. (7) that for all f ∈ S p K ([0, 1]) it holds that sup t∈[0,1] |f (t)| ≤ c 0 λ −1/4m ||f || m,λ . However, the bound in Proposition 1 is much tighter if K = o(λ −1/2m ), that is, if the rate of growth of K is outpaced by the rate of decay of λ 1/2m . This suggests that the rate of growth of K and the rate of decay of λ jointly determine the asymptotic properties of penalized spline estimators. This relation is formalized in Theorem 2. Theorem 2. Suppose that assumptions (A1)-(A4), (B5)-(B7) hold and assume that λ → 0 in a such way that n −1 K min{K, λ −1/2m } → 0 and K min{λ 2 K 2m , λ} → 0 as n → ∞. Moreover, Pr[there exists a sequence of local minimizers g n of (6) satisfying Theorem 2 presents the L 2 ([0, 1])-error as a decomposition of three terms accounting for the variance, the modelling bias and the approximation bias of the estimator, respectively. It is interesting to observe that, apart from the term K −2j stemming from the approximation of a generic C j ([0, 1])-function with a spline, the error simultaneously depends on K and λ, highlighting the intricate interplay between knots and penalties in the asymptotics of penalized spline estimators. In particular, for g 0 ∈ C m ([0, 1]) and K ≥ λ −1/2m we obtain which, apart from the approximation error K −2m , corresponds to the L 2 ([0, 1]) error decomposition of smoothing spline estimators in Theorem 1. This fact has important practical implications, as it allows for an effective dimension reduction without any theoretical side effects. Indeed, taking λ n −2m/(2m+1) and K n γ for any γ ≥ 1/(2m+1) leads to || g n −g 0 || 2 = O P (n −2m/(2m+1) ), which is the same rate of convergence as for smoothing spline estimators. Moreover, with this choice of tuning parameters, the convergence rates of the derivatives given in Corollary 1 carry over to the present lower-rank estimators, as summarized in Corollary 2. Corollary 2. Assume the conditions of Theorem 2 hold and that g 0 ∈ C m ([0, 1]). Then, for any λ n −2m/(2m+1) and K n γ with γ ≥ 1/(2m + 1) the sequence of minimizers in Theorem 2 satisfies While there have been attempts in the past to cast penalized spline estimators in an RKHS framework (e.g., in Pearce and Wand, 2006) , this was from a computational perspective. To the best of our knowledge, our theoretical treatment of penalized spline estimators based on the theory of RKHS is the first of its kind in both the classical and the robust literature and may be of independent mathematical interest. The interested reader is referred to the accompanying supplementary material for the technical details. 4 Practical implementation The practical implementation of the smoothing and penalized spline estimators based on density power divergence requires a computational algorithm as well as specification of their parameters, namely the tuning parameter α, the penalty parameter λ and K, the dimension of the spline basis, in case of penalized splines. Practical experience with penalized spline estimators has shown that the dimension of the spline basis is less important than the penalty parameter, provided that K is taken large enough (Ruppert et al., 2003; Wood, 2017) . Hence, little is lost by selecting K in a semi-automatic manner. We now discuss a unified computational algorithm for the smoothing and penalized type estimators described in Section 3 and briefly discuss the selection of K. By using the B-spline basis, the computation of the proposed estimators can be unified. Recall from our discussion in Section 3 that a solution to (5) is a natural spline of order 2m. Assume for simplicity that all the t i are distinct and that min i t i > 0 and max i t i < 1. Then, the natural spline has n interior knots and we may represent the candidate minimizer g ∈ S p K ([0, 1]) as g = n+2m k=1 g k B k where the B k are the B-spline basis functions of order 2m supported by the knots at the interior points t i and the g k are scalar coefficients de Boor (2001, p. 102) . The penalty term now imposes the boundary conditions. The reasoning is as follows: if that were not the case, it would always be possible to find a 2mth order natural spline h n (t), written in terms of B 1 , . . . , B K+p , that interpolates g n (t i ), i = 1, . . . , n, leaving the first term in (5) unchanged, but due to it being a The above discussion shows that to compute either the smoothing spline type or the penalized spline type estimators it suffices to minimize , where in the smoothing spline case the knots satisfy x i = t i , i = 1, . . . , n and the order p = 2m while in the penalized spline case 0 < x 1 < . . . < x K < 1 and p > m. By default we set p = 2m in our implementation of the algorithm and use all interior knots, i.e., x i = t i for i ≤ n if n ≤ 50. For n > 50 we adopt a thinning strategy motivated by the theoretical results of the previous section as well as the strategy employed by the popular smooth.spline function in the R language (see Hastie et al., 2009, p. 189) . In particular, for n > 50 we only employ K n 1/(2m+1) interior knots, which we spread in the [0, 1]-interval in an equidistant manner. While this leads to large computational gains, Theorem 2 assures that no accuracy is sacrificed. For example, with m = 2 and n = 5000 our strategy amounts to using only 83 knots; a dramatic dimension reduction. We solve (8) with a Newton-Raphson procedure. The updating steps of the algorithm can be reduced to penalized iteratively reweighted least-squares updates, in the manner outlined by Green and Silverman (1994, Chapter 5.2 .3). The weights in our case are given by where g i = B (t i )g and the vector of "working" data z ∈ R n consists of Alternatively, the weights w i can be replaced by their simpler expected values . . , n, leading to a variant of the Fisher scoring algorithm. For α = 0, these formulae reduce to those for penalized likelihood estimators and canonical links, cf. Green and Silverman (1994) . The tuning parameter α determines the trade-off between robustness and efficiency of the estimators. Selecting α independent of the data could lead to lack of resistance towards atypical observation or an undesirable loss of efficiency. To determine α in a data-driven way we modify the strategy of Ghosh and Basu (2015) and rely on a suitable approximation of the integrated mean integrated squared error (MISE) of g n , i.e., E{ g n − g 0 2 }. To show the dependence of the estimator on α we now denote it by g n,α . For each α ≥ 0 its MISE can be decomposed as where the first term on the RHS represents the integrated squared bias, while the second term is the integrated variance of the estimator. Neither of these terms can be computed explicitly since E{ g n,α } and g 0 are both unknown. Therefore, we seek approximations. Following Warwick and Jones (2005) and Ghosh and Basu (2015), we replace E{ g n,α } in the bias term by g n,α and use a "pilot" estimator instead of the unknown g 0 . To limit the influence of aberrant observations on the selection of α we propose to replace g 0 by the robust estimate g n,1 which minimizes the L 2 -distance between the densities, as described in Section 2. To approximate the variance term, observe that E{ g n,α − E{ g n,α } 2 } = Tr{H p Cov{ g n,α }}. Using the notation , (i = 1, . . . , n). for the first derivatives and analogously for the second derivatives l α (Y i , g(t i )), i = 1, . . . , n, define l α (Y n , g n,α (t n ))}. A first order Taylor expansion of the score function of (8) readily yields where B is the n × (K + p) spline design matrix with ijth element given by B j (t i ). Inserting the approximations of the bias and variance in (9) we obtain the approximate mean-squared error (AMISE) given by AMISE(α) = ( g n,α − g n,1 ) H p ( g n,α − g n,1 ) We propose to select α by minimizing AMISE(α) over a grid of 20 equidistant candidate values in [0, 1] . This grid includes both the maximum likelihood (α = 0) and L 2 -distance estimators (α = 1) as special cases. The bias approximation and thus the selection of α depends on the pilot estimator g n,1 . To reduce this dependence Basak et al. (2021) propose to iterate the selection procedure. That is, using g n,1 as pilot estimator determine the value α minimizing AMISE(α) and use the corresponding density power divergence estimate as the new "pilot" estimate instead of g n,1 . This procedure is repeated until α converges, which in our experience takes between 1 and 3 iterations for the vast majority of cases in our setting. The computation of the estimator for a given value of α requires an appropriate value of the penalty parameter λ. To determine λ, we utilize the Akaike Information Criterion in the form proposed by Hastie and Tibshirani (1990) which is given by To determine the minimizer of AIC(λ) we adopt a two-stage strategy. First, we evaluate AIC(λ) on a rough grid and then apply a numerical optimizer based on the Brent algorithm (Nocedal and Wright, 2006, pp. 236-237) in the neighbourhood of the most promising λ value. Implementations and illustrative examples of the power divergence smoothing/penalized type spline estimators are available at github.com/ioanniskalogridis/Robust-and-efficient-estimation-of-nonpar ametric-GLMs. We now illustrate the practical performance of the proposed density divergence spline estimators for GLM with Gaussian, Binomial and Poisson responses. We compare the following estimators. • The adaptive density power divergence estimator discussed in the previous Section, denoted by DPD( α). • The L 2 -distance estimator corresponding to α = 1, denoted by DPD(1). • The standard penalized maximum likelihood estimator corresponding to α = 0, abbreviated as GAM. • The robust Huber-type P-spline estimator of Croux et al. (2012) with 40 B-spline functions, denoted by RQL, as implemented in the R-package DoubleRobGam. • The robust local polynomial estimator of Azadeh and Salibian-Barrera (2011), denoted by RGAM, as implemented in the R-package rgam. For the first three estimators we use the default settings described in Section 4 with B-splines of order p = 4 combined with penalties of order m = 2. For Gaussian data, we use the resistant Rice-type estimator discussed in Kalogridis (2021) to estimate φ. That is, we use φ = σ 2 with σ the solution of where ρ is Tukey bisquare loss function with tuning constant equal to 0.704. This estimate is used for the DPD( α), DPD(1) and RQL estimators. Since RGAM is not implemented for Gaussian responses, we have replaced it in this case by the robust local linear estimator obtained with the loess function and reweighting based on Tukey's bisquare, see Cleveland (1979) . Both RQL and RGAM are by default tuned for nominal 95% efficiency for Gaussian data. The penalty parameter is selected via robust AIC for RQL and robust cross-validation for RGAM. Our numerical experiments assess the performance of the competing estimators on both ideal data and data that contain a small number of atypical observations. For our experiments we set t i = i/(n + 1), i = 1, . . . , n, and consider the following two test functions: • g 1 (t) = − sin(25t/6)/0.8 − 1, • g 2 (t) = 1.8 sin(3.4x 2 ). which were also considered by Azadeh and Salibian-Barrera (2011) and Croux et al. (2012) . See For Binomial and Poisson responses each Y i has mean µ j (t i ) = G −1 (g j (t i )) for j = 1 or 2 with For each setting we generated 1000 samples of size n = 200. We evaluate the performance of the competing estimators via their mean-squared error (MSE), given by n i=1 | µ i − µ i | 2 /n. Since, MSE distributions are right-skewed in general, we report both the average and median of the MSEs. Tables 1-3 report the results for data with Gaussian, Binomial and Poisson responses, respectively. The simulation experiments lead to several interesting findings. For uncontaminated Gaussian data, Table 1 shows that GAM and RGAM perform slightly better than DPD( α) and RQL while the L 2 -distance estimator DPD(1) is less efficient. However, even with a small amount of contamination, the advantage of GAM and RGAM evaporates. RQL offers more protection against outlying observations, but it is outperformed by DPD( α) and also by DPD (1) For a better understanding of the combination of efficiency and robustness which DPD( α) achieves, Figure 1 presents the selected tuning parameters for the set of simulations involving g 1 (t) combined with = 0 and = 0.1 in the left and right panels, respectively. From these histograms it may be seen that, although the distribution of α has a long tail, for = 0 its values are overwhelmingly concentrated near the origin. Most notably, the left panel of Figure 1 indicates that for well over 400 out of our 1000 simulations DPD( α) is in fact the penalized maximum likelihood estimator. For contaminated data, the distribution of α shifts to the right. The value 1 is selected with increased frequency but smaller values are, in general, preferred. This is not surprising considering that the vast majority of observations still belongs to the ideal distribution. As seen in Table 2 , for binomial data DPD( α) remains highly efficient and robust. In ideal data, DPD( α) even attains a lower mean and median MSE than GAM, although this is likely an artefact of sampling variability. A surprising finding here is the exceptionally good performance of DPD (1) in clean data. It outperforms both RQL and RGAM, performing almost as well as DPD( α). This fact suggests that efficiency loss for Binomial responses is much less steep as a function of α than for the Gaussian case. See Ghosh and Basu (2016) for efficiency considerations in the parametric case. DPD (1) As evidenced from Table 3 , the situation for count responses is somewhat reminiscent of the situation with Gaussian responses. In clean data, GAM exhibits the best median performance, closely followed by DPD( α). RQL performs slightly worse than DPD( α) in clean data, but the difference between these two estimators becomes more pronounced in favour of DPD( α) the heavier the contamination. In ideal data, DPD(1) performs adequately for µ 1 (t), but is rather inefficient for µ 2 (t). RGAM performs better for µ 2 (t) than for µ 1 (t), but is outperformed by DPD( α) in both cases. Control and Prevention (CDC), normally on a yearly basis. Due to the Covid-19 pandemic, the cycle of data collection between 2019 and 2020 was not completed and, in order to obtain a nationally representative sample, the CDC combined these measurements with data collected in the 2017-2018 period leading to a dataset consisting of 9737 individuals. We refer to https://wwwn.cdc.gov/n chs/nhanes and the accompanying documentation for a detailed description of the survey and the corresponding data. For this example, we study of the relationship between type 2 diabetes and high-density- Since the way that the covariates influence the probability of diabetes cannot be specified in advance, we apply our methodology to estimate the two regression functions in a nonparametric manner. The algorithm described in Section 4 selects α = 1 in both cases, indicating the presence Comparing the robust and GAM-estimates reveals that, despite some areas of agreement, there is notable disagreement between the estimates. In particular, the estimated probabilities of Type 2 diabetes differ significantly for large values of HDL cholesterol and for both small and mediumlarge concentrations of glycohemoglobin. Inspection of the panels suggests that the GAM estimates are strongly drawn towards a number of atypical observations, corresponding to individuals with high HDL cholesterol but no diabetes and diabetes patients with low and medium concentrations of glycohemoglobin, respectively. The results in both cases are counter-intuitive, as, for healthy individuals with good levels of HDL cholesterol or low levels of glucose, GAM predicts a nonnegligible probability of diabetes. By contrast, the robust DPD( α)-estimates remain unaffected by these atypical observations leading to more intuitive estimates. Since robust estimates are less attracted to outlying observations, such observations can be detected from their residuals. For GLMs we may make use of Anscombe residuals (McCullagh and Nelder, 1983, p. 29) , which more closely follow a Gaussian distribution than their Pearson counterparts. For Bernoulli distributions, these are given by We classify an observation as an outlier if |r A,i | ≥ 2.6, which is a conventional cut-off value for the standard Gaussian distribution. The outliers for our examples are shown in Figure 4 with a different shape and color coding. These plots show that these outliers are largely located in the areas in which the DPD( α) and GAM-estimates differ, thus confirming the sensitivity of GAM-estimates. Figure 5 : Left: Logarithm of LOS+1 versus age for patients with disorders of the respiratory system. Right: Logarithm of LOS+1 versus age for patients with disorders of the circulatory system. The lines ( , ) correspond to DPD( α) and GAM estimates respectively. Observations indicated with ( ) exhibit large Anscombe residuals according to DPD( α). Patients admitted into Swiss hospitals are classified into homogeneous diagnosis related groups (DRG). The length of stay (LOS) is an important variable in the determination of the cost of treatment so that the ability to predict LOS from the characteristics of the patient is helpful. Herein, we use the age of the patient (in years) to predict the average LOS (in days) for two DRG comprising Diseases and disorders of the respiratory system and Diseases and disorders of the circulatory system, which we henceforth abbreviate as DRG 1 and DRG 2. We assume that LOS can be modelled as a Poisson random variable with log(E{LOS j }) = g j (Age), j ∈ {1, 2}, for unknown functions g 1 and g 2 corresponding to DRG 1 and DRG 2, respectively. The data consisting of 2807 and 3922 observations are plotted in the left and right panels of Figure 5 , together with the DPD( α) and GAM-estimates of their regression functions. The plots show that the GAM-estimates lack smoothness and are shifted upwards in relation to the DPD( α)-estimates. Both facts are attributable to a lack of resistance of GAM-estimates towards the considerable number of patients with atypically lengthy hospital stays given their age. It should be noted that, while it is always possible to manually increase the smoothness of the GAM-estimates, non-robust automatic methods are very often affected by outlying observations resulting in under or oversmoothed estimates, as observed by Cantoni and Ronchetti (2001a) . Thus, in practice, robust methods of estimation need to be accompanied by robust model selection criteria. On the other hand, our algorithm selects α = 1 in both cases which combined with the robust AIC proposed in Section 4 leads to reliable estimates for the regression functions, even in the presence of numerous outlying observations. These estimates largely conform to our intuition, as they suggest that older patients are, on average, more likely to experience longer hospital stays. To detect the outlying observations, we may again use the Anscombe residuals of the DPD( α)estimates, which in the Poisson case are given by Observations with |r A,i | ≥ 2.6 are indicated with a different shape and colour coding in Figure 5 . These panels suggest that while there exist patients with atypically brief stays, the vast majority of outliers is in the opposite direction, thereby explaining the upper vertical shift of the sensitive GAM-estimates. This paper greatly extends penalized likelihood methods for nonparametric estimation in GLMs and derives new and important theoretical properties for this broad class of estimators. In practice, the proposed class of estimators behaves similarly to non-robust GAM estimators in the absence of atypical observations, but exhibits a high degree of robustness in their presence. These properties make the proposed methodology particularly well-suited for the analysis of many complex datasets commonly encountered nowadays, such as the diabetes and length of hospital stay data analysed in Section 6. There is a number of interesting and practically useful extensions we aim to consider in future work. These include the case of higher-dimensional non-parametric components, modelled, for example, with thin-plate or tensor product penalties (Wood, 2017, Chapter 5) , as well as more general semi-parametric models based on density divergence that would allow for both parametric and non-parametric components. Currently, our density power divergence estimator depends on the tuning parameter α and for the selection of α we have developed a data-dependent scheme. An intriguing alternative would be a composite penalized estimator involving several values of α, as proposed by Zou and Yuan (2008) in the context of quantile regression. Such an approach has the potential of producing another resistant yet highly efficient competitor to standard maximum likelihood estimators. Lemma 1. Let H denote a real Hilbert space of functions with inner product ·, · H and associated norm · H and let L : H → R denote a weakly lower semi-continuous functional whose range is bounded from below, say, L : H → [M, ∞). If there exists g such that g H < 1 and Proof. Define B = {f ∈ H : ||f || H ≤ 1} and set y = inf f ∈B L(f ). Observe that y is finite, as the range of L is bounded from below. Thus, there exists a minimizing sequence {f n } n ∈ B, that is, Now, the ball B is closed, bounded and convex. The space H is reflexive, (see, e.g., Rynne and Youngston, 2008) , hence B is weakly compact. Therefore, there exists a subsequence {f n k } k , which converges weakly to some f 0 ∈ B. The weak lower semicontinuity of L now implies and it must be that L(f 0 ) = y. Our assumptions then yield L(f 0 ) ≤ L(g) < inf f H =1 L(f ), which further implies that f 0 is in the interior of the ball, i.e., f 0 H < 1. The proof is complete. The following lemma will allow us to compare sums with integrals and may be viewed as a version of the Euler-Maclaurin formula. Lemma 2 (Quadrature). Let m ≥ 1. Assuming that the design is quasi-uniform in the sense of (A5), there exists a constant c m depending only on m such that, for all f ∈ W m,2 ([0, 1]) and all n ≥ 2, Proof. The proof is given in Eggermont and LaRiccia (2009, Lemma 2.27, Chapter 13 ) with h in their notation equivalent to λ 1/2m in ours. We now tend to the proof of Theorem 1. For ease of notation we shall henceforth denote all generic positive constants with c 0 . Thus, the value of c 0 may change from appearance to appearance. Proof of Theorem 1. Let us use L n (g) to denote the objective function, i.e., and, as in the text, denote the true function with g 0 (t). We will show that for every > 0 there exists a sufficiently large D = D such that lim inf n→∞ Pr inf g m,λ =D L n (g 0 + C 1/2 n g) > L n (g 0 ) ≥ 1 − , where C n = n −1 λ −1/2m + λ. Provided that we can check the conditions of Lemma 1 this would imply the existence of a (local) minimizer in the ball {f ∈ W m,2 ([0, 1]) : f − g 0 m,λ ≤ DC 1/2 n } with probability at least 1 − , which in turn we would establish the result of Theorem 1. To check the conditions of Lemma 1 we need to check the weak lower semicontinuity of L n (g), as the fact that the densities are assumed bounded implies that L n (g) is bounded from below. Let g k → g weakly in W m,2 ([0, 1]) as k → ∞ and let R m,λ denote the RK of W m,2 ([0, 1]) for the chosen inner product. Then, the reproducing property and the definition of weak convergence imply lim k→∞ g k (t i ) = lim k→∞ g k , R m,λ (·, t i ) m,λ = g, R m,λ (·, t i ) m,λ = g(t i ), i = 1, . . . , n. At the same time, by the Hahn-Banach theorem (Rynne and Youngston, 2008) , norms are weakly lower semicontinuous and therefore g (m) 2 ≤ lim inf k→∞ g (m) k 2 . Combining these two observations yields which is equivalent to weak lower semicontinuity. To establish (10), use the fundamental theorem of calculus to decompose L n (g 0 +C 1/2 n g)−L n (g 0 ) as follows: + λC n g (m) 2 = I 1 (g) + I 2 (g) + I 3 (g), say, with 0 , g (m) . Here and in the sequel we write and define higher order derivatives in the same manner. Our proof consists of showing the following inf g m,λ =D for a strictly positive c 0 . In combination, (11)-(14) would imply that for sufficiently large D E{I 1 (g)} would be positive and dominate all other terms. Thus, (10) would hold and the theorem would be proven. We first show (14). For this, observe that by the Schwarz inequality we have The bound in (14) now follows. We next show (13). Let us begin by noting that the embedding (7) implies the existence of a symmetric function, R m,λ : [0, 1] 2 → R, the reproducing kernel, such that for every y ∈ [0, 1] the map x → R m,λ (x, y) ∈ W m,2 ([0, 1]) and f (x) = R m,λ (x, ·), f m,λ , f ∈ W m,2 ([0, 1]). Using these facts, the Schwarz inequality and (7) we obtain Divide both sides by R m,λ (x, ·) m,λ to get Now, using the linearity of the inner product, Therefore, by the Schwarz inequality, The random variables l α (Y i , g 0 (t i )), i = 1, . . . , n, are independent and, since g 0 (t i ) minimizes for some τ ∈ (0, ∞) that, by (A4), does not depend on either i or n. Thus, where the last inequality follows by (15). Markov's inequality now leads to sup g m,λ ≤D as, by definition of C n , (n −1 λ −1/2m ) ≤ C n . This establishes (13). We now establish (11). For this, let us first note that, by the Schwarz inequality and (15), as n → ∞, by our limit assumption n m λ → ∞. Fubini's theorem yields Assumptions (A2)-(A3) entail that m(u) := E{l α (Y i , g 0 (t i ) + u)} admits a first-order Taylor expansion at u = 0, that is, where the interchange of expectation and differentiation is justified by (A2) and (A3). Furthermore, by the mean-value theorem there exists an s i with |s i | ≤ 1 such that as u → 0, by (16), uniformly in i and n, by conditions (A2) and (A3). Combining these facts and approximating the sum from below with the help of Lemma 2 we find since nλ 1/2m → ∞ by our conditions and we have set ξ = inf n min i≤n ξ i and used the definition of · m,λ . Note that, by (A4), ξ > 0. Taking the infimum over the D-sphere, for a strictly positive c 0 . The lower bound in (11) now follows. To complete the proof we now show that the remainder term I 1 (g) − E{I 1 (g)} is asymptotically negligible under the assumed limit conditions, which we do with an empirical process argument. Recall that the t i are fixed and Y i ∼ F θ 0,i . Thus, the distribution of each pair (t i , Y i ) is given by the denote the empirical measure placing mass n −1 on each pair (t i , Y i ). Then, adopting the notation of van de Geer (2000, Chapters 5 and 8) we have where v n (·) denotes the empirical process and h g is the function [0, 1] × R → R given by h g (t, y) := C 1/2 n g(t) 0 {l α (y, g 0 (t) + u) − l α (y, g 0 (t))}du, for each g ∈ B D := {f ∈ W m,2 ([0, 1]) : f m,λ ≤ D}. This class of functions depends on n, but we suppress this dependence for notational convenience. We will apply Theorem 5.11 of van de Geer (2000) to this empirical process adapted for independent but not identically distributed random variables (t i , Y i ) see the remarks in van de Geer (2000, pp. 131-132) . For this we first derive a uniform bound on h g and a bound on its L 2 (P )-norm. For the former note that since sup g∈B D g ∞ ≤ c 0 Dλ −1/4m we have C 1/2 n |g(t)| → 0 uniformly in t and g. Furthermore, for every |u| < min{ , } the mean-value theorem reveals where and are the constants in (A2) and (A3) and sup y∈χ,|θ−θ 0,i |< |l α (y, θ)| is uniformly bounded in i and n, that is, This observation leads to where for the last inequality we have used (7). It follows that we may take K := c 0 C n λ −1/2m in Lemma 5.8 of van de Geer (2000) . Similarly, the Cauchy-Schwarz inequality and the Lipschitz property of the score function yields where in the last step we have used Lemma 2 in order to bound n −1 |g(t i )| 2 . Thus, we may take R = c 0 λ −1/4m C n in Lemma 5.8 of van de Geer (2000) . It follows from that lemma that for our Here, ρ K (h g ) denotes the Bernstein norm given by Furthermore, using H B,K (δ, {f g , g ∈ B D },P ) to denote the δ-generalized entropy with bracketing in the Bernstein norm ρ K , Lemma 5.10 of van de Geer (2000) reveals that where H B (δ, {h g , g ∈ B D },P ) stands for the usual L 2 (P ) δ-entropy with bracketing. We next derive a bound for H B (c 0 δ, {h g , g ∈ B D },P ). Observe that for any (g 1 , g 2 ) ∈ B D × B D we have where we have used the uniform boundedness of the local increments of the score function for α > 0. It now follows from Theorem 2.7.11 of van der Vaart (1996) that where H ∞ (c 0 δ/C Combining these bounds we may now bound the bracketing integral in Theorem 5.11 of van de Geer (2000) as follows: Fix > 0 and take a = n 1/2 C n in that theorem. We need to check the conditions for a sufficiently large positive constant c 0 . The latter condition is satisfied whenever n 1/2 λ 1/4m → ∞, equivalently, n 2m λ → ∞. Our assumption n m λ → ∞ as n → ∞ ensures that this condition is satisfied. The former condition is also satisfied under this limit assumption. Therefore, setting C 1 = C 0 in Theorem 5.11 of van de Geer (2000) yields for all large n. The result follows, as nλ 1/2m → ∞ for n → ∞. We have thus established (12), completing the proof. Proof of Corollary 1. The proof may be deduced from Eggermont and LaRiccia (2009, Chapter 13, Lemma 2.17 ) that establishes the embedding for all j ≤ m and f ∈ W m,2 ([0, 1]) with c 0 depending only on j and m. Since W m,2 ([0, 1]) is a vector space, Theorem 1 now implies that for any 1 ≤ j ≤ m for λ n −2m/(2m+1) . The result follows. For the proofs of this section we will use the analogue of Lemma 2 for spline functions. The proof of Lemma 3 below is given in Zhou et al. (1998, Lemma 6 .1). Lemma 3. Assume (B5)-(B7). Then, there exist constants 0 < c 1 < c 2 < ∞ (independent of n and K) such that for any f ∈ S p K ([0, 1]) and n ≥ n 0 , We first provide the proof of Proposition 1. For the proof we make use of the B-spline functions where to derive the last inequality we have used the facts 0 ≤ B j (t) ≤ 1 and j=1 B j (t) = 1 for every t ∈ [0, 1], see de Boor (2001, p. 96) . Now, by Lemma 6.1 in Zhou et al. (1998) , there exists a positive constant c 1 depending only on Q such that where the second inequality follows from (B7) and the third inequality from the positivity of λ as asserted. For the proof of Theorem 2 we will need the following approximation lemma. Lemma 4. For each f ∈ C j ([0, 1]) there exists a spline function s f of order p with p > j such that where x i are the knots, x = max i |x i − x i−1 | is the maximum distance of adjacent knots and the constant depends only on p and j. Proof. See de Boor (2001, pp.145-149) . We now turn to the proof of Theorem 2. Proof of Theorem 2. As previously, we denote the objective function in (6) of the text with L n (g), that is, for every g ∈ S p K ([0, 1]), and let g 0 ∈ C j ([0, 1]) denote the true function. Furthermore, let s g 0 denote the spline approximation to g 0 constructed with the help of Lemma 4. Lemma 4 and assumption (B6) imply that Hence, we may write The theorem will be proven if we establish that for every > 0 there exists a sufficiently large where C n = n −1 min{K, λ −1/2m } + min{λ 2 K 2m , λ} + K −2j , as an application of Lemma 1 would yield the existence of a g n ∈ S p K ([0, 1]) such that g n − s g 0 m,λ ≤ DC 1/2 n with probability at least 1 − . Note that Lemma 1 is applicable here, because S p K ([0, 1]) is a finite-dimensional Hilbert space under ·, · m,λ , hence weak and strong continuity as well as weak and strong convergence are equivalent. Now, by (18), for all large D with probability at least 1 − . Thus, the result of Theorem 2 follows upon proving (19) . To establish (19) we decompose L n (s g 0 + C 1/2 n g) − L n (s g 0 ) as follows: + λC n g (m) 2 = I 1 (g) + I 2 (g) + I 3 (g), say, with . We will show that for a strictly positive c 0 . These are sufficient for Theorem 2 to hold, as, for large enough satisfying D ≥ 1, E{I 1 (g)} will be positive and dominate all other terms in the decomposition guaranteeing (19). Beginning with (23), the Schwarz inequality immediately yields Here, we have used the boundedness of ||s (m) g 0 ||, see de Boor (2001, p. 155) . By definition of · m,λ , we have λ 1/2 g (m) ≤ g m,λ . At the same time, for every g ∈ S p K ([0, 1]) with g = j g j B j , where D m denotes the mth order differentiation operator on S p K , the second inequality follows from Lemma 5.2 of Cardot (2002) and the third inequality follows as in the proof of Lemma 3. Combining these two bounds we find as, by definition of C n , min{λK m , λ 1/2 } ≤ C Using the reproducing property and the Schwarz inequality we therefore obtain The random variables l α (Y i , g 0 (t i )), i = 1, . . . , n, are independent and, since g 0 (t i ) minimizes for some τ ∈ (0, ∞) that, by (A4), does not depend on either i or n. Thus, by (24). Markov's inequality now leads to as, by definition of C n , n −1 min{K, λ −1/2m } ≤ C n . This establishes (22). We now establish a lower bound on E{I 1 (g)}, as required in (20). For this, first note that max i≤n |R(t i )| = o(1) as n → ∞, by (18) ≥ c 0 C n g 2 {1 + o(1)} + λC n g (m) 2 + C 1/2 n n −1 for some strictly positive c 0 , where the second-to-last inequality follows from (A4) and Lemma 3 and the last inequality follows from the definition of · m,λ . To bound the second term in the above display, use (18) and the Schwarz inequality to obtain for some 0 < c 1 < ∞. To derive the last inequality we have used K −j ≤ C 1/2 n , Lemma 3 and the inequality g ≤ g m,λ . Combining the lower and upper bound now yields inf g∈S p K ([0,1]): g m,λ =D E{I 1 (g)} ≥ c 0 D 2 C n − c 1 DC n , which is precisely (20). To complete the proof we now show (12) and for this we largely adopt the notation from the proof of Theorem 1. In this notation we may write I 1 (g) − E{I 1 (g)} = h g d(P n −P ) = n −1/2 v n (h g ), where v n (·) denotes the empirical process and h g is the function [0, 1] × R → R given by {l α (y, g 0 (t) + u) − l α (y, g 0 (t))}du, for each g ∈ B D := {f ∈ S p K ([0, 1]) : f m,λ ≤ D}. This class of functions depends on n, but we suppress this dependence for notational convenience. We derive a uniform bound on h g and a bound on its L 2 (P )-norm. For the former note that since sup g∈B D g ∞ ≤ c 0 DK 1/2 our limit assumptions ensure that we have C 1/2 n |g(t)| → 0 uniformly in t ∈ [0, 1] and g ∈ S p K ([0, 1]). At the same time, sup t∈[0,1] |R(t)| → 0, by Lemma 4 and (B5). Furthermore, for every |u| < min{ , } the mean-value theorem reveals where and are the constants in (B2) and (B3) and sup y∈χ,|θ−θ 0,i |< |l α (y, θ)| is uniformly bounded in i and n, that is, n ) and K → ∞. It follows that we may take K := c 0 KC n in Lemma 5.8 of van de Geer (2000) . Here, we have used K to denote the uniform bound and this is not to be confused with the number of interior knots, K, in S p K ([0, 1]). Similarly, the Cauchy-Schwarz inequality and the local Lipschitz property of the score function implied by (25) yield |u| 2 du ≤ c 0 K 1/2 C 1/2 n n −1 n i=1 {|R(t i )| 3 + C 3/2 n |g(t i )| 3 } ≤ c 0 KC 2 n where in the last step we have used Proposition 1, Lemma 3 and Lemma 4. Thus, we may take R = c 0 K 1/2 C n in Lemma 5.8 of van de Geer (2000) . From that lemma, |ρ K (h g )| 2 ≤ c 0 R 2 , where ρ K (h g ) denotes the Bernstein norm of h g . Furthermore, using H B,K (δ, {f g , g ∈ B D },P ) to denote the δ-generalized entropy with bracketing in the Bernstein norm ρ K , Lemma 5.10 of van de Geer (2000) reveals that H B,K (δ, {h g , g ∈ B D },P ) ≤ H B (c 0 δ, {h g , g ∈ B D },P ), where H B (δ, {h g , g ∈ B D },P ) stands for the usual L 2 (P ) δ-entropy with bracketing. We next derive a bound for H B (c 0 δ, {h g , g ∈ B D },P ). Observe that for any (g 1 , g 2 ) ∈ B D × B D we have |h g 1 (t, y) − h g 2 (t, y)| = R(t)+C 1/2 n g 1 (t) 1/2 n g 2 (t) l α (y, g 0 (t) + u) − l α (y, g 0 (t)) du ≤ c 0 C 1/2 n K 1/2 g 1 − g 2 , where we have used the uniform boundedness of the local increments of the score function for α > 0. It now follows from Theorem 2.7.11 of van der Vaart (1996) that H B (δ, {h g , g ∈ B D },P ) ≤ H(c 0 δ/K 1/2 C 1/2 n , B D ), where, for any δ > 0, H(δ, B D ) denotes the δ-entropy of B D . Since, by construction, f ≤ f m,λ , Corollary 2.6 of van de Geer (2000) implies that H(c 0 δ/K 1/2 C 1/2 n , B D ) ≤ (K + p) log c 0 K 1/2 C 1/2 n δ + 1 . Combining these bounds, we may now bound the bracketing integral in Theorem 5.11 of van de Geer (2000) as follows: CnK 1/2 0 H 1/2 B,K (u, {h g , g ∈ B D },P )du ≤ (K + p) 1/2 CnK 1/2 0 log 1/2 c 0 K 1/2 C 1/2 n u + 1 du ≤ c 0 KC 1/2 n c 0 C 1/2 n 0 log 1/2 (1/u + 1) du ≤ c 0 KC n log 1/2 (n), as δ 0 log 1/2 (1 + 1/u)du = O(δ log 1/2 (1/δ)) for δ → 0 and log(1/C 1/2 n ) ≤ c 0 log(n). Now, fix > 0 and take a = n 1/2 C n in that theorem. We need to check the conditions n 1/2 C n ≥ c 0 KC n log 1/2 (n) n 1/2 C n ≥ c 0 K 1/2 C n , for a sufficiently large positive constant c 0 . Since, by assumption (B5), K/n → 0 as n → ∞, the second condition is satisfied. Furthermore, by the assumptions of the theorem, there exists a δ > 0 such that n δ−1 K 2 → 0 as n → ∞, yielding n −1 K 2 log(n) = n δ−1 K 2 n −δ log(n) → 0, as n → ∞. This shows that the first condition of the theorem is also satisfied. Therefore, setting C 1 = C 0 in Theorem 5.11 of van de Geer (2000) yields Pr sup g∈B D |I 1 (g) − E{I 1 (g)}| ≥ C n = Pr sup g∈B D |v n (f g )| ≥ n 1/2 C n ≤ c 0 exp [−c 0 n/K] for all large n. The result follows, as, by (B5), n/K → ∞ for n → ∞. The proof is complete. Finally, we provide the proof of Corollary 2. Proof of Corollary 2. Inspection of the proof of Theorem 2 reveals that we have showed the stronger g n − s g 0 2 m,λ = O P (C n ), for C n = n −1 min{K, λ −1/2m } + min{λ 2 K 2m , λ} + K −2m . Thus, by Lemma 4, Robust fitting for generalized additive models for location, scale and shape An Outlier-Robust Fit for Generalized Additive Models With Applications to Disease Outbreak Detection On the "optimal" density power divergence tuning parameter Robust and efficient estimation by minimising a density power divergence On a robust local estimator for the scale function in heteroscedastic nonparametric regression Resistant selection of the smoothing parameter for smoothing splines Robust Inference for Generalized Linear Models Spatially Adaptive Splines for Statistical Linear Inverse Problems Model Selection and Model Averaging Asymptotic properties of penalised spline estimators Robust Locally Weighted Regression and Smoothing Scatterplots Asymptotic Analysis of Penalized Likelihood and Related Estimators Robust Estimation of Mean and Dispersion Functions in Extended Generalized Additive Models On the Mathematical Foundations of Learning A Practical Guide to Splines Maximum Penalized Likelihood Estimation Flexible smoothing with B-splines and penalties Nonparametric Regression and Spline Smoothing Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties Robust estimation of error scale in nonparametric regression models Robust estimation for independent non-homogeneous observation using density power divergence with applications to linear regression Robust estimation for non-homogeneous data and the selection of the optimal tuning parameter: the density power divergence approach Robust estimation in generalized linear models: the density power divergence approach Nonparametric Regression and Generalized Linear Models Smoothing Spline ANOVA Models Robust Statistics: The Approach Based on Influence Functions Generalized Additive Models The Elements of Statistical Learning: Data Mining, Inference, and Prediction Robust Statistics Asymptotics for M-type smoothing splines with non-smooth objective functions Robust penalized spline estimation with difference penalties Some asymptotic results on generalized penalized spline smoothing Penalized Quasi-Likelihood Estimation in Partial Linear Models Robust Statistics: Theory and Methods Generalized Linear Models Numerical Optimization A statistical perspective of ill-posed problems Semiparametric regression Linear functional analysis Parametric Statistical Modeling by Minimum Integrated Square Error Empirical Processes in M-Estimation Weak convergence and empirical processes Asymptotic Statistics Spline models for observational data On semiparametric regression with O'Sullivan penalized splines Robust Variable Selection With Exponential Squared Loss Robust Estimation for Generalized Additive Models Choosing a robustness tuning parameter Generalized Additive Models Asymptotic theory of penalized splines Local Asymptotics for Regression Splines and Confidence Regions Composite quantile regression and the oracle model selection theory We thank Professor Alfio Marazzi (Lausanne University Hospital) for providing the length of hospital stay data. The research of I. Kalogridis was supported by the Research Foundation-Flanders (project 1221122N). Their support is gratefully acknowledged. G. Claeskens acknowledges support from the KU Leuven Research Fund C1-project C16/20/002.