key: cord-0181167-5ojknauc
authors: Sousa, Alex Rodrigo dos S.; Severino, Magno T.F.; Leonardi, Florencia G.
title: Model selection criteria for regression models with splines and the automatic localization of knots
date: 2020-06-04
journal: nan
DOI: nan
sha: 27bd0f2e47b4206ebde367bc97401b22ca1a0aec
doc_id: 181167
cord_uid: 5ojknauc

In this paper we propose a model selection approach to fit a regression model using splines with a variable number of knots. We introduce a penalized criterion to estimate the number and the position of the knots where to anchor the splines basis. The method is evaluated on simulated data and applied to covid-19 daily reported cases for short-term prediction.

Spline regression is one of the most useful tools for nonparametric and semiparametric regression and has been extensively developed along the last decades. The model works by selecting some knots on the function domain to define a partition and then fit a fixed degree polynomial on each segment in such a way that these polynomials smoothly match at the internal knots through continuity conditions imposed on their derivatives.

The result is typically a smooth curve fitting, which accurately estimates a large variety of smooth functions.

Further, spline regression offers a great flexibility in curve estimation problems according to convenient choices of degree and internal knots and, as generally occurs in nonparametric regression by basis expansion, the infinite estimation problem becomes a finite parametric estimation problem, since the spline function is determined by the estimation of its coefficients. For more details about spline based methods, their theory and applications in Statistics, see Wahba (1990) , Ramsay (2004) , Friedman et al. (2001) , and James et al.

(2013).

One of the main issues in spline regression is the selection of the number and positions of the knots. In some real data analyses, wrong specification of the knots can cause a high impact on the spline regression performance. Excessive number of knots may lead to overfitting while an insufficient number of knots can lead to underfitting. Moreover, the positions of the knots should be well determined, since high concentration of knots on a specific interval of the domain overfits the data on this region and underfits the data outside it. Thus, a good procedure for knots selection should provide optimal number and location of knots, in order to have a well performed spline curve fitting.

In practice, the knots are manually selected and few studies in the literature provide some systematic method to overcome this difficulty. Friedman and Silverman (1989) , Friedman (1991) and Stone et al. (1997) studied stepwise based methods for a set of candidates of knots. Osborne et. al (1998) proposed an algorithm based on the LASSO estimator to estimate the locations of knots. Ulker and Arslan (2009) proposed an artificial immune system to perform knots selection in curve and surface B-splines estimation by considering the knots locations candidates as antibodies. Bayesian methods are proposed by Denilson et al. (1998) who proposed a joint prior distribution for the number and locations of knots, and Biller (2000) , who gave an adaptive bayesian approach for knot selection based on reversible jump Markov chain Monte Carlo in semiparametric generalized linear models.

In penalized spline regression, Eilers and Marx (1996) , who defined the term P-splines, generalized the ideas of O'Sullivan (1986) by proposing the use of order differences of the spline coefficients as penalization criteria and the knots were set equidistantly. Ruppert (2002) presented two algorithms for selection of a fixed number of knots on penalized spline regression. The first one, called myopic algorithm, is based on the minimization of the generalized cross validation statistic and the second one, the full search method, is based on the Demmler-Reinsch type diagonalization. Later, Ruppert et al. (2003) proposed a default choice of the number of knots as the minimum value between one quarter of the number of different domain observed values and 35. Both Ruppert (2002) and Ruppert et al. (2003) set the knots at the quantiles of the data, which can lead to a lack of knots in regions of the function that contain important features such as peaks and local maximum and minimum, for example. To address this problem, Yao and Lee (2008) proposed a complement in setting knots at quantiles of data by adding more knots in local characteristics of the function.

Kauermann and Opsomer (2011) considered the number of knots in penalized spline regression as a parameter to be estimated by likelihood. Again, as the previous works, the locations of knots are the quantiles of data.

An automatic selection procedure to determine locations of knots was proposed recently in an arXiv preprint paper by Goepp et al. (2018) and was called Adaptive Spline or A-spline. The method essentially starts with a large number of knots and removes excessive knots by adaptive ridge algorithm based on adaptive Li and Ruppert (2008) . Theoretical results about penalized spline regression can be seen in Hall and Opsomer (2005) .

Although the methods described above have been well succeeded, they work with a fixed number of knots and/or require some prior information regarding possible regions where the knots should be set, as the quantiles of data for example, which can be seen as a limitation in some cases. In this work we propose an automatic method to estimate both the number of knots and their locations, based on the minimization of a penalized least squares criterion. The penalty plays the role of avoiding a high number of knots which consequently leads to overfit the data.

We study the performance of the method on two simulated scenarios to evaluate its ability of estimating the number of knots and their location. Also, we compare the results with other well established methods.

Further, we also apply the procedure to daily reported cases of Covid-19 in several countries to fit their epidemiological curve of cases and perform short term forecast. Both simulation studies and real data application suggest that the proposed method can be considered by practitioners for application in spline regression problems. This paper is organized as follows: Section 2 describes the spline model and the proposed method of automatic knots selection. Section 3 presents simulation studies regarding the performance of the method, and Section 4 show applications to real datasets related to daily reported cases of Covid-19. Finally, Section 5 provides conclusion and suggestions of future works.

We start with a nonparametric regression problem of the form

are zero mean independent random variables with unknown variance σ 2 . To estimate f , we define a sequence of knots a = t 0 < t 1 < · · · < t K < t K+1 = b that defines a partition of [a, b] 

where p, K ∈ Z + , β = [β 0 , . . . , β p+K ] is the vector of coefficients, (u) p + = u p I(u ≥ 0) and t 1 < · · · < t K are fixed internal knots which, from now on, will be referred to as just knots. In general, a spline regression model is a p degree polynomial on each interval of two consecutive knots and has p − 1 continuous derivatives everywhere (supposing that the knots have no multiplicity). In this work we adopt a set of knots sufficiently spaced, i.e, we assume |t k+1 − t k | > δ for some δ > 0, k = 0, . . . , K. Thus, the problem of estimating f becomes a finite parametric estimation problem, whose parameters are the vector of coefficients β, the number of internal knots K and their locations t k , k = 1, . . . , K. Let us denote by θ the vector of parameters to be estimated, i.e, θ = [β, K, t 1 , · · · , t K ] . We assume a fixed value for p, the degree of the spline function.

The estimation of the parameters takes into account the performance of the model regarding a loss function; in our case the residual sum of squares. To avoid overffiting by choosing an excessive number of knots, we add a roughness penalty defined in terms of the number of intervals established by the knots, i.e, the K knots define K + 1 intervals on the domain of f . Thus, we define the penalized sum of squares P SS λ as

where y = [y 1 , ..., y n ] , λ > 0 is a tuning parameter, and f is the spline function in (2.1). Thus, the estimator θ is obtained byθ

Considering the definition above, we search for a curve estimate that has good fitting properties with low residual sum of squares and which is also smooth enough according to a reasonable number of knots that is controlled by the penalty on the number of segments of the domain.

The tuning parameter λ attributes a weight on the roughness penalty term. High values of λ increase the importance of the penalization, that is, preference for smoothness over accuracy. On the other hand, low values of λ favor models with high number of parameters, which can lead to overfitting. The specification of a satisfactory value for λ is an issue inherent to any regularized estimator and, as in other similar approaches, can be chosen by a standard cross validation procedure.

Algorithm 1 describes the procedure to obtainθ in (2.3). In fact, the algorithm requires as inputs the minimum distance δ > 0 between consecutive knots and the maximum number of knots K max ∈ N. For each k ≤ K max , the algorithm obtain the best vector of knots locations (t 1 , . . . , t k ) in terms of ordinary least squares (OLS). Then the overall optimal vector is the one with the smallest P SS λ (·; y). We developed the splineSelection 1 R package to implement Algorithm 1, see Sousa et al. (2020) for details.

Algorithm 1: Algorithm to obtain the proposed penalized least squares estimate of θ.

Obtain β by OLS

Calculate P SS λ,δ (θ ; y) 8 end 9 end 10 Output:θ = arg min θ ,k P SS λ,δ (θ ; y)

Our proposed method for automatic knots selection can also be applied to other spline basis structures.

A well known spline basis is the B-splines, proposed by De Boor et al. (1978) , which are more adapted for computational implementation due to its compact support. The i-th B-spline of order m and knots

is a spline of order m defined on the same knots such that is nonzero over at most m consecutive subintervals, i.e, over [t i , t i+m ] and zero outside it and can be defined recursively by

. Then, f is expanded in the B-splines basis instead of (2.1) as 

In this section we present two simulation studies to evaluate the performance of the proposed method.

Section 3.1 aims to analyse the method's ability to correctly estimate the number and locations of the knots for data generated by cubic B-splines and Section 3.2 compares the proposed method's performance with two penalized spline regression methods.

We apply the proposed method in synthetic data to evaluate its performance in several scenarios. 

We now compare the performance of our method with two methods that also automatically select the knots, namely P-spline, or Penalized Spline (Ruppert, 2002; and Ruppert et al., 2003) and the recently proposed A-spline, Adaptive Spline (Goepp et al., 2018) .

The data was generated from three commonly used functions in spline regression simulation studies:

SpaHet (for spatially heterogeneous), Sine and Logit, see Wand (2000) , Ruppert (2002) and Goepp et al. (2018) . Their definition are in Table 3 .4 and their graph can be seen in Figure 3 .5.

Function definition 

for each replicate and as a performance metric, we considered the averaged mean squared error (AMSE), Table 3 .5 presents the AMSE and the standard deviation (SD) of the MSEs for the scenarios considered and In general, the proposed method presented a competing performance in all scenarios considered. Note that in scenarios with more noise levels, SNR=3, our method outperformed the A-spline and P-spline in four out of six cases. Furthermore, our method had better performance in scenarios with sample size n = 50.

Finally, even when the method was not the best, especially for SNR=9, its performance was quite close to the competing methods. Thus, our simulation study indicates that the proposed method can be considered by practitioners in application in real data, with a special advantage in data with high noise level. We mention that the principal aim of this section is solely to show a descriptive analysis of the Covid-19

situation in the countries based on public available datasets and the proposed automatic method of knots estimation as a tool to do so. For a public policy guide to deal with this sad pandemic context, several social and biological factors beyond the scope of our study should be considered on a deeper analysis. 

In this paper we introduced a method to estimate the number and position of the knots of a spline regression function. The method is based on the minimization of the sum of squared residuals plus a penalty that depends on the number of knots. We evaluated the performance of the criterion on simulated data, and we showed that our proposed method had a great performance in the simulation studies, even with low SNR values. We also applied the method to perform a descriptive data analysis on Covid-19 daily reported cases in several countries. In this analysis we showed that the penalized least squares estimation guaranteed quite smooth curve fittings that estimated accurately underlying smooth functions automatically, i.e, without requiring to specify the number of internal knots and their locations, which is necessary in most spline based curve fitting methods available in the literature.

The penalizing constant λ used in our data analyses was set to a fixed value. However as in many other similar approaches, it could be chosen by cross-validation procedure. From a theoretical point of view, one open question that remains is what is the rate for the penalizing constant λ in order to obtain consistency of the estimatorθ. Some results on this direction for related penalized models is presented in Castro et al.

(2018) and Leonardi et al. (2021) , where a consistency result was proved for a penalizing constant of order n −1/2 . As future work, we will study whether the consistency also holds in the context of our work.

Adaptive bayesian regression splines in semiparametric generalized linear models

A model selection approach for multiple sequence segmentation and dimensionality reduction

Asymptotic properties of penalized spline estimators

Calculation of the smoothing spline with weighted roughness measure

A practical guide to splines

Automatic bayesian curve fitting

São Paulo Research Foundation, Brazil 3 Coordination of Superior Level Staff Improvement, Brazil 4 National Council for Scientific and Technological Development

Flexible Smoothing with B-splines and Penalties

The elements of statistical learning

Multivariate adaptive regression splines. The annals of statistics

Flexible parsimonious smoothing and additive modeling

Spline Regression with Automatic Knot Selection

Theory for Penalised Spline Regression

An introduction to statistical learning with applications in R

Data-driven selection of the spline dimension in penalized spline regression

Some asymptotic results on generalized penalized spline smoothing

Independent block identification in multivariate time series

On the asymptotics of penalized splines

Knot selection for regression splines via the lasso

A Statistical Perspective on Ill-Posed Inverse Problems

Functional data analysis. Encyclopedia of Statistical Sciences

Selecting the number of knots for penalized splines

Semiparametric Regression

2020) splineSelection: Automatic model selection in spline-based regression

Polynomial splines and their tensor products in extended linear modeling: 1994 wald memorial lecture

Automatic knot adjustment using an artificial immune system for b-spline curve approximation

Spline models for observational data

A Comparison of Regression Spline Smoothing Procedures

On knot placement for penalized spline regression