key: cord-0189211-gxig0yv2
authors: Alves, Larissa; Dias, Ronaldo; Migon, Helio S.
title: Variational Full Bayes Lasso: Knots Selection in Regression Splines
date: 2021-02-26
journal: nan
DOI: nan
sha: 824af1a0feaa7f2de27558d838546c285966ea24
doc_id: 189211
cord_uid: gxig0yv2

We develop a fully automatic Bayesian Lasso via variational inference. This is a scalable procedure for approximating the posterior distribution. Special attention is driven to the knot selection in regression spline. In order to carry through our proposal, a full automatic variational Bayesian Lasso, a Jefferey's prior is proposed for the hyperparameters and a decision theoretical approach is introduced to decide if a knot is selected or not. Extensive simulation studies were developed to ensure the effectiveness of the proposed algorithms. The performance of the algorithms were also tested in some real data sets, including data from the world pandemic Covid-19. Again, the algorithms showed a very good performance in capturing the data structure.

In the recent literature one finds many alternative proposals for modeling and estimating a smooth function.

In this article we focus on variants of smoothing splines, called penalized regression splines (Montoya et al. [2014] , Eilers and Marx [1996] ). This is an attractive approach for modeling the nonlinear smoothing effect of covariates.

This work discusses the selection of knots given a fixed maximum number of knots. A roughness penalty is introduced to control the selection of knots and consequently to balance the two conflicting goals, goodness of fit and smoothness. Our approach will be through a full Bayesian Lasso with variational inference. It is related to the work of Osborne et al. [1998] where an efficient algorithm to calculate the classical Lasso estimator was presented. Our contribution, therefore, includes the application of the mean field variational inference (Blei et al. [2017] , Ormerod and Wand [2010] ) for the complete Bayesian lasso penalty (Park and Casella [2008] and Mallick and Yi [2014] )). Choosing the ideal number of knots and their position is a difficult problem. We propose a two-step procedure related to the work of Ruppert [2002] . For regularization and model selection, the proposed procedure starts with a fixed maximum number of knots and then uses a full Bayesian lasso, which combines characteristics of shrinkage and variable selection, to obtain the most significant knots to recover the unknown smooth function. The number of knots is chosen based on an approximation of the predictive distribution in a grid of knots values.

The original formulation of the Bayesian Lasso is based on a hierarchical representation of the Laplace distribution, as a mixture of scale normal based on exponential (Park and Casella [2008] ) and more recently as a mixture of uniform with exponential (Mallick and Yi [2014] ). * The authors,(LA) 1 ENCE, (RD) 2 IMECC-UNICAMP and (HSM) 3 IM-UFRJ, contributed equally to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript. Corresponding author. E-mail addresses:

Alternative procedures for selecting the effective number of knots involving least squares and penalized splines regression has been proposed in the recent literature, see (Spiriti et al. [2013] , Montoya et al. [2014] ).

It is well known that MCMC often takes a great deal of computational time and is not scalable. Therefore, our proposal is to use variational inference (VI) integrated with a decision theoretical approach to knot selection in regression splines. Both are discussed in detail and have shown to be comparatively better than the alternatives presented in the current literature.

The remainder of the paper is organized as follows. In Section 2 presents a review the Bayesian linear model, the variational inference and the hierarchical formulation of the Laplace distribution. In Section 3, shows the full Bayesian Lasso, including the Jeffrey's prior for the hyperparameters and the Bayes factor criterion for knots selection. Section 4 states the knot selection procedure for regression spline in an almost fully automatic algorithm.

Section 5 shows a comparative numerical simulation of the performance of the proposed method and other existing approaches in the literature. A data analysis of real datasets is presented in Section 6.

In order to set the notation to be used later, this section presents a brief summary of Bayesian regression models and variational inference techniques and also establishes the framework for our proposal to select knots in regression spline models to be developed in Section 4.

We summarize the conjugate Bayesian analysis of a linear model. In addition, we present an introduction to variational inference and the hierarchical representation of the Laplace distribution. For more details see, Drugowitsch [2019] , Denison et al. [1998] , Berry et al. [2002] , Goepp et al. [2018] and Lang and Brezger [2004] .

Following the notation of Migon et al. [2015] let the linear model be

where y is n-vector of observed quantities, X is a known n × p matrix, β is a p-vector of parameters and φ is the precision associated with each one of the independent observations. The conjugate prior, a Normal-Gamma, is defined as:

where m 0 and (φ C 0 ) −1 are, respectively, the prior mean and covariance matrix and a 0 , b 0 are the parameters of the precision prior distribution. The posterior distribution is β|φ, y, X ∼ N (m 1 , φ −1 C −1 1 ) φ|y, X ∼ Ga(a 1 , b 1 ) m 1 = C −1 1 (C 0 m 0 + X T y) and C 1 = C 0 + X T X a 1 = a 0 + n 2 and b 1 = b 0 + 1 2 [(y − Xm 1 ) T y + (m 0 − m 1 ) T C 0 m 0 ]

A very useful extension of the above regression model is the Bayesian hierarchical regression models, which will be extensively used latter. It was proposed in the seminal paper of Lindley and Smith [1972] and a dynamic version was introduced in Gamerman and Migon [1993] .

It is well known that Bayesian inference regarding unknown quantities is entirely based on their probabilistic description. Therefore, variational inference (VI), a method to deal with the approximation of probability densities is very useful for Bayesian inference. In fact, these techniques can be traced back to the field of machine learning (Jordan et al. [1999] ). Loosely speaking, they basically exchange sampling, as in MCMC procedures, for optimization. By choosing a flexible family of approximate densities, an attempt is made to find a member of this family, which minimizes some optimal criterion, for example Kulback-Leibner divergence (KL). Variational inference is useful for quickly comparing alternative models and also for dealing with large data sets. Blei et al. [2017] pointed out that the accuracy of variational inference has not yet been thoroughly studied and many open questions are still there to be answered.

The basic ideas about variational inference can be easily followed in Blei et al. [2017] and in Ormerod and Wand [2010] . Many examples are presented in the Bishop [2006] book. Let y be a vector of n independent identically distributed observations and z a vector including latent variables and the parameters as well. The log marginal data distribution, also known as evidence integral, is denoted by p(y). Evidence integrals that are often unavailable in closed form require exponential time to be evaluated and present difficulties in making the inference for a model such as this. To avoid calculating the evidence integral, one tries to find a lower bound, which is known as ELBO(q) -Evidence Lower Bound and will be denoted by L(q). It is easy to verify that:

where L(q) = q(z) log p(y,z) q(z) dz and KL(q||p) = − q(z) log p(z|y) q(z) dz, since p(y) = p(y,z)/q(z) p(z|y)/q(z) . It is clear that max q L(q) min q KL(q||p) and also that KL(q||p) ≥ 0 with equality if and only if p(z|y) = q(z). In general, it is difficult to obtain this posterior distribution, therefore, the approach is to choose a family of tractable densities. Let's assume the following:

where a partition of the z into m disjoint groups is denoted as z l . It is worth pointing out that there is no restriction on the functional forms of the variational densities q l (z l ).

The central idea is to maximize each factor (blocks of z's) of q(z) in turn. We keep q l =h fixed and maximize L(q). Note that:

wherep(y, z h ) = E l =h log p(y, z) + const. The L(q) will be presented for the specific case of Lasso in subsection 3.3. Note that it depends on the variational parameters.

Worth emphasizing that the problem of approximating the posterior distribution for the parameters of interest was replaced for a maximization problem. The algorithm to solve the optimization problem was introduced by Bishop [2006] and denoted by CAVI -coordinate ascent variational inference. The CAVI optimizes one factor of the mean field variational density at a time.

Since (1) is equal to −KL(·||·), maximizing it is equivalent to minimizing KL. Therefore, the optimal solution is:

As one can see, q * (z l ) depends on the full conditional distributions, as usually denoted in the MCMC literature (Casella and George [1992] ). Therefore, there is a natural link with Gibbs Sampling but the proposed approach leads to tractable solutions involving only local operations.

It is well known that the original Lasso formulation (Tibshirani [1996] ) is related to the Laplace distribution which can be represented as hierarchical mixture of distributions and is relevant to a hierarchical modeling, which in turn is important for the implementation of Gibbs Sampling. One of these forms of representation is a scale mixture of a Normal distribution with an Exponential distribution (West [1987] ) and the other is a mixture of a Uniform distribution with a Gamma distribution (Mallick and Yi [2014] ).

Specifically, following Andrews [1974] , it is easy to verify that the hierarchical representation:

and τ |λ ∼ exp λ 2 2 leads, by marginalizing on τ , to the standard Laplace distribution, whose density is:

The above hierarchical representation of the Laplace distribution is important to introduce the full Bayesian

Lasso. The penalty term in the classical Lasso can be interpreted as independent Laplace prior distribution over the regression parameters. Moreover, the posterior mode can be seen as the Lasso estimates.

Following the hierarchical representation for the Laplace distributions in Subsection 2.3, Park and Casella [2008] shows a Bayesian formulation of the Lasso regression model. The hierarchical model is defined as:

where D τ = diag(τ 1 , . . . , τ p ) and τ j |λ are conditionally independent for all j. The model can be completed with the hyperparameters of the priors φ ∼ Ga(a 0 , b 0 ) and λ ∼ Ga(g 0 , h 0 ). In Subsection 3.1 we propose an independent Jeffreys prior for φ and λ to automate the Lasso, and this implies supposing a 0 , b 0 , g 0 and h 0 tending to zero.

Let θ = (β, φ, τ , λ) be the vector of the parameters and the latent variables of the model. The posterior distribution is obtained as proportional to the model distribution times the prior distribution for the latent component and the parameters:

For instance, the above joint posterior is often intractable. An almost obvious numerical approach, since the breakthrough paper of Gelfand and Smith [1990] , is to use stochastic simulation.

In order to develop an automatic Bayesian Lasso procedure it is worth to introduce non informative priors for the hyperparameters involved. Following Fonseca et al. [2019] and exploring the conditional independence involved in the Lasso model, the Fisher information decomposition for Lasso follows as:

where I β,τ (λ|y) is the information obtained from the full conditional distribution p(β, τ |y, λ). We also are using the conditional independence described by the graph that represents the Bayesian Lasso model.

We will develop, in turn, each of the components in the expression (3). The quantity I τ (λ) is based on the independent marginal distribution of τ j , leading directly to I τ (λ) = p λ 2 . In order to obtain I β,τ (λ|y), we take advantage of the known full conditional distribution of (β, τ |λ, y) (see (4)). Since (β|τ , λ, y) does not depend on λ, then it is easy to obtain E y [I β,τ (λ|y)] = p λ . Then substituting in (3), it follows I y (λ) = p λ 2 + p λ 2 and so the prior for λ is p(λ) ∝ λ −1 . This result is similar to the one reported in Fonseca et al. [2019] , using the Uniform Gamma mixture.

It is well known that the Jeffrey's prior of φ is proportional of φ −1 .

Considering the model and the prior distribution already specified, we know that the posterior distribution in this case has an unknown form. Therefore, we can use the MCMC to obtain a sample of the posterior distribution through the complete conditional distributions (Gibbs Sampler). Calculations of complete conditionals are as follows.

where θ − stands for the entire vector θ without the parameter followed by symbol " ", and GIG denotes the generalized inverse Gaussian distribution. See appendix.

In order to obtain a scalable inference procedure, we introduce an alternative methodology.

To make the notation consistent, the vector including latent variables and parameters denoted by z in the subsection 2.2 is represented in this section by the vector θ. Let the independent Jeffrey's prior be p(φ) ∝ 1 φ and p(λ) ∝ 1 λ . The joint distribution of the observations, latent components and parameters can easily be followed from the Figure 1 which in turn summarizes the model. log(q(θ)) = log(q 1 (β, φ)) + log(q 2 (τ |λ)) + log(q 3 (λ)) After quoting Blei et al. [2017] the optimal q l (θ l ) is proportional to the exponential of the log of the complete conditional distribution that is calculated in (4)

In the first step, the variational posterior for β and φ, that maximizes the variational bound L(q) while holding q 2 (τ |λ) and q 3 (λ) fixed, is given by

It is easy to see that this is a normal-gamma distribution with parameters:

Next, the variational distribution of τ , that maximizes the variational bound L(q) while holding q 3 (λ) fixed is given by

with GIG being generalized inverse Gaussian distribution, where

Therefore,

Finally, we will identify the variational distribution of λ:

which is a gamma distribution with parameters

The expected values involved in the definition of the above variational distributions are computed as follows (see the appendix for details and Jørgensen [1982] ).

where κ p (·) is the Bessel modified function of the second kind.

The evidence lower bound (ELBO) for this model consists of:

Each of the above terms are evaluated as function of the variational parameters, as follows:

The second order Taylor expansion for log τ j at E(τ j ) is used to obtain the approximation for its expected value: E(log τ j ) ≈ log E(τ j ) − V ar(τj ) 2E 2 (τj ) where the mean and the variance of τ j are in equations (5) and (6). Note that the variational bound depends on the quantities m β , C β , b φ , d τ , f τj e h λ . The algorithm updates these quantities in each iteration. The ELBO is maximized and hence L(q) reaches a plateau with stabilization of those quantities. The algorithm consists of the following steps:

Step 1. Initialize the variational hyperparameters:

Step 3. Update the variational hyperparameters based on the expected values.

Calculate ELBO end for end while Convergence can be achieved by analyzing changes to ELBO in consecutive iterations or by analyzing the quantities on which it depends.

We end this section by showing the predictive distribution. Let y o e y p be the observed and the predicted vectors, respectively. Finally, let p(β, φ|y o ) be its variational component. Then, after some algebraic calculations, we have a Student's t-distribution (St) as follows:

where q 1 (β, φ) is the variational approximation of the posterior distribution, a normal-gamma distribution.

We will discuss, from a Bayesian point of view, three alternative procedures for selecting knots (variables) in penalized regression splines (linear regression). In this work, we propose a new decision criterion based on the Bayes factor. This proposed criterion is fully described below in 3.4.1

In general, the selection of predictors/knots in a penalized regression/penalized regression splines () can be seen as a decision problem. Consider the general case where it is necessary to decide for one of the following models Hereafter, we will assume that the prior odds is equal one.

Particularly, we consider two alternatives: M 0 : β j = 0 and M 1 :

Assuming that at least a moderate evidence against M 1 corresponds to F B(M 0 , M 1 ) ≥ 3 and β is considered to be significantly distant from the M 0 , if and only if it is greater or equal to the third quartile of the standard normal distribution, that is β 0.75 = 0.67, then a quadratic equation must be solved whose root is δ = 2.3. Thus, our proposal to select knots in spline regression is described in the following algorithm:

Step 1. Take β * j = m j /s j , a standardized point estimate of β j , m j = E[β j |D] and s 2 j = var(β j |D).

Step 2. Compute the Bayes factor at β j :

Step 3. Compute π = BF (M 0 , M 1 )/(1 + BF (M 0 , M 1 )).

Step 4. Define BF evidence Choose two positive numbers a and b, (with a = 1 and b = 3, corresponding to a Bayes factor equal to 3) if π < a a+b then M 0 is rejected and j th predictor is excluded from the model else the j th predictor is not excluded from the model. end if

One procedure, due to (Li and Lin [2010] ), is based on a 50% credible interval. That is, if the credible interval of a given coefficient contains zero, the explanatory variable associated with it must be removed from the model. A second criterion, named scaled neighborhood (Li and Lin [2010] ) corresponds to evaluate the posterior probability of [−s j , s j ], where s 2 j = var(β j |y) and decide the exclusion of a predictor if this probability exceeds a certain threshold, for instance Li and Lin [2010] suggests 1/2 as this limit.

We start with the standard setup for nonparametric regression models. Let's suppose we have a collection of observations (y i , x i ) for i = 1, . . . , n such that

where f (x i ) = E[y i |x i ] are the values obtained by a unknown smooth function f that takes values on the interval [a, b] ⊂ R and i is a sequence of random variables that are uncorrelated with mean zero and unknown precision φ. A possible approach to estimate f is to assume that the regression curve f can be well approximated by a spline function. See details in Dias [1999] . That this, given a sequence of knots κ = (κ 1 , . . . , κ K ) so that κ 1 < κ 2 . . . < κ K−1 < κ K , a spline regression model can be written as:

where p is the degree of the polynomial spline and β is the vector of coefficients of dimension K + p + 1. The functions (u) + are the well known truncated power basis, (u) p + = max(0, u p ). Note that, for a fixed K, the vector of knots κ and the set of basis functions

an estimate of f , sayf , can be obtained by estimating the vector β. Such that,

It's well known (Dias [1998] , Dias and Gamerman [2002] , Dias and Garcia [2007] , Kooperberg and Stone [1991] )

when K increases the bias gets smaller causing over-fitting but at the same time substantially increases the variance. On the other hand, if K goes to zero then variance might drastically be reduced causing under-fitting and considerably increases bias. Thus, K acts as the smoothing parameter in regression spline fit and hence it balances the trade-off between over-fitting and under-fitting. A good procedure should not only provide an ideal number of knots (or basis functions) but also quantify the uncertainty of adding or removing them. Figure 2 shows the effect of different values of K for a spline regression model. There are other basis functions that can represent a regression function such as B-splines, wavelets radial basis etc. For all, it is still necessary to balance between under-fitting and over-fitting. Even in the case of smoothing splines, the regularization parameter needs to be obtained. Specifically, in this work we are dealing with the following optimization problem: Find β(λ) the minimizer of

where λ is the smoothing parameter. For large values of λ the solution of this optimization problem tends to the polynomial regression fit, that is over-fitting. Note that the penalty term involves only the coefficients associated to the knots sequence κ 1 < κ 2 . . . < κ K . Consequently, selecting knots is equivalent to selecting the coefficients that contribute most to the fitting. Under the Bayesian point of view, this work presents a novel and scalable procedure for selecting knots.

Following the idea of the Lasso procedure for variable selection, under the Bayesian point of view, the selection of knots in the spline regression models can be made by assuming an independent Laplace prior distribution for the coefficients associated with the knots. We will denote

the polynomial coefficients with dimension p + 1 and β (2) = (β p+1 , . . . , β p+K ) T of dimension K is the penalised coefficients. The hierarchical structure presented in the subsection 2.3 is maintained and in this way we complete the model defined by the equations (7) and (8):

The Bayesian inference procedure in this case must be carried out with caution since the vector of coefficients contains the coefficients of the polynomial, which will not be penalized, and the coefficients of the basis functions to which we assume a prior Lasso distribution for knots selection. The design matrix is then partitioned X = (X 1 , X 2 ) with X 1 of dimension (n × p + 1) and X 2 (n × K). Then, the i-th row of the matrix X is given by:

In this context, we have a prior distribution of the vector θ = (β (1) , β (2) , φ, τ, λ) and the posterior distribution is given by:

Slight adaptations need to be made in the variational inference method and the main one refers to the fact that we now have 4 partitions of the parametric vector giving rise to the following variational densities:

log(q(θ)) = log(q 1 (β (2) , φ)) + log(q 2 (τ )) + log(q 3 (λ)) + log(q 4 (β (1) ))

where log q 4 (β (1) ) = log N (β (1) |m β1 , C β1 ). The calculations for this and other variational densities are similar to those developed for the regression model in subsection 3.3 and can be found in the appendix.

An alternative approach to determine the maximum number of knots K is to consider it as an unknown parameter and estimate it. However, as in this work, K is not a parameter of direct interest since the selection of knots is carried out through the Lasso scheme. Despite this, in order to have a good fit, it is important to properly define the number of knots and their positions. In section 5 presents some exercises involving selecting knots and shows the importance of the correct specification of the value of K. Thus, an algorithm will be proposed for the automatic choice of the number of knots K that is based both on the VB algorithm for estimating the Lasso model and on the criterion for selecting variables (knots).

The algorithm starts by proposing a grid of possible values of K the maximum number knots. Naturally, the grid of values is an issue to be discussed. Note that it is not necessary to propose a grid that covers all the natural numbers, since computational time can be excessively high. Moreover, given a maximum number of knots, the most significant knots will be selected. For instance, start the grid with the maximum number of knots K = 20. By using Lasso and the selection criteria, it is possible to have 6 among these 20 knots as the most significant ones. On the other hand, simulations show that starting with a very large maximum number of knots may cause problems in the selection, since the penalty criteria acts more severely when the number of knots is extremely big for the size of the data set. See numerical simulations in Section 5. Therefore, our proposal is to provide a grid that increases 10 units at a time, so that the knots are placed in the quantiles or evenly spaced in the explanatory variable domain. This spacing allows us to position knots in different locations before and after selecting significant knots. In the simulated exercises presented in Section 5, the grid starts with K = 10 knots.

In summary, ELBO is taken as a stopping criterion and the objective is to maximize it. The algorithm consists of: for fixed grid values, start with the lowest value. The model is adjusted via VB together with the selection criteria and then calculates the ELBO. Move to the next grid value and repeat the procedure. As long as the ELBO increases with the grid values, the algorithm continues. The detailed algorithm is given below:

Step 1. j ← 1. Initialize K j = 10.

Step 2. Fit model via VB algorithm.

Step 3. Compute ELBO.

Step 4. Apply BF to select the most significant knots.

Step 5. K j+1 = K j + 10 and repeat steps 2, 3 and 4.

Stop and deliver ELBO and the most significant knots end if

This section proposes 5 exercises with artificial data. The first three based on Lasso for linear regression models and the last two focused on the use of Lasso to select knots in spline regression.

The inference procedure assumes, for all exercises, the following prior distributions: φ ∼ Ga(0.1, 0.1), λ ∼ Ga(0.1, 0.1) e β 1 ∼ N (1, 100) (in case of regression splines). For MCMC, 15,000 iterations were necessary to achieve convergence. The first 5,000 iterations were discarded as the burn in process and one observation was taken for every ten observations to remove autocorrelation, ending with a sample size of 10000. These quantities were obtained by using the criterion found in Lewis and Raftery [1997] , that provides the number of iterations needed to guarantee the convergence in the Gibbs Sampler. The VB algorithm is repeated until the changes in m β , C β , b φ , d τ , f τj and h λ between two consecutive iterations are less than 0.01%. When applied, the classic procedure was implemented using the R software glmnet package, which in turn applies 5-fold cross-validation to estimate the penalty parameter λ.

The goal of these exercises applied to the linear regression models is twofold. Firstly to compare the estimation methods VB and MCMC (eventually we also use the classic Lasso in the comparison). Secondly, we wish to Since MCMC and VB present similar results, it is worth to point out the main difference between these estimation methods, which is computational time. For exercise 1, the computational time of the VB was 0.72 seconds while that of the MCMC was 10.15 seconds. In the following exercises these computational times become even more discrepant as we will be dealing with simulations with replicates.

In this exercise a simulation was developed based on 100 replicates p = 8, β = (3, 1.5, 0, 0, 2, 0, 0, 0) T and the design matrix is generated from a multivariate normal distribution with zero mean, variance 1 and two different correlation structures between x i e x j : 0 e 0.7 |i−j| , ∀i e j. Let's consider φ = 1/9 and 3 nested scenarios varying the sample size with {n T , n V } = {20, 10}, {100, 50} e {200, 100}, where n T e n V denote the size of the training set and the size of the validation set, respectively. Therefore, we have a total of 6 different scenarios. Note that the explanatory variables are standardized to have mean 0 and variance 1 A Tabela 2 summarize the results of exercise 2. The comparison of the MCMC and VB methods is our main objective in this simulation, however, frequentist

Lasso is also considered through the glmnet package of the R software. For the frequentist Lasso, a 5-fold cross- validation is used to select the parameter λ. In addition, different variable selection criteria will be compared as described in 5: credible interval (CI), scaled neighborhood (SN) and Bayes factor (BF).

In order to compare Lasso's predictive power from the different estimation techniques, MCMC, VB and frequentist Lasso, the mean absolute error (MAE) was calculated for each replicate of the validation set using the following expression:

where y P i are the predicted values in the validation set, obtained from the fitted model after the selection of the coefficients. y V i are the observed values in the validation set and n V is the size of the validation set. Note that MCMC generates a sample of the predictive distribution from each iteration of the method. Then, y P i is obtained as follows:

where AM is MCMC number of iterations and θ is the vector of coefficients. Figure 4 shows the box-plots of the mean absolute errors for each of the six proposed scenarios. As the sample size increases, we observe a smaller difference between the three estimation methods. When the sample is small, similar results are obtained between MCMC and VB. These have the median MAE and the lowest dispersion when compared to the frequentist Lasso. Next, we will detail the performance of the selection criteria for each β j . The Table 3 shows the frequency of times that the predictor x j , j = 1, . . . , 8 was excluded in the 100 replicates, considering the three variables selection methods and all six scenarios built in Simulation 2. We present the proportions only for the VB because so far its results are similar to those of the MCMC. Note that for this simulation exercise the BF presents the best results in all scenarios, with a greater proportion of exclusion when the actual values of β j are zero and a small proportion when the β 's are different from zero. In addition, it is noted that as the sample size increases, the three criteria tend to correctly choose coefficients that are zero and the coefficients that are different from zero. From exercises 1 and 2, one may notice that the approximations of the VB are as good as the results obtained by the MCMC. Nevertheless, the gain in computational time provided by VB is far superior than MCMC. In addition, we saw that BF is a variable selection criterion that presents superior results when compared with CI and SN. In the following subsection we show the performance of the VB estimation method and the BF selection criterion for a more complex numerical experiment with greater sparsity. x j is equal to 0.5, ∀i = j. We analyze 4 different scenarios by varying the sample size and the precision parameter φ. The simulated data were analyzed as follows, {n T , n V } = {20, 10} e {200, 100} where n T e n V are the size of the training set and the size of the validation set respectively. In addition, we set the precision parameter as φ = 1/9 and φ = 1/225. For each scenario we consider 100 replicates. Table 4 summarizes all the scenarios considered in this simulation exercise 3. It is worth mentioning that in scenarios S3.1 and S3.3 we have n < p

Similarly to exercise 2, the MAE was calculated for each replicate as a predictive measure. Figure 5 shows the box-plots of each scenario for MCMC, VB and Lasso. One may see that the MCMC and VB present similar and superior results to the Lasso when the sample size is small. As the sample increases the results become similar in the 3 approaches. for presenting results similar to VB. In the Bayesian context, the selection criterion used in all scenarios was the BF. It is expected that the black bars will be larger when the true coefficients are different from zero and that the gray bars will be large when the true coefficients are equal to zero. The proportions of the errors are represented by the black bars when the coefficients are zero (type I error) and by the gray bars when the coefficients are different from zero (type II error Thus, it can be seen for n < p, both VB and Lasso do not have a good selection and exclusion performance, with a slight advantage of VB. On the other hand, as the sample increases, the VB presents good results, better than those presented by Lasso. Also note that when n > p both VB and Lasso have the same type II error. However, for all coefficients, the type I error is considerably less in VB than in MCMC

As the VB presents results similar to the MCMC, but with considerably less computational time, therefore, in the two exercises applied to the spline regression models, only the VB is used. The goal is to define the maximum number of knots from a grid of values and, in turn, to select the most significant knots and their positions.

Exercises 4 and 5 consist of a simulation study of the penalized spline regression model defined in equations (7), (8) and (10) The ELBO is used to indicate the maximum number of knots and for the both exercises we have an optimal initial guess of K = 30 knots. The BF and CI selection criteria indicate that around 8 of these 30 initial knots have a higher frequency of being selected as the most significant ones.

In exercise 4 we use the smooth function f called "Bump" given by f (x) = x + 2 exp{−(16 * (x − 0.5)) 2 }. Observe that the positions of the knots most selected as significant are in the rise and fall of the bump. In addition, it should also be noted that as K increases, the three selection criteria tend to be more rigorous in the penalty, hence, excluding more knots. This occurs more severely in the SN criterion, which presents an average of fittings worse when K = 50. The results of the CI and BF criteria are similar in all cases. The results for exercises where the maximum number of knots are 20 and 40 have been omitted as they are similar, for K = 10 and K = 50, respectively Figure 10 exhibits the 100 fitted models for each of the replicates considering K = 10, 30 and 50, and the BF selection criterion. One can see that the variability increases as the value of K increases. The same occurs when considering the other selection criteria. Figure 11 shows the frequency of the number of knots selected in the 100 replicates for each of the selection criteria (CI, SN and BF) and maximum number of knots (K = 10, 30 and 50). Note that CI and BF criteria, when K = 10, select between 5 and 6 knots as the most frequent ones. On the other hand, when K = 30 knots 7 knots among them are selected more frequently. In the case where the maximum number of knots is K = 50

we have a bimodal behavior for the frequency of the number of selected knots and one can see that the penalty is more severe as K grows.

ELBO as a model comparison measure can be used to propose the maximum number of knots. It is worth mentioning that the larger the ELBO the more that model is preferable. From the Table 5 , which presents the average ELBO for each value of K, we have that the model proposed with K = 30 knots which selects more frequently 7 of these knots as significant, is preferable. In Figure 15 shows the fittings of 100 replicates according to the BF criterion for different numbers of knots. It can be seen that there is less variability than in the case of 1 bump, possibly due to the increase in the sample size to n = 300 . Nonetheless, it is possible to see that the when the maximum number of knots increases , the variability between the fitted models also increases.

Our analysis shows that the SN criterion does not provide a good fit for K = 50 and in Figure 16 we see that this criterion tends to underestimate the number of significant knots as K increases. We will analyze in more detail the frequency of the number of knots selected in the 100 replicates using the CI and BF criteria. Note that both criteria present similar results. In Figure 16 we observed that when the maximum number of knots is 10, both the CI and BF criteria indicate more frequently that 7 out of 10 knots are significant. When K = 30 or K = 50, the criteria CI and BF most frequently indicate 8 knots as significant.

In order to define the maximum number of knot, the average ELBO was calculated for each value of K in a fixed grid. For the FB criterion we have that the average ELBO is -74.36 when K = 10. This value increases In this section, two applications with real data will be analyzed. The first is considered more usual in the literature in the area and the second addresses a current issue related to the world pandemic of Covid-19. In both cases, the Penalized Spline Regression model with polynomials of degrees 2 and 3 is fiited to the data. In addition, the maximum number varies between 10, 20 and 30. It is noteworthy that in the first example the knots are equally spaced positioned. Lasso, through variational inference (VB), is used in both applications together with the Bayes Factor (BF) criterion for the selection of the most significant knots. ELBO will be the measure considered for comparing models

The first data set considers the income and age of 205 Canadians (Ullah and Zinde-Walsh [1985] . These data have been widely used in applications of non-parametric regression models. See for example Ruppert [2002] . A logarithmic transformation was applied to income, as can be seen in the data represented by black dots in the two plot in Figure 17 . Table 6 shows the ELBO measure computed for each fitted model by varying the degree of the polynomial (p) and the maximum number of knots (k). For both p = 2 and p = 3, ELBO achieves its maximum value at K = 10. Comparing these two quantities, the largest ELBO occurs for p = 3 and K = 10.

The graph on the left of Figure 17 shows the fitting of the penalized spline regression model (solid black line) and

its credible interval of 95% (gray shaded area). Note that the credible interval covers a large part of the observed points. At the bottom of the graph we have the symbol "x" representing the 10 knots positioned and in black the only knot considered significant among the 10, according to the FB criterion. At the level of comparison, the graph on the right of Figure 17 shows the fitted model obtained with the R function "smooth.spline" (solid line in red) and the fitting of the proposed model (black solid line). In this application, the results of these two fitted models are similar. In this example, analyzing real data, we consider models with splines of degrees 2 and 3. The maximum number of knots varies every 10 knots. To find the maximum number of knots, we use the ELBO measure by starting the grid with K = 10 knots. After obtaining the optimal K value BF criterion is applied to select which knots are the most significant and their positions.

The inference procedure was performed from the Bayesian point of view through variational inference. The prior distribution remains the same as for studies with artificial data. Marked in bold are the cases in which ELBO achieves the highest values for p = 2 and p = 3. In the North American case, ELBO is maximum when p = 3 and K = 20. It is worth mentioning that among the 20 knots, 9 were significant according to the BF criterion. Considering the Brazil data, one can see that the largest ELBO occurs when p = 2 and K = 10, and only 6 of these 10 knots are significant, according to the BF. The following results are presented only for models with the largest ELBO. The plot to the left of the Figure 19 shows the fit (solid green line) of the penalized spline regression model with 3 polynomial of degree 3 and 9 significant knots(out of a total of 20 knots) for the US Covid-19 data. The shaded area represents the 95% credible interval and it contains most of the observed data (black dots). Note that significant knots (black asterisks) are located such that they cover the bumps of the curve. The "x" in red are the excluded knots that were not considered in the fit. The plot on the right compares the fits of the proposed model (solid green line) with the R function "smooth .spline" (red solid line). Note that the second fit captures, in addition to the signal, the noise contained in the data. Figure 19 : Left: The fit of the penalized spline regression model (solid green line) with p = 3 and 9 significant knots (black asterisks) out of K = 20 knots (x red) with the 95% credible interval (shaded area) and the fit for the logarithm of the number of daily cases of Covid-19 in the USA (black dots). Right: The fit of the proposed model (solid green line) and the fit of a model using the R function "smooth.spline" (solid red line) to the log data of the number of daily cases of Covid-19 in the USA (dots). Figure 20 shows the fit (solid green line) of the proposed model with p = 2 and 6 significant knots (black asterisks at the bottom of the graph). The excluded knots are represented by "x" in red. The 95% credible interval (gray shadow) covers much of the observed data points. The plot on the right makes a comparison between the fit of the proposed model and the R function "smooth.spline" (red solid line). As in the case of the U.S. data, we observed that the fit by using "smooth.spline" does not smooth the data as the proposed model and follows the series random noise more closely. 

This article proposes a new scalable procedure for selecting the number of knots in regression splines: A fully automatic Bayesian Lasso through variational inference. Simulation studies have shown effectiveness of this procedure in modeling different types of data sets. In addition, the numerical exercises show that this approach is much faster than the traditional one that is based on MCMC type algorithms. In real data sets the procedure was able to capture the trend existing in them. Thus providing a better understanding of the data dynamic.

The variational posterior for β and φ while holding q 2 (τ |λ) and q 3 (λ) fixed, is given by log q * 1 (β, φ) = log(p(y|β, φ)) + E τ [log(p(β, φ|τ ))] + const = log (2π) −n/2 |φ −1 I n | −1/2 exp − 1 2 (y − Xβ) T (φI n )(y − Xβ) + +E τ log (2π) −p/2 |φ −1 D τ | −1/2 exp − 1 2

It is easy to see that this is a normal-gamma distribution with parameters:

The variational distribution of τ while holding q 3 (λ) fixed is given by log q * 2 (τ j ) = E λ [log(p(τ j |λ))] + E β,φ [log(p(β j , φ|τ j ))] + const Therefore, log q * 2 (τ ) = log p j=1 GIG(τ j |c τ , d τ , f τj ).

The variational distribution of λ is: log q * 3 (λ) = log(p(λ)) + E τ [log(p(τ |λ))] + const

which is a gamma distribution with parameters

The expected values can be computed as follows:

It is worth pointing out that if X ∼ GIG(p, a, b), then its density is

where κ p (·) is a modified Bessel function of the second kind, with

Therefore, E τ [D −1 τ ] = diag(E τ (τ −1 1 ), . . . , E τ (τ −1 p )),

For the calculus of E β,φ [φβ 2 j ], let x|y ∼ N (µ x , y −1 σ x ) and y ∼ Ga(a, b), so

Thus,

The variational posterior for β (2) and φ while holding q 2 (τ |λ), q 3 (λ) and q 4 (β (1) ) fixed, is given by log q * 1 (β (2) , φ) = E β (1) [log(p(y|β, φ))] + E τ [log(p(β (2) , φ|τ ))] + const = E β (1) log (2π) −n/2 |φ −1 I n | −1/2 × × exp − 1 2 (y − X 1 β (1) − X 2 β (2) ) T (φI n )(y − X 1 β (1) − X 2 β (2) ) + +E τ log (2π) −K/2 |φ −1 D τ | −1/2 exp − 1 2 β (2) T φD −1 τ β (2) + + log[φ a0−1 exp{−φb 0 }] + const

It is easy to see that this is a normal-gamma distribution with parameters:

a φ = a 0 + n/2 and b φ = b 0 + 1 2 [y T y − 2E β (1) (β (1) T )X T 1 y + E β (1) (β (1) T X T 1 X 1 β (1) )] − m T β (2) C −1 β (2) m β (2) .

The variational distribution of τ while holding q 3 (λ) fixed is given by log q * 2 (τ j ) = E λ [log(p(τ j |λ))] + E β,φ [log(p(β j , φ|τ j ))] + const

with GIG being generalized inverse Gaussian distribution, where

Therefore, log q * 2 (τ ) = log p j=1 GIG(τ j |c τ , d τ , f τj ).

The variational distribution of λ is: log q * 3 (λ) = log(p(λ)) + E τ [log(p(τ |λ))] + const

which is a gamma distribution with parameters

The expected values can be computed as follow:

It is worth pointing out that if X ∼ GIG(p, a, b), then its density is

where κ p (·) is a modified Bessel function of the second kind, with

Therefore, E τ [D −1 τ ] = diag(E τ (τ −1 1 ), . . . , E τ (τ −1 p )),

For the calculus of E β,φ [φβ 2 j ], let x|y ∼ N (µ x , y −1 σ x ) and y ∼ Ga(a, b), so

A robust method for multiple linear regression

Bayesian smoothing and regression splines for measurement error problems

Pattern Recognition and Machine Learning

Variational inference: a review for statisticians

Explaining the gibbs sampler

Automatic bayesian curve fitting

Density estimation via hybrid splines

Sequential adaptive non parametric regression via H-splines

A Bayesian approach to hybrid splines nonparametric regression

Consitent estimator for basis selection based on a proxy of the kullbakc-leibler distance

Vblinlogit: Variational bayesian linear and logistic regression

Flexible smoothing with B-splines and penalties

Reference bayesian analysis for hierarchical models

Dynamic hierarchical models

Sampling-based approaches to calculating marginal densities

Spline Regression with Automatic Knot Selection. working paper or preprint

An introduction to variational methods for graphical models

Statistical properties of the generalized inverse Gaussian distribution

A study of logspline density estimation

Bayesian p-splines

Estimating Bayes factors via posterior simulation with the Laplace-Metropolis estimator

The Bayesian elastic net

Bayes estimates for the linear model

A new Bayesian lasso

Statistical inference. Chapman & Hall/CRC Texts in Statistical Science Series

A simulation study comparing knot selection methods with equally spaced knots in a penalized regression spline

Explaining variational approximations

Knot selection for regression splines via the lasso

The Bayesian Lasso

Selecting the number of knots for penalized splines

Knot selection for least-squares and penalized splines

Regression shrinkage and selection via the lasso

Estimation and testing in a regression model with spherically symmetric errors

On scale mixtures of normal distributions

Acknowledgments. This paper was partially supported by Fapesp Grants (RD) 2018/04654, (RD and HSM) 2019/10800-0, (RD) 2019/00787-7.