key: cord-0045746-fwd71vsl authors: Klimaszewski, Jacek; Korzeń, Marcin title: Fitting Penalized Logistic Regression Models Using QR Factorization date: 2020-06-15 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50417-5_4 sha: 6dfbea28bc68449698542aa9978e171e18ccd2d8 doc_id: 45746 cord_uid: fwd71vsl The paper presents improvement of a commonly used learning algorithm for logistic regression. In the direct approach Newton method needs inversion of Hessian, what is cubic with respect to the number of attributes. We study a special case when the number of samples m is smaller than the number of attributes n, and we prove that using previously computed QR factorization of the data matrix, Hessian inversion in each step can be performed significantly faster, that is [Formula: see text] or [Formula: see text] instead of [Formula: see text] in the ordinary Newton optimization case. We show formally that it can be adopted very effectively to [Formula: see text] penalized logistic regression and also, not so effectively but still competitively, for certain types of sparse penalty terms. This approach can be especially interesting for a large number of attributes and relatively small number of samples, what takes place in the so-called extreme learning. We present a comparison of our approach with commonly used learning tools. We consider a task of binary classification problem with n inputs and with one output. Let X ∈ R m×n be a dense data matrix including m data samples and n attributes, and y m×1 , y i ∈ {−1, +1} are corresponding targets. We consider the case m < n. In the following part bold capital letters X, Y, . . . denote matrices, bold lower case letters x, w stand for vectors, and normal lower case x ij , y i , λ for scalars. The paper concerns classification, but it is clear that the presented approach can be easily adopted to the linear regression model. We consider a common logistic regression model in the following form: P r(y = +1|x, w) ≡ σ(x, w) = 1 1 + e − n j=1 xj wj . (1) Learning of this model is typically reduced to the optimization of negative loglikelihood function (with added regularization in order to improve generalization and numerical stability): log(1 + e −yi· n j=1 xij wj ), (2) where λ > 0 is a regularization parameter. Here we consider two separate cases: 1. rotationally invariant case, i.e. P (w) = 1 2 w 2 2 , 2. other (possibly non-convex) cases, including P (w) = 1 q w q q . Most common approaches include IRLS algorithm [7, 15] and direct Newton iterations [14] . Both approaches are very similar -here we consider Newton iterations: where step size α is chosen via backtracking line search [1] . Gradient g and Hessian H of L(w) have a form: where D is a diagonal matrix, whose i-th entry equals σ(x i , w) · (1 − σ(x i , w)), and v i = y i · (σ(x i , w) − 1). Hessian is a sum of the matrix E (second derivative of the penalty function multiplied by λ) and the matrix X T DX. Depending on the penalty function P , the matrix E may be: 1) scalar diagonal (λI), 2) non-scalar diagonal, 3) other type than diagonal. In this paper we investigate only cases 1) and 2). There are many approaches to learning logistic regression model, among them there are direct second order procedures like IRLS, Newton (with Hessian inversion using linear conjugate gradient) and first order procedures with nonlinear conjugate gradient as the most representative example. A short review can be found in [14] . The other group of methods includes second order procedures with Hessian approximation like L-BFGS [21] or fixed Hessian, or truncated Newton [2, 13] . Some of those techniques are implemented in scikit-learn [17] , which is the main environment for our experiments. QR factorization is a common technique of fitting the linear regression model [9, 15] . Here we consider two cases. The number of samples and attributes leads to different kinds of factorization: -LQ factorization for m < n, -QR factorization for m n. Since we assume m < n, we consider LQ factorization of matrix X: whereL is m × m lower triangular matrix, Q is n × n orthogonal matrix andQ is m × n semi-orthogonal matrix (QQ T = I,QQ T = 0). The result is essentially the same as if QR factorization of the matrix X T was performed. Finding the Newton direction from the Eq. (3): involves matrix inversion, which has complexity O(n 3 ). A direct inversion of Hessian can be replaced (and improved) with a solution of the system of linear equations: with the use of the conjugate gradient method. This Newton method with Hessian inversion using linear conjugate gradient is an initial point of our research. We show further how this approach can be improved using QR decomposition. In the 2 -regularized case solution has a form: Substituting LQ for X andQ TQ w for w: in the Eq. (9) leads to: Algorithm 1 . Newton method for 2 penalized Logistic Regression with QR factorization using transformation into a smaller space (L2-QR). Input: X = R m×n , ym×1, m < n Initialization: [L,Q] = lq (X),ŵm×1 = 0 repeat Computeĝ and D forŵ (k) . Solve L T DL + λI ·d =ĝ. First, multiplication byQ transforms w to the smaller space, then inversion is done in that space and finally, multiplication byQ T brings solution back to the original space. However, all computation may be done in the smaller space (usingL instead of X in the Eq. (9)) and only final solution is brought back to the original space -this approach is summarized in the Algorithm 1. In the experimental part this approach is called L2-QR. This approach is not new [8, 16] , however the use of this trick does not seem to be common in machine learning tools. In the case of penalty functions whose Hessian E is a non-scalar diagonal matrix, it is still possible to construct algorithm, which solves smaller problem via QR factorization. Consider again (5), (6) and (7): Using Shermann-Morrison-Woodbury formula [5] we may write: Let C = I + BA −1 . Exploiting the structure of the matrices L and Q (6) yields: Algorithm 2 . Newton method for Logistic Regression with QR factorization and general regularizer. Hence only matrix C 1 =L T DXE −1QT + I of the size m × m needs to be inverted -inversion of the diagonal matrix E is trivial. Putting (14) and (13) into (12) and simplifying obtained expression results in: This approach is summarized in the Algorithm 2. Application to the Smooth 1 Approximation. Every convex twice continuously differentiable regularizer can be put in place of ridge penalty and above procedure may be used to optimize such a problem. In this article we focused on the smoothly approximated 1 -norm [12] via integral of hyperbolic tangent function: and we call this model L1-QR-soft. In this case Application to the Strict 1 Penalty. Fan and Li proposed a unified algorithm for the minimization problem (2) via local quadratic approximations [3] . Here we use the idea presented by Krishnapuram [11] , in which the following inequality is used: what is true for any w and equality holds if and only if w = w. Cost function has a form: If we differentiate penalty term, we get: where Initial w must be non zero (we set it to 1), otherwise there is no progress. If |w j | falls below machine precision, we set it to zero. Applying the idea of the QR factorization leads to the following result: One can note that when w is sparse, corresponding diagonal elements are 0. To avoid unneccessary multiplications by zero, we rewrite product XE −1QT as a sum of outer products: wherex j andq j are j-th columns of matrices X andQ respectively. Similar concept is used when multiplying matrix E −1QT by a vector e.g. z: j-th element of the result equals e −1 jjq j · z. We refer to this model as L1-QR. After obtaining direction d we use backtracking line search 1 with sufficient decrease condition given by Tseng and Yun [19] with one exception: if a unit step is already decent, we seek for a bigger step to ensure faster convergence. Application to the q<1 Penalty. The idea described above can be directly applied to the q<1 "norms" [10] and we call it Lq-QR. Cost function has a form: where Cost of each iteration in the ordinary Newton method for logistic regression equals k · 2n 2 + n , where k is the number of conjugate gradient iterations. In general k ≤ n, so in the worst case its complexity is O n 3 . Using data transformed to the smaller space, each step of the Newton procedure is much cheaper and it requires about km 2 operations (cost of solving system of linear equations using conjugate gra- As it is shown in the experimental part, this approach dominates other optimization methods (especially exact second order procedures). Looking at the above estimations, it is clear that the presented approach is especially attractive when m n. In the second case the most dominating operation comes from computation of the matrix C 1 in the Eq. (15) . Due to dimensionality of matrices:L m×m , X m×n andQ m×n , the complexity of computation C 1 is O(m 2 n) -cost of inversion of the matrix C 1 is less important i.e. O(m 3 ). In the case of 1 penalty taking sparsity of w into account reduces this complexity to O(m 2 · #nnz), where #nnz is the number of non-zero coefficients. Therefore theoretical upper bound on iteration for logistic regression with rotationally variant penalty function is O m 2 n , what is better than direct Newton approach. However, looking at (15) , we see that the number of multiplications is large, thus a constant factor in this estimation is large. In the experimental part we present two cases: 1) learning ordinary logistic regression model, and 2) learning a 2-layer neural network via extreme learning paradigm. We use following datasets: 1. Artificial dataset with 100 informative attributes and 1000 redundant attributes, informative part was produced by function make classification from package scikit-learn and whole set was transformed introducing correlations. 2. Two micro-array datasets: leukemia [6] , prostate cancer [18] . 3. Artificial non-linearly separable datasets: chessboard 3 × 3 and 4 × 4, and two spirals -used for learning neural network. As a reference we use solvers that are available in the package scikit-learn for LogisticRegression model i.e. for 2 penalty we use: LibLinear [4] in two variants (primal and dual), L-BFGS, L2-NEWTON-CG; For sparse penalty functions we compare our solutions with two solvers available in the scikit-learn: LibLinear and SAGA. For the case 2 penalty we provide algorithm L2-QR presented in the Sect. 2.1. In the "sparse" case we compare three algorithms presented in the Sect. 2.2: L1-QR-soft, L1-QR and Lq-QR. Our approach L2-QR (Algorithm 1) is computationally equivalent to the L2-NEWTON-CG meaning that we solve an identical optimization problem (though in the smaller space). In the case of 2 penalty all models should converge theoretically to the same solution, so differences in the final value of the objective function are caused by numerical issues (like numerical errors, approximations or exceeding the number of iterations without convergence). These differences affect the predictions on a test set. The case of 1 penalty is more complicated to compare. The L1-QR Algorithm is equivalent to the L1-Liblinear i.e. it minimizes the same cost function. Algorithm L1-QR-soft uses approximated 1 -norm, and algorithm Lq-QR uses a bit different non-convex cost function which gives similar results to 1 penalized regression for q ≈ 1. We also should emphasize that SAGA algorithm does not optimize directly penalized log-likelihood function on the training set, but it is stochastic optimizer and it gives sometimes qualitatively different models. In the case L1-QR-soft final solution is sparse only approximately (and depends on a (16)), whereas other models produce strictly sparse models. The measure of sparsity is the number of non-zero coefficients. For L1-QR-soft we check the sparsity with a tolerance of order 10 −5 . All algorithms were started with the same parameters: maximum number of iterations (1000) and tolerance ( = 10 −6 ), and used the same learning and testing datasets. All algorithms depend on the regularization parameter C (or 1/λ). This parameter is selected in the cross-validation procedure from the same range. During experiments with artificial data we change the size of training subset. Experiments were performed on Intel Xeoen E5-2699v4 machine, in the one threaded envirovement (with parameters n jobs=1 and MKL NUM THREADS=1). In the first experiment, presented in the Fig. 1 , we use an artificial highly correlated dataset (1). We used training/testing procedure for each size of learning data, and for each classifier we select optimal value of parameter C = 1/λ using cross-validation. The number of samples varies from 20 to 300. As we can see, in the case 2 penalty our solution using QR decomposition L2-QR gives better times of fitting than Table 2 . ordinary solvers available in the scikit-learn and all algorithms work nearly the same, only L2-lbfgs gives slightly different results. In the case of sparse penalty our algorithm L1-QR works faster than L1-Liblinear and obtains comparable but not identical results. For sparse case L1-SAGA gives best predictions (about 1-2% better than other sparse algorithms), but it produces the most dense solutions similarly like L1-QR-soft. In the second experiment we used micro-array data with an original train and test sets. For those datasets quotients (samples/attributes) are fixed (about 0.005-0.01). The results are shown in Table 1 ( 2 case) and in Table 2 ( 1 case). Tables present mean values of times and cost functions, averaged over λs. Whole traces over λs are presented in the Fig. 2 and Fig. 3 . For the case of 2 penalty we notice that all tested algorithms give identical results looking at the quality of prediction and the cost function. However, time of fitting differs and the best algorithm is that, which uses QR factorization. For the case of sparse penalty functions only algorithms L1-Liblinear and L1-QR give quantitatively the same results, however L1-Liblinear works about ten times faster. Other models give qualitatively different results. Algorithm Lq-OR obtained the best sparsity and the best accuracy in prediction and was also slightly faster than L1-QR. Looking at the cost function with 1 penalty we see that L1-Liblinear and L1-QR are the same, SAGA obtains worse cost function than even L1-QR-soft. We want to stress that scikit-learn provides only solvers for 2 and 1 penalty, not for general case q . Functional-link (RVFL) network is a method of learning two (or more) layer neural networks in two separate steps. In the first step coefficients for hidden neurons are chosen randomly and are fixed, and then in the second step learning algorithm is used only for the output layer. The second step is equivalent to learning the logistic regression model (a linear model with the sigmoid output function). Recently, this approach is also known as "extreme learning" (see: [20] for more references). The output of neural network with a single hidden layer is given by: where: Z is the number of hidden neurons, ϕ(x; w, b) = tanh n k=1 w k x k + b is the activation function. In this experiment we choose randomly hidden layer coefficients W 1 and b 1 , with number of hidden neurons Z = 1000 and next we learn the coefficients of the output layer: w 2 and b 2 using the new transformed data matrix: For experiments we prepared the class ExtremeClassier (in scikit-learn paradigm) which depends on the number of hidden neurons Z, the kind of linear output classifier and its parameters. In the fitting part we ensure the same random part of classifier. In this experiment we also added a new modelmulti-layer perceptron with two layers and with Z hidden neurons fitted in the standard way using L-BFGS algorithm (MLP-lbfgs). Results of the experiment are presented in the Fig. 4 . For each size of learning data and for each classifier we select optimal value of parameter C = 1/λ using cross-validation. The number of samples varies from 20 to 300. As we can see, in both cases ( 2 and sparse penalties) our solution using QR decomposi-tion gives always better times of fitting than ordinary solvers available in the scikit-learn. Time of fitting of L1-QR is 2-5 times shorter than L1-Liblinear, especially for the case chessboard 4 × 4 and two spirals. Looking at quality we see that sparse models are similar, but slightly different. For two spirals the best one is Lq-QR and it is also the sparsest model. Generally sparse models are better for two spirals and chessboard 4 × 4. The MLP model has the worst quality and comparable time of fitting to sparse regressions. The experiment shows that use of QR factorization can effectively implement learning of RVFL network with different regularization terms. Moreover, we confirm that such learning works more stable than ordinary neural network learning algorithms, especially for the large number of hidden neurons. Exemplary decision boundaries, sparsity and found hidden neurons are shown in the In this paper we presented application of the QR matrix factorization to improve the Newton procedure for learning logistic regression models with different kind of penalties. We presented two approaches: rotationally invariant case with 2 penalty, and general convex rotationally variant case with sparse penalty functions. Generally speaking, there is a strong evidence that use of QR factorization in the rotational invariant case can improve classical Newton-CG algorithm when m < n. The most expensive operation in this approach is QR factorization itself, which is performed once at the beginning. Our experiments showed also that this approach, for m n surpasses also other algorithms approximating Hessian like L-BFGS and truncated Newton method (used in Liblinear). In this case we have shown that theoretical upper bound on cost of Newton iteration is O m 3 . We showed also that using QR decomposition and Shermann-Morrison-Woodbury formula we can solve a problem of learning the regression model with different sparse penalty functions. Actually, improvement in this case is not as strong as in the case of 2 penalty, however we proved that using QR factorization we obtain theoretical upper bound significantly better than for general Newton-CG procedure. In fact, the Newton iterations in this case have the same cost as the initial cost of the QR decomposition i.e. O m 2 n . Numerical experiments revealed that for more difficult and correlated data (e.g. for extreme learning) such approach may work faster than L1-Liblinear. However, we should admit that in a typical and simpler cases L1-Liblinear may be faster. Convex Optimization On the nonmonotone line search Variable selection via nonconcave penalized likelihood and its oracle properties LIBLINEAR: a library for large linear classification Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences Molecular classification of cancer: class discovery and class prediction by gene expression monitoring Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion) The Elements of Statistical Learning Learning with Lq<1 vs L1-norm regularisation with exponentially many irrelevant features Sparse multinomial logistic regression: fast algorithms and generalization bounds SSVM: a smooth support vector machine for classification Trust region Newton method for logistic regression A comparison of numerical optimizers for logistic regression Machine Learning: A Probabilistic Perspective Feature selection, L1 vs. L2 regularization, and rotational invariance Scikit-learn: machine learning in python Overexpression of vimentin: role in the invasive phenotype in an androgen-independent model of prostate cancer A coordinate gradient descent method for nonsmooth separable minimization Comments on "the extreme learning machine Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization