key: cord-0579081-qz9o36t9 authors: Merino, Pedro; Reyes, Juan Carlos De Los title: A second-order method with enriched Hessian information for imaging composite sparse optimization problems date: 2020-09-03 journal: nan DOI: nan sha: 665c2abedc33a8cea4302d71b3baf034c0323d36 doc_id: 579081 cord_uid: qz9o36t9 In this paper we propose a second--order method for solving emph{linear composite sparse optimization problems} consisting of minimizing the sum of a differentiable (possibly nonconvex function) and a nondifferentiable convex term. The composite nondifferentiable convex penalizer is given by $ell_1$--norm of a matrix multiplied with the coefficient vector. The algorithm that we propose for the case of the linear composite $ell_1$ problem relies on the three main ingredients that power the OESOM algorithm cite{dlrlm07}: the minimum norm subgradient, a projection step and, in particular, the second--order information associated to the nondifferentiable term. By extending these devices, we obtain a full second--order method for solving composite sparse optimization problems which includes a wide range of applications. For instance, problems involving the minimization of a general class emph{differential graph operators} can be solved with the proposed algorithm. We present several computational experiments to show the efficiency of our approach for different application examples. In areas like statistics, machine learning, optimal control and image processing among others, there are many minimization problems involving a cost function that forces to get some sparsity pattern in the solution. In data analysis, for instance, the 1-norm penalization ability to produce sparse structures is well exploited for variable selection [11, 12] . Also, the composite problem of minimizing the cost f (x) + β Cx 1 , for some matrix C, is relevant in practice when sparsity is affected by a pattern matrix C. For example, in the fused lasso problem, C corresponds to the successive difference operator, then term Cx 1 becomes the anisotropic total variation of x, and it has several applications in signal and image processing [14] . In addition, high order differential operators (e.g. graph Laplacian) are covered by C, e.g. in trend flittering over graphs [15] as well as the novel nonlocal differential operators can be also considered in the spirit of [9] . Moreover, for a sufficiently large β, this problem can also be interpreted as the exact penalization of the minimization of f subject to the constraint Cx = 0. First order algorithms have been proposed for minimizing special cases of the objective function f (x) + β Cx 1 . See for instance [3] , where a modified monotone version of FISTA algorithm is proposed for image restoration problems. One the other hand, despite of the convergence properties of second-order methods, they have not been exhaustively investigated in the contenxt of nonsmooth optimization. One of the reasons for its lack of popularity is related to the high storage requirements and computation cost at every iteration. However, second-order methods can be practical and advantageous if combined with cost-reduction and parallelization techniques. One of the first second-order algorithms developed for solving composite problems was introduced in [8] for general composition of a nonsmooth and a smooth functions. Their approach is to approximate approximation of the smooth functions and solve the minimization of the surrogate model approximating the objective function using a trust-region algorithm. The surrogate model is itself a nonsmooth composite problem which is solved by expressing the nonsmooth penalization as a polyhedral function which leads to a constrained quadratic optimization problem. However, this procedure, in the case of the 1 -norm, needs a dense matrix of size m × 2 m to express Cx 1 as a polyhedral function using the columns of H, which might be prohibitive for large values of m. More recently, a primal-dual approach second-order method was proposed in [7] . There, a new variable y who represents the nonsmooth composite term is introduced in order to cope with the composite penalization quantity Cx. Next, a quadratic penalization is added to the Lagrangian formulating an Augmented Lagrange Method, which introduces a second variable to deal with the composite term as a constraint. This constraint is penalized by introducing an additional dual variable at the cost of increasing the size of the problem. The variant in this approach formulated in [7] uses the proximal operator in order to represent the Lagrange function by means of its Moreau's envelope function. Then, a generalized second-order Hessian of the Lagrangean is introduced which by computing the Clarke subgradient of the associated proximal operator. This generalized Hessian provides second-order updates for the primal and the dual variables. However, its second-order system requires the computation of proximal operator of the nonsmooth penalizer, which doest not have a close form in the case of the term Cx 1 . Also, the generalized Hessian provided is not symmetric, which is a potential numerical disadvantage. In this paper we intend to devise a new algorithm inspired by the orthant-wise secondorder algorithm from [6] , which utilizes second-order from the regular part f and also from the nondifferentiable composite term C · 1 . As expected, the transformation of the variable by matrix C entails new numerical and theoretical challenges in which the sparsity term is no longer separable. Therefore, by extending several devices from [6] we are able to describe a full second-order algorithm to the composite case and derive the corresponding convergence analysis associated to this method by applying the ideas from [2] using the Lojasiewikcz condition. We organize this paper by setting the problem in Section 2. Then, In Section 3 we describe the associated elements of the algorithm which leads to the numerical method. Section 4 is devoted to the convergence analysis and the derivation of the corresponding rate. Finally we present the numerical tests that shows how second-order information ins relevant for the numerical performance. Let f : R m → R be a differentiable function and let be β > 0. We are interested in the numerical solution of the unconstrained optimization problem where · 1 corresponds to the standard 1 -norm in R m and C is a real n × m matrix with rows c i , for i = 1, . . . , n. We shall notice that modifying the matrix C, problem (16a) also covers the so-called fused problem To obtain existence of solutions for problem (16a) the following conditions are assumed hereafter. The existence of solutions then follows from Weierstrass' theorem. 2.1. First order optimality conditions. Let us denote byx the solution of (16a) and by ∂φ(x) the subdifferential of the function φ at x. Moreover, let us denote by g the convex nondifferentiable part of ϕ, that is g(x) = β Cx 1 . By using the standard optimality condition (2) 0 ∈ ∇f (x) + ∂g(x), we can derive first-order necessary optimality conditions for (16a). Indeed, from the relation (2) we establish the optimality conditions characterizing a a solution for problem (16a). By using computation rules from subdifferential calculus, we argue that ifx is a solution for (16a), then there exists ξ(x) ∈ R n such that: where, For a given x, let us define the index sets Therefore, ifξ = ξ(x), condition (3) is equivalent to the existence ofξ i , for i ∈Ā whereP,N andĀ are the corresponding index sets associated tox. Remark 1. Notice that the linear system (5) is of size m × p, with p ≤ n being the cardinality ofĀ. By denoting C A ∈ R m×p the matrix whose columns are formed by the transposed rows indexed in A andCĀ denoting the corresponding augmented matrix, i.e. the matrix with the extra column given by the right-hand side of (5). In the following, we will assume the Rouché-Capelli theorem holds. That is, that the system (5) has at least one solution provided that rank{CĀ} = rank{CĀ}. We start with the construction of a descent direction, for which, following [6] , we consider a vector of the form ∇f (x) + βC ξ(x) according to (4). 3.1. Computation of a descent direction. In standard 1-norm penalized problems [6] , the natural choice for the subgradient element is the one with the minimum 2norm or, equivalently in the convex case, the steepest descent direction. Because of the particular structure of the 1-norm, the minimum norm subgradient is also known as orthant direction. In fact, it characterizes the orthant in which a descent direction has to be found. However, in the case of composite optimization, the term Cx 1 is no longer separable. Therefore, there is no orthant-wise interpretation for the minimum norm subgradient, which is defined in general as: One of the drawbacks of using the minimum norm subgradient is that its computation requires the solution of an auxiliary quadratic optimization problem with box constraints. However, although an additional optimization subproblem is needed, it is not as expensive as it may appear at first sight. Indeed, since we already know that ξ i = sign( c i , x )}, if c i , x = 0, we can exclude these components in the optimization problem (6). Let p := |A| and let us denotẽ Further, let C A denote the matrix obtained by removing all rows c i , with i ∈ N ∪P, from C. Hence, we may reformulate problem (6) as the following box-constrained quadratic optimization problem: Notice that this problem is of the same size as the active set cardinality at x. In many cases C A C A is nonsingular, thus problem (MinSub) has a unique solution. Moreover, the solution of (MinSub) is given by where P I denotes the projection on a set I. Formula (7) can not be computed as a closed-form solution. Indeed, its dual fits in a classical LASSO problem formulation. 3.2. Second order information. Weak second order information associated to the 1 -norm was algorithmically introduced in [6] in order to compute generalized hessian based descent directions that incorporate components coming from both the smooth and nonsmooth terms. There, the regularization of the 1 -norm by Huber smoothing allowed to obtain the targeted second order information using the second derivative of its regularization. This procedure is analogous to consider generalized Hessians in the Bouligand subdifferential ∂ B prox 1 γ · 1 . Here, we generalize this procedure to the case of composite sparse optimization. In the present case, however, the weak second order derivative of the nondifferentiable term is not longer a diagonal matrix. Indeed, recalling that the Huber regularization of the 1 -norm, for γ > 0 is defined by , and the "weak Hessian" of C · 1 is given by the matrix One could realize that D ∈ ∂ B (prox 1 γ · 1 ). We will write Γ k to specify that (10) is computed for x = x k . Now, the computation of the descent direction is carried on with the help of matrix (10) , requiring the solution of the following linear system: where B k stands either for the Hessian of f at x k or an approximation of it (e.g., the BFGS matrix). Assumption 2. The matrix B k is symmetric positive definite and satisfies 3.3. Projection step. In our algorithm, at each iteration, the approximated solution x may be close to fulfill sparsity in the range of C, i.e., c i , x ≈ 0 for some of the indexes i. Thereafter, small perturbations on x may cause undesired sign changing in c i , x . When, under small perturbations on x, we detect a change in the sign of the quantity c i , x we might keep the updated approximated solution satisfying sparsity condition. Therefore, we project the perturbation of x to the closest pointx satisfying c i ,x = 0. Thus, for a given approximated solution x and a descent direction y, we identify those c i , x which change sign with respect to the subgradient ξ(x) (recall that the subgradient ξ(x) has the same sign of c i , x , when itx is not 0). For the sign identification process we introduce the set and define C s := C(S(y), :). Then, we consider the projection over the set A S , defined by Thus, the projection P on the set A S is obtained as the solution of the following problem: It is known that (Prj) is a saddle point problem. A particular but important case is when C s is of full rank. Then, (Prj) is equivalent to the linear equation (see [4] ) Furthermore, by introducing the projections Π := C s (C s C s ) −1 C s and P = I − Π, we can solve (15) explicitly and the solution of (Prj) reads: Note that, Πx is characterized as the solution of (17) min y∈rangeC s x − y 2 . Moreover, feasibility ofx implies that C Sx = 0. From these relations, we realize that In other words, the projection step removes the part belonging to range(C s ) from the current approximation. In the case that C S is not full rank, it cannot be guaranteed the existence of (C s C s ) −1 . Then, the common practice is to consider instead a regularization C s C s + I for small > 0. In analogous fashion to [1, 5] , we consider the projected linesearch rule using P given by (16a), for choosing the step s k fulfilling the decreasing condition: The calculation of the step s k fulfilling the last condition is performed using a backtracking scheme. Compute ξ k given by solving (MinSub) Compute d k by solving system (11) 5: Compute s k using a line-search procedure 6 : Let x k be the approximated solution computed by Algorithm (1) in the k-th iteration. Moreover, let C k := C S k , for k = 1, 2, . . ., and ξ k := ξ(x k ). Hence, at every step In addition, for a vector y ∈ R m , according to (13) , we consider the index set It follows from the definition of S k that for s sufficiently small x k belongs to the null space of C k and the index set S k may be equivalently defined as Indeed, this can be seen from the fact that if i ∈ S k then we have that if c i , x k = 0 then ξ k = sign c i , x k and, for sufficiently small s, we have sign which is a contradiction. Therefore, the only possibility is that Theorem 1. Let Assumptions 1 and 2 hold, and let x k be the approximated solution for (16a) at the kth iteration of Algorithm 1 and let d k be the corresponding direction computed using (11) . Let us assume that C k defined in projection step (Prj) is full rank. Moreover, let us assume that at every step c i , d k = 0 for some i, and that the parameter γ = γ k+1 is chosen in each iteration such that where the minimum is taken from those c i , d k = 0, where ν k and η k being the vectors of coefficients of Π∇f (x k ) and Proof. Taking into account that x k+1 = P (x k + sd k ) = x k + sd k − Π(x k + sd k ) and C k is full rank then, by (15) we have that sign( c i , x k + sd k ) = sign(ξ k ) = sign( c i , x k ). Then, we conclude that | c i , x k | < s| c i , d k |. Hence, for some constant c depending on the matrix C and independent of k. Therefore, we obtain the estimate Now, using (22) and the first order Taylor expansion of the regular part of ϕ, we get From the second-order system (11) and the positive semidefiniteness of Π, we see that Note that Π = Π 2 ; moreover, it is also a symmetric positive semi-definite matrix. In addition, we have that Γ k = γC D k C is symmetric and positive semidefinite by its construction. Further, by Assumption 2 we have that exists a positive constant c, independent of k, such that d k B k d k ≥ c d k 2 . Therefore, these matrix properties imply Let us focus on the sum on the right-hand side of (23). Then, since c i , x k+1 = 0 for all i ∈ S k , we have Inserting (25) and (26) in (23) obtain the relation: where |S C k | denotes the cardinality of the complement of the set S k and i * is the index where the term | c i , Πd k ) | attains it maximum in S k . By using again Remark 2, and taking into account that Π projects onto span{c i : i ∈ S k }. Therefore, we can be estimate the last three terms as follows: Notice that we have assumed that the set {i : c i , x k ≤ 1/γ} = ∅, otherwise the righthand side of (28) vanishes. Using γ = γ k given in (32) in the last relation and inserting in (27), we arrive to which allows us to conclude that d k is a descent direction. There are nonconvex problems for which Assumption 2 can not be fullfilled, e.g. when f is concave. In this case, the last proof can be modified to cope with this situation. We will need the following assumption. for some positive constantĈ. Theorem 2. Let Assumptions 1 and 3 hold. Consider x k , ν k and η k as in Theorem 1. Moreover, assume in addition that there exist a constantC > 0 such that for every k, and that the parameter γ is chosen at each iteration as follows then, d k is a descent direction, i.e.: ϕ(x k+1 ) < ϕ(x k ). Proof. Following the same arguments and notation of the proof of Theorem 1, we have that By the estimate (28) and (31) we get Finally, the right-han side of the last relation is negative for sufficiently small s. Definition 1. We will say that a function f is a KL-function if f satisfies the Kurdyka-Lojasiewicz inequality, that is: for every y ∈ R and for every bounded subset E ⊂ R m , there exist three constants κ > 0, ζ > 0 and θ ∈ [0, 1[ such that for all z ∈ ∂f (x) and every x ∈ E such that |f (x) − y| ≤ ζ, it follows that with the convention 0 0 = 0. Theorem 3. Suppose that Assumptions 1-2 are satisfied and that ϕ is a KL-function (i.e. satisfies the Kurdyka-Lojasiewicz condition). Then, the sequence {x k } k∈N generated by Algorithm 1 converges to a pointx such that 0 ∈ ∇f (x) + β C ∂ · 1 (Cx). Proof. The proof of this convergence result is analogous to the proof of Theorem 2 in [6] . Indeed, notice that the sequence {x k } k∈N lies in the level set {x : ϕ(x) ≤ ϕ(x 0 )}, which in view of Assumption (1) is compact. Moreover, by Theorem 1, for s k sufficiently small, there exists µ > 0 such that the sequence {ϕ(x k )} k∈N enjoys the property: and ϕ(x k ) converges to some value ϕ ∞ as k → ∞. By using the Kurdyka-Lojasiewicz condition and Assumption 2, there exist κ > 0 and θ ∈ [0, 1) such that holds. Therefore, majoring (36) using (37) it can be concluded the summability of the sequence { d k } k∈N . Which in turn, by (22), implies that {x k } k∈N is a Cauchy sequence and thus convergent. Let us denote its limit byx. Since Finally, using (37) and taking the limit k → ∞ we obtain Hence (x, 0) belongs to Graph(∇f + ∂(β · 1 )) due to its closedness which is equivalent to the relation 0 ∈ ∇f (x) + ∂(β · 1 )(x). Theorem 4 (Rate of convergence). Let Assumptions 1-2 hold and assume also that ϕ is a KL-function with Lojasiewicz exponent θ ∈ (0, 1). Let {x k } k∈N be a sequence generated by Algorithm 1, converging to a local solutionx. Then, the following rates hold: , then there exist c > 0 such that Proof. We follow the ideas from [2] . From (22) and the quadratic growth (29), for sufficiently small s, there is a positive constant c such that Without loss of generality, we assume that ϕ(x) = 0 (we can always replace ϕ(·) by ϕ(·) − ϕ(x) ) and by multiplying relation (38) by ϕ(x k ) −θ and using the fact that the real function R + t → t 1−θ is a concave differentiable function On the other hand, ϕ is a KL-function thus, from the last relation, we get Fhurther, ξ k corresponds to the minimum norm subgradient solving (MinSub); therefore, by feasibility of ξ k−1 we have that ∇f (x k ) + βC ξ k 2 ≤ ∇f (x k ) + βC ξ k−1 2 which can be inserted in (39) and combined with (11) and Assumption 1 to obtain that As before, we invoke Remark 2 to infer that for sufficiently small s it follows that which together with the above inequality imply that there exist a constant c > 0 such that where M θ is a positive constant depending on θ. Let us sum (40) over k from k = n up to N > k: hence, recalling Theorem 3 that { x k+1 − x k 2 } k∈N is summable in virtude of the sumability of the sequence { d k 2 } k∈N and taking N → ∞, we get The last relation in terms of ∆ n := ∞ k=n x k+1 − x k 2 can be rewritten as follows: Using again that ϕ is a KL-function, we have from (35) and monotonicity that ϕ(x n ) 1−θ ≤ 1 κ ∇f (x n ) + βC ξ n 2 1−θ θ . Thus, using the same arguments as before, we obtain where M is a positive constant. Here, we rely on the analysis of a sequence satisfying relation (42) done in [2, pg. 13-15] henceforth (i) and (ii) hold. Second-order methods are known to be expensive when it comes to the computation of the descent direction. Without any additional strategy regarding the numerical solution in the computation of the descent direction (system (11)), the method would hardly become practical for large problems. Therefore, it is important to look at the structure of the pattern matrix C and take it into account in order to improve the computation process. In an effort to reduce the numerical cost, we extend the notion of active sets used in [6] in order to define an effective identification process of the components of the optimization variable which are known to fulfill optimality conditions and therefore, they can be excluded when seeking the descent direction. In this way, the optimization process takes place in a lower dimensional subspace, resulting in a reduction of the computation cost. A common situation occurs when the matrix C possesses a known structure e.g., when C is the successive difference matrix or "discrete gradient"; in this case, C is a band matrix. We notice that in the multiplication Cx k , not all the entries of x k are taking part in the computation of a particular entry of the product Cx k . Observing the optimality condition (5), for each i ∈ A k we consider the index set noted by I k i , consisting of indexes j ∈ {1, . . . , m} such that c ij = 0, satisfying: Then, we define the set of active entries of x k by (44) I k 0 := ∪ i∈A k I k i , which corresponds to the set of indexes that are close to satisfy optimality conditions which are active. Thus, we would not move the current approximation x k in the entries indexed by I k 0 . By contrast, we define the set of indexes I k F := {1, . . . , m} \ I k 0 , in which the variable is free to move. Thus, we consider the reduced system: Then, the step 4 of Algorithm 1 can be modified using (45) and by choosing the descend direction d computed according to the formula In this section, we conduct numerical experiments of the previous algorithm exhaustively. Several applications of the generalized 1-norm penalization are conducted in order to illustrate the general class of problems that can be handled by our algorithm. The GSOM algorithm was implemented in Matlab. The (MinSub) problem of step 3 was solved by using quadprog package from the optimization toolbox whereas the linear system (11) of step 4 was solved using direct methods from Matlab ("backslash solver"). In step 6 we implemented the line-search using a standard backtracking algorithm, by checking the condition (18). For the stopping criteria we use a certain tolerance for the difference of consecutive values for the approximated solution and its respective costs. The numerical experiments presented considered compare different problems with algorithms designed specifically for the problem structure. Therefore, we compare our algorithm with the corresponding algorithm for each problem. 6.1. Anisotropic total variation in function spaces. We consider the following problem resembling a simplified version of an anisotropic Bingham fluid flow model: After discretizing using finite differences the infinite-dimensional problem is reformulated as an energy minimization problem of the form (16a). Hence, the regular part f (u) = 1 2 u Au − b u of our minimization problem is written using the matrix A associated to the discrete laplacian and b is the vector corresponding to the discretization of the forcing term z. Table 1 the effect of using generalized second-order information introduced in Section 3.2. The cost values of the objective function were computed by varying the regularization parameter γ for different values of β which are shown in table regt:1 for the first 50 iterations of the algorithm. The first row (in red) shows the cost values achieved by the algorithm when no generalized second-order information is utilized for the computation of the descent direction (γ = 0). In this case, we notice that without generalized second-order information the cost is bigger in all tests. Moreover, this also can be observed in Figure 1 (A) where the history of the cost is shown for different values of γ for β = 0.5. There, γ = 0 is out of chart. Next, we test with the active-set identification strategy. Here, we compare the execution time with respect to an implementation not hacking this strategy. We confirm the efficiency of using this strategy by measuring time execution for this particular problem, see Table 2 . Table 2 . Average time (in seconds) of the numerical solution of system (11) with and without the Active-Set strategy within the execution of GSOM algorithm 6.2. Graph trend filtering. In [15] the authors introduced a technique of filtering data over graphs, and among other applications, this technique was applied in the denoising over graphs using the discrete laplacian as sparsity-inducting operator. There, it was showed that better results may be achieved than other denoising thechniques. In our setting, C = ∆ (2) , where for a integer k the operator ∆ (k) is defined recursively by where ∆ (1) is the oriented incidence matrix of the graph. Notice that ∆ (1) x 1 = (i,j)∈E |x i − x j |, where we denote the graph G = {N, E}. Therefore, ∆ (2) = ∆ (1) ∆ (1) . In this example, we consider the denoising of COVID-19 data over a graph corresponding to the Pichincha province of Ecuador connecting adjacent areas or tracts. Hence, each node corresponds to a particular tract of the province territory. The signal data considered in each node consist of the reported number of cases of each tract, denoted by y. This data was provided by the team of the app SALVAVIDAS, developed for monitoring COVID-19 infection in Ecuador, which use the official data provided by the Ministry of Health of Ecuador. The noise source in this kind of data are due to imprecise assignments within tracts, counting errors, false positive or negative cases to name a few. In our example, we assume that the noise induced by these different sources is normally distributed y ∼ N (x 0 , σ 2 I). The sparse graph filtering problem aims to minimize the following cost Figure 2 shows an expected behavior of a first order method (ADMM) compared with a second-order method (GSOM). We observed that GSOM is faster and more precise. However, its computational power requires the solution of a linear system, which may be costly. Nevertheless, the computational cost can be outstripped by utilizing parallelization and numerical techniques. 3. Image restoration. Consider the image deconvolution example of [10] . The aim in this problem is to recover an image out of one convoluted with the random matrix A. For instance, this convolution occurs during the camera exposure, producing a blured image. If x is the original image, the contaminated one is modeled by y = Ax + z, where A ∈ R n×n and z ∈ R n . The recovering process consists in choosing the image x which best fits the observation and at the same time minimizes the term that computes the differences of each pixel with respect to its neighbors by means of a directed graph G = {N, E}. Thus, we look for the solution of the cost function defined by Notice that the last function fits in our settings using the incidence matrix C, associated to the graph G, in order to express the penalizing term as: (i,j)∈E |x i − x j | = Cx 1 . In the following example we consider the recovering of an image of size 77×77 from its corrupted observation y = Ax+b with random noise z with standard deviation σ = 0.05. Here A is a random (uniformly distributed) convolution matrix of size 2000 × 5929. Original GFL MPGL GSOM Next, we test the Cauchy-denoising model characterized by its non-Gaussian and impulsive property that preserves edges and details of images (see [13] for the formulation of the Cauchy model in function spaces using the TV-norm). Here, the anisotropic version of the discrete Cauchy denoising problems, in a simplified setting, corresponds to the minimization of the nonconvex cost function: where C is the difference operator, f is the observed image perturbed with Cauchy noise and a > 0 is the scale parameter of the Cauchy distribution. Notice that the nonconvex structure of the optimization problem prevents the application of the standard convex methods. Again, an image of size 77 × 77 pixels is considered and a Cauchy-noise is added to the original image according to the formula suggested in [13] , where ξ > 0 gives the noise level, and η i , i = 1, 2, follow the Gaussian distribution with mean 0 and variance 1. In the next experiment we chose ξ = Original Cauchy-noised The following set of pictures shows recovered images for different values of the scale parameter a and the composite sparsity penalizing parameter β. Both play an important role in the restoration process. Indeed, we observe that larger values of a result in a reduced level of Cauchy-noise. The same observation applies to higher values of b. As usual, in this type of problems, there is a compromise between the amount of removed noise and the preservation of the details inside the picture. Scalable training of 1-regularized log-linear models On the convergence of the proximal algorithm for nonsmooth functions involving analytic features Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems Numerical solution of saddle point problems Sample size selection in optimization methods for machine learning Second-order orthant-based methods with enriched hessian information for sparse 1-optimization A second order primal-dual algorithm for nonsmooth convex composite optimization A model algorithm for composite nondifferentiable optimization problems Nonlocal operators with applications to image processing Mpgl: An efficient matching pursuit method for generalized lasso Practical applications of sparse modeling Sparse modeling: theory, algorithms, and applications Variational approach for restoring blurred images with cauchy noise Sparsity and smoothness via the fused lasso Trend filtering on graphs