key: cord-0447366-a0y85738
authors: Hildebrand, Roland
title: Optimal step length for the Newton method near the minimum of a self-concordant function
date: 2020-03-19
journal: nan
DOI: nan
sha: d147c8ef75dcafc0cb431bf0067e68e5004bd7dc
doc_id: 447366
cord_uid: a0y85738

The quadratic convergence region of the exact Newton method around the minimum of a self-concordant function makes up a fraction of the Dikin ellipsoid. Outside of this region, the Newton method has to be damped in order to ensure convergence. However, the available estimates of both the size of the convergence region and the step length to be used outside of it are based on conservative relations between the Hessians at different points and are hence sub-optimal. In this contribution we use methods of optimal control theory to compute the optimal step length of the Newton method on the class of self-concordant functions, as a function of the Newton decrement. With this step length quadratic convergence can be achieved on the whole Dikin ellipsoid. The exact bounds are expressed in terms of solutions of ordinary differential equations which cannot be integrated explicitly. As an application, the neighbourhood of the central path in which the iterates of short-step interior point methods for conic programming are required to stay can be enlarged, enabling faster progress along the central path during each iteration and hence fewer iterations to achieve a given accuracy.

The Newton method is a century-old second order method to find a zero of a vector field or a stationary point of a sufficiently smooth function. It is well-known that it is guaranteed to converge only in a neighbourhood of a solution, even on such well-behaved classes of objective functions as the self-concordant functions. In this neighbourhood the method converges quadratically. Outside the quadratic convergence region the step length has to be decreased to ensure convergence, leading to the damped Newton method. Several rules have been proposed to choose the step length on different classes of cost functions or vector fields. Among these are line searches or path searches until some pre-defined condition is met [1, 6] , strategies imported from gradient descent methods [4, p. 37] , and explicit formulas [5, p. 24 ].

These rules provide sufficient conditions for convergence, but they are optimal only if the order of convergence is concerned [4] . Given the fast convergence rate of the Newton method, the performance estimate of a single iterate is not important if the problem has to be solved to optimality with a starting point already in the quadratic convergence region. However, the situation is different if the Newton method is applied as it is in interior-point methods for solving conic programs. Here we are not interested in finding the exact minimum, rather one or a few iterations are made before changing the cost function, thus moving the iterate again farther away from the minimum. In this regime the iterates permanently stay in the outskirts of the quadratic convergence region, and the overall progress is limited by the requirement that they do not drop out of it. Exact knowledge of the worst-case performance of a single iterate and of the boundaries of the convergence region would allow to significantly enlarge the step size when updating the cost function, thus boosting the overall convergence of the interior-point method.

We shall consider the class of self-concordant functions. This class has been introduced in [5] , because due to its invariance under affine coordinate transformations it is especially well suited for analysis in conjunction with the also affinely invariant Newton method. Self-concordant functions are locally strongly convex C 3 functions F which satisfy the inequality for all x in the domain of definition and all h in the tangent space at x. The Dikin ellipsoid of F around a pointx is given by the set

Here the distance between the points is measured in the local metric given by the Hessian F (x).

The basic iteration of a short-step interior-point method consists of a (damped) Newton step

towards the minimum of a composite self-concordant function F , where γ k = 1 for a full step and γ k ∈ (0, 1) if the step is damped. This function is a sum of the self-concordant barrier used to describe the feasible set and the linear cost function of the conic problem multiplied with a positive weight. The function F hence depends on a scalar parameter, given by the weight, which is updated after one or a few Newton steps. The minima of the family of composite functions form a path, the so-called central path of the problem. The iterates must stay in a small neighbourhood of the central path to guarantee inclusion in the quadratic convergence region around the minimum. The parameter must be updated cautiously enough so that the iterate still remains inside the convergence region around the new minimum. The estimate of the size of the quadratic convergence region and the performance of the Newton step is based on the following bound on the Hessian of a self-concordant function F at a point x near the current iterate x k [5, Theorem 2.1.1] (see also [4, Theorem 5 

Here ||x − x k || k ∈ (0, 1) is the distance between x and x k as measured in the local metric at x k . Based on this bound, a full Newton step can be made safely if its length

≈ 0.3820. With a little abuse of notion we shall say that the quadratic convergence region is guaranteed to contain an open ellipsoid of radius λ * times the radius of the Dikin ellipsoid, and call λ * a bound on the radius of the convergence region. The quantity ρ(x k ) = ||F (x k )|| x k is called the Newton decrement. At the next iterate, the decrement is guaranteed to be upper bounded by [4, Theorem 5 

The bound λ * is obtained by equating the upper bound on ρ(x k+1 ) and ρ(x k ). Thus if the first step is of length strictly smaller than λ * , then the decrement is guaranteed to form a strictly decreasing sequence which tends to zero quadratically.

We now derive an estimate of how far we may advance the minimum of the current composite objective function along the central path by updating its parameter. It serves as a proxy for the overall performance of the path-following method. We have ρ(

. The right-hand side of the inequality attains its maximum value at ρ(x k ) = ρ * ≈ 0.2291. Therefore, if we guarantee that the Newton decrement before the Newton step is not larger than ρ * , we can be sure that it is not larger than ρ * = ( ρ * 1−ρ * ) 2 thereafter. We may then move the minimum by a distance of δρ ≥ ρ * − ρ * ≈ 0.1408 in order to return to the initial state at the next iterate.

This situation is different if instead of a full Newton step one applies a shorter one with a damping coefficient γ = 1 1+ρ(x k ) depending on the decrement at the actual iterate. Then one can guarantee that [4, Theorem 5.2.2] ρ(x k+1 ) ≤ ρ(x k ) 2 (2 + ρ(x k )) 1 + ρ(x k ) . Figure 1 : Upper bounds on the Newton decrement ρ(x k+1 ) (left: dashed -full Newton step, solidoptimal damped Newton step) and optimal damping coefficient γ k (right) as a function of the current Newton decrement ρ(x k ).

Equating ρ(x k+1 ) and ρ(x k ) yields the improved bound λ * = √ 5−1 2 ≈ 0.6180 on the radius of the convergence region. However, even if ρ * increases to ≈ 0.2972, the bound on the difference δρ remains the same as in the case of a full Newton step.

In this contribution we perform an exact worst-case analysis of the performance of the Newton iterate by reformulating it as an optimal control problem. We show that for a full Newton step the radius of the convergence region is actually λ * ≈ 0.6757, while for a damped Newton step with optimal step size the quadratic convergence region covers the whole open Dikin ellipsoid, λ * = 1. The values of ρ * for the full and the optimal damped Newton step are given by ≈ 0.3943 and ≈ 0.4429, respectively. The respective bounds on δρ are given by ≈ 0.2184 and ≈ 0.2300, which improves the previously known bounds on δρ by more than 50%. The tight bound on ρ(x k+1 ) and the optimal damping coefficient as a function of ρ(x k ) are depicted in Fig. 1 .

The idea of analyzing iterative algorithms by optimization techniques is not new. An exact worstcase analysis of gradient descent algorithms by semi-definite programming has been performed in [2, 7] . In these papers an arbitrary finite number of steps is analyzed, but the function classes are such that the resulting optimization problem is finite-dimensional. In our case the properties of the class of self-concordant functions do not allow to obtain tight constraints on gradient and Hessian values at different points without taking into consideration all intermediate points, and the problem becomes infinite-dimensional.

The remainder of the paper is structured as follows. In Section 2 we analyze the Newton iteration for self-concordant functions in one dimension. In this case the problem can be solved analytically. In Section 3 we generalize to an arbitrary number of dimensions. It turns out that the general case is no more difficult than the 2-dimensional one due to the rotational symmetry of the problem, and is described by the solutions of a Hamiltonian dynamical system in a 4-dimensional space. In Section 4 we analyze the solutions of the Hamiltonian system and derive the bound for the full Newton step. In Section 5 we minimize the bound on the Newton decrement ρ(x k+1 ) with respect to the step length in order to obtain the optimal damped Newton step. In turns out that in this case the system simplifies to an ordinary differential equation (ODE) on the plane. The optimal step length and the optimal bound on the decrement are then described in terms of solutions of this equation. In Section 6 we provide a game-theoretic interpretation of our results.

In this section we analyze the Newton iteration in one dimension. Given a damping coefficient and the value of the Newton decrement at the current iterate, we would like to find the maximal possible value of the decrement at the next iterate. We reformulate this optimization problem as an optimal control problem. The solution of this problem is found by presenting an analytic expression for the Bellman function.

Let F : I → R be a self-concordant function on an interval, i.e., a C 3 function satisfying F > 0,

Fix a constant γ ∈ (0, 1) and consider the damped Newton step

The Newton decrement at the next iterate, which we suppose to lie in I, is given by

. Our goal in this section is to find the maximum of ρ(x k+1 ) as a function of the parameters a, γ.

First of all, we may use the affine invariance of the problem setup to make some simplifications. We move the current iterate to the origin, x k = 0, normalize the Hessian at the initial point to F (0) = 1, and possibly flip the real axis to achieve F (0) = −a < 0. Then we get x k+1 = aγ. Introducing the functions h = F (x), g = F (x), we obtain the optimal control problem

Replacing the state variable g by y = h −1/2 g and the independent variable x by t = h 1/2 · (x − aγ), we obtain dt dx = h 1/2 · (1 + ut) and the problem becomeṡ

with initial conditions y(−aγ) = −a and objective function |y(0)| → sup. The variable h becomes disconnected from the relevant part of the dynamics and can be discarded. If the control is bang-bang, i.e., u is piece-wise constant with values in {−1, 1}, then the dynamics can be integrated explicitly, with solutions ∂B ∂y

After a bit of calculation one obtains the solution

is a separator curve, there are two optimal trajectories with controls u = ±1 emanating from the points of this curve. The optimal synthesis of the system is depicted in Fig. 2 .

Let us now consider the result for the full Newton step. In this case γ = 1, and at the initial point is given by (t, y) = (−a, −a). It hence lies between the separator curve and the switching curve, yielding the upper bound

Setting this bound equal to a and resolving with respect to a, we obtain Roots(λ 3 + 2λ 2 + 9λ − 8) ≈ 0.7282 for the radius of the convergence region λ * . For the damped Newton step, we have to minimize B(t, −a) with respect to t for given a. The minimum is given by the intersection point

, −a) of the line y = −a with the separator curve. Hence the optimal damping coefficient is given by

. The corresponding upper bound on the Newton decrement is given by

In this case we have ρ(x k+1 ) < ρ(x k ) for every ρ(x k ) ∈ (0, 1), and the convergence region covers the whole open Dikin ellipsoid.

In this section we perform a similar analysis for the case of self-concordant functions F defined on ndimensional domains. We reduce the problem to an optimal control problem in two state space dimensions. For this problem the Bellman function cannot be presented in closed form, however, and we have to employ Pontryagins maximum principle to solve it. This results in a Hamiltonian system in 4-dimensional extended phase space. Introduce the vector-valued variable g = F and the matrix-valued variable h = F = W W T , where W is the lower-triangular factor of h with positive diagonal. Let further P be the set of homogeneous cubic polynomials p(x) = n i,j,k=1 p ijk x i x j x k which are bounded by 1 on the unit sphere S n−1 ⊂ R n . This is a compact convex set. The self-concordance condition then expresses the third derivatives of F in the form

We shall need also the projection

The set U is a compact convex subset of the space of real symmetric matrices S n . Clearly it is overbounded by the set U = {U | −I U I}. We shall also introduce the sets of lower-triangular matrices

which are also compact and convex.

Using affine invariance, we may achieve the normalization x k = 0, g(0) = −ae 1 , h(0) = W (0) = I, x k+1 = aγe 1 . Here a ∈ (0, 1) is the Newton decrement ρ(x k ), γ ∈ (0, 1] the damping coefficient, and e 1 the first canonical basis vector. We consider the evolution of the variables g, h, W only on the line segment joining x k and x k+1 , and may hence pass to a scalar variable τ ∈ [0, aγ], such that x(τ ) = τ e 1 and g(τ ) := g(x(τ )), h(τ ) := h(x(τ )). The dynamics of the resulting control system can be written as

It follows that

Since W −1 dW dτ is lower-triangular, and dW T dτ W −T is its transpose, we finally obtain

We now replace g by the variable y = W −1 g and introduce a new independent variable t = W 11 ·(τ −aγ). This variable then evolves in the interval t ∈ [−aγ, 0], and we have dt dτ = W 2 11 V 11 · (τ − aγ) + W 11 = W 11 · (tV 11 + 1). The dynamics of the system becomeṡ

with initial condition y(−aγ) = −ae 1 and objective function ρ(

The matrix-valued variable W becomes disconnected and can be discarded. Let us apply Pontryagins maximum principle [3] to this optimal control problem. Introduce an adjoint vector-valued variable p, then the Pontryagin function and the Hamiltonian of the system are given by

respectively. The transversality condition is non-trivial at the end-point t = 0 and states that p(0) equals the gradient ∂||y(0)|| ∂y(0) = y(0) ||y(0)|| of the objective function. Note that the problem setting is invariant with respect to orthogonal transformations of R n which leave the distinguished vector e 1 invariant. Suppose that at some point on the trajectory the vectors y, p are located in a plane containing e 1 . Then at this point the derivatives of the Pontryagin function in the orthogonal directions vanish due to symmetry, and the derivatives of y, p are also contained in this plane. Therefore these variables will remain in this plane along the whole trajectory. We may hence assume without loss of generality that y, p are contained in the plane spanned by the basis vectors e 1 , e 2 , or equivalently, that the dimension n equals 2.

Then the set P is given by the set of bi-variate homogeneous cubic polynomials which are bounded by 1 on the unit circle, and can be expressed via the semi-definite representable set of nonnegative univariate trigonometric polynomials. This allows to obtain an explicit description of the set V. Its boundary is given by the matrices

Recall that the set V is overbounded by the set V of lower-triangular matrices (see Fig. 3 ). Both sets share the circle Note that the level sets of the Pontryagin function H are planes, because H is a fractional-linear function of V . Hence the maximum of H over a compact convex set is attained at an extreme point of this set. The maximum of H the over the circle C can be computed explicitly and is given by the expression

Besides C, the set V has the extreme points V = ±I, on which H evaluates to p1∓(p1y1+p2y2)

(p 1 y 1 − p 2 y 2 + p 1 t) 2 + 4p 2 2 y 2 1 (1 − t 2 ) ≥ p 1 t + p 1 y 1 + p 2 y 2 + 2p 2 y 2 t. These conditions then also imply max V ∈V H = max V ∈C H and hence H(t, y, p) = p 1 + (p 1 y 1 − p 2 y 2 )t + (p 1 y 1 − p 2 y 2 + p 1 t) 2 + 4p 2 2 y 2

It turns out a posteriori that the conditions are satisfied on the relevant solutions, and by virtue of the necessity of Pontryagins maximum principle for optimality we obtain the following result.

Theorem 3.1. Let a, γ ∈ (0, 1) be given. Then the upper bound on the Newton decrement ρ(x k+1 ) after a damped Newton step with damping coefficient γ and initial value of the decrement ρ(x k ) = a is given by the norm ||y(0)||, where (y(t), p(t)), y = (y 1 , y 2 ), p = (p 1 , p 2 ), t ∈ [−aγ, 0], is a solution of the Hamiltonian system defined by (1) and satisfying the boundary conditions p(0) = y(0) ||y(0)|| , y(−aγ) = −ae 1 . In the next section we shall analyze the qualitative behaviour of the solutions of this Hamiltonian system.

In this section we analyze the Hamiltonian system obtained in the previous section. It turns out that for small enough damping coefficients the trajectories corresponding to the 1-dimensional solution obtained in Section 2 are optimal. If the initial point of the trajectory lies beyond a critical curve in the (t, y 1 )-plane, however, the second variable y 2 is no more identically zero. This is the case for the full Newton step. The corresponding bound on the Newton decrement can be obtained numerically by integration of the Hamiltonian system.

As in the 1-dimensional case, the Bellman function B(t, y) of the problem is constant on the trajectories of the Hamiltonian system and satisfies the boundary condition B(0, y) = ||y||. In order to construct it, we have to integrate the system in backward time with initial condition p(0) = y(0) ||y(0)|| , launching a trajectory from every point of the y-plane.

In general, the projections on y-space of trajectories launched from different points may eventually intersect. In this case the trajectory with the maximal value of the Bellman function along it is retained. Therefore trajectories cease to be optimal at separator surfaces where they meet other trajectories with the same value of the Bellman function. In our case the plane y 2 = 0 acts as a separator surface by virtue of the symmetry (y 1 , y 2 , p 1 , p 2 ) → (y 1 , −y 2 , p 1 , −p 2 ) of the system. Indeed, a trajectory launched from a point with y 2 (0) = 0 and hitting the plane y 2 = 0 at some time will meet there with its image under the symmetry, which necessarily has the same value of the Bellman function. Thus the trajectories which are relevant for Theorem 3.1 are those which either completely evolve on the plane y 2 = 0, or which do not cross this plane on the time interval (−aγ, 0].

The first kind of trajectories correspond to the 1-dimensional system considered in Section 2 and are depicted in Fig. 2 . In the regions between the dashed curves and the vertical axis they are given by

where c = y 1 (0) is a parameter. We now perturb trajectory (2) by launching it from a nearby point y(0) = (c, ), and consider its evolution up to first order in . This can be done by solving the linearized ODE. The right-hand side of the Hamiltonian system is given by (ẏ,ṗ) = ( ∂H ∂p , − ∂H ∂y ). Hence the coefficient matrix of the linearized system is given by

Here the first relation is obtained by setting y 2 = p 2 = 0 in the partial derivatives of H, the second one by inserting the values (2). The linearized system has to be integrated with initial condition 

where the scalar function δy 2 is a solution of the ODE

with initial condition δy 2 (0) = 1. Integrating the ODE, we obtain

for c > −1, δy 2 = for c < −1.

Setting the variable δy 2 to zero, we obtain a critical value of t on trajectory (2) beyond which other trajectories of the system start to intersect the y 2 = 0 plane. At this point trajectory (2) ceases to be optimal. The ensemble of these critical points for all values c ∈ R forms a curve in the (t, y 1 )-plane which marks the limit of optimality of the synthesis obtained in Section 2. In order to obtain an expression for this curve, we have to use (2) to express c as a function of t, y 1 . Inserting this expression into the relation δy 2 = 0 yields a relation between t and y 1 .

For c > −1 this relation is given by

For c = −1 we obtain the point (t * , −1) with t * = 1 − 2 2/3 ≈ −0.5874, and for c < −1 we get

In particular, both the switching curve and the separator curve lie beyond the critical curve and are not part of the optimal synthesis (see Fig. 4 , left). The trajectories with initial condition y(−aγ) = −ae 1 corresponding to a point (t, y 1 ) = (−aγ, −a) beyond the critical curve can be computed numerically. To each such initial condition there correspond two solutions which are taken to each other by the symmetry (y 1 , y 2 , p 1 , p 2 ) → (y 1 , −y 2 , p 1 , −p 2 ). In Fig. 4 , right, the trajectories corresponding to the value γ = 1 (full Newton step) and different a are depicted. The resulting bound on the decrement after the iteration is depicted in Fig. 1, left. A numerical analysis of the solutions yields the following results.

Numerical values: The radius of the convergence region of the exact Newton method equals λ * ≈ 0.6757. This means, if the Newton method is launched from an initial point x 0 with decrement ρ(x 0 ) < λ * , then it is guaranteed to converge to the nearby minimum. If at the iterate x k we have ρ(x k ) ≤ ρ * ≈ 0.3943, then ρ(x k+1 ) ≤ ρ * ≈ 0.1758. These values maximize the lower bound on the difference ρ(x k ) − ρ(x k+1 ), which evaluates to δρ ≈ 0.2184. Applied to short-step interior-point methods, ρ * gives the radius of the tube around the central path which the iterates should not leave, while δρ is a lower bound on the length of the step we may move the target point along the central path at each iteration. (The actual values are a bit different due to second order effects, but these details are not subject of this paper.)

In this section we minimize the upper bound on ρ(x k+1 ) with respect to the damping coefficient γ for fixed initial values of the decrement ρ(x k ) = a. The minimizer of this problem yields the optimal damping coefficient for the Newton iterate which leads to the largest guaranteed decrease of the decrement.

Technically, releasing γ is equivalent to releasing the left end of the time interval on which the trajectory of the Hamiltonian system evolves, while leaving the initial state fixed. It is well known that the partial derivative with respect to time of the Bellman function, i.e., the objective achieved by the trajectory, equals the value of the Hamiltonian [3] . Therefore we look for trajectories with starting points lying on the surface H = 0.

Let us first evaluate H on the critical curve, more precisely on its arc between the lines y 1 = −1 and y 1 = 0. There we have p 1 < 0, y 1 + t < 0. Setting y 2 = 0, p 2 = 0, we obtain H = p1(1+y1)

Thus for a < 1 we have H < 0 and hence the optimal initial time instant t is strictly smaller than the time instant defined by the critical curve. For a = 1, or equivalently y 1 = −1, the trajectory of the 1-dimensional system is optimal. As a consequence, for a → 1 the optimal damping coefficient tends to 2 2/3 − 1 ≈ 0.5874.

Lemma 5.1. The hyper-surface H = 0 is integral for the Hamiltonian system defined by (1).

, and hence if H = 0 somewhere on a trajectory, then H ≡ 0 everywhere on the trajectory.

In particular, it follows that at the end-point t = 0 of the trajectory we also have H = 0. From (1) we then obtain by virtue of the transversality conditions p(0) = y(0) ||y(0)|| that

The locus of the end-points in y-space is hence given by the circle (y 1 + 1 2 ) 2 + y 2 2 = 1 4 . From now on we assume without loss of generality that y 2 ≥ 0 on the trajectory of the system. Using the homogeneity of the dynamics with respect to the adjoint variable p we may eliminate this variable altogether. On the surface H = 0 we have (y 2 1 − 1)p 2 1 − 2y 1 y 2 p 1 p 2 + (4y 2 1 + y 2 2 )p 2 2 = 0, p 1 p 2 = y 1 y 2 + −4y 4 1 + 4y 2 1 + y 2 2 y 2 1 − 1 , and consequentlẏ y 1 = −p 1 y 2 1 + p 2 y 2 y 1 + p 1 p 1 + p 1 y 1 t − p 2 y 2 t = −4y 4 1 + 4y 2 1 + y 2 2 (−y 2 (y 1 + t) + (y 1 t + 1) −4y 4 1 + 4y 2 1 + y 2 2 ) 4y 4 1 t 2 + 8y 3 1 t + 4y 2 1 − y 2 2 t 2 + y 2 2 , y 2 = − 4p 2 y 2 1 − p 1 y 1 y 2 + p 2 y 2 2 p 1 + p 1 y 1 t − p 2 y 2 t = −4y 4 1 + 4y 2 1 + y 2 2 (4ty 3 1 + 4y 2 1 + y 2 2 − ty 2 −4y 4 1 + 4y 2 1 + y 2 2 ) 4y 4 1 t 2 + 8y 3 1 t + 4y 2 1 − y 2 2 t 2 + y 2 2 .

A closer look reveals that the quotient of the two derivatives does not depend on t, and we obtain a planar dynamical system defined by the scalar ODE

(

We obtain the following result. The Riemann surfaces corresponding to the solutions of ODE (3) in the complex plane have an infinite number of quadratic ramification points, and the equation is not integrable in closed form. The optimal bound on ρ(x k+1 ) as a function of ρ(x k ) can be computed numerically and is depicted in Fig. 1 , left.

In order to obtain the value of the optimal damping coefficient one also has to integrate the linear differential equation

Theorem 5.3. Let a ∈ (0, 1) be given, and let σ be the solution curve of ODE (3) through the point y 0 = (−a, 0), intersecting the circle (y 1 + 1 2 ) 2 + y 2 2 = 1 4 in the point y * in the upper half-plane y 2 > 0. Then the optimal damping coefficient γ for the Newton step with initial value ρ(x k ) = a of the decrement is given by the value of t(y 0 ), where t(y) is the solution of ODE (4) along the curve σ with initial value t(y * ) = 0.

In order to compute the optimal value of γ one hence has first to integrate ODE (3) from y 0 to y * and then ODE (4) back from y * to y 0 . The result is depicted on Fig. 1 , right. The solution curves of the ODEs are depicted in Fig. 5 .

Numerical values: If the optimal damping coefficient is applied, then the Newton decrement is guaranteed to decrease whenever its current value is smaller than 1. The radius of the convergence region hence equals λ * = 1. If at the iterate x k we have ρ(x k ) ≤ ρ * ≈ 0.4429, then ρ(x k+1 ) ≤ ρ * ≈ 0.2129. These values maximize the lower bound on the difference ρ(x k ) − ρ(x k+1 ), which evaluates to δρ ≈ 0.2300.

The upper bound on ρ(x k+1 ) is actually quite close to ρ(x k ) 2 . An asymptotic analysis of ODE (3) for small values of a yields the expansion

On the whole interval [0, 1] we have

with a maximal difference of 7 · 10 −3 . An asymptotic analysis of ODE (4) for small values of a leads to the expansion γ = 1− a 3 2 + a 4 4 +O(a 5 log a) of the optimal damping coefficient. Unlike the situation in one dimension, in multiple dimensions the upper bound on the Newton decrement is a smooth function of the damping coefficient. Therefore a small deviation from the optimal value of γ will result only in an increase of the upper bound on ρ(x k+1 ) by a term of second order.

We first furnish an interpretation of the results. Let us imagine the worst-case behaviour of the selfconcordant function on the segment between the iterates x k , x k+1 as the response of an adversarial player to our choice of the damping coefficient. The goal of this player is to maximize the Newton decrement ρ(x k+1 ) at the next iterate. Since the control which is at the disposal of the adversarial player affects the third derivative of the function, we can roughly assume that he plays with the acceleration of the gradient.

First consider the case when the function is defined on an interval. Here the adversarial player has two different options. One is to maximally decelerate the gradient in order to prevent it from reaching zero at the end-point of the interval. This strategy will pay off more if we choose a smaller step length. The other strategy is to first maximally accelerate the gradient, in order to give it enough velocity to overshoot. At some point, corresponding to the crossing of the switching curve in Fig. 2 , the gradient is again decelerated by decreasing the Hessian, because the effect of a smaller denominator F in the objective function overweighs the effect of a larger gradient F , which enters in the numerator. This strategy pays off more if we choose a larger step length. Our optimal strategy will therefore be to choose that value of the damping coefficient which results in the same objective for both strategies of the adversarial player, i.e., we choose the initial point (−aγ, −a) on the separator curve in Fig. 2 .

In the case of a multi-dimensional domain of definition the adversarial player has more options. In addition to acceleration or deceleration of the gradient in the direction of movement, he may boost it in a perpendicular direction. Here he may choose this perpendicular direction arbitrarily, but once it is chosen, it is optimal to keep the acceleration vector in the plane spanned by the direction of movement and this particular direction. If the damping coefficient is large enough, more precisely if it corresponds to an initial point (t, y 1 ) beyond the critical curve in Fig. 4 , left, the optimal strategy of the adversarial player is then indeed a mixture of boosts in the parallel and the perpendicular direction. Here the parallel component may be an acceleration or a deceleration, but the perpendicular component is always increased. For smaller damping coefficients the optimal strategy is a pure deceleration of the gradient.

The optimal value γ of the damping coefficient can be computed numerically by integrating two scalar ODEs, one of which is linear. For small values a of the decrement at the current iterate the correction with respect to the value 1 of the full Newton step is cubic in a.

The obtained results enable to tune the parameters in the machinery of interior-point methods to achieve larger steps along the central path. The optimal choices of these parameters may also be obtained as solutions to optimization problems, but these are beyond the scope of this paper. Besides this, they may serve in general to optimize the performance of the Newton method on self-concordant functions in the vicinity of a minimum.

Some globally convergent modifications of Newton's method for solving systems of linear equations

On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions

The mathematical theory of optimal processes

Lectures on Convex Optimization

Interior-point Polynomial Algorithms in Convex Programming

Global convergence of damped Newton's method for nonsmooth equations via the path search

Exact worst-case performance of firstorder methods for composite convex optimization

The paper was written in coronavirus quarantine at Moscow Institute of Physics and Technology. The author would like to thank the medical staff for having provided appropriate working conditions.