key: cord-0174388-lhisjt29
authors: Duruisseaux, Valentin; Leok, Melvin
title: A Variational Formulation of Accelerated Optimization on Riemannian Manifolds
date: 2021-01-16
journal: nan
DOI: nan
sha: a64410168eef79ad4baa7f29c0a21557a56bb46d
doc_id: 174388
cord_uid: lhisjt29

It was shown recently by Su et al. (2016) that Nesterov's accelerated gradient method for minimizing a smooth convex function $f$ can be thought of as the time discretization of a second-order ODE, and that $f(x(t))$ converges to its optimal value at a rate of $mathcal{O}(1/t^2)$ along any trajectory $x(t)$ of this ODE. A variational formulation was introduced in Wibisono et al. (2016) which allowed for accelerated convergence at a rate of $mathcal{O}(1/t^p)$, for arbitrary $p>0$, in normed vector spaces. This framework was exploited in Duruisseaux et al. (2021) to design efficient explicit algorithms for symplectic accelerated optimization. In Alimisis et al. (2020), a second-order ODE was proposed as the continuous-time limit of a Riemannian accelerated algorithm, and it was shown that the objective function $f(x(t))$ converges to its optimal value at a rate of $mathcal{O}(1/t^2)$ along solutions of this ODE. In this paper, we show that on Riemannian manifolds, the convergence rate of $f(x(t))$ to its optimal value can also be accelerated to an arbitrary convergence rate $mathcal{O}(1/t^p)$, by considering a family of time-dependent Bregman Lagrangian and Hamiltonian systems on Riemannian manifolds. This generalizes the results of Wibisono et al. (2016) to Riemannian manifolds and also provides a variational framework for accelerated optimization on Riemannian manifolds. An approach based on the time-invariance property of the family of Bregman Lagrangians and Hamiltonians was used to construct very efficient optimization algorithms in Duruisseaux et al. (2021), and we establish a similar time-invariance property in the Riemannian setting. One expects that a geometric numerical integrator that is time-adaptive, symplectic, and Riemannian manifold preserving will yield a class of promising optimization algorithms on manifolds.

1. Introduction. Efficient optimization has become one of the major concerns in data analysis. Many machine learning algorithms are designed around the minimization of a loss function or the maximization of a likelihood function. Due to the ever-growing scale of the data sets and size of the problems, there has been a lot of focus on first-order optimization algorithms because of their low cost per iteration. The first gradient descent algorithm was proposed in [5] by Cauchy to deal with the very large systems of equations he was facing when trying to simulate orbits of celestial bodies, and many gradient-based optimization methods have been proposed since Cauchy's work in 1847.

In 1983, Nesterov's accelerated gradient method was introduced in [21] , and was shown to converge in O(1 k 2 ) to the minimum of the convex objective function f , improving on the O(1 k) convergence rate exhibited by the standard gradient descent methods. This O(1 k 2 ) convergence rate was shown in [22] to be optimal among first-order methods using only information about ∇f at consecutive iterates. This phenomenon in which an algorithm displays this improved rate of convergence is referred to as acceleration, and other accelerated algorithms have been derived since Nesterov's algorithm, such as accelerated mirror descent [20] and accelerated cubic-regularized Newton's method [23] . More recently, it was shown in [25] that Nesterov's accelerated gradient method limits to a second-order ODE, as the time-step goes to 0, and that the objective function f (x(t)) converges to its optimal value at a rate of O(1 t 2 ) along the trajectories of this ODE. It was then shown in [27] that in continuous time, the convergence rate of f (x(t)) can be accelerated to an arbitrary convergence rate O(1 t p ) in normed spaces, by considering flow maps generated by a family of time-dependent Bregman Lagrangian and Hamiltonian systems which is closed under time rescaling. This variational framework and the time-invariance property of the family of Bregman Lagrangians were then exploited in [9] using time-adaptive geometric integrators to design efficient explicit algorithms for symplectic accelerated optimization. It was observed that a careful use of adaptivity and symplecticity could result in a significant gain in computational efficiency.

In the past few years, there has been some effort to derive accelerated optimization algorithms in the Riemannian manifold setting [2-4; 17; 28; 29] . In [3] , a second-order ODE was proposed as the continuous-time limit of a Riemannian accelerated algorithm, and it was shown that the objective function f (x(t)) converges to its optimal value at a rate of O(1 t 2 ) along solutions of this ODE, generalizing the Euclidean result obtained in [25] to the Riemannian manifold setting.

In this paper, we show that in continuous time, the convergence rate of f (x(t)) to its optimal value can be accelerated to an arbitrary convergence rate O(1 t p ) on Riemannian manifolds, thereby generalizing the results of [27] to the Riemannian setting. This is achieved by considering a family of time-dependent Bregman Lagrangian and Hamiltonian systems on Riemannian manifolds. This also provides a variational framework for accelerated optimization on Riemannian manifolds, generalizing the normed vector space variational formulation of accelerated optimization introduced in [27] . We will then illustrate the derived theoretical convergence rates by integrating the Bregman Euler-Lagrange equations using a simple numerical scheme to solve eigenvalue and distance minimization problems on Riemannian manifolds. Finally, we will show that the family of Bregman dynamics on Riemannian manifolds is closed under time rescaling, and we will draw inspiration from the approach introduced in [9] to take advantage of this invariance property via a carefully chosen Poincaré transformation that will allow for the integration of higher-order Bregman dynamics while benefiting from the computational efficiency of integrating lower-order Bregman dynamics on Riemannian manifolds.

We first introduce the main notions from Riemannian geometry and Lagrangian and Hamiltonian mechanics that will be used throughout this paper (see [3; 10; 11; 13; 14; 18] for more details).

Definition 2.1. Given a manifold Q, the tangent bundle T Q and cotangent bundle T * Q are defined by

Definition 2.2. Suppose we have a Riemannian manifold Q with Riemannian metric g(⋅, ⋅) = ⟨⋅, ⋅⟩, represented by the positive-definite symmetric matrix (g ij ) in local coordinates. Then, we define the musical isomorphism g ♭ ∶ T Q → T * Q by g ♭ (u)(v) = g p (u, v) ∀p ∈ Q and ∀u, v ∈ T p Q, and its inverse musical isomorphism g ♯ ∶ T * Q → T Q. The Riemannian metric g(⋅, ⋅) = ⟨⋅, ⋅⟩ induces a fiber metric g * (⋅, ⋅) = ⟪⋅, ⋅⟫ on T * Q by

represented by the positive definite symmetric matrix (g ij ) in local coordinates, which is the inverse of the Riemannian metric matrix (g ij ).

Definition 2.3. The Riemannian gradient gradf(q) ∈ T q Q at a point q ∈ Q of a smooth function f ∶ Q → R is the tangent vector at q such that

where df is the differential of f .

The set of all vector fields on Q is denoted X (Q). The integral curve at q of X ∈ X (Q) is the smooth curve c on Q such that c(0) = q and c ′ (t) = X(c(t)).

Definition 2.5. A geodesic in a Riemannian manifold Q is a parametrized curve γ ∶ [0, 1] → Q which is of minimal local length. It can be thought of as a curve having zero "acceleration" or constant "speed", that is as a generalization of the notion of straight line from Euclidean spaces to Riemannian manifolds. Given two points q,q ∈ Q, a vector in T q Q can be transported to TqQ along a geodesic γ by an operation Γ(γ)q q ∶ T q Q → TqQ called parallel transport along γ. We will simply write Γq q to denote the parallel transport along some geodesic connecting the two points q,q ∈ Q, and given A ∈ X (Q), we will denote by Γ(A) the parallel transport along integral curves of A. Note that parallel transport preserves inner products: given a geodesic γ from q ∈ Q toq ∈ Q,

where γ is the unique integral curve of X such that γ(0) = q, for any q ∈ Q.

Definition 2.7. A function f ∶ Q → R is called L-smooth if for any two points q,q ∈ Q and geodesic γ connecting them,

Definition 2.8. The Riemannian Exponential map Exp q ∶ T q Q → Q at q ∈ Q is defined by

where γ v is the unique geodesic in Q such that γ v (0) = q and γ ′ v (0) = v, for any v ∈ T q Q. Exp q is a diffeomorphism in some neighborhood U ⊂ T q Q containing 0, so we can define its inverse map, the Riemannian Logarithm map Log p ∶ Exp q (U ) → T q Q. Definition 2.9. Given a Riemannian manifold Q with sectional curvature bounded below by K min , and an upper bound D for the diameter of the considered domain, define

Note that ζ ≥ 1 since x coth x ≥ 1 for all real values of x.

Definition 2.10. A subset A of a Riemannian manifold Q is called geodesically uniquely convex if every two points of A are connected by a unique geodesic in A. A function f ∶ Q → R is called geodesically convex if for any two points q,q ∈ Q and geodesic γ connecting them,

Note that if f is a smooth geodesically convex function on a geodesically uniquely convex subset A of a Riemannian manifold, then

A local minimum of a geodesically convex or weakly-quasi-convex function is also a global minimum, and a geodesically strongly convex function either has no minimum or a unique global minimum. Also note that a geodesically convex function is λ-weakly-quasi-convex with λ = 1.

Mechanics. Given a n-dimensional Riemannian manifold Q with local coordinates (q 1 , . . . , q n ), a Lagrangian is a function L ∶ T Q × R → R. The corresponding action integral S is defined to be the functional The Legendre transform FL ∶ T Q → T * Q of L is defined fiberwise by FL ∶ (q i ,q i ) ↦ (q i , p i ) where p i = ∂L ∂q i ∈ T * Q is the conjugate momentum of q i . We can then define the associated

We can also define a Hamiltonian Variational Principle on the Hamiltonian side in momentum phase space

where the variation is induced by an infinitesimal variation δq of the trajectory q that vanishes at the endpoints. This is equivalent to Hamilton's equations, given bẏ

which can also be shown to be equivalent to the Euler-Lagrange equations (2.3).

3.1. Inspiration. A variational framework was introduced in [27] for accelerated optimization on normed vector spaces. Given a convex, continuously differentiable function h ∶ X → R on a normed vector space X such that ∇h(x) → ∞ as x → ∞, its corresponding Bregman divergence is defined by

The Bregman Lagrangian and Hamiltonian are then defined to be

which are scalar-valued functions of position x ∈ X , velocity v ∈ R d or momentum r ∈ R d , and of time t. Here, h * ∶ X * → R denotes the Legendre transform (or convex dual function) of h, defined by

The Bregman Lagrangian and Hamiltonian family is parametrized by smooth functions of time, α t = α(t), β t = β(t), γ t = γ(t), which are said to satisfy the ideal scaling conditions ifβ t ≤ e αt andγ t = e αt . If the ideal scaling conditions are satisfied, then by Theorem 1.1 in [27] ,

. Theorem 3.1. If x(t) satisfies the Euler-Lagrange equations corresponding to the Bregman Lagrangian L α,β,γ , then the reparametrized curve y(t) = x(τ (t)) satisfies the Euler-Lagrange equations corresponding to the modified Bregman Lagrangian Lα ,β,γ whereα t = α τ (t) + logτ (t),β t = β τ (t) , and γ t = γ τ (t) . Furthermore α, β, γ satisfy the ideal scaling conditions (3.3) if and only ifα,β,γ do.

We will now extend these results to the Riemannian manifold setting. Throughout this paper, we will make the following assumptions on the function f ∶ Q → R to be minimized and on the ambient Riemannian manifold Q, which are standard assumptions in Riemannian optimization [3; 4; 28; 29] : Assumption 3.2. Solutions of the differential equations derived in this paper remain inside a geodesically uniquely convex subset A of a complete Riemannian manifold Q (i.e. any two points in Q can be connected by a geodesic), such that diam(A) is bounded above by some constant D, that the sectional curvature is bounded from below by K min on A, and that Exp q is well-defined for any q ∈ A, and its inverse Log q is well-defined and differentiable on A for any q ∈ A. Furthermore, f is bounded below, geodesically L-smooth and all its minima are inside A.

Cases. Suppose that f ∶ Q → R is a given geodesically λ-weakly-quasi-convex function, and that Assumption 3.2 holds true. Since a geodesically convex function is λ-weakly-quasi-convex with λ = 1, the following treatment also applies to the case where f is geodesically convex. We define a family of Bregman Lagrangians L α,β,γ ∶ T Q × R → R parametrized by smooth functions of time α, β, γ by 5) and the corresponding Bregman Hamiltonians H α,β,γ ∶ T * Q × R → R are given by

where X ∈ Q denotes position on the manifold Q, V is the velocity vector field, R is the momentum covector field, t is the time variable, and ζ is given by equation (2.1). This family of functions is a generalization of the Bregman Lagrangians and Hamiltonians introduced in [27] for the convex continuously differentiable function h(x) = 1 2 ⟨x, x⟩. Throughout this paper, we will assume that the parameter functions α, β, γ satisfy the ideal scaling conditions (3.3).

Theorem 3.3. The Bregman Euler-Lagrange equation corresponding to the Bregman Lagrangian L α,β,γ is given by

Proof. See Appendix A.1.

Theorem 3.4. Suppose that f ∶ Q → R is a geodesically λ-weakly-quasi-convex function, and that Assumption 3.2 is satisfied. Then, any solution X(t) to the Bregman Euler-Lagrange equation

A p > 0 parametrized subfamily of Bregman Lagrangians and Hamiltonians, that is of particular practical interest, is given by the choice of parameter functions

where C > 0 is a constant. This yields the p-Bregman Lagrangian and Hamiltonian given by

and the corresponding p-Bregman Euler-Lagrange equations are given by

Theorem 3.5. Suppose that f ∶ Q → R is a geodesically weakly-quasi-convex function, and that Assumption 3.2 is satisfied. Then, the p-Bregman Euler-Lagrange equation (3.12) has a solution, and any solution X(t) converges to a minimizer

Proof. See Appendix C.1 for the existence of a solution to the p-Bregman Euler-Lagrange equations. The O(1 t p ) convergence rate follows directly from Theorem 3.4.

Note that this theorem reduces to Theorem 5 from [3] when p = 2 and C = 1 4. Remark 3.6. To construct this variational framework for accelerated optimization, we first constructed candidate p-equations with the desired O(1 t p ) convergence rates, and then designed Lagrangians whose p-Bregman Euler-Lagrange equations matched the candidate p-equations, by inspection. We then used a similar approach to extend these results to the general α, β, γ case presented here.

Remark 3.7. In our generalization of the Bregman Lagrangian and Hamiltonian to Riemannian manifolds, we have specialized to the case where h(x) = 1 2 x 2 , because its Hessian ∇ 2 h(x) is the identity matrix, which significantly simplifies the Euler-Lagrange equations and the analysis. In addition, it avoids the complication of making intrinsic sense of terms like X + e −α V in the vector space Bregman Lagrangians and Hamiltonians, which requires the use of Riemannian geodesics and exponentials since X ∈ Q while V ∈ T X Q.

Case. Suppose f ∶ Q → R is a geodesically µ-strongly-convex function, and that Assumption 3.2 is satisfied. With ζ given by equation (2.1), let

We define the corresponding Lagrangian 14) and the corresponding Hamiltonian H SC ∶ T * Q × R → R is given by

Theorem 3.8. The Euler-Lagrange equation corresponding to the Lagrangian L SC is given by

Proof. The derivation of the Euler-Lagrange equation is presented in Appendix A.2.

Theorem 3.9. Suppose f ∶ Q → R is a geodesically µ-strongly-convex function, and suppose that Assumption 3.2 is satisfied. Then, the Euler-Lagrange equation (3.16) has a solution, and any solution X(t) converges to a minimizer x * of f with rate

Proof. See Appendix C.2 for the existence of a solution to the Euler-Lagrange equation (3.16) , and Theorem 7 from [3] for the convergence rate.

4. Numerical Experiments. The p-Bregman Euler-Lagrange equation (3.12) can be rewritten as the first-order systemẊ

for the geodesically λ-weakly-quasi-convex case, and the Euler-Lagrange equation (3.16) corresponding to the Lagrangian L SC can be rewritten as the first-order systeṁ

for the µ-strongly convex case. As in [3] , we can adapt a semi-implicit Euler scheme (explicit Euler update for the velocity V followed by an update for position X based on the updated value of V ) to the Riemannian setting to obtain the following algorithm:

Version I of Algorithm 4.1 corresponds to the usual update for the Semi-Implicit Euler scheme, while Version II is inspired by the reformulation of Nesterov's method from [26] that uses a corrected gradient ∇f (X k + hb k V k ) instead of the traditional gradient ∇f (X k ). Note that the SIRNAG algorithm presented in [3] corresponds to the special case where p = 2 and C = 1 4.

The first problem we have investigated is the problem presented in [3] of minimizing the (strongly convex) distance function f (x) = 1 2 d(x, q) 2 for a given point q, on a subset of chosen finite diameter of the hyperbolic plane H 2 , which is a manifold with constant negative curvature K = −1.

The second problem we have investigated is Rayleigh quotient optimization. Eigenvectors corresponding to the largest eigenvalue of a symmetric n × n matrix A maximize the Rayleigh

Thus, a unit eigenvector v * corresponding to the largest eigenvalue of the matrix A is a minimizer of the function f (v) = −v ⊺ Av, over the unit sphere Q = S n−1 , which can be thought of as a Riemannian submanifold with constant positive curvature K = 1 of R n endowed with the Riemannian metric inherited from the Euclidean inner product g v (u, w) = u ⊺ w. More information concerning the geometry of S n−1 , such as its tangent bundle, its orthogonal projection and exponential map can be found in [1] . Solving the Rayleigh quotient optimization problem efficiently is challenging when the given symmetric matrix A is ill-conditioned and highdimensional. Note that an efficient algorithm that solves the above minimization problem can also be used to find eigenvectors corresponding to the smallest eigenvalue of A by using the fact that the eigenvalues of A are the negative of the eigenvalues of −A.

Experiments carried out in [3] showed that SIRNAG (the convex p = 2 Algorithm 4.1) and the strongly convex Algorithm 4.1 were of comparable efficiency or more efficient than the standard Riemannian Gradient Descent (RGD) method, depending on the properties of the objective function and on the geometry of the Riemannian manifold. We have conducted further numerical experiments to investigate how the simple discretization of higher-order p = 6 Bregman dynamics compared to its p = 2 counterpart, and to see whether it matches the O(k −p ) convergence rate. The numerical results obtained for the distance minimization and Rayleigh minimization problems are illustrated in Figure 4 .1, where all the algorithms were implemented with the same fixed time-step. We can see that the p = 6 algorithms outperform their p = 2 counterparts, and that the efficiency improvement is very important. Note however that an increase in the value of p in Algorithm 4.1, which corresponds to an increase in the order of the Bregman dynamics integrated, requires a decrease in the time-step, in agreement with intuitive expectations. This time-step decrease requirement is especially important due to the polynomially growing h(kh) p−2 coefficient multiplying the gradient of f in the updates of the algorithm. Such a decrease in the time-step does not really affect the convergence rate, but the transition between the initialization and convergence phases takes longer. As a consequence, by using larger time-steps, the algorithm corresponding to a smaller value of p might achieve a desired convergence criterion with fewer iterations than the algorithm corresponding to a larger value of p, despite having a slower convergence rate. Similar issues arise when discretizing the continuous Euler-Lagrange flow associated with accelerated optimization on vector spaces, and in that situation, it was observed that time-adaptive symplectic integrators based on Hamiltonian variational integrators resulted in dramatically improved robustness and stability. As such, it will be natural to explore generalizations of time-adaptive symplectic integrators based on Hamiltonian variational integrators applied to Poincaré transformed Hamiltonians, that respect the Riemannian manifold structure in order to yield more robust and stable numerical discretizations of the flows we have studied in this paper in order to construct accelerated optimization algorithms on Riemannian manifolds. We will lay the foundation for such time-adaptive symplectic integrators in Section 5.

Finally, Figure 4 .3 shows that the discretization empirically converges to the solution of the ODE as the time-step h goes to 0. Note that although all the discretizations follow the ODE trajectory closely, smaller time-steps result in a larger number of iterations, especially to transition from the initialization plateau to the convergence phase (around time t = 4 in the example presented in Figure 4.3) . A theoretical shadowing result bounding the error between the discrete-time RGD and its continuous-time limiting ODE was obtained in [3] thanks to the uniform contraction property of the dynamical system associated with Riemannian Gradient Descent. It would be desirable to obtain similar shadowing results in the future for discretizations of the class of ODEs considered in this paper, perhaps drawing inspiration from [30] . However, such a result might be very difficult to obtain because momentum methods lack contraction, are nondescending, and are highly oscillatory [3; 24] . While it is hoped that the continuous analysis in this paper will eventually guide the convergence analysis of discrete-time algorithms, this does not appear to be a straightforward exercise, as one would first need to reconcile the arbitrarily fast O(1 t p ) rate of convergence of the continuous-time trajectories with Nesterov's barrier theorem of O(1 k 2 ) for discrete-time algorithms. Even on normed vector spaces, obtaining theoretical guarantees was a challenging task, achieved in [30] in the special case where p > 2 under additional assumptions on the objective function and on its derivatives. Generalizing these results to the general family of α, β, γ Bregman Lagrangians on Riemannian manifolds would be much more challenging since the notions of derivatives become more complicated, and since all the usual vector space operations and objects have to be replaced by their Riemannian generalization which involve geodesics, parallel transport, Riemannian exponentials and Riemannian logarithms. 

Transformation. Let f ∶ Q → R be a given λ-weakly-quasiconvex function, and suppose Assumption 3.2 is satisfied. In Section 3, we formulated a variational framework for the minimization of f , via Bregman Lagrangians and Hamiltonians. We now extend Theorem 3.1 to Riemannian manifolds.

Theorem 5.1. Suppose that Assumption 3.2 is satisfied and that the curve X(t) satisfies the Riemannian Bregman Euler-Lagrange equation (3.7) corresponding to L α,β,γ . Then the reparametrized curve X(τ (t)) satisfies the Bregman Euler-Lagrange equation (3.7) corresponding to the modified Riemannian Bregman Lagrangian Lα ,β,γ whereα t = α τ (t) + logτ (t),β t = β τ (t) , andγ t = γ τ (t) .

Furthermore α, β, γ satisfy the ideal scaling conditions (3.3) if and only ifα,β,γ do.

Proof. See Appendix D.

As a special case, we have the following theorem:

Theorem 5.2. Suppose that f ∶ Q → R is a geodesically λ-weakly-quasi-convex function, and that Assumption 3.2 is satisfied. Suppose X(t) satisfies the p-Bregman Euler-Lagrange equation (3.12) . Then, the reparametrized curve X(tp p ) satisfies thep-Bregman Euler-Lagrange equation (3.12).

Thus, the entire subfamily of Bregman trajectories indexed by the parameter p can be obtained by speeding up or slowing down along the Bregman curve in spacetime corresponding to any specific value of p. Inspired by the computational efficiency of the approach introduced in [9] , it is natural to attempt to exploit the time-rescaling property of the Bregman dynamics together with a carefully chosen Poincaré transformation to transform the p-Bregman Hamiltonian into an autonomous version of thep-Bregman Hamiltonian in extended phase-space, wherep < p. This would allow us to integrate the higher-order p-Bregman dynamics while benefiting from the computational efficiency of integrating the lower-orderp-Bregman dynamics. Explicitly, the time rescaling τ (t) = tp p is associated to the monitor function

and generates a Poincaré transformed Hamiltonian

in the extended spaceQ = Q × R whereX = X X t andR = R R t . We will make the conventional choice X t = t, with conjugate momentum R t , and R t (0) = −H p (X(0), R(0), 0) = −H 0 , which is chosen so thatH p→p (X,R) = 0 along all integral curves through (X(0),R(0)). The time t shall be referred to as the physical time, while τ will be referred to as the fictive time. The corresponding Hamiltonian equations of motion in the extended phase space are then given bẏ

Now, suppose (X(τ ),R(τ )) are solutions to these extended equations of motion, and let (x(t), r(t)) solve Hamilton's equations for the original Hamiltonian H p . Then H p→p (X(τ ),R(τ )) =H p→p (X(0),R(0)) = 0.

Thus, the components (X(τ ), R(τ )) in the original phase space of (X(τ ),R(τ )) satisfy

Therefore, (X(τ ), R(τ )) and (x(t), r(t)) both satisfy Hamilton's equations for the original Hamiltonian H p with the same initial values, so they must be the same. As a consequence, instead of integrating the p-Bregman Hamiltonian system (3.11), we can focus on the Poincaré transformed HamiltonianH p→p in extended phase-space given by equation (5.2), with H p and g p→p given by equations (3.11) and (5.1), that is

The resulting integrator has constant time-step in fictive time τ but variable time-step in physical time t. In our prior work on discretizations of variational formulations of accelerated optimization on normed spaces [9] , we performed a very careful computational study of how time-adaptivity and symplecticity of the numerical scheme improve the performance of the resulting numerical optimization algorithm. In particular, we observed that time-adaptive Hamiltonian variational discretizations, which are automatically symplectic, with adaptive time-steps informed by the time invariance of the family of p-Bregman Lagrangians and Hamiltonians yielded the most robust and computationally efficient numerical optimization algorithms, outperforming fixed-timestep symplectic discretizations, adaptive-timestep non-symplectic discretizations, and Nesterov's accelerated gradient algorithm which is neither time-adaptive nor symplectic. As such, it would be desirable to generalize the time-adaptive Hamiltonian variational integrator framework to Riemannian manifolds, and apply it to the variational formulation of accelerated optimization on Riemannian manifolds. Note that the variational framework for accelerated optimization presented in Section 3 has also been exploited successfully in the special case of Lie groups in subsequent papers [8; 15] , using two different formulations of time-adaptive symplectic Lagrangian integration, with very promising numerical results. Another important case involves Riemannian submanifolds that are embedded in a Riemannian linear manifold and are realized as the level set of a submersion. The characterization of the submanifold as the level set of a submersion, together with the linear space structure of the embedding space, and the variational characterization of the dynamics naturally lends itself to the use of the Lagrange multiplier theorem, which allows one to use Hamiltonian variational integrators defined on the embedding space by including a Lagrange multiplier term involving the submersion in the Lagrangian or Hamiltonian [6] . This is analogous to the derivation of the SHAKE and RATTLE methods as variational integrators for constrained systems (see, for example, Section 3.5 of [19] ). Another practical method can be obtained by projecting the updates of Hamiltonian variational integrators defined on the embedding space onto the constraint manifold [7] . The numerical results in these subsequent papers [6; 7] suggest that the time-adaptive Hamiltonian approach can be very competitive when numerically solving optimization problems on Riemannian manifolds.

We have shown that on Riemannian manifolds, the convergence rate in continuous time of a geodesically convex or weakly-quasi-convex function f (x(t)) to its optimal value can be accelerated to an arbitrary convergence rate, which extended the results of [27] from normed vector spaces to Riemannian manifolds. This rate of convergence is achieved along solutions of the Euler-Lagrange and Hamilton's equations corresponding to a family of time-dependent Bregman Lagrangian and Hamiltonian systems on Riemannian manifolds. As was demonstrated in the normed vector space setting, such families of Bregman Lagrangians and Hamiltonians can be used to construct practical, robust, and computationally efficient numerical optimization algorithms that outperform Nesterov's accelerated gradient method by considering geometric structure-preserving discretizations of the continuous-time flows.

Numerical experiments implementing a simple discretization of the p-Bregman Euler-Lagrange equations applied to a distance minimization and Rayleigh minimization problems confirmed that the higher-order algorithms outperform significantly their lower-order counterparts and the corresponding O(1 k p ) convergence rates. Numerical results also showed that using a corrected gradient in the update instead of the traditional gradient, as was done in [26] , improved the theoretically predicted polynomial convergence rate to an exponential rate of convergence in practice. While higher values of p result in faster rates of convergence, they usually require smaller time-steps and also appear to be more prone to stability issues under numerical discretization, which can cause the numerical optimization algorithm to diverge, but we anticipate that symplectic discretizations will address these stability issues.

Finally, in analogy to what was done in [27] for normed vector spaces, we proved that the family of time-dependent Bregman Lagrangian and Hamiltonians on Riemannian manifolds is closed under time rescaling. Inspired by the computational efficiency of the approach introduced in [9], we can then exploit this invariance property via a carefully chosen Poincaré transformation that will allow us to integrate higher-order p-Bregman dynamics while benefiting from the computational efficiency of integrating a lower-orderp-Bregman Hamiltonian system.

It was observed in our prior computational experiments in the normed vector space case [9] that geometric discretizations which respect the time-rescaling invariance and symplecticity of the Bregman Lagrangian and Hamiltonian flows were substantially less prone to stability issues, and were therefore more robust, reliable, and computationally efficient. As such, it is natural to develop time-adaptive Hamiltonian variational integrators for the Bregman Hamiltonian introduced in this paper describing accelerated optimization on Riemannian manifolds.

Developing an intrinsic extension of Hamiltonian variational integrators to manifolds will require some additional work, since the current approach involves Type II/Type III generating functions q k+1 ) , which depend on the position at one boundary point, and the momentum at the other boundary point. However, this does not make intrinsic sense on a manifold, since one needs the base point in order to specify the corresponding cotangent space, and one should ideally consider a Hamiltonian variational integrator construction based on discrete Dirac mechanics [16] , which would yield a generating function E + d (q k , q k+1 , p k+1 ), E − d (q k , p k , q k+1 ), that depends on the position at both boundary points and the momentum at one of the boundary points. This approach can be viewed as a discretization of the generalized energy E(q, v, p) = ⟨p, v⟩ − L(q, v), in contrast to the Hamiltonian H(q, p) 

However, a more practical method relies on the fact that we have a Riemannian manifold, which is endowed with a Riemannian exponential and Riemannian logarithm that can be used to construct an extension of Hamiltonian variational integrators using geodesic normal coordinates. For many important matrix manifolds, one can replace the Riemannian exponential in the geodesic normal coordinates by a retraction [1] , which is often constructed using matrix factorizations.

We anticipate that applying an appropriate generalization of Hamiltonian variational integrators to the Bregman Hamiltonians introduced in this paper will yield a novel class of robust and efficient accelerated optimization algorithms on Riemannian manifolds. The variational framework for accelerated optimization presented in Section 3 has also been exploited successfully in the special case of Lie groups in subsequent papers [8; 15] , using two different formulations of time-adaptive symplectic Lagrangian integration, with very promising numerical results which illustrate that our framework can be very competitive for optimization problems of interest on Lie groups and more generally on Riemannian manifolds. As mentioned at the end of Section 5, another important case involves Riemannian submanifolds that are embedded in a Riemannian linear manifold and are realized as the level set of a submersion. In [6] , we studied how holonomic constraints can be incorporated into variational integrators to constrain the updates of the numerical optimization algorithm to the Riemannian manifold of interest, and in [7] , the manifold constraints were enforced via projections. The numerical results in these two subsequent papers suggest that the time-adaptive Hamiltonian approach introduced in this paper can be the basis for competitive numerical optimization algorithms on Riemannian manifolds.

It would be desirable in future work to analyze the resulting discrete-time algorithms and rigorously establish their rates of convergence. Although theoretical shadowing results have already been derived for certain discrete optimization algorithms on Riemannian manifolds, such a result might be very difficult to obtain for the momentum-based algorithms presented in this paper because momentum methods lack contraction, are nondescending and highly oscillatory [3; 24] . It might also be possible to generalize the theoretical guarantees obtained laboriously on normed vector spaces in [30] , but this would be an even more challenging task since the usual vector space operations and objects have to be replaced by their more convoluted Riemannian generalizations. In addition, we would like to better understand how to reconcile the arbitrarily high rate of convergence one expects from the continuous-time analysis, with Nesterov's barrier theorem on the rate of convergence of discrete-time algorithms.

Appendix A. Derivation of the Euler-Lagrange Equations.

Theorem A.1. The Euler-Lagrange equation corresponding to the Lagrangian

is given by

Proof. Consider a path on the manifold Q described in coordinates by (x(t),ẋ(t)) = q 1 (t), . . . , q n (t), v 1 (t), . . . , v n (t) .

Then, with ⟨⋅, ⋅⟩ = ∑ n i,j=1 g ij dx i dx j , the Bregman Lagrangian L α,β,γ can be written as

.

Multiplying both terms by e αt−λ −1 ζγt , the Euler-Lagrange equations (2.3) for the Bregman Lagrangian L α,β,γ are given, for k = 1, . . . , n, by

Rearranging terms, and multiplying by the matrix (g ij ) which is the inverse of (g ij ), we get, for k = 1, . . . n, the equation

where Γ k ij are the Christoffel symbols given by Γ k ij = 1 2 ∑ n l=1 g kl ∂g jl ∂x i + ∂g li ∂x j − ∂g ij ∂x l , which gives the desired Euler-Lagrange equation once we use the ideal scaling equationγ t = e αt .

Theorem A.2. The Euler-Lagrange equation corresponding to the Lagrangian L SC is given by

Proof. Consider a path on the manifold Q described in coordinates by (x(t),ẋ(t)) = q 1 (t), . . . , q n (t), v 1 (t), . . . , v n (t) .

Then, with ⟨⋅, ⋅⟩ = ∑ n i,j=1 g ij dx i dx j , the Lagrangian L SC can be written as

.

If we multiply both terms by e −ηt , the Euler-Lagrange equations (2.3) for the Lagrangian L SC are given, for k = 1, . . . , n, by

Rearranging terms, and multiplying by the matrix (g ij ) which is the inverse of (g ij ), we get, for k = 1, . . . n, the equation

where Γ k ij are the Christoffel symbols given by Γ k ij = 1 2 ∑ n l=1 g kl ∂g jl ∂x i + ∂g li ∂x j − ∂g ij ∂x l , which gives the desired Euler-Lagrange equation.

Appendix B. Proof of the Convergence Rates. The proofs of the convergence rates of solutions to the Bregman Euler-Lagrange equations are inspired by those of Theorems 5 and 6 from [3] , and make use of Lemmas 2 and 12 therein:

Lemma B.1. Given a Riemannian manifold Q with sectional curvature bounded above by K max and below by K min , with ζ given by equation (2.1), and such that

Lemma B.2. Given a point q and a smooth curve X(t) on a Riemannian manifold Q,

Theorem B.3. Suppose f ∶ Q → R is a λ-weakly-quasi-convex function, and suppose that Assumption 3.2 is satisfied. Then, any solution X(t) of the Bregman Euler-Lagrange equation

Proof. Let

Then, using Lemma B.2,

t e βt (f (X) − f (x * )) + λ 2 e βt ⟨gradf(X),Ẋ⟩ + (ζ − 1)⟨Log X (x * ), −Ẋ⟩ + ⟨λe −αtẊ − Log X (x * ), λe −αt −α tẊ + ∇ẊẊ − ∇Ẋ Log X (x * )⟩.

−α tẊ + ∇ẊẊ = −λ −1 ζe αtẊ − e 2αt+βt gradf(X).

Thus,

Canceling the ⟨gradf(X),Ẋ⟩ and ⟨Log X (x * ), −Ẋ⟩ terms out using Lemma B.2, we geṫ

Now, since f is geodesically λ-weakly-quasi-convex, we have that λ (f (X) − f (x * )) + ⟨Log X (x * ), gradf(X)⟩ ≤ 0, so the ideal scaling equationβ t ≤ e αt implies that

Moreover, Lemma B.1 yields ζ⟨Ẋ,Ẋ⟩ + ⟨Ẋ, ∇Ẋ Log X (x * )⟩ ≥ 0, so −λe −αt ζ⟨Ẋ,Ẋ⟩ + ⟨Ẋ, ∇Ẋ Log X (x * )⟩ ≤ 0.

Therefore,Ė(t) ≤ 0, and so

which gives the desired rate of convergence

Appendix C. Proof of Existence Theorems.

C.1. Convex and Weakly-Quasi-Convex Cases.

Theorem C.1. Suppose Assumption 3.2 is satisfied, and let C, p > 0 and v > 1 be given constants. Then the differential equation

has a global solution X ∶ [0, ∞) → Q under the initial conditions X(0) = x 0 ∈ Q andẊ(0) = 0.

Proof. The proof is similar to that of Lemma 3 in [3] , which extended Theorem 1 in [25] to the Riemannian setting. We first define a family of smoothed equations for which we then show existence of a solution for all time. After choosing an equicontinuous and uniformly bounded subfamily of smoothed solutions, we use the Arzela-Ascoli Theorem on the complete Riemannian manifold Q to obtain a subsequence converging uniformly, and argue that the limit of this subsequence solves the original problem. When p = 2, we recover the simpler case considered in Lemma 3 of [3] , so we assume p ≠ 2 in this proof. Consider the following families of smoothed equations for δ > 0:

Exp and Log are defined globally on Q by Assumption 3.2, so we can choose geodesically normal coordinates φ = ψ −1 around x 0 defined globally on Q and put c = φ ○ X. Using the smoothness of f and letting u =ċ gives a system of first-order ODEs defining a local representation for a vector field in T Q, and Section IV.3 of [13] guarantees that the smoothed ODE has a unique solution X δ C.2. Strongly Convex Case.

Theorem C.2. Suppose that Assumption 3.2 is satisfied, and that η > 0 is a given constant. Then, the differential equation ∇ẊẊ + ηẊ + gradf(X) = 0, has a global solution X ∶ [0, ∞) → Q under the initial conditions X(0) = x 0 ∈ Q andẊ(0) = 0.

Proof. Exp and Log are defined globally on Q by Assumption 3.2, so we can choose geodesically normal coordinates φ = ψ −1 around x 0 defined globally on Q and put c = φ ○ X. As in [3] , using the smoothness of f and letting u =ċ gives a system of first-order ODEs which defines a local representation for a vector field in T Q, and results from Section IV.3 of [13] guarantee that the initial-value differential equation has a unique solution locally around 0. It remains to show that this solution actually exists on [0, ∞). Towards contradiction, suppose [0, T ) is the maximal interval of existence of the solution X, for some finite T > 0. Then, Rearranging, integrating both sides and using the Cauchy-Schwarz inequality gives

since f is bounded from below by Assumption 3.2. Thus, lim t→T X(t) exists, and since Q is complete, the limit is in Q, contradicting the maximality of [0, T ). This completes the proof.

Appendix D. Proof of Invariance Theorem.

Theorem D.1. Suppose that Assumption 3.2 is satisfied and that the curve X(t) satisfies the Riemannian Bregman Euler-Lagrange equation (3.7) corresponding to L α,β,γ . Then the reparametrized curve X(τ (t)) satisfies the Bregman Euler-Lagrange equation (3.7) corresponding to the modified Riemannian Bregman Lagrangian Lα ,β,γ whereα t = α τ (t) + logτ (t),β t = β τ (t) , andγ t = γ τ (t) .

Furthermore α, β, γ satisfy the ideal scaling conditions (3.3) if and only ifα,β,γ do.

Proof. Let Y (t) = X(τ (t)). Theṅ Y (t) =τ (t)Ẋ(τ (t)), and ∇Ẏ (t)Ẏ (t) =τ (t)Ẋ(τ (t)) +τ 2 (t)∇Ẋ (τ (t))Ẋ (τ (t)).

Inverting these relations giveṡ X(τ (t)) = 1 τ (t)Ẏ (t), and ∇Ẋ (τ (t))Ẋ (τ (t)) = 1 τ 2 (t)

The Bregman Euler-Lagrange equation (3.7) at time τ (t) is given by ∇Ẋ (τ (t))Ẋ (τ (t)) + λ −1 ζe α τ (t) −α τ (t) Ẋ (τ (t)) + e 2α τ (t) +β τ (t) gradf(X(τ (t))) = 0.

Substituting the expressions for X(τ (t)),Ẋ(τ (t)) and ∇Ẋ (τ (t))Ẋ (τ (t)) in terms of Y (t) and its derivatives, and multiplying byτ 2 (t), we get ∇Ẏ (t)Ẏ (t) −τ (t) τ (t)Ẏ (t) + λ −1 ζe α τ (t) −α τ (t) τ (t)Ẏ (t) +τ 2 (t)e 2α τ (t) +β τ (t) gradf(Y (t)) = 0.

Optimization Algorithms on Matrix Manifolds

From Nesterov's estimate sequence to Riemannian acceleration

A continuous-time perspective for modeling acceleration in Riemannian optimization

Practical accelerated optimization on Riemannian manifolds

Méthode générale pour la résolution des systèmes d'équations simultanées

Accelerated optimization on Riemannian manifolds via discrete constrained variational integrators

Accelerated optimization on Riemannian manifolds via projected variational integrators

Time-adaptive Lagrangian variational integrators for accelerated optimization on manifolds

Adaptive Hamiltonian variational integrators and applications to symplectic accelerated optimization

Geometric Numerical Integration

Riemannian Geometry and Geometric Analysis

General Topology, Graduate Texts in Mathematics

Fundamentals of Differential Geometry

Introduction to Riemannian Manifolds

Variational symplectic accelerated optimization on Lie groups

Variational and geometric structures of discrete Dirac mechanics

Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds

Introduction to mechanics and symmetry

Discrete mechanics and variational integrators

Problem Complexity and Method Efficiency in Optimization, Wiley -Interscience series in discrete mathematics

A method of solving a convex programming problem with convergence rate O(1 k 2 )

of Applied Optimization

Accelerating the cubic regularization of Newton's method on convex problems

Shadowing properties of optimization algorithms

A differential equation for modeling Nesterov's Accelerated Gradient method: theory and insights

On the importance of initialization and momentum in deep learning

A variational perspective on accelerated methods in optimization

First-order methods for geodesically convex optimization

An estimate sequence for geodesically convex optimization

Direct Runge-Kutta discretization achieves acceleration

Acknowledgments. The authors would like to thank the referees for their careful review of this paper and their helpful suggestions.

locally around 0. Actually, X δ exists on [0, ∞). Indeed, by contradiction, let [0, T ) be the maximal interval of existence of X δ , for some finite T > 0. Using. Integrating and using the Cauchy-Schwarz inequality for the p < 2 case gives Ẋ δ dt < ∞ for a = 0, δ implies that lim t→T X δ (t) exists. Since Q is complete by Assumption 3.2, the limit is in Q, contradicting the maximality of [0, T ). The p > 2 case is similar: the integrand is replaced by t 2−p (max (δ, t)) −1 Ẋ δ , and the integral on [δ, T ) remains unchanged while the integral on [0, δ) can be bounded by the same expression using t < δ. Thus, in both cases, we can find a solution X δ ∶ [0, ∞) → Q to the smooth initial-value ODE, and its corresponding solutionWhen 0 < t ≤ δ, the smoothed ODE can be written asThus, we can use Lemma 4 in [3] to get for p > 2 thatFrom the Lipschitz assumption on f , we have thatThus, since parallel transport preserves inner products,Taking the supremum over 0 < t ≤ δ and rearranging gives for δ < δ M = 2The case p < 2 is done exactly in the same way except that we do not need to bound u p−2 by δ p−2 in the integrals since the t p−2 term in the differential equation is already replaced by δ p−2 .Note that when δ < δ M and δ < t < t M = 2(v+p+1) CL 1 p , the smoothed ODE can be rewritten asTherefore, we can use Lemma 4 in [3] once again to obtainUsing the fact that parallel transport preserves inner products, and dividing by t v+1 givesand since this upper bound is an increasing function of t, we have for any t ′ ∈ (δ, t) thatTaking the supremum over all t ′ ∈ (0, t) gives for δ < δ M and δ < t < t M ,

Thus, F is equicontinuous and uniformly bounded, and the Riemannian manifold Q is complete by Assumption 3.2, so by the Arzela-Ascoli Theorem (Theorem 17 in [12] ), F contains a subsequence that converges uniformly on [0, T ] to some function X * . The same argument as in part 5 of the proof of Lemma 3 of [3] shows that X * is a solution to the original initial-value ODE on [0, T ] which can then be extended to get a global solution on [0, ∞).Substituting the expressions for α, β, γ in terms ofα,β,γ yieldsτ (t)Ẏ (t) + e 2αt+βt gradf(Y (t)) = 0.This gives the Bregman Euler-Lagrange equation (3.7) corresponding to Lα ,β,γ ,(t) Ẏ (t) + e 2αt+βt gradf(Y (t)) = 0.The fact that the parameters α, β, γ satisfy the ideal scaling conditions (3.3) if and only if the parametersα,β,γ do is established in the proof of Theorem 1.2 of [27] .