key: cord-0058301-de6q1j23 authors: Schwarz, Jonathan; Draxler, Felix; Köthe, Ullrich; Schnörr, Christoph title: Riemannian SOS-Polynomial Normalizing Flows date: 2021-03-17 journal: Pattern Recognition DOI: 10.1007/978-3-030-71278-5_16 sha: 7598d1e679283146acf22677141d64e9a4ffa4e6 doc_id: 58301 cord_uid: de6q1j23 Sum-of-Squares polynomial normalizing flows have been proposed recently, without taking into account the convexity property and the geometry of the corresponding parameter space. We develop two gradient flows based on the geometry of the parameter space of the cone of SOS-polynomials. Few proof-of-concept experiments using non-Gaussian target distributions validate the computational approach and illustrate the expressiveness of SOS-polynomial normalizing flows. Optimal transport has become a central topic for mathematical modelling [19, 24] and for computational approaches to data analysis and machine learning [17] . Wasserstein distances based on various transportation cost functions and their dual formulations, parametrized by deep networks, provide a framework for generative data-driven modeling [6] . A current prominent line of research initiated by [20, 21] concerns the representation and estimation of so-called normalizing flows, in order to model a data distribution ν in terms of an elementary reference measure μ, typically the standard Gaussian μ = N (0, I n ), as pushforward measure ν = T μ with respect to a transportation map (diffeomorphism) T . This framework supports a broad range of tasks like density estimation, exploring a posteriori distributions, latent variable models, variational inference, uncertainty quantification, etc. See [12, 14, 16] for recent surveys. A key requirement is the ability to evaluate efficiently both T and T −1 along with the corresponding Jacobians. Based on classical work [11] , triangular maps T and their relation to optimal transport, therefore, has become a focus of research [4, 8] . While the deviation from optimal transport, as defined by [7] , can be bounded by transportation inequalities [22] , merely regarding triangular maps T as diffeomorphisms (performing non-optimal transport) does not restrict expressiveness [3] . Accordingly, triangular maps parametrized by deep networks are nowadays widely applied. A basic property that ensures the invertibility of T is monotonicity, which in connection with triangular maps can be achieved by the coordinatewise integration of nonnegative functions. In a recent paper [10] , Sum-of-Squares (SOS) polynomials that are nonnegative by construction, were used for this purpose, as part of the standard procedure for training deep networks. However, both the convexity properties and the geometry of the parameter space of the cone of SOS polynomials [2, 13] were completely ignored. In this work, we take this geometry into account and devise computational approaches to the construction of transportation maps T . Specifically, we contribute: -We introduce basic notions in Sect. 2 and specify the parametrization of triangular transportation maps using SOS polynomials in Sect. 3. -Based on this parametrization, two algorithms for learning the parameters from given data are developed in Sect. 4. Algorithm 1 directly exploits the Riemannian geometry of the positive definite matrix cone. Algorithm 2 pulls back the objective function to the tangent bundle and performs ordinary gradient descent using a Krylov subspace method for approximating a related matrix-valued entire function. -We evaluate both algorithms and the expressiveness of SOS-polynomial flows in Sect. 5 using few academical non-Gaussian distributions. To enable a clear assessment, we do not use a deep network for additional parametrization. Our findings regarding the first algorithm are quite positive which stimulates further research on suitable extensions to large problem dimensions. Notation. Let n ∈ N, then [n] denotes the set {1, 2, . . . , n}. We denote the vector space of real multivariate polynomials in n variables of degree at most d ∈ N by . , x 2 n ) with s n (2) = 1 2 (n + 1)(n + 2). We set t n (d) = s n−1 (d) = dim v d (x 1 , . . . , x n−1 , 0). S n , S n + and P n denote the spaces of symmetric matrices, of symmetric and positive semidefinite matrices, and of symmetric and positive definite matrices, respectively, of dimension n × n. a, b = a b denotes the Euclidean inner product of a, b ∈ R n . Let μ and ν denote the reference measure and the target measure supported on R n , respectively. Throughout this paper, we assume that μ = N (0, I n ) is the standard multivariate Gaussian distribution and that ν is absolutely continuous with respect to the Lebesgue measure such that with density functions p, q. Our objective is to compute a smooth diffeomorphism T : R n → R n such that ν = T μ becomes the pushforward (or image) measure of μ with respect to T , defined by for all measurable subsets V . In terms of the densities (2.1), Eq. (2.1) reads with the Jacobian matrices dT, dT −1 . As detailed in Sects. 2.2 and 3, we consider a subclass of diffeomorphisms where the constant c collects terms not depending on S A . Replacing the expectation by the empirical expectation defines the objective function After detailing the class of maps (2.4) in Sects. 2.2 and 3, the Riemannian gradient flow with respect to (2.7) will induce a normalizing flow of q to p (Sect. 4). A mapping T : R n → R n is called triangular and increasing, respectively, if each component function T k only depends on variables x i with i ≤ k (property (2.8a)) and if each function (2.8b) is increasing in x k . (2.8b) The existence of a triangular map T : C 1 → C 2 for any two open solid convex subsets C 1 , C 2 ⊂ R n was shown by Knothe [11] . More generally, the existence of a unique (up to μ-equivalence) triangular increasing map T that achieves ( In this section, we adopt the approach from [10] using SOS polynomials for the construction of increasing triangular maps. The key difference is that we will exploit the geometry and convexity of the parameter space in Sect. 4 for deriving normalizing flows. [13] . We denote the subset of SOS polynomials by The following basic proposition says that each SOS polynomial corresponds to a parameter matrix A on the positive definite manifold. Note that p(x) ≥ 0, ∀x ∈ R n , by construction. Next, we use (2.8) and the representation (3.2) in order to define a family (2.4) of increasing triangular maps. Based on (3.2), define the sequence of SOS polynomials (3.3b) and the sequence of linear forms (3.4) that are parametrized by symmetric positive definite matrices A [k] and vectors c [k] , respectively. We collectively denote these parameters by Then the map is triangular and increasing due to the nonnegativity of the SOS polynomials have a similar structure and could be parametrized in the same way. The objective function (2.7) therefore is well defined. In this section, we will develop two different gradient descent flows with respect to the objective function (2.7) that take into account the geometry of the parameter space P n,d (3.5b). Either flow is supposed to transport the target measure ν that is only given through samples (2.5) , to the reference measure μ. This will be numerically evaluated in Sect. 5. Section 4.1 works out details of the Riemannian gradient flow leading to Algorithm 1. Section 4.2 develops a closely related flow using different numerical techniques, leading to Algorithm 2. In what follows, the tangent space to (3.5b) at A is given and denoted by (4.1) Consider the open cone of positive definite symmetric n × n matrices P n . This becomes a Riemannian manifold [1] with the metric The Riemannian gradient of a smooth function J : P n → R reads where ∂J(A) denotes the Euclidean gradient. The exponential map is globally defined and has the form with the matrix exponential function expm(B) = e B , B ∈ R n×n . Discretizing the flow using the geometric explicit Euler scheme with step size h and iteration counter t ∈ N yields Applying this discretization to the respective components of (3.5) yields the following natural gradient flow for the objective function (2.7): Consider again first the case of a smooth objective function J : P n → R. We exploit the fact that the exponential map (4.4) is globally defined on the entire tangent space S n of (P n , g), which does not generally hold for Riemannian manifolds. Using exp I (U ) = expm(U ), U ∈ S n , (4.6) we pull back J to the vector space S n , and perform ordinary gradient descent: Denote the canonical inner product on S n by U, V = tr(UV ). Then the gradient of J(U ) is given by the equation where A = expm(U ). It remains to evaluate the differential of the matrix exponential on the righthand side of (4.9). Using the vectorization operator vec(.), that turns matrices into vectors by stacking the column vectors, and we have the identity vec(CXB ) = (B ⊗ C) vec(X). (4.10) Thus, by [9, Thm. 10.13], we conclude where ⊗ denotes the Kronecker matrix product [23] , ⊕ denotes the Kronecker sum and ψ denotes the matrix-valued function given by the entire function with matrix argument x. Applying vec(·) to the left-hand side of (4.9) and substituting (4.11) in the right-hand side gives As a result, (4.8) becomes In order to evaluate iteratively this equation, the matrix K(U ) given by (4.11b) is never computed. Rather, based on [18] , the product K(U ) vec ∂J(A t ) is computed by approximating the product ψ U ⊕ (−U ) ∂J(A t ) as follows. Using the shorthands one computes the Krylov subspace using the basic Arnoldi iteration with initial vector q 1 , along with a orthonormal basis V m = (q 1 , . . . , q m ) of K m (C, q 1 ). This yields the approximation (4.19) where e 1 = (1, 0, . . . , 0) (4.20) and using any available routine [15] for the matrix exponential. Putting together, the iteration (4.8) is numerically carried out by computing In view of (4.6), we replace the overall parametrization (3.5a) by U := {c [1] , . . . , c [n] , U [1] , . . . , U [n] } ∈ S n,d (4.22a) Consequently, analogous to (4.7), we denote the pulled back objective function (2.7) by J(U). Applying the procedure worked out above to each positive definite component of the overall parametrization (4.22) results in Algorithm 2. Choose A 0 ∈ P n,d such that T [1] ≈ id. Remark 1 (polymomial basis). The framework outlined above does not depend on the specific choice of a monomial basis (1.1). For example, replacing v d (x) by for some linear regular transformation Q, provides a viable alternative. For instance, a polynomial basis that is orthogonal with respect to a weighted L 2 inner product makes sense, especially if prior information about the support supp ν of the target measure is available. In this section, we consider the objective function (2.7) for the specific case μ = N (0, I n ) and the task to generate samples y = T A (x) ∼ ν from the estimated target measure, using samples x ∼ μ that are simple to compute. Taking into account the specific form of μ and the triangular structure of S A , the objective function (2.7) simplifies to ,1 , . . . , y i,k ) . (4.24) Both Algorithm 1 and 2 can be used to minimizer (4.24) numerically. The evaluation of the map T A = S −1 A makes use of the triangular structure of in order to solve the equations [1] (y 1 ) S [2] (y 1 , y 2 ) . . . for y = T A (x) by computing recursively Each step involves few iterations of the one-dimensional Newton method that converges to the unique solution, thanks to the monotonicity of the triangular maps that holds by construction -cf. (3.6). In this section, we report numerical results as proof of concept and discuss the following two aspects: -Expressiveness of polynomial SOS maps for measure transport and generative modeling (Sects. 5.2 and 5.3); -performance and comparison of the two geometric flows approximated by Algorithms 1 and 2 (Sect. 5.4). We point out that unlike the paper [10] , no deep network was used for additional parametrization which would obscure the influence of the SOS-polynomial maps. We used the three two-dimensional densities open-ring, closed-ring and mixture of two Gaussians for this purpose (Fig. 1) , that play the role of the data measure ν. A sample set y i ∼ ν, i ∈ [N ], with N = 2.000, was generated as input data. Next, either algorithm was applied in order to estimate numerically the SOSparameters A given by (3.5a), by minimizing the objective function (4.24). We used SOS-polynomials of degrees 2d ∈ {2, 4, 6} for parametrizing the maps T A (x). Taking into account the symmetry of the matrices the corresponding numbers of variables to be determined are 12, 31, 70. Finally, samples x i ∼ μ were generated and the map T A = S −1 A was computed (Sect. 4.3) in order to generate samples y i = T A (x i ). Corresponding kernel density estimates can then be compared to the plots depicted by Fig. 1 . Both Algorithms 1 and 2 were modified in a stochastic gradient descent like manner: Every update was performed using the gradient with respect to a single random index i ∈ [N ] of the objective (4.24), such that each index i was visited after N updates. Thus, even though the considered problem sizes are small, we modified both geometric gradient descent algorithms such that they remain efficient for larger problem sizes [5] . We also checked the influence of changing the polynomial basis according to Remark 1 (page 8). Specifically, Hermite polynomials that are orthogonal with respect to a weighted L 2 inner product were used instead of the canonical monomial basis. Figure 4 illustrates that this did not affect the process in a noticeable way. Neither did the result for the Gaussian mixture density show any noticeable effect. We repeated all experiments reported in Sect. 5.2 using Algorithm 2, instead of Algorithm 1, that is based on the parametrization detailed in Sect. 4.2. The results are shown by Fig. 3 . We generally observed fairly good density approximations even for low-degree polynomial parametrizations, that do not achieve the accuracy of the results obtained using the Riemannian flows, however (cf. Fig. 2 ). In particular, we Fig. 1 and samples xi ∼ N (0, In) . The columns correspond from (left to right) to the degrees 2d ∈ {2, 6} of the SOSpolynomials that were used to compute the increasing triangular maps TA. Except for the mixture of two Gaussians density, low-degree SOS-polynomals suffice to recover the densities quite accurately. Fig. 3 . Exponential SOS-Polynomial Normalizing Flows. Results of the experiments obtained using Algorithm 2 using the same data as for the experiments illustrated by Fig. 2 . In comparison to the former results, the approximation accuracy deteriorated slightly. In addition, choosing larger polynomial degrees may not improve the result. We attribute this finding to the fact that Algorithm 2 is based on approximating the geometry of the parameter space in various ways (see text). Fig. 4 . Riemannian SOS-Polynomial Normalizing Flows using Hermite polynomials, rather than the canonical monomial basis, and using the same data as for the experiments illustrated by Fig. 2. observed that increasing the polynomial degree did not systematically improve the approximation. We attribute this negative finding to two facts: Firstly, Algorithm 2 does not exactly respect the geometry of the parameter space P n,d (3.5b). Secondly, the Krylov subspace approximation underlying the updates (4.21) may also affect the approximation accuracy. We leave a more detailed analysis for future work. Comparing the results discussed in Sects. 5.2 and 5.3 suggests that Riemannian SOS-polynomial normalizing flows should be preferred. A striking difference concerns the dependency on the polynomial degree. While the Riemannian SOS flow generally yield improved density approximations when the degree is increased, this is hardly the case when using the exponential parametrization. Possible reasons were discussed in the preceding section. In both cases, however, even small polynomial degrees enable to represent densities by transportation maps quite accurately. We studied transportation maps for generative modeling using Sum-of-Squares polynomials for the construction of increasing triangular maps. Two parametrizations were studied along with two numerical algorithms for estimating the parameters by minimizing a sample-based objective function. Experiments show that low-degree polynomials suffice to recover basic non-Gaussian distributions quite accurately. Riemannian SOS-polynomial flows that fully respect the geometry of the parameter space perform best, whereas approximations of the geometry may cause detrimental effects. We merely regard the reported preliminary experimental results as proof of concept, conducted with low-degree parametrizations and small dimension of the underlying domain. Our future work will be devoted to geometric methods for taming the complexity of large degree parametrizations and the representation of high-dimensional generative models. Positive Definite Matrices Semidefinite Optimization and Convex Algebraic Geometry. SIAM Triangular transformations of measures From Knothe's rearrangement to Brenier's optimal transport map Optimization methods for large-scale machine learning From optimal transport to generative modeling: the VEGAN cookbook Polar factorization and monotone rearrangement of vector-valued functions From Knothe's transport to Brenier's map and a continuation method for optimal transport Functions of Matrices: Theory and Computation. SIAM Contributions to the theory of convex bodies Normalizing Flows: An Introduction and Review of Current Methods Positive Polynomials and Sum of Squares An introduction to sampling via measure transport Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later Normalizing Flows for Probabilistic Modeling and Inference Analysis of some Krylov subspace approximations to the matrix exponential operator Optimal Transport for Applied Mathematicians A family of nonparametric density estimation algorithms Density estimation by dual ascent of the loglikelihood Transportation cost for gaussian and other product measures The ubiquitous Kronecker product Optimal Transport: Old and New