key: cord-0554816-rwjjfz4q authors: Law, Marc T.; Stam, Jos title: Ultrahyperbolic Representation Learning date: 2020-07-01 journal: nan DOI: nan sha: 3c8edad04afae9420f3fd0056b5bf8c28e9a5e02 doc_id: 554816 cord_uid: rwjjfz4q In machine learning, data is usually represented in a (flat) Euclidean space where distances between points are along straight lines. Researchers have recently considered more exotic (non-Euclidean) Riemannian manifolds such as hyperbolic space which is well suited for tree-like data. In this paper, we propose a representation living on a pseudo-Riemannian manifold with constant nonzero curvature. It is a generalization of hyperbolic and spherical geometries where the nondegenerate metric tensor is not positive definite. We provide the necessary learning tools in this geometry and extend gradient method optimization techniques. More specifically, we provide closed-form expressions for distances via geodesics and define a descent direction that guarantees the minimization of the objective problem. Our novel framework is applied to graph representations. In most machine learning applications, data representations lie on a smooth manifold [14] and the training procedure is optimized with an iterative algorithm such as line search or trust region methods [18] . In most cases, the smooth manifold is Riemannian: it is equipped with a positive definite metric on each tangent space (i.e. every non-vanishing tangent vector has positive squared norm). Due to the positive definiteness of the metric, the negative of the (Riemannian) gradient is a descent direction that can be exploited to iteratively minimize some objective function [1] . The choice of metric on the Riemannian manifold determines how relations between points are quantified. The most common Riemannian manifold is the flat Euclidean space, which has constant zero curvature and the distances between points are measured by straight lines. An intuitive example of non-Euclidean Riemannian manifold is the spherical model (i.e. representations lie on a hypersphere) that has constant positive curvature and is used for instance in face recognition [23, 24] . On the hypersphere, geodesic distances are a function of angles. Similarly, Riemannian spaces of constant negative curvature are called hyperbolic [21] . Such spaces were shown by Gromov to be well suited to represent tree-like structures [8] . The machine learning community has adopted these spaces to learn tree-like graphs [5] and hierarchical data structures [9, 16, 17] . In this paper, we consider a class of non-Riemannian manifolds of constant nonzero curvature [26] not previously considered in machine learning. These manifolds not only generalize the hyperbolic and spherical geometries mentioned above, but also contain hyperbolic and spherical submanifolds and can therefore describe relationships specific to those geometries. The difference is that we consider the larger class of pseudo-Riemannian manifolds where the considered nondegenerate metric tensor is not positive definite. Optimizing a cost function on our non-flat ultrahyperbolic space requires a descent direction method that follows a path along the curved manifold. We achieve this by employing tools from differential geometry such as geodesics and exponential maps. The theoretical contributions in this paper are two-fold: (1) explicit methods to calculate dissimilarities and (2) general optimization tools on pseudo-Riemannian manifolds of constant nonzero curvature. Notation: We denote points on a smooth manifold M [14] by boldface Roman characters x ∈ M. T x M is the tangent space of M at x and we write tangent vectors ξ ∈ T x M in boldface Greek fonts. A vector dot product is a positive definite bilinear form denoted ·, · and defined as x, y = x y. The 2 norm of x is x = x, x . R d denotes the d-dimensional Euclidean space and R d * = R d \{0} is the Euclidean space with the origin removed. Pseudo-Riemannian manifolds: A smooth manifold M is pseudo-Riemannian (also called semi-Riemannian [19] ) if it is equipped with a pseudo-Riemannian metric tensor (named "metric" for short in differential geometry). The pseudo-Riemannian metric g x (·, ·) at some point x ∈ M is a non-degenerate smooth symmetric bilinear 2-form. Non-degeneracy means that if for a given ξ ∈ T x M and for all ζ ∈ T x M we have g x (ξ, ζ) = 0, then ξ = 0. If the metric is also positive definite (i.e. g x (ξ, ξ) > 0 iff ξ = 0), then it is Riemannian. Riemannian geometry is then a special case of pseudo-Riemannian geometry where the metric is positive definite. In general, this is not the case and non-Riemannian manifolds distinguish themselves by having non-vanishing tangent vectors ξ = 0 that satisfy g x (ξ, ξ) ≤ 0. For more details, we refer the reader to [2, 19, 26] . Pseudo-hyperboloids generalize spherical and hyperbolic geometries to the class of pseudo-Riemannian manifolds. First, we consider the flat Euclidean space R d with d = p + q + 1 ∈ N where each vector is written x = (x 0 , x 1 , · · · , x q+p ) . R d becomes a pseudo-Euclidean space denoted R p,q+1 when equipped with the following scalar product: where G = G −1 = I q+1,p is the d × d diagonal matrix with the first q + 1 diagonal elements equal to −1 and the remaining p equal to 1. Since R p,q+1 is a vector space, we can identify the tangent space to the space itself by means of the natural isomorphism T x R p,q+1 ≈ R p,q+1 . Using the terminology of special relativity, R p,q+1 has q + 1 time dimensions and p space dimensions. A pseudo-hyperboloid is the following submanifold of codimension one in R p,q+1 : where β ∈ R * is a nonzero real value and x 2 q = x, x q is the associated quadratic form of the scalar product at x. It is equivalent to work with either Q p,q β or Q q+1,p−1 −β since they are interchangeable via an anti-isometry [19] (see supp. material for details). For instance, the unit (q + 1)-dimensional hypersphere is anti-isometric to Q 0,q −1 which is then spherical. In the literature, the set Q p,q β is called a "pseudo-sphere" when β > 0 and a "pseudo-hyperboloid" when β < 0. In the following, we will restrict ourselves to the pseudo-hyperbolic case (i.e. β < 0). We can obtain the spherical and hyperbolic geometries by constraining all the elements of the space dimensions of a pseudo-hyperboloid to be zero or constraining all the elements of the time dimensions except one to be zero, respectively. Pseudo-hyperboloids are then more general. The pseudo-hyperboloids that we consider in this paper are hard to visualize as they live in ambient spaces with dimension higher than 3. In Fig. 1 , we show iso-surfaces of a projection of the 3-dimensional pseudo-hyperboloid Q 2,1 −1 (embedded in R 2,2 ) into R 3 along its first time dimension. Tangent space: By using the isomorphism T x R p,q+1 ≈ R p,q+1 mentioned above, the tangent space of Q p,q β at x can be defined as T x Q p,q β = ξ ∈ R p,q+1 : x, ξ q = 0 for all β = 0. The orthogonal projection of an arbitrary d-dimensional vector z onto T x Q p,q β is: This section introduces the differential geometry tools necessary to quantify dissimilarities/distances between points on Q p,q β . Measuring dissimilarity is an important task in machine learning and has many applications (e.g. in metric learning [27] ). Before considering distances on Q p,q β , we consider distances in the pseudo-Euclidean space in which it is embedded. We recall that R p,q+1 is isomorphic to its tangent space. Tangent vectors are therefore naturally identified with points. Using the quadratic form defined in Eq. (1), the squared ambient distance between two points a, b ∈ Q p,q β is: This distance is a good proxy for the geodesic distance d γ (·, ·), that we introduce below, if it preserves distance relations: This relation is satisfied for two special cases of pseudo-hyperboloids for which the geodesic distance is well known: The ambient distance has been shown to be a good proxy in hyperbolic geometry [9] . Unfortunately, for the remaining ultrahyperbolic case (i.e. q ≥ 1 and p ≥ 2), the distance relations are not preserved: d γ (a, b) < d γ (c, d) ⇐⇒ |d 2 q (a, b)| < |d 2 q (c, d)|. Therefore, we need to compute the geodesics explicitly for these cases. This section contains the first theoretical contribution of this paper, specifically closed-form expressions for the geodesic distance on ultrahyperbolic manifolds. Geodesics: Informally, a geodesic is a curve joining points on a manifold M that minimizes some "effort" depending on the metric. More precisely, let I ⊆ R be a (maximal) interval containing 0. A geodesic γ : I → M maps a real value t ∈ I to a point on the manifold M. It is a curve on M defined by its initial point γ(0) = x ∈ M and initial tangent vector is the derivative of γ at t. By analogy with physics, t is considered as a time value. Intuitively, one can think of the curve as the trajectory over time of a ball being pushed from a point x at t = 0 with initial velocity ξ and constrained to roll on the manifold. We denote this curve explicitly by γ x→ξ (t) unless the dependence is obvious from the context. For this curve to be a geodesic, its acceleration has to be zero: ∀t ∈ I, γ (t) = 0. This condition is a second-order ODE that has a unique solution for a given set of initial conditions [15] . The interval I is said to be maximal if it cannot be extended to a larger interval. In the case of Q p,q β , we have I = R and I is then maximal. Geodesic of Q p,q β : As we show in the supp. material, the geodesics of Q p,q β are a combination of the hyperbolic, flat and spherical cases. The nature of the geodesic γ x→ξ depends on the sign of ξ, ξ q . For all t ∈ R, the geodesic γ x→ξ of Q p,q β is written: We recall that ξ, ξ q = 0 does not imply ξ = 0. The geodesics are an essential ingredient to define a mapping known as the exponential map. See Fig. 2 (left) for a depiction of these three types of geodesics, and Fig. 2 (right) for a depiction of the other quantities introduced in this section. Exponential map: Exponential maps are a way of collecting all of the geodesics of a pseudo-Riemannian manifold M into a unique differentiable mapping. Let D x ⊆ T x M be the set of tangent vectors ξ such that γ x→ξ is defined at least on the interval [0, 1]. This allows us to uniquely define the exponential map exp x : D x → M such that exp x (ξ) = γ x→ξ (1). For pseudo-hyperboloids, the exponential map is complete, that is D x = T x Q p,q β . Using Eq. (5) with t = 1, we obtain an exponential map of the entire tangent space to the manifold: We make the important observation that the image of the exponential map does not necessarily cover the entire manifold: not all points on a manifold are connected by a geodesic. This is the case for our pseudo-hyperboloids. Namely, for a given point x ∈ Q p,q β there exist points y that are not in the image of the exponential map (i.e. there does not exist a tangent vector ξ such that y = exp x (ξ)). We provide a closed-form expression of the logarithm map for pseudo-hyperboloids. Let U x ⊆ Q p,q β be some neighborhood of x. The logarithm map log x : U x → T x Q p,q β is defined as the inverse of the exponential map on U x (i.e. log x = exp −1 x ). We propose: By substituting ξ = log x (y) into Eq. (6), one can verify that our formulas are the inverse of the exponential map. The set U x = y ∈ Q p,q β : x, y q < |β| is called a normal neighborhood of x ∈ Q p,q β since for all y ∈ U x , there exists a geodesic from x to y such that log x (y) = γ x→log x (y) (0). We show in the supp. material that the logarithm map is not defined if x, y q ≥ |β|. Proposed dissmilarity: We define our dissimilarity function based on the general notion of arc length and radius function on pseudo-Riemannian manifolds that we recall in the next paragraph (see details in Chapter 5 of [19] ). This corresponds to the geodesic distance in the Riemannian case. Let U x be a normal neighborhood of x ∈ M with M pseudo-Riemannian. The radius function r x : U x → R is defined as r x (y) = |g x (log x (y), log x (y))| where g x is the metric at x. If σ x→ξ is the radial geodesic from x to y ∈ U x (i.e. ξ = log x (y)), then the arc length of σ x→ξ equals r x (y). We then define the geodesic "distance" between x ∈ Q p,q β and y ∈ U x as the arc length of σ x→log x (y) : It is important to note that our "distance" is not a distance metric. However, it satisfies the axioms of a symmetric premetric: These conditions are sufficient to quantify the notion of nearness via a ρ-ball centered at x: B ρ x = {y : d γ (x, y) < ρ}. In general, topological spaces provide a qualitative (not necessarily quantitative) way to detect "nearness" through the concept of a neighborhood at a point [13] . Something is true "near x" if it is true in the neighborhood of x (e.g. in B ρ x ). Our premetric is similar to metric learning methods [11, 12, 27] that learn a Mahalanobis-like distance pseudo-metric parameterized by a positive semidefinite matrix. Pairs of distinct points can have zero "distance" if the matrix is not positive definite. However, unlike metric learning, we can have triplets (x, y, z) that satisfy . Since the logarithm map is not defined if x, y q ≥ |β|, we propose to use the following continuous approximation defined on the whole manifold instead: To the best of our knowledge, the explicit formulation of the logarithm map for Q p,q β in Eq. (7) and its corresponding radius function in Eq. (8) to define a dissimilarity function are novel. We have also proposed some linear approximation to evaluate dissimilarity when the logarithm map is not defined but other choices are possible. For instance, one might consider instead is not defined. This interesting problem is left for future research. In this section we present optimization frameworks to optimize any differentiable function defined on Q p,q β . Our goal is to compute descent directions on the ultrahyperbolic space. We consider two approaches. In the first approach, we map our representation from Euclidean space to ultrahyperbolic space. This is similar to the approach taken by [9] in hyperbolic space. In the second approach, we optimize using gradients defined directly in pseudo-Riemannian tangent space. We propose a novel descent direction which guarantees the minimization of some cost function. Our first method maps Euclidean representations that lie in R d to the pseudo-hyperboloid Q p,q β , and the chain rule is exploited to perform standard gradient descent. To this end, we construct a differentiable mapping ϕ : R q+1 * × R p → Q p,q β . The image of a point already on Q p,q β under the mapping ϕ is itself: ∀x ∈ Q p,q β , ϕ(x) = x. Let S q := x ∈ R q+1 : x 2 = 1 denote the (q + 1)-dimensional unit hyper-sphere. We first introduce the following diffeomorphisms: u ∈ S q and v ∈ R p . The mapping ψ and its inverse ψ −1 are formulated (see proofs in appendix): With these mappings, any vector x ∈ R q+1 * × R p can be mapped to Q p,q β via ϕ = ψ −1 • ψ. ϕ is differentiable everywhere except when x 0 = · · · = x q = 0, which should never occur in practice. It can therefore be optimized using standard gradient methods. We now introduce a novel method to optimize any differentiable function f : Q p,q β → R defined on the pseudo-hyperboloid. As we show below, the (negative of the) pseudo-Riemannian gradient is not a descent direction. We propose a simple and efficient way to calculate a descent direction. Pseudo-Riemannian gradient: Since x ∈ Q p,q β also lies in the ambient Euclidean space R d , the function f has a well defined Euclidean gradient ∇f : This gradient forms the foundation of our descent method optimizer as will be shown in Eq. (13) . Iterative optimization: Our goal is to iteratively decrease the value of the function f by following some descent direction. Since Q p,q β is not a vector space, we do not "follow the descent direction" by adding the descent direction multiplied by a negative scalar as this would result in a new point that does not necessarily lie on Q p,q β . Instead, to remain on the manifold, we use our exponential map defined in Eq. (6). This is a standard way to optimize on Riemannian manifolds [1] . Given a step size t > 0, one step of descent along a tangent vector ζ ∈ T x Q p,q β is given by: Descent direction: We now explain why the negative of the pseudo-Riemannian gradient is not always a descent direction. Our explanation extends Chapter 3 of [18] that gives the criteria for a vector ζ to be a descent direction when the domain of f is a Euclidean space. By using the properties of the exponential map and geodesics described in Section 3, we know that for all t ∈ R and all ξ ∈ T x Q p,q β , we have the equalities: exp x (tξ) = γ x→tξ (1) = γ x→ξ (t) so we can equivalently fix t to 1 and choose the scale of ξ appropriately. By exploiting Taylor's first-order approximation, there exists some small enough tangent vector ζ = 0 (i.e. with exp x (ζ) belonging to a convex neighborhood of x [4, 6] ) that satisfies the following conditions: γ x→ζ (0) = x ∈ Q p,q β , γ x→ζ (0) = ζ ∈ T x Q p,q β , γ x→ζ (1) = y ∈ Q p,q β , and the function f • γ x→ζ : R → R can be approximated at t = 1 by: where we use the following properties: [19] ), df is the differential of f and γ is a geodesic (we omit indices). To be a descent direction at x (i.e. so that f (y) < f (x)), the search direction ζ has to satisfy Df (x), ζ q < 0. However, choosing ζ = −ηDf (x), where η > 0 is a step size, might increase the function since the scalar product ·, · q is not positive definite. A simple solution would be to choose ζ = ±ηDf (x) depending on the sign of Df (x), ζ q , but the issue where Df (x), ζ q = 0 even if Df (x) = 0 would remain. The optimization algorithm might then be stuck to an isocontour of f . Moreover, the sign of the search direction might be hard to choose in the stochastic gradient setting. Proposed solution: To ensure that ζ ∈ T x Q p,q β is a descent direction, we propose a simple expression that satisfies Df (x), ζ q < 0 if Df (x) = 0 and Df (x), ζ q = 0 otherwise. We propose to formulate ζ = −ηΠ x (GDf (x)) ∈ T x Q p,q β . We then define the following tangent vector χ = − 1 η ζ: Algorithm 1 Pseudo-Riemannian optimization on Q p,q β 1: differentiable function f : Q p,q β → R to be minimized, some initialization of x ∈ Q p,q β 2: while not converge do 3: Calculate ∇f (x) i.e. the Euclidean gradient of f at x in the Euclidean ambient space 4: χ ← Πx(GΠx(G∇f (x))) see Eq. (14) 5: x ← exp x (−ηχ) where η > 0 is a step size (e.g. determined with line search) 6: end while The tangent vector ζ is a descent direction because Df (x), ζ q = −η Df (x), χ q is non-positive: We also have Df (x), Our proposed algorithm to the minimization problem min x∈Q p,q β f (x) is illustrated in Algorithm 1. Following generic Riemannian optimization algorithms [1] , at each iteration, it first computes the descent direction −χ ∈ T x Q p,q β , then decreases the function by applying the exponential map defined in Eq. (6) . It is worth noting that our proposed descent method can be applied to any differentiable function f : Q p,q β → R, not only to those that exploit the distance introduced in Section 3. Interestingly, our method can also be seen as a preconditioning technique [18] where the descent direction is obtained by preconditioning the pseudo-Riemannian gradient Df (x) with the matrix P x = G − 1 x,x q xx ∈ R d×d . In other words, we have χ = P x Df (x) = Π x (GDf (x)). In the more general setting of pseudo-Riemannian manifolds, another preconditioning technique was proposed in [6] . The method there requires performing a Gram-Schmidt process at each iteration to obtain an (ordered [26] ) orthonormal basis of the tangent space at x w.r.t. the induced norm of the manifold. However, the Gram-Schmidt process is unstable and has algorithmic cubic complexity in the dimensionality of the tangent space. On the other hand, our method is more stable and has algorithmic complexity linear in the dimensionality of the tangent space. We now experimentally validate our proposed optimization methods and the effectiveness of our dissimilarity function. Our main experimental results can be summarized as follows: • Both optimizers introduced in Section 4 decrease some objective function f : Q p,q β → R at each iteration of our iterative descent method (until reaching a plateau). While both optimizers manage to learn high-dimensional representations that satisfy the problem-dependent training constraints, only the pseudo-Riemannian optimizer satisfies all the constraints in lower-dimensional spaces. This is because it exploits the underlying metric of the manifold. • Hyperbolic representations are popular in machine learning as they are well suited to represent hierarchical trees [8, 16, 17] . On the other hand, hierarchical datasets whose graph contains cycles cannot be represented using trees. Therefore, we propose to represent such graphs using our ultrahyperbolic representations. An important example are community graphs such as Zachary's karate club [28] that contain leaders. Because our ultrahyperbolic representations are more flexible than hyperbolic representations and contain hyperbolic subparts, we believe that our representations are better suited for these non tree-like hierarchical structures. Graph: Our ultrahyperbolic representations describe graph-structured datasets. Each dataset is an undirected weighted graph G = (V, E) which has node-set V = {v i } n i=1 and edge-set E = {e k } m k=1 . Each edge e k is weighted by an arbitrary capacity c k ∈ R + that models the strength of the relationship between nodes. The higher the capacity c k , the stronger the relationship between the nodes connected by e k . Learned representations: Our problem formulation is inspired by hyperbolic representation learning approaches [16, 17] where the nodes of a tree (i.e. graph without cycles) are represented in hyperbolic space. The hierarchical structure of the tree is then reflected by the order of distances between its nodes. More precisely, a node representation is learned so that each node is closer to its descendants and ancestors in the tree (w.r.t. the hyperbolic distance) than to any other node. For example, in a hierarchy of words, ancestors and descendants are hypernyms and hyponyms, respectively. Our goal is to learn a set of n points x 1 , · · · , x n ∈ Q p,q β (embeddings) from a given graph G of supervision. The presence of cycles in the graph makes it difficult to determine ancestors and descendants. For this reason, we introduce for each pair of nodes (v i , v j ) = e k ∈ E, the set of "weaker" pairs that have lower capacity: Our goal is to learn representations such that pairs (v i , v j ) with higher capacity have their representations (x i , x j ) closer to each other than weaker pairs. Following [16] , we formulate our problem as: where d is the chosen dissimilarity function (e.g. D γ (·, ·) defined in Eq. (9)) and τ > 0 is a fixed temperature parameter. The formulation of Eq. (17) is classic in the metric learning literature [3, 10, 25] and corresponds to optimizing some order on the learned distances via a softmax function. Implementation details: We coded our approach in PyTorch [20] that automatically calculates the Euclidean gradient ∇f (x i ). Initially, a random set of vectors {z i } n i=1 is generated close to the positive pole ( |β|, 0, · · · , 0) ∈ Q p,q β with every coordinate perturbed uniformly with a random value in the interval [−ε, ε] where ε > 0 is chosen small enough so that z i 2 q < 0. We set β = −1, ε = 0.1 and τ = 10 −2 . Initial embeddings are generated as follows: ∀i, Zachary's karate club dataset [28] is a social network graph of a karate club comprised of n = 34 nodes, each representing a member of the karate club. The club was split due to a conflict between instructor "Mr. Hi" (node v 1 ) and administrator "John A" (node v n ). The remaining members now have to decide whether to join the new club created by v 1 or not. In [28] , Zachary defines a matrix of relative strengths of the friendships in the karate club called C ∈ {0, 1, · · · , 7} n×n dependant on various criteria. We note that the matrix is not symmetric and has 7 different pairs (v i , v j ) for which C ij = C ji . Since our dissimilarity function is symmetric, we consider the symmetric matrix S = C + C instead. The value of S ij is the capacity/weight assigned to the edge joining v i and v j , and there is no edge between v i and v j if S ij = 0. Fig. 3 (left) illustrates the 34 nodes of the dataset, an edge joining the nodes v i and v j is drawn iff S ij = 0. The level of a node in the hierarchy corresponds approximately to its height in the figure. Optimizers: We validate that our optimizers introduced in Section 4 decrease the cost function. First, consider the simple unweighted case where every edge weight is 1. For each edge e k ∈ E, W(e k ) is then the set of pairs of nodes that are not connected. In other words, Eq. (17) learns node representations that have the property that every connected pair of nodes has smaller distance than non connected pairs. We use this condition as a stopping criterion of our algorithm. . In each test, we vary the number of time dimensions q + 1 while the ambient space is of fixed dimensionality d = 10. We omit the case q = 0 since it corresponds to the (hyperbolic) Riemannian case already considered in [9, 17] . Both optimizers decrease the function and manage to satisfy all the expected distance relations. We note that when we use −Df (x) instead of −χ as a search direction, the algorithm does not converge. Moreover, our pseudo-Riemannian optimizer manages to learn representations that satisfy all the constraints for low-dimensional manifolds such as Q 4,1 −1 and Q 4,2 −1 , while the optimizer introduced in Section 4.1 does not. Consequently, we only use the pseudo-Riemannian optimizer in the following results. Hierarchy extraction: To quantitatively evaluate our approach we apply it to the problem of predicting the high level nodes in the hierarchy from the weighted matrix S as supervision. We consider the challenging low-dimensional setting where all the learned representations lie on a 4-dimensional manifold (i.e. p + q + 1 = 5). Hyperbolic distances are known to grow exponentially as we get further from the origin. Therefore, the sum of distances δ i = n j=1 d(x i , x j ) of a node v i with all other nodes is a good indication of importance. Intuitively, high-level nodes will be closer to most nodes than low-level nodes. We then sort the scores δ 1 , · · · , δ n and report the ranks of the two leaders δ 1 or δ n (in no particular order) in the first two rows of Table 1 averaged over 5 different initializations/runs. Leaders tend to have a smaller δ i score with ultrahyperbolic distances than with Euclidean or hyperbolic distances. Instead of using δ i for hyperbolic representations, the importance of a node v i can be evaluated by using the Euclidean norm of its embedding x i as proxy. This is because high-level nodes in a tree represented in hyperbolic space are usually closer to the origin than low-level nodes [9, 16, 17] . Not surprisingly, this proxy leads to worse performance (8.6 ± 2.3 and 18.6 ± 4.9) as the relationships are not that of a tree. Since hierarchy levels are hard to compare for low-level nodes, we select the 10 (or 5) most influential members based on the score s i = n j=1 S ij . The corresponding nodes are 34, 1, 33, 3, 2, 32, 24, 4, 9, 14 ( in that order). Spearman's rank correlation coefficient [22] between the selected scores s i and corresponding δ i is reported in Table 1 and shows the relevance of our representations. Due to lack of space, we also report in the supp. material similar experiments on a larger hierarchical dataset [7] that describes co-authorship from papers published at NIPS from 1988 to 2003. We have introduced ultrahyperbolic representations. Our representations lie on a pseudo-Riemannian manifold with constant nonzero curvature which generalizes hyperbolic and spherical geometries and includes them as submanifolds. Any relationship described in those geometries can then be described with our representations that are more flexible. We have introduced new optimization tools and experimentally shown that our representations can extract hierarchies in graphs that contain cycles. We introduce a novel way of representing relationships between data points by considering the geometry of non-Riemannian manifolds of constant nonzero curvature. The relationships between data points are described by a dissimilarity function that we introduce and exploits the structure of the manifold. It is more flexible than the distance metric used in hyperbolic and spherical geometries often used in machine learning and computer vision. Nonetheless, since the problems involving our representations are not straightforward to optimize, we propose novel optimization algorithms that can potentially benefit the machine learning, computer vision and natural language processing communities. Indeed, our method is application agnostic and could extend existing frameworks. Our contribution is mainly theoretical but we have included one practical application. Similarly to hyperbolic representations that are popular for representing tree-like data, we have shown that our representations are well adapted to the more general case of hierarchical graphs with cycles. These graphs appear in many different fields of research such as medicine, molecular biology and the social sciences. For example, an ultrahyperbolic representation of proteins might assist in understanding their complicated folding mechanisms. Moreover, these representations could assist in analyzing features of social media such as discovering new trends and leading "connectors". The impact of community detection for commercial or political advertising is already known in social networking services. We foresee that our method will have many more graph-based practical applications. We know of very few applications outside of general relativity that use pseudo-Riemannian geometry. We hope that our research will stimulate other applications in machine learning and related fields. The supplementary material is structured as follows: • In Section B.1, we study the formulation of the geodesics in Eq. (5) . It is combination of hyperbolic, spherical and flat space due to its formulation. • In Section B.2, we show why log x (y) (see Eq. (7)) is not defined if x, y q ≥ |β|. • In Section B.3, we explain the anti-isometry between Q p,q β or Q q+1,p−1 −β . • In Section B.4, we study the curvature of Q p,q β . • In Section B.5, we give the proof of Theorem 4.1. • In Section B.6, we show that the hyperboloid is a Riemannian manifold (i.e. its metric tensor is positive definite) even if its metric matrix G = I 1,p is not positive definite. • In Section C, we report additional experiments. B Some properties about geodesics and logarithm map B.1 Geodesics of Q p,q β Tangent space of pseudo-Riemannian submanifolds: In the paper, we exploit the fact that our considered manifold M (here Q p,q β ) is a pseudo-Riemannian submanifold of M (here R p,q+1 ). For any vector space M, we have a natural isomorphism between M and its tangent space. If M is a pseudo-Riemannian submanifold of M, we have the following direct sum decomposition: where T x (M) ⊥ = {ζ ∈ T x (M) : ∀ξ ∈ T x (M), g x (ζ, ξ) = 0} is the orthogonal complement of T x (M) and called the normal space of M at x. It is a nondegenerate subspace of T x (M), and g x is the metric at x ∈ M. In the case of Q p,q β , T x (Q p,q β ) ⊥ is defined as: where ≈ denotes isomorphism. Geodesic of a submanifold: As mentioned in the main paper, a curve γ is a geodesic if its acceleration is zero. However, the acceleration depends on the choice of the affine connection while the velocity does not (i.e. the velocity does not depend on the Christoffel symbol whereas the acceleration does, and different connections produce different geodesics). Let us note D dt (γ (t)) (resp. D dt (γ (t))) the covariant derivative of γ (t) along γ(t) in R p,q+1 (resp. in Q p,q β ). By using the induced connection (see page 98 of [19] ) and the fact that Q p,q β is isometrically embedded in R p,q+1 , the second order ODE about the zero acceleration of the geodesic is equivalent to (see page 103 of [19] ): where Π γ(t) is defined in Eq. (3) and orthogonally projects onto T γ(t) Q p,q β . In other words, γ is straight in M but its curving in M is the one forced by the curving of M itself in M. The initial conditions γ x→ξ (0) = x ∈ Q p,q β and γ x→ξ (0) = ξ ∈ T x Q p,q β are straightforward. With the formulation of γ x→ξ in Eq. (5), one can first verify that for all t, γ x→ξ (t) lies on Q p,q β . Indeed, since x, ξ q = 0, we have: The acceleration of γ at t (in the ambient space R p,q+1 ) is in N γ(t) (Q p,q β , R p,q+1 ) since: We have found a solution for our second order ODE, which is unique by definition. From its formulation in Eq. (5), the nonconstant geodesic γ x→ξ is similar to the hyperbolic, flat or spherical case if ξ, ξ q is positive, zero or negative, respectively. We show here why log x (y) (see Eq. (7)) is not defined if x, y q ≥ |β|. The proof relies on the fact that the exponential map is complete, that is the domain of the exponential map is the whole tangent space. We also use the fact that ∀ξ ∈ T x Q p,q β , x, ξ q = 0 by definition of tangent vectors. Assume that there exists a tangent vector ξ ∈ T x Q p,q β such that y = exp x (ξ) and y, x q ≥ |β|. We consider the three possible cases of sign of ξ, ξ q . • Let us first assume that ξ 2 q > 0. For all ξ ∈ T x Q p,q β that satisfies ξ, ξ q > 0, we have by definition of the corresponding exponential map: Therefore, there exists no tangent vector ξ ∈ T x Q p,q β that satisfies ξ, ξ q > 0 and x, exp x (ξ) q > |β| > 0. • Let us now assume that ξ 2 q = 0. For all ξ ∈ T x Q p,q β that satisfies ξ, ξ q = 0, we have: • Let us now assume that ξ 2 q < 0. For all ξ ∈ T x Q p,q β that satisfies ξ, ξ q < 0, we have: Given the formulation of the geodesic in Eq. (5), x, y q = x, exp x (ξ) q ≥ |β| iff y = −x. From our study above, there exists no geodesic (hence no logarithm map) joining x and y if x, y q ≥ |β| except if y = −x. The antipode y = −x is also a special case for which there does not exist a logarithm map since x and −x are joined by infinitely many minimizing geodesics of equal length (a similar case is the hypersphere). Nonetheless, the geodesic "distance" between x and −x can be defined as π |β|. B.3 Anti-isometry between Q p,q β and Q q+1,p−1 In the main paper, we state that there is a anti-isometry between Q p,q β and Q q+1,p−1 −β . Let us note σ : Q p,q β → Q q+1,p−1 −β the mapping defined as: (30) We have: ∀x ∈ Q p,q β , y ∈ Q p,q β , x, y q = − σ(x), σ(y) p−1 . The manifold Q p,q β is a total umbilic hypersurface of R p,q+1 and has constant sectional curvature 1/β and constant mean curvature κ = |β| −1/2 with respect to the unit normal vector field N (x) = −κx. More details can be found in Chapter 3 of [2] . We give the proofs for Theorem 4.1 that we recall below: Theorem B.1 (Diffeomorphisms). For any β < 0, there is a diffeomorphism ψ : Q p,q β → S q × R p . Let us note x = t s ∈ Q p,q β with t ∈ R q+1 * and s ∈ R p , let us note z = u v ∈ S q × R p where u ∈ S q and v ∈ R p . The mapping ψ and its inverse ψ −1 are formulated: We need to show that ψ(ψ −1 (z)) = z and ψ −1 (ψ(x)) = x. Space dimensions: The mapping of the space dimensions of x to the space dimensions of z simply involves a scaling factor of |β| (i.e. v = 1 √ |β| s, and s = |β|v). We show the invertibility of the mappings taking time dimensions as input. • Let us first show ψ −1 (ψ(x)) = x. We recall that x = t s ∈ Q p,q β with t ∈ R q+1 * and s ∈ R p . We need to show that: To show Eq. (33), it is sufficient to prove that the following property is satisfied: Since x ∈ Q p,q β , we have by definition x 2 q = s 2 − t 2 = β < 0. Therefore, we have: • Let us now show ψ(ψ −1 (z)) = z. We recall that z = u v ∈ S q × R p where u ∈ S q and s ∈ R p . We need to show: By definition of S q , u satisfies Eq. (36). We show here that the hyperboloid is a Riemannian manifold even if its metric matrix G = I 1,p is not positive definite. That result is already known in the literature and a proof can be found in [21] . We still give it because we think it is important. We recall that a Riemannian manifold is a pseudo-Riemannian manifold with positive definite metric tensor. In other words, for all x ∈ M where M is a pseudo-Riemannian manifold, M is Riemannian iff ∀ξ ∈ T x M, g x (ξ, ξ) > 0 iff ξ = 0. The usual definition of the hyperboloid H p β in the literature is the following: Let us note x = (x 0 , x 1 , · · · , x p ) ∈ H p β with β < 0 and ξ = (ξ 0 , · · · , ξ p ) a tangent vector at x. For simplicity, we will denote the vectors u = (x 1 , · · · , x p ) ∈ R p and v = (ξ 1 , · · · , ξ p ) ∈ R p so By definition of H p β , , we have x 0 = −β + u 2 . Moreover, by definition of the tangent space, ξ satisfies 0 = x, ξ 0 = −x 0 ξ 0 + u, v , which implies: It is obvious that if ξ = 0, then ξ, ξ 0 = 0. To show that H p β is a Riemannian manifold, we need to show that every non-vanishing tangent vector ξ satisfies ξ, ξ 0 > 0. By definition of ·, · 0 , we have We then need to show that ξ 2 0 < v 2 . By the Cauchy-Schwarz inequality, we have u, v 2 ≤ v 2 u 2 ≤ v 2 (−β + u 2 ). This implies from Eq. (38): By using the Cauchy-Schwarz inequality, Eq. (40) can be an equality iff ξ = 0 since u 2 < −β + u 2 . Since every non-vanishing tangent vector ξ of H p β satisfies ξ, ξ 0 > 0, the metric tensor of H p β is positive definite. This shows that H p β is Riemannian. We complete here the experimental result section. First, we provide additional details about Zachary's karate club dataset experiments in Section C.1. We provide in Section C.2 similar experimental results on a larger social network dataset about co-authorship at NIPS conferences. Complexity: We implemented our approach in PyTorch 1.0 [20] which automatically calculates the Euclidean gradient ∇f (x i ) for each node v i ∈ V . Once ∇f (x i ) is computed, the computation of is linear in the dimensionality d and is very efficient to compute. The exponential map is also efficient to calculate (i.e. linear algorithmic complexity). Running time: The step size/learning rate for all our optimizers is fixed to η = 10 −6 (with no momentum). Step size strategies could have been used to improve convergence rate but the purpose of our experiment was only to verify that our solvers decrease the optimized function at each iteration. In the weighted case, we stop after 10,000 iterations of descent method. The training process takes 182 seconds (∼3 minutes) on an Intel i7-7700k CPU, and 254 seconds (∼4 minutes) on an Intel i7-8650U CPU. We also quantitatively evaluate our approach on a dataset that describes co-authorship information from papers published at NIPS from 1988 to 2003 [7] . Description of the dataset: The graph G = (V, E) is constructed by considering each author as a node, and an edge e k = (v i , v j ) is created iff the authors i and j are co-authors of at least one paper. The capacity c k is the number of papers co-authored by the pair of authors e k = (v i , v j ). The original number of authors in the dataset is 2865, but only n = |V | = 2715 authors have at least one co-author. The number of edges is m = |E| = 4733, which means that the number of pairs of nodes that have no edge joining them is 3,679,522. The capacity c k of each edge e k is a natural number in {1, · · · , 9}. The "leaders" of this dataset are not the authors with highest number of papers, but those with highest number of co-authors. Implementation details and running times: We ran all our experiments for this dataset on a 12 GB NVIDIA TITAN V GPU. To fit into memory, when constructing each weaker set W(e k ), we take into account all the edges with capacity lower than c k and randomly sample 42,000 pairs of nodes without edge joining them. Our fixed hyperparameters are: β = −1, the temperature is τ = 10 −5 , and the step size is η = 10 −8 . We stop training of 4-dimensional manifolds (see Table 2 ) after 10,000 iterations, which takes 10 hours to train (∼ 1, 000 iterations per hour). We stop training of 6-dimensional manifolds (see Table 3 ) after 25,000 iterations because they take longer to converge. Quantitative evaluation: The evaluation process is the same as the evaluation for Zachary's karate club dataset. However, unlike for Zachary's karate club dataset, the factions and their leaders are unknown. We then only consider Spearman's rank correlation coefficient scores. We also consider the average recall at 1 as explained below. The capacity matrix C ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} n×n is symmetric, so our score matrix is S = C. We select the most influential members based on the score s i = n j=1 S ij ∈ {1, · · · , 97}. Spearman's rank correlation coefficient [22] between the selected ordered scores s i and corresponding δ i = n j=1 d(x i , x j ) is reported in Table 2 and shows the relevance of our representations. The most influential members are selected if their score s i is at least 1 (i.e. top n members where n = 2715 is the number of nodes), at least 10 (i.e. top 232 authors) and at least 20 (i.e. top 57 authors). We also report the average recall at 1 (in percentage): for each node v i , we find the nearest neighbor v j w.r.t. the chosen metric, the recall at 1 is 0 if C ij = 0, and 1 otherwise. Both Q 3,1 −1 Q 2,2 −1 outperform the hyperbolic model. If we use the Euclidean norm of hyperbolic representations (in Q 4,0 −1 ) as proxy, we obtain the following scores: 0.044, -0.113, 0.264 for the top 2715, to 232 and top 57 members respectively. These scores are worse than using δ i as proxy due to the presence of cycles in the graph. In conclusion, our proposed ultrahyperbolic representations extract hierarchy better than hyperbolic representations when the hierarchy graph contains cycles. Optimization algorithms on matrix manifolds Minimal submanifolds in pseudo-Riemannian geometry A theoretical analysis of the number of shots in few-shot learning Manfredo Perdigao do Carmo. Riemannian geometry. Birkhäuser On the hyperbolicity of small-world and treelike random graphs Semi-riemannian manifold optimization Euclidean Embedding of Co-occurrence Data Hyperbolic groups Lorentzian distance learning for hyperbolic representations Dimensionality reduction for representing the knowledge of probabilistic models Closed-form training of mahalanobis distance for supervised clustering Efficient multiple instance metric learning using weakly supervised data Introduction to topological manifolds Introduction to Smooth Manifolds Sur l'application de la méthode des approximations successives aux équations différentielles ordinaires du premier ordre. Comptes rendus hebdomadaires des séances de l'Académie des sciences Poincaré embeddings for learning hierarchical representations Learning continuous hierarchies in the lorentz model of hyperbolic geometry Numerical optimization Semi-Riemannian geometry with applications to relativity Pytorch: An imperative style, highperformance deep learning library Riemannian geometry The proof and measurement of association between two things. The American journal of psychology Video face clustering with unknown number of clusters Normface: L2 hypersphere embedding for face verification Centroidbased deep metric learning for speaker recognition Spaces of constant curvature Distance metric learning with application to clustering with side-information An information flow model for conflict and fission in small groups We thank Jonah Philion and Guojun Zhang for helpful feedback on early versions of this manuscript.