key: cord-0152461-d3pav8f2
authors: Whiteley, Nick; Gray, Annie; Rubin-Delanchy, Patrick
title: Matrix factorisation and the interpretation of geodesic distance
date: 2021-06-02
journal: nan
DOI: nan
sha: d95d19462c97c6acd3d298fcabe47e35523ba909
doc_id: 152461
cord_uid: d3pav8f2

Given a graph or similarity matrix, we consider the problem of recovering a notion of true distance between the nodes, and so their true positions. We show that this can be accomplished in two steps: matrix factorisation, followed by nonlinear dimension reduction. This combination is effective because the point cloud obtained in the first step lives close to a manifold in which latent distance is encoded as geodesic distance. Hence, a nonlinear dimension reduction tool, approximating geodesic distance, can recover the latent positions, up to a simple transformation. We give a detailed account of the case where spectral embedding is used, followed by Isomap, and provide encouraging experimental evidence for other combinations of techniques.

Assume we observe, or are given as the result of a computational procedure, data in the form of a symmetric matrix A ∈ R n×n which we relate to unobserved vectors Z 1 , . . . , Z n ∈ Z ⊂ R d , where d n, by

for some real-valued symmetric function f which will be called a kernel, and a matrix of unobserved perturbations, E ∈ R n×n . As illustrative examples, A could be the observed adjacency matrix of a graph with n vertices, a similarity matrix associated with n items, or some matrix-valued estimator of covariance or correlation between n variates.

geodesic distance on Z, up to simple transformations of the coordinates, for example scaling. Thus a dimension reduction technique which approximates in-manifold geodesic distances can be expected to recover the Z i 's, up to a simple transformation. If those assumptions fail, as they must with real data, we may still find that those approximate geodesic distances are useful because they reflect a geometry implicit in the kernel.

There are many works which address recovery of the Z i 's when one does consider a particular matrix-type, an explicit statistical model, and/or specific form for f . When A is an adjacency matrix and f (Z i , Z j ) is the probability of an edge between nodes i and j, the construct (1) is a latent position model [22] and various scenarios have been studied: Z is a sphere and f (x, y) is a function of x, y [10] ; estimation by graph distance when f (x, y) is a function of x − y [11] ; spectral estimation when f (x, y) is a possibly indefinite inner-product [53, 55, 12, 46] ; inference under a logistic model via MCMC [22] or variational approximation [48] . The case where Z is a discrete set of points corresponds to the very widely studied stochastic block model [23] , see [5, 33, 44, 53, 8, 14] and references therein; by contrast the methods we propose are directed at the case where Z is continuous. The opposite problem of estimating f , with Z i unknown but assumed uniform on [0, 1], is known as graphon estimation [19] .

Concerning the case when A is a covariance matrix, Latent Variable Gaussian Process models [31, 30, 59] use f (Z i , Z j ) to define the population covariance between the ith and jth variates under a hierarchical model. When f (x, y) = x, y , this reduces to a latent variable optimisation counterpart of Probabilistic PCA [58] and the maximum likelihood estimate of the Z i 's is obtained from the eigendecomposition of the empirical covariance matrix, which in our setting could be A. When f is nonlinear, but fixed e.g. to a Radial Basis Function kernel, gradient and variational techniques are available [31, 30, 59] . See the same references for onward connections to kernel PCA [50] .

Our contribution and precursors. The general practice of using dimensionality reduction and geodesic distance to extract latent features from data is common in data science and machine-learning. Examples arise in genomics [38] , neuroscience [63] , speech analysis [28, 21] and cyber-security [9] . What sets our work apart from these contributions is that we establish a new rigorous basis for this practice.

It has been understood for several years that the high-dimensional embedding obtained by matrix factorisation can be related via a feature map to the Z i of model (1) [26, 15, 34, 56, 62, 32] . However, it was only recently observed that the embedding must therefore concentrate about a low dimensional set, in the Hausdorff sense [45] . Our key mathematical contribution is to describe the topology and geometry of this set, proving it is a topological manifold and establishing how in-manifold geodesic distance is related to geodesic distance in Z. Riemannian geometry underlying kernels was sketched in [7] but without consideration of geodesic distances or rigorous proofs, and not in the context of latent position estimation. The work [13, 61] was an inspiration to us, suggesting Isomap as a tool for analysing spectral embeddings, under a latent structure model in which Z is a one-dimensional curve in R d and f is the inner product. A key feature of our problem setup is that f is unknown, in which case the manifold is not available in closed form and typically lives in an infinite-dimensional space. To our knowledge, we are the first to show why spectral embedding followed by Isomap might recover the true Z i 's, or a useful transformation thereof, in general.

We complement our mathematical results with experimental evidence, obtained from both simulated and real data, suggesting that alternative combinations of matrix-factorisation and dimensionreduction techniques work too. For the former, we consider the popular node2vec [20] algorithm, a likelihood-based approach for graphs said to perform matrix factorisation implicitly [43, 64] . For the latter, we consider the popular t-SNE [37] and UMAP [39] algorithms. As predicted by the theory, a direct low-dimensional matrix factorisation, whether using spectral embedding or node2vec, is less successful.

2 Proposed methods and their rationale 2.1 Spectral embedding, as estimating φ(Z i )

In our mathematical setup (precise details come later) the kernel f will be nonnegative definite and φ will be the associated Mercer feature map. It is well known that each point φ(Z i ) then lives in an

Xi φ(Zi) Figure 1 : Illustration of theory in the case d = 1. Our analysis reveals how geodesic distance, along M, between φ(Z i ) and φ(Z j ), is related to geodesic distance, along Z, between Z i and Z j .

infinite-dimensional Hilbert space, which will be denoted 2 , and the inner-

The spectral embedding procedure. For p ≤ n, we define the p-dimensional spectral embedding of A to beX = [X 1 , . . . ,X n ] =Û|Ŝ| 1/2 ∈ R n×p , where |Ŝ| ∈ R p×p is a diagonal matrix containing the absolute values of the p largest eigenvalues of A, by magnitude, andÛ ∈ R n×p is a matrix containing corresponding orthonormal eigenvectors, in the same order. The R packages irlba and RSpectra provide fast solutions which can exploit sparse inputs.

One should think ofX i as approximating the vector of first p components of φ(Z i ), denoted φ p (Z i ), up to orthogonal transformation, and this can be formalised to a greater or lesser extent depending on what assumptions are made. There are several situations, e.g. f any polynomial [45] , the cosine kernel used in Section 4.1, the degree-corrected [29] or mixed-membership [6] stochastic block model, in which only the first p 0 (say) components of φ(·) are nonzero, where typically p 0 ≥ d. If, after n reaches p 0 , we embed into p = p 0 dimensions, then with · denoting the Euclidean norm, we have [18] :

for a universal constant c ≥ 1, orthogonal matrix Q, under regularity assumptions on the Z i 's, f and E (that Z i are i.i.d., f (Z i , Z j ) has finite expectation, and the perturbations E ij are independent and centered with exponential tails). This encompasses the case where A is binary, for example a graph adjacency matrix [36, 46] . Similar results are available in the cases where A is a Laplacian [46, 40] or covariance matrix [17] . The methods of this paper are based, in practice, on the distances X i −X j , which are invariant to orthogonal transformation and so for the purposes of validating

ForX i to converge to φ(Z i ) more generally, we must let its dimension p grow with n and, at least given the present state of literature, accept weaker consistency results, for example, convergence in Wasserstein distance between QX 1 , . . . QX n and φ p (Z 1 ), . . . , φ p (Z n ) [32] . Uniform consistency results, in the style of (2), are also available for indefinite [46] , bipartite, and directed graphs [25] . These are left for future work because of the complications of Q no longer being orthogonal.

Rank selection. In real data, where n is typically fixed, there is no 'best' way of selecting p, as discussed for example in [42] . The method of [65] , based on profile-likelihood, provides a popular, practical choice, taking as input the spectrum of A, and is implemented in the R package igraph.

We propose to recover Z 1 , . . . , Z n through the following procedure.

Algorithm 1 Isomap procedure input p-dimensional pointsX 1 , . . . ,X n 1: Compute the neighbourhood graph of radius : a weighted graph connecting i and j, with weight

Compute the matrix of shortest paths on the neighbourhood graph,D M ∈ R n×n 3: Apply classical multidimensional scaling (CMDS) toD M into R d return d-dimensional pointsẐ 1 , . . . ,Ẑ n Theorem 1 below establishes, under general assumptions, that M := φ(Z) is a topological manifold of Hausdorff dimension exactly that of Z (as opposed to an upper bound [45] ). It also proposes a 'change of metric', often trivial, under which a path on Z and its image on M have the same length.

This result explains howẐ 1 , . . . ,Ẑ n can estimate Z 1 , . . . , Z n . First, think of a path fromX i toX j on the neighbourhood graph as a noisy, discrete version of some corresponding continuous path from

The length of the first (the sum of the weights of the edges) is approximately equal to the length of the second if measured in the standard metric: an infinitesmal step from x to x + dx has length dx, dx 1/2 2 . By inversion of φ we can trace a third path, taking us from Z i to Z j on Z. But to make its length agree with the first two, we must pick a non-standard metric: an infinitesmal step from z to z + dz must be regarded to have length dz,

The exciting news for practical purposes is that, through elementary calculus, we might establish that H z is constant in z (e.g. if f is translation-invariant) or even proportional to the identity (e.g. if f is just a function of Euclidean distance). In the latter case, if Z is convex, it is not too difficult to see that the length of the shortest path, from φ(Z i ) to φ(Z j ) on M, must be proportional to the Euclidean distance between Z i and Z j . Thus, the matrix of shortest paths obtained by Isomap (Step 2) approximates the matrix of Euclidean distances between the Z i (up to scaling), from which the Z i themselves can be recovered (Step 3), up to scaling, rotation, and translation. Of course, we have taken a few liberties in this argument, such as assuming φ to be invertible and dz H z dz to be positive. We will address these rigorously in the next section.

Dimension and radius selection. For visualisation in our examples we will pick d = 1 or d = 2. For other applications, d can be estimated as the approximate rank ofD M after double-centering [60] , e.g. via [65] , again. In practice, we suggest picking just large enough for the neighbourhood graph to be connected or, if the data have outliers, a fixed quantile (such as 5%) ofD M , removing nodes outside the largest connected component. The same recommendations apply if the k-nearest neighbour graph is used instead of the -neighbourhood graph.

The usual inner-product and Euclidean norm on R d are denoted ·, · and · . 2 is the set of

. With d ≥ 1, let Z be a compact subset of R d , and let Z be a closed ball in R d , centered at the origin, such that Z ⊂ Z. Let f : Z × Z → R be a symmetric, continuous, nonnegative-definite function.

By Mercer's Theorem, e.g., [51, Thm 4.49] , there exist nonnegative real numbers (λ k ) k≥1 and functions (u k ) k≥1 , with each u k : Z → R, which are orthonormal with respect to the inner-product (u j , u k ) → Z u j (x)u k (x)dx and such that

where the convergence is absolute and uniform.

, and the latter is finite for any x ∈ Z since f is continuous and Z is compact, hence M ⊂ 2 .

The following definitions are standard in metric geometry [16] . For any a, b ∈ 2 , a path in M with end-points a, b is a continuous function γ : [0, 1] → M such that γ 0 = a and γ 1 = b. With n ≥ 1, a non-decreasing sequence t 0 , . . . , t n such that t 0 = 0 and t n = 1, is called a partition. Given a path γ and a partition P = (t 0 , . . . , t n ), define χ(γ, P) := n k=1 γ t k − γ t k−1 2 . The length of γ (w.r.t. · 2 ) is l(γ) := sup P χ(γ, P), where the supremum is over all possible partitions. The geodesic distance in M between a and b is defined to be the infimum of l(γ) over all paths γ in M with end-points a, b. D ij M denotes geodesic distance in M between φ(Z i ) and φ(Z j ). Similarly a path η in Z with end-points x, y is a continuous function η : [0, 1] → Z such that η 0 = x, η 1 = y, and with χ(η,

Assumption 2. f is C 2 on Z and for every z ∈ Z, the matrix H z is positive definite.

Assumption 1 is used to show φ is injective. Assumption 2 has various implications, loosely speaking it ensures a non-degenerate relationship between path-length in M and path-length in Z. Concerning our general setup, if f is not required to be nonnegative definite, but is still symmetric, then a representation formula like (3) is available under e.g. trace-class assumptions [45] . It is of interest to generalise our results to that scenario but technical complications are involved, so we leave it for future research.

Theorem 1. Let assumptions 1 and 2 hold. Then φ is a bi-Lipschitz homeomorphism between Z and M. Let a, b be any two points in M such that there exists a path in M with end-points a, b of finite length. Let γ be any such path and define η : [0, 1] → Z by η t := φ −1 (γ t ). Then η is a path in Z with l(η) < ∞. For any > 0 there exists a partition P such that for any partition P = (t 0 , . . . , t n ) satisfying P ⊆ P,

If γ and η are continuously differentiable, the following two equalities hold:

The proof is in the appendix. In the terminology of differential geometry, the collection of innerproducts (x, y) → x, H z y , indexed by z ∈ Z, constitute a Riemannian metric on Z, see e.g. [41, Ch.1 and p.121] for background, and the right hand side of (5) is the length of the path η in Z with respect to this Riemannian metric. The theorem tells us that finding geodesic distance in M, i.e. minimising l(γ) with respect to γ for fixed end-points say

In section 3.3 we will show how, for various classes of kernels, this minimisation can be performed in closed form leading to explicit relationships between D ij M and D ij Z . In Section 3.3 we focus on (5) rather than (4) only for ease of exposition, it is shown in the appendix that exactly the same conclusions can be derived directly from (4). To avoid repetition Assumption 1 will be taken to hold throughout section 3.3 without further mention.

Translation invariant kernels. Suppose Z ⊂ R d is compact and convex, and f (x, y) = g(x − y) where g : R d → R is C 2 . In this situation H z is constant in z, and equal to the Hessian of −g evaluated at the origin. The positive-definite part of Assumption 2 is therefore satisfied if and only if g has a local maximum at the origin. We now use Theorem 1 to find geodesic distance in M between generic points a, b ∈ M for this class of translation-invariant kernels. To do this we obtain a lower bound on l(γ) over all paths γ in M with end-points a, b, then show there exists such a path whose length achieves this lower bound.

Let G = VΛ 1/2 where VΛV is the eigendecomposition of the Hessian of −g evaluated at the origin, so H z = GG for all z. Then from (5), for a generic path γ in M with end-points a, b,

where the inequality is due to the triangle inequality for the · norm. The right-most term in (7) is independent of γ, other than through the end-points a, b. To see there exists a path whose length equals this lower bound, takẽ

which is well-defined as a path in Z, since for this class of kernels Z is assumed convex. With γ t := φ(η t ),γ is clearly a path in M. Differentiatingη t w.r.t. t and substituting into (6) shows

, as required. The following proposition summarises our conclusions in the case that a = φ(Z i ) and b = φ(Z j ), for any Z i , Z j .

Proposition 1. If Z is compact and convex, and f (x, y) = g(x − y) where g : R d → R is C 2 and has a local maximum at the origin, then D ij M is equal to Euclidean distance between Z i and Z j , up to their linear transformation by G . In the particular case where g(

where g : R → R is C 2 and such that g (1) > 0. In this case H z = g (1)I + g (1)zz , and by a result of [27] , any kernel of the form f (x, y) = g( x, y ) on the sphere is positive-definite iif g(x) = ∞ n=0 a n x n for some nonnegative (a n ) n≥0 , implying g (1) ≥ 0. Assumption 2 then holds. To derive the geodesic distance in M, first write out (5) for a generic path γ in M with end-points a, b:

Since Z is a radius-1 sphere centered at the origin we must have η t = 1 for all t, so

Therefore the r.h.s. of (9) is in fact equal to g (1) 1/2 1 0 η t dt. This is minimised over all possible paths η in Z with end-points φ −1 (a), φ −1 (b) when η is a shortest (with respect to Euclidean distance) circular arc in Z, in which case

. Thus we have:

where z (i) is the ith element of the vector z. The positive-definite part of Assumption 2 thus holds if these diagonal elements are strictly positive for all z ∈ Z.

Let us introduce the vector of monotonically increasing transformations: 

Then:

where the second and last equalities hold due to the definition of ζ t , and • denotes elementwise composition. The quantity ψ

is thus a lower bound on path-length l(γ) for any path γ in M with end-points a, b. To see that there exists a path whose length achieves this lower bound, and hence that it is the geodesic distance in M between a, b,

is a positive constant over all

This section shows the theory at work in a pedagogical example, where A is an adjacency matrix.

Consider an undirected random graph following a latent position model with kernel f (x, y) = ρ n {cos(x (1) − y (1) ) + cos(x (2) − y (2) ) + 2}/4, operating on R 2 × R 2 . Here, the sequence ρ n is a sparsity factor which will either be constant, ρ n = 1, or shrinking to zero sufficiently slowly, reflecting dense (degree grows linearly) and sparse (degree grows sublinearly) regimes respectively. The kernel is clearly translation-invariant, satisfies Assumptions 1 and 2 and has finite rank, p = 5. The true latent positions Z i ∈ R 2 are deterministic and equally spaced points on a grid over the region Z = [−π + 0.25, π − 0.25] × [−π + 0.25, π − 0.25], this range chosen to give valid probabilities and an interesting bottleneck in the 2-dimensional manifold M. From Proposition 1, the geodesic distance between φ(Z i ) and φ(Z j ) on M is equal to the Euclidean distance between Z i and Z j , up to scaling, specifically D ij M = ρ n D ij Z /2. Focusing first on the dense regime, for each n = 100, 400, 1600, 6400, we simulate a graph, and its spectral embeddingX 1 , . . . ,X n into p = 5 dimensions. The first three dimensions of this embedding are shown in Figure 2a) , and we see that the points gather about a two-dimensional, curved, manifold. To approximate geodesic distance, we compute a neighbourhood graph of radius fromX 1 , . . . ,X n , choosing as small as possible subject to maintaining a connected graph. Figure 2b) shows that approximated geodesic distances roughly agree with the true Euclidean distances in latent space Z (up to the predicted scaling of 1/2), whereas there is significant distortion between ambient and latent distance ( X i −X j versus Z i − Z j ). Finally, we recover the estimated latent positions in R 2 using Isomap, which we align with the original by Procrustes (orthogonal transformation with scaling), in Figure 2c ). As the theory predicts, the recovery error vanishes as n increases.

In the appendix, we perform the same experiment in a sparse regime, nρ n = ω{log 4 n}, chosen to ensure the spectral embedding is still consistent, and the recovery error still shrinks, but more slowly. We also implement other related approaches: UMAP, t-SNE applied to spectral embedding, node2vec directly into two dimensions and node2vec in five dimensions followed by Isomap ( Figure 6 ). We use default configurations for all other hyperparameters. Together, the results support the central recommendation of this paper: to use matrix factorisation (e.g. spectral embedding, node2vec), followed by nonlinear dimension reduction (e.g. Isomap, UMAP, t-SNE).

For this example, we use a publically available, clean version [52] (with license details therein) of the crowdsourced OpenSky dataset. Six undirected graphs are extracted, for the months November 2019 to April 2020, connecting any two airports in the world for which at least a single flight was recorded (n ≈ 10, 000 airports in each). For each graph the adjacency matrix A is spectrally embedded into p = 10 dimensions, degree-corrected by spherical projection [35, 33, 49] , after which we apply Isomap with chosen as the 5% distance quantile.

The dimension p was chosen as a loose upper bound on the dimension estimate returned by the method of [65] on any of the individual graphs, as to facilitate comparison we prefer to use the same dimension throughout (e.g. to avoid artificial differences in variance) and so follow the recommendation of [12] to err on the large side (experiments with different choices of p are in the appendix). In the degreecorrection step we aim to remove network effects related to "airport popularity" (e.g. population, economic and airport size), after which geographic distance might be expected to be the principal factor deciding whether there should be a flight between two airports. Therefore, after applying Isomap, we might hope thatẐ i could estimate the geographical location of airport i. The experiments take a few hours on a standard desktop (and the same is a loose upper bound on compute time for the other experiments in the paper).

The embedding for January is shown in Figures 3)a-b) , before and after Isomap. Both recover real geographic features, notably the relative positioning of London, Paris, Madrid, Lisbon, Rome. However, the embedding of the US is warped, and only after using Isomap do we see New York and Los Angeles correctly positioned at opposite extremes of the country. As further confirmation, Figures 3)c-d) show the results restricted to North America, with the points coloured by latitude and longitude. For this continent, the accuracy of longitude recovery is particularly striking.

In line with the recommendation to overestimate p [12] , one obtains a similar figure using p = 20 before applying Isomap (Figure 10 

In this example A is a correlation matrix and we demonstrate a simple model-checking diagnostic informed by our theory. The raw data consist of average temperatures over each day, for several thousand days, recorded in cities around the globe. These data are open source, originate from Berkeley Earth [1] and the particular data set analyzed is from [3] . We used open source latitude and longitude data from [4] . See those references for license details. Removing cities and dates for which data are missing yields a temperature time-series of 1450 common days for each of the n = 2211 cities. A is the matrix of Pearson correlation coefficients between the n time-series. (meeting the basic requirements of section 3.1, including Assumptions 1 and 2 of course) D ij M is equal to Euclidean distance between Z i and Z j up to their monotone transformation by ψ defined in (10) . In turn, this implies that if A did actually follow the model (1) for some f and some "true" latent positions Z 1 , . . . , Z n to which we had access, we should observe a monotone relationship between those true latent positions and the estimated positionsẐ 1 , . . . ,Ẑ n from Isomap.

The empirical monotonicity in Figure 4c ) thus can be interpreted as indicating latitudes of the cities are plausible as "true" latent positions underlying the correlation matrix A, without us having to specify anything in practice about f . In further analysis (appendix) no such empirical monotone relationship is found between longitude and estimated latent position; this indicates longitude does not influence the correlations captured in A. A possible explanation is the daily averaging of temperatures in the underlying data: correlations between these average temperatures may be insensitive to longitude due to the rotation of the earth.

Our research shows how matrix-factorisation and dimension-reduction can work together to reveal the true positions of objects based only on pairwise similarities such as connections or correlations. For the sake of exposition and reproducibility, we have used public datasets which can be interpreted without specialist domain knowledge, but the methods are potentially relevant to any scientific application involving large similarity matrices. Thinking about societal impact, our results highlight the depth of information in principle available about individuals, given network metadata, and we hope to raise awareness of these potential issues of privacy.

Concerning the limitations of the methods we have discussed, the bound in (2) indicates that n should be "large" in order for QX i to approximate φ p (Z i ) well. n also has an impact on the performance of Isomap: heuristically one needs a high density and a large number of points on or near the manifold to get a good estimate of the geodesic distance. Thus the methods we propose are likely to perform poorly when n is small, corresponding e.g. to a graph with a small number of vertices.

On the other hand, in applications involving large networks or matrices (e.g. cyber-security, recommender systems), the data encountered are often sparse. This is good news for computational feasibility but bad news for statistical accuracy. In particular, for a graph, (2) can only be expected to hold under logarithmically growing degree, the information-theoretic limit below which no algorithm can obtain asymptotically exact embeddings [5] . What this means in practice (as illustrated in our numerical results for the simulated data example in the appendix) is that the manifold may be very hard to distinguish, even for a large graph. Missing data may have a similarly negative impact on discerning manifold structure. There could be substantial estimation gains in better integrating the factorisation and manifold estimation steps to overcome these difficulties.

Another limitation is that our theory restricts attention to the case of positive-definite kernels. When A is say a correlation or covariance matrix the positive-definite assumption on the kernel is of course natural, but when A is say an adjacency matrix, it is a less natural assumption. The implications of removing the positive-definite assumption are the subject of ongoing research. Proof. We prove the contrapositive. So suppose φ is not injective. Then there must exist x, y, with x = y, such that φ(x) = φ(y), and hence for any a, f (x, a) = φ(x), φ(a) 2 = φ(y), φ(a) 2 = f (y, a).

The inverse of φ on M is denoted φ −1 . Let H z,z ∈ R d×d be the matrix with elements H ij z,z :=

. Lemma 2. If Assumption 2 holds, then for any x, y ∈ Z, there exists z on the line segment with end-points x, y such that

where the integral is element-wise.

Proof. Fix any x, y in Z. Observe from (3) and the definition of φ that for any x, y ∈ Z,

and so since f is symmetric, 2 2 . By the mean value theorem, there exists z on the line segment with end-points x, y (i.e. z ∈ Z) such that

where ∇g is the gradient of u → g(u) (with x, y still considered fixed) and ∇ x f (z, u) is the gradient of x → f (x, u) evaluated at z (with u considered fixed). Now considering the vector-valued mapping u → ∇ x f (z, u) with z fixed, we have

Combining the above equalities gives:

Proof. 

Lemma 5. If Assumptions 1 and 2 hold, then for any a, b ∈ M and a path γ in M with end-points a, b such that (γ) < ∞, the mapping η :

, and l(η) < ∞.

Proof. By Proposition 4, φ −1 is continuous, which combined with the continuity of t → γ t implies continuity of t → φ −1 (γ t ), so η is indeed a path in Z, and the end points of η are clearly φ −1 (a), φ −1 (b). Proposition 4 establishes that moreover φ −1 is Lipschitz continuous, and then l(γ) < ∞ implies l(η) < ∞ due to the definition of path-length.

Lemma 6. For any a ≥ 0 and b such that |b| ≤ a,

Proof. First prove the lower bound. For any a, b as in the statement, let c = a − |b|, so that c ≥ 0, and set x = |b| 1/2 and y = c 1/2 . Since x, y ≥ 0, application of the Euclidean triangle inequality in R 2 to the pair of vectors [x 0] , [0 y] gives the fact: (x 2 + y 2 ) 1/2 ≤ x + y, hence a 1/2 = (|b| + c) 1/2 ≤ |b| 1/2 + c 1/2 = |b| 1/2 + (a − |b|) 1/2 , or equivalently:

(a − |b|) 1/2 ≥ a 1/2 − |b| 1/2 . (16) By the reverse triangle inequality and the assumptions on a and b, (a + b) 1/2 = |a + b| 1/2 ≥ (a − |b|) 1/2 .

(17) Combining (16) and (17) completes the proof of the lower bound in the statement.

For the upper-bound in the statement, let c = a 1/2 and d = |b| 1/2 . Then

which implies (a + b) 1/2 ≤ a 1/2 + |b| 1/2 as required.

Proof of Theorem 1. By Lemma 1, φ is injective; by Proposition 4, φ and its inverse on M, namely φ −1 , are Lipschitz; and by Lemma 5, for γ and η as in the statement, η is a path as claimed with l(η) < ∞.

For the remainder of the proof, fix any > 0. By the definition of the path-length l(γ), there exists a partition P such that l(γ) − /2 ≤ χ(γ, P ) ≤ l(γ).

Let P = (t 0 , . . . , t n ) be any partition, and fix any k such that 1 ≤ k ≤ n. By Lemma 2 there exists z on the line segment with end-points η t k−1 , η t k such that

where

Under Assumption 2, H ηt k−1 is positive-definite, so a k ≥ 0, and by (19) , a k + b k ≥ 0, so we must have |b k | ≤ a k . Lemma 6 then gives

By Lemma 4, there exists δ > 0 such that

Since η is a path, it is continuous on the compact set [0, 1], and then in fact uniformly continuous by the Heine-Cantor Theorem. Hence there exists a suitably fine partition P ,δ ⊇ P such that if P = (t 0 , . . . , t n ) ⊇ P ,δ , max k=1,...,n η t k − η t k−1 ≤ δ and in turn from (22),

where the final inequality holds due to the definition of l(η) as the length of η.

Combining (21) and (23), if again P ⊇ P ,δ ,

Recalling from (18) the defining property of P and using P ⊇ P ,δ ⊇ P , the triangle inequality for the · 2 norm gives l(γ) − 2 ≤ χ(γ, P ) ≤ χ(γ, P) ≤ l(γ).

Combined with (24), we finally obtain that if P ⊇ P ,δ ,

and the proof of (4) is completed by taking P as appears in the statement to be P ,δ .

The first equality in (5) can be proved by a standard argument -e.g., [47, p.137 ]. The second inequality in (5) is proved by passing to the limit of the summation in (4) along any sequence of partitions P A.2 Deriving geodesic distances from (4) rather than from (5) Recall from Section 3.3 that the general strategy to derive the geodesic distance associated with each family of kernels (translation invariant, inner-product, additive) is:

(i) identify a lower bound on l(γ) which holds over all paths γ in M which have generic end-points a, b ∈ M in common, then (ii) show there exists a path whose length is equal to this lower bound.

In Section 3.3 this strategy was executed for each family of kernels starting from the expression for l(γ) given in (5) . In the proofs of Lemmas 7-9 below we show how step (i) is performed if we start not from (5) but rather from (4), the latter being more general because continuous differentiability of the paths is relaxed to continuity. The key message of these three lemmas regarding step (i) is that we obtain exactly the same lower bounds on l(γ) as are derived from (5) in Section 3.3. The reader is directed to Section 3.3 for the details of how Assumption 2 is verified for each family of kernels; to avoid repetition we don't re-state all those details here. Lemma 7. Consider the family of translation invariant kernels described in Section 3.3 and let G be as defined there. For any a, b ∈ M and any path γ ∈ M with end-points a, b,

If we defineη to be the path in Z given bỹ

Proof. Applying Theorem 1, fix any > 0 and let P be a partition such that for any partition P = (t 0 , . . . , t n ) satisfying P ⊆ P,

Recalling from Section 3.3 that for this family of translation invariant kernels H z = GG for all z ∈ Z, the triangle inequality for the · norm combined with (25) gives

The proof of the lower bound in the statement is then complete since was arbitrary. To complete the proof of the lemma, observe that from the definition ofη in the statement,

and the proof of the lemma is then complete, because in (25) being arbitrary implies l(γ) =

Lemma 8. Consider the family of inner-product kernels of the form f (x, y) = g( x, y ) as described in Section 3.3 where g (1) > 0 . For any a, b ∈ M and any path γ ∈ M with end-points a, b,

Ifη is a shortest circular arc in Z with end-points φ −1 (a), φ −1 (b), thenγ defined byγ t := φ(η t ) satisfies l(γ) = g (1) 1/2 arccos φ −1 (a), φ −1 (b) .

Proof. As usual, let η be the path in Z defined by η t := φ −1 (γ t ). Then from the definition of path-length and the triangle inequality for the · norm, for any δ > 0, there exists a partition P δ such that for any P = (t 0 , . . . , t n ) satisfying P δ ⊆ P ,

Fix any > 0 and let P be as in Theorem 1 and then take P = (t 0 , . . . , t n ) to be defined by P = P ∪ P δ , so by construction we have simultaneously P ⊆ P and P δ ⊆ P. Then from Theorem 1,

Combined with the fact that for this family of kernels H z = g (1)I + g (1)zz where g (1) > 0 and g (1) ≥ 0, we obtain

where the penultimate inequality uses g (1) ≥ 0 and the final inequality holds by taking P in (26) to be P. Since and δ were arbitrary, we have shown l(γ) ≥ g (1) 1/2 l(η). Recall that here η is a path in Z = { x ∈ R d : x = 1} with end-points φ −1 (a), φ −1 (b). Hence l(η) is lower-bounded by the Euclidean geodesic distance in Z between φ −1 (a) and φ −1 (b), which is arccos φ −1 (a), φ −1 (b) because Z is a radius-1 sphere centered at the origin.

Withη andγ as defined in the statement, taking η in (27) to beη, refining P and using η t ,η t = 0 (see discussion in Section 3.3) we find l(γ) = g (1) 1/2 l(η), where by definition ofη, l(η) = arccos φ −1 (a), φ −1 (b) . Lemma 9. Consider the family of additive kernels described in Section 3.3. For any a, b ∈ M and any path γ ∈ M with end-points a, b,

Proof. The compactness of Z and the continuity of z →

for each i = 1, . . . , d implies the uniform-continuity of the latter by the Heine-Cantor theorem. Hence for any δ 1 > 0, there exists δ 2 > 0 such that for all i = 1, . . . , d and z

Codes for the experiments reported in the main part of the paper and those in this appendix are available at https://github.com/anniegray52/graphs. 

A gap appears to form between North and South America which, as a first hypothesis, we put down to the well-publicised suspension of all immigration into the US in April 2020. To measure this gap we use the Earth Mover's distance between the point clouds belonging to the two continents (with thanks to Dr. Louis Gammelgaard Jensen for the code), using the approximate geodesic distances of the -neighbourhood graph (i.e., before dimension reduction). While this distance does explode in April 2020, as shown in Figure 12 (Appendix), we found the proposed explanation to be incomplete, because Australia and New Zealand imposed similar measures at the time, whereas the distance between Oceania and the rest of the world, computed in the same way, does not explode. Revisiting the facts [2] , while the countries mentioned above closed their borders to nonresidents, the continents of South America and Africa arguably imposed more severe measures, with large numbers of countries fully suspending flights. This explanation seems more likely, as on re-inspection we find a large jump in Earth mover's distance, over April, between both of those continents and the rest of the world, as shown in Figure 13 .

Further to the discussion in Section 4, Figure 14 illustrates two important findings in the case p = 2 and d = 1, under the setup for this temperature correlation example described in section 4: firstly that there is no clear relationship between longitude and position along the manifold, and secondly a non-monotone relationship between longitude and estimated latent position. Both these observations are in marked contrast with the results in Figure 4 for latitude.

We then consider the case p = 3 and d = 2. Figure 15 shows the spectral embedding. Again the point-cloud is concentrated around a curved manifold. In Figure 16 we plot latitude and longitude against each of the d = 2 coordinates of the estimated latent positions. As in the p = 2, d = 1 case, Graph distance, followed by CMDS Figure 6 : Latent position recovery in a sparse regime using different combinations of techniques with n = 6400. The first row contains spectral embedding followed by different nonlinear dimension reduction techniques and the second row contains node2vec, node2vec followed by Isomap, and we have attempted recovery using graph distances [11] . To aid visualisation, all plots will display a subset of 100 on a fixed sub-grid, coloured according to their true location.

we find a clear monotone relationship between latitude and the first component of estimated position but no such relationship between longitude and the second component. Figure 12 : Earth mover's distance between North and South America, as inferred from the approximate geodesic distances given by the -neighbourhood graph. In each of 100 Monte Carlo iterations, the Earth Mover's distance is computed based on 100 points randomly selected from each continent. We plot the average, with 2 standard errors in either direction indicated by the vertical bars. Figure 13 : Earth mover's distance from each continent to the rest of the world, as inferred from the approximate geodesic distances given by the -neighbourhood graph. In each of 100 Monte Carlo iterations, the Earth mover's distance is computed based on 100 points (airports) randomly selected from the continent of interest, and 100 points (airports) from the rest of the world. We plot the average, for each continent, with 2 standard errors in either direction indicated by the vertical bars. 

Countries slammed their borders shut to stop coronavirus. but is it doing any good?

Climate change: Earth surface temperature data

Simplemaps free entire world database

Community detection and stochastic block models: recent developments

Mixed membership stochastic blockmodels

Improving support vector machine classifiers by modifying kernel functions

Pseudo-likelihood methods for community detection in large sparse networks

A novel embedding-based framework improving the User and Entity Behav-ior Analysis. working paper or preprint

Latent distance estimation for random geometric graphs

On the estimation of latent distances using graph distances

Statistical inference on random dot product graphs: a survey

On estimation and inference in latent structure random graphs

Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels

The phase transition in inhomogeneous random graphs

A course in metric geometry

The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics

Spectral embedding of weighted graphs

Rate-optimal graphon estimation

node2vec: Scalable feature learning for networks

Word re-embedding via manifold dimensionality retention

Latent space approaches to social network analysis

Stochastic blockmodels: First steps

Can smooth graphons in several dimensions be represented by smooth graphons on

The multilayer random dot product graph

On the representation theorem for exchangeable arrays

Random feature maps for dot product kernels

Graph-embedding for speaker recognition

Stochastic blockmodels and community structure in networks

Probabilistic non-linear principal component analysis with Gaussian process latent variable models

Gaussian process latent variable models for visualisation of high dimensional data

Network representation using graph root distributions

Consistency of spectral clustering in stochastic block models

Large networks and graph limits

Perfect clustering for stochastic blockmodel graphs via adjacency spectral embedding

Community detection and classification in hierarchical stochastic blockmodels

Visualizing data using t-SNE

Population genomics of the viking world

Umap: Uniform manifold approximation and projection for dimension reduction

Spectral clustering under degree heterogeneity: a case for the random walk Laplacian

Riemannian geometry

On a two-truths phenomenon in spectral graph clustering

Network embedding as matrix factorization: Unifying Deepwalk, LINE, PTE, and node2vec

Spectral clustering and the high-dimensional stochastic blockmodel

Manifold structure in graph embeddings

A statistical interpretation of spectral embedding: the generalised random dot product graph

Principles of mathematical analysis

Variational bayesian inference for the latent position cluster model for network data

Spectral clustering on spherical coordinates under the degree-corrected stochastic blockmodel

Nonlinear component analysis as a kernel eigenvalue problem

Support vector machines

Jannis Lübbe, Matthias Schäfer, and Vincent Lenders. Crowdsourced air traffic data from the opensky network 2019-2020

A consistent adjacency spectral embedding for stochastic blockmodel graphs

Introduction to metric and topological spaces

Limit theorems for eigenvectors of the normalized laplacian for random graphs

Universally consistent vertex classification for latent positions graphs

A global geometric framework for nonlinear dimensionality reduction. science

Probabilistic principal component analysis

Bayesian Gaussian process latent variable model

Multidimensional scaling: I. Theory and method

Learning 1-dimensional submanifolds for subsequent inference on random dot product graphs

Rates of convergence of spectral methods for graphon estimation

Comparison of brain connectomes using geodesic distance on manifold: A twins study

Consistency of random-walk based network embedding algorithms

Automatic dimensionality selection from the scree plot via the use of profile likelihood

Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them

If you are including theoretical results

Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3 and Appendix A

a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL

Lemma 4. If Assumption 2 holds, then for any > 0 there exists δ > 0 such that for any x, y ∈ Z such that x − y ≤ δ and any ξ, z, z on the line segment with endpoints x, y,Proof. For each i, j, since (z, z ) → H ij z,z is assumed continuous on Z × Z, and Z is compact, by the Heine-Cantor theorem (z, z ) → H ij z,z is in fact uniformly continuous on Z × Z. Fix any > 0. Using this uniform continuity, there exists δ > 0 such that for any x, y ∈ Z, if x − y ≤ δ, then for any ξ, z, z on the line-segment with end-points x, y,and so in turnwhere · F is the Frobenius norm. Observing that H ξ,ξ = H z , the result then follows from Lemma 3.Proposition 4. If Assumptions 1 and 2 hold, φ and φ −1 are each Lipschitz continuous with respect to the norms · on Z and · 2 on M.Proof. As a preliminary note that for any z ∈ Z, H z is symmetric and positive-definite under Assumption 2, and let λ min z , λ max z be the minimum and maximum eigenvalues of the matrix H z . Since λ max z = H z sp , the spectral norm of H z , and the reverse triangle inequality for this norm statesshows that z → λ min z is continuous. Due to the compactness of Z, we therefore find that λ + := sup z∈Z λ max z < ∞, and λ − := inf z∈Z λ min z > 0. Our next objective in the proof of the proposition is to establish the Lipschitz continuity of φ. As a first step towards this, note that it follows from the identity, that the continuity of (x, y) → f (x, y) implies continuity in 2 of x → φ(x). Now fix any 1 > 0 and consider any x, y ∈ Z. By combining Lemmas 2 and 4 there exists δ 1 > 0 such that if x − y ≤ δ 1 , there exists z on the line segment with end-points x, y such that:On the other hand if x − y > δ 1 ,where c 1 := sup x,y∈Z φ(x) − φ(y) 2 is finite since Z is compact and φ has already been proved to be continuous in 2 . Combining (14) and (15) we obtainIt remains to prove Lipschitz continuity of φ −1 . Fix 2 ∈ (0, λ − ). Since Z is compact and φ is continuous, φ −1 is continuous on M [54, Prop. 13.26], and then also uniformly continuous by Fix any > 0 and let P be as in Theorem 1. We now claim there exists a partition P = (t 0 , . . . , t n ) satisfying simultaneously P ⊆ P andTo see that such a partition exists, note that with η t := φ −1 (γ t ), t → η t is continuous on the compact set [0, 1] hence uniformly continuous by the Heine-Cantor theorem. Thus for any s, t ∈ [0, 1] sufficiently close to each other, η s − η t can be made less than or equal to δ 2 , which implies max i=1,...,d |ηThus starting from P , if we subsequently add points to this partition until max k=1,...,n |t k − t k−1 | is sufficiently small then we will arrive at a partition P with the required properties, as claimed. Now with this partition P in hand, fix any i = 1, . . . , d and k = 1, . . . , n. We then havewhere the upper bound is due to (28)- (29) . Squaring both sides, summing over i and then applying the triangle inequality giveswhere H(1/2) ηt k−1 is the diagonal matrix with ith diagonal element equal to α. Summing over k = 1, . . . , n and using the definition of l(η) givesCombined with the relationship (4) from Theorem 1 and yet another application of the triangle inequality we thus findThe proof of the lower bound in the statement is complete since and δ 1 are arbitrary. In order to complete the proof of the lemma, observe that (28) combined with the same decomposition in (30)-(31) yields an accompanying lower bound on ζ