key: cord-0169514-yw5upzra
authors: Huguet, Guillaume; Tong, Alexander; Rieck, Bastian; Huang, Jessie; Kuchroo, Manik; Hirn, Matthew; Wolf, Guy; Krishnaswamy, Smita
title: Time-inhomogeneous diffusion geometry and topology
date: 2022-03-28
journal: nan
DOI: nan
sha: fc8cb2724a824a80ce15a0f585463eb22469ce5f
doc_id: 169514
cord_uid: yw5upzra

Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes and then applies a diffusion operator to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic diffusion homology. We use this intrinsic topology as well as an ambient topology to study how the data changes over diffusion time. We demonstrate both homologies in well-understood toy examples. Our work gives theoretical insights into the convergence of diffusion condensation, and shows that it provides a link between topological and geometric data analysis.

1. Introduction. Graph representations of high-dimensional data have proven useful in many applications such as visualization, clustering, and denoising. Typically, a set of data points is described by a graph using a pairwise affinity measure, stored in an affinity matrix. With this matrix, one can define the random walk operator or the graph Laplacian, and use numerous tools from graph theory to characterize the input data. Diffusion operators are closely related to random walks on a graph, as they describe how heat (or gas) propagates across the vertices. Using powers of this operator yields a time-homogeneous Markov process that has been extensively studied. Most notably, Coifman et al. [10] proved that, under specific conditions, this operator converges to the heat kernel on an underlying continuous manifold. Manifold learning methods like diffusion maps [10] define an embedding via the eigendecomposition of the diffusion operator. Other methods, such as PHATE [28] , embed a diffusion-based distance by multidimensional scaling. Various clustering algorithms rely on the eigendecomposition of this operator (or the resulting Laplacian) [35, 25] . However, this homogeneous process requires a bandwidth in order to fix and determine the scale of the captured data manifold. If we are interested in considering multiple scales of the data [2, 23] , or if data is sampled from a time-varying manifold [26] , we need a time-inhomogenous process.

In this paper, we focus on the time-inhomogeneous diffusion process for a given initial set of data points. This process is known as Diffusion Condensation [2] and yields a representation of the data by a sequence of datasets, each at a different granularity. This sequence is obtained by iteratively applying a diffusion operator. It has proven effective for tasks such as denoising, clustering, and manifold learning [31, 2, 26, 32] . In this work, we study theoretical questions of diffusion condensation. Thus, we define conditions on the diffusion operators such that the process converges to a single point. The convergence to a point is a valuable characteristic, as it guarantees that diffusion condensation sweeps a complete range of granularities of the data. We present this analysis from a geometric and a spectral perspective, addressing different families of operators. We also study how the intrinsic shape of the condensed datasets evolves through condensation time using tools from topological data analysis. In particular, we define an intrinsic filtration based on the condensation process, resulting in the notions of ambient and intrinsic diffusion homology, for studying individual condensation steps or for summarizing the entire process, respectively. Making use of a topological perspective, we also prove the relation between diffusion condensation and types of hierarchical clustering algorithms.

The paper is organized as follows. In section 2, we present an overview of diffusion condensation. In section 3, we develop a geometric analysis of the process, most importantly we prove its convergence to a point. In section 4, we study the convergence of the process from a spectral perspective. In section 5, we present a topological analysis of the process and relate diffusion condensation to existing hierarchical clustering algorithms.

2. Diffusion condensation. In order to establish the setup and scope for our work, we first formalize here the diffusion condensation framework, and provide a unifying view of design choices and algorithms used to empirically evaluate its efficacy in previous and related work.

2.1. Notations and setup. Let X = {x(j) : j = 1, . . . , N } ⊂ R d be an input dataset of N data points in d dimensions. Given a symmetric nonnegative affinity kernel k : R d × R d → R, with 0 ≤ k(x, y) = k(y, x) ≤ 1, x, y ∈ R d , we define an N × N kernel matrix K with entries K(i, j) := k(x(i), x(j)), which can be regarded as a weighted adjacency matrix of a graph capturing the intrinsic geometry of the data. Furthermore, the kernel and resulting graph are often considered as providing a notion of locality in the data, which can be tuned by a kernel bandwidth parameter . We defer discussion of specific k dependent on to subsection 2.5, but mention that it can be regarded as a proxy for the size or (local) radius of the neighborhoods defined by the kernel. The diffusion framework for manifold learning [10, 28] uses this construction to define a Markov process over the intrinsic structure of the data by normalizing the kernel matrix with a diagonal degree matrix D := diag(d(1), d(2), . . . , d(N )) where d(i) := j K(i, j), resulting in a row stochastic Markov matrix P := D −1 K, known as the (discrete) diffusion operator. Traditionally, time-homogeneous diffusion processes leverage powers P τ of this diffusion operator, for diffusion times τ ∈ N, to capture underlying datamanifold structure in X and to organize the data along this structure [10, 28] .

Here, on the other hand, we follow the diffusion condensation approach [2] and use a timeinhomogeneous process, where the diffusion operator (and underlying finite dataset) vary over time. Therefore, in our case we consider a sequence of datasets X t = {x t (j) : j = 1, . . . , N }, ordered along diffusion condensation time t ∈ N, with corresponding diffusion operators P t , each constructed over the corresponding X t . With a slight abuse of notation we often refer to X t as a set or as a N × d matrix, where x t (j) is the j-th row or equivalently the j-th element of the set. At time t = 0 we consider the input dataset, with its (traditional) diffusion operator, while for each t > 0 we take X t := P τ t−1 X t−1 , with the usual matrix multiplication. Then, instead of powers of a single diffusion operator, the t-step diffusion process is defined via P (t−1) := P τ t−1 · · · P τ 0 , and thus we can also directly write X t = P (t−1) X 0 . For simplicity, we keep the diffusion time τ , but it could also depend on the condensation time t. Finally, we use the notation X (T ) := X 0 , X 1 , . . . , X T for a sequence of datasets up to finite time T , and denote the diameter of the dataset at time t as diam(X t ) := max x,y∈Xt x − y 2 .

The diffusion condensation algorithm first proposed in Brugnone et al. [2] has been applied for data analysis in a number of areas. Moyle et al. [29] applied diffusion condensation to study neural connectomics between species and identify biologically meaningful substructures. Kuchroo et al. [23] applied diffusion condensation to embed and visualize single-cell proteomic data to explore the effect of COVID-19 on the immune system. Kuchroo et al. [22] applied diffusion condensation on single-nucleus RNA sequencing data from human retinas with agerelated macular degeneration (AMD) and found a potential drug target by exploring the topological structure of the resulting diffusion condensation process. van Dijk et al. [32] applied one step of diffusion condensation (t = 1) with high τ to single-cell RNA sequencing data to impute gene expression. They showed that high τ improves the quality of downstream tasks such as gene-gene relationships and visualization. These works demonstrate the empirical utility of diffusion condensation in a number of settings, specifically when multiscale clustering and visualization is needed and the data lies on a manifold.

The diffusion condensation process is a particular type of time-inhomogeneous diffusion process. General time-inhomogeneous diffusion processes over time-varying data were studied in [26] , where it was proposed to use the singular value decomposition of the operator P (t) to embed an arbitrary sequence of datasets X (T ) according to their space-time geometry. Additionally, if those datasets X (T ) were sampled from a manifold (M, g(t)) with time-varying metric tensor g(t), it was shown it was shown in [26] that as N, T → ∞, the operator P (t) converges to the heat kernel of (M, g(t)). We also note a resemblance to the mean shift algorithm [16, 7] , which relies on a kernel-based estimation of ∇ log p(x), where p(x) is the unknown density from which the points are sampled. The processed dataset is recursively updated via x t+1 (i) = x t (i) + ∇ log p(x), which effectively moves all points toward a mode of the distribution, hence creating clusters.

Motivated by these empirical successes and inspired by the general theoretical results on time-inhomogeneous diffusion processes, we consider two open questions specific to the diffusion condensation process. Under what conditions does the diffusion condensation algorithm converge? How can the topology of the diffusion condensation process be understood? 2.3. Theoretical contributions. The main contribution of this paper is to address these open questions and establish the underpinnings of diffusion condensation. Our investigation is divided into three perspectives. First, we investigate the convergence properties of diffusion condensation under various parameter regimes from a geometric perspective in section 3, i.e., arrangement of data points in spatial coordinates. This geometric perspective gives an intuitive sense of convergence for a large family of kernels with minimum tail bounds. Next, in section 4, we investigate convergence from a spectral graph theory perspective and prove convergence in terms of the spectral properties of the kernel, viewing diffusion condensation as a non-stationary Markov process. A spectral perspective gives bounds in terms of the eigenvalues of the kernel, which can give better rates of convergence depending on considered data. Finally, in section 5, we investigate the topological characteristics of diffusion condensation. Here we describe both the structure of points at each condensation step individually, which we refer to as ambient topology, as well as the topology of the condensation process itself, which we refer to as intrinsic topology. Additionally, we link the topology of the diffusion condensation process to hierarchical clustering and prove how it generalizes the centroid linkage method.

1: Input: Dataset X 0 , initial kernel parameter 0 , diffusion time τ , and merge radius ζ 2: Output: Condensed datasets X (T ) 3: for t ∈ {0, 1, . . . , T − 1} do 4:

end for 11: end for 12: X (T ) ← {X 0 , X 1 , . . . , X T } 2.4. Algorithm. The diffusion condensation algorithm summarizes input data with a series of representations, organized by condensation time, with earlier representations providing low level, microscopic details and later representations providing overall, macroscopic summarizations. Each time-step of diffusion condensation can be broken up into five main steps.

1. Construct a kernel matrix K t summarizing similarities between points. 2. Construct a Markov normalized diffusion operator P t . 3. Diffuse the data coordinates τ steps using P τ t . 4. Update the kernel bandwidth according to some update function. 5. (Optionally) merge points within distance ζ. Algorithm 2.1 shows pseudocode for this process. At each timestep, the positions of points are updated based on the predefined kernel through τ steps of diffusion. Intuitively, this can be thought of as moving each point to a kernel-weighted average of its neighbors, The condensation process will behave differently depending on the choice of kernel, the kernel bandwidth, the diffusion time, and the merging threshold. Figure 1 shows the effect of τ during the condensation process.

In these examples, greater values of τ encourage the process to condense along the manifold, in contrast with other hierarchical clustering algorithms that are not able to do so. Figure 1 shows this by comparing the left and right of (b) where with low τ (left) the manifold is distorted, but by increasing τ (right) diffusion condensation condenses along it. For the rest of the paper, we let τ = 1, but our results are valid for any τ ∈ N. Only for the spectral part, we need to consider a slight nuance, which we discuss in Remark 4.6. 

The box kernel is arguably the simplest and most interpretable kernel, leading to interesting data summarizations, including an instance of agglomerate clustering depending on the bandwidth as a function of time. However, first we note some simple cases of bandwidth settings. Consider a case where k t (x, y) = 1 for all x, y ∈ X 0 . This can be thought of as a box kernel with bandwidth greater than diam(X 0 ). Using this kernel, after a single step of diffusion condensation, all points converge to the mean data point, 1 n i x t (i). This mean data point is a useful, if trivial, summarization of the data. Next, consider the opposite extreme, a box kernel with infinitely narrow bandwidth, k t (x, y) = {1 if x = y else 0}. In this case, we have X t = X 0 for all t > 0, resulting in another trivial result, i.e., no data summarization over diffusion condensation time. Of more interest are bandwidths between these two extremes, providing hierarchical sets of summarizations. Next, we consider smoother kernels.

Definition 2.2. The α-decay kernel [28] of bandwidth is k ,α (x, y) = exp(− x − y α 2 / α ). The α-decay kernel was used in [23] along with anisotropic density normalization (see Definition 2.5), which was shown to empirically speed up convergence of diffusion condensation. Definition 2.3. The Gaussian kernel of bandwidth is k (x, y) = exp(− x − y 2 2 / ). The Gaussian kernel was used in [2] , employing density normalization and a merging threshold of 10 −4 , with a bandwidth of t doubling whenever the change in position of points between t − 1 and t dropped below a separate threshold. This kernel and setting of t ensures that the datasets converge to a single point in a reasonable amount of time in practice.

Another kernel that exhibits interesting behavior is the Laplace kernel ( Figure 2 ); it is noteworthy since it is positive definite for all conditionally negative definite metrics [15] .

Note that this is the same as the α-decay kernel, with α = 1. In fact, the Gaussian and Laplace kernels can be generalized to the α-decay kernel, which interpolates between the Gaussian kernel when α = 2 and the box kernel as α → ∞. Definition 2.5 (Anisotropic Density Normalized Kernel [10] ). For a rotation invariant kernel k (x, y), let q(x) = X k (x, y)q(y)dy; then a density normalized kernel with normalization factor β is given by k ,β (x, y) = k (x,y) q β (x)q β (y) .

3. Geometric properties of the condensation process. We first examine the time-varying nature of the data geometry along the diffusion condensation process. Since this process results in a sequence of finite datasets X t , organized along condensation time, a pertinent question is how does their underlying geometric structure change as local variability is eliminated by the diffusion process, and whether it eventually converges to a stable one as t → ∞. In this section, we study this question by considering two geometric characteristics (namely, convex hull and diameter) of the data, establish their monotonic convergence, and its relation to the tail behavior of the kernel utilized in the construction of the diffusion process.

Our main result here is that with appropriate kernel choice the condensation process converges to a point, in time dependent on the shape of the kernel. That is, for all ζ > 0, there exists a M ∈ N such that for all t ≥ M , we have x−y < ζ for all x, y ∈ X t . Intuitively, we can make all the points arbitrarily close by iterating the process. It is important to note Gaussian Laplace Box α-decay that this result requires some assumptions on the kernel in order to avoid pathological cases where the process may converge, but not to a single point. For instance, using the box kernel on a dumbbell dataset, each sphere would converge to a point, but for a certain threshold these points would not be connected. Hence, the process would reach a stable state, i.e., there exists M ∈ N such that P M X M = X M , but it would not converge to a point. One of our goals is to define the conditions on the kernels for the process to converge to a single point.

3.1. Diameter and convex hull convergence. Intuitively, one can consider each diffusion condensation iteration as eliminating local variability in data [2, 23] , and while empirical results presented in previous work indicate the condensation process can accentuate separation between weakly connected data regions, they also indicate that the process has a global contraction property due to the elimination of variability in the data. Thus, it appears that diffusion condensation coarse-grains data by sweeping through granularities, from each point being a separate entity to all data points being in a single cluster. To establish this contractive

All convex hulls Figure 4 : Illustration of Theorem 3.2. We depict 4 timesteps of the condensation process of a simple dataset. The convex hull of the points is shown in gray, with the diameter of X t being shown as dotted line. As t progresses, convex hulls shrink, with conv(X t+1 ) conv(X t ). The rightmost figure shows all convex hulls of all timesteps with lighter shades indicating later condensation timesteps.

property, and formulate a notion of data geometry (monotone) convergence associated with it, we characterize the geometry of each X t via its diameter and convex hull, whose convergence under the condensation process is shown in the following theorem.

Remark 3.1. The diffusion condensation process is not a contractive mapping in the strict sense. During individual iterations, distances are not generally all decreasing, i.e., there exist points x t (i) and

Theorem 3.2. Let (X t ) t∈N and (P t ) t∈N be respectively the sequence of datasets and diffusion operators generated by diffusion condensation. If the kernel used to construct each P t is strictly (pointwise) positive, then:

1. Their convex hulls form a nested sequence with

conv(X t ) = ∅ and convex.

i.e., the process converged to a single point.

Prior to proving Theorem 3.2, we first require the following technical lemma about polytopes, which can also be found in standard literature [24] . Its proof is provided in the supplementary material for completeness. Lemma 3.3. Let X ⊂ R d be a set of points and C := conv(X) their convex hull. Then every extremal point v j ∈ C satisfies v j ∈ X. Thus, the extremal points of C are a subset of X.

With Lemma 3.3, we are now ready to prove Theorem 3.2 as follows (see Figure 4 for an illustration of the arguments in the proof).

Proof of Theorem 3.2. We start by proving

To prove (a), we note that since the entries of P t are positive and its rows sum to 1, each element of X t+1 is a convex combination of the original data points. That is,

. . , N }, and j P t (i, j) = 1. As a consequence, X t+1 will be formed by convex combinations, so all points in X t+1 lie in the interior of conv(X t ), which is not empty since diam(X t ) > 0. From Lemma 3.3, we know that the extremal points of conv(X t+1 ) also lie in the interior of conv(X t ). Hence conv(X t+1 ) conv(X t ), which proves (a). To prove (b), we assume diam(X t ) = 0. By construction of P t , we get

Steps (a) and (b) show that the convex hulls are a nested sequence, so conv(X t ) → ∩ ∞ t=1 conv(X t ). Since the intersection of convex sets is convex, the limiting set is also convex. Finally, we use Helly's theorem [24] , which states that if a collection of compact convex sets is such that every subcollection of sufficient cardinality has nonempty intersection, then the intersection of the collection is nonempty. Here, because of the nesting property, every subcollection has nonempty intersection. Moreover, the convex hulls of finite sets are compact, hence we conclude that

Thus, the diameters form a monotonically decreasing sequence, which converges since the sequence is nonnegative.

3.2. Convergence rates. Theorem 3.2 applies for all strictly positive kernels. While it is only established in terms of the diameter and convex hull of the data, this result extends to show pointwise convergence of the diffusion condensation process if we make further assumptions on the rate at which the diameter sequence decreases, or establish bounds on this rate based on the specific kernel used in the diffusion construction. We next proceed with such an in-depth analysis, focusing on strictly positive kernels, while noting that later, in subsection 5.3, we will also show a convergence result for a kernel with finite support (i.e., where the discussion here is not valid). We begin with the following result relating the rate of convergence to the minimum value of the kernel over the data.

Proof. Here we present the key ideas of the proof, and we refer to the supplementary material for the detailed version. The assumption on K t gives the element-wise lower bound P t ≥ δ/N , and we show that d T V (P t (i, ·), P t (j, ·)) ≤ 1 − δ, where d T V is the total variation distance. Next, using a coupling ξ with marginals P t (i, ·) and P t (j, ·), we can write

We conclude with the coupling lemma, which guarantees the existence of a coupling such that

Given this result for general kernels whose tails can be lower bound by some constant, we can further state a union bound result on kernels that maintain this lower bound over the entire diffusion condensation process. We recall at this point the update step typically used to expedite the condensation process when it reaches a slow contraction meta-stable state. Previous work implemented this step with heuristics for updating the meta-parameter. Here, we provide further insights into the impact of this update step, to both justify it and suggest an update schedule that provides certain convergence guarantees via the following theorem.

Since t * is an integer, the ceiling suffices.

For specific forms of kernels, this result can be translated to suggest concrete ways of setting the kernel parameters at each timestep such that this bound holds, and diffusion condensation achieves well behaved linear convergence. Proposition 3.6. For the following kernels the specific bandwidth update suffices for the result in Theorem 3.5 to hold.

1. For the α-decay kernel, exp(− x − y α 2 / α ), t ≥ − diam(X t ) α / log(δ) suffices. For α = 2, this defines the scheduling for the Gaussian kernel. For α = 1, this defines a scheduling for the Laplace kernel. 2. For the density normalized kernel k ,β (x, y) combined with the α-decay, we define α t ≥ − diam(X t ) α / log(N 2β δ), and t ≥ − diam(X t ) 2 / log(N 2β δ) for the Gaussian kernel. Then, Theorem 3.5 holds, for all δ ∈ (0, 1/N 2β ). The same holds if we replace N with q max,t := max q(i).

Proof. To show (1) we need to define a scheduling of t such that min

Therefore, we find t such that min x,y k (x, y) ≥ N 2β δ and conclude in the same way as part (1).

Remark 3.7. We note that to ensure a constant rate of convergence in the diameter, the kernel bandwidth t in Proposition 3.6 shrinks over time proportional to the square of the diameter. This is in contrast to previous work [2] where a doubling schedule was used.

Remark 3.8. The previous results also hold for P τ t with τ > 1, as long as there exist δ > 0 such that all entries P τ t (i, j) > δ/N . Moreover, the rate of convergence with P τ t cannot be slower than with P t , because we can always write X t+1 = P τ t X t = P τ −1 t P t X t . Finally, for some P t that includes zero probabilities, it is possible for P τ t to be strictly pointwise positive, hence Theorem 3.2 could be used for the process defined with P τ t instead of P t .

4. Spectral properties of the condensation process. We complement the geometric perspective of condensation from the previous section with a spectral one, based on the idea of using an orthonormal basis to express any function f : X → R (abbreviated as f ∈ R N ) as a weighted sum of the eigenvectors of P t for all t. This sum is then divided into two terms: a constant term, and a nonconstant one. The former corresponds to the lowest frequency of a function; it is constant for all x ∈ X t and for all t. The nonconstant term is the rest of the frequencies; it can vary depending on the eigenvectors of P t . We extend this reasoning to the time-inhomogeneous diffusion P (t) f . Our main result is Theorem 4.4, which provides an upper bound on the norm of the nonconstant term. For a specific choice of kernel and using the coordinate function, this bound will converge to zero, hence in Corollary 4.5 we show how condensation converges to a single point. Before presenting the main theorem, subsection 4.1 introduces a simpler example to give insight on the structure and the challenges of the proof.

We consider a symmetric transition matrix A t based on X t , and the coordinate functions f i (x) which returns the i-th coordinate of x. In that case, A t is known as a bistochastic matrix, and its stationary distribution is the uniform distribution. Since A t is symmetric, its eigenvectors {φ t,i } N i=1 form an orthonormal basis of R N . Moreover, its ordered eigenvalues {λ t,i } N i=1 are less or equal to one ,with λ t,1 = 1. Because A t is row stochastic, and λ t,1 = 1, we can define φ t,1 = N −1/2 1 where 1 is a vector of ones of size N . Given these properties, we can write any function f ∈ R N as

By splitting this sum into two terms, we define the constant term

which is therefore invariant through the iterations of condensation. We note the resemblance with the graph Fourier transform, which uses the eigendecomposition of the Laplacian and treats the eigenvalues as frequencies, and eigenvectors as harmonics. Here, L t can be thought of as the lowest frequency term of the function. Whereas H t varies depending on the eigenvectors, it can be seen as the higher frequencies of the function. Since L t (f ) is constant during condensation, showing the convergence of the process is equivalent to showing that H t (A (t) f ) 2 tends to zero as t tends to infinity. Indeed, if this is true, by using the coordinate function f i we have

thus the process converges to the mean of the N data points. What is left to show is that the norm of the term H t (A (t) f ) indeed converges to zero as t goes to infinity. After the first condensation step, we have the following bound

and it can be deduced that

tends to zero as t tends to infinity, we could conclude that the process converges to a point. Our situation is more complex since many kernels are not symmetric, so their eigenvectors do not form an orthonormal basis of R N . Here, we also benefited from the fact that the kernels were bi-stochastic, hence they all had the same (uniform) stationary distributions. Generally, each kernel P t has a different stationary distribution. This will be reflected in the upper bound, as we have to consider the distance between two consecutive stationary distributions.

A general condensation process. In general, we consider a broader class of diffusion operators defined in subsection 2.1 by P t = D −1 K t . We recall that K is symmetric and 0 ≤ K(i, j) ≤ 1. Moreover, the stationary distribution associated to P t is π t (i) = d −1 1 d t (i), and P t is d t -reversible, i.e. d t (i)P(i, j) = d t (j)P(j, i). Thus, its associated operator

is self-adjoint with respect to the dot product f, g dt = x f (x(i))g(x(i))d t (i). Denote {ψ t,i } N i=1 as the eigenvectors of P t and {λ t,i } N i=1 as its eigenvalues arranged in decreasing order. Because P t is self-adjoint with respect to ·, · dt , its normalized eigenvectors are such that ψ t,i , ψ t,j dt = δ ij , where δ ij = 1 if i = j and 0 otherwise. Therefore, we can write any function f ∈ R N as f = N k=1 f, ψ t,k dt ψ t,k . Following the same steps as in subsection 4.1, we want to find a constant term of the function. Since P t is row stochastic, and because λ t,1 = 1, we get ψ t,1 = c1 where c is a constant. We can solve for c using ψ t,1 , ψ t,1 dt = c1, c1 dt = 1, which yields

and the nonconstant term as the rest of the sum, which varies depending on the eigenvectors,

We can write P t f = f, 1 πt 1 + k λ t,k f, ψ t,k dt ψ t,k , by using the fact that P t is self-adjoint and λ t,1 = 1. Most importantly, we note that the constant term of the function is not affected by condensation, i.e. L t (P t f ) = L t (f ). Consequently, to show that the condensation process converges to a single point, it is sufficient to show

Indeed, if (4.2) holds, then for a constant C, we get lim t→∞ P (t) f i (x) = C for all coordinates i ∈ {1, . . . , d}, and every x ∈ X 0 . To achieve our goal, in Theorem 4.4 we find an upper bound on H t (P (t) f ) 2 , which we use in Corollary 4.5 to show convergence of the condensation process. Before presenting these results, we introduce several lemmas, whose proofs are provided in the supplementary materials. Lemma 4.1 is the same as the upper bound we found in subsection 4.1, and will be beneficial when used recursively. Lemma 4.2 is necessary since each P t possibly has a different degree d t , and it enables a change of measure to · ds from · dt . Lastly, Lemma 4.3 is beneficial when combined with the observation that

Lemma 4.1. For the operator P t and its second largest eigenvalue λ t,2 , we have the following bound on the norm

For all functions f ∈ R N , and two operators P t and P s , the following in-

We are now ready to state and prove the main theorem of this section, providing an upper bound on the norm of the nonconstant part of a function. The upper bound mainly depends on two terms: the second largest eigenvalue of each condensation operator, and the distance between their stationary distributions. The convergence proof relies on this theorem. 

Proof. We start by proving the following inequality

We will prove (4.3) by induction. For t = 1, using Lemma 4.2 we obtain

Moreover, using the fact that

where the inequality (4.5) is obtained by Lemma 4.3 and Lemma 4.1. Combining the equations (4.4) and (4.6) yields

We have shown that (4.3) is true for t = 1. Now assume it is true up to t − 1, we want to show it implies that it is true for t. The proof is similar to the base case. From Lemma 4.2, we obtain

Moreover, following the same steps as (4.6) with Lemma 4.3 and Lemma 4.1, we obtain

and we prove (4.3) by applying the inductive hypothesis. We can now conclude the proof by

where the inequalities (4.8) and (4.9) are obtained by Lemma 4.2 and Lemma 4.1, the inequality (4.10) is justified by (4.3) which we have just shown. Finally, the last inequality is due to Lemma 4.2.

We recall our initial argument that to show the convergence of the condensation process it is sufficient to use the coordinate function f i and to show that the norm of nonconstant term H t (P (t) f i ) 2 converges to zero. This is achieved in the following corollary. 

since λ t,2 ≤ 1 − δ for all t. Note that the quantities d 0 1/2 ∞ and f i 2 are both finite. Furthermore, by assumption, the sequence (d t ) t∈N converges, thus d k − d k+1 2 converges to zero as k tends to infinity and

The upper bound converges to zero since lim t→∞ [ 1 − δ] t = 0, and because lim t→∞ P (t) f i = f i , 1 π 1, we conclude that all points have the same i-th coordinate for all i ∈ {1, . . . , d}.

We conclude this section by identifying kernels for which we can find analytic conditions that respect the assumptions of the previous corollary, hence producing a condensation process that converges to a single point. The assumption on the convergence of the sequence of degrees (d t ) t∈N is verified for the kernels we consider in this paper (Laplace, Gaussian, α-decay). Indeed, since the entries of these kernels are all positive, from Theorem 3.2, the convex hulls of the datasets admit a limit. In that case, the sequence (d t ) t∈N must admit a limit. As P t is reversible with respect to π t , we can use Prop. 1 of Diaconis and Stroock [11] to find an upper bound on the second largest eigenvalue. They show that λ t,2 ≤ 1 − 1/κ t , where

and d max,t := max i d t (i). To respect the assumptions of Corollary 4.5, we define t such that

Thus, for the α-decay kernel (Definition 2.2), we must define a schedule of the bandwidth parameter t , such that α t ≥ − diam(X t ) α /(− log(δd max,t )), which extends schedules for the Gaussian and Laplace kernel. For these kernels, we always need δ ∈ (0, 1/d max,t ), but we note that d max,t ≤ N , thus avoiding the case where δ tends to 0 as t tends to infinity. A similar result is obtained for the density normalized kernel k ,β (Definition 2.5), since max x,y∈Xt y k ,β (x, y) ≤ N, and min

Combining these two bounds yields the following requirement for the density normalized kernel

We can find similar schedule for each of the previous kernels, since min x,y k t (x, y) can be lower bounded by a function of the diameter. For instance, we find t ≥ − diam(X t ) 2 / log(δN q 2β max,t ) for the anisotropic Gaussian kernel. This adaptative parametrization of the bandwidth parameter guarantees that the condensation process will converge to a point for these kernels.

Remark 4.6. These results can be generalized to P τ t , for any τ ∈ N. We can write 2 H t (f ) dt , since |λ t,2 | ≤ 1. Thus, Theorem 4.4 can be used to prove convergence of the process.

Remark 4.7. Both Theorem 4.4 and Corollary 4.5 are valid for a broad class of diffusion operators, in particular those with finite support or a wider family of random walks on a graph. This differs from the geometric Theorem 3.2, which is restricted to strictly-positive kernels. It is also possible to leverage information from the underlying structure of the data to characterize the convergence of the condensation process. For example, for a random walk on a graph, the second-largest eigenvalue is influenced by the connectivity of the graph; a highlyconnected graph would yield a small eigenvalue, hence converging faster. For the Box kernel, assuming the convergence of degrees, Corollary 4.5 can be used to analyze the convergence of the process by evaluating the second largest eigenvalue at different condensation times.

Remark 4.8. The degree convergence assumption we use (e.g., in Corollary 4.5) assumes that the process converges to a stable representation X M , without any assumption on X M . In practice, since transition operators are contractive, we observe that this assumption is easily respected. Corollary 4.5, bounding the second largest eigenvalue, then guarantees that X M is a single point.

Having previously proved convergence properties, we now take on a coarser perspective and characterize topological, i.e., structural, properties of the diffusion condensation process. To this end, we note that the multiresolution structure provided by diffusion condensation naturally relates to recent advances in using computational topology to understand the "shape" of data geometry at varying scales. To elucidate this connection, subsection 5.1 and subsection 5.2 introduce two perspectives for integrating topological information into the data geometry uncovered by the diffusion condensation process, i.e., (i) intrinsic topology for describing the topology of the diffusion condensation process itself, and (ii) ambient topology for describing each step of the diffusion condensation process. We call both descriptions diffusion homology; intrinsic diffusion homology constitutes a description of homological features of the overall diffusion condensation process, while ambient diffusion homology considers the topology of each step of the diffusion process in terms of analyzing a time-varying metric space (with time corresponding to diffusion iteration steps), thus closing the loop to the previously-provided geometrical notions. We provide a brief review of relevant topological data analysis (TDA) notions in the supplementary material. Figure 5 and Figure 6 depict the two types of topological descriptions.

The formulation of the diffusion condensation process, with its merge step if points are too close to each other, can be considered to induce changes in the topological structure of the datasets. This will result in one topological descriptor summarizing them. We first define a filtration that is intrinsic to the diffusion condensation process, being compatible with the algorithm in subsection 2.4. Definition 5.1 (Intrinsic diffusion filtration). Given a merge threshold ζ ∈ R, we define the intrinsic diffusion filtration for t ∈ N as Figure 6 : Left: An illustration of ambient diffusion homology for the "double annulus" dataset. Following different steps in the diffusion condensation process (upper row), we obtain a sequence of persistence diagrams (lower row) that summarize the one-dimensional topological features, i.e., the cycles, in the dataset. Right: The intrinsic diffusion homology (top) and the topological activity curve (bottom) of the dataset for comparison purposes.

to be unique, but we obtain a consistent ordering by using the indices of the respective points. See the supplementary material for the definition of a filtration.

Intuitively, the intrinsic diffusion filtration measures at which iteration step t two points move into their ζ-neighborhood for the first time. There are two differences to a "standard" Vietoris-Rips filtration. First, we enforce the nesting condition of a filtration by taking the union of all simplicial complexes for previous time steps; this is necessary because, depending on the threshold ζ, we cannot guarantee that points remain within a ζ-neighborhood. 1 The second difference is that we filtrate over diffusion condensation iterations instead of distance thresholds. Since diffusion condensation results in changes of local distances, this filtration captures the intrinsic behavior of the process. For now, we only add 1-simplices and 0-simplices to every V t (ζ), but the definition generalizes to higher-order simplices.

We define the intrinsic diffusion homology as the persistent homology of V t (ζ) under the weight function defined above. Intuitively, we initially treat each data point x i as a 0-simplex, creating its own homology class and identify homology classes over different time steps t, i.e., the homology class of x t (i) and x t (i) for t = t is considered to be the same. As the geometry of the underlying point cloud changes during each iteration, points start to progressively cluster. Whenever a merge event happens (see line 9 in the diffusion condensation algorithm), we let the homology class corresponding to the vertex with the lower index continue, while we destroy the other homology class. Given our weight function, such an event results in a tuple of the form (0, t), where t denotes the diffusion condensation iteration. This can also be considered as a persistence diagram arising from a distance-based filtration of an abstract input dataset (hence, every tuple contains a 0; all homology classes-i.e., all vertices-are present at the start of the diffusion condensation process). Our weight function can be interpreted as a "temporal distance;" the distance between pairs of vertices (i, j) is given by calculating the value of t for which their spatial distance falls below the merge threshold ζ for the first time, i.e., d(i, j) := min{t | d (x t (i), x t (j)) ≤ ζ}. A convenient representation can be obtained using a persistence barcode [17] , i.e., a representation in which the lifespan of each homology class is depicted using a bar. Longer bars indicate more prominent clusters or groupings. Figure 5 illustrates this for a "double annulus" dataset, which does not give rise to a complex set of clusters, as indicated by the existence of few long bars in such a barcode.

Notice that the persistence pairing P corresponding to the intrinsic diffusion homology carries all the information about the hierarchy of merges obtained during the diffusion condensation process. Specifically, P consists of pairs of the form ({u}, {v, w}), where {u} is a vertex and {v, w} is an edge between two vertices. We can use these edges to construct a tree of merges, i.e., a dendrogram (see Figure 7) . This perspective will be useful later on when we show how diffusion condensation generalizes existing hierarchical clustering methods.

Ambient diffusion homology refers to topological features in each step of the diffusion condensation process. It results in a richer but also more complicated description of the diffusion condensation process. To describe ambient diffusion homology, we calculate a Vietoris-Rips complex for each point cloud X t of the diffusion Figure 7 : Two dendrograms obtained on the "petals" dataset. The different condensation behavior exhibited by different kernels (see Figure 2 ) also manifests itself in the dendrograms. condensation process, denoting the Vietoris-Rips complex of X at diffusion time t as V δ (X, t).

In the following, we are primarily interested in proving that the topological features of V δ (X, τ ) converge as the diffusion condensation process converges. Using the preceding theorem from section 3, we can bound the topological activity and prove convergence in terms of topological properties. Specifically, we prove that the bottleneck distance between different time steps is upper-bounded by the respective diameters of the point clouds.

Theorem 5.3. Let t ≤ t refer to two iterations of the diffusion condensation process. Then, the persistence diagrams corresponding to the point clouds X t and X t satisfy

Proof. Letting d t := diam (X t ) and d t := diam (X t ), we first observe that d t ≥ d t for t ≤ t , i.e., as a consequence of the convergence theorem, the diameter is non-decreasing as t increases. From Chazal et al. [4, 5] , we obtain d b (D Xt , D X t ) ≤ 2 d GH (X t , X t ), with d GH (·, ·) denoting the Gromov-Hausdorff distance. Since d GH (X t , X t ) ≤ 1 /2 max{d t , d t } [27, Proposition 5], we can simplify the bound to d b (D Xt , D X t ) ≤ max{d t , d t }. Using t ≤ t and the fact that d t ≥ d t , we obtain d b D Xt , D X t ≤ d t .

As diffusion condensation converges to a point, we have lim t→∞ diam(X t ) = 0, and the bottleneck distance between consecutive datasets, i.e., d b (D Xt , D X t+1 ) also converges to 0. By contrast, d GH (X t , X t ) ≥ 1 /2|d t − d t | (the bound being tight in certain cases), implying that the Bottleneck distance between consecutive time steps is never zero if the diameter changes. Since all point clouds are embedded into the same space, namely R d , all preceding statements apply with the Hausdorff distance d H (·, ·) replacing the Gromov-Hausdorff distance [6] . This distance has the advantage that we can easily evaluate it. We require one auxiliary lemma to replace the diameter bound in the previous proof. Figure 8 : Empirical convergence behavior of dataset diameters and the Hausdorff distance between consecutive steps i, i + 1 of the condensation process (as a proxy for the bottleneck distance). Convergence behavior with respect to the Hausdorff distance is not uniform and characterized by some "jumps", indicating that the datasets change considerably between certain time steps, before achieving a stable configuration.

Lemma 5.4. Let X, Y be subsets of the same metric space, e.g.,

Proof. The Hausdorff distance is the smallest r-thickening required such that both X and Y become subsets of each other. Since conv(Y) conv(X), we have r ≤ diam(X).

As a consequence of this lemma and the preceding proof, we obtain a bound in terms of the Hausdorff distance and the diameter. For t ≤ t , we have 

3. Hierarchical clustering. While there are many linkage methods for measuring the association between clusters in agglomerative hierarchical clustering, we focus on the centroid method as it is the most relevant to diffusion condensation. Agglomorative clustering, and the centroid method specifically, is widely applied in phylogeny [12] , sequence alignment [19] , and analysis of other datatypes [20] . In the centroid method, the distance between any two clusters a and b is defined as the distance between the centroids of the clusters. There are two natural definitions of centroid for points in Euclidean space, leading to the unweighted pair group method with centroid mean (UPGMC) [30] and to the weighted version (WPGMC), also known as median linkage hierarchical clustering [18] . The unweighted centroid C UPGMC is the centroid of all points in the cluster:

In contrast, the weighted version depends on the parent clusters, suppose cluster a is formed by the merging of clusters b and c. Then the centroid of a is defined as:

In either case, the distance between two clusters a and b is defined as the squared Euclidean distance between their centroids, i.e., D(a, b) = C(a) − C(b) 2 . This algorithm is detailed in Algorithm 5.1. For a given dataset, the UPGMC and WPGMC algorithms give a unique sequence of merges provided that at each iteration there exists a unique choice of a * and b * achieving the minimum distance between clusters. These methods are similar to diffusion condensation, and in certain situations, equivalent; our next Theorem makes this more precise.

Theorem 5.5. Let ζ = 0, t = min x,y∈Xt x − y 2 > 0, and k t (x, y) be the box kernel in Definition 2.1. In this case, the diffusion condensation produces equivalent topological features than centroid agglomerative clustering (UPGMC), in both intrinsic topology, and ambient topology (for δ = 0), i.e., V 0 (X t ) = L t for all t. Further if ζ : 0 < ζ < t , then diffusion condensation is equivalent to median linkage agglomerative clustering (WPGMC).

Proof. To show the equivalence of diffusion condensation to UPGMC algorithms, we show that (i) centroids in the UPGMC algorithm correspond to points in the diffusion condensation algorithm, and (ii) the same clusters-represented by their respective centroids-are merged in each iteration. For the first claim, we proceed by induction, providing a mapping from the diffusion condensation process to UPGMC. The claim is trivially true for t = 0 since all points are singleton clusters and therefore centroids. We now observe the behavior of the box kernel as we go from condensation step t to t + 1. By the induction hypothesis, all points are already centroids for UPGMC iteration t. The distance threshold t is chosen such that, similar to the selection of a * and b * in UPGMC, only the centroids whose distance is minimal will be used in the box kernel calculation. This ensures that both methods merge the same clusters in every step. Moreover, due to the structure of the box kernel, the kernel matrix K t will consist of multiple ones in every row, corresponding to those points (centroids) whose distances fall below t . Thus, P t will shift all of these points to their corresponding midpoint, i.e., their centroid. This is equivalent to a merge between the two clusters in the UPGMC algorithm. Since ζ = 0, this will result in a sequence of merges. Without loss of generality, we assume that only a single pair of points is shifted to its respective centroid. 2 Thus, a tuple of the form (0, t) is created in the intrinsic diffusion homology persistence diagram (and the respective pairs are created in its persistence pairing). Therefore, points in the diffusion condensation process are equivalent to centroids in the UPGMC algorithm and the same merges happen in each iteration. Hence, V 0 (X t ), the Vietoris-Rips complex of X t at scale 0, is a representation of the hierarchy of UPGMC at iteration t. Ambient diffusion homology-for different scales δ of the Vietoris-Rips complex-can therefore capture and assess inter-cluster relations.

For the setting of ζ : 0 < ζ < t , the only difference is in the setting of the merge threshold. The proof follows the same logic as the previous theorem (assuming again pairwise distances in each step are unique) except that the centroid locations are updated as the average of two points, instead of a weighted average of all points in the two clusters.

Remark 5.6. Theorem 5.5 implies that diffusion condensation converges to a point in N −1 iterations, as each iteration reduces the number of unique point locations by one. Extending this logic to more general settings (where the number of unique points might not strictly decrease in each iteration) is not trivial and is left to future work.

Remark 5.7. This theorem motivates interpreting diffusion condensation as a soft hierarchical clustering method, particularly with other kernels and in situations where the general position assumption does not hold. When points are equally spaced and not naturally clusterable, we find diffusion condensation more appealing: for instance, consider the corners of a k-dimensional simplex, with all distances between points being equal. The only two "sensible" clusterings are k clusters of single points, or one cluster with k points. Performing agglomerative clustering on this dataset will result in an arbitrary binary tree over the data, where all levels of the tree result in meaningless clusters. Diffusion condensation with any radial kernel, schedule, and merge threshold will result in exactly these two clusterings.

6. Discussion. Diffusion condensation is a process that alternates between computing a data diffusion operator and applying the operator back on the data to gradually eliminate variation. In this paper, we analyzed the diffusion condensation process from two main perspectives -its convergence and the evolution of its shape through condensation steps. We found conditions guaranteeing the convergence of the process using both geometric and spectral arguments. The geometric argument shows that the convex hull of each iteration of data after condensation shrinks in comparison to the previous iteration. The spectral argument reasons that the second largest eigenvalues of the data graph bounds the result of any function multiplied by the diffusion operator. Our spectral results are of particular interest since they are valid for a broad family of diffusion operators creating a time-inhomogeneous process.

Further, we used and extended tools from topological data analysis to characterize the evolution of the shape of the datasets during the condensation process. In particular, we defined an intrinsic diffusion filtration that operates on the data manifold, and studied the resulting intrinsic diffusion homology. This provides us with a summary of the topological features during the entire process. Since the process is guaranteed to converge, the filtration will sweep through the different resolutions of the data, hence providing meaningful details.

unique. Said assumption also ensures that the selection of a * and b * in Algorithm 5.1 is unique, so it is a useful requirement. It does not decrease the generality of our argumentation (in fact, a consistent ordering of merges can always be achieved), but it simplifies notation and discussion.

With the ambient diffusion homology, we studied the topological features for a given condensation step, resulting in snapshots of topological characteristics of the process. We also showed that instances of diffusion condensation are equivalent to hierarchical clustering algorithm. Furthermore, we provided experiments showcasing the relevance of our analysis, specifically comparing the intrinsic and ambient homologies, and the usage of condensation for clustering purposes. In future work, the definition of the intrinsic filtration can be extended to multidimensional filtration, for instance by identifying the cycles or considering path probabilities defined by the diffusion kernel.

Appendix A. Proof of Lemma 3.3. We shall show that no point from C \ X can be extremal. Take an arbitrary point v ∈ C \ X. By definition of C, v can be written as a convex combination of all points, i.e.,

In particular, since v / ∈ X, each α i satisfies α i < 1. Let w := N i=2 α i 1−α 1 x(i). This is a convex combination using points x(2), . . . , x(N ), so we may express v as v = α 1 x(1) + (1 − α 1 )w. Thus, v can be placed on a line segment between two points of the polytope and it is neither the start point nor the end point. As a consequence, the point v is not extremal.

Appendix B. Proof of Lemma 3.4.

Proof. First we show the upper bound for TV distance between any two rows of matrix P t . The (i, j) entry of P t is given by

where K t is constructed using some kernel, and 1 ≥ K t (i, j) ≥ δ > 0. Thus a lower bound for

The TV distance between any two rows of P t d T V (P t (i, ·), P t (j, ·)) = 1 2

After step t, two transformed data points:

Consider a pair of random variables Z 1 and Z 2 of P t (i, ·) and P t (j, ·) respectively, and the joint distribution ξ of (Z 1 , Z 2 ) on [N ] × [N ]. Then ξ satisfies z 2 ∈[N ] ξ(z 1 , z 2 ) = P t (i, ·) and

The distance between two points after step t

Using the coupling lemma, we choose the optimal coupling (Z 1 , Z 2 ), so that z 1 =z 2 ξ(z 1 , z 2 ) = d T V (P t (i, ·), P t (j, ·). Then we obtain the upper bound for the distance between any two points after step t,

Appendix C. Proof of Lemma 4.1.

Proof. Recall that we can write

which concludes the proof.

Appendix D. Proof of Lemma 4.2.

Proof. We start by proving the first inequality. For two time indices s and t we have

For the proof of the second inequality, we first need to note that 1/d t (i) ≤ 1 for all i ∈ {1, . . . , N }. Indeed, since we have

where the inequality (D.1) is obtained by the fact that 1/d t (i) ≤ 1. Thus, we can conclude

Appendix E. Proof of Lemma 4.3.

Proof. Since the first eigenvalue of P t is λ t,1 = 1, we have L t (P t f ) = L t (f ) and we can write

Substituting the previous expression in L s (P t f ) yields

λ t,k f, ψ t,k dt ψ t,k , 1 πs 1 = f, 1 πt 1, 1 πs + N k=2 λ t,k f, ψ t,k dt ψ t,k , 1 πs 1

λ t,k f, ψ t,k dt ψ t,k , 1 ds 1

where the last equality is due to the fact that 1, 1 ds = d s 1 . We already observed that

λ t,k f, ψ t,k dt ψ t,k , 1, ds 1.

This last equation can be upper bounded by observing that c1 dt = |c| d t 1/2 1 , and N ≤ d t 1 ≤ N 2 . Indeed, we can write

To finish the proof, we need to bound the term in bracket

Finally, we conclude by substituting the previous bound in the inequality E.1.

This section provides a brief introduction to the most relevant concepts in the emerging field of topological data analysis, namely (i) simplicial homology, (ii) persistent homology, and (iii) their calculation in the context of point clouds. We refer readers to Edelsbrunner and Harer [13] for a comprehensive description of these topics.

Simplicial homology. Simplicial homology refers to a way of assigning connectivity information to topological objects, such as manifolds, which are represented by simplicial complexes. A simplicial complex K is a set of simplices of some dimensions. These may be considered as subsets of an index set, with nomenclature typically referring to vertices (dimension 0), edges (dimension 1), and triangles (dimension 2). The subsets of a simplex σ ∈ K are referred to as its faces, and every face τ needs to satisfy τ ∈ K. Moreover, any non-empty intersection of two simplices also needs to be part of the simplicial complex, i.e., σ ∩ σ = ∅ for σ, σ ∈ K implies σ ∩ σ ∈ K. Therefore, K is "closed under calculating the faces of a simplex."

Chain groups. To characterize simplicial complexes, it is necessary to imbue them with additional algebraic structures. For a simplicial complex K, let C d (K) be the vector space generated over Z 2 (the field with two elements), also known as the chain group in dimension d. The elements of C d (K) are the d-simplices in K and their formal sums, with coefficients in Z 2 . For instance, σ + τ is an element of the chain group, also called a simplicial chain. Addition is well-defined and easy to implement since a simplex can only be present or absent over Z 2 coefficients. The use of chain groups lies in providing the underlying vector space to formalize boundary calculations over a simplicial complex, which in turn are required for defining connectivity.

Boundary homomorphism and homology groups. Given a d-simplex σ = {v 0 , . . . , v d } ∈ K, its boundary is defined in terms of the boundary operator ∂ d :

i.e., we leave out every vertex v i of the simplex once. This is a map between chain groups, and since only sum operations are involved, it is readily seen to be a homomorphism. By linearity, we can extend this calculation to C d (K). The boundary homomorphism gives us a way to precisely define connectivity by means of calculating its kernel and image. Notice that the kernel ker ∂ d contains all d-dimensional simplicial chains that do not have a boundary. Finally, the dth homology group H d (K) of K is defined as the quotient group H d (K) := ker ∂ d / im ∂ d+1 . It contains all topological features-represented using simplicial chainsthat have no boundary while also not being the boundary of a higher-dimensional simplex. Colloquially, the homology group therefore measures the "holes" in K.

Betti numbers. The rank of the dth homology group is an important invariant of a simplicial complex, known as the dth Betti number β d , i.e., β d (K) := rank H d (K). The sequence of Betti numbers β 0 , . . . , β d of a d-dimensional simplicial complex is commonly used to discriminate between manifolds. For example, a 2-sphere has Betti numbers (1, 0, 1), while a 2-torus has Betti numbers (1, 2, 1). Betti numbers are limited in expressivity when dealing with real-world data sets because they are highly dependent on a specific choice of simplicial complex K. This limitation prompted the development of persistent homology.

Persistent homology. Persistent homology is an extension of simplicial homology. At its core, it employs filtrations to imbue a simplicial complex K with scale information, resulting in multi-scale topological information. We assume the existence of a function f : K → R, which only attains a finite number of function values f (0) ≤ f (1) ≤ . . . ≤ f (m−1) ≤ f (m) . This permits us to sort K according to f , for example by extending f linearly to higher-dimensional simplices via f (σ) := max v∈σ f (v), leading to a nested sequence of simplicial complexes (F.2) ∅ = K (0) ⊆ K (1) ⊆ · · · ⊆ K (m−1) ⊆ K (m) = K, where K (i) := σ ∈ K | f (σ) ≤ f (i) . Each of these simplicial complexes therefore only contains those simplices whose function value is less than or equal to the threshold. In contrast to lost. A persistence pairing P rectifies this by storing tuples of simplices (σ, τ ), where σ is a k-simplex (the creator of the feature) and τ is a (k + 1)-simplex (the destroyer of the feature). This pairing is known to be unique in the sense that a simplex can either be a creator or a destroyer, but not both [14] . The persistence pairing and the persistence diagram are equivalent if and only if the filtration is injective on the level of 0-simplices, i.e., there are no duplicate filtration values. In the intrinsic diffusion homology, we make use of the pairing to track the hierarchical information created during the diffusion condensation process. Distances and stability. Persistence diagrams can be endowed with a metric, known as the bottleneck distance. This metric is used to assess the stability of persistence diagrams with respect to perturbations of their input function. For two persistence diagrams D and D , their bottleneck distance is calculated as

where η : D → D denotes a bijection between the point sets of both diagrams, and · ∞ refers to the L ∞ metric between two points in R 2 . Calculating (F.6) requires solving an optimal assignment problem; recent work [21] discusses efficient approximation strategies. The primary appeal of the bottleneck distance is that it can be related to the Hausdorff distance, thus building a bridge between geometry and topology. A seminal stability theorem [8] states that distances between persistence diagrams are bounded by the distance of the functions that give rise to them: given a simplicial complex K and two monotonic functions f, g : K → R, their corresponding persistence diagrams D f and D g satisfy

where f − g ∞ refers to the Hausdorff between the two functions. In section 5, we make use of (F.7) and a more generic stability bound when we characterize diffusion condensation in topological terms.

Vietoris-Rips complexes. The formulation of persistent homology hinges on the generation of a simplicial complex. We will use the Vietoris-Rips complex, a classical construction [34] that requires a distance threshold δ 3 and a metric d(·, ·) such as the Euclidean distance. The Vietoris-Rips complex at scale δ of an input data set is defined as V δ (X) := {σ ⊆ X | d(x(i), x ( j)) ≤ δ for all x(i), x(j) ∈ σ}, i.e., V δ (X) contains all subsets of the input space whose pairwise distances are less than or equal to δ. In this formulation, each simplex of V δ is assigned a weight according to the maximum distance of its vertices, leading to w(σ) := max {x(i),x(j)}⊆σ d(x(i), x(j)) and w(τ ) = 0 for 0-simplices. Other weight assignment strategies are also possible, but this distance-based assignment enjoys stability properties [5] , similar to (F.7). Letting V δ (X) and V δ (Y) refer to the Vietoris-Rips complexes of two spaces X, Y, their corresponding persistence diagrams D X , D Y satisfy

with d GH (·, ·) denoting the Gromov-Hausdorff distance. This bound, originally due to Chazal et al. [4, 5] , is useful in relating geometrical and topological properties of the diffusion condensation process in Theorem 5.3.

The framed Morse complex and its invariants

Coarse Graining of Data via Inhomogeneous Diffusion Condensation

Zigzag persistent homology and real-valued functions

Proximity of persistence modules and their diagrams

Persistence stability for geometric complexes

Subsampling methods for persistent homology

Mean shift, mode seeking, and clustering

Stability of persistence diagrams

Extending persistence using Poincaré and Lefschetz duality

Diffusion maps

Geometric bounds for eigenvalues of Markov chains

Biological sequence analysis: probabilistic models of proteins and nucleic acids

Computational topology: An introduction

Topological persistence and simplification

Geodesic exponential kernels: When curvature and linearity conflict

The estimation of the gradient of a density function, with applications in pattern recognition

Barcodes: The persistent topology of data

A comparison of some methods of cluster analysis

MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform

Finding groups in data: an introduction to cluster analysis

Geometry helps to compare persistence diagrams

Topological analysis of single-cell data reveals shared glial landscape of macular degeneration and neurodegenerative diseases, bioRxiv

Multiscale PHATE identifies multimodal signatures of COVID-19

Convex sets and their applications, Courier Corporation

Learning by unsupervised nonlinear diffusion

Time coupled diffusion maps

On the use of Gromov-Hausdorff distances for shape comparison

Visualizing structure and transitions in high-dimensional biological data

A statistical method for evaluating systematic relationships

Regularization on graphs with function-adapted diffusion processes

Recovering gene interactions from single-cell data using data diffusion

On the use of size functions for shape analysis

Über den höheren Zusammenhang kompakter Räume und eine Klasse von zusammenhangstreuen Abbildungen

A tutorial on spectral clustering

simplicial homology, the filtration is more expressive, because it permits us to track changes. For instance, a topological feature might be created (a new connected component might arise) or destroyed (two connected components might merge into one), as we pass from K (i) to K (i+1) . Persistent homology provides a principled way of tracking topological features, representing each one by a creation and destruction value (f (i) , f (j) ) ∈ R 2 based on the filtration function, with i ≤ j. In case a topological feature is still present at the end of the filtration, we refer to the feature as being essential. These features are the ones that are counted for the Betti number calculation. It is also possible to obtain only tuples with finite persistence values, a process known as extended persistence [9] , but we eschew this concept in this work for reasons of computational complexity. Every filtration induces an inclusion homomorphism between K (i) ⊆ K (i+1) . The respective boundary homomorphisms in turn induce a homomorphism between corresponding homology groups of the simplicial complexes of the filtration. These are maps of the form i. This family of homomorphisms now gives rise to a sequence of homology groupsThis group affords an intuitive description: it contains all homology classes created in K (i) that are still present in K (j) . We can now define a variant of the aforementioned Betti numbers, the dth persistent Betti number, namely, βSince the persistent Betti numbers are indexed by i and j, we can consider persistent homology as a way of generating a sequences of Betti numbers, as opposed to just calculating one single number. This sequence can be summarized in a persistence diagram.Persistence diagrams and pairings. Given a filtration induced by a function f : K → R as described above, each tuple (f (i) , f (j) ) is stored with multiplicity i,j = 0, so the practical number of tuples is not quadratic in the number of function values. For a point (x, y) ∈ D d , we refer to the quantity pers(x, y) := |y − x| as its persistence. The idea of persistence arose in multiple contexts [1, 14, 33] , but it is nowadays commonly used to analyze functions on manifolds, where high persistence is seen to correspond to features of the function, while low persistence is typically considered noise. Finally, we remark that persistence diagrams keep track of topological features by associating them with tuples. In this perspective, the identity of topological features, i.e., the pair of simplices involved in its creation or destruction, is