key: cord-0515506-1hdxyqtv
authors: Laenen, Steinar; Sun, He
title: Higher-Order Spectral Clustering of Directed Graphs
date: 2020-11-10
journal: nan
DOI: nan
sha: 24babc4c04d3ad1c899aab0af0b8ba92504154fb
doc_id: 515506
cord_uid: 1hdxyqtv

Clustering is an important topic in algorithms, and has a number of applications in machine learning, computer vision, statistics, and several other research disciplines. Traditional objectives of graph clustering are to find clusters with low conductance. Not only are these objectives just applicable for undirected graphs, they are also incapable to take the relationships between clusters into account, which could be crucial for many applications. To overcome these downsides, we study directed graphs (digraphs) whose clusters exhibit further"structural"information amongst each other. Based on the Hermitian matrix representation of digraphs, we present a nearly-linear time algorithm for digraph clustering, and further show that our proposed algorithm can be implemented in sublinear time under reasonable assumptions. The significance of our theoretical work is demonstrated by extensive experimental results on the UN Comtrade Dataset: the output clustering of our algorithm exhibits not only how the clusters (sets of countries) relate to each other with respect to their import and export records, but also how these clusters evolve over time, in accordance with known facts in international trade.

Clustering is one of the most fundamental problems in algorithms and has applications in many research fields including machine learning, network analysis, and statistics. Data can often be represented by a graph (e.g., users in a social network, servers in a communication network), and this makes graph clustering a natural choice to analyse these datasets. Over the past three decades, most studies on undirected graph clustering have focused on the task of partitioning with respect to the edge densities, i.e., vertices form a cluster if they are better connected to each other than to the rest of the graph. The well-known normalised cut value [25] and graph conductance [20] capture these classical definitions of clusters, and have become the objective functions of most undirected graph clustering algorithms.

While the design of these algorithms has received a lot of research attention from both theoretical and applied research areas, these algorithms are usually unable to uncover higher-order structural information among clusters in directed graphs (digraphs). For example, let us look at the international oil trade network [29] , which employs digraphs to represent how mineral fuels and oils are imported and exported between countries. Although this highly connected digraph presents little cluster structure with respect to a typical objective function of undirected graph clustering, from an economic point of view this digraph clearly exhibits a structure of clusters: there is a cluster of countries mainly exporting oil, a cluster mainly importing oil, and several clusters in the middle of this trade chain. All these clusters are characterised by the imbalance of the edge directions between clusters, and further present a clear ordering reflecting the overall trade pattern. This type of structure is not only found in trade data, but also in many other types of data such as migration data and infectious disease spreading data. We view these types of patterns as a higher-order structure among the clusters and, in our point of view, this structural information could be as important as the individual clusters themselves.

Our contribution. In this work we study clustering algorithms for digraphs whose cluster structure is defined with respect to the imbalance of edge densities as well as the edge directions between the clusters. Formally, for any set of vertices S 0 , . . . , S k−1 that forms a partition of the vertex set V (G) of a digraph G, we define the flow ratio of {S j } k−1 j=0 by

where w(S, T ) (u,v)∈E u∈S,v∈T w(u, v) is the cut value from S ⊂ V to T ⊂ V and vol(S) is the sum of degrees of the vertices in S. We say that {S j } k−1 j=0 forms an optimal partition if this {S j } k−1 j=0 maximises the flow ratio over all possible partitions. By introducing a complex-valued representation of the graph Laplacian matrix L G , we show that this optimal partition {S j } k−1 j=0 is well embedded into the bottom eigenspace of L G . To further exploit this novel and intriguing connection, we show that an approximate partition with bounded approximation guarantee can be computed in time nearly-linear in the number of edges of the input graph. In the settings for which the degrees of the vertices are known in advance, we also present a sub-linear time implementation of the algorithm. The significance of our work is further demonstrated by experimental results on several synthetic and real-world datasets. In particular, on the UN Comtrade dataset our clustering results are well supported by the literature from other research fields. At the technical level, our analysis could be viewed as a hybrid between the proof of the Cheeger inequality [6] and the analysis of spectral clustering for undirected graphs [23] , as well as a sequence of recent work on fast constructions of graph sparsification (e.g., [26] ). We believe our analysis for the new Hermitian Laplacian L G could inspire future research on studying the clusters' higher-order structure using spectral methods.

Related work. There is a rich literature on spectral algorithms for graph clustering. For undirected graph clustering, the works most related to ours are [23, 25, 30] . For digraph clustering, [24] proposes to perform spectral clustering on the symmetrised matrix A = M M + M M of the input graph's adjacency matrix M ; [9] initiates the studies of spectral clustering on complex-valued Hermitian matrix representations of digraphs, however their theoretical analysis only holds for digraphs generated from the stochastic block model. Our work is also linked to analysing higher-order structures of clusters in undirected graphs [4, 5, 31] , and community detection in digraphs [7, 21] . The main takeaway is that there is no previous work which analyses digraph spectral clustering algorithms to uncover the higher-order structure of clusters in a general digraph.

Throughout the paper, we always assume that G = (V, E, w) is a digraph with n vertices, m edges, and weight function w : V × V → R 0 . We write u v if there is an edge from u to v in the graph. For any vertex u, the in-degree and out-degree of u are defined as d in u v:v u w(v, u) and d out u v:u v w(u, v), respectively. We further define the total degree of u by d u d in u + d out u , and define vol(S) u∈S d u for any S ⊆ V . For any set of vertices S and T , the symmetric difference between S and T is defined by S T (S \ T ) ∪ (T \ S).

Given any digraph G as input, we use M ∈ R n×n to denote the adjacency matrix of G, where M u,v = w(u, v) if there is an edge u v, and M u,v = 0 otherwise. We use A ∈ C n×n to represent the Hermitian adjacency matrix of G, where A u,v = A v,u = w(u, v) · ω 2πk if u v, and A u,v = 0 otherwise. Here, ω 2πk is the 2πk -th root of unity, and x is the conjugate of x. The normalised Laplacian matrix of G is defined by L G I − D −1/2 AD −1/2 , where the degree matrix D ∈ R n×n is defined by D u,u = d u , and D u,v = 0 for any u = v. We sometimes drop the subscript G if the underlying graph is clear from the context. For any Hermitian matrix A ∈ C n×n and non-zero vector x ∈ C n , the Rayleigh quotient R(A, x) is defined as R(A, x) x * Ax/x * x, where x * is the complex conjugate transpose of x ∈ C n . For any Hermitian matrix B ∈ C n×n , let λ 1 (B) . . . λ n (B) be the eigenvalues of B with corresponding eigenvectors f 1 , . . . , f n , where f j ∈ C n for any 1 j n.

3 Encoding the flow-structure into L G 's bottom eigenspace Now we study the structure of clusters with respect to their flow imbalance, and their relation to the bottom eigenspace of the normalised Hermitian Laplacian matrix. For any set of vertices S 0 , . . . , S k−1 , we say that S 0 , . . . , S k−1 form a k-way partition of V (G), if it holds that 0 j k−1 S j = V (G) and S j ∩ S = ∅ for any j = . As discussed in Section 1, the primary focus of the paper is to study digraphs in which there are significant connections from S j to S j−1 for any 1 j k − 1. To formalise this, we introduce the notion of flow ratio of {S j } k−1 j=0 , which is defined by

.

We call this k-way partition {S j } an optimal clustering if the flow ratio given by {S j } achieves the maximum defined by

Notice that, for any two consecutive clusters S j and S j−1 , the value w(S j , S j−1 ) · (vol(S j ) + vol(S j−1 )) −1 evaluates the ratio of the total edge weight in the cut (S j , S j−1 ) to the total weight of the edges with endpoints in S j or S j−1 ; moreover, only k − 1 out of 2 · k 2 different cuts among S 0 , . . . , S k−1 contribute to Φ G (S 0 , . . . , S k−1 ) according to (1) . We remark that, although the definition of Φ G (S 0 , . . . , S k−1 ) shares some similarity with the normalised cut value for undirected graph clustering [25] , in our setting an optimal clustering is the one that maximises the flow ratio. This is in a sharp contrast to most objective functions for undirected graph clustering, whose aim is to find clusters of low conductance 2 . In addition, it is not difficult to show that this problem is NP-hard since, when k = 2, our problem is exactly the MAX DICUT problem studied in [15] .

To study the relationship between the flow structure among S 0 , . . . , S k−1 and the eigen-structure of the normalised Laplacian matrix of the graph, we define for every optimal cluster S j (0 j k − 1) an indicator vector χ j ∈ C n by χ j (u) w 2π·k j if u ∈ S j and χ j (u) = 0 otherwise. We further define the normalised indicator vector of χ j by

and set

We highlight that, due to the use of complex numbers, a single vector y is sufficient to encode the structure of k clusters: this is quite different from the case of undirected graphs, where k mutually perpendicular vectors are needed in order to study the eigen-structure of graph Laplacian and the cluster structure [20, 23, 30 ]. In addition, by the use of roots of unity in (3), different clusters are separated from each other by angles, indicating that the use of a single eigenvector could be sufficient to approximately recover k clusters. Our result on the relationship between λ 1 (L G ) and θ k (G) is summarised as follows: Lemma 3.1. Let G = (V, E, w) be a weighted digraph with normalised Hermitian Laplacian L G ∈ C n×n . Then, it holds that λ 1 (L G ) 1 − 4 k · θ k (G). Moreover, θ k (G) = k/4 if G is a bipartite digraph with all the edges having the same direction, and θ k (G) < k/4 otherwise. 2 It is important to notice that, among 2 · k 2 cuts formed by pairwise different clusters, only (k − 1) cut values contribute to our objective function. If one takes all of the 2 · k 2 cut values into account, the objective function would involve 2 · k 2 terms. However, even if most of the 2 · k 2 terms are much smaller than the ones along the flow, their sum could still be dominant, leaving little information on the structure of clusters. Therefore, we should only take (k − 1) cut values into account when the clusters present a flow structure.

Notice that the bipartite graph G with θ k (G) = k/4 is a trivial case for our problem; hence, without lose of generality we assume θ k (G) < k/4 in the following analysis. To study how the distribution of eigenvalues influences the cluster structure, similar to the case of undirected graphs we introduce the parameter γ defined by

Our next theorem shows that the structure of clusters in G and the eigenvector corresponding to λ 1 (L G ) can be approximated by each other with approximation ratio inversely proportional to γ k (G).

Theorem 3.2. The following statements hold: (1) there is some α ∈ C such that the vector f 1 = αf 1 satisfies y − f 1 2 1/γ k (G); (2) there is some β ∈ C such that the vector y = βy satisfies

In this section we discuss the algorithmic contribution of the paper. In Section 4.1 we will describe the main algorithm, and its efficient implementation based on nearly-linear time Laplacian solvers; we will further present a sub-linear time implementation of our algorithm, assuming the degrees of the vertices are known in advance. The main technical ideas used in analysing the algorithms will be discussed in Section 4.2.

Main algorithm. We have seen from Section 3 that the structure of clusters is approximately encoded in the bottom eigenvector of L G . To exploit this fact, we propose to embed the vertices of G into R 2 based on the bottom eigenvector of L G , and apply k-means on the embedded points. Our algorithm, which we call SimpleHerm, only consists of a few lines of code and is described as follows: (1) compute the bottom eigenvector f 1 ∈ C n of the normalised Hermitian Laplacian matrix

We remark that, although the entries of L G are complex-valued, some variant of the graph Laplacian solvers could still be applied for our setting. For most practical instances, we have k = O(log c n) for some constant c, in which regime our proposed algorithm runs in nearly-linear time 3 . We refer a reader to [22] on technical discussion on the algorithm of approximating f 1 in nearly-linear time.

Speeding-up the runtime of the algorithm. Since Ω(m) time is needed for any algorithm to read an entire graph, the runtime of our proposed algorithm is optimal up to a poly-logarithmic factor. However we will show that, when the vertices' degrees are available in advance, the following sub-linear time algorithm could be applied before the execution of the main algorithm, and this will result in the algorithm's total runtime to be sub-linear in m.

More formally, our proposed sub-linear time implementation is to construct a sparse subgraph H of the original input graph G, and run the main algorithm on H instead. The algorithm for obtaining graph H works as follows: every vertex u in the graph G checks each of its outgoing edges e = (u, v), and samples each outgoing edge with probability

in the same time, every vertex v checks each of its incoming edges e = (u, v) with probability

where α ∈ R >0 is some constant which can be determined experimentally. As the algorithm goes through each vertex, it maintains all the sampled edges in a set F . Once all the edges have been checked, the algorithm returns a weighted graph H = (V, F, w H ), where each sampled edge e = (u, v) has a new weight defined by w H (u, v) = w(u, v)/p e . Here, p e is the probability that e is sampled by one of its endpoints and, for any e = (u, v), we can write p e as p e = p u (u, v)

Analysis of the main algorithm. Now we analyse the proposed algorithm, and prove that running

is sufficient to obtain a meaningful clustering with bounded approximation guarantee. We assume that the output of a k-means algorithm is A 0 , . . . , A k−1 . We define the cost function of the output clustering A 0 , . . . , A k−1 by

and define the optimal clustering by

Although computing the optimal clustering for k-means is NP-hard, we will show that the cost value for the optimal clustering can be upper bounded with respect to γ k (G). To achieve this, we define k points p (0) , . . . , p (k−1) in C, where p (j) is defined by

We could view these p (0) , . . . , p (k−1) as approximate centers of the k clusters, which are separated from each other through different powers of ω 2π·k .

Our first lemma shows that the total distance between the embedded points from every S j and their respective centers p (j) can be upper bounded, which is summarised as follows:

Since the cost value of the optimal clustering is the minimum over all possible partitions of the embedded points, by Lemma 4.1 we have that ∆ 2

We assume that the k-means algorithm used here achieves an approximation ratio of APT. Therefore, the output A 0 , . . . , A k−1 of this k-means algorithm satisfies COST(A 0 , . . . ,

Secondly, we show that the norm of the approximate centre of each cluster is inversely proportional to the volume of each cluster. This implies that larger clusters are closer to the origin, while smaller clusters are further away from the origin.

Thirdly, we prove that the distance between different approximate centres p (j) and p ( ) is inversely proportional to the volume of the smaller cluster, which implies that the embedded points of the vertices from a smaller cluster are far from the embedded points from other clusters. This key fact explains why our algorithm is able to approximately recover the structure of all the clusters. 

Combining these three lemmas with some combinatorial analysis, we prove that the symmetric difference between every returned cluster by the algorithm and its corresponding cluster in the optimal partition can be upper bounded, since otherwise the cost value of the returned clusters would contradict Lemma 4.1. that maximises the flow ratio Φ G (S 0 , . . . , S k−1 ). Then, there is an algorithm that returns a k-

. Moreover, by assuming A j corresponds to S j in the optimal partition, it holds that vol(A j S j ) εvol(S j ) for some ε = 48k 3 ·(1+APT) (γ k (G) − 1) 1/2.

We remark that the analysis of our algorithm is similar with the work of [23] . However, the analysis in [23] relies on k indicator vectors of k clusters, each of which is in a different dimension of R n ; this implies that k eigenvectors are needed in order to find a good k-way partition. In our case, all the embedded points are in R 2 , and the embedded points from different clusters are mainly separated by angles; this makes our analysis slightly more involved than [23] .

Analysis for the speeding-up subroutine. We further analyse the speeding-up subroutine described in Section 4.1. Our analysis is very similar with [27], and the approximation guarantee of our speeding-up subroutine is as follows:

Theorem 4.5. Given a digraph G = (V, E) as input, the speeding-up subroutine computes a subgraph H = (V, F ) of G with O((1/λ 2 ) · n log n) edges. Moreover, with high probability, the computed sparse graph H satisfies that θ k (H) = Ω(θ k (G)), and λ 2 (L H ) = Ω(λ 2 (L G )).

In this section we present the experimental results of our proposed algorithm SimpleHerm on both synthetic and real-world datasets, and compare its performance against the previous state-of-theart. All our experiments are conducted with an ASUS ZenBook Pro UX501VW with an Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz with 12GB of RAM.

We will compare SimpleHerm against the DD-SYM algorithm [24] and the Herm-RW algorithm [9] . Given the adjacency matrix M ∈ R n×n as input, the DD-SYM algorithm computes the matrix A = M M + M M , and uses the top k eigenvectors of a random walk matrix D −1 A to construct an embedding for k-means clustering. The Herm-RW algorithm uses the imaginary unit i to represent directed edges and applies the top k/2 eigenvectors of a random walk matrix to construct an embedding for k-means. Notice that both of the DD-SYM and Herm-RW algorithms involve the use of multiple eigenvectors, and DD-SYM requires computing matrix multiplications, which makes it computationally more expensive than ours.

We first perform experiments on graphs generated from the Directed Stochastic Block Model (DSBM) which is introduced in [9] . We introduce a path structure into the DSBM, and compare the performance of our algorithm against the others. Specifically, for given parameters k, n, p, q, η, a graph randomly chosen from the DSBM is constructed as follows: the overall graph consists of k clusters S 0 , . . . , S k−1 of the same size, each of which can be initially viewed as a G(n, p) random graph. We connect edges with endpoints in different clusters with probability q, and connect edges with endpoints within the same cluster with probability p. In addition, for any edge (u, v) where u ∈ S j and v ∈ S j+1 , we set the edge direction as u v with probability η, and set the edge direction as v u with probability 1 − η. For all other pairs of clusters which do not lie along the path, we set their edge directions randomly. The directions of edges inside a cluster are assigned randomly.

As graphs generated from the DSBM have a well-defined ground truth clustering, we apply the Adjusted Rand Index (ARI) [14] to measure the performance of different algorithms. We further set p = q, since this is one of the hardest regimes for studying the DSBM. In particular, when p = q, the edge density plays no role in characterising the structure of clusters, and the edges are entirely defined with respect to the edge directions.

We set n = 1000, and k = 4. We set the value of p to be between 0.5 and 0.8, and the value of η to be between 0.5 and 0.7. As shown in Figure 1 , our proposed SimpleHerm clearly outperforms the Herm-RW and the DD-SYM algorithms.

Next, we study the case of n = 2000 and k = 8, but the structure of clusters presents a more significant path topology. Specifically, we assume that any pair of vertices within each cluster are connected with probability p ∈ (0.05, 0.1); moreover, all the edges crossing different clusters are along the cuts (S j , S j+1 ) for some 0 j k − 2. By setting η ∈ (0.65, 1), our results are reported in Figure 2 . From these results, it is easy to see that, when the underlying graph presents a clear flow structure, our algorithm performs significantly better than both the Herm-RW and DD-SYM algorithms, for which multiple eigenvectors are needed. 

We compare our proposed algorithm against the previous state-of-the-art on the UN Comtrade Dataset [29] . This dataset consists of the import-export tradeflow data of 97 specific commodities across N = 246 countries and territories over the period 1964 -2018. The total size of the data in zipped files is 99.8GB, where every csv file for a single year contained around 20,000,000 lines.

Pre-processing. As the pre-processing step, for any fixed commodity c and any fixed year, we construct a directed graph as follows: the constructed graph has N = 246 vertices, which correspond to 246 countries and territories listed in the dataset. For any two vertices j and , there is a directed edge from j to if the export of commodity c from country j to is greater than the export from to j, and the weight of that edge is set to be the absolute value of the difference in trade, i.e., the net trade value between and j. Notice that our construction ensures that all the edge weights are non-negative, and there is at most one directed edge between any pair of vertices.

Result on the International Oil Trade Industry. We first study the international trade for mineral fuels, oils, and oil distillation product in the dataset. The primary reason for us to study the international oil trade is due to the fact that crude oil is one of the highest traded commodities worldwide [2] , and plays a significant role in geopolitics (e.g., 2003 Iraq War). Many references in international trade and policy making (e.g., [3, 10, 11] ) allow us to interpret the results of our proposed algorithm.

Following previous studies on the same dataset from complex networks' perspectives [13, 32] , we set k = 4. Our algorithm's output around the period of 2006-2009 is visualised in Figure 3 . We choose to highlight the results between 2006 and 2009, since 2008 sees the largest post World Ward II oil shock after the economic crisis [18] . As discussed earlier, our algorithm's output is naturally associated with an ordering of the clusters that optimises the value of Φ, and this ordering is reflected in our visualisation as well. Notice that such ordering corresponds to the chain of oil trade, and indicates the clusters of main export countries and import countries for oil trade. consecutive years, we map every cluster to its "optimal" correspondence (i.e., the one that minimises the symmetric difference between the two). We further compute the total symmetric difference between the clustering results for every two consecutive years, and our results are visualised in Result on the International Wood Trade. We also study the international wood trade network (IWTN). This network looks at the trade of wood and articles of wood. Although the IWTN is less studied than the International Oil Trade Industry in the literature, it is nonetheless the reflection of an important and traditional industry and deserves detailed analysis. Wood trade is dependent on a number of factors, such as the amount of forest a country has left, whether countries are trying to reintroduce forests, and whether countries are deforesting a lot for agriculture (e.g., Amazon rainforest in Brazil) [17] . 

The Data Science for COVID-19 Dataset (DS4C) [19] contains information about 3519 South Korean citizens infected with COVID-19. Here, digraphs are essential to represent how the virus is transmitted among the individuals, and the clusters with high ratio of out-going edges represent the communities worst hit by the virus. We first identify the largest connected component of the infection graph, which consists of 67 vertices and 66 edges, and run our algorithm on the largest connected component. By setting k = 4, our algorithm manages to identify a super-spreader as a single cluster, and the path of infection between groups of people along which most infections lie.

The primary focus of our work is efficient clustering algorithms for digraphs, whose clusters are defined with respect to the edge directions between different clusters. We believe that our work could have long-term social impact. For instance, when modelling the transmission of COVID-19 among individuals through a digraph, the cluster (group of people) with the highest ratio of out-going edges represents the most infectious community. This type of information could aid local containment policy. With the development of many tracing Apps for COVID-19 and a significant amount of infection data available in the near future, our studied algorithm could potentially be applied in this context. In addition, as shown by our experimental results on the UN Comtrade Dataset, our work could be employed to analyse many practical data for which most traditional clustering algorithms do not suffice. 

In this section we present all the technical detailed omitted from Section 3.

Proof of Lemma 3.1. We prove the statement by analysing the Reyleigh quotient of L G with respect to y, which is defined by y * L G y y * y . Since y = 1, it suffices to analyse y * L G y. By definition, we have that

To analyse (5), first of all it is easy to see that

where the third equality follows by the fact that χ * j χ = 0 for any 0 j = k − 1. On the other hand, by definition we have that

where Re(·) stands for the real part of a complex number. Combining (5), (6) with (7), we have that

where the first inequality follows by the fact that cos x 1 − x 2 /2 and the last inequality follows by the inequality 2ab a 2 + b 2 for any a, b ∈ R. Therefore, we have that

By the Rayleigh characterisation of eigenvalues we know that

which proves the first statement of the lemma. Now we prove the second statement. Let G be a digraph, and S 0 , . . . , S k−1 be the k clusters maximising Φ G (S 0 , . . . , S k−1 ), i.e., Φ G (S 0 , . . . , S k−1 ) = θ k (G). Since adding edges that are not along the path only decreases the value of Φ G , we assume without loss of generality that all the edges are along the path. For the base case of k = 2, we have that

Next, we will prove that θ k (G) < k/4 for any k 3. We set y j w(S j , S j−1 ) for any 1 j k−1, and have that

y j y j−1 + 2y j + y j+1 + y k−1 y k−2 + 2y k−1 .

By introducing y 0 = 0 and assuming that all the indices of {y j } j are modulo b k, we can write Φ G (S 0 , . . . , S k−1 ) as

y j y j−1 + 2y j + y j+1 .

Next we compute ∂Φ G /∂y j , and have that

y j y j−1 + 2y j + y j+1 = ∂Φ G ∂y j y j−1 y j−2 + 2y j−1 + y j + y j y j−1 + 2y j + y j+1 + y j+1 y j + 2y j+1 + y j+2

Notice that, when all the y j (0 j k − 1) equal to the same non-zero value, it holds that ∂Φ G /∂y j = 0 for any j, and θ G (S 0 , . . . , S k−1 ) = k/4. Moreover, it's easy to verify that k/4 is an upper bound of θ G . Since we effectively assume that y 0 = 0, which cannot be always equal to all of the y 1 , . . . , y k−1 , we have that θ G (S 0 , . . . , S k−1 ) < k/4.

We first prove the first statement. We write y as a linear combination of the eigenvectors of L G by y = α 1 f 1 + · · · + α n f n for some α i ∈ C and f i ∈ C n , and define f 1 by f 1 α 1 f 1 . By the definition of the Rayleigh quotient for Hermitian matrices we have that y * L G y y = (α 1 f 1 + · · · + α n f n ) * L G (α 1 f 1 + · · · + α n f n ) = α 1 2 λ 1 (L G ) + · · · + α n 2 λ n (L G )

where the first inequality holds by the fact that λ 1 (L G ) . . . λ n (L G ) and the second inequality holds by the fact that α 2 2 + · · · + α n 2 = 1 − α 1 2 . We can see that

.

Setting α = α 1 proves the first statement.

Next we prove the second statement. By the relationship between f 1 and f 1 , we write

where β 1 1/α 1 is the multiplicative inverse of α 1 . Then, we define y as

and this implies that

we can rewrite (8) as

and therefore setting β = β 1 proves the second statement.

In this section we present all the technical detailed omitted from Section 4.

Proof of Lemma 4.1. By definition, we have that

where the last inequality follows by Theorem 3.2.

Proof of Lemma 4.2. The proof is by direct calculation on p (j) 2 .

Proof of Lemma 4.3. By definition of p (j) and p ( ) , we have that

For the case of calculation and the fact that cos(x) = cos(−x) for any x ∈ R, we denote

and rewrite (9) as

where the second inequality holds by the fact that sin x (2/π) · x holds for any x ∈ [0, π/2]. This finishes the proof of the lemma.

The following lemma will be used to prove Theorem 4.4. We remark that the following proof closely follows the similar one from [23] , however some constants need to be adjusted for our propose. We include the proof here for completeness.

Lemma B.1. Let A 0 , . . . , A k−1 be a partition of V . Assume that, for every permutation

Proof. We first consider the case where there exists a permutation σ : {0, . . . , k−1} → {0, . . . , k−1} such that, for any 0 j k − 1,

This assumption essentially says that A 0 , . . . , A k−1 is a non-trivial approximation of the optimal clustering S 0 , . . . , S k−1 according to some permutation σ. Later we will show the statement of the lemma trivially holds if no permutations satisfy (10) .

Based on this assumption, there is 0 j

one of the following two cases must hold: 1. A large portion of A j belongs to clusters different from S σ(j ) , i.e., there exist ε 0 , . . . , ε k−1 0 such that ε j = 0, k−1 j=0 ε j ε, and vol A j ∩ S σ(j) ε j vol S σ(j ) for any 0 j k − 1.

A j is missing a large portion of S σ(j ) , which must have been assigned to other clusters. Therefore, we can define ε 0 , . . . , ε k−1 0 such that ε j = 0,

In both cases, we can define sets B 0 , . . . , B k−1 and D 0 , . . . , D k−1 such that B j and D j belong to the same cluster of the returned clustering but to two different optimal clusters S j1 and S j2 . More precisely, in the first case, for any 0 j k − 1, we define B j = A j ∩ S σ(j) . We define D 0 , . . . , D k−1 as an arbitrarily partition of A j ∩ S σ(j ) with the constraint that vol(D j ) ε j vol(S σ(j ) ). This is possible since by (10) vol

In the second case, instead, for any 0 j k − 1, we define B j = A j ∩ S σ(j ) and D j = A j ∩ S σ(j) . Note that it also holds by (10) that vol(D j ) ε j vol(S σ(j) ). We can then combine the two cases together (albeit using different definitions for the sets) and assume that there exist ε 0 , . . . , ε k−1 0 such that ε j = 0, k−1 j=0 ε j ε, and such that we can find collections of pairwise disjoint sets {B 0 , . . . , B k−1 } and {D 0 , . . . , D k−1 } with the following properties: for any j there exist indices j and j 1 = j 2 such that

For any j, we define c j as the centre of the corresponding cluster A j to which both B j and D j are subset of. We can also assume without loss of generality that c j − p (j1) c j − p (j2) which implies

As a consequence, points in B j are far away from c j . Notice that if instead c j − p (j1) < c j − p (j2) , we would just need to reverse the role of B j and D j without changing the proof. We now bound COST(A 0 , . . . , A k−1 ) by looking only at the contribution of the points in the B j 's. Therefore, we have that

By applying the inequality a 2 + b 2 (a − b) 2 /2, we have that

It remains to show that removing assumption (10) implies the Lemma as well. Notice that if (10) is not satisfied, for all permutations σ there exists 0 k − 1 such that vol A ∩ S σ( ) 1 2 vol S σ( ) . We can also assume the following stronger condition:

Indeed, if there would exist a unique j = σ( ) such that vol (A ∩ S j ) > 1 2 vol (S j ), then it would just mean that σ is the "wrong" permutation and we should consider only permutations σ = σ such that σ ( ) = j. If instead there would exist j 1 = j 2 such that vol (A ∩ S j1 ) > 1 2 vol (S j1 ) and vol (A ∩ S j2 ) > 1 2 vol (S j2 ), then it is easy to see that the Lemma would hold, since in this case A would contain large portions of two different optimal clusters, and, as clear from the previous part of the proof, this would imply a high k-means cost.

Therefore, we just need to show that the statement of the Lemma holds when (11) is satisfied. For this purpose we define sets C 0 , . . . , C k−1 which are subsets of vertices in S 0 , . . . , S k−1 that are close in the spectral embedding to p (0) , . . . , p (k−1) . Formally, for any 0 j k − 1,

Notice that by Lemma 4.1 vol(C j ) 99 100 vol(S j ). By assumption (11) , roughly half of the volume of all the C j 's must be contained in at most k − 1 sets (all the A j 's different from A ). We prove this implies that the k-means cost is high, from which the Lemma follows.

Let c 0 , . . . , c k−1 be the centres of A 0 , . . . , A k−1 . We are trying to assign a large portion of each of the k optimal clusters to only k − 1 centres (namely all the centres different from c ). Moreover, any centre c j = c can either be close to p ( ) or to another optimal centre p (j ) , but not to both. As a result, there will be at least one C j whose points are assigned to a centre which is at least Ω(1/vol(S j )) far from p (j) (in squared Euclidean distance). Therefore, by the definition of C j and the fact that vol(C j ) 99 100 vol(S j ), the k-means cost is at least Ω 1 vol(Sj ) · vol(C j ) = Ω(1). This concludes the proof.

Proof of Theorem 4.4. Assume for contradiction that, for any permutation σ :

εvol S σ(j) . Then, by Lemma B.1 we have that COST(A 0 , . . . , A k−1 ) 2APT (γ k (G) − 1), which contradicts the fact that COST(A 0 , . . . , A k−1 ) APT (γ k (G) − 1). Now we prove Theorem 4.5. The following two technical lemmas will be used in our proof. Lemma B.2 (Bernstein's Inequality, [8] ). Let X 1 , ...X n be independent random variables such that

Then, it holds that

Proof of Theorem 4.5. We first analyse the size of F . Since

it holds by Markov inequality that the number of edges e = (u, v) with w(u, v) · α log n d out u ·λ2

1 and

. Without loss of generality, we assume that these edges are in F , and in the remaining part of the proof we assume it holds for any edge e = (u, v) that

Moreover, the expected number of edges in H equals to

and thus by Markov's inequality we have that with constant probability the number of sampled edges |F | = O ((1/λ 2 ) · n log n).

Proof of θ k (H) = Ω(θ k (G)). Next we show that the sparsified graph constructed by the algorithm preserves θ k (G) up to a constant factor. Without loss of generality, let S 0 , . . . , S k−1 be the optimal k clusters such that Φ G (S 0 , . . . , S k−1 ) = θ k (G).

For any edge e = (u, v) satisfying u ∈ S j and v ∈ S j−1 for some 1 j k − 1, we define a random variable Y e by Y e = w(u, v)/p e with probability p e , 0 otherwise.

We also define random variables Z 1 , . . . , Z k−1 , where Z j (1 j k − 1) is defined by

By definition, we have that

Moreover, we look at the second moment and have that

where ∆ out j is the maximum of the out degree of vertices in S j and the first inequality follows by the fact that

In addition, it holds for any e = (u, v), u ∈ S j , v ∈ S j−1 that

We apply Bernstein's Inequality (Lemma B.2), and obtain for any 1 j k − 1 that

Hence, with high probability cut values w(S j , S j−1 ) for all 1 j k − 1 are approximated up to a constant factor. Using the same technique, we can show that with high probability the volumes of all the sets S 0 , . . . , S k−1 are approximately preserved in H as well. Combining this with the definition of Φ, we have that Φ G (S 0 , . . . , S k−1 ) and Φ H (S 0 , . . . , S k−1 ) are approximately the same up to a constant factor. Since S 0 , . . . , S k−1 are the sets that maximising the value of θ k (G), we have that θ k (H) = Ω(θ k (G)).

Proof of λ 2 (L H ) = Ω(λ 2 (L G )). Finally, we prove that the top n − 1 eigenspace is approximately preserved in H. Let L G be the projection of L G on its top n − 1 eigenspaces. We can write L G as

With a slight abuse of notation we call L −1/2 G the square root of the pseudoinverse of L G , i.e.,

We call I the projection on span{f 2 , . . . , f n }, i.e.,

We will prove that the top n − 1 eigenspaces of L G are preserved. To prove this, recall that the probability of any edge e = (u, v) being sampled in H is

and it holds that

. Now for each edge e = (u, v) of G we define a random matrix X e ∈ C n×n by

where the vector b e is defined by b e = ω 2 2π·k χ u − ω * 2 2π·k χ v and for any vertex u the normalised indicator vector χ u is defined by χ u (u) = 1/ √ d u , and χ u (v) = 0 for any v = u. Notice that

where it follows by definition that

is essentially the Laplacian matrix of H but is normalised with respect to the degrees of the vertices in the original graph G, i.e.,

Moreover, for any sampled e = (u, v) ∈ E we have that

where the second inequality follows by the min-max theorem of eigenvalues. Now we apply the matrix Chernoff bound (Lemma B.3) to analyse the eigenvalues of e∈E X e , and build a connection between λ 2 (L H ) and λ 2 (L G ). By setting the parameters of Lemma B.3 by µ max = λ max E e∈E[G] X e = λ max I = 1, R = 2/ (α · log n) and δ = 1/2, we have that

for some constant c. This gives us that

On the other side, since our goal is to analyse λ 2 (L H ) with respect to λ 2 (L G ), it suffices to work with the top (n − 1) eigenspace of L G . Since E e∈E X e = I, we can assume without loss of generality that µ min = 1. Hence, by setting R = 2/ (α · log n) and δ = 1/2, we have that

for some constant c. This gives us that

Combining (12), (13) , and the fact of e∈E[G] X e = L −1/2 G L H L −1/2 G , with probability 1−O (1/n c ) it holds for any non-zero x ∈ C n in the space spanned by f 2 , . . . , f n that

By setting y = L −1/2 G x, we can rewrite (14) as

Since dim(span{f 2 , . . . , f n }) = n − 1, we have just proved there exist n − 1 orthogonal vectors whose Rayleigh quotient with respect to L H is Ω(λ 2 (L G )). By the Courant-Fischer Theorem, we have

It remains to show that λ 2 (L H ) = Ω (λ 2 (L H )), which implies that λ 2 (L H ) = Ω (λ 2 (L G )) by (15) . By the definition of L H , we have that for the Laplacian L H = D x, it holds that

where the last equality follows from the fact that the degrees in H and G differ just by a constant multiplicative factor, and therefore,

Finally, we show that (16) implies that λ 2 (L H ) (1/2) · λ 2 (L H ). To see this, let S 1 ⊆ C n be a (2)-dimensional subspace of C n such that

x : x ∈ S 1 . Notice that since D 1/2 G D −1/2 is full rank, S 2 has dimension 2. Therefore,

where the last inequality follows by (16) . Combining (15) with (17) gives us that λ 2 (L H ) = Ω(λ 2 (G)). This concludes the proof.

C Omitted details from Section 5

The API provided by the UN gives a lot of flexibility on the type of selected data. It is possible to specify the product type to either trade in goods (e.g., oil, wood, and appliances) or services (e.g., financial services, and construction services). Moreover, the classification code can be selected, which we set to the Harmonised System (HS). The HS categorises goods according to a 6-digit classification code (e.g., 060240, where the first two digits "06" represents "plants", the second two digits "02" represents "alive", and the last two digits "40" code for "roses"). The reporting countries and partner countries can also be specified, where the reporting country reports about its own reported tradeflow with partner countries. The settings we used to download the data for our experiments were Goods on an annual frequency, the HS code as reported, over the period from 2002 to 2017, with all reporting and all partner countries, all trade flows and all HS commodity codes. The total size of the data in zipped files is 99.8GB, where each csv file (for every year) contains around 20, 000, 000 lines.

For every pair of countries j and , where j is the reporting country and is the partner country, the database contains the amount that country j imports from country for a specific commodity, and also the amount j exports to . There are several cases where countries j and report different trading amounts with each other. Usually, the larger value is considered more accurate and is used instead of the average [12] . To construct the digraph of the world trade network and its corresponding adjacency matrix, we fill in each entry of the adjacency matrix M c for commodity c as follows: for each pair of countries j and , we compute d c j = e c j − e c j , where e c j is the amount country j exports to country for commodity c. If d c j > 0, we set M c j = d c j and M c j = 0. If d c j < 0 (and thus d c j > 0), we set M c j = d c j and M c j = 0. For our experiments we investigate the trade in "Mineral Fuels, mineral oils, and products of their distillation" (HS code 27), and the trade in "Wood and articles of wood" (HS code 44).

We plot the cluster visualisations for the DD-SYM algorithm in Figure 6 on the international oil trade network, over the period 2006-2009. The clusters between 2006 and 2007 are almost identical, and then there is a shift in the clustering structure between 2007 and 2008. This change occurs one year before the change in the SimpleHerm method, and this change is also one year earlier than the changes found in the complex network analysis literature [1, 33] . This indicates that the SimpleHerm clustering result is more in line with other literature. by different algorithms over the consecutive years. Again, we notice that our algorithm finds a peak around the economic crisis of 2008, and another peak is found between 2005 and 2006. We could not find any literature reasoning about the peak between 2005 and 2006, but it would be interesting to analyse this further. The symmetric difference returned by the DD-SYM method is more noisy. 

The Data Science for COVID-19 Dataset (DS4C) [19] contains information about 3519 South Korean COVID-19 cases, and we use directed edges to represent how the virus is transmitted among the individuals. We notice that there are only 831 edges in the graph and there are many connected components of size 2. To take this into account, we run our algorithm on the largest connected component of the infection graph, which consists of 67 vertices and 66 edges. Applying the complex- valued Hermitian matrix and the eigenvector associated with the smallest eigenvalue, the spectral embedding is visualised in Figure 9 . We notice several interesting facts. First of all, we do not see all the individual nodes of the graph in this embedding. This is because many embedded points are overlapped, which happens if they have the same in and outgoing edges. Moreover, from cluster S 0 to S 1 there is 1 edge, from S 1 to S 2 there are 51 edges and from S 2 to S 3 there are 5 edges. That means there are 1 + 51 + 5 = 57 edges that lie along the path, out of 66 edges in total. This concludes that our algorithm has successfully clustered the vertices such that there is a large flow ratio along the clusters.

Secondly, due to the limited size of the dataset, it is difficult for us to draw a more significant conclusion from the experiment. However, we do notice that the cluster S 1 actually consists of one individual: a super spreader. This individual infected 51 people in cluster S 2 . We believe that, with the development of many tracing Apps across the world and more data available in the near future, our algorithm could become a useful tool for disease tracking and policy making.

Features and evolution of international crude oil trade relationships: A trading-based network analysis

Crude oil conservation policy hypothesis in OECD (organisation for economic cooperation and development) countries: A multivariate panel Granger causality test

Tensor spectral clustering for partitioning higher-order network structures

Higher-order organization of complex networks

Spectral graph theory

Laplacians and the Cheeger inequality for directed graphs

Concentration inequalities and martingale inequalities: a survey

Hermitian matrices for clustering directed graphs: insights and applications

Embodied energy, export policy adjustment and China's sustainable development: a multi-regional input-output analysis

Design and impact estimation of a reform program of China's tax and fee policies for low-grade oil and gas resources

The physical dimension of international trade: Part 1: Direct global flows between

Spatiotemporal dynamics and fitness analysis of global oil market: Based on complex network

The impact of random models on clustering similarity

Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming

Historical oil shocks

International wood trade and forest change: A global analysis

Exogenous oil supply shocks: how big are they and how much do they matter for the US economy?

Korea Centers for Disease Control & Prevention. Data science for COVID-19

Multiway spectral partitioning and higher-order Cheeger inequalities

Community structure in directed networks

Hermitian Laplacians and a Cheeger inequality for the Max-2-Lin problem

Partitioning well-clustered graphs: Spectral clustering works! SIAM

Symmetrizations for clustering directed graphs

Normalized cuts and image segmentation

Graph sparsification by effective resistances

Part of this work was done when Steinar Laenen studied at the University of Edinburgh as a Master student. He Sun is supported by an EPSRC Early Career Fellowship (EP/T00729X/1).