key: cord-0649245-9xgxdyjc
authors: Aizenbud, Yariv; Jaffe, Ariel; Wang, Meng; Hu, Amber; Amsel, Noah; Nadler, Boaz; Chang, Joseph T.; Kluger, Yuval
title: Spectral Top-Down Recovery of Latent Tree Models
date: 2021-02-26
journal: nan
DOI: nan
sha: 34698a02b985f7425666e975beff18d00661459d
doc_id: 649245
cord_uid: 9xgxdyjc

Modeling the distribution of high dimensional data by a latent tree graphical model is a prevalent approach in multiple scientific domains. A common task is to infer the underlying tree structure, given only observations of its terminal nodes. Many algorithms for tree recovery are computationally intensive, which limits their applicability to trees of moderate size. For large trees, a common approach, termed divide-and-conquer, is to recover the tree structure in two steps. First, recover the structure separately of multiple, possibly random subsets of the terminal nodes. Second, merge the resulting subtrees to form a full tree. Here, we develop Spectral Top-Down Recovery (STDR), a deterministic divide-and-conquer approach to infer large latent tree models. Unlike previous methods, STDR partitions the terminal nodes in a non random way, based on the Fiedler vector of a suitable Laplacian matrix related to the observed nodes. We prove that under certain conditions, this partitioning is consistent with the tree structure. This, in turn, leads to a significantly simpler merging procedure of the small subtrees. We prove that STDR is statistically consistent and bound the number of samples required to accurately recover the tree with high probability. Using simulated data from several common tree models in phylogenetics, we demonstrate that STDR has a significant advantage in terms of runtime, with improved or similar accuracy.

Learning the structure of latent tree graphical models is a common task in machine learning [3, 10, 23, 42, 62] and computational biology [29, 30] . A canonical application is phylogenetics, where the task is to infer the evolutionary tree that describes the relationship between a group of biological species based on their nucleotide or protein sequences [18, 43, 48] . Depending on the application, the number of observed nodes ranges from a dozen and up to tens of thousands.

In latent tree graphical models, every node is associated with a random variable. A key assumption is that the given data corresponds to the terminal nodes of a tree, while the set of unobserved internal nodes determines its distribution. In phylogenetics, the terminal nodes are existing organisms, while the non-terminal nodes correspond to their extinct ancestors. Given a set of nucleotide or amino acid sequences as in Figure 1 , the task is to recover the structure of the tree, which describes how the observed organisms evolved from their ancestors. Many algorithms have been developed for recovering latent trees. Distance-based methods, including the classic neighbor joining (NJ) [46] and UPGMA [51] , recover the tree based on a distance measure between all pairs of terminal nodes. These methods are computationally efficient and thus applicable to large trees [56] . They also have statistical guarantees for accurate recovery [4, 36] . Since the distance measure does not encapsulate all the information available from the sequences, distance-based methods may perform poorly when the amount of data is limited [60] .

A different approach for tree recovery is based on spectral properties of the input data [2, 16] . Several methods work top-down, repeatedly applying spectral partitioning to the terminal nodes until each partition contains a single node [35, 63] . However, there is no theoretical guarantee that the partitions match the structure of the tree. Of direct relevance to this manuscript is the recently proposed spectral neighbor joining (SNJ) [26] , which consistently recovers the tree based on a spectral criterion. Similarly to NJ, SNJ is a bottom-up method, which iteratively merges subsets of nodes to recover the tree.

Perhaps one of the most accurate approaches for tree recovery is to search for the topology that maximizes the likelihood of the observed data [18] . Since computing the likelihood for every possible topology is intractable, many methods apply a local search to iteratively increase the likelihood function [21, 44, 52, 64] . Though there is no guarantee that such a process will converge to the global maximum of the likelihood function, in many settings the resulting tree is more accurate than the one obtained by distance-based methods. The main disadvantage of likelihood-based algorithms is their slow runtime, which limits their applicability to trees of moderate size.

With the dramatic increase in the sizes of measured datasets, there is a pressing need to develop fast tree recovery algorithms, able to handle trees with tens of thousands of nodes [56, 47] . For example, the recently developed GESTALT method combines scRNA-seq readouts with CRISPR/Cas9 induced mutations to perform lineage tracing on tens of thousands of cells. [45, 50] . For the multispecies coalescent model, recent works recover multiple gene trees, where each tree is com-posed of thousands of genes [37] . Recently, many works recovered the evolutionary history of the SARS-COV-2 virus, with over ten thousand variants [40] .

Tree recovery problems with thousands of terminal nodes pose a significant computational challenge, as even distance-based methods may prove to be too slow. To improve the scalability of slow but accurate methods such as maximum likelihood, a common framework known as divideand-conquer is to recover the tree by a two-step process [39, 58] : (i) infer the tree structure independently for a large number of small possibly random subsets of terminal nodes; (ii) compute the full tree by merging the small trees obtained in step (i). In supertree methods, the small subsets of terminal nodes in step (i) overlap. Their merging step requires optimizing a non-convex objective, which is computationally hard [28, 25] . Thus, most supertree methods circumvent global optimization problems by iterative approaches for step (ii) [55, 58] . Recently, several methods were derived to merge subtrees with disjoint terminal nodes [39, 38] . To apply these algorithms in a divide-andconquer pipeline, the terminal nodes are partitioned according to an initial tree estimate computed by NJ. Despite these works, the problem of reconstructing large trees from limited amount of data is not yet fully resolved. In particular, there is still a need for fast and scalable approaches that also have strong recovery guarantees.

Contributions and outline In this work we develop Spectral Top-Down Recovery (STDR), a scalable divide-and-conquer approach backed by theoretical guarantees to recover large trees. In contrast to previous methods, the partitioning of the terminal nodes in step (i) is deterministic. Importantly, we prove that under mild assumptions the partitions are consistent with the unobserved tree structure. The importance of this consistency is that it simplifies considerably the merging process in step (ii) of the algorithm. Since STDR is recursive, it is instructive to replace the standard divide-and-conquer two step outline, with the following recursive description.

(i) Partitioning: split the terminal nodes into two subsets.

(ii) Recursive reconstruction: infer the latent tree of each subset. When the partition size falls below a given threshold τ , the tree is recovered by a user-specified algorithm. Above this threshold, the reconstruction is done by recursively applying STDR to each subset.

(iii) Merging: reconstruct the full tree by merging the two small trees.

Each of the above three steps is explained in detail in Section 3. In step (i) we apply spectral partitioning to a weighted complete graph, with nodes that correspond to the terminal nodes of the tree and weights based on a similarity measure described in Section 3.1. In Section 4.1 we prove that given an accurate estimate of these similarities, step (i) is consistent in the sense that the resulting subsets belong to two disjoint subtrees. For this proof, we derive a novel relation between latent tree models and a classic result from spectral graph theory known as Fiedler's theorem of nodal domains [19] . This theorem is important in various learning tasks such as clustering data [57] , graph partitioning [12] , and low dimensional embeddings [27] . To the best of our knowledge, this is the first guarantee for spectral partitioning in the setting of latent tree models. The output of step (ii) is the inner structure of two disjoint subtrees. The task in step (iii) is to merge them into the full tree. In Section 3.4, we show that this task is equivalent to finding the root of an unrooted tree, given a reference set of one or more sequences, also known as an outgroup. We derive a novel spectral-based method to find the root and prove its statistical consistency in Section 4.2. This approach is of independent interest, as finding the root of a tree is a common challenge in phylogenetics [6, 8, 32] . Finite sample guarantees for the Jukes-Cantor model of evolution are derived in Section 5.

In Section 6 we compare the accuracy and runtime of various methods when applied to recover the full tree directly versus when used as subroutines in step (ii) of STDR. For example, Figure  6 shows the results of recovering simulated trees with 2000 terminal nodes generated according to the coalescent model [49] . As one baseline, we applied RAxML [52] , one of the most popular maximum likelihood software packages in phylogenetics. With 8,000 samples, RAxML took over 5 1 2 hours to complete. In contrast, STDR with RAxML as subroutine and a threshold τ = 128 took approximately 21 minutes, more than an order of magnitude faster. Importantly, in this setting, the trees recovered via STDR have similar accuracy to those obtained by applying RAxML directly. These and other simulation results illustrate the potential benefit of STDR in recovering large trees.

Let T be an unrooted binary tree with m terminal nodes. We assume that each node of the tree has an associated discrete random variable over the alphabet {1, . . . , }. We denote by x = (x 1 , . . . , x m ) the vector of the random variables at the m observed terminal nodes of the tree, and by h = (h 1 , . . . , h m−2 ) the random variables at the non-terminal nodes. We assume that these random variables form a Markov random field on T . This means that given the values of its neighbors, the random variable at a node is statistically independent of the rest of the tree [9] . An edge e(h i , h j ) connecting a pair of adjacent nodes (h i , h j ) is equipped with two transition matrices of size × ,

Note that every pair of adjacent nodes may in general have different transition matrices. Our observed data is a matrix X = [x (1) , . . . , x (n) ] ∈ {1, . . . , } m×n , where x (j) are random i.i.d. realizations of x = (x 1 , . . . , x m ). Each row in the matrix is a sequence of length n that corresponds to a terminal node in the tree, see illustration in Figure 1 . For example, in phylogenetics, each row in the matrix corresponds to a different species, while each column corresponds to a different location in a DNA sequence, see [14] and references therein. Figure 1 shows an example with m = 7 terminal nodes and n = 23 observations. The support of each node is the DNA alphabet A, C, G, T , so = 4.

Given the matrix X, the task at hand is to recover the structure of the hidden tree T . We assume that for every pair of adjacent nodes (h i , h j ), the corresponding × stochastic matrices P (h i |h j ) and P (h j |h i ) defined in (1) are full rank, with determinants that satisfy 0 < δ < det(P (h i |h j )), det(P (h j |h i )) < ξ < 1.

(

Eq.

(2) implies that the transition matrices are invertible and are not permutation matrices. This assumption is necessary for the tree's topology to be identifiable, see Proposition 3.1 in [9] and [41] . Next, to describe our approach we present several definitions related to unrooted trees, following the terminology of [59] .

Definition 1 (clan). A clan is a subset of nodes in T that is connected to the rest of the tree by a single edge.

Definition 2 (the root of a clan). A non-terminal node h is termed the root of a clan C if h ∈ C and it is connected to the edge that separates C from the rest of the tree.

For example, in Figure 1 h 4 and h 5 are the root nodes of the clans C 1 = {x 6 , x 7 , h 5 } and C 2 = {x 4 , x 5 , x 6 , x 7 , h 2 , h 4 , h 5 }, respectively. In our work, we will sometimes refer to the clans by their terminal nodes only (e.g. {x 6 , x 7 } and {x 4 , x 5 , x 6 , x 7 } for C 1 and C 2 ).

Definition 3 (adjacent clans). Let C 1 and C 2 be two disjoint subsets of terminal nodes that form two clans. If the union C 1 ∪ C 2 forms a clan, then C 1 and C 2 are adjacent clans.

Two disjoint clans whose respective root nodes share a common neighboring node are adjacent clans. For example, in Figure 1 the clans C 1 = {x 4 , x 5 } and C 2 = {x 6 , x 7 } are adjacent. Their respective root nodes h 4 and h 5 are adjacent to h 2 . This observation is important for the merging step of STDR.

Here we present the three steps of the Spectral Top-Down Recovery (STDR) algorithm, as outlined in the introduction. Pseudocode for the method appears in Algorithm 1. We begin with the definition and properties of the similarity matrix and similarity graph.

Similar to Eq. (1), we define the × transition matrix for every pair h i , h j of (not necessarily adjacent) nodes by

Note that due to the Markov assumption, the transition matrix is multiplicative along the edges of the tree. For example in Figure 1 , P (x 1 |x 2 ) = P (x 1 |h 3 )P (h 3 |x 2 ). In [26] , a similarity function between a pair of nodes h i and h j was defined as follows:

Similar to the transition matrix, the similarity is multiplicative along the edges of the tree and is bounded by δ ≤ S(h i , h j ) ≤ ξ. Thus, it exhibits an exponential decay along the tree. For any two ordered sets of terminal or non-terminal nodes A = {a 1 , . . . a r } and B = {b 1 , . . . b s }, we denote by S(A, B) a matrix of size r × s, where

for all 1 ≤ i ≤ r and 1 ≤ j ≤ s.

To simplify notation, for the case where A and B are both equal to the full set of terminal nodes, we denote the similarity matrix by S:

where by definition, S ii = 1 ∀(i). The matrix S is the adjacency matrix of the following graph.

Definition 4 (Similarity graph). The similarity graph G is a complete graph whose vertices are the terminal nodes of T . The weight assigned to every edge e(x i , x j ) is the similarity S(x i , x j ).

The relation between the spectral properties of G and the topology of T forms the theoretical basis of our approach. The following result from [26, Lemma 3.1] shows how the spectral structure of the similarity matrix S relates to the structure of the underlying tree.

Input: X ∈ {1, . . . , } m×n A matrix containing sequences from m terminal nodes τ ∈ N Partition size threshold Alg

An algorithm for recovering small tree structures Output: T Estimated tree 1: if number of terminal nodes m ≤ τ then 2: return Alg(X) Recover small tree structures by a user defined algorithm 3: end if 4: Compute the similarity matrix S from X via Eq. Compute the edge score d(e) from u via Eq. (8) 12: end for 13: Insert a root node for T 1 into the edge e 1 = argmin e∈T1 d(e) 14: Compute v, the first right singular vector of S(C 1 , C 2 ) 15: for all edges e in T 2 do 16: Compute the edge score d(e) from v via Eq. (8) 17: end for 18: Insert a root node for T 2 into the edge e 2 = argmin e∈T2 d(e) 19 : Connect the roots of T 1 and T 2 to construct the merged tree T 20: return T Lemma 3.1. Let A and B be a partition of the terminal nodes of an unrooted binary tree T . The matrix S(A, B) is rank-one if and only if A and B are clans of T . Lemma 3.1 implies that given the exact similarity matrix S, one can determine if a subset A of terminal nodes is a clan in T by computing the rank of S(A, A c ), where A c = x \ A. In practice, the exact similarity matrix S is unknown. Yet, as shown in [26] , a sufficiently accurate estimateŜ, which in general is full rank, still allows to determine if a subset is a clan.

The aim of step (i) of STDR is to partition the terminal nodes into two clans of T . Our approach is based on the similarity graph G of Definition 4. One possible way to partition the graph is by the min-cut criteria. Given the exact similarity, this approach is guaranteed to yield two clans, see Lemma B.1 in the appendix. Though the min-cut problem can be solved efficiently [57] , it often leads to unbalanced partitions of the graph, with the smaller one containing 1 or 2 terminal nodes. Since one goal is to reduce the runtime of the reconstruction algorithm in step (ii), we would like to avoid imbalanced partitions. To this end, we propose to partition the terminal nodes via a spectral approach based on the Fiedler vector.

Definition 5 (Graph Laplacian and Fiedler vector). The Laplacian matrix of a graph G with a symmetric weight matrix W is given by

. The Fiedler vector is the eigenvector of L G that corresponds to the second smallest eigenvalue.

In the STDR algorithm, we use the Fiedler vector v of the similarity graph G to partition the terminal nodes into two subsets C 1 and C 2 (Algorithm 1, line 6), as follows:

Importantly, in Section 4.1 we prove that partitioning the nodes of G via Eq. (5) yields two clans of the underlying tree T . To illustrate this point, we created a tree graphical model from a symmetric binary tree with m = 128 nodes, see Figure 2a . The transition matrices between adjacent nodes are all identical and were chosen according to the HKY model [24] . We used this model to generate a dataset of nucleotide sequences of length n = 1, 000. Figure 2b shows the Fiedler vector of the similarity graph estimated from the dataset. Here, the Fiedler vector exhibits a single dominant gap, and partitioning the terminal nodes by Eq. (5) yields two sets C 1 and C 2 which are indeed clans of T . A similar example is shown in the appendix for a tree generated according to the coalescent model. In Section 5.1 we derive an expression for the number of samples required to obtain two clans with high probability.

Step (i) of STDR outputs two sets of terminal nodes C 1 and C 2 . Under certain conditions defined in Section 4.1, these are guaranteed to be two clans in the tree T . The next task is to construct trees T 1 and T 2 that describe their latent internal structure. If |C 1 | > τ , then T 1 is recovered by

(a) Two unrooted trees. The placeholder edges are marked in red.

The merging process is completed by connecting the two root nodes h 1 , h 2 . recursively reapplying the three steps of STDR to C 1 . When |C 1 | ≤ τ , the input is small enough that we consider it tractable to use a direct method for tree reconstruction, even a slow one like maximum likelihood.

The output of step (ii) of STDR consists of the internal unrooted tree structures T 1 and T 2 of two subsets of terminal nodes C 1 and C 2 . Assuming steps (i) and (ii) were successful, then C 1 and C 2 are adjacent clans, and T 1 and T 2 are indeed their correct internal structure. The remaining challenge in step (iii) is to recover the full tree T by correctly merging T 1 and T 2 .

Since T 1 and T 2 are unrooted binary trees, to merge them it is necessary to add a root node to each of them. Adding a connecting edge between the two root nodes yields a binary unrooted tree and completes the merging process, see Figure 3 for an illustration. To add a root node to a subtree, we select one of its edges to be the "placeholder edge" (illustrated in red in Figure  3a ). Subsequently, the placeholder edge is replaced with two edges connected to the root node. Importantly, as shown in Figure 4 , changing the placeholder edge in either T 1 or T 2 yields a merged tree with a different topology.

Thus, merging T 1 and T 2 reduces to the task of identifying the correct "placeholder edge". Here, we derive a novel spectral method for finding these edges. To the best of our knowledge, our approach for merging subtrees is new and may be of independent interest for other applications, such as rooting unrooted trees [32, 8, 6] . In the following lemma, whose proof is in Appendix C, we describe a property of the placeholder edge that motivates our approach. Lemma 3.2. Let C 1 be a set of terminal nodes that forms a clan in T , and let T 1 be the internal structure of C 1 . An edge e ∈ T 1 is the correct placeholder edge if and only if it partitions C 1 into two sets A(e), B(e), such that both form clans in T . Figure 3a . The edge e(h A , h B ) divides the left subtree into the clans {x 1 , x 2 } and {x 3 , x 4 }. These subsets also form clans in the full tree depicted in Figure 3b . Next, using Lemma 3.2, we derive a spectral characterization of the correct placeholder edge. Recall that by Lemma 3.1, the matrix S(C 1 , C 2 ) ∈ R |C1|×|C2| , is rank one. Thus,

(a) Placeholder edges set to e(h A , h B ) and e(h C , x 5 ). 

In practice we can only compute an estimate of S. Motivated by Lemma 3.3, we propose to determine the placeholder edge e * by minimizing the following score function, 

The normalizing factor S(A(e), B(e)) F is added since the size of S(A(e), B(e)) changes for every edge e. Note that given the exact matrix S, at the correct placeholder edge d(e * ) = 0. In Section 5.2 we derive an expression for the number of samples required to obtain the correct placeholder edge by Eq. (8) with high probability.

We analyze the complexity of each step of STDR separately. We assume that the similarity or distance matrix are given. To simplify the analysis, we assume a balanced binary tree, and that the partition steps gave m/τ subsets of size τ each. We denote by B(k) the complexity of recovering the topology of a tree with k terminal nodes by the given subroutine Alg. 2. The complexity of merging two subtrees with k terminal nodes each is composed of two parts: (i) compute the leading singular vector of the matrix S(C 1 , C 2 ) ∈ R k×k , which takes O(k 2 ) operations; (ii) compute the score for every edge as in Eq. (8) . The number of operations required for the least square operation in the numerator of Eq. (8), as well as computing the Frobenius norms in the numerator and denominator is proportional to the number of elements in S(A(e), B(e)). Thus, the total complexity of computing the score for all edges in T 1 (and similarly T 2 ) is O( e∈T1 |A(e)||B(e)|). For a balanced tree, this term is equal to

We remark that if the two trees are highly imbalanced, the complexity may increase up to O(k 3 ).

Let T (m) be the complexity of the partitioning and merging operations of STDR, excluding the complexity of the subroutine algorithm that recovers the structure of small trees. We have that

By the Master theorem [7] ,

Thus, the total complexity of STDR is

For example, the complexity of NJ is

complexity of running NJ to recover the full tree. In the simulation section, we show that STDR+NJ outperforms NJ in accuracy while being about an order of magnitude faster. An important property of the STDR algorithm, in terns of actual runtime, is that it is embarrassingly parallel. Specifically, steps 7 and 8 in Algorithm 1 can be executed in two independent processes. This may result in up to k parallel processes, where k is the number of partitions.

In this section we consider the population setting where the similarity matrix S is known. In this setting we prove that STDR correctly recovers the underlying tree. We do so by analyzing the partitioning step (i) and the merging step (iii) of STDR. Our key results are Theorem 4.2, which states that step (i) is guaranteed to yield disjoint clans, and Theorem 4.5, which states that given accurate trees for two clans, step (iii) recovers the exact structure of the full tree. Combining these two results directly yields the following theorem establishing the correctness of STDR in the population setting.

Theorem 4.1. Given an exact similarity matrix S, and assuming that the subroutine Alg correctly recovers the internal structure of its input, STDR recovers the exact latent tree T .

The following theorem proves that given the exact similarity matrix, partitioning the terminal nodes of the tree by thresholding the Fiedler vector as described in Section 3.1 yields two adjacent clans. Before proving Theorem 4.2, we would like to put its novelty into the context of related results. In A result similar in nature to Theorem 4.2 was proved for hierarchical block models (HBM) [5] where the underlying block structure of a given connectivity matrix is recovered by recursive partitioning according to its Fiedler vector. The statistical guaranty, however, is derived by making additional assumptions on the structure of the tree as well as its parameters. Theorem 4.2, in contrast, is true for any tree structure and parameters. A different distance-based approach for tree partitioning was derived in chapter 4 of [20] . This approach is guaranteed to yield two clans, but only given the exact distance matrix between terminal nodes. In Appendix D we show empirically that our similarity based approach is more robust than the distance based approach, specifically in cases where the number of samples is limited.

For the proof of Theorem 4.2, we present several preliminaries on graphs. First, we define the Schur complement of a matrix, which plays an important role in graph theory [13] .

Definition 6 (Schur complement). Let A, B, C and D be matrices of dimensions p × p, p × q, q × p and q × q, respectively. Assume D is invertible and consider the matrix

Let H be a graph with a set of nodes V and Laplacian matrix L. We denote by L R the principal sub-matrix of L that corresponds to a subset of nodes R ⊂ V . The Schur complement of L with respect to L R yields the Laplacian of a different graph, with |V − R| nodes [11, 13] . We denote this Laplacian matrix by L H/R . The rows and columns of L H/R correspond to vertices of H that are not in R. When the graph is a tree T , and R is the set of its non-terminal nodes, then L T /R is the Laplacian of a complete graph G whose nodes are the terminal nodes of T .

Equipped with these definitions, we proceed to the proof of Theorem 4.2. The proof consists of two parts, that correspond to Theorem 4.3 and Lemma 4.4. Theorem 4.3, which is a rephrase of Theorem 3.3 of [54] , shows that one can partition the terminal nodes of a tree T into two clans via the Fiedler vector of L T /R , where R is the set of all internal nodes. . Let T be a tree with a node set V and a subset of non terminal nodes R ⊂ V . We denote by L T the Laplacian of T and by L T /R the Laplacian of a graph G obtained by Schur complement of L T with respect to R. Let v be the Fiedler vector of G, and C 1 , C 2 the following partition of the terminal nodes,

Then C 1 and C 2 are adjacent clans in T .

1 For clarity, we rephrased the theorem from [54] according to our terminology. Theorem 4.3, however, is not directly applicable to our setting, since computing L T /R requires knowledge of the unknown similarities between all nodes of T , including its unobserved nodes. Here, we derive Lemma 4.4 that shows that for any tree T , there is a twin tree T with the same topology, such that L T /R = L G . This result, proven in appendix E, provides the critical missing link required for inference of the latent tree from the similarity matrix, which can be estimated from observed data.

Lemma 4.4. Let T be a tree with a set of non-terminal nodes R. Let G be the similarity graph of T . Then there is a tree T with the same topology as T but different edge weights, such that 

Step (iii) of STDR merges the two subtrees, T 1 and T 2 , that were constructed from the two disjoint subsets of terminal nodes C 1 and C 2 . As described in Section 3, this step is done by finding for each tree its placeholder edge as the edge with the smallest score d(e), Eq. (8) . Here, we prove that this merging step is correct, under the following two assumptions on its input (the output of steps (i) and (ii)): the two subtrees T 1 , T 2 correspond to adjacent clans in T and their internal structure was recovered correctly.

Theorem 4.5. Let C 1 and C 2 be the terminal nodes of two adjacent clans that partition a tree T . Let T 1 and T 2 be the internal structures of these clans. Then given the exact similarity matrix S(C 1 , C 2 ), minimizing the criterion in Eq. (8) yields the correct placeholder edge.

Proof. By Lemma 3.3, for the correct placeholder edge e * there exists an α ∈ R such that

Hence d(e * ) = 0. If e is an incorrect placeholder edge, then again according to Lemma 3.3 there is no constant α that satisfies the equation, and hence d(e) > 0 which implies e * = argmin d(e).

In practice, the true similarity matrix S is unknown, and an estimateŜ is computed from a sequence data of length n. In this section we show that STDR is still able to correctly recover the tree provided thatŜ is sufficiently close to S. Specifically, in sections 5.1 and 5.2, we derive lower bounds on the number of samples required for the partitioning step and the merging step to succeed with high probability. In Section 5.3 we compare these results to the guarantees available for other tree recovery algorithms.

For simplicity, in the finite sample analysis, we assume the Jukes-Cantor (JC) model of sequence evolution, where each transition matrix is parameterized by a single mutation rate θ(i, j):

According to this model, the similarity between adjacent nodes defined in Eq. (4) simplifies to

By Eq. (2) the similarity is strictly positive and hence θ(i, j) < ( − 1)/ . We remark that our analysis can be extended, under minor additional assumptions to more general models of evolution as in [26, Lemma 4.8] . We present results for the top level of the tree partitioning and merging.

Following the proof, we show in Remark 5.10 that the same guarantees hold for multiple partitions and merging steps.

We compute the number of samples n required for the partitioning step to yield two clans with high probability. To this end, we require that in the population setting, the entries of the Fiedler vector are bounded away from zero. To that end, we assume that the similarity matrix S satisfies the hierarchical constant block model (CBM) addressed in [5] . We assume there is a hierarchy of partitions A C and B C , such that for each partition there is a (different) constant c such that

In phylogenetics, this assumption is satisfied in the molecular clock model [34] , where the probability of mutation between adjacent nodes is determined by two factors: (i) the edge length between them and (ii) a mutation rate matrix that is constant throughout the tree. The structure of the rate matrix is determined by the choice of evolutionary model, such as Jukes-Cantor. In addition, the path length between all terminal nodes and the root is constant. This implies that for every ancestor h (internal node) the similarity between the terminal nodes A C on the left of h and the nodes B C on the right of h is constant as in Eq. (11) . For the hierarchy of partitions, we denote by η the maximum over all partitions C of the ratio between the size of left and right parts A C , B C .

This factor serves as a measure for the balancedness of the tree. In addition, we denote by r(T ) the diameter of T , which is the maximal distance between pairs of terminal nodes,

Finally, we denote by h(T ) the depth of T as defined in [15] :

Definition 7. Let T 1 , T 2 be two rooted subtrees with respective roots h 1 , h 2 obtained by removing an edge e(h 1 , h 2 ) from T . Let d 1 (e), d 2 (e) be the distances log S(h 1 , x i ) and log S(h 2 , x j ) from h 1 , h 2 to the closest leaves x i and x j in T 1 , T 2 , respectively. Then

Note that h(T ) < r(T ) as the maximal distance between terminal nodes is larger than any distance between a pair of terminal and non terminal nodes. The following theorem bounds the number of samples n by the properties of the tree defined in Eqs. (12) , (14) and (14).

Theorem 5.1. Let T be a Jukes-Cantor evolutionary tree with m terminal nodes, with a similarity matrix S that satisfies the assumptions made for the CBM. If the number of samples n satisfies

then STDR partitions the terminal nodes into two clans with probability at least 1 − .

To prove the theorem, we derive a bound on the error that the partitioning step can tolerate in the estimateŜ.

Lemma 5.2. Assume a tree with m terminal nodes generated according to the molecular clock model. If the estimateŜ of its similarity matrix satisfies

then STDR correctly partitions the terminal nodes into two clans.

In our proof, we use the following lemma regarding the spectrum of the Laplacian. This Lemma is a reformulation of lemma 7 from [5] that addresses the spectrum of the CBM. Lemma 5.3. Consider a tree with m terminal nodes generated according to the molecular clock model. Let L be the Laplacian of its similarity graph. The first second and third smallest eigenvalues of L satisfy

The elements v 2 (i) of the eigenvector that corresponds to λ 2 satisfy |v 2 (i)| ≥ 1 mη .

Proof of Lemma 5.2. Let L andL be two symmetric matrices and let v i andv i be their i-th eigenvectors, respectively. A variant of the Davis-Kahan theorem for perturbation of eigenvectors (see

where

We apply the theorem to the Laplacian matrix L = D − S (see Definition 5) , and its Fiedler vector v 2 . The spectral norm L −L can be bounded by,

Substituting (17) into (16) yields

From Lemma 5.3 it follows that the spectral gap γ 2 is bounded by,

Combining Eqs. (18) and (19) proves that if

. Hence, partitioning the terminal nodes according to sign(v 2 ) or sign(v 2 ) yield the same result. As we proved in Theorem 4.2, the resulting subsets are clans of the tree.

Next, we prove Theorem 5.1 under the additional assumption of the Jukes-Cantor model. The theorem is proved by combining Lemma 5.2 with a concentration bound on the similarity matrix estimateŜ, derived in [26] .

Proof of Theorem 5.1. From Lemma 4.7 of [26] , under the JC model of evolution,

Setting t to the right hand side of (20) yields that if

the requirements of Lemma 5.2 are satisfied with probability at least 1 − ε, which concludes the proof.

We derive finite sample bounds for the merging step of STDR. In contrast to the partitioning step, the guarantees for the merging step, presented in the following theorem, hold for any tree topology.

Theorem 5.4. Let T be a tree with m terminal nodes, which consists of two subtrees T 1 , T 2 with terminal nodes C 1 and C 2 , respectively. Let {A, B} be the partition of C 1 induced by the correct placeholder edge e * , and let D = min{ S(A, B) F , S(C 1 , C 2 ) F }. For any ε > 0, if the number of samples n satisfies

then STDR finds the correct placeholder edge in T 1 with probability at least 1 − ε. Figure 8 shows both runtime and accuracy of STDR as a function of the threshold parameter τ , when applying STDR with RAxML or SNJ as its subroutine. The data consists of n = 1000 samples generated from a binary symmetric tree with m = 2048 terminal nodes. The accuracy of the algorithm degrades for small values of τ while the runtime improves by approximately half an order of magnitude.

Our proof of Theorem 5.4 consists of three steps: (i) In Lemma 5.5 we derive a lower bound on the score d(e) of an edge e that is not the correct placeholder edge. (ii) Lemma 5.8 provides a sufficient condition on the accuracy of the similarity matrix estimateŜ that guarantees the merging step will yield the correct placeholder edge. (iii) For the JC model, we derive an expression for the number of samples required for the condition in Lemma 5.8 to hold with high probability.

Step 1: A lower bound on the score d(e) for incorrect edges In Section 4.2, we showed that d(e) = 0 if and only if e is the correct placeholder edge. Here, for the exact similarity matrix S we derive a lower bound on d(e), if e is an incorrect placeholder edge in T 1 .

Lemma 5.5. Let T be a tree that consists of two subtrees T 1 , T 2 , and let e ∈ T 1 be an edge that is not the correct placeholder edge. Then,

For the proof of Lemma 5.5, we introduce new notation, illustrated in Figure 5 . The sets of terminal nodes of T 1 and T 2 are denoted by C 1 and C 2 , respectively. We denote by e * ∈ T 1 the correct placeholder edge, and by e ∈ T 1 an arbitrary incorrect placeholder edge. The edge e splits the terminal nodes of T 1 into A and B and has endpoints h A and h B . We denote by h 0 , . . . , h N the non terminal nodes on the path between the root node of T 1 , denoted h 0 , and h A = h N . We partition the terminal nodes in A to N + 1 subsets A 0 , . . . , A N according to h 0 , . . . , h N as follows: Every node in A is assigned to the closest non terminal node on the path between h 0 , . . . , h N . In the proof of Lemma 5.5, we use the following auxiliary lemma, proven in appendix F. Figure 5 : Bounding the score d(e) for an incorrect placeholder edge in T1. The correct placeholder edge e * ∈ T1 is marked by a dotted blue line. The incorrect placeholder edge e, which partitions the terminal node to subsets A(e) and B(e), is marked by a thick red line. The two non-terminal nodes on the path between the correct and incorrect edges are denoted by h1, h2 = hA, and the root node of C1 is denoted by h0. The subset of terminal nodes closest to hi is denoted by Ai. Since h 0 separates C 1 and C 2 , by the multiplicative property of the similarity we have,

Letβ be the proportionality constant between u and S(C 1 , h 0 ) such that u =βS(C 1 , h 0 ). Recall that u A , u B in Eq. (8) are the entries in u that correspond to A and B, respectively. Partitioning u into u A and u B and partitioning S(C 1 , h 0 ) into S(A, h 0 ) and S(B, h 0 ) gives 

We show that the matrix S(A i , h 0 )S(h 0 , B), which appears on the right side of Eq. (23), is proportional to S(A i , B) with the proportionality constant R i .

Inserting (24) into (23) gives

Thus, the score in Eq. (8) is equivalent to

Since 

Next, the following lemma, proven in Appendix F, bounds the ratio of two sums.

Lemma 5.7. For two series of positive numbers a i , b i > 0 we have

Applying the lemma to Eq. (26) yields

Combining Eq. (27) and Lemma 5.6 gives

which concludes the proof of Lemma 5.5, and with it, Step 1 in the proof of Theorem 5.4.

Step 2: A sufficient condition on the estimateŜ Lemma 5.5 shows that there is a gap between the score of the correct placeholder edge and the scores of all other edges in T 1 . In the following lemma we show that ifŜ is sufficiently close to S the gap is preserved and STDR selects the correct placeholder edge. For simplicity, we address only the case δ 2 ≥ 0.5.

If the similarity matrix estimateŜ satisfies

then STDR selects the correct placeholder edge.

In our proof, we use the following auxiliary lemma, proven in Appendix F. 

Proof of Lemma 5.8. Suppose e * ∈ T 1 is the correct placeholder edge and e = e * is a different edge in T 1 . By Lemma 5.5

while for the correct edge d(e * ) = 0. It follows from the triangle inequality that if

for all edges e, thend(e * ) <d(e ). Since δ ≤ ξ < 1 and m ≥ 2,

Thus, if the estimateŜ satisfies Eq. (28), then S −Ŝ ≤ D/2 and the condition for Lemma 5.9 holds. Combining the lemma with Eq. (29) concludes the proof.

Step 3: Finite sample guarantees We are now ready to prove Theorem 5.4, which bounds the number of samples required to compute, with high probability, a sufficiently accurate estimateŜ, as determined in Lemma 5.8.

Proof of Theorem 5.4. The following concentration bound forŜ was derived in Lemma 4.7 of [26] ,

We note that in [26] , this bound was presented for the spectral norm, but the proof holds for the Frobenius norm as well. Suppose that Pr( Ŝ − S F ≤ t) > 1 − ε, Namely

By Lemma 5.8, a sufficient condition for STDR to select the correct placeholder edge is

Setting t to the right hand side of Eq. (31) and substituting into Eq. (30), we have that if

then Eq. (31) holds with probability at least 1 − ε, and thus the merging step in STDR selects the correct placeholder edge with high probability. 

Combining Theorem 5.1 and Theorem 5.4, for a binary symmetric tree with a fixed similarity between adjacent nodes, the sample complexity of the partitioning and merging steps of STDR is

where D min is the minimum value of D from Theorem 5.4 over all partitions of the tree. We compare this result to three other methods for full recovery of trees. For simplicity, we assume that the similarity between all adjacent nodes is δ. Thus, the value of D min = O(m 3 /δτ 2 ), where τ is the user given threshold. For a reasonalbe setting where τ = O(m 3 /δτ 2 ), the sample complexity simplifies to O(m 2+4 log 2 ( 1 δ ) ). For NJ, the sample complexity given in Section 3.3 of [4] is O(exp(−4 min i,j ln(S(x i , x j ))) or equivalently O δ −4diam(T ) , where diam(T ) is the diameter of T . For a binary symmetric tree diam(T ) = 2 log 2 (m) and hence the complexity is O(m 8 log 2 (1/δ) ), which is better than (32) for δ close to one, but worse for lower values of δ. However, the diameter of the tree can be as large as m, in which case the sample complexity of NJ is exponential in m, rather than polynomial as in (32) .

For SNJ, if δ 2 > 0.5 the sample complexity is O(m 2 ) (by Theorem 4.3 in [26] ). This is similar to (32) for δ close to one, but improves upon (32) as δ decreases.

For the Dyadic Close method [15, Theorem 9] , the sample complexity is O((1/δ) 4h(T ) ), where recall that h(T ) denotes the depth of a tree as in definition 7. For a binary symmetric tree h(T ) = log 2 (m) in which case the complexity is O(m 4 log 2 (1/δ) ), which improves upon Eq. (32) by m 2 . For highly imbalanced trees depth(T ) = O(1) in which case the sample complexity is logarithmic in m.

The improved sample complexity, however, comes at cost of a O(m 5 ) computational complexity. Thus, excluding the Dyadic Closure method, the sample complexity of STDR is similar to several other distance-based methods with theoretical guarantees.

We illustrate the performance of STDR in comparison to several other algorithms in a variety of simulated settings. To this end we generated trees according to the coalescent model (6.1) and the birth-death model (6.2), which are common in phylogenetics. In addition, we also considered the challenging scenario of the caterpillar tree. In all experiments, the sequences were generated according to the HKY substitution model [24] with transition-transversion ratio of 2, a typical value in the human genome [31] . The mutation rate for the HKY model is specified for each simulation.

We considered the following reconstruction methods: (i) RAxML [52] , a standard tool for maximum likelihood-based tree inference, (ii) neighbor joining (NJ), and (iii) spectral neighbor joining (SNJ). Recall that STDR requires as input a subroutine alg for the reconstruction of the small trees. Thus, for comparison, we applied STDR with each of the aforementioned algorithms as the subroutine. We denote these three methods as (iv) STDR + RAxML, (v) STR + NJ and (vi) STDR + SNJ. A second input to STDR is the threshold parameter τ , which sets an upper bound for the size of the small trees. This parameter is specified in the description of each experiment. The accuracy of the different algorithms is measured by the normalized Robinson-Foulds (RF) distance, defined as the RF distance [17] between the reconstructed and reference tree divided by 2m − 6. Each experiment was repeated 5 times to obtain a mean and standard deviation of the performance and runtime of each method.

In addition to the above experiments, we compare our merging procedure to TreeMerge [39] . The results for the caterpillar tree and the comparison to TreeMerge are shown in Appendix G. Finally, for a symmetric binary tree, we demonstrate how changes in the threshold τ affect the results of STDR.

Implementation remarks To improve the results of STDR, we computed two possible partitions C 1 , C 2 : (i) A partition that corresponds to a threshold at 0 in the Fiedler vector, and (ii) a partition that corresponds to the largest gap. In practice, the partition was chosen by method (i) or (ii), as the one that minimizes the second singular value of S(C 1 , C 2 ), see Lemma 3.1. To improve runtime, we apply randomized methods for computing leading singular values and vectors, see [53, 22, 1] .

We generated a random tree according to Kingman's coalescent model [49] with m = 2000 terminal nodes (See example in Fig 9) . Figure 6 shows the accuracy (left panel) and the runtime (right panel) of the different methods as functions of the sequence length. The threshold parameter τ was set to 128 for all experiments. Here, STDR+RAxML performs similarly to RAxML in accuracy while achieving more than an order-of-magnitude reduction in runtime. Compared to NJ and SNJ, STDR+NJ and STDR+SNJ show improvement in both accuracy and runtime. 

We generated random binary trees with m = 2048 terminal nodes according to the birth-death model [33] .The STDR threshold was set to τ = 256 for all three methods. Figure 7 shows the accuracy and runtime of the different methods as a function of the sequence length n. Using STDR with NJ clearly improves upon the performance of standard NJ both in terms of accuracy and runtime. Compared to SNJ and RAxML, STDR+SNJ and STDR+RAxML show similar accuracy but with significantly faster runtimes

Our aim in this experiment was to test the impact of the threshold parameter τ on the performance of STDR. To that end, we created a binary symmetric tree with m = 2048 terminal nodes and similarity between all adjacent nodes equal to δ = 0.65. The number of samples was set to n = 1000. We then reconstructed the tree via STDR with different subroutines and a range of threshold values. Figure 8 shows the normalized RF distance between the recovered trees and the ground truth tree as a function of the threshold. For both RAxML and SNJ, accuracy slightly improves for higher values of the threshold. STDR + NJ is not shown in the plot because it is significantly less accurate in this setting. These results are in accordance with our analysis in Section 5, where we show that the task of merging trees becomes challenging for small subsets of terminal nodes. Lemma B.1. Let G be the similarity graph of a binary tree T . Let A * and B * be a partition of the terminal nodes that minimizes the following min-cut criterion:

Then A * and B * are clans in T .

Proof. Let (x 1 , x 2 ) be a pair of adjacent terminal nodes. Consider an arbitrary partition of the terminal nodes into two non-empty subsets, denoted A and B. The two adjacent nodes (x 1 , x 2 ) can, respectively, be labeled (A, B), (A, A), (B, A) or (B, B) . We show that if A and B each contains nodes besides x 1 and x 2 , then assigning x 1 and x 2 to the same subset decreases the value of the min-cut criterion. Assume without loss of generality that x 1 ∈ A, x 2 ∈ B. The cut between A and B is equal to

does not depend on the assignment of x 1 and x 2 . Let h be the unique node that is adjacent to both x 1 and x 2 . From the multiplicative property of the similarity, we have

Without loss of generality, assume that

It follows that

Note that the right hand side of Eq. (35) equals the value of the cut of the same partition, but with x 1 moved from A to B. Thus, the min-cut partition {A * , B * } satisfies one of the following:

• x 1 and x 2 are in the same subset.

• One of A * or B * equals exactly to {x 1 } or {x 2 }.

Next, let C 1 and C 2 be two adjacent clans. Assume that the terminal nodes of each of the clans are homogeneous (i.e., they all belong to the same subset, A or B). The same argument for a pair of terminal nodes carries over to the case of two adjacent homogeneous clans, showing that the minimal cut partition {A * , B * } satisfies one of the following:

• C 1 and C 2 are in the same subset.

• One of A * or B * equals exactly C 1 or C 2 .

Let {A, B} be an arbitrary partition of the terminal nodes that does not correspond to two clans in the tree. Since A and B are not clans, there must be at least two disjoint pairs C 1 , C 2 andC 1 ,C 2 of homogeneous adjacent subsets, where the nodes in C 1 are labeled by A and the nodes in C 2 are labeled by B. By our arguments Cut(A, B) can be reduced by either changing the labels of C 1 to B or C 2 to A which implies that {A, B} is not the min-cut partition. Thus, for any min-cut partition {A * , B * }, A * and B * are clans.

We present here the proofs of Lemmas 3.2 and Lemma 3.3 that are used in Section 3.

Proof of Lemma 3.2. Let C 2 be the clan of all the terminal nodes of T that are not in C 1 . Consider an edge e(h A , h B ) in T 1 that partitions C 1 into A(e) and B(e). First, assume that e(h A , h B ) is the correct placeholder edge of T 1 . Then there exists a node h 1 in the full tree T that is connected to h A , h B and to the root node of C 2 . Removing the edge e(h A , h 1 ) in T separates the subset A(e) from the remaining nodes in T , which implies that A(e) is a clan in T . By the same argument, B(e) is also a clan in T .

Conversely, assume that A(e), B(e) and C 2 are disjoint clans that partition the terminal nodes of T . Then, there exists a node h 1 that connects to the roots of A(e), B(e) and T 2 . This proves that the edge e(h A , h B ) in T 1 is the correct placeholder edge, since it is where the root h 1 is inserted.

Proof of Lemma 3.3. Let C 1 = A ∪ B be the terminal nodes of the clan T 1 and let h 1 be its root. We denote by C 2 the terminal nodes in its adjacent clan. By the multiplicative property of the similarity function, S(C 1 , C 2 ) = S(C 1 , h 1 )S(h 1 , C 2 ).

Combining the above expression with Eq. (6) implies that the left singular vector u of S(C 1 , C 2 ) is proportional to the vector of similarities S(C 1 , h 1 ). That is, ∃β ∈ R such that S(C 1 , h 1 ) = βu. Let e be an edge in T 1 that partitions the terminal nodes into A(e), B(e). The vector S(C 1 , h 1 ) can be similarly partitioned into S(A(e), h 1 ) and S(B(e), h 1 ) such that

We first prove that if e is the correct placeholder edge of T 1 , then Eq. (7) Setting α = β 2 proves Eq. (7). Next, we assume that Eq. 

Let D ∈ R m×m be a matrix whose elements are the pairwise phylogenetic distances between all terminal nodes. Given the exact distance matrix, it was shown in [20] that the terminal nodes of a tree can be partitioned into two clans according to the sign pattern of the leading eigenvector of the following matrix (I − 11 T /m)D(I − 11 T /m). Figure 10 shows the percentage of times the terminal nodes were correctly partitioned into clans by applying our similarity based approached vs. the distance-based approach derived in [20] . We generated 200 random trees according to Kingman's coalescent model with m = 128 terminal nodes. Figures 10a shows the ratio of times each method successfully partitioned the tree as a function of the number of samples with a fixed mutation rate between adjacent nodes of δ = 0.9. Similarly, Figure 10b shows the performance of both methods as a function of δ with a fixed number of samples n = 100. The advantage of using the similarity matrix over the distance matrix is clear.

We begin with several definitions and notations. We denote by G(v, w), T (v, w) the weight between nodes v and w in a graph G and tree T , respectively. For a tree T , we denote by path T (v, w) the Next, we define the multiplicative weight between two nodes in a tree.

Definition 8. The multiplicative weight between v and w in a tree T is equal to,

For example, let T be a tree whose edge weights are given by the similarity in Eq. (4), then the similarity between two terminal nodes x 1 , x 2 is equal to the multiplicative weight α T (x 1 , x 2 ). The next definition concerns a graph with nodes that correspond to a subset of nodes in T , and weights computed according to (37) .

Definition 9 (Multiplicative subgraph). Let T be a tree with a set of nodes V . We say that a graph G is a multiplicative subgraph with respect to T and a subset of nodes V ⊂ V if (i) the nodes of G correspond to V and (ii) the weight assigned to an edge connecting v, w in G is equal to the multiplicative weight between v and w in T ,

For convenience, we will sometimes say that G is a multiplicative subgraph of T without explicitly stating which nodes are included in G. By definition, the similarity graph G is a multiplicative subgraph with respect to the terminal nodes of T . Note that we use v and w as nodes both in G and in T interchangeably, since by definition every node in G corresponds to a node in T .

The proof of Lemma 4.4 is constructive. Given a tree T and its similarity graph G, we present an iterative procedure to build a second treeT , with the same topology as T , but with different weights such that

where R is the set of all internal nodes in T . ComputingT consists of iterative and simultaneous updates of a graph and a tree: (i) a graph G i with nodes that correspond to a subset of the nodes in T . The initial graph G 0 is set to G, with only the terminal nodes of T . (ii) A tree T i , with the same topology as T . The weights of the initial tree T 0 are set such that T 0 = T . At each iteration i, we add one of the non-terminal nodes h i of T (that was not previously added) to G i , creating G i+1 . The weights of the new graph G i+1 are set such that the Schur complement of its Laplacian matrix with respect to the added node h i is equal to the Laplacian of the previous graph L Gi .

The steps for computing G i+1 given G i and T i are described in Algorithm 2. Next, we compute a new tree T i+1 with the same topology as T i . The weights of T i+1 are set such that G i+1 becomes a multiplicative subgraph with respect to T i+1 . The steps for computing T i+1 are described in Algorithm 3. At every iteration i, we maintain an active set of nodes which we denote by A i . When updating G i , changes are only made to edges connecting two nodes in A i ∪ h i . When updating T i , changes are only made to edges on the path between two nodes in the active set. The initial active set A 0 is equal to all terminal nodes of T .

In our proof, we use the following two auxiliary lemmas, that show the correctness of the updating procedure of G i and T i . An implementation of Algorithms 2 and 3 is available on GitHub. The first lemma proves the correctness of Algorithm 2. The input to Algorithm 2 is the tree T i , a multiplicative subgraph G i and an active set A i , all of which were computed in the previous iteration. The output of the algorithm is an updated graph G i+1 that contains an additional node h i . In addition, the algorithm updates the active set A i and creates A i+1 .

Lemma E.1. The output of Algorithm 2 is a graph G i+1 whose nodes include h i as well as all the nodes in G i such that L Gi+1/hi = L Gi .

The next lemma concerns the updating procedure of T i . The input to Algorithm 3 consists of the new active set A i+1 , and the node h i added to G i+1 . Here, the only changes made are to edges on the path between h i and the nodes in the active set A i+1 .

Lemma E.2. The tree T i+1 built according to Algoithm 3 is such that G i+1 becomes a multiplicative subgraph of T i+1 . Figure 11 shows two iterations of the aforementioned process for a tree T with four terminal and two non-terminal nodes. For simplicity, all the weights of the tree T are set to 1/2.

Proof of Lemma 4.4. We initialize the updating process with a tree T and its similarity matrix G = G 0 . By definition, G 0 is a multiplicative subgraph of T , and therefore satisfies the condition for Lemma E.1. The lemma guarantees that after the first update, we obtain a graph G 1 with a Laplacian that satisfies,

where h 0 is the node added to G 0 at the first iteration. Lemma E.2 guarantees that G 1 is a multiplicative subgraph of T 1 . Thus, we can re-apply Algorithm 2 with the pair G 1 , T 1 . Thus, at each iteration i, we obtain a graph G i+1 that satisfies,

Choose a node h i in T i that is not in G i and is adjacent to at least two nodes v i,1 , v i,2 in the active set A i . Add h i to G i+1 . 3: Remove edges between the nodes v i,1 , v i,2 and the rest of the active set A i . 4: The weight between the new node h i and a node x in the active set is computed by

5: The weights between two nodes x, y in the active set (except v i,1 , v i,2 ) are updated by

6: Remove the nodes v i,1 and v i,2 from the active set A i+1 , and add h i .

Repeating the updating process for all l non-terminal nodes of T yields the graph G l , which by construction has the same topology as T . In addition, due to the transitivity of the Schur's complement operation, Eq. (42) implies that

Thus, T l is a tree with the same topology as T , but with different weights such that L T l /R = L G , which proves the lemma.

Proof of Lemma E.1. Assume, for simplicity of notation, that the jth row/column of L G is the row/column that correspond to h j for any j such that

We denote by m j the j-th column of L Gi+1 after removing the i-th entry, and by 1 the all one vector. Since h i is a single node, the Schur complement L Gi+1/hi defined in (6) can be simplified to

-the active set h i -the node last added to G i+1 v i,1 , v i,2 -nodes that where removed from the active set in the last update Output: T i+1 -a tree with weights computed such that G i+1 is a multiplicative subgraph of T i+1

where d is given by Eq. (40). 3: For two adjacent nodes x, y ∈ T where y is a node in the active set A i+1 and x is other path between y and h i , set

4: For two adjacent nodes x, y ∈ T that are not in the active set. If T i (x, y) is on the path between a node in the active set and h i , where x is closer to h i , set

For a Laplacian matrix, the sum over any row is equal to zero. Since m j is equal to the row of L Gi+1 after removing the i-th entry we have that 1 T m j = −L Gi+1 (i, j). We rewrite Eq. (43) as,

The only edges changed between G i and G i+1 are edges between nodes in the active set A i . Thus, if either h k or h j are not in the active set then L Gi+1 (j, k) = L Gi (j, k). In addition, by step 4 of Algorithm 2, the added node h i is only connected to nodes in the active set A i . Thus, if either node h k or h j are not part of A i we have L Gi+1 (j, i)L Gi+1 (k, i) = 0. It follows that in this case L Gi+1/hi (j, k) = L Gi (j, k) as required.

Next, we assume that both h j and h k are part of the active set A i . Eqs. (39) and (41) give

By step 4 of Algorithm 2, h i is only connected to nodes in the active set A i . Inserting Eq. (45) to Eq. (44) gives

The denominator in the last term on the r.h.s of Eq. (46) is equal to d 2 and hence,

We conclude that for any element j, k we have L Gi+1/hi (j, k) = L Gi (j, k).

Proof of Lemma E.2. Here, our task is to prove that the weight assigned to any edge G i+1 (x, y) is equal to the multiplicative path α Ti+1 (x, y). We address three cases: (i) the node x is in the active set A i+1 and y is equal to the node h i added to the graph in iteration i. (ii) Both x and y are in A i+1 , and are not equal to h i , and (iii) x = h i and y is either v i,1 or v i,2 . For a pair of nodes (x, y) that is not in (i) − (iii) the edges in G i and T i were not changed in the updating steps.

For case (i) we assume that x is in A i+1 and y = h i and hence by Eq. (39) in Algorithm 2

We denote the nodes on the path between x and h i in T i by

The edge between h i and z 2 is updated according to step 2 of Algorithm 3. The edge between z K−1 and z K is updated by step 3. The remaining edges are updated by step 4. The multiplicative weight α Ti+1 (x, h i ) in the updated tree T i+1 according to Algorithm 3 is equal to

Thus, the weight G i+1 (x, h i ) = α Ti+1 (x, h i ) for any x in the active set.

In case (ii) x, y are two nodes in the active set not equal to h i . According to Eq. (41) in

Denote by u the unique node that connects between the nodes x, y and h i . Then,

By assumption on the input to Alg. 2 of the previous iteration, the graph G i is a multiplicative subgraph of T i and hence G i (x, y) = α Ti (x, y). Thus, Eqs. (41) and (48) imply

Next, we show that G i+1 (x, y) is equal to the multiplicative weight α Ti+1 (x, y). Let z 1 = x, . . . , z κ = u, . . . , z K = y be the nodes on the path between x and y. By steps 2 and 3 in Algorithm 3, the multiplicative weight α Ti+1 (x, y) is equal to

Note that

and

and thus,

Lastly, we consider case (iii), where x = h i and y = v i,1 or y = v i,2 . Recall that v i,1 , v i,2 are adjacent to h i in T and were removed from the active set. By step 4 of Algorithm 2 and step 1 of Algorithm 3 the edge G i (x, y) and its corresponding edge T i (x, y) have both been updated such that T i+1 (x, y) = G i+1 (x, y) = dT i (x, y).

Proof of Lemma 5.3. We begin by characterizing all the eigenvectors of L ∈ R m×m . For any nonterminal node h in the binary symmetric tree T , we denote the set of descendent terminal nodes to the "left" of h by A, the set of descendant terminal nodes to the "right" of h by B, and the rest of the terminal nodes by C. Let v h ∈ R m be a vector with

We show that for any choice of non-terminal node h, v h is an eigenvector of L. Since there are m − 1 non-terminal nodes, this set of eigenvectors, together with the vector of all-ones, forms the full set of all eigenvectors of L.

First, we show that v h is an eigenvector of the similarity matrix S, and compute the corresponding eigenvalue. For i ∈ A,

Due to the symmetry of the tree T , every terminal node has a similarity of δ 2 to one other terminal node, δ 4 to two other terminal nodes, etc. Thus, j∈A S(i, j) = 1 + δ 2 + 2δ 4 + . . . , + . . . , |A|δ 2 log 2 |A| = δ 2 1 − (2δ 2 ) log 2 |A| 1 − 2δ 2 + 1.

The similarity between a node i ∈ A and all nodes k ∈ B is equal to δ 2(log |A|+1) . Thus,

The same result with a negative sign holds for i ∈ B. If i ∈ C then by symmetry (Sv h ) i = 0. Thus v h is an eigenvector of S with eigenvalue equal to the right side of (50). The sum of every row in S is equal to,

Let D be the scalar matrix with diagonal elements equal to Eq. (51). Combining Eq. (51) and Eq. (50), we get that v h is an eigenvector of L = D − S with eigenvalue:

For any Laplacian matrix 0 is an eigenvalue that correspond to the vector of all-ones. Since the eigenvalue e(h) decreases as |A| grows, the two smallest non-zero eigenvalues correspond to |A| = m/2 and |A| = m/4. The three smallest eigenvalues are thus equal to,

In the following proof, we use similar notations as in the proof of Lemma 5.5.

Proof of Lemma 5.6 . For simplicity, let x = S(A i , B) 2 F and y = S(A i+k , B) 2 F . To compute the numerator of Eq. (22), we set the partial derivative w.r.t. β to 0, which gives

Plugging β * back into the numerator of Eq. (22) gives

Observe that R i+k = R i S(h i , h i+k ) 2 . Thus, the above expression further simplifies to

Recall from Eqs. (2) and (3) that for any 1 ≤ i ≤ N − 1, S(h i , h i+k ) < ξ < 1. It follows that

Next, we simplify the term min(x,y) max(x,y) in Eq. (54) . Note that h i+k separates A i and A i+k from B, see ilustration in Figure 5 . Thus, we can rewrite min(x, y) as

Next, we provide lower and upper bounds on the terms S(A i , h i+k ) 2 and S(A i+1 , h i+k ) 2 . By Eq. (2), the similarity between the nodes in A i , A i+k and h i+k is bounded by ξ. It follows that

For a lower bound, we apply [26, Lemma 4.5] . Given the terminal nodes of a clan A, and the root of a clan h, the lemma bounds the norm of S(A, h) by,

There are k + 1 edges between the root of A i and h i+k , and one edge between the root of A i+k and h i+k . Thus,

Plugging Eqs. (55), (56) into (54) concludes the proof.

Proof of Lemma 5.7 . The lemma is a small variation over the known lower bound for ratio of sums, i ai i bi ≥ min i ai bi . For an even number of elements, we can merge non overlapping pairs of consecutive elements such thatã i = a 2i + a 2i+1 andb i = b 2i + b 2i+1 . Applying the standard bound for ratio of sums forã i andb i gives,

For an odd number of elements, we can merge the first three elements i = 0, 1, 2. The rest will be merged into consecutive pairs.

The ratio for elements i = 0, 1, 2 can be bounded by the minimum ratio over all pairs i, j ∈ {0, 1, 2}. Thus,

Lemma F.1. Let X, X ∈ R m×n and let y, y > 0. Assume that X F ≤ y , then

Proof. By definition,

Since X F ≤ y the lemma follows.

Lemma F.2. Let X and Y be two matrices and letX andŶ be their corresponding noisy estimates. Then,

Lemma F.3. Let X ∈ R n1×n2 , Y ∈ R n2×n3 , Z ∈ R n3×n4 be three matrices and letX,Ŷ ,Ẑ be there corresponding estimates. Then

Combining the two bounds gives, 4 . Let S denote a rank one matrix andŜ its noisy estimate. We denote by u,û their respective leading left singular vectors. If S −Ŝ F ≤ 0.5 S F then , where σ 1 (S) and σ 2 (S) are the two leading singular values of S. Since S is rank one, σ 1 (S) = S = S F and σ 2 (S) = 0. In addition, we assumed that S −Ŝ F ≤ 0.5 S F and hence

Combining Eqs. (59), (60) concludes the proof.

Proof of Lemma 5.9 . Let d(e) denote the score of the edge e computed by the exact similarity matrix S as defined in (8) . We denote byd(e) the score computed by the noisy estimate of the similarityŜ. The difference between d(e) andd(e) is equal to 

where the second inequality is due to the reverse triangle inequality and the definition of D.

We focus on the term α * u A u T B − β * û 

where a similar expression holds for β * . Multiplying α * and β * by u A u T B andû Aû The matrices ε A , ε B are submatrices ofûû T − uu T and hence ε A F , ε B F ≤ ûû T − uu T F . Applying Lemma F.4 gives

Combining Eqs. 

where the inequality is due to the reverse triangle inequality and our assumption ε 

which concludes the proof.

x 1

x 2

x 3 

Matrix decompositions using sub-gaussian random matrices. Information and Inference: A

Molecular phylogenetics from an algebraic viewpoint

Learning mixtures of tree graphical models

The performance of neighbor-joining methods of phylogenetic reconstruction

Noise thresholds for spectral clustering

Rooting with multiple outgroups: consensus versus parsimony

A general method for solving divideand-conquer recurrences

Comparison of methods for rooting phylogenetic trees: A case study using orcuttieae (poaceae: Chloridoideae)

Full reconstruction of markov models on evolutionary trees: identifiability and consistency

Learning latent tree graphical models

Applications of m-matrices to non-negative matrices

A min-max cut algorithm for graph partitioning and data clustering

Kron reduction of graphs with applications to electrical networks

Biological sequence analysis: probabilistic models of proteins and nucleic acids

A few logs suffice to build (almost) all trees (i)

Tree construction using singular value decomposition

Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units

Inferring phylogenies

A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory

Connections between numerical taxonomy and phylogenetics

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Greedy learning of binary latent trees

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA

Molecular systematics

Spectral neighbor joining for reconstruction of latent tree models

The spectral underpinning of word2vec

A polynomial time approximation scheme for inferring evolutionary trees from quartet topologies and its application

Inference of single-cell phylogenies from lineage tracing data using cassiopeia

Evolutionary inference for function-valued traits: Gaussian process regression on phylogenies

Transition-transversion bias is not universal: a counter example from grasshopper pseudogenes

Rooting trees, methods for. Encyclopedia of Evolutionary Biology

Stochastic birth and death processes

Molecular clocks: four decades of evolution

Graph splitting: a graph-based approach for superfamilyscale phylogenetic tree reconstruction

Why neighbor-joining works

Astral-ii: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes

Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge

TreeMerge: A new method for improving the scalability of species tree estimation methods

Phylogenetic analysis of sars-cov-2 data is difficult

Learning nonsingular phylogenies and hidden markov models

A survey on latent tree models and applications

Molecular evolution and phylogenetics

FastTree 2-approximately maximumlikelihood trees for large alignments

Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts

The neighbor-joining method: a new method for reconstructing phylogenetic trees

The challenge of constructing large phylogenetic trees

Coalescent Theory: An Introduction

Single-cell lineage tracing of metastatic cancer reveals selection of hybrid emt states

A statistical method for evaluating systematic relationships

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

On the Fiedler vectors of graphs that arise from trees by schur complementation of the Laplacian

Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies

Prospects for inferring very large phylogenies by using the neighbor-joining method

A tutorial on spectral clustering

Supertree construction: opportunities and challenges

Of clades and clans: terms for phylogenetic relationships in unrooted trees

Molecular phylogenetics: principles and practice

A useful variant of the Davis-Kahan theorem for statisticians

Latent tree models and diagnosis in traditional chinese medicine

Phylogeny inference based on spectral graph clustering

Evaluating fast maximum likelihood-based phylogenetic programs using empirical phylogenomic data sets

The authors would like to thank Junhyong Kim, Stefan Steinerberger and Ronald Coifman for useful and insightful discussions. Y.K. and Y.A. acknowledge support by NIH grant R01GM131642, UM1DA051410 and R61DA047037. Y.K. and B.N. acknowledge support by NIH grant R01GM135928. Y.K. acknowledges support by NIH grant 2P50CA121974.

We generated random trees with 2000 terminal nodes according to the coalescent model. The trees were recursively partitioned by STDR with a threshold of τ = 128. The structure of the different partitions was recovered by RAxML. We compared STDR's merging criteria with TreeMerge [39] for various sequence lengths. The results are shown in Figure 13 . The merging process of STDR achieved better accuracy than TreeMerge, with a significantly reduced runtime.

We generated a tree with m = 512 nodes according to the coalescent model, see Figure 9a . The transition matrices were set according to the HKY model [24] . We then generated a dataset of nucleotide sequences of length n = 2, 000. Figure 9b shows the Fiedler vector of the similarity graph estimated from the dataset. Partitioning the terminal nodes according to the sign pattern of the Fiedler vector yields two clans.

Let T be a binary tree and G be its similarity graph, as defined in Section 4. The following lemma shows that partitioning the terminal nodes according to the min-cut criterion yields two clans of

We generated a caterpillar tree with m = 512 terminal nodes, where the non-terminal nodes form a path graph. The similarity between each pair of adjacent nodes was set to δ = 0.81. As in Section 6, we compare NJ, SNJ and RAxML, with STDR where the aforementioned methods are used as subroutines. The STDR threshold is set to τ = 64 for all three STDR variants. Figure 12 shows the normalized RF distance (left) and runtime (right) of the different methods as functions of the sequence length n. Here, all three methods are significantly improved when combined with STDR in both runtime and accuracy.