key: cord-0612931-yk40fz71
authors: Tahiri, Nadia; Fichet, Bernard; Makarenkov, Vladimir
title: Building alternative consensus trees and supertrees using k-means and Robinson and Foulds distance
date: 2021-03-24
journal: nan
DOI: nan
sha: b7a31d820ec0232f2317f7bae0c92b11fd47f666
doc_id: 612931
cord_uid: yk40fz71

Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree that accounts for the main patterns of vertical descent. The output of traditional consensus tree or supertree inference methods is a unique consensus tree or supertree. We describe a new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of gene phylogenies. We show how an adapted version of the popular k-means clustering algorithm, based on some interesting properties of the Robinson and Foulds distance, can be used to partition a given set of trees into one (for homogeneous data) or multiple (for heterogeneous data) cluster(s) of trees. Moreover, we adapt the popular Cali'nski-Harabasz, Silhouette, Ball and Hall, and Gap cluster validity indices to tree clustering with k-means. A special attention is given to the relevant but very challenging problem of inferring alternative supertrees. The use of the Euclidean property of the objective function of the method makes it faster than the existing tree clustering techniques, and thus perfectly suitable for analyzing large evolutionary datasets. We apply the new method to discover alternative supertrees characterizing the main patterns of evolution of SARS-CoV-2 and genetically related betacoronaviruses.

Most of the conventional consensus and supertree inference methods generate one candidate tree for a given set of input gene phylogenies. However, the topologies of gene phylogenies can be substantially different from each other due to possible horizontal gene transfer, hybridization or intragenic/intergenic recombination events by which the evolution of the related genes could be affected . Each gene phylogeny depicts a unique evolutionary history which does not always coincide with the main patterns of vertical descent depicted by the species tree. In order to infer a reliable species phylogeny, the related gene trees should be merged, while minimizing topological conflicts present in them . In this context, two scenarios can be envisaged: First, trees to be merged are constructed for the same set of taxa (i.e. consensus tree issue), and second, trees to be merged are constructed for different, but mutually overlapping, sets of taxa (i.e. supertree issue). A large variety of methods have been proposed to address the problem of reconciliation of multiple trees defined on the same set of leaves in order to infer a consensus tree. The best known examples of consensus trees are the strict consensus tree, the majority-rule consensus tree, and the extended majority-rule consensus tree . However, in practical evolutionary studies, we rarely deal with phylogenies defined on the same set of taxa, and thus the consensus tree inference problem is transformed into the supertree inference problem (Bininda-Emonds 2004) . Several approaches have been proposed to synthesize collections of small phylogenetic trees with partial taxon overlap into comprehensive supertrees including all taxa found in the input trees . For example, the Matrix Representation with Parsimony (MRP) methods proceed by a matrix-like aggregation of separately inferred partial trees. The trees derived from these independent analyses are then combined to produce a single MRP matrix used to reconstruct the supertree of all sources of taxa . In the parsimony supermatrix methods all systematic characters are integrated into a single phylogenetic matrix which is used to analyze all characters simultaneously in order to build a supermatrix tree. The strict supertree includes the bipartitions (or splits) that agree with all bipartitions present in the input phylogenies, while the rest of the tree consists of unresolved subtrees. The concept of strict and loose supertrees has been well described by , who showed that these types of supertrees are natural generalizations of the corresponding consensus trees. Importantly, most supertree methods are heuristics for NP-hard optimization problems (Warnow 2018) .

The implementation of the famous "Tree of Life" (ToL) project intended to infer the largest possible species phylogeny became feasible due to collaborative efforts of biologists and nature enthusiasts from around the world ). The approach adopted by the project organizers consists in gradual partitioning the complex tree reconstruction problem into several subproblems, followed by merging the obtained sub-trees. Indeed, such an approach produces thousands of small trees which should be assembled to create ToL. Precisely, the problem can be viewed as twofold: First, we have to infer small sub-trees of ToL (i.e. often gene trees representing different evolutionary histories), the most commonly defined on different, but mutually overlapping, sets of taxa, and second, we have to merge these small sub-trees into one or several supertrees using a supertree reconstruction method that allows combining trees inferred for different sets of taxa. In this context, the application of a method that infers multiple supertrees (i.e. a supertree clustering method) would help discover and regroup plausible alternative evolutionary scenarios for several sub-trees of ToL.

Unfortunately, most of the traditional consensus tree and supertree inference methods return as output a single consensus tree or supertree. Thus, in many instances, these methods are not informative enough, as they do not preserve alternative evolutionary scenarios characterizing sub-groups of genes that have undergone similar reticulate evolutionary events (e.g. genes whose evolution has been affected by similar horizontal gene transfers; Boc et al. 2010; Sevillya et al. 2020) . The Phylogenetics Islands method of was the first to use multiple consensus trees. Maddison observed that the consensus trees of the islands can differ substantially from each other, and that they are usually much better resolved than the consensus tree of the whole set of taxa. The most intuitive approach for discovering and regrouping genes sharing similar evolutionary histories is clustering their gene phylogenies. In this context, Stockham et al. (2002) proposed a tree clustering algorithm based on k-means and the quadratic version of the Robinson and Foulds (RF) topological distance . The RF distance between two phylogenetic trees equals the minimum number of elementary operations, consisting of merging or splitting nodes, which are necessary to transform one tree into the other. It is also the number of bipartitions, or Buneman's splits (Buneman 1971) , which belong to exactly one of the two trees. The clustering algorithm proposed by Stockham et al. aims at inferring a set of strict consensus trees that minimizes the information loss. It proceeds by determining the consensus trees for each set of clusters in all intermediate partitioning solutions tested by k-means. Bonnard et al. (2006) introduced a method, called MultiPolar Consensus (MPC), which produces a minimum number of consensus trees displaying all splits of a given set of phylogenies whose support is above a predefined threshold. proposed the Multiple Consensus Trees (MCT) method for partitioning a group of phylogenetic trees into one or several clusters. The method of Guénoche computes a generalized partition score to determine the most appropriate number of clusters for a given set of gene trees. Finally, Tahiri et al. (2018) presented a fast tree clustering method based on k-medoids. Recently, have introduced a revised definition of tree islands based on any treeto-tree metric that usefully extends this notion to any set or multiset of trees and have provided a nice discussion of biological applications of their method. To the best of our knowledge, all the methods proposed for building multiple alternative phylogenies assume that the input trees have identical sets of taxa, thus working in the consensus tree context. Therefore, the relevant and challenging problem of inferring multiple supertrees still needs to be addressed appropriately. It is worth noting that Bansal et al. (2010) described a method for building RF-based supertrees aiming at minimizing the total RF distance between the supertree and the set of input trees. However, Bansal et al. did not consider the possibility of inferring multiple supertrees.

In this paper, we describe a new method that can be used to infer both alternative consensus trees and supertrees. We present some interesting properties of the Robinson and Foulds topological distance and show that it should not be used in its quadratic form in tree clustering. Our main contributions are summarized below: 1. We present a new efficient method for inferring multiple alternative consensus trees which is based on the Euclidean properties of the square root of the Robinson and Foulds distance, and validate it through a comprehensive simulation study; 2. We show how the proposed method can be extended to infer multiple alternative supertrees (this problem has not been addressed yet in the literature); 3. We demonstrate the practical utility of our method and software by applying them to the problem of clustering SARS-Cov-2 gene trees and by comparing our results to those provided by existing tree clustering approaches; 4. We show how to adapt the popular Ball and Hall , , Gap and Silhouette cluster validity indices to tree clustering with kmeans; 5. We discuss further extensions of the proposed tree and supertree clustering approach.

A phylogenetic tree is an unrooted leaf-labeled tree in which each internal node, representing an ancestor of some contemporary species (i.e. taxa), has at least two children and all leaves, representing contemporary species, have different labels. Our method takes as input a set Π of N phylogenetic trees defined on the same (or different, but mutually overlapping) set(s) of leaves and returns as output one or several disjoint clusters of trees from which the related consensus trees (or supertrees) can then be inferred. Our approach relies of the use of a fast version of the popular k-means algorithm adapted for tree clustering. We define and compare different variants of the k-means objective function suitable for clustering trees. K-means (MacQueen 1967) is an unsupervised data partitioning algorithm which iteratively regroups N given objects (i.e. phylogenetic trees in our case) into K disjoint clusters. The content of each cluster is chosen to minimize the sum of intracluster distances. The most commonly used distances in the framework of k-means are the Euclidean distance, the Manhattan distance, and the Minkowski distance. The problem of finding an optimal partitioning according to the k-means least-squares criterion is known to be NP-hard ). This fact has motivated the development of polynomial-time heuristics for finding an approximate clustering solution, most of them having the time complexity of O(KNIM), where I is the number of iterations in k-means and M is the number of variables characterizing each of the N objects.

The Robinson and Foulds distance, also known as the symmetric-difference distance, between two trees is a widely-used metric for comparison of phylogenetic trees defined on the same set of taxa . The RF distance is a topological distance. It does not take into account the length of the tree edges. As shown by , the majority-rule consensus tree of a set of trees is a median tree of this set in the sense of the RF distance. Moreover, even though the RF distance itself is not a Euclidean distance, its square root is Euclidean (see Appendix S1 in Supplementary Material). This fact justifies the use of this distance in tree clustering.

One of the main advantages of the method described in this paper is that it does not recompute the consensus trees or supertrees for all intermediate clusters of trees, but estimates the quality of each intermediate tree clustering using formulas based on the properties of the RF distance and majority-rule consensus trees. This allows for a much faster clustering of a given set of input phylogenies without compromising on the quality of the resulting consensus trees or supertrees.

The traditional k-means algorithm (MacQueen 1967) partitions a given dataset into K (K > 1) disjoint clusters according to its objective function based on a specific distance (e.g. Euclidean or Minkowski distance) and the selected cluster validity index (e.g. Caliński-Harabasz, Silhouette or Dann index) . Most of the traditional cluster validity indices take into consideration both intragroup and intergroup cluster evaluations. However, we cannot use the standard objective functions or cluster validity indices when clustering trees. Here, we discuss the main modifications that should be introduced into the conventional k-means algorithm in order to adapt it to tree clustering.

In case of tree clustering, the objective function of the method can be defined as follows:

where K is the number of tree clusters, Nk is the number of trees in cluster k, RF(Ck, Tki) is the RF distance between tree i of cluster k, denoted Tki, and the majority-rule consensus tree (any other type of consensus tree could be considered here as well) of cluster k, denoted Ck. If the majority-rule consensus tree is used in the objective function of the tree clustering algorithm, the problem is sometimes called the k-median tree clustering problem. Since we do not compute any median tree during the clustering process, we present it as the kmeans tree clustering problem. This notation was also used in the pioneering work of Stockham et al. (2002) , who considered the consensus tree of each cluster as its mean. Still, the computation of the majority-rule, or of the extended majority-rule, consensus tree is time-consuming. The running time of the straightforward algorithm computing any of these consensus trees is O(n 2 + nN 2 ) , where n is the number of leaves in each tree and N is the number of trees. If the trees are defined by their sets of weighted branches, an optimal algorithm proposed by Jansson et al. (2013) , and running in O(nN) time, can be used for computing the majority-rule consensus tree. The time complexity of the version of the tree partitioning algorithm using consensus trees as cluster centers (Equation 1) which was tested in our simulations is O(rnN 2 KI) (i.e. it is a variant of the algorithm by Stockham et al.) , where r is the number of different random starting partitions used in k-means (typically, hundreds of different starting partitions should be considered in order to achieve a good clustering performance with k-means, see ), and I is the number of iterations in the internal loop of k-means. In our simulations, we compared different versions of this algorithm which are as follows: a) The version that uses strict consensus trees (as in Stockham et al.) and the version that uses majority-rule consensus trees -the latter provided better results; b) The version that uses different random starts (i.e. r=100) and the version that uses only one random starting partition (i.e. r=1) -the former provided much better results; c) The version that uses the first-improvement strategy, i.e. any improving relocation is applied without updating the cluster centers (i.e. without recalculating consensus trees of the clusters), and the version that uses the best-improvement strategy when all possible tree relocations are iteratively tested and only the best one is applied -the latter provided better results. Thus, the time complexity of the consensus tree-based k-means that favors speed is O(nNKI), while the time complexity of the version that favors solution quality, at the expense of extra computational time, is O(rnN 2 KI) (here, we consider that O(nN 2 K) time is needed to relocate N trees). It is worth mentioning that Tahiri et al. (2018) have recently proposed a kmedoids-based tree clustering algorithm having the running time of O(nN 2 + rK(N-K) 2 I), where O(nN 2 ) is the time needed to precalculate the matrix of pairwise RF distances of size (N×N) between all trees in Π.

In order to speed up the tree clustering process, we propose to use the following objective function OFEA:

where SSW is the index of intragroup evaluation, v ki T and v k m are, respectively, the tree i of cluster k and the overall mean (i.e. centroid) of cluster k in the tree vector space. In fact, trees can be represented as vectors of their bipartitions or splits. There exist 2 n−1 −1 possible splits for n-leaf trees, and at most 2n−3 splits can occur in a tree. In a tree vector representation, all coordinates that do not correspond to a split of the tree are set to 0; those that correspond to a split are set to 1 (for more details, see . In Equation 2, the RF distance, or more precisely its square root, which has the Euclidean property, replaces the traditional Euclidean distance used in k-means (see also Equation 15 in Appendix S2 in Supplementary Material that provides two equivalent expressions for SSW for the traditional Euclidean distance). Importantly, the centroid vector v k m of a cluster of trees k is not necessarily a consensus tree of the cluster. Moreover, v k m may not correspond to a phylogenetic tree at all. Fortunately, we do not need to calculate the L 2 norm (the middle term in Equation 2) or work with 2 n−1 −1-dimentional vectors in the clustering process. The main advantage of using the precalculated RF distances (the right term in Equation 2) is that we should not calculate a consensus tree, or the cluster centroid, for any intermediate cluster of trees considered by k-means, and that an object relocation operation consisting in finding the best cluster for a given tree T belonging to cluster C (i.e. when we try to relocate T into each of the K-1 clusters that are different from C) can be performed in O(K) time. Indeed, all pairwise RF distances between trees in a given set of input trees Π can be precalculated, and the sums of the RF distances between a given tree T and all elements of any tree cluster can first be precalculated and then updated at each step of k-means. Hence, using Equation 2, an object relocation operation conducted in turn for all N objects (i.e. trees) takes O(NK) time. The RF distance precalculation step can be completed in O(nN 2 ) and should be carried out only once for all random starting partitions used in k-means. This leads to the total running time of O(nN 2 + rNKI) for our extended majority-rule kmeans based on Equation 2. Our clustering approach is especially efficient when the number of leaves n is larger than the number of trees N. It is worth noting that the objective function OFEA defined in Equation 2 can also be viewed as a Euclidean approximation of the objective function OF defined in Equation 1.

Importantly, Equation 2 can also be used to define a version of the Caliński-Harabasz cluster validity index (CH) adapted for clustering trees. The CH index, which is sometimes called the variance ratio criterion, is defined as follows:

where SSB is the index of intergroup evaluation, SSW is the index of intragroup evaluation, K is the number of clusters, and N is the number of objects (i.e. trees in our case). The optimal number of clusters corresponds to the largest value of CH.

In the case of tree clustering, the global variance between groups, SSB, can be calculated as follows:

where Ti and Tj are trees i and j in a given set of trees Π, and SSW is the index of intragroup evaluation defined in Equation 2.

The results of our simulation study suggest that this version of CH index outperforms the popular Silhouette (SH), Ball and Hall (BH) , and Gap cluster validity indices adapted for tree clustering with k-means (see Fig. 1 , and Appendix S2 for their definitions). Moreover, the version of our algorithm based on the objective function OFEA and CH index generally outperforms the tree clustering algorithms of Stockham et al. (2002) , Tahiri et al. (2018) and Bonnard et al. (2006) (see Fig. 2 ). However, in the case of large trees, it provides the results equivalent to those yielded by the Stockham algorithm based on computation of majority-rule consensus trees (see Fig. 2b ).

Approximation by the lower and the upper bounds, and by their mean As discussed above, the Euclidean objective function OFEA (Equation 2) can also be viewed as an approximation of the objective functions OF defined in Equation 1. For instance, the consensus tree of the four trees presented in Supplementary Figure S1 is a star tree, and the value of the objective function OF for them is (2 + 2 + 2 + 2) = 8, whereas the value of OFEA is (2 + 2 + 4 + 4 + 2 + 2)/8 = 4.5. Thus, it would be interesting to find the lower and the upper bounds of the objective function OF, and use them in the clustering process. Theorem 1 below allows us to establish these bounds. Theorem 1 For a given cluster k containing Nk phylogenetic trees (i.e. additive trees or X-trees) the following inequalities hold:

where Nk is the number of trees in cluster k, Tki and Tkj are, respectively, trees i and j in cluster k, and Ck is the majority-rule consensus tree of cluster k. The proof of Theorem 1 is presented in Appendix S3 (see Supplementary Material). It is worth noting that the lower and the upper bounds of ∑ ( , ) =1 defined in Theorem 1 are identical when Nk = 2. Moreover, the value of the objective function defined by the upper bound of (5) divided by 2 (i.e. the function OFEA defined in Equation 2) is smaller than the value of the objective function OF (Equation 1). Thus, according to the terminology of approximation theory, the criterion OFEA is a factor-2 approximation of the criterion OF.

We can use these bounds as well as the middle of the interval defined by them, which is

, as an approximation of the contribution of cluster k to the k-means objective function OF defined by Equation 1. For instance, the following objective function based on the middle of the interval established in Theorem 1 can also be used as an approximation of the original objective function OF:

In a similar way, we can define the objective function based on the lower bound of OF:

and the upper bounds of OF as well:

Clearly, the use of the approximation functions defined by Equations (6-8) does not increase the time complexity of our clustering method, which will remain O(nN 2 + rKNI). The use of these approximation functions should imply the appropriate changes in the formulas of the considered cluster validity indices (see Equations 21-22 in Supplementary Material). Importantly, the Euclidean objective function and the upper bound objective function defined in Equations 2 and 8, respectively, differ by a constant multiplier only. This means that they reach their minimum at the same point (i.e. the same tree clustering). Thus, only one of these functions (i.e.

) was tested in our simulations along with and .

In this section, we explain how the tree clustering method introduced above for the case of trees defined on the same set of leaves could be extended to trees whose sets of leaves can differ, as it is often the case in phylogenetic studies, including the famous Tree of Life project .

Let Π be a set of N unrooted phylogenetic trees that may contain different, but mutually overlapping, sets of labeled leaves. In this case, the original objective function OF (Equation 1) can be reformulated as follows:

where K is the number of clusters, Nk is the number of trees in cluster k, RFnorm(STk,Tki) is the normalized Robinson and Foulds topological distance between tree i of cluster k, denoted Tki, and the majority-rule supertree of this cluster, denoted STk, reduced to a subtree having all leaves in common with Tki. The reduced version of the supertree STk is obtained after removing from it all leaves that do not belong to Tki and collapsing the corresponding branches. The RF distance is normalized by dividing it by its maximum possible value, which is 2n(STk,Tki)-6, where n(STk,Tki) is the number of common leaves in trees STk and Tki. The RF normalization is carried out to account equally the contribution of each tree to clustering. Obviously, Equation 9 can be considered only if the number of common leaves in STk and Tki is greater than 3. We propose to use the following analogue of the Euclidean approximation function (see Equation 2) to avoid supertree recomputations at each step of kmeans:

where n(Tki) is the number of leaves in tree Tki, n(Tkj) is the number of leaves in tree Tkj, n(Tki,Tkj) is the number of common leaves in trees Tki and Tkj, and α is the penalization (tuning) parameter, taking values between 0 and 1, used to prevent from putting to the same cluster trees having small percentages of leaves in common. This penalization parameter is necessary to get well-balanced clusters in which trees have both high topological and species content similarity. Indeed, the normalized RF distance between two large trees can be small only because the trees do not have enough taxa in common. However, such trees should not be necessarily assigned to the same cluster. Equation 10 also implies that two trees belonging to the same cluster have at least four taxa in common, but a higher taxa-similarity threshold can be used as the method's parameter in order to increase the cluster homogeneity. The objective functions reported in Equations (6-8) and the corresponding cluster validity indices should be normalized in a similar way. In case of supertree clustering, the SSW index can be computed as follows:

and the SSB index as follows:

The clustering procedure based on the use of Equation 10 should be carried out with different random input partitions. The best clustering solution can be selected using the value of the adapted CH validity index based on Equations (11-12) and Equations (15-17) in Supplementary Material. Once the best clusters of trees are chosen, any existing supertree reconstruction method (Bininda-Emonds 2004) can be applied to infer the related majority-rule supertrees.

In this section, we first present our simulation design (sub-section 3.1) and then discuss the results of our simulation study conducted with synthetic data (sub-section 3.2). Finally, in sub-section 3.3, we describe the results of our clustering analysis of the SARS-CoV-2 data, originally examined by Lam et al. (2020) and Makarenkov et al. (2021) .

We tested our new method for computing multiple consensus trees and supertrees using a protocol that included three different simulation experiments. The first experiment involved partitioning phylogenies having identical sets of leaves (i.e. multiple consensus trees were constructed) and included a comparison of our new method with some state-of-the-art tree partitioning algorithms. The second experiment consisted in partitioning phylogenies having different sets of leaves (i.e. multiple consensus supertrees were constructed) without using penalization in the objective function (i.e. the value of the penalization parameter α in Equation 10 was set to 0). The third experiment involved partitioning phylogenies having different sets of leaves using the penalization parameter α (its value varied between 0 and 1) in the objective function (10). For reasons of both fairness and efficiency, the three tree partitioning algorithms compared in our simulations, i.e. the Stockham et al. (2002) algorithm, the kmedoids-based tree clustering algorithm by Tahiti et al. (2018) , and our proposed k-means-based clustering algorithm, were carried out with 100 random input partitions for each dataset considered. As we could observe in our simulations, a higher number of input trees (N) required a higher number of random input partitions (r) to obtain quality results. The detailed simulation protocol adopted in our study is presented in Appendix S4 (see Supplementary Material).

The results of the simulations conducted with synthetic data are illustrated in Figure 1 presents the clustering performances of the six compared variants of our partitioning algorithm (see Appendix S4 in Supplementary Material and the legend of Fig. 1 ). The best overall results in this simulation were provided by the variant of our algorithm based on the OFEA objective function (Equation 2) and CH cluster validity index (CH-E). The results obtained with the variant based on the OFMA objective function with approximation by the mean of interval (Equation 6) and CH cluster validity index (CH-MI) were only slightly worse. At the same time, the average ARIs (Adjusted Rand Indices) calculated for the variants using the BH and Gap statistics were generally much lower than those provided by the variants based on the CH and Silhouette cluster validity indices. Among the strategies that were able to deal with homogeneous data (when the number of clusters K was equal to 1), the variant based on BH outperformed that based on Gap for the lower numbers of clusters (K = 1, 2 and 3), but was less effective than it for the higher numbers of clusters (K = 4 and 5). It is worth noting that the results provided by the variants based on the CH index (CH-E, CH-MI and CH-LB) were very stable; they do not vary a lot depending on the number of clusters, the number of leaves, or the coalescence rate. Figure 2 presents the results of comparison of the most stable variant of our algorithm, CH-E, with the state-of-the-art tree clustering methods, including the MPC tree clustering algorithm by Bonnard et al. (2006) , the tree clustering algorithm by Stockham et al. (2002) , which uses the squared RF distance and recomputes the majority-rule consensus trees at each iteration of k-means, and the k-medoids tree clustering algorithm by Tahiri et al. (2018) based on the RF distance (non-squared) and the SH cluster validity index. The curves presented in Figure 2 (a and b) indicate that our CH-E strategy clearly outperformed the three other competing methods in terms of the clustering quality (i.e. average ARI results). The clustering results provided by the methods of Stockham et al. and Tahiri et al. were very close, and both of them generally outperformed the MPC method. The only case where the clustering performances of the four competing methods were almost similar was the case of large phylogenies (trees with 128 leaves). All the methods, but especially MPC, showed an increase in the ARI values as the number of tree leaves increases. Moreover, CH-E and MPC were by far the best methods in terms of the running time for both simulation parameters considered: the number of tree leaves (Fig 2c) and the number of trees (Fig 2d) . These results suggest that our new algorithm, along with MPC, is well suited for analyzing large phylogenetic datasets. However, for smaller phylogenies (with < 128 leaves) our CH-E strategy represents the best choice overall. Figure 3 illustrates the performance of our supertree clustering algorithm applied to gene trees with different numbers and sets of leaves using the OFST_EA objective function (Equation 10) and CH cluster validity index. In this experiment, the value of the penalization parameter α in Equation 10 varied from 0 to 1, with the step of 0.2. Obviously, when no species are removed from the dataset, the penalization term in Equation 10 equals 0 and has no impact on the clustering process. However, in all other cases, the presented ARI curves showed different behaviour and reached its maxima at different values of α (i.e. for the case of 10% of missing species the maximum value of ARI was reached at α = 0, for the case of 25% of missing species the maximum was reached at α = 0.2, with very close ARI results obtained for α = 0.4 and 1.0, whereas for the case of 50% of missing species the maximum was reached at α = 1.0). Thus, we can conclude that the value of the penalization parameter α must be chosen with respect to the number of missing taxa in the input trees. Classification performances of our supertree clustering algorithm based on the OFST_EA objective function used with CH and BH cluster validity indices are presented in Appendix S5 (see Figs. S3 and S4 in Supplementary Material). 

We performed the SARS-CoV-2 gene tree clustering to identify genes having similar evolutionary patterns. Our analysis was conducted using the supertree clustering algorithm described in Section 2.3. It was carried out for 11 main genes of the SARS-Cov-2 genome (i.e. genes ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N and ORF10) as well as for the RD domain of the spike protein because of its key evolutionary importance. Thus, 12 gene phylogenies were inferred and analyzed in our study. It is worth mentioning that some gene annotations were absent in the GenBank and Gisaid databases (i.e. some taxonomic annotations for genes ORF6, ORF7a, ORF7b, ORF8 and ORF10 were unavailable), thus leading to gene phylogenies having different, but mutually overlapping, sets of leaves. Multiple sequence alignments for all gene and genome sequences used in this study as well as all inferred gene and species phylogenies (in the Newick format) are available at: http://www.info2.uqam.ca/~makarenkov_v/Supplementary_Mate-rial_data.zip (the complete data description is available in Appendix S6 in Supplementary Material). In total, 7 gene trees with 43 leaves (the phylogenies of genes ORF1ab, S, ORF3a, E, M, N and that of the RB domain), 4 gene trees with 25 leaves (the phylogenies of genes ORF6, ORF7a, ORF7b and ORF10), and 1 gene tree with 23 leaves (the phylogeny of gene ORF8) were considered. The internal branches of gene phylogenies with bootstrap support lower than 50% were collapsed prior to conducting gene tree clustering. The average number of missing leaves by gene tree was 17.8%. According to our simulations (see Fig.   3 ), the optimal value of the penalization parameter  could vary between 0 and 0.2 for such data. Here, we present the clustering results obtained with the value of  = 0.1 (very similar clusterings were obtained with  = 0 and  = 0.2; only one tree changed its cluster membership with  = 0.2). The obtained clustering solution with 3 disjoint clusters is presented in Figure 4 (a, b and c) . This solution encompasses three main patterns of evolution of betacoronavirus genes. The phylogenies of genes E, M, ORF6, ORF7a, ORF7b, ORF8 and ORF10 were assigned to Cluster 1, those of genes ORF1ab, S, ORF3a and N to Cluster 2, and that of the RB domain to Cluster 3. The consensus supertrees for each cluster were then inferred using the CLANN program (Creeve and McInerney 2005) . The supertree for Cluster 1 (Fig. 4a) was inferred using the heuristic search (hs) and bootstrap (performed with 100 replicates) options available in CLANN. The supertree for Cluster 2 (Fig. S4b ) was obtained using the consensus option of CLANN as all four trees forming this cluster contained 43 taxa. The consensus supertree of Cluster 3, containing a singleton element (i.e. the gene phylogeny of the RB domain) was its gene tree (Fig. 4c) .

Moreover, we also conducted a detailed HGT (Horizontal Gene Transfer) and recombination analysis of the obtained supertrees in order to identify the main HGT and recombination patterns characterizing the evolution of SARS-CoV-2 and related betacoronaviruses. HGT and recombination are widespread reticulate evolutionary processes contributing to diversity of betacoronaviruses, as well as of most other viruses, allowing them to overcome selective pressure and adapt to new environments . The HGT-Detection algorithm of Boc et al. (2010) was carried out independently for each of the three consensus supertrees inferred for our data using the T-Rex web server and the Armadillo 1.1 workflow platform . The obtained gene transfers, representing common HGT and recombination trends of the tree cluster under study, are indicated by red arrows in Figure 4d . We then completed our analysis by conducting an independent gene transfer detection for all individual gene trees included in Clusters 1 and 2 (the detected individual HGTs that were different from the previously recovered common transfers are indicated by blue and green arrows, respectively).

The obtained clustering and HGT detection results highlight the uniqueness of the evolution of the RB domain. They also suggest that the RB domain of SARS-CoV-2 could be acquired by a horizontal transfer of genetic material, followed by intragenic recombination, from the Guangdong pangolin CoV. Furthermore, they emphasize the stability of evolutionary patterns of the longest CoV genes (i.e. ORF1ab, S, ORF3a and N assigned to Cluster 2) as the consensus supertree (i.e. consensus tree in this case) of this cluster has a wellresolved structure. The consensus supertree of Cluster 1 is less resolved of all supertrees. This could be expected since Cluster 1 contains 7 of 12 gene trees considered in our study. The topology of Cluster 1 supertree points out that SARS-CoV-2 could not only be a mosaic created by recombination of the bat RaTG13 and Guangdong pangolin CoVs, but is also a very close relative of the bat ZC45 and ZXC21 CoV strains, which are the most closely located taxa to the clade of the RaTG13 and SARS-CoV-2 viruses in this supertree.

One of the main advantages of our program is that it allows the users to process unresolved phylogenies and trees with different, but mutually overlapping, sets of leaves. It was the case of the considered SARS-CoV-2 dataset. For example, the algorithms of Stockham et al. (2002) , Tahiri et al. (2018) and Bonnard et al. (2006) , which work only with phylogenies defined on the same set of taxa, could be applied to this dataset only when the 12 original phylogenies were reduced to the common set of 23 taxa, thus loosing the opportunity to infer many common HGT and recombination events depicted in Fig. 4d . Moreover, for the MPC method , we needed to transform the unresolved phylogenies into binary trees. For such reduced and transformed data, the k-means majority rule clustering method found 4 clusters of trees, the k-medoids-based method generated 3 such clusters, whereas the MPC method, executed with the Welsh-Powell option, found 9 clusters. All solutions provided by these clustering methods were quite different from the clustering yielded by our method (see Fig. 4 ). Furthermore, none of these methods was able to emphasize the uniqueness of the RB domain evolution by putting it in a separate singleton cluster. were collapsed. Transfer directions are represented by arrows (when the direction is uncertain, the arrow is bidirectional). Gene(s) affected by the transfer and the corresponding tree cluster numbers are indicated on the red arrows. Blue (for Cluster 1) and green (for Cluster 2) arrows represent additional HGT events found by the HGT-Detection program for individual genes of these clusters (these events cannot be inferred directly from consensus supertrees).

Consensus supertree for genes E, M, ORF6, ORF7a, ORF7b, ORF8 and ORF10 (Cluster 1) 

Consensus tree and supertree inference methods synthesize collections of gene phylogenies into comprehensive trees that preserve main topological features of the input phylogenies and include all taxa present in them. In this paper, we introduced a new systematic method for inferring multiple alternative consensus trees and supertrees from a given set of phylogenetic trees, which can be defined either on the same set of taxa (case of multiple consensus trees) or on different sets of taxa with incomplete taxon overlap (case of multiple supertrees). To the best of our knowledge, the problem of building multiple alternative supertrees has not been addressed yet in the literature. The inferred alternative consensus trees and supertrees represent the most important evolutionary patterns characterizing the evolution of genes under study. They are generally much better resolved than a single consensus tree or a single supertree inferred by traditional methods. Thus, a multiple consensus tree or supertree inference approach has the potential to build supertrees that retain much more plausible information from the input set of gene phylogenies. A single consensus tree or supertree could be an appropriate representation of a given set of gene trees only if all of them, or a large majority of them, follow the same evolutionary patterns. For example, our method allows one to identify ensembles of genes that underwent similar horizontal gene transfer, hybridization or intragenic/intergenic recombination events, or those that were affected by similar ancient duplication events during their evolution.

The new method relies on multiple runs of the k-means partitioning algorithm applied to the non-squared Robinson and Foulds distances (original or normalized) between the input trees. A number of efficient approximations of the straightforward objective function (Equation 1), preventing us from computing a consensus tree for any considered cluster of trees in the internal loop of k-means and using some remarkable properties of the RF distance, have been introduced (see Equations 2 and 6-8). These equations allow us to precompute all RF distances prior to carrying out the k-means tree clustering and ensure that a basic k-means object relocation operation, consisting in removing a given tree T from its current cluster C and assigning it to the best possible cluster different from C (if any), can be performed in O(K), where K is the number of tree clusters. This property makes our method perfectly suitable for analysis of large evolutionary datasets. In order to compute the RF distance between pairs of trees defined on different, but mutually overlapping, sets of leaves we propose to reduce them pairwise to common sets of leaves and then to normalize the obtained distance value. In case of supertree clustering, we also added to the objective function of the method the term including the penalization parameter α, which is used to create wellbalanced clusters that contain trees with both high topological and species content similarity. This is a common way of addressing the problem of missing data in clustering, and in machine learning in general .

The use of the RF distance, and not of its quadratic form, in tree clustering is justified by the fact that the majority-rule consensus tree of a set of trees is a median tree of this set in the sense of the RF distance . Moreover, Bansal et al. (2010) showed that RF-based supertrees are supertrees that are consistent with the largest number of clades from the input trees. We also showed how the popular Caliński-Harabasz (CH), Silhouette (SH), Ball and Hall (BH) , and Gap cluster validity indices could be adapted to tree clustering with k-means. The CH and SH indices are suitable for clustering heterogeneous data (when the number of clusters K ≥ 2), whereas the BH and Gap indices can be used to cluster both homogenous (when K = 1) and heterogeneous data. Using simulations, we demonstrated that the version of our method based on Euclidean approximation (Equation 2) typically outperforms the existing methods for building multiple alternative consensus trees, such as MPC , k-means tree clustering with squared RF distance by Stockham et al. (2002) , and k-medoids tree clustering by Tahiri et al. (2018) , in terms of both clustering quality and running time.

The statistical robustness of phylogenetic trees is a very important factor that should not be neglected when inferring and interpreting the results of phylogenetic analysis. We know that the RF distance is twice the number of bipartitions present in one of the trees and absent in the other . Thus, we can incorporate bootstrap scores of the input trees in the computation by considering the following objective function in the framework of consensus tree clustering:

where bsb(Tki) and bsb(Tkj) are the bootstrap scores, expressed in percentages, of internal branch b that induces a common bipartition in trees Tki and Tkj, CB(Tki,Tkj) is the set of all common internal branches (i.e. branches inducing the same bipartitions) in Tki and Tkj, and  is the penalization parameter, taking values between 0 and 1, used to penalize pairs of input trees with different bootstrap scores of their internal branches inducing common bipartitions.

, trees with similar bootstrap support of their internal branches inducing the same bipartitions of taxa will have a greater potential to be assigned to the same cluster. It is worth mentioning that branch lengths of compared branches can be taken into account in a similar way in the objective function of the clustering algorithm. In this case, the absolute difference between bootstrap scores of the two compared branches inducing the same bipartition in Tki and Tkj (see Equation 13) can be replaced by the absolute difference between their lengths, whereas the maximum of the two branch lengths will replace 100% in the fraction denominator. Moreover, we can use the following analogue of the Euclidean approximation function (see Equation 10) to take into account bootstrap support of the internal tree branches and avoid supertree computations at each step of the supertree k-means clustering: Another interesting option for further investigations concerns the use of other popular tree distances in the objective function of the clustering algorithm. In this context, the two most promising of them seem to be the branch score distance Felsenstein 1994, Gambette et al. 2016 ) and the quartet distance .

The branch score distance is defined as follows. Let us consider two phylogenetic trees T and T' defined on the same set of n taxa and the large set (BP1, BP2, … , BPNB) of all possible bipartitions existing for these taxa. For each tree, we can determine a large vector of nonnegative values bp = (b1, b2, … , bNP), in which bi is equal to the branch length of the branch corresponding to bipartition BPi if this bipartition exists in the tree, otherwise it is equal to 0. For two trees T and T', whose bipartition vectors are bp and bp', the branch score distance is defined as the squared Euclidean distance between these vectors:

where NB is the number of all possible existing bipartitions.

The quartet distance (QD) is defined as the number of quartets of tree leaves that induce a subtree topology that occur in only one of the two compared trees. This distance has been extensively used in computational biology not only for inferring phylogenetic trees and networks , but also for building supertrees .

According to its definition, the quartet distance is a symmetric difference distance. Thus, its square root (QD 1/2 ) is Euclidean and an analogue of Equation 2, in which RF is replaced by QD, can be used in the objective function of the clustering algorithm. An advantage of the quartet distance is that it can be computed in O(nlogn), where n is the number of leaves in both trees involved in computation.

Another tree distance which could be suitable for tree clustering is the Billera-Holmes-Vogtmann (BHV) distance . The BHV distance between weighted trees is defined as the geodesic, or shortest path, distance inside treespace in which trees are viewed as (2n−3)-dimensional vectors of their bipartition weights within the larger (2 n−1 −1)-dimensional space of all graphs. The BHV distance between two trees can be computed in O(n 4 ) and approximated in linear time . The continuous treespace introduced by Billera et al. (2001) provides a perfect environment for computation of average trees, while the classical Euclidean mean, when applied to tree vectors, can yield vectors not corresponding to trees. In the BHV framework, the majority consensus tree corresponds to the mode and the Fréchet mean corresponds to the average of a given set of trees. The Fréchet mean tree of a given set of N trees is the tree that minimizes the sum of the squared BHV distances to the given set Π = {T1, T2, …, TN} of N phylogenetic trees defined on the same set of taxa, i.e. ∑ ( , ) 2

. A normalized sum of such minimum values computed independently for each considered cluster of trees can constitute an objective function of a tree clustering method based on the BHV distance. Fortunately, the Fréchet mean tree is unique and its approximation can be calculated in polynomial time by an iterative algorithm. However, the Fréchet mean may also demonstrate a non-Euclidean sticky behaviour, as changing an input tree does not necessarily change the mean tree of the dataset, in contrast to Euclidean space . Alternatively, the median (which is a more robust estimator than the mean) of a set of trees can also be considered when clustering tree. The BHV distance median of a set of trees is the tree that minimizes the sum of non-squared BHV distances to those trees. Another widely-used criterion for assessing differences between trees is the Subtree Prune-and-Regraft (SPR) distance . This distance is integrated in a variety of tree building methods exploring different tree topologies . Moreover, Whidden et al. (2014) determined that the SPR distance can be used to build supertrees and that SPR-based supertrees are significantly more similar to the known species history than RF-based supertrees given biologically plausible rates of simulated horizontal gene transfers. The problem of computing the SPR distance between two trees is NP-hard . However, in practice, its approximation can be computed using a fixed-parameter-bounded search tree algorithm in combination with a linear-time formulation of Linz and Semple's cluster reduction to solve an equivalent maximum agreement forest problem ). Nevertheless, the SPR distance, as well as the other popular tree topology rearrangement distances such as Nearest Neighbor Interchange (NNI) and Tree Bisection and Reconnection (TBR), has no Euclidean properties and new formulas and algorithms should be designed in order to adapt it to tree clustering. Appendix S1. RF is not a Euclidean distance, but its square root is Euclidean.

The following counter-example, involving four trees T1, T2, T3 and T4 with five leaves each (see Supplementary Fig. S1 ), can be used to show that RF is not Euclidean. It is the simplest example possible because for the trees with four leaves, the RF distance has the Euclidean properties. Figure S1 used to show that the Robinson and Foulds distance is not a Euclidean distance.

We will now recall a few mathematical results which will be useful in the sequel. They will allow us to suggest a new clustering strategy via the square root of the RF distance (see the 3 main text). In a series of beautiful papers, Kelly and Deza introduced hypermetric spaces and then extended them to quasi-hypermetric spaces (Kelly 1972; Deza and Laurent 1997) . Quasihypermetric spaces satisfy a so-called inequality of negative type, i.e. given a metric space (X, 

. Then, it has been observed that this inequality corresponds exactly to the Schoenberg condition for the square root (d 1/2 ) of d being of the Euclidean type. Kelly (1972) provided several examples of quasihypermetric spaces, such as normed spaces and lattices, and established that a symmetric difference distance is hypermetric. Therefore, its square root (d 1/2 ) is Euclidean (see for more details). Because the Robinson and Foulds distance is a symmetric difference distance , its square root (RF 1/2 ) is Euclidean.

adapted for tree clustering with k-means.

The first cluster validity index we consider here is the Caliński-Harabasz index . This index, sometimes called the variance ratio criterion, is defined as follows:

where SSB is the index of intergroup evaluation, SSW is the index of intragroup evaluation, K is the number of clusters and N is the number of objects (i.e. trees in our case). The optimal number of clusters corresponds to the greatest value of CH.

In the traditional version of CH, when the Euclidean distance is considered, the SSB coefficient is evaluated by using the L 2 -norm:

where mk (k = 1 ... K) is the centroid of cluster k, m is the overall mean (i.e. centroid) of all objects in the given dataset X and Nk is the number of objects in cluster k. In the context of the 4 Euclidean distance, the SSW index can be calculated using the two following equivalent expressions:

where xki and xkj are the objects i and j of cluster k, respectively .

To use the analogues of Equations 16 and 17 in tree clustering, we need to define the concept of centroid for a given set of trees. The median tree (Barthélemy and Monjardet 1981; plays the role of this centroid in our tree clustering algorithm. The median procedure is defined as follows (Barthélemy and Monjardet 1981) . The 

where Tki and Tkj are trees i and j of cluster k, respectively.

Also, in the case of the Euclidean distance we have:

where i x and j x are two different objects in X .

Thus, the approximation of the global variance between groups, SSB, can be calculated as follows:

where Ti and Tj are trees i and j in a given set of trees Π, and W SS is calculated according to Equation 18 (22)

Another popular index considered in our study is the Silhouette width (SH) . Traditionally, the Silhouette width of cluster k is defined as follows:

where Nk is the number of objects belonging to cluster k, a(i) is the average distance between object i and all other objects belonging to cluster k, and b(i) is the smallest, over all clusters k' different from k, of all average distances between i and all the objects of cluster k'. 6 We used Equations 24 and 25 for calculating a(i) and b(i), respectively, in our tree clustering algorithm (see also Tahiri et al. 2018) . For instance, the quantity a(i) can be determined as follows:

and the formula for b(i) is as follows:

where Tk'j is tree j of cluster k', such that k' ≠ k, and Nk' is the number of trees in cluster k'.

The optimal number of clusters, K, corresponds to the maximum average Silhouette width, SH, which is calculated as follows:

The value of the Silhouette index defined by Equation 26 is located between -1 and +1.

Unfortunately, the CH and SH cluster validity indices defined by Equations 15 and 26 do not allow us to compare the solution consisting of a single consensus tree (K = 1; the calculation of CH and SH is impossible in this case) with clustering solutions involving multiple consensus trees (K ≥ 2). This can be viewed as an important drawback of the CH and SH-based classifications because a good tree clustering method should be able to recover a single consensus tree when the input set of trees is homogeneous (e.g. in case of gene trees that share the same evolutionary history).

The Gap statistic was first used by Tibshirani et al. (2001) to estimate the number of clusters provided by partitioning algorithms. The formulas proposed by Tibshirani et al. were based on the properties of the Euclidean distance. In the context of tree clustering, the Gap statistic can be defined as follows. Consider a clustering of N trees into K non-empty clusters, where K ≥ 1. We first define the total intracluster distance, k D , characterizing the cohesion between the trees belonging to the same cluster k:

Then, the sum of the average total intracluster distances, K V , can be calculated:

Finally, the Gap statistic, which reflects the quality of a given clustering solution with K clusters, can be defined as follows:

where * N E denotes expectation under a sample of size N from the reference distribution. The following formula for the expectation of ) log(

was used in our method:

where n is the number of tree leaves.

The largest value of the Gap statistic corresponds to the best clustering. Ball and Hall (1965) introduced the ISODATA procedure to measure the average dispersion of groups of objects with respect to the mean square root distance, i.e. the intra-group distance, which would lead to the following formula in case of tree clustering: where Nk is the number of trees in cluster k, Tki and Tkj are, respectively, trees i and j in cluster k, and k C is the majority-rule consensus tree of cluster k.

First, the sum 

can be decomposed into the following double sum:

We know that the Robinson and Foulds distance is a metric, and thus satisfies the triangular inequality . Hence, the following inequality holds for any pair of trees (Tki,Tkj):

This means that:

Second, based on the property, proved by Barthélemy and McMorris, that the majority consensus tree of a set of trees is a median tree of this set in the sense of the RF distance Barthélemy and Monjardet 1981) , we have:

, where (n) is the set of all possible phylogenetic trees with n leaves.

Thus, we obtain:

It is easy to see that the upper bound in Equation 35 This section presents the detailed simulation protocol adopted in our study. The data generation procedure used in the first experiment (i.e. with multiple consensus trees) included three main steps. First, we randomly generated a species phylogeny T0 with n leaves (i.e. it played the role of the first consensus tree in our simulations) using the HybridSim program of Woodhams et al. (2016) . Second, still using HybridSim, we generated K-1 gene phylogenies, 9 T1, …, TK, defined on the same set of n leaves (i.e. they played the role of the other consensus trees in our simulations). Each of these phylogenies differed from T0 by a specific number of hybridization events (the value of the hybridization_rate parameter in HybridSim varied from 1 to 4 in our experiments; this value was drawn randomly using a uniform distribution). The number of clusters, K, in this experiment varied from 1 to 5, and the number of tree leaves, n, First, we compared the classification performances (in terms of Adjusted Rand Index, ARI), of the six variants of our tree clustering algorithm applied to trees with identical sets of leaves (see Fig. 1 ). The six evaluated variants of our algorithm were based on: (1) Second, the variant of our algorithm based on the OFEA objective function with approximation by Euclidean distance and CH cluster validity index (CH-E) that showed the best overall per-formance in the first simulation was compared to the state-of-the-art tree clustering methods, including: (1) the MPC method by Bonnard et al. (2006) , (2) the tree clustering algorithm by Stockham et al. (2002) , which is based on k-means clustering with squared RF distance (this method recomputes the majority-rule consensus trees of all clusters at each k-means iteration), and (3) the k-medoids tree clustering algorithm by Tahiri et al. (2018) , which uses the RF distance and SH cluster validity index. The comparison was conducted in terms of the quality of clustering results returned by competing methods (Fig. 2a and b ) and the running time ( Fig.   2c and d).

Our second simulation experiment involved partitioning trees with different sets of leaves with the objective to build multiple consensus supertrees. The data generation protocol for this experiment included an additional step consisting of the random removal of some species (i.e. tree leaves) from the generated gene trees. The branches adjacent to the removed leaves were collapsed. The following intervals of missing data were considered: 0% (no species were removed), 10% (5% to 15% of species were randomly removed), 25% (16% to 35% of species were randomly removed) and 50% (36% to 65% of species were randomly removed).

The exact number of species to be removed from each gene tree and each data interval was drawn randomly using a uniform distribution. We also made sure that every pair of trees in each input dataset had at least 4 species in common. The value of the penalization parameter  in Equation 10 was set to 0 in this experiment. Two independent simulations were carried for the supertree version of our algorithm using the OFST_EA objective function (Equation

) and the CH-E (Equations 15-20) and BH cluster validity indices adapted for supertree partitioning (see Supplementary Figs. S3 and S4). The OFST_EA objective function was used because it provided the best overall performance in our first simulation experiment with consensus trees (see Fig. 1 ), whereas the BH index was used because it slightly outperformed the Gap index in case of heterogeneous data (i.e. when the number of clusters K was equal to 1).

Our third simulation experiment was also conducted to evaluate the ability of our algorithm to cope with incomplete data. As in the second experiment, gene trees with different sets of leaves were considered. The supertree version of our algorithm based on the OFST_EA objec- Our computational experiments were carried out using a 64-bit PC computer equipped with an Intel i7-8750H CPU (2.5 GHz) and 32 Gb of RAM, except for the simulation comparing the performances of the state-of-the-art clustering algorithms, which was conducted on a high-performance parallel computing server of Compute Canada.

Supplementary Figures S3 and S4 illustrate the classification performance of our supertree clustering algorithm based on the OFST_EA objective function used with CH and BH cluster validity indices, respectively. Here, the value of the penalization parameter  in Equation 10 was set to 0.

The algorithm was applied to gene trees containing different numbers and sets of leaves. We can observe that the clustering performances of both tested algorithm's variants gradually decreases as the number of missing species (i.e. tree leaves) increases. The supertree clustering algorithm based on CH generally outperformed that based on BH, but the BH-based variant seems to be less sensitive to the increase in the number of missing species as the number of clusters and the coalescence noise grow (e.g. for the case of 5 tree clusters, the average ARI value provided by the CH-based version decreased from 0.97 for 0% of missing species to 0.81 for 50% of missing species (Fig. S3a) , whereas for the BH-based version it decreased from 0.51 for 0% of missing species to 0.44 for 50% of missing species (Fig. S4a) ).

In this Appendix, we first describe the SARS-CoV-2 dataset examined in our work, then give some details regarding the applied multiple sequence alignment and tree inference methods, and finally present and discuss the results of our supertree clustering analysis (see Fig. 4 and the related description in the main text for the presentation of the obtained results). Several recent studies provide evidence of recombination events in different genes of betacoronavirus organisms. For example, Boni et al. (2020) pointed out that sarbecoviruses (i.e. the viral subgenus containing SARS-CoV and SARS-CoV-2) undergo frequent recombination events and exhibit spatially structured genetic diversity on a regional scale in China. Li et al. (2020) demonstrated that SARS-CoV-2's receptor binding domain has been introduced through recombination with a pangolin coronavirus and indicated that similar purifying selection in different host species, along with frequent recombination among coronaviruses, may represent a common evolutionary mechanism leading to the emergence of human coronaviruses.

To carry out our the supertree clustering analysis, we considered the case of evolution of 43 betacoronavirus organisms, including : (1) CoVs. The first 25 of these CoV genomes (sub-groups 1 to 8 above) include the closest relatives of SARS-CoV-2; they have been originally examined in Lam et al. (2020) . The remaining 18 CoVs (sub-groups 9 to 17 above) comprise betacoronaviruses labeled as common cold

CoVs in the Gisaid coronavirus tree (Shu and McCauley 2017) and those studied by Prabakaran et al. (2006) . Supplementary Table S1 (see Supplementary Material) provides the organism names, host species, and Gisaid or GenBank accession numbers for the 43 coronaviruses considered in our study.

Our analysis was conducted for 11 main genes of the SARS-Cov-2 genome (i.e. genes

ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N and ORF10) as well as for the RD domain of the spike protein because of its key evolutionary importance. Thus, 12 gene phylogenies were inferred and analyzed in our study. It is worth mentioning that some gene annotations were absent in the GenBank and Gisaid databases (i.e. annotations for genes ORF6, ORF7a, ORF7b, ORF8 and ORF10 for taxa from sub-groups 9 to 17 and annotations for gene ORF8 for taxa from sub-group 8 were unavailable), thus leading to gene phylogenies having different, but mutually overlapping, sets of leaves.

The VGAS program (Zhang et al. 2019 ), intended to identify viral genes and carry out gene function annotation, was executed to validate all betacoronavirus genes found in GenBank and Gisaid. We then performed multiple sequence alignments (MSAs) for the 11 considered coronavirus genes (DNA sequences) and for the RB domain (amino acid sequences) by means of the MUSCLE algorithm (Edgar 2004) available at the Phylogeny.fr web server (Dereeper et al. 2008) , was then used with the less stringent correction option to remove MSA sites with large gap ratios. Gene and genome trees that will be further used in clustering and HGT (Horizontal Gene Transfer) and recombination analyses were inferred using the RAxML algorithm (v. 0.9.0; Stamatakis 2006). The most

A clustering technique for summarizing multivariate data

Robinson-Foulds supertrees

Phylogenetic reconstruction and lateral gene transfer

The median procedure for n-trees

Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees

Quartet cleaning: Improved algorithms and simulations

Geometry of the space of phylogenetic trees

Phylogenetic supertrees: combining information to reveal the tree of life

Inferring and validating horizontal gene transfer events using bipartition dissimilarity

T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks

Multipolar consensus for phylogenetic trees

On the computational complexity of the rooted subtree prune and regraft distance

Computing the quartet distance between evolutionary trees

A classification of consensus methods for phylogenetics. DIMACS series in discrete mathematics and theoretical computer science

The recovery of trees from measure of dissimilarity. Mathematics and the archeological and historical sciences

A dendrite method for cluster analysis

Clann: investigating phylogenetic information through supertree analyses

The partial order by inclusion of the principal classes of dissimilarity on a finite set, and some of their basic properties

The supermatrix approach to systematics

Quartets and unrooted phylogenetic networks

Do branch lengths help to locate a tree in a phylogenetic network

Mathematics of Evolution and Phylogeny

Multiple consensus trees: a method to separate divergent genes

On the complexity of comparing evolutionary trees

An optimal algorithm for building the majority-rule consensus tree

A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates

Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins

Armadillo 1.1: an original workflow platform for designing and conducting phylogenetic analysis and simulations

Some methods for classification and analysis of multivariate observations

The discovery and importance of multiple islands of most-parsimonious trees

The tree of life web project

The planar k-means problem is NP-hard

Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin

Comparison of additive trees using circular orders

A view of some consensus methods for trees

Conservative supertrees

Penalized model-based clustering with application to variable selection

Recombination in viruses: mechanisms, methods of study, and evolutionary consequences

Phylogenetic inference based on matrix representation of trees

Comparison of phylogenetic trees

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Detecting horizontal gene transfer: a probabilistic approach

On defining and finding islands of trees and mitigating large island bias

Quartets MaxCut: a divide and conquer quartets algorithm

The shape of phylogenetic treespace

Initializing k-means batch clustering: A critical evaluation of several techniques

Statistically based postprocessing of phylogenetic analysis by clustering

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

A new fast method for inferring multiple consensus trees using k-medoids

Estimating the number of clusters in a data set via the gap statistic

An efficient algorithm for computing Ml consensus trees

Supertree construction: opportunities and challenges

Supertrees based on the subtree prune-and-regraft distance

Properties of supertree methods in the consensus setting

was used for each MSA. Precisely, the (GTR+G+I) model was found to be the best-fit substitution model for genes ORF1ab, S, N and for the whole genomes, the (HKY+G) model was the most suitable for genes ORF3a, E, ORF6, ORF7a, the (HKY+I) model for gene ORF7b, the (HKY+G+I) for gene ORF8, the (JC) model for gene ORF10, and the (WAG+G) model for the RB domain

The consensus supertrees (see Supplementary Figures S4a, b and c or Figures 4a, b and c in the main text) for each cluster found by our algorithm were inferred using the CLANN program (Creeve and McInerney

2012) was used to infer directional horizontal gene transfer-recombination network

Multiple sequence alignments for all gene and genome sequences used in this study as well as all inferred gene and species phylogenies

Full virus names, abbreviations, host species and GenBank/GISAID accession numbers for the 43 betacoronavirus genomes analysed in our study

Bat SARS coronavirus Rf1 Rf1 Bat DQ412042

Approximating geodesic tree distance

ISODATA, a novel method of data analysis and pattern classification

A clustering technique for summarizing multivariate data. Behavioral

Robinson-Foulds supertrees

Phylogenetic reconstruction and lateral gene transfer

The median procedure for n-trees

The median procedure in cluster analysis and social choice theory

Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees

Highways of gene sharing in prokaryotes. Proceedings of the National Academy of Sciences

Point estimates in phylogenetic reconstructions

Quartet cleaning: Improved algorithms and simulations

Amalgamating source trees with different taxonomic levels

Geometry of the space of phylogenetic trees

Novel versus unsupported clades: assessing the qualitative support for clades in MRP supertrees

Phylogenetic supertrees: combining information to reveal the tree of life

T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks

Inferring and validating horizontal gene transfer events using bipartition dissimilarity

Clustering methods: a history of k-means algorithms. Selected contributions in data analysis and classification

Evolutionary origins of the SARS-CoV-2 sarbecoronavirus lineage responsible for the COVID-19 pandemic

Multipolar consensus for phylogenetic trees

On the computational complexity of the rooted subtree prune and regraft distance

Consistency of topological moves based on the balanced minimum evolution principle of phylogenetic inference

Computing the quartet distance between evolutionary trees in time O(nlog n)

Computing the quartet distance between evolutionary trees

A classification of consensus methods for phylogenetics. DIMACS series in discrete mathematics and theoretical computer science

Toward automatic reconstruction of a highly resolved tree of life

A dendrite method for cluster analysis

Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis

Clann: investigating phylogenetic information through supertree analyses

The partial order by inclusion of the principal classes of dissimilarity on a finite set, and some of their basic properties

Majority-rule supertrees

Axiomatic consensus theory in group choice and biomathematics

The supermatrix approach to systematics

Phylogeny.fr: robust phylogenetic analysis for the non-specialist

Geometry of cuts and metrics

Constructing majority-rule supertrees

Prospects for building the tree of life from large sequence databases

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Numerical taxonomy

Inferring phylogenies

Quartets and unrooted phylogenetic networks

Do branch lengths help to locate a tree in a phylogenetic network

Mathematics of Evolution and Phylogeny

Multiple consensus trees: a method to separate divergent genes

On the complexity of comparing evolutionary trees

Application of phylogenetic networks in evolutionary studies

An optimal algorithm for building the majority-rule consensus tree

Hypermetric spaces and metric transforms. Inequalities II

A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates

MEGA X: molecular evolutionary genetics analysis across computing platforms

Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins

Emergence of SARS-CoV-2 through recombination and strong purifying selection

A cluster reduction for computing the subtree distance between phylogenies

Least squares quantization in PCM

Armadillo 1.1: an original workflow platform for designing and conducting phylogenetic analysis and simulations

Some methods for classification and analysis of multivariate observations

The tree of life web project

The discovery and importance of multiple islands of most-parsimonious trees

The planar k-means problem is NP-hard

Comparison of additive trees using circular orders

Improving the additive tree representation of a dissimilarity matrix using reticulations

Optimal variable weighting for ultrametric and additive trees and K-means partitioning: Methods and software

Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin

A view of some consensus methods for trees

Conservative supertrees

Polyhedral computational geometry for averaging metric phylogenetic trees

A fast algorithm for computing geodesic distances in tree space

Penalized model-based clustering with application to variable selection

Recombination in viruses: mechanisms, methods of study, and evolutionary consequences. Infection, Genetics and Evolution

Structure of severe acute respiratory syndrome coronavirus receptor-binding domain complexed with neutralizing antibody

Phylogenetic inference based on matrix representation of trees

Comparison of phylogenetic trees

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Phylogenetic supertrees: assembling the trees of life

Detecting horizontal gene transfer: a probabilistic approach

GISAID: Global initiative on sharing all influenza data-from vision to reality

On defining and finding islands of trees and mitigating large island bias

Quartets MaxCut: a divide and conquer quartets algorithm

The shape of phylogenetic treespace

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

Initializing k-means batch clustering: A critical evaluation of several techniques

Statistically based postprocessing of phylogenetic analysis by clustering

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

Maximum likelihood supertrees

Probability measures on metric spaces of nonpositive curvature

SuperFine: fast and accurate supertree estimation

Modeling gene family evolution and reconciling phylogenetic discord

The inference of gene trees with species trees

A new fast method for inferring multiple consensus trees using k-medoids

Estimating the number of clusters in a data set via the gap statistic

Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data

An efficient algorithm for computing Ml consensus trees

Supertrees based on the subtree prune-and-regraft distance

Properties of supertree methods in the consensus setting

Simulating and summarizing sources of gene tree incongruence

Vgas: A Viral Genome Annotation System

We would like to thank Dr Nicolas Lartillot for providing us with the MPC tree clustering program, Drs Pierre Legendre and Bogdan Mazoure for helping us with the analysis of SARS-CoV-2 data, and Dr Russell Schwartz and two anonymous reviewers of our manuscript for their helpful comments and suggestions. We also thank Compute Canada for providing access to high-performance computing facilities. Conflict of Interest: none declared.

 Supplementary Fig. S3 . Classification performance of our k-means-based supertree clustering algorithm applied to trees with different numbers and sets of leaves (the following numbers of leaves were randomly removed from each tree: 0% (no species were removed), 10% (5% to 15% of species were removed), 25% (16% to 35% of species were removed) and 50% (36% to 65% of species were removed)) using the OFST_EA objective function with approximation by Euclidean distance and the CH cluster validity index (CH-E). The value of the penaliza-