AncestralClust: Clustering of Divergent Nucleotide Sequences by Ancestral Sequence Reconstruction using Phylogenetic Trees


AncestralClust: Clustering of Divergent Nucleotide
Sequences by Ancestral Sequence Reconstruction
using Phylogenetic Trees
Lenore Pipes 1,∗ and Rasmus Nielsen 1,2,3∗

1Department of Integrative Biology, University of California-Berkeley, Berkeley, 94707, USA,
2Department of Statistics, University of California-Berkeley, Berkeley, CA 94707, USA, and
3Globe Institute, University of Copenhagen, 1350 København K, Denmark

∗To whom correspondence should be addressed.

Abstract

Motivation: Clustering is a fundamental task in the analysis of nucleotide sequences. Despite the expo-
nential increase in the size of sequence databases of homologous genes, few methods exist to cluster
divergent sequences. Traditional clustering methods have mostly focused on optimizing high speed clus-
tering of highly similar sequences. We develop a phylogenetic clustering method which infers ancestral
sequences for a set of initial clusters and then uses a greedy algorithm to cluster sequences.
Results: We describe a clustering program AncestralClust, which is developed for clustering divergent
sequences. We compare this method with other state-of-the-art clustering methods using datasets of
homologous sequences from different species. We show that, in divergent datasets, AncestralClust has
higher accuracy and more even cluster sizes than current popular methods.
Availability and implementation: AncestralClust is an Open Source program available at
https://github.com/lpipes/ancestralclust
Contact: lpipes@berkeley.edu
Supplementary information: Supplementary figures and table are available online.

1 Introduction
Traditional clustering methods such as UCLUST (Edgar, 2010),
CD-HIT (Fu et al., 2012), and DNACLUST (Ghodsi et al., 2011)
use hierarchical or greedy algorithms that rely on user input of
a sequence identity threshold. These methods were developed
for high speed clustering of a high quantity of highly similar se-
quences (Ghodsi et al., 2011; Li et al., 2001; Edgar, 2010) and,
generally, these methods are considered unreliable for identity
thresholds <75% because of either the poor quality of alignments
at low identities (Zou et al., 2018) or because the performance of
the threshold used to count short words drops dramatically with
low identities (Huang et al., 2010). At low identities, these meth-
ods produce uneven clusters where the majority of sequences are
contained in only a few clusters (Chen et al., 2018) and the high
variance in cluster sizes reduces the utility of the clustering step
for many practical purposes. Clustering of divergent sequences is

a fundamental step in genomics analysis because it allows for an
early divide-and-conquer strategy that will significantly increase
the speed of downstream analyses (Zheng et al., 2018) and clus-
tering of divergent sequences is a frequent request of users of at
least one clustering method (Huang et al., 2010). Currently, there
are no clustering methods that can accurately cluster large taxo-
nomically divergent metabarcoding reference databases such as
the Barcode of Life database (Ratnasingham and Hebert, 2007)
in relatively even clusters. Only a few other methods, such as Sp-
Clust (Matar et al., 2019) and TreeCluster (Balaban et al., 2019),
exist for clustering potentially divergent sequences. SpClust cre-
ates clusters based on the use of Laplacian Eigenmaps and the
Gaussian Mixture Model based on a similarity matrix calculated
on all input sequences. While this approach is highly accurate,
the calculation of an all-to-all similarity matrix is a computation-
ally exhaustive step. TreeCluster uses user-specified constraints
for splitting a phylogenetic tree into clusters. However, TreeClus-
ter requires an input tree and thus can also be prohibitively slow

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/


2 Pipes and Nielsen

for large numbers of sequences where a phylogenetic tree is dif-
ficult to estimate reliably. With the increasing size of reference
databases (Schoch et al., 2020), there is a need for new compu-
tationally efficient methods that can cluster divergent sequences.
Here we present AncestralClust that was specifically developed
for clustering of divergent metabarcoding reference sequences in
clusters of relatively even size.

2 Methods
To cluster divergent sequences, we developed AncestralClust
which is written in C (Figure 1). Firstly, k random sequences are
chosen and the sequences are aligned pairwise using the wavefront
algorithm (Marco-Sola et al., 2020). A Jukes-Cantor distance ma-
trix is constructed from the alignments and a neighbor-joining
phylogenetic tree is constructed. The Jukes-Cantor model is cho-
sen for computational speed, but more complex models could
in principle be used to potentially increase accuracy but also in-
crease computational time. The C − 1 longest branches in the
tree are then cut to yield C clusters. These subtrees comprise
the initial starting clusters. The sequences in each starting clus-
ter are aligned in a multiple sequence alignment using kalign3
(Lassmann, 2020). The ancestral sequences at the root of the tree
of each cluster is estimated using the maximum of the posterior
probability of each nucleotide using standard programming algo-
rithms from phylogenetics (see e.g., Yang, 2014). The ancestral
sequences are used as the representative sequence for each cluster.
Next, the rest of the sequences are assigned to each cluster based
on the shortest nucleotide distance from the wavefront alignment
between the sequence and the C ancestral sequences. If the short-
est distance to any of the C ancestral sequences is larger than the
average distance between clusters, the sequence is saved for the
next iteration. We iterate this process until all sequences are as-
signed to a cluster. In each iteration after the first iteration, a cut
of a branch in the phylogenetic tree is chosen if the the branch is
longer that the average length of branches cut in the first iteration.
In praxis, only one or two iterations are needed for most data sets
if k is defined to be sufficiently large.

We compared AncestralClust to five other state-of-the-art
clustering methods: UCLUST (Edgar, 2010), meshclust2 (James
and Girgis, 2018), DNACLUST (Ghodsi et al., 2011), CD-HIT
(Fu et al., 2012), and SpClust (Matar et al., 2019). We used a
variety of measurements to assess the accuracy and evennness of
the clustering. We calculated two traditional measures of accu-
racy, purity and normalized mutual information (NMI), used in
Bonder et al. (2012). The purity of clusters is calculated as:

purity(Ω, C) =
1

N

∑
k

max
j
|ωk ∩ cj| (1)

where Ω = w1, w2, ..., wk is the set of clusters, C =
c1, c2, ..., cj is the set of taxonomic classes and N is the total

number of sequences. NMI is calculated as:

NMI(Ω, C) =
I(Ω, C)

[H(Ω) + H(C)]/2
(2)

where mutual information gain is I(Ω, C) and H is the entropy
function. To measure the evenness of the clusters, we used the
coefficient of variation which is calculated as:

CV =

√∑j
i (ni − m)

2/j

m
(3)

where ni is the number of sequences in cluster i, j is the total
number of clusters, and m is the mean size of the clusters. We
also used a taxonomic incompatibility measure to assess the ac-
curacy of the clusters. Let a,b be a pair of species found in cluster
i. Incompatibility at a given taxonomic rank is calculated by first
identifying the number of times a and b exist in clusters other
than cluster i. The total incompatibility is calculated by summing
over all pairs of sequences (a,b) and all i.

Both NMI and taxonomic incompatibility are very sensitive
to the number of clusters and also to unevenness of cluster sizes.
To allow fair comparison when numbers of clusters and evenness
of cluster sizes vary we, therefore, calculate the relative NMI
and relative incompatibility. These measures are calculated by
scaling them relative to their expected values under random as-
signments given the number of clusters and the cluster sizes. We
estimated relative NMI by dividing the raw NMI score by the
average NMI of 10 clusterings in which sequences have been as-
signed at random with equal probability to clusters, such that the
cluster sizes are same as the cluster sizes produced in the original
clustering. The same procedure was used to convert the taxonomic
incompatibility measure into relative incompatibility.

3 Results
To first assess performance of clustering methods on divergent
nucleotide sequences, we used 100 random samples of 10,000
sequences from three metabarcode reference databases (16, 18S,
and Cytochrome Oxidase I (COI)) from the CALeDNA project
Meyer et al. (2019). We chose to compare our method on this
dataset against UCLUST because it is the most widely used clus-
tering program and it performs better than CD-HIT on low identity
thresholds (Chen et al., 2018).

We first compared AncestralClust against UCLUST using
relative NMI and Coefficient of Variation (Figure 2). We used
k = 300 random initial sequences, which is 3% of the total num-
ber of sequences in each sample and C = 16 cuts in the initial
phylogenetic tree. Notice that the relative NMI tends to be higher
with a lower coefficient of variation for AncestralClust across all
barcodes. This suggests, that for these divergent eDNA sequences,
AncestralClust provides clusterings that are more even in size and
that are more consistent with conventional taxonomic assignment.
As a second measure of accuracy we measured relative incom-
patibility and coefficient of variation using AncestralClust and
UCLUST using for the same datasets under the same running

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/


AncestralClust 3

conditions. Notice in Figure 3, AncestralClust tends to create
balanced clusters with lower relative taxonomic incompatibilities
compared to UCLUST at all taxonomic levels. Similar results
are seen for metabarcode 18S (Fig S1). However, for metabar-
code 16S (Fig S2), AncestralClust performs noticeably better than
UCLUST at the species, genus, and family levels but at the order,
class, and phylum levels it performs either the same or worse.
Also, at the species, genus, and family levels, it is apparent that
as the UCLUST clusters approach a lower coefficient of variation,
the relative incompatibility increases dramatically.

Next, we analyzed two datasets with different properties: one
dataset of diverse species from the same gene and another dataset
of homologous genes from species of the same phyla. In the
first dataset, we expect that the sequences to cluster according to
species. In the second dataset, we expect the sequences to cluster
according to different genes. We compared AncestralClust to four
commonly used clustering programs (UCLUST, meshclust2, CD-
HIT2, and DNACLUST) and one clustering program designed for
divergent sequences, SpClust. The first dataset contained 13,043
sequences from the COI CaleDNA database from 11 divergent
species that were from 7 different phyla and 11 different classes
and the second data set contained sequences from 6 different genes
from taxonomically similar species. First, we compared all meth-
ods using 13,043 COI sequences from the 11 different species
(Table 1). We expect these sequences to form 11 different clus-
ters, each including all the sequences from one species. We chose
identity thresholds to enforce the expected number of clusters for
each method. We were unable to form 11 clusters using CD-HIT
because the program does not allow clustering of sequences with
identity thresholds < 80% at default parameters. For SpClust,
we used the three precision modes available for the method. In
this analysis, AncestralClust achieved a perfect clustering (the
purity was 1 and relative incompatibility was 0) although it was
the second slowest, and had the second lowest memory require-
ments. UCLUST was one of the fastest methods and used the least
amount of memory but had the second lowest purity with third
highest relative NMI values. meshclust2 had no incompatibilities
and the second highest purity and relative NMI values but was the
third slowest method. DNACLUST had the most uneven clusters
and the second lowest relative NMI value with the highest relative
incompatibility. SpClust only identified one cluster, with a com-
putational time of ~2 days. In comparison, AncestralClust took
~5 minutes and UCLUST used < 1 second.

Next, we analyzed ’genomic set 1’ from Matar et al.
(2019), which consists of 39 sequences from 6 homologous
genes (FCER1G, S100A1, S100A6, S100A8, S100A12, and
SH3BGRL3 in Table 2). We expect these sequences to form
6 clusters. We varied the identity thresholds for UCLUST and
meshclust2 using thresholds 0.4, 0.6, and 0.8. For CD-HIT, we
used the lowest identity threshold available on default parameters
which is 0.8. We were unable to use DNACLUST for this anal-
ysis because it cannot handle sequences longer than 4500bp (the
average sequence length was 2,387.9bp and the longest sequence
was 5,363bp). Since this dataset contained 6 different genes, we

calculated relative NMI using genes as the classes and did not
use incompatibility as an accuracy measure. Only AncestralClust,
UCLUST, and meshclust2 produced the expected number of clus-
ters, and among the methods that created the expected number of
clusters, AncestralClust had the highest purity value. Ancestral-
Clust was the second slowest method and had the highest memory
requirements which is due to the wavefront algorithm alignment
which isO(s2) in memory requirements where s is the alignment
score. Since alignments were performed using 6 different genes
that were longer than 1.5kb, this resulted in a high value of s. Sp-
Clust had the highest relative NMI using all precision modes and
the same purity as AncestralClust for its moderate and maximum
precision modes, however, failed to produce the expected number
of clusters.

4 Conclusions
We developed a phylogenetic-based clustering method, Ances-
tralClust, specifically to cluster divergent metabarcode sequences.
We performed a comparative study between AncestralClust and
widely used clustering programs such as UCLUST, CD-HIT,
DNACLUST, meshclust2, and for divergent sequences, SpClust.
UCLUST and DNACLUST are substantially faster than Ances-
tralClust and should be the preferred method if computational
speed is the main concern. However, AncestralClust tends to form
clusters of more even size with lower taxonomic incompatibility
and higher NMI than other methods, for the relatively divergent
sequences analyzed here. We recommend the use of Ancestral-
Clust when sequences are divergent, especially if a relatively even
clustering is also desirable, for example for various divide-and-
conquer approaches where computational speed of downstream
analyses increases faster than linearly with cluster size.

Acknowledgements
This work used the Extreme Science and Engineering Discov-
ery Environment (XSEDE) Bridges system at the Pittsburgh
Supercomputing Center through allocation BIO180028.

References
Balaban, M., Moshiri, N., Mai, U., Jia, X., and Mirarab, S. (2019).

Treecluster: Clustering biological sequences using phylogenetic trees.
PloS one, 14(8), e0221068.

Bonder, M. J., Abeln, S., Zaura, E., and Brandt, B. W. (2012). Compar-
ing clustering and pre-processing in taxonomy analysis. Bioinformatics,
28(22), 2891–2897.

Chen, Q., Wan, Y., Zhang, X., Lei, Y., Zobel, J., and Verspoor, K. (2018).
Comparative analysis of sequence clustering methods for deduplication
of biological databases. J. Data and Information Quality, 9(3).

Edgar, R. C. (2010). Search and clustering orders of magnitude faster than
blast. Bioinformatics, 26(19), 2460–2461.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). Cd-hit: accelerated for
clustering the next-generation sequencing data. Bioinformatics, 28(23),
3150–3152.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/


4 Pipes and Nielsen

Ghodsi, M., Liu, B., and Pop, M. (2011). Dnaclust: accurate and efficient
clustering of phylogenetic marker genes. BMC bioinformatics, 12(1),
1–11.

Huang, Y., Niu, B., Gao, Y., Fu, L., and Li, W. (2010). CD-HIT
Suite: a web server for clustering and comparing biological sequences.
Bioinformatics, 26(5), 680–682.

James, B. T. and Girgis, H. Z. (2018). Meshclust2: Application of
alignment-free identity scores in clustering long dna sequences. bioRxiv,
page 451278.

Lassmann, T. (2020). Kalign 3: multiple sequence alignment of large
datasets.

Li, W., Jaroszewski, L., and Godzik, A. (2001). Clustering of highly
homologous sequences to reduce the size of large protein databases.
Bioinformatics, 17(3), 282–283.

Marco-Sola, S., Moure López, J. C., Moreto Planas, M., and Es-
pinosa Morales, A. (2020). Fast gap-affine pairwise alignment using
the wavefront algorithm. Bioinformatics, (btaa777), 1–8.

Matar, J., Khoury, H. E., Charr, J.-C., Guyeux, C., and Chrétien, S. (2019).
Spclust: Towards a fast and reliable clustering for potentially divergent
biological sequences. Computers in biology and medicine, 114, 103439.

Meyer, R. S., Curd, E. E., Schweizer, T., Gold, Z., Ramos, D. R., Shirazi,
S., Kandlikar, G., Kwan, W.-Y., Lin, M., Freise, A., et al. (2019). The
california environmental dna “caledna” program. bioRxiv, page 503383.

Ratnasingham, S. and Hebert, P. D. (2007). Bold: The barcode of life data
system (http://www. barcodinglife. org). Molecular ecology notes, 7(3),
355–364.

Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S.,
Khovanskaya, R., Leipe, D., Mcveigh, R., O’Neill, K., Robbertse, B.,
et al. (2020). Ncbi taxonomy: a comprehensive update on curation,
resources and tools. Database, 2020.

Yang, Z. (2014). Molecular evolution: a statistical approach. Oxford
University Press.

Zheng, W., Mao, Q., Genco, R. J., Wactawski-Wende, J., Buck, M., Cai, Y.,
and Sun, Y. (2018). A parallel computational framework for ultra-large-
scale sequence clustering analysis. Bioinformatics, 35(3), 380–388.

Zou, Q., Lin, G., Jiang, X., Liu, X., and Zeng, X. (2018). Sequence clus-
tering in bioinformatics: an empirical study. Briefings in Bioinformatics,
21(1), 1–10.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/


AncestralClust 5

Figure 1. Overview of AncestralClust. In (1), k random sequences are chosen for the initial clusters. (2) Using the k sequences a distance matrix is constructed. Using
the distance matrix, a neighbor-joining tree is constructed and C − 1 cuts are made to create C clusters. In (4), each cluster is multiple sequenced aligned and the
ancestral sequences are reconstructed in the root node of each tree. The rest of the unassigned sequences are then aligned to the ancestral sequences of each cluster and
the shortest distance to each ancestral sequence is calculated. The process is iterated until all sequences are assigned to a cluster.

Figure 2. Relative NMI against coefficient of variation for AncestralClust and UCLUST for 100 samples of 10,000 randomly chosen 16S, 18S, and COI reference
sequences from the CALeDNA Project (Meyer et al., 2019). The similarity threshold for UCLUST was 0.58. For AncestralClust, we used 300 initial random sequences
with 15 initial clusters. Relative NMI was calculated by dividing NMI by the average of 10 random samples of the same fixed cluster size.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/


6 Pipes and Nielsen

Figure 3. Relative incompatibility against coefficient of variation for AncestralClust and UCLUST for 100 samples of 10,000 randomly chosen COI reference
sequences. COI reference sequences are from the CALeDNA Project (Meyer et al., 2019). The similarity threshold for UCLUST was 0.58. For AncestralClust, we
used 300 initial random sequences with 15 initial clusters.

Table 1. Comparisons of clustering methods using 13,043 COI sequences from 11 different species. The list of species can be found in Table
S1. Incompatibility was calculated at the taxonomic rank of species. For UCLUST, meshclust2, and DNACLUST, the identity thresholds were
chosen to force the expected 11 number of clusters. For CD-HIT, the lowest possible identity was chosen which is 0.8. In the case of SpClust,
Coefficient of Variation cannot be calculated for 1 cluster. SpClust clusters were created with version 2.

Method
# of

clusters
Time
(sec)

Mem
(MB)

Purity
Relative

Incompat.
(species)

Relative
NMI

Coeff.
of

Var.
AncestralClust 11 293.2 19.3 1 0 551.09 0.8574

UCLUST 11 <1 9.9 0.8717 0.0182 474.63 0.8300
meshclust2 11 108.14 46.5 0.9855 0 498.898 0.1053

CD-HIT 24 5.86 43.9 0.8561 0 241.66 1.2031
DNACLUST 11 <1 170.6 0.9455 0.0545 24.21 1.8987

SpClust
(fast)

1 152046.5 2678.9 1 0 1 -

SpClust
(moderate)

1 188172.9 6457.6 1 0 1 -

SpClust
(maxPrecision)

1 189577.1 6452.5 1 0 1 -

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/


AncestralClust 7

Table 2. Comparisons of clustering methods using 39 sequences from 6 homologous genes from Matar et al. (2019).’id’ refers to the identity
threshold used. We used identity thresholds of 0.4, 0.6, and 0.8 for UCLUST and meshclust2. We used precision levels of fast, moderate, and
maximum for SpClust using version 1 since version 2 only produced 1 cluster for all modes. DNACLUST has a maximum sequence length of
4500bp and could not be used on this dataset.

Method
# of

clusters
Time
(sec)

Memory
(Mb)

Purity
Relative

NMI

Coefficient
of

Variation
AncestralClust 6 370.3 412.0 0.9487 1.8660 0.3982

UCLUST

(id=0.4)
6 1 15.4 0.7436 1.5667 0.5396

UCLUST

(id=0.6)
19 1 20.1 0.7179 1.4379 0.7166

UCLUST

(id=0.8)
29 1.9 20.4 0.5641 1.1717 0.4565

meshclust2
(id=0.4)

6 1.1 7.7 0.8462 1.6625 1.2489

meshclust2
(id=0.6)

10 2.9 8.8 0.7949 1.9257 1.071

meshclust2
(id=0.8)

26 2.4 9.4 0.6410 1.2240 0.6325

SpClust
(fast)

4 44.6 166.2 0.8718 2.2463 0.8432

SpClust
(moderate)

4 112.5 166.1 0.9487 2.4335 0.6453

SpClust
(max

precision)

4 570.1 166.0 0.9487 2.9449 0.6809

CD-HIT

(id=0.8)
31 0.48 39.9 0.4103 1.0950 0.4574

DNACLUST - - - - - -

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.426008doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.08.426008
http://creativecommons.org/licenses/by/4.0/