key: cord-0283609-ae4o2kbm
authors: Arias, Pablo Millán; Alipour, Fatemeh; Hill, Kathleen A.; Kari, Lila
title: DeLUCS: Deep Learning for Unsupervised Clustering of DNA Sequences
date: 2021-11-02
journal: bioRxiv
DOI: 10.1101/2021.05.13.444008
sha: 671e5341209efcdea18da91f4d1a48982f19c22c
doc_id: 283609
cord_uid: ae4o2kbm

We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates “mimic” sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.

Traditional DNA sequence classification algorithms rely on large amounts of labour 2 intensive and human expert-mediated annotating of primary DNA sequences, informing 3 origin and function. Moreover, some of these genome annotations are not always stable, architecture.

DeLUCS is a fully-automated method that determines cluster assignments for its 68 input sequences, independent of any homology or same-length assumptions, and 69 oblivious to sequence taxonomic labels. DeLUCS can thus be used for successful ab 70 initio classification of datasets that were previously unclassifiable by alignment-based 71 methods, as well as datasets with uncertain or fluctuating taxonomy, where supervised 72 machine learning methods are biased by their reliance on current taxonomic labels. 73 In summary, DeLUCS is the first effective alignment-free method that uses deep 74 learning for unsupervised clustering of unlabelled raw DNA sequences. The main 75 contributions of this paper are: 76 • DeLUCS clusters large and diverse datasets, such as complete mitochondrial 77 genomes at several taxonomic levels; randomly selected bacterial genome segments 78 into families; viral genes and viral full genomes into virus subtypes. To date, these 79 are the largest real datasets of genomic data to be clustered, with our largest 80 experiment comprising over 1 billion bp of data, a full order of magnitude larger 81 than previous studies [8] [9] [10] [11] [12] [13] [14] . 82 • DeLUCS achieves "classification accuracies" of 77% to 100%, in each case 83 significantly higher than classic alignment-free clustering methods (K-means++ 84 and GMM), with double digit improvements in most cases. For the majority of 85 the computational tests, DeLUCS classification accuracies are also comparable to, 86 or higher than, those of a supervised machine learning algorithm with the same 87 architecture. 88 • DeLUCS is a highly-effective/light-preparation method for unsupervised clustering 89 of DNA sequences. Its high classification accuracies are a result of combining the 90 novel concept of mimic sequences with the invariant information learning 91 framework and a majority voting scheme. It is termed light-preparation because it 92 does not require sequence homology, sequence-length similarity, or any taxonomic 93 labels/identifiers during the learning process. 94 Prior Approaches

The time-complexity limitations of alignment-based methods, [15] , in addition to their 96 reliance on extraneous sequence information such as sequence homology, have motivated 97 the development of numerous alignment-free methodologies [16, 17] . Of these, methods 98 based on k-mer counts have been among the fastest and the most widely used [17] . In 99 parallel to alignment-free approaches, machine learning methods have emerged as 100 promising alternatives for solving classification problems both in genomics and 101 biomedicine [18] . classification and clustering is that, while in classification methods the cluster labels are 105 given a priori, in clustering methods the clusters are "discovered" by the method.) 106 Supervised Machine Learning Approaches 107 Among supervised learning algorithms, Artificial Neural Networks (ANNs) have proven 108 to be the most effective, with ANN architectures with several layers of neurons ("deep 109 learning") being the top performers [19] . 110 In the context of genome classification, alignment-free methods that employ 111 supervised machine learning have been shown to outperform alignment-based methods 112 in the construction of high-quality whole-genome phylogenies [20] , profiling of microbial 113 Fig 1. Machine learning-based alignment-free methods for classification/clustering of DNA sequences. DeLUCS is the first method to use deep learning for accurate unsupervised clustering of unlabelled raw DNA sequences. The novel use of deep learning in this context significantly boosts the classification accuracy (as defined in the Evaluation section), compared to two other unsupervised machine learning clustering methods (K-means++ and GMM).

communities [21] , and DNA barcoding at the species level [22] . In recent years, 114 alignment-free methods have successfully applied supervised machine learning 115 techniques to obtain accurate classification of HIV subtypes, [23] , as well as accurate 116 and early classification of the SARS-CoV-2 virus (COVID-19 virus) [24] . The increasing 117 success of machine learning, and in particular deep learning, techniques is partly due to 118 the introduction of suitable numerical representations for DNA sequences and the ability 119 of the methods to find patterns in these representations (see [23, 25] , respectively [21] ). 120 Other classification tasks in genomics such as taxonomic classification [26] , and the 121 identification of viral sequences among human samples from raw metagenomic 122 segments [27, 28] have also been explored from the deep learning perspective. 123 One limitation of supervised deep learning is that the performance of the ANNs is 124 heavily dependent on the number of labelled sequences that are available during 125 training. This can become a limiting factor, even though raw sequencing data can now 126 be obtained quickly and inexpensively, [29] . The reason for this is the intermediate 127 process that lies between obtaining a raw DNA sequence and uploading that sequence 128 onto a public sequence repository, namely the "invisible" work that goes into assigning a 129 taxonomic label and attaching biological annotations. This is a laborious, expensive, To overcome these limitations, one can attempt to use unsupervised learning, which 136 operates with unlabelled sequences and compensates for the absence of labels by algorithm, normally used in the field of image processing. Similar work [10, 14, 30] also 157 builds on the K-means algorithm and k-mer counts. Another approach is the use of 158 digital signal processing [11] [12] [13] , whereby Fourier spectra calculated from a numeric 159 representation of a DNA sequence are used as their quantitative description, and the 160 Euclidean distance is used as a measure of dissimilarity to be employed by either the 161 K-means or the GMM clustering algorithms.

Although K-means (and its improved version K-means++) is a simple and versatile 163 algorithm, it is dependent on several restrictive assumptions about the dataset, such as 164 the need for manual selection of the parameter K, and the assumption that all clusters 165 have the same size and density. It is also heavily dependent on the selection of initial 166 cluster centroids, meaning that for large datasets, numerous initializations of the 167 centroids are required for convergence to the best solution and, moreover, that 168 convergence is not guaranteed [31] . Although GMM is more flexible in regards to the 169 distribution of the data and does not assume that all clusters are spherical, the 170 initialization of clusters is still challenging, especially in high dimensional data, [32, 33] . 171 A potential solution to these drawbacks could lie in recent developments in the field 172 of computer vision [34] [35] [36] , specifically in the concepts at the core of invariant 173 information clustering (IIC) [36] , one of the successful methods for the clustering of 174 unlabelled images. These methods are effective for visual tasks and, as such, are not 175 applicable to genomic data. In this paper, we propose the use of Frequency Chaos 176 Game Representations (FCGR) of DNA sequences and the novel notion of mimic 177 sequences, to leverage the idea behind IIC. In our approach, FCGR pairs of sequences 178 and of their mimics are generated, and used as input for a de novo simple but general 179 Artificial Neural Network (ANN) architecture, specifically designed for the purpose of 180 DNA sequence clustering. Finally, majority voting over several independently trained 181 ANN copies is used, to obtain the accurate cluster assignment of each sequence.

In this section, we first give an overview of our method and the computational pipeline 184 of DeLUCS. We then describe the core concepts of invariant information clustering, and 185 detail how these concepts are adapted to DNA sequence clustering, by introducing the 186 notion of "mimic sequences". This is followed by a description of the architecture of the 187 neural networks employed, the evaluation scheme used for assessing the performance of 188 DeLUCS, and all of the implementation details. Finally we give a description of all the 189 datasets used in this study.

DeLUCS employs a graphical representation of DNA sequences introduced by Jeffrey 192 in [37] , called Chaos Game Representation (CGR). In this paper, we use a quantized 193 version of CGR, called Frequency CGR with resolution k, and denoted by F CGR k .

The F CGR k of a DNA sequence is a two-dimensional unit square image, with the 195 intensity of each pixel representing the frequency of a particular k-mer in the 196 sequence [38] . F CGR k is a compressed representation of the original DNA sequence, 197 with the degree of compression indicated by the resolution k. All computational 198 experiments in this paper use k = 6, which was empirically assessed as achieving the 199 best balance between accuracy and time complexity. Several studies have demonstrated 200 that the CGR of a genomic sequence can serve as its genomic signature, defined by 201 Karlin and Burge [39] as any numerical quantity that is more similar for DNA sequences 202 of closely related organisms, while being dissimilar for DNA sequences of more distantly 203 related organisms, see Unlike some quantization techniques [40, 41] , DeLUCS does not need an intermediate 205 step of supervised learning to produce a compressed representation of data, retaining 206 only the information needed for the correct classification of the feature vector and the 207 cluster assignment. This is because F CGR k already is a compressed representation, by 208 virtue of storing only the counts of all k-mers in the sequence for a given value of k.

These k-mer counts contain the intrinsic, taxonomically relevant, information used for 210 the unsupervised learning in DeLUCS. The k-mer counts for both the original sequence and its mimic sequences are then 216 computed, to produce their respective F CGRs. In this study, k = 6 was 217 empirically assessed as achieving the best balance between high accuracy and 218 speed. 3. As the training process of the ANNs is a randomized algorithm which produces 224 different outcomes with high variance, a majority voting scheme over the 225 outcomes of the ANNs in Step 2 is used to determine the final cluster assignment 226 for each sequence.

C G A T C C G G T T A A a) b) c)

To evaluate the quality of the clusters, an additional step is performed, independent 228 of DeLUCS. This step first utilizes the Hungarian algorithm to determine the optimal 229 correspondence between the cluster assignments learned by DeLUCS and the true 230 taxonomic cluster labels. It then proceeds to determine the accuracy of the DeLUCS 231 cluster predictions, as detailed in the Evaluation section. Invariant Information Clustering (IIC) 233 Steps 1 and 2 in the DeLUCS pipeline build upon the underlying concepts of IIC, [36] , 234 which leverages some information theory notions described in this subsection.

Given a discrete random variable X that takes values x ∈ X and has probability 236 mass function p(x) = P (X = x), the entropy H(X) is a measure of the average 237 uncertainty in the random variable and is defined by

H(X) also represents the average number of bits required to describe the random 239 variable X.

Given a second random variableX that takes valuesx ∈X , we can also define the 

The mutual information measures the dependence between the two random variables, 247 and it represents the amount of information that one random variable contains about 248 another. I(X,X) is symmetric, always non-negative, and is equal to zero if and only if 249 X andX are independent.

The information bottleneck principle [42, 43] , which is part of the information 251 theoretic approach to clustering, suggests that clusters only need to capture relevant 252 information. In order to filter out irrelevant information, IIC aims to learn only from 253 paired data, i.e., from pairs of samples (x,x) ∈ X ×X taken from a joint probability 254 distribution p(x,x). If, for each pair,x is an artificially created copy of x, it is possible 255 to find a mapping Φ that encodes what is common between x andx, while dropping all 256 the irrelevant information. If such a mapping Φ is found, the image Y = Φ(X ) becomes 257 a compressed representation of the original space X .

To find the best candidate for Φ, one way is to make Φ(x) represent a random 259 variable, and then maximize the predictability of sample x from samplex and vice 260 versa, that is, find a mapping Φ(x) that maximizes I(Φ(x), Φ(x)) -the mutual 261 information between the encoded variables -over all x ∈ X .

This idea suggests that Φ can be calculated using a deep neural network with a 263 softmax as the output layer. For a dataset with an expected number of c clusters, c ∈ N, 264 the output space will be Y = [0, 1] c , where for each sample x we have that Φ(x) 265 represents the distribution of a discrete random variable over the c clusters. The mutual 266 information can be modified with the introduction of a hyper-parameter λ ∈ R that 267 weighs the contribution of the entropy term in Eq (2). However, instead of maximizing 268 the weighted mutual information, we use a numerical optimizer to minimize its opposite 269 (mathematically, the negative weighted mutual information) during the training process 270 of the ANN. Hence, the loss function to be minimized becomes:

In Eq (3), the entropy term H(Φ(x)) measures the amount of randomness present at 272 the output of the network, and it is desirable for that value to be as large as possible, in 273 order to prevent the architecture from assigning all samples to the same cluster. The conditional entropy term H(Φ(x) | Φ(x)) measures the amount of randomness present 275 in the original sample x, given its correspondentx. This conditional entropy should be 276 as small as possible, since the original sample x should be perfectly predictable fromx. 277

The success of the method described in the previous section is fundamentally dependent 279 on the wayx is artificially generated from x. In the particular case of our application, 280 where the samples x are DNA sequences, we refer to the artificially createdx as mimic 281 sequences (sometimes called simply mimics). In this context, the generation of mimic 282 sequences poses the additional challenge that they should be sufficiently similar to the 283 originals so as not to be assigned to a different cluster.

Given a set X = {x 1 , . . . , x n } of n DNA sequences, we construct the set of pairs

where m ≥ 3 is a parameter representing the number of mimic sequences generated for 286 each original sequence x i , 1 ≤ i ≤ n. We use a simple probabilistic model based on increased if the number of available sequences per cluster is insufficient to obtain a high 296 classification accuracy.

The rationale behind using transition and transversion probabilities to generate 298 sequence mimics is biologically inspired. That being said, we use this method only as a 299 mathematical tool without attributing any biological significance, to create minimally 300 different sequences through randomly distributed base substitutions. In this paper we 301 use probabilities p ts = 10 −4 and p tv = 0.5 × 10 −4 , assessed empirically to result in the 302 best classification accuracies. Although the mutation rates used are biologically inspired, 303 they are not biologically precise given that mutation rates vary regionally, with 304 species [44, 45] , and with the estimation method [46] . Lastly, in practice, with no 305 taxonomic label, it is impossible to select species-specific mutation rates. 327 represents the probability that an input sequence x belongs to a particular cluster c j . Note that this general architecture was designed so as to be successful for the 329 clustering of all the diverse datasets presented in this study. However, the main pipeline 330 of DeLUCS allows it to be used also with other architectures, including architectures 331 that make use of the two-dimensional nature of the FCGR patterns and are performant 332 for specific types of genomic data (e.g., Convolutional Neural Networks).

Clustering results can be evaluated using both internal and external validation measures. 335 Internal validation methods [47, 48] evaluate clustering algorithms by measuring some of 336 the discovered clusters' internal properties (e.g., separation, compactness), while 337 external validation methods [49] [50] [51] evaluate clustering algorithms by measuring the 338 similarity of the discovered clustering with the ground truth. We note that many 339 genomic datasets are sparse, incomplete, and subject to sampling bias (more than 86% 340 of existing species on Earth and 91% of species in the oceans have not yet been 341 classified and catalogued [52] ). Thus, when taxonomic ground truth is available for an 342 external validation, agreement between discovered clusters and real taxonomic groups is 343 preferable to, and more informative than, internal validation methods. 344 We include performance comparison results obtained using (unsupervised) 345 classification accuracy (ACC), as ACC uses the optimal mapping between the 346 discovered clusters and the ground truth clusters, and has been used extensively in 347 recent deep unsupervised learning studies [53] . Comparison results obtained using two 348 other external evaluation methods, normalized mutual information of the partitions 349 (NMI), and adjusted rand index (ARI), lead to similar conclusions, and can be found in 350 S6 Appendix: Using NMI and ARI to compare DeLUCS with K-means++ and GMM.

In calculating the classification accuracy ACC, we follow the standard protocol that 352 uses the confusion matrix as the cost matrix for the Hungarian algorithm [35, 36] , to 353 find the optimal mapping f that assigns to each cluster label c j , 1 ≤ c j ≤ c found by 354 DeLUCS, a taxonomic label f (c j ). We then use this optimal assignment to calculate the 355 November 2, 2021 10/28 classification accuracy of the unsupervised clustering method, which is defined as:

where n is the total number of sequences, and for each DNA sequence x i , 1 ≤ i ≤ n, we 357 have that l i is its true taxonomic label, c i is the cluster label found by the algorithm, unsupervised learning frameworks, and have been previously used with DNA sequence 368 datasets [8, 10, 13, 14] . Note that the Hungarian algorithm is used to find the mapping 369 that maximizes ACC for all the unsupervised clustering methods considered. In all 370 three cases, the use of the true taxonomic labels is for post hoc evaluation purposes only, 371 as true labels are never used during the training process.

Lastly, we compare these three unsupervised clustering methods to a supervised The networks are initialized using the Kaiming method [54] , to avoid exponential 387 reduction or magnification of the input magnitudes. This is crucial for our method 388 because a poor initialization may lead to degenerate solutions, as one of the terms in 389 the loss function becomes dominant. We use the Adam optimizer, [55] , with a learning 390 rate of 5 × 10 −5 , and the networks were trained for 150 epochs with no early stopping 391 conditions. Another vital consideration during training is the selection of the batch size 392 (empirically determined to be 512), because the marginalization that is performed to 393 find the distribution of the output is done over each batch of pairs. If the batch size is 394 not large enough to represent the real distribution of the data, the entropy term in the 395 loss function becomes dominant, leading to sub-optimal solutions. Lastly, we fix the 396 value of the hyperparameter λ to 2.5 (in Eq 3).

DeLUCS is fully implemented in Python 3.7, and the source code is publicly 398 available in the Github repository https://github.com/pmillana/DeLUCS. Users may 399 reproduce the results obtained in this paper, or use their own datasets for the purpose 400 of clustering new sequences (see S1 Appendix: Instructions for reproduction of the tests 401 using DeLUCS). All of the tests were performed on one of the nodes of the Cedar 1 therein) .

DeLUCS reaches optimal performance (in terms of classification accuracy) when a 424 dataset consists of balanced clusters, that is, clusters that are similar in size, and when 425 each cluster has more than a minimum number of sequences (herein 20 sequences).

These two requirements, together with the number of sequences per cluster available on 427 NCBI, were used to determine the minimum and maximum cluster size for each test.

After determining the minimum cluster size for a test (the size of the smallest cluster 429 larger than 20), the clusters that were smaller than this test minimum were discarded. 430 In addition, sequences that belonged to the parent taxon, but lacked a sub-taxon 431 identifier (cluster label) were excluded (see S4 Appendix: Detailed description of the 432 mtDNA datasets, for details).

Some clusters were larger than the test maximum (a number bigger than -but not 434 more than 20% bigger than -the test minimum, to ensure cluster balance). For such 435 clusters, sequences were randomly selected until the test maximum was reached. Note 436 that since the minimum and maximum cluster size are test dependent, the number of 437 sequences in a cluster is also test dependent (for example, Test 1 uses 500 438 Actinopterygii sequences, while Test 2 uses only 113). Moreover, for similar reasons, the 439 number of sequences per cluster is usually smaller than the total number of sequences 440 with that cluster label that are available on NCBI. To construct a balanced dataset that captured as much diversity as possible, we 448 considered all of the available species per family, according to GTDB (release 95), [56] , 449 The choice of computational Tests 1 to 6 (datasets within the sub-phylum Vertebrata) follows a decision-tree approach whereby, at each taxonomic level, one particular cluster is selected for further in-depth analysis. Note that, since the minimum and maximum cluster sizes are different for each test, the number of sequences selected from a given taxon can differ from one test to another. For example, in Test 3 only 250 out of the total 2,723 available Ostariophysi sequences were selected, due to the need to achieve cluster balance, while the min/max cluster size parameters of Test 4 allowed for 383 Ostariophysi sequences to be used. The remaining Ostariophysi sequences could not be selected in either test since most of them belonged to the over-represented Order Cypriniformes (2,171 available sequences). Test 1 and 2 illustrate a different scenario: 500 of the total 7,876 available Actinopterygii sequences were used in Test 1, since this cluster size was sufficient to achieve high accuracy, while in Test 2 only 113 Actinopterygii sequences could be used, due to the under-representation of class Polypteriformes (33 available sequences) and over-representation of class Neopterygii (7,715 available sequences). and first excluded those species for which none of the sequences had a contig that was 450 of the minimum length (herein, 150 kbp). For these computational tests, the cluster size 451 was selected to be 400 sequences per cluster. This led to the following cases and 452 corresponding experiment design choices: 453 1. The number of available species in a family is larger than 400 454 (Rhodocacteriaceae and Burkholderiaceae) 455 We selected 400 species at random and, for each of the selected species, we 456 randomly selected one genome. Then we selected a random segment of at most 457 500 kbp from each genome. If there was only one genome available for a particular 458 species, and the length of its largest contig was between 150 kbp and 500 kbp, the 459 entire contig was selected. The selection strategy was designed to include as many 460 families as possible in the final dataset. represented, and that the shortest segments were selected last.

A description of the final composition of each cluster is presented in Table 2 . In 483 addition to the inter-phylum classification of bacterial sequences into families (Test 7), 484 we assessed the performance of DeLUCS for an intra-phylum classification into families, 485 within the Proteobacteria phylum only (Test 8). The dataset for Test 8 was simply the 486 subset of the dataset in Test 7 that included only the segments from genomes in 487 bacterial families from phylum Proteobacteria. 488 Viral genomes ( Table 3) : The third group of datasets consists of three different sets 489 of viral genome sequences. The minimum and maximum cluster sizes were determined 490 in a manner similar to that of the mitochondrial DNA datasets, i.e., for all the tests, the 491 minimum cluster size was fixed to the size of the smallest cluster available and the 492 maximum cluster size was fixed manually to a number exceeding the minimum cluster 493 size by no more than 20%, to ensure cluster balance. neuraminidase protein, average length 1,409 bp). The sequences were downloaded from 497 NCBI, [57] , as belonging to subtypes H1N1, H2N2, H5N1, H7N3, and H7N9, as per [11] . 498 The dataset of Test 10 comprises 1,633 Dengue virus full genome sequences downloaded 499 from NCBI, [58] , comprising four different subtypes. Finally, the dataset for Test 11 500 comprises 1,562 Hepatitis B virus (HBV) full genome sequences, downloaded from the 501 Hepatitis Virus Database, [59] , comprising six different subtypes. A description of the 502 composition of each dataset is presented in Table 3 . 

We first used a qualitative measure to assess DeLUCS's ability to group DNA sequences 506 into meaningful clusters within the previously described datasets. Fig 5 illustrates how, 507 during the training stage, the ANN discovers meaningful clusters as the learning process 508 evolves. The progress of DeLUCS is demonstrated in terms of the number of epochs 509 (one epoch means a single pass of the ANN through the entire dataset, trying to 510 discover new patterns). In Fig 5, Artificial Neural Network (ANN) Architecture), [36] . When the learning process starts, 515 all the sequences are located at the center of the polygon, which means that each 516 sequence is equally likely to be assigned any of the five cluster labels. As the learning 517 process continues, the network starts assigning the sequences to clusters (with similar 518 sequences being grouped together closer to their respective vertex/cluster), with higher 519 and higher probability. Note that if two sequences are assigned the same probability 520 vectors, their corresponding points in Fig 5 will Learning process for the clustering of 2,500 vertebrate mtDNA full genomes into c = 5 different clusters, each corresponding to one corner of the pentagon. Each point represents a sequence, and its position indicates the probability that it is assigned to different classes. Hence, a point in the center has equal probabilities to be assigned to all 5 vertices/clusters, while a point located in a vertex has probability of 1 of belonging to that vertex/cluster. With successive epochs, the learning progresses until, at epoch 149, all sequences are correctly placed in their respective vertex/cluster. Note that points in the figure overlap if they have the same probability vector.

For a quantitative assessment of DeLUCS, we calculated its "classification accuracy" 522 as defined in the Evaluation section. For the vertebrate mtDNA dataset, Table 4 shows 523 that, at each taxonomic level, down to the genus level, our unsupervised deep learning 524 algorithm outperforms the other two unsupervised clustering methods. Surprisingly,

DeLUCS also outperforms a supervised learning algorithm that uses the same 526 architecture in some tests (e.g., in Test 2, Class to Subclass), sometimes by a significant 527 margin (e.g., in Test 6, Family to Genus, by 11%). Note that, in order to obtain a 528 reasonable classification accuracy, the number m of mimics had to be increased for the 529 tests/datasets with less than 150 sequences per cluster.

To test the ability of DeLUCS to cluster bacterial genome segments, and achieve a 531 direct accuracy comparison with other unsupervised learning methods, we first 532 attempted clustering a dataset comprising genome segments, averaging 400,000 bp, from 533 the eight different bacterial families considered in [11] ( Family to Genus 8 80% 84% 87% 91% For unsupervised learning, reported accuracy values are the average over 10 runs of the algorithm. For supervised learning, the accuracy is that of classifying the test set.

classification accuracy of 77% is 19% to 21% higher than the accuracies of the other two 535 unsupervised learning methods (GMM 58%, and K-means++ 56%). The relatively low 536 classification accuracies for all three unsupervised methods may be due in part to this 537 dataset having a very heterogeneous evolutionary composition, with recent changes in Desulfovibrionaceae (formerly all Proteobacteria) are now split into two different phyla. 543 The heterogeneity of the dataset makes clustering a challenging task, as the algorithm 544 attempts to determine cluster labels simultaneously for both closely related families and 545 distantly related families. The patterns observed in the confusion matrix (see S3

Appendix: Confusion matrices) support the hypothesis of heterogeneity in genetic 547 distance between members of the dataset. Indeed, misclassification between phyla are a 548 minority, and most of the misclassifications occur among families that were previously 549 placed within the same phylum, but are now placed in different phyla.

To verify that the lower classification accuracy could be partially caused by the 551 heterogeneity of the dataset, we next considered an intra-phylum clustering of a subset 552 of this dataset, comprising only sequences belonging to phylum Proteobacteria (Table 5 , 553 Test 8). As predicted, the classification accuracy of DeLUCS shows a significant 554 increase, from 77% to 90%, now outperforming the other two unsupervised learning 555 methods by 40% and 48% respectively. The majority of the misclassified genome 556 segments in Test 8 belong to the family Desulfobivrionaceae. This may be partly due to 557 this family having the shortest genome segments: The average Desulfobivrionaceae 558 genome segment length is 359,337 bp, significantly shorter than the average genome 559 segment length in Test 7, which is 433,613 bp.

Note that, in both Tests 7 and 8, we used the default value m = 3 for the number of 561 mimic sequences. This is because the number of available sequences (genome segments 562 of the required size) per cluster was large, and the classification accuracy did not 563 increase by increasing the number of mimic sequences. 564 Table 5 . Classification accuracy for the bacterial datasets in Table 2 Table 6 show that all machine learning methods, supervised or unsupervised, perform 569 well, in spite of the fact that the viral sequences are very similar. In Tests 9, 10, 11 the 570 default value m = 3 for the number of mimic sequences was used, since this was 571 sufficient to obtain near perfect classification accuracy. 572 Table 6 . Classification accuracy for the viral sequences datasets, 

For the datasets in this study it was empirically determined that, if the cluster sizes 583 are similar, and each cluster has more than 150 sequences, then the default value of 584 m = 3 results in optimal performances (as is the case in Tests 1, 3, 7, 8, 9, 10 and 11) . 585 For the datasets with less than 150 sequences per cluster, we tested the hypothesis 586 that the classification accuracy could be improved by generating more artificial mimic 587 sequences per original DNA sequence. Our observations confirmed this hypothesis in 588 Tests 2, 4, 5, and 6, where increasing the number of mimics to m = 8 resulted in 589 improved classification accuracies.

Note that, for the datasets with more than 150 DNA sequences per cluster (e.g.,

Tests 1, 3, 7, 8) , increasing the number of mimics did not always result in increased 592 classification accuracy. Fig 6 illustrates the effect of increasing the number of mimic 593 sequences on classification accuracy for four different tests (Tests 2, 4 and 6, with fewer 594 than 150 sequences/cluster, and Test 3 with more than 150 sequences/cluster). Note 595 also that Tests 9, 10, 11 had more than 150 sequences per cluster, as well as near 596 perfect classification accuracies, and thus no increase in the number of mimic sequences 597 was explored.

In general, the optimal number of mimics per sequence may depend on the number 599 of available sequences per cluster, as well as on other particulars of the dataset being network parameters, and we observe an inverse correlation between the classification 608 accuracy and the unsupervised loss function for each epoch (Fig 7, top) . The same 609 graph illustrates that the ANN sometimes tends to converge to a suboptimal solution, 610 and this is because the mutual information might still be high for suboptimal solutions 611 where relatively many related sequences are similarly misplaced (i.e., assigned to the 612 wrong cluster, while still being close to each other within their subgroup).

To prevent the model from converging to suboptimal solutions, we added Gaussian 614 noise to the network parameters every 50 epochs. We confirmed empirically that the 615 introduction of noise is beneficial, as the accuracy increases after the introduction of 616 noise every 50 epochs (Fig 7, bottom) . training data in each epoch. The random selection of batches is beneficial for the 621 marginalization that is performed to calculate the loss function. However, we also 622 observed a high variance in the outcomes of training several independent copies of the 623 ANN over the same dataset, likely due to the aforementioned randomness. To overcome 624 this challenge, for each dataset, we independently trained ten ANNs, and integrated the 625 obtained clustering results using a majority voting scheme. Fig 8 compares Classification Accuracy single ANN ensemble of ten ANNs   Fig 8. Comparison between training a single ANN (yellow), versus training an ensemble of 10 ANNs (blue) and using majority voting to decide the final cluster label for each sequence (Tests 1-6). All tests were repeated 10 times. The minimum, first quartile (Q1), median, third quartile (Q3), and maxima of the distributions are shown. The diamonds outside the boxes represent outliers. Note that the result variance for single ANNs is larger than the result variance for the corresponding ensembles of 10 ANNs. In addition, the classification accuracy is higher for the majority voting, in all cases.

sequences. The largest computational experiment in this paper comprises 3,200 635 randomly selected bacterial genome segments, totalling more than 1 billion bp, a 636 dataset which is a full order of magnitude larger than previous studies clustering 637 genomic data [8] [9] [10] [11] [12] [13] [14] .

In addition, DeLUCS achieves significantly higher classification accuracies compared 639 to other unsupervised machine learning clustering methods (K-means++ and GMM), 640 in comparable time. The running time of the whole pipeline of DeLUCS, considering 641 both training and testing time, is less than 40 minutes for the largest bacterial dataset. 642 Note that for datasets comprising sequences with minimal homology (Tests 1-6, and 643 10-11) using alignment-based methods would be near impossible, while for datasets This study serves as a proof of concept of the ability of unsupervised deep learning 654 to effectively cluster unlabelled raw DNA sequences. DeLUCS is a fully-automated method that could be used for successful clustering of datasets where traditional 656 methods are not applicable, such as: large datasets of dissimilar and variable length raw 657 DNA sequences, DNA sequences with incomplete or missing biological annotations, and 658 DNA sequences for which taxonomic information or other sequence identifiers are 659 insufficient or unavailable. Future work can address some of the current limitations of 660 this approach, as described below. 661 First, in order to achieve accurate clustering, DeLUCS requires its training dataset 662 to have balanced and well-represented clusters, with at least a minimum number of 663 sequences per cluster. The minimum and maximum cluster size were determined 664 individually for each computational test, based on the number of available sequences 665 and the aforementioned requirements. As a result, many available sequences were not 666 included in the training data, either due to the fact that under-represented clusters were 667 discarded, or due to the fact that the excess sequences from over-represented clusters 668 were not used. This limitation could be addressed, e.g., by adjusting the loss function to 669 be able to deal with unbalanced training databases, that is, datasets where clusters are 670 of different sizes.

Second, we observe that the classification accuracy of an ANN is heavily dependent 672 on the initialization of the parameters, which is random for each run of the experiment. 673 In other words, the classification accuracy for a dataset can vary from run to run of an 674 ANN, sometimes by a large amount. On the other hand, one of the reasons behind 675 DeLUCS' successful clustering capability lies in this randomness in the parameter 676 initialization. We attempted to address this trade-off between performance and 677 reproducibility by training several copies of the same ANN, and using majority voting 678 to determine the final cluster labels. The overall classification accuracy stabilized and 679 increased as a result, with the downside being that this approach increased the running 680 time of the training step, since ten ANNs were sequentially trained. More time-efficient 681 solutions, such as training the ANNs in parallel, would lead to a tenfold improvement of 682 the running time.

Third, we observed that for the datasets with fewer than 150 sequences per cluster, 684 an increase in the number of mimics resulted in classification accuracy improvements 685 (Tests 2, 4, 5, 6). However, this was not the case for some of the datasets with more 686 than 150 sequences per cluster (Tests 1, 3, 7, 8) . Some other tests, e.g., Tests 9, 10, 11, 687 resulted in near perfect accuracy from the start and needed no further optimization. For 688 experiments in this paper, the number of mimic sequences per cluster was empirically 689 determined to be optimal with respect to the cluster size. Further exploration is needed 690 to determine the relationship between cluster size and the number of mimics per 691 sequence, as well as to find other mechanisms to boost classification accuracy for 692 specific datasets, such as the use of Convolutional Neural Networks which make full use 693 of the two-dimensional aspect of the FCGR representation. 

This is, to the best of our knowledge, the first effective alignment-free method that 704 utilizes deep ANNs for unsupervised clustering of unlabelled DNA sequences. DeLUCS 705

DeLUCS and true taxonomic groups (via the Hungarian algorithm), we note that the 710

resulting classification is of the highest sensitivity at the species into subtypes level 711

99% to 100%), while in all other tests classification accuracy is double-digit higher than 712

Using NMI and ARI to compare DeLUCS with K-means++ and GMM 721

Writing -original draft

Kari References 1. Applequist WL. A brief review of recent controversies in the taxonomy and nomenclature of Sambucus Nigra sensu lato

Taxonomic freedom and the role of official lists of species names

Perspectives from a "Partnerships for Enhancing Expertise in Taxonomy" (PEET) Debate

A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life

GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database

A complete domain-to-species taxonomy for Bacteria and Archaea

The Hungarian method for the assignment problem

An improved alignment-free model for DNA sequence similarity metric

MeShClust: an intelligent tool for clustering DNA sequences

Application of K-means clustering algorithm in grouping the DNA sequences of hepatitis B virus (HBV)

A new method to cluster DNA sequences using Fourier power spectrum

Genomic signal processing for DNA sequence clustering

GMM-based classification of genomic sequences

An improved K-means algorithm for DNA sequence clustering

On the complexity of multiple sequence alignment

Alignment-free sequence comparison: benefits, applications, and tools

Benchmarking of alignment-free sequence comparison methods

Deep learning in biomedicine

Deep learning

Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels

DeepMicrobes: taxonomic classification for metagenomics with deep learning

Convolutional neural networks improve fungal classification

An open-source k-mer based machine learning tool for fast and accurate sub-typing of HIV-1 genomes

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

Classification of eukaryotic organisms through cepstral analysis of mitochondrial DNA

Viral genome deep classifier

Deep learning on raw DNA sequences for identifying viral genomes in human samples

Identifying viruses from metagenomic data using deep learning

Sequencing technologies-the next generation

Integrative sparse K-means with overlapping group lasso in genomic applications for disease subtype discovery

Pattern Recognition and Machine Learning (Information Science and Statistics

Data Clustering: Algorithms and Applications. 1st ed. Chapman and Hall/CRC

Deep clustering for unsupervised learning of visual features

Unsupervised Deep Embedding for clustering analysis

Invariant information clustering for unsupervised image classification and segmentation

Chaos game representation of gene structure

Genomic signature: characterization and classification of species assessed by chaos game representation of sequences

Dinucleotide relative abundance extremes: a genomic signature

Combining image classification and image compression using vector quantization

Supervised learning of quantizer codebooks by information loss minimization

The information bottleneck method

Deep learning and the information bottleneck principle

Strong variations of mitochondrial mutation rate across mammals-the longevity hypothesis

Large variation in the ratio of mitochondrial to nuclear mutation rate across animals: Implications for genetic diversity and the use of mitochondrial DNA as a molecular marker

Understanding differences between phylogenetic and pedigree-derived mtDNA mutation rate: A model using families from the Azores Islands (Portugal)

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

Understanding and enhancement of internal clustering validation measures

Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance

Comparing partitions

Image clustering using local discriminant models and global integration

Mapping the space of genomic signatures

Deep learning-based clustering approaches for bioinformatics

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

A Method for stochastic gradient descent

A complete domain-to-species taxonomy for Bacteria and Archaea

The influenza virus resource at the National Center for Biotechnology Information

Virus Variation Resource -improved response to emergent viral outbreaks

HBVdb: a knowledge database for Hepatitis B Virus