key: cord-1000798-nw3tybw9 authors: Hoang, Tung; Yin, Changchuan; Zheng, Hui; Yu, Chenglong; Lucy He, Rong; Yau, Stephen S.-T. title: A new method to cluster DNA sequences using Fourier power spectrum date: 2015-05-07 journal: J Theor Biol DOI: 10.1016/j.jtbi.2015.02.026 sha: 53cee6ab40cabc83c6a1d71faf668e95d823c33f doc_id: 1000798 cord_uid: nw3tybw9 A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods. In the last few decades, several methods to classify genes and proteins have been proposed. Most of these methods are alignmentbased in which optimal alignments are obtained by using selected scoring systems. These methods provide accurate classification of biological sequences, and several algorithms have been developed and successfully applied (Katoh et al., 2002; Edgar, 2004; Larkin et al., 2007) . Nevertheless, their major drawback is due to significantly high time and memory consumption which is not suitable when a quick clustering needs to be made, for example on a new deadly virus (Marra et al., 2003) . Henceforth, an alignment-free technique is a trending method that often gives much faster classification on the same dataset (Vinga and Almeida, 2003; Yau et al., 2008; Yu et al., 2011 Yu et al., , 2013 . For example, the k-mer method is among the most popular alignment-free methods. In order to measure how different the two sequences are, the set of k-mers, or subsequences of length k, in the two biological sequences are collected and then the evolutionary distance between them is computed (Vinga and Almeida, 2003; Pandit and Sinha, 2010) . The k-mer method gives comparable results to alignment-based methods while being computationally faster (Blaisdell, 1989) . Discrete Fourier Transform (DFT) is a powerful tool in signal and image processing. During recent years, DFT has been increasingly used in DNA research, such as gene prediction, protein coding region, genomic signature, hierarchical clustering, periodicity analysis (Tiwari et al., 1997; Anastassiou, 2000; Kotlar and Lavner, 2003; Vaidyanathan and Yoon, 2004; Afreixo et al., 2004 Afreixo et al., , 2009 Tenreiro Machado et al., 2011) . A DFT power spectrum of a DNA sequence reflects the nucleotide distribution and periodic Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/yjtbi patterns of that sequence, and it has been applied to identify protein coding regions in genomic sequences (Fukushima et al., 2002; Yau, 2005, 2007) . In this paper we provide a new alignment-free method to classify DNA sequences based on the DFT power spectrum. The method is tested and compared to other state-of-the-art methods on various datasets for speed and accuracy. In signal processing, sequences in time domain are commonly transformed into frequency domain to make some important features visible. Via that transformation, no information is lost but some hidden properties could be revealed (Oppenheim et al., 1989) . One of the most common transformations is Discrete Fourier Transform (Oppenheim et al., 1989) . For a signal of length N; f ðnÞ; n ¼ 0; …; N À 1, the DFT of the signal at frequency k is for k ¼ 0; …; N À 1. The DFT power spectrum of a signal at frequency k is defined as Notice that by definition, PSð0Þ ¼ j The DFT is often used to find the frequency components of a signal buried in a noisy time domain. For example, let y be a signal containing a 60 Hz sinusoid of amplitude 0.8 and a 140 Hz sinusoid of amplitude 1. This signal can be corrupted by a zeromean random noise: The frequencies can hardly be identified by looking at the original signal as in Fig. 1(a) , but can be seen quite clearly when the signal is transformed to frequency domain by taking the DFT ( Fig. 1(b) ). For a DNA sequence composed of nucleotides adenine (A), cytosine (C), guanine (G), and thymine (T), one typical way to get numerical representation is to use binary indicator sequences. The values of these sequences are either 0 or 1 indicating the absence or presence of a specific nucleotide. Specifically, for a given DNA sequence of length N, we define u A of the same length as follows: For example, for the sequence AGTCTTACGA, the corresponding indicator sequence of nucleotide A is u A ¼1000001001. The The corresponding power spectrum for nucleotides C; G; T is defined similarly. In general, for a gene sequence of length N, let N A ; N C ; N G ; N T be the number of nucleotide A; C; G; T in that sequence, respectively. It is difficult to compare numerical sequences with different lengths, so we cannot cluster genes and genomes based on their power spectra sequences. One common approach to get over this problem is to use mathematical moments, e.g. for nucleotide A defines jth moment M A be scaling factors. We want higher moments to converge to zero, i. e. essential information is kept in the first few moments. Thus, the chosen normalization factors α j A must reflect the nature of the sequences. Let us examine the binary indicator sequence of one nucleotide, A, in more detail. By Parseval's theorem (Oppenheim et al., 1989) , The left side is actually N A , i.e. the number of 1 in the A binary sequence. Hence, So it is reasonable for α j A to be a power of N AN . As stated above, we want moments converge to zero gradually so that information loss is minimal, thus α A j ¼ 1=ðN A NÞ j À 1 is the best choice (which will be verified later). Therefore Our experimental results on various datasets proved that this is a good normalization. However, by re-examining the formula, we find that a slight modification can be made to get better outcomes. From Section 2.1, might hold large weight compared to PS A (k) for other index k, which in turn leads to unnecessary memory consumption and computations for higher moments. Therefore, the terms PS A ð0Þ are removed from the moments, and M 1 A becomes From the first modified moment, we know how to adjust the scaling factor for the j-th moment in general, i.e. α j A must be a power of N A ðN ÀN A Þ. Thus the new normalization is The fact that higher moments tend to zero is verified as follows: This fact also shows that α A j ¼ 1=ðN A ðN À N A ÞÞ j À 1 is the best scaling factor, as for α A j ¼ 1=ðN A ðN À N A ÞÞ j , M A 1 ¼ 1 so moments tend to zero very fast, thus much information can be lost, and for À N A Þ 2 so moments tend to zero much slower, thus more computational time and memory storage are needed. Additionally, due to symmetric property of DFT coefficients (Oppenheim et al., 1989) , we only have to consider the first half of power spectrum. Therefore, the moments are improved as follows: The moments for other nucleotides C; G; T are given similarly. Then the first few moments are used to construct vectors in Euclidean space. Our experimental results show that three moments are sufficient for an accurate clustering. Thus, each gene or genome sequence can be realized as a geometric point in the 12dimensional Euclidean space, i.e. ðM A Pairwise Euclidean distances between these points are calculated to cluster the gene or genome sequences. We call this Power Spectrum Moments method, or PS-M method. Power spectrum has been rarely used to study the phylogenetic analysis of DNA sequences because it is difficult to do comparison on sequences with different lengths. Zhao et al. (2011) came up with the idea of using normalized and centralized moments to compare sequences of different lengths. Motivated by that idea, we discovered a way to scale moments naturally, and only normalized moments are used to construct the Euclidean vectors. Discarding the first coefficient is another novelty of our PS-M method. The first coefficient holds significant weight and is highly dependent on the number of the respective nucleotide, which is redundant information. We also exclude the second half of the power spectrum due to its symmetry, and we give proof as to why the moment vectors converge to zero. Experimental results in the following section show that 12-dimensional moment vectors are enough to cluster genomes correctly. The PS-M method is tested on different datasets that range from small to medium size, as well as short to long genomes. In order to compare and analyze various genomic data, moment vectors are calculated and a matrix of Euclidean pairwise distances between those vectors is constructed. To cluster data into biological groups, a phylogenetic tree is built based on the distance matrix using the UPGMA method (Sokal, 1958) . The running time of our PS-M method is compared to three state-of-the-art methods. The first is the alignment-based method ClustalW (Larkin et al., 2007) implemented on MEGA 6 (Tamura (Katoh et al., 2002) , another alignment-based method using fast Fourier transform. The last method is the alignment-free k-mer method (Vinga and Almeida, 2003) which is implemented on Matlab by Vinga and Almeida (2003) . The running time is recorded in Table 1 . Even though reasonably accurate, ClustalW is significantly time consuming, as it aims to achieve the best possible results neglecting speed and efficiency. MAFFT runs much faster than ClustalW, but it sacrifices some accuracy in exchange. Moreover, errors occurred when we tested the datasets showing limitations of the MAFFT method. The errors are discussed in detail in the next section. Meanwhile, both k-mer and PS-M methods are alignmentfree, and both attempt to improve speed and efficiency with little sacrifice of accuracy. Thus, our phylogeny results are directly compared to the k-mer method in this study. Phylogenies of the ClustalW method are included in the supplementary materials for reference ( Figure S1-S4) . Phylogenies of our method and the k-mer method are drawn using Matlab and MEGA 6 respectively (Tamura et al., 2013) . Computations in this research are implemented on a PC with configuration of Intel Core i7 CPU 2.40 GHz and 8 GB RAM. The mitochondrial genome is not highly conserved and has a rapid mutation rate, thus it is suitable for examining the mode and tempo of molecular evolution (Brown et al., 1982) . The PS-M method was tested on a mitochondrial DNA dataset of 31 mammalian genome sequences from GenBank, each sequence has a length range from 16,300 to 17,500 nucleotides. This dataset was analyzed by Deng et al. (2011) using the natural vector method. In our method, these 31 genomes are clustered correctly into 7 groups: Primates, Cetacea and Artiodactyla, Perissodactyla, Rodentia, Lagomorpha, Carnivore, and Erinaceomorpha (Fig. 2) , which is consistent with the work of Deng et al. (2011) . Meanwhile, as Fig. 3 illustrates, a majority part of Carnivore is misplaced by the k-mer method. We test the efficiency of the PS-M method at a gene level. Influenza A viruses have been a major health threat to both human society and animals (Alexander, 2000) . Influenza A viruses are single-stranded, segmented RNA viruses, which are classified based on the viral surface proteins hemagglutinin (HA or H) and neuraminidase (NA or N) (Webster et al., 1992) . Eighteen H serotypes (H1 to H18) and eleven N (N1 to N11) serotypes of Influenza A viruses have been identified. Influenza A viruses are the most dangerous due to their wide natural host range, including birds, horses, swines, and humans; and they are known to have high degree of genetic and antigenic variability (Palese and Young, 1982; Garten et al., 2009) . Influenza A viruses have caused many pandemic flues, some of the most lethal subtypes are H1N1, H2N2, H5N1, H7N3, and H7N9. These subtypes will be chosen to test the efficiency of our method. Specifically, we will examine segment 6 of Influenza A virus genome, which encodes for neuraminidase (NA). As illustrated by the phylogenetic trees, the virus A/turkey/VA/505477-18/2007(H5N1) is not correctly clustered in the k-mer method (Fig. 5) . On the other hand, the dataset is classified correctly into biological groups by our proposed PS-M method (Fig. 4) . The efficiency of PS-M method was examined on a large size dataset, Human Rhinoviruses (HRV). HRV are associated with upper and lower respiratory diseases, particularly in patients with asthma. They are the predominant cause of the common cold and are responsible for more than one-half of cold-like illnesses. Past works have classified HRV into three genetically distinct groups within the genus Enterovirus and the family Picornaviridae (Palmenberg et al., 2009; Deng et al., 2011) . In their paper, Palmenberg et al. (2009) clustered the complete HRV genomes, consisting of three groups HRV-A, HRV-B, HRV-C of 113 genomes and three outgroup sequences HEV-C. While the genomes were well classified, the running time was quite high due to the use of multiple sequence alignment to construct the evolutionary tree. By our method, the phylogenetic tree is reconstructed and the three groups of HRV are clearly separated and are distinguished from the outgroup HEV-C (Fig. 6) . On the other hand, a small part of HRV-A is grouped on the same clade with HRV-B in the k-mer method (Fig. 7) . Coronaviruses, a genus of the Coronaviridae family, can cause a variety of severe diseases in respiratory and gastrointestinal tract (HCoV-229E and HCoV-OC43), or even life-threatening pneumonia (severe acute respiratory syndrome, or SARS). Thus, identification and classification of coronaviruses, especially human coronaviruses, are important and have been extensively studied so far. We cluster the set of 30 coronaviruses and 4 outgroup non-coronavirus sequences using the PS-M method. This coronaviruses dataset has been used by various authors before, e.g. Woo et al. (2005) and Yu et al. (2010) . In van der Hoek et al. (2004) the authors put the newly discovered human coronavirus NL63 into the same group with human coronavirus 229E (group 1 in our work), which is separated from other two human coronaviruses groups, HCoV-OC43 (group 2 in our work) and SARS (group 4 in our work). In Woo et al. (2005) the authors agreed to include the newfound HCoV-HKU1 in group 2 but also claimed that it is a distinct member within the group and it is not very closely related to the rest of the group. Yu et al. (2010) noticed that HCoV-HKU1 is an individual coronavirus between SARS group 4 and the traditional group 2, thus they proposed HCoV-HKU1 belongs to a new group 5. In our phylogenetic tree below (Fig. 8) , HCoV-NL63 and HCoV-229E are clustered into group 1, which is similar to the work of van der Hoek et al. (2004) . Moreover, group 5 is close to both group 4 and group 2 but separated from them, thus the result is consistent with the work of Woo et al. (2005) Fig. 9 , the HCoV-NL63 is misplaced from group 1 by the k-mer method. Methods like multiple sequence alignment cannot handle large data. The bacteria genome, consisting of millions of base pairs, is a useful dataset for checking a method's efficiency. The PS-M method was tested on the set of 30 bacterial genomes, consisting of 8 families: Enterobacteriaceae, Bacilleceae, Rhodobacteriaceae, Spirochaetaceae, Desulfovibrionaceae, Clostridiaceae, Burkholderiaceae, and Staphylococcaceae. Most of the data has a length between 3 and 5 million base pairs. As illustrated by the phylogenetic tree in Fig. 10 , the dataset is well clustered into 8 groups by the PS-M method in about 10 min. None of the three state-of-the-art methods (ClustalW, MAFFT, k-mer) are able to cluster these genomes. EMBL-EBI (http://www.ebi.ac.uk/Tools/msa/mafft/) is among the most common online tool for MAFFT. We tested our datasets on this site but there were errors for coronavirus and mammals genomes, namely "Raw Tool Output" as it appeared on the EBI website. Since the input DNA sequences worked for all other methods, the error seems to be the tool's problem. Therefore, despite MAFFT's reasonable biological classification and fast processing time, it has some drawbacks that is likely to require some restriction on the underlying dataset. Another obvious disadvantage of this online tool is that it does not allow us to save the phylogenetic tree directly to the computer, and it does not have circular tree option for dataset with a large number of elements. Thus, we do not include the phylogeny of HRV and Influenza A generated by MAFFT in this paper due to these reasons, even though the method can reasonably classify these datasets. However, the running time of MAFFT for HRV and Influenza A is still recorded in Table 1 . The k-mer method is alignment-free and it gives fairly good classification with an acceptable running time. But its major disadvantage is that we do not know which value of k will produce the best classification. In our work, we had to try various values of k and then choose the value with the best outcome. ClustalW is by far among the best alignment-based method. It often gives the most accurate biological classification, but its running time is significantly high in exchange. As a result, it is not able to perform well on large datasets. From the performance comparison (Table 1) , we can see that our method is much faster than other methods while still providing comparable biological classification. Most notably, none of the above three methods were able to cluster large genome like bacteria, while our method could do well in about 10 min. The MATLAB code for our method can be accessed from: http://www. mathworks.com/matlabcentral/fileexchange/49026. A/American black duck/NB/2538/2007(H7N3) A/chicken/Germany/R3234/2007(H5N1) A/domestic duck/Germany/R1772/2007(H5N1) A/chicken/Korea/es/2003(H5N1) A/duck/Guangxi/030D/2009(H1N1) A/duck/Hokkaido/w73/2007(H1N1) A/pintail/Miyagi/1472/2008(H1N1) A/American black duck/NB/2538/2007(H7N3) Genome analysis with inter-nucleotide distances Spectrum and symbol distribution of nucleotide sequences A review of avian influenza in different bird species Frequency-domain analysis of biomolecular sequences Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system Mitochondrial dna sequences of primates: tempo and mode of evolution A novel method of characterizing genetic sequences: genome space with biological distance and applications Muscle: multiple sequence alignment with high accuracy and high throughput Periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis Antigenic and genetic characteristics of swine-origin 2009 a (h1n1) influenza viruses circulating in humans Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions Clustal w and clustal x version 2.0 The genome sequence of the sars-associated coronavirus Discrete-Time Signal Processing Variation of influenza a, b, and c viruses Sequencing and analyses of all known human rhinovirus genomes reveal structure and evolution Using genomic signatures for hiv-1 sub-typing A statistical method for evaluating systematic relationships Mega6: molecular evolutionary genetics analysis version 6.0 Fractional dynamics in dna Prediction of probable genes by fourier analysis of genomic sequences The role of signal-processing concepts in genomics and proteomics Alignment-free sequence comparison-a review Evolution and ecology of influenza a viruses Characterization and complete genome sequence of a novel coronavirus coronavirus, hku1, from patients with pneumonia A protein map and its application A fourier characteristic of coding sequences: origins and a non-fourier approximation Prediction of protein coding regions by the 3-base periodicity analysis of a dna sequence DNA sequence comparison by a novel probabilistic method Real time classification of viruses in 12 dimensions A novel construction of genome space with biological geometry A novel clustering method via nucleotide-based fourier power spectrum analysis Supplementary Material S1 contains tables of accession numbers and names of the underlying datasets (Table S1 -S4) as well as the remaining phylogenetic trees generated by ClustalW ( Figure S1 -S3). Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.jtbi.2015.02.026.