key: cord-0848576-oim1ifw0
authors: Hou, Wenbing; Pan, Qiuhui; Peng, Qianying; He, Mingfeng
title: A new method to analyze protein sequence similarity using Dynamic Time Warping
date: 2016-12-11
journal: Genomics
DOI: 10.1016/j.ygeno.2016.12.002
sha: 5d0e8c4c412a515f5ec5382be9cff5923f015cd6
doc_id: 848576
cord_uid: oim1ifw0

Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective.

With the advance of sequencing techniques, the database of DNA, RNA and protein has been enlarged rapidly, promoting the development of bioinformatics effectively. It has been increasingly important to develop efficient ways to obtain the information hidden in the gene data. In the last few decades, several methods to classify the genes have been proposed. In 1983, Hamori and Ruskin proposed a visible 3-D curve with the name of H-curve to tell the relations between different DNAs [1] . As the first graphical representation, it motivates other researchers in the following years to develop more graphical representations of DNA sequences including 2D, 3D and even multidimensional representations [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] . Besides the graphical representations, researchers try to combine some techniques from other disciplines into the study of genes and have proposed novel methods. For example, the Discrete Fourier Transform, which is broadly applied in signal process, has been introduced into the process of genes [15, 16] . It is proved effective in the analysis of DNA sequences.

Methods for similarity analysis of proteins also have been proposed recently. Considering a protein sequence consists of 20 kinds of different amino acids while a DNA sequence only consists of four bases, it is much more complex to express a protein than a DNA sequence. However, there are some methods which are generalized from the ways of analyzing the DNA sequences [17] [18] [19] [20] [21] . Yau et al. propose a method with the name of protein map [22] following their previous work. They use the moment vectors to represent proteins and generate a universal protein map [23] . Motivated by the protein map, they also develop a novel method, with the name of protein space, to realize the nature of protein universe [24] . Their method is applied successfully in their following papers and proved effective [25, 26] . He et al. present a new way of generalized Chaos Game Representation (CGR) method to outline a dynamic 3D graphical representation [27] which is analogous to the original CGR method proposed by Jeffrey for graphical representation of DNA [3] . El-Lakkani and Mahran introduce a two dimensional graphical representation of protein sequences. They propose a new mathematical descriptor in their paper to measure the similarity of two protein sequences [28] . Li et al. present a graphical representation with the name of UC-Curve [29] . The amino acids are assigned to the circumference of a unit circle with a cyclic order. Geometric center vectors of UC-Curves and Euclidean distances are extracted to analyze pairwise similarities. Moreover, techniques from other disciplines have been applied in the analysis of proteins successfully. Wąż and Bielińska-Wąż introduce the moments of inertia as new descriptors in the calculation of similarities [30, 31] . Based on their works, Czerniecka et al. propose a 20-D dynamic representation of protein sequences [32] and the scheme is proved reasonable.

In this paper, we outline a new method based on Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to calculate the similarities of proteins. The original symbol sequences are converted to numerical sequences according to their physico-chemical properties and the similarities are calculated based on DFT and DTW. We test our scheme with different datasets and compare our results with some existing softwares. It is demonstrated that the consequences from our test are in agreement with evolutionary relation satisfactorily.

Amino acids are considered as the basic component of proteins. The study of proteins always starts with the study of amino acids. The physico-chemical properties of amino acids are considered to have immense effects on the properties of proteins [33] . It is an effective way to study the similarity of proteins by the properties of amino acids. In our work, we choose two main amino acids properties, namely hydropathy and isoelectric point, to construct a new way to represent the protein sequences. The detail values are declared in Table 1 . All values are cited from reference [23] .

According to the hydropathy value of amino acids, we could list their ranks:

A radian θ i will be assigned to each of the amino acid according to the hydropathy value rank. Based on the ranks, the value of θ i will change from 0 to 2π at the interval of 1 20 π. For example, the radian 0 will be assigned to the amino acid I and 6 20 π will be assigned to the amino acid A. Similarly, we also list another rank based on the isoelectric point values:

Another radian φ i , ranging from 0 to 2π at the interval of 1 20 π, will be assigned to amino acids according to the isoelectric point values. Then we build a three-dimensional representation of the amino acids. The coordinates of amino acids are calculated as follows:

Now, an amino acids sequence S = s 1 s 2 s 3 … s N with the length of N could be represented by a new sequence F = {c 1 , c 2 , c 3 , …c N }, where c i =(x i ,y i ,z i ). The coordinates of different axis are extracted respectively, forming new sequences

It is obvious that every symbol sequence will be represented by three numerical sequences according to our method. The representation will be unique because every amino acid has a unique coordinate, which means our approach could avoid the confusion from similar proteins.

The DFT is a common way in signal processing which is used to transform the signals in time domain into frequency domain. The latent information hidden in the signal in time domain could be discovered in this transformation without any loss. In recent years, the DFT has also been used in DNA sequences analysis. The classical application of DFT including prediction the location of exons in DNA sequences, genomic signature and periodicity analysis [34] [35] [36] [37] .

Considering the signal sequences u 1 (n), u 2 (n) and u 3 (n) defined in Section 2.1, the DFT of signal at frequency kis calculated by In our method, every protein sequences will be represented by three numerical sequences. The DFT power spectrum of the signal at frequen-cy k will be defined as PS k ð Þ ¼

The Dynamic Time Warping has been widely used in the analysis of speech signals. It is first proposed by Sakoe and Chiba in 1978 [38] , aiming to eliminate the nonlinear fluctuation in speech pattern time axis. This property could be used in the analysis of genes if we consider the protein sequences as genomic signal inputs. Recently, researchers have applied the DTW algorithm in the analysis of genetic signals. Skutkova et al. used DTW to classify DNA signals and they have obtained some excellent results [39, 40] .

In Fig. 1 , we give an example to illustrate the function of DTW. Assume data1 and data2, which have similar wave shapes, are same spoken word from different speakers. The subfigure a shows the two original signals has similar shapes, but obviously they are in different time scales. It is hard to tell whether they are from the same word. However, in subfigure b, the two signals have the same wave shapes after the DTW, which means the two signals are from the same word.

In this paper, DTW is applied to calculate the distance of different power spectra. We assume there are two power spectra To simplify the symbol, we use two sequences to represent the two power spectra a 1 ; a 2 ; …a p …; a M b 1 ; b 2 ; …b q …; b N where a p = PS 1 (p − 1) , b q = PS 2 (q − 1) (p =1,2,…M; q =1,2,…N).

Define a distance

as a metric of the difference between feature vectors a p and b q . The accumulated distance is calculated by formula (4).

Apparently, the accumulated distance depends on the pairwise distance d(p,q) and the minimum from the previous values. The values of D(p,q), which will be used as the metric of similarity of two sequences, will form a table. The sequence warping path is derived on the basis of minimization of the backward way from the right upper corner to the left lower corner [39] . For two sequences, the minor D(p,q) is, the more similar they are.

To verify the approach we proposed, we choose different datasets of various species and take several experiments. We construct the phylogenetic trees to get the cluster results and illustrate the distance between species in the evolution.

Our scheme is applied to test 22 kinds of animal first. We choose the NADH dehydrogenase subunit 5 (ND5) sequences from NCBI database as our inputs. All the information of sequences we used is listed in Table 2 . Table 3 reveals our results. Corresponding to every species, a number is assigned in the table: 1-blue whale, 2-bornean orangutan, 3-cat, 4-common chimpanzee, 5-fin whale, 6-gibbon, 7-gorilla, 8-gray seal, 9-habor seal, 10-human, 11-horse, 12-mouse, 13-opossum, 14pigmy chimpanzee, 15-platypus, 16-rat, 17-rhino, 18-sumantran orangutan, 19-wallaroo, 20-tiger, 21-korean bovine, 22-spain bovine. It is noticed in Table 3 the pairs (blue whale, fin whale) (common chimpanzee, pigmy chimpanzee) and (Korean bovine, Spanish bovine) have a shorter distance in our analysis. The homologies revealed in the table are in agreement with evolutionary relation satisfactorily. Moreover, we also construct the phylogenetic tree of the 22 species in Fig. 2 .

In Fig. 2 , some reasonable cluster results are revealed. We find that the primates, such as common chimpanzee, pigmy chimpanzee, human, gorilla, orangutan and gibbon are much closer than other species in the evolutionary distance. Besides, different kinds of whale, bovine and seal are also located in the same branch respectively. All the classifications we've obtained are in agreement with the classical evolution theory. As a comparison, we apply the method in Ref. [21] to analyze the dataset in Table 2 . The results are shown in Fig. 3 . The results obtained in Figs. 2 and 3 have similar clusters. However, there also exists some difference. The Bornean orangutan and Sumatran orangutan should have a closer relation than other species, but obviously they are located in different branches in Fig. 3 . Besides, the results in Fig. 3 also indicate the two kinds of whales and two kinds of bovines are much closer than others in the phylogeny. In Fig. 2 , all the improper classifications are corrected. This experiment indicates our scheme is effective in the similarity analysis of animal ND5 proteins.

The influenza A virus has been a major threat to human and animals [41] . The viruses could be identified to different subtypes according to the different viral surface proteins hemagglutinin and neuraminidase. Until now 18 H serotypes (H1 to H18) and 11 N serotypes (N1 to N11) of influenza A viruses have been identified. The influenza A viruses have caused epidemic among human and animals. Some of the most lethal viruses are H1N1, H2N2, H5N1 and H7N9. We take 28 kinds of influenza A virus as samples in our test. All the sequences are picked from NCBI database. The sequences information is listed in Table 4 .

We use the Mega software (version 6.06) to calculate the distance between 28 kinds of influenza A virus, drawing the phylogenetic tree in Fig. 4. In Fig. 4 , we notice that most of the virus are classified correctly except the virus (A/Duck/Ohio/118C/93(H1N1)), which belongs to H1N1 in virology. Clearly, this is an improper classification. Using the same virus data, we apply our method to calculate the similarity of 28 influenza A virus, getting the results shown in Fig. 5 . The cluster results from our method matches the classification in virology correctly. The viruses from same type are clustered in the same branch respectively. We notice the wrong classification in Mega software has been corrected in our method. Furthermore, it is also noticed that the viruses appeared in adjacent years are much closer in the phylogeny. For example, the virus (A/blue-winged_teal/Ohio/566/2006(H7N9)) is much closer to the virus (A/goose/Czech_Republic/1848-K9/2009(H7N9)) than (A/chicken/Quzhou/2/2015(H7N9)).

As a comparison, another software is also applied in our test. The cluster results from Clustal X software is illustrated in Fig. 6 . The results in Fig. 6 are similar with ours. However, as illustrated in the phylogenetic tree, the viruses which belong to H2N2 are clustered in different branches. We conclude from the figures that the results obtained from different methods have an overall agreement even though there exists some variation between different methods. The phylogenetic trees in different figures reveal similar classification of influenza A virus. Among the three methods, our approach is more accurate in the test.

As a further comparison, we construct a phylogenetic tree for 50 coronavirus spike proteins. The coronavirus could cause some severe epidemics, for example, SARS. We use some coronavirus spike proteins as inputs to test our method. All the data comes from the Table 3 in reference [21] . The relations revealed in Fig. 7 are similar to the phylogeny reported in reference [21] . All the SARS coronavirus gather in the same branch. The coronavirus from same species has a much closer relationships. Due to the discussions above, our method is proved reasonable and effective.

In this work, techniques from signal process have been applied in the analysis of protein sequences. The approach in this paper provides an intuitive solution to analyze the protein sequences. We establish a novel measure based on Discrete Fourier Transform and Dynamic Time Warping to analyze the similarity of protein sequences. Based on the values of hydropathy and isoelectric point, we assign different radians to the amino acids according to their ranks. A three dimensional representation is constructed to represent all the amino acids. With the help of DFT and DTW, we get the power spectra and scale the spectra to the same length. The distances between species are evaluated by constructing phylogenetic trees. We use different datasets including animals and viruses to test our method. Compared to the existing methods and softwares, the computational time of our algorithm is large. However, there still exists ways to improve our method. For example, in the DTW process, a proper filter or sampling method could be considered to pick some important information from results of the DFT instead of keeping all the values of spectra. Also, the DTW algorithm could be improved to reduce the running time of our method. In the test, we find the method in our paper provides accurate classification of different species. An improved DFT-DTW method will be our goal in the future works.

A/chicken/Hubei/01-MA01/1999(H9N2))

A/Adachi/2/1957(H2N2))

A/Berkeley/1/1968(H2N2))

A/California/1/1966(H2N2))

A/Georgia/1/1967(H2N2))

A/Duck/Ohio/118C/93(H1N1))

A/muscovy_duck/Vietnam/LBM66/2011(H5N1))

The phylogenetic tree of 28 influenza A virus by Mega 6

A/Duck/Ohio/118C/93(H1N1)

A/chicken/Dongguan/1096/2014(H7N9)

A/chicken/Hubei/01-MA01/1999(H9N2)

A/chicken/Iran/B263/2004(H9N2)

A/canine/Guangxi/1/2011(H9N2)

A/Adachi/2/1957(H2N2)

A/Georgia/1/1967(H2N2)

A/Berkeley/1/1968(H2N2)

A/California/1/1966(H2N2)

A/muscovy duck/Vietnam/LBM66/2011(H5N1)

A/equine/Prague/1/1956(H7N7)

A/equine/Santiago/77(H7N7)

A/fowl/Weybridge(H7N7)

The phylogenetic tree of 28 influenza A virus calculated by our method

The phylogenetic tree of 28 influenza A virus calculated by Clustal X. Fig. 7. The phylogenetic tree of 50 coronavirus spike proteins

H-curves, a novel method of representation of nucleotide series especially suited for long DNA sequences

A new graphical representation and analysis of DNA sequence structure. 1. Methodology and application to globin genes

Chaos game representation of gene structure

Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation

DNA sequence representation without degeneracy

PNN-curve: a new 2D graphical representation of DNA sequences and its application

Analysis of similarity/dissimilarity of DNA sequences based on a condensed curve representation

A group of 3D graphical representation of DNA sequences based on dual nucleotides

A novel graphical and numerical representation for analyzing DNA sequences based on codons, MATCH-Commun

Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices

C-curve: a novel 3D graphical representation of DNA sequence based on codons

A new 2D graphical representation -classification curve and the analysis of similarity/dissimilarity of DNA sequences

Novel, 2D representation of genome sequence and its application

New 2D graphical representation of DNA sequences

An improved model for whole genome phylogenetic analysis by Fourier transform

A new method to cluster DNA sequences using Fourier power spectrum

Protein sequence comparison based on K-string dictionary

A graphical representation of protein based on a novel iterated function system

A 3D graphical representation of protein sequences based on the Gray code

A 3-D graphical method applied to the similarities of protein sequences

An alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid composition

A protein map and its application

Protein map: an alignment-free sequence comparison method based on various properties of amino acids

Protein space: a natural method for realizing the nature of protein universe

Distinguishing proteins from arbitrary amino acid sequences

Virus classification in 60-dimensional protein space

A generalization of CGR representation for analyzing and comparing protein sequences

An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation

UC-Curve: a highly compact 2D graphical representation of protein sequences

3D-dynamic representation of DNA sequences

Descriptors of 2D-dynamic graphs as a classification tool of DNA sequences

20D-dynamic representation of protein sequences

What amino acid properties affect protein evolution?

Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence

Frequency-domain analysis of biomolecular sequences

Gene prediction based on DNA spectral analysis: a literature review

Signal processing in sequence analysis: advances in eukaryotic gene prediction

Dynamic-programming algorithm optimization for spoken word recognition

Classification of genomic signals using dynamic time warping

Progressive alignment of genomic signals by multiple dynamic time warping

A review of avian influenza in different bird species