key: cord-0840838-yjon4plr authors: Dai, Qi; Liu, Xiaoqing; Li, Lihua; Yao, Yuhua; Han, Bin; Zhu, Lei title: Using Gaussian model to improve biological sequence comparison date: 2009-05-28 journal: J Comput Chem DOI: 10.1002/jcc.21322 sha: ebf752aaac04bad77a814ca3a6509bbabe7b0281 doc_id: 840838 cord_uid: yjon4plr One of the major tasks in biological sequence analysis is to compare biological sequences, which could serve as evidence of structural and functional conservation, as well as of evolutionary relations among the sequences. Numerous efficient methods have been developed for sequence comparison, but challenges remain. In this article, we proposed a novel method to compare biological sequences based on Gaussian model. Instead of comparing the frequencies of k‐words in biological sequences directly, we considered the k‐word frequency distribution under Gaussian model which gives the different expression levels of k‐words. The proposed method was tested by similarity search, evaluation on functionally related genes, and phylogenetic analysis. The performance of our method was further compared with alignment‐based and alignment‐free methods. The results demonstrate that Gaussian model provides more information about k‐word frequencies and improves the efficiency of sequence comparison. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2010 The abundance of biomolecular sequence information (generated as a result of the ever-increasing number of large-scale sequencing projects), together with a relatively high cost of "wet lab" experimentation, calls for powerful and efficient computational tools as primary means for high-throughput genomic proteomic investigations. Therefore, computational methods of biological sequence analysis become an indispensable part of the modern scientist's research arsenal. 1 In protein studies, the results of sequence similarity searches in databases help generate reasonable hypotheses concerning structural and functional properties of proteins. 2, 3 On the DNA level, sequence analysis techniques make it possible to identify genes and functional elements in newly sequenced genomes. Phylogenetic analysis [4] [5] [6] [7] [8] [9] not only provides us evolutionary relations among the sequences but also provides useful information for pharmaceutical researchers to determine which medicinal species share the same medical qualities. But, these efficient computational methods rely heavily on sequence comparison. Because of the importance of sequence comparison, numerous methods have been developed. A typical approach to sequence comparison is based on sequence alignment. Waterman 10 and Durbin et al. 11 provided comprehensive reviews about this method. The search for optimal solutions using alignment-based method encounters difficulties in: (i) computational load with regard to large databases; 12 (ii) choosing the scoring schemes. 27 Because of the critical limitations of alignment method, the emergence of research into alignment-free methods is apparent and necessary. Many alignment-free methods have been proposed, but they are still in the early development compared with alignment-based method. Comparison methods based on k-word frequencies may be the most well-developed alignment-free methods. Reinert et al. 26 studied the statistical and probabilistic properties of words in sequences, with emphasis on the deductions of exact distributions and evaluation of its asymptotic approximations. Word-based methods were recently reviewed by Vinga and Almeida. 27 Among these word-based methods, each sequence is mapped into an n-dimensional vector according to its k-word frequencies. The similarity score between sequences represented in vector spaces is further defined by Euclidean distance, 28 Mahalanobis distance, 29 Kullback-Leibler discrepancy, 30 Cosine distance 31 between their corresponding vectors. Recently, several novel word-based methods have been designed for sequence comparison, such as D2z( 32 ), Gdis.k, 33 D 2 and D 3 . 34 This work presents a novel method for biological sequence comparison based on Gaussian model. Instead of comparing the k-word frequencies of two sequences directly, we evaluate their k-word frequencies in a probabilistic framework. Our method was evaluated by extensive tests such as similarity search, evaluation on functionally related genes, and phylogenetic analysis. A comparison of performance between the proposed method and several typical alignment-based or alignment-free methods was taken. The results demonstrate that it is a promising word-method for sequence comparison with potential application in improvement on structure and function prediction. There is a large body of literatures on word statistics, 26 where sequences are interpreted as a succession of symbols and are further analyzed by representing the frequencies of its small segments. A k-word is a series of k consecutive letters in a sequence. The k-word statistical analysis consists of counting occurrences of k-words in a given sequence. For a sequence s, the count of a k-word w, denoted by c(w), is the number of occurrence of w in the sequence s. The standard approach for counting k-words in a sequence of length m is to use a sliding window of length k, shifting the frame one base at a time from position 1 to m−k+1. In this method, k-words are allowed to overlap in the sequence. In this way, a sequence can be represented by an n-dimensional vector C s k made up of k-word counts where n is the total number of all possible k-words. The frequencies of k-words, F s k , can he calculated by For example, consider the DNA sequence s = AAAGGA, we can obtain the vectors made up of 2-word counts and frequencies In asymptotic cases, Gaussian, Poisson, and compound Poisson approximations have been derived for word counts; the type of approximation depends on the word length and on the method of counting word occurrences. In particular, if m is the length of the sequence, then, for large m, the distribution of counts of a word can be approximated by the normal distribution; this approximation is good when the length of the word is relatively small compared to the sequence length. 1 The simplest method of assessing normality is to look at the frequency distribution histogram. For example, the 3-word frequency histogram of HSLIPAS (Human mRNA for lipoprotein lipase) sequence appears in Figure 1 . The most important things to look at are the symmetry and peak of the curve. Figure 1 shows that the distribution of 3-word frequencies of HSLIPAS sequence approximately follows the Gaussian distribution. Visual appraisals can only be used as an indication of the distribution and subsequently better methods must be used. To test formally for normality we use a Kolmogorov-Smirnov test. Kolmogorov-Smirnov test is a goodness-of-fit test for any statistical distribution. The test relies on the fact that the value of the sample cumulative density function is asymptotically normally distributed. To apply the Kolmogorov-Smirnov test, the main operations are as follows: (1) calculate the cumulative frequency (normalized by the sample size) of the observations as a function of class; (2) calculate the cumulative frequency for a true distribution (most commonly, the Gaussian distribution); (3) find the greatest discrepancy between the observed and expected cumulative frequencies, which is called the "D-statistic." The Kolmogorov-Smirnov statistic (D) is defined as where F(x) is a given cumulative distribution function, and F n (x) is a empirical distribution function for n iid observations X i , which is defined as . . , f (w k,n ) with some unknown distribution P and we would like to test the hypothesis that P is equal to a Gaussian distribution P 0 , i.e., decide between the following hypotheses: The p value obtained by Kolmogorov-Smirnov test tells us whether the data is significantly different from the Gaussian distribution or not. We reject the hypothesis if the test is significant at the 0.05 level. That is to say, if p < 0.05, we reject H 0 , do not reject H 0 otherwise. We also take the 3-word frequencies of HSLIPAS sequence for example and perform Kolmogorov-Smirnov test. Since the p-value is 0.63857, we accept H 0 . In addition, the way Kolmogorov-Smirnov test work is by generating a normal probability plot, 35 it is a graphical technique for assessing whether or not a data set is approximately normally distributed. Figure 2 is the normal probability plot of 3word frequencies of HSLIPAS sequence. The straight line on Figure 2 is the null hypothesis of normality, the points on this plot form a nearly linear pattern, which indicates that the Gaussian distribution is a good model for the 3-word frequencies of HSLIPAS sequence. It is worthwhile pointing out that Kolmogorov-Smirnov test is designed to test a simple hypothesis P = P 0 for a given normal distribution P 0 . But, if we estimated this distribution, N(μ,σ 2 ) from the data f (w k,1 ), f (w k,2 ), . . . , f (w k,n ), formally, Kolmogorov-Smirnov test is inaccurate in this case. There is a version of Kolmogorov-Smirnov test, called Lilliefors test, 36 that tests normality of the distribution by comparing the data with a fitted Gaussian distribution as we did above, but with a correction to give a more accurate approximation of the distribution of the test statistic. The test proceeds as follows: (1) estimate the population mean and population variance based on the data; (2) find the maximum discrepancy between the empirical distribution function and the cumulative distribution function (CDF) of the normal distribution with the estimated mean and estimated variance, just as in the Kolmogorov-Smirnov test, this will be the test statistic; (3) confront the question of whether the maximum discrepancy is large enough to be statistically significant, thus requiring rejection of the null hypothesis. We also take the 3word frequencies of HSLIPAS sequence for example and perform Lilliefors test. First, we estimate the population meanμ = 0.0156 and population varianceσ 2 = 0.0066 based on the 3-word frequencies of HSLIPAS sequence. Then, we perform Lilliefors test that the 3-word frequencies of HSLIPAS sequence comes from the distribution N(0.0156, 0.0066). At the 0.05 significance level, we accept the normality of 3-word frequencies of HSLIPAS sequence with p-value 0.19537. Many methods for sequence comparison are to fix a short word length k, compute the frequencies of all k-words in each sequence, and assess the similarity of the two frequency vectors. For example, the dissimilarity score between two sequences X and Y are the Euclidian distance 28 or cosine of the angle 31 between their k-word frequency vectors F X k and F Y k . Sometimes, these simple methods are not satisfying for sequence comparison, because (i) they treat all word types equally, despite that they have different background, and (ii) it does not take into account the fact that, for a given k-word, the probability is not a linear function of the number of occurrences. To overcome the problems, the Mahalanobis and standard Euclidean distance, which take into account the data covariance structure, were proposed for sequence comparison. 29 In this article, we treat the above two problems by using a probabilistic model of k-word frequencies. The Kolmogorov-Smirnov test indicates that the k-word frequencies of biological sequences can be approximated by Gaussian distribution. This approximation is good when the length of the word is relatively small compared to the sequence length. 1 In what follows we will explore sequence comparison method on the basis of the Gaussian distribution of word frequencies. The Gaussian distribution, also called the normal distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average," µ) and variance (standard deviation squared, σ 2 ) respectively. The standard Gaussian distribution is the Gaussian distribution with a mean of zero and a variance of one. To indicate that a real-valued random variable X is normally distributed with mean µ and variance σ 2 ≥ 0, we write There are various ways to characterize a probability distribution. The most widely used one is probability density function (PDF). The probability density function of the Gaussian distribution is where σ > 0 is the standard deviation, the real parameter µ is the expected value, and is the density function of the "standard" Gaussian distribution: i.e., the Gaussian distribution with µ = 0 and σ = 1. The distribution function of the Gaussian distribution is expressed in terms of the density function as follows: The standard Gaussian distribution function is just the general distribution function evaluated with µ = 0 and σ = 1: A biological sequence s, of length m, is defined as a linear succession of symbols from a finite alphabet A , with size of |A |. All possible sequences of length k with symbol from the alphabet A compose a k-word set, which corresponds to a k-word frequencies set F k . Suppose F k is a sample space and the frequency of each kword is a random variable denoted by f w k,i , the frequency of k-word is approximately followed by the Gaussian distribution with mean µ and variance σ 2 . That is to say, Give two biological sequences X and Y , the frequencies of k-word f w k,i in sequences X and Y follow two different Gaussian models According to distribution function of the Gaussian distribution, we have Note that a word is called highly expressed if its observed frequency is more than its expected frequency, and called low expressed otherwise. In this sense, the probability µX ,σ 2 corresponds to low expression of word w k,i , and large value of We define the probability distance between X and Y as The d Nor (X, Y ) has the following properties: (i) it is a distance measure, because it satisfies positivity, symmetry and triangle inequality; (ii) background information is incorporated into the measure; (iii) kwords with identical frequency in two sequences may have different expression levels. Since the mean µ and variance σ 2 are priori unknown, we have to estimate them according to the observed sequences. Here, we estimate the parameters of Gaussian model by using the maximum likelihood method. Give a biological sequence, its k-word frequencies are are independent and each is normally distributed with expectation µ and variance σ 2 > 0. These observed values of these n random variables make up a "sample of size n from a normally distributed population." It is desired to estimate the "population mean" µ and the "population standard deviation" σ , based on the observed values of this sample. The continuous joint probability density function of these n independent random variables is As a function of µ and σ , the likelihood function based on the with some constant C > 0. In the method of maximum likelihood, the values of µ and σ that maximize the likelihood function are taken as estimates of the population parameters µ and σ . Usually in maximizing a function of two variables, one might consider partial derivatives. But here we will exploit the fact that the value of µ that maximizes the likelihood function with σ fixed does not depend on σ . Therefore, we can find Journal of Computational Chemistry DOI 10.1002/jcc that value of µ, then substitute it for µ in the likelihood function, and finally find the value of σ that maximizes the resulting expression. It is evident that the likelihood function is a decreasing function of the sum So we want the value of µ that minimizes this sum. Let be the "sample mean" based on the n observations. Observe that Only the last term depends on µ and it is minimized bŷ That is the maximum-likelihood estimate of µ based on the n observations f (w k,1 ), f (w k,2 ), . . . , f (w k,n ). When we substitute that estimate for µ into the likelihood function, we get It is conventional to denote the "log-likelihood function," i.e., the logarithm of the likelihood function, by a lower-case , and we have and then The proposed method is evaluated by extensive experiments such as similarity search, evaluation on functionally related genes, and phylogenetic analysis. We presently grouped our experiments into two sets. The first one, performed via ROC (receiver operating curve) analysis, aims at assessing the intrinsic ability of our method to search for similar sequences from a database and discriminate functionally related genes from unrelated sequences. The second one aims at assessing how well our method is used for phylogenetic analysis. The method that will be used here to evaluate performance of the presented method is based on the analysis of ROC curves. ROC goes back to signal detection and classification problems and is now widely used. 37 This approach is employed in binary classification of continuous data, usually categorized as positive (1) A ROC curve is simply the plot of sensitivity versus (1specificity) for different threshold values. The area under a ROC curve (AUC) is a widely employed parameter to quantify the quality of a classificator because it is a threshold independent performance measure and is closely related to the Wilcoxon signed-rank test. 38 For a perfect classifier, the AUC is 1 and for a random classifier the AUC is 0.5. For additional results and comprehensive discussion on AUC measure, see ref. 39 . The proposed method is used to search for similar sequences of a query sequence from a database of 39 library sequences, of which 20 sequences are known to be similar in biological function to the query sequence, and the remaining 19 sequences are known as being not similar in biological function to the query sequence. This data set has been studied in refs. 12,30 and 40. These 39 sequences were selected from mammals, viruses, plants, etc., of which lengths vary from 322 to 14121 bases. The query sequence is HSLIPAS (Human mRNA ROC curves are computed to evaluate and compare the performance of our measure with other measures. The evaluated measures are as follows: the similarity measures based on Clustal W, Euclidean distance (eu), 28 Mahalanobis distance (md), 29 standard Euclidean distance (sd), 29 Kullback-Leibler discrepancy (kld), 30 Cosine distance (cos), 31 D2z, 32 D 2 , 34 D 3 34 and our measure d Nor . All measures based on k-word frequencies run with k from 2 to 5. For each measures, separate tests are done with each combination of parameter values, and the best combination is chosen to represent that score in the performance. The ROC curves obtained for the similarity search are presented in Figure 3 . The AUC value is typically used as a measure of overall discrimination accuracy. Table 1 provides the areas under ROC curves (AUC) obtained from all the measures. Figure 3 and Table 1 Among the similarity measures based on k-word distributions, D2z.5 is clearly more efficient than other measures. The main surprise of this analysis is that when we explore the distribution information of k-word frequencies in our way, d Nor .4 performs better than other similarity measures based on k-word frequencies. The inspection of the ROC curves themselves (Fig. 3) further illustrates this comparison between similarity measures. The proposed Gaussian model of k-word frequencies is further tested to evaluate if functionally or evolutionarily related gene pairs are scored better than unrelated pairs of random sequences. To assess the performance on functionally related genes, we construct data sets as follows. We selected three sets of genes, each involved in a particular pathway: nitrogen metabolism (NIT family, 31 genes), phosphate utilization (PHO family, 13 genes), and methionine biosynthesis (MET family, 20 genes). They are well studied in ref. 41 . We retrieved the 800 bp sequence upstream the start codon of each gene as "positive" sets. As "negative" sets, we generated random sequences with lengths matching the sequence in "positive" sets. Each pair of sequences in the positive set is compared, and so is each pair in the negative set. The evaluation procedure is based on a binary classification of each sequence pair, where 1 corresponds to the pairs from positive set, 0 corresponds to the pairs from negative set. Let n be the number of sequences in the positive set, all the pairs constitute a vector of length 2 n 2 , which is used as prediction. Also, we can get a vector of length 2 n 2 consisting of 1 and 0 as class labels. A perfect measure would completely separate negative from positive set. Of course, this does not happen in practice, and the classes are interspersed. The ROC curves permit to assess the level of accuracy of this separation without choosing any distance threshold for the separation point. In particular, the AUC will give us a unique number of the relative accuracy of each measure. The similarity measures evaluated here are as follows: the similarity measures based on alignment, Euclidean distance (eu), 28 Mahalanobis distance (md), 29 standard Euclidean distance(sd), 29 Kullback Leibler discrepancy(kld), 30 Cosine distance (cos), 31 D2z, 32 D 2 , 34 D 3 , 34 and our measure d Nor , where the similarity measures based on alignment are Needleman-Wunsch (global alignment) or Smith-Waterman (local alignment) raw scores, with no correction for statistical significance, using linear gap penalties or affine gap penalties, with a gap penalty of 2. All measures based on k-word distributions run with k from 2 to 5. For each measures, separate tests are done with each combination of parameter values, and the best combination is chosen to represent that score in the performance. ROC curves are computed to evaluate and compare the performances of our measures and other measures. The ROC curves obtained for NIT, Met, and PHO are presented in Figures 4 and 5. Table 2 summarizes the AUCs obtained from all the measures for three data sets. In the MET experiment, d Nor .4 performs better than other alignment-based and word-based measures, with the area under ROC curve 1.000. The next best measure is kld.2, and the other measures lag behind. In the NIT experiment, d Nor .5 measure is better than all other measures, and its area under ROC curve is 1.000. In the MUSCLE experiment, d Nor .3 outperforms other methods, with the area under ROC curve 1.000. It is followed by kld.5. From the three experiments, we can see that d Nor , exploring the distribution of k-word frequencies, performs better than other measures. The inspection of the ROC curves themselves (Figs. 4 and 5) further illustrates these comparisons among similarity measures. The highly significant results of our method demonstrate that the d Nor measure is successful at detecting the functional similarity of genes from the random sequences. Since the different genes are only functionally related and not orthologous, the gene search algorithm requires a method that can discern functional similarity among candidate genes based on their sequence similarity. From Table 2 , we can note that the alignment-based methods lag behind some alignment-free methods. Since the outbreak of atypical pneumonia referred to as severe acute respiratory syndrome (SARS), more attentions [42] [43] [44] [45] have been paid to the relationships between the SARS-CoVs and the other coronaviruses, which would be helpful to discover drugs and develop vaccines against the virus. Generally, coronaviruses can be divided into three groups according to serotypes. Group I and group II contain mammalian viruses, while group II coronaviruses contain a hemagglutinin esterase gene homologous to that of Influenza C virus. 46 Group III contains only avian. Based on the Gaussian model of k-word frequencies, we next consider to infer the phylogenetic relationships of coronaviruses with the complete coronavirus genomes. The 24 complete coronavirus genomes used in this article were downloaded from GenBank, of which 12 are SARS-CoVs and 12 are from other groups of coronaviruses. The name, accession number, abbreviation, and genome length for the 24 genomes are listed in Table 3 . Given a set of biological sequences, their phylogenetic tree can be obtained through the following main operations: firstly, we construct the Gaussian model for biological sequences; secondly, we calculate their similarity degree by using our measure d Nor . Thirdly, by arranging all the similarity degree into a matrix, we obtain a pair-wise distance matrix. Finally, we put the pair-wise distance matrix into the neighbor-joining program in the PHYLIP package. 47 We obtain the phylogenetic relationships drawn by MEGA program (9) . In Figure 6 , we present the unrooted phylogenetic tree belonging to 24 species. Figure 6 shows that our results are quite consistent with the accepted taxonomy and authoritative ones [42] [43] [44] [45] in the following four aspects. First, all SARS-CoVs are grouped in a separate branch, which appear different from the other three groups of coronaviruses. Secondly, BCOV, BCOVL, BCOVM, BCOVQ, MHV, MHV2, MHVM, and MHVP are grouped into a branch, which is consonant with that they belong to group II. Thirdly, HCoV-229E, TGEV, and PEDV are closely related to each other, which is consistent with the fact that they belong to group I. 28 Finally, IBV forms a distinct branch within the genus Coronavirus, because it belongs to group III. Grigoriev 43 found that the mutational patterns in SARS-CoV genome were strikingly different from the other coronaviruses in terms of mutation rates. Phylogenetic analysis based on codon usage pattern suggested that SARS-CoV was diverged far from all the three known groups of coronavirus. 44 Rota et al. 42 found out that the overall level of similarity between SARS-CoVs and the other coronaviruses is low. Our tree also reconfirms that SARS-CoVs are not closely related to any previously isolated coronaviruses and form a new group, which indicates that the SARS-CoVs have undergone an independent evolution path after the divergence from the other coronaviruses. Whole genome-based phylogenetic analysis is appealing because single gene sequences generally do not possess enough information to construct an evolutionary history of organisms. Now phylogenetic analysis based on sequence alignments is well developed. However, it can hardly be applied to complete genomes, because the computational load of multiple alignment increases with the increasing length of sequence. Being different from the sequence alignment method, the current method is more simple and yields results reasonably. Figure 6 . The unrooted consensus species tree for 24 coronavirus by our distance d Nor at k = 6 using whole genomes. Sequence comparison is rapidly becoming an essential tool for bioinformatics applications. It has been used to support other types of analyses, from searching a database with a query DNA sequence to the phylogenetic tree construction. Despite the prevalence of alignment-based methods, it is noteworthy that alignment-based method is computationally intensive and consequently unpractical for querying large data sets, which forces the use of some heuristics to reduce the running times, as exemplified by BLAST. Alignmentfree comparison method is therefore of great value as it reduces the technical constraints of alignments. A novel alignment-free method for sequence comparison is proposed in this work. We assume that the frequencies of a given k-word in a biological sequence follows the Gaussian distribution. The similarity between two sequences can be evaluated by the difference between their corresponding Gaussian models. In contrast to the traditional word-based methods based on frequencies of fixed k-words, our method takes distribution information of k-word frequencies into account. In other words, our method has the ability to adjust the background information for similarity measure using k-word frequencies. The test of our methods are to perform similarity search and evaluate the functionally related genes. To evaluate this method, we compare it with alignment-based or word-based methods. The comparison demonstrates that our method, intending to explore kword frequency distribution information, gives more competitive results (Tables 1 and 2 ). In addition, the reasonable results of phylogenetic tree construction illustrate the validity of our method for phylogenetic analysis. In summary, this work presented a new and effective computational framework for sequence comparison. It can be used as another useful tool in addition to existing alignment-based and alignment-free methods for the research community of bioinformatics. The results also indicated that it is a necessity for alignment-free methods to extract more information in order to have a good comparison performance. This understanding can then be used to guide development of more powerful sequence comparison method for potential improvement on evolutionary study, structure and function prediction. Introduction to Computational Biology: Maps, Sequences, and Genomes Biological Sequence Analysis Graphical Methods for Data Analysis Signal Detection Theory and ROC-Analysis The authors thank all the anonymous referees for their valuable suggestions and support.