key: cord-0856624-49cq38bp
authors: Holmes, Ian
title: Using evolutionary Expectation Maximization to estimate indel rates
date: 2005-05-15
journal: Bioinformatics
DOI: 10.1093/bioinformatics/bti177
sha: 62d9658978a182b2b42dc8e8e84c396f308876f6
doc_id: 856624
cord_uid: 49cq38bp

Motivation: The Expectation Maximization (EM) algorithm, in the form of the Baum–Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process. Results: We present an algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the ‘TKF91’ model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm's close similarity to the Baum–Welch algorithm for training hidden Markov models, it can be used in an ‘unsupervised’ fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates. Availability: Software implementing the algorithm and the benchmark is available under GPL from http://www.biowiki.org/ Contact: ihh@berkeley.edu

A cornerstone of stochastic grammar-based profiling of DNA, protein and RNA sequences is the fast and accurate estimation of the probability parameters of the model, given a 'training set'. This is achieved for hidden Markov models (HMMs) using the Baum-Welch algorithm, and for stochastic context-free grammars (SCFGs) using the Inside-Outside algorithm (Durbin et al., 1998) ; both are examples of the Expectation Maximization (EM) algorithm (Dempster et al., 1977) .

Currently, most HMM-and SCFG-based methods are consensusor pair-based; they do not use the underlying phylogenetic tree directly in their training [although consensus methods use it indirectly, when over-represented clades are downweighted in the estimation of the consensus profile (Durbin et al., 1998) ]. The full power of evolutionary methods to model mutation rates and phylogenetic correlations is therefore unused. This may be because evolutionary models relating multiple sequences, using an arbitary phylogeny, are more complicated than consensus models, which effectively assume a star-shaped phylogeny (with the consensus sequence at the centre of the star). Nonetheless, there has been some progress in implementing stochastic evolutionary models for multiple sequence alignment and profiling (Mitchison and Durbin, 1995; Hein et al., 2000; Hein, 2001; Holmes and Bruno, 2001; Holmes and Rubin, 2002; Holmes, 2003; Miklós et al., 2004) . To develop this work further, a natural step is to extend the probabilistic results used for HMMs and SCFGs to the evolutionary domain.

Previous work demonstrated the potential of the EM algorithm as a quick and effective tool for finding maximum-likelihood (ML) rate matrices for point substitution processes (Holmes and Rubin, 2002) . As with all EM algorithms, this algorithm proceeds by repeated application of an E-step followed by an M-step. During the Estep, limited information about the inferred evolutionary history is gathered in the form of certain 'sufficient statistics', which are then used during the M-step to improve the estimate of the rate parameters. Specifically, these sufficient statistics are (1) the expected composition of the root sequenceŝ i , (2) the expected number of times,û ij , that each mutation of type i → j occurred and (3) the expected length of time,ŵ i , that each residue was available for substitution. All expectations are taken over the posterior distribution of substitution histories. In the M-step, equilibrium frequencies are set proportional toŝ i and substitution rates equal toû ij /ŵ i . This process converges extremely rapidly and, assuming the absence of significantly suboptimal local minima in the likelihood function (which in practise is often the case), results in ML substitution rate matrices. This paper presents an EM algorithm for the estimation of indel rates in the TKF91 model of sequence evolution (Thorne et al., 1991) . The main insight is that the analogues of the sufficient statistics (1)ŝ i , (2)û ij and (3)ŵ i for the TKF91 indel process are (1) the expected initial sequence length, (2) the expected number of insertions and deletions and (3) the expected mean sequence length over the evolutionary interval (together with the length of the interval itself, T ). We have implemented the algorithm in freely available software. We show that it accurately and quickly recovers the true indel length on simulated data; in contrast, the 'naive' or 'parsimonious' estimate that associates every observed gap with a single indel event is shown to underestimate the true indel rate by a small amount. Previous ML parameterization approaches have relied on sampling methods, which are significantly slower, albeit less prone to local minima in the likelihood function (Metzler et al., 2001) . We report on an application of the algorithm to estimate indel rates in coronavirus proteins.

Using methods, such as this rate-based EM algorithm, a number of novel sequence analysis approaches become feasible. For example, it is theoretically possible to do multiple sequence alignment without manually setting scoring parameters; instead, unbiased estimates of these parameters can be obtained quickly, directly and accurately from the data. The algorithms described here can be coupled with a recent 'structural EM' algorithm for the estimation of the ML phylogenetic topology (Friedman et al., 2001) . Another application is phylogenetic profiling: previous workers, such as Thorne et al. (1996) , have shown that it is possible to partition heterogenous multiple alignments using the varying signature of the substitution process, e.g. to predict secondary structure in proteins or transcription factor binding sites in promoters (Kellis et al., 2003) . Using methods based on the TKF model, it is now possible to use similar methods to partition the alignment according to the indel rates, and to estimate the relevant heterogenous rate parameters in an unsupervised fashion, from aligned or unaligned sequences, using EM, exactly as one might do with a single-sequence or pair HMM (Durbin et al., 1998) . Thus, this algorithm represents a further step in combining the achievements of HMM-based profiling and evolutionary modelling for multiple-sequence comparative genomics.

We assume some familiarity with continuous-time Markov models for sequence evolution. For a more thorough introduction, the reader is referred to Thorne et al. (1991) or Felsenstein (2003) .

Following Thorne et al. (1991) , we make the simplifying assumption that the evolutionary model can be separated into independent components: the point substitution process, modelled as a continuous-time finite-state Markov chain at each site of the sequence, and the indel process, modelled as a linear birth-death process with immigration.

The two models share certain features, in that they are both continuous-time Markov models. We begin by introducing some notation for such models, then describe each in detail.

A continuous-time Markov chain is specified by parameters ϑ = {π, R}. Here, the initial distribution is π and the transition rate matrix is R.

The probability of the model being in state j at time t, given that it started in state i at time 0, is M ij (t), where M(t) = e Rt is the matrix exponential.

The substitution model is a finite-state continuous-time Markov chain. If the substitution model is reversible, then the rate matrix R obeys detailed balance (π i R ij = π j R ji ) and is related to a symmetric matrix S (by S ij = R ij π i /π j ). The eigenvalues µ (k) and orthonormal eigenvectors v (k) of S (satisfying Sv (k) = µ (k) v (k) and v (k) · v (l) = δ kl ) can easily be found using standard algorithms (Press et al., 1992) , yielding the following result for the entries of M(t)

In contrast to the substitution model, the state space for the TKF91 indel model is infinite (the sequence length is unbounded). However, much of the theory from continuous-time Markov models still applies. The TKF91 model describes the evolution of a single sequence under the action of two kinds of mutation event: (1) point substitutions, which act on a single residue only (this process uses the finite-state Markov chain framework described in the previous section); and (2) single-residue indels, which insert or delete a single residue (Thorne et al., 1991) . Insertions occur at rate λ per available site; deletions at rate µ.

Consider first the simple example where we ignore the alignment and just look at the sequence length. We then have a linear birthdeath process with immigration whose state space is the non-negative integers (Karlin and Taylor, 1975) . The rate matrix is sparse:

We assume that the process is always started at equilibrium.

To find the likelihood of an individual alignment, consider starting with some ancestral sequence of length L and allowing it to evolve for some time t. A feature of the TKF91 model is that this process can be simplified by splitting the ancestral sequence into a series of independently evolving 'links'. For a sequence of length L, there are L + 1 such links, including one immortal link (representing the insertion site at the leftmost end of the sequence) and L mortal links (representing each ancestral residue of the sequence, and the insertion site immediately to its right).

Starting from the initial (ancestral) state, we allow each link to evolve stochastically for some time t, and then examine the new (descendant) state of the link. We may describe this descendant state as follows: first, by the number of newly inserted residues (n); and second (for mortal links) by specifying whether the original ancestral residue survived or was deleted. Let the probability distribution of n be r n (t) for immortal links and p n (t) + q n (t) for mortal links, where p n accounts for the survival of the ancestral residue and q n accounts for its deletion. Then it can be shown (Thorne et al., 1991) that

where

This probabilistic model for pairwise alignments can be expressed as a pairwise HMM using, e.g. the notation of Durbin et al. (1998) , as shown in Figure 1 .

The restriction to single-residue indel events leads to the geometric term β n in the above expressions, and is therefore roughly equivalent to using a linear gap penalty to score a sequence alignment. This is widely acknowledged to be unrealistic (Thorne et al., 1992; Hein et al., 2000; Knudsen and Miyamoto, 2003; Miklós et al., 2004) . Models that incorporate more realistic length distributions over indel sequences are less tractable than TKF91: such models have been analysed using simulation and combinatoric approximations, but have not yet yielded any algebraic expression for the alignment likelihood (Knudsen and Miyamoto, 2003; Miklós et al., 2004) . A tractable alternative is provided by the TKF92 model, which essentially replaces the single residues of TKF91 with artificial, indivisible, multi-residue 'fragments'. The lengths of these fragments are geometrically distributed. Where TKF91 is a birth-death process on residues, TKF92 is a birth-death process on fragments.

We now describe the EM algorithms for estimating model parameters. Again, we start with a general theory for continuous-time Markov models, and then proceed to the specific cases (the point substitution model and the TKF91 indel model).

Let h represent a 'history' of the process: that is, a complete specification of the state of the system at all times. The situation we wish to address is the one where we observe some data O which partially constrains the allowable histories h. For example, we might know what state the process is in at times t = 0 and t = T , but not in between. The state of the process at times 0 < t < T thus constitutes missing information. The EM algorithm (Dempster et al., 1977) consists of maximizing the following sum over possible histories with respect to ϑ

Previous work (Holmes and Rubin, 2002) showed that, if we break the time interval [0, T ] into small intervals t, we obtain

whereŝ i is the expected number of paths that start in state i,ŵ i is the expected wait in state i (i.e. the amount of time spent in i) and u ij is the expected usage of transition i → j . Note that iŝ i = 1 and iŵ i = T .

Here, f (ϑ, t) is a function that does not depend on the 'new' parameters ϑ . In fact f (ϑ, t) ∼ − log( t).

Ultimately, we are interested in the infinitesimal limit t → dt. In this limitŝ,ŵ andû retain their interpretations as the expected initial state occupancy, the expected waiting time and the expected transition usage. The function f is problematic: lim t→0 f (ϑ, t) = −∞. However, f can effectively be dropped from Q, as it disappears when we differentiate with respect to ϑ to maximize the expected log-likelihood of the new parameters.

We want to maximize Q while ensuring that R is a valid rate matrix (i.e. j R ij = 0) and π is a normalized probability vector (i.e. i π i = 1).

Introducing these constraints via Lagrange multipliers {A 1 · · · A m , B} and dropping f , the function to be maximized is

Suppose that the observed data O comprises the initial state a and the final state b. Thenŝ

where

The maximum of L is at π i =ŝ i / jŝ j , R ij =û ij /ŵ i (for j = i) and R ii = − j R ij .

We substitute into Equation (6) the definitions of R in terms of λ , µ , given in Section 2.3.

For the equilibrium distribution π, use independent geometric length parameter g, so that π i = g i (1 − g) (as long as the process stays reversible, we will end up with g = λ/µ)

iŝ i i is the expected initial sequence length,Ẑ = iŵ i i is the expected total time accrued by mortal links, T is the expected time accrued by the immortal link,D = iû i,i−1 is the expected number of deletions,Î = iû i,i+1 is the expected number of insertions and f (ϑ) does not depend on ϑ .

Note the Lagrange multipliers A, B disappear, since our definitions for R and π are already normalized. The maximum of L is g = S/(Ŝ + 1), λ =Î /(Ẑ + T ), µ =D/Ẑ.

The Supplementary information describes a procedure for calcu-latingŜ,Ẑ,D andÎ from sequence data.

We implemented the above-described EM algorithm for estimating indel rates in under 300 lines of C. The implemented code is available free of charge under the GNU Public License from the website www.biowiki.org. [The related EM algorithm for estimating substitution rates was previously implemented in the xrate program, available from the same website (Holmes and Rubin, 2002) .]

To test the EM algorithm, we ran extensive numerical simulations, generating random pairwise alignments under the TKF model and comparing the expected insertion counts (Î ) and site-times (Ẑ) with the actual values of these counts, known from the simulation. We also looked at the following 'naive' estimates of these quantities: for I , we simply counted the number of insertions in the simulated alignments, while for Z, we multiplied the lapsed time T by the average of the ancestor and descendant sequence lengths. Figure 2 shows the results of these simulations. There is extremely close agreement between the computed expectations (Î ,Ẑ) and the actual values of these statistics. As for the 'naive' estimates, we observe that simply counting the number of insertions in the pairwise alignment is a systematically biased underestimate of I , as expected (because some insertions were deleted before being observed). However, perhaps surprisingly, the naive estimate of Z (based on the average of the ancestor and descendant sequence lengths) turned out to be rather good. On closer investigation, we found that the naive Z estimate deteriorates significantly when λ is much less than µ, and also appears to perform worse at larger timescales. In practise, for and accrued site-time (right) obtained by simulation with λ = 0.0457, µ = 0.0458 (mean sequence length 50 residues), T ranging from 0.5 to 2, 10 5 trials per timepoint. The average over each 10 5 trials is plotted; to give an impression of error on the individual estimates, one in every 5 × 10 4 individual trials is also plotted. The 'naive' estimates, taken without correcting for statistical bias, are also shown. The target values are shown as a dashed straight line. long sequences (so λ µ) that are closely related (so T is small), the naive estimator for Z may be appropriate, although it seems likely that I and D will still be underestimates.

To test the application of the indel EM algorithm to experimental data, we estimated the indel rates for multiple alignments of Coronavirus protein domains, including domains from the SARS coronavirus (Marra et al., 2003) . Lengths of the SARS domains in amino acids are reported below. The first domain is GS1 (629 amino acids) and the second is GS2 (626 amino acids); both are derived from peptide cleavage of the 'Spike' surface glycoprotein. The third domain is C16 (277 amino acids), the papain-like peptidase domain from the long RNA polymerase gene product. The analysed proteins were sequenced from viruses responsible for SARS, murine hepatitis, porcine transmissible gastroenteritis, porcine epidemic diarrhoea, equine arteritis and avian infectious bronchitis.

Indel rates were estimated using three different methods: the EM method presented here, a Markov Chain Monte Carlo (MCMC) method and a naive estimate. The MCMC runs included 10 5 datapoints constituting ∼10 3 effective independent samples. Confidence intervals for the MCMC-estimated rate parameters are reported as m ± 2s where m is the mean and s the standard deviation of the marginal posterior distribution over the relevant parameter. The EM and naive methods do not return error estimates.

The domain boundaries were first identified by reference to the Pfam database (Bateman et al., 1999) . We then estimated a separate phylogenetic tree for each domain, first estimating a pairwise distance matrix using a 20 × 20 amino acid substitution rate matrix previously estimated from PFAM alignments by the EM algorithm (Holmes and Rubin, 2002) , then finding an approximate phylogenetic tree using weighted neighbour-joining (Bruno et al., 2000) . The three trees thus estimated are shown in Figures 3-5 .

Next, we used the tkfalign program (Holmes and Bruno, 2001 ) to infer ML alignment paths for the missing ancestral sequences, using that program's default values for the indel rates (λ = 0.049505, µ = 0.050495). We used the alignments containing these inferred ancestral sequences in all subsequent analyses. This strategy is likely to systematically underestimate the true indel rates, since the ML alignment paths will tend to minimize the number of inferred events. Ideally, we should integrate out the ancestral sequences, e.g. by Monte Carlo alignment sampling (Holmes and Bruno, 2001) or multidimensional dynamic programming (Hein, 2001) . However, fixing the alignments simplifies the comparison of the parameter estimation methods.

We proceeded to estimate the indel rates, using the three methods described (naive, EM and MCMC). The results of this analysis are shown in Table 1 . We note that the Spike protein has elevated indel rates compared with the peptidase protein, and that the first (5 ) domain has higher rates than the second (3 ) domain. Some caution is, however, needed in interpreting these results, because the indel rate estimates depend strongly on the estimated tree. Since we have estimated independent phylogenies (including branch lengths) for the GS1 and GS2 domains, and since the branch length estimates are dominated by the substitutions in the alignment, what we are actually measuring here is not the absolute indel rate but the ratio of the indel rate to the substitution rate. If substitutions are generally occurring faster in one of the domains (as indeed they are, in the GS1 domain) this may significantly skew the estimate. An absolute comparison of the rates of evolution of the Spike domains can be made by the following procedure. First, estimate a phylogenetic tree for the full alignment of the entire Spike protein (Fig. 6) ; next, estimate λ and µ independently for the two domains, using the same tree (and branch lengths) for each domain. The results of this test are given in Table 2 . Here, it can be seen that the 'absolute' rate of indel evolution is, in fact, greater in GS1. Thus, the overall picture is that GS1 is mutating faster than GS2, but substitutions are elevated relative to indels. Given the relative mutation rates, we predict that the GS1 domain has more contact interactions with the host immune system than GS2.

We note that although there is a clear systematic bias to the naive rate estimates, this bias ranges from 2 to ∼9% and, as such, is usually slightly smaller than the sampling error (as revealed by MCMC). The exception to this is the GS1 domain, for which the bias is slightly larger (but not by much). These results seem to indicate that the naive estimates would probably be acceptable in most practical situations, although in order to avoid the possibility of downstream or compounded errors [e.g. in more elaborate TKF91-derived models containing more parameters (Holmes, 2004) ] it would be better to use the unbiased estimators derived here. The same phylogenetic tree was used for both domains (Fig. 6) . Key to domain names: GS1 is the 5 domain of the SARS coronavirus Spike surface glycoprotein, while GS2 is the 3 domain of the same protein.

We have shown that the expected initial sequence length (Ŝ), numbers of mortal insertions (Î ) and deletions (D), and accrued time for mortal (Ẑ) and immortal (T ) links, along with the expected initial composition (ŝ i ), substitution counts (û ij ) and accrued times (ŵ i ) for each residue, constitute 'sufficient statistics' for maximizing the expected log-likelihood of a given set of alignments under the TKF91 evolutionary model. This EM algorithm is orders of magnitude faster than brute-force methods of searching the indel/substitution rate space, requiring only an O((T / t) 2 ) overhead for numerical integration, where t is the timestep for the discretized integral. We have implemented such an EM algorithm and tested it by simulation and application to biological data. In the prevalent situation where a multiple alignment is unknown, the calculation of the sufficient statistics by summing over all alignments takes time O(L N ), where L is the (geometric mean) sequence length and N is the number of sequences. However, a stochastic version of the EM algorithm, which uses MCMC alignment sampling on local neighbourhoods of the tree, can approximate these statistics efficiently (and improve alignment accuracy) in O(SL K ) time, where K is the neighbourhood size (i.e. the number of nodes whose mutual alignment is simultaneously resampled at any one step) and S is the number of sampling steps between stochastic EM updates. Naturally, one can also proceed on the assumption that the best alignment found by an alignment algorithm is the 'true' multiple alignment, although such an assumption may lead to systematic biases, such as undercounting indels.

Recent years have seen considerable interest in the derivation of stochastic evolutionary processes for biological sequences and/or structures (Thorne et al., 1992; Knudsen and Miyamoto, 2003; Miklós et al., 2004; Lunter and Hein, 2004; Siepel and Haussler, 2004; Holmes, 2004) . Several of these are so closely related to TKF91 that the theory derived here is directly applicable. For example, the TKF92 model (Thorne et al., 1992 ) (see also Section 2.3) is a birthdeath process on sequence fragments, rather than residues, so that the number of matches (a quantity appearing in the expressions for the EM-sufficient statistics) is found by counting the number of aligned fragments rather than the number of aligned residues. Similarly, an evolutionary model for RNA secondary structure has recently been described, together with a recipe for computing the sufficient statistics using the Inside-Outside algorithm (Holmes, 2004) . Models that depart more substantially from TKF91, such as 'long indel' models (Knudsen and Miyamoto, 2003; Miklós et al., 2004) and context-sensitive substitution models (Lunter and Hein, 2004; Siepel and Haussler, 2004) , have generally not been solved completely (i.e. we still lack exact algebraic expressions for the sequence likelihood) and so the TKF91 theory developed here will be less directly applicable. These models may mandate different approaches to rate estimation, such as enumeration of mutation trajectories (Miklós et al., 2004) , truncated Taylor expansions (Lunter and Hein, 2004) and variational approximations (Siepel and Haussler, 2004) .

The EM algorithm-in the form of the Baum-Welch algorithm-is one cornerstone of the application of profile HMMs in bioinformatics. We hope that the theoretical developments described here may contribute to the efforts to make evolutionary models similarly useful for probabilistic sequence analysis and phylogenomics.

Pfam 3.1: 1313 multiple alignments match the majority of proteins

Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction

Maximum likelihood from incomplete data via the EM algorithm

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids

Inferring Phylogenies

A structural EM algorithm for phylogenetic inference

An algorithm for statistical alignment of sequences related by a binary tree

Statistical alignment: computational properties, homology testing and goodness-of-fit

Using guide trees to construct multiple-sequence evolutionary HMMs

A probabilistic model for the evolution of RNA structure

Evolutionary HMMs: a Bayesian approach to multiple alignment

An Expectation Maximization algorithm for training hidden substitution models

A First Course in Stochastic Processes

Sequencing and comparison of yeast species to identify genes and regulatory elements

Sequence alignments and pair hidden Markov models using evolutionary history

A nucleotide substitution model with nearest-neighbour interactions

The genome sequence of the SARS-associated coronavirus

Assessing variability by joint sampling of alignments and mutation rates

A long indel model for evolutionary sequence alignment

Tree-based maximal likelihood substitution matrices and hidden Markov models

Numerical Recipes in C

Phylogenetic estimation of contextdependent substitution rates by maximum likelihood

An evolutionary model for maximum likelihood alignment of DNA sequences

Inching toward reality: an improved likelihood model of sequence evolution

Combining protein evolution and secondary structure

The author would like to thank Bob Griffiths, Von Bing Yap and Terry Speed for helpful discussions, and two anonymous reviewers for their suggestions. This work was partially supported by grants from EPSRC (code HAMJW) and MRC (code HAMKA).

Supplementary data for this paper are available on Bioinformatics online.