key: cord-0703460-9zfct1yx
authors: Altschul, Stephen F.
title: Amino acid substitution matrices from an information theoretic perspective
date: 1991-06-05
journal: Journal of Molecular Biology
DOI: 10.1016/0022-2836(91)90193-a
sha: 54f98bcddfb59508c278671a79a7d21197c0c536
doc_id: 703460
cord_uid: 9zfct1yx

Abstract Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a “substitution score matrix” that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is implicitly a “log-odds” matrix, with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-120 matrix generally is more appropriate, while for comparing two specific proteins with suspected homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human α 1B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins.

General methods for protein sequence comparison were introduced to molecular biology 20 years ago and have since gained widespread use. Most early attempts t'o measure protein sequence similarity focused on global sequence alignments, in which every residue of the two sequences compared had to participate (Needleman & Wunsch, 1970; Sellers, 1974; Sankoff 8r. Kruskal, 1983) . However, because distantly related proteins may share only isolated regions of similarity, e.g. in the vicinity of an active site, attention has shifted to local as opposed to global sequence similarity measures. The basic idea is to consider only relatively conserved subsequences; dissimilar regions do not contribute to or subtract from the measure of similarity. Local similarity may be studied in a variety of ways. These include measures based on the longest matching segmems of two sequences with a specified number or proportion of mismatches (Arratia et aE., 1986; Arratia & Waterman, 1989) , as well as methods that compare all segments of a fixed, predefined "window" length (McLachlan, 1971) . The most common practice, however, is to consider segments of all lengt,hs. and choose those that optimize a similarity measure (Smith & Waterman. 1981; Goad & Kanehisa, 1982; Sellers, 1984) . This has the advantage of placing no a priori restrictions on the length of the local alignments sought. Most database search methods have been based on such local alignments (Lipman & Pearson, 1985; Pearson & Lipman, 1988; Altschul et al., 1990) .

To evaluate local alignments, scores generally are assigned to each aligned pair of residues (the set of such scores is called a substitution matrix), as well as t,o residues aligned with nulls; the score of the overall alignment is then taken t,o be the sum of t,hese scores. Specifying an appropriate amino acid substitution matrix is central to protein comparison methods and much effort has been devoted to defining, analyzing and refining such matrices (McLachlan, 1971; Dayhoff et al., 1978; Schwartz & Dayhoff, 1978; Feng et aE., 1985; Rao, 1987; Risler et ml.. 1988) . One hope has been to lind a matrix best adapted to dist.inguishing distant evolutionary relationships from chance similarities. Recent mathematical results (Karlin &, Alt,schul, 1990; Karlin et al., 1990) allow all substitution matrices to be viewed in a common light, and provide a rationale for selecting particular sets of "optimal" scores for local protein sequence comparison.

Global alignment,s are of essentially no use unless they can allow gaps, but this is not, true for local alignments.

The ability to choose segments with arbitrary starting positions in each sequence means that biologically significant regions frequently may be aligned without t.he need to introduce gaps. While, in general, it, is desirable to allow gaps in loc~~l alignments.

doing so greatly decreases their tnat.hematical tractability.

The results described here apply rigorously only to local alignments t,hat lack gaps, i.e. to segments of equal length from each of t,he two sequences compared. An MSP may he of ati! length: it,s score is the ,MSF-' score.

Since any two protein sequences, related or utlrelated. will have some MSP score, it, is important t,o know how great a score one can expect to find simply hy chance. To address t,his question one needs some model of chance. The simplest is to assume that in the two proteins compared. t,he arnino acid ai appears randomly with the prob ability pi. These probabilities are chosen to reflect t,he observed frequencies of the amino acids in actual proteins.

For simplicity of discussion we will assutne both proteins share the same amino acid probability distribution; more generally. one C&an allow them to have different distributions. A random protein sequence is simply one c*onstructed according to this model. For the sake of the statistical theory, we need to make t'wo crucial but reasonable assumptions about the substitution scores. The first is that, there he at least one positive score and the second is that the expected score xi, j yipjsij be negative. Because we permit the length of a segment pair to be adjusted to optimize its score, both these assumptions are necessary also from a practical perspective.

If there were no positive scores. the MSP would always consist of a single pair of residues (or none at all, if this were permitted), and such an alignment is not of interest.

If the expected score for two random residues were positive, extending a segment pair as (1) reveals that multiplying all scores 1)~ I! also has t.he rfiect of dividing 2 by n. The parameter I. ma\-. therefore.

he viewed as a natural sc*alr fi)t any scoring system: its deeper meaning will he discussed helow.

Given two random prot,ein srq~tww's as dcw7iltcvl

above. how many distinct. or "locally opt imill" (Sellers.

1984) MSPs with score at least S at'( expected to occur simple I)>: chwncae! This nutnt)csr is well >ipproximatcld 1)~ the formula:

where X is t)he product of the sequen~s' lengths. and h' is an explicitly calculable parameter (liilrlitl B Altsetiul. 1990 : Karlin cf r/l.. 1990). \Vtrelr comparing a single random sequent with aII t ht. sequences in a database. setting S to the procluc.1 ot the query sequence length and the database lrngt h (in residues) yields an upper hound on the trutnher of distincut MSPs with s(aore at least S that thta search is expected to yield. it has been argued, can the matrix be optimal for distinguishing distant local homologies from simi-Iaritiei due to chance (Karlin & Altschul. 1990 ). Any subst.itution matrix has an implicit set of t,a.rget frequen(Ges for aligned amino acids. ll'riting the scores of the matrix in terms of its target frequencies. one has:

( ) (:s) Tn other words. the score for an amino acid pair can be written as the logarithm to some base of that pair's target frequency divided by the background frequrnc*y with hich the pair occurs. Such a ratio (~ornparrs the probability of an event occ*urring under two alternative hvpotheses and is called a likelihood or odds rat/o. Scores that are the logarithm of odds ratios are called log-odds scores. Adding such scores can be thought of as multiplying the corresponding probabilities, which is appropriate fc)r independent events, so that t'he total score remains a log-odds score. Log-odds matrices have been advocated in a number of contexts. (Dayhoff et al., 1978; Gribskov rl al., 1987; Storm0 & HartzelI. 1989 ). The wideIS used PAM matrices (Dayhoff et al., 1978) , for instance? are explicitly CJf t,his form. Other substitution mat'rices, though based on a wide variety of rationales. are all log-odds matrices, but with itnplicitj rather than explicit target frequencies. Therefore. while one may criticize the method described by Dayhoff et al. for estimating appropriate target frequencies (Wilbur, 1985) , the most direct way t,o derive superior matrices appears to be through the refined estimation of amino acid pair target and background frequencies rather than through arry fundamentally different approach.

Matrices for Global Alignments While we have been considering substitution matrices in the context of local sequence comparison, they may be employed for global alignment as well (Xeedleman & Wunsch, 1970; Sellers, 1974; Shwartz & Dayhoff, 1978) . There is a fundamental difference. however, between the use of such trrat,rices in these two (sontexts. For global alignments, as previously, multiplying all scores by a fixed positive number has no effect on the relative scores of different alignments. But adding a fixed cluantit,y (I to the score for aligning any pair of residues (and (l/2 to the score for aligning a residue with a null) likewise has no effect. Scoring systems I hat may be t,ransformed into one another by means of t,hese two rules are, for all practical purposes, equivalent llnfortunately, the new transformation tneans that no unique log-odds interpretation of global substitution mat,rices is possible, and it is doubtful that any "target distribution" theorem can be proved. It may be possible to make a convincing case for a particular substitution matrix in the global alignment context, but the argument will most likely have to be different from that for local alignments (Karlin & Altschul. 1990 ). The sa,me applies t,o substitution matrices used with fixed-length windows for studying local similarities (SlcLachlan, 1971; Argos, 1987; Storm0 & Hartzell, 19X9) : a fixed quantity can be added to all entries of such a matrix with no essential effect,. It is notable that while the PAM matrices were developed original]\-fi)r global sequence comparison (Dayhoff et al.. 19?8) , their statistical theory has blossomed in the 1o(aal alignment context.

Scores as Measures of Information .Multiplying a substitution matrix by a constant changes 2 but does not alter the matrix's implicit t,arget, frequencies. By appropriate scaling, one may therefore select the parameter I at will. Writing the matrix in log-odds form, such scaling corresponds merely to using a different implicit base for the logarithm. One natural choice for 1 is I, so that all scores become natural logarithms. Perhaps more appealing is to choose A= In 2 ;2 0.693, so that the base for the log-odds matrix becomes 2. This lends a particularly intuitive appeal to formula (2). Setting the expected number of MSPs with score at least 8 equal to p. and solving for 8, one finds:

For typical substitution matrices, K is found to be near 0.1. and an alignment may be considered significant when p is @05. Therefore the right-hand side of equation (4) generally is dominated by the term log, ,I;. In other words, the score needed to distinguish an NSP from chance is approximatjely the number of bit,s needed to specif? where the MSP starts in each of the two sequences being compared. (one bit can be thought of as the answer t,o a single yes-no question; it is the amount of information needed to distinguish between 2 possibilities. It' becomes apparent that, in general. log, N bits of information are needed to distinguish among S possibilities.)

For comparing two prot'eins of lengt,h 250 amino arid residues, about 16 bits of information are required; for comparing one such protein to a sequence database containing 4.000,OOO residues, about 30 bits are needed. When cast in t,his light, alignment scores are not arbit'rary numbers. By appropriate scaling (multiplying by 1/0.693) they take on the units of bits, and rough significance calculations can be performed in one's head. Furthermore, when so normalized, different amino acid substitution matrices ma! be directly compared.

The above review of previous results has provided us with the necessary tools for the analysis that follows. The ultimate goal is to decide which substitution matrices are the most appropriate for database searching and for detailed pairwise sequence comparison.

Given a random protein model and a substitution matrix, one may calculate the target frequencies qij characteristic of the alignments for which the matrix is optimized. A useful quantity to consider is the average score (information) per residue pair in these alignments. Assuming the substit,ution matrix is normalized as described above, this value is simply:

Notice that H depends both on the substitution matrix and on the random protein model. In information theoretic terms, H is the relative entropy of the target and background distributions. The origin of the name need not be of concern. The important point is that, for an alignment, characterized by the target frequencies qij% FI measures the average information available per position to distinguish the alignment' from chance. Tntuitivrly. t,he higher the value of the relative entropy of target and background distributions, the more easily the\ are distinguished.

For a high value of H, relativei! short alignments with the target distribution ran be distinguished from chance. while. if the value of II is lower, longer alignments are necessary.

It is interesting to examine the PAM model of molecular evolution (Dayhoff rt al.. 1978) from this standpoint.

From a study of mutations between a large number of closely related proteins. I)aThoff and co-workers proposed a stocha,stic model of pro-t,ein evolution. The amount of evolutionary c.hangr that yields. on average, one substitution in 100 amino acid residues they (*alled one PAM. ['sing their model. one may easily c~alculatr the frequent> with which any two amino acid residues are paired in an accurate alignment of two homologous I)roteins that have diverged by any given atnou~tt of' evolutionary change. These target frrquenc*irs ma>t,hen be used to construct log-odds matrictes and. it1 particular.

the widely used PAM-2.W mat ris. DayhofI rf crl. ( 1978) originally proposed this mat ris for the global alignment of two sequencaes suspt~.t~l to he homologous, but it has sinc.6, I)ocn used to search prot,ein databases for loc*al alignmrnts to ;I query sequence (Lipman & Pearson. 1985: l'earsoii Kr Llpman. 198X) . One may therefore itlquircb whether 260 I'AMs yield reasonable target frt~qnelrties for database searches.

Assuming the model described by I)aytiofi rt trl. (1978), Table 1 lists the relat.ire entropy II implicit in a range of PAM matrices.

As argued above,. distinguishing an alignment from c~hanc~e in it search of a typical current protein database using an average length protein requires about 30 I)it s of information.

Accordingly. for an alignmt~nt of' segments separated t)y a givr>n PA,11 distancsf>. otirl (aan cxalculat,e the minimum length necflssary to rise above background noise: t,hrse lengths are rr~c~ot~drtl in Table 1 For instance. at a distancr of'250 I'il\Ms. on average only 0.36 hit of information is available per alignment position. To bp statisticaall!, signiti-(*ant , suc.h an aligntnrtrt would need to lrav(l :I length grrat,rr than about X3 residues. Many biologically intrre&ng regions of protein sitnilarity ill'? much short.er than this, and ac~c~ordingly nck(atl a stronger signal to he detected. A loc~~l alignment of length 20 residues will nwtf about 1 .;i bits t)et' ;Lligtltnent posit,ion, while one of trngt.h 10 residuc~s will nerd about 0.75 bit. Table 1 shows that such aliglrnients will not bc tletectabk~ if their c*otist it lrfwt For a given alignment. one can attain such a score only by using the appropriate PAM matrix, but, of course. before the alignment is found it will not be known which matrix that is. It has therefore been proposed that a variety of PAM matrices be used for database searches (Collins et al., 1988) . We seek here to analyze how many such matrices are necessary, and which should be used.

Suppose one uses a matrix optimized for PAM distance M to compare two homologous protein segment,s that are actually separated by PAM distance D. For a range of values of M and D, the average score achieved per alignment position is shown in Table 2 . Notice that for any given matrix ,I!, the smaller the actual distance D, the higher the score. On the other hand, for a specific distance D, the highest score corresponds to the matrix with PAM distance M = D; this score is just the relative entropy discussed above. Using a PAM matrix with Jl near D, however, can yield a near-optimal score. For example, the relative entropy for D = 160 is 0.70 bit, but any PAM matrix in the range 120 to 200 yields at least @67 bit per position.

In practice, how near the optimal is it important to be? As argued above, for a given PAM distance t,here is a critical length at which alignments are just distinguishable from chance in a typical current database search; these lengths are recorded in Table  1 . For the sake of analysis, we will assume that it is worth performing an extra search (using a different PAM matrix) only if it is able to increase the score for such a critical alignment by about two bits, corresponding to a factor of 4 in significance.

has about 30 bits of information, we will therefore be satisfied using a PAM matrix that yields a score greater than 93% of t,he optimal achievable.

Using data such as those shown in Table  2 . one can calculate for which PAM distances D (and thus for which critical lengths) a given matrix M is appropriate; the results are recorded in Table  3 . Our experience has shown that perhaps the most typical lengths for distant local alignments are those for which the PAM-120 matrix gives near-optimal scores, i.e. lengths 19 to 50 residues. Therefore, if one wishes to use a single standard matrix for database searches, the PAM-120 matrix (Table 4) is a reasonable choice. This matrix may, however, miss short but strong or long but weak similarities that contain sufficient information t'o be found. Accordingly, In Table 3 , we list the range of crit,ical lengths over which various PAM Table 4 The PAM-120 ,matrix with scores in hnlf hits

are appropriate for detailed pairwise sequence comparison.

As a single matrix, thf> PAM-200 spans the most typical range of local alignment lengths, i.e. 16 to 62 residues. Alternatively, if two different' matrices are to be used, the PAM-80 and PAM-250, which together span alignment lengths 6 to 85 residues, or t,hr PAM-120 and PAM-320 matrices, which span lengths 9 to 124 residues, appear to be appropriate pairs.

Since it is convenient to express substitution matrices as integers, and since a probability factor of 2 between score levels is too rough, the units for the PAM-120 matrix shown in Table 4 are half bits. The scores in the original PAM-250 matrix (Dayhoff et al., 1978) were scaled as 10 x log,,. Because lO/(ln 10) z 3/(ln 2) to within 0.4%; a unit' score in that matrix can be thought of as approximately one-third of a bit.

As discussed, the part,icular PAM matrix that best distinguishes distant homologies from chance similarities found in a database search depends on the nature of the homologies present, and this cannot be known a priori. However, it is frequent.]) the case that distantly related proteins will share isolated stretches of relatively conserved amino acid residues, corresponding to active sites or other important structural features. It has been observed that in general the mutations along genes coding for proteins are not Poisson-distributed (Uzzell & Corbin, 1971; Holmquist et al., 1983) . suggesting that short, conserved regions are to be expect,ed. As shown in Table 3 , this means that the widely used PAM-250 matrix generally will not be optimal for locating distant relationships.

In the examples below, we compare t,he PAM-250 and PAM-120 scores for MSPs representing distant relationships to four different query sequences. In all cases, we consider relationships near the limits of what can be distinguished from chance in a search of the PIR protein sequence database (Release 26.0: 7,348,950 residues). It will be noticed that the highest chance PAM-250 scores are consistently slightly smaller than the highest chance PAM-120 scores. This is primarily attributable to the fact that the parameter K discussed above is about half as large for the former scores as for the latter. Furthermore. since neither the PIR database nor a given query sequence ever precisely fits the random protein model described by Dayhoff et al. (1978) , t,hcl parameter 1 varies slightly from one comparison to another. Therefore, while we will treat the PAM-120 scores from Table 4 (a,-microglobulin) superfamily. which contains proteins that exhibit a wide range of functions rc~lated t,o their abilit,y to bind small hydrophobic ligands. The similarities among these proteins and their biological roles have been analyzed (Peitsch & Boguski. 1990) . and crystal st,ructures are available for several members of the superfamily (('owan rt (11.. 1990) . Three proteins in the superfamily are rat androgen-dependent epididymal protein (I'IK, c~tle ITsing PAM-150 s~)res. t,hc maximal segment pair for each of these sequences when c*ompared to LPHI:I) is shown in Table 5 . These local similarities c~orrespond to one of two motifs that are conserved throughout th(b superfamily (Koguski B States. 1990 ). The s(aores for the three alignments are 27.0. 25.7 and 23.0 bits. rrspec%ively. However. the highest score from a protein in t,hr database unrelated to I,PHI'l> is 27,O bits, involving human surface glvco-lxotc~in ('I11 fi lxe'c'ursor (T'TR, (*ode SO&FiX:

Simmons & Shed. 198X). The PAM-250 matrix therefore fails to srparat,e the homolngous alignrnrnts shown f'rom background noise. In contrast rising fhe PAX-1 20 matrix of Table 4 . the s(*ores for the three alignm(ants jump to 33.5. 33.5 and 30.5 bits, respectively. (The 1 st 7 alignment positions for I,PHUI)-SQIITAI) shown in Table 5 arc dropped in an optimal PAM-1 PO alignmrnt. as art' the 1st 3 positions for the LPHL'I)-A32202 alignment.) This raises their scores above that of the best chance PAM-120 alignment (%!I+ bits). again involving hutnirn surfacar giJ-coprotc~in ('I)16 precursor. Soticth that. in both (xst's the estimate that about JO bits art' nc~edetl &arty to distinguish an MST' from cahancr is valid. For this query sequence. no relationship is found using the PAM-250 matrix that is missed Iby the PAM-120.

We searched the PIR database with human cx,B-glycoprotein (PIR code OMHrl K: lshioka et al., 1986), a plasma glycoprotein of unknown function, and a member of the immunoglobulin superfamily. ['sing the PAM-250 matrix, the only protein in the database with an MSP that rises above background noise is pig PO% F protein (PIR code J'l,OO30; Van dc Weghe it al.. 19X8). which acshieves a score of 32.3 bits. As shown in Table 6 . the score for this known homology (Van de 1Vcghe rt rrl.. 19x8) rises to 15.0 bits when the PAM-120 matrix is used instead. In addition, two proteins with imnlunoglobulitl domains. kinasr-related transforming protein precursor (PIR code SOO17-t: Qin et al., 1988) and human Ig K chain precursor V-III region (PIR code KSHUVH; Pech &, Zachau. 1984) , achieve scores of 29.0 and 28.5 bit,s, respectively. Table 6 illustrates that both these similarities are only just distinguishable from chancr. and that using the PAM-250 matrix both similarities drop in score by at least, four bits.

(c) The rystic jibrosis transmambmnr conductance regulator

The cause of cystic fibrosis has been traced to mutations in a protein that bears striking similarity to ma,ny proteins involved in the transport of substances across the cell membrane (PIR code A30300: Riordan et al.. 1989) . Characteristic features of the protein are two nucleotide (ATP)binding folds (Higgins et al., 1986) . When the PIR databa,se is searched with A30300, many related proteins may be identified easily using either t,hc PAM-250 or the PAM-120 substitution matrix. However, several distant' relationships present are harder to detect. In Table 7 are shown four optimal PAM-250 alignments, representing homologies to each of the two A30300 nucleotide-binding folds. Xone of these alignments has a PAM-250 score as great as the highest chance score of 31.3 bits. In contrast. when the PAM-120 matrix is used, the alignments jump in score by 4 to almost I2 bits, giving all but one a score greater than the highest chance PAM-120 score of 334 bits. (The boundaries of the optimal alignments change slightly under the alternate scoring scheme.) So biologically significant similarity is distinguished by the PAM-250 matrix that is not found using the PAM-120. The relatively high chance scores found in this example are partly attributable to the length of the query sequence (1480 residues), and partly to its composition, which renders the parameter 1 slightly smaller than in the previous examples.

It is possible to find examples of long alignments representing distant relationships that are better distinguished by the PAM-250 than by the PAM-120 matrix. In practice such examples are rare, for some of the reasons discussed above. The globins are one superfamily in which sequence divergence has been relatively uniform over the length of entire proteins. As a result, some sequence relationships within this superfamily become apparent, only with scoring systems tailored for long but very weak alignments.

For example, searching the PIR database with broad bean leghemoglobin I (PIR code GPVF; Richardson rt al., 1975) , the alignment with sea cucumber hemoglobin I (PTR code S06134; Suzuki, 1989), shown in Figure 1 , is found having a PAM-250 score of 253 bits. This is almost as high as the score of the best chance MSP (26.7 bits), which involves AWmonella typhimurium cystathionine fl-lyase (PTR code JVOO2fJ; Park & Stauffer, 1989) . The alignment, is 92 residue pairs long; only 14 of these pairs involve identical amino acid residues, and they are spread fair1.y evenly along the alignment. This particular similarity is totally obscured when PAM-120 scores are used. The best region of the alignment shown then involves residues 100 to 133 of the leghemoglobin sequence and has a score of only 13 bits, while the best chance PAM-120 

This paper has analyzed the properties of amino acid substitution matrices in the context of local alignments lacking gaps. This is exactly the sort of alignment sought by the recently developed BLAST database search programs (Altschul et al., 1990:

Altschul & Liptn;tn. 1990). WV have concluded that for protein databases of typical current size (about 1 x 10' residues), t,he most broadly sensitive substitution matrix should be a log-odds matrix wit,h relative entropy of about one bit, e.g. t,he PAM-120 matrix. In order to detect short but strong homologies or long but weak ones, this matrix can be complemented by the PAM-40 and PAM-250 matrices; additional matrices should be of only marginal utility. Of course, many database search methods, such as the FASTA programs (Lipman h Pearson, 1985; Pearson & Lipman, 1988) . seek locaal alignments with gaps, and such measures are potrntially more sensitive to distant homologies. Unfortunately. if gaps wit,h associated scores are allowecl, the specific quantitative discussion above is no longer correct. Nevertheless, the general thrust of the arguments should still apply, and theory and experiment suggest that analogous results will hold for local alignments with gaps (Smith rt al.. 19X5; U'at,erman et al.. 1987; Collins et al.. 198X) .

There are. of course, many much more involved ways for assessing local alignment than those discussed here. Scores can be assigned to aligned dresidues or t)ri-residues: they can depend on alignment length (Altschul 8t Erikson. 1986): or they can be complex combinations of various scoring methods (Argos, 1987) . Protein databases may also be searcshed with position-dependent scores or "profiles" constructed from multiple alignments (Taylor, 1986; Gribskov et al., 1987 : Patthy, 1987 . In certain contexts such systems may well be morp sensitive than the straightforward local scoring system considered here. Two advantages of simple additive scores are their amenability to powerf'ul algorithmic methods (Altschul et al.. 1990) and to rigorous statistical analysis (Karlin $ Altschul. 1990 : Karlin et al., 1990 . Such analysis ma? also yield insight into the properties of more c~ompllc*atet-l scoring scahemes.

The author thanks Urs David Lipmatt. Mark Hoguski and Andrew MrlAachlan for hrlpfitl vonvrrsxtions and suggrstions on the manuscript.

A nonlinear measure of subalignment similarit\ and its significance Irrels

Protein database searcahes for multiple alipnments

ISasic loc~al alignnwnt sear(,h tool

A rgos. I' (1987 .lcirlsHr.<. 14. IX5 ~IBXI,