key: cord-012975-u87ol3fs authors: Ogiwara, Atsushi; Uchiyama, Ikuo; Seto, Yasuhiko; Kanehisa, Minoru title: Construction of a dictionary of sequence motifs that characterize groups of related proteins date: 1992-09-17 journal: Protein Eng DOI: 10.1093/protein/5.6.479 sha: doc_id: 12975 cord_uid: u87ol3fs An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. This procedure is applied to the PIR database and a dictionary of sequence motifs that relate to specific superfamilies constructed. The motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in 20% of newly determined sequences. The sequence motifs identified represent functionally important sites on protein molecules. When multiple blocks exist in a single motif they are often close together in the 3-D structure. Furthermore, occasionally these motif blocks were found to be split by introns when the correlation with exon structures was examined. When the amino acid sequences of two proteins are similar, they probably belong to the same group of functionally related proteins. Thus, when a new protein sequence is determined, it is customary to perform a database search for similar sequences in the hope of obtaining a clue to its biological function. The search involves pairwise comparisons against individual sequences in the database. This is becoming more time-consuming with the rapid growth in database size. An alternative approach is to search a library of signature patterns, each of which uniquely identifies a group of related proteins. Whether all protein groups can be represented by such diagnostic patterns is arguable, but this approach is certainly more effective because the comparison is made against individual groups rather than individual sequences in the database. It is common knowledge that functionally important sites are well conserved in the amino acid sequences of related proteins. Conserved regions are not necessarily contiguous in the primary structure, because a functional site in the 3-D structure can be composed of separate pieces of conserved segments. The conserved amino acid patterns, often called consensus patterns or sequence motifs (Taylor, 1988; Hodgman, 1989) , are usually identified by the tedious method of multiple aligning and comparing a group of functionally related sequences. These published motifs are then manually collected, verified and organized in a motif library (Bairoch, 1989; Seto et al., 1990 ). An additional constraint to the conserved regions is introduced in this study: the uniqueness of amino acid patterns when compared with all other sequences outside the group. This has enabled the design of an automatic procedure to define from the protein sequence database a collection of signature patterns that uniquely identify specific protein groups. This procedure is applied to the superfamily grouping of the PIR database and a library of sequence motifs is constructed that identifies specific superfamilies. The amino acid sequences were obtained from the PIR database release 26.0 (September 1990) . The PIR database is divided into three sections (two sections before release 26.0): PIR1, annotated and classified entries; PIR2, preliminary entries; and PIR3, unverified entries. Only the PIR1 section is used when constructing a motif library. The releases of 19.0 (December 1988) to 29.0 (June 1991) were also used for comparison purposes. The 3-D coordinates of the protein structures were acquired from the Brookhaven Protein Data Bank (April 1991) . Functional groups of proteins Suppose that a protein sequence database is divided into groups, each containing functionally related members, and that the diagnostic amino acid patterns that uniquely identify the membership to each functional group are required. The PIR superfamily classification is used to define a protein group, but there may also be other definitions. A superfamily is a group of proteins bearing significant sequence similarity and represents the probable evolutionary relationships of the proteins (Dayhoff, 1978; Dayhoff et al., 1983) . It is not always the case that a protein is uniquely assigned to one superfamily, because it can contain multiple domains with different functions. For simplicity, however, the PIR superfamily numbering scheme is used, which assumes that each protein in the database belongs to one, and only one, superfamily. Dictionary of unique peptide words A three-step procedure is employed to identify the sequence motifs. The first step involves an exhaustive search for unique peptide words (UPWs) which, in our definition, are short oligopeptide patterns that are well conserved and found exclusively in one protein group. A group is usually a single superfamily, but it can be extended to comprise a few superfamilies. In practice, as illustrated in Figure 1 , we make a tally of all possible tetra-, penta-and hexapeptide patterns in the superfamilies of the PIR database. Let M S and N T be the numbers of sequences containing a given pattern in a given superfamily and in the entire database respectively. The pattern is unique to this superfamily when n % = Nj. The pattern is conserved when n s = N T > f-m, where m is the number of members belonging to the superfamily and/is the parameter defining the majority. We consider different cases ranging from/ = 1 (100% conservation) to/= 0.7 (70% conservation). Although the distinction between 100 and 70% conservation is highly dependent on the superfamily size and variability of its members, the uniqueness is mostly determined by the size and variability of the entire database. Screening of unique peptide words. This figure shows the numbers of sequences containing given tetrapeptide patterns. The superfamily 95 has 12 member sequences and all contain the pattern QWYW, while all other sequences outside this superfamily do not possess this pattern. Thus, this pattern is unique to, and conserved in, the superfamily 95. The unique pattern WHFV is not 100% conserved in superfamily 96, but this pattern can be detected by setting a lower threshold value for the conservation In the second step the order of unique peptide words in each sequence of a given group is examined and a consensus pattern constructed. As illustrated in Figure 2 , each amino acid sequence is converted to an abstract structure, which may be called unique peptide sentences, consisting of the UPW pattern number and the number of residues separating the first residues of two successive UPWs. One amino acid mutation is allowed when searching for the occurrence of each UPW pattern. When the separation is smaller than the length of the preceding UPW there is actually an overlap between the two UPWs, as in patterns 3 and 4 in Figure 2 . From a set of these sentences, some of which may lack specific UPWs and some of which may contain duplicates of the same UPW, a consensus sentence is constructed. This is a multiple alignment problem and an approximate procedure was devised by combining pairwise alignments. The optimal pairwise alignment can be obtained by the following dynamic programming algorithm which is similar to the RNA secondary structure prediction algorithm (Waterman and Smith, 1978; Kanehisa and Goad, 1982) : (S, v -g.p.(i,j,k,l) where S,y is the score up to the ith pattern P t andy'th pattern Pj, g.p. is the gap penalty and w is the weight for a match of two patterns. The resulting consensus pattern is represented by the order of UPWs with the upper-and lower-bound numbers of residues separating two successive UPWs ( Figure 2 ). The consensus pattern obtained in the previous step is represented by the blocks of amino acid patterns, which we call motif blocks, separated by the upper-and lower-bound numbers of residues in the space region as follows: < motif blockl > [min_spacer, max_spacer] < motif block2 > As shown in Figure 3 , this consensus is used again in the last step to compare each sequence in the group, to identify substitution patterns and to determine whether each block is conserved in all sequences. In practice, it is first decided whether a particular block exists or not, given the minimum fraction of matched residues, r, that constitute a block. Then, all substitution patterns are recorded. In the representation of our motif library, the plus sign designates that the block is conserved in all members of the group, while the minus sign indicates that some members lack the block. Substitution patterns are enclosed in braces. 2. An illustration of how the sequence motif is constructed from unique peptide words. First, the locations of unique peptide words in a given superfamily are examined for all member sequences. Then the consensus ordering of unique peptide words is obtained by a dynamic programming algorithm. The PIRl database release 26.0 contains 7235 sequences, totalling 2 221 416 residues, classified into 2350 superfamilies. The relatively large superfamilies that contained a set number of member sequences were considered. When the minimum value for the size of a superfamily in release 26.0 was defined as three or five members, there were 521 or 283 superfamilies respectively. As summarized in Table I , our procedure identified sequence motifs that characterized >50% of these superfamilies when the degree of conservation, /, was set at 80 or 70%. The motif library constructed with the minimum superfamily size of five members and/= 80% contained 145 sequence motifs ( Table I) . Out of the 145 motifs, 35 were characterized by single blocks while the rest contained multiple blocks, as shown in Table II . A complete listing of the 121 motifs containing < 10 blocks is shown in Appendix. Substitution patterns are obtained when r = 80%. As each new release of the PIRl database is produced, the motif library can be reconstructed by this automatic procedure. However, a long computation time is required because of the calculation of the many hexapeptide patterns in the initial screening of the UPWs. When the libraries shown in Table I were constructed without hexapeptide patterns, -5% of the superfamilies could not be identified. This was a relatively small loss compared with the gain in computation time. Superfamily assignment by sequence motifs A procedure for superfamily assignment was established utilizing our motif library, as follows: (i) begin the search using the first motif block. The criterion for the existence of a motif block is given by the parameter r, which specifies the minimum fraction of matched residues; (ii) if a motif block is found, check if the next motif block exists after the specified spacer length; and (iii) if a motif block is not found, skip this and continue searching for the next block. The search fails if no motif block is found. In the above procedure a sequence is considered assigned to a superfamily if any of the motif blocks match. No distinction is made between the conserved (+) and nonconserved (-) blocks. Table III(a) shows the results of this procedure when applied to the PIR1 database release 26.0, which is the training data set used for constructing the motif library. When the block detection parameter r = 100%, no entries were falsely assigned (false positives), but 140 entries could not be detected (false negatives) as belonging to one of the 145 superfamilies. At the level of r = 80% there were 70 false positives and 79 false negatives. When false positives were examined in more detail, all resulted from single motif blocks containing substitution patterns. Sequence motifs with multiple blocks or sequence motifs with single blocks without substitution patterns could be used safely for superfamily assignment. Next, a test data set was prepared from release 29.0 of the PIR1 database by identifying new entries added after release 26.0. There were cases where several entries in multiple superfamilies were combined into a single superfamily or entries in a single superfamily were split into different superfamilies. In such cases, the multiple superfamilies are considered to be related and assignment to a related superfamily is the correct answer. The results using this new data set are summarized in Table III (b) . Although the prediction ability (~68%) was not as great as had been expected, the search itself could be performed within a fraction of a second on a small workstation, which is two to three orders of magnitude faster than the FASTA homology search (Pearson and Lipman, 1988) . We modified the above procedure and stopped the search if any of the conserved (+) blocks were not found. The number of false positives could be decreased without affecting the number of false negatives in Table IH (a) because this is how the conserved block was defined in the training set. However, this additional constraint has more effect on increasing the number of false negatives than decreasing the number of false positives in the If the motif library is to be used as an initial step in superfamily assignment, it is desirable to decrease the number of false negatives because false positives can easily be distinguished by sequence similarity in the subsequent step. There are still -20% of false negatives in Table III (b), even with low values of r. It is possible to halve this by incorporating amino acid similarity scores, such as the PAM matrix (Dayhoff, 1978) , when comparing motif blocks (data not shown). Because the sequence motifs identified represent well conserved regions within a group of related proteins, they are likely to correspond to functionally important sites. Table IV summarizes the percentages of biological sites, annotated in the PIR1 database, which correspond to motif blocks identified by our procedure. Table V is a listing of the single block sequence motifs that characterize 35 superfamilies, together with any known functional significance. Our procedure identified known consensus patterns, or closely related derivatives, such as the active site sequence GDSGG which is known to be exclusive to the serine protease superfamily (Dayhoff et al., 1983) . The sequence motifs were obtained strictly from 1-D sequence information, the superfamily classification based on sequence similarity and the amino acid pattern searches. Among the 145 superfamilies with identified motifs, 21 superfamilies contained one or more member sequences with known 3-D structures; seven were characterized by single block motifs and 14 by multiple block motifs. Using the coordinate data from the Brookhaven Protein Data Bank (Bernstein et al, 1977) , it has been determined that multiple motif blocks come closer together in the 3-D structure. Typical examples are: L-lactate dehydrogenase (SF31; see Appendix for actual motifs), phosphoglycerate kinase (SF229), phospholipase A2 (SF281), neutral proteinase (SF385), carbonate dehydratase (SF472) and triose-phosphate isomerase (SF499). Figure 4 shows a stereo drawing of phospholipase A2 with two motif blocks at the active site. The correlation between conserved sequence patterns and exon structures has also been examined. A popular view suggests that introns existed in ancestral genes and have been removed under the exon shuffling mechanism (Holland and Blake, 1987) where an exon forms a structural or functional unit of a protein. Therefore, it was expected that the identified motif blocks may correspond to exon units. As shown in Table IV , however, quite a few introns were found to split functionally important motif blocks. Figure 5 shows typical examples where exon boundaries appear within the motif blocks. It is also noted that the intron positions around the motif block CGSCW of the papain (cysteine protease) superfamily (Ishidoh et al., 1989) and around the motif block GDSGGP of the trypsin (serine protease) superfamily (Rogers, 1985) are not fixed within the respective member sequences. These observations appear to support the concept of intron insertions (Rogers, 1989) , although all introns examined here may not fall into this category. Information about the functional properties of expressed protein products is often the main concern when DNA sequences are determined. The method presented in this paper is an attempt towards fully computerized interpretations of the sequence data. A collection of sequence motifs with associated biological meanings in evolutionary, functional and structural aspects may be considered a dictionary for such purposes. At the same time, the motif search approach is expected to solve the speed and sensitivity problems in the current homology search approaches. Because motifs represent more organized information, concentrated and extracted from primary databases, the search against a motif library is much faster than the search against a sequence database. It is also possible to incorporate various types of motif in the library, not only those to identify membership of a superfamily, but also other sequence patterns which are too weak to be detected by standard database search methods. Until now, sequence motifs have been found by manually examining a set of related sequences, although there have been a few attempts to automate the procedure (Staden, 1989; Smith and Smith, 1990; Smith et ai, 1990) . The essence of our automatic method is the concept of uniqueness. For a protein with 100 residues there are 20 100 possible amino acid sequences. In nature, however, the repertoire of real amino acid sequences appears to be quite limited in comparison to this theoretical number. The protein sequences sequenced to date amount to 10 million residues, three times larger than 20 5 or 3.2 million pentapeptide patterns. In reality, -40% of the possible pentapeptide patterns are not used in the known sequences. Thus, actual proteins seem to have evolved from a limited set of amino acid sequences, conserving functionally important residues. This has been the working hypothesis in this study. As expected, motif blocks, constructed from unique peptide words, were found to be well correlated with functionally important sites of protein molecules. In addition, separate blocks tend to be close together in space to form an active site. For the motif library to be more useful, it is necessary to increase the number of identified superfamilies, i.e. to reduce the number of no opinions (70-80%) in Table III . One approach is to use lower levels of conservation,/, as shown in Table I . Another is to relax the condition of uniqueness which was strictly required in this analysis. A few exceptions can be allowed in other superfamilies and/or patterns could be identified that are unique to multiple superfamilies. In our preliminary analyses of the latter case, the pattern YGDTDS was found in two superfamilies (DNA-directed DNA polymerases of adenovirus and herpes virus) which share very little sequence homology. The possibility of combining multiple superfamilies based on short sequence motifs is thus inferred. The pattern HPDKGG was found exclusively in the three superfamilies: large T antigen, middle T antigen and small t antigen of polyoma and related viruses. However, this pattern was actually located in the exon shared by the three antigens. Dictionary of sequence motifs characterizing superfamilies Prosite: a dictionary of protein sites and patterns. EMBL, release 4 Proc. Natl Acad. Sci. USA Received on February This work was supported by a grant-in-aid for scientific research on the priority area 'Genome Informatics' from the Ministry of Education, Science and Culture, Japan