key: cord-0009079-98oo0bkl authors: Nakai, Kenta; Kanehisa, Minoru title: A knowledge base for predicting protein localization sites in eukaryotic cells date: 2005-07-25 journal: Genomics DOI: 10.1016/s0888-7543(05)80111-9 sha: a65c8ae43245adba3cfebd0db3c87f0e43b196be doc_id: 9079 cord_uid: 98oo0bkl To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if—then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin. We collected data for 401 eukaryotic proteins with known localization sites (subcellular and extracellular) and divided them into training data and testing data. Fourteen localization sites were distinguished for animal cells and 17 for plant cells. When sorting signals were not well characterized experimentally, various sequence features were computationally derived from the training data. It was found that 66% of the training data and 59% of the testing data were correctly predicted by our expert system. This artificial intelligence approach is powerful and flexible enough to be used in genome analyses. Computational approaches are becoming indispensable components of molecular and cellular biology, especially in the analyses of human and other complex genomes for which massive amounts of sequence data must be examined for biological function. Functional information can be obtained from sequence information not by solving equations of first principles, but by inference based on empirical knowledge. Although the sequence data are now collected and organized in publicly available databases, functional data are not well organized, except, perhaps, in the brain of a human expert. We have been experimenting with an artificial intelligence approach called an expert system (see Waterman, 1 Present address: National Institute for Basic Biology, Okazaki 444, Japan. 1986, for example) for collecting and utilizing experts' specific knowledge, as well as computationally acquired knowledge, for the task of protein sorting (Nakai and Kanehisa, 1991) . In contrast to the representation of protein sequence data by 20 letters for which no arguments can be made except for possible sequencing errors, the representation of functional data can be far more controversial because it always requires interpretation of observed phenomena. In eukaryotic cells, the sorting signals that direct proteins to proper subcellular locations are usually encoded in their own amino acid sequences. A growing body of experimental evidence has been clarifying the nature of these signals. There are multiple ways of representing such empirical knowledge for computer processing. A production system utilizes a knowledge base constructed as a collection of "if-then" rules. It is relatively easy to implement knowledge in this manner, and a number of expert systems are of this type. In our previous work, we constructed an expert system for predicting protein localization sites in Gram-negative bacteria by organizing the relationships between amino acid sequence features and functional aspects in the form of if-then rules (Nakai and Kanehisa, 1991 ). The expert system could discriminate 106 proteins in our database into four localization sites with 83% accuracy. In the present work, our previous system is expanded so that it can also predict the localization sites of eukaryotic proteins, incorporating knowledge of a variety of sorting signals (Verner and Schatz, 1988) . This is a first attempt to systematically allocate various protein localization sites in eukaryotic cells from a theoretical point of view, which requires consideration of the following. First, because protein-sorting signals are mutually related, it does not seem sufficient to examine each feature separately. Second, because of the variations in how such features are encoded, it is not possible to treat all signals uniformly, say, by a single discriminant function. Third, because of the difficulty of interpreting the functional data mentioned above, any theory should make experimentally testable predictions and be able to evolve as new insights are gained. A knowledgebased approach is suitable for solving problems in these circumstances. Localization sites. The localization sites considered in this study are cytoplasm (one site), nucleus (one site), mitochondrion (four sites), chloroplast (three sites), peroxisome (one site), endoplasmic reticulum (two sites), Golgi complex (one site), lysosome (two sites), vacuole (one site), plasma membrane (two sites), and extracellular space (one site). They are summarized in Table 1 . For proteins from animal cells, the lysosome is added to the repertoire, whereas the vacuole is added for plant and yeast proteins. The chloroplast, which is divided into three sites, is also included in plant proteins. Cytoplasmic proteins include cytoskeletal ones. Peroxisomal proteins include socalled microbody proteins, but those of trypanosomes are excluded. GPI-anchored plasma membrane proteins are treated separately from other integral membrane proteins because the prediction logic is different. There is some ambiguity in the definition of protein localization sites. For example, some proteins have more than one localization site (see Discussion). Peripheral membrane proteins are distinguished from integral membrane proteins and we define their localization sites to be the spaces toward which the membrane surfaces face. Some proteins become soluble after specific proteolytic processing reactions in the membrane-bound precursor form. In the present work, we regard them as membrane proteins unless the reaction is coupled with the translocation process. Sequence data. The amino acid sequence data of known localization sites were collected from the NBRF-PIR database, release 27.0 (Barker et al., 1990) . The total of 401 sequences can be obtained from Kenta Nakai (e-maih nakai ® nibb.ac.jp). Except for some proteins in which the N-terminal methionine was removed, all sequences were in the genetically coded (unprocessed) KANEHISA form. To eliminate redundant data, these sequences have been selected from the original database such that there were no pairs with more than 50% identical residues. Exceptions were a few isozymes localizing at different sites or with apparently different localization mechanisms. We divided our dataset into training data for extracting knowledge and testing data for evaluating the prediction ability. When a localization site contained more than 10 proteins, 30% of them were used as testing data; the remainder were used as training data. When proteins at a certain site were divided into groups sorted by different pathways, testing data were selected proportionally. The number of proteins in each category of our dataset is summarized in Table 1 . Terminology [or sorting signals. There is as yet no consensus for the terminology of various sorting signals. We adopted Varshavsky's (1991) proposal for a systematic naming method. This naming convention, as well as part of our own, is shown in Table 2 . Expert system. The architecture of our expert system is the same as that previously reported (Nakai and Kanehisa, 1991) . It is a commonly used production system consisting of if-then rules, organized in a manner similar to the design of Tanaka and Shimoi (1987) . The core of our system is written in the programming language OPS83, version 3.0 (Forgy, 1989) . The knowledge base is divided into three modules, corresponding to the order of reasoning steps. In the first module, rules are organized for classifying organism names into one of the five categories: Gram-positive bacteria, Gram-negative bacteria, yeast, animal, or plant. The second and third modules contain rules for sorting signals, the second containing general rules and the third containing more specialized rules. This distinction is somewhat arbitrary, but the reasoning with more specialized rules is performed after the general reasoning step. The calculations involving sequence data are written in C language and are called from rules when necessary. The flow of reasoning occurs in a backward fashion; each localization site of the repertoire is activated one by one as a hypothesis and the rules for examining sequence characteristics for each localization site are invoked to verify the hypothesis. Thus, given the amino acid sequence and the organism name, our expert system reports a list of probable localization sites with certainty factors. Discriminant analysis. Occasionally, we use the method called stepwise discriminant analysis Kanehisa, 1988, 1991) for deriving an optimal combination of various sequence features. Suppose, for example, proteins sorted to a certain localization site have amino acid compositions different from those of proteins sorted to another site. Using 20 amino acid contents as variables, the stepwise discriminant analysis determines the set of coefficients that maximally discriminates the two groups. The derived discriminant function is represented by the formula y = ~ alx i + const, i where x i is the variables to be selected (say, 20 amino acid contents) and as and const are the coefficients. When used for prediction, each unknown sequence is classified into one of the two groups according to the sign of this function. To avoid excessive dependence on training data, the stepwise procedure is usually not repeated until all 20 coefficients are determined. It is possible to regard the order of selection of variables as corresponding roughly to the order of their importance. Other analytical methods: (i) Hydrophobic moment. The hydrophobic moment is calculated according to Eisenberg's (1984) method. The amplitude of the moment with a given angle is calculated as the average value over 11 residues around each position in the sequence. The maximum value and its position may be used as variables for discriminant analysis. (ii) The "apolar' algorithm. The algorithm we used to detect an apolar region is as follows. First, we define, somewhat arbitrarily, an index for the apolar value of each amino acid: Translocation from cytosol into nucleus (nuclear localization signal) Translocation from cytosol into peroxisome Translocation from cytosol into endoplasmic reticulum (signal sequence) Retention in endoplasmic reticulum (for the lumen or membrane) Generation of GPI-anchor after the cleavage (also a degron?) Golgi-mediated generation of Man6P residues; may be identical with the signal for transfer from trans-Golgi to lysosomes Translocation from cytosol into chloroplast stroma Translocation from chloroplast stroma (S) into thylakoid space (T) a Modified and extended version of Varshavsky's (1991) proposal. than -5.0. The output is the segment corresponding to the best sum value. (iii) The "alom'" algorithm. Transmembrane segments in membrane proteins were detected by the alom program (Klein et al., 1985) . It is based on a discriminant function between integral and peripheral membrane proteins, applied to all 17 residue segments in the sequence. To discriminate mitochondrial proteins, we searched for the M-transferon signal (rule 'mtmod'; see Table 4 for a summary of rules in our knowledge base; see also Table 2 for our naming convention of sorting signals), although several mitochondrial proteins without this signal are known to be sorted by other pathways (Hartl and Neupert, 1990; Baker and Schatz, 1991) . Since there had been no proposed method of recognizing unknown M-transferons, we searched for an optimal set of sequence features that best discriminate M-transferons. We utilized the features reported previously. Von Heijne et al. (1989) used comparisons of aligned sequences to indicate sequence features of M-transferons. However, the amplitude and the location of maximal hydrophobic moments at both 95 ° and 75 ° were not selected as effective variables in the stepwise discriminant analysis. Gavel and von Heijne (1990b) proposed a method for recognizing the cleavage site motif (rule 'exgavel'). However, we could not raise the prediction accuracy by incorporating the information of predicted cleavage sites (data not shown). The best discriminant function for our training data was obtained from the amino acid composition of the 20-residue segment at the N-terminus (rule 'mtdisc'). The variables selected and their corresponding coefficients are shown in Table 3a . Variables are listed in the order of stepwise selection, which roughly corresponds to the order of relative importance. In Table 3a we can see that R tends to strongly favored (positive coefficient) but other charged or polar residues are disfavored (nega-tive coefficients) in the N-terminal region of M-transferons. In all, 90% of the training data could be correctly discriminated. Since an M-transferon is the signal that brings a newly synthesized protein to the mitochondrial matrix, further sorting signals must exist (Hartl and Neupert, 1990) . Indeed, some intermembrane space proteins have a presequence of bipartite structure. The N-terminal half of the presequence is cleaved off at the matrix space and then the C-terminal half (M/IMS-transferon) is used for the translocation signal into the intermembrane space (rule 'mt2nd'). In our data for intermembrane space proteins, however, those that use this conservative pathway seem to be in the minority (two of five). There were two more examples of the bipartite signal in our inner membrane protein data. All but one of the four proteins with the bipartite signal had high discriminant scores of M-transferons at the N-terminus. In addition, using the 'apolar' algorithm, we found typical apolar stretches in the region from the N-terminus to the 70th residue (rule 'mtit'). However, these stretches were not always located near the second cleavage site. Interestingly, one of the two outer membrane proteins, the 45K protein of yeast, showed features characteristic of intermembrane space proteins (rule 'mtom'). In this case, the apolar region turned out to be near the N-terminus, i.e., starting at the fifth residue. Other proteins that were predicted to have M-transferons were further classified into inner membrane or matrix proteins based on the existence or absence of transmembrane stretches detected by the 'alom' program (rules 'mtim' and 'mtmx'). However, many inner membrane proteins did not have apparent hydrophobic segments. The sorting mechanism of nuclear proteins differs from that of other proteins (Silver, 1991) . The main dif- ference is that proteins do not actually transverse the nuclear membrane at the time of entrance. It seems possible that a protein without its own nuclear localization signal (Nu-transferon) enters the nucleus via cotransport with another protein (Zhao and Padmanabhan, 1988) . In addition, the Nu-transferons identified so far are not cleaved off after translocation and their exact positions in the primary sequence are not essential. This situation makes the task of finding Nu-transferons difficult. The most common Nu-transferon is the SV40 type, which is composed of short stretches rich in basic amino acids and, often, proline residues. We attempted to find sequence patterns that can cover most known SV40type Nu-tranferons: four-residue patterns composed of basic amino acids (K or R) or of three basic amino acids (K or R) and H or P (rule 'nucr); and a pattern starting with P and followed within three residues by a basic four-residue segment containing three K or R residues (rule 'nuc2'). Although these patterns match most known SV40-type signals, 8 of 33 cytoplasmic proteins in the training data also had such patterns. However, if we count overlapping patterns separately, most proteins of this type seem to have more than one pattern, giving higher predictability. Recently, another type of Nu-transferon, consisting of two interdependent basic domains, was discovered (Robbins et al., 1991) . The authors proposed a simple scheme for the recognition of this bipartite signal--2 basic residues, a 10-residue spacer, and another basic 5-residue region consisting of at least 3 basic residues (rules 'nuc3' and 'nuc7')--which was rather apparent in our training data; 14 nuclear proteins and only 1 cytoplasmic protein had the pattern. Since nuclear proteins are generally rich in basic residues, we used this heuristic in addition to the knowledge of Nu-transferons. If the sum of K and R compositions is higher than 20%, then the protein is considered to have a higher possibility of being nuclear than of being cytoplasmic (rule 'nuc4'). In addition, it might be essential to predict RNA-binding ability because it is possible for some RNPs to use targeting signals in the bound RNAs (Hamm et al., 1990) . However, simple examination of the RNP consensus motif (Query et al., 1989) was not useful for discriminating nuclear proteins because many cytoplasmic proteins seem to have this motif as well (rules 'nuc5' and 'nuc6'). Some peroxisomal proteins are known to have an uncleavable sorting signal at the C-terminus: the SKL motif (reviewed in Osumi and Fujiki, 1990) . Because their sorting pathway seems to be conserved widely throughout eukaryotes (Gould et al., 1990) , we included microbody proteins of various organisms except trypanosomes into our data ofperoxisomal proteins. It has been further shown that the SKL motif can be tolerated in a more loose form: (S/A)(K/R/H)L (Gould et al., 1989) . We searched our training data for this motif (rules 'pox1' and 'pox4'); 7 of 13 peroxisomal proteins had the motif at their C-terminus, whereas only one extracellular prorein, a-amylase A2, had it at that position. In addition, all but one peroxisomal protein, in contrast to 45% of cytoplasmic and nuclear proteins, in our training data had the motif in at least one position in the entire sequence, suggesting the possibility that internal motifs may also play some role in the sorting mechanism. To supplement the knowledge for prediction, we examined amino acid compositions of different regions of peroxisomal proteins. Although some of them have a cleavable N-terminal portion, it is not known whether it contains any information as a sorting signal (Osumi and Fujiki, 1990) . We tested whether the amino acid components of the 20-residue N-terminal segment, the 20-residue C-terminal segment, or the entire sequence could be used effectively as variables of discriminant function. The discrimination accuracy was 78, 79, and 84%, respectively. Because the third case was the best, we used this as a rule for prediction (rules 'pox2', 'pox3', and 'pox5'). It can be seen from the derived discriminant function in Table 3b that the compositions F and W seem especially important. The overall net positive charge, suggested to be characteristic of peroxisomal proteins (Borst, 1986) , was a more prominent feature of some nuclear proteins in our data. Our training data of peroxisomal proteins contained a 70K membrane protein. It was unclear whether our rule could also be applied to this protein, but it had three internal SKL motifs and was positive with the discriminant score even though this protein was not included in the derivation of the function. In our prediction scheme, proteins sorted along the nonselective bulk flow (Pfeffer and Rothman, 1987) are recognized as follows: First, a protein having an N-terminal signal sequence (ER-transferon) is transported to the endoplasmic reticulum (rule 'rghl'). Second, if it has any stop-transfer signal, it is integrated into the membrane; if not and if the ER-transferon is cleaved off, it is translocated into a lumen. Third, unless it has any other signals for specific retention or commitment (compartons), it will be transported to the cell surface by default; a luminal protein will be secreted constitutively to the extracellular space (rule 'outr) and a membrane protein will reside at the plasma membrane. It is not necessary, however, for membrane proteins to have N-terminal ER-transferons (see below). As in the case of Gram-negative bacterial proteins (Nakai and Kanehisa, 1991) , the methods of McGeoch (1985) and von Heijne (1986) were used for the recognition of N-terminal ER-transferons (rules 'mcgl', 'mcg2', and 'gvhl'). The former, modified by us to be represented as a discriminant function, uses the information from a short N-terminal charged region and a subsequent uncharged region, whereas the latter uses the in- Topology of membrane proteins. The classification is based on the definition by Singer (1990) . Here, the cytoplasm (cyt) is below the membrane and the extracytoplasmic space (exo) is above the membrane. Types Ia, II, Ib are membrane proteins with a single transmembrane segment. Type Ia proteins have a cleavable ER-transferon and are in NexoCcyt orientation. Type II proteins do not have a cleavable signal and are in NcytCexo orientation. Type Ib proteins also do not have a cleavable signal but the orientation is NexoCcyt. Here, all membrane proteins with more than one transmembrane segment are classified as type III. The most N-terminal segment is shadowed because the charge difference between both of its sides is thought to be important in topogenesis (types IIIa and IIIb). formation from the region around the cleavage site. The combined use of these two methods was rather effective for detecting ER-transferons as well as bacterial signal sequences. On the one hand, most of the proteins that do not use the secretory pathway are predicted to have no ER-transferons; 94% (145 of 155) were correctly predicted. On the other hand, the evaluation of false negatives is not easy because many membrane proteins have internal start-transfer sequences that are not detected by the above method, and there are some soluble proteins sorted through pathways not requiring an ERtransferon. In fact, most extracellular proteins predicted to have no ER-transferons turned out to belong to such an exceptional class. The cleavage sites of ER-transferons predicted by von Heijne's method were rather accurate; of the 62 training proteins whose cleavage sites are recorded in the NBRF-PIR database, 45 (73%) are correctly predicted (data not shown). As a retention signal of ER luminal proteins (ERcomparton), the sequence motif KDEL (HDEL in yeast and some plants) at the C-terminus (rules 'erl' and 'er2') seems essential (Pelham, 1990) . Although some variations of this motif are allowed in some organisms and cell types, they were not required for the discrimination of our current data. We could select all ER luminal proteins by the C-terminal KDEL motif with no false positives. Because sequence features of compartons, except ERcompartons, seem to be weak, the reliability of detecting such weak signals would be greatly enhanced if there were additional contextual features. Thus, the topology of membrane proteins was examined with our expert system. We have adopted the latest definition of the membrane topology by Singer (1990) as shown in Fig. 1 , although we do not distinguish type IV (channel) proteins from type III (polytopic) proteins. Before the prediction of the membrane topology, we located transmembrane segments by the 'alom' program (Klein et al., 1985) . Because it was difficult to set a single appropriate cutoff value between transmembrane and peripheral segments, we adopted a two-way approach. First, the sequence is examined with a high cutoff value of -2.0 (rule 'alom3'); then, if more than two transmembrane segments excluding a cleavable ER-transferon are detected, the calculation is repeated with a low cutoff value of 0.5 (rule 'alom2'). Despite this treatment, it was difficult to locate precisely all transmembrane segments in polytopic proteins, such as the seven transmembrane segments in the rhodopsin family. However, the number of predicted segments was close to that of most models of available polytopic proteins. The sequence determinants for membrane topology have been studied extensively (von Heijne and Gavel, 1988; Hartmann et al., 1989; Parks and Lamb, 1991) . Here we used the Hartmann et al. (1989) method for prediction. It is characterized by both the 'first helix' rule and the 'charge difference' rule; the overall topology is determined by the charge difference of both sides of 15 residues flanking the most N-terminal transmembrane segment (rule 'mtopl'). This method could also be applied to usual ER-transferons and gave good results when we changed the critical value from 0 to +1.0. Moreover, it was useful for the detection of internal ERtransferons as shown below. When our prediction scheme was applied, it showed some sequences with their predicted transmembrane segments located near the N-terminal. Although it was possible to detect uncleavable signal sequences by the combined use of the McGeoch and the von Heijne methods for Gram-negative bacterial proteins (Nakai and Kanehisa, 1991) , these ER-transferons were often falsely predicted to be cleaved. To overcome this difficulty, we have added the new rule that if a type Ib protein is predicted to have an NcytCexo configuration, its most N-terminal-sided transmembrane segment is assigned as uncleavable regardless of the von Heijne score (rules 'sig2' and 'sig3'). The predicted membrane topology was not only useful for limiting the search area for compartons but also suggestive for specific prediction. For example, we have noticed a tendency for type Ib proteins to be favored at the endoplasmic reticulum, whereas type II proteins are favored at the Golgi complex and the plasma membrane (rules 'er4', 'glgl', and 'pm3'). The topology of L-gulonolactone oxidase (OXRTGU) is not well studied. If it has a type II topology, as predicted from the charge difference, then it is an unusual protein whose transmembrahe segment resides near the center of the sequence. A similar feature can be observed in the type Ib topology of a plasma membrane protein, glycophorin C (GFHUC). We represented these observations as hypothetical rules (rules 'er5' and 'pm4'). In addition, cytochrome b5 (S03373) is an exceptional protein whose hydrophobic segment may not transverse the membrane (Holloway and Buchheit, 1990) . As for type III proteins, we could not find any prominent sequence features that could be used to predict their localization sites, partly because of the difficulty of precisely allocating transmembrane segments; type III proteins with more than three predicted segments were tentatively predicted to be plasma membrane proteins (rule 'pm5'). Many studies so far have indicated that compartons of membrane proteins are often found at a cytoplasmic tail, a short terminal region exposed to the cytoplasmic space in type Ia, Ib, and II proteins (Fig. 1) . First, two lysines positioned three and four or three and five residues from the C-terminus (rule 'er6') are proposed to constitute the retention signal of ER membrane proteins (Jackson et al., 1990) . With the constraint of predicted membrane topology, this simple motif was specific enough to detect one ER membrane protein with no false positives. In addition, the existence of the same motif in a type III protein, HMG CoA reductase, has been noted. Since we could not determine the detailed membrane topology of polytopic proteins, we searched for the motif at the Cterminus of all polytopic proteins with more than three predicted segments. With this additional rule, the reductase was selected with one false positive, sodium channel protein I of plasma membrane. Second, two sequence motifs, NPXY and YXRF, have been identified as signals for rapid internalization into endosomes (Chen et al., 1990; Collawn et al., 1990) . The exact position of these signals seems unimportant, provided that there is a spacer from the transmembrane region. We did not distinguish internalized receptors. They are simply treated as plasma membrane proteins and these signals are used as clues for selecting plasma membrane proteins (rules 'pro6' and 'pm7'). In our training data, all proteins that have a single transmembrane segment and either the NPXY or the YXRF motif at the cytoplasmic tail turned out to be plasma membrane proteins. There were two false negatives owing to the failure of predicting correct topology. Although the nature of Golgi localization signals has not been clarified fully, recent studies suggest that it may reside in the transmembrane domain (Hurtley, 1992) . In addition, the membrane-flanking sequences also seem to affect the efficiency of Golgi retention. Apart from these studies, the existence of a consensus motif, (S/T)X(E/Q)(R/K), near the hydrophobic domain (possibly the Golgi lumen) of all Golgi-localized glycosyltransferases has been pointed out, although there is no experimental proof that it is a sorting signal (Bendiak, 1990) . To select Golgi proteins, we searched for this motif in the regions flanking the type II transmembrane segment (rule 'glgl'). Although two Golgi proteins were selected with no false positives, two other Golgi proteins satisfying this condition failed because their N-terminal transmembrane segment was predicted to be cleaved. In the E1 glycoprotein of coronavirus, the first transmembrane domain is required for its retention (Machamer and Rose, 1987) . However, since it is the only example of such a signal, we did not incorporate this knowledge. As stated later, a tyrosine residue in the cytoplasmic tail is also important for sorting lysosomal membrane proteins. We considered two distinct cases in which chemical modifications have primary importance in protein sorting. One is the case in which proteins are anchored at a membrane with covalently bound lipid moieties and the other is the case of lysosomal proteins. There are several ways of lipid anchoring, as illustrated in Fig. 2 (Schultz et al., 1988) . Among them, all proteins linked to glycosyl-phosphatidylinositol (GPI) seem to be localized at the extracellular surface of the plasma membrane (Ferguson and Williams, 1988; Cross, 1990) . Thus, if we can predict this modification, we can simultaneously predict the localization. Although the signal that leads to the GPI attachment (GPI-modon) is not fully understood, all precursors seem to be type Ia proteins and have cleavable C-terminal sequences. Moreover, their cytoplasmic tails, if present at all, are predicted to be very short. These features were sufficient for the discrimination of our training data (rule 'pm9'); 12 of 16 members were correctly predicted to have GPI anchors by the criterion of a type Ia protein that has a short cytoplasmic tail (within 10 residues). The false predictions were caused by the failure of predicting topology. There were no type Ia proteins with short cytoplasmic tails in other localization sites. In eukaryotic cells, palmitic and myristic acids are observed to be bound directly to proteins. However, recent studies suggest that many of them may not take part in lipid anchoring (McIlhinney, 1990; Resh, 1990) . Thus, although we could make an efficient discriminant function for potential N-myristoylated sequences based on the substrate specificity of yeast N-myristoyltransferase (Towler et al., 1988) , it did not work well for the prediction of localization sites (rule 'pm8'). There is another type of lipid linking known as isoprenylation or farnesylation (Maltese, 1990) . This modification requires a CaaX motif at the C-terminus; "a" denotes an aliphatic amino acid. Since isoprenylated proteins have been found in the plasma membrane and in the nuclear envelope, another signal is needed for correct sorting (Hancock et al., 1990) . In the case of nuclear lamin A, it seems to be an usual Nu-transferon (Holtz, 1989) . Scanning our training data with the Cterminal CaaX motif revealed two false proteins. With an additional rule that isoprenylated proteins do not have any transmembrane segments or ER-transferons, only one of them, lipocortin I, remained (rules 'caax0', 'caaxl', and 'caax2'). Vacuoles, found in plant and yeast cells, have diverse functions, one of which is analogous to that of mammalian lysosomes. Based on our current understanding of the sorting mechanisms, we separated the prediction category into three sites: the lumen of lysosomes in animal cells, the membrane of lysosomes in animal cells, and the vacuoles in yeast and plant cells. There are at least two distinct mechanisms for sorting lysosomal proteins (Kornfeld and Mellman, 1989) . One is dependent on the posttranslational modification of mannose 6-phosphate, used in soluble enzymes. The other is dependent on a tyrosine residue at a particular position in the cytoplasmic tail (Williams and Fukuda, 1990) . The formation of mannose 6-phosphate seems conformation-dependent and some sequence segments could contribute to form a recognition domain (Baranski et al., 1990) . Thus, we postulated that a soluble lysosomal protein should have a cleavable ER-transferon, have no transmembrane segments, but have at least two potential N-glycosylation sites, i.e., NX(S/T) motifs, in the mature portion (rule 'lys3'). Since many extracellular proteins (23 of 50) share these features, we examined the differences in amino acid composition by stepwise discriminant analysis (rule 'lysl'). The predicted region of the ER-transferon was excluded from the calculation of amino acid composition. As shown in Table 3c , up to five variables were chosen because of the small size of the training data. With the derived discriminant function and the criterion of the potential glycosylation site, only 3 of 50 proteins were falsely discriminated. The discrimination of lysosomal membrane proteins was accomplished as follows. First, we predicted the membrane topology of a protein; then, if it was a type Ia protein and if it contained a GY motif in the cytoplasmic tail within 17 residues of the boundary with the membrane, it was predicted to be a lysosomal membrane protein (rule 'lys2'). This procedure was sufficient for discriminating the three proteins from all the training data. Interestingly, they were also positive in the discriminant score for lysosomal soluble proteins. It may be that most parts of the sequence are exposed to the lumen and they are similar in amino acid composition. The vacuolar sorting signals have been studied in yeast (Rothman et al., 1989) and in plants (Chrispeels and Raikhel, 1992) . It is likely that most plant and yeast cells have a common sorting pathway for vacuolar proteins, though they may have other diverse pathways. Some vacuolar proteins have their signals in the preregion, which essentially looks like an ER-transferon, and the proregion, which is needed for specific recognition at the Golgi complex. However, no common sequence leatures have been discovered in the latter. In addition, there is a protein that does not use even the secretory pathway (Yoshihisa and Anraku, 1990) , as well as a membrane protein that uses a distinct pathway (Klionsky and Emr, 1990) . We performed a discriminant analysis between vacuolar and extracellular proteins from the amino acid composition of the sequence excluding the preregion (rule 'vacl' and 'vac2'), as shown in Table 3d . The vacuolar sequences that do not have an ER-transferon at the N-terminus were excluded from the analysis because they seemed to have distinct characteristics. Indeed, they could not be correctly discriminated by the derived function. The selected variables were totally different from the ones selected in the analysis of lysosomal proteins. With only a single amino acid content (K), 12 of 13 lysosomal and vacuolar proteins could be correctly distinguished (data not shown), although in our expert system the distinction is made by the organism name. For plant proteins, the chloroplast is also a possible localization site. We postulated that all stromal proteins and thylakoid membrane proteins have the same kind of stroma-targeting signal (S-transferon; Hand et al., 1989; Keegstra et al., 1989) and searched for an efficient method of detecting it. Based on the observation that S-transferons have three distinct regions (von Heijne et al., 1989) , we performed a stepwise discriminant analysis using as variables the amino acid compositions of the two N-terminal segments, residues 3 to 10 and residues 1 to 30, and the position and the amplitude of maximum hydrophobic moment of 165 ° for residues 25 to 70 (rule 'chts0'). As shown in Table 3e , the result of self-discrimination was rather good. In contrast, the loosely conserved motif (V/I)X(A/C)A of S-degrons (Gavel and von Heijne, 1990a) was not as effective. We have also used the knowledge that in most (90.5%) chloroplast targeting peptides the residue at position 2 is alanine (rule 'chmod2'). Like some mitochondrial proteins, thylakoid luminal proteins are known to have a targeting sequence of bipartite structure; the N-terminal half is functionally equivalent to an S-transferon and the C-terminal half is required for translocation from the stroma into the thylakoid lumen (S/T-transferon). These proteins showed positive scores with the discriminant function for Stransferons. To detect the latter half of the bipartite signal, we employed two methods. One was the weight ma-trix of Howe and Wallace (1990) (rule 'ch2nd2'), which was derived from the data of thylakoid luminal proteins by the same method as von Heijne's (1986) . The weight matrix could locate all the cleavage sites of our data, but it was not sufficient for discriminating thylakoid luminal proteins from others. It is probable that the weight matrix could only detect the signal recognized by a specific protease and could not detect the S/T-transferon itself. The second method we used was the 'apolar' algorithm applied to the limited region of residues 40 to 90 (rule 'ch2ndl'). From the 'apolar' score and the length of the apolar region, most thylakoid luminal proteins could be discriminated from other chloroplast proteins. Thylakoid membrane proteins were discriminated by the 'alom' program. The remainder of chloroplast proteins were regarded as stromal proteins (rules 'chtm' and 'chst'). The results described above were integrated into a set of rules in our expert system. The list of core rules for reasoning steps 2 and 3 are summarized in Table 4 . The number of rules is currently 80, excluding those for bacterial sequences. A simplified reasoning tree is given in Fig. 3 . For each possible site, the reasoning procedure is performed roughly following the tree. Each node is a checkpoint for a certain sequence feature and certainty factors are modified according to its result. In principle, the path of reasoning should follow the real pathway of sorting in vivo. It is most probable that the first recognition process for a nascent polypeptide is mediated by an ER-transferon. If the polypeptide has an ER-transferon, it will be committed to a vesicle-mediated pathway and its final localization site will be determined by its comparton signals. Otherwise, it will be sorted according to other transferon signals. If it has no signals at all, it will become a cytoplasmic protein. Apart from biological reality, we had to modify this simple scheme in some minor points. One was the treatment of internal ER-transferons, which cannot be effectively detected at present. For example, since most plasma membrane proteins do not have N-terminal ERtransferons, they must be examined in the context of the cytoplasmic pathway. Another was the independence of evaluations. As described, the possibility of being sorted to each localization site is evaluated one by one. In general, different evaluations are independent except for those for cytoplasmic proteins. Finding that one protein is unlikely to be sorted to one site does not usually raise the possibility that it may be sorted to any other specific site. The calculated possibilities are stored as certainty factors and the site that has the highest certainty is selected as the most probable site. However, we had to break this principle for some localization sites with poorly characterized sorting signals (e.g., nuclei and lysosomes), making the evaluation order-dependent, i.e., well-characterized sites first. The assignment of certainty factors was one of the more difficult aspects. For some rules, the distribution of score values over the training data was examined. The score values were usually divided into three classes, positive, not clear, and negative, and the certainty factor was assigned for each class. By trial and error, certainty factors were further adjusted considering the whole prediction accuracy. As summarized in Table 5 , 66% of the training data and 59% of the testing data were correctly predicted. Since the testing data were selected from the localization sites that involved more than 10 members, the composition of the testing data was not proportional to that of the training data. It is difficult to compare this prediction accuracy to other standards, but since each protein has a possibility of 14 to 17 localization sites depending on the organism, a random guess would result in less than 10% accuracy. If we simply assume that all proteins in the testing data belong to the largest site, the extracellular space, the value is 21%. It should be noted that the upper limit of the value is apparently lower than 100% because of the presence of exceptional sorting pathways. There was not a marked difference in the prediction accuracy between animal proteins and plant/ yeast proteins. In the training data, proteins at the lumen of ER and at the lysosomal membrane were perfectly discriminated. However, they are very small in size. The next well-predicted classes were the stroma of chloroplasts, the matrix of mitochondria, and the plasma membrane (integral). GPI-anchored proteins had the highest predictability in the testing data, which was actually higher than that in the training data. A significant decrease in the predictability was observed for the stroma of chloroplasts, the matrix of mitochondria, and the peroxisomes, all of which were largely dependent on the results of discriminant analysis, which apparently had the danger of overfitting to the training data. The fine structure of organelles, such as the mitochondrial inner membrane, was relatively difficult to predict. If we simply consider all proteins in each organelle as one group, the discrimination accuracy in the training data is 66% for the mitochondrion and 86% for the chloroplast. The prediction accuracy for the combined mitochondrial proteins becomes 64% in the testing data. Table 5 also contains the number of false positives for each site. We noticed that many exceptional proteins are falsely predicted to be cytoplasmic proteins because they do not have usual targeting signals or other features. Many testing proteins were falsely predicted to be peroxisomal proteins, which implies that current knowledge of peroxisomal targeting signals is not specific enough. For practical use, alternative localization sites with lower certainty factors are also suggestive. When we took two sites with the two best certainty factors, the probability of one of them being correct was 71.9% for the training data and 69.8% for the testing data. Nota- If not yet, check the ER-transferon by GvH and store the result. mcgl If not yet, check the ER-transferon by McG and store the er2 result. mcg2 If the result of 'mcgl' is obtained, calculate and store the er4 discriminant value. alom2 If not yet, find the TMS by ALOM (threshold 0.5) and store the result, er5 mtopl If there is at least one TMS, calculate the charge difference around the most N-terminal TMS by MTOP. er6 sig2 If the charge difference predicts the NcytCexo orientation, determine whether there is an ER-transferon and whether it is cleavable from previous results er7 sig3 If the charge difference predicts the NexoCcyt orientation, determine whether there is an uncleavable ERtransferon from previous results, er8 alom3 If the number of TMS is less than 3 in the mature sequence, change the threshold of ALOM to -2.0. outl alom4 If possible, output the final result of ALOM considering the possibility of cleavage and the variable threshold value, pml mtop2 If it has a cleavable ER-transferon and one more TMS, it is pm2 type Ia. pm3 mtop3 If it has one TMS, does not have a cleavable ER-transferon, and the charge balance predicts NcytCexo, it is type II. mtop4 If it has one TMS and does not have a cleavable ER-pm4 transferon and the charge balance predicts NexoCcyt, it is type Ib. mtop5 If it has a cleavable ER-transferon and more than one pm5 TMSs, it is type IIIa. mtop6 If it does not have a cleavable ER-transferon and has more than one TMSs, it is type IIIa or IIIb according to the pm6 charge balance. aacl If the examination of ER-transferon is finished, calculate pm7 and store the amino acid composition of the mature portion, pm8 rghl If it has an ER-transferon, the sites on the vesicular pathway have some possibility of being selected, pm9 exgavel If it might be a mitochondrial protein, examine the possible cleavage site of M-transferon by GAVEL. mtdisc If it might be a mitochondrial protein, examine the caax0 existence of M-transferon from the AAC of 20 Nterminal residues. mtmod If it has a positive possibility of having an M-transferon caaxl and does not have an ER-transferon, it may be targeted to a mitochondrion, caax2 chpm If it might be a chloroplast protein, calculate the maximum hydrophobic moment in the segment from res. 26 to 70. glgl chlaal If it might be a chloroplast protein, calculate the AAC of res. 3 to 10. glg2 chlaa2 If it might be a chloroplast protein, calculate the AAC of lysl res. 1 to 30. chldisc If it might be a chloroplast protein, examine the existence of S-transferon from the results of chpm, chlaal, and lys2 chlaa2. chlmod If it has a positive possibility of having an S-transferon and does not have an ER-transferon, it may be targeted to lys3 a chloroplast. chlmod2 If it does not have an ER-transferon and the second res. is Ala, it may be targeted to a chloroplast. If it might be an ER luminal protein, the existence of the KDEL (HDEL in yeast) motif around the C-terminus must be examined. If an ER luminal protein, it is likely to have the motif, an ER-transferon, and no TMSs. If an ER membrane protein, it may be a type Ib protein whose TMS locates within the 30% region from the Nterminus. If an ER membrane protein, it may be a type II protein whose TMS locates within the 70% region from the Cterminus. If an ER membrane protein, it may be a type Ia protein with a cytoplasmic tail of appropriate length containing a retention signal. If an ER membrane protein, it may be a type IIIa or IIIb protein but the probability is relatively low. If an ER membrane protein, it may have an uncleavable ER-transferon. If an extracellular protein, it has a cleavable ER-transferon and does not have TMSs at all. If a plasma membrane protein, its topology may be type Ia. If a plasma membrane protein, its topology may be type II. If a plasma membrane protein, its topology may be type II, its TMS locates within the 40% region from the Nterminus, and there is no M-transferon. If a plasma membrane protein, its topology may be type Ib, and its TMS locates within the 40% region from the Cterminus. If a plasma membrane protein, its topology may be type IIIa or IIIb and if the number of TMSs exceed 10, the possibility raises. If a plasma membrane protein, its topology may be type Ia, II, or Ib with a NPXY motif in the cytoplasmic tail. If a plasma membrane protein, its topology may be type Ia, II, or Ib with a YXRF motif in the cytoplasmic tail. If a plasma membrane protein, its N-terminus may be myristylated. If a plasma membrane protein, its topology may be type Ia and the length of the tail is less than 10; that is, it may be GPI-anchored. If it might be a plasma membrane or nuclear protein and if it has no TMSs, the existence of the CaaX motif should be searched for at the C-terminus. If a plasma membrane protein, it might have the C-terminal CaaX motif but does not have TMSs or Nu-transferons. If a nuclear protein, it might have the C-terminal CaaX motif and Nu-transferons but does not have TMSs. If a Golgi protein, it is likely a type II protein with the (S/ T)X(E/Q)(R/K) motif near the TMS. If a Golgi protein, its topology might be type IIIa or IIIb. If it might be a lysosomal protein, the discriminant score must be calculated from the amino acid composition of the mature portion. If a lysosomal membrane protein, its topology should be type Ia with the GY motif in the tail near the TMS and have characteristic AAC. If a lysosomal luminal protein, it should have a cleavable ER-transferon, no TMSs, at least two N-glycosylation motifs, and deviated AAC. nucl If it might be a nuclear protein, the NLS motif of length 4 must be searched for. nuc2 If it might be a nuclear protein, the NLS motif of length 7 must be searched for. nuc3 If it might be a nuclear protein, the NLS motif of the Robbins et al. (1991) type must be searched for. nuc4 If it might be a nuclear protein and has no ER-transferon, its basic residue content must be calculated. nuc5 If it might be a nuclear or cytoplasmic protein, the RNAbinding protein motif must be searched for. nuc6 If it is a nuclear or cytoplasmic protein, it may have the RNA-binding protein motif. nuc7 Discriminate the nuclear proteins with the Robbins et al. (1991) type signal. nuc9 Judge the existence of the various types of Nu-transferons. pox1 If it might be a peroxisomal protein, the SKL motif must be searched for. pox2 If it might be a peroxisomal protein, its AAC must be examined. pox3 If a peroxisomal protein, it may have the SKL motif and featured AAC. pox4 If the SKL motif exists in its C-terminus, it is very likely a peroxisomal protein. Note. TMS, transmembrane segment; AAC, amino acid composition. Programs names: GvH (von Heijne, 1986) , McG (McGeoch, 1985) , ALOM (Klein et al., 1985) , MTOP (Hartmann et al., 1989) , and GAVEL (Gavel and von Heijne, 1990) . bly, the value for the testing data increased significantly and was close to the value for the training data. In this work, various experimental and computational observations on protein-sorting signals were organized into a consistent knowledge base that can be used to interpret unknown sequences. The knowledge base was realized as a collection of if-then rules (production rules), and utilized for machine inference based on standard techniques in artificial intelligence. Our system turned out to be flexible enough to incorporate diverse types of sorting signals and could contain ambiguous observations and working hypotheses. Furthermore, its performance could be evaluated by the predictability applied to unknown sequences. There are, however, still problems to be overcome, especially in knowledge acquisition and maintenance. Although rule-based representation is simple enough to update each piece of knowledge, it is time-consuming to revise certainty factors, which requires a global optimization. It is desirable that they be automatically optimized, say, by the neural network method. One of the difficulties in constructing our knowledge base was the assignment of a single appropriate localization site for each protein. There are many proteins whose localization sites are not confined to a single space. For example, some proteins like NF-KB change their localization sites in a regulated manner (Hunt, 1989) . Ribosomal proteins are first sorted from the cytoplasm to the nucleus according to their Nu-transferons, but after their assembly they are transported back to the cytoplasm possibly by a specific mechanism (Underwood and Fried, 1990) . We defined ribosomal proteins as nuclear proteins. Future progress in understanding these mechanisms may enable us to make a more detailed prediction of multiple localization sites. Another difficulty was the presence of nonconservative and specific sorting pathways. Some extracellular proteins such A simplified reasoning tree that illustrates the basic strategy for reasoning and the overall organization of rules. Reasoning processes are performed approximately following this tree downward. At each node, a decision is made according to the result of a certain calculation. "O" and "X" are "yes" and "no", respectively, although results are more precisely evaluated by way of modifying certainty factors. Thus, negative branches can be continued to be followed. Finally, every sorting site has some certainty factor at the end of reasoning and the site with highest certainty will be selected as the probable target site. as interleukins la and lfl do not have N-terminal ERtransferons and are sorted through distinct pathways (Rubartelli et al., 1990) . There is also a nuclear transport by specific interaction with a protein with an Nu-transferon (Zhao and Padmanabhan, 1988) . Although it may be possible to introduce exception rules dealing with such specific sorting pathways, it is very likely that the ratio of specific sorting mechanisms involved in the total system determines the upper limit of our prediction accuracy. To evaluate the performance of our expert system, it was applied to the testing data not used for its construction. The prediction accuracy was about 60% when more than 10 sites were distinguished. This result should be considered much better than the accuracy of widely used protein secondary structure prediction, which is also about 60% but which distinguishes only three states (helix, sheet, and coil). Apparently, most sorting signals are confined to limited regions of sequences; if there were many signals that depend on peptide conformation, like M6P-modons, the prediction accuracy would have been much worse. In addition, our combinational approach has advantages over individual prediction of each localization site. For example, the control data used in the von Heijne et al. analysis of M-transferons (1989) were adopted from the mature portion of mitochondrial proteins. It is more natural to use the N-terminal regions of proteins that compete with mitochondrial ones in real cellular recognition processes. In this respect, it may be an important observation that proteins with internal ER-transferons seem to be similar to mitochondrial proteins in N-terminal amino acid composition. Since experimental knowledge on sorting signals was not always complete, we had to rely on computational results characterizing proteins of given localization sites, which sometimes may not be directly related to sorting signals. The stepwise discriminant analysis could be effectively used for extracting characteristic amino acid components (Table 3 ). The derived discriminant function for M-transferons shows that the 20-residue segment at the N-terminus is rich in R, but poor in P and acidic residues. The function for S-transferons shows that the segment of residues 1 to 30 is rich in S and A, and the segment of residues 3 to 10 is rich in C. The first variables in both cases were in agreement with the von Heijne et al. observation (1989) , which had been based on different control data. The discriminant function for peroxisomal proteins shows the abundance in aromatic F and W, as well as in Y and H, although its biological significance is unclear. It was also found that lysosomal proteins were poor in K, whereas vacuolar proteins were poor in Q, when compared with extracellular proteins. The content of K was almost sufficient for discriminating lysosomal proteins from vacuolar proteins. It is interesting because lysosomes are acidic organelles. Several hypotheses were also included in the knowledge base to supplement the experimentally proven knowledge. One hypothesis was that the charge difference between both sides of the most N-terminal transmembrane segment would inhibit its cleavage. It is a natural hypothesis because a reversed orientation along the membrane may be difficult for the signal peptidase to approach. However, some ER-transferons of extracellular proteins had an unusual charge balance, which should be further studied. Another hypothesis was that there would be a preference of type Ib or II proteins depending on the localization site. Usually type Ib proteins have larger cytoplasmic domains and type II proteins have larger extracytoplasmic domains. It is possible that ER membrane proteins, which favor type Ib, possess large cytoplasmic domains for their function, whereas for plasma membrane proteins extracellular domains are important. In fact, a majority of plasma membrane proteins have type Ia topology with short cytoplasmic tails. More samples should be collected for further consideration. Although we attempted to incorporate the most up-todate knowledge, there are certain areas for improvement. In the current system, knowledge of polarized sorting in the plasma membrane (Simons and Wandinger-Ness, 1990) has not been included. Knowledge of cell type will also be required for precise prediction. The distinction between constitutive and regulated secretions will also become possible. The prediction of intraorganelle sorting is still insufficient. One reason is that their member proteins may not have been allocated exactly by experiments. Another reason is that in both mitochondria and chloroplasts, membrane proteins have relatively low hydrophobicity, reflecting the different natures of organelle membranes. Proteins with higher hydrophobicity may have a higher possibility of being falsely recognized by signal recognition particles. In practice, it may be effective to change the threshold of discrimination of their transmembrane segments. Many comparton signals are still open to future study. Above all, the integration of various kinds of degron (degradation) and modon (modification) signals is a challenging subject, enabling the prediction of the overall metabolic fate of proteins. It seems evident that any functional implication that can be derived from determining sequence data will become even more necessary with the development of large-scale sequencing projects. One such example has been reported recently (Adams et al., 1992) . Of the 2375 partial cDNA sequences that were newly determined, 83% were not related to known sequences in the databases. Our goal is to provide additional clues to the characterization of such unknown sequence data for further investigations. However, there is one difficulty in applying our work to the results of partial cDNA data; our method requires full-length sequences for input because missing parts can involve important targeting signals. Nevertheless, it may be used to find some signals. Sequencing N-terminal halves seems more preferable to C-terminal halves for our analyses since many transferons are found in N-terminal parts. Sequence identification of 2,375 human brain genes Mitochondrial proteins essential for viability mediate protein import into yeast mitochondria Generation of a lysosomal enzyme targeting signal in the secretory protein pepsinogen Protein sequence database A common peptide stretch among enzymes localized to the Golgi apparatus: Structural similarity of Golgi-associated glycosyltransferases How proteins get into microbodies (peroxisomes, glyoxysomes, glycosomes) NPXY, a sequence often found in cytoplasmic tails, is required for coated pit-mediated internalization of the low density lipoprotein receptor Short peptide domains target proteins to plant vacuoles Transferrin receptor internalization sequence YXRF implicates a tight turn as the structural recognition motif for endocytosis Glycolipid anchoring of plasma membrane proteins The endop!asmic reticulum retention signal of the E3/19K protein of adenovirus-2 is microtubule binding Three-dimensional structure of membrane and surface proteins Cell-surface anchoring of proteins via glycosyl-phosphatidylinositol structures The 0PS83 User's Manual System Version 3.0 A conserved cleavage-site motif in chloroplast transit peptides Cleavage-site motifs in mitochondrial targeting peptides A conserved tripeptide sorts proteins to peroxisomes Peroxisomal protein import is conserved between yeast, plants, insects and mammals The trimethylguanosine cap structure of U1 snRNA is a component of a bipartite nuclear targeting signal A polybasic domain or palmitoylation is required in addition to the CAAX motif to localize p21 ras to the plasma membrane The transit peptide of a chloroplast thylakoid membrane protein is functionally equivalent to a stromal-targeting sequence Protein sorting to mitochondria: Evolutionary conservations of folding and assembly Predicting the orientation of eukaryotic membrane spanning proteins Topography of the membrane-binding domain of cytochrome b 5 in lipids by fourier-transform infrared spectroscopy The CaaX motif of lamin A functions in conjunction with the nuclear localization signal to target assembly to the nuclear envelope Prediction of leader peptide cleavage sites for polypeptides of the thylakoid lumen Cytoplasmic anchoring proteins and the control of nuclear localization Golgi localization signals Identification of a consensus motif for retention of transmembrane proteins in the endoplasmic reticulum Chloroplastic precursors and their transport across the envelope membranes The detection and classification of membrane-spanning proteins A new class of lysosomal/vacuolar protein sorting signals The biogenesis of lysosomes A specific transmembrane domain of a coronavirus E1 glycoprotein is required for its retention in the Golgi region Posttranslational modification of proteins by isoprenoids in mammalian cells On the predictive recognition of signal peptide sequences The fats of life: The importance and function of protein acylation Prediction of in-vivo modification sites of proteins from their primary structures Expert system for predicting protein localization sites in Gram-negative bacteria Topogenesis of peroxisomal proteins Topology of Eukaryotic type II membrane proteins: Importance of N-terminal positively charged residues flanking the hydrophobic domain The retention signal for soluble proteins of the enndoplasmic reticulum Biosynthetic protein transport and sorting by the endoplasmic reticulum and Golgi A Common RNA Recognition motif identified within a defined U1 RNA binding domain of the 70K U1 snRNP protein Membrane interactions of pp60V-s~c: A model for myristylated tyrosine protein kinases Two interdependent basic domains in nucleoplasmin nuclear targeting sequence: Identification of a class of bipartite nuclear targeting sequence Protein targeting to the yeast vacuole A novel secretory pathway for interleukin-lf~, a protein lacking a signal sequence Fatty acylation of proteins How proteins enter the nucleus Polarized sorting in epithelia The structure and insertion of integral proteins in membranes Expert System Kochiku-no-hShS The biology and enzymology of eukaryotic protein acylation Characterization of nuclear localizing sequences derived from yeast ribosomal protein L29 Naming a targeting signal Protein translocation across membranes A new method for predicting signal sequence cleavage sites Topogenic signals in integral membrane proteins Domain structure of mitochondrial and chloroplast targeting peptides A Guide to Expert Systems Accumulation of membrane glycoproteins in lysosomes requires a tyrosine residue at a particular position in the cytoplasmic tail A novel pathway of import of a-mannosidase, a marker enzyme of vacuolar membrane, in Saccharomyces cerevisiae Nuclear transport of adenovirus DNA polymerase is facilitated by interaction with preterminal protein We thank Dr. Koreaki Ito for critically reading the manuscript. This work was supported by a grant from the Human Frontier Science Program.