key: cord-0013068-gt2wzajp authors: Gavel, Ylva; von Heijne, Gunnar title: Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering date: 1990-04-03 journal: Protein Eng DOI: 10.1093/protein/3.5.433 sha: 90eb4cfdcfa5b9a19ffebef186bee0ea8643d3ec doc_id: 13068 cord_uid: gt2wzajp In N-glycosylated glycoproteins, carbohydrate is attached to Asn in the sequence Asn-X-Ser/Thr, where X denotes any amino acid. However, the presence of this consensus peptide does not always lead to glycosylation. We have compiled an extensive collection of glycosylated and non-glycosylated Asn-X-Thr/Ser sites and present a statistical study based on this data set. Our results indicate that non-glycosylated sites tend to be found more frequently towards the C termini of glycoproteins, and that proline residues in positions X and Y in the consensus Asn-X-Thr/Ser-Y strongly reduce the likelihood of N-linked glycosylation. Beyond this, there are no obvious local sequence features that seem to correlate with the absence or presence of N-linked glycosylation. These findings are discussed in terms of the prediction and engineering of glycosylation sites in secretory proteins. The attachment of /V-linked carbohydrates to proteins is thought to occur during or shortly after translocation of the nascent chain into the lumen of the endoplasmic reticulum (Kaplan et al., 1987; Lennarz, 1988; Hubbard and Ivatt, 1981) . The oligosaccharide chain is transferred by the enzyme oligosaccharryl transferase to the asparagine in the consensus tripeptide Asn-X-Thr/Ser, where X is any amino acid (Marshall, 1972) . Most putative acceptor sites that become exposed on the lumenal side of the endoplasmic reticulum (ER) membrane are efficiently glycosylated, but some are never used. It has long been known that glycosylation is blocked when X is a proline (Mononen and Karjalainen, 1984) , but this rule accounts for only a minor portion of all known non-glycosylated consensus sites. Here, we present a study based on a data set of carefully selected glycosylated (gs + ) and non-glycosylated (gs~) Asn-X-Thr/Ser sites. All gs~ sites included in this set have been checked in the literature. Sites from homologous proteins have been removed, both from the gs + and gs~ sets. Statistical methods have been used in order to test the significance of the results. Our analysis indicates that glycosylation is strongly inhibited by proline residues both in positions X and Y in the consensus Asn-X-Thr/Ser-Y. Also, non-glycosylated sites tend to be found more frequently towards the C termini of the proteins in our sample, whereas glycosylated sites are rare in this region. These observations allow prediction of gs + sites to be made with -95% confidence, whereas only some 25% of all gs~ sites can be reliably predicted from the primary sequence. Sequences of glycoproteins were collected from the literature and from the NBRF-PIR database (George et al., 1986) . Earlier studies of the sequence patterns associated with A'-linked glycosylation suffer from a number of methodological shortcomings, e.g. small sample sizes, inclusion of sites from homologous proteins, no statistical analysis. But the most obvious weakness is that too little attention has been paid to the collection of proper sets of both gs + and gs~ sites, thus precluding any useful comparisons between the two classes of sequences. gs~ sites must be picked with some care. Many proteins are non-glycosylated merely because they are never exposed to the carbohydrate-transferring enzyme. Asn-X-Thr/Ser sequences from such proteins should not be included in the gs~ set. Since A'-glycosylation is thought to occur in the lumen of the ER (Kaplan et al., 1987; Lennarz, 1988) , gs" sites from cytoplasmic proteins must be excluded. The same is true of sites from intracellular and transmembrane parts of membrane proteins. Furthermore, sites from proteins produced by cells unable to carry out A'-glycosylation must be avoided. For these reasons, we have restricted the data set to gs~ sites found in lumenal domains within the sequences of proteins that also contain gs + sites and thus are certain to have been exposed to the oligosaccharyl transferase. In an extensive literature search, we found a total of 55 gs~ sites in proteins known to be jV-glycosylated. A total of 48 gs~ sites were included in our final data set. The rest (Robinson and Appella, 1979; Takahashi et al., 1984; Van Den Berg et al., 1976 Beintema, 1985; Havinga and Beintema, 1980) were from highly homologous proteins, mainly in pancreatic ribonucleases. We also collected -600 gs + sites from the NBRF database and from the literature. Again, obviously homologous proteins were removed in order to avoid distortions of the statistics. The final version of our data set contained 417 gs + sites. All the gs + sites were explicitly stated to be glycosylated in the literature or in the database. Some references reported non-glycosylated sites as well. For the rest of the gs" sites, the absence of sugar could be inferred from experimental data presented in the literature. Some potential glycosylation sites were located in carbohydrate-free tryptic peptides and therefore were not glycosylated. In other cases, it was possible to make assignments based on the results from sequence determinations. With most sequencing methods, a glycosylated residue cannot be detected; instead, a blank appears in the sequence, and the amino acid in this position has to be identified by other means. Therefore, if some of the asparagines found in the Asn-X-Thr/Ser sequences of a protein show up as blanks whereas others do not, those which give an Asn signal can be assumed to be nonglycosylated. The sites included in the data set are given in Table I . In some Kingston and Williams (1975) Baudys and Kostka (1983) Welinder ( (1) According to experimental evidence presented in the reference. (2) According to experimental evidence cited in the reference. (3) In the reference, the absence of carbohydrate at this site is explicitly mentioned. (4) PTH-Asn was detected. (5) The relevant portion of the protein did not contain carbohydrate. (6) The Asn(180)-Gly(181) bond was susceptible to cleavage with hydroxylamine. "Among the other sites listed, there is at least one located in a sequence highly homologous to this one. Therefore, this site has not been included in the sequence statistics. b In some molecules, due to amino acid substitution. c In those positions where the amino acid was not identified, the corresponding amino acid of S6-glycoprotein has been used instead. (NBRF) The NBRF database. See George et al. (1986) . When the protein is known to contain a cleaved N-terminal signal sequence, the residues of the prepeptide are given negative numbers, i.e. amino acid number one corresponds to the N terminus of the mature protein. sequences there are additional potential glycosylation sites that could not be assigned or were discarded because they were located in transmembrane or cytoplasmic domains of the integral membrane proteins. These sites are not mentioned. Known partially glycosylated sites have been counted in the gs + set. If a protein contains sites that had to be excluded from the statistics owing to homology, this is noted in the table. In a small number of cases, the sequence around the reported A/-glycosylated asparagine did not agree with the Asn-X-Thr/Ser consensus. As was also noted by Nakai and Kanehisa (1988) , three Asn-X-Cys patterns have been reported as gs + sites in the NBRF database. The possibility of carbohydrate attachment at such sites was predicted by Bause and Legler (1981) . The Asn-X-Cys sites were found in bovine and human protein C and in human von Willebrand factor. However, we did not find any experimental evidence for glycosylation of the Asn-X-Cys site in the reference given for human protein C (Foster et al., 1985) . We also found another unusual N-glycosylation site: in murine IgM heavy chain, carbohydrate is found bound to asparagine in the sequence Asn-Gly-Gly-Thr. A similar site has been reported for egg yolk phosvitin, a protein derived from vitellogenin. In this case, the sequence at the point of attachment was reported as Asn-Ser-Gly-Psr, where Psr is phosphoserine (Shainkin and Perlmann, 1971 ). However, the nucleotide sequence (van het Schip et al., 1987; Byrne et al., 1984) indicates that the site is of the normal Asn-Gly-Ser type. Thus, although some of the putative non-standard sites may have been erroneously identified, at least a couple remain that seem to be authentic (Titani et al., 1986; Kehry et al., 1979; Stenflo and Fernlund, 1982) . In exceptional cases, then, N-linked glycosylation does not seem to require the Asn-X-Ser/Thr consensus. As can be seen in Table I , gs + sites are far more common in glycoproteins than are gs~ sites. Apparently, if the oligosaccharyl transferase is present, the Asn-X-Thr/Ser signal leads to glycosylation approximately nine times out of 10. In order to compare gs + and gs~ sequences, we extracted 33-residue segments centred around the glycosylation signals listed in Table I . Amino acid distributions were calculated for gs + and gs~ sites separately. The results for the residues immediately surrounding the consensus tripeptide are shown in Table II . According to previous statistical studies, Pro is very rare or even absent in position +1 of gs + sites (Mononen and Karjalainen, 1984) . The statistical significance of this observation is confirmed by our data (Figure 1 ; P < 2 x 10~8 as estimated from a binomial distribution with P = 0.0558, i.e. the mean frequency of Pro outside positions 0 to +3). Actually, the frequency of glycosylated Asn-Pro-Thr/Ser sites may be even lower since the Pro-containing site in thyroxine-binding globulin may have been erroneously identified as a gs + site (an Table 1 Amino acid Position -5 5.8 3 4 44 4.6 4.1 6 8 2 7 4.8 44 11 4 44 3.9 4.8 3 6 4 6 9.0 6.5 5 6 1 5 3.9 -4 5.1 5 3 5 3 6.1 4.1 5 6 34 4.6 5 6 9 7 1 7 4.4 5.8 44 3 4 6.5 5.8 8 5 1 0 3.9 acid 2 5