key: cord-1002227-98uzivsk authors: Zhang, Zheng; Zhu, Zhaozhong; Chen, Wenjun; Cai, Zena; Xu, Beibei; Tan, Zhiying; Wu, Aiping; Ge, Xingyi; Guo, Xinhong; Tan, Zhongyang; Xia, Zanxian; Zhu, Haizhen; Jiang, Taijiao; Peng, Yousong title: Membrane proteins with high N-glycosylation, high expression, and multiple interaction partners were preferred by mammalian viruses as receptors date: 2018-03-08 journal: bioRxiv DOI: 10.1101/271171 sha: fc6812d62536afc0018f27069e17aba0f4063d09 doc_id: 1002227 cord_uid: 98uzivsk Receptor mediated entry is the first step for viral infection. However, the relationship between viruses and receptors is still obscure. Here, by manually curating a high-quality database of 268 pairs of mammalian virus-host receptor interaction, which included 128 unique viral species or sub-species and 119 virus receptors, we found the viral receptors were structurally and functionally diverse, yet they had several common features when compared to other cell membrane proteins: more protein domains, higher level of N-glycosylation, higher ratio of self-interaction and more interaction partners, and higher expression in most tissues of the host. Additionally, the receptors used by the same virus tended to co-evolve. Further correlation analysis between viral receptors and the tissue and host specificity of the virus shows that the virus receptor similarity was a significant predictor for mammalian virus cross-species. This work could deepen our understanding towards the viral receptor selection and help evaluate the risk of viral zoonotic diseases. systematic analysis of the characteristics of the viral receptor could help understand 63 the mechanisms under the receptor selection by viruses. 64 The virus-receptor interaction was reported to be a principal determinant of viral host 65 range, tissue tropism and cross-species infection [11, 16, 22] . The existence and 66 expression of the virus receptor in a host (or tissue) should be a prerequisite for viral 67 infection of the host (or tissue) [21] . Usually, a virus mainly infects some particular 68 type of hosts or tissues. For example, the influenza virus mostly infects cells of the 69 respiratory tract [23] . However, the virus-receptor interaction is a highly dynamic 70 process. Some viruses can recognize one or more receptors [13, 14, 24] , which can also 71 differ among virus variants or during the course of infections [14, 25, 26] . In some cases, 72 a few amino acid mutations in the viral protein or the receptor could abolish or 73 enhance viral infection [27] [28] [29] . Besides, the virus-receptor interaction is under 143 We firstly investigated the structural characteristics of mammalian virus receptor 144 proteins. As expected, all the mammalian virus receptor protein belonged to the 145 membrane protein which had at least one transmembrane alpha helix ( Figure S2A ). 146 Twenty-four of them had more than five helixes, such as 5-hydroxytryptamine 147 receptor 2A (HTR2A) and NPC intracellular cholesterol transporter 1 (NPC1). The 148 receptor protein was mainly located in the cell membrane. Besides, more than one 149 third (43/119) of them were also located in the cytoplasm, and thirteen of them were 150 located in the nucleus. 151 Then, the protein domain composition of the mammalian virus receptor protein was analyzed. The mammalian virus receptor proteins contained a total of 336 domains 153 based on the Pfam database, with each viral receptor protein containing more than two 154 domains on average ( Figure S2B ). This was significantly more than that of human 155 proteins or human membrane proteins (p-values < 0.001 in the Wilcoxon rank-sum 156 test). Some viral receptor proteins may contain more than 10 domains, such as 157 complement C3d receptor 2 (CR2) and low density lipoprotein receptor (LDLR). The human viral receptors were observed to have ten or more N-glycosylation sites, such 171 as complement C3b/C4b receptor 1 (CR1) and lysosomal associated membrane 172 protein 1 (LAMP1). Figure 2B displayed the modeled 3D-structure of HTR2A, the 173 receptor for JC polyomavirus (JCPyV). Five N-glycosylation sites were highlighted in red on the structure, which were reported to be important for viral infection [30] . For 175 comparison, we also characterized the N-glycosylation level for the human cell 176 membrane protein, human membrane proteins and all human proteins (Figure 2A ). It 177 was found they had a significantly lower level of N-glycosylation than that of human 178 and mammalian virus receptors (p-values < 0.001 in the Wilcoxon rank-sum test), 179 which suggests the importance of N-glycosylation for the viral receptor. enriched, such as "Regulation of leukocyte activation" and "Lymphocyte activation". 210 For the GO Molecular Function (Table S3) , besides for the enrichment of terms 211 related to the virus receptor activity, the human virus receptor was also enriched in 212 terms of binding to integrin, glycoprotein, cytokine, and so on. 213 Consistent with the enrichment analysis of GO Cellular Component, the KEGG 214 pathways of "Cell adhesion molecules", "Focal adhesion" and "ECM-receptor 215 interaction" were also enriched. Besides, the pathway of "Phagosome" was enriched 216 (Table S3) , which may be associated with viral entry into the host cell. Interestingly, 217 some pathways associated with heart diseases were enriched, including "Dilated 218 cardiomyopathy", "Hypertrophic cardiomyopathy", "Arrhythmogenic right 219 ventricular cardiomyopathy" and "Viral myocarditis". 220 221 We next analyzed the protein-protein interactions (PPIs) which the mammalian virus 222 receptor protein took part in. As the reason mentioned above, we only used the human (Table S3) . 244 When looking at the interactions between viral receptors, we found that 38 of 74 viral 245 receptors interacted with themselves. This ratio (38/74 = 51%) was much higher than 246 that of human proteins (22%), membrane proteins (11%) and human cell membrane 247 proteins (14%). However, we found the viral receptor tended not to interact with each 248 other ( Figure S3D 270 Since the virus has to compete with other proteins for binding to the receptor, proteins (Table S5) . 341 Since the viral receptor determines the host specificity of the virus to a large extent, it 343 is expected that the closer between the viral receptor and its homolog in a species, the 344 more likely the virus which used the receptor would infect the species. To validate this 345 hypothesis, we firstly calculated the sequence identities between the viral receptor and 346 their homologs in 108 mammal species ( Figure 5 and Table S6 ). For clarity, only 26 347 mammal species, which were frequently observed, were presented in Figure 5 . Then, Table S6 . what's the relationship between glycosylation and viral receptor selection? As we know, glycosylation of proteins is widely observed in eukaryote cells [32] . It plays an 401 important role in multiple cellular activities, such as folding and stability of 402 glycoprotein, immune response, cell-cell adhesion, and so on. Glycans are abundant 403 on host cell surfaces. They were probably the primordial and fallback receptors for the 404 virus [11] . To use glycans as their receptors, a large number of viruses have stolen a 405 host galectin and employed it as a viral lectin [11, 33] . For example, the SJR fold, which 406 was mainly responsible for glycan recognition and binding in cellular proteins, was 407 observed in viral capsid proteins of over one fourth of viruses [33] . Thus, during the 408 process of searching for protein receptors, the protein with high level of glycosylation 409 could provide a basal attachment ability for the virus, and should be the preferred 410 receptor for the virus. 411 Secondly, our analysis showed that the viral receptor protein had a tendency to 412 interact with itself and had far more interaction partners than other membrane proteins. 413 Besides the function of viral receptor, the receptor protein functions in the host cell by 414 interacting with other proteins of the host, such as signal molecules and ligands. 415 Therefore, the virus has to compete with these proteins for binding to the receptor [15] . 416 The protein with less interaction partners are expected to be preferred by the virus. 417 Why did the virus select the proteins with multiple interaction partners as receptors? 418 One possible reason is that the receptor proteins are closely related to the "door" of 419 the cell, so that many proteins have to interact with them for in-and-out of the cell. 420 This could be partly validated by the observation that for the interaction partners of 421 human viral receptors, six of top ten enriched terms in the domain of GO Biological Process were related to protein targeting or localization (Table S3) . For entry into the 423 cell, the virus also selects these proteins as receptors. Another possible reason is that 424 viral entry into the cell needs cooperation of multiple proteins which were not 425 identified as viral receptors yet. Besides, previous studies show that the virus could 426 structurally mimic native host ligands [34] , which help them bind to the host receptor. (Table 485 S1 ). The number of transmembrane alpha helix of the mammalian virus receptor was 488 derived from the database of UniprotKB and the web server TMpred [46] . The location 489 for the viral receptor was inferred from the description of "Subcellular location" for 490 the receptor protein provided by UniProtKB, or from the GO annotations for them: 491 the viral receptors annotated with GO terms which included the words of "cell surface" 492 or "plasma membrane" were considered to be located in the cell membrane; those 493 annotated with GO terms which included the words of "cytoplasm", "cytosol" or 494 "cytoplasmic vesicle", or shown to be in the cytoplasm in UniProtKB, were 495 considered to be located in the cytoplasm; those annotated with GO terms "nucleus" 496 (GO:0005634) or "nucleoplasm" (GO:0005654) were considered to be located in the [49] in R (version 3.4.2) [50] . To identify the homolog of the mammalian virus receptor in other mammal species, 528 the protein sequence of each viral receptor was searched against the database of mammalian protein sequences, which were downloaded from NCBI RefSeq database 530 [54] on October 10 th , 2017, with the help of BLAST (version 2.6.0) [55] . Analysis 531 showed that in the database of mammalian protein sequences, there were 108 532 mammal species which were richly annotated and had far more protein sequences 533 than other mammal species (Table S7) . Therefore, only these 108 mammal species 534 were considered in the evolutionary analysis. Based on the results of BLAST, the 535 homolog for the viral receptor was defined as the hit with E-value small than 1E-10, 536 coverage equal to or greater than 80% and sequence identity equal to or greater than 537 30%. Only the closest homolog, i.e., the best hit, in each mammal species was used (Table S4 ). Similar methods 544 as above were utilized to calculate the indicators of conservation level for these 545 proteins. 546 For analysis of co-evolution between viral receptors, firstly for each viral receptor, a 547 phylogenetic tree was built based on the protein sequences of the receptor and its 548 homologs in 108 mammal species with the help of phylip (version 3.68) [56] . The 549 neighbor-joining method was used with the default parameter. Then, the genetic 550 distances between the viral receptor and their homologs were extracted from the phylogenetic tree with a perl script. Finally, for a pair of viral receptors, the spearman 552 correlation coefficient (SCC) was calculated between the pairwise genetic distances of 553 viral receptors and their homologs, which was used to measure the extent of 554 co-evolution between this pair of viral receptors. 555 The set of housekeeping gene in human was adapted from the work of Eisenberg et al 556 [57] . A total of 3804 genes were identified as the housekeeping gene. (Table S5) The mammalian virus-host interactions were primarily adapted from Olival's work [7] . 571 One hundred and fifteen viruses in our database and 61 of 108 richly annotated mammal species could be mapped to those in Olival's work (Table S8 ). These 115 573 viruses used a total of 116 viral receptors. The sequence identities of these viral 574 receptor proteins to their related homologs in the corresponding mammal species were 575 presented in Table S6 . 576 For comparison, we also extracted genetic distances (host relatedness) between the 577 mammal species and the viral host with reported receptors based on Olival's work 578 (Table S9) . Then, the genetic distance of the mammal species to the viral host with 579 reported receptors, and the sequence identity of the receptor homolog in the mammal 580 species to the viral receptor protein, was respectively used to predict whether a 581 mammal species could be infected by the virus which infected the host with reported 582 receptors. The method of Receiver Operating Characteristic (ROC) curve was used to 583 evaluate and compare their performance with the functions of roc(), auc(), roc.test() 584 and plot.roc() in the package of "pROC" [59] in R (version 3.2.5). 585 Statistical analysis 586 All the statistical analysis was conducted in R (version 3.2.5) [50] . The wilcoxon 587 rank-sum test was conducted with the function of wilcox.test(). Clusterprofiler: An r package for comparing biological 599 HZ, TJ and YP conceived and designed the study. ZZ and ZZZ did the computational