key: cord-0846980-l3lcob5w authors: Tan, Yongjun; Schneider, Theresa; Shukla, Prakash K.; Chandrasekharan, Mahesh B.; Aravind, L; Zhang, Dapeng title: Unification of the M/ORF3-related proteins points to a diversified role for ion conductance in pathogenesis of coronaviruses and other nidoviruses date: 2020-11-11 journal: bioRxiv DOI: 10.1101/2020.11.10.377366 sha: 4a6508be496c88ebaff411f7dc24f3659b09571b doc_id: 846980 cord_uid: l3lcob5w The new coronavirus, SARS-CoV-2, responsible for the COVID-19 pandemic has emphasized the need for a better understanding of the evolution of virus-host conflicts. ORF3a in both SARS-CoV-1 and SARS-CoV-2 are ion channels (viroporins) and involved in virion assembly and membrane budding. Using sensitive profile-based homology detection methods, we unify the SARS-CoV ORF3a family with several families of viral proteins, including ORF5 from MERS-CoVs, proteins from beta-CoVs (ORF3c), alpha-CoVs (ORF3b), most importantly, the Matrix (M) proteins from CoVs, and more distant homologs from other nidoviruses. By sequence analysis and structural modeling, we show that these viral families utilize specific conserved polar residues to constitute an ion-conducting pore in the membrane. We reconstruct the evolutionary history of these families, objectively establish the common origin of the M proteins of CoVs and Toroviruses. We show that the divergent ORF3a/ORF3b/ORF5 families represent a duplication stemming from the M protein in alpha- and beta-CoVs. By phyletic profiling of major structural components of primary nidoviruses, we present a model for their role in virion assembly of CoVs, ToroVs and Arteriviruses. The unification of diverse M/ORF3 ion channel families in a wide range of nidoviruses, especially the typical M protein in CoVs, reveal a conserved, previously under-appreciated role of ion channels in virion assembly, membrane fusion and budding. We show that the M and ORF3 are under differential evolutionary pressures; in contrast to the slow evolution of M as core structural component, the CoV-ORF3 clade is under selection for diversification, which indicates it is likely at the interface with host molecules and/or immune attack. IMPORTANCE Coronaviruses (CoVs) have become a major threat to human welfare as the causative agents of several severe infectious diseases, namely Severe Acute Respiratory Syndrome (SARS), Middle Eastern Respiratory Syndrome (MERS), and the recently emerging human coronavirus disease 2019 (COVID-19). The rapid spread, severity of these diseases, as well as the potential re-emergence of other CoV-associated diseases have imposed a strong need for a thorough understanding of function and evolution of these CoVs. By utilizing robust domain-centric computational strategies, we have established homologous relationships between many divergent families of CoV proteins, including SARS-CoV/SARS-CoV-2 ORF3a, MERS-CoV ORF5, proteins from both beta-CoVs (ORF3c) and alpha-CoVs (ORF3b), the typical CoV Matrix proteins, and many distant homologs from other nidoviruses. We present evidence that they are active ion channel proteins, and the Cov-specific ORF3 clade proteins are under selection for rapid diversification, suggesting they might have been involved in interfering host molecules and/or immune attack. 3 ORF5, proteins from both beta-CoVs (ORF3c) and alpha-CoVs (ORF3b), the typical CoV Matrix proteins, and many distant homologs from other nidoviruses. We present evidence that they are active ion channel proteins, and the Cov-specific ORF3 clade proteins are under selection for rapid diversification, suggesting they might have been involved in interfering host molecules and/or immune attack. SARS-CoV-2, COVID-19, nidovirus, Matrix protein, ORF3a, ion channels, evolution The recent outbreak of human coronavirus disease 2019 (COVID- 19) has generated a global health crisis (1) . It is the 7 th human disease caused by coronaviruses, after Severe Acute Respiratory Syndrome (SARS) in 2003 (2) , Middle Eastern Respiratory Syndrome (MERS) in 2012 (3) , and four less-severe infections caused by human coronaviruses 229E (hCoV-229E) (4), hCoV-NL63 in 2004 (5) , and hCoV-HKU1 in 2004 (6) . Of these, SARS-CoV-2, SARS-CoV/SARS-CoV-1, MERS-CoV, hCoV-OC43 and hCoV-HKU1 belong to the beta coronavirus clade while hCoV-229E and hCoV-NL63 belong to the alpha coronavirus clade. Although the broad genomic structure and core gene-composition of these viruses is similar, the pathology and severity of these viruses, including SARS-CoV-2, are markedly distinct. According to the WHO report, as of October 2020, there have been over 360 million of confirmed cases with over 1 million deaths from COVID-19 globally. Therefore, the need for a better understanding of the biology and evolution of SARS-CoV-2 is a major desideratum to combat and prevent the disease. Coronaviruses possess a large positive-sense single-stranded RNA genome with two third of the genome coding for the ORF1a/ORF1ab polyprotein. This is followed by several ORFs encoding so-called structural and accessory proteins, several of which might be variable between viruses (2) . The ORF1 polyprotein is processed into smaller proteins by proteolytic cleavage catalyzed by one of its constituent components (the peptidase domain) (7) , viral replication, viral RNA-processing (e.g. xEndoU endoRNase domain) and countering of defenses centered on NAD+/ADP-ribose (Macro domains) (8) . The structural and accessory proteins contribute to virion structure and assembly, virulence and immune manipulation and invasion (2, 9) . However, despite concerted experimental studies, the structural understanding of many of these viral proteins is still missing; for example, in SARS-CoV-2, these include ORF3a, ORF3b, M, ORF6, ORF8, ORF9b, ORF9c, ORF10, and certain parts of ORF1a/b. Here, we utilize a domain-centric computational strategy to systematically study the function and structure of CoV proteins. In our recent work, we have demonstrated that the mysterious SARS-CoV-2 protein, ORF8, belongs to a novel family of the immunoglobulin fold. We also showed that ORF8 is fastevolving and its function is likely to disrupt the host immune responses (10) . In this study, we present results on the function and evolution of novel ion channel proteins in CoVs and other nidoviruses. Viral ion channels (viroporins) represent a new functional class of proteins which have been identified in different animal viruses, including human immunodeficiency virus (HIV), hepatitis C virus, and influenza A virus (11) . These proteins are shown to facilitate several steps of the viral life cycle from genome replication, viroplasm formation, and virion budding, to viral infection (11) . CoVs also code for their own ion channels. The SARS-CoV ORF3a was found to function as a potassium-specific channel promoting virus release (12) . Thereafter, several other CoV proteins were shown to display similar ion channel activities, including porcine epidemic diarrhea virus (PEDV) ORF3 (13), hCoV-229E ORF4a (14) , and SARS-CoV envelope (E) protein (15, 16) . Among them, SARS-CoV ORF3a, PEDV ORF3 and hCoV-229E ORF4a, appear to utilize their three transmembrane (3-TM) region to constitute an ion channel in the form of either a dimer or a tetramer, whereas SARS-CoV E protein with a 1TM region forms an ion channel as a pentamer (15, 16) . Recently, a similar ion channel activity was also observed in SARS-CoV-2 ORF3a and its structure was solved (17) . This provides an opportunity for us to systematically identify other ion channel proteins in coronavirus and related genomes by using sensitive profile-based homology detection and structural modeling methods. As a result, we have identified several homologous protein families, including ORF5 from MERS-CoV, several proteins from beta-CoVs, and ORF3b from alpha-CoVs. Importantly, we show that the wellknown Matrix (M) proteins from CoVs and several other nidoviruses are also homologous to the ORF3a proteins and identify their likely ancestral members of the family. We present evolutionary and structural evidence that they utilize distinct conserved residues to constitute the ion conducting pore in the membrane. Finally, we reconstructed the genome composition changes during the evolution of CoVs and other related viruses which suggests an evolutionarily conserved role of ion channels in viral virion assembly and membrane fusion/budding. The ORF3a has two domains including a leading 3-transmembrane (3-TM) region and a beta sandwich domain. We first identified the viral homologs of ORF3a by conducting iterative sequence searches using PSIBLAST against the NCBI NR database (18) . A multiple sequence alignment was conducted to identify the evolutionarily conserved residues, and the majority of them are located within the ion channel pore of the 3-TM structure (S1 superfamily. Based on both sequence searches and clustering analysis, we divided these proteins into nine families (Fig 1) . All of them share a common domain architecture with the 3-TM region followed by a β -sandwich domain. We carefully generated a super-alignment for the β -sandwich domains of these proteins which revealed a comparable eight beta-sheet arrangement despite their low sequence identity (Fig 2) . There are no universally conserved polar residues in these C-terminal β -sandwich domains indicating they are not enzymatic modules. As ORF3a of the SARS-CoV clade functions as an ion channel (12, 17) , we next asked if other viral protein families might have similar functions. We generated homology dimeric models for several representatives of other families (Fig 3) by using the SARS-CoV-2 ORF3a structure as the template (PDB: 6XDC) (17) . We then identified the evolutionarily conserved residues for each viral family and evaluated them for any conserved compositional trends. These viral families do not share any conserved residues across the whole superfamily (S2-S7 Figs). However, they display a unique conservation of family-specific polar residues, mostly basic residues, which are located in the internal surface of ion channel cavity (Fig 3) . This indicates that these residues have been conserved to maintain an aqueous pore within the membranespanning region. Therefore, all the novel viral families are potential ion channels whose predicted pore-forming regions are defined by family-specific polar residues. When we examined the genome organizations of the viruses which contain the M and ORF3like proteins, we found all ORF3-like proteins are strictly coupled with one typical M protein on the viral genome in an order of ORF3-E-M (Fig 5) , where ORF3 represents one of the divergent ORF3-like (M2) families, E is the envelope protein, and M is the typical matrix protein of CoVs. The inter-relationships of the alpha-and beta-CoV ORF3-like (M2) families is consistent with the internal relationship of the coupled M proteins on the same viral genome (Fig 4A) . This indicates that both the ORF3-like (M2) and M proteins have been vertically inherited from the common ancestor of the alpha-and beta-CoVs. However, these proteins show striking differences in their evolutionary rates. Within each viral group, the sequence identity between the ORF3-like proteins is always significantly lower than the one of the coupled M proteins (Fig 4B) . Further, when we compared proteins between different viral clades, we found the average percent identity of M proteins is about 34.0%, whereas the average percent identity of ORF3-like (M2) proteins is only 10.0% (Fig 4B) . This indicates that the ORF3 (M2) protein was evolving much faster than the genomically-linked M protein. To better understand the functional difference between them, we conducted the column-wise Shannon entropy analysis (Fig 4C) . We found that across the length of the whole alignment, ORF3-like proteins have significantly higher mean column-wise entropy than the M proteins from the same set of viral genomes (2.22 as opposed to 1.52; p=1. In conclusion these results suggest that the M proteins are undergoing purifying selection and might have been shielded to a degree from attacks by the host immune system, whereas the ORF3-like proteins are likely to be at the interface of the interactions between the host and virus. Other than M and ORF3-like (M2) proteins, several other proteins are also major structural components of coronavirus, including S, which is a major structural component of the virion involved in cellular receptor binding (23), E, which is a 1-TM ion channel protein critical for envelope formation and membrane budding (24), and N which is essential for viral genome packaging and virion assembly (25, 26) . Hence, we investigated if they also show a pattern similar to that of the M/ORF3 proteins. We conducted a genomic composition analysis by using both similarity-based clustering and domain analysis for CoVs, ToroVs, and other known nidoviruses such as roniviruses, mesnidoviruses and arteriviruses (Fig 5) . Consequently, we found that the Spike proteins of three major lineages of nidoviruses such as CoVs, ToroVs, and mesnidoviruses show a conservation at their C-terminal regions (SC; Pfam domain: PF01601) while their N-terminal regions display variations in which the CoV-Spike contains the characteristic NTD and RBD for receptor binding whereas other viral Spikes contain apparently-unrelated regions (Fig 5) . The M proteins from both CoVs and ToroVs share a similar architecture of 3TM and β -sandwich domains. Interestingly, we found that Arterivirus contains two proteins with distinct 3TM domains, so-called M and GP5 (S10 and S11 Figs). Both of their N-terminal 3TM domains share similarity and conserved polar residues with the M/ORF3-3TM domain, but their C-terminal domain is an unrelated shorter domain with just 6 β strands (S10 and S11 Figs). This indicates that the 3TM-type ion channels of the CoVs, ToroVs, and Arterviruses share a common ancestry, but their coupled β -strand-rich C-terminal domains might have different origins. In the case of the N protein, its C-terminal dimerization domain (N-C in Fig 5) shows a wide distribution among ToroVs and Arteriviruses, in addition to CoVs (S12 Other than this buildup, we also observed additional lineage-specific genes. For example, the SARS-related clade has several new genes including ORF6, ORF7 and ORF8 which were inserted between M and N genes (10), while the MERS-CoVs, which have several genes were inserted between Spike and ORF5 genes. They could be newly introduced structural components during evolution. To collect protein homologs, iterative sequence profile searches were performed using the programs PSI-BLAST (Position-Specific Iterated BLAST) (18) and JACKHMMER (28), which searched against the non-redundant (nr) protein database of NCBI with a cut-off e-value of 0.005 serving as the significance threshold. Similarity-based clustering was conducted by BLASTCLUST, a BLAST score-based singlelinkage clustering method (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html). Multiple sequence alignments were built using the KALIGN (29), MUSCLE (30) and PROMALS3D (31) programs, followed by careful manual adjustments based on the profile-profile alignment and predicted structural information. Sequence-profile and profile-profile comparisons were conducted using the HHsearch program (19) . Secondary structure was predicted using the JPRED program (32) . The consensus of the alignment was calculated using a custom Perl script. The alignments were visualized using CHROMA program (33) and further modified using adobe illustrator. The transmembrane regions were predicted using the TMHMM Server v. The sequence identity between the template and the targets is very low, from 15% between the template and the SARS-CoV-2 M protein, 19% with the MERS-CoV ORF5, 15% with the NL63like-CoV ORF3b, to 22% with the Bat-CoV HKU9-2 ORF3c. Since in these low sequenceidentity cases, sequence alignment is the most important factor affecting the quality of the model (36) , alignments used in this analysis have been carefully built and cross-validated based on the information from HHsearch and edited manually using the secondary structure information. For each protein, we generated five models and selected the one that had the highest model accuracy p-value (ranging from 0.06 to 0.013) and global model quality score Kullback-Leibler divergence (or relative entropy) was computed as described in (39) , then centered by the mean and normalized by the range to identify the functionally constrained and diverging positions. Analysis of the entropy values which were thus derived was performed using the R language. Genome composition analysis. Open reading frames of viral genomes used in this study were extracted from NCBI GenBank files (40) . Protein sequences were subjected to similarity-based clustering by BLASTCLUST with -S at 0.4 and -L at 0.4. Protein clusters were further annotated with conserved domains which are identified by the hmmscan searching against Pfam (19, 41) and our own curated profiles. For previously unknown domains, we used sequence searches followed by multiple sequence alignment and further sequence-profile searches to study their sequence and structural features. COVID-19: consider cytokine storm syndromes and immunosuppression The genome sequence of the SARS-associated coronavirus Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia The genome of human coronavirus strain 229E Identification of a new human coronavirus Characterization and complete genome sequence of a novel coronavirus, coronavirus HKU1, from patients with pneumonia SARS coronavirus replicase proteins in pathogenesis. Virus research Crystal structure and mechanistic determinants of SARS coronavirus nonstructural protein 15 define an endoribonuclease family Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Novel Immunoglobulin Domain Proteins Provide Insights into Evolution and Pathogenesis of SARS-CoV-2-Related Viruses Viroporins: structure and biological functions Severe acute respiratory syndrome-associated coronavirus 3a protein forms an ion channel and modulates virus release PEDV ORF3 encodes an ion channel protein and regulates virus production The ORF4a protein of human coronavirus 229E functions as a viroporin that regulates viral production Structure of a conserved Golgi complex-targeting signal in coronavirus envelope proteins Structural model of the SARS coronavirus E channel in LMPG micelles Cryo-EM structure of the SARS-CoV-2 3a ion channel in lipid nanodiscs Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Protein homology detection by HMM-HMM comparison FastTree: computing large minimum evolution trees with profiles instead of a distance matrix MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus evolution Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell. 2020. 24. Ruch TR, Machamer CE A key role for the carboxy-terminal tail of the murine coronavirus nucleocapsid protein in coordination of genome packaging Coronavirus genomic RNA packaging Influenza virus M2 protein and haemagglutinin conformation changes during intracellular transport A new generation of homology search tools based on probabilistic inference Kalign-an accurate and fast multiple sequence alignment algorithm MUSCLE: multiple sequence alignment with high accuracy and high throughput PROMALS3D: a tool for multiple protein sequence and structure alignments JPred4: a protein secondary structure prediction server CHROMA: consensus-based colouring of multiple alignments for publication Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes Comparative protein structure modeling using MODELLER. Current protocols in bioinformatics Relationship between multiple sequence alignments and quality of protein comparative models ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models Foundations of statistical natural language processing The Pfam protein families database in 2019