key: cord-0814547-0lrre6ph authors: Tan, Yongjun; Schneider, Theresa; Shukla, Prakash K; Chandrasekharan, Mahesh B; Aravind, L; Zhang, Dapeng title: Unification and extensive diversification of M/ORF3-related ion channel proteins in coronaviruses and other nidoviruses date: 2021-02-16 journal: Virus Evol DOI: 10.1093/ve/veab014 sha: 617995bcf8afcc5a9507c7886a251f984a11bc32 doc_id: 814547 cord_uid: 0lrre6ph The coronavirus, SARS-CoV-2, responsible for the ongoing COVID-19 pandemic, has emphasized the need for a better understanding of the evolution of virus-host interactions. ORF3a in both SARS-CoV-1 and SARS-CoV-2 are ion channels (viroporins) implicated in virion assembly and membrane budding. Using sensitive profile-based homology detection methods, we unify the SARS-CoV ORF3a family with several families of viral proteins, including ORF5 from MERS-CoVs, proteins from beta-CoVs (ORF3c), alpha-CoVs (ORF3b), most importantly, the Matrix (M) proteins from CoVs, and more distant homologs from other nidoviruses. We present computational evidence that these viral families might utilize specific conserved polar residues to constitute an aqueous pore within the membrane-spanning region. We reconstruct an evolutionary history of these families and objectively establish the common origin of the M proteins of CoVs and Toroviruses. We also show that the divergent ORF3 clade (ORF3a/ORF3b/ORF3c/ORF5 families) represents a duplication stemming from the M protein in alpha- and beta-CoVs. By phyletic profiling of major structural components of primary nidoviruses, we present a hypothesis for their role in virion assembly of CoVs, ToroVs and Arteriviruses. The unification of diverse M/ORF3 ion channel families in a wide range of nidoviruses, especially the typical M protein in CoVs, reveal a conserved, previously under-appreciated role of ion channels in virion assembly and membrane budding. We show that M and ORF3 are under different evolutionary pressures; in contrast to the slow evolution of M as core structural component, the ORF3 clade is under selection for diversification, which suggests it might act at the interface with host molecules and/or immune attack. The recent outbreak of the human coronavirus disease 2019 has generated a global health crisis (Mehta et al. 2020) . It is the 7 th human disease caused by coronaviruses, after Severe Acute Respiratory Syndrome (SARS) in 2003 (Marra et al. 2003) , Middle Eastern Respiratory Syndrome (MERS) in 2012 (Zaki et al. 2012) , and four less-severe infections caused by human coronaviruses 229E (hCoV-229E) (Macnaughton and Madge 1978 ), hCoV-NL63 in 2004 (Van Der Hoek et al. 2004 , and hCoV-HKU1 in 2004 (Woo et al. 2005) . Of these, SARS-CoV-2, SARS-CoV/SARS-CoV-1, MERS-CoV, hCoV-OC43 and hCoV-HKU1 belong to the beta coronavirus clade while hCoV-229E and hCoV-NL63 belong to the alpha coronavirus clade. Although the broad genomic structure and core gene-composition of these viruses is similar, the pathology and severity of these viruses, including SARS-CoV-2, are markedly distinct. According to the WHO report, as of October 2020, there have been over 360 million of confirmed cases with over 1 million deaths from COVID-19 globally. Therefore, the need for a better understanding of the biology and evolution of SARS-CoV-2 is a major desideratum to combat and prevent the disease. Coronaviruses possess a large positive-sense single-stranded RNA genome with two third of the genome coding for the ORF1a/ORF1ab polyprotein. This is followed by several ORFs encoding so-called structural and accessory proteins, a subset of which might be variable between CoVs (Marra et al. 2003) . The ORF1-derived proteins are involved in polyprotein cleavage (the peptidase domain) (Graham et al. 2008 ), viral replication, viral RNA-processing (e.g. xEndoU endoRNase domain) and countering of defenses centered on NAD+/ADP-ribose (Macro domains) (Ricagno et al. 2006) . The structural and accessory proteins contribute to virion structure and assembly, virulence and immune manipulation and invasion (Marra et al. 2003; Lu et al. 2020 ). However, despite concerted experimental studies, the structural understanding of many of these viral proteins is still lacking; for example, in SARS-CoV-2, these include ORF3a, ORF3b, M, ORF6, ORF8, ORF9b, ORF9c, ORF10, and certain domains of ORF1a/b. Here, we utilize a domain-centric computational strategy to systematically study the function and structure of CoV proteins. In our recent work, we have demonstrated that the mysterious SARS-CoV-2 protein, ORF8, belongs to a novel family of the immunoglobulin fold and it is one of the fast-evolving genes in the SARS-CoV-2 genome (Tan et al. 2020) . Based on its inferred structural fold and enhanced evolutionary rate, we further predicted that ORF8 is likely to be involved in disrupting the host immune responses (Tan et al. 2020) . Our computational predictions of both ORF8 structure and function have been recently confirmed by https://mc.manuscriptcentral.com/vevolu wet-lab studies (Flower et al. 2020; Zhang et al. 2020) . In this study, we present results on the function and evolution of novel ion channel proteins in CoVs and other nidoviruses. Viral ion channels (viroporins) represent an expanding functional class of proteins that have been identified in different animal viruses, such as the human immunodeficiency virus (HIV), hepatitis C virus, and influenza A virus (Nieva, Madan, and Carrasco 2012) . These proteins are shown to operate at several steps in the viral life cycle including regulation of replication compartment, viroplasm formation, and virion budding and viral infection (Nieva, Madan, and Carrasco 2012) . CoVs also code for their own ion channels. The SARS-CoV ORF3a was found to function as a potassium-specific channel promoting virus release (Lu et al. 2006) . Thereafter, several other CoV proteins were shown to display ion channel activities, including porcine epidemic diarrhea virus (PEDV) ORF3 (Wang et al. 2012) , hCoV-229E ORF4a (Zhang, Wang, et al. 2014) , and SARS-CoV envelope (E) protein (Li et al. 2014; Surya, Li, and Torres 2018) . Among them, SARS-CoV ORF3a, PEDV ORF3 and hCoV-229E ORF4a, appear to utilize their three transmembrane (3-TM) region to constitute an ion channel either as a dimer or a tetramer, whereas SARS-CoV E protein with a 1TM region forms an ion channel as a pentamer (Li et al. 2014; Surya, Li, and Torres 2018) . Recently, a similar ion channel activity was also observed in SARS-CoV-2 ORF3a and its structure was determined (Kern et al. 2020) . This prompted us to systematically identify other ion channel proteins in coronavirus and related viruses by using sensitive profile-based homology detection and structural modeling methods. As a result, we have identified several homologous protein families, including ORF5 from MERS-CoV, many proteins from beta-CoVs, ORF3b from alpha-CoVs, and importantly, the well-known Matrix (M) proteins from CoVs, as well as more distant homologs from other nidoviruses. We present evolutionary and structural evidence that the newly identified families have preserved familyspecific residues to constitute the ion conducting pore in the membrane. Using both phylogenetic analysis and phyletic profiling, we show that the M proteins are the most conserved structural components for CoVs, ToroVs and Arteriviruses. By contrast, the ORF3a/ORF5/OR3b families appear to have emerged via a duplication from the M proteins at the base of alpha-and beta-CoVs and have undergone constantly rapid diversification, indicating they might be involved in host-virus interactions. Thus, our results have: 1) expanded the repertoire of ion channel proteins across a large subset of nidoviruses (including all CoVs); 2) suggested an evolutionarily conserved role for the M protein in these viruses that might operate via the formation of a potential aqueous channel in the membrane; and 3) pointed to potential new functions of the ORF3-like families in host-virus interactions which might again result from transmembrane-associate pore formation. Examination of the structure of the SARS-CoV-2 ORF3a (Genbank: YP_009724391.1) reveals two distinct domains, namely a N-terminal 3-transmembrane (3-TM) region and a C-terminal βsandwich domain. We first identified other viral homologs of ORF3a by conducting iterative sequence searches using PSIBLAST against the NCBI NR database (Altschul et al. 1997) . We then prepared a multiple sequence alignment to identify the evolutionarily conserved residues, and the majority of them line the ion channel pore of the 3-TM structure (Supplementary Figure S1 ). Interestingly, when we used HMM profile-based homology detection against Pfam profiles via HHsearch (Söding 2005) , we found two other coronavirus protein families as the significant hits: the coronavirus M protein (Pfam ID: PF01635) and the alpha-coronavirus ORF3b (Pfam ID: PF03053) (Fig. 1 ) which were not known to be related to ORF3a. As many coronavirus proteins are not covered by Pfam profiles, we performed a BLAST bit-score based single-linkage clustering analysis of a comprehensive collection of coronavirus proteins (excluding the ORF1a/b) and selected the representatives, for a series of HHsearch searches. This procedure uncovered three other viral protein families that are related to ORF3a and M families, prototyped by ORF5 of MERS-CoVs (NCBI accession number: YP_009047208.1), ORF4 of 229E-related bat CoV (NCBI accession number: ALK28794.1) and ORF3 of Eidolon bat coronavirus/Kenya/KY24/2006 (NCBI accession number: ADX59467.1) (Fig. 1) . Further, we extended our clustering analysis together with profile searches to other nidoviruses beyond coronaviruses, leading to the identification of numerous M proteins from fish, reptile, and mammal ToroVs as homologs (Fig. 1) . Figure 1 summarizes the above homology detection searches and the viral protein families that were uncovered by them. In this figure, representative viral proteins were clustered using all-against-all BLASTP comparisons and are presented nodes of a two-dimensional graph arranged using the Fruchterman and Reingold algorithm (Fruchterman and Reingold 1991) implemented in the CLANS program (Frickey and Lupas 2004) , which are linked by edges of denoting the detected sequence similarity. The viral proteins belonging to the same family segregate as a dense sub-graph, while the families themselves are linked together by edges mostly derived from profile-profile analysis (blue dashed lines in Fig. 1 ). To examine the relationship of the nine viral protein families identified above more closely, we generated multiple sequence alignments of each family, and predicted their potential secondary structures (Supplementary Figures S2-S7) . Importantly, secondary structure (Drozdetskiy et al. 2015) and transmembrane region prediction (Krogh et al. 2001 ) revealed that all nine viral protein families share a congruent domain architecture, with a N-terminal 3-TM region and a Cterminal β-sandwich domain. As TM regions tend to have a biased composition with an enrichment in hydrophobic residues, we next investigated if the above observed profile-profile similarity between the families could be recapitulated by their globular β-sandwich domains. Accordingly, we extracted these domains from above families and conducted a comparable HHsearch as above. These searches recovered significant similarities between families as with the searches with the full-length proteins (Fig. 1) . This was further supported by a superalignment and secondary structure prediction of the β-sandwich domains from the different families ( Fig. 2 ): despite their low sequence identity (10-20%), they all share a comparable eight β-stranded predicted secondary structure, with several conserved hydrophobic amino acids forming the predicted folding cores of the β-strands (Fig. 2) . Thus, the sensitive sequence comparisons and alignments using either full-length proteins or β-sandwich domains revealed previously undetected relationships between the SARS-CoV ORF3a ion-channel proteins and several distinct viral protein families in CoVs and other nidoviruses. Collectively, we term these unified families the viral M/ORF3 superfamily. Both SARS-CoV and SARS-CoV-2 ORF3a proteins are ion channels that form a polar ionconducting cavity through dimerization or tetramerization (Lu et al. 2006; Kern et al. 2020) . To explore the functions of the other viral families which we unified into the M/ORF3 superfamily, we generated a multiple sequence alignment of the 3-TM regions of several CoV-M/ORF3 families (Fig. 3A) . We used this to examine their sequence conservation by generating both position-specific-score matrices (PSSMs) (Supplementary Figure S8) (Schaffer et al. 1999 ) and conservation consensus ( We next investigated if they might also form transmembrane pores: we reasoned that if the capacity to form aqueous transmembrane pores is present more generally across proteins of this superfamily, then the selective pressure would favor retention of specific polar residues in their inferred ion-conducting cavity. Further, the presence of such residues could also indicate the capacity to transport of ions through the pore thereby resulting in channel activity. However, the above analysis did not identify any universally conserved polar residues across the M/ORF3 superfamily. This is not entirely unexpected for divergence of distinct families with a superfamily, given that they might have acquired distinct family-specific functions (Zhang, Iyer, et al. 2014) . Therefore, using the same consensus and PSSM methods as above, we examined each of the families of M/ORF3 superfamily separately. Notably, this analysis identified several familyspecific, conserved polar positions, mostly basic residues ( Figure 3A ). We inferred their positions using the ORF3a structure (PDB: 6XDC) (Kern et al. 2020) , and found them to be located at the inner surface of the cavity forming the TM pore as well as just outside and inside the membrane. As a proof of concept, we examined the location of the conserved residues we had identified in the ORF3a family ( Fig. 3A) and found them to constitute the inner surface of the ion-transporting cavity in SARS-CoV-2 ORF3a (PDB: 6XDC; Fig. 3B ) (Kern et al. 2020) . Those polar residues that are located just outside or inner side of the membrane might be involved in maintaining TM polarity or in mediating ion movement in the vicinity of the channel (Supplementary Figure S1) As no structure is available for other M/ORF3 families, we utilized the MODELLER program (Webb and Sali 2016) to generate homology models of the dimeric form for representatives of other families (Fig. 3C-3F ) by using the SARS-CoV-2 ORF3a structure as the template. These models strongly supported our proposal that several of the above-identified evolutionarily conserved polar residues specific to each family line a TM cavity equivalent to the ionconducting channel of the ORF3a virporin ( Fig. 3 and Supplementary Figures S2-S5 ). This suggests that even though M and the remaining newly identified viral M/ORF3 families might not possess the same pattern of conserved polar residues as ORF3a, they do possess intra-TM polar residues that line a comparable pore cavity. This implies that in the very least they might form an aqueous pore in the membrane and could also potentially function as ion channels. In contrast to the TM domain, no universally conserved or family-specific polar residues are seen in the C-terminal β-sandwich domains indicating that they neither enzymatic domains nor likely to play a role in ion transport (Fig. 2 ). However, it is possible that they function as adaptor domains that undergo conformational changes induced by ion conduction and might help recruit partners proteins in a structural context. We next examined the evolutionary relationship of the families within the M/ORF3 superfamily. Using a super-alignment of the β-sandwich domains (Supplementary Dataset S1), we conducted phylogenetic analyses using rapid approximately-maximum-likelihood inference (FastTree), robust Maximum Likelihood (ML), and Bayesian inference (BI) methods (Price, Dehal, and Arkin 2009; Kumar, Stecher, and Tamura 2016; Suchard et al. 2018b ). All three methods produced similar tree topologies, especially in terms of the higher-order relationships Examination of the genome organization of the viruses that code for both M1 and M2 clade proteins (see next section for details), showed the genes for the two families to be strictly coupled, in an M2-E-M1 order, where ORF3 represents one of the M2 clade families, E is the envelope protein, and M1 is the typical matrix protein of CoVs (M). The inter-relationships of the alpha-and beta-CoV M2 families are congruent to the relationships of the corresponding coupled M1 clade proteins from the same genome (Fig. 4 ). This indicates that both the M2 (ORF3-like) and M proteins have been vertically inherited from the common ancestor of the alpha-and beta-CoVs. However, we found that the sequence identity between the ORF3-like proteins is always considerably lower than that of the coupled M proteins (Fig. 5A ): the average percent identity of the M1 clade proteins is 34%, whereas the average percent identity of the M2 clade proteins is only 10% (Fig. 5A ). This was also supported by the significantly higher number of above-average conservation positions in the PSSM of the M1 clade as opposed to the M2 clade (t-test, p= 1.23 x10 -5 , Supplementary Figure S9 ). This suggested that the M2 clade proteins are diverging much faster than the cognate M protein in the same genome. To better understand this divergence, we performed a column-wise Shannon entropy analysis ( Fig. 5B and 5C) that objectively quantifies the conservation in each column (Vinga 2014; Krishnan et al. 2018) . When using the regular 20-amino-acid alphabet, we found that across the length of the whole alignment, ORF3-like (M2) proteins have significantly higher mean columnwise entropy than the M1 proteins from the same set of viral genomes (2.28 as opposed to 1.49; p=1.6 x 10 -16 for the H 0 of congruent means by t-test). When using a reduced 8-letter alphabet (where amino acids are grouped based on similar side chain chemistries), we found a similar result (1.52 for the M2 clade proteins as opposed to 0.91 for the coupled M1 clade proteins; p= 2.2 × 10 −11 ). The significantly higher mean column-wise entropy of the ORF3-like (M2) clade in the reduced alphabet indicates that there is a much greater tendency in these proteins to contain positions that differ drastically in side chain chemistry (e.g. substitution of a charged or polar residue for a hydrophobic residue). To explore this further, we computed the normalized Kulback-Leibler entropy (see Methods) which measures the divergence of the observed residue distribution at a column from the background residue distribution across the entire alignment of both the M1 and ORF3-like (M2) clades. Thus, this measure accurately picks out those columns in the multiple sequence alignment of a clade that define its unique sequence conservation profile (Supplementary Figures S10 and S11). We then examined the top ranked quartile of these columns from the M1 and M2 clades to see if they contained amino acids with the same side chain chemistry or not. We found a dramatic difference between the two clades -whereas only 16 out 63 top Kulback-Leibler entropy columns in the M1 clade had residues with different side-chain properties, 37 out of 63 showed this trend in the M2 clade (proportions test p-value=3 x 10 -4 ). In conclusion, these results suggest that the M1 is under stronger selection for retention of conserved positions (purifying selection); conversely, the M2 clade proteins appear to be under diversifying selection. Other than the M1 and M2 clade proteins, there are several key structural proteins of the mature coronavirus particle. These include: 1) the spike protein S, which is involved in cellular receptor binding and internalization (Walls et al. 2020) ; 2) E, which is a 1-TM ion channel protein critical for envelope formation and membrane budding (Ruch and Machamer 2012) ; 3) N which is essential for viral genome packaging and linking the viral ribonucleoprotein to the membrane (Kuo, Koetzner, and Masters 2016; Masters 2019) . Hence, we investigated their phyletic patterns relative to that of the M/ORF3 proteins. We conducted a genomic organization analysis by using sequence similarity-based clustering and domain analysis for CoVs, ToroVs, and other nidoviruses such as roniviruses, mesnidoviruses and arteriviruses (Fig. 6 ). Consequently, we found that the S proteins of three major lineages of nidoviruses such as CoVs, ToroVs, and mesnidoviruses show a conserved C-terminal region (SC; Pfam domain: PF01601) while their N-terminal regions, which are cleaved upon engaging host receptors by peptidase domains, are greatly variable (Fig. 6 ). The M proteins from both CoVs and ToroVs share a similar architecture with N-terminal 3TM and C-terminal β-sandwich domains. Interestingly, we found that Arterivirus contains two proteins with distinct 3TM domains, socalled M and GP5 (Supplementary Figures S12 and S13). Both of their core 3TM domains share similarity with the M/ORF3-3TM domain and display conserved polar residues. However, their C-terminal cytoplasmic domain, while β-strand rich, is shorter with just six β-strands (Supplementary Figures S12 and S13 ). This indicates that the 3TM domains of the CoVs and ToroVs on one hand, and Arterviruses on the other share a common ancestry, but we cannot currently detect statistically significant relationships between their respective coupled β-strandrich C-terminal domains. The N proteins of all CoVs contain a N-terminal RNA-binding domain (RBD) and a C-terminal dimerization domain (NC in Fig. 6 ). Beyond the CoVs, the NC domain of the N protein is present more widely in ToroVs and Arteriviruses but these proteins do not possess the RBD of CoV-N; instead they carry a long N-terminal region that contains a stretch of basic residues (Supplementary Figure S14) . Therefore, we predict that this basic region functions equivalently in binding the negatively-charged viral genomic RNA. Tracing the genomic organization of these genes across nidoviruses indicates that the juxtaposition of the M and NC genes is present in arteriviruses, ToroVs and CoV (Fig. 6 ). This suggests that the M-NC coupling was an ancestral feature present in the ancient nidoviral particle with the dyad playing a role in anchoring the genomic RNA to the membranous envelop. However, these are absent in the currently available roniviruses and mesonidoviruses suggesting that this mechanism was secondarily lost in these viruses (Fig. 6) . Nevertheless, the latter two viral clades share the spike protein S with ToroVs and CoVs. Indeed, in the common ancestor the ToroVs and CoVs, we can infer a S-M-NC gene triad indicating that ancient M-NC dyad was joined by the S protein with rapidly evolving N-terminal regions that became the central to the invasion process. At the base of the CoV clade, the above gene-triad was joined Beyond these, we also observed several cases of accretion of lineage-specific genes in the S-M-N genomic region across diverse nidoviruses (Fig. 6 ). For example, the SARS-related clade acquired several new genes including ORF6, ORF7 and ORF8 that were inserted between M and N genes (Tan et al. 2020) , while the MERS-CoV clade, has several genes that were inserted between Spike and ORF5 genes. Similar uncharacterized genes or genes coding for proteins with domains playing a predicted role in pathogenesis (e.g. the hemagglutinin-esterase domain in Bovine Toroviruses) are seen inserted between the known genes in the corresponding genomic regions of Ateriviruses and Toroviruses. Together, these observations suggest that components of the membrane-structural complex e.g. M, N are frequently coupled to (e.g. S or E and the lineage-specific ORFs) or gave rise to (e.g. M2 clade) proteins that are directly part of the host-virus interface. Coronaviruses have emerged as a major threat to human health in the past two decades as the causative agents of several severe infectious diseases, namely SARS, MERS, and the currently ongoing COVID-19 pandemic. The rapid spread, severity of these diseases, as well as the potential re-emergence of other CoV-associated diseases emphasize the need for a thorough understanding of the biology and evolution of these viruses. Viral ion channel proteins exemplified by ORF3a are structurally distinct from all previously characterized ion channels (Flower et al. 2020 ) and represent a new functional theme in coronavirus biology. In the current study, we use sensitive profile-based sequence analysis and homology modeling methods to study these channels and their homologs. We have unified several divergent families of viral ion channel proteins, including SARS-CoV/SARS-CoV-2 ORF3a, MERS-CoV ORF5, proteins from other beta-CoVs (ORF3c) and alpha-CoVs (ORF3b), the CoV M proteins, and many distant homologs from other nidoviruses such as ToroVs and Arterivirus into single superfamily of viral membrane proteins. We present a natural classification of proteins in this superfamily and provide computational evidence that these viral protein families have preserved several familyspecific polar residues that can form a TM aqueous pore with potential ion-conducting capacity. Thus, our study has greatly expanded the repertoire of viral ion channel-related proteins in CoVs and other nidoviruses and provides several insights into the function and evolution of these viruses. CoVs, ToroVs and arterivirus are nidoviruses, which feature an envelope derived from the host cell membranes with embedded viral proteins (Neuman and Buchmeier 2016) . M proteins are a key structural component of the viral envelope. Although their roles in promoting virus assembly and membrane-budding are well documented (Neuman et al. 2011) , their mechanism of action remain unknown. The results of our study raise the possibility that the M proteins might possess the capacity to form a TM-pore and potentially possess ion-channel activity comparable to their viroporin homologs. If this were the case, then it is conceivable that the establishment of ionic gradients by M plays a role in regulating the interactions of proteins in the process of virion assembly and membrane budding. Indeed, several previous studies have demonstrated that M promotes viral assembly and membrane fusion by multimerization and interactions with E, S, and N proteins (de Haan et al. 1999; de Haan, Vennema, and Rottier 2000; Narayanan et al. 2000; Siu et al. 2008 ). Importantly, these interactions are related to the two conformations of the M protein, as revealed by cryo-electron microscopy (Neuman et al. 2011) . The elongated conformation of the M protein is associated with a rigid structural state that associates with clusters of S proteins and imparts a spherical membrane curvature of about 5-6 degrees per M dimer. On the other hand, the compact conformation is associated with a flexible state, low S protein density, and does not appear to impart membrane curvature. The same study also showed that a conversion between the elongated conformation and the compact conformation can be induced by a transient acidification, weakening the M interactions with other viral proteins. Therefore, it was proposed that the formation of elongated conformation of M drives the membrane budding process (Neuman et al. 2011) . Our results complement this model -the proposal that the M protein itself might mediate ion transport could be key to the observed response to pH changes that might occur in the intracellular membranous compartments during virion assembly. In addition to virus assembly, previous research on several animal viruses have shown that viral ion channels can also promote membrane fusion and regulate viral replication and/or packaging genomic RNA into viral particles (Ciampor et al. 1995; Nieto-Torres et al. 2015) . It is possible that the M and ORF3-like proteins may also contribute to these processes through their ion transport activity (Yount et al. 2005; Castano-Rodriguez et al. 2018) . We unified five highly divergent ORF3-like families from alpha-and beta-CoVs into the M2 clade of the M/ORF3 superfamily. All five families of the M2 clade share a common ancestor that split from the M protein at the base of alpha-and beta-CoVs (Fig. 4) . The incorporation of ORF3-like proteins has been observed in the virions of SARS-CoV (containing the ORF3a family) (Shen et al. 2005; Huang et al. 2006; Ito et al. 2005 ) and human CoV NL63 (containing the ORF3b family) (Muller et al. 2010) . This is consistent with ORF3-like proteins partly retaining the ancestral structural association with envelope as seen with M. Deletion of ORF3a in SARS-CoV is associated with a reduction in virus growth (Yount et al. 2005 ) indicating that it is not redundant with M and has evolved a distinct function. This is also supported by the different metrics that suggest selection for diversification in the M2-like clade as opposed to purifying selection in the M1 clade. This rapid diversification is comparable to that observed in the proteins that are involved in host-virus interactions, such as S, that binds the host receptor, and ORF8, a viral immunoglobulin protein that interferes with host immune system (Tan et al. 2020; Zhang et al. 2020) . This suggests that the M2 clade families, like S and ORF8, might be under diversifying selection due to either the need to interact with correspondingly diversifying host molecules or host immune attack. Two lines of evidence support our hypothesis: 1) Half of patients recovered from SARS have developed antibodies against the OR3a N-terminal peptide (Zhong et al. 2006) , suggesting that ORF3a might be a target of host immune system. 2) During the outbreak of both SARS-CoV and SARS-CoV-2, positive selection was observed in ORF3a along with S and ORF8, suggesting that the it might be evolving in response to the pressure from the host immune attack (Velazquez-Salinas et al. 2020; Yeh et al. 2004) . Therefore, we propose that beyond a structural role in the envelope, the M2 clade proteins might be directly at the interface of CoV-host conflicts. In this context it remains to be seen if their ion channel activity might play a role in directly modulating or hijacking host membrane permeability, glycoprotein trafficking or vesicular transport. Ion channels are major drug targets that account for ~13% of FDA approved drugs (McManus 2014). Inhibition of host ion channels with multiple ion channel modulators has been repeatedly shown to affect virion entry and endosomal fusion (Hover et al. 2017) . The recent identification of viral ion channels has allowed screening for novel virus-specific channel modulators, given the lack of homology between the viral and human ion channels. Successful products include the approved drugs amantadine and rimantadine which target the M2 ion channel in influenza A virus (Kozakov et al. 2010; Jefferson et al. 2004) . As ORF3a and E are potential ion channels for SARS-CoV/SARS-CoV-2, they have been proposed as candidates of drug targets (McClenaghan et al. 2020; Alothaid et al. 2020; Dey, Borkotoky, and Banerjee 2020) . Our results suggest that the M protein might be a potential candidate that might have even better prospects than the M2 clade proteins like ORF3a and E: 1) M2 clade proteins are fast-evolving; hence the virus is predicted to more easily develop resistance against the modulating compounds. In contrast, M appears to be under stronger selective constraint for conservation. 2) The presence of M in all CoVs makes it a better candidate for a wide spectrum drug. This is notable because human pathogenic CoVs have originated from different clades of CoVs. 3) In term of structural organization, M with 3-TM regions and multiple inter-TM loops offers more potential for drug-protein interactions than the other envelope-associated channel E that has only 1-TM forming a loosely packed pentamer (Mandala et al. 2020) . Hence, our findings prioritize M as potential therapeutic target against CoVs. We utilized two protein sequence profile-based methods for homology searches and remote relationship detection. The first one is the iterated PSSM (profile)-based method, PSI-BLAST (Position-Specific Iterated BLAST) (Altschul et al. 1997) . For most searches, a cut-off e-value of 0.005 was used as the significance threshold. In each iteration, the newly detected sequences that had e-values lower than the above cutoff were examined for being false positives and the search was continued with the same e-value threshold only if the profile was uncorrupted. The second method is the HHsearch program (Söding 2005) , which is used for the sequence-profile our own domains. The significance was evaluated by probabilities. The collected sequence homologs of each protein family were subjected to a similarity-based clustering that was conducted by BLASTCLUST, a BLAST score-based single-linkage clustering method (ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html). This was used to remove highly similar sequences in the dataset. Multiple sequence alignments (MSAs) were built using the KALIGN (Lassmann and Sonnhammer 2005) , MUSCLE (Edgar 2004 ) and PROMALS3D (Pei, Kim, and Grishin 2008) programs. Based on prior benchmarks with divergent proteins, these programs can produce accurate MSA for protein families (as assessed by structural comparisons) when their sequence identity is high (>25%). However, when they are used to make super-alignments that contain multiple highly divergent domain families, they typically tend to introduce obvious mis-alignments. We tackle this problem using multiple steps to make a good MSA. Specifically, we use both KALIGN and MUSCLE to generate the MSA for each domain family. We choose the alignments which have fewer gaps in the core structural elements that determine the fold, i.e., alpha-helices and beta-sheets. We then use PROMALS3D to generate the super-alignment, which will have both errors within each family and between families. For each family, we use the prior MSAs generated by either MUSCLE or KALIGN to correct mis-alignments introduced by PROMALS3D. We correct those misalignments that occur between families by using the predicted secondary structure information and profile-profile alignments generated by HHsearch (Söding 2005) . Finally, the generated super-alignment can be sampled randomly for each family and checked by superimposing the sequence against known structures to ascertain contiguity of secondary structure elements. the superalignment of all members. Each family has its own conservation patterns which we compute as a PSSM profile and more simply a sequence consensus. The PSSM captures the position-specific frequencies of all 20 amino acids across the alignment (Altschul et al. 1997; Schaffer et al. 1999 ) corrected for their background frequency in the alignment, which we present in a two dimensional plot ( Supplementary Figures S8 and S9) . The PSSM is computed with pseudocounts using the following process: 1) Frequency determination with pseudocounts for a MSA column j with n sequences = ( -1) + Where and are the observed frequency of the amino acids in column j and their respective background frequencies across the alignment for pseudocount correction. 2) These are converted to the initial PSSM score thus: The score is then rescaled as , where is the maximum value of the = max ( ) -max ( ) score across all columns. The conservation metric at each column of the PSSM can be calculated taking the inverse of the mean of the PSSM scores and centering them. This metric for each column is then multiplied by , where is the frequency of gaps in that (1 --) column to down-weight gapped columns. For the consensus method, we generate consensus sequence based on different categories of amino acid properties, a classification developed by Taylor in 1986 (Taylor 1986 ). In this method, different amino acids are classified into 11 groups according to their shared physicochemical properties, including: Consensus is calculated by examining each column of the MSA to determine whether an abovethreshold fraction of the amino acids belongs to a group defined previously. For each MSA, we generated a series of consensus sequences based on threshold from 70%, 80%, 90% to 100%. We colored the MSA using the CHROMA program (32) based on the consensus sequence calculated from a threshold (either 90% or 80%) and further modified using Adobe Illustrator or Microsoft Word. Comparison of either the profiles or the consensus between different families in a superfamily allows one to identify the specific versus general conservation patterns. The general conservation pattern captured as a sequence profile or the consensus for the superalignment primarily reveals a pattern of just hydrophobic residues that form the folding core of the two domains. However, the family-specific consensus or profiles reveal residues which are conserved only in a given family and constitute their unique conservation pattern. Position-wise Shannon entropy (H) for a given column of the multiple sequence alignment was calculated using the equation: P is the fraction of residues of amino acid type i, and M is the number of amino acid types. Two distinct alphabets used to calculate the column-wise Shannon entropy. The first is the regular 20 amino acid alphabet. The Shannon entropy for the ith position in the alignment ranges from 0 (only one residue at that position) to 4.32 (all 20 residues equally represented at that position) in a 20 letter alphabet, which is shown on the positive y-Axis with the x-Axis being the position of the column along the multiple alignment ( Figure 4C ). The second is in a reduced alphabet of 8 symbols that groups the amino acids into non-overlapping categories based on related sidechain properties. For example, in the reduced alphabet both D and E are represented by a single alphabet (acidic category). This is plotted downwards, i.e., along the negative y-axis with magnitude equal to the entropy in the reduced alphabet for the same column. Comparing the entropies in the regular 20 aa alphabet versus the reduced alphabet helps discern positions that show genuine diversifying pressures. For instance, hydrophobic positions can show high entropy in the regular entropy if they present multiple hydrophobic residues. However, high entropy in this scenario is not biochemically very meaningful especially for membrane protein because the different hydrophobic residues perform an equivalent role. This is effectively filtered by the reduced alphabet that brings the focus on genuinely variable positions with different sidechain characteristics. The way the graph is to be understood is by looking at the overall tendency for the heights of the bars in each alphabet -it can be seen that the ORF3-like families show higher amplitude of the bars on an average than the genomically coupled M proteins. This is statistically quantified and found to be significantly higher in both alphabets for the former families (shown as boxplot the p-values are provided in the text). The Kullback-Leibler entropy, also called the Kullback-Leibler divergence (or relative entropy), was computed for each column as described in (Manning and Schutze 1999) Analysis of the entropy values was performed using the R language. Secondary structure was predicted using the JPRED program (Drozdetskiy et al. 2015) . The transmembrane regions were predicted using the TMHMM Server v. 2.0 (Krogh et al. 2001 Based on the super-alignment of the β-sandwich domains of nine M/ORF3 families (Supplementary Dataset S1), we conducted phylogenetic analysis using three robust methods, including the Maximum Likelihood (ML) analysis implemented in the MEGA7 program (Kumar, Stecher, and Tamura 2016) , an approximately-maximum-likelihood method implemented in the FastTree 2.1 program (Price, Dehal, and Arkin 2009) , and Bayesian Inference implemented in the BEAST 1.8.3 program (Suchard et al. 2018b) . For ML analysis, initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with superior log likelihood value. A discrete Gamma distribution was used to model evolutionary rate differences among sites (4 categories). The rate variation model allowed for some sites to be evolutionarily invariable. A bootstrap analysis with 100 repetitions was performed to assess the significance of phylogenetic grouping. For FastTree analysis, default parameters were applied, which include the WAG evolutionary model and the discrete gamma model with 20 rate categories. For Bayesian inference, a JTT amino acid substitution model with a discrete Gamma distribution (four rate categories) was used to model evolutionary rate differences among sites. Markov chain Monte Carlo (MCMC) duplicate runs of 10 million states each, sampling every 10,000 steps was computed. Logs of MCMC runs were examined using Tracer 1.7.1. Burn-ins were set to be 4% of iterations. The tree with the highest log likelihood from the ML analysis was visualized using the FigTree 1.4.4 program of the BEAST package (Suchard et al. 2018a ). The ML-bootstrapping value, FastTree SH-like local support value and Bayesian Posterior value are shown next to the branches. Open reading frames of viral genomes used in this study were extracted from NCBI GenBank files (Benson et al. 2018) . Protein sequences were subjected to similarity-based clustering by we used sequence searches, multiple sequence alignment analysis, and further sequenceprofile searches (Söding 2005) to study their sequence and structural features. All sequence alignments can be found in the supplementary data. Supplemental material is available online only. HHsearch programs. Each node corresponds to a protein sequence. Straight lines indicate significant high-scoring segment pairs (HSPs) detected by all-against-all BLASTP searches with scoring matrix BLOSUM45 and e-value cutoff of 0.02. The graph was generated by CLANS which uses the Fruchterman and Reingold graph drawing algorithm (Frickey and Lupas 2004) . The Bat-CoV HKU9-2 NS3 (ORF3c) (ABN10920.1) A H Y 61 34 1. 1 Y P _ 0 0 9 3 6 1 8 6 1 .1 A IG 1 3 1 0 0 .1 A N I6 9 8 9 3 .1 A C U 3 1 0 3 4 .1 A R O 7 6 3 8 3 .1 A B D 7 5 3 1 6 .1 N P _ 8 2 8 8 5 2 .2 A G C 7 4 1 6 6 .1 Y P _ 0 0 9 7 2 4 3 9 1 .1 A V P 7 8 0 3 2 .1 A P O 4 0 5 8 0 .1 Y P _ 0 0 3 8 5 8 5 8 5 .1 A D Y 1 7 9 1 2 .1 Y P _ 0 0 9 0 7 2 4 4 1 .1 Y P _ 8 0 3 2 1 6 .1 Y P _ 0 0 9 6 6 6 3 0 3 .1 A V M 8 7 5 8 9 .1 Q F U 1 9 7 5 0 .1 Q F U 1 9 7 6 7 .1 A II 0 0 8 2 8 .1 Y P _ 0 0 9 4 0 8 1 7 3 .1 A X P 1 1 7 1 0 .1 Y P _ 0 0 9 6 6 6 2 6 3 .1 A V M 8 7 3 4 9 .1 Q F U 1 9 7 5 7 . Similarities between the effect of SARS-CoV-2 and HCV on the cellular level, and the possible role of ion channels in COVID19 progression: a review of potential targets for diagnosis and treatment Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Role of Severe Acute Respiratory Syndrome Coronavirus Viroporins E, 3a, and 8a in Replication and Pathogenesis', mBio Influenza virus M2 protein and haemagglutinin conformation changes during intracellular transport Relationship between multiple sequence alignments and quality of protein comparative models Mapping of the coronavirus membrane protein domains involved in interaction with the spike protein Assembly of the coronavirus envelope: homotypic interactions between the M proteins Pymol: An open-source molecular graphics tool In silico identification of Tretinoin as a SARS-CoV-2 envelope (E) protein ion channel inhibitor JPred4: a protein secondary structure prediction server Profile hidden Markov models MUSCLE: multiple sequence alignment with high accuracy and high throughput The Pfam protein families database in 2019 Structure of SARS-CoV-2 ORF8, a rapidly evolving coronavirus protein implicated in immune evasion CLANS: a Java application for visualizing protein families based on pairwise similarity Graph drawing by force-directed placement', Software: Practice and Experience SARS coronavirus replicase proteins in pathogenesis', Virus research Viral dependence on cellular ion channels -an emerging anti-viral target? Severe acute respiratory syndrome coronavirus 3a protein is released in membranous structures from 3a proteinexpressing cells and infected cells Severe acute respiratory syndrome coronavirus 3a protein is a viral structural protein Amantadine and rimantadine for preventing and treating influenza A in adults Cryo-EM structure of the SARS-CoV-2 3a ion channel in lipid nanodiscs Where does amantadine bind to the influenza virus M2 proton channel? Diversification of AID/APOBEC-like deaminases in metazoa: multiplicity of clades and widespread roles in immunity Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets A key role for the carboxy-terminal tail of the murine coronavirus nucleocapsid protein in coordination of genome packaging Kalign-an accurate and fast multiple sequence alignment algorithm Structure of a conserved Golgi complex-targeting signal in coronavirus envelope proteins Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Severe acute respiratory syndrome-associated coronavirus 3a protein forms an ion channel and modulates virus release The genome of human coronavirus strain 229E ModFOLD6: an accurate web server for the global and local quality estimation of 3D protein models Structure and drug binding of the SARS-CoV-2 envelope protein transmembrane domain in lipid bilayers Foundations of statistical natural language processing The genome sequence of the SARS-associated coronavirus Coronavirus genomic RNA packaging Coronavirus Proteins as Ion Channels: Current and Potential Research', Front Immunol HTS assays for developing the molecular pharmacology of ion channels COVID-19: consider cytokine storm syndromes and immunosuppression Human coronavirus NL63 open reading frame 3 encodes a virion-incorporated N-glycosylated membrane protein Characterization of the coronavirus M protein and nucleocapsid interaction in infected cells Supramolecular Architecture of the Coronavirus Particle A structural analysis of M protein in coronavirus assembly and morphology Relevance of Viroporin Ion Channel Activity on Viral Replication and Pathogenesis', Viruses Viroporins: structure and biological functions' PROMALS3D: a tool for multiple protein sequence and structure alignments FastTree: computing large minimum evolution trees with profiles instead of a distance matrix Crystal structure and mechanistic determinants of SARS coronavirus nonstructural protein 15 define an endoribonuclease family The coronavirus E protein: assembly and beyond IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices The severe acute respiratory syndrome coronavirus 3a is a novel structural protein The M, E, and N structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles Protein homology detection by HMM-HMM comparison Bayesian phylogenetic and phylodynamic data integration using BEAST 1 Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10', Virus evolution Structural model of the SARS coronavirus E channel in LMPG micelles Novel Immunoglobulin Domain Proteins Provide Insights into Evolution and Pathogenesis of SARS-CoV-2-Related Viruses The classification of amino acid conservation Identification of a new human coronavirus Positive Selection of ORF1ab, ORF3a, and ORF8 Genes Drives the Early Evolutionary Trends of SARS-CoV-2 During the 2020 COVID-19 Pandemic Information theory applications for biological sequence analysis Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein PEDV ORF3 encodes an ion channel protein and regulates virus production Comparative protein structure modeling using MODELLER Characterization and complete genome sequence of a novel coronavirus, coronavirus HKU1, from patients with pneumonia Characterization of severe acute respiratory syndrome coronavirus genomes in Taiwan: molecular epidemiology and genome evolution Severe acute respiratory syndrome coronavirus groupspecific open reading frames encode nonessential functions for replication in cell cultures and mice Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Resilience of biochemical activity in protein domains in the face of structural divergence The ORF4a protein of human coronavirus 229E functions as a viroporin that regulates viral production The ORF8 Protein of SARS-CoV-2 Mediates Immune Evasion through Potently Downregulating MHC-I Amino terminus of the SARS coronavirus protein 3a elicits strong, potentially protective humoral responses in infected patients Secondary Structure