key: cord-0876837-ksql7px0 authors: Zmasek, Christian M.; Lefkowitz, Elliot J.; Niewiadomska, Anna; Scheuermann, Richard H. title: Genomic evolution of the Coronaviridae family date: 2022-03-30 journal: Virology DOI: 10.1016/j.virol.2022.03.005 sha: dc5a08ef09b0e41677432759244e6782111abda1 doc_id: 876837 cord_uid: ksql7px0 The current outbreak of coronavirus disease-2019 (COVID-19) caused by SARS-CoV-2 poses unparalleled challenges to global public health. SARS-CoV-2 is a Betacoronavirus, one of four genera belonging to the Coronaviridae subfamily Orthocoronavirinae. Coronaviridae, in turn, are members of the order Nidovirales, a group of enveloped, positive-stranded RNA viruses. Here we present a systematic phylogenetic and evolutionary study based on protein domain architecture, encompassing the entire proteomes of all Orthocoronavirinae, as well as other Nidovirales. This analysis has revealed that the genomic evolution of Nidovirales is associated with extensive gains and losses of protein domains. In Orthocoronavirinae, the sections of the genomes that show the largest divergence in protein domains are found in the proteins encoded in the amino-terminal end of the polyprotein (PP1ab), the spike protein (S), and many of the accessory proteins. The diversity among the accessory proteins is particularly striking, as each subgenus possesses a set of accessory proteins that is almost entirely specific to that subgenus. The only notable exception to this is ORF3b, which is present and orthologous over all Alphacoronaviruses. In contrast, the membrane protein (M), envelope small membrane protein (E), nucleoprotein (N), as well as proteins encoded in the central and carboxy-terminal end of PP1ab (such as the 3C-like protease, RNA-dependent RNA polymerase, and Helicase) show stable domain architectures across all Orthocoronavirinae. This comprehensive analysis of the Coronaviridae domain architecture has important implication for efforts to develop broadly cross-protective coronavirus vaccines. turn are divided into four genera, Alpha-, Beta-, Gamma, and Deltacoronaviruses. Currently, 48 there are seven Orthocoronavirinae species or sub-species, which have been found to infect new hidden Markov models were developed (see Methods). These are labelled in italic fonts. 95 Homologs are genes that are related by shared ancestry. Orthologs were defined by Fitch in 1970 96 as homologous genes in different species that diverged by speciation. Genes that diverged by 97 gene duplication, either in the same or different species, have been termed paralogs (Fitch, 2000 (Fitch, , 98 1970 ). While the terms ortholog and paralog have no functional implications (Jensen, 2001) , 99 orthologs are often thought of as more functionally similar than paralogs at the same level of 100 sequence divergence (Altenhoff et al., 2012; Eisen, 1998) . creates an entity whose function can be more than the sum of its constituent parts. The 132 We analyzed complete sets of proteins for all publicly available Nidovirales genomes (for a total 133 of roughly one million sequences, including ~900,000 for SARS-CoV-2) for the presence of The main finding from this analysis is that, during Coronaviridae evolution, the largest number Coronaviridae spike proteins are multifunctional proteins that mediate viral entry into host cells. Composed of two subunits, S1 and S2, they first bind to a receptor on the host cell surface 219 through their S1 subunit and then fuse viral and host membranes through their S2 subunit. In 220 SARS-CoV-2 (but not in SARS-CoV1 or MERS-CoV) the two subunits S1 and S2 have been Table 1 ). While the carboxy-terminal S2 subunit shows the 229 same two-domain CoV_S2--CoV_S2_C arrangement for all Orthocoronavirinae genomes 230 analyzed here, the amino-terminal S1 subunit differs significantly between Alpha-, Beta-. Table 2 ). For comparison, we also included two example polyproteins from Table 2 . We also used DAIO to compare the accessory proteins across Orthocoronavirinae subgenera and 308 found that most accessory proteins are subgenus-specific, despite oftentimes having been given 309 identical names such as "NS7" (non-structural protein 7) or "ORF7". These names are based on 310 the protein's placement in the genome and do not necessarily indicate homology or similarity in 311 biological/molecular function. Therefore, these names cannot be used to compare or relate 312 proteins between different subgenera, since for example, Hibecovirus ORF7 is not related to 313 Cegacovirus ORF7 (also see Figure 2B for additional examples). Table 3 (as are ORF1ab polyprotein, Spike glycoprotein, Membrane protein, Envelope 318 small membrane protein, and Nucleoprotein). HMMs would be the ideal means to systematically 319 classify accessory proteins since they represent a protein's molecular "signature" and are 320 indifferent to the placement in a genome (Eddy, 2004) . For this reason, HMMs were created for 321 all accessory proteins that currently lack one. Domains defined by these new HMMs are in italic 322 fonts in Table 3 . Since these novel HMMs are representing sub-genus-specific domains, they do 323 not appear in Figure 3 and Supplementary Tables 1 and 2 as gains and losses. (iii) The greatest variability is found in the accessory proteins. For these proteins, each sub-genus 386 has its own unique set, with very little overlap between sub-genera. The only notable exception 387 to this is NS3b which is present and orthologous over all Alphacoronaviruses. In addition, we note the following: The establishment of the Orthocoronavirinae family is associated with a large gain of domains. While this superficially appears as if these domains appeared at the same time, en bloc, the more 391 likely explanation is that these domains were gained one domain at a time, but most viral species 392 emerging from the branch leading from Nidovirales to Coronaviridae either went extinct and/or 393 have not been discovered yet. We used a semi-automated software pipeline to analyze amino acid sequences for their protein 418 domain-based architectures and to infer multiple sequence alignments and phylogenetic trees for 419 the molecular sequences corresponding to these architectures, followed by gene duplication ATPases associated with the assembly, operation, and disassembly of protein complexes. Resolving the ortholog 493 conjecture: Orthologs tend to be weakly Detection of recombinant rousettus bat 606 coronavirus GCCDC1 in lesser dawn bats (eonycteris spelaea Rapid diversification of cell signaling 609 phenotypes by modular domain recombination ViPR: An open bioinformatics database and analysis resource for virology research Maximum-Likelihood Analysis Using TREE-PUZZLE, in: 616 Current Protocols in Bioinformatics RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with 618 thousands of taxa and mixed models Molecular Characterization of the Capsid Gene of Two 621 Structural basis for translational shutdown 625 and immune evasion by the Nsp1 protein of SARS-CoV-2. Science (80-. ) A general empirical model of protein evolution derived from multiple 628 protein families using a maximum-likelihood approach Origins and evolution of the global RNA virome Homologous 2?,5?-phosphodiesterases from disparate RNA viruses antagonize antiviral 634 innate immunity RIO: analyzing proteomes by automated phylogenomics using 637 resampled inference of orthologs A simple algorithm to infer gene duplication and speciation events on a 639 gene tree This Déjà Vu Feeling-Analysis of Multidomain Protein Evolution in 641 This Déjà Vu Feeling-Analysis of Multidomain Protein Evolution in 644 Strong functional patterns in the evolution of eukaryotic genomes 646 revealed by the reconstruction of ancestral protein domain repertoires Herpesviridae proteins using Domain-architecture Aware Inference of Orthologs (DAIO) Human coronavirus OC43 (HCoV_OC43) (c) Human coronavirus HKU1 [Species] (HCoV-HKU1) (d) Middle East respiratory syndrome Severe acute respiratory syndrome coronavirus 2 (2019-nCoV, SARS-CoV-2; Species: SARS-CoV) The main findings and results of our work are:• Certain sections of the Orthocoronavirinae genomes are stable and only differ by point mutations and small insertions and deletions. • The spike proteins and the papain-like peptidases, in contrast, differ in their domain architectures between genera. Similarly, the N-terminus of the polyproteins differ in the proteins encoded between genera, and for Betacoronaviruses, even between sub-genera. • The greatest variability is found in the accessory proteins. For these proteins, each subgenus has its own unique set, with very little overlap between sub-genera. The only notable exception to this is NS3b which is present and orthologous over all Alphacoronaviruses. • The establishment of the Orthocoronavirinae family is associated with a large gain of proteins domains. • From a domain presence/absence perspective Alpha-and Betacoronaviruses are similar to each other, as are Gamma-and Deltacoronaviruses. • In the course of this work, we developed a consistent naming scheme for all Coronaviridae proteins as well as numerous novel hidden Markov models (HMMs) representing sub-genus specific accessory proteins. The resulting annotations of this efforts will be disseminated via the ViPR (https://www.viprbrc.org) database. ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:J o u r n a l P r e -p r o o f