key: cord-0906086-4fota9rl authors: Medvedev, Kirill E.; Kinch, Lisa N.; Grishin, Nick V. title: Functional and evolutionary analysis of viral proteins containing a Rossmann‐like fold date: 2018-06-13 journal: Protein Science DOI: 10.1002/pro.3438 sha: 4e1dda0217a3227fe682fbfffff9b47e593f48a0 doc_id: 906086 cord_uid: 4fota9rl Viruses are the most abundant life form and infect practically all organisms. Consequently, these obligate parasites are a major cause of human suffering and economic loss. Rossmann‐like fold is the most populated fold among α/β‐folds in the Protein Data Bank and proteins containing Rossmann‐like fold constitute 22% of all known proteins 3D structures. Thus, analysis of viral proteins containing Rossmann‐like domains could provide an understanding of viral biology and evolution as well as could propose possible targets for antiviral therapy. We provide functional and evolutionary analysis of viral proteins containing a Rossmann‐like fold found in the evolutionary classification of protein domains (ECOD) database developed in our lab. We identified 81 protein families of bacterial, archeal, and eukaryotic viruses in light of their evolution‐based ECOD classification and Pfam taxonomy. We defined their functional significance using enzymatic EC number assignments as well as domain‐level family annotations. Rossmann-like fold 1,2 is the most populated fold among a/b-folds in the Protein Data Bank. 3 It was first found in a wide range of nucleotide-binding proteins that utilize diphosphate-containing cofactors such as NAD(H). These structures included two sets of b-a-b-a-b units (321456 topology), forming a single parallel sheet flanked of a three layer a/b/a sandwich. 4 An important structural feature of this fold includes a crossover observed between strands 3 and 4. This crossover creates a natural cavity that participates in the binding of the nucleotide ring. 5 We can therefore define a minimal Rossmann-like motif as a three-layer a/b/a sandwich with at least three parallel b-strands and a crossover between the second and third strands. In fact, protein structures containing this minimal defined unit constitute 22% of the Protein Data Bank (see "Materials and methods"). Rossmann domains are linked to a great variety of different catalytic domains and metabolic enzymes and can be found in different viruses. 6 As an abundant life form that infects practically all organisms, viruses are a major cause of human suffering and economic loss. Previous studies showed that in marine, soil, and animal-associated environments, the number of virus particles is typically 10-100 times greater than the number of cells. 7 This so-called "virosphere" is probably inclusive of every environment on the Earth, from the atmosphere to the deep biosphere. 8 The viromes of the three domains of cellular life (bacteria, archea, and eukaryotes) are fundamentally different. In prokaryotes, most have double-stranded DNA genomes, with a substantial minority of single-stranded DNA viruses and only limited presence of RNA viruses. On the other hand, in eukaryotes, RNA viruses account for the majority of the virome diversity, although ssDNA and dsDNA viruses are common as well. 9 Although several families of dsDNA viruses are represented in both bacteria and archea, no viruses are known to be shared by eukaryotes with any of the other two cellular domains, even at the family or order level. 10 However, structural analyses of virion architecture and coat protein topology have revealed unexpected similarities, not visible in sequence comparisons, suggesting a common origin for viruses that infect hosts residing in different domains of life. 11 Given the prevalence of Rossmann-like folds in nature, their analysis in the viral structure proteome could provide an understanding of viral biology and evolution as well as could propose possible targets for antiviral therapy. In this current work, we provide functional and evolutionary analysis of viral proteins containing a Rossmann-like fold that can be found in the Evolutionary Classification of protein Domains (ECOD) database developed in our lab. 12 ECOD is a hierarchical classification based on evolutionary concepts that consists of five levels: architecture (A), possible homology (X), homology (H), topology (T), and family (F). 13 We identified and described protein families of bacterial, archeal, and eukaryotic viruses in the light of their classification in ECOD and defined their functional significance using enzymatic EC number assignments as well as domain-level family annotations. 81 protein families were defined as viral proteins containing a minimal Rossmann fold motif. A few well-populated viral folds tend to distribute across multiple host kingdoms, including Ploop domains-related (2004.1), Rossmann-related (2003.1) , and UDP-glycosyltransferase/glycogen phosphorylase (2111.1). Alternatively, numerous fold types tend to be specific to their viral host kingdom, potentially explaining the difference in their viriomes and suggesting targets for therapy. Using the definition of a minimal Rossmann-like folding unit that contains a three-layer a/b/a sandwich with at least three parallel b-strands and a crossover between the second and third strands, structures of 1427 viral Rossmann-like domains were detected in the ECOD database. These domains were found in 512 (6.7% of all known viral protein structures) PDB structures and were assigned by ECOD to 81 protein family groups and 24 homology groups (Fig. 1) . The structures represented gene products from 21 viral taxonomical families with host ranges from all kingdoms of life (http://prodata.swmed.edu/rossmann_fold/viruses/). 4 of 21 viral taxonomical families infect Bacteria, one infects Archea and eighteen infect Eukaryota ( Fig. 1 -green, violet, and orange colors, respectively). The biggest taxonomical family in terms of amount Phages are viruses that infect bacteria; their selfreplication depends on access to a bacterial host. The rising tide of antibiotic resistance coupled with the low rate of antibiotic discovery has revived interest in phages as antibacterial agents. 14 Phages have been used not only to treat and prevent human bacterial infections but also to control plant diseases, detect pathogens, and assess food safety. E.coli bacteriophages are the most popular objects for these purposes, 14 which perhaps explains their relative abundance of structures (Fig. 1 ). Our analysis detected 14 different bacterial virus structure topology types defined by ECOD T-groups that contain a Rossmann-like fold (Fig. 2, 12 topology groups shown). The largest "Rossmann-related" homology group (2003.1) adopts a classical Rossmann-like topology with additional b-strands at the C-terminal end for several families [ Fig. 2(A,B) ]. The tubulin family [ Fig. 2(B) ] of this homology group has a sheet topology 321456 representing a canonical GTPbinding domain. Pseudomonas phiKZ-like bacteriophages encode a group of related tubulin/FtsZ-like proteins believed to be essential for the correct centering of replicated bacteriophage virions within the bacterial host. 15 This phage protein classifies together with bacterial and archeal FtsZ proteins as well as eukaryotic tubulin alpha, beta, and gamma chains. The similar protein sequence and structure, in addition to the GTP-dependent polymerization activities suggest all evolved from a common ancestor. One family from the "P-loop domains-related" [2004.1, Fig. 2 (C,D)] homology group, phosphomevalonate kinase [P-mevalo_kinase, PMVK, Fig. 2 (C)], contains a subfamily of phage T4 deoxynucleotide kinases that are somewhat distinct from the canonical animal enzymes catalyzing phosphorylation of 5phosphomevalonate into 5-diphosphomevalonate, an essential step in isoprenoid biosynthesis. Given the idea that PMVK enzymes arose from nonorthologous gene displacement early in animal evolution, 16 perhaps the phage system T4 deoxynucleotide kinases played a role in their alternate evolutionary origins. Another family member of this H-group, the ATPase P4 from bacteriophage phi12 [NTPase_P4, Fig. 2(D) ] contains motor proteins involved in the packaging of Pseudomonas phage phi-12 genome into preformed capsids using ATP to drive translocation. 17 The central P4 structure core, together with part of the Cterminal region, forms a Rossmann-type domain containing a twisted, eight-stranded b-sheet of mixed parallel and antiparallel topology flanked by five helices. Although the P4 family is limited to phage proteins, their presumed homologous relationship to other P-loop structures suggests common evolutionary origin. 18 The toprim-like [Toprim_2, Fig. 2 (G)] family belongs to "HAD domain-related" homology group [2006.1, Fig. 2 (E-G)] and is represented by DNA primase-helicase proteins. 19 Phage T7 encodes DNA polymerase, primase, helicase, and single-stranded DNA binding activities that catalyze the replication of double-stranded phage DNA. Primase and helicase activities reside in a bifunctional primase-helicase protein that assembles into ring-shaped hexamers. 20 Given their essential role in phage replication, these viral toprim-like proteins are predominantly found in bacteriophages and nucleocytoplasmic large DNA viruses. 21 Two possible evolutionary scenarios have been proposed for primase-helicase proteins. The first involves fusion of primase and helicase genes, while the second considers the primase-helicase gene as an ancestor that underwent duplication and divergence followed by physical separation of primase and helicase functions. The phage tail sheath protein domain [Phage_-sheath_1, Fig. 2 (H)] with the following common sheet topology (231456) contains proteins that play crucial roles in contracting the tail of bacteriophage T4 through the host outer membrane during infection. 22 The phage tail sheath protein resembles the fold of the Type VI secretion system protein (T6SS) VipA/VipB heterodimer that intertwines to adopt the Rossmann-like architecture. 23 In fact, components of the T6SS system, which delivers effector proteins into a target cell, and the phage tail-associated complex are thought to have arisen from a common ancestor. 24 The T4-glucosyltransferase family [ Fig. 2 (K)], limited to phage proteins, catalyzes the transfer of glucose from uridine diphosphoglucose to 5hydroxymethyl cytosine. The role of glycosylation in protecting the infecting viral DNA from host restriction enzymes has been reviewed. 25 Comparison of the glucosyltransferase family fold to other families showed that it is completely embedded in the structure of glycogen phosphorylase. All nine a-helices and 13 b-strands of b-glucosyltransferase match elements of glycogen phosphorylase in sequential order. 26 The significance of the match is further supported by the fact that the first and second domains of the common core are architecturally distinct despite topographical similarity to the classical nucleotide-binding fold. It appears paradoxical that the architecture of T4 b-glucosyltransferase is simpler and therefore appears more primitive than that of glycogen phosphorylase, which is the older enzyme by phylogenetic arguments. 26 The paradox may be resolved by either of two evolutionary models that differ in the point of divergence between glycogen phosphorylase and T4 bglucosyltransferase. The first model postulates that the gene for T4 b-glucosyltransferase diverged rather recently from a fully evolved glycogen phosphorylase and evolution in T4 rapidly simplified the structure of the protein down to the essential catalytic core. In the second model, both descend along separate lineages from a very ancient common ancestor. The N-acetylmuramoyl-L-alanine amidase (Ami-dase_3) family contains viral proteins that exhibit high specificity towards the cell walls of their host bacteria. The core of these domains is formed by a twisted, six-stranded b-sheet flanked by six helices. b-strands 4 and 5 are antiparallel, which makes the catalytic domain unique among the known Listeria phage endolysins. 27 Based on their antimicrobial properties, endolysins from phages infecting Grampositive pathogens have recently attracted attention as potential therapeutic agents. 28 Initial studies were successfully carried out with oral Streptococci in mice 29 as well as with Bacillus anthracis in vitro. 30 Also genetically modified lactic acid bacteria that are able to synthesize and secrete active Listeria phage endolysin were constructed to protect food fermentation products. 31 Cutinase [ Fig. 2 (L)] belongs to the a/b-hydrolase H-group and is another big family that contains proteins from different domains of life. Cutinase is a serine esterase with the classical Ser, His, Asp triad of serine hydrolases. The structure of a cutinase shows an a/b sandwich organization similar to lysin B (LysB) enzymes. Viral LysB proteins are produced by mycobacteriophages to degrade the host peptidoglycan layer. The activity also circumvents a mycolic acid-rich outer membrane covalently attached to the arabinogalactan-peptidoglycan complex in the wall of Mycobacterium. 32 The five closest related structures of LysB are all cutinases, although there is no greater than 21% amino acid sequence identity with any of them. The LysB catalytic mechanism is expected to be similar to that for other serine esterases. Acquisition of LysB by mycobacteriophages throughout their evolution likely confers a substantial selective advantage over those without it by providing faster and more complete lysis. 32 Like the bacterial and eukaryotic branches in the tree of life, the Archea are host to a multitude of Functional and Evolutionary Analysis of Viral Proteins viruses. However, compared to viruses infecting the domains Eukaryota and Bacteria, studies of viruses infecting the Archea are still in their infancy. 33 The PDB database includes 305 domains in 55 structures of archeal-specific viral nonchimeric proteins and only three domains in three structures were defined as Rossmann-like (Fig. 3 ). In addition to these archea-specific viruses, all types of dsDNA viruses known to infect bacteria can replicate in archea, including head-tailed viruses (families Myoviridae, Siphoviridae, and Podoviridae), icosahedral viruses with internal lipid envelopes (families Sphaerolipoviridae and Turriviridae in archea; Tectiviridae and Corticoviridae in bacteria), and pleomorphic viruses (families Pleolipoviridae in archea and Plasmaviridae in bacteria). 34 One example of an archealinfecting viral protein is the B116 protein (DUF1874 family) of Sulfolobus turreted icosahedral virus (STIV). 35 While the 37 STIV open reading frames (ORFs) generally lack sequence similarity to other genes ORF B116 is common to the genomes of three additional hyperthermophilic Archeal viral families, the Rudiviridae, the Lipothrixviridae and the Bicaudaviridae. The B116 polypeptide folds to form a fivestranded, predominantly parallel b-sheet (topology 31452) lined on one side by three a-helices, with strand b5 running antiparallel to the remaining strands. Interestingly, an intramolecular disulfide bond, contributed by Cys33 and Cys62, is also observed. This covalent link between the a1 and a2 helices is likely to enhance the thermostability of the B116 fold. Two copies of the B116 polypeptide are found in the asymmetric unit, giving rise to the homodimer and forming a larger 10-stranded unclosed barrel, giving rise to a saddle shaped protein. Authors suggested that this unclosed barrel could be a place of nonspecific DNA binding in which the Rossmann-like fold also could take part. 35 One more example from STIV is the A197 protein (EUF08621 family). The structure of the A197 monomer reveals a six-stranded, predominantly parallel, a/b/a sandwich that is flanked by a fourstranded antiparallel b-sheet with an extended C terminus. Structure similarity data identified members of the glycosyltransferase (GT-A) superfamily as the closest structural homologues of this protein. A197 is one of the smallest known glycosyltransferases, composed of only the core catalytic GT-A fold and lacking additional functional domains. Thus, its structure may define the minimal components necessary for glycosyltransferase activity. 36 The third example is the B204 protein (AAA_22_like family). It is an ATPase belonging to the P-loop containing nucleoside triphosphate hydrolases H-group that is thought to drive packaging of viral DNA during the replication process. The structure of STIV B204 is represented by a central nine-stranded b-sheet decorated with seven a-helices. B204 contains a core Rossmann-like fold (sheet topology 32451) followed by a b-meander with a helical hairpin stemming from one of the loops. Related P-loop ATPases with this identical topology also function to translocate DNA; including the bacterial conjugation protein TrwB that transfers bacterial DNA across membranes and between cells 37 and the VirB4 ATPase of the bacterial type IV secretion system that mediates the transfer of proteins and DNA across bacterial membranes. 38 In contrast to bacteria and archea, eukaryota hosts numerous, diverse RNA viruses, retrotranscribing elements and retroviruses that typically integrate into the host genome. 39 The giant viruses of the family Mimiviridae are associated with a distinct class of satellite viruses, the virophages, which reproduce within viral "factories" inside their host protist cells and which depend on the latter for their replication. 40 In our dataset Mimiviridae encodes two protein families. The glucose-methanol-choline oxidoreductase family, represented by the R135 protein, contains an Nterminal FAD binding Rossmann-related domain followed by a C-terminal substrate recognition domain. The R135 oxidoreductase might participate in degrading the cell walls of their normal hosts, which include some lignin-containing algae. 41 The minimal Rossmann domain is composed of a five-stranded parallel b-sheet with the same b-strand topology typical of a nucleotide-binding fold. The second Mimiviridae protein family is represented by tyrosyl tRNA synthetases (TyrRS). These proteins have sixstranded b-sheet topology ordered 432561, with an antiparallel first strand. TyrRS shares 30% identity over 340 residues with the TyrRS of the hyperthermophilic Euryarchaeota Pyrococcus horikoshii, its closest known structural homologue. tRNA synthetases are pivotal in determining how the genetic code is translated in amino acids and in providing the substrate for protein synthesis. The discovery of four aminoacyl-tRNA synthetases encoded in the genome of mimivirus together with a full set of translation initiation, elongation, and termination factors appeared to blur what was once a clear frontier between the cellular and viral world. 42 Another eukaryotic infecting virus, vaccinia virus (Poxviridae), causes smallpox. The Poxviridae genome includes eight protein family groups with five different topology types. One of example is H3 envelop protein-an immunodominant antigen that is expressed late in infection and found as a membrane protein on the surface of virion particles. The nine-stranded b-sheet is made up of strands in the order 679584132 that is surrounded with helices on both sides, and all of the strands except 7 and 8 are oriented parallel to each other. The fold belongs to glycosyltransferases (GTs) of the GT-A group. 43 H3 is involved in attachment to the host cell, contributes to viral morphogenesis, and plays a role in infection. 44 Since H3 is a major immune system target that is recognized by neutralizing antibodies, H3 is an important viral protein to be included in new vaccines. 45 Another example is subunit D12 of vaccinia virus capping enzyme that executes all three steps in m7GpppRNA synthesis. 46 The topology of the stimulatory subunit D12 reveals a class I N7methyl-transferase (MT) like core, however, with a truncated S-adenosyl-homocysteine-binding domain, consistent with its lack of MT activity. This enzyme has a completely unique mode of binding of the adenosine moiety of S-adenosyl-homocysteine, a feature that could be exploited for design of specific antipoxviral compound. 47 Zika virus (Flaviviridae) include only one family with a minimal Rossmann structure, non-structural protein NS1. The protein fold and domain arrangement of NS1 is virtually identical to dengue virus DENV2 protein and West Nile virus NS1 protein. Despite this overall similarity, the Zika virus NS1 crystal structure provides important new information about a domain containing minimal Rossmann motif flexible loop that is not visible in previous structures. 48 The structure of the loop reveals an expanded surface permitting NS1 to associate with membranes during replication, to associate with immature virions during particle morphogenesis and to facilitate the interactions necessary for formation of the hexameric lipoprotein complex. The Zika virus, which has been implicated in an increase in neonatal microcephaly and Guillain-Barr e syndrome, has spread rapidly through tropical regions of the world. NS1 plays crucial role in the Zika virus life cycle, being a multifunctional virulence factor. 48 Reoviridae is a big viral family infecting fish, shellfish, crustacean species, insects, ruminants and human hosts. The Reoviridae family includes structure representatives from nine different protein families. Rotaviruses are the principal agents of infectious dehydrating diarrhea of infants and the cause of nearly a half-million childhood deaths per year. 49 They have a segmented, double stranded RNA (dsRNA) genome, packaged within a multishelled virus particle. The outer protein layer of the virion, the molecular machinery for host-cell binding and penetration, contains two protein components, VP4 and VP7. 50 The domain containing minimal Rossmann motif of VP7 protein has five parallel bstrands and is assigned to "Flavodoxin-like" possible homology group. Rotavirus infection and parenteral immunization with virions both induce a strong VP7-specific neutralizing antibody response. This protein is a principal target of protective antibodies. 51 We examined taxonomic distribution of sequences best hits of each family representative in order to define possible evolutionary relationships between viral and host proteins from the same family, which could be the result of Horizontal Gene Transfer (HGT) between virus and host organism. Looking at the similarity of the Rossmann motif ECOD families' sequences to those in the nonredundant sequence database, all Rossmann-like fold families were divided into six groups according to appearance of BLAST hits from four taxonomy groups: A-Archea, B-Bacteria, E-Eukaryota and V-Viruses. The first group -31% (25 out of 81) of all protein families under study has universal BLAST hits from archea, Bacteria, Eukaryota and Viruses ("ABEV", see two last columns at the online table: http://prodata. swmed.edu/rossmann_fold/viruses/). Half (13 out of 25, or 52%) of these families belong to doublestranded DNA viruses (dsDNA), which can affect bacteria. BLAST score distributions allow us to assume that Horizontal Gene Transfer (HGT) took place between virus and host bacteria for 7 out of 13 protein families in this universal group (e3uj3X1, e3u5zE1, e2ocaA8, e2ia5B1, e1juvA1, e4ieeA2, e1xovA2). This assumption is based on high-scoring bacterial hits being among viral hits in the BLAST Functional and Evolutionary Analysis of Viral Proteins score distributions. The second half of protein families with "ABEV" hits (12 out of 25, or 48%) belong to viruses that affect eukaryotes, including doublestranded DNA and positive single-stranded RNA ((1)ssRNA) viruses. Only one protein family seems to have evidence for HGT between virus and eukaryote host-vaccinia virus thymidine kinase (e2j87A3). The second group-6% (5 out of 81) of Rossmann-like fold proteins has BLAST hits from three taxonomical groups: Archea, Bacteria and Viruses ("ABV"). Two protein families infect bacteria (dsDNA) and three infect archea (dsRNA). All these protein families except one (e4r2iA1) have strong evidences for HGT between virus and host. The third group-16% (13 out of 81) has BLAST hits from three taxonomical groups: Bacteria, Eukaryota, and Viruses ("BEV"). Six protein families infect bacteria (dsDNA), and four of them have evidences of HGT (e3bgwF1, e3hc7A1, e5hd9A1, e2ihnA2). The rest infect eukaryotes and belong to double-stranded DNA and positive single-stranded RNA viruses. The fourth group-11% (9 out of 81) has BLAST hits from two taxonomical groups: Bacteria and Viruses ("BV"). Of the seven families that infect bacteria (dsDNA and dsRNA viruses), six have evidences for HGT (e4cu5B1, e4cu2A1, e1dekA1, e1y8zB2, e1y8zB1, e2ia5A2). The fifth smallest group-2.5% (2 out of 81) has BLAST hits from two taxonomical groups: Eukaryota and Viruses ("EV"). Both families belong to human vaccinia dsDNA virus with BLAST distributions having only low-scoring eukaryotic hits. The biggest group-33% (27 out of 81) has BLAST hits limited to Viruses ("V"), with only four families infecting bacteria (dsDNA and dsRNA viruses) and the rest (23 families) infecting eukaryotes (dsRNA, (1) ssRNA, dsDNA, and (-) ssRNA viruses). Given the relatively high number of viral protein families that contain a Rossmann-like fold (81 families), we sought to examine their evolutionary distributions among folds from the three major host kingdoms. Figure 4 highlights the protein family counts from bacteria (blue bar), archea (red bar), and eukaryota (green bar)-infecting viruses that fall within all distinct minimum Rossmann fold types. Importantly, each fold type includes protein families related by a common ancestor (ECOD Hgroup). The fold types represented by more than one host kingdom tend to display relatively high family counts. These well-populated folds include P-loop domains-related (2004.1), Rossmann-related (2003.1), and UDP-glycosyltransferase/glycogen phosphorylase (2111.1). Most of the other fold types include only one family representative, with 8 from eukaryotic hosts, 6 from bacterial hosts, and one from archeal hosts. The fold type nucleotidediphospho-sugar transferases (2111.6) contains two families from eukaryote and one from archea, while the fold type SGNH hydrolase (2007.5) contains one family from bacteria and one from archea. This almost biphasic distribution, with one wellpopulated universal fold set and one less populated unique fold set, suggests that despite the fundamental differences in the viriomes from the three domains of cellular life, they survive using a common set of folds. Furthermore, these well-populated folds extend across all life forms, perhaps suggesting an ancient origin. Accordingly, the P-loop domainslike fold, whose nucleotide metabolic enzyme components are thought to form the origin of the protein world, 52 also represent the most populated fold among viral genomes. Such examples of ancient viral proteins should be useful for understanding evolutionary relationships among members of this unique domain of life. Alternately, the unique fold types likely drive some of the differences observed in virus infecting different cellular forms of life. For example, the cystovirus bacteriophage phi12 encodes a unique P7 protein with a minimal Rossmann fold. Phi12 P7 is classified as its own unique X-group in ECOD, implying a lack of evidence for homology to existing folds. P7 serves as a putative virion assembly cofactor thought to bind the unique threesegmented double-stranded cystovirus RNA genome. 53 The avian coronavirus that infects chickens encodes another example (IBV Nsp2a) of a unique Rossmann-like fold domain. The N-terminal domain of Nsp2a adopts a Rossmann-like sheet topology (2314), with b-strand 2 being an insert to the core that is antiparallel to the rest. Although the function of Nsp2a is unknown, its lack of sequence/ structure similarity to other proteins has suggested a role in host specificity. 54 Traditionally defined, helicases use energy derived from NTP hydrolysis to unwind double-stranded nucleic acids. 55 As such, they play roles in numerous cellular processes involving nucleic acids; including DNA replication and repair, transcription, translation, and RNA splicing and maturation, among others. 56 Mechanistically, helicases can bind singlestranded or double-stranded nucleic acid; unwind RNA, DNA or hybrids, and translocate in both directions (3 0 -5 0 or 5 0 -3 0 ). 57 Despite these functional distinctions, all helicases bind NTP using two structural elements formed by signature sequence motifs: a phosphate-binding P-loop (motif I/Walker A motif) and Mg 21 cofactor binding loop (motif II/ Walker B). These motifs have helped classify P-loop helicases into superfamilies (SF1-SF5) based on sequence, with the last family including also nonhelicase NTPases. 58 Modular accessory domains or subdomains, including terminal extensions and insertions within the core, can also regulate helicase activity. 58 By shuffling core and regulatory domains, nature has created a diverse range of cellular helicase machinery that also plays key roles in viral function. As such, the relative abundance of helicases in viral genomes has been used in part to assess picornaviral evolution. 59 Their key roles in viral function also suggest helicases as novel targets for treating viral infections. 58 In humans, several debilitating inherited disorders are linked to genetic defects in helicase genes, including Bloom's, Werner's, and Rothmund-Thomson's syndromes. 60 The potential of helicases as antiviral drug targets has recently been reviewed. 61 Among viral protein structures containing the minimal Rossmann fold, 14 protein families are known helicases (http://prodata.swmed.edu/ross-mann_fold/viruses/). Helicase domains fall into two different homologs folds (H-groups): P-loop domainsrelated (13 families) and HAD domain-related (1 family). 19 Given the number of P-loop domainsrelated representatives and their potential for informing viral evolution, we constructed a structure-based tree of domains that contain P-loop Figure 5 . A tree of viral helicases. Structure-based distances between representative P-loop domains from ECOD family viral helicases were estimated using DaliZ scores. Nodes of the tree are labeled by PDB and colored according to superfamily. Structures are colored in rainbow according to the core Rossmann-like topology common to all viral helicase domains. Terminal extentions (white) and insertions (pink) decorate the core fold. Where present, active site molecules are in stick. Functional and Evolutionary Analysis of Viral Proteins motifs (Fig. 5) . The tree reproduces traditional sequence-based classification: 58 dividing the viral helicase domains into four superfamilies (SF1-SF4). Those in the SF1 and SF2 include a Rossmann-like fold duplication, with the second domain helping form the DNA binding site and contributing a motif to the active site (i.e., motif IV in 3upuA3, not included in the tree). All the catalytic viral helicase domains include a core topology of four strands (order 1432) sandwiched by a helix on one side and two helices on the other. The common core binds nucleotide using the Walker A (following core strand 1) and Walker B (following strand 3) motifs. The structure-based tree correctly defines superfamilies based on terminal extensions and insertions to the core, with SF1 having an insertion (Fig. 6 , pink cartoon) that extends one side of the b-sheet by a strand and a C-terminal extension that extends the other side. SF2 domains have a longer insertion than SF1 that extends the sheet by two strands. SF3 have a similar C-terminal extension as SF1 but lack the insertion, while SF4 includes both a longer C-terminal extension that adopts a four-stranded b-meander and a longer insertion that extends the sheet and has an additional helical subdomain. Given the ability of the viral P-loop helicase domains with minimal Rossmann folds to recapitulate traditional sequencebased classification, these domains might provide useful for further analysis of viral evolution. Viral genomes possess several additional types of P-loops domain-related homologs that do not function as helicases. Their activities include terminase or viral nucleic acid packaging that couples NTP hydrolysis to directional motion along nucleic acids, thymidine and other deoxynucleotide kinases providing DNA precursors for synthesis in the host cytoplasm, polynucleotide kinase functioning in nucleic acid repair, and a recombinase promoting strand exchange. These additional structures bring the total number of P-loop domains-related families to 27 (removed incorrect dynein domains), which represents almost one third of the existing Rossmann-like fold domains in viruses. Looking at similarity to known protein sequences, about one third of the Rossmann motif ECOD families are limited to viral sequences (27 out of 81, or 33%). Most of these viral-specific protein families exhibit characteristics of fast evolution, with their sequences being distinct from homologs (24 out of 27, or 89%). Interestingly, most of these fast evolving families infect eukaryotes (21 out of 24, or 87%), with many of these classified as diverse methyltransferase domains (12 families). Viral methyltransferase domains tend to function in mRNA 5' cap biosynthesis. One viral methyltransferase example that has not rapidly diverged from its FtsJ counterparts in other kingdoms (i.e., "ABVE") illustrates a typical methyltransferase fold bound to AdoMet substrate and cap [ Fig. 6(A) ]. Alternately, a fast evolving structure of the vaccinia virus protein VP39 highlights the typical methyltransferase binding sites for AdoMet substrate [ Fig. 6(B) , black stick] with diversity arising from extensions at the termini, a replacement of the C-terminal helix with a strand, and a unique cap binding pocket [ Fig. 6(B) , magenta stick] that allows for sensing substrate methylation status. 62 Evolution of viral methyltransferases has been discussed, with the unique sequence features arising from their invention of alternate capping pathways, their intimate interactions with additional viral enzymes functioning in the process, and their inactivation as methyltransferases. 63 Such inactive domains have transformed into RNA-binding modules 64 Fig. 6(D,E) , respectively]. Interestingly, classified viral methyltransferase domains from SARS NSP15 and PRRSV NSP11 endoribonucleases have a largely degraded Nterminus where AdoMet substrate usually binds; however, the distinguishing C-terminal b-hairpin that marks the methyltransferase fold remains intact [ Fig. 6(F) ]. The function of this domain, like other viral structure additions, appears to be in oligomerization. Given the uniqueness of the folds with respect to host methyltransferases, the rapidly diverging viral methyltransferases might serve as targets for therapy. Indeed, Zika virus NSP5, which includes an FtsJ family methyltransferase domain, shows promise for drug design. 66 Alternatively, to the majority of fast evolving viral Rossmann domains, only a few of them (3 out of 27, or 11%) appear as being unique to virus. These belong to virus with diverse hosts, with one infecting bacteria (Pseudomonas phage core protein P7) and two infecting eukaryotic hosts (IBV Nsp2a and Zika virus NS1). The IBV Nsp2a N-terminus contains a minimal Rossmann-like motif with an a/b insertion following the first helix that forms an antiparallel interaction with the crossover strand 2 [ Fig. 7(A) ]. While NSP2 is one of the first proteins to be translated and processed in the IBV life cycle, its function remains unknown. Zika virus NS1 also adopts a minimal Rossmann-like fold. However, NS1 replaces the C-terminal helix addition with an Nterminal helix addition. It also has an insertion in the same position as Nsp2a, but the b-strand forms a parallel interaction with strand 2 [ Fig. 7(B) ]. The function of Zika virus NS1 remains unclear. The unique properties of these apparent viral-specific proteins, which could have arisen from degradation of more complete Rossmann domains, preclude functional inference from their structure. The remaining families include BLAST hits from other domains of life (54 out of 81, or 66%). These could either represent proteins of ancient origin or proteins that have been horizontally transferred (HGT) between viral genomes and their hosts. 51% (28 out of 54) of these families infect bacteria, 44% (23 out of 54) infect eukaryotes and 5% (3 out of 54) infect archea. This nearly equal distribution suggests the potential for viral protein families that derive from ancient origin, as the prevalence of HGT stems from bacterial origins. 67 Although evidence does exist for viral acquisition of eukaryotic proteins 68 (i.e., from HGT, not ancient origins), these viral families of eukaryotic hosts that perform universal functions serve as a potential examples of viral proteins with ancient origins. Many of these more universal proteins serve as helicases, which provide important functions to all domains of life. Rossmann-like fold containing viral helicases are divided into four superfamilies (SF1-SF4), according to traditional sequence-based classification 58 and our classification of the P-loop containing domains (Fig. 5) . The SF3 helicases, whose viral examples belong to universal families, are thought to have been present at the last universal common ancestor stage as in virus-like "selfish" replicons. 69 Key role of these proteins in viral function also suggest helicases as novel targets for treating viral infections. Only one family appears as forming a homology group that is unique and distant from the othersresolvase protein family (PF00239). Resolvases or Figure 7 . Novel viral Rossmann-like motifs. Two viralspecific families retain minimal Rossmann-like motifs colored in rainbow from N-terminus (blue) to C-terminus (red). (A) The N-terminal domain from IBV Nsp2a [3ld1] has a unique b/a insertion (white) after the first helix forming an antiparallel interaction with strand 2 and an additional C-terminal helix (salmon). (B) The N-terminal domain from Flavivirus NS1 includes an N-terminal helix (slate) and a different insertion (white) in the same position as in IBV Nsp2a, but forming a parallel interaction with strand 2. Functional and Evolutionary Analysis of Viral Proteins recombinases are proteins that cause conserved DNA rearrangements, interact with short sequences in the DNA, bring two sites together in a synapse and then catalyze strand exchange so that the DNA is cleaved and religated to opposite partners. 70 There are several known types of recombinases but only serine recombinases or resolvase/invertase family proteins contain a Rossmann fold motif. This family has emerged from studies of phages, prophages, and transposons from predominantly Grampositive bacteria and consists of three groups. The structural and evolutionary differences imply that an ancestral catalytic domain has fused to unrelated sequences to result in a family of structurally and functionally diverse proteins. Thus, the modular nature of the serine recombinases resembles that in other recombination enzymes, such as the tyrosine integrases and the DDE superfamily of transposases. 71 Interestingly, serine recombinases have significant sequence similarity to the poxvirus F16 protein whose function is still unclear, but it is proposed that this protein may affect signaling functions of the nucleoli and it is unlikely to have serine recombinase activity. 72 Thus, the most parsimonious evolutionary scenario of these orthologs involves acquisition of a serine recombinase gene by the ancestor of poxviruses from a transposon or a bacteriophage. 72 The minimal Rossmann fold motif was defined as a three-layer a/b/a sandwich motif with a crossover between elements III and V, which contained three parallel b-strands as a middle layer and three variations of the crossover element IV (Fig. 8) . Element IV can be represented as a-helix [ Fig. 8(A) ], bstrand [ Fig. 8(B) ] or linker [ Fig. 8(C) ]. Element II was represented only as a helix since it forms the active site of most of Rossmann fold proteins. For the motif search, we used the ProSMoS program developed in our lab. 73 We generated a database of PDB domains (ECOD database version: develop159/20161205) with each represented by a secondary structure element (SSE) interaction matrix describing the interactions (parallel or antiparallel) and hydrogen-bonding between the secondary structure elements of the PDB structure. This database was generated using PALSSE. 74 The structure consensuses of minimal Rossmann fold proteins were represented as query matrices. Query matrices specified the number and types of secondary structure elements in the motif under consideration, the hydrogen bonding and parallel or antiparallel relationships between its elements and also minimum and maximum length of the three component bstrands. Then we used query matrixes as input for ProSMoS program. False positives were removed by visual inspection. Domains were considered to belong to Rossmann-like fold only when minimal Rossmann fold motif in them formed the structural core of the protein domain. A full list of protein families and their characteristics can be found online: http://prodata.swmed. edu/rossmann_fold/viruses/. We generated functional information for each PDB representative included in the online table from several databases. Virus taxonomy is from the latest report of the International Committee for Taxonomy of Viruses (ICTV). Topology and evolutionary information are from ECOD. The protein name and structure are from the Protein Data Bank. 3 General functional descriptions are from the Pfam database. 75 Finally, enzyme function in the form of an EC number-is from the KEGG Enzyme database. 76 The helicase tree was built from structure-based distances of ECOD domains (2004.1.1) with helicase EC function using the FITCH program from PHY-LIP package 77 (available from: http://evolution. genetics.washington.edu/phylip.html) with global rearrangements. Distances between representative structures were estimated by transforming Dali Z scores 78 from pairwise superpositions using the For each PDB family representative sequence from the online table (http://prodata.swmed.edu/ross-mann_fold/viruses/), a search against the NCBI nonredundant protein sequence database (National Center for Biotechnology Information, NIH, Bethesda, MD) was performed using BLAST. 79 Settings of BLAST search were used as follows: number of hits-5,000, e-value cutoff-0.01. The rest settings were set on default. All hits were sorted in four big taxonomical groups: A-Archea, B-Bacteria, E-Eukaryota, V-Viruses, and were plotted as distributions against BLAST score. These plots can be accessed through the online table (http://prodata. swmed.edu/rossmann_fold/viruses/), see column "BLAST distribution plot". The last column of the online table entitled "BLAST hits kingdoms" defines taxonomical groups from the distribution plot. Monophyly of class I aminoacyl tRNA synthetase, USPA, ETFP, photolyase, and PP-ATPase nucleotide-binding domains: Implications for protein evolution in the RNA world Predicted class-I aminoacyl tRNA synthetase-like proteins in non-ribosomal peptide synthesis The Protein Data Bank and the challenge of structural genomics Natural history of the E1-like superfamily: Implication for adenylation, sulfur transfer, and ubiquitin conjugation Chemical and biological evolution of a nucleotide-binding protein The geometry of domain combination in proteins Viral metagenomics Marine viruses: Major players in the global ecosystem Origins and evolution of viruses of eukaryotes: The ultimate modularity Ninth report of the International Committee on Taxonomy of Viruses What does structure tell us about virus evolution? ECOD: An evolutionary classification of protein domains Manual classification strategies in the ECOD database Genetically engineered phages: A review of advances over the last decade Structure of the tubulin/FtsZ-like protein TubZ from Pseudomonas bacteriophage UKZ Nonorthologous gene displacement of phosphomevalonate kinase Atomic snapshots of an RNA packaging motor reveal conformational changes linking ATP hydrolysis to RNA translocation Comparative genomics of the FtsK-HerA superfamily of pumping ATPases: Implications for the origins of chromosome segregation, cell division and viral capsid packaging Evolutionary genomics of the HAD superfamily: Understanding the structural adaptations and catalytic diversity in a superfamily of phosphoesterases and allied enzymes Modular architecture of the bacteriophage T7 primase couples RNA primer synthesis to DNA synthesis The polyphyletic origins of primase-helicase bifunctional proteins The tail sheath structure of bacteriophage T4: A molecular machine for infecting bacteria Structure of the type VI secretion system contractile sheath Type VI secretion apparatus and phage tail-associated protein complexes share a common evolutionary origin High resolution crystal structures of T4 phage b-glucosyltransferase: Induced fit and effect of substrate and metal binding Evolutionary link between glycogen phosphorylase and a DNA modifying enzyme The crystal structure of the bacteriophage PSA endolysin reveals a unique fold responsible for specific recognition of Listeria cell walls Bacteriophage lytic enzymes: Novel anti-infectives Rapid killing of Streptococcus pneumoniae with a bacteriophage cell wall hydrolase A bacteriolytic agent that detects and kills Bacillus anthracis Gene cloning and expression and secretion of Listeria monocytogenes bacteriophage-lytic enzymes in Lactococcus lactis Mycobacteriophage lysin B is a novel mycolylarabinoga-lactan esterase The wonderful world of archaeal viruses Archaeal viruses: Living fossils of the ancient virosphere? A new DNA binding protein highly conserved in diverse crenarchaeal viruses Structure of A197 from Sulfolobus turreted icosahedral virus: A crenarchaeal viral glycosyltransferase exhibiting the GT-A fold The bacterial conjugation protein TrwB resembles ring helicases and F1-ATPase Structure of the VirB4 ATPase, alone and bound to the core complex of a type IV secretion system Polintons: A hotbed of eukaryotic virus, transposon and plasmid evolution Virophages or satellite viruses? A mimivirus enzyme that participates in viral entry Virus-encoded aminoacyl-tRNA synthetases: Structural and functional characterization of mimivirus TyrRS and MetRS The vaccinia virus H3 envelope protein, a major target of neutralizing antibodies, exhibits a glycosyltransferase fold and binds UDP-glucose Vaccinia virus envelope H3L protein binds to cell surface heparan sulfate and is important for intracellular mature virion morphogenesis and virus infection in vitro and in vivo Vaccinia virus H3L envelope protein is a major target of neutralizing antibodies in humans and elicits protection against lethal challenge in mice Crystal structure of vaccinia virus mRNA capping enzyme provides insights into the mechanism and evolution of the capping apparatus Structural insights into the mechanism and evolution of the vaccinia virus mRNA cap N7 methyl-transferase Extended surface for membrane association in Zika virus NS1 structure Rotavirus and severe childhood diarrhea Atomic model of an infectious rotavirus particle Structure of rotavirus outer-layer protein VP7 bound with a neutralizing Fab The origin, evolution and structure of the protein world Structure and dynamics of the P7 protein from the bacteriophage /12 Purification, crystallization and preliminary X-ray analysis of nonstructural protein 2 (nsp2) from avian infectious bronchitis virus Unwinding the 'Gordian knot' of helicase action Structure of adeno-associated virus type 2 Rep40-ADP complex: Insight into nucleotide recognition and catalysis by superfamily 3 helicases Viral and cellular RNA helicases as antiviral targets Helicases: Amino acid sequence comparisons and structure-function relationships The Big Bang of picorna-like virus evolution antedates the radiation of eukaryotic supergroups RecQ family helicases: Roles as tumor suppressor proteins Current progress in antiviral strategies Structural basis for sequence-nonspecific recognition of 5 0 -capped mRNA by a cap-modifying enzyme RNA methyltransferases involved in 5 0 cap biosynthesis Structural and functional insights into alphavirus polyprotein processing and pathogenesis A putative ATPase mediates RNA transcription and capping in a dsRNA virus Structure and function of Zika virus NS5 protein: Perspectives for drug design Horizontal gene transfer: Essentiality and evolvability in prokaryotes, and roles in evolutionary transitions Viral proteins acquired from a host converge to simplified domain architectures Evolutionary history and higher order classification of AAA 1 ATPases Diversity in the serine recombinases Integrating DNA: Transposases and retroviral integrases Vaccinia virus F16 protein, a predicted catalytically inactive member of the prokaryotic serine recombinase superfamily, is targeted to nucleoli Searching for three-dimensional secondary structural patterns in proteins with PALSSE: A program to delineate linear secondary structural elements from protein structures The Pfam protein families database: Towards a more sustainable future Using the KEGG database resource PHYLIP (Phylogeny Inference Package) version 3.69 Dali server update Gapped BLAST and PSI-BLAST: A new generation of protein database search programs The authors declare that they have no conflicts of interest with the contents of this article.