key: cord-0720880-rehr84c7 authors: Cagliani, Rachele; Forni, Diego; Clerici, Mario; Sironi, Manuela title: Computational Inference of Selection Underlying the Evolution of the Novel Coronavirus, Severe Acute Respiratory Syndrome Coronavirus 2 date: 2020-06-01 journal: J Virol DOI: 10.1128/jvi.00411-20 sha: c0fd63c9675b3da42526e55082facdbbff999196 doc_id: 720880 cord_uid: rehr84c7 The novel coronavirus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that recently emerged in China is thought to have a bat origin, as its closest known relative (BatCoV RaTG13) was described previously in horseshoe bats. We analyzed the selective events that accompanied the divergence of SARS-CoV-2 from BatCoV RaTG13. To this end, we applied a population genetics-phylogenetics approach, which leverages within-population variation and divergence from an outgroup. Results indicated that most sites in the viral open reading frames (ORFs) evolved under conditions of strong to moderate purifying selection. The most highly constrained sequences corresponded to some nonstructural proteins (nsps) and to the M protein. Conversely, nsp1 and accessory ORFs, particularly ORF8, had a nonnegligible proportion of codons evolving under conditions of very weak purifying selection or close to selective neutrality. Overall, limited evidence of positive selection was detected. The 6 bona fide positively selected sites were located in the N protein, in ORF8, and in nsp1. A signal of positive selection was also detected in the receptor-binding motif (RBM) of the spike protein but most likely resulted from a recombination event that involved the BatCoV RaTG13 sequence. In line with previous data, we suggest that the common ancestor of SARS-CoV-2 and BatCoV RaTG13 encoded/encodes an RBM similar to that observed in SARS-CoV-2 itself and in some pangolin viruses. It is presently unknown whether the common ancestor still exists and, if so, which animals it infects. Our data, however, indicate that divergence of SARS-CoV-2 from BatCoV RaTG13 was accompanied by limited episodes of positive selection, suggesting that the common ancestor of the two viruses was poised for human infection. IMPORTANCE Coronaviruses are dangerous zoonotic pathogens; in the last 2 decades, three coronaviruses have crossed the species barrier and caused human epidemics. One of these is the recently emerged SARS-CoV-2. We investigated how, since its divergence from a closely related bat virus, natural selection shaped the genome of SARS-CoV-2. We found that distinct coding regions in the SARS-CoV-2 genome evolved under conditions of different degrees of constraint and are consequently more or less prone to tolerate amino acid substitutions. In practical terms, the level of constraint provides indications about which proteins/protein regions are better suited as possible targets for the development of antivirals or vaccines. We also detected limited signals of positive selection in three viral ORFs. However, we warn that, in the absence of knowledge about the chain of events that determined the human spillover, these signals should not be necessarily interpreted as evidence of an adaptation to our species. number and sequence even among coronaviruses belonging to the same genus or subgenus (4) . On the other hand, accessory proteins have often been shown to contribute to the modulation of immune responses, as well as to virulence (3, 4) . It is thus conceivable that their limited constraint maintains variability in coronavirus Similarity plot (generated with SimPlot) of BatCoV RaTG13 relative to SARS-CoV-2 (Wuhan-Hu-1 reference strain, NC_045512.2). Similarity (Kimura distance) was calculated within sliding windows of 250 bp moving with steps of 50 bp. A schematic representation of the SARS-CoV-2 genome is also shown. ORF and nsp (nonstructural protein) names, lengths, and relative positions are in accordance with the annotation for the reference Wuhan-Hu-1 sequence. Box colors indicate the level of amino acid identity between the SARS-CoV-2 and BatCoV RaTG13 sequences. Black triangles indicate amino acid changes that are polymorphic in the analyzed SARS-CoV-2 genomes. Asterisks denote positively selected sites, and their sizes are proportional to the number of selected sites/region. Short ORFs with names in red were not analyzed with gammaMap. (B and C) Violin plots (median, white dot; interquartile range, black bar) of selection coefficients (␥) for the longest (more that 80 codons) ORFs (B) and nsp3 subdomains (C) are shown. Nsp3 domains were retrieved from the SARS-CoV annotation (68) . accessory ORFs, eventually facilitating rapid adaptation when the environment (e.g., host) changes. We next wished to determine whether positive selection at specific sites also drove the evolution of SARS-CoV-2. We thus estimated codon-wise posterior probabilities for each selection coefficient. Very strong evidence (defined as a posterior probability Ͼ 0.80 of ␥ Ն 1) of positive selection was detected for seven sites, including six in the S1 region of the spike protein and one in N (Fig. 2) . When the posterior probability cutoff was lowered to a less stringent value of 0.50, five additional sites in ORF8 (n ϭ 4) and in nsp1 (n ϭ 1) were identified (Fig. 2 ). It should be noted that this P value cutoff represents reasonably strong evidence of positive selection. Using these criteria, positively selected sites were estimated to account for the 0.12% of analyzed codons seen using 0.5 as the cutoff (0.07% for a 0.8 cutoff) (34, 45, 46) . The S1 region contains the RBD, and the crystal structure of the SARS-CoV S protein in complex with human ACE2 showed that, in turn, the RBD is formed by two subdomains, a core structure and the receptor-binding motif (RBM, which directly contacts ACE2) (47, 48) . The S2 region includes the fusion machinery (49) . We performed homology modeling of the SARS-CoV-2 S protein onto the SARS-CoV structure, and we analyzed the distribution of selection coefficients (Fig. 3A) . The S2 subunit was characterized by stronger constraint than the S1 portion, and five of six putative positively selected sites were found to be located in the RBM, at the binding interface with ACE2 (Fig. 3A) . Comparing SARS-CoV-2 and BatCoV RaTG13, the RBM stands out as the single most divergent region (Fig. 1A) (8, 16) . Very recent evidence indicated that, although the average level of genome similarity is lower than that seen with BatCoV RaTG13, coronaviruses isolated from pangolins have RBMs almost identical to that of SARS-CoV (14) (15) (16) (17) . This clearly implies that recombination might have inflated the estimation of positive selection in the S1 region. A pangolin virus available in GenBank (isolate MP789) has an RBM with high identity to SARS-CoV-2. Thus, using FIG 2 SARS-CoV-2 positively selected sites. A schematic representation of the nsp1, ORF8, spike (S), and nucleocapsid (N) proteins is presented. Positively selected sites (magenta) and amino acid substitutions between SARS-CoV-2 and BatCoV RaTG13 (red) and between SARS-CoV-2 and pangolin-CoV MP789 (blue) are indicated in the alignments. The location of an insertion (insPRRA) in the spike glycoprotein is also shown. This insertion is predicted to occur in the S1/S2 furin-like cleavage site (69, 70) . the genome sequences of isolate MP789, SARS-CoV-2, and BatCoV RaTG13, we searched for recombination events using RDP4 (50) . No evidence of recombination was detected, but that finding might have been due to the fact that the parental sequence with which BatCoV RaTG13 recombined is presently unsampled. We thus analyzed synonymous substitutions in the RBM alignment for these viruses, and we found that 41% (n ϭ 37) of such substitutions are shared between SARS-CoV-2 and isolate MP789, whereas only 27% (n ϭ 10) are shared between SARS-CoV-2 and BatCoV RaTG13. Overall, these findings strongly suggest that recombination rather than positive selection shaped the genetic diversity at the RBM, as previously suggested (16) . Recombination is known to affect evolutionary inference (51) . In this case, because we used the BatCoV RaTG13 as an outgroup, the spurious signals were generated by considering the selected sites to represent amino acid replacements that arose and became fixed in the SARS-CoV-2 population, whereas they might represent changes that occurred in the outgroup through recombination. We consider that this was not the case for the other signals that we detected, as all of them were located in regions of high overall similarity between BatCoV RaTG13 and SARS-CoV-2, indicating no evidence of recombination (Fig. 1A) . The positively selected site (A267) in the nucleocapsid protein is located in the C-terminal domain. Homology modeling using the SARS-CoV N protein as a template indicated that A267 is located on an exposed loop on the protein surface ( Fig. 3B ) (52) . The N protein is the most abundant protein in coronavirus-infected cells (53, 54) . Its primary function is to package the viral genome into a ribonucleoprotein complex. In addition, the N protein performs nonstructural functions, as it regulates the host cell cycle and the stress response, it acts as a molecular chaperone, and it interferes with the host immune response (53, 54) . Because these activities are mediated by interactions with different cellular proteins, the positively selected site might be evolving to establish, maintain, or avoid the binding of different host molecules. Another positively selected site was detected in the nsp1 region, which also displayed relatively weak selective constraint. In SARS-CoV and other betacoronaviruses, nsp1 is a virulence factor and is essential for viral replication at least in the presence of an intact host interferon (IFN) response (55) (56) (57) . Despite their relevant role for viral fitness in vivo, nsp1 proteins tend to be variable in sequence both within and among coronavirus genera. Detailed analysis of SARS-CoV nsp1 indicated that the protein plays multiple roles during viral infection, including inhibition of host protein synthesis, antagonism of IFN responses, modulation of the calcineurin/NFAT (nuclear factor of activated T cells) pathway, and induction of chemokine secretion (43) . Homology modeling using the SARS-CoV nsp1 structure indicated that the positively selected site (E93) is exposed on the protein surface (Fig. 3C) . Extensive mutagenesis of SARS-CoV nsp1 showed that exposed charged residues, including the positively selected site, mediate inhibition of gene expression and antiviral signaling (58) . Moreover, the N-terminal half of SARS-CoV nsp1 interacts with immunophilins and calcipressins to modulate the calcineurin/NFAT pathway (59) . Overall, these observations suggest that the diversity of coronavirus nsp1 proteins is driven by the need to establish interactions with multiple cellular partners and to evade immune surveillance. This is also likely to explain the positive selection signal that we detected. In general, a better understanding of the evolutionary constraints and forces acting on coronavirus nsp1 proteins may be extremely relevant, as the generation of viruses carrying nsp1 mutations was previously reported to be regarded as a potential strategy to generate attenuated vaccine strains (57, 60) , and inhibitors of cyclophilins were previously reported to be potential antivirals for coronavirus treatment (59) . Finally, all of the selected sites that we identified in ORF8 (F3, I10, A14, are T26) are located in the N-terminal portion of the protein (Fig. 2) . The SARS-CoV-2 ORF8 protein displays 30% identity to the intact ORF8 from the SARS-CoV GZ02 stain. It is presently unclear whether the SARS-CoV ORF8 N terminus is cleaved as a signal peptide or inserted into the endoplasmic reticulum membrane (61, 62) . Using computational methods to predict signal peptides and transmembrane helices, we found evidence for both in the case of the N terminus of SARS-CoV-2 ORF8 (not shown). Clearly, experimental analyses will be required to determine the function of the N-terminal region of ORF8 and, more generally, the relevance of the selected sites for virus fitness or pathogenicity. Overall, our analyses indicate that distinct coding regions in the SARS-CoV-2 genome evolve under conditions of different degrees of constraint and are consequently more or less prone to tolerate amino acid substitutions. In practical terms, the level of constraint can provide indications concerning which specific proteins or protein regions are better suited to being possible targets for the development of antivirals or vaccines. Conversely, the current available knowledge and the analyses reported here allow no inference on the selective events (or lack thereof) that turned SARS-CoV-2 into a human pathogen. Recent analyses paid much attention to changes in the RBM. This is indeed expected to represent a major determinant of host range, and its sequence is highly variable among SARS-CoV-related viruses (as is also evident in the data presented in Fig. 2) . Albeit preliminary and necessarily limited to currently sampled genomes, our analyses suggest that recombination had a role in shaping the diversity of the RBMs in these viruses. Our data also indicate that divergence of SARS-CoV-2 from BatCoV RaTG13 was accompanied by limited episodes of positive selection, suggesting that the common ancestor of the two viruses was poised for human infection. We also emphasize that lack of knowledge about the reservoir host and the chain of events that determined the human spillover prevent us from drawing any conclusion on the selective pressure underlying the limited positive selection events tha twe detected. These will need to be interpreted in the future, by incorporating epidemiological, biochemical, and additional genetic data. Clearly, a caveat of our analyses lies in the quality and paucity of SARS-CoV-2 genomes, as well as in the limited availability of genomes of other coronaviruses closely related to SARS-CoV-2. Available sequences were obtained using different methods and most likely contain errors. This is unlikely to strongly affect inference of positive selection, as the frequency of all selected sites is high in the SARS-CoV-2 population. Also, the SARS-CoV-2 sequences that we analyzed display limited diversity (with only 41 nonsynonymous polymorphisms, most of them present in one or a few sequences). Thus, although the availability of additional genomes may increase the power to detect selective events and the confidence with which evolutionary patterns are inferred, simply increasing the number of genomes is unlikely to change the bulk of our results. However, sustained viral spread in the human population will necessarily introduce new mutations in the viral population. Thus, data reported here can depict only the situation of the early phases of the human epidemic. Follow-up analyses of the SARS-CoV-2 population will be required to determine the evolutionary trajectories of new mutations and to assess whether and how they affect viral fitness in the human hots. Sequences and alignments. Genome sequences were retrieved from the National Center for Biotechnology Information database (NCBI; https://www.ncbi.nlm.nih.gov/). Only complete or almostcomplete genome sequences were included in the analysis (Table 1) . Alignments were generated using MAFFT (63) , setting sequence type as codons. Population genetics-phylogenetic analysis. Analyses were performed with gammaMap, which uses intraspecies variation and interspecies diversity to estimate, along coding regions, the distribution of selection coefficients (␥). In this framework, ␥ is defined as 2PN e s, where P is the ploidy, N e is effective population size, and s is the fitness advantage of any amino acid-replacing derived allele (34) . For the eight longest ORFs in the SARS-CoV-2 genome, the corresponding coding sequence of BatCoV RaTG13 was used as the outgroup. We assumed (neutral mutation rate per site), k (transitions/transversions ratio), and T (branch length) to vary within genes following log-normal distributions, whereas p (probability of adjacent codons to share the same selection coefficient) was assumed to follow a log-uniform distribution. For each ORF, we set the neutral frequencies of non-STOP codons (1/61). For selection coefficients, we considered a uniform Dirichlet distribution with the same prior weight for each selection class. For each ORF, we performed 2 runs with 100,000 iterations each and with a thinning interval of 10 iterations. Runs were merged after checking for convergence. The similarity plot was computed using a Kimura (two-parameter) distance model with SimPlot version 3.5.1 (64) . The strip gap option was set at the 50% default value. Similarity scores were calculated in sliding windows of 250 bp moving with a step of 50 bp. Protein 3D structures and homology modeling. The three-dimensional (3D) structures of SARS-CoV N (PDB identifier [ID]: 2CJR) (65) and S (PDB ID: 6ACG) (48) proteins were obtained from the Protein Data Bank (PDB). Homology modeling analysis was performed through the SWISS-MODEL server (66) . The accuracy of the models was examined through the GMQE (Global Model Quality Estimation) and QMEAN (Qualitative Model Energy ANalysis) scores (67). structures were rendered using PyMOL (The PyMOL Molecular Graphics System The species Severe acute respiratory syndromerelated coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 China Novel Coronavirus Investigating and Research Team. 2020. A novel coronavirus from patients with pneumonia in China Origin and evolution of pathogenic coronaviruses Molecular evolution of human coronavirus genomes Molecular epidemiology, evolution and phylogeny of SARS coronavirus Middle East respiratory syndrome coronavirus (MERS-CoV): announcement of the Coronavirus Study Group Severe acute respiratory syndrome coronavirus phylogeny: toward consensus A pneumonia outbreak associated with a new coronavirus of probable bat origin A new coronavirus associated with human respiratory disease in China Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus Discovery and genetic analysis of novel coronaviruses in least horseshoe bats in southwestern China Genomic characterization and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event Identification of 2019-nCoV related coronaviruses in Malayan pangolins in southern China Isolation and characterization of 2019-nCoV-like coronavirus from Malayan pangolins Evidence of recombination in coronaviruses implicating pangolin origins of nCoV-2019 Switching species tropism: an effective way to manipulate the feline coronavirus genome Retargeting of coronavirus by substitution of the spike glycoprotein ectodomain: crossing the host cell species barrier Lethal infection of K18-hACE2 mice infected with severe acute respiratory syndrome coronavirus Retroviruses pseudotyped with the severe acute respiratory syndrome coronavirus spike protein efficiently infect cells expressing angiotensin-converting enzyme 2 The N-terminal region of the murine coronavirus spike glycoprotein is associated with the extended host range of viruses from persistently infected murine cells Angiotensin-converting enzyme 2 is a functional receptor for the SARS coronavirus Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2 Mechanisms of host receptor adaptation by severe acute respiratory syndrome coronavirus Identification of two critical amino acid residues of the severe acute respiratory syndrome coronavirus spike protein for its variation in zoonotic tropism transition via a double substitution strategy Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China Severe acute respiratory syndrome (SARS) coronavirus ORF8 protein is acquired from SARSrelated coronavirus from greater horseshoe bats through recombination Attenuation of replication by a 29 nucleotide deletion in SARS-coronavirus acquired during the early stages of human-to-human transmission A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Feng Z. 2020. Early transmission dynamics in Importation and human-to-human transmission of a novel coronavirus in Vietnam A population genetics-phylogenetics approach to inferring natural selection in coding sequences Time-dependent rates of molecular evolution Purifying selection can obscure the ancient age of viral lineages A case for the ancient origin of coronaviruses The nonstructural proteins directing coronavirus RNA synthesis and processing Sequence and topology of a model intracellular membrane protein, E1 glycoprotein, from a coronavirus The M, E, and N structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles The membrane protein of severe acute respiratory syndrome coronavirus acts as a dominant immunogen revealed by a clustering region of novel functionally and structurally defined cytotoxic T-lymphocyte epitopes Protective humoral responses to severe acute respiratory syndromeassociated coronavirus: implications for the design of an effective proteinbased vaccine Coronavirus nonstructural protein 1: common and distinct functions in the regulation of host and viral gene expression Bioinformatics and functional analyses of coronavirus nonstructural proteins involved in the formation of replicative organelles Molecular evolution at a meiosis gene mediates species differences in the rate and patterning of recombination Holding it together: rapid evolution and positive selection in the synaptonemal complex of Drosophila Structure of SARS coronavirus spike receptor-binding domain complexed with receptor Cryo-EM structure of the SARS coronavirus spike glycoprotein in complex with its host cell receptor ACE2 Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission Detecting and analyzing genetic recombination using RDP4 Analysing recombination in nucleotide sequences Solution structure of the c-terminal dimerization domain of SARS coronavirus nucleocapsid protein solved by the SAIL-NMR method The SARS coronavirus nucleocapsid protein-forms and functions The SARS-CoV nucleocapsid protein: a protein with multifarious activities Severe acute respiratory syndrome coronavirus evades antiviral signaling: role of nsp1 and rational design of an attenuated strain Mutagenesis of the murine hepatitis virus nsp1-coding region identifies residues important for protein processing, viral RNA synthesis, and viral replication Coronavirus non-structural protein 1 is a major pathogenicity factor: implications for the rational design of coronavirus vaccines Identification of residues of SARS-CoV nsp1 that differentially affect inhibition of gene expression and antiviral signaling The SARS-coronavirus-host interactome: identification of cyclophilins as target for pan-coronavirus inhibitors Identification of the mechanisms causing reversion to virulence in an attenuated SARS-CoV for the design of a genetically stable vaccine The 29-nucleotide deletion present in human but not in animal severe acute respiratory syndrome coronaviruses disrupts the functional expression of open reading frame The 8ab protein of SARS-CoV is a luminal ER membrane-associated protein and induces the activation of ATF6 MAFFT multiple sequence alignment software version 7: improvements in performance and usability Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination Structure of the SARS coronavirus nucleocapsid protein RNA-binding dimerization domain suggests a mechanism for helical packaging of viral RNA SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information Toward the estimation of the absolute quality of individual protein structure models Nsp3 of coronaviruses: structures and functions of a large multi-domain protein Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade This work was supported by the Italian Ministry of Health (Ricerca Corrente 2019-2020 to M.S. and Ricerca Corrente 2018-2020 to D.F.).