key: cord-0835708-fs07zdu6 authors: Sharma, Amresh Kumar; Kumari, Priyanka; Som, Anup title: Recombination in Sarbecovirus lineage and mutations/insertions in spike protein linked to the emergence and adaptation of SARS-CoV-2 date: 2021-10-15 journal: bioRxiv DOI: 10.1101/2020.05.12.091199 sha: 6cbc457a44567b78f1ab6950219a0743c2489a09 doc_id: 835708 cord_uid: fs07zdu6 The outbreak of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in Wuhan city, China in December 2019 and thereafter its spillover across the world has created a global pandemic and public health crisis. Researchers across the world are involved in finding the origin and evolution of SARS-CoV-2, its transmission route, molecular mechanism of interaction between SARS-CoV-2 and host cells, and the cause of pathogenicity etc. In this paper, we shed light on the origin, evolution and adaptation of SARS-CoV-2 into human systems. Our phylogenetic/evolutionary analysis supported that bat-CoV-RaTG13 is the closest relative of human SARS-CoV-2, outbreak of SARS-CoV-2 took place via inter-intra species mode of transmission, and host-specific adaptation occurred in SARS-CoV-2. Furthermore, genome recombination analysis found that Sarbecoviruses, the subgenus containing SARS-CoV and SARS-CoV-2, undergo frequent recombination. Multiple sequence alignment (MSA) of spike proteins revealed the insertion of four amino acid residues “PRRA” (Proline-Arginine-Arginine-Alanine) into the SARS-CoV-2 human strains. Structural modeling of spike protein of bat-CoV-RaTG13 also shows a high number of mutations at one of the receptor binding domains (RBD). Overall, this study finds that the probable origin of SARS-CoV-2 is the results of intra-species recombination events between bat coronaviruses belonging to Sarbecovirus subgenus and the insertion of amino acid residues “PRRA” and mutations in the RBD in spike protein are probably responsible for the adaptation of SARS-CoV-2 into human systems. Thus, our findings add strength to the existing knowledge on the origin and adaptation of SARS-CoV-2, and can be useful for understanding the molecular mechanisms of interaction between SARS-CoV-2 and host cells which is crucial for vaccine design and predicting future pandemics. Coronaviruses are single-stranded RNA viruses of 26 to 32 kilobases (kb) nucleotide chain and consist of both structural and non-structural proteins. They have been known to cause lower and upper respiratory diseases, central nervous system infection and gastroenteritis in a number of avian and mammalian hosts including humans (Zhu et al., 2019; Gorbalenya et al 2020) . The recent outbreak of novel coronavirus (SARS-CoV-2) associated with acute respiratory disease called coronavirus disease 19 (commonly known as COVID-19) has caused a global pandemic. As of 15 th June 2021, more than 175 million laboratory confirmed COVID-19 cases and approximately 3.78 million people have died and further COVID-19 appears as a global threat to public health as well as to the human civilization as economic and social disruption caused by the pandemic is devastating (WHO, COVID-19 situation reports). Coronaviruses are placed within the family Coronaviridae, which has two subfamilies namely Orthocoronavirinae and Torovirinae. Orthocoronavirinae has four genera: Alphacoronavirus (average genome size 28kb), Betacoronavirus (average genome size 30kb), Gammacoronavirus (average genome size 28kb), and Deltacoronavirus (average genome size 26kb) (King et al. 2011) . Coronaviruses are typically harbored in mammals and birds. Particularly Alphacoronavirus and Betacoronavirus infect mammals, and Gammacoronavirus and Deltacoronavirus infect avian species (Woo et al., 2009; 2010; Fan et al., 2019) . SARS-CoV-2 is a member of the genus Betacoronavirus and subgenus Sarbecovirus. Figure 1 shows the taxonomical origin of SARS-CoV-2. The previous important outbreaks of coronaviruses are severe acute respiratory syndrome coronavirus (SARS-CoV or SARS-CoV-1) outbreak in China in 2002/03, Middle East respiratory syndrome coronavirus (MERS-CoV) outbreak in 2012 that resulted severe epidemics in the respective geographical regions (Eickmann et al., 2003; Vijaykrishna et al., 2007; Zumla et al, 2015; Hayes et al., 2019) . The present outbreak of SARS-CoV-2 is the third documented spillover Overall, the phylogenetic analysis consists of 166 complete viral genomes (162 Orthocoronavirinae and four Torovirinave genomes). Details genome sequences used in this study can be found in Supplementary File S1. The genome sequences were aligned using the MAFFT alignment tool (Katoh et al., 2002) . Genome tree of the Orthocoronavirinae and Betacoronaviruses were reconstructed using maximum likelihood (ML) method and GTR+G+I model of nucleotide substitution as revealed by the model test with 1000 bootstrap support. The model test was performed for accurate phylogenetic estimation by using ModelFinder, which is implemented in IQ-TREE version 1.5.4 (Kalyaanamoorthy et al., 2017) . Phylogenetic trees were reconstructed using IQ-TREE software (Nguyen et al., 2015) . The trees were visualized with iTOL tool (Letunic et al., 2019) . Five gene trees of the Betacoronaviruses were reconstructed using Orf1ab, Spike (S), Envelope (E) Membrane (M), and Nucleocapsid (N) amino acid sequences. The ML method of tree reconstruction and protein-specific amino acids substitution model as revealed by ModelFinder was used for gene tree reconstruction. Bootstrap test with 1000 bootstrap replicates was carried out to check the reliability of the gene trees. Potential recombination events in the history of the Betacoronaviruses were assessed using the RDP5 package (Martin et al., 2015) . The RDP5 analysis was conducted based on the complete genome sequence using RDP, GENECONV, BootScan, MaxChi, Chimera, SiScan, and 3Scan methods. Putative recombination events were identified with a Bonferroni corrected P-value cutoff of 0.05 supported by more than four methods. The homology and genetic variations analysis of sequences in different genomic regions of SARS-CoV-2 strain Wuhan Hu-01 (MN908947) is compared to bat-CoV-RaTG13 (MN996532) and pangolin-CoV-GX-P5E (MT040336) using CLUSTAL W (https://www.genome.jp/toolsbin/clustalw) and multiple sequence alignment (MSA) analysis of spike proteins were performed using CLUSTAL OMEGA (https://www.ebi.ac.uk/Tools/msa/clustalo/). The structures of the spike protein of SARS-CoV-2 Wuhan Hu-1 (PDB: 6XLU), bat-CoV-RaTG13 (PDB: 6ZGF) were retrieved from PDB database (Rose et al. 2016) . The spike protein for pangolin coronavirus was not available so it was modeled using SWISS-MODEL SERVER (https://swissmodel.expasy.org) with 6XR8 as template. These structures were compared using the structure superimposition/structure alignment tool of Chimera software (Pettersen et al. 2004 ). In this study we aim to understand the origin and evolutionary trajectory of SARS-CoV-2 using molercular phylogenetic, genetic recombination and structural analyses. Particularly, we study the origin of SARS-CoV-2 from their deep ancestral roots (i.e., from deeper shared evolutionary history). Accordingly, the molecular phylogenetic analysis was based on two-stage genome phylogeny followed by gene trees analyses. Firstly, reconstruction of genome phylogeny of the Orthocoronavirinae genomes and study the cladistic/evolutionary relationships of its four genera. Secondly, reconstruction of Betacoronavirus genome and gene phylogeny that included its five subgenera namely Embecovirus, Hibecovirus, Merbecovirus, Nobecovirus and Sarbecovirus, and study the evolutionary relations of these five subgenera. The genome phylogeny of Orthocoronavirinae depicts that Alpha, Beta, Delta and Gamma coronaviruses clustered according to their cladistic relations where Alphacoronavirus clade appeared as a basal radiation of the Orthocoronavirinae phylogeny (Fig. 2 ). This result is consistent with the other results (Luk et al. 2019; Wu et al., 2020) . Furthermore, analysis of the clades found that Gammacoronavirus and Deltacoronavirus clades are monophyletic (originated from a single common ancestor). This result is supported by their hosts' nature; as both types mostly infect avian species (Wertheim et al. 2013 ). Further, a deeper analysis of the Orthocoronavirinae genome tree revealed that irrespective of their geographical locations, the host-specific strains are clustered together.This is probably due to the host adaptation, which is an important characteristic of viral genomes for their survival and replication (Songa et al., 2005; Fung et al., 2019; Andersen et al., 2020) . For example, Alphacoronavirus strains from ferret_Japan and ferret_Netherland are monophyletic. Similarly cat_UK is monophyletic with cat_Netherland, and human_China is monophyletic with human_Netherland. Further analysis revealed all Alphacoronavirus camel strains of Saudi Arabia appeared in a distinct subclade where bat_Ghana strain appeared as outgroup which indicates interspecies transmission took place from bat_Ghana to camel. A body of literature also reported that SARS-CoV-2 transmission took place to humans through intermediate hosts (Montoya et al., 2020; Roy et al., 2021; York, 2020; Zhou et al., 2020 strain. These results reconfirm that coronaviruses are present in a large number of hosts those are widespread in different geographical location and coronaviruses undergo host-specific adaptation (Nakagawa and Miyazawa, 2020) . Phylogenetic analysis of Betacoronavirus genomes revealed that the five subgenera clustered separately (Fig. 3) . Furthermore, the Betacoronavirus genome tree depicts that the host-specific strains from distance geographical locations formed monophyletic clades. For example, in descended from a common ancestor). Clade 3 also shown that pangolin (PCoV-GX-P5E) is the second closest relative of human SARS-CoV-2 behind bat-CoV-RaTG13. This result was also reported by other studies Zhang et al., 2020) . Further, deep node analysis, in Clade 3, suggested that SARS-CoV-2s, pangolin CoVs (strains PCoV-GX-P4L/P3B/P1E/P5E/P2V) and bat-CoVs (strains bat-SL-CoVZXC21 and bat-SL-CoVZC45) shared a single common ancestor (Fig. 3) . These clades analysis suggest bat and pangolin are the natural reservoir of SARS-CoV-2 and possibly transmission from bat /pangolin to humans took place through intermediate organisms. Montoya et al., 2021) . A comprehensive study based on codon adaptation index reported that the natural selection and host adaptation have been occurred in SARS-CoV-2 (Roy et al., 2021) . Similar finding had also been reported by Lu et al., 2020. Therefore, in summary, this study shows that coronaviruses belonging to Sarbecovirus in bat could be the origin of SARS-CoV-2. In addition to genome phylogeny, gene tree analysis was also conducted as it provides a more reliable basis for studying species evolution. Five gene trees namely Orf1ab, Spike, Envelope, Membrane, and Nucleocapsid of the Betacoronaviruses were reconstructed for gene tree analysis Betacoronavirus genome/species tree might be possible as gene tree differs from species tree for various analytical and/or biological reasons (Degnan et al., 2009; Som, 2013; 2015) . Further, analysis on the gene trees found, except Envelope gene tree, other four gene trees exhibited bat-CoV-RaTG13 is the closest relative of SARS-CoV-2 followed by pangolin-CoV as found in the genome tree analysis (Figs. 4, S2, S3, S5). Different evolutionary pattern of Envelope gene tree is probably due to stochastic error as its length is very small (average length 75 amino acids) (Som, 2015) . Further analysis of the gene trees found though subgenera-wise four gene trees are similar, but within subgenera there are widespread phylogenetic inconguences (Jeffroy et al., 2006) . This result led us to hypothesize that recombination events had occurred among Betacoronaviruses in the past that are caused to evolve new strains including the emergence of pathogenic lineage like SARS-CoV-2. (1,071bp WAG+I+ of sarbec in strains. In region3 (Spike gene), bat-CoV-RaTG13 genome shows divergence with SARS-CoV-2 genome and there is a good number of genetic recombination among the bat and pangolin strains. In region4 (E, M, N and ORF3/6-8/10 genes), all strains show high similarity and a few number of recombination events with the SARS-CoV-2 strain. Further, gene recombination analysis found that there are highest recombination events in spike protein (spotted nine events) followed by Orf1ab protein (six events). Membrane and Nucleocaspid proteins reported few recombination events and envelope protein did not show any recombination event. Overall, recombination results support our phylogenetic inference and suggest that the origin of SARS-CoV-2 is the results of ancestral intra-species recombination events between bat SARS-CoVs (Flores-Alanis et al., 2020; . Details of recombination analysis are given in Table 1 . sequences with respect to SARS-CoV-2 Wuhan-Hu-1 strain, and found that spike protein has highest genetic variation 3% and 7 % respectively (Table 2) . Major genetic variations in spike protein seemed essential for the transition from animal-to-human transmission to human-to-human transmission of SARS-CoV-2 (Su et al., 2016; Luk et al. 2019; Jaimes et al., 2020; Mondal et al., 2021) . We further did MSA of the spike protein sequences and observed that the insertion of the novel amino acids "PRRA" in the spike protein of SARS-CoV-2 ( Fig. 6) . A number of studies also reported/observed the insertion of "PRRA" residues in the spike protein o al., 2021 CoV-2s The This cleavage point between the receptor binding domain (S1) and fusion peptide (S2) mediate cell-cell fusion and entry into human cell Thus structural analysis supports MSA results and suggests that SARS In quest of the origin, evolution and adaptation of SARS-CoV-2, our analysis suggested that the probable origin of SARS-CoV-2 is the results of ancestral intra-species recombination events between bat coronaviruses belonging to Sarbecovirus subgenus and the insertion of the four amino acids "PRRA" in the spike protein of SARS-CoV-2 along with high number of mutations at one of its receptor-binding domain are probably responsible for the adaptation of SARS-CoV-2 into humans systems. Thus, our findings add strength to the existing knowledge on the origin and adaptation of SARS-CoV-2. Further a detailed mechanistic understanding of molecular mechanisms of interaction between SARS-CoV-2 The proximal origin of SARS-CoV-2 The polybasic insert, the RBD of the SARS-CoV-2 spike protein, and the feline coronavirus -evolved or yet to evolve Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic How SARS-CoV-2 first adapted in humans The spike glycoprotein of the new coronavirus 2019-nCoV contains a furin-like cleavage site absent in CoV of the same clade Origin and evolution of pathogenic coronaviruses Gene tree discordance, phylogenetic inference and the multispecies coalescent Bat Coronaviruses in China The receptor binding domain of SARS-CoV-2 spike protein is the result of an ancestral recombination between the bat-CoV RaTG13 and the pangolin-CoV MP789 Human coronavirus: host-pathogen interaction The species Severe acute respiratory syndrome-related coronavirus: classifying SARS-CoV-2 and naming it SARS-CoV-2 Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19 Evolutionary and structural analyses of SARS-CoV-2 D614G spike protein mutation now documented worldwide Phylogenetic Analysis and Structural Modeling of SARS-CoV-2 Spike Protein Reveals an Evolutionary Distinct and Proteolytically Sensitive Activation Loop Phylogenomics: the beginning of incongruence? ModelFinder: fast model selection for accurate phylogenetic estimates MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Order nidovirales. Virus Taxonomy, Ninth Report of the International Committee on Taxonomy of Viruses Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Molecular epidemiology, evolution and phylogeny of SARS coronavirus RDP4: Detection and analysis of recombination patterns in virus genomes Mutations in SARS-CoV-2 viral RNA identified in Eastern India: possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility Pattern of genomic variation in SARS-CoV-2 (COVID-19) suggests restricted nonrandom changes: Analysis using Shewhart control charts Variable routes to genomic and host adaptation among coronaviruses Genome evolution of SARS-CoV-2 and its virological characteristics IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies UCSF Chimera?A visualization system for exploratory research and analysis Evolutionary Trajectory for the Emergence of Novel Coronavirus SARS-CoV-2 How bacterial pathogens colonize their hosts and invade deeper tissues Microbes Infect The RCSB protein data bank: integrative view of protein, gene and 3D structural information Base Composition and Host Adaptation of the SARS-CoV-2: Insight From the Codon Usage Perspective Population genomics of bacterial host adaptation Genome-Scale Approach and the Performance of Phylogenetic Methods Causes, consequences and solutions of phylogenetic incongruence Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human SARS-CoV-2 genomics: an Indian perspective on sequencing viral variants Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses The COVID-19 epidemic Evolutionary Insights into the Ecology of Coronaviruses A case for the ancient origin of coronaviruses A Unique Protease Cleavage Site Predicted in the Spike Protein of the Novel Pneumonia Coronavirus (2019-nCoV) Potentially Related to Viral Transmissibility Coronavirus Diversity, Phylogeny and Interspecies Jumping A new coronavirus associated with human respiratory disease in China Novel coronavirus takes flight from bats? Structural impact on SARS-CoV-2 spike protein by D614G substitution Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak Molecular mechanism of interaction between SARS-CoV-2 and host cells and interventional therapy A pneumonia outbreak associated with a new coronavirus of probable bat origin A Novel Coronavirus from Patients with Pneumonia in China Middle East respiratory syndrome Thanks to Arindam Ghosh for useful discussions. This work was partly supported by the Department of Biotechnology (DBT) and University Grants Commission (UGC), India. The authors declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.