key: cord-0689814-uadfehr6 authors: Zhang, X. W.; Yap, Y. L.; Danchin, A. title: Testing the hypothesis of a recombinant origin of the SARS-associated coronavirus date: 2004-10-11 journal: Arch Virol DOI: 10.1007/s00705-004-0413-9 sha: 389e155ca041826b82fd8079d545386856110a1b doc_id: 689814 cord_uid: uadfehr6 The origin of severe acute respiratory syndrome-associated corona-virus (SARS-CoV) is still a matter of speculation, although more than one year has passed since the onset of the SARS outbreak. In this study, we implemented a 3-step strategy to test the intriguing hypothesis that SARS-CoV might have been derived from a recombinant virus. First, we blasted the whole SARS-CoV genome against a virus database to search viruses of interest. Second, we employed 7 recombination detection techniques well documented in successfully detecting recombination events to explore the presence of recombination in SARS-CoV genome. Finally, we conducted phylogenetic analyses to further explore whether recombination has indeed occurred in the course of coronaviruses history predating the emergence of SARS-CoV. Surprisingly, we found that 7 putative recombination regions, located in Replicase 1ab and Spike protein, exist between SARS-CoV and other 6 coronaviruses: porcine epidemic diarrhea virus (PEDV), transmissible gastroenteritis virus (TGEV), bovine coronavirus (BCoV), human coronavirus 229E (HCoV), murine hepatitis virus (MHV), and avian infectious bronchitis virus (IBV). Thus, our analyses substantiate the presence of recombination events in history that led to the SARS-CoV genome. Like the other coronaviruses used in the analysis, SARS-CoV is also a mosaic structure. SARS, a new disease characterized by high fever, malaise, rigor, headache and non-productive cough, has spread to over 30 countries with around 8% of mortality rate on average. Sequence analysis of SARS coronavirus (SARS-CoV) [17, 25] showed that it is a novel coronavirus [12] . Anand et al. [1] reported a three-dimensional model of SARS-CoV main proteinase and suggested that There are a number of methods and software packages that have been developed for detection of recombination events in DNA sequences. The performance of these methods has been extensively evaluated and compared on simulated and real data [23, 24] . In the present study we applied these methods to RNA viruses. SARS-CoV and other 6 coronavirus genomes (SARS-CoV, IBV, BCoV, HCoV, MHV, PEDV, TGEV) were first aligned using CLUSTALW [33] . Sites with gaps were removed and a 25077-nt alignment was generated. Subsequently, seven methods were employed to detect the occurrence of recombination (see corresponding reference in parenthesis for details of each method): BOOTSCAN [26] , GENECONV [28] , DSS (Difference of Sums of Squares) [20] , HMM (Hidden Markov Model) [8] , MAXCHI (Maximum Chi-Square method) [19] , PDM (Probabilistic Divergence Measures) [9] , RDP (Recombination Detection Program) [18] . BOOTSCAN, MAXCHI and RDP are implemented in RDP software package, http://web.uct.ac.za/depts/microbiology/microdescription.htm. GENECONV is implemented in the program, http://www.math.wustl.edu/∼sawyer/geneconv/. DSS, HMM and PDM are implemented in TOPALi software package, http://www.bioss.sari.ac.uk/software.html. Basically default parameter settings were used in all the programs, except the following values: gscale = 1 (GENECONV), internal and external references (RDP), window size = 300 and step = 10 (DSS, HMM and PDM). After potential recombination events were identified by at least 3 methods above, separate neighbor joining trees were constructed for each putative recombination region to better evaluate the evidence for conflicting evolutionary histories of different sequence regions. All trees were produced with TOPALi mentioned above. Table 2 summarizes the results of BOOTSCAN analysis with 100% bootstrap support and significant P-value (<0.05 for uncorrected and MC corrected Pvalue). Two regions (13151-13299 and 16051-16449, position in alignment) are identified as putative recombination regions and all 6 coronaviruses are potential parents with SARS-CoV as potential daughter. GENECONV detected 9 putative recombination events occurred in a wide range of positions 5941-24997 (in alignment) at a significant level p < 0.05 for two P-values: simulated P-value (based on 10,000 permutations) and BLASTlike BC KA P-value (Table 3 ). All 6 coronaviruses are potential parents with SARS-CoV as potential daughter. MAXCHI identified 15 putative recombination events (Table 4 , possible misidentification events are not retained). Most of the breakpoints are significant at about 0.001 level; the position located in alignment spans from 3534 to 22840, but some beginning or ending breakpoints are not determined. Similarly, 6 coronaviruses are potential parents with SARS-CoV as potential daughter. RDP revealed that 6 putative recombination events occur in the domain of alignment 5910-13334 (Table 5) , with the uncorrected and MC corrected pvalue at less than 0.002 and 0.05 respectively. In this case, 4 coronaviruses (IBV, BCoV, MHV and PEDV) are potential parents with SARS-CoV as potential daughter. Figure 1 shows the DSS profiles of putative breakpoints between SARS-CoV and other coronaviruses (Dotted line indicates the 95 percentile under the null hypothesis of no recombination): SARS-CoV, IBV, BCoV and MHV (Fig. 1a) , SARS-CoV, MHV, PEDV and TGEV (Fig. 1b) , SARS-CoV, IBV, HCoV and TGEV (Fig. 1c ). There are about 6 different breakpoints (significant peaks): 13614 and 16085 (Fig. 1a) , 11008 and 12850 (Fig. 1b) , 12805, 13614 and 16444 (Fig. 1c) . HMM plots for SARS-CoV, IBV, BCoV and HCoV (Fig. 2 ) revealed that the putative breakpoints are at about position 5500 and 19000. There is a clear transition from state 1 (SARS-CoV grouped with IBV) (Fig. 2a) into state 3 (SARS-CoV grouped with HCoV) (Fig. 2c) . The region between 5500 and 19000 is noisy, and at this moment no information can be provided by HMM. Figure 3 shows the results of PDM analysis performed on SARS-CoV and other coronaviruses (dotted line indicates the 95% critical region for the null (Fig. 3c, d) , 1393, 6111, 16624, 19859 and 20802 (Fig. 3e, f) . Posada [23] suggested that one should not rely too much on a single method for recombination detection. Here we consider the regions identified by at least 3 methods as putative recombination regions. The results are summarized in Table 6 . Seven putative recombination regions span a range of positions in SARS-CoV Phylogenetic trees constructed by using putative recombination regions and nonrecombination regions identified by above techniques are shown in Figure 4 . The left panels stand for non-recombination regions and the right panels for recombination regions. We compared each row of figures and found that the phylogenetic tree in the left panel (non-recombination region) had very different topology when compared to the phylogenetic tree in the right panel (recombination region), which indicates that recombination has occurred. For example, in Fig. 4a , 7 coronaviruses are divided into 4 groups: group 1 for TGEV, HCoV and PEDV, group 2 for BCoV and MHV, group 3 for IBV, and group 4 for SARS-CoV, consistent with Marra et al. [17] ; while in Fig. 4b, 7 coronaviruses are divided and SARS-CoV, suggests that SARS-CoV is most closely related to BCoV and MHV, which is consistent with a recent report [29] . At the same time, SARS-CoV is also most closely related to TGEV (Fig. 4d) and IBV (Fig. 4f) . Thus, phylogenetic analysis substantiates the presence of recombination events in the history that led to the SARS-CoV genome. In this study, seven recombination detection methods and phylogenetic analyses were performed on SARS-CoV and the six coronaviruses identified by BLAST (IBV, BCoV, HCoV, MHV, PEDV and TGEV). These techniques successfully identified recombination events in bacteria and viruses [2, 3, 6, 21, 26, 39] . Our analysis concurred to suggest the occurrence of recombination events between ancestors of SARS-CoV and these 6 coronaviruses. Indeed, pairwise alignment showed that many segments of high homology with IBV, BCoV, HCoV, MHV, PEDV and TGEV do exist in SARS-CoV genome, Table 7 exhibits the segments with length >20 nt and identiy >80%, and Fig. 5 shows the mosaic structure of the region 14930-15908 in SARS-CoV genome based on the segments with length >50 and identity >80%. Of course, the other coronaviruses used in the analysis are also mosaic structures, for more sequence similarities exist among them than with SARS-CoV. It is noted that all the sequence comparisons in this study are based on nucleotide sequences. While the protein sequences in SARS-CoV are largely different from those in the known three groups of coronavirus [17] , such as, for S protein, the identity is: 25.9% for SARS-CoV and BCoV, 21.7% for SARS-CoV and HCoV, 21.5% for SARS-CoV and IBV, 25.6% for SARS-CoV and MHV, 20.6% for SARS-CoV and PEDV, 19.4% for SARS-CoV and TGEV. Although SARS-CoV is close to BCoV, MHV, TGEV and IBV, the corresponding protein, replicase 1a, is still different: with identity 27.4% for SARS-CoV and BCoV, 24.8% for SARS-CoV and IBV, 32.2% for SARS-CoV and MHV, 25.0% for SARS-CoV and TGEV. Naturally, we should take into account the role of convergent evolution, which would bear its mark on the viral genome. The recombination events that we witnessed in SARS-CoV are present in six different viruses, suggesting sequential horizontal transfers and progressive adaptation to new hosts cells or animals. Indeed because viruses need both receptors to permeate host cells and resist the immune response of the host, their outer layer proteins are submitted to an extremely strong selection pressure that may restrict considerably the possible variations of the corresponding proteins (and accordingly of the corresponding genome pieces of sequences). It is nevertheless remarkable that, despite the inclusion of all possible types of viruses in our sample set (as well as shuffled genomes from the viruses we have identified as relevant) we find a more or less single category of viruses as similar to SARS-CoV. This suggests that even if the contribution of convergent evolution is important, this happened on a more or less common phylogenetic background, suggesting several steps of recombination followed by fine adaptation. In this context, we would like to suggest that ancestors of PEDV, MHV or both are the most plausible origin of SARS-CoV. Guan et al. [7] Based on phylogenetic techniques and BOOTSCAN recombination analysis Stavrinides and Guttman [32] indicated that the replicase of SARS-CoV was a mammalian-like origin, the M and N proteins have an avian-like origin, and the S protein has a mammalian-avian mosaic origin. While in the present study we used phylogenetic analysis and 7 recombination detection methods, including the powerful methods of MAXCHI and GENECONV among 14 methods studied (SIMPLOT (BOOTSCAN), GENECONV, HOMOPLASY TEST, PIST, MAXCHI, CHIMAERA, PHYPRO, PLATO, RDP, RECPARS, RETICULATE, RUNS TEST, SNEATH TEST, TRIPLE) [23, 24] , to conduct whole genomewide recombination analysis. We identified seven putative recombination regions, which encompass, in terms of proteins involved, replicase 1A, replicase 1B and the spike glycoprotein. Stavrinides and Guttman [32] primarily inferred the occurrence of recombination qualitatively, but did not identify the precise recombination region in the protein involved (the S protein is an exception, they identified a recombination region in S protein, located between nucleotides 2472 and 2694 of the S protein, i.e. between nucleotides 23963 and 24185 of the SARS-CoV genome, basically covered by the last recombination region for S protein (Table 6) ). Most importantly, each of our recombination regions is identified by at least 3 methods, because one should not rely too much on a single method, as suggested in [23] . In general, we believe two studies lead to the overall conclusion: the evolution of SARS-CoV has involved recombination. The recombination event in the replicase is related to the fact that the RNA polymerase of coronaviruses utilize a discontinuous transcription mechanism to synthesize mRNAs. The viral polymerase must jump between different RNA templates regularly during positive-or negative-strand RNA synthesis and depending on the rejoining sites, the resultant RNA recombination will be either homologous or nonhomologous. This is the copy-choice model of recombination in RNA viruses [13, 27, 31, 34] . The recombination event in S protein is certainly important since this allows the virus to alter surface antigenicity and escape immunesurveillance in the animals, thus adapting to a human host. The existence of SARS-CoV-like viruses (99.8% homology to human SARS-CoV) in several wild animals in a live animal market in Guangdong [7] indicated that interspecies transmission among the human and animal SARS-CoV-like viruses had occurred. The mutation analysis of sequence variations among these isolates will help identify the genetic signature of SARS virus strains when a sufficient amount of sequence data is available. The very fact that several species of animals are affected does not allow one to trace directly the origin of the virus as endemic in one of these species, but, rather, might be indicative that animals and men might have been contaminated by a virus from a common origin, presumably located in animal food present in local markets in the Guangdong province. Investigating a wide variety of animal coronaviruses, especially in relation to rodents, birds, snakes and farm animals, would be interesting with regard to the origin of the SARS-CoV that caused disease in humans. Finally, a challenging question arises. What is the molecular basis of recombination in SARS-CoV? Many requirements are needed for recombination to occur: (1) Two coronaviruses can infect a host simultaneously and continue to replicate without interference with each other; (2) Sufficient nucleotide identity between these genomes is essential for genome-switching to occur during RNA replication; (3) The proteins arising from recombination must be functional; (4) The recombinant virus must have some selective advantage for its survival. That is, the recombination that creates a successful "new" coronavirus is probably a rare event. So, we must stress that the potential recombination events in SARS-CoV, identified in the present study, are most likely "old" events, which may represent the events that occurred thousands of years ago. Although the recent findings indicated that SARS-CoV did exist in a number of wild animals [7] , we have not yet determined where these SARS-CoV-like virus strains come from. Coronavirus main proteinase (3CL pro ) structure: basis for design of anti-SARS drugs Testing the hypothesis of a recombinant origin of human immunodeficiency virus type 1 subtype E Full-length sequence and mosaic structure of a human immunodeficiency virus type 1 isolate from Thailand Evolution of avian coronavirus IBV: sequence of the matrix glycoprotein gene and intergenic region of several serotypes Infectious bronchitis virus: evidence for recombination within the Massachusetts serotype The heterosexual human immunodeficiency virus type 1 epidemic in Thailand is caused by an intersubtype (A/E) recombinant of African origin Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China Detecting recombination with MCMC Probabilistic divergence measures for detecting interspecies recombination A novel variant of avian infectious bronchitis virus resulting from recombination among three different strains Experimental evidence of recombination in coronavirus infectious bronchitis virus A novel coronavirus associated with severe acute respiratory syndrome RNA recombination in animal and plant viruses Recombination in large RNA viruses: coronaviruses Transmission dynamics and control of severe acute respiratory syndrome High-frequency RNA recombination of murine coronaviruses The genome sequence of the SARSassociated coronavirus RDP: detection of recombination amongst aligned sequences Analyzing the mosaic structure of genes A graphical method for detecting recombination in phylogenetic data sets Recombination in the ompA gene but not the omcB gene of Chlamydia contributes to serovar-specific differences in tissue tropism, immune surveillance, and persistence of the organism A double epidemic model for the SARS propagation Evaluation of methods for detecting recombination from DNA sequences: Empirical data Evaluation of methods for detecting recombination from DNA sequences: Computer simulations Characterization of a novel coronavirus associated with severe acute respiratory syndrome Identification of breakpoints in intergenotypic recombinants of HIV type 1 by Bootscanning A new model for coronavirus transcription Statistical tests for detecting gene conversion Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage Comparison of the genome organization of toro-and coronaviruses: evidence for two nonhomologous RNA recombination events during Berne virus evolution Transcription strategy of coronaviruses: fusion of non-contiguous sequences during mRNA synthesis Mosaic evolution of the severe acute respiratory syndrome coronavirus CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice Regulation of transcription of coronaviruses Recombination in SARS-CoV Flood of sequence data yields clues but few answers Evidence of natural recombination within the S1 gene of infectious bronchitis virus Evolutionary implications of genetic variations in the S1 gene of infectious bronchitis virus Experimental confirmation of recombination upstream of the S1 hypervariable region of infectious bronchitis virus Widespread intra-serotype recombination in natural populations of dengue virus Author's address: Dr We wish to thank the Hong Kong Innovation and Technology Fund for supporting the present research. Ending in Length Identity Match percent Source SARS SARS (%) 10063 10109 47 41 88 MHV 10609 10641 33 30 91 TGEV 12821 12854 34 31 92 HCoV 13844 13879 36 32 89 BCoV 13845 13879 35 IBV 14808 14835 28 26 93 HCoV 14913 14947 35 31 89 HCoV 14933 15070 138 112 82 BCoV 14982 15091 110 89 81 IBV 14986 15055 70 64 92 MHV 15062 15093 32 29 91 HCoV 15123 15173 51 43 85 TGEV 15210 15232 23 22 96 PEDV 15210 15238 29 27 94 BCoV 15210 15253 44 40 91 IBV 15417 15482 66 57 87 BCoV 15417 15457 41 37 91 IBV 15420 15479 63 55 88 MHV 15611 15682 72 64 89 PEDV 15624 15670 47 42 90 HCoV 15633 15672 40 35 88 TGEV 15729 15770 42 40 96 MHV 15765 15817 53 46 87 HCoV 15852 15908 57 49 86 MHV 17088 17125 38 35 93 IBV 17688 17714 27 25 93 TGEV 17757 17800 44 39 89 PEDV 17783 17809 27 25 93 HCoV 18558 18577 20 20 100 PEDV 18771 18847 77 65 85 TGEV 18784 18833 50 44 88 HCoV 19102 19132 31 29 94 IBV 19113 19132 20 20 100 HCoV 19146 19252 107 87 82 MHV 19201 19252 52 45 87 IBV 19206 19253 48 44 92 BCoV 19396 19420 25 24 96 MHV