key: cord-0686062-qlpb4y2y authors: Lohrasbi-Nejad, Azadeh title: Detection of homologous recombination events in SARS-CoV-2 date: 2022-01-17 journal: Biotechnol Lett DOI: 10.1007/s10529-021-03218-7 sha: 6076e95a05ba7a96490655ce3193b0910c160e36 doc_id: 686062 cord_uid: qlpb4y2y PURPOSE: The COVID-19 disease with acute respiratory symptoms emerged in 2019. The causal agent of the disease, the SARS-CoV-2 virus, is classified into the Betacoronaviruses family. Coronaviruses (CoVs) are a huge family of viruses. Therefore, homologous recombination studies can help recognize the phylogenetic relationships among these viruses. METHODS: In order to detect possible recombination events in SASRS-CoV-2, the genome sequences of Betacoronaviruses were obtained from the GenBank. The nucleotide sequences with the identity ≥ 60% to SARS-CoV-2 genome sequence were selected and then analyzed using different algorithms. RESULTS: The results showed two recombination events at the beginning and the end of the genome sequence of SARS-CoV-2. Bat-SL-CoVZC21 (GenBank accession number MG772934) was specified as the minor parent for both events with p-values of 8.66 × 10(–87) and 3.29 × 10(–48), respectively. Furthermore, two recombination regions were detected at the beginning and the middle of the SARS-CoV-2 spike gene. Pangolin-CoV (PCoV_GX-P4L) and Rattus CoV (ChRCoV-HKU24) were determined as the potential parents with the GenBank accession number MT040333 and KM349742, respectively. Analysis of the spike gene revealed more similarity and less nucleotide diversity between SARS-CoV-2 and pangolin-CoVs. CONCLUSION: Detection of the ancestors of SARS-CoV-2 in the coronaviruses family can help identify and define the phylogenetic relationships of the family Coronaviridae. Furthermore, constructing a phylogenetic tree based on the recombination regions made changes in the phylogenetic relationships of Betacoronaviruses. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10529-021-03218-7. Coronaviruses were discovered in the 1960s and classified as the family Coronaviridae (Woo et al. 2010) . The family Coronaviridae includes two subfamilies: Orthocoronavirinae and Torovirinae. The subfamily Orthocoronavirinae consists of four genera: Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and Deltacoronavirus (Woo et al. 2010) . Alpha and Beta-coronaviruses are related to mammals; for example, bats (Woo et al. 2012 ) and Gammacoronaviruses are common among bird species (Lin et al. 2016; Zhou et al. 2018; and Mardani et al. 2008) . Avian infectious bronchitis virus (IBV), human coronavirus 229E (HCoV-229E), and human coronavirus OC43 (HCoV-OC43) were the first identified coronaviruses. They caused respiratory illnesses in chickens and colds in humans. Since the advent of HCoV-229E and HCoV-OC43 (Van der Hoek 2007), several acute coronaviruses have been discovered, such as severe acute respiratory syndrome (SARS) in 2002 and the Middle East respiratory syndrome (MERS) in 2012. In December 2019, a report was published about patients with severe viral pneumonia in Wuhan, China . In determining the virus sequence obtained from these patients, a new CoV was identified as the causative agent of this respiratory disease . Coronavirus 2019 has recently been named by the World Health Organization (WHO) as SARS-CoV-2 that caused COVID-19 disease. Unlike all human COVs that cause mild respiratory symptoms, SARS-CoV, MERS-CoV, and SARS-CoV-2 are associated with severe respiratory illness (Drosten et al. 2003; Zaki et al. 2012) . SARS-CoV-2 appeared in Wuhan, Hubei Province, China, with fever, severe respiratory infection, and pneumonia (Chan et al. 2020; Huang et al. 2020a ). SARS-CoV-2 is a new member of the Betacoronavirus closely related to bat coronaviruses Wu et al. 2020) . SARS-CoV appeared in China's Guangdong Province in 2002, infecting 8,098 people and leaving 774 dead. In 2012, MERS-CoV appeared on the Arabian Peninsula, infecting a total of 2,494 people and killing 858 people (Walls et al. 2020) . SARS-CoV-2 was transmitted more rapidly from human to human than SARS-CoV and was spread to several continents (Chan et al. 2020; Chen et al. 2020; Li et al. 2020) . CoVs are surrounded by a lipid layer derived from the host cell membrane. CoVs are positive singlestranded RNA viruses characterized by spike proteins in the surface of the virion (Barcena et al. 2009; Neuman et al. 2006) . The CoVs genome is the secondlargest RNA among viruses, 26 to 32 kbp (Lai 1990) . Structural proteins and several non-structural proteins with different functions are encoded by the 3' end of the viral genome (Masters 2006) . Two-thirds of the 5' end of RNA strand encodes the non-structural proteins important in viral replication, including RNA-dependent RNA polymerase (RdRP) (Masters, 2006) . The proteins encoded by viral RNA include spike proteins (S), membrane proteins (M), surface coat proteins (E), and nucleocapsid proteins (N). However, some beta coronaviruses also include hemagglutinin esterase (HE) (Fehr and Perlman 2015) . The homo-trimer structure of the S protein has many N-linked glycans required for proper folding of the protein (Rossen et al. 1998 ). The S protein consists of two functional subunits. Subunit S1 is responsible for connecting to the host cell receptor, and subunit S2 is responsible for fusing viral and cell membranes (Walls et al. 2016) . Combining a viral envelope with a host cell membrane leads to releasing a viral genome into the cytoplasm (He et al. 2006) . Previous studies have shown that bats have CoVs that are the ancestors of SARS-CoV. Also, it has been specified that the Himalayan palm civets had SARSlike CoVs in local Chinese markets (Guan et al. 2003) . Therefore, these animals were introduced as mediators of virus transmission between bats and humans (Lau et al. 2005) . At the beginning of the outbreak of SARS-CoV-2, researchers hypothesized that the SARS-CoV-2 was attributed to the Huanan Seafood Market in Wuhan, China, where one or more animals traded may have been the direct zoonotic source of the virus (Lam et al. 2020; Wu et al. 2020; Zhou et al. 2020; Zhu et al. 2020) . Several reports, however, claimed that the initial occurrence of infection was unrelated to the Huanan Seafood Market (Huang et al. , 2020b . As a result, initiatives to track down the source of SARS-CoV-2 should not be confined to animals sold in markets but should include a broad spectrum of wild species not sold in markets (Huang et al. 2020b) . As determined in the case of SARS-CoV and MERS-CoV , the bat is considered a likely species of origin for SARS-CoV-2. In 2020, a study on the SARS-CoV-2 genome was performed using Simplot (similarity plotting) software and determined that SARS-CoV-2 was remarkably similar to bat coronavirus (BatCoV-RaTG13) throughout the genome with a 96.2% genome sequence identity. However, there was no evidence of recombination events in the genome of SARS-CoV-2 . Several other articles published in 2020 point to the close link between the SARS-CoV-2 genome sequence and the BatCoV-RaTG13, which was isolated from Rhinolophus affinis Xiao et al. 2020 ). Furthermore, a study described that SARS-CoV-2 was more closely related to two bat-derived coronavirus strains, Bat-SL-CoVZC45 and Bat-SL-CoVZC21 ). The recombinant event in the 1b nucleotide region of the SARS-CoV-2 genome was discovered with the help of Simplot software, and it was suggested that SARS-CoV-2 may have originated in bats . Since human-to-human transmission during the SARS-CoV-2 outbreak is attributed to the compatibility of the S protein (especially RBD) to bind to the human ACE2, the possibility of coronavirus transmission through one of these animals is raised. Before infecting humans, SARS-CoV and MERS-CoV typically infected intermediate hosts (Cui et al. 2019) . So, SARS-CoV-2 was most likely spread to humans by other animals. Identifying and isolating the intermediate SARS-CoV-2 host is critical to preventing interspecies transmission ). On March 24, 2019, the Guangdong Wildlife Rescue Center received 21 live Malayan pangolins (Manis javanica) from the Anti-Smuggling Customs Bureau. Most of the animals were in poor health, and after rescue operations, 16 of them eventually died (Liu et al. 2019) . The majority of the dead pangolins exhibited an enlarged lung filled with a frothy liquid, as well as pulmonary fibrosis symptoms. Analysis of lung samples confirmed the presence of a SARS-like CoV in two out of the 11 cases of dead Malayan pangolins based on a viral metagenomic study (Liu et al. 2019) . In another study, during March-August 2019, lung tissues from four Chinese pangolins (Manis pentadactyla) and 25 Malayan pangolins (Manis javanica) were taken from a Wildlife Rescue Center. To identify SARS-related coronaviruses, they used the RT-PCR method with primers targeting a region of Betacoronaviruses. Their results determined pangolin-Cov in 17 cases of Malayan pangolins, while all samples of Chinese pangolins were negative (Xiao et al. 2020) . All publicly available metagenome samples of pangolin-CoV were collected and investigated to learn more about the animal hosts of SARS-CoV-2. Researchers assembled a draught genome of the SARS-CoV-like coronavirus, which showed 73% coverage and 91% sequence identity to the SARS-CoV-2. So, it is suggested that pangolins probably play a role in the evolution of SARS-CoV-2 and its transmission from bats to humans ). In another article, researchers re-evaluated previously published data (Liu et al. 2019 ) about SARS-CoV-like coronaviruses identified in pangolin lung samples to access the genomic and evolutionary evidence of the pangolin-CoV . Their results showed that pangolin-CoV had 91.02% and 90.55% identical to SARS-CoV-2 and BatCoV-RaTG13, respectively, at the whole-genome level. So, it was concluded that pangolin species might serve as a natural reservoir for SARS-CoV-2 . In this paper, the SARS-CoV-2 gene was investigated by several algorithms to detect recombination events and find its ancestor genes. Then, the spike gene was examined to detect nucleotide sequences with high similarity with SARS-CoV-2. Analysis of the SARS-CoV-2 genome sequence to detect recombinant regions The SARS-CoV-2 genome sequence was obtained from the Genebank (GenBank accession number MT049951). The nucleotide sequences belong to the Betacoronavirus family were selected to compare with SARS-CoV-2. Genome sequences of coronaviruses in Apodemus, Bats, Bos, Camelus, Canis, Equus, Felis, Mustela, Giraffa, Erinaceus, Neovison, Murines, Pangolins, Sus, Rabbits, Rattus, Rusa, Shrew, Hydropotes, Kobus, Odocoileus, and Hippotragus were obtained from the GenBank database of the National Center for Biotechnology Information (NCBI) database. The number of the used nucleotide sequences of each group is given in Supplementary Table 1. The alignment of the whole-genome sequence of the SARS-CoV-2 with other sequences was carried out using MEGA v6.12 software and Clustal Muscle v2.1 server, and percent identity between the SARS-CoV-2 nucleotide sequence and other sequences were determined. The sequences of over 60% identity were used for the rest of the study. RDP v4.5 software was used to detect recombination events (Martin et al. 2015) , and the occurrence of all recombination events was examined using RDP, GENECONV, BootScan, Max-Chi, Chimaera, SiScan, and 3Seq programs. For the detection of the recombination regions to be valid, p-value B 0.05 was considered. The beginning and end breakpoints related to each recombination event were determined. Phylogenetic trees were constructed based on the neighbor-joining (NJ) method to examine the possible relations between different viruses. Moreover, the relations between the parents identified in each recombination event were investigated. Once the recombination regions were detected and specified in the SARS-CoV-2 genome, the encoding fragments of the genome were independently studied. Investigation of the nucleotide sequences of the spike gene Due to the identification of two different genetic groups as parents in the SARS-CoV-2 spike gene, the nucleotide sequences of this gene were examined and compared. The sequences were first aligned among the members of each group using Align Sequences tool embedded in the server of Virus Pathogen Database and Analysis Resource (Pickett et al. 2012) . The Synonymous/Non-synonymous Analysis Program (SNAP v2.1.1) was used to determine the selective pressure which might have happened in the spike gene of the expected groups (group 1: SARS-CoV-2 and pangolin-CoVs; group 2: SARS-CoV-2 and Rattus CoVs). SNAP calculates the non-synonymous (dN) and synonymous (dS) substitution rates of amino acids based on a set of codon-aligned nucleotide sequences (Nei and Gojobori 1986) . The rate of nucleotide substitution and the pairwise distances of nucleotide sequences among each group member of the spike gene were estimated using MEGA v6.0 software. Nucleotide and haplotype diversity of the spike gene, Tajima's D-test, and Fu and Li's F* test statistic were examined using DnaSP v6.12 software (Rozas et al. 2017) . For identifying the protein sequence polymorphisms, nucleotide sequences of the spike gene belong to the coronaviruses of groups 1 and 2 (SARS-CoV-2 and pangolin-CoVs; SARS-CoV-2 and Rattus CoVs) were translated into amino acid sequences with the help of a translate tool at the Expasy website. The multiple sequence alignments were made for all sequences using Virus Pathogen Database and Analysis Resource that uses the MUSCLE (Multiple Sequence Comparison by Log-Expectation) algorithms as a preprocessor to enhance the quality and speed of sequence alignment. The outcome obtained from the server was used to find sequence variations (Pickett et al. 2012) . Metadata-driven Comparative Analysis Tool (meta-CATS) was also used to ensure that the substitutions occurred between the residues in protein sequences (Pickett et al. 2012) . The meta-CATS tool uses a chi-square test to report changes of the amino acids with a p-value B 0.05. The potential motif discovery and their distribution in the spike protein sequences were investigated using MEME Suite 5.4.1 server (Bailey et al. 2006 ) for all coronaviruses in groups 1 and 2. Percentages of identity between the nucleotide sequences of SARS-CoV-2 and other coronaviruses were calculated using the whole-genome alignment (these values are presented in Supplementary Table 1) . Based on the results, SARS-CoV-2 had the highest percent identity (96%) to BatCoV-RaTG13 (GenBank accession number MN996532) and the lowest one (51%) to Hydropotes CoV (GenBank accession number MG518518). The nucleotide sequences of over 60% identity to SARS-CoV-2 genome sequence were observed in coronaviruses belonging to shrew, Rusa, Rattus, Rabbits, Erinaceus, Bos, pangolin, Apodemus, and bats. These nucleotide sequences were examined using RDP software to detect putative recombinant events. A recombination event was observed when comparing the SARS-CoV-2 genome sequence to bat, Rattus, and pangolin coronaviruses. The details of these coronaviruses are given in Supplementary Table 2 . The genome sequences of SARS-CoV-2 and bat CoVs were analyzed using RDP software, and the results are shown in Fig. 1 . The whole genome of coronavirus is illustrated in two Figs. (1b, c) to get a higher resolution. According to the results, two recombination events were detected at the beginning and ending segments of the SARS-CoV-2 genome. The first event was observed at positions 1-12,394 of the aligned nucleotide sequence of SARS-CoV-2 with a p-value = 8.66 9 10 -87 . As shown in Fig. 1a , this nucleotide region (left hatched area) is associated with the initial segment of the orf1ab gene. The results showed that this segment of the SARS-CoV-2 genome might appear due to the genetic recombination of the bat coronaviruses. Bat-SL-CoVZC21 (GenBank accession number MG772934) and BtRs-BetaCoV-HuB (GenBank accession number KJ473814) were identified as a minor parent (identity 91.6%) and a major parent, respectively. This recombination event was further evaluated using other methods, and the results are shown in Table 1 . Comparison of the beginning and ending breakpoints using 3Seq and LARD methods demonstrated that the same positions (1-12,394) were recombined with p-values of 1.34 9 10 -96 and 1.72 9 10 -145 . The position of the second recombination event at the end of the SARS-CoV-2 genome is shown in Fig. 1c . The position 27,088-32,336 of the whole-genome alignment sequence of SARS-CoV-2 was identified as a recombination event in the RDP method. In this case, the p-value was calculated to be 3.29 9 10 -48 . This nucleotide region was determined by the MaxChi and LARD methods with almost similar beginning and ending breakpoints and p-values of 8.41 9 10 -30 and 1.20 9 10 -79 ( Table 2 ). The major and minor parents for the event were bat-SL-CoVZC21 (GenBank accession number MG772934) and BtRs-BetaCoV-HuB (GenBank accession number KJ473814), respectively. As illustrated in Fig. 1a , the second recombination region encompassed the ending segments of the spike gene, E gene, M gene, and the beginning Fig. 2a . The sequences were categorized into four major clusters (G1-G4) in this mode. The average between the minimum and maximum percent identity of sequences of each cluster was determined. This value for the G1, G2, and G3 cluster members was 99.81%, 98.54%, and 95.82%. SARS-CoV-2 genome sequence was placed in the G4 cluster. The average sequence identity for the members of this cluster was calculated to be 93.11%. Since recombination can affect phylogenetic reconstruction (Sabella et al. 2018) , the phylogenetic tree was reconstructed based on the nucleotide sequence of the recombination region, and the relations between the identified parents were investigated for every recombination event after detecting those events in the coronavirus genome sequence. The phylogenetic tree constructed for the first recombinant region (at the beginning of the genome sequence) is shown in Fig. 2b . Categorization was performed based on the average between the minimum and maximum percentage of the sequence identity shared among the members. As it can be seen, the SARS-CoV-2 nucleotide sequence was placed in the G2 cluster. MG772934 coronavirus identified as a minor parent was also grouped in this cluster. The average sequence identity for the members of the cluster was calculated to be 93.74%. KJ473814 (from the species Rhinolophus sinicus ), serving as a major parent, was in the G4 cluster. MG772934 coronavirus was first found in Rhinolophus sinicus in 2018 (Hu et al. 2018 ). The phylogenetic tree was constructed based on the recombination region at the end of the SARS-CoV-2 genome sequence, and the resulting phylogram was shown in Fig. 2c . According to the results, SARS-CoV-2 and the minor parent (MG772934) were grouped in the G3 cluster. The average sequence identity for the members of the cluster was calculated to be 93.83%. The major parent of this event was grouped in the G4 cluster, and the nucleotide sequence identity for its members was determined to be 91.33%. The results confirmed that MG772934 was the potential parent of the first and second recombination. After the SAR-CoV outbreak during 2002-2004, researchers were seeking the potential source of the virus among other animals. For this purpose, the species of Rhinolophus was studied more than other live organisms. Fortyseven coronaviruses associated with that species were identified by 2018 (Luk et al. 2019) . Extensive research was conducted on this subject due to the pandemic outbreak of SARS-CoV-2 in 2019. Comparing the SAR-Cov-2 genome and other coronavirus revealed some similarities between human and Rhinolophus CoVs (Wassenaar and Zou 2020). Wassenaar and Zou analyzed 253 nucleotides upstream of the start codons of coronaviruses to find the source of the virus. They reported a close genetic relationship among the Sarbecovirus species that contains SARS- CoV-2 and coronaviruses belonging to the species of Rhinolophus (Wassenaar and Zou 2020). Their results indicated a close relationship between the nucleotide sequences of SARS-CoV-2 and MG772933. So, MG772933 might be the source of the SARS-CoV-2 (Wassenaar and Zou 2020). Our results complied with theirs and determined that MG772934, with 97.47% sequence identity to MG772933, might be regarded as the parent of SARS-CoV-2. Furthermore, our results displayed that the recombination events at the beginning and the end of the SARS-CoV-2 genome were observed almost at the same positions in BatCoV-RaTG13 (GenBank accession number MN996532). The analysis of the whole-genome sequence showed 96% identity between BatCoV-RaTG13 and SARS-CoV-2. The BatCoV-RaTG13 was isolated from Rhinolophus affinis in China's Yunnan Province . Considering the geographical distance between the location where SARS-CoV-2 emerged and the habitat of the BatCoV-RaTG13 host, it could be hypothesized that another animal acted as a mediator between bats and humans. That hypothesis concerning the virus epidemiology can justify the presence of the coronaviruses isolated from different species or obtained from different geographical regions in a phylogenetic tree (Sabella et al. 2018 ). The SARS-CoV-2, pangolin-CoVs, and Rattus CoVs genome sequences were analyzed using RDP software, and the recombination regions were detected in the genome of SARS-CoV-2. These events were unique to the SARS-CoV-2 coronavirus and not observed in BatCoV-RaTG13 coronavirus. The result of the genome sequence analysis of the SARS-CoV-2 and pangolin-CoVs is shown in Fig. 3a . The recombination event at the spike gene encompassed the position 367-916 in the aligned nucleotide sequence of SARS-CoV-2 with a p-value of 7.74 9 10 -18 . The recombination occurrence in this region of the genome was also analyzed using other methods. The results given in Table 3 showed that this region of the SARS-CoV-2 genome was identified as the site of recombination occurrence using BootScan and SiScan methods. In this case, the minor and major parents were the coronaviruses with GenBank accession numbers MT040333 (81.1% identity) and MT121216 (91.9% identity), respectively. Both coronaviruses have been isolated from Manis Javanica (Lam et al. 2020) . The results obtained by comparing the nucleotide sequence of SARS-CoV-2 with Rattus CoVs are shown in Fig. 4a . The recombination event with a higher rate of occurrence (pvalue = 5.79 9 10 -5 ) was detected at 3863-4254 of the aligned SARS-CoV-2 sequence. The minor and major parents identified for this event were the coronaviruses with GenBank accession numbers KM349742 (Rattus norvegicus as host (Lau et al. 2015) ) and KF294371 (Rattus losea as host (Wang et al. 2015) ), respectively. Since this region was specified as a recombination site using other methods (Table 4) , the occurrence of this recombination can be confirmed. Because the recombination events were observed only in SARS-CoV-2 and not in BatCoV-RaTG13, it was likely that the Manis javanica, Rattus norvegicus, and Rattus losea hosting coronaviruses could act as the reservoirs of the SARS-CoV-2 and its transmission route from bats to humans. This hypothesis is not yet proven and needs more investigation to be fully confirmed. The phylogenetic trees were CoV-2 genome has been exhibited in the above plot. B Phylogenetic tree based on the ignore of recombination events, and C Phylogenetic tree based on the recombination event, (square) putative recombinant, (circle) potential minor parent, (triangle) potential major parent. The phylogenetic trees were constructed by the neighbor-joining (NJ) method with the bootstrap value (1000 replicates) (Figs. 3c, 4c ). According to a previous study, the researchers believe in the possibility that the virus may be transmitted to humans by infecting another mammal (Wassenaar and Zou 2020) . As a result, any animals that may have close contact with humans should be investigated. The studies on this subject introduced pangolins as candidates for transmitting viruses to humans (Liu et al. 2019; Wacharapluesadee et al. 2020; Xiao et al. 2020; Zhang et al. 2020a Zhang et al. , 2020b . Previous studies conducted on the primary source of MERS-CoV showed that bats were the primary host of the virus. However, dromedary camels were a reservoir for the virus and transmitted it to humans (Haagmans et al. 2014; Memish et al. 2013) . The SARS-CoV and SARS-CoV-2 viruses shared similarities in their genetic sequences and originated from bats (Ge et al. 2013; Yang et al. 2016; Zhou et al. 2020) . In SARS-CoV, palm civets are believed to be the reservoir and transmitter of the virus Wang et al. 2005 ), but the intermediate host of SARS-CoV-2 is still unknown. Analysis of nucleotide sequence of the spike gene The rate of changes in the nucleotides causing synonymous and non-synonymous amino acid changes are presented in Table 5 . In both groups, the nucleotide sequences of SARS-CoV-2 were considered a standard sequence compared to other coronaviruses. At first, the numbers of nucleotide changes resulted in non-synonymous amino acid (dN) and synonymous amino acid (dS) substitutions were calculated. Then, the ratio of dN/dS was measured. The measurement of this parameter is a practical and efficient method for recognizing the natural selection pattern for genes during their evolution (Nei and Kumar 2000) . The value of dN/dS [ 1 represents positive selection, dN/dS \ 1 means purifying selection, and dN/dS = 1 suggests neutral selection (Li 1997) . The average value of dN/dS in group 1 (SARS-CoV-2 and pangolin-CoVs) was calculated to be 1.02 ± 0.05, exhibiting the neutral selection pattern during the evolution of the spike gene. Investigation about the selective pressure pattern for the second group (SARS-CoV-2 and Rattus CoVs) revealed that dN/dS value was 0.58 ± 0.01. Hence, the purifying selection pattern was considered for the spike gene in group 2. Measuring the pairwise distance for the spike gene among the members of each group (Table 5) showed a shorter distance between the members of group 1 compared to those of group 2. The minimum distance of 0.37 between MT049951-MT121216 and the minimum distance of 0.72 between MT049951-JF792616 were obtained for group 1 and group 2, respectively. This parameter represents the rate of nucleotide substitutions between the sequences under study. The maximum and minimum values of this parameter are 0 and 1. The results from examining the nucleotide substitutions in the spike gene in both groups are shown in Table 6 . These factors (transition and transversion substitutions) are considered as indicators of molecular diversity (Tamura et al. 2004) . In group 1, the highest rate of transition substitution belongs to pyrimidine bases; this rate for thymine-cytosine exchange and cytosine-thymine exchange was calculated to be 14% and 23.38%, respectively. Accordingly, the highest rate of change was observed in the C ? T exchange due to cytosine methylation. This result is consistent with previous studies that reported The rate of nucleotide substitution that caused non-synonym amino acid changes * Pairwise distance refers to the amount of difference between nucleotide sequences a Group I including SARS-CoV-2 (MT049951) and pangolin-Covs the highest substitution rate in pyrimidine bases (Picoult-Newberg et al. 1999; Vignal et al. 2002; Zhang et al. 1994) . Analysis of the transition substitutions in group 2 showed higher C ? T and G ? A exchanges rates. The results showed that transversion substitutions had higher values in group 2 in comparison with group 1. Generally, transversion substitutions exert more effect on nucleotide changes in a gene compared to transition substitutions. Therefore, it seems that members of group 2 have higher nucleotide diversity in the spike gene. Measurement of nucleotide and haplotype diversity of the spike gene between members of each group was carried out using DnaSP software. In the present study, haplotype diversity was calculated to be 1 ± 0.076 and 1 ± 0.063 for group 1 and group 2, respectively. Haplotype diversity is a suitable marker to determine the rate of genetic diversity among populations. Haplotype diversity can vary from zero (all individuals of a population have similar haplotypes) to one (all individuals of a population have different haplotypes) (Aboim et al. 2005) . Nucleotide diversity was determined to be 0.09 ± 0.03 and 0.28 ± 0.05 in group 1 and group 2, respectively. Figure 5 shows that nucleotide diversity is higher at the beginning of the gene (5' end) than at the end region of the gene in both groups. High haplotype diversity and low nucleotide diversity were observed in both groups. In an expanding population, haplotype diversity and the number of polymorphism sites increase rapidly while nucleotide diversity is left behind. As time passes, nucleotide diversity increases when the population expansion ceases. Values of Tajima's D-test and Fu-Li's F* test were calculated to be negative numbers for both groups; -0.86 and -0.76 for group 1 and -0.55 and -0.52 for group 2. More negative values of Tajima's D-test and Fu-Li's F* test are present in group 1. Tajima's D is a statistic of population genetics. It is the normalized difference between two estimators, of which one is derived from the average number of pairwise differences and the other from the number of segregating sites (Tajima 1989a ). Tajima's D represents the expansion or contraction of population size, the strength of selection, and population structure. Usually, Tajima's D is used to examine whether the population follows three assumptions: (1) constant population size over time, (2) neutral evolution, (3) lack of population structure (for example, subdivision) (Kim et al. 2016 ). The sign of Tajima's D helps us to interpret natural selection (Biswas and Akey 2006) . Natural selection and population dynamics determine the sign of Tajima's D. A positive and negative value suggests decreasing and increasing population size, respectively (Innan and Stephan 2000; Sano and Tachida 2005; Tajima 1989b; Kim et al. 2016 ). Measurement of the parameter will be highly effective when analyzing pathogens that evolve rapidly, such as RNA viruses, which accumulate random mutations during their epidemic (Duffy et al. 2008) . However, Tajima's D is influenced by both population changes and selective pressure. It is not easy to quantify the effectiveness rate of both components on the Tajima's D values (Innan and Stephan 2000; Kim et al. 2016 ). The results obtained from examining the spike protein sequences to find potential motifs between members of group 1 are shown in Fig. 6 . Regions of motifs in each sequence are illustrated with colored blocks. All known motifs are shared among members of group 1. Identified motif sequences in the spike protein of SARS-CoV-2 (MT049951) were the same as motif sequences in the spike protein of pangolin-CoVs except for motifs 3, 10, 6, and 12 ( Supplementary Fig. 1 ). The results from the analysis of the spike protein sequences in group 2 showed that motifs 16, 9, 19, 18, 22, and 15 were found only in the spike protein of Rattus CoVs in group 2, and the sequence similar to them was not observed in the MT049951-related protein sequence (Fig. 7) . Motif 15 sequence (PKVTIDCAAF) was the same in all studied Rattus CoVs. On the other hand, the lack of this motif in the SARS-CoV-2 sequence makes it a suitable marker for identifying the spike protein of Rattus CoVs. Comparison of the obtained results in Figs. 6, 7 demonstrates the high similarity between the spike protein sequences of SARS-CoV-2 and pangolin-CoVs. Protein sequence polymorphisms of spike were detected by Analyze Sequence Variation tool. For further analysis, the meta-CATS tool was used to find polymorphisms. In this case, the chi-square test was measured to ensure the accuracy of determining positions, and variations of amino acids with a p-value B 0.05 were reported. The obtained results revealed that 35 amino acids in the spike protein of MT049951 were different from amino acids present in the spike protein of other members of group 1. These altered amino acids resulted from nucleotide changes in their codons. Investigations showed that the altered amino acids were present at positions other than known motifs along the protein sequence. Comparison of the spike protein sequence in MT049951 and other members of group 2 showed 305 altered amino acids, out of which five amino acids were observed in motifs 13, 17, 20, 4, and 25 ( Supplementary Fig. 2) . Molecular studies have confirmed that protein adaptation is associated with more nucleotide changes in the genome that alter amino acids. (Kryazhimskiy and Plotkin 2008) . In general, the results obtained in this paper showed a close relationship between spike proteins in SARS-CoV-2 and pangolin-CoVs. The epidemic of COVID-19 began in the city of Wuhan, China, in 2019. The outbreak of the disease (MT049951) and pangolin-CoVs. Motif sequences were acceptable with a p-value B 1 9 10 -5 and its global epidemic status caused many efforts to be done to understand the structure and function of the virus. SARS-CoV-2 is a member of the Betacoronavirus family. Detection of the ancestors of SARS-CoV-2 in the coronaviruses family helps to identify and define the phylogenetic relationships of the family Coronaviridae. In the present paper, the phylogenetic evidence demonstrated that SARS-CoV-2 could develop from bat-SL-CoVZC21. The occurrence of a recombination event in the region of the spike gene specified two recombination regions. PCoV_GX-P4L and ChRCoV-HKU24 were determined as the potential parents for the first and second events, respectively. In this study, it was presumed that pangolins and Rattus could be the parents of SARS-CoV-2. However, this remains a hypothesis and needs further investigations to be proved. Fig. 7 The motif analysis in spike protein. The BLOCK diagram shows the sequence of the discovered motifs. Motifs 16, 9, 22, 18, and 19 were found in the spike protein of Rattus CoVs (blue square). Motif 15 was seen in all Rattus CoVs with the completely same sequence (green square). Motif sequences were acceptable with a p-value B 1 9 10 -5 . Star markers belong to motifs that are not acceptable Genetic structure and history of population of the deep-sea fish Helicolenus dactyloptenus (Delaroche 1809 inferred from mtDNA sequence analysis MEME: discovering and analyzing DNA and protein sequence motifs Cryo-electron tomography of mouse hepatitis virus: insights into the structure of the coronavirion Genomic insights into positive selection A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan China: a descriptive study Origin and evolution of pathogenic coronaviruses Identification of a novel coronavirus in patients with severe acute respiratory syndrome Rates of evolutionary change inviruses: patterns and determinants Coronaviruses: an overview of their replication and pathogenesis Isolation and characterization of a bat SARS-like coronavirus that uses the ACE2 receptor Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China Middle East respiratory syndrome coronavirus in dromedary camels: an outbreak investigation Identification and characterization of novel neutralizing epitopes in the receptor-binding domain of SARS-CoV spike protein: revealing the critical antigenic determinants in inactivated SARS-CoV Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats Clinical features of patients infected with 2019 novel coronavirus in Wuhan China Identifying the zoonotic origin of SARS-CoV-2 by modeling the binding affinity between the spike receptorbinding domain and host ACE2 The coalescent in an exponentially growingmetapopulation and its application to Arabidopsis thaliana Molecular evolution analysis and geographic investigation of severe acute respiratory syndrome coronavirus-like virus in palm civets at an animal market and on farms Host-specific and segment-specificevolutionary dynamics of avian and human influenza A viruses: a systematicreview The population genetics of dN/Ds Coronavirus: organization replication and expression of genome Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats Discovery of a novel coronavirus China Rattus coronavirus HKU24 from Norway rats supports the murine origin of Betacoronavirus Molecular Evolution Bats are natural reservoirs of SARS-like coronaviruses Early transmission dynamics in Wuhan China of novel coronavirus-infected Pneumonia Evolution antigenicity and pathogenicity of global porcine epidemic diarrhea virus strains Viral metagenomics revealed sendai virus and coronavirus infection of Malayan pangolins (Manis javanica) Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Molecular epidemiology evolution and phylogeny of SARS coronavirus Infectious bronchitis viruses with a novel genomic organization RDP4: detection and analysis of recombination patterns in virus genomes The molecular biology of coronaviruses Middle East respiratory syndrome coronavirus in bats Saudi Arabia Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions Molecular evolution and phylogenetics Supramolecular architecture of severe acute respiratory syndrome coronavirus revealed by electron cryomicroscopy ViPR: an open bioinformatics database and analysis resource for virology research Mining SNPs from EST Databases The viral spike protein is not involved in the polarized sorting of coronaviruses in epithelial cells DNA sequence polymorphism analysis of large datasets phylogenetic analysis of viruses in tuscan vitis vinifera sylvestris (gmeli) hegi Gene genealogy and properties of test statistics of neutrality under population growth Statistical method for testing the neutral mutation hypothesis by DNA polymorphism The effect of change in population size on DNA polymorphism Prospects for inferring very large phylogenies by using the neighbor-joining method Human coronaviruses: What do they cause A Review on SNP and other types of molecular markers and their use in animal genetics Evidence for SARS-CoV-2 related coronaviruses circulating in bats and pangolins in Southeast Asia Cryo-electron microscopy structure of a coronavirus spike glycoprotein trimer Structure function and antigenicity of the SARS-CoV-2 spike glycoprotein SARS-CoV infection in a restaurant from palm civet Discovery, diversity and evolution of novel coronaviruses sampled from rodents in China 2019-ncov/sars-cov-2: rapid classification of betacoronaviruses and identification of traditional chinese medicine as potential origin of zoonotic coronaviruses Coronavirus genomics and bioinformatics analysis Discovery of seven novel Mammalian and avian coronaviruses in the genus deltacoronavirus supports bat coronaviruses as the gene source of Alphacoronavirus and Betacoronavirus and Avian Coronaviruses as the Gene Source of Gammacoronavirus and Deltacoronavirus ORF8-Related genetic evidence for Chinese horseshoe bats as the source of human severe acute respiratory syndrome coronavirus A new coronavirus associated with human respiratory disease in China Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins Isolation and characterization of a novel bat coronavirus closely related to the direct progenitor of severe acute respiratory syndrome coronavirus Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Positional cloning of the mouse obese gene and its human analogue Protein structure and sequence reanalysis of 2019-nCoV genome refutes Snakes as its intermediate host and the unique similarity between its spike protein insertions and HIV-1 Probable Pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak Fatal swine acute diarrhoea syndrome caused by an HKU2-related coronavirus of bat origin A pneumonia outbreak associated with a new coronavirus of probable bat origin A novel coronavirus from patients with Pneumonia in China 2019 Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Ethical approval This article does not contain any studies with human participants or animals performed.