key: cord-1036980-yhap30qx authors: Zhou, Hong; Ji, Jingkai; Chen, Xing; Bi, Yuhai; Li, Juan; Hu, Tao; Song, Hao; Chen, Yanhua; Cui, Mingxue; Zhang, Yanyan; Hughes, Alice C.; Holmes, Edward C.; Shi, Weifeng title: Identification of novel bat coronaviruses sheds light on the evolutionary origins of SARS-CoV-2 and related viruses date: 2021-03-08 journal: bioRxiv DOI: 10.1101/2021.03.08.434390 sha: fcb2bbcfb4a106ae1000a9fd7a5e89993013681f doc_id: 1036980 cord_uid: yhap30qx Although a variety of SARS-CoV-2 related coronaviruses have been identified, the evolutionary origins of this virus remain elusive. We describe a meta-transcriptomic study of 411 samples collected from 23 bat species in a small (~1100 hectare) region in Yunnan province, China, from May 2019 to November 2020. We identified coronavirus contigs in 40 of 100 sequencing libraries, including seven representing SARS-CoV-2-like contigs. From these data we obtained 24 full-length coronavirus genomes, including four novel SARS-CoV-2 related and three SARS-CoV related genomes. Of these viruses, RpYN06 exhibited 94.5% sequence identity to SARS-CoV-2 across the whole genome and was the closest relative of SARS-CoV-2 in the ORF1ab, ORF7a, ORF8, N, and ORF10 genes. The other three SARS-CoV-2 related coronaviruses were nearly identical in sequence and clustered closely with a virus previously identified in pangolins from Guangxi, China, although with a genetically distinct spike gene sequence. We also identified 17 alphacoronavirus genomes, including those closely related to swine acute diarrhea syndrome virus and porcine epidemic diarrhea virus. Ecological modeling predicted the co-existence of up to 23 Rhinolophus bat species in Southeast Asia and southern China, with the largest contiguous hotspots extending from South Lao and Vietnam to southern China. Our study highlights both the remarkable diversity of bat viruses at the local scale and that relatives of SARS-CoV-2 and SARS-CoV circulate in wildlife species in a broad geographic region of Southeast Asia and southern China. These data will help guide surveillance efforts to determine the origins of SARS-CoV-2 and other pathogenic coronaviruses. ). These samples were pooled into 100 libraries (numbered p1 to 116 p100) according to the collection date and host species, with each library containing 117 one to 11 samples. Meta-transcriptomic (i.e. total RNA) sequencing was performed 118 and coronaviruses contigs were identified in 40 libraries (Table S2) remaining 24 genomes were named in the same manner, in which the first two letters 127 represent an abbreviation of the bat species, YN denotes Yunnan, and the final number 128 is a serial number ranging from 03 to 26. In addition, several short contigs related to 129 SARS-CoV-2 were identified in two other libraries -p7 and p11 ( Figure S1 , Table 130 S2). Further Blastn analyses revealed that four of the seven novel sarbecoviruses identified 132 here (RpYN06, RsYN04, RmYN05, and RmYN08) were related to SARS-CoV-2 133 with nucleotide identities ranging from 82.46% to 97.21%, while the remaining three 134 (RsYN03, RmYN07, and RsYN09) were more closely related to SARS-CoV with 135 nucleotide identities ranging from 91.60% to 93.28%. We next designed specific 136 primers and a probe set of quantitative real-time PCR primers (qPCR) ( Table S4) (Table S4 ) and Sanger sequencing. Results from Sanger sequencing 150 were consistent with those obtained from the meta-transcriptomic sequencing. 152 At the scale of the whole genome, RpYN06 exhibited 94.5% sequence identity to 153 SARS-CoV-2, making it, after RaTG13 (96.0%), the second closest relative of SARS- RmYN08 now grouped with the Guangdong pangolin viruses (rather than those from 193 Guangxi; Figure 3B ). A different pattern again was observed in the phylogeny of the 194 entire ORF1ab ( Figure 3C ). RpYN06 and RmYN02 now formed a clade and that was 195 the direct sister-group to SARS-CoV-2, with RaTG13 a little more divergent ( Figure 3C ). In addition, RsYN04, RmYN05 and RmYN08 now clustered with the pangolin Figure 3D ). In addition, RsYN321B, 205 RmYN363B, and RmYN442B did not fall within the SARS-CoV and SARS-CoV-2 206 clades, but instead formed a separate and far more divergent lineage ( Figure 3D ). identities ranging from 85.3% to 99.0%; Figure S4 ). 237 We predicted and compared the three-dimensional structures of RpYN06, RsYN04 (Figures 6D-6F) . Ecological drivers for these species 309 unsurprisingly showed some differences. Specifically, R. affinis was also influenced to which humans may be routinely exposed. Importantly, in addition to Rhinolophids, Thailand and Cambodia. Also see Table S6 . The yellow region represents the predicted range of each species. Also see Figure S7 . of concern' commonly occur in or near indels, https://virological.org/t/spike-protein-mutations-in-504 novel-sars-cov-2-variants-of-concern-commonly-occur-in-or-near-indels/605/1. 505 Holmes, E.C., Andersen, K.G., Rambaut, A. and Garry, R.F. (2021). Spike protein sequences of 506 Cambodian, Thai and Japanese bat sarbecoviruses provide insights into the natural evolution of the 507 Receptor Binding Domain and S1/S2 cleavage site. https://virological.org/t/spike-protein-508 sequences-of-cambodian-thai-and-japanese-bat-sarbecoviruses-provide-insights-into-the-natural-509 evolution-of-the-receptor-binding-domain-and-s1-s2-cleavage-site/622. Further information and requests for resources and reagents should be directed to and 614 will be fulfilled by the Lead Contact, Weifeng Shi (shiwf@ioz.ac.cn). This study did not generate new unique reagents. 617 A total of 23 different bat species were tested in this study (Table S1) (Table S1 ). All bats were sampled alive and subsequently released. All samples 637 were transported on ice and then kept at -80°C until use. Global hotspots and correlates of emerging zoonotic diseases Global patterns in coronavirus diversity BLAST+: architecture and applications Interspecies transmission and 488 emergence of novel viruses: lessons from bats and birds De novo haplotype reconstruction in viral quasispecies using 490 paired-end read guided path finding fastp: an ultra-fast all-in-one FASTQ preprocessor The SARS-CoV-2 Spike protein has a broad 495 tropism for mammalian ACE2 proteins Hosts and Sources of Endemic 497 Bat Coronaviruses in China. Viruses 11. The newly assembled coronavirus genomes were validated by read mapping using To further improve the 666 quality of the genome annotations, SAM files of the reads mapping to SARS-CoV-2 667 were checked manually with Geneious (v2021.0.1), extending the ends as much as 668 possible. The open reading frames (ORFs) of the verified genome sequences were 669 annotated using Geneious (v2021.0.1) and then checked with closed references from 670 NCBI. The taxonomy of these newly assembled coronavirus genome were determined 671 by online BLAST To mitigate the possibility of false 674 positives due to index hopping, coronavirus contigs from different libraries within the 675 same chip and same lane were compared, and if a shorter contig shared >99% quantitative real-time PCR (qPCR), PCR amplification and Sanger 681 sequencing. A TaqMan-based qPCR was first performed to test the feces of pools p19, 682 p35, p44, p46, p52 and p62, as these contained beta-CoVs according to the 683 metagenomic analysis. cDNA synthesis was performed using the ReverTra Ace qPCR 684 RT Kit (TOYOBO). The qPCR reaction was undertaken using a set of probe and 685 primer pairs LightCycler 96 Real-Time PCR System (Roche) The sequences of the 5ʹ and 3ʹ termini 688 were obtained by RACE using the SMARTer RACE Set (Takara), according to the manufacturer's instructions with some minor 690 modifications. Two sets of gene-specific primers (GSPs) and nested-GSPs (NGSPs) The first round of 693 amplification was performed by touchdown PCR, while the second round comprised 694 regular PCR. The PCR amplicons of ~1000 bp fragments of the two regions were 695 obtained separately and sequenced with the amplified primer or gel purified followed 696 by ligation with the pMD18-T Simple Vector (TaKaRa) and transformation into 697 competent Escherichia coli DH5α (Takara) The cDNAs reverse 702 transcribed above were used as templates. The thermal cycling parameters of PCR 703 amplification were as follows: 5 mins at 95°C; followed by 30 s at 95°C, 30 s at 50°C 704 (an exception of 55°C for primers 379SF5/379SR5), 1 min at 72°C for 30 cycles; and 705 10 min at 72°C. A second round PCR was then performed under the same conditions 706 with the corresponding PCR products used as templates. Further confirmation of host 707 species was based on analysis of the cytochrome b (cytb) gene obtained from the 708 assembled contigs. We also amplified and sequenced the fragment of cytochrome c 709 oxidase subunit I (COI) gene using primers VF1/VR1 45°C, 45 s at 72°C for 14 cycles; and followed by 30 s at 95°C, 30 s at 45°C Multiple sequence alignment of the alphacoronavirus and 715 betacoronavirus nucleotide sequences was performed using MAFFT (v7 Phylogenetic analysis of the complete genome and major genes were performed using 717 the maximum likelihood (ML) method available in RAxML Pairwise sequence identity of the 722 complete viral genome and genes between SARS-CoV-2 and representative 723 sarbecoviruses was calculated using Geneious (v2021.0.1). A whole genome sequence 724 similarity plot was performed using Simplot Site and structural analysis of the spike gene. The three-dimensional structures of 727 the S1 protein from RpYN06, RsYN04 and SARS-CoV-2 were modeled using the 2018) employing PDB: 7A94.1 as the 729 template. Molecular images were generated with an open-source program -PyMOL Data was collated using a combination of that from Hughes 2019, from various online 734 repositories (Table S7), and additional GBIF data collated between version2) in 744 addition to productivity and other climate metrics (NDVI, seasonality, actual 745 evapotranspiration, potential evapotranspiration seasonality and mean annual potential 746 evapotranspiration, aridity, Emberger's pluviotonic quotient, continentality, 747 thermicity All variables were clipped to a mask of Tropical Southeast Asia and southern China at 751 a resolution of 0.008 decimal degrees (approximately 1km 2 ) in ArcMap 10.3, then 752 converted to asci format for modelling Five replicates were run 754 for each species, and the average taken before reclassifying with the 10 th percentile 755 cumulative logistic threshold to form binary maps for each species Because of complex regional biogeography, optimal species habitat can exist in areas 759 that have not been colonized. Therefore, we downloaded mapped ranges for 39 of the 760 49 species modelled from the IUCN We then divided the IUCN data into five regions; mainland Southeast Asia This was collated to form a spreadsheet listing each 765 zone each species was listed in, and then the appropriate shapefiles used to determine 766 the ranges of each species (although only 39 of the 49 species could be treated in this 767 way). Each species was then remosaiced with the mask to provide a binary 768 distribution map, removing any potentially suitable areas that were outside the species 769 biogeographic range