key: cord-0804860-6o1h9uyl authors: Zhou, Hong; Ji, Jingkai; Chen, Xing; Bi, Yuhai; Li, Juan; Wang, Qihui; Hu, Tao; Song, Hao; Zhao, Runchu; Chen, Yanhua; Cui, Mingxue; Zhang, Yanyan; Hughes, Alice C.; Holmes, Edward C.; Shi, Weifeng title: Identification of novel bat coronaviruses sheds light on the evolutionary origins of SARS-CoV-2 and related viruses date: 2021-06-09 journal: Cell DOI: 10.1016/j.cell.2021.06.008 sha: 75c369e6f5ff6e224860980af423213c7a8d1dac doc_id: 804860 cord_uid: 6o1h9uyl Despite the discovery of animal coronaviruses related to SARS-CoV-2, the evolutionary origins of this virus are elusive. We describe a meta-transcriptomic study of 411 bat samples collected from a small geographical region in Yunnan province, China, between May 2019 and November 2020. We identified 24 full-length coronavirus genomes, including four novel SARS-CoV-2 related and three SARS-CoV related viruses. Rhinolophus pusillus virus RpYN06 was the closest relative of SARS-CoV-2 in most of the genome, although it possessed a more divergent spike gene. The other three SARS-CoV-2 related coronaviruses carried a genetically distinct spike gene that could weakly bind to the hACE2 receptor in vitro. Ecological modeling predicted the co-existence of up to 23 Rhinolophus bat species, with the largest contiguous hotspots extending from South Laos and Vietnam to southern China. Our study highlights the remarkable diversity of bat coronaviruses at the local scale, including close relatives of both SARS-CoV-2 and SARS-CoV. those obtained from the meta-transcriptomic sequencing. 142 At the scale of the whole genome, RpYN06 exhibited 94.48% sequence identity to 143 SARS-CoV-2, making it, after RaTG13 (96.10%), the second closest relative of RpYN06, RmYN02 and RaTG13 to SARS-CoV-2 were 97. 17%, 96.41% and 96.49%, 155 respectively. 156 In contrast, RsYN04, RmYN05 and RmYN08 exhibited >99.96% nucleotide 157 identities to each other at the scale of the whole genome. Such strong similarity is 158 indicative of viruses from the same species, even though they were sequenced on 159 different lanes and the samples were collected from different bat species at different 160 times. In addition, they shared relatively low sequence similarity with SARS-CoV-2 161 across the whole genome (76.5%), particularly in the spike gene, ORF3a, ORF6, 162 ORF7a, ORF7b and ORF8 with nucleotide identities <70% ( Figure 2) . Interestingly, 163 when using RsYN04 as the query sequence, the closest hit in the Blastn search was 164 the pangolin derived coronavirus MP789 (MT121216.1) with 82.9% nucleotide 165 identity. Also of note was that no complete ORF10 was found in RsYN04, RmYN05, 166 RmYN08 and a number of other SARS-CoV-2 related coronaviruses due to premature 167 J o u r n a l P r e -p r o o f termination (Figure 2A ), consistent with a previous study suggesting that ORF10 is 168 not essential to the replication cycle of SARS- CoV-2 (Pancer et al., 2020) . 169 Evolutionary history of sarbecoviruses 170 Phylogenetic analysis of full-length genome sequences of representative 171 sarbecoviruses revealed that SARS-CoV-2 was most closely related to RaTG13, while 172 RmYN02 and the Thailand strains formed a slightly more divergent clade. Notably, 173 RpYN06 was placed at the basal position of the clade containing SARS-CoV-2 and its 174 closest relatives from bats and pangolins ( Figure 3A , Table S5 ). In contrast, RsYN04, 175 RmYN05 and RmYN08 grouped together and clustered with the pangolin derived 176 viruses from Guangxi, although separated from them by a relatively long branch. ORF1ab ( Figure 3C ). RpYN06 and RmYN02 now formed a clade and that was the 188 direct sister-group to SARS-CoV-2, with RaTG13 a little more divergent ( Figure 3C) . 189 In addition, RsYN04, RmYN05 and RmYN08 now clustered with the pangolin Figure 3D ). In addition, RsYN04, RmYN05, 198 and RmYN08 did not fall within the SARS-CoV and SARS-CoV-2 clades, but instead 199 formed a separate and far more divergent lineage ( Figure 3D ). Finally, in the 218 2021) such that there was no furin cleavage site in these viruses ( Figure 4A ). 219 Interestingly, however, the recently sampled bat virus from Thailand possessed a PVA 220 three amino acid insertion at this site, similar to the PAA insertion found in RmYN02. 221 The QTQTNS motif, proposed as the landing site for furin and other proteases, was 222 J o u r n a l P r e -p r o o f similarly not observed in the newly described viruses ( Figure 4A ). In addition, two 223 indel events have been identified in the RBD of many bat-associated coronaviruses 224 (Holmes et al., 2021) , and RpYN06 was characterized by indel patterns identical to 225 those of ZC45 and ZXC21 ( Figure 4A ). There were no indel events in SARS-CoV-2 226 and the pangolin derived coronaviruses in the RBD, and RsYN04, RmYN05 and Figure S3B ). 235 We predicted and compared the three-dimensional structures of RpYN06, RsYN04 236 and SARS-CoV-2 using homology modeling ( Figures 4B-4D) . In a similar manner to 237 RmYN02 (Zhou et al., 2020a) , the RBD of RpYN06 had two shorter loops than those 238 observed in SARS-CoV-2, while RsYN04 only had one shorter loop ( Figure 4D ). In 239 addition, near the S1/S2 cleavage sites, the conformational loop of RpYN06 and 240 RsYN04 were different from those of SARS-CoV-2 ( Figures 4B-4C) . Notably, 241 RsYN04 exhibited greater amino acid identity (71.28%) and shared more structural 242 similarity with the SARS-CoV-2 RBD than RpYN06 (63.08%). Importantly, the 243 conformational variations caused by these amino acid substitutions and deletions were 244 speculated to interfere with the binding of RpYN06 and RsYN04 RBD to hACE2 245 ( Figure 4D ). However, RsYN04 exhibited lower structural similarity with 246 SARS-CoV-2 in the N-terminal domain (NTD) (36.32% amino acid identity; Figure 247 4C, black arrowheads) than RpYN06 (60.77% amino acid identity). To determine to what extent the deletions in the RBD region might interfere with the 249 binding of the RpYN06 and RsYN04 RBDs to hACE2, RBDs from SARS-CoV-2, 250 J o u r n a l P r e -p r o o f Under Curve (AUC) of 0.96 for training and 0.92 for testing, and all training AUCs 295 were above 0.88. Continentality (reflecting the difference between continental and 296 marine climates) was, on average, the most important factor, contributing an average 297 of 14.91% (based on permutation importance), followed by temperature seasonality at 298 11.7% average contribution, mean diurnal temperature range at 5.69%, and annual 299 potential evapotranspiration at 5.38%. Three additional ecological factors also 300 contributed over 5% on average: minimum precipitation at 5.25%, potential 301 evapotranspiration seasonality at 5.17% and Emberger's pluviothermic quotient (a 302 measure of climate type) at 5%. The next most important factor was the distance to 303 bedrock (an indicator of potential caves and rock outcrops) at 4.46%. Thus, local 304 climate, especially factors that influence diet availability across the year, is seemingly 305 J o u r n a l P r e -p r o o f key to determining bat species distributions across the region. Although we could not accurately model diversity for Indonesia because of limited 307 recently available data and likely high endemism, mainland Southeast Asia was well 308 mapped ( Figures 6 and S5 ). Most of mainland Southeast Asia's remaining tropical 309 forests showed a high diversity of rhinolophid bats, with a maximum of 23 species 310 estimated to exist concurrently ( Figure 6A ). Rhinolophid hotspots occurred in forests 311 throughout much of mainland Southeast Asia, with the largest contiguous hotspots 312 extending from South Laos and Vietnam to Southern China ( Figure 6A ). Hotspots 313 were also identified in the Hengduan mountains, and some parts of northern Myanmar 314 and Nagaland in India ( Figure 6A ). Interestingly, R. affinis ( Figure 6B ) and R. pusillus ( Figure 6C) unsurprisingly showed some differences. Specifically, R. affinis was also influenced 322 by temperature seasonality (16.59%), followed by Emberger's pluviothermic quotient 323 and mean diurnal range (8.79, and 8.7%), while R. malayanus (a smaller species) was 324 mainly influenced by annual potential evapotranspiration mean (33.79%) and 325 seasonality (14.57%). R. pusillus was influenced by temperature seasonality (12.44%) 326 and continentality (9%), and R. shameli was largely influenced by annual potential 327 evapotranspiration seasonality (34.81%) followed by annual evapotranspiration 328 (9.79%). Overall, these factors control the range limits, and food availability for these 329 bat species. It should be noted that the ecological modeling identified several other rhinolophid 331 species with wide geographic distributions: R. huanensis, R. lepidus, R. luctus, R. 332 macrotis, R. marshalli, R. microglobosus, R. pearsoni, R. rouxii, R. stheno, R. thomasi, 333 J o u r n a l P r e -p r o o f and R. yunnanensis ( Figure S5 ). Notably, R. stheno was found to host both 334 SARS-CoV-2 and SARS-CoV-like coronaviruses. To reveal more of the diversity, ecology and evolution of bat coronaviruses, we Of particular note was that one of the novel bat coronavirus identified here -RpYN06 348 -exhibited 94.5% sequence identity to SARS-CoV-2 across the genome as a whole 349 and in some individual gene regions (ORF1ab, ORF7a, ORF8, N, and ORF10) was 350 the closest relative of SARS-CoV-2 identified to date. However, much lower sequence 351 identity in the spike gene, undoubtedly the product of a past recombination event, 352 made it a second closest relative of SARS-CoV-2, next to RaTG13, at the genomic 353 scale. Hence, aside from the spike gene, RpYN06 possessed a genomic backbone that 354 is arguably the closest to SARS-CoV-2 identified to date. Although several SARS-CoV-2-like viruses have been identified from different 356 wildlife species that display high sequence similarity to SARS-CoV-2 in some 357 genomic regions, none are highly similar (e.g. >95%) to SARS-CoV-2 in the spike 358 gene in terms of both the overall sequence identity and the amino acid residues at 359 critical receptor binding sites (Zhou et al., 2020b; Lam et al., 2020; Xiao et al., 2020; 360 J o u r n a l P r e -p r o o f Zhou et al., 2020a; Murakami et al., 2020; Hul et al., 2021; Wacharapluesadee et al., 361 2021) . Indeed, the spike protein sequences of three of the novel coronaviruses 362 described here (RsYN04, RmYN05, and RmYN08) formed an independent lineage 363 separated from known sarbecoviruses by a relatively long branch. In this context, it is 364 interesting that the recently identified bat coronavirus from Thailand carried a 365 three-amino acid-insertion (PVA) at the S1/S2 cleavage site (Wacharapluesadee et al., Small body-size in rhinolophids means that GPS tracking is not possible for the 381 majority of species, although their adaptation to forest habitats makes long-distance 382 migration unlikely. Whilst our monitoring efforts suggest that some larger bat species 383 from other genera may migrate, seasonal abundance changes in species such as R. The authors declare no competing interests. SARS-CoV-2 'variants of concern' commonly occur in or near indels, 558 https://virological.org/t/spike-protein-mutations-in-novel-sars-cov-2-variants-of-concern-commonl 559 y-occur-in-or-near-indels/605/1. 560 Holmes, E.C., Andersen, K.G., Rambaut, A. and Garry, R.F. (2021). Spike protein sequences of 561 Cambodian, Thai and Japanese bat sarbecoviruses provide insights into the natural evolution of 562 the Receptor Binding Domain and S1/S2 cleavage site. 563 https://virological.org/t/spike-protein-sequences-of-cambodian-thai-and-japanese-bat-sarbecovirus 564 es-provide-insights-into-the-natural-evolution-of-the-receptor-binding-domain-and-s1-s2-cleavage 565 -site/622. (Table S1 ). All bats were sampled alive and subsequently released. All samples were 702 transported on ice and then kept at -80°C until use. for the 5ʹ and one set for the 3ʹ RACE PCR amplification were designed based on the 757 assembled genome sequences of six beta-CoVs (Table S3 ). The first round of (Table S6) , and additional GBIF data collated between 2017 and 2021. (Table S7) Variables included a number of bioclimatic parameters (1,2,4,5,11,12,13,14,15: 868 http://worldclim.org/version2) in addition to productivity and other climate metrics AUCs were above 0.88, indicating that the models perform well based on testing data. Using model averages between replicates also prevents stochasticity between models. Maxent also assays the importance of each variable using a number of approaches. Because of complex regional biogeography, optimal species habitat can exist in 891 areas that have not been colonized. Therefore, we downloaded mapped ranges for 39 892 of the 49 species modelled from the IUCN 893 (https://www.iucnredlist.org/resources/spatial-data-download). Bats were extracted 894 from this data, clipped to match the study area. We then divided the IUCN data into results were applied to calculate the value of mean ± SD that displayed in Figure 967 4H-J. • RpYN06 is the closest relative of SARS-CoV-2 in most of the virus genome. • A high diversity of bat coronaviruses was present in a very small geographic area. • Ecological modeling reveals a broad range of rhinolophid bats in parts of Asia. A study of 411 bat samples collected in Yunnan province, China between 2019-2020 yields 24 full length coronavirus genomes, including 4 viruses highly related to SARS-CoV-2 and 3 to SARS. The closest relative to SARS-CoV-2 is infects a species of bats that are found in regions that extend from South Laos and Vietnam to southern China. RsYN04 --MFILLLLPIVLAQQDSCNHIVQLPNSMVRGVYNSGSKVYYPDDINRVSTSILFNDYFLPFNSNLTNRQTLGN---TNNNRFDVFTEPFLDGVYFAATEHSNVLRGWIFGTSFTNTTQS 115 RmYN05 --MFILLLLPIVLAQQDSCNHIVQLPNSMVRGVYNSGSKVYYPDDINRVSTSILFNDYFLPFNSNLTNRQTLGN---TNNNRFDVFTEPFLDGVYFAATEHSNVLRGWIFGTSFTNTTQS 115 RmYN08 --MFILLLLPIVLAQQDSCNHIVQLPNSMVRGVYNSGSKVYYPDDINRVSTSILFNDYFLPFNSNLTNRQTLGN---TNNNRFDVFTEPFLDGVYFAATEHSNVLRGWIFGTSFTNTTQS 115 Yunnan_RmYN02 --MFILLL-IG-YTAATTCV--TGPTIENKQNVSSLMRGVYYPDDIYRSNVNVLFTVPFLRFNSTLTWYNFWNQ-------AYTSRVMEFGDGIYFSTVDKSNAVRGWIFGTTLDNTTQS 107 Thailand_RacCS203 --MFILLL-IG-YTAATTCV--TGPSTENKQNVSSLMRGVYYPDDIYRSNVNVLFTVPFLRFNSTLTWYNFWNQ-------AYTSRVMAFGDGIYFSTVDKSNAVRGWIFGTTLDNTTQS 107 Yunnan_RaTG13 --MFVFLV-LL-PLVSSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFSNVTWFHAIHVSGTNGIKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQS 116 Cambodia_RShSTT200 MILLAFLI-PL-VTAQQGCGVISTKPQPKAMTFLSSSRGVYYNDDIFRSSVLHLTQDYFLPFNSTLTQYFSHVI---DTSSYFDNPILNFGDGVYFASTEKSNVIRGWIFGSSFDNTTQS Pangolin_Guangxi_P5L CSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPMAIHAEQLTPAWRVYSAGANVFQTRAGCLVGAEHVNNSYECDIPVGAGICASYHSMSSF----RSVNQRSIIAYTMSLGAENSVAYSN 703 Japan_Rc-o319 CSYGGVSVITPGTNASTQVAVLYQDVNCTDVPTAIHAEQLTPSWRVYSTGTNMFQTQAGCLIGAEHVNNSYDCDIPIGAGICATYHTPSML---RSANNNKRIVAYVMSLGAENSVAYSN 671 **:**********: : :***********:** : :: . **:*: * . ** :****:**:: * : :****:***:**:* : :.** **:** .* **:* Signal peptide S1 Yunnan RmYN02 Thailand RacCS203 Yunnan RaTG13 Cambodia RShSTT200 Japan Rc-o319 RsYN04 658 RmYN05 658 RmYN08 658 Yunnan RmYN02 632 Thailand RacCS203 632 Yunnan RaTG13 672 Cambodia RShSTT200 654 Japan Rc-o319 China BtNv-SC2013|N. velutinus RsYN25|R. stheno China YN2012 Ra13591|Rhinolophus sp China BtRf-YN2012|R. ferrumequinum China HuB2013|R. ferrumequinum Kenya BtKY229E-1|Bat USA CDPHE15/USA/2006|M. lucifugus China Lucheng-19|Rat Kenya BtKYNL63-9a|Bat Viet Nam 17819-17|S. kuhlii Netherlands FRCoV-NL-2010|Ferret China SADS CoV FarmA|Swine CpYN11|C. plicata HK Bat CoV 1A|Miniopterus sp Viet Nam 17819-17|S. kuhlii China BtRf-YN2012|R. ferrumequinum Netherlands FRCoV-NL-2010|Ferret South Africa PML-PHE1/RSA/2011|N. capensis China YN2012 Rs3376|Rhinolophus sp HK Bat CoV 1B|Miniopterus sp USA 79-1146|Feline China SADS CoV FarmA|Swine Netherlands FRCoV-NL-2010|Ferret South Africa PML-PHE1/RSA/2011|N. capensis Netherlands FRCoV-NL-2010|Ferret Saudi Arabia Riyadh/Ry141/2015|Camel China BtCoV/512/2005|Scotophilus sp China Lucheng-19|Rat China YN2012 Rs3376|Rhinolophus sp Viet Nam 17819-17|S. kuhlii Kenya BtKY229E-1|Bat Netherlands FRCoV-NL-2010|Ferret China Anlong-57|M. davidii China YN2012 Ra13591|Rhinolophus sp. China BtMr-SAX2011|M. ricketti Canine CoV China BtCoV/512/2005|Scotophilus sp Human CoV NL63 USA CDPHE15/USA/2006|M. lucifugus China Lucheng-19|Rat Table S2 . Detection of coronaviruses and SARS-CoV-2 related sequences. Related to Figure S1 and the "Genome assembly and annotation" section of the STAR Methods.J o u r n a l P r e -p r o o f Table S4 . Primers used for the detection and amplification of the S gene, and the 5' and 3' terminal sequences of the four SARS-CoV-2 related genomes. Related to Figure 1 and the "Sanger sequencing" section of the STAR Methods.