key: cord-0802204-oczkbjgd authors: Jungreis, Irwin; Sealfon, Rachel; Kellis, Manolis title: SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes date: 2020-10-01 journal: Res Sq DOI: 10.21203/rs.3.rs-80345/v1 sha: afb78dae0802fdb6f78be31f2545bce0234aa5c3 doc_id: 802204 cord_uid: oczkbjgd Despite its overwhelming clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. Here, we use comparative genomics to provide a high-confidence protein-coding gene set, characterize protein-level and nucleotide-level evolutionary constraint, and prioritize functional mutations from the ongoing COVID-19 pandemic. We select 44 complete Sarbecovirus genomes at evolutionary distances ideally-suited for protein-coding and non-coding element identification, create whole-genome alignments, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for all named genes and for 3a, 6, 7a, 7b, 8, 9b, and also ORF3c, a novel alternate-frame gene. By contrast, ORF10, and overlapping-ORFs 9c, 3b, and 3d lack protein-coding signatures or convincing experimental evidence and are not protein-coding. Furthermore, we show no other protein-coding genes remain to be discovered. Cross-strain and within-strain evolutionary pressures largely agree at the gene, amino-acid, and nucleotide levels, with some notable exceptions, including fewer-than-expected mutations in nsp3 and Spike subunit S1, and more-than-expected mutations in Nucleocapsid. The latter also shows a cluster of amino-acid-changing variants in otherwise-conserved residues in a predicted B-cell epitope, which may indicate positive selection for immune avoidance. Several Spike-protein mutations, including D614G, which has been associated with increased transmission, disrupt otherwise-perfectly-conserved amino acids, and could be novel adaptations to human hosts. The resulting high-confidence gene set and evolutionary-history annotations provide valuable resources and insights on COVID-19 biology, mutations, and evolution. SARS-CoV-2, the virus responsible for COVID-19 1 , is a betacoronavirus in the subgenus Sarbecovirus, 30 which also includes SARS-CoV, responsible for the 2003 severe acute respiratory syndrome (SARS) 31 outbreak. Its large 29,903-nucleotide positive-strand RNA genome encodes ~30 known and 32 hypothetical mature proteins (Fig. 1a, Fig. 2, Extended Data Fig. 1) . Despite SARS-CoV-2's extreme 33 medical importance, its gene content remains surprisingly unresolved, with several hypothetical open 34 reading frames (ORFs) whose function or even protein-coding status is unknown. Moreover, no 35 systematic resource exists for interpreting the functional impact of SARS-CoV-2 mutations and 36 prioritizing candidate drivers that may underlie phenotypic differences between strains. Table S2) . 43 The last third of the genome encodes named proteins S (Spike surface glycoprotein), composed of S1 44 (viral attachment to host-cell ACE2 receptor) and S2 (membrane fusion, viral entry), E (Envelope protein-coding (left) vs. non-coding (right) using evolutionary signatures, including distinct frequencies of amino-acid-468 preserving (green) vs. amino-acid-disruptive (red) substitutions, and stop codons (cyan/magenta/yellow) in frame-specific 469 alignments, and additional features. c. PhyloCSF score (x-axis) for all confirmed (green) and rejected (red) ORFs, showing annotated/hypothetical/novel (labeled) and all AUG-initiated ≥25-codons-long locally-maximal ORFs (unlabelled CTT ATG GAT TTG TTT ATG AGA ATT TTC ACA CTT GGA ACT GTA ACT TTG AAA CAA GGT GAA ATT AAG GAT GCT ACT CCT TCA GAT TCT GTT CGC GCT ACT GCA ACG ATA CCG ATA CAA GCC TCA CTC CCT TTC GGA TGG CTT ATT GTT GGC GTT GCA TTT CTT GCT GTT TTT CAA AGC GCT TCC AAG ATC ATA A MG772933_Bat_SARS_like_CoV_bat_SL_CoVZC45 CTT ATG GAT TTG TTT ATG AGA ATT TTC ACT CTT GGA ACT GTT ACT CTT AAA CAA GGT GAA ATC AAA GGT GCT ACT CCT ACA AAT TCT GTT CGC ACT ACT GCA ACA ATA CCG ATA CAA GCC ACA CTC CCT TTC GGA TGG CTT GTT GTT GGC GTT GCA ATT CTT GCT GTT TTT CAA AGC GCT TCA AAA ATA ATT A MG772934_Bat_SARS_like_CoV_bat_SL_CoVZXC21 CTT ATG GAT TTG TTT ATG AGA ATT TTC ACA CTT GGA ACT GTA AGT CTG AAA CAA GGT GAA ATT AAG GAT GCT ACT CCT TCA GAT TCT ATT CGC GCT ACT GCA ACA ATA CCG ATA CAA GCC ACA CTC CCT TTC GGA TGG CTT GTT GTT GGC GTT GCA ATT CTT GCT GTT TTT CAA AGC GCT TCA AAA ATA ATT A NC_004718_SARS_CoV CTT ATG GAT TTG TTT ATG AGA TTT TTT ACT CTT AGA TCA ATT ACT GCA CAG CCA GTA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT G KT444582_SARS_like_CoV_WIV16 CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA TCA ATT ACT GCA CAG CCA GGA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT T KY417146_Bat_SARS_like_CoV_Rs4231 CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA TCA ATT ACT GCA CAG TCA GGA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCG CTA CAG GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GTT ACC AAA ATA ATT G MK211376_CoV_BtRs_BetaCoV_YN2018B CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA TCA ATT ACT GCA CAG TCA GGA AAA ATT GAC AAT GCT TCT CCT GCA GGT ACT GTT CAT GCT ACA GCA ACG ATA CCG CTA CAG GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT G KY417151_Bat_SARS_like_CoV_Rs7327 CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA TCA ATT ACT GCA CAG CCA GGA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCA CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT G KY417152_Bat_SARS_like_CoV_Rs9401 CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA TCA ATT ACT GCA CAT CCA GGA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCA CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT G KY417144_Bat_SARS_like_CoV_Rs4084 CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA TCA ATT ACT GCA CAG CCA GGA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT T KF367457_Bat_SARS_like_CoV_WIV1 CTT ATG GAT TTG TTT ATG AGA ATT TTT ACT CTT GGA CCA ATT ACT GCA CAG CCA GGA AAA ATT GAC AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACA GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA TTT CTT GCT GTT TTT CAG AGC GCT ACC AAA ATA ATT T KU973692_UNVERIFIED_SARS_related_CoV_F46 TT-ATG GAT TTG TTT ATG AGT ATT TTC ACG CTT GGA TCA ATC ACA CGT CAA TCG AGT AAG ATT GAA AAT GCT TCT CCT GCA AGT ACT GTT CAT ACT ACT GCA ACG ATA CCG CTA CAG GCC TCA CTC CCT TTC GGA TGG CTT GTT GTT GGC GTT GCA CTT CTT GCT GTT TTC CAA AGC GCT TCC AAA GTG ATT G KY417145_Bat_SARS_like_CoV_Rf4092 TTA ATG GAT TTG TTT ATG AGT ATT TTC ACA CTT GGA TCG ATC ACG CGT CAA CCG AGT AAG ATT GAA AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACT GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT ATT GGC GTT GCA CTT CTT GCT GTT TTT CAA AGC GCT TCC AAA GTG ATT G KJ473816_BtRs_BetaCoV_YN2013 TTA ATG GAT TTG TTT ATG AGT ATT TTC ACA CTT GGA TCG ATC ACA CGT CAA CCG AGT AAG ATT GAA AAT GCT TCT CCT GCA AGT ACT GTT CAT GCT ACT GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT GTT GGC GTT GCA CTT CTT GCT GTT TTT CAA AGC GCT TCC AAA GTG ATT G KY770858_Bat_CoV_Anlong_103 TTA ATG GAT TTG TTT ATG AGT ATT TTC ACG CTT GGA TCA ATC ACA CGT CAA CCG AGT AAG ATT GAA AAT GCT CTT CCT GCA AGT ACT GTT CAT GCT ACT GCA ACG ATA CCG CTA CAA GCC TCA CTC CCT TTC GGA TGG CTT GTT GTT GGC GTT GCA CTT CTT GCT GTT TTT CAA AGC GCT TCC AAA GTG ATT G KY417143_Bat_SARS_like_CoV_Rs4081 TTA ATG GAT TTG TTT ATG AGT ATT TTC ACG CTT GGA TCA ATC MG772933_Bat_SARS_like_CoV_bat_SL_CoVZC45 ACT TCT TTA GAG GTG GCT GTT CTT TAC CAA GAT GTT AAC TGC ACT GAT GTA CCA ACT ACT ATA MG772934_Bat_SARS_like_CoV_bat_SL_CoVZXC21 ACT TCT TCA GAG GTG GCT GTT CTT TAC CAA GAT GTT AAC TGC ACT GAT GTA CCA ACT ACT ATA NC_004718_SARS_CoV GCT TCA TCT GAA GTT GCT GTT CTA TAT CAA GAT GTT AAC TGC ACT GAT GTT TCT ACA GCA Methods 537 Genome sequences were obtained from https://www.ncbi.nlm.nih.gov/. The genomes and NCBI 539 annotations for SARS-CoV-2 and SARS-CoV were obtained from the records for accessions 540 NC_045512.2 and NC_004718.3, respectively. The UniProt annotations for SARS-CoV-2 were 541 obtained from the UCSC Genome Browser 48 on April 5, 2020. 542 The 44 Sarbecovirus genomes used in this study were selected starting from all betacoronavirus and 543 unclassified coronavirus full genomes listed on ncbi via searches 544 https://www.ncbi.nlm.nih.gov/nuccore/?term=txid694002[Organism:exp] and the same with txid1986197 545 and txid2664420 on 5-Mar-2020, excluding any that differed from NC_045512.2 in more than 10,000 546 positions in a pairwise alignment computed using NW-align 49 , that cutoff being chosen so as to 547 distinguish Sarbecovirus genomes among those that were classified, and removing near duplicates, 548 including all SARS-CoV and SARS-CoV-2 genomes other than the reference. Coronavirus genomes in 549 the left half of Extended Data Fig. 2 were those listed by 550 https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118 on 11-Feb-2020. 551 The genomes were aligned using clustalo 50 with the default parameters. The Phylogenetic tree was 552 calculated using RAxML 51 using the GTRCATX model. 553 PhyloCSF (Phylogenetic Codon Substitution Frequencies) 22 determines whether a given nucleotide 555 sequence is likely to represent a functional, conserved protein-coding sequence by determining the 556 likelihood ratio of its multi-species alignment under protein-coding and non-coding models of evolution 557 that use pre-computed substitution frequencies for every possible pair of codons, and codon 558 frequencies for every codon, trained on whole-genome data. PhyloCSF was run using the 29mammals 559 empirical codon matrices but with the Sarbecovirus tree substituted for the mammals tree. Input 560 alignments were extracted from the whole-genome alignment and columns containing a gap in the 561 p 30 reference sequence were removed. Browser tracks were created as described previously 26 . Scores 562 listed in Supplementary Table S2 were calculated on the local alignment for each ORF or mature 563 protein, excluding the final stop codon, using the default PhyloCSF parameters, including --564 strategy=mle. 565 FRESCo 29 was run using HYPHY version 2.220180618beta(MP) for Linux on x86_64 on 9-codon 566 windows in each of the NCBI annotated ORFs. Alignments were extracted for the ORF excluding the 567 final stop codon, and gaps in the reference sequence were removed. SCEs were found by taking all 568 windows having synonymous rate less than 1 and nominal p-value<10 -5 , and combining overlapping or 569 adjacent windows. For the variant analysis, FRESCo was also run on 1-codon windows using codon 570 alignments as described previously 29 . 571 Substitutions per site and per neutral site for each annotated ORF and mature protein were calculated 572 by extracting the alignment column for each site or, respectively, 4-fold degenerate site, from the 573 whole-genome alignment and determining the parsimonious number of substitutions using the whole-574 genome phylogenetic tree. For columns in which some genomes did not have an aligned nucleotide, 575 the number of substitutions was scaled up by the branch length of the entire tree divided by the branch 576 length of the tree of genomes having an aligned nucleotide in that column. 577 PhastCons and phyloP tracks shown in Fig. 2 are the Comparative Genomics tracks from the UCSC 578 Genome Browser, which were constructed from a multiz 52 alignment of the list of 44 Sarbecovirus 579 genomes that we supplied to UCSC. 580 Single nucleotide variants were downloaded from the "Nextstrain Vars" track in the UCSC Table 582 Browser on 2020-04-18 at 11:46 AM EDT. Table S3 includes one additional mutation, G24047A, from a 583 later download, in order to represent Korber variant A829T/S. We defined an amino acid to be 584 "conserved" if there were no amino-acid-changing substitutions in the Sarbecovirus alignment of its 585 p 31 codon. We defined codons to be "synonymously constrained" if the synonymous rate at that codon 586 calculated by FRESCo using 1-codon windows was less than 1.0 with nominal p-value<0.034, 587 corresponding to a false discovery rate of 0.125. We defined an intergenic nucleotide to be "conserved" 588 if there were no substitutions of that nucleotide in the Sarbecovirus alignment. We classified SNVs as 589 Synonymous, Nonsynonymous, or Noncoding, relative to the NCBI annotations, so SNVs within ORF10 590 were classified as coding, and SNVs within overlapping ORFs 3c and 9b were classified relative to the 591 longer containing ORFs 3a and N, respectively. However, in Supplementary Table S3 , we also 592 classified variants according to our proposed reference gene annotations (fields beginning with New_); 593 when classifying variants in overlapping ORFs 3a/3c and N/9b we classify SNVs relative to the ORF in 594 which the variant is non-synonymous if that is true for only one of the frames, or the ORF for which the 595 amino acid change is more radical (as defined by the blosum62 matrix obtained from biopython version 596 1.58 53 ) if it is non-synonymous in both frames, or the larger ORF if the variant is synonymous in both 597 frames. 598 We determined mature proteins for which the density of amino-acid-changing SNVs differed 599 significantly from the density that would be expected from their level of conservation, by calculating the 600 residual of a linear regression of amino-acid-changing SNV density as a function of the fraction of 601 conserved amino acids, for all mature proteins. The regression line was y=0.235-0.165x. We 602 determined significance using a binomial p-value with a false discovery rate cutoff of 0.05. To further 603 test significance of the SNV depletion in S1, we downloaded a larger set of SNVs from the UCSC Table 604 Browser as above on 2020-05-09. 605 The 16 Spike-protein variants prioritized were those reported by Korber et al. in their bioRxiv preprint or 606 later Cell publication (ones at greater than 0.3% frequency, or 0.1% if near certain epitopes). 607 To find regions that were significantly enriched for missense variants in conserved amino acids, we first 608 defined a null model as follows. For each mature protein, we counted the number of missense variants 609 and the number of conserved amino acids and randomly assigned each SNV to a conserved amino 610 p 32 acid in the same mature protein, allowing multiplicity. For any positive integer n, we found the largest 611 number of variants that had been assigned to any set of n consecutive conserved amino acids within 612 the same mature protein across the whole genome. Doing this 100,000 times gave us a distribution of 613 the number of missense variants in the most enriched set of n consecutive conserved amino acids in 614 the genome. Comparing the number of actual missense variants in any particular set of n consecutive 615 conserved amino acids to this distribution gave us a nominal p-value for that n. We applied this 616 procedure for each n from 1 to 100 and multiplied the resulting p-values by a Bonferroni correction of 617 100 to calculate a corrected p-value for a particular region to be significantly enriched. We note that 618 these 100 hypotheses are correlated because enriched regions of different lengths can overlap, so a 619 Bonferroni correction is overly conservative and our reported p-value of 0.012 understates the level of 620 statistical significance. To find significantly depleted regions we applied a similar procedure with every n 621 from 1 to 1000, but did not find any depleted regions with nominal p-value less than 0.05 even without 622 multiple hypothesis correction. 623 Ribosome footprints shown in Extended Data Fig. 3 Sarbecovirus conservation and a link to view the alignment of a neighborhood of the SNV in 643 CodAlignView. It is our intention to update this track hub as the list of variants in the UCSC Table 644 Browser is updated. [Note to reviewers: classification is currently with respect to NCBI annotations; we 645 will add a track classifying SNVs with respect to our PhyloCSF Genes annotations once our paper is 646 accepted.] 647 In this resource, we have augmented variant data made available by UCSC 54 with our own 648 annotations. UCSC data came from nextstrain.org 55 , which was derived from genome sequences 649 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 11,000 12,000 13,000 14,000 15,000 16,000 17,000 18,000 19,000 20,000 21,000 UniProt mature proteins KY352407_SARS_related_CoV_strain_BtKY72 NC_014470_Bat_CoV_BM48_31_BGR_2008 Same region in main reading frame Optimal bases for start codon recognition likely that ORF9b can be translated by leaky scanning from the same subgenomic RNA as N, as it is only ~2 codons 756 downstream of N's start. Moreover, both the optimal 9b start-codon context, and the less-optimal N start-codon context 757 are fully-conserved features across all Sarbecovirus strains, indicating that leaky-scanning translation may be a 758 conserved feature throughout Sarbecoviruses. In addition, ORF9b shows significant localized synonymous constraint in N 759 in its start and end regions (Fig. 3) GCT TGT TGT TGG CGT TGC ATT TCT TGC TGT TTT TCA AAG CGC TGC GAA ATT AAT ACC TTT TAA CAG TCT TTG GCA GCG TTG CTT ATA CCA GAG CTT TCA ATT GCT TTG CAA TGT GCT TCT TAT TGC TTT GAC AGT TTA CTC GCA CTT ACT GCT TGT TGC TGC AGG GCT Extended Data Figure 6. ORF3b is not protein-coding. Sarbecoviruses alignment of SARS-CoV 154-codon ORF3b Although start codon is conserved in all but 768 one strain, ORF length is highly variable due to numerous in-frame stop codons (red ovals and red rectangle). The 22-769 codon ORF in SARS-CoV-2 has strongly negative PhyloCSF score, does not overlap any SCEs, and even among the four 770 strains sharing its stop codon (blue rectangle) all six substitutions are radical amino acid changes, providing no evidence 771 of amino-acid-level purifying selection. Ribosome profiling did not find translation of ORF3b, transcription studies did not 772 find substantial transcription of an ORF3b-specific subgenomic RNA, and translation by leaky scanning would implausibly 773 require ribosomal bypass of eight AUG codons (green rectangles The very low score of ORF10 with this 790 adjustment indicates that its only-slightly-negative unscaled-PhyloCSF score in Fig. 1c stems from the high nucleotide 791 conservation of the region, rather than protein-coding constraint. The scores of N-overlapping ORFs 9b and 9c are both 792 reduced, consistent with the high nucleotide conservation of N. Notably, the branch-length-adjusted score for 3c remains 793 high, consistent with its protein-coding nature, and despite the higher overall nucleotide conservation of its dual-coding 794 region. We have manually inspected all other candidates with adjusted scores higher than 9c, and all are rejected (as not 795 protein-coding): two are discussed in Supplementary Figure S4 Change Synonymous Conservative Radical Ochre Stop Codon Amber Stop Codon Opal Stop Codon In-frame ATG in the Sarbecovirus clade at both the amino acid level and nucleotide level is associated with purifying selection on variants in the SARS-805 Alignment of 20 amino acid Nucleocapsid region that is highly enriched for variants disrupting perfectly conserved 806 amino acids (alternate alleles shown in second row NC_004718_SARS_CoV ATG ATG CCA ACT ACT TTG TTT GCT GGC ACA CAC ATA ACT ATG ACT ACT GTA TAC CAT ATA ACA GTG TCA CAG ATA CAA TTG TCG TTA CTG AAG GTG ACG GCA TTT CAA CAC CAA AAC TCA AAG AAG ACT ACC AAA TTG GTG GTT ATT CTG AGG ATA KT444582_SARS_like_CoV_WIV16 ATG ATG CCA ACT ACT TTG TTT GCT GGC ACA CAC ATA ACT ATG ACT ACT GTA TAC CGT ATA ACA GTG TCA CAG ATA CAA TTG TCG TTA CTG CAG GTG ACG GCA TTT CAA CAC CAA AAC TCA AAG AAG ACT ACC AAA TTG GTG GTT ATT CTG AGA ATT KY417146_Bat_SARS_like_CoV_Rs4231 ATG ATG CCA ACT ACT TTG TTT GCT GGC ATA CAC ATA ACT ATG ACT ACT GTA TAC CAT ATA ACA GTG TCA CAG ATA CAA TCG TCG TTA CTG CAG GTG ACG GCA TTT CAA CAC CAA AAC TCA AAG AAG ACT ACC AAA TTG GTG GTT ATT CTG AGG ATT MK211376_CoV_BtRs_BetaCoV_YN2018B ATG ATG CCA ACT ACT TTG TTT GCT GGC ACA CAC ATA ACT ATG ACT ACT GTA TAC CAT ATA ACA GTG TCA CAG ATA CAA TTG TCG TTA CTG CAG GTG ACG GCA TTT CAA CAC CAA AAC TCA AAG AAG ACT ACC AAA TTG GTG GTT ATT CTG AGA ATT KY417151_Bat_SARS_like_CoV_Rs7327 ATG ATG CCA ACT ACT TTG TTT GCT GGC ACA CAC ATA ACT ATG ACT ACT GTA TAC CAT ATA ACA GTG TCA CAG ATA CAA TTG TCG TTA CTG CAG GTG ACG GCA TTT CAA CAC CAA AAC TCA AAG AAG ACT ACC AAA TTG GTG GTT ATT CTG AGA ATT KY417152_Bat_SARS_like_CoV_Rs9401 ATG ATG CCA ACT ACT TTG TTT GCT GGC ATA CAC ATA ACT ATG ACT ACT GTA TAC CAT ATA ACA GTG TCA CAG ATA CAA TTG TCG TTA CTG CAG GTG ACG GCA TTT CAA CAC CAA AAC TCA AAG AAG ACT ACC AAA TTG GTG GTT ATT CTG AGA TTT GCT CGT TGC TGC TGG CCT TGA MN996532_Bat_CoV_RaTG13 ATG GCT TAT TGT TGG CGT TGC ATT TCT TGC TGT TTT TCA AAG CGC TTC CAA GAT CAT AAC CCT TAA AAA GAG ATG GCA ACT AGC ACT CTC TAA GGG TAT TCA CTT TAT TTG CAA CTT GCT GCT GCT GTT TGT AAC AGT TTA CTC ACA TCT TTT GCT CGT TGC TGC TGG TCT TGA MG772933_Bat_SARS_like_CoV_bat_SL_CoVZC45 ATG GCT TGT TGT TGG CGT TGC AAT TCT TGC TGT TTT TCA AAG CGC TTC AAA AAT AAT TAC ACT CAA AAA GAG ATG GCA GTT AGC CCT CTC TAA AGG TGT TCA TTT TGT TTG CAA CTT GCT TCT GCT GTT TTT AAC AGT TTA TTC TCA CTT GTT GCT TCT TGC TGG TGG CTT GGA MG772934_Bat_SARS_like_CoV_bat_SL_CoVZXC21 ATG GCT TGT TGT TGG CGT TGC AAT TCT TGC TGT TTT TCA AAG CGC TTC AAA AAT AAT TAC ACT CAA AAA GAG ATG GCA GTT AGC CCT CTC TAA AGG TGT CCA CTT TGT TTG CAA CTT GCT TCT GCT GTT TTT AAC AGT TTA CTC ACA CCT ATT GCT TGT TGC TGG TGG CTT AGA NC_004718_SARS_CoV ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TGC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ACT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC AGG TAT GGA KT444582_SARS_like_CoV_WIV16 ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TTC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ACT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC AGG TAT GGA KY417146_Bat_SARS_like_CoV_Rs4231 ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGT TAC CAA AAT AAT TGC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ACT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC GGG TAT GGA MK211376_CoV_BtRs_BetaCoV_YN2018B ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TGC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ACT GCT GCT ATT TAT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC GGG TAT GGA KY417151_Bat_SARS_like_CoV_Rs7327 ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TGC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ATT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC AGG TAT GGA KY417152_Bat_SARS_like_CoV_Rs9401 ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TGC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ATT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC AGG TAT GGA KY417144_Bat_SARS_like_CoV_Rs4084 ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TTC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ACT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC AGG TAT GGA KF367457_Bat_SARS_like_CoV_WIV1 ATG GCT TGT TAT TGG CGT TGC ATT TCT TGC TGT TTT TCA GAG CGC TAC CAA AAT AAT TTC GCT CAA TAA AAG ATG GCA GCT AGC CCT TTA TAA GGG CTT CCA GTT CAT TTG CAA TTT ACT GCT GCT ATT TGT TAC CAT CTA TTC ACA TCT TTT GCT TGT CGC TGC AGG TAT GGA KU973692_UNVERIFIED_SARS_related_CoV_F46 ATG GCT TGT TGT TGG CGT TGC ACT TCT TGC TGT TTT CCA AAG CGC TTC CAA AGT GAT TGC GCT TCA TAA TAG GTG GCA GCT TGC CCT GTA TAA AGG CAT TCA GCT TGT TTG CAA TTT GCT GCT ACT TTT TGT GAC AAT TTA CTC ACA CCT TCT ACT TTT AGC TGC TGG CAT GGA KY417145_Bat_SARS_like_CoV_Rf4092 ATG GCT TGT TAT TGG CGT TGC ACT TCT TGC TGT TTT TCA AAG CGC TTC CAA AGT GAT TGC GCT TCA TAA GAG GTG GCA GCT TGC CCT GTA TAA AGG CAT TCA GCT TGT TTG CAA TTT GCT GCT ACT CTT TGT GAC AAT TTA CTC ACA CCT CCT ACT TTT AGC TGC TGG CAT KY352407_SARS_related_CoV_strain_BtKY72 ATG GCT TAT TGT TGG CGT TGC ATT GCT TGC TGT TTT TCA AAA TGC TTC AAA AGT AAT TCC TTT TAA CAG CTT GTG GCA GCG CTG CCT TTA TCA AAG CTT CCA ACT TGT TTG CAG TCT GTT AGT TGG TTT TCT CAC AGT CTA TGT ACA CCT ATT GCT GGC AGC TGC TGG GCT AGA T T T T A T CG T CC G A T T T TT A