key: cord-0786416-drqhuhk4 authors: Guan, Qingtian; Sadykov, Mukhtar; Nugmanova, Raushan; Carr, Michael J.; Arold, Stefan T.; Pain, Arnab title: The genomic variation landscape of globally-circulating clades of SARS-CoV-2 defines a genetic barcoding scheme date: 2020-04-23 journal: bioRxiv DOI: 10.1101/2020.04.21.054221 sha: ed14d64f9a8045bd785a0b78c4d7cf9dff70c52d doc_id: 786416 cord_uid: drqhuhk4 We describe fifteen major mutation events from 2,058 high-quality SARS-CoV-2 genomes deposited up to March 31st, 2020. These events define five major clades (G, I, S, D and V) of globally-circulating viral populations, representing 85.7% of all sequenced cases, which we can identify using a 10 nucleotide genetic classifier or barcode. We applied this barcode to 4,000 additional genomes deposited between March 31st and April 15th and classified successfully 95.6% of the clades demonstrating the utility of this approach. An analysis of amino acid variation in SARS-CoV-2 ORFs provided evidence of substitution events in the viral proteins involved in both host-entry and genome replication. The systematic monitoring of dynamic changes in the SARS-CoV-2 genomes of circulating virus populations over time can guide therapeutic and prophylactic strategies to manage and contain the virus and, also, with available efficacious antivirals and vaccines, aid in the monitoring of circulating genetic diversity as we proceed towards elimination of the agent. The barcode will add the necessary genetic resolution to facilitate tracking and monitoring of infection clusters to distinguish imported and indigenous cases and thereby aid public health measures seeking to interrupt transmission chains without the requirement for real-time complete genomes sequencing. harbouring V483A and G476S mutations belong to Clade S. Interestingly, the V367F 161 mutation has appeared independently in Clade V and Clade S, suggesting that this mutation 162 contributes to viral fitness. We found the V483A substitution in 13 closely-related cases 163 from Washington State. An equivalent amino acid substitution located in a similar position 164 within the RBD in the MERS-CoV spike protein reduces its binding to its cognate receptor 165 in the RBD of the SARS-CoV S protein (T332I, F460C and L443R) were identified previously 15 . These mutations negatively impact viral fitness through reducing the affinity 182 to the host receptor 16 (Supplementary Table 2 ). The nonsynonymous mutations in the N 183 protein, which have key roles in viral assembly, might also have functional implications. 184 The hotspot mutations S202N, R203K and G204R all cluster in a linker region where they 185 might potentially enhance RNA binding and alter the response to serine phosphorylation 186 events (Supplementary Figure 6) . The clade I-defining mutation in the nucleocapsid 187 protein, which has key roles in viral assembly, is synonymous. However, we observed 188 nonsynonymous mutations in the nucleocapsid protein that are predicted to have functional 189 implications. The hotspot mutations S202N, R203K and G204R all cluster in a linker 190 region where they might potentially enhance RNA binding and alter the response to serine 191 phosphorylation events (Supplementary Figure 6) . 192 In this study, we have defined 5 major clades (G, I, S, D and V) and 9 minor clades 195 (H2, L2, G1110A, H, Y, C17373A, S2, I2 and K) which covers ~89% of the 2,058 high 196 quality genomes available until March 31 st in GISAID database. The clustering of these 197 genomes revealed the spread of clades to diverse geographical regions (Figure 1, Figure 3) We have observed a distinct distribution of the major clades in different parts of the 209 world ( Figure 3 ). Most of the viral genomes that have not been assigned to a major clade 210 are found in Asia and have earlier detection times in January and February at the start of 211 the epidemic in China (Supplementary Figure 2) . We observed a decrease in the genetic 212 diversity of the virus over time following dissemination from China, especially in Europe 213 and North America that each notably now has a predominant clade type, which we believe 214 to be associated with a founder effect whereby a single clade was introduced and current sampling of available public genomes does likely not represent the extant genetic 217 diversity of virus populations in circulation due to biases of genome data deposits from the 218 sequencing laboratories based mainly in the northern hemisphere and new datasets may 219 define new clades in the near future from regions, including Africa, the Indian subcontinent 220 and Latin America with comparably few genomes available at present. In this case, 221 additional identifiers within an evolving barcode scheme can be added to track and monitor 222 future emerging clades with higher resolution. On the other hand, the genetic stability of However, due to the bias in the representation of countries depositing the SARS-CoV-2 238 genomes with over-representation of North American and European genomes (28.3% and 239 47.2% respectively) and the available genome data representing only a minute proportion 240 of the total COVID-19 positive cases from each of these regions (America, 0.27%; Europe, 241 0.31%; China, 0.31%. Data collected from GISAID 9 and COVID-19 dashboard 1 ), the 242 genetic barcode described hererin may need to be updated in order to be globally 243 representative, once sufficient numbers of genomes covering less represented parts of the 244 world are eventually sequenced and deposited to publicly-available database. We envisage 245 a qPCR-based allelic discrimination approach, such as PCR allele competitive extension 246 (PACE), which would enable rapid turnaround in real-time following the identification of 247 a laboratory-confirmed case. This would allow viral genetic epidemiological data to be 248 added to contact tracing information to allow efficient detection of circulating SARS-CoV-progressive elimination of the genetic diversity and, ultimately, eradication in all regions. 251 A robust genetic barcoding scheme for SARS-CoV-2 can facilitate this molecular tracking 252 of larger numbers of laboratory-confirmed cases and by implementing such a facile 253 genotyping approach upstream of next-generation sequencing will allow whole genome 254 sequencing to be performed on selected cases. This is of particular relevance when the 255 available genomes represent only a small sample of the over 2.5 million total COVID-19 256 cases globally to date. receptor binding domain (RBD), however located too far away from the ACE2-binding site to directly affect receptor binding. V367 is 512 surface exposed, and its substitution would not create clashes with other protein regions. However, the exchange of a small with a bulky 513 hydrophobic residue would alter the surface characteristics of this region, which might influence the efficiency of glycosylation (stick 514 model) of the nearby N343, or the positioning of the sugars. Additionally, the altered RBD surface could potentially interfere with 515 antibody recognition. (D) G476S: G476 is located in the RBD. It is positioned solvent-exposed in a SARS-CoV-2-specific loop. This 516 loop is stabilised by a disulphate bridge (C480:C488; C480 is shown as stick model). In the open, ACE2-bound conformation of the 517 RBD, G476 is close to ACE2 Q24 and E23. The substitution G476S would lead to light clashes with these ACE2 residues (indicated as 518 red spheres) and with the RBD residues N487. However, minor reorientation of the side chains might allow an additional hydrogen bond 519 to be formed between S476 and ACE2 Q24 and E23, thus enhancing the affinity. (E) V483A: V483 is also located in the RBD, solvent-520 exposed in the same SARS-CoV-2-specific loop as G476 (C480:C488 are shown as stick models). In the open, ACE2-bound 521 conformation of the RBD, V483 is more than 10Å away from the ACE2 receptor, and hence does not contribute to direct binding or 522 stability. In the closed conformation, this loop is not modeled in the EM structures (PDB 6vyb), inferring it is flexible in the absence of ACE2. Superimposition of the ACE2-bound conformation of the RBD onto RBDs in a closed conformation shows that this loop region 524 would stick out into the solvent. Substitution of V483 is consequently not predicted to have a strong impact on receptor binding or 525 protein stability. By lowering the hydrophobic surface, this substitution might however reduce the non-specific stickiness of this loop 526 region, and/or affect binding of antibodies. (F) D614G: D614 is located in the SD1. In the trimeric S, D612 engages stabilising 527 interactions within the SD1 (R646 or the backbone of F592, depending on the chain) and with the S1 of the adjacent chain (T859 and 528 K854). Replacement of D614 with a glycine would entail losing these stabilising interactions and increase the dynamics in this region. 529 (G) V1040F: V1040 is located in a loop region that makes hydrophobic contacts between stalk regions of the spike trimer. The V1040F 530 substitution is possible without steric hindrance, and would slightly increase the hydrophobic contacts between the chains. The other 531 mutations are not shown in detail, but are evaluated as follows. M153T: M153 is located in the NTD in a solvent-exposed loop. N 532 electron density was modeled for this loop in the cryo-EM structure (6vyb) suggesting it is flexible. The substitution is expected to be 533 neutral. S940F: S940 is located in the stalk region, in a solvent-exposed turn. Introducing the bulkier phenylalanine in this position 534 would not destabilise the structure but locally change the surface characteristics. C1254F: C1254 is the last cysteine in a cysteine-rich 535 unstructured short cytoplasmic region. This region is required for efficient membrane fusion. The exact mechanism remains to be 536 elucidated, and hence we cannot assess the exact impact of the C1254F mutation. with one nsp7 and two nsp8, which markedly enhances its polymerase activity 4 . The shown SARS-CoV-2 nsp12 structure is composed 563 of a nidovirus-unique N-terminal extension (pale yellow), a linker domain (black) and the RNA-dependent RNA polymerase (RdRp) 564 domain (grey). The structure shows the nsp12 in complex with nsp7 (magenta) and nsp8 (cyan and teal), based on PDB 7btf. (A) A4494 565 (A97 in nsp12 numbering; shown as a blue sphere models) is located in the N-terminal extension of the polymerase 5,6 . A4494 is sealing 566 the hydrophobic core of the N-terminal lobe, and its side chain is not solvent exposed. Its replacement with the hydrophobic but slightly 567 bigger valine (yellow) only leads to minor clashes (small red discs) that would not have a significant impact on the function or stability. 568 (B) P4720 (P323 in nsp12 numbering) is located in the 'interface domain' (black). In this position, the P323L substitution (yellow) is not predicted to disrupt the folding or protein interactions and hence is not expected to have strong effects. (C) A4846V (residue A449 570 in nsp12 numbering; blue) is located in the finger domain, with its side chain pointing inwards, contributing to a hydrophobic interaction 571 with the adjacent beta strand. The substitution of A4846 by a valine (yellow) is tolerated in this context. Leading only to minor clashes, 572 it might slightly improve the stability of this region. (D) H5269 (H872 in nsp12 numbering; blue) is located in the thumb domain. It is 573 at the tip of a solvent exposed turn, where its replacement by a tyrosine (yellow) would be tolerated without functional impact. 574 Nsp7: S2884 and Q3890 (S25 and Q31 in nsp7 numbering; both in light yellow) are solvent exposed on a helix that makes contact with 575 both nsp8 and nsp12. (E) S25 is capping the N-terminal end of this helix. Its substitution with a leucine (yellow) does not cause steric 576 problems, but would lead to loss of the capping hydrogen bond. However, D163 from one of the nsp8 molecules also performs a capping 577 function in the complex, and hence S25L would only have a slightly destabilising effect. (F) Q31 is located on the surface of the helix. 578 Although Q31 is close to nsp8, its substitution with a histidine (yellow) would not influence this interaction measurably. 579 Nsp8: Only T4031 (T89I in nsp8 numbering; red) is included in the structural model. In both nsp8 molecules, T89 is located solvent 580 exposed on a helix. In neither of the two nsp8 molecules would the substitution T89I create steric problem or affect the interaction with 581 nsp12 or nsp7. 582 the helicase. The 2A and adjacent 1A domains coordinate together to complete the final unwinding process. The helicase has been 586 modelled based on the 99.8% identical SARS nsp13 (PDB id 6jyt). The domains 1A and 2A are coloured in blue and cyan, respectively. 587 The residues involved in NTP binding are shown in orange. Residues on domain 2A that are involved in RNA binding are shown in 588 magenta. The other domains are grey. Left: overview of the complete structure. Key residues are shown as sphere models. Right: zoom 589 into the mutated area. Key residues are shown as stick models. For Y1464 the in silico mutated cysteine is shown in white. P1427 is located in a solvent-exposed loop region that has not yet reported to be involved in nucleotide binding. Its substitution with a leucine is 591 not expected to create noticeable effects. Y1464 is part of a region that contributes to binding and unwinding of duplex oligonucleotides 7 . 592 Its substitution by a cysteine would decrease the stability and enhance the dynamics of this particular region, and might affect RNA 593 binding and processing. 594 theoretical AlphaFold model. (A) Shown is a model for residues 180-534, colour ramped from blue to yellow. T265 and G392 are shown 598 as blue sphere models. (B) T265 (blue stick model, corresponding to residue T58 in the nsp2 cleavage product numbering) is located at 599 the tip of a loop that is pinned to the core of the N-terminal domain (blue) via hydrophobic residues (two phenylalanines are shown as structural analysis is therefore based on the proposed theoretical AlphaFold model (Left). Helices are colored in cyan, loops in pale 624 orange. L3606 is shown as blue sphere model. The endoplasmic reticulum membrane is shown in orange; lumen and cytoplasmic sides 625 are indicated. Right: close-up view of the location of L3606 (shown as blue stick model). L3606 is located in a predicted helical region 626 that is partly submerged in the membrane, lying parallel to its luminal surface. According to the structural model, L3606 is exposed to 627 the membrane, and hence its substitution with a larger but still hydrophobic phenylalanine would not impact structure or function. 628 Supplementary C) G392 (G212 in nsp2 numbering In this position even the non-conservative substitution with an aspartic acid (yellow) is not predicted to have 603 significant impact on protein fold or function. (D) V378 is located in a surface exposed loop. Its substitution by an isoleucine will not 604 create steric clashes, nor lead to loss of hydrophobic interactions. (E) A second nsp2 fragment is shown, colour ramped from yellow to 605 red, comprising residues 618 to 818. I793 is shown as blue sphere model. (F) I739V (I559 in nsp2 numbering; shown in blue) is part of 606 a hydrophobic core of a small C-terminal domain. Its substitution with an only slightly smaller hydrophobic valine (yellow) Helices are coloured in cyan, strands 612 in magenta and loops in pale orange. A876 is shown as blue sphere model. (B) Zoom onto A876. A876 (blue stick model) is placed 613 within a helix, engaging hydrophobic contacts with Y905 (stick model). The substitution into a slightly larger threonine (yellow stick model) can be accommodated with only minor structural adjustments (clashes are shown as red spheres) and hence the A876T mutation 615 is not expected to have a substantial influence on the proteins stability and function on 4m0w). L1599 is remote from the active site in the Ubl domain. (D) Its substitution does not lead to clashes and is expected to be 618 neutral in terms of function and protein stability A theoretical model for the Orf3a monomer has been proposed by AlphaFold 9 . The structure-function relationship of this protein 634 remains to be clarified. The mutation G251V is located C-terminal to the b-sandwich domain and the tail (marked by an asterisk) The SARS coronavirus nucleocapsid protein -Forms and 659 functions Multiple Nucleic Acid Binding Sites and Intrinsic Disorder of Severe Acute Respiratory Syndrome 661 Coronavirus Nucleocapsid Protein: Implications for Ribonucleocapsid Protein Packaging Phosphorylation of the arginine/serine dipeptide-rich motif of the severe acute 664 respiratory syndrome coronavirus nucleocapsid protein modulates its multimerization, translation inhibitory activity and 665 cellular localization One severe acute respiratory syndrome coronavirus protein complex integrates processive RNA polymerase 667 and exonuclease activities Structure of the RNA-dependent RNA polymerase from COVID-19 virus Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors Delicate structural coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP 672 hydrolysis Severe acute respiratory syndrome Coronavirus ORF3a protein activates the NLRP3 inflammasome by Computational predictions of protein structures associated with Available at Identification of an epitope of SARS-coronavirus nucleocapsid protein Protein -Implication for Virus Ribonucleoprotein Packaging Potent binding of 2019 novel coronavirus spike protein by a SARS coronavirus-specific human monoclonal 683 antibody. Emerging Microbes and Infections Escape from Human Monoclonal Antibody Neutralization Affects In Vitro and In Vivo Fitness of Severe Acute 685 Spread of mutant middle east respiratory syndrome coronavirus with reduced affinity to human CD26 during the 687 south Korean outbreak 658