key: cord-0800244-8twdx4g2 authors: Koyama, Takahiko; Platt, Daniel; Parida, Laxmi title: Variant analysis of SARS-CoV-2 genomes date: 2020-07-01 journal: Bull World Health Organ DOI: 10.2471/blt.20.253591 sha: e942ff1d90462875dd3b197377c0ada065045e2f doc_id: 800244 cord_uid: 8twdx4g2 OBJECTIVE: To analyse genome variants of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). METHODS: Between 1 February and 1 May 2020, we downloaded 10 022 SARS CoV-2 genomes from four databases. The genomes were from infected patients in 68 countries. We identified variants by extracting pairwise alignment to the reference genome NC_045512, using the EMBOSS needle. Nucleotide variants in the coding regions were converted to corresponding encoded amino acid residues. For clade analysis, we used the open source software Bayesian evolutionary analysis by sampling trees, version 2.5. FINDINGS: We identified 5775 distinct genome variants, including 2969 missense mutations, 1965 synonymous mutations, 484 mutations in the non-coding regions, 142 non-coding deletions, 100 in-frame deletions, 66 non-coding insertions, 36 stop-gained variants, 11 frameshift deletions and two in-frame insertions. The most common variants were the synonymous 3037C > T (6334 samples), P4715L in the open reading frame 1ab (6319 samples) and D614G in the spike protein (6294 samples). We identified six major clades, (that is, basal, D614G, L84S, L3606F, D448del and G392D) and 14 subclades. Regarding the base changes, the C > T mutation was the most common with 1670 distinct variants. CONCLUSION: We found that several variants of the SARS-CoV-2 genome exist and that the D614G clade has become the most common variant since December 2019. The evolutionary analysis indicated structured transmission, with the possibility of multiple introductions into the population. In late 2019, several people in Wuhan, China, were presenting with severe pneumonia at the hospitals. As the number of patients rapidly increased, the Chinese government decided on 23 January 2020 to lock down the city to contain the virus. Unfortunately, the virus had already spread across China and throughout the world. The World Health Organization (WHO) officially declared the outbreak a pandemic on March 11, 2020. As of 23 May 2020, over 5 million cases worldwide had been reported to WHO and the death toll has exceeded 330 000. 1 Researchers isolated the virus causing the pneumonia in December 2019 and found it to be a strain of β-coronavirus (CoV). The virus showed a high nucleotide sequence homology with two severe acute respiratory syndrome (SARS)-like bat coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21 (88% homology) and with SARS-CoV (79.5% homology), while only 50% homology with the Middle East respiratory syndrome coronavirus (MERS) CoV. 2, 3 The virus, now named SARS-CoV-2, contains a single positive stranded RNA (ribonucleic acid) of 30 kilobases, which encodes for 10 genes. 4 Researchers have shown that the virus can enter cells by binding the angiotensin-converting enzyme 2 (ACE2), through its receptor binding domain in the spike protein. 5 The virus causes the coronavirus disease 2019 (CO-VID-19), with common symptoms such as fever, cough, shortness of breath and fatigue. 6, 7 Early data indicated that about 20% of patients develop severe COVID-19 requiring hospitalization, including 5% who are admitted to the intensive care unit. 8 Initial estimates of the case fatality rates were from 3.4% to 6.6% which is lower than that of SARS or MERS, 9.6% and 34.3% respectively. [9] [10] [11] The mortality from COVID-19 is higher in people older than 65 years and in people with underlying comorbidities, such as chronic lung disease, serious heart conditions, high blood pressure, obesity and diabetes. [12] [13] [14] Community transmission of the virus, as well as anti-viral treatments, can engender novel mutations in the virus, potentially resulting in more virulent strains with higher mortality rates or emergence of strains resistant to treatment. 15 Therefore, systematic tracking of demographic and clinical patient information, as well as strain information is indispensable to effectively combat COVID-19. Here we analysed the SARS-CoV-2 genome from 10 022 samples to understand the variability in the viral genome landscape and to identify emerging clades. In total, we downloaded 15 755 genome sequences from the following databases: the Chinese National Microbiology Data Center on 1 February 2020; the Chinese National Genomics Data Center Genome Warehouse on 4 February 2020; GISAID 16 on 1 May 2020 and GenBank on 1 May 2020. We removed redundant sequences with the China National Center for Bioinformation annotations. To reduce the number of false positive variants, we removed sequences with more than 50 ambiguous bases. For this study, we used the sequence of established SARS-CoV-2 reference genome, NC_045512. 17 This genome was sequenced in December 2019. Each sample was first aligned to the reference genome in a pairwise manner using EMBOSS needle (Hinxton, Cambridge, England), with a default gap penalty of 10 and extension penalty of 0.5. 18 Then, we developed a custom script in Python (Python Software Foundation, Wilmington, United States of America) to extract the differences between the genome variants and the reference genome. Nucleotide variants in the coding regions were converted to corresponding encoded amino acid residues. For the open reading frame 1 (ORF1), we used the protein coordinates from YP_009724389.1 19 for translation. Finally, we carefully Objective To analyse genome variants of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2). Methods Between 1 February and 1 May 2020, we downloaded 10 022 SARS CoV-2 genomes from four databases. The genomes were from infected patients in 68 countries. We identified variants by extracting pairwise alignment to the reference genome NC_045512, using the EMBOSS needle. Nucleotide variants in the coding regions were converted to corresponding encoded amino acid residues. For clade analysis, we used the open source software Bayesian evolutionary analysis by sampling trees, version 2.5. Findings We identified 5775 distinct genome variants, including 2969 missense mutations, 1965 synonymous mutations, 484 mutations in the non-coding regions, 142 non-coding deletions, 100 in-frame deletions, 66 non-coding insertions, 36 stop-gained variants, 11 frameshift deletions and two in-frame insertions. The most common variants were the synonymous 3037C > T (6334 samples), P4715L in the open reading frame 1ab (6319 samples) and D614G in the spike protein (6294 samples). We identified six major clades, (that is, basal, D614G, L84S, L3606F, D448del and G392D) and 14 subclades. Regarding the base changes, the C > T mutation was the most common with 1670 distinct variants. Conclusion We found that several variants of the SARS-CoV-2 genome exist and that the D614G clade has become the most common variant since December 2019. The evolutionary analysis indicated structured transmission, with the possibility of multiple introductions into the population. 20 Using the identified recurrent variants, we performed hierarchical clustering in SciPy library, Python, to identify clades. First, a binary matrix of samples and distinct variants was created. Then, we did hierarchical clustering using the Ward's method 21 in SciPy library. 22 We investigated the mutation patterns of SARS-CoV-2 to find potential causes of mutations, by looking at the changes in bases. Since coronavirus genomes are positive sense, single stranded RNA, we did not combine C > T with G > A mutations. The spike protein is a key protein for SARS-CoV-2 viral entry and a target for vaccine development. We, therefore, wanted to find amino acid conservation between other coronavirus sequences in the spike protein. We used the basic local alignment search tool BLAST (National Center for Biotechnology Information [NCBI], Bethesda, United States) 23 followed by the constraint-based multiple alignment tool COBALT (NCBI, Bethesda, United States). 24 We carefully investigated mutations within the receptor binding domain and predicted B-cell epitopes. 25, 26 The mutations were further analysed to identify cross species conservation and to understand the nature of amino acid changes. We visualized the aligned sequence using the open source software alv. 27 For the phylogenetic analysis, we used the open source software Bayesian evolutionary analysis by sampling trees (BEAST), version 2.5. 28 BEAST uses a Bayesian Monte-Carlo algorithm generating a distribution of likely phylogenies given a set of priors, based on the probabilities of those tree configurations determined from the viral genomes. This analysis presents a different view than the variant analysis described above and is an independent test of the structure that individual haplogroup markers identify. First, we aligned sequences to NC_045512, using the multiple sequence alignment software, MAFFT. 29 Subsequently, we adjusted for length and sequencing errors, by truncating the bases in the 5'-UTR and 3'-UTR, without losing key sites. We excluded sequences showing a variability higher than 30 bases. For an optimal output of the phylogenetic tree, we randomly selected a subset of 2000 samples by using a random number generator in Python. We ran BEAST using sample collection dates with the Hasegawa-Kishino-Yano mutation model, 30 with the strict clock mode. Finally, we estimated the mutation rate and median tree height from the resulting BEAST trees. In total, we analysed 10 022 SARS CoV-2 genomes (sequences are available from the data repository) 20 Of the 2969 missense variants, 1905 variants are found in ORF1ab, which is the longest ORF occupying two thirds of the entire genome. ORF1ab is transcribed into a multiprotein and subsequently cleaved into 16 nonstructural proteins (NSPs). Of these proteins, NSP3 has the largest number of missense variants among ORF1ab proteins. Of the NSP3 missense variants, A58T was the most common (159 samples) followed by P153L (101 samples; Table 2 ). We also detected mutations in the nonstructural protein RNA-dependent RNA polymerase (RdRp), such as P323L (6319 samples). Deletions are also common in 3'-5' exonuclease (11 deletions) including those resulting in frameshifts. A comprehensive list of variants is available in data repository. 20 Variants with recurrence over 100 samples are shown in Table 3 . The most common variants were the synonymous variant 3037C > T (6334 samples), OR-F1ab P4715L (RdRp P323L; 6319 samples) and SD614G (6294 samples). They occur simultaneously in over 3000 samples, mainly from Europe and the United States. Other variants including ORF3a Q57H (2893 samples), ORF1ab T265I (NSP3 T85I; 2442 samples), ORF8 L84S (1669 samples), N203_204delinsKR (1573 samples), ORF1ab L3606F (NSP6 L37F; 1070 samples) were the key variants for identifying clades. We identified six major clades with 14 subclades ( Fig. 1 and Table 4 (80 samples) and Iceland (76 samples). However, the basal clade now accounts only for a small fraction of genomes (670 samples mainly from China). The remaining two clades D448del and G392D are small and they are without any significant subclades at this point. All non-coding deletions are either located within 3'-UTR, 5'-UTR or intergenic regions. Of the in-frame deletions, ORF1 D448del stands out with 250 samples. In contrast, we only detected two distinct in-frame insertions in our data set. We also detected 11 frameshift deletions and 36 stop-gained variants. The recurrent stop-gained variant Y4379* (NSP10 Y126*) is found in 51 samples in the D614G clade. NSP10 Y126* is located only 13 residues upstream of the stop codon; therefore, a truncation may not significantly affect function of the protein. Most of frameshift variants in ORF1ab do not recur except for S135fs (three samples) and L3606fs (two samples). Although frameshift variants are considered deleterious, for instance, S135fs (more precisely S135Rfs*9) In-frame deletion Non The most common base change is C > T (Fig. 2) . As expected, 31 we observed a strong bias in transition versus transversion ratio (7:3). C > T transitions might be intervened by cytosine deaminases. Surprisingly, G > T transversions, likely introduced by oxo-guanine from reactive oxygen species, 32 were also frequently observed. Assessing variants in the spike protein revealed 427 distinct non-synonymous variants with many variants located within the receptor binding domain and B-cell epitopes (Fig. 3) . Among the variants in the receptor binding domain, V483A (26 samples), G476S (9 samples) and V367F (12 samples) are highly recurrent. Fig. 4 shows the consensus tree from the phylogenetic analysis. The tree has a coalescence centre with exponential expansion identified by haplotype markers. The colour mapped phylogenies largely support the 14 identified subclades. We note that substantial numbers of samples from the United States show affinity with European lineages rather than those directly derived from East Asia. Except for the earliest cases, European clades dominate even in samples from western states in the United States. Further, European samples tend to associate with lineages that expanded through Australia. Estimation of mutation rate showed a median of 1.12 × 10 −3 mutations per site-year (95% confidence interval, CI: 9.86 × 10 −4 to 1.85 × 10 −4 ). The median tree height was 5.1 months (95% CI: 4.8 to 5.52). Here we show the evolution of the SARS-Co-2 genome as it has spread across the world. Although, our methods do not allow us to investigate whether the mutations observed led to a loss or gain of function, we can speculate on the implications of viral function of these mutations. The most common clade identified was the D614G variant, which is located in a B-cell epitope with a highly immunodominant region and may therefore affect vaccine effectiveness. 33 Although amino acids are quite conserved in this epitope, we identified D364Y V367F K378R F338L V341I A348T W353R P384L P330S V289I E309Q L296H E298G V308L G476S V483A/I N439K S438F A520S T547I P561L A570T/V A575S G594S R408* Q409E I434K A435S K458R/N D467V I468F/T C432* I472V P491R Y508H H519P/Q Q414A/E K417* G446V L452R A475V T478I V503F R509K V510L K529E E554D T573I T572I E583D L585F D405V D614G V615F/L T630S L611F V622I/L P631S A626V D627H N603K Q613H P621S D808N P809S P812S/L F817L I818V G889S A892S A893V Q804K I805V N354D/K S359N D364Y V367F K378R F338L V341I A348T W353R P384L P330S G476S V483A/I N439K S438F A520S R408* Q409E I434K A435S K458R/N D467V I468F/T D467V 6 C432* I472V P491R Y508H H519P/Q A520S Q414A/E K417* G446V L452R A475V T478I V503F R509K V510L D405V V289I E309Q L296H E298G V308L G594S D614G V615F/L D614G D614G T630S L611F V622I/L P631S A626V D627H N603K Q613H D614G D614G P621S D808N P809S P812S/L F817L I818V Q804K I805V Q Q G889S A892S A893V T547I P561L A570T/V A575S K529E E554D T573I T572I E583D Severe acute respiratory syndrome coronavirus 2 genomes Takahiko Koyama et al. 14 other variants besides D614G. Almost all strains with D614G mutation also have a mutation in the protein responsible for replication (ORF1ab P4715L; RdRp P323L), which might affect replication speed of the virus. This protein is the target of the anti-viral drugs, remdesivir and favipiravir, and the susceptibility for mutations suggests that treatment resistive strains may emerge quickly. Mutations in the receptor binding domain of the spike protein suggest that these variants are unlikely to reduce binding affinity with ACE2, since that would decrease the fitness of the virus. V483A and G476S are primarily observed in samples from the United States, whereas V367F is found in samples from China, Hong Kong Special Administrative Region, France and the Netherlands. The V367F and D364Y variants have been reported to enhance the structural stability of the spike protein facilitating more efficient binding to the ACE2 receptor. 34 In summary, structural and functional changes concomitant with spike protein mutations should be meticulously studied during therapy design and development. We detected several non-recurring frameshift variants, which can be se-quencing artefacts. The frameshift at Y3 in ORF10, although only detected in one sample, might not be essential for survival of the new coronavirus, since ORF10, a short 38-residue peptide, is not homologous with other proteins in the NCBI repository. The phylogenetic analysis suggest population structuring in the evolution of SARS-CoV-2. The analysis provides an independent test of the major clades we identified, as well as the geographic expansions of the variants. While the earliest samples from the United Stated appear to be derived from China, belonging either to basal or L84S clades, the European clades, such as D614G/ Q57H, tend to associate with most of the subsequent increase in infected people in the United States. D614G was first observed in late January in China and became the largest clade in three months. The mutation rate of 1.12 × 10 −3 mutations per site-year is similar to 0.80 × 10 −3 to 2.38 × 10 −3 mutations per site-year reported for SARS-CoV-1. 35 The rapid increase of infected people will provide more genome samples that could offer further insights to the viral dissemination, particularly the possibility of at least two zoonotic transmissions of SARS-CoV-2 into the human population. An understanding of the biological reservoirs carrying coronaviruses and the modalities of contact with human population through trade, travel or recreation will be important to understand future risks for novel infections. Further, populations may be infected or even re-infected via multiple travel routes. The number of people with confirmed COVID-19 has rapidly increased over the last five months with no sign of decline in the near future. The fight against COVID-19 will be long, until vaccines and other effective therapies are developed. To facilitate rapid therapeutic development, clinicopathological, genomic and other societal information must be shared with researchers, physicians and public health officials. Given the evolving nature of the SARS-CoV-2 genome, drug and vaccine developers should continue to be vigilant for emergence of new variants or sub-strains of the virus. ■ Notes: Each sample is coloured with corresponding subclade. We used the Bayesian evolutionary analysis by sampling trees software. 28 Research Severe acute respiratory syndrome coronavirus 2 genomes Takahiko Koyama et al. Flu Database, GenBank, and NGDC Genome Warehouse, and the National Microbiology Data Center on which this research is based. The list of genomes is available from the data repository. 20 We also thank Jane Snowdon and Dilhan Weeraratne. Цель Проанализировать варианты геномов тяжелого острого респираторного синдрома, вызванного коронавирусом-2 (SARS-CoV-2). Методы В период между 1 февраля и 1 мая 2020 года авторы загрузили данные по 10 022 геномам вируса SARS CoV-2 из четырех баз данных. Геномы принадлежали инфицированным пациентам из 68 стран. Авторы идентифицировали варианты, извлекая и попарно сравнивая последовательности с эталонным геномом NC_045512, используя набор инструментов EMBOSS. Варианты нуклеотидной последовательности в кодирующих участках были преобразованы в соответствующие кодируемые аминокислотные остатки. Для анализа клад использовалось программное обеспечение с открытым кодом для байесовского эволюционного анализа деревьев выборки, версия 2.5. Результаты Было идентифицировано 5775 четких вариантов генома, в том числе 2969 миссенс-мутаций, 1965 синонимичных му таций, 484 му тации в некодирующих учас тк ах, 142 некодирующие делеции, 100 делеций внутри рамки считывания, 66 некодирующих вставок, 36 вариантов изменения последовательности ДНК с новым стоп-кодоном, 11 делеций со сдвигом рамки и две вставки внутри рамки считывания. Чаще всего встречались синонимичная замена 3037C > T (6334 образца), P4715L в открытой рамке считывания 1ab (6319 образцов) и D614G в белке «шипа» (6294 образца). Было выявлено шесть основных клад (базовая, D614G, L84S, L3606F, D448del и G392D) и 14 субклад. Что касается замены оснований, наиболее частой была мутация с заменой цитозина на тимин (C>T), которая встречалась в 1670 вариантах. Вывод Авторы обнаружили существование нескольких вариантов генома SARS-CoV-2 и выяснили, что с декабря 2019 года наиболее распространенным вариантом является клада D614G. Эволюционный анализ продемонстрировал структурированную передачу генетических данных с возможностью многократной интродукции в популяцию. Objetivo Analizar las variantes del genoma del coronavirus tipo 2 del síndrome respiratorio agudo grave (SARS-CoV-2). Métodos Entre el 1 de febrero y el 1 de mayo de 2020, se registraron 10 022 genomas del CoV-2 del SARS en cuatro bases de datos. Los genomas eran de pacientes infectados ubicados en 68 países. Se identificaron variantes al extraer la alineación por pares del genoma de referencia NC_045512, por medio de EMBOSS Needle. Las variantes de los nucleótidos en las regiones codificantes se convirtieron en los correspondientes residuos de aminoácidos codificados. Para analizar los clados, se utilizó el programa informático de código abierto Bayesian evolutionary analysis by sampling trees, versión 2.5. Resultados Se identificaron 5775 variaciones diferentes del genoma, incluidas 2969 mutaciones con cambio de sentido, 1965 mutaciones sinónimas, 484 mutaciones en las regiones no codificantes, 142 supresiones no codificantes, 100 supresiones en la fase, 66 inserciones no codificantes, 36 variaciones de parada prematuras (stop-gained), 11 supresiones de desplazamiento de fase y dos inserciones en la fase. Las variaciones más comunes eran las sinónimas 3037C > T (6334 muestras), P4715L en la fase abierta de lectura 1ab (6319 muestras) y D614G en la proteína S (6294 muestras). Se identificaron seis clados principales, (es decir, basal, D614G, L84S, L3606F, D448del y G392D) y 14 subclados. En relación con los cambios de base, la mutación C > T fue la más común con 1670 variaciones diferentes. Conclusión Se determinó que existen diversas variaciones del genoma del SARS-CoV-2 y que el clado D614G es la variante más común desde diciembre de 2019. El análisis evolutivo indicó una transmisión estructurada, en la que existe la posibilidad de que se realicen múltiples inserciones en la población. Severe acute respiratory syndrome coronavirus 2 genomes Takahiko Koyama et al. Situation Report -124. Geneva: World Health Organization Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A new coronavirus associated with human respiratory disease in China Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome. NCBI Reference Sequence: NC_045512.1. Bethesda: National Center for Biotechnology Information SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan. China: JAMA China Medical Treatment Expert Group for Covid-19. Clinical characteristics of coronavirus disease 2019 in China Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention A novel coronavirus outbreak of global health concern Cumulative Number of Reported Probable Cases of SARS World Health Organization Middle East respiratory syndrome coronavirus (MERS-CoV) World Health Organization and the Northwell COVID-19 Research Consortium. Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York city area Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study Mechanisms of viral mutation GISAID: global initiative on sharing all influenza data -from vision to reality Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome. NCBI Reference Sequence: NC_045512.2. Bethesda: National Center for Biotechnology Information A general method applicable to the search for similarities in the amino acid sequence of two proteins NCBI Reference Sequence: YP_009724389.1. Bethesda: National Center for Biotechnology Information Variant analysis of SARS-CoV-2 genomes [data repository Hierarchical Grouping to Optimize an Objective Function SciPy 1.0: fundamental algorithms for scientific computing in Python Basic local alignment search tool COBALT: constraint-based alignment tool for multiple protein sequences A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2. Cell Host Microbe Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS-CoV-2 alv: a console-based viewer for molecular sequence alignments BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Dating of the human-ape splitting by a molecular clock of mitochondrial DNA Evidence for the selective basis of transition-totransversion substitution bias in two RNA viruses RNA damage and surveillance under oxidative stress Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment. Pathogens Emergence of RBD mutations in circulating SARS-CoV-2 strains enhancing the structural stability and human ACE2 receptor affinity of the spike protein Cold Spring Habor: medRxiv Moderate mutation rate in the SARS coronavirus genome and its implications We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID's Epi-