key: cord-1011748-mrx0max9 authors: ESKİER, Doğa; AKALP, Evren; DALAN, Özlem; KARAKÜLAH, Gökhan; OKTAY, Yavuz title: Current mutatome of SARS-CoV-2 in Turkey reveals mutations of interest date: 2021-02-09 journal: Turk J Biol DOI: 10.3906/biy-2008-56 sha: 91d97ea23e5a6fa99e65e652d03c3610b94d8d04 doc_id: 1011748 cord_uid: mrx0max9 As the underlying pathogen for the COVID-19 pandemic that has affected tens of millions of lives worldwide, SARS-CoV-2 and its mutations are among the most urgent research topics worldwide. Mutations in the virus genome can complicate attempts at accurate testing or developing a working treatment for the disease. Furthermore, because the virus uses its own proteins to replicate its genome, rather than host proteins, mutations in the replication proteins can have cascading effects on the mutation load of the virus genome. Due to the global, rapidly developing nature of the COVID-19 pandemic, local demographics of the virus can be difficult to accurately analyze and track, disproportionate to the importance of such information. Here, we analyzed available, high-quality genome data of SARS-CoV-2 isolates from Turkey and identified their mutations, in comparison to the reference genome, to understand how the local mutatome compares to the global genomes. Our results indicate that viral genomes in Turkey has one of the highest mutation loads and certain mutations are remarkably frequent compared to global genomes. We also made the data on Turkey isolates available on an online database to facilitate further research on SARS-CoV-2 mutations in Turkey. World Health Organization (1948) . WHO timeline -COVID-19 [online] . Website https://www.who.int/news-room/detail/27-04-2020-who-timeline---covid-19 [accessed 17 August 2020]. 2 The GISAID Initiative (2008) . GISAID [online] . Website https:// www.gisaid.org/ [accessed 27 August 2020]. and Nextstrain 3 (Hadfield et al., 2018) have become vital resources for researchers who seek to track the evolution of the virus during its transmissions. SARS-CoV-2 has a single-stranded RNA genome that codes for the proteins responsible for its own replication, many of which are produced via cleavage of the Orf1ab polyprotein, the largest gene on the genome. Therefore, mutations in the SARS-CoV-2 genome can lead to cascading effects by reducing the fidelity of subsequent replication cycles. Key proteins in the RNA replication complex include nsps 7, 8, and 12 (also known as RNA dependent RNA polymerase or RdRp), which together form the core polymerase complex (Kirchdoerfer and Ward 2019; Peng et al., 2020) , as well as nsp14, a dual function protein which joins the larger replication complex as a 3'-5' error-correcting exonuclease (Subissi et al., 2014; Romano et al., 2020) . Our previous findings show that frequently observed mutations in both nsp12 and nsp14 are associated with an increase in mutation density in the SARS-CoV-2 genome (Eskier et al., 2020a (Eskier et al., , 2020b (Eskier et al., , 2020c . In this study, we aimed to analyze the current mutatome of SARS-CoV-2 in Turkey, with three main questions in mind: (i) are there any key reoccurring mutations observed in a large number of isolates? (ii) how does the distribution of mutations among isolates compare to other regions in the world? and finally, (iii) are there any mutations observed in Turkey but not the rest of the world? We focused on the latter two questions in particular, with an emphasis on mutations of interest previously described in the literature. Our findings reveal the presence of three main clades of SARS-CoV-2 in Turkey, roughly analogous to 19A, 20A, and 20B as described in NextStrain, with a preponderance of high mutability variants (Eskier et al., 2020a (Eskier et al., , 2020b (Eskier et al., , 2020c compared to international isolates. Furthermore, we identified several frequently recurrent, previously uncharacterized variants in Turkey isolates not observed in isolates from other countries, which can serve as potential candidates for validation and study. Furthermore, we collected our analysis of Turkey isolates in a regularly maintained and updated database, which we hope will serve as a potential resource for future research on the local mutatome of SARS-CoV-2. 2.1. Genome sequence filtering, retrieval, and preprocessing SARS-CoV-2 isolate genome sequences and the corresponding metadata were obtained from the GISAID EpiCoV database on 28 July 2020 4 . These sequences were filtered for location to limit our database to isolates with the location "Europe/Turkey", which resulted in 180 isolate sequences. We applied further quality filters, including selecting only isolates obtained from human hosts (excluding environmental samples and animal hosts), those sequenced for the full length of the genome (sequence size of 29 kb or greater), and those with high coverage for the reference genome (<1% N content, < 0.05% unique mutations, no unverified indel mutations), which further narrowed down the list to 166 isolates. To ensure alignment accuracy, as characters that are not one of A, C, G, T, or N would not be aligned according to potential biological meanings of the alternative characters, all nonstandard unverified nucleotide masking was changed to N, using the Linux sed command, and the isolates were aligned against the SARS-CoV-2 reference genome using the MAFFT (v7.450) alignment software (Katoh et al., 2002) . Variant sites in the isolates were annotated using snp-sites (2.5.1), bcftools (1.10.2) 5 , and ANNOVAR (release date 24 October 2019) software (Wang et al., 2010; Page et al., 2016) , to identify whether a given mutation was synonymous or nonsynonymous. In addition, the 4 The GISAID Initiative (2008) 5' untranslated region of the genome (bases 1-265) and the 100 nucleotides at the 3' end were removed from the alignment and annotation files due to a high number of gaps and unidentified nucleotides. The genome data is stored using the MariaDB 10.3.22 database installed on Debian Linux 10 operating system. For web application, the genome data is visualized on the map using jVectorMap with HTML 5 and Ajax web development techniques, using the Django 3.0.5. framework and Python 3.7.3 programming language. A modified version of TreeTime, an open-source phylogenetic analysis software, is used to create the phylogenetic tree (Sagulenko et al., 2018) . Our analysis of the genome sequences of 166 isolates from Turkey revealed 258 distinct mutations across the isolates, 87 of which are observed in multiple isolates, and 43 of them are found in at least five isolates (hereafter referred to as recurring mutations). 19 of the 43 recurring mutations are nonsynonymous, 21 are synonymous, and 3 are found outside of coding regions. C>T transitions are the most common, comprising over half of the mutations, consistent with previous international findings on C>U hypermutations in SARS-CoV-2 (Simmonds, 2020) . The most commonly seen mutations are 3037 C>T, 14408 C>T, and 23403 A>G, observed together in 139 of the isolates, with one singleton instance of 23403 A>G, also consistent with previous findings (Pachetti et al., 2020; Yin, 2020) . Orf1ab mutations are the most common, comprising 23 of the recurring mutations, consistent with the size of the gene, as Orf1ab makes up two thirds of the SARS-CoV-2 genome. Orf9 (nucleocapsid; N) gene has the second highest number of recurring mutations (n = 7, however, 3 of them are block mutations of 28881-28883 trinucleotide), followed by Orf5 (membrane; M) and S genes (n = 5) ( Table 1) . To identify which of the recurring mutations are stronger indicators of Turkey genotype, we compared their frequency in the isolate population from Turkey to frequencies in other geographical regions, using a metric of mutation instance per sequenced isolate. To eliminate the potential confounding effect of earlier isolates having a lower number of mutations on average, and different regions having started sequencing efforts in different timetables, we selected isolates sequenced after the day when each region of interest had at least ten isolates sequenced. As Turkey was the latest region to have the required number of isolates (19 March 2020), it was used as the filtering metric. Four of the recurring mutations were found only in Turkey isolates, and six more were not (322) 13620C>T 0% (0) 0% (0) 0.04% (5) 0.03% (2) 0% (0) 0% (0) 11.45% (19) 0.03% (7) 14724C>T 0% (0) 0% (0) 0.07% (9) 0.12% (7) 0.08% (1) 0.36% (1) 11.45% (19) 22444C>T 0% (0) 6.48% (93) 0.02% (2) 0% (0) 0% (0) 0% (0) 5.42% (9) 0.44% (95) 9479G>T 0.35% (1) 0.14% (2) 0.07% (9) 0.05% (3) 0.15% (2) 0% (0) 4.22% (7) 0.08% (17) 16428C>T 0% (0) 0% (0) 0.01% (1) 0.12% (7) 0% (0) 0.36% (1) 3.61% (6) 0.04% (9) 28857G>T 0% (0) 0.07% (1) 0.09% (12) 0.02% (1) 0% (0) 0% (0) 3.61% (6) 0.06% (14) 20268A>G 3.17% (9) 1.04% (15) Turkey are limited to a single batch of isolates obtained by a single center, therefore pending verification. Afterwards, we sought to understand how the mutation load of the isolates in Turkey compare to distributions in other regions. Using our previous date filter, we calculated the number of single nucleotide variants (SNVs) per isolate in each region (Table 3) . Turkey had the highest number of SNVs per isolate, followed by South America. In comparison, Africa, another region which started sequencing efforts later than the other regions, had a mean SNV number lower than that of Asia, the region with the earliest sequences available, implying that the mutation numbers are strongly influenced by other factors in addition to the date of introduction of the virus to the region. We also compared the number of SNVs per isolate in each region per gene, normalized by kilobase of gene region (Table 4 ). Turkey had the most SNVs of any region in Orf1ab, M, and Orf7a genes, with Orf7a having more than three times as many SNVs as any other region. Data regarding Turkey isolates are available as a database comprising an interactive phylogenetic tree of the isolates, a geographical heatmap of sequenced isolates, and tables for both the mutatome of individual isolates, and summaries of the mutations observed in the isolates (Figure) . The phylogenetic tree can be viewed both in real time and divergence time, and colored according to nucleotide of interest, location, or sequencing date. The tables are generated using the sequencing metadata available from GISAID as well as ANNOVAR variant annotation tables. We aim to regularly validate and update the database as new sequences are made available 7 . Future plans include implementation of Nextstrain clade and branch information in the phylogenetic tree to aid the user in comparisons with international sequencing data. COVID-19 has been causing tremendous challenges for clinicians, healthcare systems, societies, and governments, 7 The database is freely accessible at http://covid19.ibg.edu.tr. and has required development of novel approaches to fight the pandemic. With an unpredictable future course for the ongoing pandemic, close monitoring and characterization of mutations has emerged as top priorities for better understanding of possible genotype-phenotype relations, and therefore better management of healthcare efforts. Mutations in any viral infection, especially those that have crossed interspecies barriers, have to be considered in the context of natural selection. As the evolution of a virus will likely affect its fitness in a new host, any attempts against such an infection have to consider the causal relationships between genomic variances and the spread of the virus. Previous studies suggest that the selective pressure on mutations in SARS-CoV-2 in human hosts are largely confined to modest positive selection, with very little purifying selection, due to the short span of the pandemic, and that most of the positive selection have occurred in previous hosts (MacLean et al., 2020) . Therefore, any investigation of the mutations will need to consider most of the mutations have to be beneficial or neutral to create true strains of the virus. A comprehensive analysis by Jungreis et al. (2020) showed that SARS-Cov-2 mutations are excluded from the evolutionarily conserved amino acid residues and nucleotides, and the authors concluded both synonymous and nonsynonymous mutations are under purifying selection. Therefore, not only the nonsynonymous mutations, but also the synonymous ones should be considered as potentially functional. Many studies already provided lines of evidence that supports a role for the S D614G mutation in increased infectivity and likely in transmissibility of SARS-CoV-2 (Daniloski et al., 2020; Korber et al., 2020) . It is possible that new mutations that affect viral behavior may arise, and therefore emergence and spreading of such mutations should be monitored closely. However, with tens of millions affected worldwide, monitoring of every single mutation is a challenging task. We believe that our database will provide a valuable and practical resource for researchers in Turkey, as well as in other countries, to track the spread of SARS-CoV-2 mutations in Turkey. Our findings show the viral isolates in Turkey have accumulated a higher number of mutations compared to other regions on average, even after normalizing for the isolates sequenced earlier during the pandemic having accumulated fewer mutations. Furthermore, it has more mutations in the Orf1ab gene, which produces the polyprotein that is cleaved into the mature peptides responsible for viral replication, than any other region. In addition, it has the third highest number of mutations in the S gene, which is responsible for the viral infection of the cells. As these two genes have the highest potential impact on the replication and transmission cycle of the virus, a higher mutation density in these genes can lead to an accelerated mutation rate. Of note, the 18877 C>T mutation in nsp14, the 3'-5' exonuclease responsible for error correction during genomic replication, has the second highest frequency in Turkey of any country 6 . Our previous study (Eskier et al., 2020a) shows a strong correlation between increased mutation density and the 18877 C>T mutation, which might be a potential reason for Turkey's increased SNV average per isolate. Two groups of mutations we identified that is worth further attention are the 3037 C>T, 14408 C>T, 23403 A>G haplotype, and the 28881-28883 block mutation. Both of these groups of mutations are found almost exclusively together, both in Turkey, and worldwide. In both cases, Turkey has a higher incidence of mutations in these groups than worldwide averages, and four of the major regions (Asia, Europe, North America, Oceania). We previously found that the 14408 C>T and 23403 A>G mutations,when occurring together, are strongly associated with increased mutation density over time (Eskier et al., 2020a) , and the prevalence of both these mutations and the 18877 C>T mutation in Turkey isolates may further contribute to a variant-rich mutation landscape (Eskier et al., 2020b) . 28881-28883 GGG>AAC is found on the N gene, whose product is responsible for packaging the genome into newly produced virions in cells, and regulating host cell response (McBride et al., 2014) . The mutation disrupts an SR-rich motif in the nucleocapsid protein, which was found to cause reduced transmissibility in SARS-CoV, a similar betacoronavirus with high homology to SARS-CoV-2 (Tylor et al., 2009; Ayub, 2020) . It is not clear whether the mutation groups are selected together and show homoplasic recurrence across isolates, or if they are a result of strong founder effect. A major concern when analyzing the isolate sequences from Turkey is the limited nature of the data. The sequences are few in number, and their geographical and temporal distributions are highly skewed, leading to difficulty in understanding the transmission routes of the virus across the country. Furthermore, new sequences are often made available in large batches by the centers, which further introduces bias to the samples by potentially generating sequencing or assembly artifacts to the sequences. Unless verified by multiple centers, in multiple batches, or by other experimental methods, caution is required when studying these mutations. As more genomes are sequenced, a more clear picture of the SARS-CoV-2 mutatome in Turkey will emerge and we will likely be able to draw more solid conclusions. Finally, it should be noted that mutational profiles of viral genomes may determine whether infected patients will develop lasting immunity and remain protected from re-infection. Although exposure to SARS-CoV-2 protected rhesus macaques from re-infection with the same strain of virus (Deng et al., 2020) , there are questions still remaining to be answered related to whether each recovered patient will have lasting immunity. Recent news within days reported that four patients from Hong Kong, Belgium, the Netherlands, and USA, who had earlier recovered from COVID-19 has been reinfected, with a different strain of SARS-CoV-2 than the original infection 8 (Tillett et al., 2021) . In support of this observation, an earlier study reported that convalescent plasma from some of the COVID-19 patients showed reduced neutralizing activity against pseudoviruses with D614G mutation in 8 Euronews (1993) . Euronews [online] . Website https://www. euronews.com/2020/08/25/two-cases-of-covid-19-reinfectionreported-in-europe [accessed 28 August 2020]. culture environment (Hue et al., 2020) . We do not have a clear understanding of the viral determinants of lasting immunity to SARS-CoV-2, however, it seems that certain viral proteins may be more critical than others, based on analyses of patient plasma samples. Grifoni et al. (2020) suggested that M, Spike and N proteins are the major determinants of CD4+ response, with additional responses to nsp3, nsp4, ORF3a and ORF8. Hachim et al. (2020) showed that ORF8, ORF3b and N proteins of SARS-CoV-2 elicited the strongest specific antibody responses in infected patients. It is plausible that certain mutations within these proteins affect the immune response, however, it remains to be explored whether any of the mutations common or more frequently seen in Turkish isolates have any effect on the immune response. Reporting two SARS-CoV-2 strains based on a unique trinucleotide-bloc mutation and their potential pathogenic difference A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster The Spike D614G mutation increases SARS-CoV-2 infection of multiple human cell types Primary exposure to SARS-CoV-2 protects against reinfection in rhesus macaques Data, disease and diplomacy: GISAID's innovative contribution to global health Mutation density changes in SARS-CoV-2 are related to the pandemic stage but to a lesser extent in the dominant strain with mutations in spike and RdRp Mutations of SARS-CoV-2 nsp14 exhibit strong association with increased genome-wide mutation load RdRp mutations are associated with SARS-CoV-2 genome evolution Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals ORF8 and ORF3b antibodies are accurate serological markers of early and late SARS-CoV-2 infection Nextstrain: real-time tracking of pathogen evolution Uncontrolled innate and impaired adaptive immune responses in patients with COVID-19 acute respiratory distress syndrome Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of COVID-19 pandemic mutations MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform Structure of the SARS-CoV nsp12 polymerase bound to nsp7 and nsp8 co-factors Cardiac and arrhythmic complications in patients with COVID-19 Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus Gastrointestinal and liver manifestations in patients with COVID-19 The neuroinvasive potential of SARS-CoV2 may play a role in the respiratory failure of COVID-19 patients Natural selection in the evolution of SARS-CoV-2 in bats, not humans, created a highly capable human pathogen The coronavirus nucleocapsid is a multifunctional protein Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant SNPsites: rapid efficient extraction of SNPs from multi-FASTA alignments Structural and biochemical characterization of the nsp12-nsp7-nsp8 core polymerase complex from SARS-CoV-2 Pattern of early human-to-human transmission of Wuhan A structural view of SARS-CoV-2 RNA replication machinery: RNA synthesis, proofreading and final capping TreeTime: Maximumlikelihood phylodynamic analysis Rampant C>U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses -causes and consequences for their short and long evolutionary trajectories One severe acute respiratory syndrome coronavirus protein complex integrates processive RNA polymerase and exonuclease activities Genomic evidence for reinfection with SARS-CoV-2: a case study The SRrich motif in SARS-CoV nucleocapsid protein is important for virus replication ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data Asymptomatic transmission of SARS-CoV-2 and implications for mass gatherings Genotyping coronavirus SARS-CoV-2: methods and implications Cardiovascular complications in patients with COVID-19: consequences of viral toxicities and host immune response The authors would like to extend their thanks to İzmir Biomedicine and Genome Center (IBG) COVID-19 platform IBG-COVID-19 for their support in implementing the study and the Scientific and Technological Research Council of Turkey (TÜBİTAK) for their financial support of IBG-COVID-19. Yavuz Oktay is supported by the Turkish Academy of Sciences Young Investigator Program (TÜBA-GEBİP). The authors have no conflicts of interest to disclose.