key: cord-1007447-2wdhy3kq authors: Oulas, Anastasis; Richter, Jan; Zanti, Maria; Tomazou, Marios; Michailidou, Kyriaki; Christodoulou, Kyproula; Christodoulou, Christina; Spyrou, George M. title: In depth analysis of Cyprus-specific mutations of SARS-CoV-2 strains using computational approaches date: 2021-06-16 journal: bioRxiv DOI: 10.1101/2021.06.08.447477 sha: 9fb236adb47b1be9e0b84bf49e89c79ea35e91a5 doc_id: 1007447 cord_uid: 2wdhy3kq This study aims to characterize SARS-CoV-2 mutations which are primarily prevalent in the Cypriot population. Moreover, using computational approaches, we assess whether these mutations are associated with changes in viral virulence. We utilize genetic data from 144 sequences of SARS-CoV-2 strains from the Cypriot population obtained between March 2020 and January 2021, as well as all data available from GISAID. We combine this with countries’ regional information, such as deaths and cases per million, as well as COVID-19-related public health austerity measure response times. Initial indications of selective advantage of Cyprus-specific mutations are obtained by mutation tracking analysis. This entails calculating specific mutation frequencies within the Cypriot population and comparing these with their prevalence world-wide throughout the course of the pandemic. We further make use of linear regression models to extrapolate additional information that may be missed through standard statistical analysis. We report a single mutation found in the ORF1ab gene (S6059F) that appears to be significantly enriched within the Cypriot population. We further analyse this mutation using regression models to investigate possible associations with increased deaths and cases per million. Moreover, protein structure prediction tools show that the mutation infers a conformational change to the protein that significantly alters its structure when compared to the reference protein. Investigating Cyprus-specific mutations for SARS-CoV-2 can not only lead to important findings from which to battle the pandemic on a national level, but also provide insights into viral virulence worldwide. Cyprus' first case of COVID-19 was reported on March 9 th 2020, with a gradual increase after that to formulate the first peak of the virus transmission seen in late March with all major Cypriot cities affected by the virus. This first wave was relatively mild with respect to reported daily new cases and deaths reaching a maximum of 58 cases on April 1 st and 2 deaths. The second wave of virus transmissions hit Cyprus in mid-October and gradually increased to peak in December with a maximum of 907 and 8 reported daily new cases and deaths, respectively. This summary of the SARS-CoV-2 spread in the Cypriot population dictates that Cyprus is one of the least affected European countries during this pandemic (mostly with respect to deaths per million). This can be attributed to the rapid response time for austerity measures and the effective quarantine process for confirmed cases and their contacts. In addition, Cyprus is ranked 3 rd in Europe as for the number of COVID tests performed per million (>2,4M tests). Multiple national studies have been undertaken by a plethora of countries throughout the world, in order to generate and analyse high throughput sequencing country-specific data for SARs-CoV-2 strains [1] [2] [3] [4] [5] [6] [7] [8] . Recently, the Cyprus Institute of Neurology and Genetics (CING) also published a study based on 144 NGS samples obtained from the Cypriot population representing the first documented genomic and epidemiological characterization of these samples [9] . In this study we expand on this work, by initially performing basic lineage analysis, as previously reported [9] , to identify the dominant SARS-CoV-2 lineage(s) in Cyprus as well as a phylogenetic tree analysis to map the Cypriot strains against other strains present throughout the world. We further perform a more thorough, in depth genomic analysis of these samples by implementing variant-calling and displaying an overview of the genomic variation and reported frequencies in the Cypriot strains. We then focus on the 9 spike (S) protein mutations that delineate the UK B.1.1.7 lineage and how their founder effect in Cyprus has impacted the number of cases and deaths in the population. Furthermore, we identify Cypriot-specific/dominant mutations and investigate them using mutation tracking analysis in order to isolate their origin, determine their founder effect in Cyprus and also trace their overall prevalence and propagation in other countries. We capitalize on our previously published generalized linear models [10] to undertake virulence analysis on Cypriot-dominant mutations and show how their presence has affected the number of cases and deaths within the country. Finally, we perform structural modelling of the alternate versus the reference mutation for selected mutations of interest and view their effects at the viral protein level. The Burrows-Wheeler Aligner (BWA) [11] , version: 0.7.15 was used to map the raw reads to Wuhan-Hu-1 (NCBI ID:NC_045512.2). Duplicate reads, which are likely to be the results of PCR bias, were marked using Picard (http://broadinstitute.github.io/picard/) version: 2.6.0. SAMtools [12] The consensus sequences of all 144 Cypriot strains were uploaded to the Pangolin COVID-19 lineage assigner interface [14] (https://pangolin.cog-uk.io/). Further analysis of results and generation of visualizations was performed in R V.3.6.1 [15] . Full genomic SARS-CoV-2 sequences of high sequencing resolution were obtained from GISAID (https://www.gisaid.org/, last accessed 15/01/21). The nextstrain [16] pipeline was downloaded locally and the commands for filtering, aligning and constructing phylogeny were used according to the nextstrain's best practices. MAFFT [17] was used to construct a multiple sequence alignment (MSA). Phylogeny was estimated using the RAxML [18] maximum likelihood algorithm for phylogenetic tree construction. Variant calling for the GISAID strains was achieved using the snp_sites tool available through github (https://github.com/sanger-pathogens/snp-sites). Relative frequencies across countries with at least one occurrence of the selected mutations of interest was visualized as bar plots across time (months). This provides an indication of the spread of the studied mutations across the general population. Analysis was performed using R (packages: dplyr, tidyr, ggplot2, ggtree, phytools, phangorn). A generalized linear model was applied (see formula 1 below), whereby the absence or, presence (values 0 or 1 respectively) of the S6059F mutation (mut variable) was assessed across samples from Cyprus, as well as the rest of the world (as defined by the classes variable, 1 denoting Cyprus and 0 for other countries). model<-glm(mut ~ classes, data=vcf, family = binomial(link = "logit")) An odds ratio analysis was also performed using a simple formula (see 2 below). Whereby, cycounts denotes the number of times the S6059F mutation was observed in Cyprus (0 for absence, 1 for presence) and similarly othercounts denotes the number of times the mutation was observed in other countries. Comparative structural modelling was carried out for unknown protein structures using the template-based web server, I-TASSER [19] . The accuracy of the method was assessed based on the I-TASSER template modeling score (TM-score), which indicates the structural similarity between model and templates. A TM-score higher than 0.5 indicates a model of correct topology, while a TM-score of less than 0.17 means a random topology [19] . The DynaMut webserver [21] , was used to visualize non-covalent molecular interactions, calculated by the Arpeggio algorithm [22] . Protein-protein complexes were constructed using the ClusPro (v2.0) [23] and HDOCK [24] algorithms and binding affinities and dissociation constants (Kd) were calculated using the PRODIGY webserver [25] . RNA-protein docking simulations were carried out using the HDOCK [31] and MPRDock [32] algorithms. For RNA-protein docking simulations, as active residues we selected the active site residues of the SARS-CoV NSP14 protein, since the two proteins share a 99.1% sequence similarity [26] . Structural alignment was performed using the align tool of PyMOL and all-atom RMSD values were calculated without any outliers' rejection, with zero cycles of refinement. All docking simulations were performed in triplicates. One of the most widely used international systems for detecting lineages that contribute most to active spread is the dynamic nomenclature system presented by Rambaut et al [14] . Cyprus-specific mutations were obtained by counting the frequency of all single nucleotide mutations present within the vcf file generated from our 144 sequenced samples. We selected 1 0 for mutations that appeared in at least 30% of the samples and thus identified 18 mutations that met this criterion. These included mutations that were very cosmopolitan (e.g. D614G mutation) but also included mutations that appeared primarily in the Cypriot population according to our sequenced samples. We performed an odds ratio (OR) analysis to investigate which mutations appear to be more frequent in the Cypriot population vs. strains obtained from the rest of the word. We also applied a generalized linear model ( were generated to show how well the glm performs under this scenario. We specifically highlight the S6059F mutation, located in the NSP14 protein of the ORF1ab gene, because it obtained the highest OR statistic (6921.17) and was also deemed most significant according to our glm model (p-value = 6.94E-180) (see Table 1 ). We performed mutation tracking analysis to further investigate the prevalence of this specific mutation throughout the course of the pandemic, both in Cyprus as well as worldwide. The founder effect appeared to have occurred in Russia, as shown by timestamps for the strains as well as phylogenetic tree analysis for the specific strains with the S6059F mutation (see Figure 3 ). This event did not lead to the establishment of this mutation as the dominant form within the Russian population. Similar results appear for other countries with at least one reported strain with this mutation. This is in contrast to Cyprus where the alternate S6059F mutation clearly appears to be replacing the reference form (see Figure 4 ). viable hypermutation phenotype both in cell culture as well as animal models [31] [32] [33] while a SARS-CoV-2 ExoN knockout mutant was found to be unable to replicate [30] . In order to investigate for proofreading dysfunctionality in the strains with the Capitalizing on previously published generalized linear models (glms) that provide a measure of association between specific SARs-CoV-2 mutations and increased or decreased viral cases or deaths [10] , we proceeded to perform virulence analysis on the 18 mutations found to be prevalent in the Cypriot population. We focused on the S6059F mutation due to its exclusive propagation within Cyprus compared to the rest of the world. Our glms are uniquely designed to incorporate mutation frequency, austerity measure response time and mutation occurrence information as predictor variables and cases or deaths per million as response variables into a single model. This allows for a fit that determines whether a given mutation is positively or negatively correlated with number of cases and deaths per country. We focused on geographic regional data reporting deaths and transmissions according to the Worldometers.info website (last accessed 08/03/21). Populations harbouring the mutation show a higher mean of deaths per million (1363) compared to the converse (1206). Statistical analysis of the populations with and without the S6059F mutation shows substantial evidence that the two distributions are significantly different (Wilcoxon test -pvalue 2.2e-16) (see Figure 6A ). Results obtained from using the cases per million parameter to assess the different distributions, show a reversed pattern with samples with the mutation exhibiting a lower mean of cases (49224) compared to the reference samples (64630) (see Figure 6B ). Statistical analysis further provides evidence that the two distributions are significantly different (Wilcoxon test -p-value 2.2e-16). We also included response time separation in the box plots which shows how the mutation segregates across countries that responded differently to the COVID austerity measures (see supplementary Figure S1A ,B). A B 1 7 Applying our previously published glm model [10] , on the 18 Cyprus-specific mutations allows for the analysis of these less studied mutations and how they appear to correlate with death and transmission rates. According to the fitting plot of our glm models, the unique S6059F mutation appears to be positively correlated with deaths per million (R=0.61) and neutral with respect to cases (R=0.15) per million (see supplementary Figure S2 ). However, it should be noted that this mutation did not show significance according to the model obtained p-values (p >0.05). This can most probably be attributed to its low frequency of occurrence across different countries besides Cyprus. In order to investigate potential reasons why the S6059F mutation appears to be associated with increased mutagenesis, we turn to structural prediction and docking tools. The Ser6059 residue is located on the surface of the NSP14 protein encoded by the ORF1ab gene, a bifunctional protein with an N-terminal exonuclease (ExoN) domain with a proofreading function and a C-terminal N7-guanine methyltransferase (N7-MTase) domain, implicated in the methylation of viral RNA cap structures. Both domains are crucial for the maintenance and stability of the viral RNA by lowering its sensitivity to RNA mutagens and evading its degradation from the host immune response [26] . The QHD43415 structure (Estimated TM-score = 0.87) of the NSP14 protein was retrieved from the I-TASSER repository containing the 3D structural models of all proteins encoded by the genome of SARS-CoV-2. The S6059F residue is located in the NSP14 domain directly involved in the physical interaction with NSP10 (residues 1-76 and 119-145), which enhances the ExoN activity by more than 35-fold [26] . The 6059 residue is close to the active site of the enzyme (D90, E92, E191 and D273) [26] , and both amino acids are exposed to the surface with RSA values 24.6% and 51.7% for the uncharged Serine (S) and Phenylalanine (F) residues, respectively. Further, in depth molecular analysis reveals disruption of hydrogen bonds and hampering and shifts in other inter-molecular interactions in the ALT-S6059F compared to the REF-S6059F structure of the protein (see Figure 8A ,B). Since the NSP14 protein is known to be involved in the maintenance and stability of the viral RNA by lowering its sensitivity to RNA mutagens and evading its degradation from the host immune response [26] , the large structural changes observed upon mutagenesis at the protein and RNA-protein complex levels, could ultimately lead to a reduced enzyme activity and could be the functional aetiology for observing a greater mutagenesis rate. Table 2 ). More evident differences in complex stability as denoted by free energy scores are observed for the NSP14-NSP10-RNA complexes, with the ALT forms of the complex attaining overall lower free energy values, indicating a higher affinity for RNA binding for the ALT forms of this complex (see Table 2 ). Investigating mutations for SARS-CoV-2 that are unique to a specific country can not only lead to interesting results for the underlying country but also provide insights into viral virulence worldwide. Countries like Cyprus with a small isolated population can be invaluable in assessing viral transmission and death rates, as well as potentially predicting future trends of the virus in the rest of the world. This work was based on the analysis of a small number (N=144) of viral strains from the Cypriot population. To this end we initially perform basic phylogenetic and lineage analysis as previously reported for these samples [9] . We extend previous work by selecting for 18 Cyprus-specific mutations exhibiting a high frequency of occurrence that is unique to Cyprus compared to the rest of the world. We highlight a single mutation with the highest frequency of occurrence in Cyprus alone and further analyse this by mutation tracking and regressions analysis (glms) in order to obtain a greater understanding of the nature of this mutation. We show that this mutation causes an amino acid change (S6059F) on the NSP14 exoribonuclease of the ORF1ab protein. We furthermore assess whether this mutation affects the molecular functionality of the NSP14 protein, which is known to be implicated in viral mutation proofreading during replication. We provide evidence that it may actually allude to a dysfunctional NSP14 protein that causes hypermutability in the strains with this mutation. This is further supported by structural modelling of the NSP14 protein with and without the S6059F mutation, which clearly points Simplex Virus UL42 protein which binds DNA and plays an essential role in viral DNA replication by acting as the polymerase accessory subunit. It has been shown that engineered viruses expressing mutant forms of the UL42 proteins that increase its affinity for DNA binding, exhibited increased mutation frequencies and elevated ratios of virion DNA copies [34] . These results suggest the SARS-CoV-2 S6059F-generated increased affinity for RNA seen in our structural models, may be the cause of the hypermutation also observed in our data for the Cypriot strains with the ALT form of the S6059F mutation. As previously reported, certain viruses may have evolved so that their RNA or DNA binding proteins, implicated in viral replication, neither bind DNA too tightly nor too weakly to optimize virus production and replication [34] . It is important to stress that our proposed mechanism for the S6059F mutation alluding to a dysfunctional NSP14 protein, which in turn may cause hypermutability, requires additional in-vitro experimental verification before it can be fully validated. We propose a theory whereby the increased mutability caused by decreased proofreading as a consequence of the S6059F mutation, causes a greater diversity of viral strains or quasispecies resident within the same host. This can potentially impact the infection patterns descending from this host by altering transmission rates and possibly death rates. We outline two potential scenarios where this could be of significant impact for downstream viral infection trends. If the "main" nested strain is of high virulence, viral replication with increased mutability will allude to a greater range of viral quasispecies from which to infect downstream hosts, thus less chances of the host transmitting the strain with high virulence. If, on the other hand, the nested strain is one of low virulence, increased mutability could potentially lead to the development and consequent transmission of more virulent strains in downstream infections. One very important aspect that ought to be discussed and investigated further, is the potential impact of such viable mutations that cause hypermutability on vaccination efficacy. It can be argued that infections caused from viral strains with such proofreading dysfunction can act as generators of diverse viral strains that allude current 2 4 vaccination attempts at a faster rate compared to strains with more regular proofreading functionality. First Report on the Latvian SARS-CoV-2 Isolate Genetic Diversity The power and limitations of genomics to track COVID-19 outbreaks: A case study from New Zealand SARS-CoV-2 genomic characterization and clinical manifestation of the COVID-19 outbreak in Uruguay. Emerg Microbes Infect Isolation and phylogenetic analysis of SARS-CoV-2 variants collected in Russia during the COVID-19 outbreak Genomic Analysis of Early SARS-CoV-2 Variants Introduced in Mexico Analysis of Genomic Characteristics and Transmission Routes of Patients with Confirmed SARS-CoV-2 in Southern California during the Early Stage of the US COVID-19 Pandemic Phylogeography of SARS-CoV-2 pandemic in Spain: A story of multiple introductions, micro-geographic stratification, founder effects, and super-spreaders Genome Epidemiological Study of SARS-CoV-2 Introduction into Japan Molecular Epidemiology of SARS-CoV-2 in Cyprus. medRxiv Generalized linear models provide a measure of virulence for specific mutations in SARS-cov-2 strains Ultrafast and memory-efficient alignment of short DNA sequences to the human genome The Sequence Alignment/Map format and SAMtools The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology R: A Language and Environment for Statistical Computing NextStrain: Real-time tracking of pathogen evolution MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies Can Predicted Protein 3D Structures Provide Reliable Insights into whether Missense Variants Are Disease Associated? DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability Arpeggio: A Web Server for Calculating and Visualising Interatomic Interactions in Protein Structures New additions to the ClusPro server motivated by CAPRI. Proteins Struct Funct Bioinforma The HDOCK server for integrated protein-protein docking PRODIGY: A web server for predicting the binding affinity of protein-protein complexes Structural insights into SARS-CoV-2 proteins Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Estimated transmissibility and severity of novel SARS-CoV-2 Variant of Concern 202012/01 in England Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. medRxiv The Enzymatic Activity of the nsp14 Exoribonuclease Is Critical for Replication of MERS-CoV and SARS-CoV-2 High Fidelity of Murine Hepatitis Virus Replication Is Decreased in nsp14 Exoribonuclease Mutants Infidelity of SARS-CoV Nsp14-exonuclease mutant virus replication is revealed by complete genome sequencing A live, impaired-fidelity coronavirus vaccine protects in an aged, immunocompromised mouse model of lethal disease Mutations That Increase DNA Binding by the Processivity Factor of Herpes Simplex Virus Affect Virus Production and DNA Replication Fidelity