key: cord-1016184-aopgzfil authors: Goyal, Manisha; De Bruyne, Katrien; van Belkum, Alex; West, Brian title: Different SARS-CoV-2 haplotypes associate with geographic origin and case fatality rates of COVID-19 patients date: 2021-01-26 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2021.104730 sha: 413df163d80911f58107c81868b74ea432df460a doc_id: 1016184 cord_uid: aopgzfil The current pandemic of COVID-19 is caused by the SARS-CoV-2 virus for which many variants at the Single Nucleotide Polymorphism (SNP) level have now been identified. We show here that different allelic variants among 692 SARS-CoV-2 genome sequences display a statistically significant association with geographic origin (p < 0.000001) and COVID-19 case severity (p = 0.016). Geographic variation in itself is associated with both case severity and allelic variation especially in strains from Indian origin (p < 0.000001). Using an new alternative bioinformatics approach we were able to confirm that the presence of the D614G mutation correlates with increased case severity in a sample of 127 sequences from a shared geographic origin in the US (p = 0.018). While leaving open the question on the pathogenesis mechanism involved, this suggests that in specific geographic locales certain genotypes of the virus are more pathogenic than others. We here show that viral genome polymorphisms may have an effect on case severity when other factors are controlled for, but that this effect is swamped out by these other factors when comparing cases across different geographic regions. The current pandemic of COVID-19 is caused by the SARS-CoV-2 virus for which many variants at the Single Nucleotide Polymorphism (SNP) level have now been identified. We show here that different allelic variants among 692 SARS-CoV-2 genome sequences display a statistically significant association with geographic origin (p<0.000001) and COVID-19 case severity (p=0.016). Geographic variation in itself is associated with both case severity and allelic variation especially in strains from Indian origin (p<0.000001). Using an new alternative bioinformatics approach we were able to confirm that the presence of the D614G mutation correlates with increased case severity in a sample of 127 sequences from a shared geographic origin in the US (p=0.018) . While leaving open the question on the pathogenesis mechanism involved, this suggests that in specific geographic locales certain genotypes of the virus are more pathogenic than others. We here show that viral genome polymorphisms may have an effect on case severity when other factors are controlled for, but that this effect is swamped out by these other factors when comparing cases across different geographic regions. J o u r n a l P r e -p r o o f The SARS coronavirus-2 (SARS-CoV-2) causes COVID19 (Kadkhoda, 2020) . This disease is now pandemic and it is killing hundreds of thousands of people on a global scale (e.g. Potere et al, 2020) . Viruses, especially those with an RNA genome, have a tendency to evolve relatively rapidly during episodes of intense geographic spread. Using modern genomic sequencing technologies the genetic changes associated with global but also more local dissemination can be documented rapidly (Sekizuka et al, 2020) . Also, within viral populations variants can be traced due to the quasi-species nature of the SARS-CoV-2 virus (Jary et al, 2020) . For SARS-CoV-2 thousands of Single Nucleotide Polymorphisms (SNPs) have already been identified, several of which have become fixed in the more recent, geographically defined viral populations at large (Saha et al, 2020; Sapoval et al, 2020; Kaushal et al, 2020; Yang et al, 2020) . Rapid regional spread of SARS-CoV-2 may lead to increased allelic variability during periods of extended transmission (Gudbjartsson et al, 2020) . Although not all of these SNPs translate in amino acid variation in coding sequences (CDS), a significant portion does change the structure of important viral proteins. It is currently not clear what the effect of such variations is on viral phenotypes (e.g. its capacity to adhere to target host cells, efficiency of invasion of host cells, rapidity of replication, disease features in infected hosts etc) also because defining such effects is usually performed in artificial in vitro models. Such models are often cumbersome, have an intrinsic infectious risk for those working with it and may not adequately represent the real-life in vivo situation (Lamers et al, 2020; Leibel et al, 2020) . Modern bioinformatics tools may add flexibility to such laborious assays and are helpful in defining associations between viral genome variation and differential effects that such viral variants have during infection (e.g. Gallego et al, physiologically important mutation changing the amino acid sequence of the RNAdependent RNA polymerase (RdRp) was noted but the effect on disease severity could not be assessed. Furthermore, it was shown that a 328 basepair deletion in ORF8 clinically associated with a lesser chance for developing hypoxia during COVID-19 (Young et al, 2020) . Very recently however, Toyoshima et al (2020) , Nakamichi et al (2020) and Hodcroft et al (2020) reported the first viral mutations that associated with fatality rates for COVID-19 and concluded that viral variation, together with host susceptibility and the environment co-define the course of COVID- Using a novel viral typing tool, we here assess SNP-based haplotype variation in a large set of SARS-CoV-2 genome sequences, define the SARS-CoV-2 population structure and dynamics and associate these with clinical findings, including fatality rates, among patients. CoV-2 viral genome sequences were collected using the Global Initiative on Sharing Avian Influenza Data (GISAID) database which combined more than 90.000 genome sequences including phenotypic and disease-related metadata. Over 6400 of these sequences included relatively complete dossiers on patient status information. Sequences and metadata were stored, processed, and analyzed in a BIONUMERICS (v8.0) database, with a SQLite backend. Data quality assessment was performed by filtering the GISAID sequences for completeness (>29000bp) and by comparing genome sequences to the NC_045512 NCBI reference sequence. the plugin tool into subsequences matching the annotated CDSs while ignoring the small fraction of intergenic regions in the NCBI reference sequence for SARS-CoV-2 (NC_045512). Next, each of these sequences was analyzed for SNPs relative to the reference sequence. SNPs were stored in the database as a character type experiment to be used for comparison and strain typing using BIONUMERICS' clustering tools (dendrograms and minimum spanning trees). SNPs were also translated, enabling SNP interpretation based on actual amino acid changes. The "haplotype", as defined in the plugin, was determined by categorization of a set of common missense SNPs translated into amino acids (Sekizuka et al, 2020) . This haplotype information was also stored in the database and displayed on the trees and networks for easy group detection. Tool modules: After being downloaded from GISAID, FASTA-formatted genomic sequences were imported into the database using a dedicated sequence import routine available in BIONUMERICS. The SARS-CoV-2 plugin applied a BLAST approach to extract 26 subsequences from each genome. The subsequences of sample Wuhan-Hu-1 (NC_045512), installed automatically by the plugin, were used as reference sequences for the BLAST searches. The subsequences extracted from the genomic sequences were stored in the corresponding destination sequence type experiments. These sequence types were identified by ORF and, for ORF1, an additional Nuclear Shuttle Protein (nsp) tag. After the BLAST screening, the following detailed results were reported for each destination sequence type (Locus column): whether or not a BLAST hit was found, its position on the genome sequence (Start and Stop), sequence identity (Identity (%)) and sequence overlap (Length (%)), the length of the retrieved subsequence, the number of mismatches with the reference sequence (Mismatches) and the number of gaps (Open gaps) and length correction (if applied). In the second step of the process, the haplotypes were determined for each sample. The haplotype, as defined in the SARS-CoV-2 plugin, consists of a set of high-frequency amino acid substitutions which are summarized in Table 1 . Three pairs of these substitutions were observed to be in linkage SNP calculation: After extraction, the plugin screened each subsequence for SNPs by automating the built-in BIONUMERICS SNP analysis tool. The resulting SNP set was filtered based on the relaxed (non-ACGT bases allowed) SNP filtering template and the retained SNPs were stored in the SNP character experiment. Clustering SNP data into dendrograms: Entries to be clustered were selected based on suitability. In the first step, all selected entries were screened for the presence of the subsequences extracted in the prior processing step. Entries for which one or more subsequences are missing have an incomplete SNP character set and were excluded from the comparison. A similarity matrix was calculated based on the SNP experiment, using the categorical (differences) similarity coefficient, and displayed in the similarities panel. A dendrogram was then calculated based on the complete linkage (furthest neighbor) clustering algorithm (Sneath and Sokal, 1973) . A minimum spanning tree (MST) was then calculated in the advanced cluster analysis window of BIONUMERICS, using default priority rule settings. The SNPs stored in the SNP experiment of the selected entries were translated and the amino acids stored in the SNP_TRANSL experiment file. Case severity: The patient status information for each genome sequence was imported as a category (e.g. "asymptomatic", "hospitalized", "deceased"). Each patient's status was evaluated sometime between when the sample taken and when it was submitted, and does not necessarily reflect the case's outcome. We created a decision network in BIONUMERICS to convert each category to an integer value representing increasing case severity, on a scale from 1 to 6 ( Table 2) . Statistical analysis: Tables of contingencies between two different categories (e.g. haplotypes and countries) were evaluated for unexpected frequencies with the chisquared test. Distributions of case severity rankings across three or more categories (e.g. haplotypes or countries) were evaluated with the Kruskal-Wallace H test by ranks. Distributions of case severity rankings across two categories were evaluated with the Mann-Whitney test by sum of ranks. We extracted 692 SARS-CoV-2 genomic sequences originating from the USA, India, Italy, France and Spain from the GISAID database. These regions were chosen for being well represented among sequences with complete patient status metadata. The MST for these sequences shows a high degree of genotypic heterogeneity within each country although clusters representing local dissemination of closely related genotypes were obviously observed as well (Figure 2 ). Figure 2 also illustrates that strains deriving from the USA and India show global representation as well. Of note, certain types are genuinely pandemic whereas others are more geographically restricted. Overall, there was a significant association between haplotype and case severity with haplotype (H= 2.360; p= 0.016743) (Figure 3) . There was also a strong association (H= 58.285; p=0.000000) between case severity and country ( Figure 4 ). Furthermore, a contingency table shows a highly significant association (Chi square= 597.170, P =0.000000) between haplotype and country (Table 3) . It shows that L.GL.YP.QT is widespread but predominates in Italy; that L.GL.YP.HT is found primarily in India; that S.DP.YP.QT is prominent mostly in Spain; and that L.GL.YP.HI predominates in the United States. An examination of case severity versus haplotype within each country showed mixed results; only data from Italy and Spain showed a significant association (Table 4) . To minimize geographic factors while maximizing genetic diversity, we selected the sequences from California for further analysis. As shown in Table 5, these 133 sequences included all nine haplotypes, 20 of which were "D" types. A single CA sequence was submitted by Naval Health Research Center. A Kruskal-Wallis test by ranks did not show a statistically significant association between haplotype and case severity ( Figure 5 ). However, there was an apparent trend with regard to the D614G mutation ( Figure 6 ). By grouping the haplotypes into "D" and "G" types, a Mann-Whitney test revealed a significant association between the D614G genotypes and case severity (p= 0.031085). This is once more reflected in the MST (Figure 7) where all of the deceased patients are shown to fall within the G allele group. Several studies have addressed the relevance of human genetic polymorphism in severity and mortality of COVID-19 (Bosso et al, 2020; Li et al, 2020; Lu et al, 2020; J o u r n a l P r e -p r o o f Journal Pre-proof pathogen adaptation and evolution. The relevance of viral variation in this respect has been studied by Parlikar et al (2020) who analyzed 167 SARS-CoV-2, 312 SARS-CoV, and 5 Pangolin CoV genomes to help understand their origin and evolution. The phylogeny of the subgenus Sarbecovirus confirmed the fact that SARS-CoV-2 strains evolved from their common ancestors putatively residing in bat or pangolin hosts. These authors predicted a few country-specific patterns of relatedness but failed to document any relatedness between genotypes and disease phenotypes in human patients. Two other recent publications again touch upon a lack of viral variation in the development of more or less serious disease. In the review by Callaway et al (2020) it is concluded that viral mutations do not contribute to mortality and that more likely than not environmental conditions have a more significant clinical impact than viral variation. Zhang et al (2020) conclude similarly, based on the bioinformatic analyses of experimentally defined genome sequences. In this study, the number of clinical isolates sequenced may have been a limiting factor. We have here set out to correlate viral genotypes with host phenotypes in more detail using a large number of SARS-CoV-2 genome sequences from a broader geographic origin. We show that genotypic variants across multiple geographic regions are associated with variation in case severity. Given the likelihood that both genotype and case severity are influenced by other geographic factors, we controlled for geographic variation by focusing on one region with a relatively high degree of genotypic variation. Within this region, we showed a significant association between the D614G mutation and case severity. We also demonstrated that controlling for confounding parameters had a big effect on retrieving significant correlations between viral types and pathogenicity within patients. The D614G mutation has received a great deal of attention with respect to its rapid global dissemination (Dearlove et al, 2020) and its significant influence on the spike protein's affinity for the ACE2 receptor. Recent studies demonstrated that in situ images of S trimer conformational changes were affected by the D614G substitution (Ke et al, 2020) . This mutation abolishes a salt bridge to K854 and may reduce folding of the 833-854 loop. It has been suggested (Korber et al, 2020) that this mutation increases the virus' transmissibility, without necessarily increasing its J o u r n a l P r e -p r o o f Journal Pre-proof virulence, thereby explaining its rapid spread in multiple locations. A counterargument (Grubaugh et al, 2020) has proposed that genetic drift and founder effects could also explain this pattern. More recently, the D614G mutation was identified as a marker associated with fatality rate at a countrywide level (Toyoshima et al, 2020) . Our current results support these findings independently, using a completely different set of sequences and an alternative bioinformatic approach, and here show that this mutation could in fact result in increased case severity. However, we cannot rule out the possibility that transmissibility and virulence are not independent. Even if 614G is not more virulent than its D614 ancestor, ease of transmission could lead to higher viral loads in actual patients, thereby increasing the likelihood of severe cases. The polymorphisms we have identified in this project may have an effect on case severity when other factors are controlled, but that this effect is swamped out by these other factors when comparing cases across different geographic regions. Future studies should investigate the relationships among genotype, viral load, and patient outcome to sort out the underlying mechanisms. Although this study focused on genotypes that were of particular interest at the time the data were gathered, our approach could be adapted easily to novel variants such as B.1.1.7, first observed in the UK (Public Health England, 2020). A recent update to the BIONUMERICS SARS-CoV-2 plugin includes a tool to identify mutations relative to the reference sequence that are monomorphic for the samples of interest. For example, a set of known B.1.1.7 samples can be used to define a set of characteristic mutations, which can then be used to identify unknown samples. Once samples are characterized as variants in this way, they can be compared to other variants in terms of geography, patient outcome, and other epidemiological factors. provide the software used free of charge to all except for BIONUMERICS evaluation licenses and for a limited period of a month only. We do welcome collaborations in order to expand the current type of analyses and look forward to suggestions to that effect. bioMérieux marketing and sales departments had no part in the design and the written documentation of this work. J o u r n a l P r e -p r o o f show here that different allelic variants among 692 SARS-CoV-2 genome sequences display a statistically significant association with geographic origin (p<0.000001) and also COVID-19 case severity (p=0.016). Geographic variation in itself is associated with both case severity and allelic variation especially in strains from Indian origin (p<0.000001). We were able to show that the presence of the D614G mutation correlates with increased case severity in a sample of 127 sequences from a shared geographic origin in the US (p=0.018). While leaving open the question on the mechanism involved, this suggests that in specific geographic locales certain genotypes of the virus are more pathogenic than others. J o u r n a l P r e -p r o o f ACE2 and TMPRSS2 variants and expression as candidates to sex and country differences in COVID-19 severity in Italy The Two Faces of ACE2: The Role of ACE2 Receptor and Its Polymorphisms in Hypertension and COVID-19. Version 2. Mol Ther Methods Clin Dev Six months of coronavirus: the mysteries scientists are still racing to solve A SARS-CoV-2 vaccine candidate would likely match all currently circulating variants Online ahead of print Correlation between rules-based interpretation and virtual phenotype interpretation of HIV-1 genotypes for predicting drug resistance in HIV-infected individuals Factors Leading to High Morbidity and Mortality of COVID-19 in Patients with Type 2 Diabetes Genome-wide analysis of Indian SARS-CoV-2 genomes for the identification of genetic mutation and SNP Hidden genomic diversity of SARS-CoV-2: implications for qRT-PCR diagnostics and transmission Haplotype networks of SARS-CoV-2 infections in the Diamond Princess cruise ship outbreak Online ahead of print Numerical Taxonomy: The Principles and Practice of Numerical Classification Patient characteristics and predictors of mortality in 470 adults admitted to a district general hospital in England with Covid-19. medRxiv 2020 preprint Doi: using OpenSAFELY Analysis of genomic disctributions of SARS.CoV-2 reveals a dominant strain type with strong allelic associations Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study Association of hypertension, diabetes, stroke, cancer, kidney disease, and high-cholesterol with COVID-19 disease severity and fatality: A systematic review We gratefully acknowledge Dr Maud Tournoud (bioMérieux, Data Analytics,