key: cord-0737508-1uckspig authors: Chan, F.; Ataide, R.; Richards, J. S.; Narh, C. A. title: Contrasting epidemiology and population genetics of COVID-19 infections defined with 74 polymorphic loci in SARS-CoV-2 genomes sampled globally. date: 2021-04-26 journal: nan DOI: 10.1101/2021.04.25.21255897 sha: d3e8c08b48e9cf0eaa30204c07b8037a02f4c495 doc_id: 737508 cord_uid: 1uckspig SARS-CoV-2, the coronavirus causing COVID-19, has infected and killed several millions of people worldwide. Since the first COVID-19 outbreak in December 2019, SARS-CoV-2 has evolved with a few genetic variants associated with higher infectivity. We aimed to identify polymorphic loci in SARS-CoV-2 that can be used to define and monitor the viral epidemiology and population genetics in different geographical regions. Between December 2019 and September 2020, we sampled 5,959 SARS-CoV-2 genomes. More than 80% of the genomes sampled in Africa, Asia, Europe, North America, Oceania and South America were reportedly isolated from clinical infections in older patients, [≥] 20 years. We used the first indexed genome (NC_045512.2) as a reference and constructed multilocus genotypes (MLGs) for each sampled genome based on amino acids detected at 74 polymorphic loci located in ORF1ab, ORF3a, ORF8, matrix (M), nucleocapsid (N) and spike (S) genes. Eight of the 74 loci were informative in estimating the risk of carrying infections with mutant alleles among different age groups, gender and geographical regions. Four mutant alleles - ORF1ab L4715, S G614, and N K203 and R204 reached 90% prevalence globally, coinciding with peaks in transmission but not COVID-19 severity, from March to August 2020. During this period, the MLG genetic diversity was moderate in Asia, Oceania and North America; in contrast to Africa, Europe and South America, where lower genetic diversity and absence of linkage disequilibrium indicated clonal SARS-CoV-2 transmission. Despite close relatedness to Asian MLGs, MLGs in the global population were genetically differentiated by geographic region, suggesting structure in SARS-CoV-2 populations. Our findings demonstrate the utility of the 74 loci as a genetic tool to study and monitor SARS-CoV-2 transmission dynamics and evolution, which can inform future control interventions. little is known about how these epidemiological differences relate to infection dynamics. i.e., RdRP). Other ORFs, including ORF3a (induces apoptosis in host cells), ORF6, ORF7a, 59 ORF8 (ORF8 mediates immune suppression and evasion) and ORF10 encode accessory 60 proteins that are involved in viral replication and host immune dysregulation [4] . Since first being identified, the virus has evolved, with numerous genetic variants being 63 associated with higher infectivity. The geographical distribution and probable risk factors (e.g., 64 demographics and clinical factors) for infection with mutant genotypes remain unknown. 65 Comparative genomic analysis of SARS-CoV-2 infections collected globally suggests that the 66 virus is adapting to its human host. A few genetic variants harbouring E484K, N501Y and 67 D614G mutations in the S protein have been associated with higher infectivity than the wild-68 type variant, Wuhan NC_045512.2 [5, 6] . Variants with these mutations rose to predominance 69 in many parts of the United Kingdom and South Africa [7, 8] . Other mutations, including 70 orf1ab P4715L, Orf3a G251V and orf8 L84S, have been associated with higher infectivity and 71 viral density, respectively [9] . Whether these and other unreported mutations are linked, under 72 selection and can be used to source-track infections within and between different geographical 73 regions has not been investigated. More polymorphic and informative loci, representative of the global SARS-CoV-2 genetic 76 diversity, are needed to accurately differentiate closely related variants and interrogate the virus 77 population genetics in different geographical regions [9, 10] . Comparison of SARS-CoV-2 78 whole genomes identified phylogenetic clusters, defined by Single Nucleotide Polymorphisms 79 (SNPs) in < 10 codons/loci, that differentiated European and Asian infections [11, 12] ; 80 however, these loci lacked the needed resolution to differentiate variants circulating globally. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Multilocus genotyping using amino acid changes in SARS-CoV-2 can reduce the complexity 82 in the genomic data and provide informative and virologically relevant data that can provide 83 insights into the transmission dynamics and evolution of variants causing COVID-19. This 84 approach on multiple polymorphic loci can estimate and monitor the genetic diversity of 85 SARS-CoV-2 populations spatiotemporally and in response to control interventions. This study evaluated the epidemiology and population genetics of 5,959 SARS-CoV-2 88 genomes sampled globally to identify risk factors associated with infection with mutant 89 variants and gain insights into how the viral population had evolved geographically eight 90 months into the pandemic. Briefly, we identified 74 polymorphic loci, of which eight loci 91 located in orf1ab, orf3a, orf8, N and S genes, were considered informative in explaining the CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) To determine linkage disequilibrium (LD), i.e., non-random association of alleles at two or To account for this, the LD was clone-corrected using the dataset consisting of unique MLGs. To determine whether the LD was 'structured' between specific gene pairs, a pairwise LD was 136 performed as described elsewhere [19] . The ̅ d score ranges from 0 to 1, with 0 indicating no 137 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.25.21255897 doi: medRxiv preprint 5 LD and 1 indicating complete LD. The statistical significance of the score was supported by a 138 P-value < 0.05. Genetic differentiation (Nei's GST) among MLG populations within and between continents 141 was estimated using mmod [20] . The GST score ranges from 0 (no genetic differentiation, i.e., CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. The majority of the SARS-CoV-2 genomes were reportedly isolated from throat swabs 183 and were sequenced using Illumina. Half of the genomes we analyzed had the associated 184 data on the specimen type collected for diagnosis or isolating the virus genome. URT 185 specimens constituted 92.0%. Of these, throat swabs were the majority (65.1%) ( Figure S2 ). This was observed for all the study variables ( Figure S3 -S5) except in South America, where 187 more than 60.3% of the genomes were isolated from nose swabs ( Figure S4 ). Globally, more 188 than 68.0% of the genomes we sampled were reportedly sequenced using Illumina except in 189 Asia and South America, where a higher proportion of genomes were reportedly sequenced 190 using Ion Torrent (39.53%) and Nanopore (43.10%), respectively ( Figure S6 ). i.e. associated with demographic processes [30, 31] . A few others, considered homoplasic 195 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.25.21255897 doi: medRxiv preprint (recur independently) and adaptive (associated with viral transmissibility and/or 196 pathogenicity), have been detected in clinical infections circulating worldwide [9, 31, 32] . Based on these reports and our filtering criteria of a MAF ≥ 0.01, 74 polymorphic loci (≥ 2 198 alleles) were used to construct the MLGs ( Figure S7 ). These loci are located within ORF1ab -199 NSP1, NSP2, NSP3, NSP4, NSP5, NSP8, NSP12, NSP13 and NSP14, two accessory proteins 200 -ORF3a and ORF8, and three structural proteins -M, N and S (Table S3 ). The moderate genetic 201 differentiation (Gst = ~0.10) observed for these loci demonstrate their utility as markers for 202 differentiating SARS-CoV-2 variants (Table S4 ). The orf1ab gene was the most polymorphic 203 loci (Table S3- Table S4 ). These alleles, particularly 222 those in the ORF1ab, have been considered neutral [31] . It is worth noting that the N501Y 223 mutation in the S protein, associated with higher infectivity among UK and South African 224 variants [5, 6], was carried by 0.8% of the genomes we sampled from Australia, Oceania. These Table S3 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. infections carrying mutant alleles [39] . Interestingly, compared to asymptomatic cases, severe 247 cases in Africa, Asia and North America were more likely to harbour the S D614 allele ( Figure 248 2). Although the S G614 has been associated with enhanced viral transmission, our data indicates 249 that it was not associated with severe COVID-19 as reported elsewhere [40] . (Table S5) . South America had the lowest number, 40 261 (Table 3 and Table 2 ). Our data is consistent with previous reports indicating that 278 SARS-CoV-2 phylogenies in the latter two regions depicted the diversity that existed 279 worldwide as of July 2020 [37, 42, 43] . There was significant genome-wide LD (non-random 280 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.25.21255897 doi: medRxiv preprint 9 association among alleles) both at the global and continental levels ( ̅ d ≥ 0.034, P-value < 281 0.001) (Table 2) . However, the decay of this LD ( ̅ d ≤ 0.007, P-value ≥ 0.166) in Africa, Europe 282 and South America after repeating the analysis with the unique MLGs was indicative of clonal 283 SARS-CoV-2 transmission in these geographical regions. Indeed, multiple outbreaks in Europe 284 and South America were due to local transmissions and were largely associated with clusters 285 of closely related infections [43, 44] . 286 Interestingly, we observed that the genome-wide LD was driven by specific gene pairs, i.e., 287 'structure'. We detected the strongest LD signal ( ̅ d ≥ 0.3, P-value < 0.001) between NSP12 288 and S, NSP1 and ORF3a, NSP4 and ORF8, and NSP4 and N ( Figure 3B and Figure S12 ), 289 consistent with previous reports using nucleotide data [39] . While the former three LD Table 2 : Genetic diversity estimates for SAR-COV-2 populations circulating globally. Figure 4B . The . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 majority of MLGs in each continent were predicted to have 20-50% geographical assignment 311 to Asia and Oceania ( Figure 4B ). This data suggests that the majority of SARS-CoV-2 variants 312 in the world are of Asian descent. It also supports contact tracing data that showed that most 313 SARS-CoV-2 cases during the early (March to May) stages of the pandemic were linked to 314 imported cases from Asia [48] . However, a few MLGs were unique to each continent, as 315 indicated by the 40-60% within-continent membership assignments. Prominent among 316 continental clusters were the AC1 and OC1 clusters, detected in Asia and Oceania, respectively 317 ( Figure 4A ). The ~60% membership assignment of African MLGs to Europe ( Figure 4B ) 318 suggested that most African infections were more likely to have been imported from Europe, 319 contradicting previous reports of an American source based solely on travel data [49, 50] . This 320 underscores the need to build strong surveillance systems utilizing both travel and genomic 321 data. SARS-CoV-2 MLGs were genetically differentiated within and between continents. We 324 detected many mutant alleles restricted to each continent ( Figure 4D ). Spatial connectivity, including cross-border migrations and international travel, is a major 347 conduit for spreading the virus (i.e., resulting in gene flow) among countries. Hence, we 348 expected to see little genetic differentiation among 'regional blocks' within a continent. This 349 hypothesis was valid for Europe, where there was little to moderate genetic differentiation (Gst 350 ≤ 0.181) among regional blocks except between Eastern and Northern Europe (Gst = 0.248) 351 ( Figure S12 ). Interestingly, we detected moderate-to-high genetic differentiation (Gst ≥ 0.111) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.25.21255897 doi: medRxiv preprint among regional blocks in Africa and Asia ( Figure S12 ). This may be a reflection of the fast 353 and strict travel bans that were put into effect early during the spread of the virus in both 354 regions. Conclusion. The disproportionate distribution of SARS-CoV-2 genomes among the young and 357 older age groups in this study was representative of the age distribution of COVID-19 cases 358 reported globally. Throat swabs were the preferred specimen for COVID-19 diagnosis, but the 359 invasive nature involved in sampling may limit its utility for surveillance. In building a robust 360 and efficient surveillance system, access to affordable sequencing and rapid analysis of 361 complex genomic data will be seminal to inform control efforts. The utility of the 74 362 polymorphic loci as markers for studying the epidemiology and population genetics of SARS- . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Proceedings of the National Academy of Sciences, 2020. 117(49): p. 31519-31526. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 The spatiotemporal estimation of the risk and the international transmission of COVID-19: a global 381 perspective Change in global transmission rates of COVID-19 through Genomic Cues From Beta-Coronaviruses and Mammalian Hosts Sheds Light on Probable Origins and Infectivity of 385 SARS-CoV-2 Causing COVID-19 Minireview of progress in the structural study of SARS-CoV-2 proteins. Current Research in Microbial Sciences Covid-19: New coronavirus variant is identified in UK SARS-CoV-2 spike-protein D614G mutation increases virion spike density and infectivity Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv Patient-derived SARS-CoV-2 mutations impact viral replication dynamics and infectivity in vitro and with clinical 396 implications in vivo The population genetics and evolutionary epidemiology of RNA viruses Phylogenetic network analysis of SARS-CoV-2 genomes Geographic and Genomic Distribution of SARS-CoV-2 Mutations. Frontiers in microbiology Data, disease and diplomacy: GISAID's innovative contribution to global health Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of 406 sequence data Minimap2: pairwise alignment for nucleotide sequences Poppr: an R package for genetic analysis of populations with clonal, partially 409 clonal, and/or sexual reproduction vegan: Community Ecology Package. R package version 2.4-3. Vienna: R Foundation for Statistical 411 Computing Novel measures of linkage disequilibrium that correct the bias due to population structure and relatedness Clonal interference can cause wavelet-like oscillations of multilocus linkage disequilibrium MMOD: An R library for the calculation of population differentiation statistics A Standardized Genetic Differentiation Measure. Evolution Discriminant analysis of principal components: a new method for the analysis of 420 genetically structured populations adegenet: a R package for the multivariate analysis of genetic markers ggplot2: elegant graphics for data analysis PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods RCore, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing Populations skew older in some of the countries hit hard by COVID-19 Age-dependent effects in the transmission and control of COVID-19 epidemics Antibody response of patients with severe acute respiratory syndrome (SARS) targets the viral nucleocapsid Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection, Genetics and Evolution Mutational spectra of SARS-CoV-2 orf1ab polyprotein and signature mutations in the United States of 436 Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant The global population of SARS-CoV-2 is composed of six major subtypes No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2 Antigenic variation of SARS-CoV-2 in response to immune pressure Effect of internationally imported cases on internal spread of COVID-19: a mathematical modelling study The Lancet Public Health Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease 448 bulletin Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Review of Burden, Clinical Definitions, and Management of COVID-19 Cases. The American journal of 454 tropical medicine and hygiene Genomic epidemiology reveals transmission patterns and dynamics of SARS-CoV-2 in Aotearoa New 456 Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread 458 visualization Genetic variants and source of introduction of SARS-CoV-2 in South America A persistently replicating SARS-CoV-2 variant derived from an asymptomatic individual Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity Host Immune Response Driving SARS-CoV-2 Evolution. Viruses Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK Importations of COVID-19 into African countries and risk of onward spread Tracking the COVID-19 pandemic in Australia using genomics . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101 Infrastructure Support Scheme. The funders had no role in study design, data collection and analysis, 486 decision to publish, or preparation of the manuscript. 487Acknowledgements. We are grateful to the authors from the laboratories responsible for obtaining the 488 specimens and the submitting laboratories where genetic sequence data were generated and shared via 489 the GISAID Initiative, on which this research is based. 490 491 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. and N K203 and R204 were associated with spikes in COVID-19 cases. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The adjusted odds ratio (OR) with the P-value are shown with orange, blue and grey matrices 13 indicating OR > 1 and P-value < 0.05, OR < 1 and P-value < 0.05 and 1< OR > 1 and P-value 14 > 0.05, respectively. The OR was not estimated for study variables with < 5 samples. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.25.21255897 doi: medRxiv preprint 3 17 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 26, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted April 26, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021