key: cord-0335048-mmr3blyc authors: Silva, B. W. C. d.; Felix, P. T. title: Analysis of the genetic diversity of SARS-CoV-2 genomes carrying the Omicron B.1.1.529 mutation date: 2022-05-25 journal: nan DOI: 10.1101/2022.05.21.22275421 sha: 5599ed147524807c9bf4c32168bfca675ae7db84 doc_id: 335048 cord_uid: mmr3blyc In this work, we evaluated the levels of genetic diversity in 95 genomes of the carriers of the Omicron B.1.1.529 mutation in SARS-CoV-2 from South Africa, Asia, Massachusetts-USA, Rhode Island-USA, United Kingdom and Germany. All with 29,996pb extension and recovered from GENBANK and publicly available at the National Center for Biotechnology and Information (NCBI). All gaps and conserved sites were extracted for the construction of a phylogenetic tree and for specific methodologies of estimates of paired FST, Molecular Variance (AMOVA), Genetic Distance, Incompatibility, demographic expansion analyses, molecular diversity and of evolutionary divergence time analyses, always with 20,000 random permutations. The results revealed the presence of only 75 parsimony-informative sites, sites among the 29,996bp analyzed. The analyses based on FST values, confirmed the absence of distinct genetic structuring with fixation index of 98% and with a greater component of population variation (6%) for a "p" 0.05. Tau variations (related to the ancestry of the groups), did not reveal significant moments of divergence, supported by the incompatible analysis of the observed distribution ({tau} = 0%). It is safe to say that the large number of existing polymorphisms reflects major changes in the protein products of viral populations in all countries and especially In South Africa. This consideration provides the safety that, because there are large differences between the haplotypes studied, these differences are minimal within the populations analyzed geographically and, therefore, it does not seem safe to extrapolate the results of polymorphism and molecular diversity levels found in the Variant Omicron B.1.529 of SARS-CoV-2 for wild genomes or other mutants. This warns us that, due to their higher transmission speed and infection, possible problems of molecular adjustments in vaccines already in use may be necessary in the near future. Since the discovery of the Coronaviridae family, problems and concerns have been constant in human populations. Initially with the SARS-CoV virus, because of the onset of severe acute respiratory syndrome and which, in a few years later, evolved into vaccines, people with incomplete vaccination regimens show a certain degree of protection to this variant, substantially improving their immune response from the application of booster doses. This brings us back to the fact that complete vaccination schemes are the main allies in containing all variants and especially the Omicron variant. (Chekol et al, 2022) . Thus, to understand a little more about the molecular diversity of the Omicron variant, 95 sequences of the complete genome of SARS-CoV-2(06 from South Africa, 55 from Asia, 10 from the city of Massachusetts (USA), 02 from the city of Rhode Island (USA), 20 from the United Kingdom and 2 from Germany), were rescued from the National Biotechnology Information Center (NCBI) and submitted to phylum genetic and molecular variance (AMOVA) tests, to understand, in addition to other things, its demographic and spatial expansion. Evaluate possible levels of genetic diversity and polymorphisms existing in 95 mutant haplotypes for the Ômicron B.1.1.529 of SARS-CoV-2. (02), all with 29,996pb extension, were recovered from GENBANK (https://www.ncbi.nlm.nih.gov/labs/gquery/all/?term=omicron&utm_source=Da tasets) on April 20, 2022. Once aligned using the MEGA X program (TAMURA et al., 2018) , ambiguous sites, lost data and gaps, were excluded. Variance (AMOVA), Genetic Distance, mismatch, demographic and spatial expansion analyses, molecular diversity and evolutionary divergence time were obtained with the Software Arlequin v. 3.5 (EXCOFFIER et al., 2005) using 1000 random permutations (NEI and KUMAR, 2000) . The FST and geographic distance matrices were not compared. All steps of this process are described below: Among the routines of LaBECom, this test is used to measure the genetic diversity that is equivalent to the heterozygosity expected in the studied groups. We used for this the standard index of genetic diversity H, described by Nei (1987) . Which can also be estimated by the method proposed by PONS and PETIT (1995) . According to LaBECom protocols, we used this local frequency spectrum analytical test (SFS), from DNA sequence data that allows us to estimate the demographic parameters of the frequency spectrum. Simulations are made using fastsimcoal2 software, available in http://cmpg.unibe.ch/software/fastsimcoal2/. Molecular diversity indices are obtained by means of the average number of paired differences, as described by Tajima in 1993, in this test we used sequences that do not fit the model of neutral theory that establishes the existence of a balance between mutation and genetic drift. Theta population parameters are used in our Laboratory when we want to qualify the genetic diversity of the studied populations. These estimates, classified as Theta Homwhich aim to estimate the expected homozygosity in a population in equilibrium between drift and mutation and the estimates Theta (S) (WATTERSON, 1975) , Theta (K) (EWENS, 1972) and Theta (π) (TAJIMA, 1983 ). In LaBECom, analyses of the mismatch distribution are always performed relating the observed number of differences between haplotype pairs, trying to define or establish a pattern of population demographic behavior, as described already by (ROGERS; HARPENDING, 1992; HUDSON, SLATKIN, 1991; RAY et al., 2003 , EXCOFFIER, 2004 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 25, 2022. ; This model is always used when we intend to estimate the probability of differences observed between two haplotypes not recombined and randomly chosen, this methodology in our laboratory is used when we assume that the expansion, in a haploid population, reached a momentary balance even having passed through τ generations, of sizes 0 N to 1 N. In this case, the probability of observing the S differences between two non-recombined and randomly chosen haplotypes is given by the probability of observing two haplotypes with S differences in this population (Watterson, 1975) . We use these inferences for haplotypic or genotypic data with unknown gametic phase. Following our protocol, inferences are estimated by observing the relationship between haplotype i and xi times its number of copies, generating an estimated frequency (^pi). With genotypic data with unknown gametic phase, the frequencies of haplotypes are estimated by the maximum likelihood method, and can also be estimated using the expected Maximization (MS) algorithm. This method, when used in LaBECom, allows estimating a corrected percentage of how different two haplotypes are. This correction allows us to assume that there have been several substitutions per site, since the most recent ancestor of the two studied haplotypes. Here, we also assume a correction for identical replacement rates for all four nucleotides A C, G and T. Much like the previous test, this fix allows for multiple site substitutions, but takes into account different replacement rates between transitions and transversions. We at LaBECom understand this method as an extension of the 2-parameter Kimura method, which also allows the estimation of frequencies for different haplotypes. However, transition-transversion relationships as well as general nucleotide frequencies are calculated from the original data. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; At this stage, we were also able to produce a corrected percentage of nucleotides for which two haplotypes are different, but this correction is an extension of the Jukes and Cantor method, with the difference of being able to do this from the original data. As in kimura's models 2 parameters a distance of Tajima and Nei, this correction allows, inferring different rates of transversions and transitions, besides being able to distinguish transition rates between purines and pyrimidines. To calculate the distance between OTU (operational taxonomic units) from the paired distance matrix of haplotypes, we used a Minimum Spanning Network (MSN) tree, with a slight modification of the algorithm described in Rohlf (1973) . We usually use free software written in Pascal called MINSPNET. EXE running in DOS language, previously available at: http://anthropologie.unige.ch/LGB/software/win/min-span-net/. To estimate haplotypic frequencies we used the maximum likelihood model with an algorithm that maximizes the expected values. The use of this algorithm in LaBECom, allows to obtain the maximum likelihood estimates from multilocal data of gametic phase is unknown (phenotypic data). It is a slightly more complex procedure since it does not allow us to do a simple gene count, since individuals in a population can be heterozygous to more than one locus. Very similar to the previous algorithm, ELB attempts to reconstruct the gametic phase (unknown) of multilocal genotypes by adjusting the sizes and locations of neighboring loci to explore some rare recombination. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101/2022.05.21.22275421 doi: medRxiv preprint We use this test in LaBECom for both haploid and diploid data. This test is used only as a way to summarize the distribution of allelic frequency, without taking into account its biological significance. This test is based on the sampling theory of neutral all links from Ewens (1972) and tested by Watterson (1978) . It is now limited to sample sizes of 2,000 genes or less and 1,000 different alleles (haplotypes) or less. It is still used to test the hypothesis of selective neutrality and population balance against natural selection or the presence of some advantageous alleles. This test created by Slatikin in 1994 and adapted by himself in 1996. is used in our protocols when we want to compare the probabilities of random samples with those of observed samples. This test was proposed by Chakrabordy in 1990, serves to calculate the observed probability of a randomly neutral sample with a number of alleles equal to or greater than that observed, it is based on the infinite allele model and sampling theory for neutral Alleles of Ewens (1972) . We use this test in our Laboratory when DNA sequences or haplotypes produced by RFLP are short. It is based on the 1989 Tajima test, using the model of infinite sites without recombination. It commutes two estimators using the theta mutation as a parameter. Also based on the model of infinite sites without recombination, the FU test is suitable for short DNA sequences or haplotypes produced by RFLP. However, in this case, it assesses the observed probability of a randomly neutral sample with a . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; number of alleles equal to or less than the observed value. In this case the estimator used is θ. This stage is the most used in the LaBECom protocols because it allows to know the genetic structure of population measuring their variances, this methodology, first defined by Cockerham in 1969 and and, later adapted by other researchers, is essentially similar to other approaches based on analyses of gene variance variance, but takes into account the number of mutations haplotypes. When the population group is defined, we can define a particular genetic structure that will be tested, that is, we can create a hierarchical analysis of variance by dividing the total variance into covariance components by being able to measure intra-individual differences, interindividual differences and/or interpopulation allocated differences. In LaBECom, this tree is generated using the operational taxonomic units (OTU). This tree is calculated from the matrix of paired distances using a modification of the algorithm described in Rohlf (1973) . We performed this analysis for each locus separately as it is performed at the haplotypic level and the variance components and f statistics are estimated for each locus separately generating in a more global panorama. This is the most present analysis in the work of LaBECom. These generate paired FST parameters that are always used, extremely reliably, to estimate the short-term genetic distances between the populations studied, in this model a slight algorithmic adaptation is applied to linearize the genetic distance with the time of population divergence (Reynolds et al. 1983; Slatkin, 1995) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101/2022.05.21.22275421 doi: medRxiv preprint Here we measured how much pairs of fixed N-size haplotypes diverged over t generations, based on FST indices. We used this test in LaBECom when we want to know how much two Haploid populations of N size diverged t generations behind a population of identical size and managed to remain isolated and without migration. This is a demographic model and applies very well to the phylogeography work of our Laboratory. In this test we assumed that the relationship between the gross (D) and liquid (AD) number of Nei differences between populations is the increase in genetic distance between populations (Nei and Li, 1979) . We used this method in LaBECom when we want to estimate the time of divergence between populations of equal sizes (Gaggiotti and Excoffier, 2000) , assuming that two populations diverged from an ancestral population of N0 size a few t generations in the past, and that they have remained isolated from each other ever since. In this method we assume that even though the sizes of the two child populations are different, the sum of them will always correspond to the size of the ancestral population. The procedure is based on the comparison of intra and inter populational (π's) divers that have a large variance, which means that for short divergence times, the average diversity found within the population may be higher than that observed among populations. These calculations should therefore be made if the assumptions of a pure fission model are met and if the divergence time is relatively old. The results of this simulation show that this procedure leads to better results than other methods that do not take into account unequal population sizes, especially when the relative sizes of the daughter populations are in fact. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 We at LaBECom understand that this test is an analog of fisher's exact test in a 2x2 contingency table extended to a rxk contingency table. It has been described in Raymond and Rousset (1995) and tests the hypothesis of a random distribution of k different haplotypes or genotypes among r populations. Inspired by what had been described in Paetkau et al (1995 Paetkau et al ( , 1997 and Waser and Strobeck (1998) this method determines the origin of specific individuals, knowing a list of potential source populations and uses the allelic frequencies estimated in each sample from their original constitution. We use this test when we suspect that natural selection affects genetic diversity among populations. This method was adapted by Cavalli-Sforza in 1966 from a 1973 work by Lewontin and Krakauer. All 95 (Table 1) . Using the UPGMA method, for the 75 parsimony-informative sites, it was possible to understand that the 34 haplotypes from Europe and American continents . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101/2022.05.21.22275421 doi: medRxiv preprint comprised 3 distinct groups, and it is even possible that there are haplotype sharing between the studied countries (Figure 2 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; Figure 2 . Evolutionary analysis by the maximum likelihood method. The evolutionary history was inferred using the maximum likelihood method and the 3-parameter Tamura model[1]. The tree with the highest probability of logging (-1366.35 ) is shown. The percentage of trees in which the associated dollar sums group together is shown next to the branches. The initial trees for heuristic research were obtained automatically by applying the Join-Join and BioNJ algorithms to an array of distances in estimated pairs using the Tamura 3 parameter model and then selecting the topology with a higher log probability value. This analysis involved 38 nucleotide sequences. The evolutionary analyses were performed in MEGA X. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101/2022.05.21.22275421 doi: medRxiv preprint Analyses based on FST values confirmed the presence of several distinct genetic entities with a fixation index of 98% and with a larger component of population variation of 2% to a p lower than 0.05. Significant evolutionary divergences were observed within and between groups (Table 2 ) and little genetic similarity between the groups that make up the American and European countries, as well as a greater evolutionary divergence between the sequences that make up the United Kingdom Group (Table 3, Table 4 , Figure 3 and Figure 4 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; (Table 3) . The Tajima and FS de Fu tests showed disagreements between the estimates of general φ and π, but with positive and weakly significant values, indicating, once again, the presence of population . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 expansions for all groups (Table 4 ). The irregularity index (R= Raggedness) with parametric bootstrap simulated new values φ for before and after a supposed demographic expansion and, in this case, assumed a value greater than zero for all groups, with the exception of the population of South Africa (Table 5) , probably due to the small number of haplotypes analyzed (Figure 6 and 7) . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) As the use of genetic diversity study methodologies not yet used in this customized PopSet for genomes of the Variant Ômicron B.1.1.529 of SARS-CoV-2, it was possible to detect the immense genetic diversity among all haplotypes studied, not evienciarelationship between all localities. The few parsimony-informative sites of all viral genomes contain few inter-haplotypic variations with a few ecessoes such as the UK finger and Massasuchets populations. Perhaps, with the insertion of more haplotypes for all regions studied in the databases, this configuration may change. The groups described . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 here presented null patterns of genetic structuring diverging from the results obtained by Felix et al, 2020 for wild populations of SARS-CoV-2 for Latin American countries. These data suggest that the lack of structure present in the tavez amaostras do not directly relate to the genetic flow. Cases like this have also been supported by simple phylogenetic pairing methodologies, such as UPGMA, which in this case, with the lack of a discontinuous pattern of genetic divergence between the groups, does not support the idea of possible isolations resulting from past fragmentation events, especially when we observe a not-so-numerous amount of branches in the tree generated and with very few mutational steps. These few mutations have not yet been corrected by drift or the founding effect, which does not accompany the behavior of dispersion and/or loss of intermediate haplotypes over generations. The values found for the genetic distance support the absence of this continuous pattern and low divergence between the groups studied, since they considered important the minimum differences between the groups, when the hapltypes between them were exchanged, as well as the inference of values greater than or equal to those observed in the proportion of these permutations, including the p-value of the test. The discrimination of the 95 genetic entities was also perceived by almost null and not hierarchical inter haplotypic variations in all components of covariance: by their intra-and inter-individual differences or by their intra-and intergroup differences, generating a dendrogram that supports the idea that the significant similarities found among all populations studied, for example, they have not yet been shared either in their form or in their number, since the result of estimates of the mean evolutionary divergence found within the groups are high.. Based on the low level of haplotypic sharing, tests that measure the relationship between genetic distance and geographic distance, such as the Mantel test were dispensed with. The estimator φ, although extremely sensitive to any form of molecular variation (FU, 1997) , did not support the uniformity between the results found by all the methodologies employed, and can be interpreted as a lack of phylogenetic confirmation for a consensus in the conservation of genomes of the variant Ômicron B.1.1.529 of SARS-CoV-2 and, therefore, it is safe to affirm that the large number of existing polymorphisms reflects in large changes in protein products of viral populations in all the locations studied. This consideration provides the safety that, because there are large differences between the haplotypes studied, these differences are minimal within the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 25, 2022. it does not seem safe to extrapolate the results of polymorphism and molecular diversity levels found in the Variant Omicron B.1.1.529 of SARS-CoV-2 for wild genomes or other mutants, even supporting the discussion of the existence of great differences in the protein products of this mutant in relation to the others. This warns us that, due to their higher transmission speed and infection COVID-19: A systematic review and update on prevention, diagnosis, and treatment Evolution, Mode of Transmission, and Mutational Landscape of Newly Emerging SARS-CoV-2 Variants. mBio Mutational Pattern, Impacts and Potential Preventive Strategies of Omicron SARS-CoV-2 Variant Infection. Infect Drug Resist Variance of gene frequencies Analysis of gene frequencies The reproduction number of the Ômicron variant triples that of The sampling theory of selectively neutral alleles Patterns of DNA sequence diversity and genetic structure after a range expansion: lessons from the infinite-island model Genetic diversity analysis of the D614G mutation in SARS-CoV-2. bioRxiv 2020.10.30.362954 Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection A simple method of removing the effect of a bottleneck and unequal population sizes on pairwise genetic distances National Center for Biotechnology Information Molecular Evolutionary Genetics Analysis across computing platforms Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms Molecular Evolution and Phylogenetics Mathematical model for studying genetic variation in terms of restriction endonucleases Microsatellite analysis of population structure in Canadian polar bears An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations Estimation, Variance and Optimal Sampling of Gene Diversity I. Haploid locus Intra-Deme Molecular Diversity in Spatially Expanding Populations An exact tes for population differentiation Estimation for the coancestry coefficient: basis for a short-term genetic distance Algorithm 76. Hierarchical clustering using the minimum spanning tree A correction to the exact test based on the Ewens sampling distribution An exact test for neutrality based on the Ewens sampling distribution A measure of population subdivision based on microsatellite allele frequencies