key: cord-0291500-flxbomsx authors: Oliver, José L.; Bernaola-Galván, Pedro; Perfectti, Francisco; Gómez-Martín, Cristina; Castiglione, Silvia; Raia, Pasquale; Verdú, Miguel; Moya, Andrés title: The emergence of variants with increased fitness accelerates the slowdown of genome sequence heterogeneity in the SARS-CoV-2 coronavirus date: 2022-05-26 journal: bioRxiv DOI: 10.1101/2022.05.26.493529 sha: 5e753813052ff84187107894c3315893bae7f48e doc_id: 291500 cord_uid: flxbomsx Since the outbreak of the COVID-19 pandemic, the SARS-CoV-2 coronavirus has accumulated an important amount of genetic and genomic variability through mutation and recombination events. To test evolutionary trends that could inform us on the adaptive process of the virus to its human host, we summarize all this sequence variability by computing the Sequence Compositional Complexity (SCC) in more than 23,000 high-quality coronavirus genome sequences from across the globe, covering the period spanning from the start of the pandemic in December 2019 to March 2022. In early samples, we found no statistical support for any trend in SCC values over time, although the virus as a whole appears to evolve faster than Brownian Motion expectation. However, in samples taken after the first Variant of Concern (VoC) with higher transmissibility (Alpha) emerges, and controlling for phylogenetic and sampling effects, we were able to detect a statistically significant trend for decreased SCC values over time. SARS-CoV-2 evolution towards lower values of genome heterogeneity is further intensified by the emergence of successive, widespread VoCs. Concomitantly to the temporal reduction in SCC, its absolute evolutionary rate kept increasing toward the present, meaning that the SCC decrease itself accelerated over time. As compared to Alpha or Delta variants, the currently dominant VoC, Omicron, shows much stronger trends in both SCC values and rates over time. These results indicate that the increases in fitness of variant genomes associated to a higher transmissibility leads to a reduction of their genome sequence heterogeneity, thus explaining the general slowdown of SCC along with the pandemic course. Pioneer works showed that RNA viruses are excellent material for studies of evolutionary genomics (Domingo et al., 1999; Moya et al., 2004; Worobey and Holmes, 1999) . Now, with the outbreak of the COVID-19 pandemic, this has become a key research topic. Despite the controversy surrounding the first days and location of the pandemic (Koopmans et al., 2021; Worobey, 2021) , the most parsimonious explanation for the origin of SARS-CoV-2 seems a zoonotic event (Holmes et al., 2021) . Direct bat-to-human spillover events may occur more often than reported, although most remain unrecognized (Sánchez et al. 2021) . Bats are known as the natural reservoirs of SARS-like CoVs (Li et al., 2005) and early evidence exists for the recombinant origin of bat (SARS)-like coronaviruses (Hon et al., 2008) . A genomic comparison between these coronaviruses and SARS-CoV-2 has led to propose a bat origin of the COVID-19 outbreak (Zhang and Holmes, 2020) . Indeed, a recombination event between the bat coronavirus and either an origin-unknown coronavirus (Ji et al., 2020) or a pangolin virus would be at the origin of SARS-CoV-2. Bat RaTG13 virus best matched the overall codon usage pattern of SARS-CoV-2 in orf1ab, spike, and nucleocapsid genes, while the pangolin P1E virus had a more similar codon usage in the membrane gene (Gu et al., 2020) . Other intermediate hosts have been identified, such as RaTG15, and its knowledge is essential to prevent further spread of the epidemic . Despite its proofreading mechanism and the brief time-lapse since its appearance, SARS-CoV-2 has already accumulated an important amount of genetic and genomic variability (Elbe and Buckland-Merrett, 2017; Hadfield et al., 2018; Hamed et al., 2021; Hatcher et al., 2017; , which is due to both its recombinational origin (Naqvi et al., 2020) as well as mutation and additional recombination events accumulated later (Cyranoski, 2020; Jackson et al., 2021; Patiño-Galindo et al., 2021) . Noteworthy, RNA viruses can also accumulate high genetic variation during individual outbreaks (Pybus et al., 2015) , showing mutation and evolutionary rates that may be up to a million times higher than those of their hosts (Islam et al., 2020) . Synonymous and non-synonymous mutations (Banerjee et al., 2020; Cai et al., 2020) , as well as mismatches and deletions in translated and untranslated regions (Islam et al., 2020; Young et al., 2020) have been tracked in the SARs-CoV-2 genome sequence. Particularly interesting changes are those increasing viral fitness (Holmes et al., 2021; van Dorp et al., 2020; Wang et al., 2021; Zhou et al., 2020) , as mutations provoking epitope loss and antibody escaping. These have been found mainly in evolved variants isolated from Europe and the Americas, which have critical implications for SARS-CoV-2 transmission, pathogenesis, and immune interventions (Gupta and Mandal, 2020) . Some studies have shown that SARS-CoV-2 is acquiring mutations more slowly than expected for neutral evolution, suggesting that purifying selection is the dominant mode of evolution, at least during the initial phase of the pandemic course. Parallel mutations in multiple independent lineages and variants have been observed (van Dorp et al., 2020) , which may indicate convergent evolution and that are of particular interest in the context of adaptation of SARS-CoV-2 to the human host (van Dorp et al., 2020) . Other authors reported some sites under positive pressure in the nucleocapsid and spike genes (Benvenuto et al., 2020) . All this research effort has allowed to track in real-time all these changes. The CoVizu e project (https://filogeneti.ca/covizu/) provides a visualization of SARS-CoV-2 global diversity of SARS-CoV-2 genomes. Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 4 Base composition varies at all levels of the phylogenetic hierarchy and throughout the genome, and can be caused by active selection or passive mutation pressure (Mooers and Holmes, 2000) . The array of compositional domains in a genome can be potentially altered by most sequence changes (i.e., synonymous and non-synonymous nucleotide substitutions, insertions, deletions, recombination events, chromosome rearrangements or genome reorganizations). Compositional domain structure can be altered either by changing the nucleotides at the borders separating two domains, or by changing nucleotide frequencies in a given region, thus altering the number of domains or their compositional differences, consequently changing the resulting SCC value of the sequence (Bernaola-Galván et al., 1996; Keith, 2008; Oliver et al., 1999; Wen and Zhang, 2003) . Ideally, a genome sequence heterogeneity metric should be able to summarize all the mutational and recombinational events accumulated by a genome sequence over time Fearnhead and Vasilieou, 2009; Oliver et al., 2004 Oliver et al., , 2002 Román-Roldán et al., 1998) . In many organisms, the patchy sequence structure formed by the array of compositional domains with different nucleotide composition has been related to important biological features, i.e., GC content, gene and repeat densities, timing of gene expression, recombination frequency, etc. (Bernaola-Galván et al., 2008; Bernardi, 2015; Bernardi et al., 1985; Oliver et al., 2004) . Therefore, changes in genome sequence heterogeneity may be relevant on evolutionary and epidemiological grounds. Specifically, evolutionary trends in genome heterogeneity of the coronavirus could reveal adaptive processes of the virus to the human host. To this end, we computed the Sequence Compositional Complexity, or SCC (Román-Roldán et al., 1998) , an entropic measure of genome heterogeneity, meant as the number of domains and nucleotide differences among them, identified in a genome sequence through a proper segmentation algorithm (Bernaola-Galván et al., 1996) . By using phylogenetic ridge regression, a method that has been able to reveal evolutionary trends in both macro- Serio et al., 2019) and micro-organisms (Moya et al., 2020) , we present here evidence for a long-term tendency of decreasing genome sequence heterogeneity in SARS-CoV-2. The trend is shared by the virus most important Variants of Concern (VoCs), Alpha or Delta, and greatly accelerated by the recent rise to dominance of Omicron (Du et al., 2022) . The first SARS-CoV-2 coronavirus genome sequence obtained at the start of the pandemic (2019-12-30) was divided into eight compositional domains by our compositional segmentation algorithm (Bernaola-Galván et al., 2008 , 1996 Oliver et al., 2004 Oliver et al., , 1999 , resulting in a SCC value of 5.7 x 10E-3 bits ( Figure 1 ). Figure 1 . Compositional segmentation of the GISAID reference genome (hCoV-19/Wuhan/WIV04/2019|EPI_ISL_402124|2019-12-30). Using an iterative segmentation algorithm (Bernaola-Galván et al., 1996; Oliver et al., 2004) , the RNA sequence was divided into eight compositionally homogeneous segments (i.e., compositional domains) with P value ≤ 0.05. The genome position of domain borders is shown on the horizontal scale. Colors are used only to illustrate the differential nucleotide composition of each domain. From then on, descendent coronaviruses present a lot of variation in each domain's number, length and nucleotide composition, which is reflected in higher or lower SCC values in individual genomes. The number of segments ranges between 4 and 9, while the SCC do it between 2.71E-03 and 6.8E-03 bits. The strain name, the collection date, and the SCC values for each analyzed genome are shown in Supplementary Tables S1-S18. To characterize the temporal evolution of SCC over the entire range of the coronavirus pandemic (Table 1) . We then filtered, masked and aligned these sequences to the reference genome (see Methods). For each of these samples, we determined the proportion of variants (Table 1 , columns 5-8) and inferred an ML phylogenetic tree by means of IQ TREE 2 (Minh et al., 2020) . Finally, we sought for temporal trends in SCC values and evolutionary rates by using the function search.trend in the R package RRphylo (Castiglione et al., 2018) , contrasting the realized slope of SCC versus time regression to a family of 1000 slopes generated under the Brownian motion model of evolution, which models evolution with no trend in either the SCC or its evolutionary rate. We found that SARS-CoV-2 genome sequence heterogeneity did not follow any trend in SCC during the first year of the pandemic Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 6 course, as indicated by the non-significant SCC against time regressions in any sample ending before December 2020 (Table 1) . With the emergence of variants in December 2020 (s1573 , Table 1 ), the genome sequence heterogeneity started to decrease significantly over time. In contrast to the decreasing trend observed for SCC, a clear tendency towards faster evolutionary rates takes place throughout the study period, indicating that the virus increased in variability early on, but took on a monotonic trend for decreasing SCC as VoCs appeared. These results were robust to several sources of uncertainty, including those related with the algorithms used for multiple alignment or to infer phylogenetic trees (see Supplementary Information) . In summary, these analyses show that statistically significant trends for declining heterogeneity began in between the end of December 2020 (s1573) and March 2021 (s1871) in coincidence with the emergence of the first VoC (Alpha), a path continued over the successive emergence of other variants . T a b l e 1 . P h y l o g e n e t i c t r e n d s i n c o r o n a v i r u s r a n d o m s a m p l e s d o w n l o a d e d f r o m t h e G I S A I D d a t a b a s e ( E l b e a n d B u c k l a n d -M e r r e t t , 2 0 1 7 ; K o e h o r s t e t a l . , 2 0 1 7 ; S h u a n d M c C a u l e y , 2 0 1 7 ) c o v e r i n g t h e p a n d e m i c t i m e r a n g e f r o m D e c e m b e r 2 0 1 9 t o M a r c h 2 0 2 2 . F o r e a c h s a m p l e , t h e a n a l y z e d t i m e r a n g e w a s f r o m D e c e m b e r 2 0 1 9 t o t h e d a t e s h o w n i n t h e c o l u m n ' C o l l e c t i We estimated the relative contribution of the three most important VoCs (Alpha, Delta and Omicron) to the trends in SARS-CoV2 evolution by picking samples both before (s726, s730) and after (s1871, Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 8 s1990) their appearance. The trends for SCC and its evolutionary rate in the sample s1990, which includes a sizeable number of Omicron genomes, are shown in Figure 2 . On all these samples, we tested trends for variants individually (as well as for the samples' trees as a whole) while accounting for phylogenetic uncertainty, by randomly altering the phylogenetic topology and branch lengths 100 times per sample (see Methods and Supplementary Information for details). In agreement with the previous (seventeen consecutive bins, see Table 1 ) analysis, we found strong support for a decrease in SCC values through time along phylogenies including variants (s1871, s1990) and no support for any temporal trend in older samples. Just 4 out of the 200 random trees produced for samples s726 and s730 produced a trend in SCC evolution. The corresponding figure for the two younger samples is 186/200 significant and negative instances of SCC decrease over time (Table 2 ). This ~50-fold increase in the likelihood to find a consistent trend for SCC decline over time is shared unambiguously by all tested variants (Alpha, Delta, and Omicron) (Table 3 ). Yet, Omicron shows significantly stronger decline in SCC than the other variants (Table 3) suggesting that the trends initiated with the appearance of main variants became more intense with the emergence of Omicron by the end of 2021. Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 9 Figure 2 . Phylogenetic ridge regressions for SCC (left) and its evolutionary rate (right) as detected by the RRphylo R package (Castiglione et al., 2018) on the s1990 sample. For SCC, the estimated value for each tip in the phylogenetic tree is regressed (blue line) against its age (the phylogenetic time distance, meant mainly as the collection date of each virus isolate). The rescaled evolutionary rate was obtained by rescaling the absolute rate in the 0-1 range and then transforming to logs to compare to the Brownian motion expectation. The statistical significance of the ridge regression slopes was tested against 1,000 slopes obtained after simulating a simple Brownian evolution of the SCC in the phylogenetic tree. The 95% confidence intervals around each point produced according to the Brownian motion model of evolution are shown as shaded areas. Dots are colored according to the variant they belong to or left blank for strains collected before the appearance of variants. We tested the difference in the slopes of SCC values versus time regression computed by grouping all the variants under a single group and the same figure for all other strains grouped together. The test was performed by using the function emtrends available within the R package emmeans (Lenth, 2022) . We found the slope for the group including all variant to be significantly larger than the slope for the other strains (estimate = -0.772 x 10 -8 , P-value = 0.006) still pointing to the decisive effect of VoCs on SCC temporal trend. Table 2 . Percentages of significant results of SCC and SCC evolutionary rates versus time regressions performed on 100 randomly fixed (and subsampled for s1871 and s1990) phylogenetic trees. Higher/lower than BM = the percentage of simulation producing slopes significantly higher/lower than the Brownian Motion expectation. SCC evolutionary rates (absolute magnitude of the rate) showed a tendency to increase through time ( Table 2 ). The slope of SCC rates through time regression for Omicron was always significantly lower than the slope computed for rest of the tree (Table 3 ). This is also true for Alpha and Delta, although with much lower support. Table 3 . Percentages of significant results of SCC and SCC evolutionary rates versus time regressions performed on 100 randomly resolved (s1871 and s1990) phylogenetic trees. % slope difference indicates the percentage of simulations producing significantly higher/lower slopes than the rest of the tree. Here we show that, despite its short length (29,912 bp for the reference genome) and the short time-lapse analyzed (28 months), the coronavirus RNA genomes can be divided into 4-9 compositional domains (~0.27 segments by kbp on average). Although such segment density is lower than in free-living organisms, like cyanobacteria where we observed an average density of 0.47 segments by kbp (Moya et al., 2020) , it may suffice for comparative evolutionary analyses of compositional sequence heterogeneity in these genomes, which might shed light on the origin and evolution of the COVID-19 pandemic. In early samples (i.e., collected before the emergence of variants) we found no statistical support for any trend in SCC values over time, although the virus as a whole appears to evolve faster than Brownian Motion expectation. However, in samples taken after the first VoC with higher transmissibility (Alpha) appeared in the GISAID database (December 2020), we started to detect statistically significant decreasing trends in SCC (Table 1 ). Concomitantly to the temporal reduction in SCC, its absolute evolutionary rate kept increasing toward the present, meaning that the SCC decrease itself accelerated over time. In agreement with this notion, although the SCC decrease is an evolutionary path shared by variants, the nearly threefold increase in rates becomes more intense after the appearance of the most recent VoC (Omicron) on later 2021, which shows much faster decrease in SCC than the other variants (Table 3) . These results indicate the existence of a driven, probably adaptive, trend in the variants toward a reduction of genome sequence heterogeneity. Variant genomes have accumulated a higher proportion of adaptive mutations, which allows them to neutralize host resistance or escape host antibodies (Mlcochova et al., 2021; Thorne et al., 2021; Venkatakrishnan et al., 2021) , consequently gaining a higher transmissibility (a paradigmatic example is the recent outbreak of the Omicron variant). The sudden increases in fitness of variant genomes, mainly due to the gathering of co-mutations, which become prevalent world-wide compared to single mutations, are largely responsible for their temporal changes in transmissibility and virulence (Ilmjärv et al., 2021; Majumdar and Niyogi, 2021) . In fact, more contagious and perhaps more virulent VoCs share mutations and deletions that have arisen recurrently in distinct genetic backgrounds (Richard et al., 2021) . We show here that these increases in fitness of variant genomes associated to a higher transmissibility leads to a reduction of their genome sequence heterogeneity, thus explaining the general slowdown of SCC along with the pandemic expansion. We conclude that the accelerated loss of genome heterogeneity in the coronavirus is promoted by the rise of high viral fitness variants, leading to adaptation to the human host, a well-known process in other viruses (Bahir et al., 2009) . Further monitoring of the evolutionary trends in current and new co-mutations, variants and recombinant lineages (Callaway, 2022; Ledford, 2022; Straten et al., 2022) by means of the tools used here will allow elucidating whether and in what extension the evolution of genome sequence heterogeneity in the virus impacts human health. Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 12 We retrieved random samples (see Table 1 ) of high-quality coronavirus genome sequences from the GISAID/Audacity database (Elbe and Buckland-Merrett, 2017; Koehorst et al., 2017; Shu and McCauley, 2017) . MAFFT (Katoh and Standley, 2013 ) was used to align each random sample to the genome sequence of the isolate Wuhan-Hu-1 (MN908947.3), then filtering and masking the alignments to avoid sequence oddities (Hodcroft et al., 2021) . The best ML timetree for each random sample in Table 1 was inferred by means of IQ-TREE 2 (Minh et al., 2020) , using the GTR nucleotide substitution model (Rodríguez et al., 1990; Tavaré, 1986 ) and the least square dating (LSD2) method (To et al., 2016) , finally rooting the timetree to the GISAID coronavirus reference genome (EPI_ISL_402124, hCoV-19/Wuhan/WIV04/2019, WIV04). To divide the coronavirus genome sequence into an array of compositionally homogeneous, nonoverlapping domains, we used a heuristic, iterative segmentation algorithm (Bernaola-Galván et al., 2008 , 1996 Oliver et al., 2004 Oliver et al., , 1999 . We choose the Jensen-Shannon divergence as the divergence measure between adjacent segments, as it can be directly applied to symbolic nucleotide sequences. At each iteration, we used a significance threshold (s = 0.95) to split the sequence into two statistically significant segments. The process continues iteratively over the new resulting segments while sufficient significance continues appearing. Once each coronavirus genome sequence is segmented into an array of statistically significant, homogeneous compositional domains, its genome sequence heterogeneity was measured by computing the Sequence Compositional Complexity, or SCC (Román-Roldán et al., 1998) . SCC increases/decreases with both the number of segments and the degree of compositional differences among them. In this way, SCC is analogous to other biological complexity measures, particularly to that described by McShea and Brandon (McShea and Brandon, 2010) , in which an organism is more complex if it has a greater number of parts and a higher differentiation among these parts. It should be emphasized that SCC is highly sensible to any change in the RNA genome sequence, either nucleotide substitutions, indels, genome rearrangements or recombination events. To search for trends in SCC values and evolutionary rates over time, phylogenetic ridge regression was applied by using the RRphylo R package V. 2.5.8 (Castiglione et al., 2018) . The estimated SCC value for each tip or node in the phylogenetic tree is regressed against its age (the phylogenetic time distance, which represents the time distance between the first sequence ever of the virus and the collection date of individual virus isolates); the regression slope was then compared to Brownian Motion (BM) expectation (which models evolution according to no trend in SCC values and rates over time) by generating 1,000 slopes simulating BM evolution on the phylogenetic tree, using the function search.trend in the RRphylo R package. In order to test explicitly the effect of variants and to compare variants among each other we selected 4 different trees and SCC data (s730, a727, s1871, s1990) from the entire dataset (Table 1) . On each sample, we accounted for phylogenetic uncertainty by producing 100 dichotomous versions of the initial tree by removing polytomies applying the RRphylo function fix.poly. This function randomly resolves polytomous clades by adding non-zero length branches to each new node and equally partitioning the evolutionary time attached to the new nodes below the dichotomized clade. Each randomly fixed tree was used to evaluate the presence of temporal trends in SCC and SCC evolutionary rates occurring on the entire tree and individual variants if present, by applying search.trend. Additionally, for the larger phylogenies (i. e. s1871 and s1990 lineage-wise trees) half of the tree was randomly sampled and half of the tips were removed. This way we avoided biasing the results because of different tree sizes. Additional details regarding the methods used in this study are provided in the Supplementary Information. All data generated or analyzed during this study are included in this published article (and its Supplementary Information files). Viral adaptation to host: A proteome-based analysis of codon usage and amino acid preferences The Novel Coronavirus Enigma: Phylogeny and Analyses of Coevolving Mutations Among the SARS-CoV-2 Viruses Circulating in India The 2019-new coronavirus epidemic: Evidence for virus evolution A standalone version of IsoFinder for the computational prediction of isochores in genome sequences Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes Compositional segmentation and long-range fractal correlations in DNA sequences Chromosome architecture and genome organization The mosaic genome of warm-blooded vertebrates Identification of Novel Missense Mutations in a Large Number of Recent SARS-CoV-2 Genome Sequences Are COVID surges becoming more predictable? New Omicron variants offer a hint Simultaneous detection of macroevolutionary patterns in phenotypic means and rate of change with and within phylogenetic trees including extinct species Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 15 A new method for testing evolutionary rate variation and shifts in phenotypic evolution Profile of a killer: the complex biology powering the coronavirus pandemic Origin and evolution of viruses The mysterious origins of the Omicron variant of SARS-CoV-2 Data, disease and diplomacy: GISAID's innovative contribution to global health Bayesian Analysis of Isochores GISAID. 2020. GISAID Initiative Multivariate analyses of codon usage of SARS-CoV-2 and other betacoronaviruses Non-synonymous Mutations of SARS-Cov-2 Leads Epitope Loss and Segregates its Variants Nextstrain: real-time tracking of pathogen evolution Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology Virus Variation Resource-improved response to emergent viral outbreaks Emergence in late 2020 of multiple lineages of SARS-CoV-2 Spike protein variants affecting amino acid position 677 The Origins of SARS-CoV-2: A Critical Review Evidence of the Recombinant Origin of a Bat Severe Acute Respiratory Syndrome (SARS)-Like Coronavirus and Its Implications on the Direct Ancestor of SARS Coronavirus Concurrent mutations in RNA-dependent RNA polymerase and spike protein emerged as the epidemiologically most successful SARS-CoV-2 variant Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity Generation and transmission of interlineage recombinants in the SARS-CoV-2 pandemic Cross-species transmission of the newly identified coronavirus 2019-nCoV MAFFT multiple sequence alignment software version 7: Improvements in performance and usability Sequence segmentation GISAID Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 17 Global Initiative on Sharing All Influenza Data. Phylogeny of SARS-like betacoronaviruses including novel coronavirus (nCoV) Origins of SARS-CoV-2: window is closing for key scientific studies The next variant: three key questions about what's after Omicron emmeans: Estimated Marginal Means, aka Least-Squares Means Bats are natural reservoirs of SARS-like coronaviruses Emergence of SARS-CoV-2 through recombination and strong purifying selection Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS-CoV-2 SARS-CoV-2 mutations: The biological trackway towards viral fitness Biology's first law : the tendency for diversity and complexity to increase in evolutionary systems Macroevolutionary trends of brain mass in Primates IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 18 The evolution of base composition and phylogenetic inference The population genetics and evolutionary epidemiology of RNA viruses Driven progressive evolution of genome sequence complexity in Cyanobacteria Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: Structural genomics approach IsoFinder: computational prediction of isochores in genome sequences Isochore chromosome maps of the human genome SEGMENT: identifying compositional domains in DNA sequences Recombination and lineage-specific mutations linked to the emergence of SARS-CoV-2 Virus evolution and transmission in an ever more connected world A phylogeny-based metric for estimating changes in transmissibility from recurrent mutations in SARS-CoV-2 The general stochastic model of nucleotide substitution Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 20 Sequence compositional complexity of DNA through an entropic segmentation method Macroevolution of Toothed Whales Exceptional Relative Brain Size GISAID: Global initiative on sharing all influenza data -from vision to reality Mapping the antigenic diversification of Some probabilistic and statistical problems in the analysis of DNA sequences Evolution of enhanced innate immune evasion by the SARS-CoV-2 B.1.1.7 UK variant Fast Dating Using Least-Squares Criteria and Algorithms Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Antigenic minimalism of SARS-CoV-2 is linked to surges in COVID-19 community transmission and vaccine breakthrough infections Mechanisms of SARS-CoV-2 Evolution Revealing Vaccine-Decreasing evolutionary trend of genome sequence heterogeneity in the coronavirus 21 Resistant Mutations in Europe and America Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis Dissecting the early COVID-19 cases in Wuhan Evolutionary aspects of recombination in RNS viruses Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak A Genomic Perspective on the Origin and Emergence of SARS-CoV-2 Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin We also gratefully acknowledge both the originating and submitting laboratories for the sequence data in GISAID EpiCoV on which these analyses are based. Supplementary Table S19 shows a complete list acknowledging all originating and submitting laboratories. In the same way, we gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences we used for the analysis of the Nextstrain sample The authors declare no competing interests.