key: cord-0279350-hqtu9v8l authors: Oliver, José L.; Bernaola-Galván, Pedro; Perfectti, Francisco; Gómez-Martín, Cristina; Verdú, Miguel; Moya, Andrés title: Accelerated decline of genome heterogeneity in the SARS-CoV-2 coronavirus date: 2021-11-08 journal: bioRxiv DOI: 10.1101/2021.11.06.467547 sha: e7d84e38a5be8c303ea6a33c77442aedba11973d doc_id: 279350 cord_uid: hqtu9v8l In the brief time since the outbreak of the COVID-19 pandemic, and despite its proofreading mechanism, the SARS-CoV-2 coronavirus has accumulated a significant amount of genetic variability through recombination and mutation events. To test evolutionary trends that could inform us on the adaptive process of the virus to its human host, we summarize all this variability in the Sequence Compositional Complexity (SCC), a measure of genome heterogeneity that captures the mutational and recombinational changes accumulated by a nucleotide sequence along time. Despite the brief time elapsed, we detected many differences in the number and length of compositional domains, as well as in their nucleotide frequencies, in more than 12,000 high-quality coronavirus genomes from across the globe. These differences in SCC are phylogenetically structured, as revealed by significant phylogenetic signal. Phylogenetic ridge regression shows that SCC followed a generalized decreasing trend along the ongoing process of pathogen evolution. In contrast, SCC evolutionary rate increased with time, showing that it accelerates toward the present. In addition, a low rate set of genomes was detected in all the genome groups, suggesting the existence of a stepwise distribution of rates, a strong indication of selection in favor of different dominant strains. Coronavirus variants reveal an exacerbation of this trend: non-significant SCC regression, low phylogenetic signal and, concomitantly, a threefold increase in the evolutionary rate. Altogether, these results show an accelerated decline of genome heterogeneity along with the SARS-CoV-2 pandemic expansion, a process that might be related to viral adaptation to the human host, perhaps paralleling the transformation of the current pandemic to epidemic. Pioneer works 1, 2 showed that RNA viruses are an excellent material for studies of evolutionary genomics. Now, with the outbreak of the COVID-19 pandemic, this has become a key research topic. Despite its proofreading mechanism and the brief time-lapse, SARS-CoV-2 shows an important amount of genetic variability [3] [4] [5] , which is due to both its recombinational origin 6 as well as mutation and additional recombination events accumulated along with the expansion of COVID-19 pandemic across the globe 7 . An unprecedented research effort has allowed to track in real-time all these changes along with pathogen evolution. Since the COVID-19 pandemic was declared by the World Health Organization (WHO) to be a public health emergency of international concern on March 2020 (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports), a massive amount of shared multidisciplinary information has been made available by the scientific community. Genome information about the coronavirus is available on websites as GISAID 4 , Nextstrain 5 , or NCBI virus 3 . The CoVizu e project (https://www.epicov.org/epi3/frontend#28b5af) supplied a near realtime visualization of SARS-CoV-2 global diversity of SARS-CoV-2 genomes. To date, the most parsimonious explanation for the origin of SARS-CoV-2 is a zoonotic event 8 . Direct bat-to-human spillover events may occur more often than reported, although most remain unrecognized because of different causes 9 . Bats are known as the natural reservoirs of SARS-like CoVs 10 . Because of a comparison between these coronaviruses and SARS-CoV-2, a bat derivation for the outbreak was proposed 11 . Indeed, a recombination event between the bat coronavirus and either an origin-unknown coronavirus 12 or a pangolin virus 13, 14 would be at the origin of SARS-CoV-2. Gu and co-workers 14 found that bat RaTG13 virus best matched the overall codon usage pattern of SARS-CoV-2 in orf1ab, spike, and nucleocapsid genes, while the pangolin P1E virus had a more similar codon usage in membrane gene. Other intermediate hosts have been identified, such as RaTG15 15 , knowledge of which is imperative to prevent further spread of the epidemic 16 . RNA viruses can accumulate high genetic variation during an individual outbreak 17 , showing mutation and evolution rates that may be up to a million times higher than those of their hosts 18 . In the brief time since the COVID-19 pandemic appeared, and despite viral genomic proofreading mechanism, recombinations have accumulated 19 over multiple rounds of mutations, many of which have increased viral fitness 8, [20] [21] [22] . Synonymous and non-synonymous mutations [23] [24] [25] , as well as mismatches and deletions in translated and untranslated regions 18 have been tracked. Of particular interest are those non-synonymous mutations provoking epitope loss and antibody escaping found mainly in evolved variants isolated from Europe and the Americas, which have critical implications for SARS-CoV-2 transmission, pathogenesis, and immune interventions 26 . Some studies have shown that SARS-CoV-2 is acquiring mutations more slowly than expected for neutral evolution, suggesting that purifying selection is the dominant mode of evolution, at least during the initial phase of the pandemic 27 . Parallel mutations in multiple independent lineages and variants have been observed 21, 27 , which may indicate convergent evolution and that are of particular interest in the context of adaptation of SARS-CoV-2 to the human host 21 . Other authors reported some sites under positive pressure in the nucleocapsid and spike genes 28 . Finally, genome rearrangements, as nucleotide deletions of different lengths, have been found, the major one affecting 382 nucleotides has been associated with a milder infection 29 . Most sequence changes (i.e., synonymous and non-synonymous nucleotide substitutions, insertions, deletions, recombination events, chromosome rearrangements, genome reorganizations…) can potentially alter the array of compositional domains in a genome. These domains can be changed either by altering nucleotide frequencies in a given region or by changing the nucleotides at the borders separating two putative domains, thus enlarging, or shortening a given domain [30] [31] [32] [33] [34] . A good metric of genome heterogeneity should be able to summarize the mutational and recombinational events accumulated by a genome sequence over time 35-39 . In many organisms, the patchy sequence structure formed by the array of domains with different nucleotide composition has been related to important biological features, i.e., gene and repeat densities, the timing of gene expression, recombination frequency, etc. 37, [40] [41] [42] . Therefore, changes in genome heterogeneity may be relevant on evolutionary and epidemiological grounds. Specifically, evolutionary trends on genome heterogeneity could reveal adaptative processes of the virus to the human host. To this end, we computed the Sequence Compositional Complexity SCC 39 The first SARS-CoV-2 coronavirus genome obtained from the start of the pandemics (2019-12-30) was divided into eight compositional domains by the iterative segmentation algorithm 30, 31, 37, 42 , resulting in a SCC value of 5.7 x 10 -3 bits (Figure 1 ). Since that time, descendent coronaviruses present a lot of variation in each domain's number, length, nucleotide composition, and so in SCC genome heterogeneity values (Supplementary Tables S1-S16). Table S1 ). The strain name, the collection date, the SCC values, and the number of segments for each analyzed genome are shown in detail in Supplementary Tables S2-S16. First, an ML phylogenetic tree was inferred for each sample, then computing the phylogenetic signal 43 for SCC (Table 1) . Interestingly, the phylogenetic signal values (K) were clearly lower in the samples of variant genomes, becoming even nonsignificant in most of them, as compared to the samples from the general virus population. Decreasing trends for genome heterogeneity The ridge regression of SCC against age shows a highly significant, decreasing trend. Figure 2 shows the regression obtained for the sample s1000_1. Similar decreasing trends were seen in the other samples from the general virus population; interestingly, non-significant slopes were obtained in most of the variant samples (Table 2 ). The ridge regression for the evolutionary rate of SCC obtained in the sample s1000_1 is shown in Figure 3 . In contrast to the decreasing trend observed for SCC (Figure 2 ), an increasing trend was observed for its evolutionary rate (higher rate towards the present). Table 2 shows that the slopes for evolutionary rate were positive and highly significant in all the genome groups (except for the Gamma variant sample where there are a marginal significance). Despite the general positive slopes for evolutionary rate (e.g., Figure 3 ), a conspicuous set of genomes with a low evolutionary rate also appears. This set is indicated by an oval in Figure 3 for the sample s1000_1, but similar sets appear in all the samples, even in the variants. However, these genomes never form a separate clade on the tree, appearing instead scattered over different branches of the tree, and always spanning a wide range of ages. In most SARS-CoV-2 samples, the ridge regression of SCC against age shows a highly significant, negative slope ( Figure 2 and Table 2 ), thus indicating a generalized declining trend along the ongoing process of pathogen evolution. However, the evolutionary rate of SCC shows positive slopes, thus indicating that the rates increase (i.e., they are faster than the Brownian motion expectation) toward the present. Therefore, an accelerated decline of sequence heterogeneity over time exists in the coronavirus. The biological meaning of high, increasing evolutionary rates ( Figure 3 and Table 2 ) indicate a fast changing (i.e., still adapting) virus genome. However, the co-existence in all the samples of a set of genomes with a low evolutionary rate (e.g., A caution over our results could be the specific protocol we followed to select, filter, and mask the sampled genomes, as well as on the particular algorithm we used to infer phylogenetic trees. Therefore, we repeat our analyses using collected coronavirus samples and trees inferred by other groups; using these data, we found qualitatively comparable results. An example was the analysis of the SARS-CoV-2 Nextstrain global dataset containing 3059 genomes sampled between December 2019 and October 2021 5 , and using the ML phylodynamic tree obtained by these authors by means of the TreeTime software 53 . Interestingly, we obtained a decreasing, although marginally significant trend, for genome heterogeneity (slope = -0.01, P value ≤ 0.122), as well as a slight, increasing, but highly significant trend for the evolutionary rate (slope = 0.89, P value ≤ 0.001). The accelerated loss of genome heterogeneity in the coronavirus revealed in this work might be related to viral attenuation 54 leading to adaptation to the human host, a well-known process in other viruses 55 , perhaps paralleling the ongoing transformation of the current pandemic into an epidemic. Further monitoring of current and new variants will allow checking these hypotheses to elucidate how virus evolution impacts human health. To search for coronavirus phylogenetic trends, we retrieved coronavirus genome sequences from the GISAID database 4 WIV04 is 12 nucleotides shorter than Wuhan-Hu-1 at the 3' end, the two sequences are identical in practical terms, particularly the 5' UTR is the same length, and the coding regions are identical. Therefore, the coordinates and relative changes are the same whichever sequence is used 58 . We followed the steps that have been recognized so far as useful for filtering and masking alignments of SARS-CoV-2 sequences, in this way avoiding sequence oddities 59 . First, we aligned the dataset of each sample to the genome sequence of the isolate Wuhan-Hu-1 (MN908947.3) using MAFFT 60 , following the detailed protocol at https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473. We then mask the alignment using the Python script 'mask_alignment_using_vcf.py' and following the detailed protocols at https://virological.org/t/masking-strategies-for-sars-cov-2alignments/480 and https://github.com/W-L/ProblematicSites_SARS-CoV2. In this way, we avoid oddities in the SARS-CoV-2 genome sequences from GISAID, such as alignment ends, which are affected by low coverage and high rate of apparent sequencing/mapping errors, recurrent or systematic sequencing errors, or homoplasic, recombination, and hypervariable sites. The masking of some sites in the alignment provokes that some of the sequences in the initial random sample become identical to others. Upon eliminating such duplicates, we realigned each sample with MAFFT 60 using default options. To solve polytomies, we used the function fix.poly from the RRphylo package V. 2.5.8 45, 46 . We then infer the best ML trees for each sample by means of the software IQ-TREE 2 61 , using the GTR nucleotide substitution model 62, 63 and additional options suggested by the software (i.e., GTR+F+R2). We also used the least square dating (LSD2) method 64 We divided a given nucleotide sequence into an array of compositionally homogeneous, non-overlapping domains using a heuristic, iterative segmentation algorithm 30, 31, 37, 42 . In brief, a sliding cursor is moved along the sequence, and the position that optimizes an appropriate measure of compositional divergence between the left and right parts is selected. We choose the Jensen-Shannon divergence (equations (1) and (2) Figure 1 ). Once a sequence is segmented into an array of homogeneous compositional domains, a reliable measure of Sequence Compositional Complexity or SCC 39 was computed: where S denotes the whole genome sequence and G its length, Gi the length of the i th domain Si. (•) = − ∑ is the Shannon entropy of the distribution of relative frequencies of symbol occurrences, f, in the corresponding (sub)sequence. It should be noted that the above expression is the same as the one used in the segmentation process, applying it to the tentative two new subsequences (n = 2) to be obtained in each step. Thus, the two steps of the SCC computation are based on the same theoretical background. Note that 1) this measure is 0 if no segments are found in the sequence (the sequence is compositionally homogeneous, i.e., a random sequence) and 2) increases with both the number of segments and the degree of compositional differences among them. In this way, the SCC measure is analogous to the measure used by McShea and Brandon 67 to obtain complexity estimates on morphological characters: an organism is more complex if it has a greater number of parts and a higher differentiation among these parts. It should also be emphasized the high sensibility of our measure of sequence heterogeneity. Only one nucleotide substitution or one little indel often suffices to alter the number, the length, or the nucleotide frequencies of the compositional domains, and therefore the resulting value for SCC. The phylogenetic signal can be defined as the tendency for related species to resemble each other more than species randomly drawn from a phylogenetic tree. So, high values of the phylogenetic signal indicate that closely related species in the phylogeny tend to be more similar than expected by chance. It measures how trait variation, in our case SCC, is correlated with the phylogenetic relatedness of species 68, 69 . Here we used Blomberg's K 43 to measure the phylogenetic signal. This metric is for continuous characters only and it uses an explicit model of trait evolution, the Brownian motion (BM) model. So, the expected variance behind this scenario is the summed branch length from the root to each species represented by a variance-covariance matrix. Blomberg's K measures phylogenetic signal by quantifying the observed trait variance relative to the trait variance expected under BM. Evolutionary trends for SCC were determined using the RRphylo R package V. 2.5.8 45, 46 . The estimated SCC value for each tip or node in the phylogenetic tree is regressed against its age (the phylogenetic time distance, meant mainly as the collection date of each virus isolate). The statistical significance of the ridge regression slope was tested against 1,000 slopes obtained after simulating a simple (i.e., no-trend) Brownian evolution of SCC in our phylogenetic tree with the search.trend function of this package. The evolutionary rate of SCC was also computed by search.trend function of the RRphylo package 45, 46 . However, to search for trends in evolutionary rate, its comparison to the expectation of the Brownian motion model is needed. To this end, the absolute evolutionary rate needs to be rescaled in the 0-1 range and then transformed to logs (Prof. Pasquale Raia, personal communication). Note that the time distance is expressed as the distance from the tree root (+1 for mathematical reasons). The statistical significance of the ridge regression slope was tested against 1,000 slopes obtained after simulating a simple Brownian evolution in the phylogenetic tree. All data generated or analyzed during this study are included in this published article (and its Supplementary Information files) . Origin and evolution of viruses The population genetics and evolutionary epidemiology of RNA viruses Virus Variation Resource-improved response to emergent viral outbreaks GISAID Global Initiative on Sharing All Influenza Data. Phylogeny of SARS-like betacoronaviruses including novel coronavirus (nCoV) Nextstrain: real-time tracking of pathogen evolution Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: Structural genomics approach Recombination and lineage-specific mutations linked to the emergence of SARS-CoV-2 The Origins of SARS-CoV-2: A Critical Review A strategy to assess spillover risk of bat SARS-related coronaviruses in Southeast Asia Bats are natural reservoirs of SARS-like coronaviruses A Genomic Perspective on the Origin and Emergence of SARS-CoV-2 Cross-species transmission of the newly identified coronavirus 2019-nCoV Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak Multivariate analyses of codon usage of SARS-CoV-2 and other betacoronaviruses Identification of a novel lineage bat SARS-related coronaviruses that use bat 1 ACE2 receptor 2 Composition and divergence of coronavirus spike proteins and host ACE2 receptors predict potential intermediate hosts of SARS-CoV-2 Virus evolution and transmission in an ever more connected world Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity Profile of a killer: the complex biology powering the coronavirus pandemic Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection Mutation landscape of SARS-CoV-2 reveals three mutually exclusive clusters of leading and trailing single nucleotide substitutions The novel Coronavirus enigma: Phylogeny and mutation analyses of SARS-CoV-2 viruses circulating in India during early 2020 RdRp mutations are associated with SARS-CoV-2 genome evolution Identification of Novel Missense Mutations in a Large Number of Recent SARS-CoV-2 Genome Sequences Non-synonymous Mutations of SARS-Cov-2 Leads Epitope Loss and Segregates its Variants CoV-2 genome evolution exposes early human adaptations The 2019-new coronavirus epidemic: Evidence for virus evolution Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study Compositional segmentation and long-range fractal correlations in DNA sequences SEGMENT: identifying compositional domains in DNA sequences Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis Sequence segmentation Delineating relative homogeneous G+C domains in DNA sequences Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes Isochore chromosome maps of the human genome IsoFinder: computational prediction of isochores in genome sequences Bayesian Analysis of Isochores Sequence compositional complexity of DNA through an entropic segmentation method The mosaic genome of warm-blooded vertebrates A standalone version of IsoFinder for the computational prediction of isochores in genome sequences Testing for phylogenetic signal in comparative data: Behavioral traits are more labile Phylosignal: An R package to measure, test, and explore the phylogenetic signal A new, fast method to search for morphological convergence with shape data Simultaneous detection of macroevolutionary patterns in phenotypic means and rate of change with and within phylogenetic trees including extinct species Driven progressive evolution of genome sequence complexity in Cyanobacteria Rapidly evolving traits and the comparative method: how important is testing for phylogenetic signal? Evolution of enhanced innate immune evasion by the SARS-CoV-2 B.1.1.7 UK variant Antigenic minimalism of SARS-CoV-2 is linked to surges in COVID-19 community transmission and vaccine breakthrough infections SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion A phylogeny-based metric for estimating changes in transmissibility from recurrent mutations in SARS-CoV-2 Maximum-likelihood phylodynamic analysis Evolutionary Dynamics of Viral Attenuation Viral adaptation to host: A proteomebased analysis of codon usage and amino acid preferences Data, disease and diplomacy: GISAID's innovative contribution to global health Global initiative on sharing all influenza datafrom vision to reality CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation Want to track pandemic variants faster? Fix the bioinformatics bottleneck MAFFT multiple sequence alignment software version 7: Improvements in performance and usability IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era Some probabilistic and statistical problems in the analysis of DNA sequences The general stochastic model of nucleotide substitution Fast Dating Using Least-Squares Criteria and Algorithms Segmentation of time series with long-range fractal correlations Information measures, effective complexity, and total information Biology's first law : the tendency for diversity and complexity to increase in evolutionary systems Phylogenetic signal, evolutionary process, and rate Phylogenetic signal and linear regression on species data and co-financed by the European Regional Development Fund (ERDF). Special thanks are due to Professor Pasquale Raia and Dr. Silvia Castiglione for its advice in applying the RRphylo package to coronavirus data, particularly the development of a new function (fix.poly) to resolve polytomies in the tree, and also for sharing code to rescale the evolutionary rate. We also gratefully acknowledge both the originating and submitting laboratories for the sequence data in GISAID EpiCoV on which these analyses are based. Supplementary Table S17 shows a complete list acknowledging all originating and submitting laboratories. In the same way, we gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences we used for the analysis of the Nextstrain sample; a complete list is shown in Supplementary Table S18. The authors declare no competing interests.