key: cord-0254609-1ma4hp52 authors: Nunes, D. R.; Braconi, C. T.; Ludwig-Begall, L.; Arns, C. W.; Janini, L. M. R.; Duraes-Carvalho, R. title: Deep phylogenetic-based clustering analysis uncovers new and shared mutations in SARS-CoV-2 variants as a result of directional and convergent evolution date: 2021-10-14 journal: nan DOI: 10.1101/2021.10.14.21264474 sha: 56170f0964a17bd53a2fbb7442935245961e1622 doc_id: 254609 cord_uid: 1ma4hp52 Nearly two decades after the last epidemic caused by a severe acute respiratory syndrome coronavirus (SARS-CoV), newly emerged SARS-CoV-2 quickly spread in 2020 and precipitated an ongoing global public health crisis. Both the continuous accumulation of point mutations, owed to the naturally imposed genomic plasticity of SARS-CoV-2 evolutionary processes, as well as viral spread over time, allow this RNA virus to gain new genetic identities, spawn novel variants and enhance its potential for immune evasion. Here, through an in-depth phylogenetic clustering analysis of upwards of 200,000 whole-genome sequences, we reveal the presence of not previously reported and hitherto unidentified mutations and recombination breakpoints in Variants of Concern (VOC) and Variants of Interest (VOI) from Brazil, India (Beta, Eta and Kappa) and the USA (Beta, Eta and Lambda). Additionally, we identify sites with shared mutations under directional evolution in the SARS-CoV-2 Spike-encoding protein of VOC and VOI, tracing a heretofore-undescribed correlation with viral spread in South America, India and the USA. Our evidence-based analysis provides well-supported evidence of similar pathways of evolution for such mutations in all SARS-CoV-2 variants and sub-lineages. This raises two pivotal points: the co-circulation of variants and sub-lineages in close evolutionary environments, which sheds light onto their trajectories into convergent and directional evolution (i), and a linear perspective into the prospective vaccine efficacy against different SARS-CoV-2 strains (ii). recombination events, and gained a pervasive ability to rapidly infect and spread around the globe (Corman et al., 2018; Boni et al., 2020; V'kovski et al., 2021) . The COVID-19 pandemic precipitated an intense genomic surveillance via data depositories and sequencing platforms and led to an unprecedented accumulation of public genomic data concerning a human pathogenic virus (Boni et al., 2020; Munnink et al., 2021) . The sheer amount of available sequencing data has the potential to facilitate higher-precision micro-evolutionary analyses mapping escape and point mutations in presumed positively selected sites and residues putatively associated to an increased virus fitness and pathogenesis and allows inferences concerning the dynamics of SARS-CoV-2 spread (Kosakovsky Pond et al., 2008; Alteri et al., 2021) . Although the analysis of micro-evolutionary mechanisms is of paramount importance and may provide powerful information to promote the prediction of vaccination perspectives and the tracing of SARS-CoV-2 epidemiological chains, there is as yet a lack of data-based investigations examining the presence of eventual shared mutations and their evolutionary characteristics in classified SARS-CoV-2 Variants of Concern (VOC) and Variants of Interest (VOI) (CDC 2021a; Peacock et al., 2021) . Given the importance of monitoring mutations to track the emergence of novel variants, here we investigate the influence of directional selection and the dynamics of SARS-CoV-2 genomic plasticity in VOC and VOI by clustering partition high-scale phylogenetic and directional evolution (DEPS) approaches. Additionally, we show the presence of several mutations common for both VOI/VOC and convergently . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 emerged sub-lineages, and provide a perspective of possible effects on the vaccination efficacy and the ongoing COVID-19 pandemic. High-coverage and complete HCoV-229E and HCoV-NL63 (alpha-CoVs), HCoV-OC43, HCoV-HKU1, MERS-CoV, SARS-CoV and SARS-CoV-2 VOC and VOI (beta-CoVs) genome sequences (≥ 29,000 bp), sampled from humans, were retrieved from the Global Initiative on Sharing Avian Influenza Data-EpiCoV (GISAID-EpiCoV) and GenBank databases at different times: February 12 th (MERS-CoV, SARS-CoV and SARS-CoV-2), July 12 th (HCoV-229E, HCoV-NL63, HCoV-OC43, HCoV-HKU1 and SARS-CoV-2) and August 26 th 2021 (SARS-CoV-2), A methodological approach to extract large-scale phylogenetic partitions was applied to identify transmission cluster chains on the largest Maximum Likelihood (ML) phylogenetic trees of the SARS-CoV-2 variants on the basis of a depth-first search algorithm which unifies evaluation of node reliability, tree topology and patristic distance (Prosperi et al., 2011 where initially the patristic distance was adjusted to find a representative number of clusters (n= 100) from each large reconstructed ML tree. In addition to this strategy, a second approach included sub-clustering analysis as an indirect way to infer and investigate the possibility of co-circulating sub-lineages. For this, we selected sequences (two per cluster) with ≥ 95% node reliability of statistical support from a threshold of 0.05, thus corresponding to the 5 th percentile when considering the whole-tree patristic distance distribution. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint Before proceeding to directional evolution analysis, all datasets were submitted to the Genetic Algorithm for Recombination Detection (GARD), a likelihood-based tool to pinpoint recombination breakpoints (Kosakovsky Pond et al., 2006) . To double check the outcome of the first of the two strategies described above, an additional test was conducted using the Pairwise Homoplasy Index (PHI; default settings) (Huson and Bryant 2005) . Evidence-based analysis through phylogenetic maximum-likelihood was then performed implementing the Datamonkey web-server and the program Hyphy v.2.5 to track directional selection in amino acid sequences (DEPS) (Kosakovsky Pond et al., 2020) . The DEPS method identifies both the residue and sites evolving toward it with great accuracy and detects frequency-dependent selection-scenarios as well as selective sweeps and convergent evolution that can confound most existing tests (Kosakovsky Pond et al., 2008) . Further, the DEPS method has shown better performance than (traditional) substitution rate-based analyses (dN/dS) in detecting transient and frequencydependent selection and directionally evolving sites and residues. For the most part, a Beta-Gamma site-to-site rate variation was used to conduct the analysis. The bestfit protein substitution model was chosen according to the corrected Akaike Information Criterion (cAIC). Only target sites and residues with Empirical Bayes Factors for evidence in favour of a directional selection model equal to or greater than 100 were considered for further exploration. Certain randomly chosen datasets were run multiple times (more than eight) to confirm obtained results. Data pertaining to SARS-CoV and MERS-CoV-related cases and deaths were extracted from the National Health Service (NHS, UK) . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint (https://www.nhs.uk/conditions/sars/) and European Centre for Disease Prevention and Control (ECDC) (https://www.ecdc.europa.eu/en/publications-data/distributionconfirmed-cases-mers-cov-place-infection-and-month-onset-1), respectively. Information concerning SARS-CoV-2 was collected from World Health Organization Recombination is known to be a crucial evolutionary process for many RNA is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint process is frequently observed in the Coronaviridae family where recombination is likely facilitated by discontinuous transcription involving jumps of the replicationtranscription complex during minus strand RNA synthesis. However, the consequences of recombination events occurring in the context of the current SARS-CoV-2 evolutionary landscape are still speculative (Li et al., 2020; Singh and Yi 2021; Pollett et al., 2021) . Here we address this knowledge gap, revealing the presence of recombination and shared mutations in the SARS-CoV-2 Spike-encoding protein, demonstrating them to be under directional and convergent evolution amongst SARS-CoV-2 VOC/VOI and sub-lineages, and tracing an interconnection with viral spread. First, endemic and epidemic human coronaviruses (HCoVs) were compared to identify similar evolutionary patterns that could help clarify the evolution of SARS-CoV-2. An initial recombination breakpoint analysis showed that four of six HCoVs analyzed presented such signals (Fig. 1A) . Endemic viruses OC43, NL63 and HKU1 also showed a similar pattern of residue accumulation and directional evolution, is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint In panels A and B, the symbol ≠ represents the presence of recombination breakpoints signals. In panel C, the number inside the circle represents the amount . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. The first epidemic wave of SARS-CoV-2 severely affected most countries in South America as a probable result of multiple viral introductions (Candido 2020); rapid increases of case numbers were especially reported in Brazil, the biggest and most populous country in Latin America (Paiva et al., 2020; Stefanelli et al., 2020) . The uncontrolled viral spread created a favorable scenario for the emergence of new variants (Voloch et al., 2021; Faria et al., 2021; Resende et al., 2021; Sabino et al., 2021) . To identify the impact of directional-positive selection sites at the rate of infections under these particular conditions, we traced the evolutionary scenario of SARS-CoV-2 in South America (via analysis of a significant and representative amount of genome sequences). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint Remarkably, our data showed that an increase of DEPS was correlated with viral spread dynamics, with Brazil exhibiting a lower proportion of COVID-19 cases when compared to French Guiana and the same amount of SARS-CoV-2 clusters inferred in Chile (n= 97) (Fig. 1C) , probably due to a higher diversity of circulating viruses. Our results also highlighted a series of mutations; while certain mutations have previously been described, but have hitherto remained unidentified in SARS-CoV-2 VOC and VOI, multiple further mutations are identified for the first time in this study (Table 1) . . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint Analysis of the molecular evolution of SARS-CoV-2 taking into account the influence of local demography in these specific scenarios has the potential to generate important insights into the spread and infection dynamics of this pathogen. Using SARS-CoV-2 sequences from China (the most populated country in the world) as reference, we analyzed all datasets from Brazil, India, and the USA via a . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint large-scale phylogenetic partitions analysis (Prosperi et al., 2011; Matsuda, Suzuki, and Ogata 2020) . Increases in SARS-CoV-2 infections were observed to be proportional to locally circulating variants and were not (in the scenarios analyzed), correlated with any particular demography (Fig. 1D) ; this indirectly reinforces the importance of measures implemented to avoid viral propagation. Analysis of phylogenetic partition clusters along the length of the circa 30 kb CoV genome evidenced several directionally-evolving sites under convergent evolution (Table 2) . Thus, a possible association between the rate of infections and the number of residues as well as sites in the Spike-encoding protein under DEPS can be established (Fig. 1D) . Interestingly, this supports a hypothesis of convergent evolution due to repeated and multiple site-specific substitutions in distinct SARS-CoV-2 VOC and VOI (see Table 1 and Table 2 ). . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint Additionally, we also inferred the possible appearance of SARS-CoV-2 sublineages and traced the influence of an environment favoring directional evolution acting on SARS-CoV-2 variants. We showed different patterns among sites in the VOC and VOI, with a particular emphasis on the Kappa VOI currently circulating in the USA. We further demonstrated recombination among SARS-CoV-2 VOC and VOI from India (Beta, Eta and Kappa) and the USA (Beta, Eta and Lambda) ( Fig. 2A and Table 2 ). As one of the first countries in the world to develop efficient immunizations and implement a vaccination policy (FDA, 2021), the USA vaccinated more than 30% of its population by April 2021. By September 2021, 60% of the boosterimmunized population possessed neutralizing antibodies against several viral variants (Ritchie et al., 2021; Pegu et al., 2021) . Similar outcomes were observed following widespread vaccination with various SARS-CoV-2 vaccines (different technologies leveraged for vaccine production) in many other regions, including South America and India (Li et al., 2021; Bernal et al., 2021) . Nonetheless, viral circulation in the face of incomplete immunization has been described as one of the probable causes of the emergence of new variants (Sabino et al., 2021) . Accordingly, our own analysis identified SARS-CoV-2 VOC and VOI subclusters (Fig. 2B) , thus indicating co-circulation of variants and sub-lineages under convergent evolution. Surprisingly, the same evolutionary pattern was also observed for other endemic and epidemic CoVs studied (see Data availability). . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 This study demonstrates the influence of positive directional evolution on SARS-CoV-2 circulating in South America and in those countries most severely affected by the COVID-19 pandemic. Our methodology allowed for the identification of recombination breakpoints and distinct transmission subclusters. We were able to indirectly infer transmission of a viral epidemiological chain and the generation of new variants. We also further identified and classified several convergently emerged shared mutations in different SARS-CoV-2 VOC and VOI. Lastly, we hypothesize . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint that the co-circulation of SARS-CoV-2 variants and their possible sub-lineages takes place within a very close evolutionary environment, which can be translated to a setting of strong convergent evolution, where the viral effective population size have acquired identical site-specific mutations. Our results can help to anticipate a linear perspective with regards to future vaccine efficacy pandemic. Some data on which this paper is based are too large to be retained or publicly archived with available resources. Smaller files which comprise information concerning recombination, convergent evolution, phylogenetic-based clustering analysis (ML trees, transmission clusters/subclusters), as well as the filtered and aligned sequences datasets used to map the shared and unsigned mutations in the is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 14, 2021. ; https://doi.org/10.1101/2021.10.14.21264474 doi: medRxiv preprint Multiple Introductions Followed by Ongoing Community Spread of SARS-CoV-2 at One of the Largest Metropolitan Areas of Northeast Brazil SARS-CoV-2 one year on: evidence for ongoing viral adaptation Durability of mRNA-1273 vaccine-induced antibodies against SARS-CoV-2 variants Evolutionary dynamics of the SARS-CoV-2 ORF8 accessory gene SARS-CoV-2 variants lacking a functional ORF8 may reduce accuracy of serological testing Spike mutation D614G alters SARS-CoV-2 fitness Platform for Evolutionary Hypothesis Testing Using Phylogenies GARD: a genetic algorithm for recombination detection FastTree 2--approximately maximumlikelihood trees for large alignments A novel methodology for largescale phylogeny partition A Potential SARS-CoV-2 Variant of Interest (VOI) Harboring Mutation E484K in the Spike Protein Was Identified within Lineage B.1.1.33 Circulating in Brazil Coronavirus Pandemic (COVID-19)', Our World in Data Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence On the origin and evolution of SARS-CoV-2' Whole genome and phylogenetic analysis of two SARS-CoV-2 strains isolated in Italy in Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses Food & Drug Administration (FDA) (2020) 'Moderna COVID-19 Vaccine Positive Selection of ORF1ab, ORF3a, and ORF8 Genes Drives the Early Evolutionary Trends of SARS During the 2020 COVID-19 Pandemic Coronavirus biology and replication: implications for SARS-CoV-2' Genomic characterization of a novel SARS-CoV-2 lineage from Rio de Janeiro, Brazil' The emergence of SARS-CoV-2 in Europe and North America Full-genome sequences of the first two SARS-CoV-2 viruses from India Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia