key: cord-0298955-lnnc3wfu authors: Perico, C. P.; De Pierri, C. R.; Neto, G. P.; Fernandes, D. R.; Pedrosa, F. d. O.; de Souza, E. M.; Raittz, R. T. title: Genomic landscape of SARS-CoV-2 pandemic in Brazil suggests an external P.1 variant origin date: 2021-11-10 journal: nan DOI: 10.1101/2021.11.10.21266084 sha: 6183311f5f18b6e057d16378e1d4c942c9c54953 doc_id: 298955 cord_uid: lnnc3wfu Brazil was the epicenter of worldwide pandemics at the peak of its second wave. The genomic/proteomic perspective of the COVID-19 pandemic in Brazil can bring new light to understand the global pandemics behavior. In this study, we track SARS-CoV-2 molecular information in Brazil using real-time bioinformatics and data science strategies to provide a comparative and evolutive panorama of the lineages in the country. SWeeP vectors represented the Brazilian and worldwide genomic/proteomic data from GISAID between 02/2020-08/2021. Clusters were analyzed and compared with PANGO lineages. Hierarchical clustering provided phylogenetic and evolutionary analysis of the lineages, and we tracked the P.1 (Gamma) variant origin. The genomic diversity based on Chao's estimation allowed us to compare richness and coverage among Brazilian states and other representative countries. We found that epidemics in Brazil occurred in two distinct moments, with different genetic profiles. The P.1 lineages emerged in the second wave, which was more aggressive. We could not trace the origin of P.1 from the variants present in Brazil in 2020. Instead, we found evidence pointing to its external source and a possible recombinant event that may relate P.1 to the B.1.1.28 variant subset. We discussed the potential application of the pipeline for emerging variants detection and the stability of the PANGO terminology over time. The diversity analysis showed that the low coverage and unbalanced sequencing among states in Brazil could have allowed the silenty entry and dissemination of P.1 and other dangerous variants. This comparative and evolutionary analysis may help to understand the development and the consequences of the entry of variants of concern (VOC). The current pandemic of the SARS-CoV-2 virus (Severe Acute Respiratory Syndrome Coronavirus 2), which causes 2 the disease known as Corona Virus Disease 2019 (COVID-19) (1), had its first cases notified in Brazil in February 3 2020. Brazil was the pandemic's epicenter at the peak of its second wave, around April 2021. 4 New variants continually emerge, many of them considered Variants of Concern (VOC) such as the British B.1.1.7 5 (Alpha), the South African B.1.351 (Beta), the Indian B.1.617.2 (Delta), and the P.1 (Gamma), first identified in 6 Brazil in November 2020 (2) . In addition, variants acquire mutations that make them more adapted, transmissible and 7 challenging to detect by the immune system (3; 4; 5) . Therefore, virus monitoring is essential to diagnose, improve 8 treatment, characterize strains and sub-strains, and thus understand their dynamics and dispersion (6) . It is also of 9 utmost importance in health policy decisions. International and domestic travel without quarantine is a significant 10 vehicle for spreading potentially dangerous variants, as occurred at the beginning of the pandemic in 2020 before air 11 travel restrictions (7) . Proper quarantine use positively impacted case reduction, and neglected quarantine caused 12 exponential growth in infected curves (8) . Figure S1 presents the pipeline constructed in R language that is available at https://github.com/CamilaPPerico/ 38 SARS-CoV-2_Brazil_Landscape/, as well as the other results of this research. The sequences used in this paper, 39 excepting the Wuhan reference sequence, were downloaded from the GISAID database and represented into vectors. 40 Euclidean is the adopted metric for distance in this study. We ran the analysis on a Xeon server with 251Gb of RAM 41 and 40 threads. Protein sequences were concatenated (with border delimiters) into proteomes which were represented in vectors using 60 the SWeeP tool (Spaced Words Projection) (14) . The R version of SWeeP tool, used for the proteome vectorization, 61 is available in the Bioconductor Platform 4 for R version 3.12 (25) . Finally, we made the vector projection of the 62 Brazilian genomes (coded in DNA) in the SWeeP tool in Matlab (14) with its default parameters. The 1,000,588 (1M) of SARS-CoV-2 worldwide proteomes were vectorized, totaling 9.97 billion amino acids, 64 including the reference sequence of Wuhan and the spike protein of Brazilian samples separately integrated into the 65 comparative study. The proteomes of Brazil, Germany, India, Italy, England, and World-2020 were vectorized and 66 considered as independent sets. The same orthonormal base, with the SWeeP default parameters (length 600 and 67 mask [1 1 0 1 1]), was employed to project all sequences into compacted vectors. 68 Cluster analysis and visualization 69 Brazilian proteomes were clustered using the ConsensusClusterPlus package version 1.54.0 from Bioconductor (26) 70 and the kmedoids method (Partitioning Around Medoids, PAM), in procedures with 1,000 replicates for each cycle, 71 testing 2-20 as the number of clusters. For the spike proteins, 2-10 sets were tested. As a criterion for selecting the 72 best number of groups, in both cases, was considered the best convergence in the Consensus Cumulative Distribution 73 Function (CDF) with the smallest number of clusters. We visualized and compared the clustering results using two 74 approaches of dimensionality reduction: Principal Component Analysis (PCA) and the t-Distributed Stochastic 75 Neighbor Embedding (t-SNE) (27) . The t-SNE diagrams were constructed in the Rtsne package 5 , with its default 76 parameters. 77 Information on the number of COVID-19 cases in Brazil was available at the official website https://covid. 78 saude.gov.br/. In addition, the mapping of temporal and spatial evolution was carried out based on information 79 obtained from the metadata provided by the GISAID platform. Coverage and richness of viral subvariants (unique and non-redundant sequences) were estimated via the Chao 1 82 richness estimator, given by the equation 1 (28; 29) . In equation 1, S obs is the number of distinct vectorized proteomes observed, F 1 is the number of singletons 84 (single-occurring vectors), and F 2 is the number of doubletons (two-occurring vectors). Thus, the coverage is given by: 85 Phylogenetic analisys 86 All proteomic phylogenetic trees were built using the Neighbor-Joining (NJ) method through the Ape version 87 5.5 package (30) , performing bootstrap with 1000 replicas. Only branches with bootstrap greater than 70% were 88 considered. For previous studies employing bootstrap calculation in tree construction in alignment-free analyses, see 89 references (31; 32). 90 We built a consensus phylogenetic tree for the 8,720 Brazilian proteomes based on the proteome's vectors distance 91 matrix. Also, phylogenetic trees were built for cluster and lineage centroids, selecting the sequence closest to each 92 corresponding centroid and taking it as a representative vector. The centroids were obtained by the average of the 93 vectors within the cluster/lineage. The proteomic results were compared to a phylogenetic tree with the aligned genomes of the clusters and lineages 95 centroids. We also aligned the specific sequences and constructed genome trees to analyze the origin of the P.1 variant. 96 For this step, the Maximum Likelihood method of the MEGAX 10.2.6 (33) tool, with a 500 bootstrap of size, was 97 performed using the Jukes-Cantor nucleotide substitution model (34) . The alignment was made using the Nextclade 98 online tool (24) . All the phylogenetic trees were rooted using the Wuhan reference sequence (NC 045512.2) as the 99 outgroup. Finally, we visualize the trees in the iTOL tool 6 (35) and with the ggtree package (36). Intending to determine the origin of the P.1 variant, whether internal or external to Brazil, we obtained the 70 closest 102 worldwide samples to each of the 50 P.1 Brazilian samples in 2020 by distance, resulting in 91 unique vectors whose 103 phylogeny by alignment was analyzed. We also searched for occurrences of sequences like P.1 in the World before its 104 emergence in Brazil. Finally, we assessed the involvement of the P.1 variant in possible recombination events. Machine Learning for P.1 search 106 An ensemble of 50 feed-forward neural networks (multilayer perceptron, MLP) was trained using the vectors of 107 the Brazilian sequences classified as P.1 and non-P.1 utilizing data from release 609 with classification PANGO 108 v.3.0.5. Data division was 70:30 for training and testing sets, respectively, randomly divided for each neural network 109 training. Each MLP network contained the layers input, middle, and output with 600,3,1 neurons, respectively. Only 110 networks with an f1-score higher than 90% as threshold compose the ensemble. A majority vote decided classification. 111 We tested the model with the complete set of Brazilian vectors of accuracy, f1-score, recall, and precision through 112 cross-validation. Recombinant's detection 114 Possible recombinants were detected from aligned genomes using RAPR (37) and RDP4 (38) tools. RDP4 provides 115 the methods RDP (39), BOOTSCAN (40) , MAXCHI (41) , CHIMAERA (42), 3SEQ (43) , GENECONV (44) , LARD 116 (45) , and SISCAN (46), applied in this task. The confirmation test for the recombinant events was performed by 117 analyzing the phylogenetic trees of genomes. The genomes were split into two parts at the breaking points of the 118 aligned sequences and, we phylogenetically analyzed the relative position between supposed recombinants and their 119 parents. Finally, the recombinants that presented a distinct relative position between the trees of each segment were 120 validated (47) . From the 1,000,558 samples worldwide in release 409 on the GISAID platform (21), 49% of proteomes remained 123 after filtering (493,080 sequences), of which 260,759 from 2020. In Brazil, 65% of samples were considered, a quality 124 percentage higher than the world average and the other countries studied, as shown in Table 1 . The 8,720 vectorized 125 Brazilian sequences were analyzed and discussed below. More detailed information on the results is in Section 2 of 126 Supplementary 1. The complete metadata of the Brazilian sequences and the metadata referring to other countries 127 and the World is available at the Github link. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The ConsensusClusterPlus analysis returned 15 clusters representing the epidemic proteomes in Brazil from 25 130 February 2020 to the end of May 2021. More than 15 clusters do not provide a considerable increase in the consensus 131 value of the CDF curve (less than 5% - Figure 3,7,9,10,14 and 15) . Rarer variants were mainly grouped in clusters 2 and 6. Cluster 2 is composed of 137 lineages less frequent in Brazil, including the basal lineages A.1, A.2, and B and B.1, which have 1,3,3 and 59 samples, 138 respectively. The Wuhan reference sequence belongs to cluster 2 and is highlighted in Figure 2 . The analysis showed that the clustering approach respects evolutionary similarity among the sequences. Moreover, 140 the clustering results match the PANGO division, as viewed in the t-SNE diagram, PCA, and clusters/lineage centroids 141 heatmap (Figures 1, 2) . Tables 2 and S2 show the relationship between the clusters and their main composition. 142 Other results are presented in the supplementary material (Table S1) . We also vectorized and clustered spike proteins sequences which derived eight consensus clusters ( Figure S3 , 153 Tables S3,S4). The clustering of the spike proteins was similar to those of the complete proteomes but with fewer 154 divisions. Nevertheless, the division into two larger groups is maintained, and clusters 11 and 13 are still differentiable, 155 as shown in the PCA ( Figure S3b ). The consensus mutations for all SARS-CoV-2 Brazilian samples, the characteristic mutations for the TP1 group, 157 and for the other clusters are presented in Tables S5, S6 (Table S2) . These are the older groups that predominated in Brazil 164 in 2020, but are almost extinct, giving way to the TP1 group ( Figure 4 ). There are no consensus mutations 165 characteristic in T0 (Tables S5, S7); each cluster represents an individual lineage or a group of lineages less 166 frequent in Brazil. One example of a variant belonging to T0 is the B.1.1.33, which stood out the most in 167 2020 in Brazil. Franceschi et al. (9) Jul Color Key . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 10, 2021. ; https://doi.org/10.1101/2021.11.10.21266084 doi: medRxiv preprint Figure 2 . Overview of the relationship between SARS-CoV-2 proteomes of Brazilian samples. The colors represent main lineages (above) and clusters (below) for the Brazilian sequences for each image pair. The red asterisk highlights the position of the Wuhan reference sequence. a) components 1 and 2 of PCA graph -note the cluster on the left is the TP1 group, composed of sequences from P.1, P.1.1, P.1.2, and P.4; b) components 3 and 4 of PCA graph -the cluster 13, isolated above, corresponds to the 79 samples identified as P.4 by PANGO (v3.0.5). The TP1 group is more central, and the other clusters are around it; c) t-SNE graph -note clustering coincides with classification by lineages. Groups related to variant P.1 (TP1) 172 TP1 comprises clusters within variants P.1, P.4, P.1.1, and P.1.2 (clusters 3,7,9,10,12,14 and 15). The first P.1 was 173 notified in 12/2020, though previous studies estimate that P.1 origin in Brazil occurred between early October and 174 mid-November 2020 (2). The P.1 sample (EPI ISL 2241496) dated 2020-10-01 from Paraíba State corroborates this 175 hypothesis. October/2020 was one of the months with the lowest sequencing in Brazil, which can be the sake of 176 underreporting of P.1 related cases. Remarkably, this month was a period of flexibilization of international flights in 177 Brazil (48) . Phylogenies show that P.1 and P.4 variants mix themselves among and inside clusters in TP1 ( Figure S6 ). In 179 t-SNE, PCA, and heatmaps, P.1 and P.4 are hardly distinguishable, either in clusters or lineages (Figures 1 and 2 ). 180 Furthermore, the TP1 clusters share many non-synonymous mutations (Table S6 and Figure 3 ). At least five of these 181 mutations are in the spike protein, conferring the adaptive virus advantage (E484K, N501Y, K417T, H655Y, and 182 L18F) (49; 50; 51; 52). The sum of these characteristics suggests that the TP1 group could be seen as a single lineage, 183 divided into sublineages. As stated before, it is also remarkable that clusters within TP1 do not correspond perfectly 184 to the P.1 and P.4 subdivision provided by PANGO. 185 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 10, 2021. ; https://doi.org/10.1101/2021.11.10.21266084 doi: medRxiv preprint Table 2 . Division of lineages into clusters using the complete vectorized proteome. The predominant strains in each cluster are listed with the date of their first and last sample. The complete list of observed lineages by cluster is available in Table S2 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 10, 2021. ; https://doi.org/10.1101/2021.11.10.21266084 doi: medRxiv preprint 4 5 6 7 8 9 10 11 12 13 14 15 16 4 5 6 7 8 9 10 11 12 13 14 15 16 17 5 195 435 182 239 192 129 131 149 322 309 726 860 2145 Although most analyzes of this study were performed based on proteomes samples, complete genome DNA trees were 203 built for comparison. The results showed that the proteome and genome derived trees with 8,720 samples generally 204 agree ( Figure S6 ) -complete trees are available in Supplementary 5 and 6. The consensus tree consistently grouped the monophyletic branch of the TP1 group with 100% bootstrap (BP). 206 The branch containing cluster 12 (P.1.2), internal to the branch of the TP1 group, is monophyletic and obtained a 207 100% BP. The cluster 11 (B.1.1.7) -with BP=87% -and the lineages N.9 and N.10 of cluster 4 -with BP=100% 208 both -are also monophyletic (tree available in Github as SARS NJ Consensus BP.nwk). Centroids-based phylogenies 209 provided a cleaner and more reliable evolutionary overview of the groups with high bootstrap ( Figure 5 ). TP1 group 210 appears together in all tested centroid trees with high bootstrap. The TP1 group is cohesive and monophyletic in all approaches, and the P.4 lineage does not differ from P.1, as 212 there is an alternation of branches in all trees, both in genome and proteome ( Figure S6) . The basal cluster of the 213 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint The copyright holder for this this version posted November 10, 2021. ; https://doi.org/10.1101/2021.11.10.21266084 doi: medRxiv preprint TP1 group is uncertain, varying between cluster 14 in the genomic approach and cluster 9 in the proteomic one. 214 Cluster-based methods reached higher bootstrap values compared to those based on the lineage ( Figure 5 ). Kappa variant B.1.617.1 samples appear together within the TP1 group in the DNA and the protein trees 216 (Figures 5b and d) , which probably consists of annotation errors once these samples have the characteristic mutations 217 of the P.1 variant rather than B.1.617.1 (Table S1 ). 218 P.1 variant origin analisys 219 We investigated three hypotheses for the P.1 variant origin: a) it evolved locally, i.e., from T0, b) it had a later entry 220 external origin (came from abroad); and c) P.1 is derived from some recombination event. Table S8 . This isolated information would indicate that P.1 is closely related to the 224 B.1.1.28 sequences from the Pará (PA) and São Paulo (SP) States, supporting the local ancestry hypothesis previously 225 reported (54) . However, the two samples here nominated as PA-TP1 (EPI ISL 1068256 and EPI ISL 1261122) have 226 as closest sequences only foreign samples Table 3 . These samples also appeared close to all other searched P.1 samples 227 then we deepened the analysis. The PA-TP1 genomes have 10 of the 17 characteristic mutations of the P.1 group, 228 and both instances belong to cluster 9. Thus, the phylogeny suggests that PA-TP1 may be the precursors of the TP1 229 group in Brazil (Figures 6 and S7) , and cluster 9 may be ancestral of the P.1 variant. Table 3 . Worldwide sequences close to P.1 ancestors (PA-TP1) in 2020. The list was obtained using the distance from PA-TP1 to the 2020 world samples. In the tree based on aligned genomes ( Figure S7) We measured the distances between proteomes from Brazilian samples and that of the Wuhan reference proteome 243 over the pandemic period (Figure 1e ). The differences between P.1 and P.4 and the Wuhan reference are much higher 244 than the distance among the T0 group to Wuhan. From the beginning of 2021, the average distance among the 245 Brazilian samples concerning the Wuhan sample leaped with the TP1 emergence as T0 group variants became extinct. 246 Therefore, the distancing of the TP1 group from T0 is abrupt and not gradual. 247 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted November 10, 2021. ; https://doi.org/10.1101/2021.11.10.21266084 doi: medRxiv preprint Figure 5 . Phylogenetic trees by cluster and lineage centroids (relative to PANGO v3.0.5) with a bootstrap of 1000 replicas. Left, the Neighbor-Joining method with Euclidean distance was used on the proteome centroids corresponding to each a) cluster, b) lineage. Right, centroid trees per cluster by DNA via Maximum Likelihood with JukesCantor method, by c) cluster, d) lineage. The tree a) is divided into two main branches: the TP1 group, with cluster 11 of B.1.1.7 as more basal, followed by cluster 9; and the other main branch, the T0 with cluster 13 as the most basal. All branches reach bootstrap greater than 85%. The b) is less consistent, but the TP1 group reaches a 73% bootstrap, except the branch of P.1.2, which does not get high BP. The c) consistently inserts cluster 14 as basal in the TP1 group and cluster 2 as more basal in Brazil. Finally, tree d) obtains BP of 100% for the branch of the TP1 group and inserts as basal the variants A.1, A.2, B, B.3 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this this version posted November 10, 2021. ; https://doi.org/10.1101/2021.11.10.21266084 doi: medRxiv preprint Figure 6 . Sequences similar to P.1 detected by the neural network ensemble. The two identified ancestral sequences from the P.1 lineage in Brazil (PA-TP1) are highlighted green. The collapsed branch in blue groups 48 Table S9) ). This 271 sample belongs to clade 28-AM-II (A6613G) of B.1.1.28, indicated as the ancestor of lineage P.1 according to Naveca 272 et al. (54) (Figure S7) . Such clade has the A6613G mutation, a characteristic mutation of the TP1 group, present in 273 99.9% of the samples in the group. Therefore, to reinforce the recombination possibility, we built phylogenetic trees 274 with the sequence before the alignment breakpoint and the other after this point. However, the trees did not reach a 275 high enough bootstrap to confirm or rule out any recombination events suggested by the tools (Figure S7 ). 276 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The distribution of variants by state ( Figure S8) shows that only 4 of the 27 states had samples continuously sequenced 278 along with pandemics till May 2021: SP, RJ, RS, and BA. In other states, there were months without sequencing or 279 simply one-off analyses. In the first stage of the pandemic, the B.1.1.28 and B.1.1.33 variants predominated until 280 October 2020, when the P.2 variant became the predominant strain. Thus, from December 2020 until March 2021, the 281 P.1 variant grew to become the primary variant in the country, followed by February 2021 by variant P.4 (Figure 4) . 282 Detailed information for the Brazilian States is available in Section 2.3 of Supplementary 1. The second epidemic wave of SARS-CoV-2 was more significant than the first, and its beginning coincides with 284 the emergence and rise of P.1 (Figure 4c ), as already reported (9; 54) . See also Supplementary 4. Over time, the 285 lineages and clusters graphs illustrate how the T0 group prevalence decreased and was probably extinct (or occurred 286 in small quantity), with variants TP1 and the imported groups, B.1.1.7 and the new variant of cluster 13, became 287 dominants in Brazil (Figure 4 ). This transition is more evident when the evolution of the pandemics over time in the 288 PCA and t-SNE graphs is viewed (3D graph of Figure S9 , and the Supplementaries 3 and 4). Furthermore, looking 289 at the development of the lineages over time, we notice a pattern in the origin of new variants, characterized by the 290 formation of new clusters (discussed later). The study of SARS-CoV-2 diversity permitted both: i) understanding the distribution of the variants in the viral 293 population in Brazil (richness), and ii) verifying and comparing the degree of sequencing in different Brazilian states 294 and to compare with other countries (coverage). Here we exploited the concepts of richness and coverage as defined 295 in Methods. São Paulo State (SP) presents the higher diversity (richness value), followed by RJ and RS with 10,221, 296 4,073, and 3,011 estimated distinct proteomes, respectively. The lower richness is in AC, with 50, followed by DF 297 with 99 estimated proteomes (Table S10) . Regarding coverage, the states of SP, RO, and AM had the highest coverage (30.3%, 28.9%, and 26.9% respec-299 tively), while MA, AP, AL, MG, and SC had the lowest coverage (4.4%, 7.2%, 7.2%, 7.8%, and 8.9% respectively). 300 Unfortunately, we could not analyze the states of RR, PI, RN, and MS due to low sampling. The TP1 group presented almost half of the richness in Brazil, with 13,194 estimated distinct proteomes representing 302 about 47.6% of the total local diversity in Brazil ( Table S11 ). The TP1 also presented a high coverage (27.10% 303 on average), which may be corresponding to the increase in sequencing in 2021; clusters 11 and 13 also had high 304 coverages of 33.67% and 25.56%, respectively. T0 had lower coverages with 19.20% on average due to less sequencing 305 at the beginning of 2020. We observed an increase in sampling between 2020 to 2021 and an increase in proteome 306 quality (proteome complete and without misreadings). In 2020, 4,402 samples were sequenced, of which 52% (2,293 307 sequences) with quality, while in 2021, 8,993, of which 71% have good quality (6,427 instances). Brazilian coverage (23.3%) is low compared to Italy, Germany, and England, but it is superior to India. However, the 309 estimated richness of Brazil (26, 370) is lower than the other countries in this study except for Italy (Table 4, Figure S10 ). 310 Table 4 . Diversity comparison in selected countries. Richness and coverage metrics were calculated by Chao 1 method (28) . Ncases = number of cumulative cases in the country (in millions); Nseq = number of quality collected sequences (analyzed); Nunique = number of non-redundant samples. Ncases ( . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The pandemic of COVID-19 is a catastrophic event with severe consequences, leading to losses in almost all human 312 activities, mainly health and the economy. On the other hand, we have a rare opportunity to observe the evolution 313 process in almost real-time; since it promotes a rush for genomes sequencing of a single virus species never seen before. 314 Recent bioinformatics technology provides resources to analyze the big data provided by these efforts and allows us to 315 draw a panoramic view of the SARS-CoV-2 evolution in Brazil and worldwide. The pandemic in Brazil had two moments ( Figure 4) : 317 1. T0 -the early entry of SARS-CoV-2, which occurred in early 2020, characterizing the T0 group in this study. 318 The T0 group embodied many lineages that disappeared over time, and the prevailing lineage were B. We observed that strains tend to be extinct and replaced by newer and more adapted strains holding more 323 advantageous mutations, as observed in other studies (20; 54) . This lineage substitution process was followed in Brazil 324 on several occasions, as in the emergence of the P.2 variant and later of the TP1 group ( Figure 4 ). Our proposed pipeline 7 ( Figure S1 ) allowed us to recognize the appearance of new variants. New variants emerge 326 by moving away from the parental lineage, in a process called "exploitation of the mutational space", which becomes 327 graphically visible by the methods of PCA and t-SNE ( Figure S4) , followed by the establishment of a new cluster. 328 We observed a remarkable variant emergence event during the analysis, the origin of cluster 7, a sublineage of P.1 329 (cluster 10) composed of 126 samples from the State of São Paulo (called P.1-SP at first moment - Figure S4 ). These 330 samples had their designation updated from P.1 to P.4 between v2.3.8 and v3.0.5 of PANGO. There is no reliable phylogenetic analysis of SARS-CoV-2 in the literature. Furthermore, the high mutation rate 332 associated with the large volume of circulating viruses strains in the world entails frequent cases of parallel and 333 backward mutations, resulting in inconsistencies in the determination of lineages and hindering the reconstruction 334 of their evolutionary relationships (20) , difficulties also reported in the survey by Morel et al. (55) . Therefore, 335 we performed a phylogenetic analysis of complete SARS-CoV-2 genomes/proteomes based on vectorial distances 336 matrices. We compared trees based on representing lineage centroids with those based on representing cluster 337 centroids. The cluster centroid-based phylogenetic trees showed to be consistent (bootstrap greater than 85% in all 338 branches), while the lineage centroid-based tree presented a much lower BP and did not show clear differentiation in 339 the evolutionary history of the lineages. It led us to conclude that the division of specimens by clustering is more 340 reliable to the evolutionary mapping and that some inconsistencies may be present in SARS-CoV-2 classification 341 by PANGO. Therefore, the clustering approach presented in this study may help revise the lineages nomenclature 342 process. Additionally, the use of proteomes (amino acid representation) in the evolutionary analyses and the heatmaps 343 (Figures 1 and 5 ) showed more consistency than use of genomes (DNA) in both cluster and lineage divisions. As a 344 consensus across all methods, the TP1 group consistently clusters in a single branch, away from the other Brazilian 345 variants. The phylogeny in Figure S7 does not support the lineage B.1.1.28 as ancestor of P.1. We cannot, however, 350 conclusively rule out the possibility of a Brazilian origin for P.1 since there is a gap in the sampling in the period of 351 the emergence of P.1 in Brazil around October 2020. However, the accumulating body of evidence consistently points 352 to an external P.1 origin: 3. The distance from the Wuhan reference sample is much higher to P.1 than to the other Brazilian instances in 357 2020 ( Figure 1e) ; 5. The machine learning approach found P.1-like SARS-CoV-2 samples circulating the World before the variant 361 emergence in Brazil ( Figure 6 ). The external VOC P.1 entry in Brazil may have been favored by the flexibilization of measures regarding 363 international flights in Brazil in October 2020 (48) , the period of entry/emergence of P.1 in Brazil, also suggested by 364 Faria et al. (2) . After the entry of P.1 in Brazil, the mutations S:H655Y, S:T1027I, S:R190S, S:T20N, ORF1a:S1188L, 365 ORF8:E92K and ORF1b:E1264D probably originated in Brazil, since they are not present in the ancestral PA-TP1, 366 assuming these samples as reference. Among these mutations, the S:H655Y promotes escape from the immune system 367 (51). Recombination is a common phenomenon in the Coronaviridae family (47) ; however, there are indications that 369 recombinant events between SARS-CoV-2 strains are rarer than expected (56) . Our results indicate no recombination 370 event in the origin of the P.1 variant; however, such an event can relate to B.1.1.28 and P.1 variants. The RAPR tool 371 results indicate that a subgroup of B.1.1.28, a subset of cluster 6 in our study, the same group identified as 28-AM-II 372 (A6613G) clade by Naveca et al. (54) , was originated by recombination between a P.1 and a foreign sample close to 373 the hCoV-19/USA/NC-UNC-0017 (EPI ISL 831339 -B.1.1.1) (Table S9) . Thus, our analysis points to the possibility 374 that clade 28-AM-II comes from recombination, but it is not an ancestor of P.1. We propose Cluster 9 as the probable ancestral cluster of the TP1 group ( Figure 5 ). It contains the PA-TP1 376 samples, the sequenced strains closest to the ancestors of the P.1 lineage. Furthermore, the hypothesis is supported 377 by the PCA (Figure 2a) , which shows cluster 9 as the furthest apart among the TP1 clusters, dispersed as in the 378 described "exploitation of the mutational space" during the origin of new variants, forming a bridge between itself 379 and the other TP1s. The diversity analysis revealed that coverage of viral subvariants is low in all Brazilian states (Table S10) , and 381 13 of the 27 Brazilian states had less than 100 quality samples until May 2021. São Paulo (SP) and Rio de Janeiro 382 (RJ) States present more sequencing and had 4,386 and 1,170 sequenced samples, respectively. The state of SP is the 383 national center of the pandemic, having the highest virus richness. In addition, SP is the the main hub for national 384 and international travel, representing more than 70% of international flights from/to Brazil (7) . Therefore, it was 385 expected a large circulation of different viral variants in this state. Candido et al. (7) indicate that, like SP, the states 386 MG, CE, and RJ are major international travel entry centers. For these states, the estimated richness also presented 387 high values (Table S10) , except CE, that appear to have underestimated richness, likely due to the low sampling. Wealthier countries, such as those in Europe, also presented the highest richness estimations despite the lower 389 number of COVID-19 cases (Table 4) . Thus, we hypothesize that European countries, receiving more international 390 inflows, have a higher chance of variants entering, increasing their viral richness, such as in the State of São Paulo in 391 Brazil. The disproportion in sampling between states (Table S12) makes it difficult to compare the evolutionary history of 393 the virus among states. In addition, the low coverage in regions may hide VOCs, making their tracking hard (9) . Also, 394 globally, there is a concentration of sequencing. Ten countries account for 85% of the GISAID samples and only 35% 395 of the World's cases. Disproportion in sampling between different countries results in strains remaining undetectable 396 until they become widely spread, and then it is no longer possible to effectively control their dispersion (20) . We 397 think that, similarly, the subsampling in Brazilian states corroborated the sudden spreading of the P.1 lineage. Comments on PANGO and GISAID database updates 399 In the current version of PANGO in v3.1.11, all P.4 samples from all clusters have been reclassified as P.1 (excepting 400 those of cluster 13). It came in accord with our findings and explains the overlapping observed in Figure 2c concerning 401 the clusters of the TP1 group, based on the previous terminology (Figures 3, 1a-d) . Also, concerning the T0 group, 402 cluster 5 has 77 samples of B.1.1.28 consistently separated from the others, which had its designation updated to 403 variant P.7, which agrees with our analysis. More about updates in PANGO in Table S13 . 404 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Here we present a data-science-based overview of the SARS-CoV-2 pandemics in Brazil. Brazil has a scenario divided 406 into two moments. The first, which occurs throughout 2020 to early 2021, is composed of several variants from the 407 initial entry of SARS-CoV-2 strains in Brazil (Figure 4) . The primary early entry samples are the B.1.1.1, B.1.1.28, 408 and B.1.1.33 variants, and later occurs the emergence of P.2 and the recently named P.7. The second moment sees 409 the emerging of P.1 (and correlated variants) and the entry of foreign strains -late 2020 and early 2021 -a period 410 when preventive measures were relaxed (48) . The classification performed by PANGO generally corresponds to our analyses.Most of the recent updates of the 412 PANGO nomenclature (v3.1.11) are supported by our findings; however, constant changes impair the comparability of 413 studies. Our analyses showed that the clustering method groups the sequences by evolutionary similarity, making it 414 suitable to classify tasks even for nomenclature purposes. Additionally, the results of proteomic evolutionary analyses 415 showed to be more consistent than genomic ones and thus ideal for this analysis. Therefore, the proposed pipeline is 416 based on proteomic sequences. The process of emergence and extinction of new lineages occurs continuously and gradually (54) . New strains 418 arise through a process of "genomic exploration", distancing from the parents and consolidating ( Figure S4 ). This 419 process is visible graphically by t-SNE and PCA on the vectorized and clustered sequences, methods that detect 420 emerging variants as demonstrated in the notable case of cluster 7. The considerable distance between the sequences of the TP1 group and the other Brazilian sequences led us to 422 question the Brazilian origin of the lineage P.1. We could not completely discard the Brazilian source from group T0. 423 Still, we have assembled a set of evidence that suggests its external origin (Figures 3, 5) . The recombination analysis 424 did not detect a recombination event in the root of P.1; however, it shows a possible recombination event between 425 a Brazilian P.1 specimen and a B.1.1.1 from the USA, generating a subgroup of B.1.1.28, interpreted in previous 426 studies as the ancestral clade of P.1 (54) . A considerable increase in sequencing has occurred in 2021 in Brazil; however, the disproportion between states 428 remains. Disproportionate monitoring, both nationally and internationally (20) , allows dangerous variants to remain 429 undetectable until they become widespread. In addition, the richness estimations point out the existence of a much 430 larger number of variants that are not yet sequenced (Table S10, Figure S12 ). The analyses indicate that Chao's 431 metric combined with vector representation for proteomes is a suitable method for viral diversity analysis (28) . Worldwide sequencing is scarce, around 12% (Table 4 ). We have seen that the lack of monitoring by sequencing in 433 Brazil has allowed P.1 to spread silently; moreover, we could not trace the origin of its large number of accumulated 434 mutations (17 in all), mutations which make this VOC dangerous. The low sequencing associated with a great richness 435 of variants, observed in countries like India, may lead to the emergence of new VOCs, such as the Indian B.1.617.2 436 (Delta). Therefore, sequencing should increase, and border control measures help control the spread of dangerous 437 variants. The variant Delta is replacing the P.1 lineage in Brazil ( Figure S11 ) (24) . Based on this study, we suspect that 439 the emergence and domination of Delta in the World follow an analogous way that P.1 dominated in Brazil. Moreover, 440 due to some countries' low sampling and vaccination, new VOC can still emerge and even replace Delta itself. We 441 hope our study can bring new lights to better understanding such viral evolving behavior, now and in other eventual 442 future outbreaks. 443 substantial contributions, revisions and approved the final manuscript. R.T.R. supervised the whole project. All 452 authors contributed thoughts and advice, discussed the results, and contributed to writing the final manuscript. A novel bat coronavirus closely related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the spike protein Genomics and epidemiology of the P. 1 SARS-CoV-2 lineage in Manaus The SARS-CoV-2 spike protein: balancing stability and infectivity Structural and functional analysis of the D614G SARS-CoV-2 spike protein variant Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus SARS-CoV-2 lineages and sub-lineages circulating worldwide: a dynamic overview Evolution and epidemic spread of SARS-CoV-2 in Brazil Estimating the quarantine failure rate for COVID-19 Mutation hotspots, geographical and temporal distribution of SARS-CoV-2 lineages in Brazil GISAID: Global initiative on sharing all influenza data -from vision to reality Artificial intelligence and machine learning to fight COVID-19 Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study Alignment-free machine learning approaches for the lethality prediction of potential novel human-adapted coronavirus using genomic nucleotide Sweep: representing large biological sequences datasets in compact vectors Comparative Genomics Provides Insights into the Taxonomy of Azoarcus and Reveals Separate Origins of Nif Genes in the Proposed Azoarcus and Aromatoleum Genera Continuous distributed representation of biological sequences for deep proteomics and genomics Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer Prot-spam: Fast alignment-free phylogeny reconstruction based on whole-proteome sequences Lessons learned 1 year after sars-cov-2 emergence leading to covid-19 pandemic One year into the pandemic: Short-term evolution of SARS-CoV-2 and emergence of new lineages Data, disease and diplomacy: GISAID's innovative contribution to global health A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology SARS-CoV-2 variants combining spike mutations and the absence of ORF8 may be more transmissible and require close monitoring Nextstrain: real-time tracking of pathogen evolution rSWeeP: AR/Bioconductor package deal with SWeeP sequences representation. bioRxiv ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking Visualizing data using t-SNE Nonparametric estimation of the number of classes in a population Estimating terrestrial biodiversity through extrapolation ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R Whole-proteome phylogeny of large dsdna virus families by an alignment-free method An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data MEGA X: molecular evolutionary genetics analysis across computing platforms Evolution of protein molecules Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection RDP4: Detection and analysis of recombination patterns in virus genomes RDP: detection of recombination amongst aligned sequences Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning Analyzing the mosaic structure of genes Evaluation of methods for detecting recombination from DNA sequences: computer simulations An exact nonparametric method for inferring mosaic structure in sequence triplets Possible emergence of new geminiviruses by frequent recombination Phylogenetic evidence for recombination in dengue virus Sister-scanning: a Monte Carlo procedure for assessing signals in recombinant sequences Genomic recombination events may reveal the evolution of coronavirus and the origin of SARS-CoV-2 Coronavírus: na contramão do mundo, Brasil segue sem restriçõesà entrada de estrangeiros por aeroportos Recombinant SARS-CoV-2 genomes are currently circulating at low levels. bioRxiv Structural modeling of the SARS-CoV-2 Spike/human ACE2 complex interface can identify high-affinity variants associated with increased transmissibility Spreading of a new SARS-CoV-2 N501Y spike variant in a new lineage SARS-CoV-2 variant of concern 202012/01 has about twofold replicative advantage and acquires concerning mutations Estimated transmissibility and impact of sars-cov-2 lineage b. 1.1. 7 in england COVID-19 epidemic in the Brazilian state of Amazonas was driven by long-term persistence of endemic SARS-CoV-2 lineages and the recent emergence of the new Variant of Concern P Phylogenetic analysis of SARS-CoV-2 data is difficult Rapid detection of inter-clade recombination in SARS-CoV-2 with Bolotie The authors are deeply grateful to all the researchers and organizations collaborating to maintain and share SARS-445 CoV-2 genomic data on the GISAID Platform (See Supplementary 2). The authors thank the group of Artificial 446 Intelligence Applied to Bioinformatics of Federal University of Paraná and CAPES (Coordination for the Improvement 447 of Higher Education Personnel) for the financial support. The web platform Nextstrain, used for genome alignment, can be accessed at https://clades.nextstrain.org/ 471 (24) . For the recombination analysis, the RDP4 software is available for download at http://web.cbio.uct.ac.za/ 472 darren/rdp.html (38) , and the RAPR web tool is available at https://www.hiv.lanl.gov/content/sequence/ 473 RAP2017/rap.html (37) . The MEGAX 10.2.6 tool, used for the phylogenetic analysis genome-based, can be downloaded 474 from https://www.megasoftware.net/ (33). Finally, for visualization of phylogenetic trees, the iTOL online tool 475 can be accessed at https://itol.embl.de/ (35). 454 The authors declare no competing interests 455 Data availability 456 Genomes and proteomes were obtained at Global Initiative on Sharing Avian Influenza Data (GISAID), available at 457 https://gisaid.org/. Complete information about the Brazilian, Indian, German, Italian, British, and World-2020 458 sequences analyzed has been deposited at https://github.com/CamilaPPerico/SARS-CoV-2_Brazil_Landscape/ 459 tree/main/metadata. The Wuhan reference sequence (NC 045512.2) was obtained from the National Center for 460 Biotechnology Information (NCBI) https://www.ncbi.nlm.nih.gov/nuccore/1798174254. Information regarding 461 the number of COVID-19 cases in Brazil was collected from official website https://covid.saude.gov.br/. Further 462 data and materials produced in this study are available in the supplementary information (the contents of each file 463 are described in Supplementary 1) and at https://github.com/CamilaPPerico/rSWeeP_case_study 464 Code availability 465 The pipeline developed in this study, coded in R language, is available at https://github.com/CamilaPPerico/ 466 SARS-CoV-2_Brazil_Landscape/tree/main/scripts Code and instructions for reproducing our main results are 467 described in Supplementary 1. The R version of the SWeeP tool is deposited at Bioconductor platform, available for 468 version R 3.12 (25) , at https://bioconductor.org/packages/release/bioc/html/rSWeeP.html. Previous versions of PANGOLIN are available at https://github.com/cov-lineages/pangolin/releases (22) . 470