key: cord-1044209-mz7d9kkv authors: Juscamayta‐López, Eduardo; Carhuaricra, Dennis; Tarazona, David; Valdivia, Faviola; Rojas, Nancy; Maturrano, Lenin; Gavilán, Ronnie title: Phylogenomics reveals multiple introductions and early spread of SARS‐CoV‐2 into Peru date: 2021-07-16 journal: J Med Virol DOI: 10.1002/jmv.27167 sha: 875a68ae0210516807c148e30fcb031786e69706 doc_id: 1044209 cord_uid: mz7d9kkv Peru has become one of the countries with the highest mortality rates from the current coronavirus disease 2019 (COVID‐19) pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2). To investigate early transmission events and the genomic diversity of SARS‐CoV‐2 isolates circulating in Peru in the early COVID‐19 pandemic, we analyzed 3472 viral genomes, of which 149 were from Peru. Phylogenomic analysis revealed multiple and independent introductions of the virus likely from Europe and Asia and a high diversity of genetic lineages circulating in Peru. In addition, we found evidence for community‐driven transmission of SARS‐CoV‐2 as suggested by clusters of related viruses found in patients living in different regions of Peru. We obtained RNA from nasal and pharyngeal swab samples collected from confirmed COVID-19 patients at the National Institute of Health, Peru (NIH-Peru), between March 5, 2020, and July 4, 2020, from departments of Lima, Callao, Ancash, Lambayeque, Ica, Arequipa, and Junin. Cases were randomly selected regardless of disease severity and hospital origin from samples that tested positive for SARS-CoV-2 RNA by quantitative reverse-transcription polymerase chain reaction and with Ct less than 25. The WGS of SARS-CoV-2 isolates (n = 96) were performed on MiSeq (Illumina) at NIH-Peru using the CleanPlex ® SARS-CoV-2 Panel (Paragon Genomics) through amplicon-based target enrichment. The quality of the obtained sequences was evaluated through the software FastQC v0.11.5 (https://www.bioinformatics.babraham.ac. uk/projects/fastqc). Contaminant residues, low-quality, and duplicated reads were trimmed off using the program fqCleaner v.0.5.0, with a Phred quality score of 28. Filtered reads were mapped against SARS-CoV-2 reference (NC_045512) using the Burrows-Wheeler Aligner MEM algorithm (BWA-MEM) v0.7.7. SAMtools v1.9 6 and Geneious Prime were used to sort BAM files, generate alignment statistics, and obtain consensus sequences. Previous to the construction of the consensus sequence, we removed primer adapters with the software fgbio (https://github.com/fulcrumgenomics/fgbio) using the primer sequence coordinates provided by Paragon Genomics. The SARS-CoV-2 genomes from Peru (n = 62) available in GISAID as of August 26, 2020, were selected getting a data set of 149 Peruvian genomes that cover a period from March 5, 2020 (officially first reported case) to July 4, 2020, of 12 Peruvian regions. To place the Peruvian sequences into a global context and avoiding dealing with a large data set computationally intractable, we retrieved sequences from other countries by means of a subsample (n = 3323) from viral genomes publicly available in GISAID randomly choosing one genome per country per day (collection date) between December 24, 2019 (including Wuhan-01 genome) and July 1, 2020. The final genomic data set comprised 3472 sequences to investigate the origins and genomic diversity of SARS-CoV-2 isolates circulating in Peru by Maximum-Likelihood (ML) phylodynamic approaches. The full genomic data set (n = 3472) was aligned using MAFFT v7.1 7 with default parameters. The alignment was manually curated, trimming the 5ʹ and 3ʹ ends and ambiguous regions obtaining an alignment length of 29520 bp. We estimate an ML tree of 3472 aligned sequences using IQ-tree v1.6 8 under the HKY nucleotide substitution model, with a gamma distribution of among-sites rate variation (HKY + G + I) as selected by ModelFinder. 9 The EPI_ISL_406801 sequence (Wuhan/WH04/ 2020), a basal A lineage sequence, was used as an outgroup for the ML tree. 10 To measure branch support, we used the Shimodaira-Hasegawa and approximate likelihood-ratio test with 1000 replicates. TempEst v1.5.1 11 was used to assess the strength of the temporal signal and inspect for outliers in the data set by a root-to-tip regression of genetic distance against sampling date. We used the program TreeTime v0.7. 6 12 to estimate an ML time tree. The analysis was performed using the HKY substitution model and a coalescent Skyline prior under a strict clock. 13 We also passed the flag-confidence to retrieve node dates with 90% confidence intervals (CIs). The clades were analyzed with Next Strain classification for SARS-CoV-2 (https://nextstrain.org/ncov) and clade-specific nonsynonymous mutations were identified. Introductions or local transmission events were designated as nodes (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) with more than or equal to 70% of statistical support value in the ML tree and the time to the most recent common ancestor (tMRCA) was estimated for each node with 90% CIs. Introduction events were defined as Peru sequences that clustered with non-Peru sequences across different clades. Local transmission events were defined as Peru-exclusive clusters with at least five or more sequences that were reproduced in the time-scaled inference. 14 Statistical analyses and tree visualization using ggtree package were carried out in R v3.6.2. The lineages were determined using the nomenclature of Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) (https://pangolin.cog-.io/). 15 Next strain classification helps to reference origins of sequence patterns while the nomenclature of PANGOLIN provides a convenient scheme for genomically detectable introductions of SARS-CoV-2 into new regions. CI) and the inferred ancestral root being Asia. These observations are in line with the known epidemiology of the pandemic. 5 Furthermore, the analysis of genomic data has shown that Peruvian isolates were widely distributed across the phylogenetic tree suggesting multiple and independent introductions, designed as nodes (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) with over 70% of statistical support from ML ( Figure 2A and Table 1 ). Similar introductions have been reported in other countries such as Brazil, 16 Colombia, 17 and the USA. 14 Most of these putative introductions of SARS-CoV-2 into Peru occurred between mid-February and early March, likely sourced from Europe, Asia, North America, and South America (Table 1) . Concordantly, the COVID-19 pandemic reached Latin America in February 2020 expanding into the region until March 2020, when the COVID-19 incidence curve started to grow more rapidly. 3 Also, our results coincide with initial analysis based on minimum spanning tree (MST) of 36 Peruvian SARS-CoV-2 genomes that suggested the introduction of multiple isolates from Europe and Asia. 18 However, similar to genome-based analyses, the MST model must be supported by an exhaustive clinical-epidemiological investigation to identify the root of transmission in an outbreak of infectious diseases. 19 Peruvian isolates (57%) were mainly clustered with clade 20B according to the Next strain nomenclature. This clade is predominantly composed of isolates obtained from patients with COVID-19 in Europe (51%), suggesting that introductions from Europe account for the majority of cases found in Peru between February and early March 2020 ( Figure 2B and Table 1) . Surprisingly, as of March 12, 2020, the Ministry of Health reported 11 confirmed cases (out of 22 total cases) in Lima with recent travel history to Spain, France, and Italy. 20 We also identified within clade 20B, five putative introductions (node 1-5) where the most Peruvian isolates were interspersed without grouping by country or geographic regions. Within this clade, we also identified SARS-CoV-2 Peruvian sequences that were reproduced in the time-scaled inference with mutations T1246I and G3278S in the ORF1a gene that distinguish clusters of sequences from Peru and elsewhere, suggesting a local transmission event that likely occurred in early March (tMRCA 90% CI: 27 February-2 March) ( Table 1 , node 1, Figures 2A and S1B ). The SARS-CoV-2 sequence of this patient clustered with other Peruvian isolates (from March 6, 2020, to July 4, 2020) conforming a "Peruvian cluster," all of which were closely related with European isolates confirming its likely origin (Figures 2 and S1B ). The tMRCA of this cluster (node 10) was estimated to be March 4, 2020 (February 25, 2020-March 4, 2020, 90% CI) and contains two specific amino acid substitutions: H604Y in the ORF1b gene and I119V in the spike protein (Table 1) (Figure 2A and Table 1 ). The clades 19A and 19B have mainly included Asian isolates (72% and 54%, respectively) suggesting that likely introductions from Asia account for the majority of cases found in Peru in the early pandemic period (Figure 2A ,B and Table 1 ). To Figure 2A) . April. 16, 17, 24, 26 It is known that the carriage of SARS-CoV-2 mutations into naive populations might be due to the neutral founder effect. Nevertheless, the ratio of nonsynonymous to synonymous mutations in S protein (0.25-0.5) is consistent with an emerging virus undergoing purifying selection. 27 However, any other factors might be involved in the global predominance of the G614 variant including uneven sampling, chance, and epidemiological reasons. 28 Although the G614 variant has not been associated with disease severity, it correlated strongly with the mortality rate of COVID-19 (Pearson's correlation coefficient (r) = 0.43, p = 0.022) during the early pandemic in a global survey of countries. 29, 30 One of the limitations of this study is the difference in the sequenced genomes between countries. It is well known that some locations (e.g., UK) are sequencing SARS-CoV-2 genomes much more intensively than others while other countries are not sequencing at all. This means that some locations may appear to be infection sources simply because they provide more sequences, rather than actually being the source of infection for a particular disease. In addition, due to the relatively long incubation period and low mutation rate of SARS-CoV-2, two genome sequences may be identical despite having last shared a common ancestor 4-6 weeks ago-during which time the pathogen might have been transmitted through one or multiple missed nodes and locations (e.g., Germany to the USA to Peru, rather than directly Germany to Peru). Another limitation is the differences between the number of Peruvian isolates from cases identified within Peru regions. Specifically, there were more SARS-CoV-2 representative' sequences in Lima than in the other Peruvian departments. Moreover, in the absence of epidemiological information such as travel history and contact tracking, it is hard to associate periods of untracked transmissions with any specific regions or countries. Although our estimates might be biased due to the above- The authors declare that there are no conflicts of interest. The use of human biological material was approved by the Ethics A new coronavirus associated with human respiratory disease in China Emergence of SARS-CoV-2 through recombination and strong purifying selection Early transmission dynamics of COVID-19 in a southern hemisphere setting Sala situacional COVID-19 Peru Bayesian phylodynamic inference on the temporal evolution and global transmission of SARS-CoV-2 Rapid genomic characterization of SARS-CoV-2 by direct amplicon-based sequencing through comparison of MinION and Illumina iSeq MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies ModelFinder: fast model selection for accurate phylogenetic estimates Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) TreeTime: maximum-likelihood phylodynamic analysis The coalescent Introductions and early spread of SARS-CoV-2 in the New York City area A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Evolution and epidemic spread of SARS-CoV-2 in Brazil Genomic epidemiology of severe acute respiratory syndrome coronavirus 2, Colombia. Emerg Infect Dis Genomic analysis reveals local transmission of SARS-CoV-2 in early pandemic phase in Peru Use of the minimum spanning tree model for molecular epidemiological investigation of a nosocomial outbreak of hepatitis C virus infection Prevención y Control de Enfermedades Alerta Epidemiológica Ante La Presencia de Casos Confirmados de COVID-19 En El Perú Facebook Data for Good to Response to the COVID19 Pandemic Genomic epidemiology of SARS-CoV-2 reveals multiple lineages and early spread of SARS-CoV-2 infections in Lombardy, Italy Geographical distribution of genetic variants and lineages of SARS-CoV-2 in Chile. Front Public Health Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus Spike mutation D614G alters SARS-CoV-2 fitness The arrival and spread of SARS-CoV-2 in Colombia Implications of SARS-CoV-2 mutations for genomic RNA structure and host microRNA targeting We shouldn't worry when a virus mutates during disease outbreaks SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 How to cite this article Phylogenomics reveals multiple introductions and early spread of SARS-CoV-2 into Peru