key: cord-0272145-19pqcxhh authors: Dezordi, F. Z.; Resende, P. C.; Naveca, F. G.; do Nascimento, V. A.; de Souza, V. C.; Paixao, A. C. D.; Appolinario, L.; Lopes, R. S.; Mendonca, A. C. d. F.; da Rocha, A. S. B.; Venas, T. M. M.; Pereira, E. C.; Salvato, R. S.; Gregianini, T. S.; Martins, L. G.; Pereira, F. M.; Rovaris, D. B.; Fernandes, S. B.; Ribeiro-Rodrigues, R.; Costa, T. O.; Sousa, J. C.; Miyajima, F.; Delatorre, E.; Graf, T.; Bello, G.; Siqueira, M. M.; Wallau, G. L. title: Unusual SARS-CoV-2 intra-host diversity reveals lineages superinfection date: 2021-09-23 journal: nan DOI: 10.1101/2021.09.18.21263755 sha: 1bd671063f5fdfd6ae29c30005ba3f8f6456a426 doc_id: 272145 cord_uid: 19pqcxhh The SARS-CoV-2 has infected almost 200 million people worldwide by July 2021 and the pandemic has been characterized by infection waves of viral lineages showing distinct fitness profiles. The simultaneous infection of a single individual by two distinct SARS-CoV-2 lineages provides a window of opportunity for viral recombination and the emergence of new lineages with differential phenotype. Several hundred SARS-CoV-2 lineages are currently well characterized but two main factors have precluded major coinfection/codetection analysis thus far: i) the low diversity of SARS-CoV-2 lineages during the first year of the pandemic which limited the identification of lineage defining mutations necessary to distinguish coinfecting viral lineages; and the ii) limited availability of raw sequencing data where abundance and distribution of intrasample/intrahost variability can be accessed. Here, we have put together a large sequencing dataset from Brazilian samples covering a period of 18 May 2020 to 30 April 2021 and probed it for unexpected patterns of high intrasample/intrahost variability. It enabled us to detect nine cases of SARS-CoV-2 coinfection with well characterized lineage-defining mutations. In addition, we matched these SARS-CoV-2 coinfections with spatio-temporal epidemiological data confirming their plausibility with the co-circulating lineages at the timeframe investigated. These coinfections represent around 0.61% of all samples investigated. Although our data suggests that coinfection with distinct SARS-CoV-2 lineages is a rare phenomenon, it is likely an underestimation and coinfection rates warrants further investigation. coinfection with well characterized lineage-defining mutations. In addition, we matched these SARS-CoV-2 coinfections with spatio-temporal epidemiological data confirming their plausibility with the co-circulating lineages at the timeframe investigated. These coinfections represent around 0.61% of all samples investigated. Although our data suggests that coinfection with distinct SARS-CoV-2 lineages is a rare phenomenon, it is likely an underestimation and coinfection rates warrants further investigation. The raw fastq data of codetection cases are deposited on gisaid.org and correlated to is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The SARS-CoV-2, the etiological agent of the COVID-19 pandemic, has a relatively low mutation rate compared to other RNA viruses 1 , and most viral lineages are normally defined by only a few synapomorphic SNPs (n < 10) 2 . However, the pervasiveness of SARS-CoV-2 infections during the COVID-19 pandemic provided substantial opportunities for the virus to explore the fitness landscape through single nucleotide substitutions and/or indels, giving birth to a range of more transmissible variants of concern (VOCs). These lineages are characterized by an unusual pattern of lineage-defining SNPs along the genome (n > 15) 3,4,5 . Coinfection is defined as a single cell/host infection by more than one virus lineage simultaneously. Despite a rare phenomenon, it may provide opportunity for genetic recombination, an event known to occur in viruses of the Coronaviridae family 6, 7 . Recombinant viruses may, in turn, trigger the emergence of new lineages with enhanced biological properties, including the capacity to infect new hosts (expansion of viral host range) [8] [9] [10] [11] . The frequency of coinfected patients and its role to promote recombination-driven SARS-CoV-2 evolution and the emergence of SARS-CoV-2 lineages is still poorly understood. The low variability found in SARS-CoV-2 lineages and the few well-defined lineage-specific SNPs until the second half of 2020 probably hindered the identification of coinfection and recombination events of SARS-CoV-2 lineages so far. In contrast the emergence of VOCs lineages carrying a substantial number of additional SNPs may provide enough markers to currently detect these events. A number of coinfection cases were reported for SARS-CoV-2, including lineages B. 14, 15 . In this study, we assessed amplicon sequencing reads of 2,263 SARS-CoV-2 samples from Brazilian patients generated by the Fiocruz Genomic Surveillance Network. We is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint were also co-circulating at the time of sampling, thus providing further plausibility for our findings. The sequencing data was obtained through the genomic survey of SARS-CoV-2 positives samples sequenced by Fiocruz COV-19 Genomic Surveillance Network between 18 May 2020 and 30 April 2021. The SARS-CoV-2 genomes were recovered using previously described Illumina protocols [16] [17] [18] (Table S1 ). The frequency of lineages by Brazilian states was evaluated using data recovered from GISAID (gisaird.org) on 23 The Fastq reads were submitted in an in house workflow available at https://github.com/dezordi/IAM_SARSCOV2 that performs the following steps: The remotion of duplicated reads, adapters and read extremities with less than 20 of phred score quality with the fastp tool 19 ; A genome assembly guided by reference was performed with BWA 20 mapping reads against the SARS-CoV-2 Wuhan reference genome (NC_045512.2); The consensus genomes were generated with samtools mpileup 21 and iVar 22 , using a threshold quality score of 30 and calling SNPs and indels present as major allele frequencies; After the consensus generation, the bam-readcount tool 23 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; positions; The consensus genomes were submitted to PangoLineage tool v1.1.23 and pangoLEARN update at 28 May 2021 24 and to Nextclade 25 tools. Only genomes with more than 95% coverage breadth and 100 reads of average coverage depth (Table S2) A reference alignment was created using MAFFT 27 with the 6,167 genomes, which represents the genomes present in the nextstrain 28 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Our initial analysis revealed that 1,462 out of 2,263 genomes had enough sequencing breadth and depth to be able to consistently detect and characterize the viral genomic variability at the sequencing reads level. 1,150 out of 1,462 SARS-CoV-2 positive samples investigated showed at least one genomic site with supported intra-host variability, that is, at least one genomic position with more than 100 reads supporting a minimum of two alternative nucleotides. Those samples showed an average coverage depth of 1817.46 (stdev = 908.59) and an average coverage breath supported by at least 100 reads of 99.66 (stdev = 1.10) ( Table S2 ). In addition, we estimated a mean of 2.57 genomic sites showing intra-host variants ( Table S3) . Major and Minor consensus sequences were generated for all samples bearing well supported alternative nucleotides. These alternative consensus genomes, representing the viral genome variability found in each sample, were then assessed for lineage assignment using the PangoLineage tool. If the same lineage was recovered for both genomes, this represents that the Major and Minor variants did not differ in relation to lineage-defining SNPs and that the variability observed likely resulted from de novo intra-host variants that emerged during viral replication. Conversely, if Major and Minor genomic variants were assigned to different lineages, the intra-host variability observed is more likely derived from a codetection event. We detected 16 instances in which Major and Minor variants were assigned to distinct lineages (intra-host sites: mean = 24, stdev = 9.75), including former Variants for Further Monitoring (VFM) N.9 and P.2 as well as the high circulating VOC P.1 (Table S4) (Figure 2A, Table S5 ). Seven out of nine putative coinfection events involve the VOC Gamma (P.1 lineage) ( Table 1) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint SNPs characteristic of this lineage that facilitate the distinction between coinfecting SARS-CoV-2 lineages. As more distinct lineages, bearing many lineage-defining SNPs, coinfect the same host, it becomes increasingly more likely to objectively distinguish the coinfecting lineages through the reconstruction of alternative intrasample viral genomes. In order to assess if codetection could be a result of sample contamination we reassessed sample AM-FIOCRUZ-21142481RG from RNA extraction, library preparation and sequencing. We confirmed the intrahost variability for 25 out of 31 sites present in the first sequencing run (Table S6) . Moreover, lineage assignment, phylogenetic reconstruction and the detection of SNP defining mutations confirmed the codetection status of that sample ( Table S4 ). This study reports that codetection/coinfection events occurred at a low rate in Brazil (0.61% -9 samples from 1462). This is certainly an underestimation due to the limitation of detecting true coinfection events of earlier low diverging SARS-CoV-2 lineages that dominated the first year of the pandemic. Despite that, considering the lower bound of recorded SARS-CoV-2 cases worldwide until July 2021 were around 190 million (https://coronavirus.jhu.edu/map.html), we can infer that at least 1,1 million patients have been coinfected across the world, which in turn provides a substantial window of opportunity . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.18.21263755 doi: medRxiv preprint for SARS-CoV-2 recombination events. Moreover, this estimate is certainly downwardly biased because the number of asymptomatic infections is largely not accounted for. In line with other studies, we showed that SARS-CoV-2 has an apparent low intrahost variability overall. Our in-depth analysis revealed at least nine codetection events which are corroborated by epidemiological data from co-circulating lineages in different Brazilian states. Moreover, the lineages identified revealed the early emergence of cryptically The authors declare that there are no conflicts of interest. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint Table S5 . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.18.21263755 doi: medRxiv preprint Figure 3 . SARS-CoV-2 lineage proportion through time in different Brazilian states with codetection cases. Data were recovered from GISAID on 23 July 2021, raw data can be accessed in Table S7 . Upper triangles colored with the lineage of major consensus genomes and lower triangles with minor consensus genomes lineages. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 23, 2021. ; https://doi.org/10.1101/2021.09.18.21263755 doi: medRxiv preprint The interplay of SARS-CoV-2 evolution and constraints imposed by the structure and functionality of its proteins Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus COVID-19 in Amazonas, Brazil, was driven by the persistence of endemic lineages and P.1 emergence Evolutionary analysis of the Delta and Delta Plus variants of the SARS-CoV-2 viruses Co-circulation of three camel coronavirus species and recombination of MERS-CoVs in Saudi Arabia Emergence of pathogenic coronaviruses in cats by homologous recombination between feline and canine coronaviruses Ecoepidemiology and complete genome comparison of different strains of severe acute respiratory syndrome-related Rhinolophus bat coronavirus in China reveal bats as a reservoir for acute, self-limiting infection that allows recombination events Evolutionary Dynamics of MERS-CoV: Potential Recombination, Positive Selection and Transmission Why do RNA viruses recombine? Recombination in eukaryotic single stranded DNA viruses Pervasive transmission of E484K and emergence of VUI-NP13L with evidence of SARS-CoV-2 co-infection events by two different lineages in Rio Grande do Sul, Brazil Genomic Evidence for Divergent Co-Infections of SARS-CoV-2 Lineages Generation and transmission of inter-lineage recombinants in the SARS-CoV-2 pandemic SARS-CoV-2: Possible recombination and emergence of potentially more virulent strains Genomic and phylogenetic characterisation of an imported case of SARS-CoV-2 in Amazonas State, Brazil Multiple Introductions Followed by Ongoing Community Spread of SARS-CoV-2 at One of the Largest Metropolitan Areas of Northeast Brazil SARS-CoV-2 Genomes Recovered by Long Amplicon Tiling Multiplex Approach Using Nanopore Sequencing and Applicable to Other Sequencing Platforms fastp: an ultra-fast all-in-one FASTQ preprocessor Fast and accurate short read alignment with Burrows-Wheeler transform The Sequence Alignment/Map format and SAMtools An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar The McDonnell Genome Institute Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool Integrative genomics viewer MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability Nextstrain: real-time tracking of pathogen evolution Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era Interactive Tree Of Life (iTOL) v4: recent updates and new developments AM-FIOCRUZ-21142481RG AM Manaus 31 98.9 1105 Pango lineages supported into the phylogeny. ²Date of the first genome deposited on GISAID of each variant into the specific municipality/state * Coverage breadth supported by 100 reads AM: Amazonas; BA: Bahia; ES: Espírito Santo; CE: Ceará; RS: Rio Grande do Sul We thank the Fiocruz COVID-19 Genomic Surveillance Network for sharing this large dataset and embracing such collaborative work. We also thank all the researchers around the world that are working and generating data of SARS-CoV-2 in those difficult times. The acknowledgment info of all SARS-CoV-2 genomes from GISAID and used in this work are present in Supplementary File 2.