key: cord-0991886-gtlcg7ek authors: Gallego-García, Pilar; Varela, Nair; Estévez-Gómez, Nuria; De Chiara, Loretta; Fernández-Silva, Iria; Valverde, Diana; Sapoval, Nicolae; Treangen, Todd J; Regueiro, Benito; Cabrera-Alvargonzález, Jorge Julio; del Campo, Víctor; Pérez, Sonia; Posada, David title: Limited genomic reconstruction of SARS-CoV-2 transmission history within local epidemiological clusters date: 2022-02-04 journal: Virus Evol DOI: 10.1093/ve/veac008 sha: 57bd6a04fcb400c4c979aa7a84c99145c59f2317 doc_id: 991886 cord_uid: gtlcg7ek A detailed understanding of how and when severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission occurs is crucial for designing effective prevention measures. Other than contact tracing, genome sequencing provides information to help infer who infected whom. However, the effectiveness of the genomic approach in this context depends on both (high enough) mutation and (low enough) transmission rates. Today, the level of resolution that we can obtain when describing SARS-CoV-2 outbreaks using just genomic information alone remains unclear. In order to answer this question, we sequenced forty-nine SARS-CoV-2 patient samples from ten local clusters in NW Spain for which partial epidemiological information was available and inferred transmission history using genomic variants. Importantly, we obtained high-quality genomic data, sequencing each sample twice and using unique barcodes to exclude cross-sample contamination. Phylogenetic and cluster analyses showed that consensus genomes were generally sufficient to discriminate among independent transmission clusters. However, levels of intrahost variation were low, which prevented in most cases the unambiguous identification of direct transmission events. After filtering out recurrent variants across clusters, the genomic data were generally compatible with the epidemiological information but did not support specific transmission events over possible alternatives. We estimated the effective transmission bottleneck size to be one to two viral particles for sample pairs whose donor–recipient relationship was likely. Our analyses suggest that intrahost genomic variation in SARS-CoV-2 might be generally limited and that homoplasy and recurrent errors complicate identifying shared intrahost variants. Reliable reconstruction of direct SARS-CoV-2 transmission based solely on genomic data seems hindered by a slow mutation rate, potential convergent events, and technical artifacts. Detailed contact tracing seems essential in most cases to study SARS-CoV-2 transmission at high resolution. In recent years, genomic epidemiology has revealed itself as a powerful tool for tracking viral outbreaks (Grubaugh et al. 2019b) . Particularly for diseases with a high proportion of asymptomatic infections like coronavirus disease 2019 , the use of genomic information might be especially relevant to understand their dissemination. Several methods have been developed to reconstruct infectious disease outbreaks using genomic information (e.g. Didelot, Gardy, and Colijn 2014; Jombart et al. 2014; Worby et al. 2014a; Rambaut 2015, 2016; Lumby, Nene, and Illingworth 2018; Didelot et al. 2021 ). However, these strategies rely on pathogen genomes mutating rapidly between infected individuals (Campbell et al. 2018) . Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), responsible for the COVID-19 pandemic, has spread globally in a very short time. SARS-CoV-2 has a mutation rate in the order of 1 × 10 −3 mutations per site per year (Koyama, Platt, and Parida 2020; van Dorp et al. 2020b) . For MERS-CoV-2, in principle with a similar mutation rate, the prediction is that in most cases, the consensus sequences sampled from a transmission pair (donor and receptor) will be identical, precluding a complete reconstruction of the outbreak (Campbell et al. 2018) . As a counterpart, for SARS-CoV-1, with a mutation rate four times higher, we expect to see several mutations between transmission pairs, which considerably augments the power to resolve transmission history (Campbell et al. 2018) . These considerations are based on consensus sequences that represent the dominant viral lineage within a host. However, pathogens with high rates of evolution, such as RNA viruses, accumulate new mutations more or less rapidly as they replicate within the individuals they infect, generating intrahost genomic variation. The generation of this genomic diversity enables viral populations to evade host immune responses (Hensley et al. 2009; Henn et al. 2012; Parameswaran et al. 2017) , alter disease severity (Vignuzzi et al. 2006) , and adapt to changing environments (Stapleford et al. 2014; Stern et al. 2017) . Notably, the study of the shared intrahost genomic variation among individuals can be critical for identifying contagion events and transmission clusters (Didelot, Gardy, and Colijn 2014; Worby, Lipsitch, and Hanage 2014b, 2017; Park et al. 2015) . Moreover, it also allows for estimating the size of the founding pathogen population transmitted from the donor to the recipient host (i.e. the transmission bottleneck size) (Frise et al. 2016; Sobel Leonard, Weissman, and Greenbaum 2017) . Several studies have already shown that intrahost genomic variation can be detected in most SARS-CoV-2 infections, generally at low levels, but with some variation among individuals (Kuipers et al. 2020; Seemann et al. 2020; Shen et al. 2020; Wölfel et al. 2020; Butler et al. 2021; Lythgoe et al. 2021; Tonkin-Hill et al. 2021; Valesano et al. 2021; Braun et al. 2021b; Wang et al. 2021b ). Most SARS-CoV-2 intrahost mutations appear at low frequencies, often less than 5 per cent, are primarily under purifying selection and display particular biochemical signatures (Graudenzi et al. 2021; Sapoval et al. 2021; Tonkin-Hill et al. 2021; Wang et al. 2021b) . A key question is whether SARS-CoV-2 intrahost variation can be transmitted during contagion. The answer is not straightforward, as shared intrahost variants among unrelated individuals can also result from convergent evolution or mutational hotspots (Tonkin-Hill et al. 2021; Valesano et al. 2021) . So far, a few studies have used shared genomic variants between putative donorreceptor pairs to infer narrow transmission bottlenecks of one to ten virions, in SARS-CoV-2 (Li et al. 2022; Lythgoe et al. 2021; San et al. 2021; Wang et al. 2021a) . Limited genomic diversity can prevent the reconstruction of disease outbreaks (Campbell et al. 2018) . While distinct SARS-CoV-2 transmission clusters might be identified using consensus sequences (Letizia et al. 2020; Popa et al. 2020; Seemann et al. 2020) , its moderate mutation rate and rapid transmission might prevent the detailed reconstruction of the transmission events within these clusters (Tonkin-Hill et al. 2021) . Leveraging intrahost variation, San et al. (2021) studied two nosocomial SARS-CoV-2 outbreaks, showing that potential donorrecipient pairs are supported in some cases but not in others by shared intrahost variants. All in all, it is not clear whether the observed levels of inter and intrahost variation in SARS-CoV-2 and the apparently small size of the transmission bottleneck could limit our capability to reconstruct local SARS-CoV-2 outbreaks in detail using only genomic information. Intrahost mutations, typically at very low frequencies, are sensitive to methodological artifacts like sequencing errors (Turakhia et al. 2020; De Maio et al. 2020a; Kubik et al. 2021 ) and cross-sample contamination, and the occurrence of mutational hotspots can confound the identification of transmission events. Here, we wanted to assess our ability to reconstruct putative transmission chains and to infer reliable transmission bottleneck sizes in SARS-CoV-2. For this, we obtained high-quality genomic data from ten independent epidemiological clusters, with two replicates per sample and with unique oligonucleotide spike-ins to detect potential contamination, leveraging both interhost and intrahost variants and ad hoc phylogenetic techniques. Our results confirm the low levels of intrahost variability and the small transmission bottleneck of SARS-CoV-2, suggesting that genomic data alone might not be sufficient to fully resolve direct SARS-CoV-2 transmissions, revealing the need for additional sources of information like detailed contact tracing. According to the epidemiological records, we identified forty-nine patients infected with SARS-CoV-2 conforming ten independent transmission clusters originated in nursing homes, family households, and birthday parties in the city of Vigo, NW Spain ( Fig. 1 ; Table S1 ). After that, we recovered the corresponding diagnostic nasopharyngeal exudates collected at the Vigo University Hospital Complex (CHUVI). This study was conducted under the approval of the Galician Drug Research Ethics Committee (CEIm-G code 2020-301). Clusters A and B belong to two different nursing homes, and in both cases, the primary case could not be established with confidence ( Fig. 1) . Cluster C is a family in which there was a probable transmission from C2 to C4. Cluster D is a large family spanning four different households. D1 came from another Spanish city and likely started the D transmission at a birthday party. Cluster E is a family in which brothers E1 and E2 were infected abroad before infecting their parent, E3. Cluster F is another family that was likely infected by an unsampled case from another city in Spain. Cluster G originates in two individuals (G1 and G4) that attended the same event and afterward infected their respective families, G1 to G2 and G3, and G4 to G5 (G5 failed at sequencing). Cluster H is another family in which H3 likely infected H1 and H2. Cluster I starts with two children (I6 and I3; I6 failed at sequencing) that got infected at the same birthday party before infecting their families, I6 to I1 and I2, and I3 to I4 and I5. Cluster J is a family in which J1 infected partner J2 and child J3. After that, either J1 or J3 infected J4 and J5. Following the manufacturer's recommendations, we extracted the viral RNA from the nasopharyngeal exudates using the MagNA Pure 24 Total NA Isolation kit (Roche Diagnostics, Basel, Switzerland). Different team members processed each RNA sample independently to obtain two technical replicates for each patient sample, from retrotranscription to library construction. We measured SARS-CoV-2 genome copy concentration for each sample by real-time polymerase chain reaction (RT-PCR) of the E gene with the Sarbecovirus E-gene ModularDx (TIB Molbiol, Berlin, Germany) kit in a LightCycler® z480 System (Roche Molecular Systems Inc, Meylan, France). Viral load was estimated using linear regression (R 2 > 0.99) from the standard curve generated with the Ct values obtained for serial dilutions (log) of RNA standards with known viral RNA genome equivalents/µl (Vogels et al. 2020) . We followed the ARTIC sequencing protocol (v.3) (Quick et al. 2017 ), a multiplex PCR-based target enrichment that produces 400-bp amplicons that span the SARS-CoV-2 genome, with slight modifications. First, we retrotranscribed the RNA samples to cDNA using the SuperScript IV reverse transcriptase (Invitrogen, MA, USA), starting with 10 µl of RNA. Then we ran 30 PCR cycles for all the samples, independently of the Ct value, using the ARTIC primer Pool1 and Pool2 (IDT, CA, USA) and the Q5 Hot Start DNA polymerase (New England Biolabs, MA, USA). Next, we mixed the corresponding PCR products from each sample before cleaning (1.2:1 ratio beads to sample). We eluted the clean PCR products with 35 µl nuclease-free water, recovered 33 µ and performed quantification with the Qubit 3.0 using the dsDNA HS or BR kit (Thermo Fisher Scientific, MA, USA), and checked amplicon size with the 2200 TapeStation D1000 kit (Agilent Technologies, CA, USA). We added 1 µl of an X-mer single-stranded oligonucleotide with a unique barcode sequence at 38 fM to each retrotranscription reaction to detect potential sample cross-contamination. To prepare these barcode spike-ins, we used as a template the alcohol dehydrogenase 1 (adh1) mRNA (XM_008650471.2) from Zea mays, as described in the PrimalSeq v.4.0 protocol (Matteson et al. 2020) . After a cleanup step (2:1 ratio beads to sample), we recovered a final volume of 22 µl and performed QC (Qubit 3.0 and 2200 TapeStation). We added F and R primers with the same barcode sequence at the same concentration as the ARTIC primer pools to amplify the barcodes in the multiplex PCRs. We built ninety-eight whole-genome sequencing libraries employing the DNA Prep (M) Tagmentation kit (Illumina, CA, USA) using ¼ of the recommended volume, with approximately 125 ng of input DNA. Finally, we checked the size of the libraries and quantified them as described above. We sequenced the ninety-eight libraries in two high-output (7.5 Gb) runs (sixty and thirty-eight samples, respectively) on an Illumina MiniSeq (PE150 reads) at the sequencing facility of the University of Vigo. To assess the level of cross-sample contamination, we quantified the specific maize barcode content in each fastq file. For this, we aligned the raw reads against the Zea mays adh1 sequence using BWA-mem (Li 2013) with default settings and demultiplexed the mapped reads with cutadapt (v.2.10) (Martin 2011), specifying a minimum overlap of 15 and a maximum error rate of 0.1. Then, we counted the number of each type of reverse and forward barcodes present in each FASTQ file and checked if they were any different from the ones added to the corresponding sample. We assessed the quality of the fastq files using FastQC (Andrews 2010) and aligned the reads to the reference MN908947.3 from Wuhan using BWA-mem (Li 2013) . We used iVar (Grubaugh et al. 2019a ) to trim primer sequences and low-quality regions of the reads, as well as remove reads with less than 30 bp. We evaluated the quality of the aligned trimmed reads using Picard v2.21.8 (http://broadinstitute.github.io/picard; last accessed 20 December 2021). We used SAMtools depth v1.10 (Li et al. 2009 ) to calculate the sequencing coverage along the genome for each replicate. We only kept samples for which ten or more reads covered more than 75 per cent of the viral genome in the two replicates and with less than 2.5 per cent missing bases on the consensus sequence. We used iVar (Grubaugh et al. 2019a ) to identify single nucleotide changes and indels, with a minimum base quality threshold of 20 and a minimum read depth of 10. The calls obtained were confirmed with LoFreq (Wilm et al. 2012) . We only retained variants that appeared in both replicates with a minimum overall variant allele frequency (VAF) of 2 per cent. Based on their frequency, we labeled the genomic changes detected as fixed (VAF ≥ 0.98; to account for potential sequencing errors) and intrahost variants (0.02 ≤ VAF < 0.98). We masked and removed from further analyses positions containing complex variants (i.e. nucleotide changes plus indels) or those deemed as homoplasic (De Maio et al. 2020b) , including the sites immediately before and after. To build a consensus sequence for each sample, we merged the reads from the two replicates with SAMtools mpileup and fed them to iVar consensus with a minimum VAF threshold of 0.5. We assigned the consensus sequences to a SARS-CoV-2 clade with Nextclade (https://clades.nextstrain.org; last accessed 20 December 2021) and to a SARS-CoV-2 PANGO lineage (Rambaut et al. 2020) with Pangolin (O'Toole et al. 2021). The simplest method for delimiting epidemiological clusters using genomic data alone is estimating a phylogenetic tree using the consensus sequences. For this, we aligned the consensus sequences with the reference using MAFFT v.7 (Katoh and Standley 2013) (mafft-maxiterate 500 ) and ran IQ-TREE (v.2.0.6) (Nguyen et al. 2015 3) with the best-fit nucleotide substitution model and 1,000 bootstrap replicates. We also built a timetree based on the output tree of IQ-TREE and the dates of the samples using TreeTime (v.0.8.1) (Sagulenko, Puller, and Neher 2018) (treetime-aln -tree -dates max-iter 30). In addition, we tried six heuristics developed explicitly for the reconstruction of epidemiological clusters described in Worby, Lipsitch, and Hanage (2017) . The weighted distance tree and the minimum distance tree use the genetic distances among consensus sequences. On the other hand, the weighted and maximum variant tree strategies rely exclusively on shared intrahost variants, while the hybrid weighted tree and maximum tree procedures use intrahost variants and consensus genetic distances. Furthermore, we also estimated transmission clusters using the Transcluster algorithm (Stimson et al. 2019) , assuming a mutation rate of 1 × 10 −3 mutations/site/year. We explored four values for the transmission rate (10, 25, 50, 100 transmissions per year) and six for the transmission cutoff (one to six transmission events). Finally, we also tried a probabilistic approach such as the one implemented in Phydelity (Han et al. 2019) , which leverages the inferred phylogenetic trees and can work without predefined genetic distance thresholds (phydelity.py-tree -collapse_zero_branch_length). Within each cluster, we tried several approaches to estimate which individuals transmitted the virus and in which direction, that is, to learn who infected whom. First, we explored the Worby et al. heuristics, which assume that the donor for each sample has the most similar sequence or more shared intrahost variants. In addition, we implemented a simple approach that leverages the intrahost variation along a minimum spanning tree (MST). First, we computed Euclidean pairwise distances among all individuals within a cluster, with the rdist R package (https://github.com/blasern/rdist; last accessed 20 December 2021), using the VAF distributions. Afterward, we built the MSTs based on those distances with the function mst from the ape R package (Paradis and Schliep 2019) . Then, assuming a single source for each cluster, we inferred the transmission direction that minimized the generation of novel variants in the receptor, meaning that in a pair of individuals, the donor should be the one with a higher number of private mutations. Finally, we also explored TransPhylo (Didelot et al. 2017) , using the dated phylogeny obtained with TreeTime. We ran the algorithm for 150,000 Markov Chain Monte Carlo iterations and assumed a Gamma distribution for the generation time with shape 1 and scale 0.01917 (Perera et al. 2021 ). To estimate the transmission bottleneck size of SARS-CoV-2 (i.e. the size of the viral population transferred from the donor to the recipient host), we used the beta-binomial method of Sobel Leonard, Weissman, and Greenbaum (2017). This method assumes that the intrahost variants detected did not arise de novo in different patients. This calculation includes only intrahost donor variants shared with the recipient (note that they can be fixed in the recipient but not in the donor). We identified putative donor-recipient pairs according to the available epidemiological information (Fig. 1 ). We lacked epidemiological information for clusters D and F, and we identified possible transmission pairs according to the genomic data (see Section 3). For the estimation of the transmission bottleneck size, we used the R code at https://github.com/weissmanlab/BB_bottleneck (last accessed 20 December 2021), under the approximate model (given that the sequencing depth per sample was very high, around 6,000X) and setting the maximum bottleneck size to an arbitrarily large value of 600, and the VAF cutoff to 0.02. The ratio of non-synonymous changes per non-synonymous site (dN) to the number of synonymous substitutions per synonymous site (dS) is one of the most popular statistics for detecting selective pressures at the molecular level. We estimated the dN/dS ratio for each sample using the dNdScv package (Martincorena et al. 2017) , recently adapted for its application to SARS-CoV-2 (Tonkin-Hill et al. 2021). We used the default substitution model with 192 rate parameters. Twenty-seven out of the forty-nine samples had a viral load above 10 3 copies/µl (Table S1 ). Sequencing coverage and breadth were high (mean depth ± sd: 6316.71 ± 2336.99; breadth: 0.999) (Table S2) , except for three samples (D11, G5, and I6, all with a Ct > 32 for gene E), that we excluded from further analyses. We did not detect appreciable cross-contamination between samples (Table S2 ). Most variants were fixed (VAF ≥ 0.98) (Fig. 2, Figure S1 ). The number of differences among consensus sequences was, on average, 2.12, 2.28, and 7.57 variants, within early clusters (A-C), late clusters (D-J), and among early and late clusters (see also Table S3 ). We observed on average 19.76 variants per sample, of which 8.17 were intrahost (Table S4 ). Both fixed and intrahost variants were shared among samples at different VAFs. Several intrahost variants appeared recurrently in multiple samples, often corresponding to indels at low frequency (Table S4 ). These recurrent variants may correspond to potential sequencing errors and mutational hotspots, which might confound our analyses. Therefore, we decided to filter out intrahost variants present in more than one cluster. After filtering, there were 2.13 intrahost variants per sample on average, with a maximum of 11 (Table S4 ). Before and after filtering, the number of intrahost variants detected per sample was unrelated to sequencing depth, Ct values, or viral load ( Figure S2 ). Furthermore, VAFs between sample replicates were significantly correlated (Pearson correlation coefficient = 0.99, P-value = 5.6 e-113) ( Figure S3 ). All samples were assigned to two related clades/lineages (20A/B.1 and 20E(EU1)/B.1.177), which was not particularly surprising as these were the dominant lineages in the area at the time of sampling. We estimated dN/dS values for missense variants consistently below 1, suggesting a predominance of intrahost purifying selection across samples ( Figure S4 ). The maximum likelihood (ML) trees obtained with the consensus sequences showed the epidemiological clusters as distinct groups, mostly monophyletic and with relatively high bootstrap support (Fig. 3) . Remarkably, adding the temporal information with Tree-Time improves the phylogenetic resolution of the clusters, which become all monophyletic (Fig. 3B) . However, standard phylogenetic approaches do not explicitly inform about the number of clusters or the assignment of the different individuals to clusters. In the absence of additional epidemiological information (i.e. colors in our trees), researchers often infer putative transmission clusters using some kind of distance threshold. The weighted distance tree and the minimum distance tree, which use consensus sequences to explicitly delimit clusters, were identical and highly congruent with the epidemiological information (Fig. 4A) . In this case, the only 'error' was that cluster D was divided into two, although we might expect it because D1, D3, and D6 share two consensus mutations that the rest of the D individuals do not present. Indeed, cluster D is large and phylogenetically diverse, and we might not have sampled all the infected individuals in this transmission chain. The weighted variant and maximum variant trees, based exclusively on intrahost variants, were also identical and produced a very complex network in which all individuals seemed related to each other (Fig. 4B) . After removing the recurrent intrahost variants common to multiple individuals and clusters (taking advantage of the epidemiological information), these methods identified three clusters primarily compatible with the epidemiological information, plus thirty-three unconnected individuals (Fig. 4C) . Cluster A was perfectly delimited, while cluster I formed a group with a sample from cluster H. The only other three clusterized samples were from cluster D (again D1, D3, and D6). The hybrid transmission methods, which use the connections established by the weighted variant and maximum variant trees and incorporate consensus information for those samples without a donor or recipient, did not result in any noticeable improvement compared to methods based on consensus sequences (data not shown). Finally, the transmission-based clustering method in Transcluster was able to identify some of the epidemiological clusters but not all (Fig. 4D) . In this case, congruence with the epidemiological data was maximal after setting a transmission rate of 25 and a transmission cutoff of 1. In the case of Phydelity, which tries to delimitate transmission clusters upon a given phylogeny, the inferred clusters were often subsets of the epidemiological clusters, both when using the ML ( Figure S5A ) and the TreeTime tree ( Figure S5B ). For example, Phydelity divided cluster D into two subclusters (one of them composed again by D1, D3, and D6) and three other samples were left unassigned in both trees, while for most of the other clusters (A, B, C, E, F, and I) there were one or more unassigned samples. Phydelity precisely identified clusters G and J when using the ML tree, while H is the only cluster that was correctly delimited in both the ML and the TreeTime tree. For clusters A and B, we had no epidemiological information other than the corresponding nursing home. In cluster A (Fig. 5) , the four samples share what seems to be an intrahost variant (27695-TCTTA). However, given its high VAF (0.93-0.95) and the fact that the individuals do not share other variants, this deletion may be a truly fixed variant, where sequencing or calling errors prevented its identification in all reads. In any case, the genetic data does not help identify the different transmission events in this cluster with confidence. In cluster B, no shared variation was apparent. B2 has two private apparently fixed variants, suggesting it was infected later than the other cluster members or from a different source. Again, it was not possible to infer the transmission network for this cluster. We had partial contact trace information for clusters C, E, G, H, I, and J. However, the lack of shared intrahost variants prevented a detailed reconstruction of their transmission history in most cases (Fig. 5) . The epidemiological record suggests a transmission from C2, the index case, to C4 in cluster C. This event is compatible with the genetic data, as C2 has a single intrahost variant at low frequency (0.05), which could have been lost during transmission to C4, which has no intrahost variants. Private variants with low VAFs in C1 and C3 could have arisen de novo within each individual after transmission, but three of them with higher VAFs (0.27, 0.34, and 0.67) are more difficult to explain in the same way. In cluster E, the genetic information cannot resolve whether E1 or E2 infected E3. In cluster G, we did not observe shared intrahost variants. In contrast, the distribution of the private variants is compatible with the epidemiological information, and it does not help resolve further the transmission history. In cluster H, the three samples share five fixed (VAF ≥ 0.98), or almost fixed, variants. The index case (H3) does not seem to have intrahost variants, contrary to H1 and H2. However, the quality of the sequencing data in the case of H3 is well below average, so it is possible that low-frequency variants were overlooked in this sample. All five members of cluster I share three variants with high VAF-or fixed in several cases. Variants 445C and 25062 could be genuinely fixed in all samples, including cases where the apparent VAF is 0.95-0.97. The distribution of variant 29366T is remarkable, as it appears in all cases with a VAF of 0.68-0.86. Another salient observation is that I3, one of the index cases, has a well-supported variant (9430T) with a VAF of 0.96 that does not appear in the other samples from this cluster. Cluster J lacked shared intrahost variants, so the genetic data neither confirmed nor invalidated the contact tracing information. 3.4.3 .1 Ad hoc approaches. We did not have detailed information about contacts in clusters D and F, so we tried to identify transmission events considering just the genomic data (Fig. 6) . In cluster D, the transmission started at a birthday party where the index case was D1. D1 shares two variants with D3 and D6 (4543T and 18431T), both fixed (VAF ≥ 0.98) in D1 and D6 and close to fixation in D3 (0.96 and 0.94, respectively). Therefore, we hypothesize that D1 → D3 and D1 → D6, but alternatively D3 → D6, could be transmission pairs. These two variants also appear in D2 but at a very low VAF (0.07 and 0.06, respectively). D3 and D2 further share 4142 A, but this variant has a low VAF in D3 (0.07) and a high VAF (0.89) in D2. Furthermore, D2 has 15857T at high VAF (0.88). Given that we assumed that D1 infected D3, we considered that D3 could have infected D2. However, the explanation for the observed VAF patterns might imply recombination and de novo mutation. Finally, D10 shares with D2 variants 4142A and 15857T at high frequency, so we also considered D2 → D10 another likely transmission pair. In cluster F, where we do not have an index case, F1 and F3 share a fixed variant (3737 T). Given that F1 has seven private variants at low frequency, but F3 only two, F1 might be the pair's donor because it seems easier to lose these variants during the F1 → F3 bottleneck than to arise de novo in F1 after an F3 → F1 transmission. F2 also has 3737T fixed. Following the same logic, F2 could have been infected by F1, but also by F3. approaches were not as helpful in inferring transmission as for delimiting clusters. Assigning the source of each sample to the patient with the highest number of shared intrahost variants or the minimum genetic distance (using weights or absolute values) resulted in samples with multiple potential sources and pairs with bidirectional transmission (Fig. 4A, B) . Relying only on intrahost shared variants proved inefficient, as most samples were not connected to any other (Fig. 4C ). Consensus sequences from the same cluster were very similar, so multiple samples were often equidistant, preventing choosing one of them as the source. The MST analysis ( Figure S6 ) was incompatible with the epidemiological information. Apart from tied transmission paths for some of the clusters (clusters D and E, with three and two options, respectively), the starting point of the transmission did not coincide with the epidemiological information in any of the cases. TransPhylo could differentiate the different clusters (Fig. 7, Figure S7 ); however, the inferred direct transmission events within clusters were often not compatible with the epidemiological information. We selected individual pairs representing direct transmission events to estimate the transmission bottleneck size according to the epidemiological and genomic information. We discarded clusters A and B (nursing homes) from this analysis because it was impossible to identify likely transmission pairs in these cases. We had contact information about at least a transmission pair for clusters C, E, G, H, I, and J. The situation was more complex for clusters D and F, so we identified possible transmission pairs considering both the epidemiological and the genomic data, as described above. Across the studied transmission pairs, we found an average of 0.38 (range 0-3) shared intrahost variants (Table S6) . Accordingly, the estimated transmission bottleneck sizes were typically small (one to two viral particles) (Fig. 8, Table S6 ). To ensure that our selection of transmission pairs in clusters D and F was not biasing these estimates downwards, we also calculated the transmission bottleneck sizes for all potential pairs within these two clusters. The estimated bottlenecks were consistently one to two. Note that the bottleneck size can only be estimated when there is at least one variant in the donor (regardless of whether that variant is observed in the recipient). If none of the donor variants appear in the recipient, the estimated bottleneck size will be one, with a variable confidence interval depending on the variant calling threshold. Understanding SARS-CoV-2 transmission is crucial to identify which situations minimize or maximize the risk of infection and, therefore, implement more effective control strategies. Former studies have tried to reconstruct SARS-CoV-2 local transmission chains with more or less success using a combination of epidemiological and genomic data (Popa et al. 2020; Sekizuka et al. 2020; Shen et al. 2020; Hamilton et al. 2021; San et al. 2021) . However, it is unclear whether, in situations for which contact tracing information is limited, we can use SARS-CoV-2 genomic information alone to understand who infected whom. Here, we show that while SARS-CoV-2 genomic variation can be helpful, at least in some cases, to delimit distinct transmission clusters, it is not enough to resolve with confidence transmission chains and direct transmission events. Using the most likely transmission pairs, we infer a narrow effective transmission bottleneck size for SARS-CoV-2 in the order of one to ten viral particles. Unlike most of the previous studies of SARS-CoV-2 intrahost variation, we made use of substantial laboratory and bioinformatic controls to stress variant calling reproducibility, including the use of unique barcodes to discard cross-sample contamination, sequencing replicates, high sequencing depth, multiple variant callers, and curation of recurrent variants. While a few studies have included some of these controls to a different extent (Lythgoe et al. 2021; Tonkin-Hill et al. 2021; Braun et al. 2021b) , this is the first study to consider contamination, replicates, and multiple variant calling strategies directly for each of the samples studied, plus the use of contact information. We found a limited number of intrahost variants (∼8 before filtering recurrent variants and ∼3 after filtering), as reported in earlier studies (Kuipers et al. 2020; Seemann et al. 2020; Shen et al. 2020; Wölfel et al. 2020; Butler et al. 2021; Tonkin-Hill et al. 2021; Valesano et al. 2021; Braun et al. 2021b; Wang et al. 2021b ). Half of our samples (twentyseven/forty-eight) had a viral load above 10 3 copies/µl, which is the threshold determined in Valesano et al. (2021) for reliable identification of intrahost variants with a VAF ≥2 per cent in single replicates. Another novel aspect of this work is the explicit exploration of established methods that use viral interhost and/or intrahost genomic variation to delimitate transmission clusters and to identify transmission chains and direct contagions, and the comparison of these results with the available epidemiological information. In our samples, all from the same city and corresponding to two time points, the level of interhost genomic variation was generally low. However, this did not prevent the distinction among local clusters. When the sampling dates were taken into account, the concordance between genomic and epidemiological clusters was maximized, highlighting the relevance of the temporal information. Methods for cluster delimitation that rely exclusively on intrahost variants did not work well in this regard. In contrast, methods based on differences at the consensus level could differentiate the clusters near perfectly. These results suggest that in SARS-CoV-2, consensus sequences might be enough in some cases, to separate samples belonging to different clusters from the same area. But caution regarding sampling is always necessary. There might be unsampled individuals that do not belong to any of these clusters but that might have identical consensus sequences. At the same time, intrahost SARS-CoV-2 diversity does not seem sufficient for cluster delimitation. Here, transmission history within nursing homes or households, where most SARS-CoV-2 infections occur (Lee et al. 2020) , was complicated to decipher. In general, all the methods we tried, even those relying on intrahost variation, could not identify clear transmission patterns within clusters, as seen before in care homes (Hamilton et al. 2021) or in nosocomial outbreaks (Abbas et al. 2021) . This poor resolution can be explained by a lack of genetic variation but also by homoplasy, as we observed several shared intrahost variants among apparently unrelated samples. In addition, we noticed that if VAF thresholds are relaxed, unique or uncommon mutations appear in many individuals, suggesting these variants are either recurrent artifacts or hotspot mutations (Tonkin-Hill et al. 2021) . Deciding which shared intrahost variants in SARS-CoV-2 are the result of transmission events is not easy. Much care should be taken regarding reliable genotyping and identifying recurrent events, particularly for samples with low viral loads (van Dorp et al. 2020a,b; Kubik et al. 2021; Valesano et al. 2021; Braun et al. 2021b ). Recurrent, low-frequency insertions in SARS-CoV-2 have already been detected elsewhere (Kuipers et al. 2020; Rayko and Komissarov 2020; Turakhia et al. 2020; Tonkin-Hill et al. 2021) . Although not addressed in this study, another potential complication regarding the identification of shared intrahost variants is the occurrence of significant intrahost evolution. Several studies have reported VAF changes within days in SARS-CoV-2 (Jary et al. 2020; Tonkin-Hill et al. 2021; Voloch et al. 2021; Wang et al. 2021b) , even faster in immunocompromised individuals (Avanzato et al. 2020; Kemp et al. 2021 ). On the other hand, more rigorous studies report that diversity does not increase over time, although this does not imply that VAFs cannot change significantly among different time points (Valesano et al. 2021 ). Significant intrahost evolution would imply that the amount of sharing between samples could change depending on the exact sampling dates, so the inferences derived from it. Consistent with previous studies, our estimates indicate that the SARS-CoV-2 transmission bottleneck size is small or very small (Gu et al. 2022; Li et al. 2022; Lythgoe et al. 2021; San et al. 2021; Wang et al. 2021a; Braun et al. 2021b) , with only a few viral particles being responsible for the successful growth within the recipient. Notably, tight bottleneck estimates have also been obtained for a highly transmissible SARS-CoV-2 lineage like Delta (Li et al. 2021) . In contrast, Popa et al. (2020) estimated an average transmission bottleneck size for SARS-CoV-2 of 1,000, but these estimates have been recently revised because of the inclusion of suspicious, highly shared intrahost variants (Martin and Koelle 2021; Nicholson et al. 2021) . A strong transmission bottleneck reduces, in general, the efficacy of selection, impeding the contribution of intrahost diversity to viral adaptation at a global scale (McCrone and Lauring 2018), although exceptions to this rule exist (Zwart and Elena 2015; Braun et al. 2021a) . Still, interhost competition among SARS-CoV-2 variants can maintain and increase viral fitness. On the other hand, a small transmission bottleneck size for SARS-CoV-2 is compatible (but not proof of) with a dominance of aerosol transmission over direct contact, as seen in influenza (Varble et al. 2014; Frise et al. 2016; McCrone and Lauring 2018) . Further studies are necessary to elucidate whether this is really the case for SARS-CoV-2. If only one or a few unique virions are passed during transmission, then most of SARS-CoV-2 intrahost variation has to be due to the accumulation of de novo mutations (Valesano et al. 2021; Voloch et al. 2021 ). These de novo mutations seem to be mainly deleterious. We inferred strong intrahost purifying selection across the genome for missense variants, as in prior studies (Shen et al. 2020; Lythgoe et al. 2021; Tonkin-Hill et al. 2021) . Our results suggest that SARS-CoV-2 genomic diversity is helpful to delimitate different transmission clusters within a relatively small area, but that could be insufficient to fully resolve transmissions within a household or in the same social event. In other words, genomics alone cannot help identify who infected whombut might discard putative contagions. Thus, contact tracing data will be essential to study direct SARS-CoV-2 transmission events, as it occurs in typical slow-evolving pathogens (Campbell et al. 2018 (Campbell et al. , 2019 . Overall, the biological picture that has become apparent after this and preceding studies is that SARS-CoV-2 intrahost variation is low and mainly determined by genetic drift and purifying selection. The transmission bottleneck is very narrow, with only a few virions effectively contributing to the genomic diversity of the infection, so intrahost variants are infrequently transmitted from one host to another. Under this scenario, only a minority of infections, typically prolonged ones, should lead to the appearance of novel variants. Therefore, managing long-term SARS-CoV-2 infections should become a priority. Raw FASTQ files have been deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) (Leinonen, Sugawara, and Shumway 2011) (Project Accession No. PRJNA778931). Viral consensus genomes are available at the Global Initiative on Sharing All Influenza Data (GISAID) (Shu and McCauley 2017) (accession numbers in Supplementary Table S7 ). Supplementary data is available at Virus Evolution online. This project was funded by grant EPICOVIGAL FONDO SUPERA-COVID19 from Banco Santander, Consejo Superior de Investigaciones Científicas (CSIC) and Conferencia de Rectores de las Universidades Españolas (CRUE), grant CT850A-2 from the Axencia de Coñecemento en Saúde (ACIS) of the Servizo Galego de Saúde (SERGAS) from the Consellería de Sanidade Xunta de Galicia, and grant ED431C2018/54-GRC from the Consellería de Cultura, Educación e Ordenación Universitaria of Xunta de Galicia. NS and TT were supported in part by a C3.ai Digital Transformation Institute award. Funding for open access charge: Universidade de Vigo/CISUG. Explosive Nosocomial Outbreak of SARS-CoV-2 in a Rehabilitation Clinic: The Limits of Genomics for Outbreak Reconstruction FastQC: A Quality Control Tool for High Throughput Sequence Data Case Study: Prolonged Infectious SARS-CoV-2 Shedding from an Asymptomatic Immunocompromised Individual with Cancer Gephi: An Open Source Software for Exploring and Manipulating Networks Acute SARS-CoV-2 Infections Harbor Limited Within-host Diversity and Transmit via Tight Transmission Bottlenecks Shotgun Transcriptome, Spatial Omics, and Isothermal Profiling of SARS-CoV-2 Infection Reveals Unique Host Responses, Viral Diversification, and Drug Interactions' Bayesian Inference of Transmission Chains Using Timing of Symptoms, Pathogen Genomes and Contact Data Masking Strategies for SARS-CoV-2 Alignments Genomic Infectious Disease Epidemiology in Partially Sampled and Ongoing Outbreaks Bayesian Inference of Infectious Disease Transmission from Whole-genome Sequence Data Genomic Epidemiology Analysis of Infectious Disease Outbreaks Using TransPhylo Contact Transmission of Influenza Virus between Ferrets Imposes a Looser Bottleneck than Respiratory Droplet Transmission Allowing Propagation of Antiviral Resistance Mutational Signatures and Heterogeneous Host Response Revealed via Large-scale Characterization of SARS-CoV-2 Genomic Diversity', iScience An Amplicon-based Sequencing Framework for Accurately Measuring Intrahost Virus Diversity Using PrimalSeq and iVar Genomic epidemiology of SARS-CoV-2 under an elimination strategy in Hong Kong Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set Using Genomics Data to Reconstruct Transmission Trees during Disease Outbreaks Genomic Epidemiology of COVID-19 in Care Homes in the Inferring Putative Transmission Clusters with Phydelity Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants upon Immune Recognition during Acute Infection Hemagglutinin Receptor Binding Avidity Drives Influenza A Virus Antigenic Drift Evolution of Viral Quasispecies during SARS-CoV-2 Infection Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability SARS-CoV-2 Evolution during Treatment of Chronic Infection Variant Analysis of SARS-CoV-2 Genomes Recommendations for Accurate Genotyping of SARS-CoV-2 Using Amplicon-based Sequencing of Clinical Samples Within-patient Genetic Diversity of SARS-CoV-2 The Engines of SARS-CoV-2 Spread International Nucleotide Sequence Database Collaboration SARS-CoV-2 Transmission among Marine Recruits during Quarantine The Sequence alignment/map (SAM) format and SAMtools Viral infection and transmission in a large, well-traced outbreak caused by the SARS-CoV-2 Delta variant Aligning Sequence Reads A Novel Framework for Inferring Parameters of Transmission from Viral Sequence Data SARS-CoV-2 Within-host Diversity and Transmission Cutadapt Removes Adapter Sequences from Highthroughput Sequencing Reads Comment on Genomic epidemiology of superspreading events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2 Universal Patterns of Selection in Cancer and Somatic Tissues PrimalSeq: Generation of Tiled Virus Amplicons for MiSeq Sequencing V1 (Protocols.io.bez7jf9n) Genetic Bottlenecks in Intraspecies Virus Transmission', Current Opinion in Virology IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-likelihood Phylogenies Response to Comment on Genomic epidemiology of superspreading events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2 Assignment of Epidemiological Lineages in an Emerging Pandemic Using the Pangolin Tool Ape 5.0: An Environment for Modern Phylogenetics and Evolutionary Analyses in R' Intrahost Selection Pressures Drive Rapid Dengue Virus Microevolution in Acute Human Infections Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone Reconstructing SARS-CoV-2 infection dynamics through the phylogenetic inference of unsampled sources of infection Genomic Epidemiology of Superspreading Events in Austria Reveals Mutational Dynamics and Transmission Properties of SARS-CoV-2 Multiplex PCR Method for MinION and Illumina Sequencing of Zika and Other Virus Genomes Directly from Clinical Samples A Dynamic Nomenclature Proposal for SARS-CoV-2 Lineages to Assist Genomic Epidemiology Quality Control of Lowfrequency Variants in SARS-CoV-2 Genomes TreeTime: Maximum-likelihood Phylodynamic Analysis Transmission Dynamics of SARS-CoV-2 Withinhost Diversity in Two Major Hospital Outbreaks in South Africa SARS-CoV-2 Genomic Diversity and the Implications for qRT-PCR Diagnostics and Transmission Tracking the COVID-19 Pandemic in Australia Using Genomics Haplotype Networks of SARS-CoV-2 Infections in the Diamond Princess Cruise Ship Outbreak Genomic Diversity of Severe Acute Respiratory Syndrome-Coronavirus 2 in Patients with Coronavirus Disease GISAID: Global Initiative on Sharing All Influenza Data -from Vision to Reality Transmission Bottleneck Size Estimation from Pathogen Deepsequencing Data, with an Application to Human Influenza A Virus Emergence and Transmission of Arbovirus Evolutionary Intermediates with Epidemic Potential The Evolutionary Pathway to Virulence of an RNA Virus Beyond the SNP Threshold: Identifying Outbreak Clusters Using Inferred Transmissions Patterns of Within-host Genetic Diversity in SARS-CoV-2', eLife Stability of SARS-CoV-2 Phylogenies Temporal Dynamics of SARS-CoV-2 Mutation Accumulation within and across Infected Hosts Emergence of Genomic Diversity and Recurrent Mutations in SARS-CoV-2', Infection, Genetics and Evolution No Evidence for Increased Transmissibility from Recurrent Mutations in SARS-CoV-2' Influenza A Virus Transmission Bottlenecks are Defined by Infection Route and Recipient Host Quasispecies Diversity Determines Pathogenesis through Cooperative Interactions in a Viral Population Generation of SARS-COV-2 RNA Transcript Standards for qRT-PCR Detection Assays V1', Protocols.io Intra-host Evolution during SARS-CoV-2 Prolonged Infection Population Bottlenecks and Intra-host Evolution during Human-to-Human Transmission of SARS-CoV-2 Intra-host Variation and Evolutionary Dynamics of SARS-CoV-2 Populations in COVID-19 Patients LoFreq: A Sequence-quality Aware, Ultrasensitive Variant Caller for Uncovering Cell-population Heterogeneity from High-throughput Sequencing Datasets Virological Assessment of Hospitalized Patients with COVID-2019' The Distribution of Pairwise Genetic Distances: A Tool for Investigating Disease Transmission Within-host Bacterial Diversity Hinders Accurate Reconstruction of Transmission Networks from Genomic Distance Data Matters of Size: Genetic Bottlenecks in Virus Infection and Their Potential Impact on Evolution D.P. conceived the study and designed the analyses. S.P., J.J.C., V.d.C., and B.R. obtained the patient samples and the epidemiological information. S.P., N.E.G., L.d.C., I.F.S., and D.V. planned and performed the laboratory experiments. P.G.G., N.V., N.S., and T.T. carried out the bioinformatic analyses. D.P. wrote the draft manuscript with the help of P.G.G. and N.V. All authors read the manuscript and contributed to interpretation and discussion.