key: cord-0334350-tg1yebeg authors: Walter, K. S.; Kim, E.; Verma, R.; Altamirano, J.; Leary, S.; Carrington, Y. J.; Jagannathan, P.; Singh, U.; Holubar, M.; Subramanian, A.; Khosla, C.; Maldonado, Y.; Andrews, J. R. title: Shared within-host SARS-CoV-2 variation in households date: 2022-05-27 journal: nan DOI: 10.1101/2022.05.26.22275279 sha: beb483cc5e9762d59376a137b3a6ea8886346598 doc_id: 334350 cord_uid: tg1yebeg Background: The limited variation observed among SARS-CoV-2 consensus sequences makes it difficult to reconstruct transmission linkages in outbreak settings. Previous studies have recovered variation within individual SARS-CoV-2 infections but have not yet measured the informativeness of within-host variation for transmission inference. Methods: We performed tiled amplicon sequencing on 307 SARS-CoV-2 samples from four prospective studies and combined sequence data with household membership data, a proxy for transmission linkage. Results: Consensus sequences from households had limited diversity (mean pairwise distance, 3.06 SNPs; range, 0-40). Most (83.1%, 255/307) samples harbored at least one intrahost single nucleotide variant (iSNV; median: 117; IQR: 17-208), when applying a liberal minor allele frequency of 0.5% and prior to filtering. A mean of 15.4% of within-host iSNVs were recovered one day later. Pairs in the same household shared significantly more iSNVs (mean: 1.20 iSNVs; 95% CI: 1.02-1.39) than did pairs in different households infected with the same viral clade (mean: 0.31 iSNVs; 95% CI: 0.28-0.34), a signal that increases with increasingly liberal thresholds. Conclusions: Although only a subset of within-host variation is consistently shared across likely transmission pairs, shared iSNVs may augment the information in consensus sequences for predicting transmission linkages. 3 Background 47 SARS-CoV-2 genomic sequencing has been powerfully used to reconstruct the virus' 48 evolutionary dynamics at broad temporal and spatial scales [1] [2] [3] . Yet the virus' relatively slow 49 substitution rate compared with its short serial interval limits the viral diversity observed in many 50 outbreaks, and viral consensus sequences-which represent the most common allele along the viral 51 genome-are often identical or nearly so [4, 5] . In superspreading events, identical consensus sequences have provided important evidence of 53 recent shared transmission. For example, four individuals on the same international flight were infected 54 with identical SARS-CoV-2 consensus genomes, evidence that the virus could be transmitted during air 55 travel [6] . Genomic surveillance in Boston during 2020 reported that 59 out of 83 (71%) genomes 56 sequenced from a skilled nursing facility were identical, implicating transmission within the facility [7] . Similarly, 75% of SARS-CoV-2 consensus sequences from a fishing boat outbreak were identical to at 58 least one other sequence, and the remaining sequences were closely related, suggesting rapid transmission 59 from a single viral introduction [8] . or limited genomic variation resulting in identical, epidemiologically unlinked consensus genomes. In the 71 absence of detailed epidemiological data, such as contact information or spatial information that might be 72 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 4 available in hospital-based studies, it is not yet known whether routine sequencing data alone can be used 73 to reconstruct transmission linkages of who-infected-whom or identify locations or individuals that may 74 drive transmission. Genomic studies of HIV and other viral and bacterial pathogens have begun to harness the 76 pathogen variation within individual infections, or within-host diversity, to reconstruct transmission 77 linkages [11] [12] [13] . Previous studies have reported low levels of SARS-CoV-2 diversity within individual 78 hosts and have estimated the size of a narrow transmission bottleneck which limits the viral diversity 79 shared across hosts [8, 9, 14, 15] . However, more research is needed to quantify the informativeness of 80 within-host SARS-CoV-2 variation and evaluate the effects of variant identification approaches on 81 transmission inferences [15] . To investigate the potential for within-host SARS-CoV-2 diversity to be harnessed for studies of 83 transmission, we deep sequenced SARS-CoV-2 samples collected from household members, allowing us 84 to directly compare shared within-host variants among epidemiologically linked individuals and those 85 with no known linkage, providing a test case for the transmission information contained within individual 86 infections. We additionally sequenced artificial mixtures of SARS-CoV-2 variants to examine tradeoffs 87 between sensitivity and specificity in within-host variation identification. Collection of residual SARS-CoV-2 samples for deep sequencing. We assembled a collection of samples from four prospective SARS-CoV-2 research studies: (a) a 92 prospective household transmission study, in which index cases with at least one reverse transcription 93 quantitative polymerase chain reaction-confirmed (RT-qPCR) SARS-CoV-2 test were enrolled along with 94 household members. Participants were tested daily for SARS-CoV-2 RNA via RT-qPCR, using self-95 collected lower nasal swabs, and households were followed until all members tested negative for seven 96 consecutive days [16] . (b) A randomized, single-blind, placebo-controlled trial of Peginterferon Lambda-97 1a (Lambda) for reducing the duration of viral shedding or symptoms [17] in which oropharyngeal swabs 98 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 5 were collected for 28 days following enrollment. (c) A phase 2 double-blind randomized controlled 99 outpatient trial of the antiviral favipiravir for reducing the duration of viral shedding in which participants 100 self-collected daily anterior nasal swabs for 28 days following enrollment [18] . Neither Lambda nor 101 favipiravir was found to shorten the duration of SARS-CoV-2 viral shedding [17, 18] [23] , and called 123 variants with respect to the reference genome with iVar [23] . We also used this pipeline to remove reads 124 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) To test whether commonly applied filters would improve overall accuracy, we applied five 131 variant filters: a filter for iSNV quality from iVar [23] (PASS = TRUE), a variant quality score filter 132 (Phred score >40), a depth filter (of both major and minor alleles > 5X), a filter of false positive iSNVs 133 repeated in more than one sample in the artificial strain mixture experiment (below), and all filters. We 134 additionally excluded iSNVs occurring in primer binding sites (except for the unfiltered variant set). To identify shared within-host diversity across samples, we compared each unique pair of We fit a Poisson regression model for the number of iSNVs identified within a single sample 144 including sequencing batch and participant as random effects. We additionally fit a Poisson regression 145 model for the number of pairwise shared iSNVs as a function of pair type and distance between consensus 146 sequences, including pair as a random effect. Finally, we fit a binomial regression model for predicting 147 household membership as a function of the number of shared pairwise iSNVs and an indicator variable 148 for close consensus sequences (pairwise distance ≤ 1 SNP), including the earliest samples collected from 149 each pair to exclude multiple pairwise comparisons. We fit all models with the R package lme4 [31] , and 150 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101/2022.05.26.22275279 doi: medRxiv preprint 7 included the set of variants after applying all filters, including iSNVs with a minor allele frequency of 151 ≥ 0.2%. We excluded samples sequenced in the same sequencing batch. Replicating analysis in an independent deep sequencing dataset from Wisconsin. We additionally investigated patterns of shared within-host variation in a previously published 154 dataset from a household transmission study in Wisconsin[9] . Specifically, we re-analyzed variants called 155 by the previous study and filtered to include iSNVs with a minor allele frequency . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. As previously reported[9,33,34], iSNVs are not consistently recovered within serial samples. Among individuals with recovered within-host diversity, a mean of 8.7% within-host iSNVs above a 198 minor allele frequency of 1.0% and applying all filters were recovered one day later; this proportion 199 declined with time between samples, though not significantly (r = -0.11, p = 0.23). When including 200 unfiltered iSNVs above a 0.5% threshold, a mean of 15.4% of within-host iSNVs were recovered one day 201 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In a generalized linear model for shared within-host diversity, household membership was 248 associated with an increased odds of shared iSNVs (aOR:19.8; 95% CI: 6.39-61.40) compared to sample 249 pairs within the same clade, after controlling for genetic distance between consensus sequences, 250 consistent with a previous study that found household membership is the strongest predictor of shared 251 iSNVs [9] . Longitudinal samples from an individual were also associated with an increased odds of shared 252 iSNVs (aOR: 60.5; 95% CI:19.9-184) as were sequencing replicates (aOR: 132; 95% CI: 42.5-411). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101/2022.05.26.22275279 doi: medRxiv preprint 11 After excluding pairs sequenced in the same batch and multiple comparisons between 254 participants, our sample size was small (23 unique household pairs). In a generalized linear model, the 255 number of shared iSNVs was not significantly associated with an increased odds of household 256 membership (aOR: 1.31; 95% CI: 0.87-1.71), while a closely related consensus sequence (within 0-1 257 SNPs) was significantly associated with household membership (aOR: 30.38; 95% CI: 10.39-129.14). However, shared diversity as measured as the standardized sum of shared minor allele frequencies 259 between pairs was associated with an increased odds of household membership (aOR: 1.20; 95% CI: 260 1.08-1.32) when controlling for closely related consensus sequence. We tested the replicability of our findings in an independent study conducted in Wisconsin where 264 SARS-CoV-2 was deep sequenced from 133 acutely-infected individuals, including members of 19 265 households [9] . At a frequency threshold of 0.5%, we found a similar signal that pairs of individuals in the 266 same household shared significantly more iSNVs (mean: 9.52 iSNVs; 95% CI 8.14-10.89) than did pairs 267 in different households infected with the same viral clade (mean: 4.28 iSNVs; 95% CI: 4.19-4.37) or pairs 268 in different households infected with a different viral clade (mean: 1.42; 95% CI: 1.38-1.47) (Fig. 3a) , in 269 variants in filtered VCF files made publicly available from the earlier study[9] ( Fig. S6 ; Methods). Our 270 findings were consistent across minor allele frequency thresholds, though as in the California data, a 271 signal of household membership was strongest when using minor allele frequency thresholds of ≤ 1% 272 (Fig. S6) . We found a similar signal when measuring shared population diversity as the sum of shared 273 minor allele frequencies (Fig. S7) . However, household pairs did not share significantly more diversity 274 than epidemiologically unrelated pairs when applying all filters and a minor allele frequency threshold 275 ≥ 3% (Fig. S7 ). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 While most SARS-CoV-2 genomic studies focus on consensus sequences, consensus sequences 279 may not provide the resolution needed to reconstruct transmission linkages and identify potential sources 280 of transmission in outbreak settings, where many cases may be closely genetically related. Here, we 281 report that within-host SARS-CoV-2 genomic variation may contribute information about transmission 282 that may augment the information contained in viral consensus sequences. We focused on household within-host diversity is lower than that identified in other viral pathogens and, as previously reported, we 294 find that within-host viral diversity is frequently lost during transmission[9]. As others have reported [9, 23, 33, 34] , excluding sources of noise from within-host pathogen 296 genomic data remains a major challenge. We sequenced artificial strain mixtures of two SARS-CoV-2 297 variants of concern and found significant tradeoffs between sensitivity and specificity in recovery of true 298 within-host variants as increasingly strict variant filters were applied. Applying strict minor allele 299 frequency thresholds excludes much potential within-host variation. Additionally, in our empirical 300 sequencing data, we find that the signal of shared within-host variation across transmission pairs is 301 strongest when including iSNVs at low minor allele frequency thresholds. The optimal variant identification approach may differ across applications-for example, 303 measurements of transmission bottleneck are highly sensitive to allele frequency threshold[9,36] and may 304 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 13 prioritize specificity, while studies of transmission might prioritize sensitivity to identify potential 305 transmission linkages. However, again as others have highlighted, our findings underscore the need to 306 control for other potential explanations for shared iSNVs while still prioritizing sensitivity (Box 2). Our 307 findings suggest that for transmission inference, privileging sensitivity in variant identification may 308 greatly improve sensitivity for recovering within-host variation, at a small cost of false positive variant 309 calls. Our study has several limitations. First, we focused on a convenience sample of residual samples 311 with accompanying household information collected in California from March 2020 through May 2021. Replicating these findings in other settings and with more recently emerged SARS-CoV-2 lineages is 313 critical to understand the generalizability of our findings. Second, our study focused on the potential 314 epidemiological value of within-host viral variation. Our focus was on transmission linkage rather than in 315 viral evolutionary dynamics or transmission bottlenecks, which might have different optimal variant 316 identification approaches. Third, many groups have hypothesized that evolution within immune-317 compromised or immune-suppressed populations may be an important driver of the emergence of new 318 variants of concern or interest[37-41]. Our sample collection did not enable us to test these hypotheses. Forth, the epidemiological utility of within-host variation depends on SARS-CoV-2 sampling and 320 sequencing. Routine sequencing may always not generate sufficient depth to accurately recover within-321 host variation. In conclusion, we find that SARS-CoV-2 variation within individual hosts may be shared across 323 transmission pairs and may contribute information on transmission linkage on a backdrop of limited 324 diversity among consensus sequences. More broadly, pathogen diversity within individual infections 325 holds largely untapped information that may enhance the resolution of transmission inferences. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 • True positive: Transmission of a diverse infecting inoculum. o Within-host viral diversity can be structured temporally [33, 38, 41] CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022 27 593 594 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Ultrafast Sample placement on Existing tRees 439 (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic Accommodating individual travel history and unsampled 442 diversity in Bayesian phylogeographic inference of SARS-CoV-2 Tracking changes in SARS-CoV-2 Spike: evidence 446 that D614G increases infectivity of the COVID-19 virus Temporal 449 signal and the phylodynamic threshold of SARS-CoV-2 Nosocomial Outbreak of SARS-CoV-2 in a Hospital Ward: Virus Genome Sequencing as a Key Tool to Understand Cryptic Transmission In-Flight Transmission of SARS-CoV-2 Phylogenetic analysis of SARS-CoV-2 in Boston 22 host viral diversity during a SARS-CoV-2 outbreak on a fishing boat Acute SARS-CoV-2 infections harbor limited within-466 host diversity and transmit via tight transmission bottlenecks Rapid implementation of SARS-CoV-2 sequencing 470 to investigate cases of health-care associated COVID-19: a prospective genomic surveillance 471 study Pneumococcal within-host diversity during colonisation, 474 transmission and treatment Inferring transmission from within-477 and between-host pathogen genetic diversity Phylogenetics in HIV transmission: Taking within-host diversity into account Genomic epidemiology of superspreading events in Austria 484 reveals mutational dynamics and transmission properties of SARS-CoV-2 Transmission dynamics of SARS-CoV-2 within-host diversity 24 An amplicon-based sequencing framework for 515 accurately measuring intrahost virus diversity using PrimalSeq and iVar Ultrafast metagenomic sequence classification using exact 518 alignments A statistical framework for SNP calling, mutation discovery, association mapping and 521 population genetical parameter estimation from sequencing data A dynamic nomenclature proposal for SARS-CoV-2 525 lineages to assist genomic epidemiology Masking strategies for SARS-528 CoV-2 alignments Parallelization of MAFFT for large-scale multiple 530 sequence alignments Ape 5.0: An environment for modern phylogenetics and evolutionary 533 analyses in IQ-TREE 2: New Models and Efficient Methods for 537 Phylogenetic Inference in the Genomic Era International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Fitting Linear Mixed-Effects Models Using lme4 American Statistical Association Measurements of Intrahost Viral Diversity Are Extremely Sensitive to 544 Systematic Errors in Variant Calling Temporal dynamics of SARS-CoV-2 mutation 546 accumulation within and across infected hosts Narrow transmission bottlenecks and limited within-550 host viral diversity during a SARS-CoV-2 outbreak on a fishing boat SARS-CoV-2 within-host diversity and transmission Genomic epidemiology of superspreading events in Austria 557 reveals mutational dynamics and transmission properties of SARS-CoV-2 Preliminary genomic characterisation of an emergent 561 SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations -SARS-CoV-2 coronavirus / nCoV-2019 Genomic Epidemiology -Virological 564 characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spikethe author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted SARS-CoV-2 evolution during treatment of chronic 567 infection Within-host evolution of SARS-CoV-2 in an 570 immunosuppressed COVID-19 patient as a source of immune escape variants From one to many: The within-host rise of viral variants Persistence and Evolution of SARS-CoV-2 in an 577 Immunocompromised Host Bayesian reconstruction of transmission within 580 outbreaks using genomic variants Nextclade: clade assignment, mutation calling 584 and quality control for viral genomes The nf-core framework for community-curated . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted May 27, 2022. ; https://doi.org/10.1101 https://doi.org/10. /2022