key: cord-0325888-ds6vtunx authors: Van Poelvoorde, L. A. E.; Delcourt, T.; Coucke, W.; Herman, P.; De Keersmaecker, S. C. J.; Saelens, X.; Roosens, N.; Vanneste, K. title: Strategy and performance evaluation of low-frequency variant calling for SARS-CoV-2 in wastewater using targeted deep Illumina sequencing date: 2021-07-07 journal: nan DOI: 10.1101/2021.07.02.21259923 sha: 423eea19ba872298ad3b857af922c488dd2f2350 doc_id: 325888 cord_uid: ds6vtunx The ongoing COVID-19 pandemic, caused by SARS-CoV-2, constitutes a tremendous global health issue. Continuous monitoring of the virus has become a cornerstone to make rational decisions on implementing societal and sanitary measures to curtail the virus spread. Additionally, emerging SARS-CoV-2 variants have increased the need for genomic surveillance to detect particular strains because of their potentially increased transmissibility, pathogenicity and immune escape. Targeted SARS-CoV-2 sequencing of wastewater has been explored as an epidemiological surveillance method for the competent authorities. Few quality criteria are however available when sequencing wastewater samples, and those available typically only pertain to constructing the consensus genome sequence. Multiple variants circulating in the population can however be simultaneously present in wastewater samples. The performance, including detection and quantification of low-abundant variants, of whole genome sequencing (WGS) of SARS-CoV-2 in wastewater samples remains largely unknown. Here, we evaluated the detection and quantification of mutations present at low abundances using the SARS-CoV-2 lineage B.1.1.7 (alpha variant) defining mutations as a case study. Real sequencing data were in silico modified by introducing mutations of interest into raw wild-type sequencing data, or by mixing wild-type and mutant raw sequencing data, to mimic wastewater samples subjected to WGS using a tiling amplicon-based targeted metagenomics approach and Illumina sequencing. As anticipated, higher variation, lower sensitivity and more false negatives, were observed at lower coverages and allelic frequencies. We found that detection of all low-frequency variants at an abundance of 10%, 5%, 3% and 1%, requires at least a sequencing coverage of 250X, 500X, 1500X and 10,000X, respectively. Although increasing variability of estimated allelic frequencies at decreasing coverages and lower allelic frequencies was observed, its impact on reliable quantification was limited. This study provides a highly sensitive low-frequency variant detection approach, which is publicly available at https://galaxy.sciensano.be, and specific recommendations for minimum sequencing coverages to detect clade-defining mutations at specific allelic frequencies. Some of these variants are characterized by a potential enhanced transmissibility, and can 87 cause more severe infections and/or potential vaccine escape [16] [17] [18] [19] [20] . Consequently, 88 monitoring current and potential future variants, is crucial to control the epidemic by taking 89 timely measures because these variants can affect epidemiological dynamics, vaccine 90 effectiveness and disease burden. 91 To monitor SARS-CoV-2 variants, RT-qPCR methods were designed to detect a selection of 92 the mutations that define specific variants of concern (VOCs). VOCs are however defined by 93 a combination of multiple mutations and only few mutations can be targeted by RT-qPCR 94 assays, but many VOCs are characterized by a high number of specific mutations. This 95 approach is also not sustainable because it is likely that the ongoing vaccination and 96 increased herd immunity will result in the selection of new mutations and emergence of new 97 VOCs [21] , as has been observed with other viruses [22, 23] . Since only a few mutations can 98 be targeted by a RT-qPCR assay, an additional step of whole genome sequencing (WGS) is 99 required to fully confirm the variant's sequence [24] . 100 WGS has been used to understand the viral evolution, epidemiology and impact of SARS-101 CoV-2 resulting in, as of July 2021, more than 2,000,000 publically available SARS-CoV-2 102 genome sequences, mainly derived from respiratory samples that are frequently submitted 103 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint 5 to the Global Initiative on Sharing Avian Influenza Data (GISAID) database [25] . Most of 104 these sequences were obtained using amplicon sequencing in combination with the Illumina 105 or Nanopore technology, with Illumina still being the most commonly used method [25, 26] . 106 This large amount of genomes allows reliable detection of variants based on the consensus 107 genome sequence in patient samples [27] [28] [29] [30] . The European Centre for Disease Prevention 108 and Control (ECDC) has defined several quality criteria for clinical samples depending on 109 the application. For most genomic surveillance objectives, a consensus sequence of the 110 (near-)complete genome is sufficient and a minimal read length of 100 bp and minimal 111 coverage of 10X across more than 95% of the genome is recommended. To reliably trace 112 direct transmission and/or reinfection, a higher sequencing coverage of 500X across more 113 than 95% of the genome is recommended for determining low-frequency variants (LFV) that 114 can significantly contribute to the evidence for reinfection or direct transmission. In-depth 115 genome analysis, including recombination, rearrangement, haplotype reconstruction and 116 large insertions and deletions (indel) detection, should be investigated using long-read 117 sequencing technologies with a recommended read length of minimally 1000 bp and a 118 sequencing coverage of 500X across more than 95% of the genome [31] . Due to the high 119 cost of sequencing large quantities of samples from individual patient, samples that tested 120 positive for a selection of mutations related to VOCs using RT-qPCR and have a sufficiently 121 high viral load are typically sequenced. Consequently, only a subset of all circulating variants 122 is detected during routine clinical surveillance. Since wastewater samples contain both 123 SARS-CoV-2 RNA from symptomatic and asymptomatic individuals, sequencing wastewater 124 samples can provide a more comprehensive picture of the genomic diversity of SARS-CoV-2 125 circulating in the population compared to individual clinical testing and sequencing. 126 Wastewater surveillance of SARS-CoV-2 may therefore be of considerable added value for 127 SARS-CoV-2 genomic surveillance by providing a cost-effective, rapid and reliable source of 128 information on the spread of SARS-CoV-2 variants in the population. 129 Sequencing of wastewater samples is however currently mainly used to reconstruct the 130 consensus genome sequence of the most prevalent SARS-CoV-2 strain in the sample and 131 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2021. In this study, we evaluate the performance of LFV detection based on targeted SARS-CoV-2 158 sequencing to detect and quantify mutations present at low abundances. This approach 159 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint 7 mimics wastewater deep sequencing by means of the Illumina technology. We used 160 mutations that define the B.1.1.7 lineage as a proof-of-concept. Using two real sequencing 161 datasets that were in silico modified by either introducing mutations of interest into raw wild-162 type sequencing datasets or mixing wild-type and mutant raw sequencing data, we provide 163 guidelines for minimum sequencing coverages to detect clade-defining mutations at specific 164 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2021. were generated for all these samples (Figure 1: Step 2). The workflow was built using the 179 Snakemake workflow management system using python 3.6.9 [44] . Next, the re-paired 180 paired-end reads were trimmed using Trimmomatic v0. 38 Samtools mpileup v1. 9 [47] using the options "--count-orphans" and "--VCF". Next, the 190 variants were called with bcftools call v1. 9 [47] using the options "-O z", "--consensus-caller", 191 "--variants-only" and "ploidy 1", and converted and indexed to uncompressed VCF files with 192 respectively bcftools view v1. 9 [47] using the options "--output-type v" and bcftools index 193 v1. 9 [47] using the option "--force". Lastly, a temporary consensus sequence was generated 194 using bcftools consensus v1.9 [47] with default settings, providing the reference genome and 195 produced VCF file as inputs. Afterwards, the previous steps were repeated once with the 196 same options using the generated temporary consensus sequence as fasta reference to 197 generate the final consensus sequence. These sequences were used to confirm either the 198 presence or absence of the clade-defining mutations of the B.1.1.7 mutant for both the 199 mutant and wild-type samples respectively (Table 1) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint The first, second, and third columns present respectively the gene name, cDNA-level mutation and protein-level 206 mutation. The last column describes whether the position is covered by one or two amplicons from the 207 enrichment panel (Supplementary Table S1 From the initial 316 samples, ten mutant samples were selected that presented similar 213 coverage depth at the positions of interest after normalization (see below). These samples 214 contained the mutations assigned to the B.1.1.7 variant. Ten wild-type samples were also 215 chosen that did not contain any of these mutations (Table 1, Table 2 CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. (Table 228 2). Additionally, as suggested by ECDC, more than 95% of the genome was covered by 229 reads with a minimal coverage of 500X [31]. 230 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. "LB", "PL", "PU" and "SM" set to the arbitrary placeholder value "test". The resulting BAM 248 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. files and generate a VCF file using the options "--call-indels" and "--no-default-filter" and 253 using the consensus sequence as reference to call LFV. Next, the unfiltered VCF file was 254 filtered using the filter function of the LoFreq v2.1.3.1 package, setting the strand bias 255 threshold for reporting a variant to the maximum allowed value by using the option "--sb-256 thresh 2147483647" to allow highly strand-biased variants to be retained, to account for the 257 non-random distribution of reads due to the design of the amplification panel. All employed 258 scripts are available in Supplementary File S2. Additionally, the workflow is also available at 259 the public Galaxy instance of our institute at https://galaxy.sciensano.be as a free resource 260 for academic and non-profit usage. The presence of the nucleotides assigned to the B.1.1.7 261 lineage or the wild-type (Table 1) (Table 1) were 274 introduced at 26 different AF (mutant: 0%, 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 275 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint 14 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, 10%, 20%, 30%, 40%, 50%, 100%) at 276 the various coverages mentioned above employing biostar404363. This resulted in 10 277 samples at 364 conditions (i.e. combination of coverage and AF). Next, all reads containing 278 indels were removed from these samples using samtools view v1.9. Finally, the three 279 deletions associated with the B.1.1.7 lineage were introduced at the 26 AF mentioned above 280 using BAMSurgeon 1.2 [59], which was adapted to decrease runtime, with the options "-p 281 10", "--force", "-d 0", "--ignorepileup", "--mindepth 1", "--minmutreads 1", "--maxdepth 282 1000000", "--aligner mem", "--tagreads". A minority of reads that were lacking a mate in the 283 targeted regions were removed by using an in-house script making use of Python 3.6. Step 3) described in section 2.2 was used on these 10 samples for all 364 conditions using 287 the FASTA file of the wild-type sample as reference with LoFreq. 288 (Table 2) was 291 normalized to 5000X using BBMap v38.89 bbnorm.sh [43] with the options "target=5000", 292 "mindepth=5", "fixspikes=f", "passes=3", "uselowerdepth=t". However, due to the tiled 293 amplicon approach used to amplify these samples prior to sequencing, regions where 294 amplicons overlapped subsequently had double coverage resulting in two coverages, i.e. 295 5000X and 10,000X, after normalization (Supplementary Table S1 ). In silico datasets were 296 then generated by mixing the appropriate number of reads for every combination of the ten 297 wild-type and ten mutant samples, resulting in a total of 100 mixed samples, which were 298 down-sampled using "seqtk sample" (with option "-s100") to the appropriate fractions for the 299 required combination of 13 final coverages (100X, 250X, 500X, 750X, 1000X, 1500X, 300 2000X, 2500X, 3000X, 3500X, 4000X, 4500X and 5000X) and 26 AF (mutant: 0%, 0.5%, 301 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 302 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint 15 9.5%, 10%, 20%, 30%, 40%, 50%, 100%). This resulted in 100 mixed samples at 338 303 conditions (i.e. combination of coverage and AF). Finally, the LFV detection workflow ( Figure 304 1: Step 3) described in section 2.2 was used on these samples for all conditions using the 305 FASTA file of the wild-type sample as reference, except for samples mimicking 100% AF for 306 the mutant positions where the FASTA file of the mutant sample was used. 307 Although the second dataset was normalized for total coverage at every genomic position, 308 the tiled amplicon approach resulted in some genomic positions being covered by two 309 overlapping amplicons. Two groups of mutations were therefore obtained for every coverage 310 (Table 2) Table S1 ). For further analysis, the results were 319 pooled together per theoretical coverage resulting in 24 mutations per coverage but only 17 320 and 7 mutations at the lowest (i.e. 100X) and highest (i.e. 10,000X) coverage, respectively 321 (Supplementary Table S2 considered as 'below the quantification limit' with the quantification limit equal to the lowest 344 recorded value for that condition (i.e. combination of AF and coverage). Outliers were 345 identified for each condition using the Grubbs test that was sequentially applied by first 346 searching for two outliers at the same side, followed by a search for exactly one outlier. If the 347 p-value of the Grubbs test was below 0.05, outliers were excluded. The standard deviation 348 (SD) and mean value of AF for every condition were estimated by a maximum likelihood 349 model based on the normal distribution that took the FN into account as censor data. Data 350 were modelled according to a normal distribution. If the percentage of FN results was above 351 75%, the condition was however excluded from quantitative evaluation. Finally, a 352 performance metric describing closeness to the true AF was calculated for each targeted AF 353 individually by dividing each pooled squared SD by the maximal pooled squared SD. This 354 metric will range between 0, relatively the closest to the targeted AF, and 1, relatively the 355 furthest from the targeted AF. 356 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. Table 3 , for all evaluated coverages and targeted AFs until 20%. Results for all 374 targeted AFs (including higher values) are presented in Supplementary Figure S1 and 375 Supplementary Table S3 . All LFV could be detected at an AF of 1% at a median coverage of 376 10,000X. As the coverage decreased, the AF threshold at which no single FN occurred (i.e. 377 perfect sensitivity) increased to 1.5% at 5000X, 3% at 1000X, 5% at 500X, 9.5% at 250X, 378 and 20% at 100X. When allowing a maximum of 10% FN (i.e. sensitivity of 90%), the AF 379 thresholds decreased substantially to 1% at 5000X, 1.5% at 1000X, 2.5% at 500X, 4% at 380 250X, and 7.5% at 100X. No false positive mutations related to the mutant and wild-type 381 were observed at respectively 0% and 100% AF. 382 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint A second approach was also considered for mimicking targeted SARS-CoV-2 virus 383 sequencing with a VOC present at low abundances, by in silico mixing real raw sequencing 384 reads from ten B.1.1.7 samples into ten wild-type samples ('Dataset 2') for a total of 100 385 mixes at well-defined AFs and coverages, while applying coverage normalization so that 386 individual mutations were present at approximately similar coverages for all B.1.1.7 clade-387 defining positions. 388 Figure 2B depicts the proportion of FN observations, and actual values are presented in 389 Table 4 , for all evaluated coverages and targeted AF until 20%. Results for higher targeted 390 AF are presented in Supplementary Figure S2 and Supplementary Table S4 . All LFV could 391 be detected at an AF of 1% at a median coverage of 9792X. As the coverage decreased, the 392 AF thresholds at which no single FN occurred (i.e. perfect sensitivity) increased to 1.5% at 393 4851X, 3.5% at 969X, 4% at 482X, 7% at 237X, and 20% at 97X. However, when allowing a 394 maximum of 10% FN (i.e. reducing the sensitivity to 90%), the AF thresholds decreased 395 substantially to 1% at 4851X, 2% at 969X, 3% at 482X, 4% at 237X, and 7% at 97X. No 396 false positive mutations related to the mutant and wild-type were observed at 0% and 100%. 397 Overall, the results for Dataset 1, using the median coverages, and Dataset 2, using the 398 coverages at the positions of interest, were qualitatively similar. 399 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. Table 3 for Dataset 1 and Table 4 for Dataset 2). Results for targeted mutant AF 405 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. C o v e r a g e 1 0 0 2 5 0 5 0 0 7 5 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0 4 0 0 0 4 5 0 0 5 0 0 0 Figure 2A . r a g e 9 7 2 0 1 2 3 7 4 8 2 7 2 8 9 6 9 1 4 5 4 1 9 3 7 2 4 1 3 2 9 0 4 3 3 8 3 3 8 7 2 4 3 5 8 4 8 5 1 5 8 5 5 6 8 3 4 7 8 0 1 8 7 9 0 9 7 9 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint Our results can serve as a reference for the scientific community to select appropriate 501 thresholds for the AF and coverage. These could also be context-specific as a smaller or 502 larger degree of false negatives might be warranted for specific applications, and can also 503 be used as a baseline for determining the number of samples that can be multiplexed per 504 run to optimize cost-efficiency of WGS. Our findings highlight the feasibility of using targeted 505 amplicon-based metagenomics approaches for wastewater surveillance, as such samples 506 comprise a collection of different strains, among which the dominant strain will define the 507 consensus sequence of the sample and the detected LFV will represent the circulating 508 strains present at lower frequencies. Other studies that investigated LFV in wastewater 509 provided limited quality criteria regarding the coverage and AF. Furthermore, the quality 510 criteria in these studies were not evaluated using a defined population [33, 34] . ECDC has 511 provided limited quality criteria regarding the sequencing coverage, namely 500X across 512 95% of the genome to detect LFV, but has not indicated the corresponding AF thresholds 513 this corresponds to for reliable LFV detection [31] . Based on the results obtained in this 514 study, a coverage of 500X allowed to detect LFV until an AF of 5% with perfect sensitivity 515 and would therefore be less suited to detect LFV at lower AFs. Lythgoe et al. recommended 516 a depth of at least 100 reads with an AF of at least 3% to detect the LFV in clinical samples 517 with high viral loads (50,000 uniquely mapped reads) [64] . Based on the results in this study, 518 these recommendations appear not sufficiently strict, since we observed that an AF of 3% 519 requires at least a sequencing coverage of 1500X to detect all LFV or 500X to detect 90% of 520 In the presence of multiple VOCs, the VOCs can be identified by composing all possible 522 combinations of LFV as a conservative strategy, although multiple VOCs in one sample will 523 also make the estimation of the relative abundance of each VOC more complicated. If 524 multiple VOCs with partially overlapping defining mutations would be present in a 525 wastewaters sample, some mutations of interest would consequently be present at different 526 AFs. Haplotyping reconstruction methods could be used in such situations to delineate 527 VOCs. However, most haplotype reconstruction programmes perform poorly under higher 528 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint levels of diversity, and haplotype populations with rare haplotypes are often not recovered 529 [65]. Although haplotype reconstruction has been described for short reads, Nanopore 530 sequencing might offer a substantial advantage for such cases due to its longer reads, 531 despite their higher error rate, to perform haplotype estimation to delineate actual VOCs. 532 In conclusion, there exists a pressing need for recommendations for detecting LFV for 533 wastewater surveillance. Although further work is still required to investigate the specificity 534 and possibility to detect VOCs instead of just mutations, including for other existing and 535 employed methodologies such as probe-based capture and/or Nanopore sequencing, this 536 study demonstrates the feasibility of a targeted metagenomics approach for highly sensitive 537 LFV detection with acceptable relative abundance estimations using a tiled-amplicon 538 enrichment based on the Illumina technology. This approach enables the detection of 539 mutations associated with specific VOCs. Our approach could also be used to evaluate the 540 potential occurrence of co-infections with other SARS-CoV-2 variants with different strains in 541 clinical samples. In future work this approach should be evaluated on real wastewater data, 542 as in this study high-quality data from clinical specimens was used and modified in silico to 543 mimic wastewater data. In light of the pandemic urgency, and the multiple SARS-CoV-2 544 wastewater surveillance initiatives that are being established and also being integrated into 545 coordinated overarching coordination and preparedness initiatives such as the recently 546 announced European Health Emergency Preparedness and Response Authority [7], we 547 hope that our results will help establishing guidance and recommendations for wastewater 548 surveillance and other relevant applications. 549 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint Coronavirus-2 RNA in Sewage and Correlation with Reported COVID-19 Prevalence 593 in the Early Stage of the Epidemic in The Netherlands SARS-CoV-2 Titers in Wastewater Are 596 Higher than Expected from Clinically Confirmed Cases. mSystems;5. Epub ahead of 597 print 21 Wastewater surveillance of SARS-CoV-2 for population-based health management Pathogen Surveillance Through 602 Monitoring of Sewer Systems Wastewater-Based Epidemiology for Early Detection of Viral 604 SARS-CoV-2: sewage surveillance as an 607 early warning system and challenges in developing countries Temporal signal and the phylodynamic threshold of SARS-CoV-2 SARS-CoV-2 variants 351 and P.1 escape from neutralizing antibodies. Cell. Epub ahead of print CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) transmissibility and severity of novel SARS-CoV-2 Variant of Concern Mitigations to Reduce Transmission of the 619 new variant SARS-CoV-2 virus UK -Scientific Advisory Group for Emergencies. NERVTAG: Update note on 624 Mutations to the SARS-CoV-2 Spike Receptor-Binding Domain that Escape Antibody 630 Emerging SARS-CoV-2 Variants and Impact 632 in Global Vaccination Programs against SARS-CoV-2/COVID-19 Evolution of Influenza A Virus by 635 Mutation and Re-Assortment Vaccination and antigenic drift in influenza Two-step strategy for the 638 identification of SARS-CoV-2 variant of concern 202012/01 and other variants with 639 spike deletion H69-V70 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Global initiative on sharing all influenza data -from 642 vision to reality Evaluation of NGS-644 based approaches for SARS-CoV-2 whole genome characterisation Epub ahead of print 1 Genomic surveillance 647 of Nevada patients revealed prevalence of unique SARS-CoV-2 variants bearing 648 mutations in the RdRp gene Identified Cases of SARS-CoV-2 Variant P.1 in the United States -Minnesota Emergence of genomic 654 diversity and recurrent mutations in SARS-CoV-2 Genomic monitoring of SARS-CoV-2 656 uncovers an Nsp1 deletion variant that modulates type I interferon response Sequencing of Sewage Detects Regionally Prevalent SARS-CoV-2 Variants Epub ahead of print Detection 665 of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) medRxiv 2021 Deep 671 sequencing analysis of viral infection and evolution allows rapid and detailed 672 characterization of viral mutant spectrum ultra-sensitive variant caller for uncovering cell-population heterogeneity from 675 high-throughput sequencing datasets Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from 678 Massively Parallel Sequence Data Measurements of Intrahost Viral Diversity Are Extremely 680 Sensitive to Systematic Errors in Variant Calling Generation Whole Genome Sequencing Identifies the Direction of Norovirus 683 Transmission in Linked Patients Sequencing of A(H3N2) Influenza Viruses Reveals Variants Associated with Severity 686 during the 2016-2017 Season Intrahost 688 dynamics of antiviral resistance in influenza A virus reflect complex patterns of 689 segment linkage, reassortment, and natural selection International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted The Sequence Read Archive Trimmomatic: a flexible trimmer for Illumina 698 sequence data Fast gapped-read alignment with Bowtie 2 Epub ahead of print 16 Preliminary genomic 705 characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel 706 set of spike mutations Public Health England. Variants of concern or under investigation Centers for Disease Control and Prevention (CDC). SARS-CoV-2 Variants 51. WHO. Tracking SARS-CoV-2 variants International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted SARS-CoV-2 variants in Switzerland Preliminary genomic 721 characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel 722 set of spike mutations Alignment/Map format and SAMtools Array 728 programming with NumPy Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA 733 sequencing data Combining tumor 736 genome simulation with crowdsourcing to benchmark somatic single-nucleotide-737 variant detection BEDTools: a flexible suite of utilities for comparing genomic 739 features Interactive Web-Based Data Visualization with R, plotly, and shiny International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted Viral load of SARS-CoV-2 in clinical 743 samples fate and removal of SARS-CoV-2 in wastewater: 745 Current knowledge and future perspectives Evaluation of 749 haplotype callers for next-generation sequencing of viruses International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity . Dataset 2 (Supplementary File S4), the SD systematically decreased per target AF as 419 coverage increased. This provisional analysis also indicated that for both datasets, 420 irrespective of coverage, the SD generally increased between a targeted AF of 1% to 10%, 421 after which it plateaued for targeted AFs above 20%. We therefore employed the squared 422 SD per AF divided by the maximal squared SD per target AF to describe closeness of 423 observed AF to the true AF, for which results are presented in Figure 3A for Dataset 1. As 424 expected, the variation in AF estimates fluctuates in function of the median coverage and 425 targeted AF, with variation decreasing per target AF as coverage increased, but also 426 variation being generally more pronounced at low AFs irrespective of coverage. 427Notwithstanding, even for regions in Figure 3A is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint 22 6.26% at an AF of 50%, 0.36%-3.49% at an AF of 10% and 0.27%-2.07% at an AF of 5% 431 with the highest IQR observed at lower coverages. 432Results for the quantitative evaluation of Dataset 2 are presented in Figure 3B , and are in 433 accordance with the trends observed for Dataset 1 with the variation decreasing per target 434 AF as coverage increased, and lower target AFs exhibiting increasing variation irrespective 435 of coverage. Notwithstanding, similarly to Dataset 1, the observed total variation remained 436 small (Supplementary File S4). The IQR (Supplementary File S4D) of the observed AF was 437 limited at the various targeted AF ranging from 0.73%-3.93% at an AF of 50%, 0.41%-3.93% 438 at an AF of 10% and 0.29%-2.27% at an AF of 5% with the highest IQR observed at lower 439 coverages. 440 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. In Dataset 2, sequencing coverages were normalized, which allowed evaluating with high 465 precision how reliable AF detection is at specific coverages. Afterwards, the ability to both 466 detect and quantify LFV was evaluated. Results demonstrated that WGS enabled detecting 467 LFV with very high performance. As expected, lower coverages and AFs resulted in lower 468 sensitivity and higher variability of estimated AFs. We found, employing the most 469 conservative thresholds from either Datasets 1 or 2, that a sequencing coverage of 250X, 470 500X, 1500X, and 10,000X is required to detect all LFV at an AF of 10%, 5%, 3% and 1%, 471 respectively (Table 3 and Table 4 ). For quantification of variants, the variability remained 472 overall small for all conditions respecting the above thresholds, resulting in reliable 473 abundance estimations, despite the variability of estimated AF increasing at lower coverages 474 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. at an AF of 10%, 5%, 3% and 1% were still detected at a sequencing coverage of 100X, 486 250X, 500X, and 2500X respectively (Table 3 and Table 4 ). This study focussed on the 487 sensitivity of LFV detection and did not explore the false positive rates (i.e. specificity). could be investigated using RT-qPCR or RT-ddPCR assays that target that specific 499 positions. 500 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. The authors declare that there are no conflicts of interest. 564 This study was financed by Sciensano through COVID-19 special funding. 566 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 7, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprintThe copyright holder for this preprint this version posted July 7, 2021. ; https://doi.org/10.1101/2021.07.02.21259923 doi: medRxiv preprint