key: cord-0947272-046z2c69
authors: Bull, Rowena A.; Adikari, Thiruni; Ferguson, James M.; Hammond, Jillian M.; Stevanovski, Igor; Beukers, Alicia G.; Naing, Zin; Yeang, Malinna; Verich, Andrey; Gamaarachchi, Hasindu; Kim, Ki Wook; Luciani, Fabio; Stelzer-Braid, Sacha; Eden, John-Sebastian; Rawlinson, William D.; van Hal, Sebastiaan J.; Deveson, Ira W.
title: Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis
date: 2020-10-20
journal: bioRxiv
DOI: 10.1101/2020.08.04.236893
sha: c22f16a44dd6638ac71c3ccfd53870cc0c5c168e
doc_id: 947272
cord_uid: 046z2c69

Viral whole-genome sequencing (WGS) provides critical insight into the transmission and evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). Long-read sequencing devices from Oxford Nanopore Technologies (ONT) promise significant improvements in turnaround time, portability and cost, compared to established short-read sequencing platforms for viral WGS (e.g., Illumina). However, adoption of ONT sequencing for SARS-CoV-2 surveillance has been limited due to common concerns around sequencing accuracy. To address this, we performed viral WGS with ONT and Illumina platforms on 157 matched SARS-CoV-2-positive patient specimens and synthetic RNA controls, enabling rigorous evaluation of analytical performance. Despite the elevated error rates observed in ONT sequencing reads, highly accurate consensus-level sequence determination was achieved, with single nucleotide variants (SNVs) detected at >99% sensitivity and >99% precision above a minimum ~60-fold coverage depth, thereby ensuring suitability for SARS-CoV-2 genome analysis. ONT sequencing also identified a surprising diversity of structural variation within SARS-CoV-2 specimens that were supported by evidence from short-read sequencing on matched samples. However, ONT sequencing failed to accurately detect short indels and variants at low read-count frequencies. This systematic evaluation of analytical performance for SARS-CoV-2 WGS will facilitate widespread adoption of ONT sequencing within local, national and international COVID-19 public health initiatives.

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is the causative pathogen for COVID-19 disease 1,2 . SARS-CoV-2 is a positive-sense single-stranded RNA virus with a ~30 kb poly-adenylated genome 1, 2 . Complete genome sequences published in January 2020 1,3 enabled development of RT-PCR assays for SARS-CoV-2 detection that have served as the diagnostic standard during the ongoing COVID-19 pandemic 4 . Whole-genome sequencing (WGS) of SARS-CoV-2 provides additional data to complement routine diagnostic testing. Viral WGS informs public health responses by defining the phylogenetic structure of disease outbreaks 5 . Integration with epidemiological data identifies transmission networks and can infer the origin of unknown cases [6] [7] [8] [9] [10] [11] . Largescale, longitudinal surveillance by viral WGS may also provide insights into virus evolution, with important implications for vaccine development [12] [13] [14] [15] .

WGS can be performed via PCR amplification or hybrid-capture of the reverse-transcribed SARS-CoV-2 genome sequence, followed by high-throughput sequencing. Short-read sequencing technologies (e.g., Illumina) enable accurate sequence determination and are the current standard for pathogen genomics. However, long-read sequencing devices from Oxford Nanopore Technologies (ONT) offer an alternative with several advantages. ONT devices are portable, cheap, require minimal supporting laboratory infrastructure or technical expertise for sample preparation, and can be used to perform rapid sequencing analysis with flexible scalability 16 . The use of ONT devices for viral surveillance has been demonstrated during Ebola, Zika and other disease outbreaks [17] [18] [19] . Although protocols for ONT sequencing of SARS-CoV-2 have been established and applied in both research and public health settings [20] [21] [22] , adoption of the technology has been limited due to concerns around its accuracy. ONT devices exhibit lower read-level sequencing accuracy than short-read platforms [23] [24] [25] . This may have a disproportionate impact on SARS-CoV-2 analysis, due to the virus' low mutation rate (8 ´ 10 -4 substitutions per site per year 26 ), which ensures erroneous (false-positive) or undetected (falsenegative) genetic variants have a strong confounding effect. In order to address concerns regarding ONT sequencing accuracy and evaluate its analytical validity for SARS-CoV-2 genomics, we have performed amplicon-based nanopore and short-read WGS on matched SARS-CoV-2-positive patient specimens and synthetic RNA controls, allowing rigorous evaluation of ONT performance characteristics.

Synthetic DNA or RNA reference standards can be used to assess the accuracy and reproducibility of nextgeneration sequencing assays 27 . We first sequenced synthetic RNA controls that were generated by in vitro transcription of the SARS-CoV-2 genome sequence. The controls matched the Wuhan-Hu-1 reference strain at all positions, allowing analytical errors to be unambiguously identified. To mimic a real-world viral WGS experiment, synthetic RNA was reverse-transcribed then amplified using multiplexed PCR of 98 ´ ~400 bp amplicons that enabled evaluation of ~95% of the SARS-CoV-2 genome. Eight independent replicates were sequenced on ONT PromethION and Illumina MiSeq instruments (see Methods).

We aligned the resulting reads to the Wuhan-Hu-1 reference genome to assess sequencing accuracy and related quality metrics (Fig. S1a-i) . Illumina and ONT platforms exhibited distinct read-level error profiles, with the latter characterised by an elevated rate of both substitution (23-fold) and insertion-deletion (indel) errors (76-fold; Table 1 ; Fig. S1d,e) . Per-base error frequency profiles showed clear correlation between ONT replicates (substitution R 2 = 0.67; indel R 2 = 0.82; Fig. S1f,g) . This indicates that ONT sequencing errors are not entirely random but are influenced by local sequence context. For example, indel errors were enriched (1.4-fold) at low-complexity sequences within the SARS-CoV-2 genome (i.e., sites with homopolymeric or repetitive content; ~1% of the genome; Fig. S1d ). Illumina error profiles showed weaker correlation between replicates (substitution R 2 = 0.15; indel R 2 = 0.42), indicating that short-read sequencing errors were less systematic than for ONT libraries (Fig. S1h,i) .

Despite their distinct error profiles, both sequencing platforms demonstrated high consensus-level sequencing accuracy across the SARS-CoV-2 genome. We used iVar and Medaka workflows to determine consensus genome sequences for Illumina and ONT libraries, respectively (see Methods). We detected just two erroneous variant candidates in a single ONT library ( Table 1) . Both of these were single-base insertions occurring at low-complexity sites (Fig. S2) , with no erroneous SNVs detected in any replicate (n = 8). All Illumina libraries exhibited perfect accuracy (Table 1) . Therefore, the sequencing artefacts affecting both technologies had minimal impact on the accuracy of consensus-level sequence determination, with indel errors in ONT samples being a possible exception.

To further evaluate the suitability of ONT sequencing for SARS-CoV-2 genomics, we conducted rigorous proficiency testing using bona fide clinical specimens. We performed ONT and Illumina WGS on matched, deidentified SARS-CoV-2-positive cases collected at public hospital laboratories in Eastern & Southern New South Wales and Metropolitan Sydney from March-April 2020 (see Methods; Supplementary Table 1 ). The SARS-CoV-2 genome was enriched by PCR amplification, using a custom set of 14 ´ ~2.5 kb amplicons that covers 29783/29903 bp (99.6%) of the genome, including 100% of annotated protein-coding positions 6 . Pooled amplicons then underwent parallel library preparation and sequencing on an ONT GridION/PromethION and an Illumina MiSeq instrument (see Methods). Short-read sequencing was performed according to a pathogen genomics accredited diagnostic workflow in a reference NSW Health Pathology laboratory, enabling direct comparison of nanopore sequencing to the established standard for pathogen genomics. In total, we obtained complete (99.6%) genome coverage with both technologies for 157 matched positive cases (Supplementary Table 1 ). By comparison to the Wuhan-Hu-1 reference strain, Illumina sequencing identified 7.6 consensus single-nucleotide variants (SNVs) and 0.04 indels, on average, per sample. A further 1.0 SNVs and 0.2 indels per sample were detected at sub-consensus read-count frequencies (20 -80%), indicative of intra-specimen genetic diversity (see below). Excluding positions with evidence of subconsensus variation, this provides an overall comparison set of 1201 consensus variants and 4,674,554 positions that match the reference strain in a given sample, against which to assess the accuracy of SARS-CoV-2 nanopore sequencing (Supplementary Table 1 ).

We used each of two best-practice bioinformatics pipelines developed by the ARTIC network to identify consensus variants with ONT sequencing data. The alternative pipelines differed primarily in their use of either Medaka or Nanopolish to call variants (see Methods). In general, ONT variant candidates identified by both pipelines were highly concordant with the Illumina comparison set. Illumina variants were detected with 99.17% sensitivity and 99.58% precision by Nanopolish, compared to 98.33% sensitivity and 99.24% precision by Medaka ( Table 2) . Undetected variants (false-negatives) were more frequent than erroneous candidates (false-positives), occurring in 14/157 (9%) and 9/157 (6%) of Medaka samples, respectively (Supplementary Table 2 ). Only 1/7 (14%) of consensus indels in the Illumina comparison set was detected by either Nanopolish or Medaka, while a further five and nine false-positive indels were detected by the respective pipelines (Supplementary Table 2 ). While the scarcity of consensus indels detected with either sequencing technology prevented a more thorough evaluation of indel accuracy, this indicates that ONT is inadequate for accurate detection of small indels in the SARS-CoV-2 genome. In contrast, SNVs were detected by Nanopolish and Medaka with high accuracy: overall, we found 99.66% and 98.83% concordance between ONT and Illumina SNVs, as measured by Jaccard similarity, with identical results in 145/157 (92%) and 153/157 samples (97%), respectively ( Table 2) . Inspection of false-positive and false-negative variant candidates detected with ONT sequencing data showed that these tended to occur in low-complexity sequences, which are known to be refractory to ONT base-calling algorithms 23 . For example, false-negative and/or false-positive candidates were found within a 21 bp T-rich site in the orf1ab gene in multiple samples (Fig. S3a,b) . We identified fifteen problematic lowcomplexity sites in the SARS-CoV-2 genome ranging in size from 9 to 42 bp in length that showed elevated read-level sequencing error rates ( Fig. S1d; Supplementary File 1) . Exclusion of these positions (~1% of the genome) improved the fidelity of ONT variant detection, with consensus SNVs in the Illumina comparison set being detected with 99.83% and 99.40% sensitivity by Nanopolish and Medaka, respectively, and perfect precision for both. Consensus SNVs detected with the Nanopolish workflow were identical between ONT and Illumina data in 155/157 (99%) of samples ( Table 2; Supplementary Table 3 ). This suggests that the accuracy of nanopore WGS may be improved via the exclusion of a small number of 'blacklist' low-complexity sites in the SARS-CoV-2 genome from downstream analysis. We next assessed the impact of sequencing depth on ONT performance. To do so, we down-sampled nanopore sequencing reads from a uniform 200-fold coverage across the SARS-CoV-2 genome and repeated variant detection across a range of coverage depths (see Methods). Both sensitivity and precision of variant detection were strongly influenced by sequencing coverage, showing a sharp decline below ~50-fold coverage depth, with minimal improvement observed above ~60-fold (Fig. 1a,b) . As above, excluding errorprone low-complexity sequences afforded consistent improvements to sensitivity and overall concordance across the range of depths tested (Fig. 1a,b) .

To verify these observations and assess reproducibility, we re-sequenced twelve specimens to generate triplicate (n = 3) data on both Illumina and ONT platforms (see Methods). We measured reproducibility by performing pairwise comparisons of detected variant candidates between replicates for a given sample (Supplementary Table 4 ). No discordant variants were detected between Illumina replicates across any of the 36 pairwise sample comparisons (309 variants total), confirming the reliability of short-read WGS. ONT also showed high reproducibility, with 99.36% Jaccard similarity between Medaka replicates for consensus variants (310 total) and perfect concordance for SNVs (Supplementary Table 4 ).

In summary, ONT sequencing enabled highly accurate and reproducible detection of consensus-level SNVs in SARS-CoV-2 patient isolates but appears generally unsuitable for the detection of small indel variants.

Within-host genetic diversity is a common feature of RNA viruses, with divergent quasi-species present in a single infection. Within-host diversity may help infecting viruses evade the host immune response, adapt to changing environments and can cause more severe and/or long-lasting disease [28] [29] [30] . Resolving this diversity may also better inform studies of virus transmission than consensus-level phylogenetics alone [31] [32] [33] . Therefore, we next evaluated the capacity of nanopore sequencing to identify intra-specimen genetic variation by detecting variants present at sub-consensus frequencies (i.e. variants detected in < 80% of mapped reads). Analysis of the SARS-CoV-2 synthetic RNA controls (see above) showed that sequencing artefacts in Illumina libraries could be misinterpreted as variants at read-count frequencies below ~20% (Fig.  S2b) , effectively establishing a lower bound for variant detection. We therefore limited our analysis to variants detected at ³20% frequency, taking variants detected by Illumina sequencing above this level to be genuine. Overall, short-read sequencing identified sub-consensus variants (20-80%) in 54/157 samples, comprising 156 SNVs and 20 indels (Supplementary Table 5 ). Using Varscan2, we identified 154 sub-consensus SNV candidates in ONT sequencing libraries (Supplementary Table 5 ). We detected 119 SNVs (sensitivity = 76.3%) in the Illumina comparison set and 25 false-positives (precision = 82.6%; Supplementary Table 5 ). Read-count frequencies for variants identified with both technologies were correlated (R 2 = 0.69), indicating that these were bona fide variants, rather than sequencing artefacts (Fig. 1c) . While the overall performance of sub-consensus SNV detection was quite poor, most false-positives and false-negatives were confined to the lower end of the frequency range assessed here (Fig. 1c,d) . For example, SNVs at high (60-80%) and intermediate (40-60%) sub-consensus frequencies were detected with relatively high sensitivity (95.7%, 91.3%) and precision (100%, 97.7%), whereas low-frequency variants (20-40%) were detected with low sensitivity (63.2%) and precision (69.6%; Fig. 1d) . Unsurprisingly, the high rate of indel errors in ONT sequencing libraries meant that they were unsuitable for detecting indel diversity, with errors overwhelming true variants (Supplementary Table 5 ). In summary, ONT sequencing enabled detection of within-specimen SNVs at frequencies from ~40-80% with adequate accuracy but was generally unsuitable for the detection of indels or rare SNVs (< 40%).

Large genomic deletions or rearrangements can have a major impact on virus function and evolution, however, there are currently just a few reported cases of SARS-CoV-2 specimens harbouring structural variants (SVs) 15, 34 . Therefore, we next evaluated the detection of SVs in SARS-CoV-2 specimens with ONT sequencing. We used NGMLR-Sniffles to identify potential SVs in ONT libraries and validated these with supporting evidence from short-read sequencing (see Methods). Across all SARS-CoV-2 patient specimens, we detected sixteen candidate deletions ranging in size from 15-1,840 bp (Table 3) , while no other SV types were identified. Of these, 13/16 were supported by split shortread alignments and/or discordant read-pairs in matched Illumina libraries ( Fig. S4a; Table 3 ). For 7/16 candidates, short-read evidence confirmed the presence of the deletion but indicated that the breakpoint position was not accurately placed by ONT reads (Fig. S4b; Table 3 ). Among the thirteen deletions detected by both platforms were examples in genes S, M, N, ORF3, ORF6, ORF8 and orf1ab ( Table 3) . Only one variant, a 328 bp deletion in ORF8 (Fig. S4c) , was detected in multiple specimens, although highly similar (but not identical) 28 bp and 29 bp deletions were also detected in S in two unrelated specimens (Fig. S4d) .

Overall, this analysis demonstrates that large deletions can be reliably detected using ONT sequencing and suggests that structural variation in the SARS-CoV-2 genome is more common and diverse than currently appreciated.

Viral WGS can be used to study the transmission and evolution of SARS-CoV-2, and is increasingly recognised as a critical tool for public health responses to COVID-19. Nanopore sequencing offers an alternative to established short-read platforms for viral WGS with several advantages. ONT devices: (i) are relatively inexpensive, highly portable and require minimal associated laboratory infrastructure; (ii) enable rapid generation of sequencing data and even real-time data analysis; (iii) require comparatively simple procedures for library preparation and; (iv) offer flexibility in sample throughput, accommodating single (e.g., Flongle), multiple (e.g., MinION/GridION) or tens/hundreds (e.g., PromethION) of specimens per flow-cell 16, 18 . Therefore, ONT sequencing could further empower SARS-CoV-2 surveillance initiatives by enabling point-ofcare WGS analysis and improved turnaround time for critical cases, particularly in isolated or poorly resourced settings 35 .

Due to the relatively low mutation rate observed in SARS-CoV-2 26 , accurate sequence determination is vital to correctly define the phylogenetic structure of disease outbreaks. With ONT sequencing known to exhibit higher read-level sequencing error rates than short-read technologies [23] [24] [25] , reasonable concerns exist about suitability of the technology for SARS-CoV-2 genomics. Moreover, public databases for SARS-CoV-2 data (e.g., GISAID: https://www.gisaid.org/) already contain consensus genome sequences generated via ONT sequencing, potentially confounding investigations that rely on these resources. The present study resolves these concerns, demonstrating accurate consensus-level SARS-CoV-2 sequence determination with ONT data. We report that: (i) variants at consensus-level read-count frequencies (80-100%) were detected with >99% sensitivity and >99% precision across 157 SARS-CoV-2-positive specimens, confirming the suitability of ONT sequencing for standard phylogenetic analyses; (ii) high accuracy and reproducibly was achieved by each of two alternative tools for ONT variant detection, with Nanopolish showing modest improvements over Medaka; (iii) a minimum ~60-fold sequencing depth was required to ensure accurate detection of SNVs, but little or no improvement was achieved above this level; (iv) falsepositive and false-negative variants were typically observed at low-complexity sequences, with fidelity improved by excluding these problematic sites; (v) in contrast to consensus SNVs, ONT sequencing performed poorly in the detection of consensus indels or low-frequency variants (such variants should therefore be interpreted with caution); (vi) while the high indel error rate in ONT sequencing impedes accurate detection of small indels, long nanopore reads appear well-suited for the detection of large deletions and potentially other structural variants. Although SNVs alone are sufficient for routine phylogenetic analysis, small indels and large structural variants can profoundly impact gene function and are, therefore, of interest to studies of virus evolution and pathogenicity 15 . As the first systematic evaluation of nanopore sequencing for SARS-CoV-2 WGS, this study removes an important barrier to its widespread adoption in the ongoing COVID-19 pandemic. While short-read sequencing platforms remain the gold-standard for high-throughput viral sequencing, the advantages to portability, cost and turnaround-time afforded by nanopore sequencing imply that this emerging technology can serve an important complementary role in local, national and international COVID-19 response strategies.

Synthetic controls used in this study were manufactured by Twist Biosciences and are commercially available (Catalog item 101024). The controls comprise synthetic RNA generated by in vitro transcription (IVT) of the SARS-CoV-2 genome sequence, representing the complete genome in 6 ´ ~5 kb continuous sequences. The controls used in this study are identical in sequence to the Wuhan-Hu-1 reference strain (MN908947.3), allowing sequencing artefacts to be readily identified. Synthetic controls were prepared for sequencing via a protocol established by the ARTIC network for viral surveillance (https://artic.network/ncov-2019). Briefly, reverse-transcription was performed on aliquots of synthetic RNA (at 10 6 copies per µL) using Superscript IV (Thermo Fisher Scientific) with both random hexamers and oligo-dT primers. Prepared cDNA was then amplified using multiplexed PCR with 98 ´ ~400 bp amplicons tiling the SARS-CoV-2 genome (ARTIC V3 primer set). Amplification was performed with Q5 Hotstart DNA Polymerase (New England Biolabs) with 1.5 µL of cDNA per reaction. PCR products were cleaned using AMPure XP beads (0.8X bead ratio), quantified using a Qubit fluorometer (Thermo Fisher Scientific) and partitioned into separate aliquots for analysis by short-read and nanopore sequencing. We note that it is not possible to amplify the entire SARS-CoV-2 genome in this way, since amplicons that span boundaries of the 6 ´ ~5 kb IVT products necessarily fail. Nevertheless, we were able to evaluate ~95% of the SARS-CoV-2 genome sequence.

SARS-CoV-2-positive extracts from 157 cases, tested at NSW Health Pathology East Serology and Virology Division (SaViD), were retrieved from storage and included in this study. All specimens were nasopharyngeal swabs originating from patients in New South Wales during March-April 2020. Specimens underwent total nucleic acid extraction using the Roche MagNA Pure DNA and total NA kit on an automated extraction instrument (MagNA pure 96). Reverse-transcription was performed on viral RNA extracts using Superscript IV VILO Master Mix (Thermo Fisher), which contains both random hexamers and oligo-dT primers. Prepared cDNA was then amplified separately with each of 14 x ~2.5 kb amplicons tiling the SARS-CoV-2 genome, as described elsewhere 6 . Amplification was performed with Platinum SuperFi Green PCR Mastermix (Thermo Fisher) with 1.5 µL of cDNA per reaction. PCR products were cleaned using AMPure XP beads (0.8X bead ratio), quantified using PicoGreen dsDNA Assay (Thermo Fisher). All 14 x amplicon products from a given sample were then pooled at equal abundance and partitioned into separate aliquots for analysis by shortread and nanopore sequencing. This strategy ensured that any sequence artefacts potentially introduced during reverse-transcription and/or PCR amplification were common to matched ONT/Illumina samples, so would not be interpreted as false-positive/negatives during technology comparison.

Pooled amplicons were prepped for short-read sequencing using the Illumina DNA Prep Kit, according to the manufacturer's protocol. Samples were multiplexed using Nextera DNA CD Indexes and sequenced on an Illumina MiSeq. Within each sequencing lane, a blank sample was also prepared and sequenced, in order to monitor for contamination and/or index swapping between samples. The resulting reads were aligned to the Wuhan-Hu-1 reference genome (MN908947.3) using bwa mem (0.7.12-r1039) 36 . Primer sequences were trimmed from the termini of read alignments using iVar (1.0) 37 . Trimmed alignments were converted to pileup format using samtools mpileup (v1.9) 38 , with anomalous read pairs retained (--count-orphans), base alignment quality disabled (--no-BAQ) and all bases considered, regardless of PHRED quality (--min-BQ 0). Variants were identified using bcftools call (v1.9) 38 , assuming a ploidy of 1 (--ploidy 1), then filtered for a minimum read depth of 30 and minimum quality of 20. Variants were classified according to their read-count frequencies as consensus (>80% reads supporting the variant) or sub-consensus (20-80%) variants, with the latter further divided into high (60-80%), intermediate (40-60%) or low-frequency (20-40%). Variants at readcount frequencies below 20% were considered to be potentially spurious and excluded on this basis. Nanopore sequencing ARTIC amplicons (~400 bp) from the synthetic RNA controls were prepared for nanopore using the ONT Native Barcoding Expansion kit (EXP-NBD104). The longer amplicons (~2.5 kb) used on SARS-CoV-2 patient specimens were prepared for nanopore sequencing using the ONT Rapid Barcoding Kit (SQK-RBK004). Both kits were used according to the manufacturer's protocol. Up to twelve samples were multiplexed on a FLO-FLG001, FLO-MIN106D or FLO-PRO002 or flow-cell and sequenced on a GridION X5 or PromethION P24 device, respectively. In addition, a no-template negative control from the PCR amplification step was prepared in parallel and sequenced on each flow-cell (Supplementary Table 6 ). The RAMPART software package 39 was used to monitor sequencing performance in real-time, with runs proceeding until a minimum ~200-fold coverage was achieved across all amplicons. At this point, the run was terminated and the flowcell washed using the ONT Flow Cell Wash kit (EXP-WSH003), allowing re-use in subsequent runs. The resulting reads were basecalled using Guppy (4.0.14) and aligned to the Wuhan-Hu-1 reference genome (MN908947.3) using minimap2 (2.17-r941) 40 . The ARTIC tool align_trim was used to trim primer sequences from the termini of read alignments and cap sequencing depth at a maximum of 400-fold coverage. Consensus-level variant candidates were identified using each of two workflows developed by ARTIC (https://github.com/artic-network/artic-ncov2019), using Nanopolish 41 or Medaka (0.11.5) to variants, respectively. Nanopolish variants candidates were filtered directly with the ARTIC artic_vcf_filter tool, while Medaka candidates were evaluated by LongShot (0.4.1) 42 before filtering. Sub-consensus level variant candidates were identified using Varscan2 (v2.4.3) 43 .

For synthetic RNA controls, read-level quality metrics, such as sequencing error rates, were derived from read alignments using pysamstats, with any bases that differed from the Wuhan-Hu-1 reference sequence considered errors.

The accuracy of variant detection by ONT sequencing was evaluated by comparison to the set of variants identified by Illumina sequencing in matched cases. To ensure consistent representation of variants across calls generated by different programs: (i) multi-allelic variant candidates were separate into individual SNVs/indels using bcftools norm (1.9) 38 ; (ii) multi-nucleotide variants were decomposed into their simplest set of individual components using rtg-tools vcfdecompose (3.10.1) and; (iii) indels at simple repeats were left-aligned using gatk LeftAlignAndTrimVariants (4.0.11.0). Variant candidates identified by Illumina/ONT could then be considered concordant based on matching genome position, reference base and alternative base/s. For a given case, variant candidates identified with ONT and Illumina were classified as true-positives (TPs), candidates identified by ONT but not Illumina as false-positives (FPs) and candidates identified by Illumina but not ONT as false-negatives (FNs). The following statistical definitions were used to evaluate results: Sensitivity = TP / (TP + FN) Precision = TP / (TP + FP) Jaccard similarity = TP / (TP + FP + FN)

To identify structural variation, nanopore reads were re-aligned to the Wuhan-Hu-1 reference genome (MN908947.3) using the rearrangement-aware aligner NGMLR (v0.2.7) 44 . Sniffles (v1.0.11) 44 was then used to detect candidate variants with a minimum length of 10 bp and ³ 20 supporting reads. To validate SVs detected with ONT alignments, split short-read alignments and discordant read-pairs were extracted from matched Illumina libraries using lumpy 45 . Variant candidates were then manually inspected to verify evidence from ONT and short-reads and assess break-point position resolution.

Software used in this study is generally open source and all publicly available. Full descriptions, including parameters and version numbers are provided in the Materials & Methods section, and further detail on the bioinformatics protocols can be found at: https://github.com/Psy-Fer/SARS-CoV-2_GTG

Raw data for SARS-CoV-2 whole genome sequencing experiments (ONT and Illumina) have been deposited to the Sequence Read Archive under Bioproject PRJNA651152. 

A new coronavirus associated with human respiratory disease in China

A Novel Coronavirus from Patients with Pneumonia in China

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Comparison of seven commercial RT-PCR diagnostic kits for COVID-19

A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology

An emergent clade of SARS-CoV-2 linked to returned travellers from Iran

Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United States

Introductions and early spread of SARS-CoV-2 in the New York City area

Spread of SARS-CoV-2 in the Icelandic population

Genomic epidemiology of SARS-CoV-2 in Guangdong province

Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agentbased modeling

Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus

The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity

SARS-CoV-2/COVID-19: Viral Genomics, Epidemiology, Vaccines, and Therapeutic Interventions

Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study

The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community

Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella

Real-time, portable genome sequencing for Ebola surveillance

Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples

Rapid implementation of SARS-CoV-2 sequencing to investigate cases of health-care associated COVID-19: a prospective genomic surveillance study

Rapid, sensitive, full-genome sequencing of Severe Acute Respiratory Syndrome Coronavirus 2. Emerging Infectious Disease journal 26

Genetic structure of SARS-CoV-2 reflects clonal superspreading and multiple independent introduction events

From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy

Evaluation of Oxford Nanopore's MinION sequencing device for microbial whole genome sequencing applications

Assessing the performance of the Oxford Nanopore Technologies MinION

Phylodynamic Analysis | 176 genomes | 6

Reference standards for next-generation sequencing

Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection

Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population

The evolutionary pathway to virulence of an RNA virus

Shared genomic variants: identification of transmission routes using pathogen deep-sequence data

Bayesian reconstruction of transmission within outbreaks using genomic variants

PHYLOSCANNER: Inferring transmission from within-and between-host pathogen genetic dDiversity

SARS-CoV-2 genomic surveillance in Taiwan revealed novel ORF8-deletion mutant and clade possibly associated with infections in Middle East

Genopo: a nanopore sequencing analysis toolkit for portable Android devices

Fast and accurate short read alignment with Burrows-Wheeler transform

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

RAMPART: a workflow management system for de novo genome assembly

Minimap2: pairwise alignment for nucleotide sequences

Detecting DNA cytosine methylation using nanopore sequencing

Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing

VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing

Accurate detection of complex structural variations using single-molecule sequencing

LUMPY: a probabilistic framework for structural variant discovery