key: cord-0871979-7he4wd1c authors: Brejova, B.; Borsova, K.; Hodorova, V.; Cabanova, V.; Gafurov, A.; Fricova, D.; Nebohacova, M.; Vinar, T.; Klempa, B.; Nosek, J. title: Nanopore Sequencing of SARS-CoV-2: Comparison of Short and Long PCR-tiling Amplicon Protocols date: 2021-05-13 journal: nan DOI: 10.1101/2021.05.12.21256693 sha: d6ed7a312c297f80ed272e37583345d5e366a18e doc_id: 871979 cord_uid: 7he4wd1c Surveillance of the SARS-CoV-2 variants including the quickly spreading mutants by rapid and near real-time sequencing of the viral genome provides an important tool for effective health policy decision making in the ongoing COVID-19 pandemic. Here we evaluated PCR-tiling of short (~400-bp) and long (~2 and ~2.5-kb) amplicons combined with nanopore sequencing on a MinION device for analysis of the SARS-CoV-2 genome sequences. Analysis of several sequencing runs demonstrated that using the long amplicon schemes outperforms the original protocol based on the 400-bp amplicons. It also illustrated common artefacts and problems associated with this approach, such as uneven genome coverage, variable fraction of discarded sequencing reads, as well as the reads derived from the viral sub-genomic RNAs and/or human and bacterial contamination. 4 and those evading the immune response (10, 11) ). This underscores the importance of genomic epidemiology, although the elucidation of direct links between particular mutation(s) and the virus spreading or clinical implications still represents a challenging task (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) . Nanopore sequencing of tiled PCR-generated amplicon pools represents a powerful tool for investigating viral genomes. The protocol has been developed by the Artic Network (https://artic.network/) for sequencing of Ebola, Zika, and Chikungunya genomes (23, 24) . In January 2020, the original protocol was promptly adjusted for rapid sequence determination of SARS-CoV-2 RNA prepared directly from clinical samples such as nasopharyngeal or oropharyngeal swabs. Additional studies described its modifications including alternative primer schemes and different amplicon sizes or different sequencing chemistries (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) . Its further improvements resulted in simplification of the sequencing library preparation, shortened hands-on time, and increased sample multiplexing (up to 96) that decreased the reagent costs to about £10 per sample, making this approach affordable for epidemiologic surveillance of the pandemics (36) . Importantly, the rigorous comparison of nanopore sequencing with Illumina short reads technology demonstrated that in spite of relatively high error rates in individual nanopore reads, the highly accurate consensus single nucleotide variant (SNV) calling with >99% sensitivity and >99% precision can be achieved with a minimum of about 60-fold coverage (37) . In this study, we compare the performance of several PCR-tiling based protocols which were evaluated as part of our efforts to sequence isolates of SARS-CoV-2 from Slovakia collected between March 2020 and March 2021. Using the generated sequence data, we investigate the nature of common problems and artefacts associated with this approach. We compared the sequencing results obtained from the libraries containing multiplexed barcoded SARS-CoV-2 samples made of ~400-bp, ~2-kb, and ~2.5-kb long overlapping amplicon pools as well as the combination of short and long amplicons. Our results show that sequencing of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint generating either ~400-bp (Artic Network version V3, https://github.com/artic-network/artic-ncov2019), ~2-kb (35), or ~2.5-kb long amplicons (27) . In some experiments, we loaded the same sequencing libraries to both the standard and Flongle flow cells. This allowed us to compare and evaluate the MinION runs and to analyze the problems affecting the performance of nanopore sequencing. To compare primer sets for short and long amplicons and sequencing devices, three different batches (UKBA-2, UKBA-3 and UKBA-4 in Table 1 ) consisting of 10-12 multiplexed samples were sequenced using multiple strategies. Sequencing of a mixture of longer and shorter amplicon pools provided comparable results to sequencing longer amplicons alone, perhaps because the mixture was enriched in the long amplicons. Finally, the Flongle and standard flow cells are similarly successful at comparable sequencing volumes. However, there are two disadvantages to using the Flongle flow cells. First, the Flongle cannot be washed and reused, its entire capacity is used for a single experiment. Second, since there is a large variance in the amount of data produced by a single Flongle flow cell (in our experiments, the number of active pores in Flongles ranged between 18 and 67 pores and produced between 110 and 830 Mbp -see Table 1 ), the capacity may be insufficient to completely recover sequences of 10 or more multiplexed samples. We consider as an important advantage that the runs using the standard flow cells can be terminated when . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint sufficient data is collected, and thus these flow cells can be reused in further experiments after washing with nuclease containing buffer (i.e., EXP-WSH003 or EXP-WSH004). Moreover, the standard flow cells allow simultaneous sequencing of a greater number of barcoded samples with a longer run. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint variables, namely the Cq values, amplicon concentration, and RNA sample storage time prior to amplification. Although the expected trends are in some cases observable, they are not followed universally. Using the Artic pipeline for further analysis, sequencing reads must first pass a series of filters to ensure no barcode bleeding and to remove possible contamination. The number of reads passing these filters and used for the identification of variants in the final step of the pipeline varied between runs. In our experiments their fraction comprise between 14 and 55% (Fig 2A) . Majority of failed reads are due to the low quality or incompleteness (groups A-C) and comprise between 41-78% of all reads. While there are no clear differences between short and long amplicon protocols, with 2-kb amplicons these low-quality reads are apparently more prevalent on the Flongle runs compared to the standard flow cells. Interestingly, in some runs, up to 6% of reads that pass the base quality filters do not map to the target reference genome. In particular, four samples in batch UKBA-2 of 2-kb amplicon run (barcodes 02, 07, 08 and 11) have a very high fraction of non-target reads (Fig 2B) . The majority (82-96%) of these reads map to the human genome, and a smaller fraction (0.3-9%) map to bacterial genomes, including the species colonizing human oral cavity and respiratory tract (e.g., Actinomyces graevenitzii, Haemophilus parainfluenzae, Leptotrichia spp., Prevotella spp., Pseudomonas aeruginosa, Rothia mucilaginosa, Streptococcus pneumoniae, S. mitis, S. parasanguinis, S. salivarius, Tannerella forsythia, Veillonella parvula). All four samples showed a lower viral load (i.e., Cq value > 30) in RT-qPCR assays, and the amplification in the PCR-tiling protocol resulted in lower product yield. Human and bacterial reads represent artefacts apparently resulting from a non-specific amplification of contaminating nucleic acids present in clinical samples. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint We have also observed that some amplicons originate from sub-genomic RNAs that copurify with the SARS-CoV-2 genomic RNA. It has been demonstrated that the amount of subgenomic RNAs correlates with the disease severity. As these molecules are strongly repressed in asymptomatic patients (38) , their proportion in the sequencing data can serve as a molecular marker. The most abundant reads are derived from the N mRNA (39). The sub-genomic RNAs are generated in the process of the virus replication/transcription (5) and start with a leader sequence originating from the untranslated 5' end of the viral genome, followed by a downstream sequence containing a particular open reading frame. The leftmost primer in both 400-bp and 2-kb primer sets investigated in this study is contained within the leader sequence. This facilitates amplification of sub-genomic RNAs with appropriate right primers (Fig 3) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint Table 2 lists the fraction of selected sub-genomic RNAs among reads that could be aligned to the SARS-CoV-2 genome. These fractions are relatively low, with the remaining sub-genomic RNAs being even more rare. However, the fractions vary among the samples. In UKBA-2 run with 2-kb amplicons, the highest fraction of 14.3% was observed for the gene N mRNA in barcode 07 and the fraction of 7.5% was observed for the ORF3a mRNA in barcode 11. Some of these sub-genomic amplicons are discarded from the analysis as too short, while others lead to lower coverage in areas not covered by the sub-genomic RNA (Fig 4) . From these pilot experiments, we conclude that even though 400-bp amplicons have a lower percentage of discarded reads (Fig 2) , they produce fewer finished sequences at a comparable overall amount of sequence data (Fig 1) . The reason is a very uneven coverage of individual amplicons (Fig 4) . This is observed in both sets of primers, but for the 400-bp amplicons we see a much lower coverage in the worst covered regions (Fig 5) . Additional sequencing runs (UKBA-6, UKBA-10, UKBA-11, and UKBA-12) were performed with long 2-kb amplicons on standard MinION flow cells with similar results (Fig 2A; S2 Fig) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. also tested the 2.5-kb primer panel (27) . Except for the leftmost primer, the primer positions in this panel differ from those of the 2-kb scheme. We have performed two sequencing runs with 2.5-kb primer set (UKBA-19, UKBA-21). In the first experiment, we have noticed an almost complete drop of coverage in the last amplicon derived from the 3' end of the genome; for the second experiment, we have replaced the primers for the right-most amplicon with the rightmost primers from the 2-kb panel, which mitigated the issue. Comparing the coverage of individual amplicons between the 2-kb and 2.5-kb schemes (Fig 6) , the coverage in the 2.5-kb scheme indeed appears to be more even. However, we have also noticed a higher percentage of failed reads, with only 24% (UKBA- 19) and 16% (UKBA-21) reads passing all filters and being usable for variant identification. Further analysis revealed a notable increase in single-barcode reads (group B) and shorter than expected reads (group E), pointing to difficulties in amplifying and sequencing longer fragments. More experiments are required to determine whether the 2.5-kb scheme results in more fully-assembled genomes over the 2-kb scheme. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Interestingly, PCR-tiling protocols were able to also pick up sub-genomic RNA transcripts, and the proportion of these transcripts varied between samples. Since increased levels of sub-genomic transcripts are correlated with severe cases of COVID-19, these protocols could be optimized to detect the levels of sub-genomic transcripts more accurately and used as a biomarker for disease severity. It is evident that effective epidemiologic surveillance of the pandemic is strongly dependent on systematic sequencing of SARS-CoV-2 isolates. The MinION platform from Oxford Nanopore Technologies is one of the most powerful and versatile means for acquisition of viral sequences. Yet, as demonstrated in this study, the pros and cons of a particular protocol . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint must be taken into account to ensure that the sequencing results will be of a highest quality, which is an essential prerequisite for their utility in fighting the pandemic. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Table) were . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. For sequencing on the Flongle flow cells (FLO-FLG001), the library preparation was the same, . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint except that one third to one half of the library was loaded compared to the amount used for the standard flow cell. Data processing. Nanopore sequencing data were base called and demultiplexed using Guppy v.3.4.4. Variant analysis was performed using Artic analysis pipeline v.1.1.3. (https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html) using recommended settings. Minimum and maximum read lengths in the Artic guppyplex filter were set to 350 and 619 for the 400-bp amplicons and 1500 and 3000 for both the 2 and 2.5-kb amplicons, respectively. For batch UKBA-2, the final sequences were produced by first combining sequencing reads from both standard and Flongle runs with the same primer set and running the Artic pipeline. Subsequently the results for the two primer sets were combined so that regions sufficiently covered by at least one amplicon set were considered as finished. The same process was used in batch UKBA-3, but there was only data from standard flow cells available. Subsequent batches were based on 2 or 2.5-kb amplicons sequenced on a standard flow cell. To compare different primer sets and sequencing devices, reads were also demultiplexed at the less strict default Guppy settings and aligned to various reference genomes by minimap2 v. 2.13-r852-dirty (41) . Reference genomes include the SARS-CoV-2 genome MN908947.3 (1), the human genome version hg19 downloaded from the UCSC genome browser (40) , and the database for bacterial species typing included in the Japsa software (42) . To detect subgenomic RNAs, reads were aligned to transcripts downloaded from the UCSC genome browser by minimap2, and classified as sub-genomic, if the alignment to a sub-genomic RNA has at least 5 matches more than the best alignment to the reference genome. Read coverage was computed using genomecov tool from BEDTools (43) with options -bga -split. To compare the results for various sequencing data volumes, reads were ordered by the sequencing finish time and the initial portion with the desired total length was selected and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 13, 2021. ; https://doi.org/10.1101/2021.05.12.21256693 doi: medRxiv preprint The architecture of SARS-CoV-2 transcriptome Direct RNA sequencing and early evolution of SARS-CoV-2 The landscape of SARS-CoV-2 RNA modifications Rapid SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in the Netherlands Analysis and forecasting of global RT-PCR primers for SARS COVID-Diagnosis HCL Study Group. Two-step strategy for the identification of SARS-CoV-2 variant of concern 202012/01 and other variants with spike deletion H69-V70, France A novel, multiplexed RT-qPCR assay to distinguish lineage B.1.1.7 from the remaining SARS-CoV-2 lineages Persistence and evolution of SARS-CoV-2 in an immunocompromised host RdRp mutations are associated with SARS-CoV-2 genome evolution Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020 SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo SARS-CoV-2 evolution during treatment of chronic infection Recurrent emergence and transmission of a SARS-CoV-2 Spike deletion Spike mutation D614G alters SARS-CoV-2 fitness Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity Positive selection of ORF1ab, ORF3a, and ORF8 genes drives the early evolutionary trends of SARS-CoV-2 during the 2020 COVID-19 pandemic Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants Real-time, portable genome sequencing for Ebola surveillance Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples CoronaHiT: highthroughput sequencing of SARS-CoV-2 genomes Evaluation of NGSbased approaches for SARS-CoV-2 whole genome characterisation SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms Rapid and inexpensive whole-genome sequencing of SARS-CoV-2 using 1200 bp tiled amplicons and Oxford Nanopore rapid barcoding A rapid, costeffective tailed amplicon method for sequencing SARS-CoV-2 Sequencing of SARS-CoV-2 genome using different Nanopore chemistries Disentangling primer interactions improves SARS-CoV-2 genome sequencing by multiplex tiling PCR. PLoS sequencing, random hexamers, and bait capture Rapid, sensitive, fullgenome sequencing of severe acute respiratory syndrome coronavirus 2 SARS-CoV-2 genomes recovered by long amplicon tiling multiplex approach using nanopore sequencing and applicable to other sequencing platforms Improvements to the ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis Subgenomic RNAs as molecular indicators of asymptomatic SARS-CoV-2 infection The human genome browser at UCSC Minimap2: pairwise alignment for nucleotide sequences Streaming algorithms for identification of pathogens and antibiotic resistance potential from realtime MinION(TM) sequencing BEDTools: a flexible suite of utilities for comparing genomic features