key: cord-0822619-i8nhf1g3 authors: Castañeda-Mogollón, Daniel; Kamaliddin, Claire; Fine, Laura; Oberding, Lisa K.; Pillai, Dylan R. title: SARS-CoV-2 Variant Detection with ADSSpike date: 2021-11-23 journal: Diagn Microbiol Infect Dis DOI: 10.1016/j.diagmicrobio.2021.115606 sha: 904b38474d3f1656e24b32cd4eb0c2d25487f201 doc_id: 822619 cord_uid: i8nhf1g3 The SARS-CoV-2 coronavirus pandemic has been an unprecedented challenge to global pandemic response and preparedness. With the continuous appearance of new SARS-CoV-2 variants, it is imperative to implement tools for genomic surveillance and diagnosis in order to decrease viral transmission and prevalence. The ADSSpike workflow was developed with the goal of identifying signature SNPs from the S gene associated with SARS-CoV-2 variants through amplicon deep sequencing. Seventy-two samples were properly sequenced, and 30 mutations were identified. Among those, signature SNPs were linked to two Zeta-VOI (P.2) samples and one to the Alpha-VOC (B.1.17). An average depth of 700 reads was found to properly identify all SNPs and deletions pertinent to SARS-CoV-2 mutants. ADSSpike is the first workflow to provide a practical, cost-effective, and scalable solution to diagnose SARS-CoV-2 VOC/VOI in the clinical laboratory, adding a valuable tool to public health measures to fight the COVID-19 pandemic for approximately $41.85 USD/reaction. The COVID-19 pandemic's causative agent is the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), first identified at the end of 2019 in Wuhan, China [1] . Following the appearance of mutations in specific SARS-CoV-2 genes, different viral phenotypes have emerged as the main threat to epidemic control. These viral variants, known as variants of concern (VOC), and variants of interest (VOI) are mutants that are associated with either increased transmissibility, changes in epidemiological patterns or clinical presentation, increased virulence, decreased effectiveness of public health and social measures for epidemic control, and/or decreased effectiveness of available diagnostics, vaccines, and/or therapeutics. Genomic surveillance of these mutants is essential for monitoring viral spread and implementing targeted measures to reduce VOC/VOI transmission. Current surveillance strategies rely on whole genome sequencing (WGS) for identification of VOC/VOI, or assays to monitor and detect known mutations. WGS using next generation sequencing (NGS) approaches are based on commercially available sequencing kits or international consortia-published protocols such as those from the ARTIC network [2] . The majority of the targeted assays rely on PCR amplification of a genomic region followed by amplicon Sanger sequencing [3, 4] . Other assays are based on reverse transcription quantitative polymerase chain reaction (RT-qPCR) using TaqMan probes targeting specific SNPs [5] , amplicon melting curve analysis, or high range melting (HRM) analysis [6, 7] . These targeted assays are limited to currently known SNPs related to VOC/VOI and do not have the potential for rapid identification of new mutations, especially during a pandemic where turnaround times are a critical factor for molecular surveillance. The SARS-CoV-2 S gene encodes for the surface glycoprotein of the virus which mediates viral adhesion and entry into human host cells. Mutations in the S-gene sequence in VOC/VOI have resulted in amino acid changes in the S protein that affect the binding to the human cell receptor ACE [8, 9] , antibody recognition, vaccine efficacy, and antibody therapy efficacy [8] . In addition, this gene represents an attractive target for SARS-CoV-2 VOC/VOI identification, as signature SNPs can be easily identified in this region through amplicon deep sequencing (ADS). Here, a novel workflow using ADS of the S gene (ADSSpike) was implemented to provide a scalable, high throughput, and unbiased identification workflow for VOC/VOI from clinical SARS-CoV-2 samples. In addition, ADSSpike simplifies the ARTIC Illumina V3 protocol [2] by incorporating Illumina overhangs into PCR primers targeting 400 bp regions spanning the SARS-CoV-2 S gene, with the aim of reducing PCR cycles and thus chimera formation [10, 11] . To our knowledge, ADSSpike is one of the first workflows to identify SNPs by employing a SARS-CoV-2 S gene-targeted amplicon deep sequencing approach across SARS-CoV-2 specimens. A previous pipeline discusses a similar methodology; however several pitfalls compared to this pipeline were observed [12] . Ethical approval was obtained from the Conjoint Health Research Ethics Board (CHREB) of the University of Calgary (REB 20-0567, REB 20-0402). All archived specimens were de-identified prior to analysis in this study. Clinical nasopharyngeal swab (NPS) and throat swab (TS) specimens (n=72) were collected and tested for SARS-CoV-2 by Alberta Precision Laboratories (APL) between March 2020 and February 2021 as previously described [13] . RNA was extracted from 90 to 120µL of sample using the Qiagen QIAamp® Viral RNA Mini Kit (Cat. No/ID 52906, Qiagen, USA), following the manufacturer's protocol with slight modifications. Briefly, samples were digested with proteinase K for 10 minutes at 56 o C prior to extraction, and treated with DNAse I (Promega) to remove any remaining DNA. Obtained RNA was eluted in two centrifugation rounds of 40 µL of nuclease-free water (NFW). The ADS pipeline ( Figure 1 ) was designed to optimize sample preparation time. Briefly, extracted RNA was subjected to cDNA synthesis before PCR amplification. Fourteen pairs of primers spanning the S gene and its adjacent regions, generating ~400 bp amplicons, were designed and adapted from published protocols [2] (Supplementary table 1 ). The PCR amplification step was designed to include the Illumina adapters [14] with each primer pair, thus allowing PCR products to move directly to library preparation and sequencing while reducing PCR artifacts and chimeric reads. Illumina adaptors fused to the locus-specific primers is a strategy that has been used in the past from previous protocols was adapted from current protocols for SNP calling and haplotype screening [15] . The performance of a pooling strategy was evaluated, comparing the sequencing results from PCR conducted either individually (iPCR, conducting 14 individual PCR reactions each yielding 400 bp amplicons), or as a pooled PCR (pPCR) strategy (one-pot PCR reaction using the 14 primer pairs, generating mixed 400 bp amplicons from each primer set). For each sample, the iPCR products were pooled prior to purification and subsequent analysis (Supplementary figure 1). The cDNA synthesis and PCR amplification of the SARS-CoV-2 S gene are detailed in the Supplementary data. Briefly, the cDNA synthesis was performed using the LunaScript® RT Supermix (New England Biolabs, #E3010L) following the manufacturer's protocol. The S gene and its adjacent regions were amplified using 14 primer pairs, followed by a single PCR run and SPRIselect beads purification (Beckman Life Science, USA). Three NFW negative controls were used to assess potential amplicon contamination in the lab, as well as to help tuning the parameters for SNP calling. Fifteen µL of pooled purified amplicons per sample was sent to the Center for Health Genomics and Informatics (CHGI) at the University of Calgary for library preparation and sequencing. Each sample was indexed using the Nextera XT Index Kit V2 along with KAPA HiFi polymerase. The indexed libraries were pooled and sequenced in an Illumina MiSeq instrument (San Diego, CA) in paired-end mode (2x 250 bp), using the Illumina MiSeq Reagent Kit V2 Nano (500 cycles), for a total of 1 million reads, with a 5% spike of Enterobacteria phage PhiX control. Samples were submitted to the IDseq pipeline for adapter and primer trimming, removal of low-quality reads, and removal of host sequences [16] . Reads were mapped using the Burrows Wheeler Aligner (BWA) [17] against a SARS-CoV-2 reference genome (GenBank MN908947.3 [1] ). Samtools was used for file indexing and sorting [18] , followed by FreeBayes [19] to perform SNP calling and insertion/deletion (indel) identification. To determine which parameters are best to identify SARS-CoV-2 VOC/VOI, a combination of parameters were employed and the positive predictive value (PPV) along with analytical sensitivity were computed. For this a depth of coverage (number of reads supporting a SNP or deletion) between 5 to 40 reads was tested, along with a fraction depth between 15% to 90% (fraction of reads supporting the SNP or deletion); a Phred score of 20 was maintained across all parameter combinations. Assessment details of the PPV and sensitivity are available in the Supplementary data. Read depth and coverage, defined as the percentage of the S gene covered by at least one read, were assessed by BEDTools [20] and visualized with Tablet [21] . Descriptive statistics regarding depth and coverage were performed by including the mean ± standard deviation. Non-parametric Mann-Whitney tests were performed across the continuous variables, and a Fisher's exact test was performed across categories with discrete variables. A non-parametric Kruskal-Wallis test followed by multiple-paired comparisons were carried out to determine difference in significance amongst coverage and Ct value groups. All statistical tests were considered significant for adjusted p-values below 0.05. The statistical analysis and figure design were performed using GraphPad Prism for Mac [22]. Seventy-two samples were selected and tested with ADSSpike (68 were positive for SARS-CoV-2, and 4 were negative). This sample size was selected to aim for an average of table 2 ). Nevertheless, when employing a read depth of 40% and a fraction depth of 60% and 70%, two and three false negatives were observed, decreasing the sensitivity to 98.29% and 97.43%, respectively. This revealed that a read depth of 10 and a depth fraction of 50% had the optimum values for SNP calling and SARS-CoV-2 identification, and were therefore used for variant analysis. The previously sequenced VOC/VOI samples, P739 (Alpha/B. sample only showed the V1176F VOC-flagged mutant, suggesting a minimum average read depth of 700 may be required to identify all SNPs pertinent to VOC/VOI. A depth of 97 was found to be the lowest acceptable for accurate SNP calling, while a depth of 536 was the highest observed for missing a SNP (Figure 2d) . We also observed that 55/76 (72.36%) samples had a 50% coverage, a requirement for submission to the GISAID database [23] Sixty-eight samples from SARS-CoV-2 positive patients were successfully sequenced. The negative samples did not display any SNPs (coverage=33.18%±11.55%; depth: (Figure 3d ). In addition, four independent samples had the A21583G synonymous mutant encoding for L7L in the S gene, a variant that has not been recorded in the literature. An alternative strategy was evaluated in order to test the efficacy of a simplified experimental workflow (running one iPCR for each of the 14 primer pairs vs. pPCR with 14 primers pairs). A highly significant difference between the aligned read depths was observed, with a mean depth of 560±237.88 for the iPCR samples and 312.87±372.72 for the pPCR samples (p-value<0.0001; paired t-test). On the three samples tested with the two approaches, a total of five mutations were identified in the S gene, with a positive percent agreement of 80%. One mutant was missed (Δ144/145) and had a read depth of 422 at the flanking regions of the indel by the iPCR approach (Supplementary table 3 ). The close monitoring of SARS-CoV-2 mutants is imperative, especially for the continuous development of therapeutics, vaccines, and phenotypical. This study reports the development and of ADSSpike, a practical workflow for selective ADS of the SARS-CoV-2 S gene. The overall workflow is an adaptation of the international ARTIC V3 consortium protocol [2] and similar to the HiSpike pipeline [12] . One key difference to HiSpike is the use of its relaxed parameters for SNP calling. Indeed, screening for SNPs pertinent to SARS-CoV VOC/VOI with the parameters employed by HiSpike (read depth of 15 ), a PPV below 25% was observed in each case VOC/VOI analyzed (Table 2 ). In addition, the HiSpike workflow has no mention of the fraction of reads employed for base calling, a parameter of uttermost importance in the identification of multiple SARS-CoV-2 strains in a single host. Despite mentioning HiSpike as a cost-effective tool, no cost description is provided, which adds uncertainty of the Targeted assays are commercially available but only cover existing mutations, meaning new SNPs can be missed [24] . A comparison of cost, time, and overall advantages and limitations of SARS-CoV-2 VOC/VOI methods was performed across current methodologies (Supplementary table 4). SARS-CoV-2 VOC detection using WGS by capillary sequencing or NGS has been widely used for surveillance [25] [26] [27] [28] . In contrast to WGS, amplicon deep sequencing focuses on a specific gene or genomic region, providing a greater sequencing depth and coverage while reducing costs and data burden. Sanger-based approaches have been previously employed [29, 30] . These are either focused on a small window in the S gene, and hence missing relevant SNPs flanking said region, or target SNPs prevalent to specific VOC/VOIs. While SNPs can be identified using Sanger sequencing, mutations present in a minority of the sample population sequenced can be missed [31, 32] . ADSSpike data analysis employs thresholds based on high-quality reads, Phred score, coverage, and cut-offs of 50% of reads supporting the SNPs, as previously described [33] [34] [35] [36] [37] . Careful evaluation and SNP screening showed these parameters to accurately call signature SNPs pertinent to VOC/VOIs while minimizing the number of false negatives. Whilst employing these parameters for variant calling, a total of 15 SARS-CoV-2 positive samples did not show any SNPs, suggesting either an identical gene to the reference employed, or lack of depth to detect mutations. Because of this, the read depth analysis performed across SARS-CoV-2 positive samples suggests that a read depth below 170 is not likely to detect any SNPs. The previous analysis here presented, suggests an average of 700 reads spanning the S gene as a minimum to detect SNPs and correlate them to SARS-CoV-2 variants. Moreover, a workflow is proposed that could increase the likelihood of detecting SNPs, and potentially a VOC/VOI, based on the initial average read depth and coverage (Supplementary figure 2) . A limitation of the study is that the recommended 700 read depth described is based on a small sample size (3 VOC/VOI samples) and is therefore subject to a potential selection bias. To summarize, ADSSpike is one of the first workflows to target the SARS-CoV-2 S gene for VOC-VOI detection by NGS amplicon deep sequencing. Additionally, ADSSpike was able to detect the A21583G SNP in four independent samples, a mutation that has not been recorded in the literature. This suggests that the presented workflow can identify SNPs in particular population clusters that could be assessed for increased risk, spread, and overall transmissibility. Moreover, the method here proposed can be used as a faster and more feasible approach than WGS, with even greater potential possible through use of a one pot primer-pooled approach. The emergence of new pandemic variants calls for molecular surveillance tools. The ADSSpike workflow provides a cost-effective, scalable and practical solution for SARS-CoV-2 variant identification, including VOC/VOIs. Additionally, the experimental design has been validated with the proper controls and parameter tuning to increase its reliability for SNP calling. The primer mixture approach to amplify the S gene may reduce the time to SNP detection, which is imperative during a pandemic. The data that supports the results presented here are available upon reasonable request. This work was funded by Canadian Institutes of Health Research (NFRFR-2019-00015) and Genome261 Canada (Alberta, GC 2020-03-6). A new coronavirus associated with human respiratory disease in China COVID-19 ARTIC v3 Illumina library construction and sequencing protocol 2020 P1 variants and key amino acid mutations at the Spike gene identified using Sanger protocols Sanger Sequencing Solutions for SARS-CoV-2 Research -CA n.d Multi-site Evaluation of SARS-CoV-2 Spike Mutation Detection Using a Multiplex Real-time RT-PCR Assay A Simple RT-PCR Melting temperature Assay to Rapidly Screen for Widely Circulating SARS Genotyping of the Major SARS-CoV-2 Clade by Short-Amplicon High-Resolution Melting (SA-HRM) Analysis Novel SARS-CoV-2 variants: the pandemics within the pandemic SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase The impact of DNA polymerase and number of rounds of amplification in PCR on 16S rRNA gene sequence data HiSpike: A high-throughput cost effective sequencing method for the SARS-CoV-2 spike gene Development and validation of RT-PCR assays for testing for SARS-CoV-2 16S Metagenomic Sequencing Library Preparation n.d Tag jumps illuminated -reducing sequence-tosample misidentifications in metabarcoding studies IDseq-An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring Fast and accurate short read alignment with Burrows-Wheeler transform The Sequence Alignment/Map format and SAMtools Haplotype-based variant detection from short-read sequencing BEDTools: a flexible suite of utilities for comparing genomic features Tablet-next generation sequence assembly visualization Global initiative on sharing all influenza data -from vision to reality Mutation-Specific SARS-CoV-2 PCR Screen: Rapid and Accurate Detection of Variants of Concern and the Identification of a Newly Emerging Variant with Spike L452R Mutation SARS-CoV-2 N501Y Introductions and Transmissions in Switzerland from Beginning of Whole Genome Sequencing of SARS-CoV-2: Adapting Illumina Protocols for Quick and Accurate Outbreak Investigation during a Pandemic Full genome viral sequences inform patterns of SARS-CoV-2 spread into and within Israel Full length genomic sanger sequencing and phylogenetic analysis of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Nigeria A Sanger-based approach for scaling up screening of SARS-CoV-2 variants of interest and concern New infections by SARS-CoV-2 variants of concern after natural infections and post-vaccination in Rio de Janeiro, Brazil Improving the limit of detection for Sanger sequencing: A comparison of methodologies for KRAS variant detection Precise detection of de novo single nucleotide variants in human genomes Coverage recommendation for genotyping analysis of highly heterologous species using next-generation sequencing technology Development of amplicon deep sequencing markers and data analysis pipeline for genotyping multi-clonal malaria infections Longitudinal tracking and quantification of individual Plasmodium falciparum clones in complex infections Phred Quality Score -an overview | ScienceDirect Topics n.d Empirical evaluation of variant calling accuracy using ultra-deep whole-genome sequencing data Figure 1. Spike gene amplicon deep sequencing (ADS) pipeline ADSSpike The presented workflow was divided among 4 steps: sample processing, PCR and library preparation, sequencing, and data analysis (read filtering, alignment, SNP calling, and variant assessment) Plot displaying the mean read depth distribution by individual nucleotide position. Depth was assessed by SARS-CoV-2 positive iPCR samples, SARS-CoV-2 negative samples, and negative controls (NFWiPCR). The dashed green line represents a 700 read depth. (b) Plot displaying the mean read depth distribution by RT-PCR E gene Ct value (n = 68), mean ±SD Table 2. Parameter iteration for PPV and sensitivity calculation in SARS-CoV-2 VOC/VOI. Parameter combination Alpha/B.1.1.7 VOC PPV and sensitivity .2 VOI (sample 1) PPV and sensitivity 2 VOI (sample2) We would like to thank the participants of the study as well as the genomics facility and the Illumina support team.