key: cord-0925033-h0vmyxs5 authors: Harilal, Divinlal; Ramaswamy, Sathishkumar; Loney, Tom; Al Suwaidi, Hanan; Khansaheb, Hamda; Alkhaja, Abdulmajeed; Varghese, Rupa; Deesi, Zulfa; Nowotny, Norbert; Alsheikh-Ali, Alawi; Tayoun, Ahmad Abou title: SARS-CoV-2 Whole Genome Amplification and Sequencing for Effective Population-Based Surveillance and Control of Viral Transmission date: 2020-07-27 journal: Clin Chem DOI: 10.1093/clinchem/hvaa187 sha: d67406ffc82e2b174cfbbf7c4453106703bb4480 doc_id: 925033 cord_uid: h0vmyxs5 BACKGROUND: With the gradual reopening of economies and resumption of social life, robust surveillance mechanisms should be implemented to control the ongoing COVID-19 pandemic. Unlike RT-qPCR, SARS-CoV-2 whole genome sequencing (cWGS) has the added advantage of identifying cryptic origins of the virus, and the extent of community-based transmissions versus new viral introductions, which can in turn influence public health policy decisions. However, the practical and cost considerations of cWGS should be addressed before it is widely implemented. METHODS: We performed shotgun transcriptome sequencing using RNA extracted from nasopharyngeal swabs of patients with COVID-19, and compared it to targeted SARS-CoV-2 genome amplification and sequencing with respect to virus detection, scalability, and cost-effectiveness. To track virus origin, we used open-source multiple sequence alignment and phylogenetic tools to compare the assembled SARS-CoV-2 genomes to publicly available sequences. RESULTS: We found considerable improvement in whole genome sequencing data quality and viral detection using amplicon-based target enrichment of SARS-CoV-2. With enrichment, more than 99% of the sequencing reads mapped to the viral genome compared to an average of 0.63% without enrichment. Consequently, an increase in genome coverage was obtained using substantially less sequencing data, enabling higher scalability and sizable cost reductions. We also demonstrated how SARS-CoV-2 genome sequences can be used to determine their possible origin through phylogenetic analysis including other viral strains. CONCLUSIONS: SARS-CoV-2 whole genome sequencing is a practical, cost-effective, and powerful approach for population-based surveillance and control of viral transmission in the next phase of the COVID-19 pandemic. human specimens using magnetic bead technology. SARS-CoV-2 positive results were confirmed using a RT-qPCR assay, originally designed by the US Centers for Disease Control and Prevention (CDC), which is currently provided by Integrated DNA Technologies (IDT). This assay consists of oligonucleotide primers and dual-labelled hydrolysis (TaqMan®) probes (5'FAM/3'Black Hole Quencher) specific for two regions (N1 and N2) of the virus nucleocapsid (N) gene. An additional primer/probe set is also included to detect the human RNase P gene (RP) as an extraction control. The reverse transcription and amplification steps are performed using the TaqPath TM 1-Step RT-qPCR Master Mix (ThermoFisher) following manufacturer's instructions. A sample was considered positive if the cycle threshold (Ct) values were less than 40 for each of the SARS-CoV-2 targets (N1 and N2) and the extraction control (RP). To estimate the viral load relative to human RNA, we calculated the ΔCt value for each target as follows: ΔCt = CtNn -CtRP, where Nn is either N1 or N2. The inverse of the average N1 and N2 target ΔCt values (i.e. -ΔCt) was used to estimate the relative viral load which is inversely correlated with Ct value. RNA libraries from all samples were prepared for shotgun transcriptomic sequencing using the TruSeq Stranded Total RNA Library kit from Illumina, following manufacturer's instructions. RNA specific fluorescent dye is used to quantify extracted RNA using the Qubit RNA XR assay kit and the Qubit Flourometer system (ThermoFisher). Then, 1µg of input RNA from each patient sample was depleted for human ribosomal RNA, and the remaining RNA underwent fragmentation, reverse transcription (using the SuperScript II Reverse Transcriptase Kit from Invitrogen), adaptor ligation, and amplification. Libraries were then sequenced using the NovaSeq SP Reagent kit (2 X 150 cycles) from Illumina. RNA extracted (approximately 1µg) from patient nasopharyngeal swabs was used for double stranded cDNA synthesis using the QuantiTect Reverse Transcription Kit (Qiagen) according to manufacturer's protocol. This cDNA was then evenly distributed into 26 PCR reactions for SARS-CoV-2 whole genome amplification using 26 overlapping primer sets covering most of its genome (Figure 2A and online Supplemental Table 1 ). The SARS-CoV-2 primer sets used in this study were modified from Wu et al (7) by adding M13 tails to enable sequencing by Sanger, if needed (online Supplemental Table 1 ). PCR amplification was performed using the Platinum TM SuperFi TM PCR Master Mix (ThermoFisher) and a thermal protocol consisting of an initial denaturation at 98°C for 60 seconds, followed by 27 cycles of denaturation (98°C for 17 seconds), annealing (57°C for 20 seconds), and extension (72°C for 1 minute and 53 seconds). A final extension at 72°C for 10 minutes was applied before retrieving the final PCR products. Amplification was confirmed by running 2µl from each reaction on a 2% agarose gel. All PCR products were then purified using Agencourt AMPure XP beads (Beckman Coulter), quantified by NanoDrop (ThermoFisher), diluted to the same concentration, and then pooled into one tube for next steps. A minimum of 200-800ng of the pooled PCR products in 55µl were then sheared by ultrasonication (Covaris LE220-plus series) to generate a target fragment size of 250-750bp using the following parameters: 20% duty factor, peak power of 150 watts, 900 cycles per burst, 320 seconds treatment time, an average power of 30 watts, and 20°C bath temperature. Target fragmentation were confirmed by the TapeStation automated electrophoresis system TapeStation (Agilent) (Figure 2A) . Subsequently, the fragmented product was purified and then processed to generate sequencing-ready libraries using the SureSelectXT Library Preparation kit (Agilent) following manufacturer instructions. Indexed libraries from multiple patients were pooled and sequenced (2 X 150 cycles) using the MiSeq or the NovaSeq systems (Illumina). A step-by-step SARS-CoV-2 target enrichment and sequencing protocol is provided in the online Appendix. Demultiplexed Fastq reads, obtained through shotgun or target enrichment sequencing, were generated from raw sequencing base call files using BCL2Fastq v2.20.0, and then mapped to the reference Wuhan genome (GenBank accession number: NC_045512.2) by Burrow-Wheeler Aligner, BWA v0.7.17. Alignment statistics, such as coverage and mapped reads, were generated using Picard 2.18.17. Variant calling was performed by GATK v3.8-1-0, and was followed by SARS-CoV-2 genome assembly using BCFtools v.1.3.1 ( Figure 2B ). All tools used in this study are freely accessible. For laboratories without bioinformatics support, several publicly accessible, end-to-end bioinformatics pipelines (INSaFlu and Genome Detective) (8, 9) , composed of the above tools, can be used to generate viral sequences from raw Fastq data. For downstream analysis, a general quality control metric was implemented to ensure assembled SARS-CoV-2 genomes had at least 20X average coverage (sequencing reads >Q30) across most nucleotide positions (56-29,797). For target enrichment and shotgun sequencing comparisons in Table 1 , we used data from 7 samples (UAE/P1/2020, L0287, L1189, L4711, L5857, L6841, L9119) generated using both methods. Data from patient L5630 were not included in this analysis since, unlike the above 7 samples, appreciably more sequencing data was allocated for the target enriched sample in this patient (online Supplemental Table 2 ) which can overestimate the efficiency of the target enrichment protocol. All new SARS-CoV-2 sequences (n=7) generated in this study were submitted to GISAID (Global Initiative on Sharing All Influenza Data) under accession IDs: EPI_ISL_463740 and EPI_ISL_469276 to EPI_ISL_469281. We used Nexstrain (10) , which consists of Augur v6.4.3 pipeline for multiple sequence alignment (MAFFT v7.455) (11) and phylogenetic tree construction (IQtree v1.6.12) (12). Tree topology was assessed using the fast bootstrapping function with 1,000 replicates. Tree visualization and annotations were performed in FigTree v1.4.4 (13). Shotgun transcriptome sequencing was used to fully sequence SARS-CoV-2 RNA extracted from patients (n=17) who tested positive for the virus (4). Analysis of the sequencing data showed that this approach required, on average, 4.71Gb of data per sample yielding 31.7 million total reads, of which approximately 0.63% of the reads (~199,000 reads) mapped to the SARS-CoV-2 genome with an average coverage of ~173x (Table 1) . This is attributed to the fact that most of the shotgun data (~99%) is allocated to the human transcriptome while a minority of the reads align to the SARS-CoV-2 genome (Table 1 ). In addition to cost and storage considerations discussed below, this approach is not highly sensitive for detecting SARS-CoV-2 genomes in general and specifically in samples with low viral abundance. In fact, despite high viral abundance relative to human RNA, most samples had less than 100x sequencing coverage across the SARS-CoV-2 genome. Samples with seemingly very low viral loads failed to yield full SARS-CoV-2 genome sequence using this approach (Table 1 and Figure 3A ). To enrich viral sequences and minimize sequencing cost and data storage issues addressed below, we describe an alternative approach where the entire SARS-CoV-2 genome is first amplified using 26 overlapping primer sets each yielding around 1.5kb long inserts (Figures 2A, 3B and online Supplemental Table 1 ). All inserts were then pooled and fragmented to 250-750bp inserts which were then prepared for short read next generation sequencing (Figures 2A and 3C) . RNA extracted from eight COVID-19 patients, which was first sequenced by shotgun transcriptome ( Table 1 and Supplemental Table 2 ), was sequenced using the enrichment protocol. As expected, we observed significant enhancement in virus detection using this protocol where, on average, 99.3% of the reads now mapped to the SARS-CoV-2 genome leading to tenfold increase in coverage relative to shotgun transcriptome (avg. 440x versus 45.5x, respectively) despite generating two hundred fold less sequencing data (avg. 0.02Gb versus 4.28Gb, respectively, Table 2 and Figure 3D ). On average, 37x coverage per 1Gb of sequencing data was generated using shotgun sequencing ( Table 1 ) compared to ~23,000x per 1Gb using target enrichment ( Table 2) suggesting the latter method is more cost effective and is highly scalable. We calculate the cost of SARS-CoV-2 full genome sequencing to be ~$87 per sample when sequencing 96 samples in a batch at 400x using the target enrichment method. The number of samples in a batch can be doubled (196) while maintaining a low cost (~$104) and a very high coverage of 40,000x per sample (online Supplemental Table 3 ). On the other hand, the cost of sequencing one sample at a much lower coverage (50x) using the shotgun method is $403, while increasing sequencing coverage more than doubled the cost ($1735 at 100x and $1060 at 200x) (online Supplemental Table 3 ). However, using higher throughput sequencing can significantly lower the cost of shotgun sequencing to $232 for 62 samples in a batch at 200x per sample. Nonetheless, using a similar throughput, the per sample cost of enrichment sequencing is $108 for 196 samples in a batch where each sample receives considerably more coverage (~40,000x) (online Supplemental Table 3 ). Therefore, target enrichment sequencing is still more cost-effective and scalable than shotgun transcriptome sequencing even at higher sequencing throughputs. Another factor impeding scalability of the shotgun approach is data storage. Even with higher throughput sequencing (NovaSeq SP flowcell), shotgun sequencing requires an allocation of 1TB of data for ~250 sequenced samples. On the other hand, with 1TB of data, a total of around 80,000 samples can be sequenced using the enrichment method and the MiSeq Micro flowcell (online Supplemental Table 3 ). Therefore, long term data storage allocations, and cost, are substantially higher, and perhaps formidable, when using the shotgun sequencing approach. To illustrate the utility of SARS-CoV-2 whole genome sequencing, we tracked the origin(s) of the virus in seven patients (UAE/P1/2020, L0287, L1189, L4711, L5857, L6841, L9119) by comparing their assembled sequences to virus strains (n=25) identified during the early phase of the pandemic, between January 29 and March 18 2020, in the UAE (4). All seven patient samples were collected between March 28 and April 5 2020, and are therefore good candidates to determine whether transmissions were community-based due to the previously documented 25 strains or were independent external introductions. Multiple sequence alignment and phylogenetic analysis (Figures 2B and 4) showed that the new isolates from patients L0287, L4711, L1189, and L5857 clustered with earlier strains of Iranian origin (or clade A3), while that from patient L9119 belonged to the early European (or clade A2a) cluster (Figure 4) . This information suggests that transmissions for all those five patients were most likely community-based, which we then confirmed from patient medical records where no recent travel history was reported by any of those individuals. SARS-CoV-2 isolates from patients P1/UAE/2020 and L6841 were, on the other hand, closer to the earliest Asian strains, which are more diverse due to fewer but distinct mutations (Figure 4) . Hence, with the available sequencing data it is challenging to ascertain whether the P1/UAE/2020 and L6841 transmissions were community-based or due to early independent introductions. However, patient L6841 did not have any recent travel history before symptoms onset, suggesting the case for community-based transmission related to an Asian strain. On the other hand, travel history in patient P1/UAE/2020 was not known and the corresponding isolate appeared to match closely to five other strains from the United States and Taiwan (Figure 4) . Therefore, transmission in patient P1/UAE/2020 was unlikely to be community-based from the early 25 strains (4), but rather due to an independent travelrelated introduction of the virus. Genomics-based SARS-CoV-2 population-based surveillance is a powerful tool for controlling viral transmission during the next phase of the pandemic. Therefore, it is important to devise efficient methods for SARS-CoV-2 genome sequencing for downstream phylogenetic analysis and virus origin tracking. Towards this goal, we describe a cost-effective, robust, and highly scalable target enrichment sequencing approach, and provide an example to demonstrate its utility in characterizing transmission origin. Our target enrichment protocol is amplicon-based for which oligonucleotide primers can be easily ordered by any molecular laboratory. Next generation sequencing (NGS) has also become largely accessible to most labs, and in our protocol we show that highly affordable, low throughput sequencers, such as the Illumina MiSeq system, can be used efficiently to sequence up to 96 samples at 400x coverage each at a cost of $87 per sample ( Table 3) . This cost is likely comparable to RT-qPCR testing for the virus. Other low throughput, highly affordable semiconductor sequencers can also be used with this protocol (14) . One possible limitation is the use of ultra-sonication for fragmentation of PCR products after SARS-CoV-2 whole genome amplification. Several labs might lack sonication systems due to accessibility and affordability issues. In such situations, our protocol can be easily modified to use enzymatic fragmentation instead provided by commercial kits, such as the Agilent SureSelect QXT kit. Furthermore, we have added M13 tails to all our primer sets making them amenable to Sanger sequencing for those labs not equipped with NGS. However, with this approach, manual analysis of sequencing data limits scalability of the approach. Upon sequence generation, the bioinformatics analysis can be performed using open source scripts. Labs without bioinformatics expertise or support can use online tools (INSaFlu and Genome Detective) (8, 9) which can take raw sequencing (Fastq) files to assemble viral genomes, and to perform multiple sequence alignment and phylogenetic analysis for virus origin tracking. In addition, the described approach does not require large data storage or computational investment as shown by our cost, data, and scalability calculations (Supplemental Table 3 ). Our phylogenetic analysis demonstrates how SARS-CoV-2 genomic sequencing can be used to track origins of virus transmission. However, data should be carefully interpreted, and should be combined with other epidemiological information (such as travel history) to avoid inaccurate conclusions. The major limitation facing genomic-based SARS-CoV-2 surveillance includes the lack of virus sequencing data representing most strains in any country. Nonetheless, SARS-CoV-2 strains are continuously being sequenced by government, private, and academic entities all over the world, and the sequencing data is being shared publicly. This proliferation of sequencing datasets will empower genomic-based surveillance of the virus, and the availability of cost effective sequencing options, like the one described in this study, will contribute to democratizing SARS-CoV-2 sequencing and data sharing. In summary, we show that SARS-CoV-2 whole genome sequencing is a highly feasible and effective tool for tracking virus transmission. Genomic data can be used to determine community-based versus imported transmissions, which can then inform the most appropriate public health decisions to control the pandemic. Author Declaration: A version of this paper was previously posted as a preprint on bioRxiv as https://www.biorxiv.org/content/10.1101/2020.06.06.138339v1. SARS-CoV-2/COVID-19: Viral Genomics, Epidemiology, Vaccines, and Therapeutic Interventions Identifying and interrupting superspreading events -Implications for control of severe acute respiratory syndrome coronavirus 2 Genomic surveillance and phylogenetic analysis reveal multiple introductions of SARS-CoV-2 into a global travel hub in the Middle East Shotgun transcriptome and isothermal profiling of SARS-CoV-2 infection reveals unique host responses, viral diversification, and drug interactions A new coronavirus associated with human respiratory disease in China InSaFLU: an automated open web-based bioinformatics suite "from reads" for influenza whole-genomesequencing-based surveillance Genome detective: An automated system for virus identification from highthroughput sequencing data Nextstrain: real-time tracking of pathogen evolution MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices A comprehensive assay for CFTR mutational analysis using next-generation sequencing No authors declared any potential conflicts of interest. No sponsor was declared.