key: cord-0943841-ii26xnse authors: Donzelli, Sara; Ciuffreda, Ludovica; Pontone, Martina; Betti, Martina; Massacci, Alice; Mottini, Carla; De Nicola, Francesca; Orlandi, Giulia; Goeman, Frauke; Giuliani, Eugenia; Sperandio, Eleonora; Piaggio, Giulia title: Optimizing the Illumina COVIDSeq laboratorial and bioinformatics pipeline on thousands of samples for SARS-CoV-2 Variants of Concern tracking date: 2022-05-19 journal: Comput Struct Biotechnol J DOI: 10.1016/j.csbj.2022.05.033 sha: c2b0ee590f414f8877e94afcf177b41806fd5bae doc_id: 943841 cord_uid: ii26xnse The SARS-CoV-2 Variants of Concern tracking via Whole Genome Sequencing represents a pillar of public health measures for the containment of the pandemic. The ability to track down the lineage distribution on a local and global scale leads to a better understanding of immune escape and to adopting interventions to contain novel outbreaks. This scenario poses a challenge for NGS laboratories worldwide that are pressed to have both a faster turnaround time and a high-throughput processing of swabs for sequencing and analysis. In this study, we present an optimization of the Illumina COVID-seq protocol carried out on thousands of SARS-CoV-2 samples at the wet and dry level. We discuss the unique challenges related to processing hundreds of swabs per week such as the tradeoff between ultra-high sensitivity and negative contamination levels, cost efficiency and bioinformatics quality metrics. time and a high-throughput processing of swabs for sequencing and analysis. In this study, we present an optimization of the Illumina COVID-seq protocol carried out on thousands of SARS-CoV-2 samples at the wet and dry level. We discuss the unique challenges related to processing hundreds of swabs per week such as the tradeoff between ultra-high sensitivity and negative contamination levels, cost efficiency and bioinformatics quality metrics. The SARS-CoV-2 virus pandemic has posed many novel challenges to worldwide health care and laboratorial processes. On the one hand, the logistics in dealing with hundreds of thousands of novel infected individuals per day [1] , [2] , of which a small but steady percentage require hospitalization and intensive care treatment. Among the many epi processes involved with pandemic resilience, an effective viral genomic surveillance is one that heavily challenged laboratories and genomic facilities worldwide. The importance of a robust and resilient genomic and bioinformatics workflow is of utter importance since it permits to 1. discover putative novel Variants of Concern (VoC) 2. Monitor variant distribution at the geographical and population scale. Coupling these capabilities with an international infrastructure of public datasets where sequences are promptly downloaded and shared [3] , [4] allows to generate a valuable amount of data that can share light on human-viral interactions with regard to public containment measures and vaccine distribution efficacy. Several Next Generation Sequencing (NGS) protocols and library standards have been released in order to deal with VoC tracking, such as the Artic which is supported by a wealth of open resources. Biotech companies such as Illumina and Thermo Fisher developed specific kits optimized for their platforms, a few of which have obtained FDA emergency approval for clinical-grade diagnostics [5] , [6] . While for the Artic amplicon a wealth of literature exists [7] - [10] , where severalprotocols and reports were published for the Illumina setting, there is a lack of material for specific high-level optimization of the mentioned protocols, aiming to optimize Turnaround Times (TAT), sensitivity in sequencing low-abundance RNA samples, and specificity for accurate variant calling and noise-contamination reduction [11] , [12] . In this scenario, we present an optimization of the COVIDSeq Test vv3 protocol based on our experience of 3416 total sequences, representing 3% of Italian sequences and 31% of the Lazio region at the time of writing this paper. The overall protocol is depicted in Figure 1 . Data stemming from local laboratory LIMS was finally merged with the lineage report for both patient-level reporting and cohort-level description for region network surveillance (rete Coronet). Bosphore® Novel Coronavirus (2019-nCoV) Detection Kit v4 (Anatoliageneworks, Instanbul, Turkey) was used to detect and characterize 2019-nCoV in human respiratory samples. Fluorescence detection was accomplished using FAM, HEX, Texas RED and Cy5 filters. SARS-CoV-2 was detected by three regions of the virus in a single reaction: E-gene assays are specific for bat-related betacoronaviruses, i.e. they detect both SARS-CoV-1 and SARS-CoV-2 and the ORF1ab target and N gene regions were used to discriminate SARS-CoV-2 specifically (Corman et al., 2020). Before amplification, a fast extraction was performed, which does not require a separate extraction but only a pre-treatment step that takes less than 10 min. Real-time PCR was performed with Montania 4896® thermal cycler. Swabs were tested within 4 h from collection. Samples with a ct value < 25 were re-extracted for the NGS analysis by using the total automated protocol of QIAsymphony technology (Qiagen, Hilden, Germany) with a final elution of 60ul. This technology allows an automated purification of viral nucleic acids and combines the speed and efficiency of silica-based purification of viral nucleic acids with the convenient handling of magnetic particles. During the first phase of the delta wave in Italy (August-September 2021), we reported a run with a particularly high contamination level, reaching a dropout rate of 82%. Not only was the Illumina protocol threshold of 5 amplicons covered in the negative controls higher with more than 50% of the SARS-CoV-2 genome covered, but the absolute coverage of contamination was higher than the positive samples i.e., the noise exceeding the actual signal, as shown from a test run containing 10 negative water samples that were highly enriched with viral coverage (Fig. 2A) . In order to tackle this issue, and having already double-checked all the contamination procedures and the lineage calling in previous known clusters and families, we reasoned that: (1) With current standards we could be over-sequencing SARS-CoV-2 for positive samples, being a ~30kb genome and spending more than 2 million reads on each sample on the standard setups, having a high duplication level (~60%) due to saturation; (2) Ultra-high coverage coupled with high-PCR amplification cycles could result in over-amplification and sequencing of spurious, non-specific, single viral molecules; (3) The dropout level for swab concentration at 35 cycles was low (median 0.23%, Table S1 ) in all previous test runs, making it possible to reduce amplification without compromising sensitivity. Furthermore, the initial design of the CovidSeq protocol was designed to detect viral positivity, with hypersensitivity in mind, despite the fact that this information was always included with previous rtPCR for all our samples. We thus introduced a variation in the Illumina CovidSeq protocol, namely the reduction of PCR cycles for the cDNA amplification and amplicon tag mentation steps, from 35 to 22 cycles and from 7 to 6 cycles, respectively. Finally, we had to align the final concentration of the positive control from 25 to 1500 copies to a novel amplification level. In order to test lineage calling reproducibility of the 35-cycles contamination level, we resequenced the same samples deriving from different pandemic time frames (March 2021 and August 2021). The most striking results depend on the signal-to-noise ratio shown in the 10 negative controls that illustrate a 70% median coverage reduction ( Fig. 2A-B) . The reanalysis confirmed a strong overlap in lineage assignment, even if the negative control contamination always stemmed from the dominant variant in the period (Alpha to Delta), lineage assignment was identical in 87% of all resequenced samples and finally where superlineage assignment showed 100% concordance with all high-coverage samples (Table S2 ). Discrepancies on low coverage samples involved lineage shift to a less specific variant in the same family (2.1%) or either a missing assignment (1.2%), which was caused by an even lower coverage on low-quality samples. Nonetheless, a subset of samples (1.2%) was observed to switch from a non-assignment to a low-coverage assignment (e.g. none to AY.57), resulting from noise made by a consensus cleaner albeit incomplete. Lineage specificity increased for 2.1% of high-coverage samples (e.g., B.1.177.75 from B.1.177), a byproduct of a dramatic reduction in the variability of negative background coverage (from 2748x±32582x to 4704x±3435x, Fig. S1A ). All these results stress the importance of using multiple metrics for low-quality samples lineage calling, composed of a mixture of technical (e.g., median and RBD coverage) and qualitative (lineage class) data. When considering the reproducibility of the process at the mutational level, we found a 100% concordance in 37/82 (45.1%) and >95% for additional 26/82 (31.7%) (Table S4) ; the lower tailed concordance samples reflect the nature of lineage non-assignment or super-class shift as described in Table 2 . Furthermore, we characterized the mutations specifically present only in negative control samples. These 31 mutations are listed in Table S5 In addition, PCR cycle reduction saved 1.5h hours for each amplification process (4h for 35 cycles required against 2.5h for 22 cycles). This validation encouraged us to keep this amplification level as an acceptable signal-to-noise threshold, with a dropout level even lower thanks to cleaner consensuses (Fig. S1B) . To further reduce the possibility of lineage miscalling, we further conducted a Bioinformatics check list at the single mutation level as described in the subsequent section. Sequencing runs had different setups based on the number of samples ( [16] to check for the latest changes in variant assignment. In addition, with the aim to have better control and governance of the analytical processes, we started looking for alternatives for the private cloud. We found a powerful workflow in the viral-recon pipeline of the NF-CORE environment, which was highly supported by the scienitific community at the time of writing this paper [17] . We employed the viral-recon workflow on our NGS systems with 1,2,4,8 PCR plates setups, obtaining results in our HPC node of 48CPU/354GB from 4 to 24 hours. In order to achieve these performances, several non-critical steps of the pipeline were skipped via --skip_kraken2 --skip_fastp --skip_nextclade --skip_snpeff --skip_variants_quast --skip_plasmidid --skip_blast --skip_bandage -skip_asciigenome. An additional performance comparison is provided in Whole Genome Sequencing performed from April 2021 to January 2022 produced 3232 high quality samples. As expected, total coverage analysis showed that amplicons 43, 44, 56 and 57 are specifically highly enriched (coverage > 15000x) (Fig S1C) , whilst in contrast amplicon 64 is consistently lower (median coverage = 800x) than the rest of the viral genome ( Fig. 2C) . Amplicons spanning the Spike protein region [72-83], which are critical for VoC calling (Fig. S1D ), have sufficient coverage on most samples (Median Undedup Coverage > 8000x, Minimum Undedup Coverage > 20x on 100% samples); nevertheless, their median coverage is lower than the rest of the genome overall (Fig. S1E) . These outlier regions call for greater attention to be paid to both over-sequencing (false positives) and putative low-coverage in outlier samples. It is important to note that all these metrics are in deduplicated coverage suitable for amplicon sequencing, while many DRAGEN/Illumina thresholds are indicated in coverage including duplication. When considering the performance of the panel, we found that over 80% samples have a minimum coverage above 100x across the whole genome and that the PCR change does not pan an issue for horizontal coverage (Fig. 2D) . Lineage analysis showed a switch from the beta-delta dominated scenario towards an omicron increase (Table S3) , consistent with what is reported in the international data mining (GISAID) and literature. Accuracy of lineage calling, computed as the proportion of the defining variants having alternative alleles, is consistently high across all samples (Median ScorpioSupport = 0.875). The whole process was split up into four separate units at IFO: (1) multi-center swab collection, processing and extraction carried out at the Virology Unit, 1 to 2 days (2) library prep at the Oncogenomics Laboratory, 1 to 2 days (3) library quantification, cartridge loading, and sequencing run at the Genomics Facility, 1 to 2 days (4) primary and secondary analysis at the Bioinformatics Unit, 1 day. Many steps were processed overnight such as sequencing runs and cloud or High-Performance Computing analyses in order to achieve a turnaround time of seven days. Every sub-process had a group of 2 to 3 people for high redundancy, except for the Virology Unit that features a large with >10 technicians on a 24hshift (ISG COVID Team). The coronavirus pandemic proved once again how protocol sharing and Open Science approaches lead to process optimization and massive savings on human and capital resources. One typical question arises when a VoC tracking optimization is presented: would it be possible to detect Novel viral Variants better? In theory yes it is possible, bearing in mind that a Variant is defined via a mixture of analyses comprising of Viral, Epidemiological and Geographical data. Viral sequences are not the only resources used for detection, the results from the lab need to be continuously compared to public resources such as Nextstrain or the COVID-miner, in order to avoid irreproducible announcements caused by technical artifacts such as inter-sample contamination [19] . An alert module can easily be added to the workflow when lineage assignment scores are particularly low, or the set of single mutations associated with a sample are poorly represented. In conclusion, here we presented an optimized protocol that led to a severe reduction in background noise due to contamination, a modest gain in turnaround time and an integrated Characteristics and Outcomes of Hospitalized Patients in South Africa During the COVID-19 Omicron Wave Compared With Previous Waves Omicron SARS-CoV-2 variant: a new chapter in the COVID-19 pandemic GISAID's Role in Pandemic Response The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing FDA Grants Emergency Use Authorization for Two Next-Generation COVID-19 Assays from Thermo Fisher Scientific Illumina COVIDSeq Test-Instructions for Use Artic Network Evaluation of NGS-based approaches for SARS-CoV-2 whole genome characterisation A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Phylodynamic Analysis | 176 genomes | 6 Mar 2020 -SARS-CoV-2 coronavirus / nCoV-2019 Genomic Epidemiology -Virological High throughput detection and genetic epidemiology of SARS-CoV-2 using COVIDSeq next-generation sequencing An optimized, amplicon-based approach for sequencing of SARS-CoV-2 from patient samples using COVIDSeq assay on Illumina MiSeq sequencing platforms DRAGEN RNA Pathogen Detection DRAGEN COVIDSeq Test (RUO) Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool nf-core/viralrecon: nf-core/viralrecon v2.3 -Copper Coatimundi Design of a companion bioinformatic tool to detect the emergence and geographical distribution of SARS-CoV-2 Spike protein genetic variants Deltacron: the story of the variant that wasn't We thank dr. Cesare Camma and the whole group at the Istituto Zooprofilattico Sperimentale dell' Abruzzo e del Molise for their valuable insight during our discussions and sharing of