key: cord-1034306-byqkisoz authors: Bermúdez, Juanjo title: A Comparison of Performance for Different SARS-Cov-2 Sequencing Protocols date: 2021-03-01 journal: bioRxiv DOI: 10.1101/2021.03.01.433428 sha: 0c28e29661bc8068a9141e980b14c28cf6a8dd90 doc_id: 1034306 cord_uid: byqkisoz SARS-Cov-2 genome sequencing has been identified as a fundamental tool for fighting the COVID-19 pandemic. It is used, for example, for identifying new variants of the virus and for elaborating phylogenetic trees that help to trace the spread of the virus. In the present study we provide a comprehensive comparison between the quality of the assemblies obtained from different sequencing protocols. We demonstrate how some protocols actively promoted by different high-level administrations are inefficient and how less-used alternative protocols show a significant increased performance. This increase of performance could lead to cheaper sequencing protocols and therefore to a more convenient escalation of the sequencing efforts around the world. There are two basic strategies to recreate a genome departing from the data obtained by the actually available sequencing machines: 1. Recreate the genome with no prior knowledge using de novo sequence assembly 2. Recreate the genome using prior knowledge with reference based alignment/mapping It is generally accepted that each strategy has its own advantages and drawbacks. The quality of reference-based assembly is heavily dependent upon the choice of a closeenough reference: identification of some variantss can be missed if the sample is not close enough to the reference. In the other hand, de novo genome assembly is more computationally exigent and not always possible from the available data. "Current variant discovery approaches often rely on an initial read mapping to the reference sequence. Their effectiveness is limited by the presence of gaps, potential misassemblies, regions of duplicates with a high-sequence similarity and regions of high-sequence divergence in the reference. Also, mapping-based approaches are less sensitive to large INDELs and complex variations" (1) "We document that 18.6% of SNP genotype calls in HLA genes are incorrect and that allele frequencies are estimated with an error greater than ±0.1 at approximately 25% of the SNPs in HLA genes. We found a bias toward overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset." (2) "Detecting indels is challenging for several reasons: (1) reads overlapping the indel sequence are more difficult to map and may be aligned with multiple mismatches rather than with a gap; (2) irregularity in capture efficiency and nonuniform read distribution increase the number of false positives; (3) increased error rates makes their detection very difficult within microsatellites; and (4) localization, near identical repetitive sequences can create high rates of false positives" (3) In an ideal scenario, researchers should have both options available: reference-mapping and de-novo assembly. If one of these is missed, the results do not count with the maximal possible reliability. And if there is the possibility to have both at the same cost, there is absolutely no reason for not having both. For that reason it is important that the libraries for sequencing SARS-Cov-2 are designed with de novo genome assembly in mind. Some studies have already been developed to assess the performance of the most commonly used protocols (4), but these are exclusively focused on the obtained coverage of the reads and not in the quality of the de novo assemblies. This study will establish a comparison of protocols based on the quality of the de novo assembly, which is a more exigent metric to asses the performance of the protocols. The performance of mapping to a reference genome will not be analyzed as this has already been analyzed in previous studies and a superior performance in de novo assembly is already strongly correlated to a superior performance in reference-mapping. I used different search patterns at the NCBI SRA (5) website to find SARS-Cov-2 sequencing data obtained using different protocols. Despite this is not a totally reliable method (some search terms are ambiguous) I think it can help to understand the proportions. Table 2 shows the number of matches found for every sequencing hardware technology. Despite some protocols were developed for some specific hardware, we can see how these are being used for other hardware too. for example, there are many more ARTIC (6) results for Illumina than for Nanopore despite the protocol was initially designed for Nanopore. From the results for these queries I randomly selected some runs and downloaded the data sets. Then I assembled the data sets using the best performing genome assembly software from SPAdes (7), rnaSPAdes (8) and metaSPAdes (I will note as xSPAdes the best result obtained from these). In case the runs contained long reads Flye and Canu (9) was also applied. I finally assembled some of the short-read runs with Contignant s-aligner (10). SPAdes, rnaSPAdes and metaSPAdes have been demonstrated to be the best-performing open-source software for viral genome de-novo assembly in different previous studies. Flye and canu are considered the best-performing assembly utilities for long-read data. Meanwhile, s-aligner is a new de novo genome assembler that has recently demonstrated superior performance for viral-genome assembly over the previous short-read assemblers. Table 3 show the results obtained. From these results, some observations can be extracted. I still have not found a long-read data-set that completes a perfect assembly. Doesn't matter the library design or the technology employed (Nanopore or PacBio). The mean NG50 for long-read data-sets is 7.622 while any protocol using short-reads at least doubles that. In addition, the obtained sequences have a higher misassembly rate, which makes that data less feasible for variant detection. Despite being widely used (41% of runs in the SRA archive) its performance is low and far from the bestperforming protocols. If we only consider results for shortread data the mean NG50 is 16.712, which is a quite bad result. C. The ARTIC protocol doesn't outperform other protocols. When making use exclusively of open-source assembly software, the ARTIC protocol doesn't even significantly outperform results from other protocols. Its NG50 mean is similar to the NG50 overall mean of all protocols using opensource software: 16.712 with ARTIC vs 15.865 overall, and slightly lower than protocols using random primers (17.220). D. Library designs with random primers largely outperform designs with fixed primers when using s-aligner. When making use of all available software options, not only open-source, designs with random primer selection largely outperform designs with fixed primer selection, like ARTIC. If we compare the NG50 mean from results for shortread data employing ARTIC and SPAdes (16.712), it is a 71% lower than the NG50 mean obtained from random-primer data and s-aligner assembler (28.654). Indeed, the combination of s-aligner plus random-primer data guarantees in most cases an almost perfect assembly of the virus genome. Thirteen out of fifteen cases got as result an almost-perfect assembly. This observation is corroborated by the frequent presence of gaps in the reference-mapping of runs obtained from fixedprimers designs. This is, indeed, something that could be expected from designs based on fixed primers. That limitation is already recognized by the WHO (11). S-aligner is, in general, a better tool for viral genome assembly. But even when using it, the ARTIC protocol underperforms compared to other protocols. The average NG50 using s-aligner for ARTIC data-sets is 16.757, which is similar to the average NG50 with open-source software (16.712), but far from the average NG50 obtained with s-aligner for random-primer protocols (28.654). When using s-aligner as assembly software with randomprimer library designs, there is no significant difference between using paired-end data or single-read data: 28.394 (single) vs 28.654 (overall). There are significant differences of performance between different protocols for sequencing the SARS-Cov-2 (figure 1). The difference of performance between using the ARTIC protocol with short-read technologies and using a randomprimer design with s-aligner is statistically significant, with p-vakue <0.00001. The difference of performance in the NG50 metric is on average 71,5%. In addition, when evaluating the perfect-assembly ratio, we find that ARTIC has a 33,3% success rate, while the s-aligner-based protocol has a 86,7% success rate. With long-read data-sets, the success rate Results in which both assembling methods under performed were excluded as likely due to problems in the data-set. Empty cells correspond to assemblies that were not tried because of lack of relevance for the study. These results suggest that the hundreds of thousands of genome sequencing's being done in the world to trace the spread of the virus and detect new variants are not making use of the most reliable and efficient methods. The low NG50 and perfect-assembly ratio suggest that these methods are even far from being reliable if de novo genome assembly is considered a need, as it is suggested by previous studies on the efficacy of only-mapping assembly. Mapping the data to a reference genome is usually considered a necessary but insufficient step, and it is always preferable to have a de novo assembly, being the only reason for not preferring that the unavailability of that possibility. We demonstrate in this study that there are protocols that reliably permit us to obtain de novo genome sequencing's of SARS-Cov-2: a tool that would improve the quality of the actual efforts to trace the virus worldwide. Another factor for considering which protocols to use for sequencing SARS-Cov-2 is the cost. ARTIC was specifically designed to be low-cost for that reason. When evaluating the costs of different sequencing protocols three aspects should be considered. Unfortunately, I don't have the necessary experience nor access to materials to evaluate these costs. For that reason I contacted several public-health organizations, warning them of the significant lack of performance of some protocols and offering them cooperation to find better ones. You can see on Annex I a list of entities that were contacted. None of them have acceded to cooperate at the moment of writing this manuscript. One can guess what their motivations are, but some motivations can be firmly discarded: they are not rejecting that because they are already developing equivalent studies nor because they already have the answers that such study would bring. Even though I lack the experience to make a full analysis of the cost-effectiveness of different protocols for sequencing SARS-Cov-2, some clues can be extracted from the data in this study. We see how we can obtain reliable, almostcomplete, de novo genome assemblies from data-sets under 10MB (therefore largely multiplexable), obtained with lessexpensive hardware like Ion Torrent or BGI. Also with Illumina, we can establish cost-effective protocols making use of less data and single-read technology. That suggests that costeffective protocols are possible that are also reliable under a de-novo assembly perspective and not only under a referencemapping one. The increase of performance also suggests that a higher percentage of sequencing efforts will end-up in conclusive results, therefore eliminating the cost of most inconclusive results. All that information suggest that overall more cost-effective protocols than ARTIC are possible and desirable. The data underlying this article are available as DOI: 10.5281/zenodo.4558343. The s-aligner software is available for free at https: //contignant.com for first-time users. It's free to use for 15 days after installation. No personal identification is required but a contact email must actually be provided for downloading it. Comparative analysis of de novo assemblers for variation discovery in personal genomes Mapping bias overestimates reference allele frequencies at theHLAGenes in the 1000 genomes project phase i data Accurate de novo and transmitted indel detection in exome-capture data using microassembly Improvements to the ARTIC multiplex PCR method for SARS-CoV-2 genome sequencing using nanopore The sequence read archive Artic protocol SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing metaSPAdes: a new versatile metagenomic assembler Canu: scalable and accurate long-read assembly via adaptivek-mer weighting and repeat separation Juanjo Bermúdez. s-aligner: a greedy algorithm for non-greedy de novo genome assembly Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health Contignant s-aligner I am the developer and the owner of all the rights of the saligner software. See in table 6 the list of public institutions that were contacted whether to warn them of a possible inefficiency in the applied protocols for sequencing SARS-Cov-2 (including an offer to cooperate) or to warn them of the existence of a new tool that could have an impact on the protocols for sequencing SARS-Cov-2 (offering them also cooperation). They did not acknowledge reception Several private institutions were also contacted. None has either responded.