key: cord-0301672-u1uz2w7t authors: Hunt, Martin; Swann, Jeremy; Constantinides, Bede; Fowler, Philip W; Iqbal, Zamin title: ReadItAndKeep: rapid decontamination of SARS-CoV-2 sequencing reads date: 2022-01-21 journal: bioRxiv DOI: 10.1101/2022.01.21.477194 sha: 68a19ebadcca415685020632345b4419aa17ab19 doc_id: 301672 cord_uid: u1uz2w7t Summary Viral sequence data from clinical samples frequently contain human contamination, which must be removed prior to sharing for legal and ethical reasons. To enable host read removal for SARS-CoV-2 sequencing data on low-specification laptops, we developed ReadItAndKeep, a fast lightweight tool for Illumina and nanopore data that only keeps reads matching the SARS-CoV-2 genome. Peak RAM usage is typically below 10MB, and runtime less than one minute. We show that by excluding the polyA tail from the viral reference, ReadItAndKeep prevents bleed-through of human reads, whereas mapping to the human genome lets some reads escape. We believe our test approach (including all possible reads from the human genome, human samples from each of the 26 populations in the 1000 genomes data, and a diverse set of SARS-CoV-2 genomes) will also be useful for others. Availability and implementation ReadItAndKeep is implemented in C++, released under the MIT license, and available from https://github.com/GenomePathogenAnalysisService/read-it-and-keep. Since experimental isolation of viral DNA from the host is imperfect, viral sequence data is frequently contaminated with host DNA. Removal of host sequence is a first step for many analyses, and where the host is human this is essential to safeguard patient anonymity. Typical approaches [1] either map reads directly to the host genome (eg using BWA MEM [2] , Bowtie2 [3] ), or use a metagenomics classifier (eg Kraken2 [4] ) to assign each read to a species. However in some circumstances (such as a global pandemic following a recent zoonosis) the viral genome is known and of limited diversity, opening up the possibility of positively identifying viral reads by mapping to a reference. In this paper we develop a simple tool that scans sequence data and retains only that which maps to the viral genome. By rigorously testing both theoretically and with human data from diverse global populations predating the pandemic, we are able to give convincing evidence that mapping to a modified SARS-CoV-2 reference is sufficient to guarantee removal of human data. The tool, named ReadItAndKeep, is extremely fast and requires very little RAM -typically a few MB as compared with around 10GB for methods based on mapping to the human genome. This allows read decontamination locally on a standard laptop before uploading to a shared or public server for analysis, or depositing in read archives. ReadItAndKeep is implemented in C++, using the API of minimap2 to match reads to a target genome. Hits from minimap2 are used without performing full alignment (equivalent to minimap2 default command line options, reporting approximate mappings). A read is retained if it has a match that is at least 50bp or is at least 50% of the length of the read (these are default values of user-specifiable parameters). In the case of paired reads, a pair is kept if either of the reads have a suitable match. ReadItAndKeep uses the minimap2 presets "short read" or "ont" for Illumina and Oxford Nanopore Technology (ONT) reads respectively (same as command line options -x sr or -x map-ont). Retained reads are written to gzipped FASTQ file(s). We compared ReadItAndKeep with a standard approach of removing reads matching the human genome. We benchmarked against the tool Dehumanizer (https://github.com/SamStudio8/ dehumanizer), which wraps mappy/minimap2, with its recommended reference comprising the human genome GRCh38 plus decoy and HLA sequences (details in Supplementary text). We refer to this collection of references as the "human reference" throughout. Complete benchmarking results are shown in Supplementary Table 1, summarised in Table 1 , and described below. Evaluation of human read removal: we first checked that ReadItAndKeep should in principle remove human reads, using all 75-mers and 150-mers of the human reference as "reads", with the target genome SARS-CoV-2 MN908947.3. Only 90,469 (0.001%) 75-mers were retained, all of which matched the 33bp poly-A tail of MN908947.3. Since this tail provides no useful information and is excluded by SARS-CoV-2 amplicon sequencing, we removed if from the viral genome for all further analysis. Using this trimmed sequence as the target, all tested k-mers were removed by ReadItAndKeep. Dehumanizer retained 0.76% of the 75bp reads, and 0.03% of the 150bp reads (Table 1) . We then measured the success at human read removal on 27 Illumina runs from the expanded 1000 genomes project [5] : the well-studied sample NA12878 plus one sample from each of the 26 populations, originating from Africa, Asia, Europe and the Americas (Supplementary Table 2) . A high depth run of ONT reads from NA12878 was also tested [6] . Note that all of these samples were sequenced years before the SARS-CoV-2 virus jumped into humans, and so we assume that all reads in these datasets should be excluded. Across all these samples, ReadItAndKeep retained zero reads, but Dehumanizer kept 1.8% Illumina and 10% ONT reads (Table 1) . Further investigation of the 10% showed they were heavily enriched for very low quality and repetitive reads, with multi-kb softclipped regions. Quantification of SARS-CoV-2 read retention: we confirmed that all 75-mers and 150mers from the SARS-CoV-2 reference genome were retained by Dehumanizer and ReadItAnd-Keep. Next, a set of genetically diverse samples was collated, comprising 246 Illumina and 189 ONT sequencing runs, chosen (see Supplementary text) to maximise unique protein mutations and ensure a range of lineages as assigned by Pangolin [7] . Dehumanizer retained >99.99% of reads, and ReadItAndKeep kept >99.99% of ONT reads and 99.89% of Illumina reads (Table 1) . For diagnostic purposes, those reads excluded by ReadItAndKeep were then mapped to the SARS-CoV-2 genome using Bowtie 2 [3] with the --very-sensitive-local option. The excluded reads were highly enriched for low quality -with either a very short match or high error rate (see Supplementary Figure 1, Supplementary Table 3 ). The greatest loss in mean per-base depth was 0.21% for ONT and 1.87% for Illumina (238/246 Illumina samples had mean loss <1%) (Supplementary Table 3 ). We conclude this loss of a tiny volume of low quality reads would not affect downstream analyses. There are broadly three options for decontaminating SARS-CoV-2 datasets: exclude reads mapping to human (as done by Dehumaniser), keep reads mapping to the virus (as done by Rea-dItAndKeep), or do both (first map to the virus, and then exclude any of that also map to human, as is done by the COG consortium). We have shown that, by trimming the poly-A tail from the SARS-CoV-2 genome used by ReadItAndKeep, we completely remove spurious matches of human reads. Thus ReadItAndKeep offers an approach that is more reliable than just mapping to the human genome, and lighter weight (low RAM, fast) than either of the other two approaches. We also investigated using ReadItOnKeep for Influenza A and HIV-1 samples, which are known to be significantly more diverse than SARS-CoV-2. Although all human reads were removed, the method was not effective in retaining viral reads, in extreme cases rejecting more than half. Therefore we only recommend ReadItAndKeep for viruses with low levels of diversity -our focus was SARS-CoV-2. Finally, one challenge for implementing pathogen sequencing in healthcare systems is justifying what proportion of human reads must be removed to guarantee non-identifiability. By explicitly testing with all possible 75bp and 150bp reads in the (extended) human reference genome, and 27 human genome samples from different global ancestries, we were able to show ReadItAndKeep excluded every single human read. We hope the benchmarking approach itself will be of use, and that the speed and low resource requirements will make ReadItAndKeep of wide utility. MH, ZI designed the study and wrote the paper. MH developed the software and performed analyses. BC, JS added bioconda, docker. PF collected the viral dataset. Walker. Evaluation of methods for detecting human reads in microbial sequencing datasets Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM Fast gapped-read alignment with Bowtie 2 Improved metagenomic analysis with Kraken 2 The Human Genome Structural Variation Consortium Nanopore sequencing and assembly of a human genome with ultra-long reads Assignment of Epidemiological Lineages in an Emerging Pandemic Using the Pangolin Tool