key: cord-0431286-aon463ev
authors: Katz, Kenneth S.; Shutov, Oleg; Lapoint, Richard; Kimelman, Michael; Brister, J. Rodney; O’Sullivan, Christopher
title: A fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next generation sequence submissions
date: 2021-04-26
journal: bioRxiv
DOI: 10.1101/2021.02.16.431451
sha: 73d8b0b8613c9ab091e7477ea4ea31ca614b223a
doc_id: 431286
cord_uid: aon463ev

Sequence Read Archive submissions to the National Center for Biotechnology Information often lack useful metadata, which limits the utility of these submissions. We describe a scalable k-mer based tool for fast assessment of taxonomic diversity intrinsic to submissions, independent of metadata. We show that our MinHash-based k-mer tool is accurate and scalable, offering reliable criteria for efficient selection of data for further analysis by the scientific community, at once validating submissions while also augmenting sample metadata with reliable, searchable, taxonomic terms.

inherently large, and improved technologies are exquisitely sensitive to contamination. Submissions must be processed, before either interpretation or quality assessment is possible, to provide submitter feedback and submission verification. The growth of data submission is exponential (doubling approximately every 12 months (2)), rendering use of computationally expensive methods, such as de novo assembly followed by alignment, impractical due to costs and limits of scale, particularly given the time constraint of submission processing.

We considered that questions about the quality of a given NGS run could reasonably be inferred from the taxonomic distribution of reads within that set, whether based on a single organism or of metagenomic design. This is often enough information to answer basic experimental or clinical questions, as well as inform decisions about the merit of subsequent resource-intensive assessment methods. Read sets with organismal tags can be used to select data for further analysis. Moreover, binning reads into taxonomic buckets can identify contaminating reads and reads outside of the stated experimental scope. Such identified reads can be filtered from a sample before downstream processing. This proposed taxonomic analysis is independent of metadata and intrinsic to the run, capable of both validating submissions and augmenting sample metadata with reliable, searchable, taxonomic terms.

Following these principles, we developed a k-mer-based Sequence Taxonomic Analysis Tool (STAT). Based on MinHash (3) , and inspired by Mash (4), STAT employs a reference k-mer database built from available sequenced organisms to allow mapping of query reads to the NCBI taxonomic hierarchy (5) . We use the MinHash principle to compress the representative taxonomic sequences by orders of magnitude into a k-mer database, followed by a process that yields a set of diagnostic k-mer hashes for each organism. This allows for significant coverage of taxa with a minimal set of diagnostic k-mers. Our results show STAT is a reliable method for examining submitted NGS data in a timely, and scalable, manner.

STAT was developed for quality assessment of SRA submissions to be shared with the submitter, requiring that analyses ideally take no more time than that of existing submission processing, while minimizing resource usage. Our design starts from the MinHash principle that a random selection of the lowest valued constituent blocks in a pool after hashing represents a signature of the parent object. In building k-mer databases from the set of sequences assigned a specific NCBI taxonomy id (TaxId), we read the 32 base pair (bp) k-mers as 64-bit hashes, selecting the minimum value representative for a window, then iteratively merging k-mers from taxonomic leaves to roots (see Methods, Figures 1, 2) . Initial analysis using only densely populated k-mer databases performed well.

However, despite being on average over an order of magnitude smaller than the input sequence database size (see below), we determined that loading the entire densely merged "tree_filter.dbs" into memory for analysis unnecessarily incurred long I/O read time and large memory costs since most runs required only a fraction of the complete database. Moreover, STAT jobs, like many computational pipelines, are submitted to either a local computer farm cluster scheduler ("grid engine"), or by dispatching cloud-based virtual machines. In both cases job scheduling typically requires explicit needed resource declarations such as CPU and memory. An initial screen capable of evaluating diversity of the sample and necessary resource requirements for detailed analysis minimizes cost and maximizes computational efficiency. For these reasons we pursued a selective two-step analysis, using a sparse filtering database in the first step to identify the presence of any (a) eukaryote if there are more than 100 biological reads of a species, (b) bacteria, or archaea with more than 10 biological reads, and (c) virus if there are 1-2 biological reads. This first pass is neither qualitative, nor exhaustive, but allows us to quickly identify taxa for focus in the second pass ( Figure 3 ).

To facilitate this two-step process, and further minimize resource requirements, we decreased k-mer database size by 33% by storing only the 8-byte k-mers in a database file, separately storing pairs of TaxId, total TaxId k-mer count for each TaxId respectively in an auxiliary "annotation" file. The k-mer database / k-mer count annotation file pair is designated "dbss", the database sorted by TaxId, with each TaxId set sorted by k-mer. TaxIds identified in the first step against the sparse k-mer database are used in the second step to load into memory only those TaxId k-mer hashes using the counts provided by the annotation file as offsets.

MinHash sampling combined with dynamic loading of only necessary dense TaxId database k-mer hashes yields significant benefits for cpu and memory requirements. Further, the selection of TaxIds to load may be augmented by heuristics, such as purposely withholding TaxIds from contamination detected in the prior filtering step.

STAT reports the distribution of biological reads mapping to specific taxonomic nodes as a percentage of total biological reads mapped within the analyzed run.

Since results are proportional to the size of sequenced genomes, a mixed sample containing several organisms at equal copy number is expected to find more reads originating from the larger genomes. This means that percentages reported likely reflect sample genome size(s), and must be considered by the user against the genomic complexity of the sequenced sample.

Like all sequence-based classification schemes, STAT reflects and depends upon accurately encapsulating evolutionary paths. The significant achievements of adapting both the NCBI RefSeq data model (6) , and internationally accepted taxonomy to incorporate metagenomic viral sequence (7, 8) fundamentally benefit STAT and other similar classification tools.

An important consequence of merging in k-mer database construction is to avert complications caused by biological complexities. For example, most k-mers derived from endogenous retroviruses found in the human input reference genome will likely merge to the root as those k-mers would also be found in the Viruses Super Kingdom.

Further, when analyzing results each level -read, run -requires integration of less than ideal signals. It is common to find multiple TaxIds identified in a single biological read, ideally coherent for a given lineage. Were those Mus musculus, Murinae, and Mammal, there is confidence in declaring the read Mus musculus.

Should a read map to multiple, related taxonomic nodes, it is reported as originating from the most proximal shared taxonomic node. For example, a read with hits to sibling species may be reported as their common genus, conservatively locating the most proximal common node before ambiguity ( Figure 4 ) . Likewise, such conservative heuristics are required when integrating the signals from all biological reads to report the run. If the run subject is a single organism, it is expected that STAT would identify taxonomic nodes across the lineage, and that the number of reads mapping to higher level nodes will be more than those mapping to terminal nodes.

STAT was designed as a tool for assessing the quality of any SRA submission and has matured into a tool that also significantly enhances user comprehension. Many k-mer tools were created for the purpose of metagenomic taxonomic assignment (9) during STAT development. Taxonomic classifiers balance speed, accuracy, and memory requirements. While STAT was neither primarily developed for metagenomic analyses, nor as a tool for distribution, the same concerns apply.

Using MinHash to sample and save at most 1 out of every 64 k-mers generated from input sequences yields k-mer databases 1-2 orders of magnitude smaller than the parental reference nucleotide database from which they were derived. For example, currently the We compare STAT accuracy to Kraken 2 using the strain exclusion test as described by Wood et al. (11) . STAT shows the identical accuracy of Kraken 2 for both bacteria and virus (see Figure 5 ). As expected, STAT sensitivity is notably dampened as we chose to sample the widest taxonomic breadth. Our desire for conservative taxonomic assertions is further reflected by STAT never yielding a false positive in accuracy test results (data not shown).

We found it unnecessary to apply the same selection of k-mer hash minimums from query sequences to compose a similarity index (3, 4) , instead of exact k-mer (hash) matching. We show that accuracy is robust, while still reflecting our conservative bias in taxonomic assignment. Though similar in performance to Kraken 1 input speed (21.6 million reads/minute), and runtime (132.5 seconds) characteristics, STAT (maximum resident set size 830304 kilobytes) required only 8% and 1% of the memory needed by Kraken 1 and Kraken 2, respectively (11) .

Unsurprisingly, the accuracy test (see Methods) required additional time for extracting the requested TaxId k-mers on demand. Maximum resident set size during the accuracy test was approximately an order of magnitude greater than Kraken 2 (11, data not shown), despite loading a k-mer database 20 times the "strain_excluded" FASTA file size (3.9 gb) and over 100 times "strain_excluded.dbs" size (545 megabytes (mb)).

We provide two symmetrical examples of expected and unexpected contamination that illustrate STAT effectiveness. allowing curators to contact submitters to alert, and correct, the source of contamination. 1 We use the word "spot" to reference either the un-split paired biological read, or the single unpaired biological read.

As lower cost significantly expanded human genome sequencing, awareness rose of potential personally identifying information residing in public repositories (14) .

Large efforts employing NGS to diagnose and monitor human health, or detect pathogenic outbreaks such as SARS-CoV-2, caused clinical sample submitters to worry about the inclusion of human sequence. As a counterpart to the previously discussed contamination example, we sought a STAT-based tool to find and remove unavoidable human sequence reads in clinical pathogen samples.

We began by building a k-mer database using human reference sequences withholding the iterative merging previously described. The majority (approximately 80%, see Methods) of k-mers derived represent conserved ancestral sequences, but our goal here is to aggressively identify human sequences. We then subtracted any k-mer also found in the merged kingdom databases Viruses, and Bacteria to protect against spurious false positive hits targeting clinical pathogens. After testing several window sizes, we found optimal performance using a segment of 32 bp (twice as dense as our standard taxonomy database).

Because unintended contamination is never uniform, we chose different ends of the expected spectrum of human content for testing (see Table 1 ). Two RNA_Seq runs were derived from bronchoalveolar lavage fluid taken from suspected SARS-CoV-2 patients. The wash of the lower respiratory tract from a patient suffering an active infection is expected to contain patient immune cells, sloughed patient epithelial cells, lung microbiota, and suspect clinical pathogens. Each run contains over five million spots, and though starting with approximately 85% Eukaryotic content, less than 10% of the spots remain after scrubbing for human sequence.

The observation that a 3% selection of all possible human-derived 32 bp k-mers identifies over 92% of a random selection of likely human spots validates using MinHash and underscores its efficiency. These examples present a difficult test, and we identify 5-6% of the remaining spots as human.

Unlike the previous examples, amplicon-directed sequencing of pathogens is expected to contain less unintended human content, as can be seen in Table 1 . In both cases, 0.1% or less spots were removed, while among those remaining, 0.01% or fewer spots were identified as human. In no case was there any deleterious loss of the intended target signal (see Additional file 2, S5 Taxonomic Table 1 ) are highly significant alignments to related primates with approximately 20% sharing the same low eValue for all members (see Additional file 2, S1-S4). These likely represent conserved regions unfavored for SNP location (18) . Table 1 Summary of results including those found in Additional file 2 (S1-S4).

We define "Human Spots" as those where all hits (up to top five) are identified as human with eValue < 1e -10. "Conserved Lineage Spots" are those containing a human top hit (lowest eValue) though not the exclusive organism of hits with eValue < 1e -10, and where all spot hits have either identical eValue or the greatest has eValue < 1e -14. "Total Length of Human Spot Alignments" is the sum of all the top alignments for all human spots maining.

STAT has provided a successful framework for our SRA NGS submission pipeline. Sometimes actual sample content may be unknown, and submitted metadata are often incomplete and of poor quality (19, 20 we contribute a framework others may find useful, and offer the collection of tools to use freely. to explore this implementation in STAT, and look forward to reporting results in the future.

STAT refers to a collection of tools for building k-mer databases, querying those databases, and reporting results of our SRA submission pipeline using the former.

Details described below are based on our standard pipeline settings.

STAT uses 32 bp k-mers (i.e., k=32) for database generation, and as the unit for comparison. The majority of unaligned SRA data are reads between 60 and 150 bp in length, with mean error rate of 0.18% (16) : such reads can be expected to yield many correct 32 bp k-mers for reliable identification. While reducing from 32 bp k-mers to 16 bp k-mers decreases the size of resulting databases, there is significant loss of specificity (10 ^9) per k-mer. By comparison, using 64 bp kmers are extraordinarily more selective, but database size becomes impractical.

Finally, with each base encoded in 2 bits, 32 bp k-mers fit fully, and compactly in a 64-bit integer.

Two primary types of k-mer databases are constructed (as described below): a dense database that selects one k-mer per 64 bp segment ( "tree_filter"), of input sequence, while a sparse database ("tree_index") selects one k-mer per 64 bp (Virus), 8000 bp (Eukaryota), and 2000 bp (Bacteria, and Archaea) segment respectfully, noting that segment size is roughly proportional to genome size.

k-mers are selected using an iterative approach derived from MinHash (3) . To compose STAT databases, for every fixed length segment ("window") of incoming nucleotide sequence, a list of overlapping k-mers (effectively segment length plus right k-1 bp "wings") is generated from both strands. The 32 bp k-mers are encoded using 2 bits per base into 64 bits (8 bytes), then a minimum k-mer representing this segment is selected based on the 64-bit encoded k-mer value read as 64-bit integer, effectively a 64-bit hash ( see Figure 1 ) .

Construction of k-mer databases is guided by the NCBI Taxonomy Database (5) NCBI Taxonomy Id (TaxId) , and represent leaves on these trees. These lineage relationships are represented in a two-column file referred to as "parents", wherein each node TaxId (first column) reports its parent node TaxId (second column).

All sequences (see Database Input Sequences) attached to a particular TaxId are input to k-mer database generation using segment ("window") sizes as described.

For each input set of sequences assigned a TaxId, the immediate output is a dictionary that contains the set of unique 32 bp k-mers derived as described (we designate this "db" file extension). Each dictionary is further transformed into a binary file that encodes every 32 bp k-mer as an 8-byte (64-bit) integer using 2 bits per base, followed by its TaxId represented in a 4-byte (32-bit) integer. Thus each k-mer record is stored as one 12-byte pair (k-mer, TaxId) in a database file designated with "dbs" file extension, sorted by k-mer for binary search optimization.

Next, using the taxonomic node relationships (found in the "parents" file), starting from the leaves we recursively merge each binary ("dbs") file representing a unique set of k-mers derived from a single TaxId to sibling(s), then parent nodes.

Each sibling leaf is merged such that k-mers specific to a leaf remain as diagnostic of that TaxId, while those found in neighboring (sibling species) leaves are moved up ("merged") to the common parent node TaxId (see Figure 2 ). This process results in a single merged database file ("tree_filter.dbs") representing all k-mers assigned a TaxId.

While it is difficult to generalize, we note that when the process of merging is complete, approximately 20% of the Homo sapiens 32 bp k-mers remain as unique to human; that is, 80% were not diagnostic for the species, and instead merged up the eukaryotic tree.

Database generation can be accomplished using any of the build_index* tools (see github), and each takes parameters for window size, and k-mer size. The process of merging is accomplished using merge_db.

We use NCBI B L A S T ® "refseq_genomes" database (29) , supplemented with viral sequences extracted from the BLAST ® "nt/nr" database as input source for taxonomy identification in both sparse ("index") and dense ("filter") k-mer databases (30) . Viral records are extracted from "nt/nr" by loading only sequences assigned a TaxId whose lineage root is the Super Kingdom "Viruses".

To query a k-mer database, input SRA accessions, or FASTA sequence is used to generate the unique set of query 32 bp k-mers read as 64-bit integer hashes for finding identical hashes (and assigned TaxId) from the designated k-mer database using the tool aligns_to. Proximal results are counts for each specific taxonomic kmer hit (see Results, Figure 3 ). Passed an SRA accession, STAT built with NCBI NGS library support will retrieve query sequences, and aligns_to option -unaligned_only is available to limit analysis to the unaligned reads found in the SRA object.

We determined the need to delete low complexity k-mers composed of >50% homo-polymer or dinucleotide repeats (e.g. AAAAAA or ACACACACACA). This is accomplished using filter_db. We have also investigated "dusting" input sequences (31) and found it complementary to filtering, though it is not used at this time in our pipeline.

STAT performance metrics were gathered as described in Wood et al. (see "Execution of strain exclusion experiments", and "Evaluation of accuracy in strain exclusion experiments" in Methods, 11). A "dense" k-mer database was created using the excluded taxa sequences for input (11) . Briefly, we used Mason 2 (32) to generate 500,000 simulated Illumina 100 bp paired reads for each excluded strain TaxId, and collected cpu, and memory using ram-disk storage of the simulated reads employing 16 threads (16 Intel® Xeon® 2.8 GHz CPUs 64 GB RAM). Accuracy was measured using aligns_to against "tree_filter.dbss" (see Results) with a list of all TaxIds excluding the 50 strains tested (130,769 TaxIds total).

Submissions suggesting contamination with SARS-CoV-2 identified through normal processing were subject to two further verification methods. All identified accessions were rerun using the current SARS-CoV-2 Detection tool (25, DockerHub Tag 1.1.2021-01-25, see Additional file 1). Low level contamination (1 spot, 1 or 0 resolved hits) observed in 31 records was further examined using STAT against a SARS-CoV-2 -specific database ("dbs") composed of 32-bp kmers identified by Wahba et al. (33) . Using these 18582 SARS-CoV-2 -specific k-mers as queries never found a matching k-mer when run against our full tree_filter.dbs (data not shown).

The special-purpose k-mer database uses NCBI BLAST ® "refseq_genomes" limited to Human (TaxId 9606) for input using a "window" segment of 32 bp and filtered as described previously. Any k-mers found also in the merged Kingdom databases of Bacteria, and Viruses were removed. The current database contains 80,143,408 k-mers and is 612 mb in size. The sra-human-scrubber is intended as the last step before submission and takes as input a "fastq file", and outputs a "fastq.clean file" in which all reads identified as potentially of human origin are removed (34) .

Examples discussed in Results and shown in Table 1 were run against the sra- 

Archiving next generation sequencing data

International Nucleotide Sequence Database Collaboration. The Sequence Read Archive: explosive growth of sequencing data

Identifying and filtering near-duplicate documents

Mash: fast genome and metagenome distance estimation using MinHash

NCBI viral genomes resource

Consensus statement: Virus taxonomy in the age of metagenomics

A sea change for virology

A review of methods and databases for metagenomic classification and assembly

Kraken: ultrafast metagenomic sequence classification using exact alignments

Improved metagenomic analysis with Kraken 2

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): Emergence, history, basic and clinical aspects

Novel coronavirus disease (Covid-19): The first two patients in the UK with person to person transmission

Re-identifiability of genomic data and the GDPR: Assessing the re-identifiability of genomic data in light of the EU General Data Protection Regulation

Genomic research and human subject privacy

Systematic evaluation of error rates and causes in short samples in nextgeneration sequencing

Perspectives on Human Variation through the Lens of Diversity and Race

SNPs occur in regions with less genomic sequence conservation

MetaSRA: normalized human samplespecific metadata for the Sequence Read Archive

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive. F1000Res

NIH Office of Data Science Strategy

The FAIR Guiding Principles for scientific data management and stewardship. Sci Data

SRA in the cloud

NCBI National database of antibiotic resistant organisms (NDARO)

NCBI Sequence Read Archive (SRA)

SRA detection tool

How independent are the appearances of n-mers in different genomes?

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

Taxonomy Statistics

Taxonomy Nodes (all dates). Available from

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

A fast and symmetric DUST implementation to mask low-complexity DNA sequences

Mason -a read simulator for second generation sequencing data

An Extensive Meta-Metagenomic Search Identifies SARS-CoV-2-Homologous Sequences in Pangolin Lung Viromes. mSphere

NCBI sra-human-scrubber Docker image

Database indexing for production MegaBLAST searches

pair segment, the series of 64 possible 32 base-pairs k-mers is defined by sequentially shifting the 32-base window by one base. The first four, and last of the possible k-mers is shown schematically. b An example selected kmer sequence is shown followed by: first the two-bit encoding of that same k-mer sequence; finally the 64-bit decimal value of the encoded k-mer

See Methods for details. a Before merging two sibling species are depicted containing both unique and shared k-mers (indicated in bold).. b After merging the two shared k-mers "merge" up to the genus level

 Figure 3 . STAT two phase query. a In the first qualitative phase the input query (an SRA accession, or fasta file) is sequentially rendered into 32 bp k-mers, and matches to the decimal hash values found in the sparse database identifying taxa for deeper analysis. b The TaxIds identified are used to compose the dense database using the same query in the second quantitative pass. 

• Additional file 1.xlsx, Microsoft Excel: The first sheet (S1) contains results from accessions using SARS-CoV-2 detection tool as described in Methods; the second sheet (S2) contains those accessions from S1 subject to verification using STAT as described in Methods.