key: cord-0581862-0c4f9ht8
authors: Ghiasi, Nika Mansouri; Park, Jisung; Mustafa, Harun; Kim, Jeremie; Olgun, Ataberk; Gollwitzer, Arvid; Cali, Damla Senol; Firtina, Can; Mao, Haiyu; Alserr, Nour Almadhoun; Ausavarungnirun, Rachata; Vijaykumar, Nandita; Alser, Mohammed; Mutlu, Onur
title: GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis
date: 2022-02-21
journal: nan
DOI: nan
sha: 06a58ed462246f145794bd300420272fc1e8681d
doc_id: 581862
cord_uid: 0c4f9ht8

Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select the reads that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the computation overhead, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different read lengths and error rates, and 2) different degrees of genetic variation. Through rigorous analysis of read mapping processes, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based SSD. Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05$times$ (1.52-3.32$times$) for read sets with high similarity to the reference genome and 1.45-33.63$times$ (2.70-19.2$times$) for read sets with low similarity to the reference genome.

Genome sequence analysis, which analyzes the DNA sequences of organisms, is important for many applications in personalized medicine [1] [2] [3] [4] [5] [6] [7] [8] , outbreak tracing [9] [10] [11] [12] [13] [14] , and evolutionary studies [15] [16] [17] [18] [19] [20] [21] . The information of an organism's DNA is converted to digital data via a process called sequencing. A sequencing machine extracts the sequences of DNA molecules from the organism's sample in the form of strings consisting of four base pairs (bps), denoted by A, C, G, and T. No current sequencing technology has the capability to read a human DNA molecule in its entirety. Instead, state-ofthe-art sequencing machines generate randomly sampled, inexact sub-strings of the original genome, called reads. The information about the corresponding location of each read in the complete genome is lost during sequencing in most technologies. State-ofthe-art sequencing machines produce one of two kinds of reads. 1) Short read sequencing technologies, such as Illumina [22, 23] , produce reads that are highly accurate .9%) [24] [25] [26] , but short (e.g., up to a few hundred DNA base pairs [24, 27, 28] ). 2) Long read sequencing technologies, such as Pacific Biosciences (PacBio) [29] and Oxford Nanopore Technologies (ONT) [30] , produce reads that are less accurate (85-90%) [27, [31] [32] [33] , but long (e.g., lengths ranging from thousands to millions of base pairs [34] ).

Many genomics applications that involve the comparison of the genomic reads to a reference genome require a fundamental initial process, called read mapping. Read mapping identifies potential matching locations of reads against a reference genome [35, 36] and is a very computationally-costly process [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] due to two key challenges. First, it uses computationally-expensive algorithms involving approximate string matching (ASM) [40-42, 44, 45, 47-52] .

A read is aligned to a reference genome if the read is sufficiently similar to one or more subsequences in that reference genome. Reads generated by a sequencing machine might have differences compared to the reference genome due to either errors in the sequencing process or genetic variations [37, [53] [54] [55] . ASM is widely used in existing read mappers to accurately account for such potential differences when determining the similarity between each read and the reference genome. Second, read mapping performs large amounts of expensive ASM computation because the genomic read datasets contain many reads (e.g., millions of reads), and each read requires ASM computation on multiple subsequences in the reference genome (see Section 2.1 for more details).

Since read mapping is a key performance bottleneck in genome sequence analysis applications, there has been significant effort into improving read mapping performance via both algorithmic and system optimizations. Many prior works propose efficient heuristics for ASM [56] [57] [58] [59] [60] , hardware accelerators [35, 39, 40, , and various filters that try to efficiently and accurately prune reads that do not require expensive computation [35-38, 40-43, 45, 48, 49, 94-98] . For example, filters can be used to quickly prune reads that have exact matches in the reference genome. Pruned reads do not go through the expensive ASM process, which improves read mapping performance and efficiency [35, 36, 40-43, 45, 48, 49, 94, 96] .

While prior works improve read mapping performance, to our knowledge, none of them consider the I/O cost that most systems must pay to read the large amount of data from the storage system to main memory and computation units. Read mapping incurs unnecessary data movement from the storage system for large amounts of low-reuse data. For example, while existing filters prune many reads to avoid expensive computation, they still need to first read the entire read set from the storage system, even though a large fraction of the reads would be filtered out and not be reused in the later stages of the read mapping process. The unnecessary data movement from the storage system can bottleneck read mapping performance in both conventional (software-based) and emerging (hardware-accelerated) genomics systems, while having a larger impact on emerging systems that greatly reduce the computation bottlenecks of ASM (e.g., [35, 39, 40, 43, [61] [62] [63] [64] ).

In-storage filtering can be a fundamental solution for reducing the cost of the unnecessary data movement in read mapping. Our motivational study using an ideal in-storage filter for read mapping (Section 3) demonstrates that in-storage processing can greatly accelerate the end-to-end read mapping process. This is because in-storage filtering not only avoids unnecessary data movement from storage, but also eliminates the computational burden of the filtering process from the rest of the system.

Our goal in this work is to improve the performance of genome sequence analysis by effectively reducing unnecessary data movement from the storage system. To this end, we propose GenStore, the first in-storage processing system designed for genome sequence analysis. The key idea of GenStore is to exploit low-cost in-storage accelerators to accurately filter out the reads that do not require the expensive ASM computation in read mapping, thereby significantly reducing unnecessary data movement from the storage system to main memory and processors.

We identify two key challenges in designing an efficient instorage system for read mapping. First, read mapping workloads exhibit fundamentally different behavior due to 1) the varying read properties such as read length and error rates, which highly depend on the sequencing technology, and 2) the genetic variation of reads compared to the reference genome, which highly depends on the genomes that are being compared. Second, existing filtering methods incur a large number of random accesses to large datasets, which is challenging for a modern NAND flash-based solid-state drive (SSD) 1 to cope with due to its poor random-access performance and limited size of internal DRAM.

We address these challenges with hardware/software co-design in three key directions. First, based on our detailed analysis of read mapping, we design two different accelerators that can accelerate a wide range of read mapping applications for reads with different properties (lengths and error rates) and genetic variations. Each accelerator filters a large fraction of genomic read datasets using simple operations. Second, we develop storage technology-aware algorithmic optimizations to replace expensive random accesses with more efficient sequential accesses to storage devices (e.g., NAND flash-based SSDs). Third, we carefully design an efficient technique for data placement inside the storage device that takes full advantage of the high internal SSD bandwidth to concurrently access large amounts of genomic data.

We design GenStore to support two in-storage filtering mechanisms in a single SSD: 1) GenStore-EM and 2) GenStore-NM. GenStore-EM filters exactly-matching reads, i.e., reads that exactly match subsequences of a reference genome. Due to the low error rates of short reads, a large fraction of short reads map exactly to the reference genome [43, 55, 99] . For example, on average 80% of human short reads map exactly to the human reference genome [43, 55, 99] . However, finding exactly-matching reads in the SSD is challenging, as it incurs a number of random accesses per read to a large index structure that stores unique subsequences of length (called k-mers) and their positions in the reference genome. Since each read consists of many k-mers, filtering each read requires several random accesses to the index. To avoid such random accesses, we introduce a new sorted, read-sized k-mer index structure, which enables sequentially scanning of the read set and the new index, with only one index lookup per read during filtering. GenStore-NM filters most of the non-matching reads, i.e., reads that would not align to any subsequence in the reference genome. In read mapping, a significant fraction of reads might not align to the reference genome due to 1) the high sequencing error rate (in long reads) and/or 2) high genetic variation (in both short and long reads). For example, both short and long read sets sequenced from rapidly-evolving viral samples (such as SARS-CoV-2) can have high genetic variations compared to the reference genome, leading to, on average, 36% (up to 99.9%) of reads not aligning to the reference [100] . To avoid expensive ASM operations for such nonmatching reads, state-of-the-art read mappers commonly employ a step called chaining, which calculates a similarity score for each read (called chaining score) to the reference and filters out reads with a low score. GenStore-NM uses this basic idea of chaining to build an in-storage filter. 1 In this work, we focus on SSDs based on NAND flash memory, the prevalent memory technology in modern storage systems. We expect that GenStore would also provide performance and energy benefits with storage devices that are built using emerging non-volatile memory technologies.

Calculating a chaining score for a read inside the SSD is challenging since it requires performing an expensive dynamic programming algorithm on the read's k-mers that exactly match the reference. This is particularly challenging for long reads since they have a large number of k-mers per read. To avoid such expensive computation, we selectively perform chaining only on reads with a small number of exactly-matching k-mers and send other reads to the host system for full read mapping (including chaining). Selective chaining is effective because 1) a read with many exactly-matching k-mers most likely aligns to the reference genome and thus does not require in-storage filtering, and 2) selective chaining can filter many non-aligning reads, without requiring costly hardware resources in the SSD.

To evaluate GenStore, we use a combination of synthesized Verilog models of our in-storage accelerators and state-of-the-art simulation tools that are widely used for DRAM and SSD research, Ramulator [101] and MQSim [102] . To assess the performance impact of the storage system, we evaluate three GenStore-enabled systems with different SSD configurations (low-end, medium-end, and highend). We integrate GenStore into a state-of-the-art software read mapper (Minimap2 [58] ) and two state-of-the-art hardware read mappers (GenCache [43] for short reads and Darwin [39] for long reads). Our results show that GenStore-EM and GenStore-NM improve the performance of Minimap2 by 2.07-6.05× and 1.45-33.63×, respectively, with no accuracy loss. GenStore-EM improves the performance of GenCache by 1.52-3.32×, and GenStore-NM improves the performance Darwin by 2.70-19.2×, with no accuracy loss.

This work makes the following key contributions: • We introduce GenStore, the first in-storage processing system designed for genome sequence analysis. GenStore fundamentally addresses the high I/O cost of reading low-reuse genomic data from storage systems. • We address the challenges of in-storage filtering for genome sequence analysis by analyzing the read mapping process and performing hardware/software co-design to develop in-storage filtering mechanisms and accelerators for genomic reads with various lengths, error rates, and genetic variations. • We introduce two in-storage accelerators, 1) GenStore-EM for filtering exactly-matching reads and 2) GenStore-NM for filtering most reads that would not align to any subsequence in the reference genome. GenStore filters out a large fraction of reads with lightweight hardware accelerators and no loss of accuracy, thereby improving the end-to-end performance and energy efficiency of genome sequence analysis.

We provide brief background on read mapping and NAND flashbased SSDs, necessary to understand the rest of the paper.

End-to-End Workflow of Genome Sequence analysis. There are three key initial steps in a standard genome sequencing and analysis workflow [35, 36] . The first step is the collection, preparation, and sequencing of a DNA sample in the laboratory. Modern sequencing machines are unable to read an organism's genome as a single complete sequence; instead they generate shorter subsequences sampled randomly from the genome sequence [103, 104] . The second step is basecalling, which converts the representation of the subsequences generated by the sequencing machine (e.g., images or electric current, depending on the sequencing technology [30, 53] ) into reads, which are sequences of nucleotides (i.e., A, C, G, and T in the DNA alphabet). In order to reproduce the complete genome sequence from the shorter read sequences, the third step, called read mapping, identifies potential matching locations of each read with respect to a known reference genome (e.g., a representative genome sequence for a particular species) [4, 36, 40, 58] . Genomic read sets can be obtained by, for example, 1) sequencing a DNA sample and storing the generated read set into the SSD of a sequencing machine [105, 106] or 2) downloading read sets from publicly available repositories [107] and storing them into an SSD.

In this work, we focus on optimizing the performance of read mapping because sequencing and basecalling are performed only once per read set, whereas read mapping can be performed many times for the same read set. This is common in many genomic applications for two reasons. First, some applications require analyzing the genetic differences between a read set belonging to an individual and many reference genomes of other individuals. Examples of such applications include measuring the genetic diversity in a population [108, 109] and determining the donor of a sample by quantifying the reads that have a match in each reference genome [110, 111] . Second, some other applications require repeating the read mapping step many times to improve the outcome of read mapping. Examples of such applications are 1) mapping with new, more updated reference genomes [44] , or 2) using different mapping parameter values (such as the maximum number of allowed differences between a read and a subseqeuence in the reference genome so that they are considered similar) [112] .

Improving read mapping performance is critical since it is a fundamental step used in almost all genomic analyses that use sequencing data [35, 39, 40, 45] . The contribution of read mapping to the entire analysis pipeline varies depending on the application. For example, read mapping takes up to 1) 45% of the execution time when discovering sequence variants in cancer genomics studies [113] , and 2) 60% of the execution time when profiling the species composition of a multi-species (i.e., metagenomic) read set [110] . Read Mapping Process. Since the sequencing process does not provide location information for both short and long reads in most technologies, read mapping is a fundamental initial process for many genomics applications. The read mapping process identifies subsequences in the reference genome to which the input reads match. For each matching location, i.e., the location of each matching subsequence in the reference genome, a read mapper computes an alignment score, indicating the degree of similarity between the read and the region of the reference to which the read aligns. Matching base pairs between the read and the reference increase the alignment score, whereas edits (i.e., base pair mismatches, insertions, or deletions relative to the reference) decrease this score.

Since each read is much shorter than the reference genome (e.g., the human reference genome contains ∼3.2 billion base pairs), a read mapper typically uses an index of the reference genome to reduce the search space for each read. The index is a dictionary, i.e., a key-value store, where the keys are unique -length subsequences (called -mers) extracted from the reference genome, and the values are the exactly-matching locations of these -mers in the reference genome [37, 114] . The value of is fixed during indexing and used for all subsequent steps. 2 To greatly reduce the storage overhead of the index and speed up queries against it, without significantly changing the final outcome of read mapping, some read mappers index only a subset of reference genome -mers called minimizers [117] [118] [119] . A minimizer is a representative -mer of a set of -mers according to a scoring mechanism. For example, some read mappers [58, 120] calculate hash values for all -mers in a window of consecutive -mers from an input sequence, and mark the -mer with the smallest hash value as the minimizer -mer.

Read mapping is a three-step process. In the first step (seeding), the read mapper queries the index structure to determine potential locations in the reference genome where the read could map. To do so, the read mapper looks up every minimizer k-mer fetched from a read in the reference index. If the minimizer k-mer hits in the reference index, the read mapper marks the locations of such a k-mer in the reference genome as the read's potential matching locations, also called seeds. In the second step (seed filtering and/or chaining), the read mapper prunes those potential matching locations in the reference to which the read would not align. If all of the potential matching locations of the read get filtered, the read mapper discards the read from further analysis. The read mapper uses a dynamic programming (DP) algorithm to 1) merge overlapping seeds into longer regions, called chains [58] , and 2) calculate their corresponding chaining scores, which refers to the approximation of the entire read's alignment score in these regions. If the read mapper finds one or more chains with a sufficiently high chaining score (indicating a high degree of similarity to the reference genome), then the read mapper performs the third step. If the read has no chain with a sufficiently high score, the read mapper prunes the read and skips the third step. In the third step (sequence alignment), the read mapper determines the exact differences between a read and the reference genome at the potential matching locations. Sequence alignment is done with a computationally-expensive DP algorithm [35, 36, 40-42, 45, 48-51] to perform approximate string matching (ASM). Finally, the read mapper returns the locations in the reference genome with the best alignment scores for each read. Pre-Alignment Read Filtering. To mitigate the high performance overhead of alignment, read filtering approaches are widely used. Read filters can be incorporated at any stage of the process before alignment. There are two main filter types. The first filter type [37, 38, 42, 45, 48, 49, 96] aims to efficiently filter potential matching locations in the reference genome that lead to a large number of edits (larger than a user-defined threshold) between the read and the reference genome at those locations. Doing so avoids a costly alignment step for potential locations at which the read would not match the subsequences of the reference genome. The second filter type [43] aims to detect if a read matches a subsequence of the reference genome with no edits (i.e., exact-match) or very few (e.g., 1-5) edits. Reads that satisfy this requirement are guaranteed to align to at least one location in the reference genome without requiring the costly read alignment process. This filter type 2 is typically between 11 and 31 [58, 115, 116] , depending on the application. is particularly effective for read sets with a large number of exactlymatching reads (e.g., 80% in human short read sets [43, 55, 99] ). While both filter types reduce computation overhead, they still require a large number of random memory accesses for each read, similar to the baseline read mapper. In a typical read set of several gigabytes, read filters incur several random accesses per read 1) to the reference index for seeding, and potentially, 2) to the reference genome to compare the read with the subsequence of reference genome at each potential matching location. Figure 1 depicts the internal organization of a modern NAND flashbased solid-state drive (SSD) that consists of three main components: 1 NAND flash packages, 2 an SSD controller, and 3 DRAM. NAND Flash Memory. A NAND package comprises multiple dies (also called chips) that share the package's I/O pins. One or more packages share command/data busses (called channels) to connect to the SSD controller. Dies sharing the same channel can operate independently of each other, but only one die can communicate with the SSD controller (e.g., for data transfer) at a time via the shared channel. A die has multiple (e.g., 2 or 4) planes. Each plane contains thousands of blocks. A block includes hundreds to thousands of pages, each of which is 4-16 KiB in size. NAND flash memory performs read/write operations at page granularity but erase operations at block granularity. Planes in the same die share the peripheral circuitry used to access pages; as such, they can concurrently operate only when accessing pages (or blocks) at the same offset, which are called multi-plane operations. SSD Controller. An SSD controller has two main components: 1) multiple cores to run SSD firmware, commonly called the flash translation layer (FTL), and 2) per-channel hardware flash controllers for request handling and error-correcting codes (ECC) for underlying NAND flash chips. The FTL is responsible for communication with the host system, internal I/O scheduling, and various SSD management tasks required for hiding the unique characteristics of NAND flash memory from the host system. For example, a page of NAND flash memory needs to first be erased before it is programmed, 3 so the FTL always performs out-of-place updates by writing the new data of a logical page to a new physical page that was erased previously. To this end, the FTL maintains logical-tophysical (L2P) address mappings for reads and performs garbage collection to reclaim new physical pages for writes.

Internal DRAM. A modern SSD employs large low-power DRAM (e.g., 4GB LPDDR4 DRAM for a 4TB SSD [121] ) to store metadata for SSD management tasks. Most of the DRAM capacity is used to store the L2P mappings for address translation. It is common practice to maintain the L2P mappings at 4KiB granularity to provide high random access I/O performance [122, 123] , so in a 32-bit architecture, the memory overhead for the L2P mappings is approximately 0.1% of the SSD capacity (4 bytes per 4KiB data). SSD I/O Bandwidth. To mitigate the large performance gap between main memory and the storage system, SSD manufacturers increase the external bandwidth of SSDs by employing advanced I/O interfaces between the host system and SSDs. For example, while older SATA3 SSDs provide around 500MB/s sequential-read bandwidth [121, 124] , state-of-the-art PCIe-Gen4 SSDs can provide significantly higher sequential-read bandwidth, up to 8 GB/s (e.g., 7 GB/s in Samsung PM1735 [125] ).

A modern SSD's internal bandwidth (i.e., I/O bandwidth between NAND flash chips and SSD controller) is usually higher than its external bandwidth (i.e., I/O bandwidth between the host and the SSD). For example, a recent enterprise SSD controller [126] supports 6,550MB/s external bandwidth and 19.2GB/s internal bandwidth (16 channels, each with a bandwidth of 1.2 GB/s). Over-provisioning the internal bandwidth is reasonable since 1) a modern SSD needs to perform various internal management tasks (e.g., garbage collection [127] [128] [129] [130] [131] and wear-leveling [129, 132] ), and 2) a higher channel count reduces contention between requests by interleaving data between the channels.

We perform experimental studies to understand the potential of efficient in-storage accelerators for improving the performance of genome sequence analysis applications.

Read Mappers. We evaluate five read mapping systems, each of which adopts different optimization techniques to accelerate read mapping: 1) Base uses Minimap2 [58] , a state-of-the-art software tool for read mapping. 2) SW-filter extends Minimap2 to filter out exactly-matching reads (i.e., reads that exactly match subsequences in one or more locations in the reference genome) using simple single-instruction-multiple-data (SIMD) operations, without requiring costly ASM operations. 3) Ideal-ISF uses an ideal In-Storage Filter (ISF) that can concurrently filter out exact-matching reads inside the SSD while the host CPU performs read mapping for nonfiltered reads. 4) ACC uses a state-of-the-art hardware accelerator for short read mapping, GenCache [43] . 5) Ideal-ISF+ACC uses an ideal in-storage filter (ISF) that can concurrently filter out exactlymatching reads inside the SSD while a hardware accelerator (ACC) performs read mapping for non-filtered reads. System Configuration. To assess the impact of the storage subsystem on end-to-end application performance, we evaluate each of the five systems with four different configurations: 1) a low-end SSD (SSD-L) [124] with a SATA3 interface [133] , 2) a mid-end SSD (SSD-M) [134] using a PCIe Gen3 M.2 interface [135] , 3) a highend SSD (SSD-H) [125] with a PCIe Gen4 interface [136] , and 4) a system where all of the processed data is pre-loaded to DRAM with no performance cost for pre-loading (DRAM), as the idealized case where storage I/O overheads are completely eliminated (we do not evaluate DRAM for Ideal-ISF and Ideal-ISF+ACC since using in-storage processing is contradictory to pre-loading all the data to main memory). We assume 8 channels for SSD-L and 16 channels for SSD-M and SSD-H, where the maximum bandwidth per channel is 1.2 GB/s. The maximum internal bandwidth is calculated by 1.2 GB/s × channel count. The external bandwidth of SSD-L, SSD-M, and SSD-H for sequential reads is 500 MB/s, 3.5 GB/s, and 7 GB/s, respectively. Hence, the internal bandwidth of SSD-L, SSD-M, and SSD-H for sequential reads is 19.2×, 5.48×, and 2.74× that of its external bandwidth, respectively.

We evaluate Base and SW-filter by running Minimap2 on a highend server (AMD EPYC 7742 CPU [137] with 1TB DDR4 DRAM). We simulate the performance of the other three systems using our simulation environment that faithfully models system components including DRAM and storage devices (see Section 5) . We map all reads of a short read dataset against the human reference genome, where 80% of the reads have one or more exactly-matching subsequences in the reference genome [55, 99] . Key Features of an Ideal In-Storage Filter. We assume two key features for Ideal-ISF and Ideal-ISF+ACC. First, I/O overheads due to limited external SSD bandwidth are completely eliminated for filtered reads. Second, the system provides high in-storage filtering performance such that the filtering process can concurrently run in the SSD, and the latency of this filtering process is fully hidden by the read mapping of unfiltered reads in the host CPU (Ideal-ISF) or hardware accelerator (Ideal-ISF+ACC). We assume that the accelerator or the CPU streams through the input reads in batches and analyzes a batch concurrently with reading the next batch. Thus, the execution time of Ideal-ISF (+ACC) can be modeled as follows:

where I/O-Ref , I/O-Unfiltered , and RM-Unfiltered are the latency of reading the reference genome from the SSD into main memory, the latency of reading the unfiltered genomic reads from the SSD, and the latency of read mapping of the unfiltered reads, respectively. For a given input size, RM-Unfiltered varies depending on the computation unit used for read mapping (i.e., the host CPU or accelerator), while the I/O-latency values only depend on the SSD configuration. Figure 2 shows the execution time of read mapping in the five evaluated systems, each with four different storage subsystem configurations. We make four key observations. Observation 1. The ideal in-storage filter provides significant performance improvements over other systems. Ideal-ISF significantly outperforms Base and SW-filter (by 3.12× and 2.21×, respectively), and Ideal-ISF+ACC provides a large speedup (2.78×) over ACC, when they all use SSD-H. These large improvements are due to two key benefits provided by the ideal in-storage filter: 1) mitigation of data movement from the storage devices and 2) removal of the burden of filtering out 80% of the input read set from the rest of the system, including processors and main memory. To distinguish the effects of these two benefits, we analyze an ideal outside-storage filter (Ideal-OSF) that provides only the second benefit; this filter concurrently runs with the read mapper and fully overlaps the filtering process with the read mapping process of unfiltered reads. The execution time of Ideal-OSF (+ACC) can be formulated as follows:

where I/O-All-Reads is the latency for reading all genomic reads from the SSD into main memory. Using SSD-H, Ideal-OSF leads to an execution time of 1.15 seconds, which is 60% slower than the Ideal-ISF+ACC. This is because I/O-All-Reads is significantly larger than both RM-Unfiltered and I/O-Unfiltered (in Equation (1)).

The remaining observations dive deeper into the effects of the I/O bottleneck on each read mapping system. Observation 2. In Base and SW-filter, using high-end SSDs significantly improves read mapping performance over low-end SSDs, effectively reducing the storage performance bottleneck that exists in low-end SSDs. For example, using SSD-H instead of SSD-L reduces the execution time of Base and SW-filter by 24% and 38%, respectively, showing comparable performance to DRAM, where all the data is pre-loaded to main memory (i.e., no I/O accesses). This is because, by using SSD-M and SSD-H, the performance bottleneck of the application shifts to parts of the system other than I/O (e.g., CPU or main memory). This observation shows that I/O has a significant impact on application performance but this impact can be alleviated at the cost of expensive storage devices and interfaces. Note that, while SSD-M and SSD-H provide an order-of-magnitude higher bandwidth for sequential reads compared to SSD-L, it is challenging to scale a storage system's capacity using the high-end SSDs due to their significantly-higher prices and the relatively smaller number of the PCIe slots in a server. 4 Observation 3. Even though SW-filter outperforms Base, its filtering process is slow. Potentially, SW-filter could provide significant performance benefits over Base due to two reasons; 1) as explained, 80% of reads in the dataset exactly match the reference genome, so only 20% of the reads need to undergo the costly ASM computaion; 2) exact-match filtering requires only simple computation, i.e., SIMD XOR operations used by SW-filter. However, even with DRAM, SW-filter's speedup over Base is only 41%. The limited speedup is mainly due to large number of random memory accesses concurrently issued from all threads to the reference index (explained in Section 2.1). This observation highlights the potential of in-storage filtering. Even though both SW-filter and Ideal-ISF filter out the same fraction of reads, the filtering process outside the SSD must compete with the read mapping process for the resources in the system (e.g., the limited main memory bandwidth). In contrast, filtering of reads inside the SSD (where the reads originally reside) can remove the burden of filtering from the rest of the system. Observation 4. With a hardware accelerator (ACC), using the state-of-the-art SSD (SSD-H) does not fully alleviate the storage bottleneck, showing 23% longer execution time compared to when all the data is pre-loaded to main memory (DRAM). While using SSD-M and SSD-H in Base and SW-filter shifts the bottleneck away from I/O, ACC turns I/O into a bottleneck again. This is because ACC greatly reduces the computational bottleneck, which increases the relative effect of the storage subsystem on the end-to-end execution time. The ACC and Ideal-ISF+ACC results clearly show that data movement between the storage devices and the hardware accelerator, which has not been properly considered in prior read mapping accelerators [39, 40, 43, 61, 62, 65, [70] [71] [72] [73] [74] [75] [76] [77] , can significantly bottleneck the potential benefits of the accelerator. Comparison to Other Near-Data Processing Systems. Even though read mapping applications could also benefit from other near-data processing (NDP) approaches such as processing-in-main memory (PIM) [45, 65, 140, 141] or processing-in-caches [43] , instorage processing can fundamentally address the data movement problem by filtering large, low-reuse data where the data initially resides. As an extreme example, even if an ideal accelerator achieved a zero execution time for read mapping by addressing all of the computation and main memory overheads, there would still exist the need to bring the data from storage to the accelerator. In our motivational study, even SSD-H takes at least 1.55 seconds to read the entire dataset, which is 2.15× slower than the execution time that Ideal-ISF+ACC provides (0.72 seconds). Thus, even though solutions such as processing-in-memory can improve read mapping execution times, they still need to pay the cost of data movement from the storage system to the main memory [39, 40, 45, 61, [65] [66] [67] [68] [69] 94] . Therefore, an in-storage filter can be further integrated with any read mapping accelerator, including PIM accelerators, to alleviate their data movement overhead.

Based on our observations, we conclude that an efficient in-storage filter can be a key enabler for read mappers to achieve high performance in both conventional software-based (e.g., Base and SWfilter) and new hardware-accelerated (e.g., ACC) genomics systems. In particular, in-storage filtering enables the system to take full advantage of the high computation capability of hardware accelerators by fundamentally addressing the data movement bottleneck. Our goal is to design an in-storage filter for genome sequence analysis in a cost-effective manner.

We have three key objectives in designing our new system. First, the system should provide high in-storage filtering performance to overlap the filtering with the read mapping of unfiltered data (as Ideal-ISF does in our motivation study). Second, it should support reads with 1) different properties (e.g., lengths and error rates) and 2) different degrees of genetic variation in the compared genomes. Third, it should not require significant additional hardware overhead, e.g., complicated logic circuits or large SRAM/-DRAM memory.

We propose GenStore, the first in-storage processing system tailored for genome sequence analysis. GenStore greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. Gen-Store supports reads with different properties (lengths and error rates) and different degrees of genetic variation in the compared genomes. We primarily design GenStore as an in-storage accelerator, which is an extension of the existing SSD controller and flash translation layer (FTL). GenStore is designed to be integrated into the system such that, when the accelerator is not in use, the entire storage device is available to all other applications, just like in a general-purpose system today.

The key idea of GenStore is to exploit low-cost in-storage accelerators to accurately filter out the reads that do not require the expensive alignment step in read mapping and thus significantly reduce unnecessary data movement from the storage system to main memory and processors. Figure 3 shows the overall architecture of GenStore and how it interacts with the host system. GenStore employs two types of hardware accelerators: 1 a single SSD-level accelerator and 2 channel-level accelerators, each of which is dedicated to a channel. The GenStore-FTL ( 3 ) communicates with the host system and manages the metadata and data flow over the SSD hardware components (i.e., NAND flash chips, internal DRAM, and in-storage accelerators). Once the host system indicates that the SSD should start analysis as required by a read mapping application ( 1 in Figure 3 ), GenStore prepares for operation as an accelerator ( 2 ) . It flushes the conventional FTL metadata necessary to operate as a regular SSD (e.g., L2P mappings [128] ), while loading the GenStore metadata necessary for each use case (Section 4.4 provides more details on GenStore FTL). After finishing the preparation, GenStore starts the filtering process. It keeps concurrently reading the data to process from all NAND flash chips ( 3 ) via multi-plane operations (i.e., it exploits the SSD's full internal bandwidth, which is much higher than the I/O bandwidth between the SSD and host system [142, 143] ), while filtering out reads ( 4 ) that do not have to undergo further analysis (e.g., ASM computation). Doing so is possible due to multiple channel-level accelerators that provide computational throughput matching the SSD's internal bandwidth even for the most complicated computation required for filtering. The host system performs further computation as soon as GenStore sends unfiltered reads ( 5 ) , which removes GenStore's filtering process almost completely from the critical path of the application.

As explained in Section 2.2, most of the internal DRAM is occupied by the regular L2P mapping. Therefore, flushing the regular L2P mapping data into NAND flash memory enables GenStore to exploit most of the GB-scale DRAM (e.g., 4GB DRAM in a 4TB SSD [121] ) for its operations, which significantly reduces the overhead of additional internal DRAM that might otherwise be required to store the GenStore metadata necessary for the filtering process. We carefully design the GenStore filtering algorithms to only sequentially access the underlying NAND flash chips, so GenStore requires only a small amount of metadata to access the stored data. Therefore, GenStore can use most of the internal DRAM space for such metadata. We envision that all GenStore metadata are built offline by the host or some other system (e.g., by the sequencing machine when the read set or reference genome are initially stored to the SSD). Constructing GenStore metadata is a one-time preprocessing step that can be performed independently of the read mapping process, while the result of the preprocessing step can be used multiple times for different genomics applications.

There exist two main challenges in designing GenStore as an efficient in-storage filter for read mapping. First, the behavior and data-access patterns in read mapping significantly vary depending on the read properties (length and error rate) and genetic variation between the compared genomes. Second, hardware resources (e.g., CPU and DRAM) are quite limited even in modern high-end SSDs. We address these challenges via thorough hardware/software codesign tailored for filtering 1) exactly-matching reads, i.e., reads that exactly match subsequences of the reference genome (Section 4.2), and 2) most of the non-matching reads, i.e., reads that would not align to any subsequence of the reference genome (Section 4.3).

Overview. GenStore-EM accelerates read mapping by using an efficient in-storage filter for reads that have at least one exact match in the reference genome. Due to the low error rates of short reads, combined with low genetic variation between the compared genomes, a large fraction of short reads map exactly to the reference genome [43, 55, 99] . For example, on average 80% of human short reads map exactly to the human reference genome [43, 55, 99] . Since exact-match detection is computationally cheaper than ASM, concurrently filtering exact-matching reads inside the SSD can significantly improve the runtime of read mapping (as we demonstrate in Section 3). Note that, GenStore-EM is not applicable to long reads due to their greater length. For example, for a 10K base pair-long human read, even with zero sequencing error rate, the probability of the read exactly matching a subsequence in the reference genome is very low (e.g., < 3.6 × 10 −6 ) due to natural genetic variation. 5 Key Challenges. The key challenge in designing GenStore-EM is the large number of random accesses to large data structures inside the SSD. As explained in Section 2, identifying exact matches for read mapping requires a number of random accesses for each k-mer in a read to two large data structures: 1) a large k-mer index, to find potential matching locations of each k-mer in the reference genome, and 2) the reference genome, to find candidate matching sequences at the candidate matching locations in the reference genome. Handling random accesses to large data structures is challenging for NAND flash-based SSDs for two reasons. First, NAND flash memory exhibits poor performance for random access reads. Second, we cannot use in-SSD DRAM to store the large data structures that are randomly accessed, since even in high-end SSDs, the size of internal DRAM is relatively small (e.g., 4 GB [121] ) compared to the size of the data structures that GenStore-EM needs to handle (e.g., 7 GB for the human reference genome and its index [58] ). Key Idea. The key idea in GenStore-EM is to sequentialize most data accesses via a carefully designed metadata structure and data layout. To do so, we design a new sorted, read-sized k-mer index structure. This index enables sequential scanning of the read set and the new index, with only one index lookup per read while performing filtering. Figure 4 shows the key idea of GenStore-EM with a simplified example in which each short read consists of three base pairs (bps). 6 Suppose that we have two data structures: 1) a sorted read table (SRTable), each entry of which stores a read and its unique ID, and 2) a sorted k-mer index (SKIndex), which contains all unique readsized k-mers of the reference genome, along with each k-mer's corresponding locations in the reference genome. Each data structure is sorted by read/k-mer in alphabetical order. With these two data structures, it is possible to identify each read's exactly-matching locations in the reference genome by streaming both reads and k-mers through a simple comparator. We use two pointers, and , which point to the current entries we are examining in SRTable and SKIndex, respectively. We sequentially increment the two pointers in three different ways based on the comparison result of the current read and k-mer. First, when the current read and k-mer are identical ( 1 in Figure 4) , we record the read as an exactly-matching read and increment and . Second, if the read is alphabetically larger than the k-mer ( 2 ), we conclude that the k-mer does not match any read in SRTable and increment (so that we can examine if the next k-mer matches the read). Third, if the k-mer is alphabetically larger than the read ( 3 ), we conclude that the read does not match any k-mer in SKIndex. We record the read as not an exact match and increment (so that we can examine the next read). GenStore-EM's two data structures and filtering algorithm enable an exact-match filter highly suitable for in-storage processing. First, since these two data structures are only sequentially accessed, the filtering process can be done in a streaming manner, leveraging the high sequential read bandwidth of NAND flash memory. Second, 6 In a realistic scenario, the read length is much larger (e.g., 150 bps).

we can easily perform exact-match detection of a read and a readsized k-mer with simple comparator logic and fully pipeline the filtering process with sequential access to the data structures.

A read-sized k-mer index increases the total amount of accessed data for read mapping. The reason is that a read-length value of k (e.g., k=150) significantly increases the number of unique k-mers, compared to the k values commonly used in conventional read mappers (e.g., k=15). For example, the size of an index structure for all unique k-mers in the human reference genome is 21 GB when k=15, while the size increases to 126 GB when k=150. However, our proposal (i.e., a large yet sequentially-accessed SKIndex) is feasible and desirable for in-storage processing due to the large capacity and high internal bandwidth of modern NAND flash-based SSDs.

Design of GenStore-EM. Figure 5 illustrates the overall operational flow of GenStore-EM, which consists of two steps: Step 1. data fetching and Step 2. exact-match filtering. GenStore-EM uses the two data structures explained in Section 4.2.1: 1) a sorted read table (SRTable) for storing the read set and 2) a sorted k-mer index (SKIndex) for storing read-sized k-mers from the reference genome.

Step 1 reads the two data structures from NAND flash chips to the SSD's internal DRAM in a batched manner ( 1 in Figure 5 ).

Step 2 performs exact-match filtering within each read batch, using simple comparator logic in the SSD-level accelerator ( 2 ). Steps 1 and 2 are performed in a pipelined manner. During filtering, GenStore-EM sends the unfiltered reads to the host system for full read mapping. This enables the concurrent filtering of reads in the SSD and read mapping of the unfiltered reads in the host system. Data Structures. We carefully design SRTable and SKIndex to minimize performance and storage overheads of GenStore-EM, by extending the two data structures described in Figure 4 in two aspects. First, both SRTable and SKIndex contain a strong hash value (e.g., SHA-1 [145] or MD5 [146] ) of each read and read-sized k-mer, respectively, which is used as both the sorting criterion of the data structures and a fingerprint of each read and k-mer that is used by the comparator logic. Second, unlike the data structures described in Figure 4 , SKIndex no longer contains the k-mers of the reference genome but only the fingerprints of the k-mers. Using strong hash values enables GenStore-EM to determine exact matches between a read and a k-mer by comparing only their fingerprints, which provides two benefits. First, it reduces the storage overhead of SKIndex by obviating the need to store the raw read-sized k-mers. 7 For the human reference genome, the size of the optimized SKIndex (which stores fingerprints instead of read-sized k-mers) is 32 GB when the read size is 150 bps, which is 3.9× smaller than the size of the unoptimized SKIndex. Second, using fingerprints significantly reduces the performance overhead of exact-match detection by avoiding comparisons of reads and k-mers that are hundreds of bytes in size.

Note that such exact-match detection does not affect the accuracy of read mapping due to the extremely low collision rate of strong hash functions. Even in an extremely rare case of a hash collision, the impact of the collision on GenStore's accuracy will be negligible [147] , since the DNA information loss due to the falsely filtered read will highly likely be compensated by other reads generated from the neighboring locations in the DNA, which almost fully overlap with the falsely filtered read but have totally different hash values. This is because it is common practice to sequence each DNA fragment several times (i.e., with high coverage) to improve the accuracy of downstream genetic analyses [26, 53, 54, 148] .

We envision that all GenStore data structures are built offline by the host or some other system (e.g., by the sequencing machine when the read set or reference genome are initially written to the SSD). This preprocessing overhead can be hidden by two essential initial steps of the genome sequence analysis pipeline: 1) sequencing and 2) basecalling. For example, in the current highest-throughput Illumina NovaSeq 6000 sequencer [149] , sequencing and basecalling work in a pipelined manner and generate genomic read data at a limited throughput of 18.9 MB/s [149] . We analyze the throughput of GenStore's preprocessing step (i.e., generating hash values and sorting reads) and observe that even a personal laptop [150] can provide 174 MB/s of preprocessing throughput. Therefore, GenStore's preprocessing can be done in a pipelined manner with sequencing/basecalling, without decreasing the overall throughput of these steps. This low preprocessing overhead can be further amortized since the preprocessed data can be reused multiple times in different read mapping experiments.

Step 1. Data Fetching. GenStore-EM reads SRTable and SKIndex in batches, while exploiting the full internal bandwidth of the SSD. In Figure 5 , we refer to each batch of SRTable as an SRTable Batch and each batch of SKIndex as an SKIndex batch. The batch size is equal to the size of data that can be read in parallel by a multiplane read operation for each chip (i.e., Number of Planes in the SSD × Page Size), which enables 100% utilization of the NAND flash chips while reading a batch. As shown in Figure 5 (bottom), GenStore-EM stores the SRTable and SKIndex to NAND flash chips in an interleaved manner so that each of the data structures can be sequentially, evenly distributed across all the NAND flash chips.

GenStore-EM exploits the SSD's full internal bandwidth using double buffering. As shown in Figure 5 , GenStore-EM employs two sets of batch buffers in the internal DRAM: SRTable Buffer (for SRTable) and SKIndex Buffer (for SKIndex), each of which can store two batches of the respective data structure. After Step 1 finishes fetching ℎ# , it proceeds to fetching ℎ# + 1, while Step 2 starts working on ℎ# . If the two steps work with the 7 We design SRTable to store the raw reads so that we can transfer unfiltered reads to the host for full read mapping after we detect them as non-exactly-matching reads. same throughput, we only need to buffer two batches for each data structure. For example, to enable double-buffering in an 8-channel SSD (with four 2-plane dies per channel and 16-KiB pages), the batch buffers require 8MB DRAM space in total.

Step 2. Exact-Match Filtering.

Step 2 scans through each batch of SRTable and SKIndex stored in SRTable Buffer and SKIndex Buffer, respectively, comparing the fingerprints (strong hash values) with a simple hardware comparator. When ( ) > ( ) (where ( ) is the fingerprint of , and and are the -th read and -th kmer in the current batch, respectively), Step 2 scans SKIndex while increasing until it finds such that ( ) ≤ ( ). If ( ) < ( ), it is guaranteed that no exact match exists for read , so GenStore-EM sends to the host for the read mapping process. When ( ) = ( ), i.e., the two fingerprints are identical, and GenStore-EM marks the read as an exactly matching read.

Due to the simple computation in Step 2, the execution time of GenStore-EM is bottlenecked by Step 1 (Data Fetching). As explained, Step 1 only streams the data structures in batches from the underlying NAND flash chips to the internal DRAM, leveraging the SSD's full internal bandwidth. Therefore, the performance of GenStore-ER can be easily scaled up by increasing the SSD's internal parallelism (e.g., by deploying more channels or using low-latency NAND flash memory [122, 151, 152] ).

Overview. GenStore-NM filters most of the nonmatching reads, i.e., reads that would not align to any subsequence in the reference genome. This is motivated by the fact that in read mapping, a large fraction of reads might not align to the reference genome due to 1) the high sequencing error rate (in long reads) and/or 2) high genetic variation between the compared genomes (in both short and long reads). To illustrate this, we analyze four read mapping use cases, one with read sets with high sequencing error rates, and three with high genetic variation in the compared genomes. Use cases with high genetic variation include 1) samples from rapidly-evolving species (such as SARS-CoV-2 [100] ) that have high genetic variation compared to the reference genome, 2) samples with no known reference genomes, and 3) mapping a read set against the human reference genome to filter out human contamination, i.e., reads that have been sequenced from humanorigin contaminant DNA in a non-human sample. 8 For each use case (except for the Contamination use case), we analyze two different combinations of the input read set and the reference. Table 1 summarizes the result of this analysis by showing input read sets, the properties of each read set (e.g., read length and dataset size), reference genomes, and the percentage of reads in each read set that align to subsequences in the reference genome. We observe that a large fraction of reads (31.7%-99.6%) within a read set does not align to any subsequence in the reference genome. Quickly filtering this large fraction of non-aligning reads in the SSD can reduce the large data movement from the storage and expensive ASM computation for reads that would not align.

To avoid expensive ASM computation for reads that would not be aligned, state-of-the-art read mappers commonly employ a step called chaining (described in Section 2.1), which calculates each read's similarity score (called chaining score) to the reference genome and filters out reads with a low score. GenStore-NM uses this basic idea of chaining to build an in-storage filter. Key Challenges. Calculating a chaining score in the SSD is challenging because finding the best chaining score requires performing an expensive dynamic programming (DP) algorithm on every potential matching location (i.e., seed) within a read (explained in Section 2.1). Normally this operation has O ( 2 ) time and space complexity, where is the number of seeds. Even though a chaining score can be approximated accurately in O (ℎ ) time and space by considering only ℎ < 50 seeds at a time [58] , we observe that some long reads can have up to thousands of seeds due to their length. Therefore, designing a chaining accelerator to perform an expensive DP algorithm on these large reads ( > 1000) can incur significant performance or area overheads. Key Ideas. To avoid such expensive chaining, GenStore-NM selectively performs a fast version of chaining only on reads with a small number of seeds and sends other reads to the host system for full read mapping (including complete chaining). This idea is based on our key observation that 1) a read with a large number of seeds most likely aligns to the reference genome and does not require in-storage filtering, and 2) selective chaining can filter many nonaligning long reads, without requiring costly hardware resources in the SSD. Figure 6 shows the alignment probability of a read in a long read dataset (SRR5413248 [161] in Table 1 ) to subsequences in the reference genome (NZ_NJEX02 [159] ), as a function of the number of seeds per read ( ). The average read length in this dataset is 10K base pairs and the number of seeds goes up to several thousands for some reads. We observe that reads with a sufficiently large number of seeds are very likely to align to subsequences in the reference genome (e.g., at least 85% of reads with ≥ 64 seeds align). Such reads can be directly sent to the CPU for full read mapping (bypassing the in-storage chaining-based filter). We extend this analysis to read datasets from various organisms commonly used in genomics studies: E. coli [159] , yeast [162] , thale cress [163] , fruit fly [164] , mouse [165] , and human [144] . We observe that when a read has = 64, 128, or 256 seeds, it aligns with average probabilities of 88.87%, 91.32%, and 93.84%, respectively. Based on these observations, we design GenStore-NM to selectively perform chaining only on reads with fewer than seeds, while sending reads with at least seeds to the host system. 9 This selective chaining significantly reduces the chaining execution time and additional hardware area cost, while filtering most reads that would not align to the reference genome. Figure 7 shows the overview of GenStore-NM that filters out most of the non-matching reads in three steps. In Step 1, GenStore-NM reads the input read set from the flash chips, generates minimizer k-mers for each read (as explained in Section 2.1), and looks up each minimizer in a K-mer Index (KmerIndex) to find the potential matching locations, i.e., seeds ( 1 in Figure 7 ). In Step 2, GenStore-NM counts the number of seeds in each read to decide if the read needs to go through chaining ( 2 ) . To further improve overall performance, Step 2 also filters out reads with too few seeds (i.e., < ), which would not align to the reference and thus would be filtered anyway by the baseline read mapper [58] . In Step 3, GenStore-NM filters reads based on their chaining scores using a fast and efficient chaining accelerator ( 3 ) . If a read has at least one chain with a score above a specified threshold, it is sent to the CPU for mapping. Otherwise, the read is filtered. All three steps run in a pipelined manner. Data Structures. We carefully design the KmerIndex to reduce the SSD's internal DRAM capacity required for storing it. KmerIndex, similar to the index used by the baseline read mapping tool (i.e., Minimap2 [58] ), is a hierarchical hash table. To reduce the memory capacity required for storing the KmerIndex, we make three modifications: 1) not store the reference genome since it is not needed by our approach, 2) not store seeds with many matching locations (e.g., more than 495 locations in the baseline read mapper [58] ) 10 since read mappers usually ignore such seeds in the chaining process [58] , and 3) increase the number of hash table buckets so that each bucket holds one minimizer. Increasing the number of buckets increases the false positive rate in the filter, which leads to finding extra seeds and performing extra chaining operations (without loss of accuracy). We make this trade-off to reduce the KmerIndex's size and the required memory space. These optimizations reduce the size of the index for the human reference genome [58] from 5.8 GB to 2.9 GB, which enables GenStore-NM to easily store the KmerIndex in the limited DRAM space inside the SSD.

Step 1: Seed Finding. This step finds the potential matching locations of a read (i.e., seeds) in the KmerIndex. GenStore-NM first reads the input read set from all flash chips ( 1 in Figure 7 ). GenStore-NM exploits the full internal bandwidth of the SSD by reading the read set in parallel via a multi-plane read operation for each chip (same as GenStore-EM, Section 4.2.2). In a shifting buffer (called the K-mer Window) in the channel-level accelerator, GenStore-NM stores the most recently read k-mers of each read. For each sliding window of k-mers, 11 GenStore-NM finds the minimizer of the window (described in Section 2.1) using a 64-bit Integer Mix hash function (hash64) [166] in the SSD-level accelerator ( 2 ) . GenStore-NM queries each minimizer in the KmerIndex until it reads seeds (e.g., = 64) or it reaches the end of the read ( 3 ). Each index query takes up to two memory accesses (to visit the two levels of the hash table). GenStore-NM stores the seeds in the Location Buffer in the channel-level accelerator ( 4 ) , and moves to Step 2.

Step 2: Seed Count-Based Filtering. This step compares the number of seeds in the Location Buffer with a lower bound and an upper bound . If a read has fewer than matching seeds, it is filtered since it will not meet the minimum chaining score requirement of the baseline read mapper [58] . 12 If the read has more than seeds, it means that the read will very likely align to the reference (e.g., reads with at least 64 seeds in Figure 6 ) and thus require the full read mapping process. GenStore-NM sends such reads to the host system for the full read mapping process. This way, the read mapping process of the unfiltered reads can run concurrently with GenStore's filtering operations. For any other read, GenStore-NM performs chaining in the SSD in Step 3.

Step 3: Chaining-Based Filtering. This step performs chaining (see Section 2.1), a step commonly used in state-of-the-art read mappers [58, 120, 167] to filter out reads with low chaining scores and send reads with high chaining scores to the CPU to undergo the full read mapping process. The key difference in GenStore-NM's chaining process compared to existing read mappers is that GenStore-NM selectively performs chaining only on reads with lower seed counts than a threshold . Such selective chaining is based on our two key observations; 1) a long read with many seeds most likely aligns to the reference genome and thus does not require in-storage filtering, and 2) selective chaining can filter many nonaligning long reads, without requiring costly hardware resources in the SSD (as shown in Section 4.3.1).

We design GenStore-NM's chaining unit based on the chaining algorithm used in Minimap2 [58] , a state-of-the-art baseline read mapper. A chain is computed from a sequence of seeds 1 , . . . , 13 of lengths 1 , . . . , , respectively. For a given seed , ( ) denotes the ending position of the seed's matching location in the reference genome (the read). The chaining score acts as an approximation of an alignment score, increasing linearly with the number of matching base pairs between the read and the reference, and decreasing with the size of gaps between the seeds. The chaining 11 The default value for in Minimap2 [58] is 10, but GenStore's design can be tuned for different values. 12 We use =3, as in [58] , but GenStore-NM's design trivially supports different values of . 13 Sorted based on their locations in the reference genome.

score is defined as follows (based on [58] ):

where ( ) is the best chaining score which can be computed with the seeds 1 , . . . , . Given a chain whose last seed is , ( , ) (also called the match score) is the number of new base pairs added to the chain after adding . ( , ) (also called the gap penalty) subtracts from the chain score based on the distance between and in the reference genome. 14 GenStore-NM's Chaining-Based Filter (in Figure 7) consists of 1) a Chaining Buffer to store , , , and ( ) values and 2) a Chaining Processing Element (PE) based on Equation (3). Figure 8 shows the design of the Chaining PE in the channel-level accelerator. We reduce the latency and size of this unit by approximating multiplication with shift operations. We ensure that our hardware optimizations always over-estimate the chaining score so that we do not filter out any potential read mappings. GenStore-NM does not affect the accuracy of the read mapper because the filter performs the same computations as the baseline chaining filter for reads with seeds < inside SSD. The chaining PE needs to be executed × times per read where is the number of seeds per read and is the DP-iterations of each seed. Due to the limited number of seeds that go through the chaining step, we limit the number of chaining units that are needed to match the full internal bandwidth of SSD. Similar to GenStore-EM, the steps of GenStore-NM are pipelined. The performance of GenStore-NM can be easily scaled up by increasing the SSD's internal parallelism (e.g., by deploying more channels or using low-latency NAND flash memory) [122, 151, 152] .

GenStore requires simple changes to the existing FTL code. GenStore FTL Metadata. GenStore metadata includes the mapping information of the data structures necessary for read mapping acceleration, which enables access to the data structures without the L2P mapping table of the regular FTL. In accelerator mode, where GenStore operates as an accelerator, GenStore also keeps in internal DRAM other metadata structures of the regular FTL (e.g., the page status table and block read counts [129, 152] ) which need to be updated during the filtering process. We carefully design GenStore to only sequentially access the underlying NAND flash chips while operating as an accelerator, so it requires only a small amount of metadata to access the stored data. Data Placement. GenStore needs to properly place its data structures to enable the full utilization of the internal SSD bandwidth in accelerator mode. When each data structure is initially written to the SSD, GenStore sequentially and evenly distributes it across NAND flash chips. We design GenStore to store every data structure using multiple sets of NAND flash blocks such that each block set consists of NAND flash blocks with the same block offset across the planes in a die. For example, a die's block set includes all NAND flash blocks whose offset is in the die. Such data placement enables GenStore not only to always perform multi-plane read operations during the filtering process, but also to significantly reduce the size of GenStore metadata. For example, suppose that a GenStoreenabled SSD consists of 128 2-plane NAND flash dies, and the block size is 12 MB (i.e., the size of each block set would be 24 MB). In such a case, GenStore can specify the physical location of a 30-GB data structure by maintaining only the list of 1,250 (30 GB/24 MB) physical block addresses. This way, GenStore significantly reduces the size of the necessary mapping information from 300 MB (with conventional 4-KiB page mapping) to only 5 KB (1,250×4 bytes), and the saved internal DRAM space can be used for loading data structures in the accelerator mode. SSD Management Tasks. In accelerator mode, GenStore stops working as a regular SSD. It only reads data structures to perform filtering, and does not write any new data. Therefore, GenStore does not require any write-related SSD-management tasks such as garbage collection [127] [128] [129] [130] [131] and wear-leveling [129, 132, 168] .

The other tasks necessary for ensuring data reliability, such as refreshing data to avoid uncorrectable errors due to data retention and read disturb [129, [168] [169] [170] [171] [172] [173] [174] [175] [176] , can be done before or after the filtering process, for two reasons. First, GenStore significantly limits the amount of data whose retention age would exceed the manufacturer-specified threshold for reliable operation (e.g., 1 year [177] ) during the filtering process, since GenStore's filtering process takes a short time. 15 GenStore simply refreshes such data [129, 170, 173, 174] before starting the filtering process. Second, GenStore-FTL can easily avoid read disturbance errors for data with high read counts [172, 175] since GenStore sequentially reads NAND flash blocks only once during filtering. After the filtering process, to prevent read disturb errors in future filtering processes or regular accesses, GenStore refreshes pages that store a data structure if the structure's read count exceeds a certain threshold.

Evaluated Systems. We show the benefits of GenStore when it is integrated with the state-of-the-art software and hardware read mappers. To this end, we evaluate the following systems:

• Base: Minimap2 [58] is a state-of-the-art software read mapper baseline for both short and long reads. GenCache [43] and Darwin [39] are state-of-the-art hardware read mappers for short and long reads, respectively. • GS-Ext: Base integrated with an implementation of the Gen-Store filter without in-storage support (Ext stands for external to storage). GS-Ext concurrently filters reads while using Base to perform read mapping for unfiltered reads. The goal of evaluating GS-Ext is to decouple the effects of GenStore's two major benefits: 1) alleviating I/O bottlenecks via efficient in-storage processing and 2) reducing the workload of the read mapper by filtering reads with simple operations. GS-Ext obtains the second benefit but not the first. For software read mappers, we evaluate 15 E.g., less than 7 minutes for a 1TiB dataset even with a low-end SSD (SSD-L) [121] .

a pure software implementation of GenStore that concurrently runs with Base. 16 For hardware read mappers, we evaluate a hardware implementation of GenStore outside the SSD. • GS: Base integrated with the hardware GenStore filtering accelerators, GenStore-EM and GenStore-NM, as described in Sections 4.2 and 4.3. GS concurrently filters reads inside the SSD while using Base to perform read mapping for unfiltered reads. The source code of GenStore and scripts and datasets can be freely downloaded from https://github.com/CMU-SAFARI/GenStore. Area and Power. We implement GenStore's logic components in Verilog HDL. We synthesize our designs using the Synopsys Design Compiler [178] with a 65nm process technology node to estimate latency, area and power consumption. We use the SSD power values of the Samsung 3D NAND flash-based SSD [121] , and DRAM power values based on DDR4 model [179, 180] . Performance Modeling. We evaluate hardware configurations using two state-of-the-art simulators to analyze the performance of GenStore. We model DRAM timing with the DDR4 interface [180, 181] in Ramulator [101, 182] , a widely-used, cycle-accurate DRAM simulator. We model SSD performance using MQSim [102] , a widelyused simulator for modern SSDs. We model the end-to-end throughput of GenStore based on the throughput of each GenStore pipeline stage: accessing NAND flash chips, accessing internal DRAM, accelerator computation, and transferring unfiltered data to the host. We estimate the performance of GenCache and Darwin accelerators based on the data reported in the original works [39, 43] . Real System Results. We use real systems to evaluate all software configurations. For a given reference genome, separate reference indexes are generated for each sequencing technology using the software read mapper's default settings for each technology [58] . All other read mapping parameters are kept at their default values. We perform all experiments on an AMD ® EPYC ® 7742 CPU with 1TB DDR4 DRAM (available in user space). We measure power in these systems using AMD ® µProf [183] . We carefully evaluate Gen-Store's benefits over the baseline in a conservative manner by using optimized configurations for all baselines. Our baselines fetch data from the SSD by sequentially reading the data in batches [58] , while providing sufficiently large DRAM capacity to contain all data that gets reused during read mapping. We use the number of threads that leads to each software configuration's best performance (i.e., execution time): 128 threads for Base and 16 threads for GS-Ext in our experimental setup. 17 SSD Configurations. We analyze the benefits of GenStore on three SSD configurations: 1) a low-end SSD (SSD-L) [124] with a SATA3 interface [133] , 2) a mid-end SSD (SSD-M) [134] using a PCIe Gen3 M.2 interface [135] , and 3) a high-end SSD (SSD-H) [125] with a PCIe Gen4 interface [136] . Datasets. For short read experiments, we use the hg38 human reference genome [144] . In Section 3, we use real short reads (SRR2052419 [158] , 19.6 GB). In Section 6, to flexibly and controllably analyze the effect of read sets with different features, we 16 We do not evaluate a separate GS-Ext configuration for long read software mapper since Base (Minimap2 [58] ) already incorporates the chaining filter used in GenStore-NM. GenStore-NM implements part of this chaining filter at low cost (enabled by the key observations in Section 4.3) to fit within the constraints of instorage processing. 17 Performance of GS-Ext saturates after 16 threads since it becomes bottlenecked by the external SSD bandwidth. simulate read sets with various sizes (up to 440 GB) and different fractions of exactly-matching reads (75% and 85%) using the Mason 2 genomic read simulator [184] . We generate reads with different exact-match rates by introducing sequence mutations (uniformly randomly drawn from the gold-standard mutation list for the human sample NA12878 [158] ) to the reference genome. For the long read experiments, we use the reference genome and read set from Table 1 in Section 4.3.1. To flexibly analyze the effect of data size, we generate larger read sets by concatenating the original read sets several times.

We propose GenStore as an in-storage processing system that supports both accurate short reads and error-prone long reads with different levels of genetic variation between compared genomes. Table 2 shows the area and power consumption of each logic unit used in GenStore. In the first column, the table shows the different logic units used in GenStore along with the width of each entry. The second column shows the number of instances of each unit in an 8-channel SSD. We find the number of instances based on the frequency of and the required throughput from each unit such that GenStore can process data concurrently read from all planes in all SSD channels. GenStore's hardware units work at different clock frequencies, with the lowest frequency being 300 MHz (for each channel-level Chaining PE). While the hardware units can be designed to operate at higher frequency, their throughput is sufficient when operating at lower frequencies since the end-to-end execution time of GenStore is bottlenecked by NAND flash reads. The third and fourth columns show the area and power of one instance of each unit. The total hardware area needed for GenStore is very small (0.20 mm 2 at 65 nm and 0.02 mm 2 at 14 nm, only 0.006% of a 14nm Intel Processor [185] ). 18 The area overhead of GenStore hardware (0.06 mm 2 at 32 nm) is less than 9.5% of the three 28nm ARM Cortex R4 processors [187] in a SATA SSD controller [121] .

The area and power of most logic units (except for the comparator, the hash64 accelerator, and the control unit) increase linearly with the channel count. Each instance of the comparator and hash64 accelerator supports multiple channels (up to 12 and 4 channels, respectively). Therefore, these units scale by ⌈ # ℎ 12 ⌉ or ⌈ # ℎ 4 ⌉, respectively. The area and power of GenStore's control unit remains the same across different channel counts.

We analyze the benefits of GenStore-EM for a 22-GB short read set where 80% of reads exactly match some subsequence in the reference genome (see Section 5 for input generation methodology), on a system with three different SSD configurations: SSD-L, SSD-M, and SSD-H. Integration with a Software Read Mapper. Figure 9a shows the execution time of four read mapper configurations: 1) Base (Minimap2 [58] ), 2) SIMD, an extension of Base with a SIMD implementation of a baseline exactly-matching read filter using 128-bit SIMD instructions, 3) GS-Ext, and 4) GS. We divide the execution time between Alignment (chaining and alignment's contribution to end-to-end execution time) and Other (file access, seeding, and exact match filtering's contribution to end-to-end execution time). We make four observations from Figure 9a . First, for all SSD types, GS significantly outperforms Base and SIMD by 2.07-2.45× and 1.66-2.09×, respectively, by alleviating the cost of moving data from the SSD to the host CPU and removing the burden of finding exactly-matching reads from the rest of the system (DRAM and the processor). Second, GS-Ext provides significant performance improvements over both Base and SIMD in SSD-M and SSD-H. However, GS-Ext provides limited benefits over SIMD in SSD-L since its performance is bottlenecked by the low external I/O bandwidth. Third, even though SIMD also filters out the same number of reads and reduces the alignment time similarly to GS and GS-Ext, its average performance benefit over Base (1.19×) is quite limited compared to those of GS (2.23×) and GS-Ext (1.83×). This is because 1) both GS and GS-Ext reduce the number of memory accesses per read and convert the random memory accesses to more efficient streaming accesses, and 2) GS addresses the I/O bottlenecks due to limited external SSD bandwidth. Based on our observations, we draw two conclusions. First, GenStore-EM leads to large performance improvements in short read genomic analysis by efficiently filtering large amounts of data in storage. Second, even without in-storage processing support, the key idea of GenStore-EM can significantly improve the performance of the state-of-the-art software read mapper especially when using high-bandwidth SSDs (e.g., SSD-M and SSD-H). Integration with a Hardware Read Mapper. Figure 9b shows the execution time of three hardware read mapper configurations: 1) Base (GenCache [43] ), 2) GS-Ext, and 3) GS. 19 Integrating Gen-Store with existing accelerators requires no architectural changes to GenStore and the accelerators because GenStore operates directly on the original data that is stored in the SSD, before any acceleratorspecific operation starts. The only factor that GenStore needs to consider is to generate its outputs (i.e., the information of unfiltered reads) in the format that the accelerator needs as its inputs. The host system is responsible for orchestrating the execution and data flows between GenStore and the accelerator.

We make two observations based on Figure 9b . First, GS significantly outperforms Base by 3.32×, 2.55×, and 1.52× on systems with SSD-L, SSD-M, and SSD-H, respectively. Second, GS-Ext performs significantly slower than Base (2.28-1.91×) on all systems since GenStore-EM requires accessing the large SSIndex data structure (Section 4.2), while GS-Ext suffers from limited SSD external bandwidth. We conclude that the in-storage processing approach in GenStore effectively addresses the I/O bottleneck, which becomes even more significant with hardware accelerators that address the computation bottleneck. Effect of Read Set Features on Performance. We study the benefits of GenStore-EM depending on the characteristics of input read sets. To this end, we evaluate GenStore-EM while changing two key characteristics: 1) the input read set size and 2) exactlymatching read rate. These two factors affect the data movement savings ( _ ) of GenStore as governed by Equation (4):

where is the size of the reference genome and its index (e.g., 7 GB for humans [58] ), is the size of the read set, and in GenStore-EM is the exactly-matching read rate of the read set. Figure 10a shows the execution time of Base and GS for the baseline software read mapper (Minimap2 [58] ) for input read sets with different sizes (1x, 10x, and 20x larger than the size of our default 22-GB short read set) and different exactly-matching read rates (75% and 85%) on a system with SSD-H. We make two observations. First, GS's performance benefit grows as input size increases (from 2.62× to 4.75×) due to larger data movement savings. Since is constant, _ and the performance benefits of GenStore-EM increase with larger (see Equation (4)). Second, GS's performance benefit increases with higher exactlymatching read rates (from 3.46× to 6.05× for the largest read set) because with a larger exactly-matching read rate (i.e., in Equation (4)), data movement saving ( _ ) increases, and the time spent on mapping the unfiltered reads decreases. Mapping the unfiltered reads is the key contributor to the end-to-end execution time of GS since it has larger execution time compared to concurrently running GenStore-EM operations in the SSD. We conclude that the benefits of GenStore-EM, when integrated with the software read mapper, increases with larger read sets and with larger exactly-matching read rates. Figure 10b shows the execution time of Base and GS for a baseline hardware read mapper (GenCache [43] ). We make two observations. First, GS's performance benefit increases with larger input sizes (from 1.52× to 3.13×) due to its larger data movement reduction. Second, GS's performance benefit does not increase with higher exactly-matching read rates. This is because, with the hardware read mapper, the end-to-end performance of GS is dominated by the execution time of GenStore-EM's filter operations inside the SSD, which only depends on the input read set size and not on the exact-match rate. We conclude that the benefits of GenStore-EM, when integrated with the hardware read mapper, increases with larger input read set sizes.

We analyze the benefits of GenStore-NM for a 12.4-GB read set (the first No reference use case in Table 1 ) with 99.65% of reads not aligning on a system with three different SSD configurations: SSD-L, SSD-M, and SSD-H.

Integration with a Software Read Mapper. Figure 11a shows the execution time of two read mapper configurations: 1) Base (Minimap2 [58] ), which already incorporates the chaining filter, and 2) GS. 20 We observe that GS outperforms Base by 22.4×, 29.0×, and 27.9× on systems with SSD-L, SSD-M, and SSD-H, respectively. 21 The reason is that GS alleviates the cost of data movement from SSD to the processor and removes the burden of mapping a large fraction of reads from the rest of the system (DRAM and the processor). Integration with a Hardware Read Mapper. Figure 11b shows the execution time of three hardware read mapper configurations: 1) Base (Darwin [39] ), 2) GS-Ext, and 3) GS. We make two observations based on Figure 11b . First, GS significantly outperforms Base by 19.2×, 6.86×, and 6.85× on systems with SSD-L, SSD-M, and SSD-H, respectively. The reason is that GS alleviates the cost of data movement from SSD to the accelerator and removes the burden of read mapping for the filtered reads from the rest of the system (DRAM and the accelerator). Second, GS-Ext does not provide significant performance benefits compared to Base in systems with SSD-L and SSD-M since the execution time of GS-Ext is dominated by the I/O overhead of bringing reads from SSD to the accelerator. GS-Ext performs 2.50× faster than Base in systems with SSD-H 20 Unlike Figure 9a , we do not divide the execution times of Alignment and Other in Figure 11a since the execution time of GS is dominated by the filter operations of GenStore-NM. This is because, in this case, a large fraction (e.g., 99 .65% in our evaluated dataset) of reads get filtered in the SSD and the execution time of mapping the unfiltered reads is very small. 21 In this case, GS provides larger benefits for systems with SSD-M and SSD-H since these SSDs have larger internal bandwidth compared to SSD-L. since the I/O bottlenecks are partially alleviated with SSD-H. However, the benefits of GS-Ext are limited compared to GS since GS significantly alleviates the I/O overhead of bringing reads from SSD to the accelerator with all three SSD configurations. Effect of Read Set Features on Performance. We study how the benefits of GenStore-NM vary depending on the characteristics of input read sets. To this end, we evaluate GenStore-NM while changing two key characteristics: 1) the input read set size and 2) alignment rates (the fraction of reads in the read set that align to the reference genome). We use input sets with different sizes (1×, 10×, and 20× larger than the size of our default 12.5GB read set) with different alignment rates (0.3% and 37%, corresponding to the first and second No reference use cases in Table 1 ), on a system with SSD-H. Figure 12 shows the execution time of Base and GS for the baseline software read mapper [58] (Figure 12a ) and the baseline hardware read mapper [39] (Figure 12b ). Figure 12 : GenStore-NM performance versus input size and read alignment rate.

We make two main observations from Figure 12 . First, for both hardware and software read mappers, GS's performance benefits vary little as the input size changes. The reason is that is very small (14.6 MB) for this experiment. Therefore, data movement saving ( _ in Equation (4)) mainly correlates with alignment rate (i.e., 1 − ). Second, for both software and hardware read mappers, GS's performance benefit increases as the ratio of non-aligning reads increases (i.e., as alignment rate reduces), because with higher values of , both 1) _ increases and 2) the time spent on mapping unfiltered reads decreases. We conclude the GenStore-NM provides high performance benefits to both software and hardware read mappers with different input sizes and its benefits increase with lower alignment rates.

To demonstrate the energy benefits of different GenStore modes, we obtain the energy of the host processor, the host-side DRAM, the DRAM inside the SSD, the communication between the SSD and the host, active and idle energy of the SSD, and the energy of the logic units used in GenStore (Section 6.1). We calculate the energy of each component based on its idle and dynamic power consumption and its execution time. We observe that by filtering out large amounts of data, GenStore reduces the end-to-end energy consumption of read mapping in all of our evaluations compared to Base (Minimap2 [58] ). We measure the energy consumption for experiments on all SSD configurations. By filtering exact matches in a short read set with 80% exactly-matching read rate (see Section 5), GenStore-EM reduces the energy consumption by on average (up to) 3.92× (3.97×) across all storage configurations. By filtering nonmatching reads in a long long read set with a read alignment rate of 0.35% (the first No reference use case in Table 1 ), GenStore-NM reduces the energy consumption by on average (up to) 27.17× (29.25×) across all storage configurations.

To our knowledge, GenStore is the first in-storage system designed for accelerating genome sequence analysis. GenStore works with both short and long reads, and different degrees of genetic variation between the compared genomes. Accelerating Read Mapping. There exists a large body of work on accelerating read mapping, which tend to follow two general directions: 1) non-filtering and 2) filtering approaches [35] . Neither of these two approaches are designed for processing in storage. Non-filtering approaches accelerate one or more non-filtering steps of read mapping (e.g., seeding and approximate string matching) using hardware accelerators. Examples of these accelerators include processing-in-memory architectures [40, 61, [65] [66] [67] [68] [69] , ASICs [39, 62, 70] , GPUs [71-73, 78-84, 188, 189] , and FPGAs [63, 64, [74] [75] [76] [77] [85] [86] [87] [88] [89] [90] [91] [92] [93] . Filtering approaches accelerate the pre-alignment filtering step of read mapping. These works provide highly-parallel read filtering heuristics that quickly eliminate dissimilar sequences before invoking computationally-expensive alignment algorithms. Examples of these accelerators include processing-near-memory architectures [43, 94, 95, 97] , FPGAs [41, 42, 48, 49, 98] , GPUs [41, 96, 98] , or traditional CPU-based acceleration [37, 38, 41] .

In contrast to GenStore, prior works on read mapping acceleration suffer from two key issues. First, they all still need to access the genomic data that is initially stored in the storage device, which incurs time and energy costs due to data movement overhead from the storage device to the main memory and later to the CPU cores or accelerators. Second, most prior works support processing only one type of sequencing reads [37-39, 42, 48, 49, 62, 64, 65, 68, 69, 71-73, 77, 82-84, 86, 90, 91, 96, 98] , which limits their applicability. Compared to existing filtering approaches, our work 1) is the first to identify the I/O inefficiencies that are not addressed by existing techniques and 2) introduces new filtering techniques that enable efficient in-storage processing for both short and long reads. In-Storage Processing Systems. Prior works explore in-storage processing in the form of application-specific accelerators [143, [190] [191] [192] [193] [194] [195] [196] , general-purpose processing inside storage devices [195] [196] [197] [198] [199] [200] [201] , SSDs closely integrated with FPGAs [142, [202] [203] [204] [205] [206] , or SSDs closely integrated with GPUs [207] . Several works propose techniques for general pattern matching in storage [208, 209] without a specific focus on read mapping nor support for different sequencing read types and data structures. None of these works perform in-storage filtering for read mapping to accelerate genome sequence analysis.

As sequencing technologies develop in the future, we expect that GenStore will still play an important role in genomic sequence analysis. We witness three major trends in sequencing technologies. First, the length and the accuracy of reads are expected to increase [35, 53, 54, 210, 211] . Second, short reads will continue to be widely used due to their very high accuracy and low cost [26, 212, 213] . Third, DNA sequencing machines are increasingly adopting data processing capabilities to perform both sequencing and genomic analysis within the same machine [31, 34, [214] [215] [216] [217] .

Even with increases in long read accuracy (trend 1), GenStore-NM would continue to filter large numbers of reads that do not align to a reference genome due to genetic differences between the compared genomes [155, [218] [219] [220] [221] [222] . Given that accurate short reads are essential for many processes in genome analysis (trend 2), e.g., for polishing [46, 223, 224] and validating [225] long read sequences, GenStore-EM can continue to accelerate short read mapping by filtering out exactly-matching reads. With the push to provide DNA sequencing and preliminary genetic analysis on more portable integrated devices with limited compute resources and available DRAM capacity [34, 35, 214, 226] (trend 3), there will be a growing demand for data movement-minimizing systems like GenStore to quickly filter out a large fraction of reads at low cost and high efficiency [94, 100, 214, 217, 227, 228] .

With higher availability of portable genome sequencers, genomic analyses will be performed more routinely [1-9, 35, 99, 155, 210, 228, 229] and will need to provide much faster results [35, 94] . In-storage processing is crucial for meeting the huge demands for faster analyses that scale well to large numbers of genomic samples. Examples of these analyses include gene detection [230] , alignment of reads from rapidly mutating organisms such as SARS-CoV-2 [9, 100] , and studies of as-of-yet undiscovered microbes [155, 218, 219, 229] . In these analyses, GenStore-NM can effectively filter out >99.7% [219, 230] , ∼36.1% [100] , and ∼47.3% [155, 218, 219] of the reads, respectively, thereby greatly improving the end-to-end throughput and energy efficiency of genome sequence analysis. We hope that our proposed techniques provide a foundation for future efforts to accelerate genome analysis.

We propose GenStore, a new in-storage processing system for genome sequence analysis. GenStore can be integrated with both hardware and software read mappers to improve the end-to-end performance of both short and long read mapping. We address the challenges of in-storage processing for genomic read filtering via new hardware/software co-designed techniques and develop new in-storage filtering accelerators for both short and long reads. Our evaluations show that GenStore provides large performance and energy improvements when integrated into the state-of-theart software and hardware read mappers. We hope that GenStore inspires other in-storage processing and storage system design ideas for genome sequence analysis. A promising avenue for future work is to enable the integration of GenStore-like accelerators into sequencing devices for extreme scalability and efficiency.

Diagnosis of Genetic Diseases in Seriously Ill Children by Rapid Whole-genome Sequencing and Automated Phenotyping and Interpretation

Rapid Whole-genome Sequencing Decreases Infant Morbidity and Cost of Hospitalization

Rapid Whole Genome Sequencing Impacts Care and Resource Utilization in Infants with Congenital Heart Disease

Personalized Copy Number and Segmental Duplication Maps Using Next-Generation Sequencing

P4 Medicine: How Systems Medicine Will Transform the Healthcare Sector and Society

Genomic and Personalized Medicine: Foundations and Applications

Cancer Genomics: From Discovery Science to Personalized Medicine

Towards Precision Medicine

Massively Scaled-up Testing for SARS-CoV-2 RNA via Next-generation Sequencing of Pooled and Barcoded Nasal and Saliva Samples

Multiplexed Detection of SARS-CoV-2 and Other Respiratory Infections in High Throughput by SARSeq

Selected Insights from Application of Whole Genome Sequencing for Outbreak Investigations. Current Opinion in Critical Care

Whole Genome Sequencing of Mycobacterium Tuberculosis for Detection of Recent Transmission and Tracing Outbreaks: A Systematic Review

Whole-genome Sequencing for Tracing the Transmission Link between Two ARD Outbreaks Caused by A Novel HAdV Serotype 7 Variant

Whole-genome Sequencing in Outbreak Analysis

Finding the Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions

Comparative Population Genomics in Animals Uncovers the Determinants of Genetic Diversity

Determinants of Genetic Diversity

Human Disease Variation in the Light of

Genome Sequencing and Population Genomics in Non-Model Organisms

Great Ape Genetic Diversity and Population History

Human Disease Variation in the Light of

Ten Years of Next-Generation Sequencing Technology

Field Guide to Next-Generation DNA Sequencers

Coming of Age: Ten Years of Next-generation Sequencing Technologies

A Tale of Three Next Generation Sequencing Platforms: Comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq Sequencers

Generations of Sequencing Technologies: from First to Next Generation

Systematic Evaluation of Error Rates and Causes in Short Samples in Next-generation Sequencing

Opportunities and Challenges in Long-read Sequencing Data Analysis

Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions

Single Molecule Real-Time (SMRT) Sequencing Comes of Age: Applications and Utilities for Medical Diagnostics

Comprehensive Comparison of Pacific Biosciences and Oxford Nanopore Technologies and Their Applications to Transcriptome Analysis

The Third Revolution in Sequencing Technology

Accelerating Genome Analysis: A Primer on An Ongoing Journey

Technology Dictates Algorithms: Recent Developments in Read Alignment

Accelerating Read Mapping with FastHASH

Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter to Accelerate Alignment Verification in Read Mapping

Darwin: A Genomics Coprocessor Provides up to 15,000 x Acceleration on Long Read Assembly

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

SneakySnake: A Fast and Accurate Universal Genome Pre-alignment Filter for CPUs, GPUs and FPGAs

GateKeeper: A New Hardware Architecture for Accelerating Prealignment in DNA Short Read Mapping

Leveraging In-cache Operators for Efficient Sequence Alignment

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-memory Technologies

A Sequencing-technology-independent, Scalable and Accurate Assembly Polishing Algorithm. Bioinformatics

Edlib: A C/C++ Library for Fast, Exact Sequence Alignment Using Edit Distance

MAGNET: Understanding and Improving the Accuracy of Genome Pre-alignment Filtering

Shouji: A Fast and Efficient Pre-alignment Filter for Sequence Alignment

A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins

Identification of Common Molecular Subsequences

An Improved Algorithm for Matching Biological Sequences

Advancements in Next-generation Sequencing

Next-Generation Sequencing Technologies: An Overview

Genomes Project Consortium et al. A Global Reference for Human Genetic Variation

A Greedy Algorithm for Aligning DNA Sequences

Automated Generation of Heuristics for Biological Sequence Comparison

Minimap2: Pairwise Alignment for Nucleotide Sequences

A Fast Bit-vector Algorithm for Approximate String Matching Based on Dynamic Programming

Fast Gap-affine Pairwise Alignment Using the Wavefront Algorithm. Bioinformatics

RADAR: A 3D-ReRAM based DNA Alignment Accelerator Architecture

SeedEx: A Genome Sequencing Accelerator for Optimal Alignments in Subminimal Space

ASAP: Accelerated Short-read Alignment on Programmable Hardware

GeNVoM: Read Mapping Near Non-Volatile Memory. TCBB

RAPID: A ReRAM Processing In-memory Architecture for DNA Sequence Alignment

PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm

Aligns: A Processing-inmemory Accelerator for DNA Short Read Alignment Leveraging SOT-MRAM

Aligner: A Process-inmemory Architecture for Short Read Alignment in ReRAMs

Race Logic: A Hardware Acceleration for Dynamic Programming Algorithms

Bitmapper2: A GPU-accelerated All-mapper Based on The Sparse Q-gram Index

Hardware Acceleration of BWA-MEM Genomic Short Read Mapping for Longer Read Lengths

An Efficient GPU-accelerated Implementation of Genomic Short Read Mapping with BWA-MEM

Ultra-fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine

When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration

Accelerating the Next Generation Long Read Mapping with the FPGA-based System

A High-Throughput FPGA Accelerator for Short-Read Mapping of the Whole Human Genome

High-performance GPU-based X-drop Long-read Alignment

GASAL2: A GPU Accelerated Sequence Alignment Library for High-Throughput NGS Data

Accelerating the Smith-waterman Algorithm Using Bitwise Parallel Bulk Computation Technique on GPU

CUDAlign 4.0: Incremental Speculative Traceback for Exact Chromosome-wide Alignment in GPU Clusters

GSWABE: Faster GPU-accelerated Sequence Alignment with Optimal Alignment Retrieval for Short DNA Sequences. Concurrency and Computation: Practice and Experience

CUDASW++ 3.0: Accelerating Smith-Waterman Protein Database Search by Coupling CPU and GPU SIMD Instructions

Arioc: High-throughput Read Alignment with GPU-accelerated Exploration of The Seed-and-extend Search Space

FPGASW: Accelerating Large-scale Smith-Waterman Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array

Hardwareacceleration of Short-read Alignment Based on the Burrows-wheeler Transform

A Novel High-throughput Acceleration Engine for Read Alignment

SWIFOLD: Smith-Waterman Implementation on FPGA with OpenCL for Long DNA Sequences

An FPGA Accelerator of the Wavefront Algorithm for Genomics Pairwise Alignment

PipeBSW: A Two-Stage Pipeline Structure for Banded Smith-Waterman Algorithm on FPGA

Genesis: A Hardware Acceleration Framework for Genomic Data Analysis

Accelerating Genomic Data Analytics With Composable Hardware Acceleration Framework

FPGA Accelerated Indel Realignment in the Cloud

FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications

A 1.2 v 12.8 gb/s 2gb Mobile Wide-i/o Dram with 4× 128 i/os Using TSV-based Stacking

Gatekeeper-gpu: Fast and accurate pre-alignment filtering in short read mapping

ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering

Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race between FPGA and GPU

The UK10K Project Identifies Rare Variants in Health and Disease

High Throughput Detection and Genetic Epidemiology of SARS-CoV-2 Using COVIDSeq Next-generation Sequencing

Ramulator: A Fast and Extensible DRAM Simulator

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-queue SSD Devices

Next-generation Sequencing Library Preparation: Simultaneous Fragmentation and Tagging Using in Vitro Transposition

Next-generation DNA Sequencing

Fast Nanopore Sequencing Data Analysis with SLOW5

Oxford Nanopore Technologies. MinION Mk1B IT Requirements

Martin Shumway, and International Nucleotide Sequence Database Collaboration. The Sequence Read Archive

Pangenomics Reveals Alternative Environmental Lifestyles among Chlamydiae

Emergence of Genomic Diversity and Recurrent Mutations in SARS-CoV-2. Infection

Metalign: Efficient Alignment-based Metagenomic Profiling Via Containment Min Hash

Critical Assessment of Metagenome Interpretation-the second round of challenges. bioRxiv

Hidden Biases in Germline Structural Variant Detection

BALSA: Integrated Secondary Analysis for Whole-genome and Whole-exome Sequencing

Optimal Seed Solver: Optimizing Seed Selection in Read Mapping

Basic Local Alignment Search Tool

Improved Metagenomic Analysis with Kraken 2

Winnowing: Local Algorithms for Document Fingerprinting

Reducing Storage Requirements for Biological Sequence Comparison

Improving the Performance of Minimizers and Winnowing Schemes

Minimap and Miniasm: Fast Mapping and De Novo Assembly for Noisy Long Sequences

Samsung. Samsung SSD 860 PRO

Improving I/O Performance of Large-page Flash Storage Systems Using Subpage-parallel Reads

Improving Performance and Lifetime of Large-page NAND Storages Using Erase-free Subpage Programming

Enterprise SSD Controllers

Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives

Architectural Support for Efficient Data Sanitization in Modern Flash-Based Storage Systems

Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-state Drives

RansomeBlocker: a Low-Overhead Ransomware-Proof SSD

Improving Performance and Lifetime of NAND Storage Systems Using Relaxed Program Sequence

On Efficient Wear Leveling for Large-scale Flash-memory Storage Systems

SATA revision 3.0 specifications

Samsung. Samsung SSD 980 PRO

PCI Express Base Specification Revision 3

Micron. Micron 9300 SSD

Western Digital. WD Blue SATA Internal SSD Hard Drive

Seed-and-vote Based In-memory Accelerator for DNA Read Mapping

RASSA: Resistive Prealignment Accelerator for Approximate DNA Long Read Mapping

Trading Communication with Computing Near Storage

Deepstore: In-storage Acceleration for Intelligent Queries

Evaluation of GRCh38 and De Novo Haploid Genome Assemblies Demonstrates the Enduring Quality of the Reference Assembly

Secure Hash Standard

RFC1321: The MD5 Message-digest Algorithm

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Sequencing Depth and Coverage: Key Considerations in Genomic Analyses

NovaSeq 6000 System Specifications

A Flash Memory Controller for 15µs Ultra-Low-Latency SSD Using High-Speed 3D NAND Flash with 3µs Read Time

Reducing Solid-State Drive Read Latency by Optimizing Read-Retry

Human Contamination in Bacterial Genomes has Created Thousands of Spurious Proteins

Human Microbiome Project Consortium et al. Structure, Function and Diversity of the Healthy Human Microbiome

A Global Metagenomic Map of Urban Microbiomes and Antimicrobial Resistance

Best Practices for Analysing Microbiomes

Database Resources of the National Center for Biotechnology Information

Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials. Scientific data

A New Coronavirus Associated with Human Respiratory Disease in China

FDA-ARGOS is A Database with Public Quality-controlled Reference Genomes for Diagnostic Use and Regulatory Science

The Reference Genome Sequence of Saccharomyces Cerevisiae: Then and Now

The Arabidopsis Information Resource: Making and Mining the "Gold Standard

Updates to the Drosophila Melanogaster Knowledge Base

Lineage-specific Biology Revealed by a Finished Genome Assembly of the Mouse

Integer Hash Function

Reliability Issues in Flash-memory-based Solid-state Drives: Experimental Analysis, Mitigation, Recovery. In Inside Solid State Drives

Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques

Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation

Heat-Watch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-recovery and Temperature Awareness

Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery

Error Analysis and Management for MLC NAND Flash Memory

Flash Correct-and-refresh: Retention-aware Error Management for Increased Flash Memory lifetime

An Integrated Approach for Managing Read Disturbs in High-density NAND Flash Memory

WARM: Improving NAND Flash Memory Lifetime with Write-Hotness Aware Retention Management

Micron 3D NAND Flash Memory

Micron Technology Inc. 4Gb: x4, x8, x16 DDR4 SDRAM Data Sheet

Demystifying Complex Workload-DRAM Interactions: An Experimental Study

Ramulator Source Code

Mason -A Read Simulator for Second Generation Sequencing Data

Scaling Equations for the Accurate Prediction of CMOS Device Performance from 180 Nm to 7 Nm

ARM Holdings. Cortex-R4

CUDASW++: Optimizing Smith-Waterman Sequence Database Searches for CUDA-enabled Graphics Processing Units

CUDASW++ 2.0: Enhanced Smith-Waterman Protein Database Search on CUDA-enabled GPUs Based on SIMT and Virtualized SIMD Abstractions

REGISTOR: A Platform for Unstructured Data Processing inside SSD Storage

Using Accelerated Flash Storage for External Graph Analytics

Query Processing on Smart SSDs: Opportunities and Challenges

A User-Programmable SSD. In USENIX OSDI

Sang-Won Lee, and Bongki Moon

Active Disks for Large-Scale Data Processing

Active Storage for Large-Scale Data Mining and Multimedia Applications

A Framework for Near-data Processing of Big Data Workloads. ISCA

Enabling Cost-effective Data Processing with Smart SSD

Project Almanac: A Time-traveling Solid-state Drive

Active Disks: Programming Model

A Case for Intelligent Disks (IDISKs). SIGMOD Record

Bluedbm: An Appliance for Big Data Analytics

Distributed Flash Storage for Big Data Analytics

Hosein Bobarshad, Vladimir Alves, and Nader Bagherzadeh. Catalina: In-storage Processing Acceleration for Scalable Big Data Analytics

SmartSSD: FPGA Accelerated Near-Storage Data Analytics on SSD

CIDR: A Cost-effective In-line Data Reduction System for Terabitper-second Scale SSD Arrays

Xsd: Accelerating Mapreduce by Harnessing the GPU inside an SSD

REACT: Scalable and High-performance Regular Expression Pattern Matching Accelerator for In-storage Processing

In-storage Embedded Accelerator for Sparse Pattern Processing

Nanopore Sequencing of the Fungal Intergenic Spacer Sequence as a Potential Rapid Diagnostic Assay

Oxford Nanopore Technologies. R10.3: the Newest Nanopore for High Accuracy Nanopore Sequencing -Now Available in Store

A Large Genome Center's Improvements to the Illumina Sequencing System

Pacific Biosciences Closes Acquisition of Omniome and Establishes San Diego Presence

Squigglefilter: An accelerator for portable virus detection

Reliable Variant Calling during Runtime of Illumina Sequencing

Realtime Mapping of Nanopore Raw Signals

Targeted Nanopore Sequencing by Real-time Mapping of Raw Electrical Signal with UNCALLED

Geospatial Resolution of Human and Bacterial Diversity with City-scale Metagenomics

Urban Transit System Microbial Communities Differ by Surface Type and Interaction with Humans and the Environment. Msystems

Assembly of A Pan-genome from Deep Sequencing of 910 Humans of African Descent

Building a Chinese Pan-genome of 486 Individuals

The Need for A Human Pangenome Reference Sequence

A Comprehensive Evaluation of Long Read Error Correction Methods

Telomere-to-telomere Assembly of A Complete Human X Chromosome

The Structure, Function and Evolution of A Complete Human Chromosome 8

Nanopore Base Calling on the Edge

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing. iScience

Real-time DNA Barcoding in a Rainforest Using Nanopore Sequencing: Opportunities for Rapid Biodiversity Assessments and Local Capacity Building

Structure and Function of the Global Ocean Microbiome

Longitudinal Analysis of Microbial Interaction between Humans and the Indoor Environment

We thank the anonymous reviewers of MICRO 2021 and ASP-LOS 2022 for feedback. We thank the SAFARI group members for feedback and the stimulating intellectual environment. We acknowledge the generous gifts and support provided by our industrial partners: Google, Huawei, Intel, Microsoft, VMware, and the Semiconductor Research Corporation. Jisung Park was in part supported by the National Research Foundation (NRF) of Korea (NRF-2020R1A6A3A03040573).