key: cord-0668979-z8asewvc authors: Jiang, Lei; Zokaee, Farzaneh title: EXMA: A Genomics Accelerator for Exact-Matching date: 2021-01-13 journal: nan DOI: nan sha: d5abc12475a8287fb496939e27cf88a5ebd69902 doc_id: 668979 cord_uid: z8asewvc Genomics is the foundation of precision medicine, global food security and virus surveillance. Exact-match is one of the most essential operations widely used in almost every step of genomics such as alignment, assembly, annotation, and compression. Modern genomics adopts Ferragina-Manzini Index (FM-Index) augmenting space-efficient Burrows-Wheeler transform (BWT) with additional data structures to permit ultra-fast exact-match operations. However, FM-Index is notorious for its poor spatial locality and random memory access pattern. Prior works create GPU-, FPGA-, ASIC- and even process-in-memory (PIM)-based accelerators to boost FM-Index search throughput. Though they achieve the state-of-the-art FM-Index search throughput, the same as all prior conventional accelerators, FM-Index PIMs process only one DNA symbol after each DRAM row activation, thereby suffering from poor memory bandwidth utilization. In this paper, we propose a hardware accelerator, EXMA, to enhance FM-Index search throughput. We first create a novel EXMA table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation. We then build an accelerator to search over an EXMA table. We propose 2-stage scheduling to increase the cache hit rate of our accelerator. We introduce dynamic page policy to improve the row buffer hit rate of DRAM main memory. We also present CHAIN compression to reduce the data structure size of EXMA tables. Compared to state-of-the-art FM-Index PIMs, EXMA improves search throughput by $4.9times$, and enhances search throughput per Watt by $4.8times$. Because of the huge advancement of sequencing technologies such as Illumina [1] , PacBio SMRT [2] , and Oxford Nanopore [3] , sequencing a entire human genome requires only < 1 day. The big genomic data has been a cornerstone to enabling personalized healthcare [4] , and ensuring global food security [5] . Recently, genome sequencing becomes a powerful tool to fight virus outbreaks, e.g., Ebola [6] , Zika [7] and COVID-19 [8] . However, it is challenging to process and analyze huge volumes of genomic data generated by high throughput sequencers that scale faster than Moore's Law [9] . For instance, thousands of USB-drive-size Oxford Nanopore Minion sequencers are deployed to monitor virus outbreaks [6] - [8] in the wild by generating several terabytes data per day. Analyzing a single genome may take hundreds of CPU hours [10] , [11] even on high-end servers. To overcome the looming crisis of big genomic data, the application-specific hardware acceleration has become essential for genomics. A genome sequencing pipeline [4] sequences organic genomes, archives genomic data, analyzes genome sequences, and generates genetic variants that can used for patient treatment. Therefore, the latency of genome sequencing is a matter of life and death. Read alignment [12] , which aligns reads, i.e., small DNA fragments, against a long genome reference, is identified as one of the most time-consuming steps [10] , [11] , [13] - [15] in genome analysis. Read alignment adopts the seed-and-extend paradigm [12] , [16] , and thus includes two major stages, i.e., seeding and seed extension. During seeding, parts of each read are mapped to their exactly matched positions, i.e., seeds, of the long reference by hash tables [10] , [11] , [17] or Ferragina-Manzini Index (FM-Index) [12] , [16] . Seed extension pieces together a larger sequence with seeds and edit distance errors, i.e., insertions, deletions (indels) and substitutions, by dynamic programming [18] - [20] . State-of-the-art read alignment applications such as BWA-MEM [12] , MA [16] and SOAP [21] use FM-Index to build super-maximal exact matches (SMEMs) during seeding, since it augments the space-efficient Burrows-Wheeler transform (BWT) [22] with accessory data structures that permit ultrafast exact-match operations. SMEMs generated by FM-Index guarantee each seed does not overlap other seeds and has the maximal length that cannot be further extended. Compared to hash tables, FM-Index reduces not only the number of errors in output genome mappings but also the durations of seed extension substantially [23] . Besides read alignment, FM-Index is widely used for exactmatch operations in other time-consuming steps of genome analysis such as genome assembly [24] , annotation [25] and compression [26] . Figure 1 shows the execution time breakdown of various genome analysis applications on a human genome 1 . On average, FM-Index searches cost 31% ∼ 81% of the execution time of these genome analysis applications. Since Illumina, Nanopore and PacBio genome sequencers generate reads having different lengths and error rates, aligning and assembling these reads require different amounts of time for FM-Index searches. The reads produced by Illumina machines have lower error rates, so an Illumina dataset invokes FM-Index searches more frequently. However, FM-Index is notorious for its poor spatial locality and random memory access pattern [27] . The kernel of conventional FM-Index search is pointer chasing. After activating one DRAM row, FM-Index processes only one DNA symbol, thereby greatly decreasing DRAM bandwidth utilization. Although a recent algorithmic work, LISA [28] , uses a learned index [29] to search multiple DNA symbols after each row activation, the learned index accuracy is low. LISA has to search many unnecessary entries, and thus achieves only moderate search throughput improvement. Beyond CPUs [12] and GPUs [21] , prior work creates FPGA [13] , [30] -, ASIC [31] -, and even processing-in-memory (PIM) [14] , [15] -based designs to accelerate conventional FM-Index searches processing only one DNA symbol after each row activation. Therefore, these accelerators are fundamentally limited by the poor off-chip memory bandwidth utilization. Instead of searching multiple DNA symbols after each row activation, a recent DRAM PIM, MEDAL [15] , achieves the state-of-the-art search throughput by enabling DRAM chip-level parallelism, where each chip can independently activate a partial row to process a DNA symbol. However, we observe that there are a lot of conflicts on the DDR4 address bus shared by all chips in a rank. The shared address bus seriously limits search throughput of MEDAL. In this paper, we propose an algorithm and hardware codesigned accelerator, EXMA, to process EXact-MAtch operations during genome analysis. Our contributions are summarized as follows: Seed-and-Extend: As one of the bottlenecks in genome analysis, read alignment may consume hundreds of CPU hours [10] , [11] , [13] . During read alignment, DNA reads generated by various sequencing machines, e.g., Illumina, PacBio SMRT, and Oxford Nanopore, are mapped to a preexisting genome reference, as shown in Figure 2 (a). Read alignment is complicated by the fact that there are genetic variations in the human population, and sequencing machines also introduce sequencing errors [32] . The overall variation of human population has been estimated as 0.1% [9] , while the sequencing error rate of various sequencing machines is 0.2% ∼ 30% [32] . To reduce sequencing errors, a sequencing machine produces 30 ∼ 50 reads to cover every position in the genome. As Figure 2 (b) shows, read alignment adopts the seed-and-extend paradigm [12] , [16] , [24] , [33] - [35] to accommodate sequencing errors and genetic variations. During seeding, a read is divided into multiple smaller parts that are aligned against the reference. If a part is exactly matched, it becomes a seed. The computationally expensive Smith-Waterman algorithm [10] , [11] , [13] is invoked only around seeds to handle sequencing errors and genetic variations. Exact-Match Operation: The alphabet of DNA includes A, C, G and T . Given a genome reference G ∈ * of length |G| and a query Q ∈ * of length |Q|, the seeding, aka exact-match problem, is to find all occurrences of Q in G. A naïve algorithm of exhausting all possible positions for Q will take O(|G||Q|) comparisons, which is infeasible for large genome. It is possible to use a hash table [10] , [11] , [17] to support exact-match operations with hundreds of gigabytes DRAM, but the hash-table-based seeding not only degrades genome mapping quality but also prolongs seed extension durations [23] . State-of-the-art alignment algorithms [12] , [16] , [21] use FM-Index for seeding. To search a query Q over the reference genome G, FM-Index occupies O(|G|log(|G|)) DRAM space and does O(|Q|) comparisons during a search. FM-index is built upon BWT [22] . To compute the BWT of a genome reference G, we can list all its circularly shifted sequences. For instance, Figure 3 (a) shows G = CAT AGA$, where $ indicates the end of the sequence and it is the lexicographically smallest symbol. Circularly shifted sequences of G = CAT AGA$ can be listed as $CAT AGA, A$CAT AG, . . ., and CAT AGA$, which can be sorted in the lexicographical order to form a Burrows-Wheeler (BW)-matrix. The last column of the BW-matrix is the BWT of G, i.e., The data structure and search algorithm of FM-Index can be summarized as: • Occ and Count. FM-index searches are implemented with two functions Occ(s, i) and Count(s) over the BWT of G. As Figure 3 Figure 3 (c) computes the number of symbols in the BWT that are lexicographically smaller than the symbol s, e.g., Count(T ) = 6, which indicates that there are 6 symbols in BW T (G) = AGT C$AA lexicographically smaller than T . • Backward Search. An exact-match operation is implemented by backward search, whose algorithm can be viewed in Figure 3 (d). The interval (low, high) covers a range of indices in the BW-matrix where the suffixes have the same prefix. The pointer low locates the index in the BWmatrix where the pattern is first found as a prefix, while the pointer high provides the index after the one where the pattern is last found. At first, low and high are initialized to the minimum and maximum indexes of the Occ table respectively. And then, they iterate from the last symbol in a query Q to the first. The pointer pos is updated by where Q[i] indicates the i th symbol in the query Q. The pointer pos can be low or high, as shown from the lines 2 to 3 in Figure 3 (d). The computations of low and high are pointer chasing and thus suffer from poor spatial locality [14] , [15] , [30] . Finally, the interval (low, high) gives the range of indexes in the BW-matrix where the suffixes have the target query as a prefix. These indexes are converted to reference genome positions using SA. Figure 3 (e) illustrates an example of searching a query T AG in the reference G = CAT AGA$. Before a search happens, (low, high) is initialized to (0, 7). In the iteration 0, the last symbol G is processed, and then (low, high) is updated to (5, 6) . After three iterations, (low, high) eventually equals (6, 7) . By looking up SA[6] = 2 in Figure 3 (a), we find that the query T AG in reference G = CAT AGA$ starts from the position 2. During an iteration of the FM-Index backward search, two memory accesses for low and high are issued for each symbol in a query Q. Totally, 2|Q| memory accesses are required for an exact match operation of a query. The FM-Index backward search performance is seriously limited by random memory accesses [14] , [15] , [30] , since each access opens a DRAM row but fetches only 64B. k-step FM-Index [36] is proposed to reduce the number of memory accesses to 2|Q| k by updating a k-mer, i.e., k DNA symbols, in each search iteration. The idea of k-step FM-Index is to enlarge the alphabet size from Σ to Σ k . For instance, if k = 2, instead of single DNA symbols, as Figure 4 shows, the enlarged alphabet includes 16 2-mers: AA, AC, . . ., T T . We can construct a BWT with the enlarged alphabet and its corresponding FM-Index to perform k-step backward search in the same way. The trade-off for k-step FM-Index is the increase in its size, which is calculated as F = ⌈log 2 (|G|)⌉ · |G| · |Σ| k 8d + |G| · ⌈log 2 (|Σ| k + 1)⌉ 8 (2) where k is the number of DNA symbols updated in each search iteration. The size of multi-step FM-Index exponentially increases with an enlarging k. To support multi-step searches with smaller DRAM overhead, a recent work proposes Learned Indexes for Sequence Analysis (LISA) [28] consisting of an Index-Paired BWT (IP-BWT) array and a learned index. • IP-BWT. In Figure 5 Since the IP-BWT is sorted, LISA adopts binary search for backward searches. As Figure 5 (b) shows, to search the query T AG, we first break it into T A and G, since each iteration can process a 2-mer. In the first iteration, we start with G. The padding algorithm [28] of LISA converts G to G$ for low and GT for high. low and high are initialized to 0 and 6 respectively. ❶ To search [G$, 0], a binary search is performed over the IP-BWT. ❷ During the binary search, G$ is first compared against AT , i.e., the row 3 of the IP-BWT. ❸ Because G$ > AT , the binary search goes to the row 5, i.e., GA of the IP-BWT. ❹ Finally, it ends with the row 4 of the IP-BWT, i.e., CA. ❺ Since G$ > CA, the new low is calculated as 6 + 1 = 7. high can be computed in the same way. Each search iteration requires log 2 (|G|) comparisons due to binary search. • Learned Index. To reduce the number of comparisons during binary searches, LISA adopts a learned index [29] , i.e., a model hierarchy consisting of multiple neural network models, as shown in Figure III. RELATED WORK AND DESIGN MOTIVATION It is challenging to achieve high-throughput and powerefficient FM-Index searches by state-of-the-art FM-Index algorithms and accelerators. Intractable Size of k-step FM-Index. We recorded the row IDs of 200 consecutive 1-step FM-Index search iterations in Figure 6 (a), where 197 different rows are accessed. Because 1step FM-Index processes only 1 DNA symbol in each iteration, in most cases, searching a DNA symbol by 1-step FM-Index requires one row activation. Prior accelerators [14] , [15] for 1-step FM-Index expect no row buffer hit and thus adopt close-page policy. Though k-step FM-Index can search k DNA symbols by activating a row, as Figure 6 (b) shows, its data structure size exponentially increases with the step number k. For instance, 5-step FM-Index costs 105GB, while 6-step FM-Index occupies 374GB. As a result, 5-step FM-Index (FM-5) improves search throughput by only 1.21× over 1step FM-Index, as shown in Figure 6 (d). Further enlarging the step number of k-step FM-Index (FM-6) decreases its search throughput improvement, due to more TLB misses. Weakness of LISA Learned Index. LISA can search k DNA symbols after each row activation by its IP-BWT. Moreover, as Figure 6 (b) shows, the size of LISA linearly increases with the step number k. However, the learned index of LISA suffers from low accuracy and low cache hit rate. First, the LISA learned index has to index |G| IP-BWT entries, where |G| is the length of reference genome. For a human genome, |G| = 3G. When the learned index of LISA predicts a wrong position, LISA has to linearly search up to |G| IP-BWT entries. As Figure 6 (c) describes, on average, LISA has to search ∼ 3K extra IP-BWT entries during each iteration, due to the low accuracy of its learned index. Consequently, compared to 1-step FM-Index, 21-step LISA (LISA-21) improves search throughput by only 2.15× in Figure 6 (d). But if LISA has a perfect learned index (LISA-21P) which always predicts the right position, compared to 1-step FM-Index, LISA-21P improves search throughput by 5.1×. Further increasing the step number of LISA (LISA-32) also introduces more TLB misses, thereby achieving smaller search throughput improvement. Second, traversing the learned index's hierarchical models is also pointer chasing. The LISA learned index consumes ∼ 1.5GB. If we assume a perfect cache (100% hit) for LISA-21P (LISA-21PC), LISA-21PC improves search throughput by 8.53× over 1-step FM-Index. Algorithmic Takeaways. (1) Implementing k-step search with moderately enlarged data structure is important for FM-Index. However, a larger step number, i.e., k, may not necessarily lead to better search throughput. (2) Figure 6 (c), we present the maximal and minimal errors, the mean of errors, and the 25 th , 50 th , 75 th percentiles of errors. In Figure 6 (d), FM-X means X-step conventional FM-Index; LISA-X means X-step LISA; LISA-XP means X-step LISA with a 100% accurate index; and LISA-XPC denotes X-step LISA with a 100% accurate index and a 100% hit cache. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [12] , GPUs [21] , FPGAs [13] , and ASICs [37] can accelerate only 1-step FM-Index searches. A recent FPGA design [30] supports 2-step FM-Index searches, while k-step LISA is built merely on CPUs. Existing FM-Index application-specific accelerators including FPGAs [13] , [30] , and ASICs [37] can search only 1 or 2 DNA symbols after each DRAM row activation. Therefore, they cannot fully exploit the maximal DRAM bandwidth. Processing-In-Memories. Though recent works [14] , [15] , [38] , [39] propose PIM accelerators to process FM-Index searches, they cannot fully utilize the available DRAM bandwidth either. Most NVM-based PIMs [14] , [38] , [39] focus on processing simple arithmetic computations of FM-Index by NVM arrays, but ignore optimizing external memory accesses. For instance, a ReRAM-based FM-Index PIM, FindeR [14] , has only 2.6GB memory arrays, so it still suffers from low DRAM bandwidth utilization when fetching FM-Index buckets that cannot fit into its internal arrays from external DRAMs. Only a DRAM PIM, MEDAL [15] , modifies its DRAM main memory for higher FM-Index search throughput. Instead of processing multiple DNA symbols during a DRAM row activation, MEDAL enables chip-level parallelism where each chip in a rank can independently process a DNA symbol by opening its partial row. In ideal case, 16 chips in a rank can enlarge FM-Index search throughput by 16×, and each chip has only 1/16 row size. However, we find the DDR4 address bus shared by all chips in a rank seriously limits search throughput of MEDAL. The DDR4 address bus is only 17bit wide [40] . During each access, the row activation and the column access serially pass their addresses via the same 17- bit bus. As Figure 7 shows, MEDAL can sequentially activate a 1/16 partial row in chip 0∼2 . But when activating a 1/16 partial row in chip 3 , its row address (R-A 3 ) and the column address of chip 0 (C-A 0 ) compete for the same address bus. The activation in chip 3 has to be delayed to CK 12 . And idle bubbles appear in the DDR4 data bus. Because of address bus conflicts, although MEDAL claims a 68× search throughput improvement over a multi-core CPU baseline, we observe it improves search throughput by only 11×. We first create a row-buffer-friendly alternative to FM-Index, EXMA table, to process multiple DNA symbols in each search iteration. And then, we present a Multi-task-Learning (MTL)-based index to accelerate searches over an EXMA table. Compared to LISA learned index, the MTL-based index is more accurate, although it has less parameters. A. EXMA Table Data Structure. The major reason why the learned index of LISA is not accurate is that it has to index |G| IP-BWT entries. To reduce the problem size for a learned index, we propose a novel data structure, EXMA table. In each row of the Occ table, only one k-mer can increase its value by one. For instance, in the 2-step Occ table shown in Figure 4 , only "AC" increases its value from 0 to 1 in the row 1. Based on this observation, as Figure 8 shows, for each k-mer, an EXMA table stores only the row numbers of the Occ table, where its value increases. For example, for "AA", its value increases in the row 2, 3, 6, and others. We store these row number as the increments of "AA". Totally, we have f 0 "AA"s, EXMA table has the space complexity of O(|G|log(|G|)). For a 3G-base human genome, the increments occupy 12GB. We can consecutively store the increments of all k-mers to take advantage of the row buffer locality. Each k-mer needs a base to point to its first increment. For instance, the base of "AC" is f 0 + 1 indicating its first increment is in the position f 0 + 1. Totally, a k-step EXMA table stores 4 k bases, each of which is related to a k-mer. Even if a k-mer has no increment, e.g., "TC", its base is set to M AX = |G|+1. The bases of a k-step EXMA To compute Occ(k-mer, pos) in each search iteration, we compare pos against all increments of the k-mer and find the first increment larger than pos, where pos can be low or high. For instance, to compute Occ(AA, 4), we first read the base of "AA", which is 0. And then, we initialize a counter and start a linearly search from the position 0 of "AA" to M AX. If an increment is smaller than 4, we increase the counter by 1. When 6 is found, we stop, since 6 > 4. At last, the counter value is 2, i.e., Occ(AA, 4) = 2. Naïve Adoption of Learned Index. Since all increments of each k-mer are sorted, similar to LISA, we can adopt learned index [29] to perform only one comparison to compute Occ(k-mer, pos) in the best case. We build a learned index model hierarchy for each k-mer that has > 256 increments. As Figure 9 (a) shows, similar to LISA [28] , to build learned indexes, we use a fixed ratio between the number of parameters of a learned index model and the number of increments that need to be indexed. As a result, if a k-mer has more increments, its learned index model has more parameters. For instance, the model of "TT" (m T T ) owns more weights and biases than that of "AA" (m AA ), since "TT" has more increments. To compute Occ(AA, pos) in a search iteration, we first read all the parameters of m AA , and input only pos to the model. If the prediction is not correct, we start a linear search from the predicted position to find the correct position. However, since an EXMA table have to index totally |G| increments, the learned index of EXMA does not have higher accuracy than that of LISA. Step # of an EXMA Table. We tuned the step number k of an EXMA table to balance the DRAM overhead and search throughput. For a genome reference G, the size of increments in an EXMA table is constant, while the size of bases in the table is proportional to 4 k . Although a small k has few bases, the search throughput is low. For instance, for a 3Gbase human genome, if k = 2, the bases require only 32-byte. But each time, only a 2-mer can be processed by a search Norm . throughput (b) Throughput on CPU baseline Fig. 10 . The trade-off of an EXMA table (EXMA-X means X-step EXMA). (a) 15 "A"s (b) 7 "AC"s and "A" (c) 7 "AT"s and "G" Fig. 11 . Increment distributions of 15-mers. iteration. In contrast, a large k improves the search throughput but significantly increases the number of bases and thus the size of an EXMA table. For a human genome, as Figure 10 Figure 11 shows, the increments of various k-mers in 15-EXMA exhibit similar random distributions. Based on Stein's paradox [44] , it is more accurate to approximate many independent random distributions using samples from all of them rather than approximating them separately. The MTL-based index is trained with the increments of multiple k-mers to obtain superior accuracy over learning the increments of each k-mer independently. We adopt the hard-parameter-sharing MTL [43] that shares a subset of parameters between the learned index models of k-mers having different numbers of increments. For instance, as Figure 9 (b) shows, "AA" uses the smallest model (m AA ) to index its increments, since it has the fewest increments among all 2-mers. "TT" has more increments, and thus a larger model (m T T ) which contains the entire m AA . Besides m AA , m T T comprises more levels of nodes to index its increments more accurately. Compared to the naïve learned index, we add more neurons in the hidden layers of each non-leaf node of a MTLbased index to accommodate two inputs, i.e., pos and k-mer. But the size of a MTL-based index is smaller that of the naïve learned index, since the k-mers share most parameters of a MTL-based index. Inference. The MTL-based index of an EXMA table effectively approximates Occ(k-mer, pos) as p = F (k-mer, pos) * f k-mer (3) where p is the predicted position of pos in the increments of the k-mer; F (., .) is the neural network model hierarchy to estimate the probability to observe an increment ≤ pos; f k-mer is the number of increments of the k-mer, and can be stored along the k-mer base. After each inference, we read both p and p + 1 to check whether the prediction is correct. If not, we start a linear search to find the correct position. Therefore, the accuracy of a MTL-based index decides search throughput of FM-Index, but has no impact on the quality of final DNA mapping. A MTL-based index model hierarchy is a tree structure. To keep the index size in check, we deploy simple linear regression models [45] as leaf nodes in the model hierarchy of the MTL-based index. A linear regression model contains only one weight and one bias. Each non-leaf node is a neural network having a fully-connected layer, each of which contains 10 neurons with sigmoid activation. and their positions {y i,j } fi j=1 . The learning function for the k-mer T i is defined as w T i x+ b i . Based on [42] , [43] , [46] , we formulate the loss function to learn the relations between k-mers as where W = (w 1 , . . . , w p ); b = (b 1 , . . . , b p ) T ; l(·, ·) means the cross-entropy loss function; β i is the importance of kmer {T i }. We trained the MTL-based index to minimize this equation by an Adam optimizer. Similar to LISA [28] , the training and testing of a MTL-based index use the same dataset, an EXMA table of a reference. Training a MTL-based index for a reference typically takes 1 ∼ 2 days. Once a MTLbased index is trained, billions of exact match operations can happen on its reference. MTL-based Index Performance. We profiled the performance of EXMA-15 equipped with a naïve learned index in Figure 12 . As Figure 12 (a) shows, 2.5E-5% and 4E-6% of 15-mers have 64K∼256K and >1M increments, respectively. However, searching the 15-mers having 64K∼256K and >1M increments consumes 36% and 20.5% of the search time respectively, as shown in Figure 12 (b). This is because the naïve learned index predicts a lot wrong positions, and the linear search overhead is large. The prediction errors of the naïve learned index for the 15-mers having 64K∼256K (learn-256K) and >1M (learn-1M) increments are shown in Figure 13 . On average, we have to search 917 and 2133 more increments to find the correct one for the 15-mers having 64K∼256K and >1M increments respectively. The MTL-based index greatly improves index prediction accuracy by simultaneously learning from multiple 15-mers having similar amounts of increments. The MTLbased index (MTL) further reduces the mean of prediction errors to 45 and 182 for the 15-mers having 64K∼256K and >1M increments respectively. As a result, the MTL-based index (EXMA-15M) improves search throughput by 75% over LISA-21 with only a half number of parameters, as shown in Figure 10 (b). We propose a hardware accelerator to process search operations over an EXMA table using a MTL-based index. We integrate two caches for the bases and the MTL-based index of an EXMA table to reduce unnecessary DRAM accesses. And then, we present EXMA scheduling to improve cache hit rate. We also introduce dynamic page policy to improve DRAM row buffer hit rate during searching the increments of an EXMA table. At last, we create CHAIN compression to reduce the EXMA table size. The architecture of our EXMA accelerator is shown in Figure 14 . The kernel of the EXMA accelerator is an inference engine computing predictions of a MTL-based index. We adopt the state-of-the-art Tangram neural network accelerator [47] as the inference engine. The Tangram accelerator consists of a number of processing elements (PEs) organized in a 2D array. Each PE includes a simple ALU for multiplyaccumulate (MAC) operations and a small register file of 32B. A larger SRAM buffer is shared by all PEs. We add two small caches for the bases and MTL-based index of an EXMA table. Data fetched and stored by the accelerator goes through a de/compression unit ( §IV-C4) that de/compresses the bases and increments of an EXMA table. We integrate a scheduling queue into the EXMA accelerator to implement 2stage EXMA scheduling ( §IV-C2). At last, the dynamic page management ( §IV-C3) switching between open and close page policies is implemented in the CPU memory controller. Poor Locality of MTL-based Index. The conventional first-ready first-come-first-serve (FR-FCFS) policy is adopted by almost all accelerators [13] - [15] , [30] , [37] , [38] to schedule FM-Index searches. However, FR-FCFS significantly degrades the hit rates of our base cache and index cache. Figure 15 (c). When R 0 arrives, both caches are empty and thus have a miss. And then, the bases of "TTTG" and "TTTT" are stored in the base cache, while the index nodes of m 0 , m 2 and m 18 used by R 0 are stored in the index cache. For R 1 , both caches also suffer from a miss. The bases of "AAAA" and "AAAC" are fetched to the base cache, while m 0 , m 1 and m 3 used by R 1 are installed into the index cache. For R 2 , the base cache has a miss, but the index cache has a hit. At last, both cache have a miss for R 3 . Totally, four misses happen in the base cache, while three misses occur in the index cache. 2-Stage Scheduling. We propose a 2-stage EXMA scheduling technique to enhance the hit rates of the base and index caches. Unlike FR-FCFS scheduling requests based on their addresses and order, EXMA re-orders the requests according to their data including k-mers and positions (pos). In the first stage, EXMA sorts the FM-Index requests based on their k-mers. By consecutively issuing FM-Index requests in the lexicographic order of their k-mers, the hit rate of the base cache increases, since the bases of the k-mers sorted in the lexicographic order tend to have stronger spatial locality. As Figure 16 (a) shows, the first stage of EXMA scheduling issues four requests in the order of R 1 , R 3 , R 2 , and R 0 . The base cache has two hits and two misses, but the index cache has all four misses. This is why EXMA needs to do the second stage of scheduling. During the second stage of EXMA scheduling, four requests are ranked according to their pos values. Through consecutively computing inferences of MTL indexes of the requests having small differences between their pos values, the index cache can expect more hits. This is because the smaller difference the pos values of two requests exhibit, the more likely these two requests use the same MTL index nodes during searches. As Figure 16(b) shows, the index cache has two misses and two hits. Totally, the 2-stage EXMA scheduling has 4 misses and 4 hits. Implementation. Our 2-stage EXMA scheduling is implemented with the scheduling queue that is a content-addressable memory (CAM). A CAM can perform sorting operations [48] . The k-mer and pos of a request can be stored in one row of the CAM. Each DNA symbol in a k-mer is denoted by 3 bits, since we need to encode $, A, C, T and G. Requests can be sorted in the CAM based on their k-mers or pos values. Dynamic Page Policy. Prior FM-Index accelerators [13] - [15] , [30] , [37] , [38] adopt close-page policy in their DRAM main memories. They always pre-charge a DRAM row after each access, since conventional 1-step FM-Index searches have little spatial locality, as shown in Figure 6 (a). On the contrary, our EXMA table stores the increments of a k-mer consecutively in DRAM rows. As the algorithm of FM-Index backward search (line 2∼3 in Figure 3 (d)) indicates, each iteration issues two requests searching the same k-mer but with different position values. In a search iteration, we compute Occ(k-mer, low) and Occ(k-mer, high). Both search the increments of the same k-mer that are very likely stored in the same row. So our accelerator asks the CPU memory controller (MC) to keep the DRAM row open after the first request in a search iteration is processed, since we expect the second request can hit in the row buffer. However, the row will be pre-charged, after the second is processed. Implementation. The dynamic page policy can be implemented in the CPU MC. When searching a request, if there is another request pending in the scheduling queue of the accelerator searches the same k-mer, the CPU MC keeps the DRAM row open after the ongoing request completes. Otherwise, it pre-charges that DRAM row. The CPU MC maintains a register to indicate whether all rows are closed, and another register to record which row is open for each bank. B∆I. The state-of-the-art cache compression technique, B∆I [49] , breaks a 64-byte cache line into eight 8-byte data sections. As Figure 17 (a) shows, to compress the cache line, B∆I records the first data section (data 0 ) and stores only the difference (∆ i ) between each data section (data i ) and data 0 . To decompress a data section, B∆I simply calculates data 0 Δ 1 ... data i = data 0 +∆ i . B∆I typically reduces data size of SPEC-CPU2006 applications by ∼ 50%. In these applications, data sections in a cache line are not sorted. CHAIN. Since both increments and bases of each k-mer are sorted and stored consecutively in a DRAM row, we believe they are more compressible than the data of SPEC-CPU2006 applications. Therefore, we propose a novel compression technique, CHAIN, to more aggressively compress an EXMA table. As Figure 17 (b) describes, to compress the increments in a 64B memory line, CHAIN first stores the first increment (incr 0 ). And then, it stores the value difference (∆ i ) between two consecutive increments. So we have ∆ i = incr i −incr i−1 . To decompress an increment incr i in a 64B memory line, CHAIN simply computes incr i = incr 0 + i j=1 ∆ j . Bases of an EXMA table can be (de)compressed in the same way. Implementation. The CHAIN compression and decompression require only 64-bit adders. Multiple adders concurrently operate for the CHAIN compression, while the CHAIN decompression requires only one adder for accumulations. The CHAIN decompression slightly prolongs FM-Index search latency but greatly increases FM-Index search throughput. As Figure 14 describes, ❶ after receiving FM-Index requests from the CPU, the accelerator stores them in its scheduling queue and performs the first stage scheduling. ❷ Based on their k-mers, the accelerator checks whether the bases of the requests stored in the queue are in the base cache or not. ❸ If misses occur, DRAM accesses are issued to fetch the bases. Otherwise, the accelerator does the second stage scheduling. ❹ The accelerator checks whether the MTL index nodes required by the requests in the queue are in the index cache or not, according to both k-mer and pos values. ❺ If misses happen, DRAM accesses are issued to fetch MTL index nodes. Otherwise, the inference engine computes with MTL index nodes. ❻ Until the inference of a leaf node is finished, the accelerator issues a DRAM access to read the increment in the predicted position. If the increment is correct, the computation of Occ(k-mer, pos) is completed. Otherwise, it linearly searches DRAM rows to find the correct increment. ❼ All DRAM accesses from the EXMA accelerator are managed by its DMA controller communicating with the CPU MC, which decides whether to pre-charge opened rows based on the requests in the scheduling queue of the accelerator. Our EXMA is connected to a CPU processor as a looselycoupled non-coherent accelerator [50] , [51] by a Network-on-Chip (NoC). EXMA accesses DRAM via two DMA-dedicated planes of the NoC, bypassing the cache hierarchy of the CPU. The EXMA data region must be flushed from the CPU cache hierarchy before FM-Index searches start. We chose the noncoherent model [50] for better performance, since the memory footprint of FM-Index searches is always larger than the CPU LLC capacity. The training overhead of a MTL-based index is shown in the Section of Training in §IV-B. The hardware overhead of the EXMA accelerator is summarized in Table I . From [47] , we adopted the inference engine, which runs at 800M Hz and is synthesized with Synopsys 28nm generic library. We quantized the model hierarchy of MTL index with 8-bit without accuracy degradation. A PE has an 8-bit MAC ALU and a 32B register file. The inference engine contains 4 8 × 8 PE arrays, each of which has a 16KB shared SRAM buffer and an activation unit. The EXMA accelerator also includes a scheduling queue (SRAM CAM) with 512 128-bit entries, a 32KB 16-way SRAM index cache, and a 1MB 8-way eDRAM base cache. We modeled the area and power of memory components including registers, buffers and caches by CACTI [53] . We used the same DDR4-2400 DRAM main memory configuration as the recent FM-Index PIM MEDAL [15] . The EXMA accelerator connects to four DRAM channels, with a total 384GB capacity. We adopted DRAMPower [54] to model the power consumption of our DDR4 main memory. V. EXPERIMENTAL METHODOLOGY Simulation. We used gem5-aladdin [55] to model our CPU baseline and our EXMA accelerator. The configuration of our CPU baseline is shown in Table I . We used McPAT [56] to calculate the power consumption of the CPU processor. We implemented the main memory system using DRAMsim2 [57] . Accelerator Baselines. Besides CPU, we also ran the FM-Index search kernel on an Nvidia Tesla P100 GPU [58] , a Stratix-V FPGA [30] , and a 28nm ASIC design [37] . We compared EXMA against recent FM-Index PIM designs, MEDAL [15] and FindeR [14] . We used gem5-gpu [59] to simulate the GPU, and gem5-aladdin to model the computing units of FPGA, ASIC and PIMs. All their DRAM main memories are implemented by DRAMsim2. The power data of the GPU, FPGA, ASIC and PIMs are adopted from [14] , [15] , [30] , [37] , [58] . The power of DRAM main memories is also modeled by DRAMPower. Workloads. To evaluate EXMA, we adopted FM-Indexbased genome analysis applications: BWA-MEM [12] for short read alignment, MA [16] for long read alignment, SGA [24] for read assembly, ExactWordMatch [25] for annotation and a reference-based compression algorithm [26] . SGA for long reads uses the FM-Index-based error correcting scheme [33] to reduce errors. Datasets. For alignment, annotation and compression, we used human (human, 3G-bp), picea glauca (picea, 20G-bp), and pinus lambertiana (pinus, 31G-bp) genomes as reference genomes. To study short reads, we adopted DWGSim [60] to generate 101-bp short reads with 50× coverage. To evaluate long reads, we created long reads (with length of 1K-bp) by PBSIM [61] . The error profiles of reads is summarized in the format of (name, mismatch%, insertion%, deletion%, total error%), i.e., (Illumina, 0.18%, 0.01%, 0.01%, 0.2%) [62] , (PacBio, 1.50%, 9.02%, 4.49%, 15.01%), and (ONT 2D, 16.50%, 5.10%, 8.40%, 30.0%) [10] . Schemes. The schemes we studied can be summarized as: • CPU: We ran LISA-21 for FM-Index searches in genome applications on our CPU baseline. We also applied B∆I [49] compression on LISA data for three datasets. VI. RESULTS AND ANALYSIS Throughput Comparison against CPU. As Figure 18 shows, we compare FM-Index search throughput of EXMA and CPU by running the seeding of short read alignment, since FM-Index searches consume 99% of the seeding time in short read alignment. Compared to CPU, on average, EXMA-15 improves search throughput by 80%. Our MTL-based index achieves high accuracy on picea, since the increment distributions of its different k-mers are more similar to each other. EX-acc improves search throughput by 7.25× over CPU. Our EXMA accelerator can support more concurrent search operations, while CPU has only a limited number of LLC MSHRs. Compared to CPU, EX-2stage increases search throughput by 15×. Pinus with EX-2stage has the smallest throughput improvement. Because the size of the pinus MTL- based index is the largest among 3 datasets, and thus its index cache has the lowest hit rate. On average, EXMA increases search throughput by 23.6× over CPU. Performance Comparison against CPU. We report and compare the speedup achieved by EXMA in various genome applications in Figure 19 , where we list 3 sets of "alignment and assembly" for reads generated by Illumina, Nanopore, and PacBio respectively. For each application, although EXMA obtains smaller FM-Index search throughput improvement on larger datasets (Figure 18 ), e.g., pinus, EXMA improves the application performance more significantly on larger datasets. This is because CPU consumes a larger portion of the execution time of a genome analysis application to perform FM-Index searches when processing larger datasets that introduce more TLB and data cache misses. EXMA achieves larger performance improvement on alignment and assembly for Illumina, annotation, and compression, since FM-Index searches dominate the execution of these applications. On average, EXMA improves the performance of genome applications by 2.5× ∼ 3.2×, when processing various datasets. Energy Comparison against CPU. We compare the energy reduction obtained by EXMA in various genome applications in Figure 20 , where we list 3 sets of "align(ment) and asse(mbly)" for reads generated by Illumina, Nanopore, and PacBio respectively. On average, EXMA reduces total energy consumption of genome analysis by 61% ∼ 70% when processing different datasets. The major part of the energy reduction comes from voiding using the CPU processor during FM-Index searches. The more time FM-Index searches consume in a genome analysis application, the more energy reduction EXMA can achieve in that application. On average, the EXMA accelerator consumes only < 3% of the total energy consumption of various genome applications. The vast majority of energy consumption is consumed by the DRAM main memory and the CPU handling non-FM-Index-search parts in genome analysis applications. Comparison against Accelerators. We evaluated EXMA and compare it against various hardware accelerators including a GPU [58] , a FPGA [30] , an ASIC [37] , and two PIMs [14] , [15] when processing pinus in Table II Bandwidth utilization. Figure 21 shows the comparison of bandwidth utilization, which is defined as the ratio between the data capacity fetched from DRAM and total DRAM bandwidth. ASIC using FM-1 has only 26% of the total DRAM bandwidth, since it uses close-page policy and fetches only 64B after activating a 2KB row. GPU implementing LISA-21 fetches entire rows from host memory, so it achieves higher bandwidth utilization. MEDAL increases bandwidth utilization by activating each individual chips. However, due to conflicts on the address bus, MEDAL uses only 67% of the DRAM bandwidth. In contrast, EXMA obtains 91% bandwidth utiliza-tion by switching between close-page and open-page policies. DIMM number. We studied the sensitivity of EXMA and MEDAL to the DIMM number in Figure 22 . With 2 DIMMs, EXMA improves search throughput by 29% over MEDAL. By having 3 DIMMs, EXMA linearly scales its throughput up by 40%, since a single EXMA accelerator can maintain all the DIMMs. However, MEDAL increases its throughput by only 14.5% with 3 DIMMs. Each MEDAL PIM accelerator sits on a DIMM. More DIMMs bring more ranks. MEDAL suffers from non-trivial inter-DIMM communication overhead. When the number of DIMM increases to 4, the search throughput of neither EXMA nor MEDAL increases significantly. The data bus (bandwidth utilization) of EXMA is saturated, while the address bus of MEDAL is saturated. PE Array number. We show search throughput of EXMA with a varying number of PE arrays in Figure 22 . Two PE arrays of EXMA already achieve 89% of the search throughput of the configuration with four PE arrays. This is because the computations of MTL-based indexes are not intensive. Further increasing the PE array number to 8 increases search throughput by only 3% over the configuration with four PE arrays. So we selected 4 PE arrays in our baseline configuration. CAM & Cache. We explored search throughput of EXMA with varying CAM and base cache sizes in Figure 22 . We use a CAM consisting of 256, 512, and 1024 entries to serve as the scheduling queue of EXMA. A 256-entry CAM cannot fully satisfy 2-stage scheduling and scheduling for dynamic page policy, and thus achieves only 77% of search throughput of the configuration with a 512-entry CAM. Further increasing the CAM entry number to 1K improves search throughput by only 2% over the configuration with a 512-entry CAM. Compared to the index cache, the search throughput is more sensitive to the capacity of the base cache, since the global buffer and register file of PE arrays can temporarily store MTL-based index nodes. We tried 512KB, 1MB and 2MB for the base cache. The search throughput stops increasing significantly, when the base cache capacity reaches 1MB. So we selected a 512-entry CAM and a 1MB base cache in our baseline configuration. CHAIN. We show the compression result of CHAIN on pinus in Figure 23 . Since the size of the IP-BWT table of LISA-21 is proportional to its step number, the total data size of LISA-21 (original) is 2.2× larger than that of EXMA-15 (original). After B∆I compresses the data size of LISA-21 to 50%, i.e., 152GB. On the contrary, CHAIN compresses the data size of EXMA-15 to only 25%, i.e., 40GB. We observed similar compression rates of B∆I and CHAIN on the other genome datasets. VII. CONCLUSION Though state-of-the-art genome analysis adopts FM-Index to process exact-matches, FM-Index is notorious of random memory access patterns. In this paper, we first present a row-buffer-friendly and highly-compressible EXMA table with a MTL-based index to process multiple DNA symbols by activating a DRAM row during each search iteration. And then, we build a hardware accelerator to process FM-Index searches on a EXMA table. Compared to the state-of-the-art FM-Index PIM MEDAL, EXMA improves search throughput by 4.9×, and enhances search throughput per Watt by 4.8×. Insight into biases and sequencing errors for amplicon sequencing with the illumina miseq platform Improved performance of the pacbio smrt technology for 16s rdna sequencing Oxford nanopore announcement sets sequencing sector abuzz Long-read genome sequencing identifies causal structural variation in a mendelian disease Genome editing for global food security Nanopore sequencing as a rapidly deployable ebola outbreak tool Multiplex pcr method for minion and illumina sequencing of zika and other virus genomes directly from clinical samples A novel coronavirus from patients with pneumonia in china Short read mapping: An algorithmic tour Darwin: A genomics coprocessor provides up to 15,000x acceleration on long read assembly Genax: A genome sequencing accelerator Aligning sequence reads, clone sequences and assembly contigs with bwa-mem The smem seeding acceleration for dna sequence alignment Finder: Accelerating fm-indexbased exact pattern matching in genomic sequences through reram technology Medal: Scalable dimm based near data processing accelerator for dna seeding algorithm Accurate high throughput alignment via line sweep-based seed processing Gencache: Leveraging in-cache operators for efficient sequence alignment A resistive cam processing-in-storage architecture for dna sequence alignment Race logic: A hardware acceleration for dynamic programming algorithms Accelerating smith-waterman alignment of long dna sequences with opencl on fpga Soap3-dp: fast, accurate and sensitive gpu-based short read aligner A block-sorting lossless data compression algorithm A comparison of seed-andextend techniques in modern dna read alignment algorithms Efficient de novo assembly of large genomes using compressed data structures Annotating large genomes with exact word matches Compressing similar biological sequences using fm-index Boosting the fm-index on the gpu: Effective techniques to mitigate random memory access Lisa: Towards learned dna sequence search The case for learned index structures Leveraging fpgas for accelerating short read alignment A 135-mw fully integrated data processor for next-generation sequencing A tale of three next generation sequencing platforms: comparison of ion torrent, pacific biosciences and illumina miseq sequencers Fmlrc: Hybrid long read error correction using an fm-index An efficient error correction algorithm using fm-index Exploring single-sample snp and indel calling with wholegenome de novo assembly n-step fmindex for faster pattern matching Accelerating fm-index search for genomic data processing Aligns: A processing-inmemory accelerator for dna short read alignment leveraging sot-mram Pim-aligner: A processing-inmram platform for biological sequence alignment Ddr4 sdram standard jesd79-4c An overview of multi-task learning in deep neural networks Multi-task learning using uncertainty to weigh losses for scene geometry and semantics Integrated perception with recurrent multi-task neural networks Inadmissibility of the usual estimator for the mean of a multivariate normal distribution Applied linear statistical models Multi-task feature learning Tangram: Optimized coarse-grained dataflow for scalable nn accelerators A proposed structure of a 4 mbit content-addressable and sorting memory Base-delta-immediate compression: Practical data compression for on-chip caches Determining optimal coherency interface for many-accelerator socs using bayesian optimization Accelerators and coherence: An soc perspective Design and implementation of an advanced dma controller on amba-based soc Cactiio: Cacti with off-chip power-area-timing models Drampower: Open-source dram power & energy estimation tool Co-designing accelerators and soc interfaces using gem5-aladdin Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures Dramsim2: A cycle accurate memory system simulator Nvidia tesla p100 The sequence alignment/map format and samtools Pbsim: Pacbio reads simulator-toward accurate genome assembly Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data