key: cord-0330518-6p5kkgud authors: Zhang, Feng; Pan, Zaifeng; Zhou, Yanliang; Zhai, Jidong; Shen, Xipeng; Mutlu, Onur; Du, Xiaoyong title: G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression date: 2021-06-13 journal: nan DOI: 10.1109/icde51399.2021.00148 sha: fff12f3234c8b3a240073bd4000de83089946fe7 doc_id: 330518 cord_uid: 6p5kkgud Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to accelerate TADOC. We describe G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing the input data. G-TADOC solves three major challenges. First, TADOC involves a large amount of dependencies, which makes it difficult to exploit massive parallelism on a GPU. We develop a novel fine-grained thread-level workload scheduling strategy for GPU threads, which partitions heavily-dependent loads adaptively in a fine-grained manner. Second, in developing G-TADOC, thousands of GPU threads writing to the same result buffer leads to inconsistency while directly using locks and atomic operations lead to large synchronization overheads. We develop a memory pool with thread-safe data structures on GPUs to handle such difficulties. Third, maintaining the sequence information among words is essential for lossless compression. We design a sequence-support strategy, which maintains high GPU parallelism while ensuring sequence information. Our experimental evaluations show that G-TADOC provides 31.1x average speedup compared to state-of-the-art TADOC. Text analytics directly on compression (TADOC) [1] - [4] has proven to be a promising technology for big data analytics. Since TADOC processes compressed data without decompression, a large amount of space can be saved. Meanwhile, TADOC reuses both data and intermediate computation results, which results in that the same contents in different parts of original files can be processed only once, thus saving significant computation time. Recent studies show that TADOC can save up to half of the processing time and 90.8% storage space [1] , [2] . On the other hand, GPU as a heterogeneous processor shows promising performance in many real applications, such as artificial intelligence. It is popular to use heterogeneous processors, such as GPU, to accelerate data analytics systems [5] - [9] . Therefore, it is essential to enable efficient data analytics on GPUs without decompression. Applying GPUs to accelerate TADOC brings three key benefits. First, GPU performance is much higher than CPU performance, so applying GPUs with proper designs can greatly accelerate TADOC performance, which means that users can feel no delay in data analytics towards massive data. Second, previous TADOC mainly focuses on distributed systems. If we develop a GPU-based solution on a single HPC server while achieving higher performance, tremendous resources, including equipment cost and electricity cost, can be significantly saved. Third, many data analytics applications, such as latent Dirichlet allocation (LDA) [10] and term frequency-inverse document frequency (TFIDF) [11] , have been ported to GPUs, while TADOC has proven to be suitable for these advanced data analytics applications. Hence, providing a GPU solution would remove the last barrier to apply TADOC to a wide range of applications. Although it is both beneficial and essential to develop TADOC on GPUs, building efficient GPU-based TADOC is very challenging. Applying GPUs to accelerate TADOC faces three challenges. First, TADOC organizes data into rules, which can further be represented as a DAG. Unfortunately, the amount of dependencies among the rule-structured DAG of TADOC is extremely large, which is unfriendly for GPU parallelism. For example, in our experiments, the generated DAG for each file has 450,704 dependent middle-layer nodes on average, which greatly limits its parallelism. Even worse, a node in the DAG of TADOC can have multiple parents, which makes this problem more complicated. Second, a large number of GPU threads writing to the same result buffer inevitably cause tremendous write conflicts. A straightforward solution is to lock the buffer for threads, but such atomicities lose partial performance. In the worst case, the parallel performance is lower than that of the CPU sequential TADOC. Third, maintaining and utilizing the sequence information on GPUs is another difficulty: the original TADOC adopts a recursive call to complete sequential traversal on compressed data, which is similar to a depth-first search (DFS) and is extremely hard to solve in parallel. Currently, none of the TADOC solutions can solve the challenges of enabling TADOC on GPUs mentioned above. Zhang et al. [2] first proposed TADOC solution but it is designed in a sequential manner. Although TADOC can be applied in a distributed environment, TADOC adopts coarsegrained parallelism and the processing for each compressed unit is still sequential. Zhang et al. next developed a domain specific language (DSL), called Zwift, to present TADOC [1] , and further realized random accesses on compressed data [3] . However, the parallelism problems still exist. Zhang et al. [4] then provided a parallel TADOC design, which provides much better performance than the sequential TADOC. Unfortunately, such parallelism is still coarse-grained: it only divides the original file into several sub-files, processes different files separately, and then follows a merge process, which cannot be utilized by GPUs efficiently. We design G-TADOC, the first framework that provides GPU-based text analytics directly on compression, effectively enabling efficient text analytics on GPUs without decompressing input data. G-TADOC involves three novel features that can address the above three challenges. First, to utilize the GPU parallelism, we develop a fine-grained thread-level workload scheduling strategy on GPUs, which allocates thread resources according to the load of different rules adaptively and uses masks to describe the relations between rules (Section IV-B). Second, to solve the challenge of write conflict from multiple threads, we enable G-TADOC to maintain its own memory pool and design thread-safe data structures. We use a lock buffer when multiple threads update the global results simultaneously (Section IV-C). Third, to support sequence sensitive applications in G-TADOC, we develop head and tail data structures in each rule to store the contents at the beginning and end of the rule, which requires a light-weight DAG traversal (detailed in Section IV-D). We evaluate G-TADOC on three GPU platforms, which involve three generations of Nvidia GPUs (Pascal, Volta, and Turing micro-architectures), and use five real-world datasets of varying lengths, structures, and content. Compared to TADOC on CPUs, G-TADOC achieves 31.1× speedup. In detail, TADOC can be divided into two phases: initialization and DAG traversal. For the initialization phase, G-TADOC achieves 76.5% time saving, while for the DAG traversal phase, G-TADOC achieves 82.2% time saving. As far as we know, this is the first work enabling efficient text analytics on GPU without decompression. In summary, we have made the following contributions in this work. • We present G-TADOC, which is the first framework enabling efficient GPU-based text analytics directly on compressed data. • We unveil the challenges for developing TADOC on GPUs and provide a set of solutions to these challenges. • We evaluate G-TADOC on three GPU platforms, and demonstrate its significant benefits compared to state-ofthe-art TADOC. II. BACKGROUND In this section, we introduce TADOC and GPUs, which are the background and premises of our work. TADOC [1] - [4] is a novel lossless compression technique that enables data analytics directly on compressed data without decompression. In detail, TADOC adopts dictionary conversion to encode original input data with numbers, and then uses context-free grammar (CFG) to recursively represent the numerical transformed data after conversion into rules. Repeated pieces of data are transformed into different rules in CFG, and the data analytics tasks are then represented as rule interpretations. To leverage redundant information between files, TADOC inserts unique splitting symbols for file boundaries. Moreover, the CFG can be represented as a directed acyclic graph (DAG), so the interpretation of the rules for data analytics can be regarded as a DAG traversal problem. Currently, TADOC extends Sequitur [12] - [14] as its core algorithm. We use Figure 1 to show how TADOC compresses data by CFG representation, which is an example used in [2] . Figure 1 (a) shows the original input data, which consists of two files: file A and file B, and "wi" represents a unique word. Figure 1 (b) shows the dictionary conversion, which uses an integer to represent an element. Note that the rules "Ri" and file splitters "spti" are also transformed into numerical forms. Figure 1 (c) shows the TADOC compressed data, which are sequences of numbers. The TADOC compressed data can be viewed as CFG shown in Figure 1 (d), which can be further organized as a DAG shown in Figure 1 (e) for traversals. We use Figure 2 to show how to utilize the TADOC compressed data to perform a simple data analytics application word count. In Step 1, R2 transmits its accumulated local word frequencies to its parents, which are R0 and R1. In Step 2, R1 receives the word frequencies from R2 and merges these frequencies to R1's local frequency table. In Step 3, R1 transmits its accumulated word frequencies to its parent R0. After R0 receives the word frequencies from all its children, which are R1 and R2, R0 merges all received word frequencies into R0's word count results, which are also the final word counts. GPU is a specialized device targeting graphics and image processing originally. Due to its high parallelism R0: R1 R1 R2 w1 R1: R2 w3 R2 w4 R2: w1 w2 , , , CFG Relation Information Propagation spt1 Step 1 Step 2 Step 3 , , , , , , , , , assuming the key-value pair is hashed to 1. Because there is no conflict in this insertion, the related value in the next buffer is -1. Figure 5 (c) shows the hash table state after inserting <163,1>, assuming the key-value pair is hashed to 3. Accordingly, G-TADOC just stores the key-value pair <163, 1> after the first <126,1>. Figure 5 (d) shows a hash conflict situation: the hash table state after inserting <78,1>, assuming the key-value pair is hashed to 1. Because <126, 1> has already been inserted to the first entry, we update its "next" buffer pointing to a new place for the newly inserted <78,1>. Note that the lock buffer is used only when all threads writing to the same buffer location. Moreover, if the hash table is private and owned by one thread, we do not need to create the locks. Head and tail structures for sequence support. The head and tail structures are used to support sequence sensitive applications, such as sequence count [2] . Because G-TADOC traverses the DAG in parallel, some rules may involve crossrule sequence (a word sequence spanning multiple nodes in the DAG). We design head and tail data structures for each rule to store the content of the beginning and end of the rule, which are provided to the parents. We show an example in Figure 6 . In the root, the first sequence, , is a sequence that does not span across rules. However, for the next three-word sequence, , it spans across the root and R1. For this sequence, we store the partial content of in the head buffer of R1, so that this cross-rule sequence can be processed by the parent, which is the root. Similarly, we store in the tail buffer of R1 so that R1's parent can quickly process the sequences containing the words in R1's tail buffer. Note that the first few elements and the last few elements in the subrule can also be a rule. For example, in Figure 4 , the first element of R2 is also a rule, so the sequence from the root can span more than two rules, which is complicated. In our design, each rule can be handled by different threads. If we can provide the head and tail buffers of all rules, we can avoid multi-rule scanning by looking into only the head and tail buffers of different subrules directly. In summary, the parents are responsible to process cross-rule sequences, and the problem can be solved by scanning the head and tail buffers of the direct children. More details are presented in Section IV-D. In this part, we discuss the sequence support in G-TADOC for sequence sensitive applications. The sequence support in TADOC [2] is developed by function recursive calls, which is inefficient and hard to be parallel on GPUs. To improve the sequence support of TADOC and parallelize it on GPUs, we have the following insights. First, to fully parallelize the rule processing, each rule needs to include the head and tail buffers mentioned in Section IV-B to remove the sequence dependency across rules. Second, a first-round initialization phase is required to fulfill the head and tail buffers for all rules. Third, the original recursive design in TADOC [2] is inefficient and thus shall be abandoned; a more efficient parallel graph traversal needs to be developed. Based on the analysis, we develop a two-phase sequence support design for sequence sensitive applications. Initialization phase. The first initialization phase is to prepare the head and tail buffers for each rule with a lightweight scanning. The upper limit of memory space for each rule is shown in Equation 1, where wordSize denotes the size of the word elements, l denotes the sequence length, and subRuleSize denotes the number of subrules. The detailed process to generate the head and tail buffers of each rule is shown in Figure 7 . The CPU side uses a whileloop to continuously check whether all the head and tail buffers have been fulfilled. To generate the head buffers, G-TADOC traverses the rules, and puts a given number of continuous words at the beginning of the rule in the head buffer. Within such a process, if G-TADOC encounters a subrule, G-TADOC first checks the related mask. If the mask is set, which implies that the subrule's head buffer is ready, then G-TADOC can put the content from the subrule's head buffer to the current rule's head buffer; otherwise, the calculation fails and needs to be conducted in the next round. The generation of the tail buffers is similar to the generation of the head buffers. Graph traversal phase. The second phase of sequence support is shown in Figure 8 . Similar to the first phase shown in Figure 7 , the CPU part uses a while-loop to control the DAG traversal process. We use sequence count [2] as an example, which uses the hash tables described in Figure 5 . For sequence support, we need to reduce the intermediate results in the local tables from the rules. We use parallel hash tables to merge these results, as discussed in Section IV-C. First, we distribute each key-value pair a mask, and each entry a lock. Second, each thread is responsible for one key-value pair. Third, each thread needs to justify whether it is necessary to insert a keyvalue pair. If not, G-TADOC returns directly; otherwise, G-TADOC obtains the entry based on hash functions, and then verifies if the same key already exists on this entry. If the key exists, G-TADOC uses atomic additions directly, and then sets the mask to true; otherwise, G-TADOC tries to obtain the lock of the entry. If the lock is occupied by other threads, G-TADOC sets the stop flag to false and returns directly; if G-TADOC obtains the lock, G-TADOC needs to verify whether the same key coexists. If the same key coexists, G-TADOC uses atomic additions to avoid this issue; otherwise, G-TADOC obtains a new node and sets the entry accordingly. Finally, G-TADOC unlocks the table, sets the mask to true, and returns. Note that the CPU part continuously launches this process until the stop flag is set to true. V. IMPLEMENTATION We integrate our G-TADOC into the CompressDirect (CD) [2] library, which is an implementation of TADOC. G-TADOC in CD includes two parts: 1) the CPU part that is used to input data and program, and to handle the GPU module, and 2) the GPU part that is used for GPU-based TADOC acceleration. We use the same interfaces as TADOC in CD, including word count, sort, inverted index, term vector, sequence count, and ranked inverted index, so users do not need to change any code in this GPU support. In this section, we measure the performance of G-TADOC and compare it with TADOC [2] for evaluation. We show our experimental setup in this part. Methodology. The baseline in our evaluation is TADOC [2] , which is the state-of-the-art data analytics directly on compression, denoted as "TADOC". Our method that enables TADOC on GPUs is denoted as "G-TADOC". In our evaluation, we measure TADOC [2] performance and G-TADOC performance for comparison. Moreover, we assume that small datasets can be stored and processed in GPU memory directly without PCIe data transmission; large datasets are stored on disk with PCIe data transmission required to be involved in time measurement. Platforms. We use three GPU platforms and a 10-node Amazon EC2 cluster in our evaluation, as shown in Table I . We evaluate G-TADOC on three generations of Nvidia GPUs (Pascal, Volta, and Turing micro-architectures), which are used to prove the adaptability of G-TADOC. Since GPU architectures are constantly changing, if we can achieve high performance on all these platforms, then it is very likely that G-TADOC can achieve promising results for future GPU products. The 10-node cluster is a Spark cluster on Amazon EC2 [16] for large datasets of TADOC. Datasets. The datasets used in our evaluation are shown in Table II , which include various real-world workloads. Datasets A, B, and C are used in [1] - [4] . Dataset A is NSF Research Award Abstracts (NSFRAA) downloaded from UCI Machine Learning Repository [17] , and is composed of a large number of small files. Dataset B is a collection of four web documents downloaded from Wikipedia [15] . Dataset C is a large Wikipedia dataset [15] . To increase the diversity of test data, we add datasets D and E compared to previous works [1] - [4] . Dataset D is COVID-19 data from Yelp [18] , and dataset E is a collection of DBLP web documents [19] . Note that only dataset C is evaluated on the 10-node cluster. In this part, we measure the speedups of G-TADOC over TADOC and show their time breakdowns. Overall speedups. We show the speedups that G-TADOC achieves over TADOC [2] in five datasets in Figure 9 . In detail, Figure 9 (a) shows the speedups on Pascal platform, Figure 9 (b) shows the speedups on Volta platform, and Figure 9 (c) shows the speedups on Turing platform. We have the following observations. First, G-TADOC achieves significant performance speedups over TADOC in all cases. On average, G-TADOC achieves 31.1× speedup over TADOC. The reason is that the GPU device for G-TADOC provides much higher computing power and bandwidth than the CPU device for TADOC. For example, on the Pascal platform, the theoretical peak performance of the GPU is about 185.3× over the theoretical peak performance of the CPU. Moreover, the bandwidth provided by the GPU memory is about 8.3× over the memory bandwidth provided by the CPUs. The performance speedups achieved by G-TADOC further prove the effectiveness of our solutions to handle the dependencies in our parallel design for GPUs. Second, the speedups of G-TADOC over TADOC on single nodes for processing small datasets are higher than the speedups of G-TADOC over TADOC on clusters for processing large dataset C. The average speedup of G-TADOC over TADOC on a single node is 57.5×, while the average speedup of G-TADOC over TADOC on a tennode cluster is 2.7×. The reason is that when processing the large dataset, TADOC adopts coarse-grained parallelism in distributed environments to improve the data processing efficiency. However, due to the data exchange overhead between nodes in the distributed environment of TADOC, our G-TADOC is still more efficient than TADOC. Third, the speedups G-TADOC achieves for sequence count and ranked inverted index are much higher than the speedups of the other applications in most cases. In detail, the average speedups of sequence count and ranked inverted index are 111.3× and 112.0×, which are much higher than the full range average speedup. The reason is that sequence count and ranked inverted index of TADOC in [2] is of low performance: as described in [2] , the performance behaviors of sequence count and ranked inverted index of TADOC are close to those of the original implementations on uncompressed data without compression. As to G-TADOC, sequence count and ranked inverted index reuse the partial results of duplicate data and execute in parallel on GPUs. We analyze G-TADOC acceleration in different phases and the traversal strategies on GPUs in this part. Speedups in different phases. We show the separate speedups of G-TADOC over TADOC in different phases in Figure 10 . First, the average speedup in the second phase is 64.1×, which implies that most G-TADOC performance benefits come from the acceleration in the second phase of DAG traversal. The second phase also has a relatively long execution time in TADOC [2] , which provides more parallel optimization opportunities. Second, for the first phase, although the execution time is relatively short, G-TADOC still achieves clear performance speedups: on average, G-TADOC is 9.5× faster than TADOC on CPUs. Third, in large dataset C, the speedups in the first initialization phase are extremely high, which shows that the data structure preparation for massive large files is time-consuming in TADOC [2] and is essential to be accelerated by G-TADOC. Top-down vs. bottom-up traversals. We develop topdown and bottom-up traversals in G-TADOC, but the optimal traversal strategy for each application can be input dependent. For example, for term vector in dataset A, the top-down traversal takes 14.04 seconds, but the bottom-up traversal takes only 1.56 seconds. In contrast, for term vector in dataset B, the bottom-up traversal takes 0.43 seconds, but the topdown traversal takes only 0.11 seconds. In detail, dataset B involves only four files. If we traverse the DAG in a top-down strategy, we only need to maintain a small buffer of 16 bytes in each rule indicating its file information, and the transmission for file information in DAG traversal is also marginal. In contrast, for dataset A, which involves a large number of small files, the top-down traversal with file information would be time-consuming and drags down the overall performance. Therefore, we should select the bottom-up traversal strategy in dataset A, and top-down strategy in dataset B. We apply the TADOC adaptive traversal strategy selector on GPUs, as discussed in Section IV-B, which can help select the optimal traversal strategy. We summarize our findings and insight as follows. First, we find that GPUs are very suitable for text analytics directly on compression, but need special optimizations. For example, G-TADOC needs fine-grained thread-level workload scheduling for GPU threads, thread-safe data structures for parallel updates, and head and tail structures for sequence sensitive applications. Second, the GPU platform is both cost-effective and energyefficient, which can be applied to a wide range of data analytics applications directly on compression, especially in large data centers. Experiments show that a GPU server can have much higher performance on data analytics directly on compressed data than a ten-node cluster does. Third, although the GPU memory is limited, our work can help put much larger content directly in GPU memory. The frequent data transmission between the CPU and GPU drags down the performance advantages of GPUs when large workloads fail to be loaded to the GPU memory at once. Our work sheds light on the GPU acceleration design for such big data applications. We next show the importance of our paper and future work. Importance of our work. As the first work enabling efficient GPU-based text analytics without decompression, G-TADOC provides the insights that are of interests to a wide range of readers. Currently, G-TADOC involves only the applications in TADOC [2] , but other data analytics tasks can all benefit from G-TADOC. Furthermore, the series of optimizations on GPUs for TADOC can be directly applied to other advanced data analytics scenarios. Comparison with GPU-accelerated uncompressed analytics. In our evaluation for the six data analytics tasks with the five datasets, G-TADOC reaches 31.1× of the performance of the state-of-the-art TADOC on CPUs. A common question is how the G-TADOC performance differs from the performance of GPU-accelerated uncompressed analytics. Currently, there is no implementation about the six analytics tasks on GPUs, so we develop efficient GPU-accelerated uncompressed analytics for comparison. Experiments show that G-TADOC still achieves an average of 2× speedup. Applicability. G-TADOC has the same applicability as TADOC [4] . In general, G-TADOC targets the analytics tasks that can be expressed as a DAG traversal problem, which involves scanning the whole DAG. How far is the performance from the optimal? Although G-TADOC already achieves high performance, it still has room for performance improvement. The reasons include 1) dependencies in DAG traversal, 2) random accesses on large memory space, and 3) atomic operations on global buffers. When these issues are solved, G-TADOC can achieve at least 20% extra performance improvement. Future work. Currently, G-TADOC supports data analytics directly on compression on GPUs. This research is headed for high-performance and efficient data analytics methods. The future possible avenues of exploration include architecture optimizations or multi-GPU environments, which can further accelerate G-TADOC. As far as we know, G-TADOC is the first work that enables efficient GPU-based text analytics without decompression. In this section, we show the related work of grammar compression, compression-based data analytics, and GPU data analytics. Grammar compression. There are plenty of works on grammar compression [1] - [4] , [20] - [28] . The closest work to G-TADOC is TADOC, which is the text analytics directly on compression in single-node and distributed environments [2] . TADOC extends Sequitur [12] - [14] as its compression algorithm for data analytics. After TADOC being proposed, Zhang et al. [1] proposed Zwift, which the first TADOC programming framework, including a domain specific language, TADOC compiler and runtime, and a utility library. Then, Zhang et al. [4] applied TADOC as the storage to support advanced document analytics, such as word co-occurrence [29] , [30] , term frequency-inverse document frequency (TFIDF) [11] , word2vec [31] , [32] , and latent Dirichlet allocation (LDA) [10] . Furthermore, Zhang et al. [3] enabled random accesses to TADOC compressed data, and at the same time, supported insert and append operations. In this work, we enable TADOC on GPUs, which improves the performance of TADOC significantly. Index compression. The compression-based data analytics is an active research domain in recent years. However, typical approaches mainly use suffix trees and indexes [33] - [38] , [38] - [41] . Suffix trees are traditional representations for data compression [33] , [42] but incur huge memory usage [43] , [44] . Suffix arrays [45] and Burrows-Wheeler Transform [35] , [36] are the development of these compression formats, but still generate high memory consumption [43] . Compressed suffix arrays [46] - [50] and FM-indexes [36] , [51] - [54] are more efficient than the previous compression techniques. Furthermore, Agarwal et al. proposed Succinct [34] , which targets queries on compressed data. Moreover, there are many works about inverted index compression [55] - [61] . For example, Petri and Moffat [55] developed compression tools for compressed inverted indexes. Different from these works, G-TADOC targets text analytics directly on compressed data on GPUs. GPU data analytics. GPUs have been applied to various aspects of data analytics, including structured data analytics, stream data analytics, graph analytics, and machine learning analytics [5] - [9] , [62] - [66] . For example, MapD (Massively Parallel Database) [5] is a popular big data analytics platform powered by GPUs. Most current analytics frameworks, such as Spark, have supported GPUs [6] . SABER [7] is a stream system that schedules queries on both CPUs and GPUs, and Zhang et al. [9] further developed FineStream, which enables fine-grained stream analytics on CPU-GPU integrated architectures. Gunrock [8] is an efficient graph library for graph analytics on GPUs, and for large graphs, multi-GPU graph analytics have been explored [62] . For machine learning data analytics, parallel technologies have been extensively applied to various aspects, especially for deep learning applications [63] . Currently, most machine learning frameworks, such as TensorFlow [64] , support GPU. VIII. CONCLUSION In this paper, we have presented G-TADOC enabling efficient GPU-based text analytics without decompression. We show the challenges of parallelism, result update conflicts from multi-threads, and sequence sensitivities in developing TADOC on GPUs, and present a series of solutions in solving these challenges. By developing an efficient parallel execution engine with data structures and sequence support on GPUs, G-TADOC achieves 31.1× speedup on average compared to state-of-the-art TADOC. Zwift: A Programming Framework for High Performance Text Analytics on Compressed Data Efficient document analytics on compressed data: Method, challenges, algorithms, insights Enabling efficient random access to hierarchically-compressed data TADOC: Text analytics directly on compression MapD: A GPU-powered big data analytics and visualization platform Spark-GPU: An accelerated in-memory data processing engine on clusters SABER: Window-based hybrid stream processing for heterogeneous architectures Gunrock: GPU graph analytics FineStream: Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated Architectures Latent dirichlet allocation A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Inferring sequential structure Identifying hierarchical structure in sequences: A linear-time algorithm Linear-time, incremental hierarchy inference for compression Wikipedia HTML data dumps Amazon elastic compute cloud (Amazon EC2) UCI machine learning repository COVID-19 Data from Yelp Open Dataset Grammar compression, lz-encodings, and string algorithms with implicit input The smallest grammar problem A faster grammar-based self-index Random access to grammar-compressed strings and trees Finger search in grammar-compressed strings Gract: a grammar-based compressed index for trajectory data Balancing straight-line programs A space-optimal grammar compression POSTER: Exploring Deep Reuse in Winograd CNN Inference Keyword extraction from a single document using word co-occurrence statistical information Glove: Global vectors for word representation Compact Data Structures: A Practical Approach Succinct: Enabling queries on compressed data A block-sorting lossless data compression algorithm Indexing compressed text When indexing equals compression: Experiments with compressing suffix arrays and applications Compressed text indexes: From theory to practice Bicriteria data compression: efficient and usable On the bit-complexity of lempelziv compression From theory to practice: Plug and play with succinct data structures Compressed suffix trees with full functionality Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences Reducing the space requirement of suffix trees Suffix arrays: a new method for on-line string searches High-order entropy-compressed text indexes Compressed suffix arrays and suffix trees with applications to text indexing and string matching Compressed text databases with efficient query algorithms based on the compressed suffix array Succinct representations of lcp information and improvements in the compressed suffix arrays New text indexing functionalities of the compressed suffix arrays FM-index Opportunistic data structures with applications An experimental study of a compressed index An experimental study of an opportunistic index Compact inverted index storage using generalpurpose compression libraries Index compression using byte-aligned ANS coding and two-dimensional contexts Fast dictionary-based compression for inverted indexes Techniques for Inverted Index Compression Compressed Indexes for Fast Search of Semantic Data The potential of learned index structures for index compression Compressing inverted indexes with recursive graph bisection: A reproducibility study Multi-GPU graph analytics Parallel approaches to machine learning-a comprehensive survey Tensorflow: Large-scale machine learning on heterogeneous distributed systems An Efficient Parallel Secure Machine Learning Framework on GPUs Mining hard samples globally and efficiently for person reidentification