key: cord-0258432-cx8bhh7r authors: Gaihre, Anil; Zheng, Da; Weitze, Scott; Li, Lingda; Song, Shuaiwen Leon; Ding, Caiwen; Li, Xiaoye S; Liu, Hang title: Dr. Top-k: Delegate-Centric Top-k on GPUs date: 2021-09-16 journal: nan DOI: 10.1145/3458817.3476141 sha: d68beb54431476edf2fa90f621887baa31591826 doc_id: 258432 cord_uid: cx8bhh7r Recent top-$k$ computation efforts explore the possibility of revising various sorting algorithms to answer top-$k$ queries on GPUs. These endeavors, unfortunately, perform significantly more work than needed. This paper introduces Dr. Top-k, a Delegate-centric top-$k$ system on GPUs that can reduce the top-$k$ workloads significantly. Particularly, it contains three major contributions: First, we introduce a comprehensive design of the delegate-centric concept, including maximum delegate, delegate-based filtering, and $beta$ delegate mechanisms to help reduce the workload for top-$k$ up to more than 99%. Second, due to the difficulty and importance of deriving a proper subrange size, we perform a rigorous theoretical analysis, coupled with thorough experimental validations to identify the desirable subrange size. Third, we introduce four key system optimizations to enable fast multi-GPU top-$k$ computation. Taken together, this work constantly outperforms the state-of-the-art. Formally, top-algorithms find the top elements from an input vector . Here, the criteria could be the top largest or smallest, or any other conditions of interest. For simplicity, we assume the default criterion in this paper to be the top largest. -selection algorithm slightly differs from the top-algorithm, as -selection only identifies the ℎ largest element from . These two algorithms serve as building blocks for a variety of applications, such as, High Performance Computing (HPC) [26, 51] , Information Retrieval (IR) [8, 11] , deep learning training [3, 48, 49] , big data [14, 15, 27] , and data mining [25, 34, 54] . A textbook implementation of topexploits priority queue, i.e., min-heap. That is, a priority queue at Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. SC '21, November 14-19, 2021 the size of will slide through the input vector. For each encountered element that is larger than the minimum from the priority queue, we substitute the minimum of the priority queue by this encountered element. Eventually, this priority queue captures the top-largest elements for the input vector . Recently, the interest in deploying top-computation on GPUs has surged for three major reasons. First, GPUs offer superior processing power and memory throughput comparing to the other processing hardware [17, 23, 47] . For instance, the most recent A100 GPU [28] features an astonishing 312 Tera Floating Point Operations Per Second (TFLOPS) computing capability and 2,039 GB/s memory throughput. Second, both the existing leading supercomputers [45] and future exascale ones (e.g., Aurora [46] , Frontier [32] and El Capitan [41] ) use GPUs as the major computing resources. Third, the majority of the applications that exploit top-, such as IR [24] , deep learning [1] , data mining [36, 53] , and database applications, e.g., PG Strom [35] , Ocelot [7] , and MapD [38] are offloaded atop GPUs, deploying top-on GPUs could avoid copying data back and forth between GPU and CPU for top-computation. While priority queue-based top-is the most efficient approach for single-or multi-core systems [55] , it requires to maintain many local priority queues to expose massive parallelism to GPUs. Unfortunately, maintaining such many priority queues would experience expensive global synchronization overhead when merging these local priority queues into a final global one. Consequently, pertinent top-applications do not adopt priority queue-based top-. Instead, they use sort-and-choose approach for top-computing on GPUs [6, 18, 33, 44, 48] . However, as shown in Figure 17 , the GPU-based sort-and-choose top- [6] takes much longer time than GPU-based top-k algorithms. Revising sorting algorithms to compute top-becomes a popular trend because, at most, only a subset of data needs to be sorted in the top-k problem. Along this direction, bitonic top- [42] presents a revised bitonic sort algorithm [20] that focuses on the top-elements when merging 2 elements together. Since this rudimentary design can only reduce the workload by half, [42] further proposes to read 8 elements and reduce it to while using GPU shared memory to cache the intermediate results. Due to the limited capacity of shared memory on GPUs, bitonic top-can only work for very small (i.e., ≤ 256). Another notable attempt [2] revises bucket sort by discarding all buckets that do not include the ℎ elements at each iteration, similarly for radix top-. Despite these designs in [2] have the chance of reducing more workloads, they would suffer from unstable workload reductions (see Figure 6 ). To reduce more workload in a stable manner, we introduce Dr. Top-k, a delegate-centric system that partitions the input vector into subranges, extracts the delegate from each subrange, and uses the top-of the delegates to rapidly reduce the workload for the overall top-computation on the input vector. It is essential to note that the popular IR algorithm, i.e., Block Maximum WAND (BMW) [11] also uses the delegate concept for search engine designs. In contrast, Dr. Top-k has a more comprehensive design and innovative usage for the delegate concept. Taken together, it can help state-of-the-art top-algorithms to improve their performance significantly with the following three contributions: First, we introduce a comprehensive delegate-centric design, which includes maximum delegate, top-delegate-based filtering, and delegate mechanisms to help reduce the workload for top-up to more than 99%. Specifically, we (i) partition the input vector into a collection of subranges and extract the maximum delegate from each subrange to construct a delegate vector, and (ii) perform topon the delegate vector. Since only those subranges whose maximum delegates belong to the top-of the delegate vector can contribute to the top-for the input vector, we further (iii) concatenate those qualified subranges to construct a concatenated vector, and (iv) perform top-on the concatenated vector. To further reduce the workload for step (iv), we use the minimum of the top-of the delegate vector to filter out smaller elements from the qualified subranges. Not limited there, we extend the maximum delegate to delegate to reduce the workload for concatenation and second top-. In particular, we will extract the top delegates, instead of merely the maximum, from each subrange. Afterward, we introduce a new rule using which we only concatenate subranges whose entire delegates are taken. Second, we deduce the optimal subrange size with both theoretical soundness and experimental validation. Note, a proper subrange size is crucial for Dr. Top-k to achieve a good performance; on the one hand, a small subrange size would lead to too many subranges. In this context, the delegate vector construction and first topwould suffer from heavy workloads. On the other hand, when the subrange size is too large, we would have too few subranges. In this case, the majority of these subranges will be eligible for the second top-. We hence skip too few subranges, leading to limited workload reduction for concatenation and second top-. In Section 5.2, our theoretical analysis derives that the total time consumption of Dr. Top-k is a convex function of subrange size, which we also verify in our experiment. We further extract the optimal subrange size for a wide range of | | and . Third, we deploy Dr. Top-k atop multiple GPUs with four key system optimizations. First, we introduce a warp-centric delegate vector construction mechanism to achieve near-peak GPU global memory throughput. Second, although our delegate-centric design can help all existing top-algorithms, we identify that the best Dr. Top-k assisted top-algorithm changes along with the climbing of . We further introduce a flag-based strategy to avoid random memory access during in-place radix top-. Third, we identify that delegate vector construction suffers from low thread utilization and an exceeding usage of CUDA (an acronym for Compute Unified Device Architecture) shuffle instructions when becomes relatively large. As a result, we introduce a novel coalesced load to shared memory and strided compute approach to improve the thread utilization, as well as curb the usage of CUDA shuffle instructions. This optimization has reduced the delegate vector construction time consumption from 31.4 ms in Figure 10 to 9.5 ms in Figure 15 for = 2 24 and | | = 2 30 . Finally, we scale top-across multiple GPUs to handle gigantic input vectors, and achieve sustained scalability. During evaluation, we notice that the recent top-efforts [2, 42] only test their systems on synthetic datasets, limiting the impacts of top-. This paper hence builds a benchmark that contains three real-world applications, i.e., -nearest neighbor search [21] , website degree centrality [12] , and COVID fear related Twitter dataset [19] for top-. Our evaluation shows that Dr. Top-k can outperform the state-of-the-art on both synthetic and real-world datasets. The remainder of this paper is organized as follows: Section 2 discusses the background and related work which motivate the overview in Section 3. Section 4 presents the delegate-centric topdesign and compares it against BMW. Section 5 deploys our topon multi-GPU systems with GPU-specific optimizations. Section 6 evaluates Dr. Top-k and we conclude in Section 7. Streaming processors and threads. Designed with NVIDIA Volta architecture, V100S [30, 31] is powered by 80 streaming processors (SMs). Each SM is equipped with 64 CUDA cores, yielding a total of 5,120 cores running at 1.5 GHz. During execution, a GPU thread runs on one CUDA core, and an SM schedules a group of 32 consecutive threads known as warp in a Single Instruction Multiple Thread (SIMT) manner. Note, all the threads in a warp can use shuffle instructions to exchange data. A collection of consecutive warps further formulate a Cooperative Thread Arrays (CTAs) or a block. All the CTAs are called a grid. Memory architecture. V100S is equipped with 32 GB global memory with 1,134 GB/s as the peak throughput. All the SMs share an L2 cache of 6,144 KB. Each SM owns a private 96KB configurable shared memory, also used as the L1 cache. All the threads in a CTA can use shared memory to communicate with the help of the CUDA __syncthreads() primitive. It is desirable to use shared memory to cache intermediate data because it is around one order of magnitude faster than the global memory [42] . This section discusses the closely related projects for Dr. Top-k that includes priority queue-based top- [42] , sorting-based top- [6, 48] , bucket top- [2] , radix top- [2] and bitonic top- [42] . Priority queue approach. A natural way to compute topwould be to maintain a priority queue that only keeps the topelements while scanning through the input vector. While this idea is well-suited for single-or multi-threaded systems, implementing it on massively parallel GPUs remains elusive. Mainly, a parallel implementation would involve the maintenance of many local priority queues and the eventual merging of these priority queues into a global one. This adds the challenge of frequent read and write and global synchronization across the threads when merging them. Sort-and-choose is an alternative approach that is more friendly for parallel implementation. Basically, we sort the input vector elements using sorting algorithms, like in THRUST [6] and choose the top-elements. But this implementation turns out to do more work than necessary. At least, there is no need to sort the elements that are outside of the range of 1 − ℎ elements. Alabi et. al. [2] also show their top-algorithms, e.g., radix and bucket top-outperforms the sort-and-choose designs. Top-algorithms. Bucket, radix, and bitonic top-are introduced to alleviate the aforementioned inefficiency problems faced by sorting. In contrast to their corresponding sort-and-choose approach, the top-algorithm distributes the input vector into different subranges, like a bucket in bucket top-, and only focuses on the subrange that will lead to the ℎ element of the input vector. Below we explain how these designs work with examples. I. Bucket topfirst obtains the and values from the input vector. Afterward, it divides this − value range into several buckets, with each of which accounting for a disjoint equal value range. In the second step, this method scans through the input vector, puts each element into the corresponding bucket, and tracks the number of elements in each bucket. This way, one can easily figure out which bucket contains the ℎ elements. As mentioned earlier, top-operation discards the buckets that do not contain the ℎ element. This method continues until the bucket of interest only has one element, i.e., the ℎ one. Figure 1 only includes three iterations due to space constraints, this process is supposed to continue until the bucket of interest only contains one element. II. Radix topis similar to bucket top-but exploits the digits (i.e., radixes) of each element to determine which bucket a value belongs to. The key is that the position of the bucket needs to indicate their order so that we can derive the bucket of interest for the next iteration. Consequently, radix top-starts from the Most Significant Digit (MSD) to the Least Significant Digit (LSD). Following this manner, for instance, if we process 3 bits at one iteration, we will need eight buckets, that is, 000, 001, 010, 011, 100, 101, 110, 111. And all the elements from bucket '111' are larger than those of '110'. Similarly for other buckets. At the end of each iteration, we only focus on the bucket that contains the ℎ elements to proceed. III. Bitonic top-. Improving from the traditional bitonic sorting algorithm, bitonic top- [42] proposes to discard elements by selecting the top-elements from a bitonic sequence of size 2 . Therefore, the workload is always reduced by half. Figure 2 demonstrates how bitonic top-2 behaves for the same input vector in Figure 1 . Particularly, this algorithm sorts every two consecutive elements in the input vector, as shown in Iteration 1. Afterward, it merges the adjacent two sequences -{101, 2001} and {3012, 1323}and gets the top-2 from these four elements, that is, {2001, 3012}, similarly for remaining sequences. This process continues until Iteration 3, where we obtain the final top-2, as {3012, 3210}. Some of the other related projects worth mentioning are a GPUbased bucket sorting [10] that takes samples from different regions in the input vector to achieve a good workload balancing. The work partitions the input vector into several subranges, performs a local sort in each subrange, and selects multiple samples from each subrange. These samples are collectively processed to guide data reordering on the original vector so that each bucket would end up with a similar amount of workloads. The top-at [22] performs a priority queue based k-selection algorithm in register memory in GPUs. As the registers per thread in the GPU are limited to a few numbers, similar to [42] , the performance degrades for k ≥ 1024. A recent work [37] uses sampling to make bucket select more immune of skewed data distribution. Particularly, this work samples splitters from the original vector. Then, these splitters are sorted and used to assign bucket ranges. While this work tries to adjust the bucket boundaries in order to reduce workload in each level, our work directly reduces the original input vector for not only bucket top-but also other ones, e.g., radix and bitonic top-algorithms. The state-of-the-art GPU-based top-designs, as shown in Figure 3(a), directly work on the input vector when reducing the elements. This data reduction process continues until a desirable condition is met. Despite that such designs can outperform the traditional priority queue and sorting-based approaches, they still face the following two challenges: • The performance of bucket and radix top-is unstable. That is, they are sensitive to the value distribution of the data. For instance, the radixes of interest might carry most of the elements from one iteration over to the next. Figure 4 presents the performance variation of Dr. Top-k on three different distributions Uniform (UD), Normal (ND) and Customized Distributions (CD), where the data distributions are rigorously defined in Section 6. We observe both radix and bucket top- [2] Dr. Top-k, as shown in Figure 3 (b), introduces the delegatecentric concept where top-computation only happens on delegate and concatenated vectors which are small fraction of the original input vector. This warrants both stable and larger workload reductions during top-computation on Dr. Top-k. (i) Our workload reduction is stable regardless of the value distribution of the input vector. That said, for a given and | |, the workload is determined (detailed in Section 5.2). (ii) Dr. Top-k on average reduces a greater portion of the workload, compared to top-algorithms such as bucket, radix, and bitonic top-. This is because the total size of the delegate and concatenated vectors is smaller than the input vector, which is the input for the bucket/radix/bitonic top-algorithms. This is evident in Section 6.2. It should be noted that instead of being regarded as an alternative algorithm to the existing top-algorithms, Dr. Top-k can help reduce workloads for all existing top-algorithms, including bucket, radix, and bitonic top-as long as we change the first and second top-( 1 , and 2 ) into these algorithms. 4 DR. TOP-K: DELEGATE-CENTRIC TOP-K Rule 1. For a given vector , ∃ ∈ , such that is the delegate vector containing the maximum elements of all subranges . If the maximum from is not among the top-elements in , then will not contribute any elements to the top-of . Rule 1 indicates that we can use the delegate of a subrange to decide whether to omit the entire subrange during top-computation. Figure 5 presents an example about how to use Rule 1 to find the top , (i.e., = 2) elements from the same input vector as Figure 1 . Essentially, we first divide the input vector into four subranges, with each of which containing four elements. Second, we extract the delegate, i.e., maximum element from each subrange to formulate a delegate vector, i.e., {3012, 2313, 3210, 2321}. The first topfinds 3012 and 3210 as the top 2 elements from the delegate vector. This implies that only the subranges that contain 3012 and 3210 are qualified for concatenation. Therefore, we concatenate these two subranges and conduct the second top-on the concatenated vector -{2001, 101, 1323, 3012, 3000, 1002, 3210}. Our second topderives the final top-2 as {3012, 3210}. Leveraging Rule 1, we implement the initial version of Dr. Topk. Figure 6 demonstrates the time consumption breakdown of Dr. Top-k accelerated radix top-for | | = 2 30 unsigned integers with ranging from 2 0 to 2 24 . For ≤ 2 15 , the time consumption delegate vector construction is ∼4.2 ms, which means we achieve 84% of the peak throughput of the V100S GPU, albeit delegate vector construction also performs additional shuffle instructions. When > 2 15 , the time consumption of Dr. Top-k also increases, which is reflected in all the four steps of Dr. Top-k. Although the maximum delegate of a subrange is in the top-of the delegate vector, not all the elements in the qualified subrange will be eligible for the second top-. This section uses the top-of the delegate vector to remove elements from the qualified subrange during the concatenation, leading to further workload reduction for the second top-through the following rule. Rule 2. The ℎ element in the delegate vector is the minimum possible element the final ℎ element can become. This rule can be derived as follows: the minimum of the topof the input vector will be no less than that of the delegate vector , i.e., ( ( )) ≤ ( ( )). Therefore, only the elements that are larger than the ( ( )) are possible to get into ( ), hence are qualified for the concatenation. Here, ( ) denotes the top-elements in , similarly for ( ). We can use the example from Figure 5 to assist the understanding of this Rule 2. Here, the minimum element from the top-2 of the delegate vector is 3012. Our prior Dr. Top-k takes the entire subranges whose maximums are in the first top-into consideration. This is, in fact, wasteful. For instance, the elements that are smaller than 3012 in both subranges 0 and 2, that is, 2001, 101, 1323, 3000, 3010, and 1002, will never become one of the elements in the final top-2. Hence, none of them should be copied to the new concatenated vector. Eventually, the concatenated vector is merely {3012, 3210}. To implement this delegate top-enabled filtering approach, we disseminate the minimum of the top-from the delegate vector across all threads. Afterward, the threads are dispatched to work on the qualified subranges identified by the first top-. When performing scan on those qualified subranges, only the elements that are larger than the minimum of the top-of the delegate vector are stored in the concatenated vector. As the number of eligible elements from each subrange is unknown beforehand, each thread needs to use atomic operation [13, 16] to obtain the position to store the eligible element. Figure 7 demonstrates the effectiveness of delegate top-enabled filtering for the same dataset in Figure 6 . Comparing Figures 7 and 6 top-is substantial, especially when ≥ 2 19 . Using = 2 24 as an example, we reduce the second top-time consumption from 28.7 ms in Figure 6 to 6.1 ms. While delegate top-enabled filtering can tremendously reduce the workload for second top-, it still has two weaknesses that require further improvements: First, one might need to perform extensive atomic operations to build the concatenated vector. Second, we still need to scan through the qualified subranges to omit the elements that are smaller than the minimum of the top-of the delegate vector. Now we introduce delegate that allows Dr. Top-k to safely avoid the entire subrange without scanning any elements which would be qualified for second top-if without delegates. Below, we formally introduce the delegate rule. Rule 3. In a subrange , we select the top elements as delegates. If not all of the delegates in would qualify as top-of the delegate vector , the rest of the elements from this subrange will not qualify for the top-of the input vector . Note, ∈ and > 1. Figure 8 (a), since we only take one delegate from subrange 0 and both delegates from subrange 2, we obtain the concatenated vector as {3012, 3010, 3210}. Note, even the concatenated vector is the same as the top 3 delegates, we still need to scan through the entire subrange 3 to omit the ineligible elements. Finally, the second top-computes on the concatenated vector to figure out the final top-3. Figure 8 (b) presents a special benefit of delegate. That is, we might not need the concatenation and second top-computation. In this case, since the top-2 of the delegate vector does not take all the delegates from any subrange, Rule 3 suggests that neither concatenation nor second top-is necessary. Note, delegate will lead to more workloads for the first topand delegate vector construction. To reduce the workload for the first top-, we let the first top-skip the final iteration when locating the exact bucket or radix of interest. Because delegate and delegate top-enabled filtering can substantially reduce the workload for concatenation and second top-, this skipping, which helps first top-noticeably, will lead to negligible performance drop for the subsequent concatenation and second top-steps. For performance improvements on delegate vector construction, Section 5.3 will introduce our novel optimizations shortly. An appropriate is important for Dr. Top-k, our empirical study in Figure 9 suggests that = 2 performs the best. For better visualization, we normalize the performance of various tests towards = 1. In Figure 9 (a), we find that = 2 is the desirable configuration which increases the performance up to 1.41 when = 2 24 from = 1. Although Figure 9 (b) observes slightly better performance when = 3 for smaller | | = 2 29 and 2 30 , we find = 2 always yield good performance across both figures. This section compares our Dr. Top-k algorithm with a closely related IR algorithm, BMW [11] , which is a variant of the popular Weak AND (WAND) algorithm [8] . Briefly, the BMW algorithm aims to find the top-most related documents for a query term. Figure 11 presents an example for BMW algorithm. For clarity, we first describe the settings of our example: (i) there are ten documents, i.e., d 0 -d 9 ; (ii) the query contains three terms: "the search engine", and (iii) the score of a term in a document is the number of occurrence of the query term in that document. BMW first puts the documents that contain each term together, subsequently sorts them by the document ID and partitions them into blocks, e.g., the term "the" contains two blocks b 0 = {d 0 -d 4 }, and b 1 = {d 6 -d 8 }. For each block, BMW stores the maximum score, e.g., the maximum score of block b 0 is 5. Assuming BMW is working on document d 3 , Figure 11 : BMW algorithm for a query {"the search engine"}. the right side of Figure 11 is the pseudocode of BMW. Specifically, BMW first evaluates whether the block maximums of the three blocks that contain document d 3 would be bigger than the threshold . If this condition is true, BMW will perform a full evaluation on document d 3 and move on to the next document d 4 . Otherwise, BMW will skip all the documents that exactly share the same block maximum with document d 3 . In this context, using each block's maximum to estimate whether we should skip a document is similar to Dr. Top-k, which uses the maximum of a subrange for delegate. Distinction. While BMW leverages the block maximum to skip computations when extracting the top-documents, Dr. Top-k designs and exploits the delegate concept more comprehensively from three aspects. First, Dr. Top-k introduces a delegate-centric processing concept while BMW still uses a regular element-centric concept. Here a regular element is a document. Particularly, Dr. Top-k uses the delegate to decide whether an entire subrange (i.e., a block in BMW) is eligible or not. However, BMW processes one document at a time. Using document d 3 in Figure 11 as an example, even d 3 is qualified, BMW still needs to perform the eligibility check for d 4 . Further, for ineligible documents, BMW can only skip the documents that share the same or fewer terms than d 3 . Second, we further introduce delegate to help remove more subranges and delegate top-enabled filtering to reduce some regular elements from the qualified subranges. Both of these designs are novel compared to BMW. Third, as we will discuss shortly in Section 5, Dr. Top-k also includes subrange size tuning and GPU-aware optimizations, which are also novel compared to the BMW algorithm. Warp-centric delegate vector construction first divides the entire input vector into smaller subranges at the length of 2 , where is an integer. Afterward, each warp of GPU threads is assigned to extract each subrange delegate in three phases. Using maximum delegate as an example, every thread first records the maximum element when scanning through a specific subrange. Second, all the threads in each warp use shuffle instruction, i.e., __shfl_sync(), to communicate and derive the maximum element in the current subrange. During the third phase, Dr. Top-k writes each subrange's maximums and the subrange IDs to the delegate vector in the global memory. The size of the delegate vector is | | Warp-centric concatenation. This step concatenates the eligible subranges into a new concatenated vector where the second top-performs on. Particularly, a warp of threads is responsible for moving the subrange elements into the concatenated vector. Because Dr. Top-k uses delegate top-enabled filtering, the eligible elements per subrange are unclear. We resort to atomic operations to calculate the location for each eligible element. First and second top-. Once Dr. Top-k formulates delegate or concatenated vector, it will perform top-on them. While both top-algorithms work on a relatively small vector, the first toppresents two unique features. First, this top-algorithm has to work on a delegate vector that comes in the format of (key, value) pair. Here, the key is the delegate element from each subrange, and the value indicates which subrange this delegate element comes from, which is essential for the concatenation step. Second, the first topalgorithm has to be a top-operation instead of -selection because one needs to extract all the top-subranges for concatenation. We hence have to revise the radix and bucket -selection algorithms of [2] to support top-. Choice of top-algorithms. Despite the fact that Dr. Topk can help all existing top-algorithms, we notice that the best Dr. Top-k will favor different top-algorithms when changes. Particularly, (i) when is small, all top-algorithms will enjoy comparable performance gains over their baseline algorithms. However, for radix and bucket top-, they prefer in-place designs that always work on the input vector V as instead of out-place variants that copy the derived candidates to a new array for the follow-up iteration. (ii) When is large, the performance of Dr. Top-k assisted bitonic top-will lag behind. Specifically, bitonic top-needs a large shared memory space to cache the intermediate results and achieve desirable performance, which will experience low occupancy hence poor performance when > 256. As shown in Figure 4 , when goes beyond 256, the performance of bitonic top-degrades significantly. This makes Dr. Top-k assisted bitonic perform worse than other Dr. Top-k assisted ones. Optimized in-place radix top-. Since existing in-place radix top-algorithm [2] requires to modify the ineligible element from the input vector into a value that is assured to fall out of the value range of interest (e.g., zero), this results in excessive random memory accesses. We introduce a single flag variable to indicate the radixes of interest. This flag tracks the radixes that are eligible for the next iteration. Subsequently, once an element is loaded from global memory, we will perform flag == (flag & loadedElement) between the loaded element and the flag variable. Only when the condition is evaluated as true, we consider this loaded element as a qualified element. As shown in Figure 12 , our optimized in-place radix top-is on average 10.7× faster than the state-of-the-art [2] . A proper subrange size is crucial for Dr. Top-k to achieve good performance. On the one hand, a small subrange size would lead to too many subranges. In this context, the delegate vector construction and the first top-would suffer from heavy workloads. On the other hand, when the subrange size is large, there are too few subranges. In this case, the majority of these subranges will be eligible for the second top-. We hence skip too few subranges, leading to limited workload reduction for concatenation and second top-. Rule 4 helps Dr. Top-k derive an optimal subrange size. Rule 4. For Dr. Top-k, leads to the optimal subrange size 2 , where | | is the number of elements in the input vector , is the number of top elements Dr. Topk aims to find. = log 2 [6 · ( + ℎ )] − log 2 (6 · ), where and ℎ are the clock cycles for one global memory access and one CUDA shuffle instruction, respectively. where , , and are the time consumption of delegate vector construction, first top-, concatenation, and second top-, respectively. Global memory access and intra-warp communication are the key factors determining the time consumption of Dr. Top-k for two reasons. First, one global memory access or intra-warp shuffle operation takes a much longer time than a single arithmetic and logic operation on GPUs, according to Nvidia profiler [29] . Second, the number of arithmetic and logical operations is similar to that of global memory accesses across all four stages of Dr. Top-k. Using delegate vector construction as an example, each thread loads one element from global memory and compares, i.e., a logic operation, it with the current maximum in a register. In this case, one memory access leads to one arithmetic or logic operation. As a result, we mainly use global memory access and shuffle instructions to estimate the time consumption. We perform our analysis for maximum delegate for simplicity and assume all global memory accesses have equal latency, i.e., . : Delegate vector construction reads | | elements and write | | 2 delegates. After thread local comparison, each subrange resorts to CUDA __shfl_sync instruction to derive the maximum for the entire subrange. Since one warp contains 32 threads, 1≤ ≤5 32 2 = 31 shuffle instructions are needed. Therefore, the communication complexity is 31 · | | 2 · ℎ , where ℎ is the cost of a shuffle instruction. Together, delegate vector construction time is: Figure 13 : The runtime of Dr. Top-k with respect to the change of , where = 2 13 and the UD dataset from Section 6. : Now, we analyze the time consumption of our optimized in-place radix top-. According to our study, 8-bit per digit yields the optimal performance for in-place optimized radix top-. A 32-bit unsigned integer hence experiences four iterations of scans. At each iteration, we always load all the elements. An additional iteration over the vector is used to identify the top-k elements. Finally, we write elements, which are also the indices of the eligible subranges. Therefore, the time consumption of the first top-is: : The concatenation step reads indices for the subranges that are eligible for the second top-and copies those subranges from the input to the concatenated vector for the second top-. The time consumption for concatenation is: : The second top-takes as input the output from the concatenation step and conducts in-place radix top-to derive the eventual top-. Consequently, this step is mainly about reading the entire outputs from concatenation. Similar to the analysis for the first top-, which reads the concatenated vector by four times, the time consumption of the second top-is: Taken Equations (2) -(5) together, we arrive at the total time consumption of Dr. Top-k as shown as in Equation 6 . Given Equation 6 ignores various hardware scheduling, arithmetic, and logical operation latency, we introduce Δ( , , | |) (at Equation 7), which is a positive function of , and | |, to make up the impacts. We assume the magnitude of Δ( , , | |) is smaller than that of . Dr. Top-k = + Δ( , , | |). We first prove Dr. Top-k is a convex function, which makes it easy to obtain the optimal for Dr. Top-k. According to [50] , in order to demonstrate the convex nature of Dr. Top-k , the second derivative of Dr. Top-k with respect to should be positive. 2 Dr. Top-k 2 = (31 · ℎ + 6 · ) · | | · ln 2 (2) · 2 − + 6 · · · ln 2 (2) · 2 + Δ ′′ ( , , | |). Hence, Dr. Top-k is convex function of . Our study in Figure 13 also suggests that Dr. Top-k is a convex function of . Particularly, the time consumption of delegate vector construction and first top-decrease along with the increase of . Meanwhile, concatenation and second top-increase. Altogether, the total time consumption decreases then increases with respect to the increase of . Finally, since Dr. Top-k is convex, the optimal value of can be obtained by: Solving Equation 10, we obtain: where, = log 2 (6 · + 31 · ℎ ) − log 2 (6 · ) + Δ ′ ( , , | |). Figure 14 , from an evaluation perspective, exhibits the performance alignment of the auto-tuned and the oracle across a wide range of for the | | = 2 30 unsigned integers dataset, where we set const = 3 according to performance tuning. □ After we optimize the first and second top-computations and concatenation steps, delegate vector construction becomes the next bottleneck for Dr. Top-k. This is especially true when is relatively large. According to Equations 11: decreases with respect to the increase of . For instance, when |V|=2 30 , and = 2 24 , the optimal = 4. This implies that the input vector is partitioned into a large number of small subranges which would lead to two problems: (i) the small subrange size fails to saturate a GPU warp; and (ii) too many subranges will lead to an overwhelming number of shuffle instructions for delegate communication. We introduce a novel coalesced loading to shared memory and strided computing approach to remedy this small subrange size problem ( ≤ 5). This method consists of two phases: (i) one warp moves 32 subranges into the shared memory for delegate extraction. Here, each subrange is loaded from global memory into the shared memory by a warp in a coalesced manner. Since the subrange size is small, the shared memory pressure remains low. Subsequently, (ii) each thread of the warp individually works on the entire subrange to extract the delegate. This design ensures that all the threads of a warp have workloads, and no shuffle instruction is needed to communicate and decide the subrange delegates. This design helps the delegate tremendously, which would otherwise needs approximately × more shuffle instructions to extract the delegates. We use padding to avoid shared memory bank conflict. Figure 15 shows the improvement brought by the delegate vector construction optimization for different values of . Comparing to Figure 10 , one can find out that the delegate vector construction time is dramatically reduced for larger values of , making the sampling time always close to merely the time consumption of scanning the input vector. Especially, for = 2 24 , we observe the time consumption of delegate vector construction decreases from 31.4 ms to 9.4 ms. And the total time consumption is reduced from 46.7 ms of Figure 10 to 24.7 ms. In the distributed GPU setting, we partition the input vector into disjoint sub-vectors of equal length. To fit in the GPU memory, we require the length of each sub-vector to be no longer than 2 30 : (i) when (# ) × 2 30 ≥ | |, we partition V into #GPU number of sub-vectors and let each GPU account for one sub-vector. (ii) When (# ) × 2 30 < | |, we partition into | | 2 30 number of sub-vectors. In this case, one GPU accounts for more than one subvector; hence will load the unloaded sub-vectors from outside of GPUs. We schedule each GPU to compute the top-for its own sub-vectors to arrive at one top-per GPU. Subsequently, each GPU sends its top-to the primary GPU to calculate the final top-. Figure 16 presents the workflow of multi-GPU Dr. Top-k which contains three major steps: 1 It enables all the participating GPUs to work on their local sub-vectors to compute local top-. 2 It gathers these locally computed top-'s to the primary GPU. 3 It enables primary GPU to compute the global top-. For inter-GPU communication, we use Message Passing Interface (MPI) [43] . Particularly, we use MPI asynchronous (← Asynch. MPI in the figure) communication among the processes to gather local top-'s from all the GPUs to the primary GPU. While relying on the primary GPU to compute the final topworks for a small number of GPUs, e.g., 16 in our evaluation, we anticipate hierarchical reduction [52] would excel when Dr. Top-k scales to a large number of GPUs. Particularly, for a multi-node setting, where each node installs multiple GPUs, the hierarchical scheduling method first derives the top-across all the GPUs in each node. Afterward, all the nodes will send their top-to the primary GPU to compute the final top-. It is also worthy of noting that we attempt to reduce the workload for the second top-by enabling an MPI communication for the top-ℎ element of the first top-(the ↔ symbol in Figure 16 ). With the ℎ delegate across all GPUs, we anticipate this will help filter out more unpromising elements hence reduce the workload. However, since this method requires all GPUs to have the maximum of ℎ elements before launching the concatenation kernel, this introduces synchronization overhead. Additionally, in Figure 15 , we also notice the cost of second top-remains low throughout for a wide range of , leaving relatively small room for improvement. In summary, the overhead of synchronizing the ℎ elements from first top-s' exceeds the benefits of a smaller concatenated vector, we disable this technique in our final version of distributed Dr. Top-k. We implement Dr. Top-k with ∼1,500 lines of C++ and CUDA code, extending the state-of-the-art bitonic, bucket and radix topprojects [2, 42] . We compile the source code using NVIDIA CUDA 10.1 nvcc compiler with the optimization level as O3. We use two platforms to evaluate the performance of Dr. Top-k. Platform I is a server with two Intel Xeon "Cascade Lake-SP" CPUs (@3.8 GHz) and 4 Tesla V100S GPU running Ubuntu Server 18.04. Platform II consists of i7-8700 CPU @ 3.20GHz with one Titan Xp running Ubuntu Server 16.04. All the reported execution time is an average of five runs. The default size of the input vector is | | = 2 30 , and each data entry is an unsigned integer. is generated by the following distributions. Unless stated differently, we present the experiments on platform I for the UD dataset. This section reports the performance gains brought by Dr. Top-k to the state-of-the-art with respect to the change of | | and . Dr. Top-k for different input vector sizes. Figure 17 demonstrates the time consumption of different versions of top-for = 1024 on whose sizes vary from 2 26 to 2 30 . The general trend is that Dr. Top-k becomes more beneficial when the input vector size is bigger, because delegate vectors can help reduce more workloads when gets larger. Particularly, when | | = 2 30 , radix, bucket, bitonic and sort and choose top-consume 41.3 ms, 38.4 ms, 127.0 ms and 243.2 ms, respectively. Our Dr. Top-k assisted radix, bucket and bitonic top-designs reduce the time to 6.4 ms, 7.0 ms and 7.0 ms, respectively. Dr. Top-k assisted radix top-. As shown in Figure 18 , in general, Dr. Top-k yields bigger performance gains on the normal and customized distribution datasets. Particularly, we observe 1.7× -10× and 1.1× -10.1× speedups, respectively on normal and customized distribution, while 1.7× -6.6× on uniform distribution. It is also important to note that the impact of Dr. Top-k decreases with respect to the increase of . For instance, when k=2 24 , Dr. Top-k only gives 1.7× speedups for both the uniform and normal distribution datasets and 1.1× speedup for customized distribution. This is caused by the fact that Dr. Top-k requires more delegates to figure out the useful subranges when becomes larger, leading the first top-to consume significant time. Section 6.2 conducts a thorough study on the workload reduction trend for varying . Dr. Top-k assisted bucket top-. Figure 18 also shows the speedup of Dr. Top-k assisted bucket top-over the bucket topalone algorithm on the normal, customized and uniform distributions. The trends are analogous to radix top-but with two differences. First, bucket top-performs fairly well when = 1 because bucket top-first finds the maximum value. Then the query is completed. Thanks to near bandwidth performance of delegate vector construction and single delegate needed to be selected for first top-for = 1, Dr. Top-k assisted bucket top-only needs to work on the first top-, leading Dr. Top-k assisted bucket top-to perform faster, i.e., by 1.1× for all normal, customized and uniform distributions. Second, in bucket top-, the speedup of Dr. Top-k on customized distribution outpaces the uniform and normal distribution. Particularly, we observe the speedups from 1.1× -118.6× for customized distribution while 1.1× -6.1× and 1.1× -6.2× for normal and uniform distribution respectively. Dr. Top-k assisted bitonic top-. Figure 18 further includes the speedup of Dr. Top-k assisted bitonic top-over the bitonic topstand alone algorithm [42] . Note, the original source code [5] from bitonic top-k project [42] experiences shared memory overflow when goes beyond 256. We modify the source code to enable it for > 256. Particularly, the speedup of Dr. Top-k climbs from 1.1× when k=2 0 to 473× when k=2 24 . Since the performance of bitonic top-is independent from the data distribution, the speedups over the normal, uniform and customized distributions are the same. Note, for visualization, we limit the y-axis to [0, 128] in Figure 18 . Hence the speedups that are beyond 128 for bitonic top-are marked as numbers in Figure 18 . Real-world datasets contains three datasets: ANN_SIFT1B [21] , ClueWeb09 [39] and TwitterCOVID-19 [19] . (i) ANN_SIFT1B dataset contains 1 billion vectors, each of which is at 128 dimensions and describes an image. We use the first vector from the ANN_SIFT1B dataset to calculate the euclidean distances between this vector and the 1 billion vectors. Afterward, the distance array is the input vector for top-. (ii) The ClueWeb09 is a webpage graph which contains 4,780,950,910 webpages and 7,939,635,651 links. We derive the degrees of the webpages and use that as the input vector for top-. (iii) TwitterCOVID-19 [19] dataset consists of COVID-fear related scores of the tweets related to the COVID-19 pandemic from 28 January 2020 to 1 January 2021. The original 132 million public twitter posts are duplicated on the vector of 1 billion size to achieve same distribution. Top-computation can help (i) derive the -NN of a query vector [21] , (ii) rank the vertices by degree [12] and (iii) k least fearful tweets related to the COVID19 pandemic in [19] dataset. Since bitonic top-cannot work on | | that is not at size of power of 2, and GGKS radix top-suffers from | | ≥ 2 31 , we cut the sizes of (i) and (ii) datasets into 536,870,912 and 1,073,741,824, respectively. Figure 19 shows the speedup of Dr. Top-k assisted top-algorithms over the state-of-the-art projects on the real-world datasets. In general, for the same top-algorithm, Dr. Top-k enjoys higher speedups on CW dataset than AN, because CW is larger. This aligns with our finding in Figure 20 . On average, Dr. Top-k assisted radix, bucket and bitonic top-, respectively, perform 6.7×, 4.6 × and 173.7× faster than their corresponding top-algorithms on CW. AN dataset observes an average speedup of respectively 4.2×, 3.3× and 127.1× over the state-of-the-art top-algorithms. Similarly, TR dataset observes an average speedup of respectively 4.8×, 4.1× and 170.2× over the state-of-the-art top-algorithms. Figure 20 plots the workload dynamics for the first top-, second top-and their sum with respect to varying sizes of the input vector | |. Particularly, the workloads are the sizes of the delegate vector and the concatenated vector for the first and second top-, respectively. We observe that the ratio of the delegate vector over | | decreases significantly when | | increases, so does that for concatenated vector. Specifically, the sum of the delegate and concatenated vector sizes is 76.06% of | | at | | = 2 22 and 0.83% at | | = 2 30 . This workload reduction trend demonstrates the scalable nature of Dr. Top-k, that is, Dr. Top-k performance improves when the problem size |V| increases. will lead to larger vector sizes for both the first and second top-. As the vector sizes increase to a higher ratio of the input vector, the speedup of Dr. Top-k over the state-of-the-art also decreases. Another fact is that the workload of the first top-dominates the entire workload for Dr. Top-k because delegate will lead to more delegates. Further, delegate and delegate-based filtering together can significantly reduce the workload for the second top-. Particularly, the ratio of the sum of both vectors over the input vector climbs from 0.0015% to 15.91% with the increase of . Here, we include delegate vector construction optimization. When is small, one can observe that delegate top-enabled filtering yields better performance gains over delegate, = 2 20 in particular. However, once becomes bigger, the delegate optimization starts picking up the momentum. Overall, delegate top-enabled filtering combined with delegate always offers the best performance. Particularly, for = 2 24 , the time consumption of delegate top-enabled filtering, delegate and the combined one are 54.2 ms, 35.9 ms 24.7 ms. Table 3 showcases the number of global memory load and store transactions of different versions of top-on the UD dataset with |V| = 2 30 and k = 2 7 . We use nvprof [29] profiler for profiling the results. From the table, we observe a reduction of load transactions by 2.3×, 3.1× and 8.5×, respectively when Dr. Top-k assists radix, bucket and bitonic top-. Similary, Dr. Top-k helps reduce the global memory store transaction by 766.8×, 516.9× and 298.6×, respectively on radix, bucket and bitonic top-. Top-k on both the GPUs are similar for a range of . Overall, the performance of Dr. Top-k on V100S is better than in Titan Xp by a factor of 1.3× -1.8×. This roughly aligns with the ratio of the reported peak throughput difference between V100S [31] and Titan Xp [4] which are 1,134 GB/s and 547.7 GB/s. 6.6 Dr. Top-k vs BMW Figure 24 presents the ratio = BMW workload Dr. Top-k workload , where we use the sum the workloads of first and second top-as Dr. Top-k workload. Overall, we observe the ratio to be, on average, 212× in ND and 6× in UD, which suggests that Dr. Top-k reduces 212× and 6× more workload than BMW. The reason is that BMW still works on each regular item even after deriving the delegate while Dr. Top-k uses a delegate to skip an entire subrange directly. ����� ���� �� ��� ���� � �� � �� � �� � �� � �� � �� � �� � �� � ���� �� �� �� �� �� �� �� �� �� �� � �� � � � � � � � � � � � � � � � � � � � � � � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� �� �� �� �� �� �� �� �� �� �� �� � �� � ������������������� ������������ ����������� ������������ ����� ��� ��� ��� ��� ��� ��� � � � � � � � � � � � � � � � � � � � � � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � � � � � � � � � � � � � � � � � � � � � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � ���� ��� ���� ����� � � � � � � � � � � � � � � � � � � � � � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � �� We introduce Dr. Top-k with three contributions: First and foremost, Dr. Top-k introduces a comprehensive delegate-centric concept to help tremendously reduce the workload for top-computations. Second, we introduce a practical way to partition the input vector into proper sized subranges with theoretical support. Finally, we deploy our project atop distributed GPUs to handle extreme large input vectors along with various system optimizations. Taken together, Dr. Top-k assisted top-algorithms constantly outperform the state-of-the-art. Tensorflow: A System for Large-scale Machine Learning Fast -Selection Algorithms for Graphics Processing Units Department of Energy and Cray to Deliver Record-Setting Frontier Supercomputer at ORNL Bitonic select source code Thrust: A Productivity-Oriented Library for CUDA Optimized Data Processing on Heterogeneous Hardware. Proceedings of the VLDB Endowment Efficient Query Evaluation Using a Two-level Retrieval Process Parallel Sorting for GPUs Faster Top-k Document Retrieval Using Block-Max Indexes Extracting Top-Most Influential Nodes by Activity Analysis Scalable Sparse LU Symbolic Factorization on GPUs. IEEE Transactions on Parallel and Distributed Systems Do Bitcoin Users Really Care About Anonymity? An Analysis of the Bitcoin Transaction Graph Deanonymizing Cryptocurrency with Graph Learning: The Promises and Challenges XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs Hardware Technologies for High-performance Data-intensive Computing A Memory Access Reduced Sort on Multi-core GPU Global Reactions to COVID-19 on Twitter: A Labelled Dataset with Latent Topic, Sentiment and Emotion Attributes Optimizing Parallel Bitonic Sort Product Quantization for Nearest Neighbor Search Billion-scale Similarity Search with GPUs Warp-consolidation: A Novel Execution Model for GPUs Uniting CPU and GPU in Information Retrieval Systems for Intra-Query Parallelism Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs Trustworthy Answers for Top-k Queries on Uncertain Big Data in Decision Making Available at https Nvidia Inc. Nvidia Profiler nvprof America's First Exascale Supercomputer to be Built by 2021 Theoretically-efficient and Practical Parallel In-place Radix Sorting A Framework For Graph Sampling and Random Walk on GPUs PG-Strom v3.0 Official Documentation Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory MapD: A GPU-powered Big Data Analytics and Visualization Platform Available at http: //networkrepository.com/web-ClueWeb09 The Network Data Repository with Interactive Graph Analytics and Visualization El Capitan Supercomputer to Feature AMD Chips, Break 2 Exaflops Barrier Efficient Top-K Query Processing on Massively Parallel Hardware A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs World's Fastest Supercomputer" for US Government Americas First Exascale Supercomputer to be Built by 2021 Extreme Heterogeneity 2018-Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity. Tech. rep., USDOE Office of Science FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks Efficient and Scalable LDA on GPUs Second Derivative Test Available at CalculatetopKonGPUwithkselectionalgos Tree-based Allreduce Communication on MXNet Approximate Nearest Neighbor Search on GPU A Fast and Accurate LSH Framework for High-Dimensional Approximate NN Search. Proceedings of the VLDB Endowment Efficient Main-memory Top-K Selection for Multicore Architectures