key: cord-0045386-sgofnocp authors: Tsai, Yuhsiang M.; Cojean, Terry; Anzt, Hartwig title: Sparse Linear Algebra on AMD and NVIDIA GPUs – The Race Is On date: 2020-05-22 journal: High Performance Computing DOI: 10.1007/978-3-030-50743-5_16 sha: d70532663aa9d6e49700ef501ecf56f3e40b74b8 doc_id: 45386 cord_uid: sgofnocp Efficiently processing sparse matrices is a central and performance-critical part of many scientific simulation codes. Recognizing the adoption of manycore accelerators in HPC, we evaluate in this paper the performance of the currently best sparse matrix-vector product (SpMV) implementations on high-end GPUs from AMD and NVIDIA. Specifically, we optimize SpMV kernels for the CSR, COO, ELL, and HYB format taking the hardware characteristics of the latest GPU technologies into account. We compare for 2,800 test matrices the performance of our kernels against AMD’s hipSPARSE library and NVIDIA’s cuSPARSE library, and ultimately assess how the GPU technologies from AMD and NVIDIA compare in terms of SpMV performance. The sparse matrix vector product (SpMV) is a heavily-used and performancecritical operation in many scientific and industrial applications such as fluid flow simulations, electrochemical analysis, or Google's PageRank algorithm [11] . Operations including sparse matrices are typically memory bound on virtually all modern processor technology. With an increasing number of high performance computing (HPC) systems featuring GPU accelerators, there are significant resources spent on finding the best way to store a sparse matrix and optimize the SpMV kernel for different problems. In this paper, we present and compare four SpMV strategies (COO, CSR, ELL, and HYB) and their realization on AMD and NVIDIA GPUs. We furthermore assess the performance of each format for 2,800 test matrices on high-end GPUs from AMD and NVIDIA. We also derive performance profiles to investigate how well the distinct kernels generalize. All considered SpMV kernels are integrated into the Ginkgo open-source library 1 , a modern C++ library designed for the iterative solution of sparse linear systems, and we demonstrate that these kernels often outperform their counterparts available in the AMD hipSPARSE and the NVIDIA cuSPARSE vendor libraries. Given the long list of efforts covering the design and evaluation of SpMV kernels on manycore processors, see [2, 7] for a recent and comprehensive overview of SpMV research, we highlight that this work contains the following novel contributions: -We develop new SpMV kernels for COO, CSR, ELL and HYB that are optimized for AMD and NVIDIA GPUs and outperform existing implementations. In particular, we propose algorithmic improvements and tuning parameters to enable performance portability. -We evaluate the performance of the new kernels against SpMV kernels available in AMD's hipSPARSE library and NVIDIA's cuSPARSE library. -Using the 2,800 test matrices from the Suite Sparse Matrix Collection, we derive performance profiles to assess how well the distinct kernels generalize. -We compare the SpMV performance limits of high-end GPUs from AMD and NVIDIA. -Up to our knowledge, Ginkgo is the first open-source sparse linear algebra library based on C++ that features multiple SpMV kernels suitable for irregular matrices with back ends for both, AMD's and NVIDIA's GPUs. -We ensure full result reproducibility by making all kernels publicly available as part of the Ginkgo library, and archiving the performance results in a public repository 2 . Before providing more details about the sparse matrix formats and the processing strategy of the related SpMV routines in Sect. 3, we recall some basics about sparse matrix formats in Sect. 2. In Sect. 3.4, we combine several basic matrix storage formats into the so-called "hybrid" format (HYB) that splits the matrix into parts to exploit the performance niches of various basic formats. In a comprehensive evaluation in Sect. 4, we first compare the performance of Ginkgo's SpMV functionality with the SpMV kernels available in NVIDIA's cuSPARSE library and AMD's hipSPARSE library, then derive performance profiles to characterize all kernels with respect to specialization and generalization, and finally compare the SpMV performance of AMD's RadeonVII GPU with NVIDIA's V100 GPU. We conclude in Sect. 5 with a summary of the observations. For matrices where most elements are zero, which is typical for, e.g., finite element discretizations or network representations, storing all values explicitly is expensive in terms of memory and computational effort. In response, sparse matrix formats reduce the memory footprint and the computational effort by focusing on the nonzero matrix values [3] . In some cases, additionally storing some zero elements can improve memory access and data-parallel processing [4] . Fig. 1 . Different storage formats for a sparse matrix of dimension m × n containing nz nonzeros along with the memory consumption [6] . While there exists a long and still expanding list of sparse matrix formats (some of them tailored towards specific problems), we illustrate some of the most common basic formats (DENSE, COO, CSR, ELL) in Fig. 1 . The optimization of the SpMV kernel for manycore GPUs remains a topic of major interest [5, 9, 12] . Many of the most recent algorithm developments increase the efficiency by using prefix-sum computations [13] and intra-warp communication [10] on modern manycore hardware. We realize all SpMV kernels in the vendors' native languages: CUDA for NVIDIA GPUs and HIP for AMD GPUs. Given the different hardware characteristics, see Table 1 , we optimize kernel parameters like group size for the distinct architectures. More relevant, for the CSR, ELL, and HYB kernels, we modify the SpMV execution strategy for the AMD architecture from the strategy that was previously realized for NVIDIA architectures [2] . Flegar et al. [6] introduced a load-balancing COO SpMV based on the idea of parallelizing across the nonzeros of a sparse matrix. This way, all threads have the same workload, and coalesced access to the column indexes and the values of the sparse matrix is enabled. At the same time, parallelizing across nonzeros requires the use of atomicAdd operations to avoid race conditions, see Algorithm 1. Flegar et al. [6] also introduced an "oversubscribing" parameter ω that controls the number of threads allocated to each physical core. When increasing the oversubscribing, we have more active threads to hide the latency of data access and atomicAdds [1] . At the same time, it increases the number of atomicAdds invocations and the overhead of context switching. Using an experimental assessment on all of the 2,800 matrices from the Suite Sparse Matrix Collection, Flegar et al. [6] identifies oversubscribing parameters ω NVIDIA that draw a good balance between these aspects. Similarly to Flegar et al. [6] , we use experiments to identify good choices ω AMD for AMD architectures by considering oversubscribing Compute next row, row index of the next element to be processed 6: if any thread in the warp's next row != current row or it is the final iteration then 7: Compute the segmented scan according to current row. 8: if first thread in segment then 9: atomicAdd c on output vector by the first entry of each segment 10: end if 11: Reinitialize c = 0 12: end if 13: Get the next index ind 14: Compute Update current row to next row 16: end for . In the Ginkgo library and our experiments, we use the setting . 32 (10 7 ≤ n z ) The most basic CSR SpMV kernel (basic CSR) assigns only one thread to each row, which results in notoriously low occupancy of GPU. In Algorithm 2, we assign a "subwarp" (multiple threads) to each row, and use warp reduction mechanisms to accumulate the partial results before writing to the output vector. This classical CSR assigning multiple threads to each row is inspired by the performance improvement of the ELL SpMV in [2] . We adjust the number of threads assigned to each row to the maximum number of nonzeros in a row. We select subwarp size = 2 k (0 ≤ k ≤ 5 (NVIDIA) or 6 (AMD)) as the closest number smaller or equal to the maximum number of nonzeros in a row, i.e. subwarp size = max 2 t ≤ max row nnz|t ∈ Z, 0 ≤ t ≤ log 2 (device warpsize) In Fig. 5 in Sect. 4, we visualize the performance improvements obtained from assigning multiple threads to each row and observe that the basic CSR SpMV is not always slower. In particular for very unbalanced matrices, assigning the same parallel resources to each row turns out to be inefficient. In response, we Algorithm 2. Ginkgo's classical CSR kernel. 1: Get row = the row index 2: Compute subrow = the step size to next row 3: Get step size = the step size to next element of value. 4: Initialize value c = 0 5: for row = row .. #rows, row+ = subrow do 6: end for 9: Perform warp reduction of c on the warp 10: if thread 0 in subwarp then 11: Write c to the output vector 12: end if 13: end for design a load-balancing CSRI which follows the strategy of the COO SpMV described in Sect. 3.1 to balance the workload across the compute resources. For an automatic strategy selection in Algorithm 3, we define two variables nnz limit and row len limit to control the kernel selection on NVIDIA and AMD GPUs. nnz limit reflects the limit of total nonzero count, and row len limit reflects the limit of the maximum number of stored elements in a row. For AMD GPUs, nnz limit is 10 8 and row len limit is 768. For NVIDIA GPUs, nnz limit is 10 6 and row len limit is 1024. Algorithm 3. Ginkgo's CSR strategy. 1: Compute max row nnz = the maximal number of stored element per rows. 2: if #nnz > nnz limit or max row nnz > row len limit then 3: Use load-balance CSR Kernel 4: else 5: Use classical CSR Kernel 6: end if In [2] , the authors demonstrated that the ELL SpMV kernel can be accelerated by assigning multiple threads to each row, and using an "early stopping" strategy to terminate thread blocks early if they reach the padding part of the ELL format. Porting this strategy to AMD architectures, we discovered that the non-coalesced global memory access possible when assigning multiple threads to the rows of the ELL matrix stored in column-major format can result in low performance. The reason behind this is that the strategy in [2] uses threads of the same group to handle one row, which results in adjacent threads always reading matrix elements that are m (matrix size or stride) memory locations apart. To overcome this problem, we rearrange the memory access by assigning the threads of the same group to handle one column like the classical ELL kernel, but assigning several groups to each row to increase GPU usage. Because the threads handling the elements of a row may be of the same thread block but are no longer part of the same warp, we can not use warp reduction for the partial sums but need to invoke atomicAdds on shared memory. Figure 2 visualizes the different memory access strategies. In our experiments, we set the "group size" to multiple of 32 for both AMD and NVIDIA architectures. The group size is the number of contiguous element read by thread block, and the num group is the number of thread in the thread block accessing the same row. We use block size = 512 in ELL kernel. To make the "group size" is the multiple of 32, we set the max of num group = block size/min group size = 512/32 = 16. We visualize in Fig. 8 the improvement of the new ELL SpMV kernel over the kernel previously employed [2] . Compute ind = index of this element in the ELL format 8: if In Algorithm 4, we present the ELL SpMV kernel implemented in Ginkgo for SIMD architectures like GPUs. The number of groups assigned to a row is computed via Algorithm 5. Generally, the number of the group is increased with the number of nonzero elements accumulated in a single row. However, if num group = 16, multiple thread block may be assigned to the same row, see line 8 in Algorithm 5. This strategy aims at increasing the occupancy of the GPU multiprocessors when targeting short-and-wide matrices that accumulate many elements in few rows. After the group is determined, the start index for a specific thread is computed in lines 2 in Algorithm 4 with the step size which is same as the total number of threads accessing the same row. The threads process the data with the loop in lines 6-12. This kernel still uses the early stopping in lines 8-10 introduced in [2] . After completion of the matrix vector Algorithm 5. Ginkgo's automatic ELL kernel configuration. 1: Initialize num group = 1 2: Initialize nblock per row = 1 3: Compute ell ncols = maximum number of non zero elements per row 4: Get nwarps = total number of warps available on the GPU 5: if ell ncols / nrows > 1e − 2 then 6: Compute num group = min(16, 2 ceil(log 2 (ell ncols)) ) 7: if num group == 16 then 8: Compute nblock per row = max(min(ell ncols/16, nwarps/nrows), 1) 9: end if 10: end if multiplication step, the partial sums accumulated in thread-local variables are reduced (line 13) and added to the output vector in global memory, see line 15. Even though this operation requires an atomic operation as multiple groups (part of distinct thread blocks) may operate on the same row, the chance of atomic collisions is small due to the previous reduction in line 13. Ginkgo's hybrid ("HYB") format splits the matrix into two parts and stores the regular part in the ELL format and the irregular part in the COO format. Flegar et al. [6] demonstrated that Ginkgo's COO SpMV achieves good performance for irregular matrices on NVIDIA architectures, and the results in Sect. 4 confirm that Ginkgo's COO SpMV performs well also on AMD architectures. How the HYB format partitions a matrix into the ELL and the COO part impacts the memory requirements and performance. Anzt et al. [2] derived strategies basing the partitioning on the nonzeros-per-row distribution of the matrix. We modify this strategy by adding a condition based on the ratio between the maximum nonzeros-per-row and the number of rows. For R being the set of the nonzerosper-row values, we define the function Q R and F R : We recall that Anzt et al. [2] introduced hybrid{n} which takes the nonzeros of the row at the n%-quantile in the ascending ordering of the nonzero-per-row values, Q R (n%). A variant denoted with "hybridminstorage" selects according to the (bit-)size of the value and index arrays, i.e. hybridminstorage is hybrid25 when storing the values in 64-bit doubles and the indexes in 32-bit integers [2] . In this paper, we enhance the hybrid{n} partitioning from Anzt et al. [2] by enforcing the limitation that the maximum nonzero-per-row of the ELL part can at most be #rows * 0.0001. We consider the resulting strategy "hybridlimit{n}" and select hybridlimit33 (label "HYB") as our default strategy according to the performance evaluation in Fig. 11 in Sect. 4. In this paper, we consider NVIDIA's V100 (SXM2 16 GB) GPU with support for compute capability 7.0 [14] and AMD's RadeonVII with compute capability gfx906. See Table 1 for some hardware specifications [16] . We note that the AMD RadeonVII is not a server-line GPU, but provides the same memory bandwidth as the AMD HPC GPU MI50, and thus should be comparable for memory bound operations such as the SpMV kernels. We use the major programming ecosystems for the distinct architectures -CUDA for NVIDIA GPUs and HIP for AMD GPUs. CUDA GPU kernels were compiled using CUDA version 9.2, and HIP GPU kernels were compiled using HIP version 2.8.19361. The performance evaluation covers more than 2,800 test matrices of the Suite Sparse Matrix Collection [15] . Some matrices contain dense rows, which makes the conversion to the ELL format virtually impossible. We ignore those matrices in the performance evaluation of the ELL SpMV kernel. All experiments are performed in IEEE double precision arithmetic, and the GFLOP/s rates are computed under the assumption that the number of flops is always 2n z , where n z is the number of nonzeros of the test matrix (ignoring padding). We first evaluate the performance of the load-balancing COO SpMV kernel. In Fig. 3a , we compare against cuSPARSE's COO kernel (cusparseDhybmv with CUSPARSE HYB PARTITION USER and threshold of 0), in Fig. 3b , we compare against hipSPARSE's COO kernel (hipsparseDhybmv with HIPSPARSE HYB PARTITION USER and threshold of 0). Each dot reflects one test matrix from the Suite Sparse collection. The x-axis is the nonzero count of the matrix, and the y-axis is the performance in GFLOP/s. In Fig. 4 In the CSR SpMV performance analysis, we first demonstrate the improvement of assigning multiple threads to each row (classical CSR) over the implementation assigning only one thread to each row (basic CSR) see Fig. 5 for the CUDA and AMD backend, respectively. For a few matrices with many nonzeros, the basic CSR is 5x-10x faster than the classical CSR. To overcome this problem, we use Algorithm 3 in Ginkgo that chooses the load-balancing CSRI algorithm for problems with large nonzero counts. Next, we compare the performance of the Ginkgo CSR SpMV (that automatically interfaces to either the load-balancing CSRI kernel or the classical CSR, see Sect. 3.2) with the vendors' CSR SpMV. Anzt et al. [2] identified the cusp csr kernel (cusparseDcsrmv) as the overall performance winner among the different NVIDIA CSR implementations. For the AMD CSR SpMV kernel, we use the CSR kernel (hipsparseDcsrmv) provided in hipSPARSE. For completeness, we mention that the rocSPARSE library (outside the HIP ecosystem) contains a CSR kernel that renders better SpMV performance for irregular matrices on AMD GPUs. We refrain from considering it as we want to stay within the HIP ecosystem, which is anticipated to serve as primary dissemination tool for AMD's sparse linear algebra technology. In Fig. 6 , we compare the Ginkgo CSR SpMV with the cusparseDcsrmv CSR kernel available in NVIDIA's cuSPARSE library and the hipsparseDcsrmv CSR kernel available in AMD's hipSPARSE library, respectively. In the relative performance analysis, Fig. 7 , we use the ratio max(row nz) num rows for the x-axis as this is the parameter used in Ginkgo's CSR SpMV to decide which CSR algorithm is selected. Ginkgo CSR achieves significant speedups for large x-values (up to 900x speedup on V100 and 700x speedup on RadeonVII). At the same time, there are a few cases where the Ginkgo CSR SpMV is slower than the library implementations (up to 20x slowdown on V100 and 5x slowdown on RadeonVII). First, we investigate the performance improvement we obtain by changing the memory access strategy for the ELL SpMV kernel, see Sect. 3. Interestingly, moving to the new ELL SpMV algorithm does not render noteworthy performance improvements on NVIDIA's V100 GPU, as can be seen in Fig. 8a . At the same time, the performance improvements are significant for AMD's RadeonVII, as shown in Fig. 8b . In the new ELL SpMV algorithm, we improve the global memory access at the cost of atomicAdd operations on shared memory (which are more expensive than warp reductions). In consequence, the current ELL SpMV is not always faster than the previous ELL SpMV. In Fig. 9 , we compare Ginkgo's ELL SpMV kernel against cuSPARSE cus-parseDhybmv with CUSPARSE HYB PARTITION MAX ELL kernel and hipSPARSE hipsparseDhybmv with HIPSPARSE HYB PARTITION MAX ELL kernel, respectively. hipSPARSE ELL employs a limitation not to process matrices that have more than #nnz−1 #rows + 1 elements in a row. Thus, we have much fewer data points for the hipSPARSE ELL SpMV (the blue points in Fig. 9b ). In Fig. 10 , Ginkgo's ELL is faster than their counterparts available in the vendors libraries if the ratio max(row nz) num rows > 10 −2 . For the other cases, Ginkgo and the vendor libraries are comparable in their ELL SpMV performance. Before comparing against the vendor implementations, we investigate the performance of our HYB SpMV kernel for different partitioning strategies denoted by hybrid{n}, hybridlimit{n}, and hybridminstorage (which is same as hybrid25) as introduced in Sect. 3.4. We use a performance profile [8] on all Suite Sparse matrices to compare the strategies with respect to specialization and generalization. Using a performance profile allows to identify the test problem share (y-axis) for a maximum acceptable slowdown compared to the fastest algorithm (x-axis). In Fig. 11 , we visualize the performance profiles for the V100 and RadeonVII architectures. Although the hybrid strategy (which corresponds to hybridlimit33) does not win in terms of specialization (maximum slowdown of 1), we favor this strategy since it provides the best generality: when considering a maximum acceptable slowdown factor of less than 1.75, this format wins in terms of problem share. In Fig. 12 , we see that Ginkgo's HYB SpMV achieves similar peak performances like cuSPARSE's cusparseDhybmv HYB SpMV and hipSPARSE's hipsparseDhybmv HYB SpMV, but Ginkgo has much higher performance averages than cuSPARSE or hipSPARSE. Figure 13a and Fig. 13b visualize the HYB SpMV performance relative to the vendor libraries, and we identify significant speedups for most problems and moderate slowdowns for a few cases. In Fig. 14 , we use the performance profile to assess the specialization and generalization of all matrix formats we consider. In Fig. 14 , Ginkgo's CSR is the fastest for about 30% of the test cases, and Ginkgo's HYB is the winner in terms of generality (if the acceptable slowdown factor is larger than 1.0625). Very similarly, in Fig. 15 , Ginkgo's CSR is the fastest kernel for roughly 30% of the test cases, and Ginkgo's HYB is the generalization-winner if the acceptable slowdown factor is larger than 1.375. We note that the hipSPARSE ELL stays at a low problem ratio as it employs a limitation to not process matrices that have more than #nnz−1 #rows + 1 elements in a row. We already noticed in the analysis comparing Ginkgo's different SpMV kernels to the vendor libraries that AMD's hipSPARSE library generally features much better-engineered kernels than NVIDIA's cuSPARSE library. In consequence, also the performance profiles of AMD's SpMV kernels are much closer to Ginkgo's SpMV kernel profiles than NVIDIA's SpMV kernel profiles. We finally compare the SpMV performance limits of RadeonVII and V100 in Fig. 16 . We consider both Ginkgo's back ends for the two architectures, and the SpMV kernels available in the vendor libraries (labeled "Sparselib"). In most cases, the V100 is faster than RadeonVII, but the speedup factors are moderate, with an average around 2x. RadeonVII shows better performance for matrices that contain many nonzeros. The higher memory bandwidth of the RadeonVII might be a reason for these performance advantages, but as there are typically many factors (such as context switch, warp size, the number of multiprocessors, etc.) affecting the performance of SpMV kernels, identifying the origin of the performance results is difficult. While NVIDIA's V100 outperforms AMD's RadeonVII in most tests, we acknowledge that the price for a V100 (16 GB SXM2) is currently more than an order of magnitude higher than for a RadeonVII 3 In this paper, we have presented a comprehensive evaluation of SpMV kernels for AMD and NVIDIA GPUs, including routines for the CSR, COO, ELL, and HYB format. We have optimized all kernels for the latest GPU architectures from both vendors, including new algorithmic developments and parameter tuning. All kernels are part of the Ginkgo open source library, and typically outperform their counterparts available in the vendor libraries NVIDIA cuSPARSE and AMD hipSPARSE. We accompany te kernel release with a performance database and a web tool that allows investigating the performance characteristics interactively. We also conducted an extensive SpMV performance comparison on both AMD RadeonVII and NVIDIA V100 hardware. We show that despite NVIDIA's V100 providing better performance for many cases, AMD's RadeonVII with the hipSPARSE library is able to compete against NVIDIA's V100 in particular for matrices with a high number of non zero elements. In addition, we note that due to the price discrepancy between the two hardware (AMD's RadeonVII is roughly 6.6% of the price of an NVIDIA's V100), the AMD hardware provides a much better performance-per-dollar ratio. This may indicate that after a long period of NVIDIA dominating the HPC GPU market, AMD steps up to recover a serious competitor position. On block-asynchronous execution on GPUs Load-balancing sparse matrix vector product kernels on GPUs Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods Implementing sparse matrix-vector multiplication on throughput-oriented processors Optimizing sparse matrix operations on GPUs using merge path Overcoming load imbalance for irregular sparse matrices A survey of sparse matrix-vector multiplication performance on large matrices Matlab Guide Adaptive sparse tiling for sparse matrix multiplication Accelerating CUDA graph algorithms at maximum warp Google's PageRank and Beyond: The Science of Search Engine Rankings Merge-based parallel sparse matrix-vector multiplication High-performance and scalable GPU graph traversal Whitepaper: NVIDIA Tesla V100 GPU Architecture Acknowledgment. This work was supported by the "Impuls und Vernetzungsfond" of the Helmholtz Association under grant VH-NG-1241, and the US Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.