key: cord-0058356-vr3ru8c0 authors: Aliaga, José Ignacio; Anzt, Hartwig; Quintana-Ortí, Enrique S.; Tomás, Andrés E.; Tsai, Yuhsiang M. title: Balanced and Compressed Coordinate Layout for the Sparse Matrix-Vector Product on GPUs date: 2021-02-15 journal: Euro-Par 2020: Parallel Processing Workshops DOI: 10.1007/978-3-030-71593-9_7 sha: 9d2518c578c29554a1277a97b823257e7b3d7bf2 doc_id: 58356 cord_uid: vr3ru8c0 We contribute to the optimization of the sparse matrix-vector product on graphics processing units by introducing a variant of the coordinate sparse matrix layout that compresses the integer representation of the matrix indices. In addition, we employ a look-ahead table to avoid the storage of repeated numerical values in the sparse matrix, yielding a more compact data representation that is easier to maintain in the cache. Our evaluation on the two most recent generations of NVIDIA GPUs, the V100 and the A100 architectures, shows considerable performance improvements over the kernels for the sparse matrix-vector product in cuSPARSE (CUDA 11.0.167). The sparse matrix-vector product (SpMV) is a fundamental operation for the iterative solution of sparse linear systems since it is usually the computationally most expensive building block in stationary schemes as well as Krylov subspace methods [10] . The SpMV is, in general, a memory-bound operation which means that its performance is strongly determined by the memory access volume and the access pattern dictated by the algorithmic realization of the kernel and the memory bandwidth of the target computer architecture. In this context, the irregularity of the memory accesses turns the parallel optimization of SpMV into a challenging task. A particular factor which directly influences the implementation and (parallel) performance of SpMV is the data layout of the sparse matrix. The coordinate format (COO) [10] is likely the most intuitive layout: for each non-zero matrix entry, this scheme maintains a 3-tuple with the entry's row and column indices and its numerical value. The compressed sparse row format (CSR) [10] is a flexible alternative that reduces the indexing overhead with respect to COO by storing only starting/ending indices (pointers) for each matrix row, while keeping the same information for the column indices and values as COO. A plethora of application-specific sparse matrix layouts have been proposed over the past decades; see [2] [3] [4] [5] 8, 9] among many others. In general, these solutions deliver high performance for some problem domains and/or computer architectures but perform poorly and/or require expensive transformations of the matrix format for others. In [7] we introduced a balancing parallelization scheme for GPUs optimized for matrices with an irregular row distribution of the non-zero entries. In brief, this scheme: 1) is based on the standard CSR format; 2) requires an inexpensive pre-processing step; and 3) consumes only a minor amount of additional memory compared with significantly more expensive GPU-specific sparse matrix layouts. The new balancing approach departs from the conventional parallelization across matrix rows by instead distributing the workload evenly among the thread teams while avoiding race conditions via atomic transactions with efficient support by hardware in recent GPU architectures. In [6] , we extended the idea to the COO format, showing that the resulting kernel is superior to some of the most popular SpMV implementations based on both COO and CSR. In this paper, we continue our effort towards the optimization of SpMV on GPUs by making the following contributions: -We propose orthogonal (independent) enhancements of the balancing COObased scheme in [7] that result in a compressed storage format for the matrix data (indices and values), thus reducing the memory traffic and improving performance. -We develop a high performance realization of this scheme for the most recent generations of NVIDIA GPUs (Volta and Ampere). -We provide a complete evaluation of the new kernel in comparison with highly optimized implementations of SpMV, based on COO and CSR, from in NVIDIA cuSPARSE (those in CUDA 11.0.167). Following standard practice, this analysis is performed both from the perspective of memory consumption and GFLOPS (billions of floating-point arithmetic operations, or flops, per second). The idea of compressing the indexing information to reduce the pressure on memory bandwidth is not original. In this sense, our approach is slightly related to the compressed sparse blocks (CSB) format [3] , which partitions the sparse matrix into a regular grid of sparse blocks, each of which is stored in CSR format with the block indices compressed as offsets to a reference. In comparison, we also maintain the indices as offsets, encoded using a shorter number of bits. However, our scheme is based on COO instead of CSR; we divide the nonzero matrix entries (instead of the matrix itself) into regular chunks; we couple this partitioning with a balanced workload distribution for GPUs; and we also explore the compression of the numerical data using a look-ahead table. The rest of the paper is organized as follows. In Sect. 2, we review the COO format and introduce our new balancing and compressed variant for GPUs based on it. In Sect. 3, we evaluate a standalone implementation of the new scheme for SpMV in comparison with the GPU kernels in NVIDIA cuSPARSE. In addition, in that section, we also assess the impact of the scheme when the SpMV kernel is integrated into the biconjugate gradient stabilized method (BICGSTAB) [10] . Finally, in Sect. 4, we offer some concluding remarks and a brief discussion of open research lines. Consider the SpMV y := A · x, where A is an n × n sparse matrix with n z non-zero entries and x, y are both vectors with n components. The COO format employs three vectors: say a, i and j, each of dimension n z , to maintain the values of the non-zero elements of the matrix and their row and column index coordinates, respectively. In a direct parallelization of the COO-based SpMV on a GPU, each thread operates with a single nonzero element of the matrix, performing the multiplication with the corresponding entry of x, and using atomic operations to accumulate the partial result on the appropriate component of y. The performance of this initial approach can be improved if each thread computes several elements of the result vector, as typically 2 or 4 elements suffice for the compiler to aggregate enough memory access operations to overlap transfer and arithmetic operations. The excerpt of CUDA-like code in Listing 1.1 illustrates this approach for a COO-based SpMV with A stored using vectors a, i, and j. There each thread computes K accumulations of the form y i := y i +a ij ·x j , involving K nonzero matrix elements. Note that, for simplicity, we assume that n z is an exact multiple of B · K, where B denotes the number of threads per block. Otherwise, the matrix can be padded with explicit zero elements. In practice, the number of iterations in the loop of the SpMV kernel in Listing 1.1 is small, and the whole loop should be unrolled to attain high performance. For that purpose, it is convenient to pad each matrix row with zeros so that its dimension becomes an exact multiple of K. A second "loop" is implicit in the GPU code as the B threads of a block perform the operations for a chunk of B · K matrix elements. In current NVIDIA GPUs, the number of threads in a block is limited to 1,024 and must be over 192 for good performance. The compromise value B = 256 is rather optimal and provides some advantages from the perspective of the compression technique introduced in the next subsection. Finally, a third (outermost) "loop" is also implicitly present, for the n z /(B · K) thread blocks. With this approach, the GPU hardware scheduler will dynamically assign blocks to each chunk of the matrix. This is important because the execution time of the threads can be quite different given the variations in the access cost to the vectors x and y. The reason is that, although each thread process the same number of elements, the matrix pattern can result in very different cache hits and misses during the accesses to the input vector x. Also, the ordering of the matrix elements can introduce an important number of cache misses in the update of the result vector y. In addition, atomic operations must be used to avoid race conditions in this update. Although atomic primitives have efficient support in modern GPU hardware, they introduce contention among the threads introducing further variations to the execution time. In principle, the COO format does not enforce any specific ordering of the matrix elements. However, a random ordering will result in poor locality during the accesses to the result y. In contrast, a row-major ordering (such as that used in CSR) renders excellent locality during the same accesses, but with higher contention among threads. To avoid this, a segmented scan is implemented using the intra-block communication primitives available on NVIDIA's GPU. The fragment of CUDA code in Listing 1.2 shows this reduction. There, the variable v stores the values that have to be accumulated and the variable row their corresponding row indices. This solution mimics the highly parallel variant of the classic prefix sum: each thread communicates the accumulated value as well as the row index for that value to the thread in the next level of the hierarchy. The accumulation continues if the received row index matches the index of the row assigned to the receiving thread. Assuming a row-major ordering (i.e., consecutive row indices), the thread with the lowest identifier participating in the accumulation of elements for each row accumulates the partial products for all the products in that row. Only this thread issues a global memory access operation to write the final value to the main memory. if ( row == s && threadIdx .x + l < W) v += t; 6 } 7 int prev = __shfl_up_sync (0 xffffffff , row , 1) ; 8 if ( threadIdx .x == 0 || row != prev ) atomicAdd (Y + row , v); Listing 1.2. CUDA code that performs the accumulation on y. y2 Fig. 1 . Diagram of a segmented scan of 8 elements using 8 threads Figure 1 shows a reduced example using only 8 threads. The first column represents the contents of the row variable in each thread and the last column corresponds to the result vector y. The columns in between show the value of v at each step of the loop. Each arrow represents the messages sent among threads, using a dotted line when the received value is not added because it comes across a row boundary. The solid arrows are atomic additions to y in main memory. This reduction scheme requires 11 communication operations for adding the values for all rows compared with 5 communications in a regular reduction which only computes one sum. Therefore, it is only sub-optimal when the matrix elements processed by a warp pertain to the same row. However, this case is avoided by the compression technique introduced in the next subsection. Following the balanced thread distribution, each block of threads processes exactly a chunk of B · K nonzero elements of the sparse matrix. If the matrix elements are ordered row-wise and, by columns inside each row, (as is the case in the CSR format,) each chunk will likely present a significant number of repeated row indices in vector i as well as clustered column indices in vector j. In addition, for some applications, many of the matrix values are repeated. For these reasons, it may be beneficial to use different encodings for each chunk, reducing (compressing) the amount of memory required to store the sparse matrix. This approach avoids thread divergence as the same format is used for all the elements in a chunk. At the same time, the compression level may not be optimal as it needs to account for the values accessed by several threads. To implement this compression, a handful of auxiliary vectors are required, all of the length n z /(B · K) (that is, vectors with one element per chunk). The first vector contains a 1-byte entry per block to specify which particular format is used for that block. Table 1 shows a summary of the possible encodings for a matrix with double precision (DP) floating point data. The row index and element value combinations can be represented by a single bit each, while the column index requires two bits in each 1-byte entry of the vector. Two additional integer vectors then contain the baseline (reference) row and column indices of the elements in the chunk, which correspond to those for the top-leftmost nonzero entry of the sparse matrix in the chunk. Finally, as the space occupied by distinct chunks will be often different, a vector of integers is used to point to the start of each chunk. Instead of the three original COO vectors (i, j and a), the data of the matrix elements in a chunk are maintained in a blob (Binary Large OBject), with the B row indexes first; followed by the B · K column indexes; and finally the B · K values. Those blobs are stored contiguously in memory with no alignment issues provided B and B · K are both integer multiples of 8 for DP data (or 4 for single precision values). The values of i and j are stored as offsets relative to the baseline element of the chunk. For B ≤ 256, the row index is encoded using one byte only as most matrices contain at least one element per row. If this is not the case, each empty row is padded with an explicit zero element. If the whole chunk corresponds to a unique row, it is not necessary to store any value for the individual elements, and a regular sum reduction is used instead of the segmented scan. Depending on the nonzero pattern of the matrix, the column index is encoded using 8, 16, or 32 bits. For sparse matrices arising in non-graph applications, the nonzero entries in a row usually appear in clusters, allowing to use fewer bits to encode the column indices. While converting the matrix, a lookup table (LUT) is built containing the 256 most frequent values. If all value entries in the chunk are covered by the LUT, only one byte per element is used to index the right element in the LUT instead of storing the actual floating point values. Figure 2 shows an example corresponding to a small chunk (B = 8 and K = 1) in compressed COO format. The original COO data is represented left of the arrow and the different elements in the compressed COO format on the right. In this figure, each column from the blob corresponds to the respective original vector. The first column contains the (row) i indices as an offset to the row baseline. Similarly, the second column contains the (column) j indices as an offset to the column baseline. Finally, the third column contains an index to the LUT where the double precision values are stored. The values of the row and columns offsets are different for each chunk/element but the LUT is common to the whole matrix. For the experimental evaluation of the new compressed realization of SpMV, we selected 60 test matrices from the Suite Sparse Matrix Collection [1] . The chosen benchmarks have row/column dimensions larger than 900,000 and arise in a variety of scientific problems excluding graph applications. (Although the adjacency matrices associated with graphs have excellent compression properties, we do not consider them to be interesting use cases for the SpMV kernel as there are more efficient algorithms for graph manipulation.) The test matrices along with some key properties are listed in Table 2 . Figure 3 visualizes the memory overhead of COO and Compressed COO with respect to CSR, assuming a DP floating point representation for the numerical values with all three formats, and a 32-bit integer representation for the indices in CSR and COO. There are some matrices with clustered indices/repeated numerical entries where the compression schemes are especially efficient and, as a result, Compressed COO uses less memory than CSR. For the rest of the matrices, except in two cases, the overhead of Compressed COO over CSR is always smaller than that of regular COO. We ran all the following experiments in this section using DP arithmetic on two distinct generations of NVIDIA accelerators: -A V100 GPU with compute capability 7.0, furnished with 16 GB of main memory, 128 KB L1 cache per streaming processor, and 6 MB of L2 cache. The bandwidth to memory bandwidth is 900 GB/s and the theoretical peak performance is 7.8 DP TFLOPS. -An A100 GPU with compute capability 8.0, equipped with 40 GB of memory, 1.5 GB/s main memory bandwidth, and a theoretical peak performance of 19.5/9.7 DP TFLOPS with/without DP tensor cores, respectively. All the codes were compiled using CUDA version 11.0.167. We first compare the computational efficiency of our realization of SpMV against the codes in NVIDIA cuSPARSE. This native library from NVIDIA offers three routines for this computational kernel, two based on CSR and one based on COO. In the following comparisons, we include only the default CSR SpMV algorithm from cuSPARSE as the second CSR-based variant delivers very similar performance for the chosen test matrices. We do not include results for other formats, such as ELL or Hybrid-ELL, which were available in earlier versions of cuSPARSE but are no longer included in the last version of the library. Figure 4 shows the performance evaluation of NVIDIA's codes against our Compressed COO implementation which applies the memory-reduction techniques described in Sect. 2 to diminish the indexing overhead for the row/column indices as well as data values. The results in the figure, in terms of GFLOPS, show a large performance improvement using Compressed COO for matrices with clustered indices/repeated values. Concretely, we are able to achieve up to 170/250 GFLOPS on the V100/A100 GPUs, respectively. While the compressed COO almost always outperforms the cuSPARSE COO and the cuSPARSE CSR kernels (except for a few outliers where the performance is on par or negligibly lower), the median speed-up over its competitors is 1.4× and 1.25-1.3× on the V100 and the A100 GPUs, respectively. Even though the median speed-up over cuSPARSE CSR and cuSPARSE COO is almost identical, we note that the performance ratios for the distinct problems are more consistent when comparing the COO formats. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 0 50 100 150 200 250 cuSPARSE CSR cuSPARSE COO SpMV perfomance on the V100 GFLOPS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 0 50 100 150 200 250 Compressed COO SpMV perfomance on the A100 GFLOPS Fig. 4 . Performance of the new compressed realization of SpMV against those in cuS-PARSE on NVIDIA V100 and A100 GPUs (left and right, resp.) We next evaluate the impact of the new compressed kernels for SpMV when integrated into an iterative solver for sparse linear systems based on a Krylov subspace method. For this purpose, we select a BICGSTAB implementation based on CUDA. In the comparison, the BiCGSTAB solver employs the three different SpMV realizations analyzed in the previous subsection: compressed COO, cuSPARSE CSR, and cuSPARSE COO. For a performance comparison, we execute a fixed number of iterations and measure the GFLOPS for the linear systems constructed from the same test cases selected for the standalone evaluation of SpMV. Our experiments with the BiCGSTAB solver using the three SpMV kernels show that the performance benefits of the faster SpMV kernel execution carry over to the BiCGSTAB solver. The acceleration of the BiCGSTAB solver depends on the specific problem and how much the SpMV kernel contributes to the overall runtime cost. In that sense, the speed-ups of BiCGSTAB correlate to a scaled version of the SpMV speed-up values reported in Fig. 4 , damped with the problem-specific ratio between SpMV kernel cost vs. BiCGSTAB solver cost. In the end, equipping the BiCGSTAB solver with the compressed COO SpMV kernel improves the overall iterative solver performance for virtually all problems with a median speed-up of about 1.2× on both architectures, see We have adopted our previous balancing approach for SpMV to (virtually) divide the matrix contents into chunks (blocks) of nonzero entries of the same size; map these to the thread blocks; and prevent race conditions via efficient atomic operations. On top of this technique, in this work, we have proposed a new compression scheme that reduces the amount of indexing information that is associated with a COO-based realization of SpMV while maintaining the balanced distribution. For this purpose, the indices of each entry inside the same chunk are maintained as offsets with respect to a baseline row/column index pair, allowing the use of 8-bit encodings for the row indices, and 8/16/32bit encodings for the column indices depending on the chunk. In addition, the observation that the numerical values in the sparse matrices arising in scientific applications present a considerable number of repetitions, motivates the design of a compression scheme that employs a look-up table. The experimental results show the benefits of the new format, demonstrating a consistent advantage over the native implementation of the SpMV kernel in NVIDIA's cuSPARSE (CUDA 11.0.167) on the V100 and A100 GPUs. The matrix format in this paper can be extended to support more efficient encodings. For example, matrix values could be stored in different precisions. Or even not stored at all for graph adjacency matrices that contain a large number of entries equal to one. Furthermore, the presented format is suitable for very large-scale matrices that require 64-bit indices. Efficient sparse matrix-vector multiplication on CUDA Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication Model-driven autotuning of sparse matrixvector multiply on GPUs Sparse matrix-vector multiplication on GPGPUs Overcoming load imbalance for irregular sparse matrices Balanced CSR sparse matrix-vector product on graphics processors A survey of sparse matrix-vector multiplication performance on large matrices CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication Iterative Methods for Sparse Linear Systems Acknowledgements. This work was partially sponsored by the EU H2020 project 732631 OPRECOMP and project TIN2017-82972-R of the Spanish MINECO. Hartwig Anzt and Yuhsiang M. Tsai were supported by the "Impuls und Vernetzungsfond" of the Helmholtz Association under grant VH-NG-1241 and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors would like to thank the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology for providing access to an NVIDIA A100 GPU.