key: cord-0045641-25siwrtq authors: Ayala, Alan; Tomov, Stanimire; Haidar, Azzam; Dongarra, Jack title: heFFTe: Highly Efficient FFT for Exascale date: 2020-05-26 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50371-0_19 sha: 8ad6f082312df5cc5ee5c6b1712ff5dcc2a7e8b0 doc_id: 45641 cord_uid: 25siwrtq Exascale computing aspires to meet the increasing demands from large scientific applications. Software targeting exascale is typically designed for heterogeneous architectures; henceforth, it is not only important to develop well-designed software, but also make it aware of the hardware architecture and efficiently exploit its power. Currently, several and diverse applications, such as those part of the Exascale Computing Project (ECP) in the United States, rely on efficient computation of the Fast Fourier Transform (FFT). In this context, we present the design and implementation of heFFTe (Highly Efficient FFT for Exascale) library, which targets the upcoming exascale supercomputers. We provide highly (linearly) scalable GPU kernels that achieve more than [Formula: see text] speedup with respect to local kernels from CPU state-of-the-art libraries, and over [Formula: see text] speedup for the whole FFT computation. A communication model for parallel FFTs is also provided to analyze the bottleneck for large-scale problems. We show experiments obtained on Summit supercomputer at Oak Ridge National Laboratory, using up to 24,576 IBM Power9 cores and 6,144 NVIDIA V-100 GPUs. Considered one of the top 10 algorithms of the 20th century, the Fast Fourier transform (FFT) is widely used by applications in science and engineering. Such is the case of applications targeting exascale, e.g. LAMMPS (EXAALT-ECP) [14] , and diverse software ranging from particle applications [21] and molecular dynamics, e.g. HACC [7] , to applications in machine learning, e.g. [16] . For all these applications, it is critical to have access to an heterogeneous, fast and scalable parallel FFT library, with an implementation that can take advantage of novel hardware components, and efficiently exploit their benefits. Highly efficient implementations to compute FFT on a single node have been developed for a long time. One of the most widely used libraries is FFTW [10] , which has been tuned to optimally perform in several architectures. Vendor libraries for this purpose have also been highly optimized, such is the case of MKL (Intel) [13] , ESSL (IBM) [8] , clFFT (AMD) [1] and CUFFT (NVIDIA) [18] . Novel libraries are also being developed to further optimize single node FFT computation, e.g. FFTX [9] and Spiral [22] . Most of the previous libraries have been extended to distributed memory versions, some by the original developers, and others by different authors. In the realm of distributed-CPU libraries, FFTW supports MPI via slab decomposition, however it has limited scalability and hence it is limited to a small number of nodes. P3DFFT [19] extends FFTW functionalities and supports both pencil and slab decompositions. Large scale applications have built their own FFT library, such as FFTMPI [20] (built-in on LAMMPS [14] ) and SWFFT [23] (built-in on HACC [7] ). These libraries are currently being used by several molecular-dynamics applications. Concerning distributed-GPU libraries, the slab-approach introduced in [17] is one of the first heterogeneous codes for large FFT computation on GPUs. Its optimization approach is limited to small number of nodes and focus on reducing tensor transposition cost (known to be the bottleneck) by exploiting infinibandinterconnection using the IBverbs library, which makes it not portable. Further improvements to scalablity have been presented in FFTE library [26] which supports pencil decompositions and includes several optimizations, although with limited features and limited improvements on communication. Also, FFTE relies on the commercial PGI compiler, which may limit its usage. Finally, one of the most recent libraries is AccFFT [11] , its approach consists in overlapping computation and blocking collective communication by reducing the PCIe overhead, they provide good (sublinear) scalability results for large real-to-complex transforms using NVIDIA K20 GPUs. Even though the fast development of GPUs has enabled great speedup on local computations, the cost of communication between CPUs/GPUs on largescale computations remains as the bottleneck, and this is a major challenge supercomputing has been facing over the last decade [6] . Large parallel FFT is well-known to be communication bounded, experiments and models have shown that for large node counts the impact on communication needs to be efficiently managed to properly target exascale systems [5, 15] . In this context, we introduce heFFTe (pronounced "hefty"), which provides very good (linear) scalability for large node count, it is open-source and consists of C++ and CUDA kernels with (CUDA-aware) MPI and OpenMP interface for communication. It has a user-friendly interface and does not require any commercial compiler. Wrappers to interface with C, Fortran and Python are available. It is publicly available in [2] and documented in [24, [27] [28] [29] . Its main objective is to become the standard for large FFT computations on the upcoming exascale systems. Figure 1 shows how heFFTe is positioned on the ECP software stack, and some of its target exascale applications (gray boxes). This paper is organized as follows, Sect. 2 describes the classical FFT multidimensional algorithm and its implementation phases within heFFTe. We then present heFFTe's main features and functionalities. Next, Sect. 3 presents a multi-node communication model for parallel FFTs and an approach to reduce its computational complexity for a small number of nodes. Section 4 presents numerical experiments on Summit supercomputer, and evaluates the multi-GPU communication impact on performance. Finally, Sect. 5 concludes our paper. Multidimensional FFTs can be performed by a sequence of low-dimensional FFTs (see e.g. [12] ). Typical approaches used by parallel libraries are the pencil and slab decompositions. Algorithm 1 presents the pencil decomposition approach, which computes 3D FFTs by means of three 1D FFTs. This approach is schematically shown in Fig. 2 . On the other hand, slab decomposition relies on computing sets of 2D and 1D FFTs. Figure 2 schematically shows the steps during a 3D FFT computation in parallel, using a 3D partition of processors. On the top part of this figure, we present the pencil methodology, as described in Algorithm 1, in whichN i denotes output data obtained from applying 1D FFT of size N i on the i-th direction. This approach can be summarized as follows, the input data of size N 0 ×N 1 ×N 2 is initially distributed into a grid processors, P i0 × P i1 × P i2 , in what is known as brick decomposition. Then, a reshape (transposition) puts data into pencils on the first direction where the first set of 1D FFTs are performed. These two steps are repeated for the second and third direction. Observe that intermediate reshaped data is handled in new processor grids which must be appropriately created to ensure load-balancing, for simplicity a single Q 0 × Q 1 grid is used Reshape pencils Output brick Input brick P o1 = 6, P o1 = 2, P o2 = 3 Require: Initial and final processor grids, in Algorithm 1. Finally, a last data-reshape takes pencils on the third direction into the output brick decomposition. Several applications provide input data already on pencil distribution on the first direction and require the output written as pencils on the third direction. In this case, only two data-reshapes are required, this is the default for FFTE [26] and AccFFT [11] libraries. On the other hand, heFFTe can treat input and output shapes with high flexibility, generalizing features of modern libraries, and with a friendly interface as presented in Sect. 2.3. Finally, in the bottom part of Fig. 2 , we show the slab approach which saves one step of data reshape by performing 2D FFTs, this has a considerable impact in performance for a small number of nodes [25] . Two main sets of kernels intervene into a parallel FFT computation: 1. Computation of low dimensional FFTs, which can be obtained by optimized libraries for single node FFT, as those described in Sect. 1. 2. Data reshape, which essentially consists on a tensor transposition, and takes a great part of the computation time. To compute low-dimensional FFTs, heFFTe supports several open-source and vendor libraries for single node, as those described in Sect. 1. And it also provides templates to select types and precision of data. For direct integration to applications, heFFTe provides example wrappers and templates to help users to easily link with their libraries. Data reshape is essentially built with two sets of routines, the first one consists in packing and unpacking kernels which, respectively, manage data to be sent and to be received among processors. Generally, these set of kernels account for less than 10% of the reshaping time. Several options for packing and unpacking data are available in heFFTe, and there is an option to tune and find the best one for a given problem on a given architecture. The last set of routines correspond to communication kernels, heFFTe supports binary and collective communications as presented in Table 1 . The main impact on performance obtained by heFFTe in comparison to standard libraries, comes from the kernels optimization (> 40× speedup w.r.t CPU libraries, c.f. Fig. 6) , and also by a novel efficient asynchronously-management of packing and communication, as shown in Fig. 3 , where we can observe the classical packing process supported by most libraries and a novel approach via routine heFFTe alltoallv, introduced in [4] , which overlaps packing/unpacking with MPI communication. Parallel FFT libraries typically handle communication by moving data structures on the shape of pencils, bricks, and slabs of data. For each of these options the total amount of data communicated is always the same. Hence, decreasing the number of messages between processors yields to increasing the size of the messages they send. On the other hand, for modern hardware architectures, it is well-known that latency and bandwidth improvements do not grow as quickly as the arithmetic computation power [6] . Therefore, it is important to choose the appropriate communication scheme. For instance, reshaping brick to pencil data requires O(P 1/3 ) messages, this can be verified by overlapping both grids. Analogously, the number of messages for reshaping pencil to pencil is O(P 1/2 ), while O(P 2/3 ) for brick to slab, and O(P ) for slab to pencil. Choosing the right communication scheme highly depends on the problem size and hardware features, heFFTe supports standard MPI Alltoallv within subgroups, which generally yields better performance compared to poin-to-point communication. However, optimizations of all-to-all routines on heterogeneous clusters are still not available (e.g. on the NVIDIA Collective Communications Library [3]), even though, as can be regarded in Fig. 6 , improvements to all-toall communication are critical. For this reason, we developed a routine called heFFTe alltoallv [4] which includes several all-to-all communication kernels and can be used for tuning and selecting the best one for a given architecture. This routines aimed to provide a better management of multi-rail communication. The asynchronous approach of heFFTe alltoallv was proved efficient for up to 32 nodes, and the multi-rail management, although promising, negatively impacts the performance when increasing node count, degrading the potential benefit of the multi-rail optimization [4] . Work on communication avoiding frameworks is ongoing, targeting large node count on heterogeneous clusters. heFFTe's software design is built on C and C++, and provides wrappers to be used in Fortran and Python. The API aims to be user-friendly and portable among several architectures, allowing users to easily link it to their applications. The API follows styles from FFTW3 and CUFFT libraries, adding novel features, such as templates for multitype and multiprecision data. Distributed FFT requires data split on a processors grid. In 3D, input/output arrays are typically defined by the six vertices of a brick, e.g. (i lo , i hi ), as shown in Fig. 2 ; heFFTe allows this definition using any MPI sub-communicator, fft comm, which has to be provided by user together with data definition. FFT Plan Definition. Once data and processors grids are locally defined, user can create an FFT plan by simply providing the following parameters, dim: Problem dimension, e.g., dim = 3 -N: Array size, e.g., N = [nx,ny,nz] permute: Permutation storage of output array. Next, we show how a heFFTe plan is created; memory requirements are returned to the user as a workspace array. ... /* Create FFT plan */ heffte_plan_create(dim, work, fft, N, i_lo, i_hi, o_lo, o_hi, permute, workspace); Note that a single plan can be used for several FFT computations, which is typical for several applications where grids are fixed. FFT Execution. One of heFFTe's most important kernels is the one in charge of the FFT computation. This kernel has the same syntax for any type of data and its usage follows APIs from CUFFT and FFTW3. ... /* Compute an in-place complex-to-complex (C2C) forward FFT */ heffte_execute(fft, work, work); Similar execution function is available for the case of real-to-complex (R2C) transforms, heffte execute r2c. To analyze the bottleneck of exascale FFT computation, communication models can be deduced and experimentally verified for different types of computer clusters [5] . These models can be built for specific frameworks, as for pencil and slab data exchanges [25] ; or they could be oriented to the hardware architecture [11] . In this section, we propose an inter-node communication model for large FFTs. We focus on inter-node effects since fast interconnection is typically available intra-node, e.g. NVLINK. And properly scheduling intra-node communications can overlap their cost with the inter-node communications. In Table 2 , we summarize the parameters to be used for the communication model. To create a communication model, we analyze the computational intensity (ϕ) in Flops/Byte. For the case of FFT, we have that the number of FLOPS is 5N log(N ) and the volume of data moved at each reshape is αN , then for the total FFT computation using P nodes, we get, and the peak performance (in GFlops) is defined as, For the case of Summit supercomputer, we have a node interconnection of B = 25 GB/s, considering r = 4 (c.f. Fig. 2 ) and data-type as double-precision complex (i.e. α = 16). Then, Ψ Summit = 5P log(N ) * 25 16 * 4 = 1.953P log(N ). (3) Figure 8 , shows heFFTe's performance for a typical FFT of size N = 1024 3 , and compares it to the roofline peak for increasing number of nodes, getting about to 90% close to peak value. In this section we present numerical experiments on Summit supercomputer, which has 4,608 nodes, each composed by 2 IBM Power9 CPUs and 6 Nvidia V100 GPUs. For our experiments, we use the pencil decomposition approach, which is commonly available in classical libraries and can be shown to be faster than the slab approach for large node count [25] . In Fig. 4 , we first show strong scalability comparison between heFFTe GPU and CPU implementations, being the former ∼2× faster than the latter. We observe very good linear scalability in both curves. Also, since heFFTe CPU version was based on improved versions of kernels from FFTMPI and SWFFT libraries [29] , then its performance is at least as good as them. Therefore, heFFTe GPU is also ∼2× faster than FFTMPI and SWFFT libraries. Drop in performance for the CPU implementation after 512 nodes (12,288 cores) is due to latency impact, which we verified with several experiments. This is because for a 1024 3 size, the number of messages become very large while their size becomes very small. When increasing the problem size, the CPU version keeps scaling very well as shown in Fig. 5 . Next, Fig. 5 shows weak scalability comparison of heFFTe GPU and CPU implementations for different 3D FFT sizes, showing over 2× speedup and very good scaling. In order to show the impact of local kernels acceleration, Fig. 6 shows a profile of a single 3D FFT using both, CPU and GPU, versions of heFFTe, where over 40× speedup of local kernels and the great impact of communication are clearly displayed. Next, in Fig. 7 , we compare strong and weak scalability of heFFTe and FFTE libraries. Concluding that heFFTe overcomes FFTE in performance (by a factor >2) and having better scalability. We do not include results with AccFFT library, since its GPU version did not verify correctness on several experiments performed in Summit. However, AccFFT reported a fairly constant speedup of ∼1.5 compared with FFTE, while having very similar scalability [11] . In Fig. 8 , we numerically analyze how we approach to the roofline peak performance as described in Sect. 3. We observe that by appropriately choosing the transform size and the number of nodes, we approach to the proposed peak, and hence a correlation could be established between these two parameters to ensure that maximum resources are being used, while still leaving GPU resources to simultaneously run other computations needed by applications. Fig. 8 . Roofline performance from Eq. 3 and heFFTe performance on a 3D FFT of size 1024 3 ; using 40 MPI processes, 1MPI/core, per node (blue), and 6 MPI/node, 1MPI/1GPU-Volta100, per node (red). (Color figure online) Diverse applications targeting exascale make use of FFT within their models. In this section, we consider LAMMPS [14] , part of the EXAALT ECP project. Its KSPACE package provides a variety of long-range Coulombic solvers, as well as pair styles which compute the corresponding pairwise Coulombic interactions. This package heavily rely on efficient FFT computations, with the purpose to compute the energy of a molecular system. Fig. 9 . LAMMPS Rhodopsin protein benchmark on a 128 3 FFT grid, using 2 nodes, 4 MPI processes per node. For FFTMPI we use 1 MPI per core plus 16 OpenMP threads, and for heFFTe we use 1 MPI per GPU. In Fig. 9 we present an experiment obtained using a LAMMPS benchmark experiment, where we compare the performance when using its built-in FFTMPI library, and then using the GPU version of heFFTe library. As shown in Fig. 4 , it is expected that even for large runs, using LAMMPS with heFFTe would provide a 2× speedup of its KSPACE routine. In this article, we presented the methodology, implementation and performance results of heFFTe library, which performs FFT computation on heterogeneous systems targeting exascale. We have provided experiments showing considerable speedups compared to state-of-the-art libraries, and that linear scalability is achievable. We have greatly speedup local kernels getting very close to the experimental roofline peak on Summit (a large heterogeneous cluster which ranks first on the top 500 supercomputers). Our results show that further optimizations would require better hardware interconnection and/or new communicationavoiding algorithmic approaches. Impacts of Multi-GPU MPI collective communications on large FFT computation On the communication complexity of 3D FFTs and its implications for exascale Communication-avoiding algorithms for linear algebra and beyond Arrival of first summit nodes: HACC testing on phase I system The IBM parallel engineering and scientific subroutine library FFTX and SpectralPack: a first look The design and implementation of FFTW3 Accfft: a library for distributedmemory FFT on CPU and GPU architectures Accuracy and Stability of Numerical Algorithms Intel: Intel Math Kernel Library Large-scale atomic/molecular massively parallel simulator Petascale direct numerical simulation of turbulent channel flow on up to 786k cores FFT-based deep learning deployment in embedded systems Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions fftMPI, a library for performing 2d and 3d FFTs in parallel FFTs for (mostly) particle codes within the DOE exascale computing project Large bandwidth-efficient FFTs on multicore and multi-socket systems Quantitative performance assessment of proxy apps and parents GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems Implementation of parallel 3-D real FFT with 2-D decomposition on Intel Xeon Phi clusters FFTE: a fast Fourier transform package Design and implementation for FFT-ECP on distributed accelerated systems FFT-ECP Implementation optimizations and features phase Evaluation and design of FFT for distributed accelerated systems