key: cord-0652829-88gfcgph authors: Schade, Robert; Kenter, Tobias; Elgabarty, Hossam; Lass, Michael; Kuhne, Thomas D.; Plessl, Christian title: Breaking the Exascale Barrier for the Electronic Structure Problem in Ab-Initio Molecular Dynamics date: 2022-05-24 journal: nan DOI: nan sha: a00cf2143cb8b16a3235585560574b3f62d1469d doc_id: 652829 cord_uid: 88gfcgph The non-orthogonal local submatrix method applied to electronic-structure based molecular dynamics simulations is shown to exceed 1.1 EFLOP/s in FP16/FP32 mixed floating-point arithmetic when using 4,400 NVIDIA A100 GPUs of the Perlmutter system. This is enabled by a modification of the original method that pushes the sustained fraction of the peak performance to about 80%. Example calculations are performed for SARS-CoV-2 spike proteins with up to 83 million atoms. Electronic-structure based ab-initio molecular dynamics simulations (AIMD, [1] - [3] ) are an important tool in solidstate physics, chemistry and material science. The explicit treatment of quantum-mechanical effects in the electronic structure is required in situations, where empirical model potentials used in classical molecular dynamics fail to describe the relevant physical or chemical phenomena. To derive the forces acting on the atoms the electronicstructure problem has to be solved in every time step during the propagation of the atoms. To make this possible, linear-scaling methods have been developed, where the computational complexity scales only linearly with the number of atoms in the system [4] - [8] . We have proposed the non-orthogonal local submatrix method (NOLSM, [9] ) as a massively parallel method to solve the electronic-structure problem via an approximate solution of the required matrix functions. The local nature of the method avoids inter-node communication in the solution phase and has been shown to scale extremely well to more than one thousand GPUs, while efficiently using the mixed-precision tensor cores for linear algebra operations. All of these developments together have made it possible for the first time to break the exaflop barrier for a computational science application. The only other instance we are aware of where more than one exaflop of sustained performance have been achieved is for the real-time simulation of a random quantum circuit [10] . This paper is building on the implementation described in [9] , but since we are focusing on improvements to increase the sustained peak performance, aspects like the compensation of noise from numerical approximations with an appropriately modified Langevin-type equation to obtain accurate thermodynamical expectation values are not touched here, but have been discussed in previous work [7] , [11] . Instead, section II summarizes the tackled problem, whereas Section III puts the achievement in relation to the performance of related large-scale electronic-structure-based structure relaxations or AIMD simulations. The innovations beyond those presented in [9] are described in Section IV. In Section V-B, we discuss our evaluation and define the performance measurements. Finally, Section VI discusses the achieved performance. Molecular dynamics calculations simulate the movement of atoms in molecules, surfaces or solids by integrating Newtons equation of motion where M i is the mass of atom i, R i its position in space and F i the force acting on it. Thus, the evaluation of the forces acting on the atoms is required in every time step. In electronicstructure based molecular dynamics these forces are evaluated on-the-fly from the total energy E of the system via In turn, the total energy is not obtained from an empirical model as in classical molecular dynamics, but directly from the quantum-mechanical problem of electrons in the electrostatic field of the nuclei. The total energy of a system can be written as where E elec is the electronic energy, E dc the double counting terms and E ion the nuclear Coulomb repulsion energy of the atoms. The bulk of the computational effort is required to obtain the electronic energy. While this can be done efficiently for small and medium-sized systems by solving a highdimensional eigenvalue problem [12] , [13] , very large systems require methods that scale at most linearly with the size of the system in terms of number of atoms. Such linear-scaling electronic-structure methods have been developed [4] - [7] for example based on the one-particle reduced density matrix D [14] , which allows to obtain the electronic energy via At zero electronic temperature, the density matrix can be written as a matrix function where H 0 is the one-particle Hamiltonian matrix of the system and S is the overlap matrix between the basis functions that are used to describe the wave functions of the electrons. Furthermore, µ denotes the chemical potential. The evaluation of the matrix-sign function in Eq. 5 can be performed iteratively for example with the Newton-Schulz iteration [15] or other iteration schemes [16] that only require scalings, additions and multiplications of matrices. The submatrix method [17] , [18] instead views the density matrix as a matrix function to be evaluated. Therein, the evaluation of a matrix function f (A) is performed in three steps. The steps are performed for every column i of the matrix A independently and are schematically shown in Fig. 1: 1) In the first step a submatrix is generated for column i of the input matrix A by removing all rows and corresponding columns for which the matrix A has vanishing or negligible elements in column i. The result is a smaller and much denser submatrix T i (A). The matrix elements of f (T i (A)) that correspond to the column i are written to the matrix B in the sense that the submatrix construction of step 1) is applied in a reverse way. The resulting matrix B is an approximation of the matrix function f (A) and by construction has the same sparsity pattern as A. The non-orthogonal submatrix method, where D = D(H 0 , S) is a matrix function of two matrices, combines the sparsity patterns H 0 and S before the submatrix construction. Thus, it builds two submatrices, T i (H 0 ) and T i (S) and in step 2) the matrix function is Please note, that the efficiency and accuracy of the submatrix method can be improved by generating one submatrix for multiple columns instead of just a single column as described in detail in [9] together with the GPU implementation. Moreover, the use of the submatrix approximation and low-precision numerics can be compensated by an modified Langevin-type equation that replaces Newtons equation of motion so that exact ensemble-averaged expectation values can be obtained [7] , [11] . Table I lists previous attempts to extend the boundaries of electronic-structure based structure relaxation or molecular dynamics simulations. This work uses the previously reported algorithmic innovations like the use of approximate computing techniques, the non-orthogonal local submatrix method and its realization with GPUs, while minimizing the communication, as well as the heuristic combination of columns in the submatrix creation described in [9] . A new development beyond the implementation innovations already shown in [9] like the efficient iterative evaluation of matrix functions for dense matrices on GPU tensor cores is introduced in section IV-B: The matrix-size dependency of the GPU-performance is now also considered for the combination of submatrices and yields an additional speedup. B. Implementation Innovations 1) Submatrix Combination Heuristics: The combination of columns for the generation of submatrices introduced in [9] used a cubic metric, i.e., the combination of two columns yields an improvement if and only if where n i is the dimension of the submatrix for column i, n j for column j and n i∧j the number of overlapping columns. This cubic metric represents the number of floating-point operations during the evaluation of the matrix function for the submatrices, which is based on dense matrix multiplications. We propose to modify the cubic metric by including the performance characteristic of the used GPUs. For this purpose, the matrix multiplication performance p(n) of the GPUs is measured for different matrix sizes n and interpolated. The combination criterion then compares the predicted runtime p(n) × n 3 for the matrix functions of the submatrix instead of the number of floating-point operations, i.e., p(n i +n j −n i∧j )×(n i + n j − n i∧j ) 3 < p(n i )×n 3 i +p(n j )×n 3 j . This criterion effectively increases the dimension of the submatrices and the achievable portion of the peak performance. Results are shown for the SARS-CoV-2 Spike protein in Table I PERFORMANCE OF PREVIOUSLY CONDUCTED ELECTRONIC STRUCTURE-BASED STRUCTURE RELAXATION OR AIMD SIMULATIONS. THEREIN, THE EMPLOYED ELECTRONIC STRUCTURE METHOD IS ABBREVIATED BY DFT, NSC-DFT, LS-DFT AND SS-DFT, WHICH STANDS FOR DENSITY FUNCTIONAL THEORY AND ITS NON-SELF-CONSISTENT, LINEAR-SCALING AND SUBSYSTEM VARIANTS, RESPECTIVELY. THE CORRESPONDING BASIS SET TO REPRESENT THE SINGLE-PARTICLE ORBITALS ARE DENOTED BY PW FOR CONVENTIONAL PLANE WAVES, RMG-PW FOR REAL-SPACE MULTIGRID PLANE WAVES, GPW FOR GAUSSIAN AND PLANE WAVES, GTO FOR GAUSSIAN-TYPE ORBITALS, FD FOR FINITE DIFFERENCE, RS-FD FOR REAL-SPACE FINITE DIFFERENCE, FEM FOR FINITE ELEMENT METHOD, NGWF FOR NON-ORTHOGONAL GENERALIZED WANNIER FUNCTIONS AND PAO FOR POLARIZED ATOMIC ORBITALS. IF THE CALCULATION WAS CONDUCTED INVOLVING TRIVIAL K-POINT PARALLELISM, THE TOTAL NUMBER OF ATOMS IS GIVEN AS THE PRODUCT OF NUMBER OF INDEPENDENT INSTANCES TIME THE NUMBER OF ATOMS IN ANYONE OF THEM. THE SUSTAINED EFFICIENCY IS EITHER GIVEN WITH RESPECT TO THE CORRESPONDING Table II and histograms for the submatrix dimensions in Fig. 2 . The criterion of Eq. 10 approximately doubles the average submatrix dimension and slightly increases the total number of floating point operations while drastically increasing the estimated floating-point throughput and leading to an overall estimated speedup compared to the previous criterion of Eq. 9. V. HOW PERFORMANCE WAS MEASURED A. Computational Details 1) SARS-CoV-2 Spike Protein in Aqueous Solution: As our benchmark system, we have used the full-length SARS-CoV- Histogram for the submatrix sizes in the case of the SARS-CoV-2 Spike protein in aqueous solution with approx. 1.7 mio. atoms: without combining submatrices (blue), combination using criterion Eq. 9 (orange) and using criterion Eq. 10 (green). A discussion of the further structure can be found in [9] and also applies to the spike protein system. 204.7×199.5×408.5Å, and including 1693134 atoms. The single cell shown in Fig. 3 can be easily repeated in a twodimensional grid of spike proteins as a scalable benchmark system. 2) Simulation Details: The electronic structure is simulated with the GFN-xTB approach in conjunction with a London dispersion correction based on the rational Becke-Johnson damping function [34] . Further details can be found in [9] . For the sake of benchmark resources, we have restricted each simulation run to one SCF iteration in the spirit of the second-generation Car-Parrinello AIMD method [12] , [35] , but included the iterations for finding an appropriate chemical potential that produces a charge-neutral system. The main measurements presented here are: 1) Wall clock time of the NOLSM method T NOLSM : The wall clock time T NOLSM,i of the NOLSM method on node i is measured for each iteration of the chemical potential. Each iteration includes all transfers between host and GPU. The overall wall clock time wall clock time T NOLSM is defined as the maximum over all node wall-clock times. 2) FLOPs in the NOLSM method FLOPs NOLSM,i : The per-node floating-point operations FLOPs NOLSM in the FP16/FP32-mixed-precision matrix iterations in the NOLSM method are estimated as 2n 3 for a gemm-operation C = αA · B + βC with A, B, C ∈ R n×n for each iteration of the chemical potential. The construction of the matrix elements of the submatrices and other operations scaling like O(n 2 ) are neglected in the count. 3) Node-Performance of NOLSM method P NOLSM,i : The node performances of the NOLSM method are defined as P NOLSM,i = FLOPs NOLSM,i /T NOLSM,i for each node i. 4) Performance of NOLSM method P NOLSM : The performance of the NOLSM method is defined as the sum of the node performances. The benchmark runs presented here have been performed on the Perlmutter systsem at the National Energy Research The software environment used in this work consisted of GCC 11.2.0, Cray-MPICH 8.1.10, CUDA NVCC 11.5.119, and CUBLAS 11.5. One MPI-rank per node and 64 CPUthreads per rank were used as well as four CUDA streams per GPU. Each stream was controlled by a single CPU-thread. We have performed calculations for three different grid sizes of spike proteins: 6×5 (51 mio. atoms), 6×6 (61 mio. atoms) and 7×7 (83 mio. atoms). All three example calculations have been performed with 1,100 nodes of the Perlmutter system, i.e., 4,400 NVIDIA A100 GPUs. The wall clock time of the NOLSM method T NOLSM is shown in Figure 4 . The distribution of the performances of individual nodes is shown in Figure 5 for 7×7 spike proteins (83 mio. atoms) in relation to the peak performance of the GPUs. The performances of the nodes with 4 NVIDIA A100 GPUs mainly fall in the range between 1 PFLOP/s and 1.07 PFLOP/s with an average of 1.03 PFLOP/s. This represents about 80% of the peak performance of 1.248 PFLOP/s= 4 · 0.312 PFLOP/s per node. Finally, Figure 6 shows the floating-point performance P NOLSM in mixed FP16/FP32 of the NOLSM method. The floating-point throughput of 1.106 to 1.127 EFLOP/s with 4,400 NVIDIA A100 GPUs achieving about 80% of the theoretical peak performance of the tensor cores. To the best of our knowledge, the achieved ∼1.1 EFLOP/s in FP16/FP32 floating-point arithmetic positions electronicstructure based molecular dynamics calculations with the nonorthogonal local submatrix method in CP2K [37] as the first algorithm in computational natural science that has broken the exaflop barrier within a scientific application. The massively parallel nature of the method allows for an efficient use of many thousand GPUs. The method can not only be applied to electronic-structure based molecular dynamics, but also in other situations where a matrix function needs to be evaluated for a large sparse matrix or problems that can be transformed to such an operation. Unified Approach for Molecular Dynamics and Density-Functional Theory Iterative minimization techniques for ab initio totalenergy calculations: molecular dynamics and conjugate gradients Second generation Car-Parrinello molecular dynamics Linear scaling electronic structure methods Direct calculation of electron density in density-functional theory Large scale electronic structure calculations Self-consistent field theory based molecular dynamics with linear system-size scaling Graph-based linear scaling electronic structure theory Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms Gap: Achieving Real-Time Simulation of a Random Quantum Circuit Using a New Sunway Supercomputer Accurate Sampling with Noisy Forces from Approximate Computing Efficient and Accurate Car-Parrinello-like Approach to Born-Oppenheimer Molecular Dynamics Disordered crystals from first principles II: Transport coefficients Some recent advances in density matrix theory Iterative Berechung der reziproken Matrix A General Algorithm to Calculate the Inverse Principal p-th Root of Symmetric Positive Definite Matrices A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices A Submatrix-Based Method for Approximate Matrix Function Evaluation in the Quantum Chemistry Code CP2K Dual-level parallelism for ab initio molecular dynamics: Reaching teraflop performance with the CPMD code Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform ACM/IEEE conference on Supercomputing The linearly scaling 3D fragment method for large scale electronic structure calculations Calculations for millions of atoms with density functional theory: linear scaling shows its potential Linear scaling selfconsistent field calculations with millions of atoms in the condensed phase Hybrid MPI-OpenMP parallelism in the ONETEP linear-scaling electronic structure code: Application to the delamination of cellulose nanofibrils Large-scale DFT simulations with a linear-scaling DFT code CONQUEST on Kcomputer Performance evaluation of ultra-large-scale firstprinciples electronic structure calculation code on the K computer Combining linear-scaling DFT with subsystem DFT in Born-Oppenheimer and Ehrenfest molecular dynamics Simulations: from molecules to a virus in solution Metascalable quantum molecular dynamics simulations of hydrogen-on-demand Openatom: Scalable ab-initio molecular dynamics with diverse capabilities Modeling dilute solutions using first-principles molecular dynamics: computing more than a million atoms with over a million cores Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing: 46 PFLOPS simulation of a metallic dislocation system Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation AI-driven multiscale simulations illuminate mechanisms of SARS-CoV-2 spike dynamics Effect of the damping function in dispersion corrected density functional theory Disordered crystals from first principles I: Quantifying the configuration space CP2K: An electronic structure and molecular dynamics software package -Quickstep: Efficient and accurate electronic structure calculations This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Depart-ment of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231 using NERSC award DDR-ERCAP0022240.Additionally, we would like to thank for funding of this project by computing time provided by the Paderborn Center for Parallel Computing (PC2). This work is partially funded by Paderborn University's research award for "GreenIT", as well as the Federal Ministry of Education and Research (BMBF) and the state of North Rhine-Westphalia as part of the NHR Program. T.D.K. received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant Agreement No. 716142).