key: cord-0474720-lokyztp7 authors: LeGrand, Scott; Scheinberg, Aaron; Tillack, Andreas F.; Thavappiragasam, Mathialakan; Vermaas, Josh V.; Agarwal, Rupesh; Larkin, Jeff; Poole, Duncan; Santos-Martins, Diogo; Solis-Vasquez, Leonardo; Koch, Andreas; Forli, Stefano; Hernandez, Oscar; Smith, Jeremy C.; Sedova, Ada title: GPU-Accelerated Drug Discovery with Docking on the Summit Supercomputer: Porting, Optimization, and Application to COVID-19 Research date: 2020-07-06 journal: nan DOI: nan sha: f07902c43f653318afb88c8f85d1251790099145 doc_id: 474720 cord_uid: lokyztp7 Protein-ligand docking is an in silico tool used to screen potential drug compounds for their ability to bind to a given protein receptor within a drug-discovery campaign. Experimental drug screening is expensive and time consuming, and it is desirable to carry out large scale docking calculations in a high-throughput manner to narrow the experimental search space. Few of the existing computational docking tools were designed with high performance computing in mind. Therefore, optimizations to maximize use of high-performance computational resources available at leadership-class computing facilities enables these facilities to be leveraged for drug discovery. Here we present the porting, optimization, and validation of the AutoDock-GPU program for the Summit supercomputer, and its application to initial compound screening efforts to target proteins of the SARS-CoV-2 virus responsible for the current COVID-19 pandemic. The binding of a drug to a protein target in vivo can elicit a molecular response, such as the inhibition of a cellular function, and forms the basis of targeted drug discovery. Computational protein-ligand docking can be utilized for the rapid structure-based screening of small molecule drug candidates for target binding [5, 21] . Three United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). dimensional structural models of both a protein receptor and a set of small molecule compounds, or ligands, are employed to computationally predict the ability of a ligand to bind to the receptor, using an optimization algorithm within some function, which can be either a physics-based empirical energy potential or statistical. [10, 16, 31] . Advances in high-throughput experimental screening, both cell-based [18, 26] or molecular [7, 12] , have allowed tens of thousands of chemical compounds to be tested simultaneously. However, these assays are expensive, and the ability to computationally filter chemical compounds for their propensity to bind to a target protein can significantly reduce the overall cost and time to solution. The data to be evaluated are plentiful as the relevant chemical search space consists of billions of compounds. For instance, the Enamine REAL database (https://enamine.net) contains over a billion small molecules, and the ZINC 15 database contains 230 million ready-to-dock compounds, [29] . Thus the number of compounds that can potentially be screened with in silico docking to complement experimental efforts is now on the order of billions. Increases in processor speeds and the number of cores on large nodes of computing clusters and cloud resources have made meeting this docking challenge theoretically attainable. Unfortunately, many docking programs are not inherently designed with massive highthroughput screens in mind. Most open source docking programs commonly applied in academic research are CPU-based, singlenode, and utilize file-based input and output [16, 31] . Such codes are often designed to perform a single ligand-docking calculation per run instance of the executable. A variety of programs exist for ligand docking, with some using empirical but physics-based potential energy descriptions and others using fully empirical statistical potentials trained on a set of experimental structures of ligands bound to proteins [16, 17, 31] . The widely employed AutoDock program saw its inception in 1990 [8] and has undergone several major changes, with the latest version being AutoDock4. This program incorporates a physics-based energy function that includes entropic and solvation terms, in addition to van der Waals terms, hydrogen bond, and electrostatic interactions [15, 16] . The energy minimization procedure consists of a Lamarckian genetic algorithm (LGA), whereby a local search accompanies the standard genetic optimization, and the improved ligand pose that is obtained from this local optimization is the input for the crossover portion of the algorithm. The original local search method is based on the Solis-Wets random optimization algorithm [27] . General purpose graphics processing units (GPUs) are used to accelerate dense numerical calculations in a variety of settings, from high-performance computing (HPC) facilities to data centers and cloud resources. There are several commonly used application programming interfaces (APIs) that are used to program NVIDIA GPUs including CUDA, and OpenCL. A few docking programs have in the past five years made use of GPUs for accelerating their calculations [4, 14, 28] . Recently Scripps Research, in collaboration with TU Darmstadt, developed an accelerated version of their AutoDock4 program using OpenCL, which provided up to 50× speedup over the single-threaded CPU version [24, 25, 28] . OpenCL was chosen as the programming model for using the accelerator as it provides code portability to various types of architectures, for instance GPUs, CPUs, and FPGAs, from multiple vendors. This program has been named AutoDock-GPU and while it uses the same physics-based empirical potential energy function, it includes new algorithmic additions and changes that can improve both performance and the quality of the results [24, 25, 28] . In particular, within the AutoDock-GPU program, there are now two possible algorithms for the local search that can be employed, (1) the Solis-Wets (SW) method of random optimization, and (2) the ADADELTA gradient-based method [33] . The ADADELTA method was added to improve the local refinement of results, especially for ligands with larger numbers of torsions. AutoDock-GPU uses similar run parameters as AutoDock4, with some minor changes, including a renaming for some parameters, changes in some defaults, and some different hyperparameters. The run parameters important to the current study include the nrun parameter, which designates the number of different independent complete LGA calculations that are performed in one instance of the executable. Each nrun results in a separate final pose, and is output in the output files. A drug screen would take the best scoring pose from these independent outputs. In addition, a recently developed feature is the ability to set the autostop parameter, which allows the local search algorithm to finish prematurely (with respect to the value set in the nev parameter that sets the number of total iterations of the LGA algorithm per run), if the energy value has not changed a sufficient amount, measured by the standard deviation over the past 10 iterations. The Oak Ridge Leadership Facility (OLCF) at the Oak Ridge National Laboratory (ORNL) is an open science computing facility that supports HPC research. The OLCF houses the Summit supercomputer, an IBM AC922 system consisting of 4608 large nodes each with six NVIDIA Volta V100 GPUs and two POWER9 CPU sockets providing 42 usable cores per node. Currently there is no NVIDIA driver support for OpenCL on POWER9 architectures for GPU. Summit is harnessed by hundreds of research groups as part of its multiple open science allocation grants, including the INCITE, ALCC, and Director's Discretion awards and most recently, as part of the COVID-19 Consortium created for supporting computational research aimed at combating the COVID-19 pandemic (https://www.xsede.org/covid19-hpc-consortium). The ability to leverage the Summit supercomputer's GPUs to perform massive high-throughput docking screens for antivirals against the SARS-CoV-2 virus was recognized by our groups as an important resource that was not accessible for AutoDock-GPU due to this lack of OpenCL support. We therefore have worked to create a new version of this program using the CUDA API, and in addition, to design programmatic changes that facilitated optimized docking screens using large ligand datasets. Here we describe these ports, optimizations including latency hiding facilitated by overlapping set-up and calculation, and inclusion of a convenient input option for processing hundreds to thousands of files more efficiently. Several previous efforts have enabled the deployment of docking codes such as AutoDock4 and AutoDock Vina on large compute resources, from clusters to supercomputers. Vina MPI used an MPI wrapper to enable the simultaneous launching of thousands of parallel AutoDock Vina executables and was run on the OLCF Jaguar and Titan supercomputers [3] . A similar effort at the Lawrence Livermore National Laboratory resulted in a program called VinaLC [34] . Recently a GNU-Bash based pipeline was created for docking millions of ligands on clusters [9] . GPU acceleration has also been included in docking programs. GeauxDock created a novel energy function consisting of physics-based and statistical terms that ran on GPUs [4] , but has seen limited application or updates. To our knowledge, the version of AutoDock-GPU described in this paper is the only docking program using the AutoDock4 potential that is optimized for the most recent GPUs and also adds the capability to optimally process thousands of ligand input files sequentially with a single executable. The ability to rapidly dock millions to billions of ligands to protein receptors to help find inhibitors for viral proteins could be invaluable to experimental drug discovery efforts. Correspondingly, harnessing leadership computing resources for drug discovery efforts against the SARS-COV-2 virus could provide an invaluable resource with which to battle the current COVID-19 pandemic [22] . Applying this approach is part of the collaborative international research effort recently formed around discovering therapeutics to mitigate morbidity and mortality caused by the disease. The work presented here takes the first step in facilitating the deployment of a GPU-enabled, open source docking program on the Summit supercomputer, and optimization for high-throughput docking of large numbers of ligands in a single instance. With this capability in place, future work involving the deployment of efficient platforms to enable running the AutoDock-GPU program at scale on all of Summit's 27,648 GPUs can be pursued. A number of additions and changes were made to the AutoDock-GPU program to create a version that made effective use of the nodes on Summit and supported massive, high-throughput docking screens. In particular we focused on the case where a very large number of ligands (millions to billions) would be docked to a single receptor. Fig. 1 illustrates the design of this version of the program, showing the addition of OpenMP threading for creating a pipeline that hides latency from file I/O by staging ligand files while the GPU is busy docking. In addition, the receptor data is now explicitly reused, further reducing I/O when docking thousands of compounds to the same protein target. These changes are detailed in the following subsections. The original AutoDock-GPU program developed by Scripps and TU Darmstadt was written in OpenCL to improve portability across GPU vendors, and enabled threaded execution on CPUs. However, OpenCL is not supported on all platforms, including the combination of NVIDIA GPUs and POWER9 CPUs present on Summit. Therefore it was necessary to port AutoDock-GPU to CUDA to leverage the V100 GPUs on Summit to run AutoDock-GPU at scale. This provides not only large speedups over CPU-based code but also a reduction in power consumption for the same compute task on Summit [6] . This porting is not effortless, due to philosophical and design differences between OpenCL and CUDA. OpenCL was created in 2009 as an open and portable alternative to CUDA [30] , and has maintained the commitment to remain free of hardwarespecific programming elements. While the OpenCL API resembles CUDA's API in many ways, it is a vendor-neutral solution with a focus on portability across CPUs, GPUs, FPGAs, and embedded devices, and thus cannot expose architecture-specific hardware features. CUDA, on the other hand, has been developed in tandem with NVIDIA GPUs hardware and exposes new architectural features with each release. While OpenCL has the advantage of running on any device that supports it, porting OpenCL code to CUDA also provides significant opportunities to exploit those vendor-specific hardware advances, which would otherwise squander the transistors on NVIDIA GPUs that enable them. In order to create an initial port that could reproduce the OpenCL results, we transcribed the AutoDock OpenCL version (forked April 3rd from the "develop" branch) by hand to CUDA. This created an initial executable which performed at roughly half the speed of the OpenCL version. Next, we applied basic optimization techniques to the code like replacing calls to pow with intrinsics, and dynamically allocating shared memory to allow more threads per processor. These first optimizations enabled the CUDA version to run 2.8× faster than the OpenCL version on an NVIDIA RTX2080TI GPU for the ADADELTA algorithm. Back-porting these optimizations to the OpenCL code showed little to no benefit there, for the following reasons: OpenCL already has intrinsics for pow(float, int), and the shared memory amount is known to OpenCL at compile time. Although the OpenCL and CUDA implementations of AutoDock-GPU were able to accelerate docking calculations significantly, AutoDock-GPU was designed to take a single receptor and a single ligand as command-line inputs. If one wished to dock multiple ligands with a single receptor (or vice versa), one would have to run the executable separately for each "job. " For a large ligand set, this is both tedious for the user and inefficient in terms of resources. Here we describe several steps taken to maximize performance in the context of massive docking calculations. To improve the code for this use case, we enabled the user to provide a list of protein and ligand files, accessible through a -filelist flag. The program then loops over the provided list and performs the docking calculations for each. To further optimize the performance of GPU-based highthroughput docking of thousands of ligands to a single receptor, we enabled the program to reuse the 16 receptor maps of up to hundreds of megabytes containing the interaction grid information for atoms in ligands rather than re-read the map files for every new ligand. OpenMP threading for overlapping set-up. After enabling multiple jobs to run consecutively in the same executable, we could begin performance optimization of this new pipeline. Depending on the parameters specified, a significant portion of the run time is spent in the setup and post-processing phases. These phases are performed on the host and include I/O, allocations, and some initial and final calculations less suitable for GPU. If jobs are executed serially, the GPU is idle during these phases and thus underutilized. To reduce this latency, we introduced OpenMP directives so that CPU-based threads could work in parallel to load ligand input files, transfer them onto the GPU, launch the GPU-based docking kernels, and process and write the output of the calculation. This is illustrated in Fig. 1 . To ensure that jobs do not conflict, all CPU-GPU communication (besides the initial receptor map setup described above) and the docking algorithm itself (consisting of multiple CUDA kernels) are placed in a function called from an omp critical section. As a result, the OpenMP threads perform the setup and post-processing phases of each job in parallel, while queuing up to use the GPU one by one for the docking simulation itself. Combined with the reduction in setup time from the optimizations detailed below, this threading approach is sufficient to hide almost all setup and post-processing time and ensure the GPU is rarely idle. We also improved handling of GPU memory and host-device communication. Instead of creating and destroying the CUDA context anew for each ligand input, the context is now created once and GPU memory is preallocated and reused for each ligand. Similarly, rather than transfer the receptor grid maps onto the GPU for every ligand, all of the receptor's grids are sent once during initialization and left on device. Because each ligand requires a different subset of these grids, a mapping was required to point to the grids necessary for a specific ligand docking. Historically, creation of CUDA contexts in memory, GPU memory allocation, and transfer of data to GPU have been bottlenecks in GPU programming. On Summit, with the improved NVLINK2 interconnect, data transfer onto the GPU is less expensive, but still considerable. The V100 GPUs on Summit contain 16 GB of global memory, which is enough to store all possible receptor interaction grids for supported atom types. We tested the performance of the new CUDA version of AutoDock-GPU with and without the OpenMP-based pipeline and compared to performance of the OpenCL version using a NVIDIA DGX-2 appliance hosted at ORNL. [1] , and converted to input files for AutoDock-GPU using OpenBabel program [20] . The set of inputs and the PDB IDs for these structures can be found at https://github.com/diogomart/AD-GPU_set_of_42. This set is hereafter referred to as S42. The S42 dataset contains a variety of different ligand sizes and numbers of torsions, or rotatable bonds, and a range of different sized search boxes defined by the input grid coordinates. We then tested the performance of the CUDA/pipeline version of AutoDock-GPU on a large benchmark set of ligands on Summit. For Summit tests we used GNU compiler collection version 6.4 and CUDA version 10.1.243. Summit currently runs the Red Hat Enterprise Linux Server version 7.6 (Maipo). The ligand dataset consisted of 9,000 ligands taken from the full SWEETLEAD database [19] (as acquired in March of 2020), but with ligands containing atom types unsupported by the AutoDock Utilities tools [16, 28] removed, and supplemented with ligands from the NCI. This data set will be referred to as SN9000. Performance was measured using both local search algorithms (Solis-Wets and ADADELTA), and compared to the performance of both AutoDock4 and AutoDock Vina on Summit. The SARS-CoV-2 endoribonuclease protein (NendoU) crystal structure recently deposited on the RCSB PDB, PDB-ID 6VWW [13] , was the target receptor. Finally, we explored in detail the different optimizations which were added, namely, the application of the OpenMP pipeline, the addition of context reuse, and the reuse of the receptor grids by storing all of them on the GPU's global memory. We performed these tests on Summit using a single resource set on a single compute node which consists of a GPU, 1 (for non-pipelined version) and 7 (for pipelined version) physical CPU cores with one thread per core with the pipeline using 1 and 7 OpenMP threads. For this test we picked a random subset of about 400 ligands from the SN9000 ligand set, and also docked to the NendoU receptor with a search box of size 22.5 Å per side using both local search methods (Solis-Wets and ADADELTA). We chose ligands with small numbers of torsions (≤ 10) to test the performance of the Solis-Wets method, and ligands with more than 10 torsions for ADADELTA. We performed the test using the number of runs (nrun) parameter set to 10 and 100 for both cases, and with the newly implemented autostop functionality both on and off. For all receptors, any bound small molecule ligand, ions, and water molecules were removed from the structures using MOE [32] . Hydrogen atoms were then added, and each structure was saved in its apo form. The AutoDockTools script (prepare_receptor4.py) was then applied to convert the PDB files to PDBQT format, and AutoGrid4 was employed to generate the atom-specific affinities, electrostatic potential, and desolvation potential maps for each receptor. Details of experiments on viral Mpro (section 4) are given in that section. Molecular images were made with VMD [11] and UCSF Chimera [23] . Here we describe the results of performance testing and validation of the CUDA/pipeline version of AutoDock-GPU on several potential use-cases for high-throughput docking. The parameters that are varied in these different situations include the number of ligands docked against a receptor, the size of the allowed search box, the number of torsions in the ligands, the local search algorithm used, and the number of replicas of the calculation that are performed for each docking (the nrun parameter). We break down the performance gains imparted by the different improvements detailed in the Methods section. We also confirmed that the results obtained with the new version are consistent with those obtained using the original OpenCL version. Results for running the S42 benchmark on the DGX-2 are shown in Fig. 2 This is known as re-docking. Because the S42 dataset consists of experimental structures of ligands bound to receptors, it can be harnessed to not only validate the ability of the AutoDock-GPU program to reproduce experimental ligand poses, but to also validate the consistency of outputs after major changes to the program. Here we show the cumulative root mean squared (Euclidean) distance (RMSD) distribution between the final best pose from docking and the crystallographic pose from the S42 set (Fig. 3) . The RMSD is a standard measure of the three-dimensional similarity between conformations of a molecule. Note that some ligands within this test set are intentionally difficult, with a large number of torsions and a commensurately large search space, and so not all poses found are near the experimental position. Despite these challenges, the median RMSD over the full 42 ligand set ranges from 1.28 to 1.85 Å for the CUDA implementation of AutoDock-GPU, depending on the local search algorithm. This indicates useful agreement with the experimental poses. As expected, the docking quality improves with increased computational effort, although the improvement is not linear, and usable results can be obtained with a comparatively small number of runs, particularly with the ADADELTA local search algorithm. The small differences in the results between OpenCL and CUDA implementations within Fig. 3 are not indicative of qualitatively different results, are instead consistent with the stochasticity inherent in docking algorithms. For these tests we used the SARS-CoV-2 endoribonuclease protein (NendoU) crystal structure recently deposited on the RCSB PDB, PDB-ID 6VWW [13] . In addition, we tested both the Solis-Wets local search method, and the ADADELTA method introduced with AutoDock-GPU. We created two test cases for evaluating two extremes of search-space size. The first is an exhaustive search over the protein, known as "blind" docking, together with a complete set of ligand torsional degrees of freedom. The second uses a focused, small search box, limited to the active site region that must be pre-determined together with a subset of the ligand dataset that contained only those ligands with ≤ 10 torsions. Fig. 4 shows the different search spaces on the receptor. In the large search space test the ligands were allowed to bind anywhere in a large search volume (gray region, Fig. 4) which was a box 40 Å per side centered on the active site. In addition, we used the full set of ligands which included those with ≥ 10 torsional degrees of freedom, or rotatable bonds. We set (nrun=100) in order to explore this larger search space sufficiently. The dockings were performed with both the SW and the ADA-DELTA methods. For the small search space test, we docked to the known binding site. In this case, only a small region (22.5 Å per side) of the protein was explored for ligand binding (red box, Fig. 4) , selected based on the final binding location for the majority of ligands in the blind docking trials, and only those ligands with ≤ 10 torsions. The smaller search box coupled to selecting for simpler ligands in this test means that fewer runs were needed to saturate the space and we therefore set (nrun=10). Other Docking Programs. The two previous test cases were also used to compare the performance of AutoDock-GPU to equivalent calculations using the CPU-only AutoDock4 [16] or AutoDock Vina [31] programs when performed on Summit. For the large search space test, we compared time to solution using AutoDock Vina against the time to solution for AutoDock-GPU on Summit. AutoDock4 was excluded from this analysis, as the high number of runs required to sample the space (nrun=100) would routinely exceed the allowed walltime on Summit. For the small search space test, a lower nrun setting permitted AutoDock4 to be added to the comparison set. The results of these runtime tests under these two conditions are summarized in Fig. 5 . For the large space test, AutoDock Vina runtimes are observed to grow rapidly as a function of ligand complexity. As a consequence, while the median docking times are only a factor of 16 slower in the large space case, the aggregate run times for all AutoDock Vina calculations are a factor of 25 slower due to particularly poor performance for the largest ligands. By contrast, AutoDock-GPU tackles even these challenging ligands in under two minutes (Fig. 5) . The general runtime distribution for AutoDock4 and AutoDock-GPU on Summit is similar, with additional ligand complexity only modestly increasing the runtime for an individual ligand (Fig. 5) . However, whereas AutoDock4 has significantly longer runtimes even for the simple ligand set selected, AutoDock-GPU leverages the parallelism inherent to the GPUs to bring the runtimes down significantly. The large reduction in runtimes and narrower time distribution may permit large ligand sets to be screened on Summit on a routine basis. Table 1 : Performance Improvements from Sequential Optimizations: Solis-Wets. Shown is mean time in seconds over ligands run sequentially on one GPU using the pipeline with with either 1 or 7 threads. Ligands in this set contained 10 or fewer torsions. CR: context reuse; F/RR: receptor file and receptor grid reuse on GPU. Each optimization is added on top of the previous ones going from left to right. We also performed a systematic test of the incremental contributions of each of the optimizations to the overall speedup of the new HPC-centric version of AutoDock-GPU. 412 randomly selected ligands from the SN9000 dataset were picked for this analysis. We computed the average time in seconds per ligand for values of the nrun parameter set to 10 and 100, for the program with and without threading (by setting the OMP_NUM_THREADS environment variable to 1 and 7 respectively), and tested different stages of optimization the program, added incrementally: with addition of CUDA context reuse, and with the reuse of the receptor input files and of this data by making employing of GPU global memory to store all required grids for all ligands in a batch. Tables 1 and 2 show these effects for the SW and ADADELTA algorithms, respectively. Each new optimization is added on top of the previous one in these tables. For simple ligands under 10 torsions, both for SW and ADADELTA, the use of 7 threads with the OpenMP pipeline provided about 3 to 3.4× speedup for nrun = 10, but slightly less than a 2× speedup for nrun = 100. With nrun = 10, for both low torsion sets, the file and receptor reuse optimization provided a significant further speedup of about 2 to 2.8×, while CUDA context reuse did not provide any additional speedup, while context reuse provided 2.2× speedup for SW with nrun = 100, and file and receptor reuse did not add performance. Fig. 6 summarizes the total speedup, with all optimizations, for all 6 cases. The autostop parameter was added recently to the development version of AutoDock-GPU (OpenCL) and ported to the CUDA version. It allows the program to stop the local search if no progress is being made. Table 3 shows the speedup provided by using autostop, shown without the pipeline (threads set to 1). This feature helps most for larger nrun values combined with smaller molecules. Many groups from around the world have been studying different SARS-CoV-2 proteins and solving their structure. The main protease (Mpro) in particular has attracted much attention as a potential drug target, and as such over 90 crystallographic structures containing a bound ligand have been released on the RCSB PDB, with many bound to the protease's active site region, and others bound elsewhere on the protein. We therefore performed some preliminary docking calculations on ligands taken from this set. These calculations allow us to examine calculated binding energies and docking poses. Fig. 7 shows an overlay of some of these ligands bound to Mpro. From this set, 68 ligands are observed to interact with the enzyme active site, of which 22 are non-covalent ligands that have been crystallized. The remaining 46 ligands bound to the active site are covalently bound to the receptor, and this interaction cannot be represented in AutoDock Vina or AutoDock-GPU. Fig. 7 indicates that 25 crystallographically determined ligands bind non-covalently to Mpro, well outside of the active site region. All of these putative ligands are fragments, and therefore are relatively small, with no more than 4 torsions. A hierarchical chemical clustering over all non-covalent binders to the protease was performed in chemical fingerprint space using MDL keysets [2] , the Tanimoto similarity coefficient between compounds, and the Ward clustering linkage method with a clustering threshold of 0.8. The small molecules were divided into 24 clusters (clades) based on structural similarity (Fig. 8) . All molecules in this set, both bound to the active site and outside of it, were found in similar clusters. Furthermore, many of the external binders were in the same clades as active site binders. It is possible, considering their closeness in chemical fingerprint space as shown by the clustering analysis, that the external binders also occupied the Mpro active site for some time before being displaced during crystallization. We docked both the 22 active-site binders and the 25 external binders to the Mpro receptor. For these tests all docking calculations were performed with SW with nruns = 100. Interestingly, scores for both sets (active site and external binders) had a mean between -6.4 and -7.4 kcal/mol for all of the Mpro crystal structures tested, with best scores over each ligand set extending below -8.5 kcal/mol. Docking of the 22 non-covalent ligands to Mpro with AutoDock Vina was also performed (Fig. 9 ). Vina scores are shifted to higher values and have a more one-sided and less disperse distribution than AutoDock-GPU scores. These differences reflect the different potentials and algorithms used in the two programs, and can be useful information when choosing a cut-off for score-based virtual screening. Figure 9 : Histogram of the best binding free energies (scores) of ligands which were experimentally found to be bound non-covalently to Mpro active site, docked to Mpro structure PDB ID 5RE9 using AutoDock-GPU and AutoDock Vina. 4.0.1 Effect of box size and active site structure. We docked the 22 active-site binders to several available Mpro crystal structures, PDB IDs: 5R7Y, 5R80, 5R81, 5R84, 5RE9, 5RF3, 5RGI, 6WQF, 6Y2E. Using a box that was fit very tightly about the active site, of dimensions 18.75×24.75× 22.5 Å, scores obtained were high-not lower than -6.5 kcal/mol and with a mean of around -5. Using a box of size 26.25 Å per side (with center as Pro 39 C-alpha atom) improved scores by approximately 2 kcal/mol, bringing scores down below -8 in several cases. We also noted that, using the same search box, the resulting scores were somewhat sensitive to small changes in the protein active site. These may include the position of several coils or small helix regions at the opening of the active site and flanking side chains. Fig. 10 illustrates this difference, showing scores for docking the 22 active site binder ligands to two different Mpro structures, PDB IDs 6Y2E and 6WQF, which were crystallized without bound ligands, and 5R7Y, crystallized with a bound ligand. 6Y2E and 6WQF differ by a large change in position of a side chain of the GLN 189 residue, seen upon close inspection of the 6WQF active site. This conformation, in addition to small changes in the positions of surrounding loops, gives the active site a reduced volume, thus limiting the search space for docking and potentially contributing to the slightly higher energy values obtained from docking to this structure. Fig. 11 shows an overlay of five of the 22 structures crystallized with ligand bound, that were found to be most different from the others (PDB IDs 5RGK, 5R84, 5R80, 5R81, 5RE9), and also from the unbound Mpro structures. Notable is the difference in the small helix that becomes a coil in several structures, and adopts a wide range of positions, changing the active site volume. Re-docking refers to the docking of a ligand, crystallized in complex with a receptor, to the same receptor. We performed re-docking experiments on four of the 22 non-covalent binders, ligands from PDB IDs 5R7Y, 5R84, 5RF3, and 5RGI. After docking, the top-ranked (lowest score) docked pose of the ligand was superimposed on the crystal structure ligand pose to compare them. 5RF3 (ligand residue name T5V) is a very small fragment and can dock in many locations in the large active site of Mpro, this is shown in Fig. 12 ; it also receives a high value for the binding free energy, -4.75 kcal/mol. For two of the four ligands tested, PDB IDs 5R84 (ligand residue name GWS) and 5RGI (ligand residue name U0P), re-docking results were remarkably good, as illustrated in 13. 5R7Y was docked in a less accurate position than these two. Fig. 13 also shows a comparison to the same re-dockings with Vina, showing that AutoDock-GPU can calculate redocked poses of comparable closeness to the crystallized position as Vina. Using a smaller search box could help to better locate very small ligands such as T5V, however, the smaller search box described above prevented the GWS ligand from being docked correctly, as shown in Fig. 14. This demonstrates some of the difficulties encountered in high-throughput docking, as it is difficult to find a search box that can ensure the best docking result for all ligand sizes, even with a very reduced subset containing only 4 torsions or fewer. We have presented a new version of AutoDock-GPU that has been ported to CUDA and containing a number of optimizations to enable more efficient processing of thousands of ligands per receptor to facilitate high-throughput structure-based in silico drug discovery. This version has enabled AutoDock-GPU to be deployed on the Summit supercomputer, which opens up this resource for massive computational efforts for therapeutics, especially for combating the current COVID-19 pandemic. The total speedup for flexible docking of small compounds with a moderate number of search iterations, using the new pipeline is close to 10× relative to the previous single receptor/ligand workflow. This pipeline will be back-ported to the OpenCL version in future work, allowing this method to also be employed on non-NVIDIA systems. For deployment on Summit, next steps will involve integrating this program into workflow management systems to efficiently manage large scale docking campaigns. The Protein Data Bank Reoptimization of MDL keys for use in drug discovery VinaMPI: Facilitating multiple receptor high-throughput virtual docking on high-performance computers GeauxDock: Accelerating Structure-Based Virtual Screening with Heterogeneous Computing Computational proteinâĂŞligand docking and virtual drug screening with the AutoDock suite The energy case for graph processing on hybrid CPU and GPU systems Enzyme assays for high-throughput screening Automated docking of substrates to proteins by simulated annealing An open-source drug discovery platform enables ultra-large virtual screens Empirical Scoring Functions for Structure-Based Virtual Screening: Applications, Critical Aspects, and Challenges VMD -Visual Molecular Dynamics Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2 High performance in silico virtual drug screening on many-core processors Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function AutoDock4 and AutoDock-Tools4: Automated docking with selective receptor flexibility Combining Docking Pose Rank and Structure with Deep Learning Improves ProteinâĂŞLigand Binding Mode Prediction over a Baseline Docking Approach Cell-Based Assay Design for High-Content Screening of Drug Candidates SWEETLEAD: an In Silico Database of Approved Drugs, Regulated Chemicals, and Herbal Isolates for Computer-Aided Drug Discovery Open Babel: An open chemical toolbox Software for molecular docking: a review How to Discover Antiviral Drugs Quickly UCSF ChimeraâĂŤa visualization system for exploratory research and analysis D3R Grand Challenge 4: prospective pose prediction of BACE1 ligands with AutoDock-GPU Accelerating AUTODOCK4 with GPUs and Gradient-Based Local Search High-throughput screening and identification of potent broad-spectrum inhibitors of coronaviruses Minimization by Random Search Techniques A Performance and Energy Evaluation of OpenCL-accelerated Molecular Docking ZINC 15 -Ligand Discovery for Everyone OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading Medicinal Chemistry and the Molecular Operating Environment (MOE): Application of QSAR and Molecular Docking to Drug Discovery ADADELTA: an adaptive learning rate method Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines