key: cord-0014103-xuguky8i
authors: Acharya, Atanu; Agarwal, Rupesh; Baker, Matthew; Baudry, Jerome; Bhowmik, Debsindhu; Boehm, Swen; Byler, Kendall; Coates, Leighton; Chen, Sam Yen-Chi; Cooper, Connor J.; Demerdash, Omar; Daidone, Isabella; Eblen, John; Ellingson, Sally R.; Forli, Stefano; Glaser, Jens; Gumbart, James C.; Gunnels, John; Hernandez, Oscar; Irle, Stephan; Larkin, Jeffery; Lawrence, Travis J; LeGrand, Scott; Liu, Shih-Hsien; Mitchell, Julie C.; Park, Gilchan; Parks, Jerry M.; Pavlova, Anna; Petridis, Loukas; Poole, Duncan; Pouchard, Line; Ramanathan, Arvind; Rogers, David; Santos-Martins, Diogo; Scheinberg, Aaron; Sedova, Ada; Shen, Shawn; Smith, Jeremy C.; Smith, Micholas; Soto, Carlos; Tsaris, Aristides; Thavappiragasam, Mathialakan; Tillack, Andreas F.; Vermaas, Josh V; Vuong, Van Quan; Yin, Junqi; Yoo, Shinjae; Zahran, Mai; Zanetti-Polzi, Laura
title: Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19
date: 2020-07-29
journal: ChemRxiv
DOI: 10.26434/chemrxiv.12725465
sha: a2951bfb119ede84b7a01bad3d5b6ccd54c48ac5
doc_id: 14103
cord_uid: xuguky8i

We present a supercomputer-driven pipeline for in-silico drug discovery using enhanced sampling molecular dynamics (MD) and ensemble docking. We also describe preliminary results obtained for 23 systems involving eight protein targets of the proteome of SARS CoV-2. THe MD performed is temperature replica-exchange enhanced sampling, making use of the massively parallel supercomputing on the SUMMIT supercomputer at Oak Ridge National Laboratory, with which more than 1ms of enhanced sampling MD can be generated per day. We have ensemble docked repurposing databases to ten configurations of each of the 23 SARS CoV-2 systems using AutoDock Vina. We also demonstrate that using Autodock-GPU on SUMMIT, it is possible to perform exhaustive docking of one billion compounds in under 24 hours. Finally, we discuss preliminary results and planned improvements to the pipeline, including the use of quantum mechanical (QM), machine learning, and AI methods to cluster MD trajectories and rescore docking poses.

A typical drug takes 10-15 years and $1-1.5B to develop, due mainly to the need for extensive preclinical testing and high failure rates in clinical trials. 1 Accelerating the drug discovery process is therefore of major concern, and this need has now been placed in even sharper relief by the Covid-19 pandemic. When a target is known and a specific assay developed, experimental high-throughput screening of compounds synthesized using combinatorial chemistry can be performed, and this is a mainstay of drug discovery. However, this approach has had a checkered past, with notable failures, for example, in antimicrobial efforts. 2 A logical alternative is to mimic what happens in nature, using 'structure-based' drug discovery. This can involve structural biology experiments, such as crystallography and cryo-EM, combined with a variety of assays. Moreover, the availability of many experimental protein structures combined with massive increases in computational power and methodological advances have led to a resurgence of computational studies in which trial compounds are docked into binding sites in three-dimensional models of the protein targets and then ranked according to their strength of binding. Computational docking has been particularly useful in early stages of molecular discovery in order to identify initial hits to be prioritized for experimental validation.

Early docking studies were performed with static target crystal structures and rigid ligands. These were quite successful in some cases, such as in the discovery of antivirals for HIV and influenza 3, 4 . Unfortunately, though, at that time, structures for few targets existed, and the process was relatively inefficient: calculations were relatively inaccurate, and computers could dock only ~100 compounds in a reasonable timeframe. Since the 1990s, the power of supercomputers has increased by a factor of a million or so. Rigid docking of over a billion compounds has been performed in a few days. Thus, virtual highthroughput screening has outperformed equivalent experimental high-throughput screening and has been shown to rapidly identify very tightly binding compounds 5 .

Strictly rigid docking does not often take place in protein: ligand interactions, as both ligands and proteins, undergo thermally-driven internal motions, which lead to fluctuating binding site conformations 6 . Therefore, a particularly important development has been the recognition that incorporating target flexibility into drug discovery protocols can improve the drug discovery process 7 . Ensemble docking uses different conformations of the protein targets of interest, and combinatorially performs the docking of databases of compounds against each of the protein target conformations. This process models the conformational selection binding mechanism, as opposed to a more limited induced-fit mechanism. The method requires the generation of an ensemble of protein conformations to be used in the docking calculations.

Ensemble docking of small probe molecules for flexible pharmacophore modeling was introduced in 1999. It was shown that consensus pharmacophore models, based on multiple MD structures or on multiple crystallographic structures, were more successful than models based on single conformations in yielding successful predictions of binding 8 . In our own labs, ensemble docking has produced experimentally validated hits against each of the 16 protein targets presented to us over the past few years. Our groups have increasingly used an ensemble approach to perform docking [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] . In addition, we have shown that the clustering of protein target MD trajectories usually brings a large improvement in the quality of ensemble docking compared to what is obtained using single structure docking 26 .

Ensemble generation using MD and docking both require significant computational power -to perform MD simulations of sufficient duration and to dock large databases of compounds. This combinatorially large computational time requirement essentially limits this approach to high-performance computing (HPC) architectures for large database screenings, even when only a subset of protein conformations is used in docking, for example, following clustering of the target's MD configurations. HPC involves the use of specialized, large supercomputing systems to perform large calculations that are parallelized over many compute nodes, each consisting of dozens of cores. Traditionally, the use of a high-speed interconnect allows rapid communication between separate compute units and clever parallelization schemes to enable rapid calculations on problems too large to fit on a single compute unit. These schemes have historically involved specialized programs that focus efforts to optimize communication overlap. The use of graphics processing units (GPUs) has helped to accelerate many types of calculations. The Summit supercomputer, housed at the Oak Ridge Leadership Computing Facility (OLCF), is currently the fastest supercomputer in the United States. Summit is an IBM AC922 system consisting of 4608 large nodes, each with six NVIDIA Volta V100 GPUs per node. Each node also contains two POWER9 CPU sockets for a total of 42 cores per node. The GROMACS MD program [27] [28] [29] is able to make use of all aspects of the Summit supercomputer's HPC utilities, including the GPUs and the interconnect, providing for both strong and weak scaling, which dramatically decreases time per MD step and increases the size of the system that can be simulated efficiently. The temperature replica exchange molecular dynamics (T-REMD) routine [30] [31] [32] [33] which was chosen here for the MD calculations (see below) uses the interconnect not only to allow for parallelization of a single simulated biomolecule, but also to communicate between separate replicas of the system, each carried out at a different temperature, and performs exchanges between replicas to accelerate the conformational sampling of the biomolecule 34 .

Protein-ligand docking has hitherto not been considered a traditional HPC task, as each docking calculation is short and does not require multiple nodes to complete. In fact, many docking programs can run on a single CPU core. However, the number of cores on a supercomputer or cluster can provide a resource to perform many simultaneous docking calculations that greatly decrease the time-to-solution for screening a large dataset of ligands. Cloud and distributed computing resources also provide this type of completely parallel solution for high-throughput screening 35, 36 . The use of GPUs has recently been made possible for the widely-used program AutoDock4 37, 38 resulting in the program AutoDock-GPU, which provides up to 50X speedup over AutoDock4 (available at https://github.com/ccsb-scripts/AutoDock-GPU) [39] [40] [41] . Thus, the use of leadership HPC facilities for ensemble docking can provide the ability to screen billions of ligands to a full set of conformations generated with HPC-based MD simulations. Quantum mechanical refinement of classical docking ranking based on fragment molecular orbital (FMO) techniques also naturally benefits greatly from massively parallel supercomputer capabilities 42 .

The need for rapid time-to-solution in drug discovery has become accentuated by the Covid-19 pandemic. Combating viral illnesses such as Covid-19 encompasses two main modalities: vaccination for prevention and drug therapy for those already infected. Unfortunately, vaccine development can be a long process due to the need to ensure immunity while also determining effective booster schedules to reinforce immunity and safety 43 . Regarding the latter, there can be a paradoxical enhancement of infection after vaccine administration wherein viruses hijack the host's immune system and actually utilize antibodies to facilitate entry into host cells 44 . Partly for these reasons, there is an impetus to develop in parallel drugs that target the virus directly. These are typically small molecules that bind viral proteins, abolishing their function and, in turn, inhibiting viral replication, although drugs targeting human proteins can also be considered.

Given the urgency of the pandemic, repurposing established antivirals appears to be an expedient and reasonable approach. Among the present candidates, the antiviral remdesivir, a nucleoside analog that acts by interfering with RNA synthesis, is a leading candidate for repurposing, as it is active against coronaviruses related to SARS-CoV-2, including the coronavirus responsible for MERS [45] [46] [47] [48] [49] . Recent publications have indicated that remdesivir shortens recovery from Covid-19 in hospitalized patients. 50 Other promising candidates, such as dexamethasone, may modulate the host response 51 .

When surveying the existing clinical trials for Covid-19, one is struck by the number of trials that are not based on knowledge of the drug interacting with a known target. There are several examples. As one illustration, baloxavir is a specific inhibitor of the cap-snatching endonuclease of influenza virus, which is a member of the PD-(D/E)xK two-metal nuclease superfamily 52 . Coronaviruses have an endonuclease but of a completely different fold (NendoU) and different active site residues 53 . NendoU also oligomerizes into a hexamer. Although it would seem unlikely that baloxavir would bind to NendoU, it nevertheless is in clinical trials. Similarly, lopinavir and ritonavir are also undergoing testing, even though they target proteases of the unrelated HIV 48 . Although in principle, enzyme active sites with similar chemical functions may bind similar ligands, the steric and physicochemical substrates of drug-protein binding are nuanced. The frequent trial of drugs specific for targets known to be absent in SARS-CoV-2 seems to us to be symptomatic of a lack of precision in combating this pandemic. For further information on other ongoing trials on small molecule drugs, biologics, passive immunization with antibodies, and vaccines, the reader is referred to the comprehensive review by Liu et al. 54 . The aforementioned challenges, taken together, demonstrate that there exists an unequivocal need for de novo drug discovery campaigns as well as repurposing studies.

There are a number of events in the SARS-CoV-2 viral replication cycle 55 to target for antiviral therapeutic development; from viral entry to membrane fusion, travel to the host endoplasmic reticulum where translation of the viral genome occurs, to formation of the viral replication complex and formation from host membranes of double-membrane vesicles (DMVs) 56, 57 , the passage of the replicon through the Golgi and the release of the virion from the cell. Each of these steps involves key viral proteins and occurs in a different compartment of the host cell. For example, the binding of the virus to the ACE2 receptor involves the receptor-binding domain (RBD) of the virus S (spike) protein, pre-fusion cleavage involves the binding of host TMPRSS and furin proteases to the S1/S2 dibasic domain 58 formation of the replication complex and the DMVs involves the non-structural proteins (NSPs), and the N protein is required for packaging of the viral genome into the newly assembled virion 59 . The replication complex is made up of 15 mature NSPs, which are encoded by orf1ab and orf1a genes as the pp1ab and pp1a polyproteins 60 . Currently, many efforts are targeting the main protease, MPro 61 , which is required for cleavage of the large viral polypeptide into its functional proteins, the RNA-dependent RNA polymerase (RdRp) 62 responsible for the production of new viral RNA, and some efforts target prevention of S cleavage 63 . In addition, viral proteins also function to impede the host's defense mechanisms: both proteases have been shown to inhibit the human immune response by interacting with immune proteins in SARS-CoV 64 . It is, therefore, important to understand regions on these proteins that act as binding sites for both substrates (as in the case of the proteases) and for protein-protein interaction.

There are 16 NSPs that are involved in SARS-CoV-2 viral replication and the circumvention of the immune system. The NSPs vary in size and complexity, and the function of some of them is not currently known. For instance, NSP3 is a large multi-domain protein 65 composed of around 1900 amino acids that form multiple domains, making it a particularly difficult target for structural biology studies. To make structural studies more tractable, individual domains of NSP3, such as the papain-like protease domain and the ADP ribose phosphatase domain, can be overexpressed in bacteria and then purified and crystallized for X-ray diffraction studies.

Besides importance to the viral replication cycle 66 , there are other factors to consider when choosing targets for in-silico drug discovery. One important concept is the idea of druggability 67 , i.e., the likelihood that a drug will be able to change the behavior of the target in a way that brings a therapeutic response. Traditionally druggability has been based on previous responses to drugs of proteins in similar families or with similar structures 66 . Newer metrics include analyses of the structure of the target's active site and the desolvation that occurs when the drug binds to the target 68 . For antivirals, especially for coronaviruses such as SARS, MERS, and SARS-CoV-2 viruses, certain targets such as the RdRp have been considered to be more druggable than others 69 .

Another consideration for the medium to longer-term is the ability of the pathogen's proteome to mutate and cause drug resistance 70, 71 . It is possible that less conserved proteins in the proteome may withstand more mutations and thus could confer increased viral resistance to a drug than more conserved proteins [72] [73] [74] . For more rapidly mutating viruses, combination drugs may be used in an attempt to counteract drug resistance that occurs as the virus mutates in response to drugs 70, 71 . Incorporating information about mutation frequencies when considering drug targets can help to prevent resistance--it is possible that ancestrally conserved proteins across taxa may be less likely to mutate and cause resistance. Therefore, sequence analyses can be helpful when deciding which viral proteins should be targeted by in-silico efforts or experimental screenings that make use of in-silico filtering. An analysis of the current mutations in the SARS-CoV-2 virus and their frequencies can help to elucidate emerging trends that may affect the propensity for the resistance of the various targets and help to focus choices for lead optimization.

In previous work, very early in the pandemic, we combined restrained temperature replica-exchange molecular dynamics (restrained T-REMD) simulations with virtual high-throughput screening in a supercomputer-based ensemble docking campaign to identify well-characterized drugs, metabolites, or natural products that bind to either the S-protein: ACE2 receptor interface or the RBD of the S-protein 25 . From this ensemble docking campaign, we provided a ranking of the predicted binding affinities of over 8000 drugs, metabolites, and natural products (and their isomers) with regards to the SARS-CoV-2 Sprotein and the S-protein: ACE2 receptor complex. The ranked list has been incorporated into experimental testing using a high throughput screen that was implemented in the SARS CoV outbreak, and new compounds will be added as discovered. Two of the top compounds in this screen are now in clinical trials (Trial Numbers: NCT04357990 and NCT04377789).

In this work, we describe our efforts establishing a supercomputing-based pipeline for ensemble docking and preliminary results on its application to discovering therapeutics that target viral proteins of SARS-CoV-2. The pipeline and results presented here represent both the continuation of the above work and our contribution to date to the work of the USA HPC Covid-19 Consortium that was created on March 29 th, 2020. We describe the choice of 8 targets and the preparation of protein models from experimental data. We report on T-REMD simulations performed for the targets totaling about half a millisecond of simulation time. We have docked repurposing databases to ten configurations of each protein simulated using the popular docking program Autodock Vina. We also describe efforts deploying Autodock-GPU [39] [40] [41] at scale on Summit that demonstrate the docking of over a billion compounds in 24 hours with full structural optimization of the ligand. Future developments involving the use of AI and quantum chemistry in rescoring and clustering are also outlined. The pipeline described here can also be used in future work to target human proteins 75, 76 known to interact with viral proteins, or in disease-causing responses in Covid-19 and more generally in computational structure-based drug discovery.

Computational methods in drug discovery narrow a vast chemical search space to a tractable set of compounds suitable for experimental testing. Experimental work can involve a variety of tests, including live virus testing as well as target engagement studies, and will not be considered further here. Rather, we discuss the procedures of structural modeling, MD simulation, and docking.

Multiple groups have been using structural biology techniques, including X-ray crystallography, smallangle scattering (SAS), and cryogenic electron microscopy (cryoEM), to investigate the structure of proteins and protein complexes from SARS-CoV-2 [77] [78] [79] [80] . However, obtaining a structure from the Protein Data Bank (PDB) or perhaps a revised model from another resource is only the starting point. Often structures are missing residues that were not resolved, and a determination must be made whether and how to model them. Also, as structural models are rapidly being released to aid in the fight against COVID-19, the potential inclusion of a few structural errors is an unfortunate reality. In particular, the identification of metal cations in protein structures requires careful thought and examination of its local coordination environment.

Even with perfectly assigned and complete experimental structures, it may not be enough to consider viral protein targets as chemically invariant structures for modeling and binding calculations. Large differences in pH in various cell compartments as the virus travels through the host cell [81] [82] [83] can qualitatively change the protein's structure and function. Differences in pH also affect the protonation states of the proteins and the small molecules being tested as drugs, altering drug binding preferences. Finally, the oligomerization states of the target proteins are important to consider as the interactions between protein monomers may influence the shapes of the active sites. Another important factor to consider when performing in silico screens using ensemble docking is the ability to construct a useful model of a particular protein for MD simulation. For instance, certain metal-containing regions of a protein may not have an existing classical mechanics model (force field parameters), or existing models may be inadequate. In addition, highly charged, disordered, and ion-dependent biomolecules have been known to have less accurate force fields and may perform poorly in an MD simulation [84] [85] [86] [87] .

Proteins chosen for ensemble docking in this study were those that had a crystal structure available with a reasonable resolution, were amenable to accurate simulation with classical MD force fields, and were also known to be important for viral pathogenicity based on either recent studies or those on SARS-CoV. The 23 systems studied comprise nine protein domains. Two of these, RBD of the S (spike) protein and the Nterminal region of the N (nucleocapsid) protein, are domains in structural proteins found attached/within the virion. The N protein is used for packaging the viral genome and is essential for the assembly of the virion 88 . The remaining seven domains come from non-structural proteins (NSPs) 3, 5, 9, 10, 15, and 16, which form the replication complex and are involved in a number of key tasks leading to the creation of new virus particles 89 . Two domains come from NSP3 65 , the ADP ribose phosphatase (ADRP, also known as macro-or "X") domain, and the papain-like protease domain (PLPro) 90 . The ADRP seems to be involved in ADP-ribosylation, which is used in cell signaling and thus may act to inhibit the host immune response 91 . PLPro cleaves regions of the polyprotein to release non-structural proteins and also is involved in the mechanisms the virus uses to counteract the host immune response, for instance by interaction with host immune proteins 89, 92 Nsp 5 is the main protease (MPro), which self-cleaves and also cleaves other regions along the polyprotein, releasing essential proteins to perform their tasks in the assembly of the replication complex, and is also involved in interacting with and preventing the action of host immune proteins 92, 93 . The exact function of NSP 9 is unclear, but has been found in SARS-CoV to be required for viral replication and has been shown to bind to RNA oligonucleotides 94 , Nsp 15 is an endoribonuclease specific for uridine whose exact function is also unknown, but has been implicated in interfering with host immune response both through direct interaction and by cleaving viral RNA to prevent detection by the host 95 . NSP 16 is thought to be a methyltransferase that requires NSP 10 as a co-factor, and acts to disguise viral mRNA from the host immune response by adding a methylation onto the RNA cap which host cells use to mark RNA as belonging to "self" versus "pathogen." [96] [97] [98] .

The explosion in research and literature fueled by the Covid-19 pandemic, together with the need for searching through related literature on other coronaviruses, has created a challenge for researchers needing to understand the structural details and cellular contexts of the SARS-CoV-2 proteins. To help navigate this challenging landscape, we have been developing new tools based on natural language processing for enabling a more robust search for specific questions required for our modeling, simulation, and ligand docking work 99-101 featuring targeted filtering and exploiting external resources (e.g., Wikidata, ChEBI, PubChem) to expand our search capability. For example, after we generate a set of related keywords, the service will screen for the terms referring to a chemical substance and fetch the chemical information (e.g., SMILES string) from the PubChem automatically. In addition, using this keyword search enables the ontologies (e.g., Wikidata, ChEBI) to be used to link related chemicals and their properties for document annotations in query results. The main data resource of the system is a collection of scientific papers, which are collated from major publications. The full-text article access and download from the publishers' archives are performed under the publishers' agreements, and the internal article corpus in our system is updated on a weekly basis.

To provide a diverse survey of the conformational ensembles of the SARS CoV-2 viral proteome, we performed T-REMD simulations of 23 different model systems listed in Table 1 . An additional supplementary table (SI Table I ) is also provided, which summarizes the PDB entry simulated, complete protonation state choices (where applicable), and the number of replicas used for the T-REMD. 

To engage in the use of MD for the rapid generation of conformational ensembles for drug discovery one requires that the input for MD be generated in a semiautomated fashion by which the atomic coordinates, obtained from experimental or in silico protein structure prediction methods, can be quickly processed into MD input files. To facilitate this semiautomated approach, CHARMM-GUI was used for most model building 102 . The general system building method used here involves the direct retrieval of structures from the PDB and processing to model missing residues, assign protonation states, add disulfide bonds (where noted in the PDB annotation), add glycosylation (where resolved in the crystal structures of the S-protein receptor binding domain and ACE2), neutralize the charge of the system (using Na+ and Cl-ions), and solvate (with TIP3P water). Many proteins have coordinated ions that serve structural roles, such as the Zn 2+ cations in Nsp10, or catalytic roles. Thus, the treatment of Zn-complexes in fixed-charged classical MD force-fields is a challenge, and for some systems, it may result in the failure to maintain Zn-protein coordination 103 104 105 , and when found necessary (as noted below and also summarized in SI Table I ) an explicit bond representation was used. All of these considerations mandate an abundance of care when preparing a biologically accurate model for simulation. Below we discuss considerations taken into account when modeling some of the proteins simulated.

Presently available structures of the S protein have nine gaps totaling approximately 150 residues, in addition to over 20 and over 100 missing residues at the N-and C-termini, respectively. Current models also lack post-translational modifications, including glycosylation and formation of disulfide bonds. The S protein is heavily glycosylated, with roughly 20% of its mass in glycan chains, yet at most, a few mono or disaccharides are present in the structure.

In our preliminary study 25 , we made use of a homology model of the entire spike with restraints applied such that only the human ACE2-Spike interface was unconstrained. Here, using crystal structures of the ACE2-S protein complex, simulations of the receptor-binding domain of the S Protein (Spike) both in complex with the human ACE2 receptor and on its own (referred to within the text as the "Apo" RBD) were performed. The viral spike receptor-binding domain was chosen to provide insight into the details of the initial viral-host recognition process. Glycosylation resolved from crystallographic imaging was used, and an annotated disulfide bond was also included. The main protease, MPro, is an attractive target for the development of antiviral drugs. There is compelling evidence that the enzymatically active species is the dimeric assembly of MPro. A dimer is observed in most crystal structures of CoV MPro, as well as in solution at sufficiently high concentrations. In addition, a linear increase in the enzyme activity at increasing concentration suggests catalytic incompetence of the monomer 106 . Therefore, the full dimer was considered in the present MD simulations for MPro, using as starting coordinates the apo-homodimer in the crystal structures 6Y2E and 6WQF 107, 108 .

The crystal structures show that SARS-CoV-2 MPro, similarly to other MPro's 93, 107, 109, 110 , is composed of three domains: Domains I (residues 8-101) and II (residues 102-184) are arranged in an antiparallel βbarrel structure, and domain III (residues 201-303) contains five α-helices arranged in a globular cluster. Domain III is a specific feature of CoV MPro proteins and was suggested to be essential in the proteolytic activity by keeping domain II and the long loop connecting domains II and III (residues 185-200) in the proper orientation, and/or by orienting the N-terminal residues that are essential for the dimerization 109 . Dimerization occurs through interactions between the helical domains of the two monomers and through hydrogen bonding interactions between the N-terminal residues of one monomer and key residues in the other monomer. In particular, the salt bridge between the N-terminal Ser1 of one monomer and Glu166 of the other monomer has been suggested to be essential to maintaining the catalytically competent conformation 109, 111 . The substrate-binding site is located in a cleft between domains I and II and contains a highly conserved catalytic dyad formed by Cys145 and His41. Comparison among the two apo crystal structures and the crystal structure obtained in the presence of an inhibitor reveals 93 only minor structural differences in the position of a few side-chains and no relevant changes in the substrate-binding site, except for the rotation of the side chain of Met165, which is in the proximity of His41.

Although the catalytic mechanism is not fully understood, there is a general agreement in considering that the proteolytic activity of CoV MPros is initiated by activation of the enzyme through a proton transfer reaction in the catalytic dyad, leading to a charge-separated state with a highly reactive thiolate. It has also been suggested that such a proton transfer reaction is induced by the presence of the native substrate 112, 113 . Therefore, in the present MD simulations of the apoenzyme, Cys145 and His41 were simulated in their neutral state, with His41 protonated at Nδ (i.e., HSD). This choice is based on the observation that the His41-Nε appears to be the best candidate as proton acceptor from Cys145 because in the crystal structures the His41-Nε is closer than the His41-Nδ to the Cys145-S and the His41-Nδ is already involved in a hydrogen bond to a highly conserved water molecule, which is considered the third element of the catalytic site. A recent QM/MM study 114 also supports this proton transfer mode and the role of water in catalysis. However, the ε-nitrogen protonation state for His41 (HIE) cannot be decisively ruled out, and MD simulations were performed also considering this alternative, although less probable, protonation state.

The protonation states of two additional His residues, namely His163 and His172, were also highlighted as being crucial for the enzymatic activity of CoV MPro. In particular, the doubly protonated (cationic) state of His163 at pH 6.0 was suggested to modulate relevant conformational variations involving Glu166, Phe140, and His172, leading to a catalytically inactive conformation 109 . At higher pH values, both His163 and His172 should be uncharged, and, on the basis of the hydrogen-bonding pattern that can be inferred from the crystal structures, the HSE protonation state was used for both His163 and His172 in the present MD simulations. All other His residues were also simulated in their neutral state, assigning the Nδ or Nε protonation state on the basis of their chemical environment and hydrogen-bonding patterns. The selected protonation states are as follows: HSD64, HSD80, HSE164, HSE246.

PLPro. For PLPro (PDB entry 6W9C), the Zn cation was coordinated to C189, C192, C224, and C226. Similar to MPro, the protonation states of the His residues in PLPro were not readily available. These were obtained with the use of the PropKa 3.0 server assuming pH ~5, corresponding to its presumed cellular (lysosome) environment 115 , with 6W9C being assigned to pH 5 based on its physiological role in acidic environments. TopoGromacs 116 was used to convert the system and associated force field to GROMACS format, allowing Zn-coordinating proteins to be incorporated into the overall temperature-replica exchange workflow.

NSP15. Large (His) tags were added during crystallization for NSP15. Prior to simulating the monomeric and hexameric forms, these His tags were removed from NSP15 using MOE2019 and subjected to a "quick prep" with the prepare protein module of MOE to resolve potential issues in the resulting structure. The resulting truncated PDBs were then uploaded to CHARMM-GUI for neutralization and solvation.

NSP10. For NSP10, both in its monomeric and a complexed form with one Zn cation liganded to C4370, C4373, C4381, and C4383, while the other bound Zn was liganded to C4327, C4330, H4336, and C4343. For 6VYO, H59 and H145 were liganded.

All simulations were performed using the GROMACS 27, 117 software suite, and the CHARMM36m force field 118 , which is the default of choice using the CHARMM-GUI. For all systems, the protein was solvated in water-boxes with edge-distances of 1nm, and only neutralizing Na + and Clions were used. Short-range interactions were treated with a smooth force-switch cutoff of 1.2 nm, and long-range electrostatics were treated using the particle-mesh Ewald (PME) formalism, as implemented in GROMACS 119 . To facilitate the use of a 2-fs MD timestep, all covalent bonds to hydrogen were restrained with the LINCS algorithm 120 in all simulations. Following system preparation, all solvated models generated were subject to steepestdescent energy minimization with a stopping condition of either reaching the force-convergence criteria of 1000 kJ mol -1 nm -1 or a maximum of 5000 iterations. Energy minimization was performed primarily to remove potential clashes between the solvent, ions, and the protein (or protein complex) of interest. Postclash removal minimization, short (250 ps) NPT relaxation simulations (with default positions restrains generated from CHARMM-GUI) were performed to relax the simulation box dimensions for each replica (at different temperatures) independently (see T-REMD Protocol). For these relaxation simulations, the Berendsen baro/thermostats 121 (as implemented in GROMACS) and an integration time step of 1 fs were used.

MD simulations provide a means to study the conformational dynamics of proteins. However, frequently MD becomes 'trapped,' resulting in the need for many long simulations to effectively sample a protein's conformational landscape 122 . To overcome this sampling challenge, enhanced sampling techniques can be used. For the present work, temperature replica-exchange molecular dynamics (T-REMD) was employed, whereby multiple copies of a target system are simulated simultaneously with each copy (replica) at a different temperature, with periodic coordinate swapping (performed in such a manner as that preserves detailed balance) between the copies [31] [32] [33] . By running at multiple temperatures, with exchanges, the dynamics of the system avoids 'kinetic traps' and provides a robust sampling of the protein free-energy landscape, and thus the protein conformational diversity 123 . T-REMD was chosen for several reasons:

1) it guarantees an increase in sampling efficiency over straightforward MD 124 , 2) it does not require the assignment of reaction coordinates (or collective-variables) a priori to accelerate conformational sampling

3) it does not require direct modification to the system Hamiltonian 125 .

T-REMD simulations for each system were performed with the GROMACS simulation suite. A limited temperature range of 310 K to ~350 K was chosen to maintain physiological configuration space. For each protein system, the number of replicas and temperatures for each replica was chosen using the temperature predictor server by Patriksson and van der Spoel 126 with a target exchange probability set at 0.2 though the actual exchange was found to be ~0.3 for all systems. All simulations were performed for a total of 750 ns per replicate.

After relaxation, production T-REMD simulations were performed with a frame saving rate of 10 ps and an integration time step of 2 fs. Production simulations were performed, similar to the relaxation simulations, in the NPT ensemble. Unlike the relaxation simulations, the V-rescale (Bussi) thermostat 127 and the Parrinello-Rahman barostat 128, 129 , were used. Regardless of the temperature window, the target pressure for each replicate was set to 1 bar.

For all systems, the measures of the gyration tensor (from which shape anisotropies are derived), solventaccessible surface area (SASA), and pairwise simulation frame versus simulation frame RMSD matrices, and RMSD based clusters, were obtained using a combination in-house VMD 130 scripts, NumPy, and SciPy 131 . The RMSD clustering specifically only considered the lowest temperature replica (310 K), and rapidly generated the pairwise RMSD matrices using the QCP algorithm 132 . Clustering was performed using hierarchical clustering with a complete linkage method, as implemented within SciPy. For generality in evaluating structural diversity, clustering was initially performed based on the RMSD of all heavy protein atoms, and where additional diversity of active sites was of interest for subsequent docking, a second round of clustering was performed based on binding site residues and protein-protein interfaces. VMD atom selections for the docking specific clustering are summarized in SI Table II .

Although T-REMD is an efficient simulation method, and the 310K data do correspond to a formal statistical mechanical ensemble generated at this temperature, as with other enhanced sampling methods the risk is always present that the enhancement of the sampling takes the system to regions of configurational space beyond that that would be significantly sampled by the protein physiologically; for example, to partially or wholly unfolded states. We, therefore, take care to identify these and to not perform docking screens on such configurations.

Two different docking databases were used.

1) A smaller database of potential ligands was built merging together the content of the SWEETLEADS 133, 134 repurposing database SuperDRUG2 135, 136 , and the NCI-diversity database 137 , yielding 13,757 unique compounds. This database has been ensemble docked to all systems, as listed in Table II , with noted targeted binding sites. This database was docked using local HPC clusters using Autodock Vina. 2) Supercomputing docking runs were performed involving billion-plus compound screens of the Enamine database using an accelerated version of Autodock: Autodock-GPU. To date, these runs have been performed on two crystal structures of MPro.

Docking to the target structures obtained from the MD simulations (as listed in Table I ) was performed on various HPC clusters using Vina MPI 138 and MOE. Two sets of structures were used in the ensemble docking. In the first series of docking calculations, only the first 100ns of the T-REMD trajectories were used, and the results of the docking simulations were passed on to collaborators for experimental testing.

In the second series of docking, as the MD trajectories were expanded beyond their initial first 100ns, the clustering was performed on the entire 750s ns trajectories, as described in the results section below.

For the VinaMPI 138 calculations, the "Exhaustiveness" parameter was kept at its default value of 10.

Databases of potential ligands were built merging together the content of the SWEETLEADS 133, 134 , SuperDRUG2 135, 136 , and NCI-diversity databases 137 , yielding 13,757 unique compounds.

Using the program MOE, compounds with more than 49 rotatable bonds were deleted from the database, and only one stereoisomer was included for each compound. Very low molecular weight (<58) compounds (single atoms, ions, very small functional groups). The resulting database included ~9K unique molecules. The compounds were protonated at pH 7 and energy-minimized using the MOE software to obtain low energy 3D structures. The compounds were saved on disk in sdf format and then converted to PDBQT format using AutoDock Tools 139, 140 .

The ligands were docked to 10 clusters per receptor, each cluster corresponding to a different configuration of the binding pocket. The clusters corresponding to the first 100ns of the MD simulations have been uploaded on the publicly available structure repository https://cmbcovid19.flywheelsites.com/data/ additional data from the complete 750ns T-REMD simulations is forthcoming. The residues used to determine the clusters fall into one of three categories: the protein active site, residues at the protein-protein interfaces (for complexes), and all the protein non-hydrogen atoms. Tables II and SI Table II list the receptors and binding sites we have screened so far. 

In general, we have two classes of potential binding sites: 1) catalytic pocket or substrate-binding site and 2) PPI. The first aims at identifying potential competitive inhibitors of the viral proteins, and the second aims at finding compounds potentially disrupting a viral protein-protein complex. Binding site definition requires manual intervention and cannot be easily automated. Examples of definitions are listed below for three viral proteins.

a) In the main protease dimer (PDB: 6WQF), the docking box contains catalytic sites of chain A and PPI residues. The docking box was constructed to align with the peptide-binding groove on either side of the catalytic dyad of chain A, which extends outward to include the S3, S2, S1, S1', and S2' catalytic pockets.

b) In the NSP10-NSP16 complex (PDB: 6W4H), the S-adenosyl methionine (SAM) binding site Asp6928 in NSP16 was considered 97 . In addition, PPI residues such as Tyr4349, Val4295 to Leu4298 in NSP10, and Gln6885 in NSP16 were included. Tyr4349 and Gln6885 interact with each other in SARS-CoV virus 97 , and Val4295 to Leu4298 are hot spot residues in the SARS-CoV-2 virus computationally predicted using the crystal structure along with the KFC2 method 141 , which is based on a machine learning predictive model (https://kfc.mitchell-lab.org). Hot spot residues are the fraction of PPI residues that account significantly for the overall protein-protein binding affinity, and they are typically determined experimentally using alanine scanning mutagenesis 142 . c) In the N-terminal domain of nucleocapsid protein tetramer (PDB: 6VYO), three critical RNAbinding residues on the beta-sheet core were included in docking: Arg88, Arg92, and Arg107 80, 143 144 .

A major aim of this exercise was to see whether it would be possible to dock a billion compounds with full ligand optimization on the OLCF Summit supercomputer in 24 hours of wall-clock time. To perform efficient ensemble docking, we modified AutoDock-GPU 39, 41 , to enable it to run at peak efficiency on the Summit system. For compatibility, OpenCL kernels were re-written in CUDA, and file input and output were streamlined to enable it to keep up with the GPU's speed. These modifications, together with the size of the Summit supercomputer, indeed allow over 1 billion compounds to be docked within 24 hours. This capability will enable giga-compound docking for a number of proteins in the viral proteome and beyond.

We performed initial docking tests using this framework on NSP15 (NendoU) and the main protease (MPro). For NSP15 we used a 9,000 compound dataset composed of the SWEETLEADS 133 database with additional ligands, and also a trimmed version of this dataset containing only ligands containing less than 11 rotatable bonds, consisting of about 5,000 ligands. For tests with MPro, we used a 90,000 ligand subset of the Enamine REAL database 145 . All ligands were prepared with AutoDockTools 140, 146 , and the receptor grids were generated with the program autogrid with a grid spacing of 0.375 Å. We tested a set of search box sizes: 40, 25, 20 and 15 Å 3 , and different settings for the number of runs, nruns, which defines how many separate instances of the genetic algorithm are executed. For the trimmed dataset, we also performed docking with AutoDock Vina with exhaustiveness of 10 to compare results. These results provided us with the confidence to dock over 1 billion compounds from the Enamine real database to two different MPro crystal structures, 5R84 and 6WQF 108 , with a search space 25 Å large on each side, centered on the active site. The analysis of this dataset is ongoing. Due to the documented inaccuracies of force field-based scoring functions in the task of screening and affinity prediction of compounds, 147 rescoring of at least 1 percent of the billion compounds is being performed using the accurate, yet highly computationally efficient machine learning-based rescoring method known as RF-Score-3 148 . Also, at least 50% of those compounds re-scored with RF-Score-3 will be further filtered using recently developed rescoring described below in Future Directions and Preliminary Results from New Methodologies, subsection Protein-ligand rescoring using machine learning.

We performed an analysis of available sequences of the SAR-CoV-2 virus to look for numbers of mutations and map these locations on the proteins we were using as drug discovery targets. All complete, highcoverage genomes labeled as human host SARS-CoV-2 were downloaded from GISIAID 149, 150 on May 5, 2020, yielding a total of 16,252 genomes. Sequences were filtered to remove any genomes with greater than 3% ambiguous (N) nucleotides or were less than 29,000 nucleotides in length, resulting in 14,284 genomes. Multiple sequence alignment of the 14,284 genomes was performed using MAFFT 151 v.7.464 with the --addfragments method using NC_045512.2 (EPI_ISL_402125) as the reference genome and removing insertions relative to the reference. Mature protein-coding sequences for each protein were extracted from the alignment using coordinates from the reference genome and translated using FAST 152 v1.6, with protein sequences containing internal stop codons discarded from further analyses. Shannon entropy 153 was calculated for every column of each protein alignment using a custom script, disregarding ambiguous and gap characters using a custom script. Additionally, the frequency and types of substitutions with respect to the reference were recorded. For visualization of the mutation entropy per residue of the proteins studied in this paper, entropy values were color-coded in protein PDB structures. Known SARS-CoV and SARS-CoV-2 structures were downloaded from the Protein Data Bank, their sequences were aligned with the SARS-CoV-2 reference genome (NC_045512.2) using BLASTP, and the calculated entropy of the sequences was embedded in the PDB file in the place of the B factor column using a custom Python script.

Along with ML-based approaches, quantum mechanics (QM)-based refinement of classical docking results is being developed here as a tool to narrow down the list of promising inhibitor candidates 154 . Until recently, the inclusion of QM electronic structure in high-throughput drug screening was deemed computationally intractable due to the enormous computational resources required even for density functional theory (DFT) calculations. The poor scaling of most quantum chemical methods further exacerbates the situation. A viable emergent alternative is the recently developed linear-scaling version of an approximate, yet remarkably accurate DFT method called "fragment molecular orbital density-functional tight-binding (FMO-DFTB) 155 . This method is implemented in the widely-utilized GAMESS quantum chemistry code 156 . We here report preliminary calculations of FMO-DFTB with the so-called polarizable continuum model (PCM) of the solvent 157 for quantum mechanics-based evaluation of potential COVID-19 spike protein inhibitor drugs identified by re-clustering and re-docking to an extended simulation of the S protein, similar to the initial work by Smith & Smith 25 . In addition, to improve the accuracy in describing non-covalent interactions, the D3 dispersion correction was employed. To obtain the refined binding energy of a given candidate, its unbound geometry, the unbound protein, and its corresponding complex were optimized using FMO-DFTB/PCM. While the unbound ligands were completely optimized, only selective residues in the binding pocket of the unbound protein, and in the protein-ligand complexes were locally optimized. The QM-refined binding energy is defined here as the difference between FMO-DFTB/PCM total energy of the complex and the sum of the total energy of unbound protein plus total energy of unbound ligand. In preliminary work, the QM-refinement was carried out for the Vina top-10 best binders of each spike protein cluster. In total, 15 spike protein clusters were investigated, and the binding energies of 150 protein-ligands complexes were refined.

We present here preliminary results obtained for members of the SARS-CoV-2 proteome. Naturally, ongoing refinements of the results are continually being undertaken, and the results are incomplete. However, they give a snapshot report We found that performance was maximized when running all bonded and nonbonded calculations on the GPUs (interatomic and both particle-mesh Ewald and pairwise Lennard-Jones). With the noted choices, performance saturates at around 100 ns/day for 250,000 atoms and above, even if more nodes are allocated per replica, for two reasons. First, the GPUbased fast Fourier transform is limited to a single GPU, and communication latencies between nodes slow down the calculation. However, throughput around 100 ns/day can still be achieved for simulations above 250,000 atoms if nodes are increased proportionately to system size.

T-REMD simulations were performed with the number of replicas ranging from 20 to 60 for 750 ns each for an aggregate sampling of over 0.6 ms (Table I & SI Table I ). Given the scaling data noted above, for the total 816 replicas simulated, the calculations (is performed simultaneously) used the equivalent of ~18% of the entire Summit supercomputer for ~3 days. The performance thus scales up to ~1 ms/day was the whole machine to be used for different proteins at the same time. For all systems, the replicate temperatures range from 310 K to ~350 K, and the average exchange probabilities were near 0.3.

From the simulations, structural diversity was quantified by calculating, when a binding site is known, the gyration tensor of the binding site residues, the solvent-accessible surface area (SASA) of the binding sites, and the construction of pairwise snapshot-snapshot root-mean-squared deviation (RMSD) matrices for the target temperature replica, i.e., the replica with the temperature set to 310 K (see Methods for calculation details). Additionally, using the gyration tensor, the shape anisotropy of the pockets was also obtained.

Linkage-based RMSD clustering, using the pairwise RMSD matrices was performed to gauge the overall structural diversity of the proteins. 3C and 3E indicate the existence of a number of dominant conformational states. Figs. 3D and 3C further suggest that, although six dominant states exist, these states could be grouped into two 'super-states,' which may indicate a switching like behavior or the potential existence of a 'hinge.' Finally, subfigure B shows a significant amount of sampling of rod-like geometries (anisotropies near 1); however, there are states that have a correlated reduction in SASA and shape anisotropy, which would correspond with a nearly continuous transition between rod-like structures and spheroid-like structures. The general conformation variation highlighted by Figs. 2 and S1-22, to some degree, masks the conformational variation within binding sites; however, when for docking to the individual binding sites, clusters within the T-REMD trajectory are identified and demonstrate significant variability within the active site region (Fig. 4) . While not specifically active site residues, residue variability at the tip of the loop centered on Y266 and the charged residue pair R164-E165 near the active site imply that accounting for the protein conformational ensemble is essential. Otherwise, the docking calculations would be strongly biased by the rotameric states present in the single static structure used in typical single-structure docking calculations.

A preliminary analysis was performed of general trends seen in docking the smaller database to the 23 SARA-CoV-2 protein systems. For each protein target, all the docking results from each of the 10 cluster configurations were combined, and the top 500 scoring compounds extracted. The selectivity of the compounds for any given target varies considerably (Table III) with the number of compounds present on any two different top 500 lists as low as 171 or as high as 283. In comparison, from two random selections of 500 items out of 9,014 items (see SI Figure 23 , 5% percentile = 19 compounds, 95% percentile = 36 compounds), 27 identical compounds would be expected on average. Thus, the high number of identical top-scoring compounds observed between any two targets indicates a non-random selection of these duplicate compounds. For any particular target, the number of non-duplicate compounds is relatively low, ranging from 8 to 52 (Table III) . The majority the compounds selected bind to a single target ( Figure 4) . However, of all the compounds that are found in the top 500 lists, over half are calculated to bind to 3 or more targets. Molecular weight was found to be only weakly related to the number of protein targets a compound is calculated to bind to (See SI Figure 24 ). Therefore, the overlap in the top-scoring compounds is not an artifact of the size of the ligand. In the absence of a systematic experimental assay on each of these compounds, it is difficult to assess the significance of the overlap in the top 500 compounds. It is important to note that the "top compounds" are assembled based on relative docking scores between compounds against the same target, and not based on their absolute docking scores, which could artifactually inflate the number of duplicates. A high number of duplicates between the lists obtained on two different proteins could also indicate a computational bias of some compounds based on other criteria than their good fit to the targets. On the other hand, such high numbers could correctly identify promiscuous binding sites that do not display marked structural specificity and hence could be indeed targeted by similar compounds. It is outside of the scope of the present work to assertively differentiate between these two possibilities. However, the number of duplicates varies greatly across several pairs of targets, which renders unlikely a systematic bias in the docking (because of, say, molecular weight or other ligand properties independent of the target's binding site).

Only ~55% of the top 500 compounds were the same in the docking results from the 100 ns and 750 ns clusters. Thus, extending the T-REMD simulation time by a factor of 7.5 nearly doubled the chemical diversity. Future analysis will be needed to indicate if the compounds that are identical in both sets of docking calculations are promiscuous compounds that would bind to many protein structures or if many of the clusters from the MD trajectories end up being selected by the same compounds.

We found that for the ligands with fewer numbers of rotatable bonds, such as found in the Enamine dataset, a docking calculation using 20 repeated runs could be performed in 0.5-2.5 seconds when using the Summit GPU (Fig. 5) . The same set of ligands docked with Vina on Summit's CPUs showed a large spread of timings, with some ligands requiring nearly five minutes to complete (Fig. 5 ). In practice, this means that with GPU-enabled docking, it is feasible to flexibly dock a billion compounds in about a day on modern supercomputers, whereas with Vina, a similar calculation would require a multi-year effort on a university cluster. We confirmed that for ligands with less than 11 rotations, the Solis-Wets algorithm in AD-GPU provides equivalent results to the new ADADELTA algorithm 40 . For the trimmed dataset, the top 5% of scores obtained with AD-GPU using the Solis-Wets algorithm formed an intersection with the top 5% of scores from Vina consisting of 18% of each top 5% set. Analysis of the full billion ligand sets is currently ongoing.

The mutation frequency of the proteins simulated in this study is generally low. It should be noted, however, that it is as yet early in the history of SAR-CoV-2, and thus increased relative variability of residues along the proteome may indicate the propensity of those residues for future mutations. We did find higher variability, given by the entropy values, in other SARS-CoV-2 proteins not included in this study; in particular, the Spike mutation D614G noted in other reports continues to be seen with high frequency since being described in a recent preprint that performed an analysis on GISAID through April 13 158 . We counted 9107 D614G mutations (up from 3577 found April 13) and calculated entropy of 0.94 for this residue. The NSP12 RdRp protein also shows a large mutation entropy at residue 323, with a mutation entropy of 0.95. This residue, P, has mutated to L 9078 times (and F 3 times). Note that not every protein was represented in all sequences used for entropy calculations. Other regions of the genome with higher entropy values (greater than 0.5) are residues 203 and 204 of orf9 (entropy 0.70 and 0.69, respectively), residue 85 of NSP2 (0.74), 37 of NSP6 (0.57), 57 of orf3a (0.81), and 84 of orf8 (0.56).

The highest entropy found among the structures in this study was in the main protease, with an entropy of 0.13 for residue 15, a glycine. We found 261 G15S mutations and one G15D mutation in our dataset. The MPro also has a number of other residues with relatively high entropies, including residue 90, with entropy 0.07 and 117 K90R mutations, and 266 with entropy 0.04 and 64 A266V mutations. After this, the next highest entropy was 0.06 for the N-terminal region of the N protein and also for NSP15. These are displayed in Figure 6 . An entropy of 0.04 was also found for domain X of NSP3 (glycine 76). A lower mutation entropy was found in PLPro, compared to MPro, with the highest value being 0.03. These mutations are important to consider when choosing targets for drug discovery, in that a protein that seems to be more rapidly mutating could potentially lead to an ineffective therapeutic if mutations alter the shape of the drugbinding site. In the case of MPro, the highest entropy mutations were not found in the active site; however, it is possible that they may still affect its conformation indirectly. The reduced mutation entropy for PLPro may indicate that an effort to target a protease could meet with fewer mutation-related problems if targeting PLPro rather than MPro. 

As emphasized above, this article is a progress report on an ongoing project. The development of the pipeline is continuing with advances being made in several directions. Notably, we are incorporating artificial intelligence and machine learning into rescoring ligand ranking and clustering the MD trajectories. Further, we are developing methods to rescore docking using quantum chemical approaches. Although these developments have not been incorporated into the pipeline at the time of writing and were not applied to generate the results described above, we report on progress with them here.

The deluge of data generated from simulations such as the T-REMD runs reported here can make traditional approaches of machine learning and clustering approaches (based on measures of similarity in the RMSDspace, or other metrics) quite challenging. Often, practical aspects of computing dictate the use of subsample tracts of the MD data itself or use of prior knowledge about these datasets (e.g., knowing that the ligand binds only in a certain orientation) to filter such datasets. Deep learning techniques can be particularly valuable in 'sifting' through large datasets and can be powerful for clustering T-REMD simulations. We are investigating the use of a variational autoencoder with convolutional filters (CVAE), previously developed to cluster protein folding trajectories 159, 160 , to cluster the T-REMD simulations of NSP15. As shown in Fig. 7 , we find that the latent dimensions learned from the simulations indeed cluster the simulation data into a small number of conformational states. These states correspond to transitions observed in the simulations, as seen from various measured observables from the data such as the binding site RMSD, SASA, and the radius of gyration. The outcomes from the clustering provide insights into aspects of how the T-REMD simulations have sampled the conformational landscape -for example, in the case of this protein, as observed in Panel C, there is only one conformational state which has sampled a large SASA, indicating a potentially open state (which has only a minor change in the overall RMSD, Panel B). This information can be particularly helpful for selecting conformations and identifying metastable states for docking simulations 161 .

The computational identification of drug compounds, and small molecules in general, that bind to a protein consists of three distinct tasks: 1) identifying a putative conformation of the protein-ligand complex (the docking problem); 2) given a docked conformation, determining whether or not the ligand is a true binder (the screening problem); and 3) is determined to be a true binder, ascertaining a relative, or better yet, absolute binding affinity (the affinity, or scoring, problem). In principle, one could perform the screening and affinity prediction problems using molecular dynamics techniques such as free energy perturbation, thermodynamic integration, or more approximate methods such as MM/PB(GB)SA (molecular mechanics/Poisson-Boltzmann[generalized Born]-surface area). However, this is computationally intractable for large numbers of compounds, even with supercomputers, and the accuracy can often be poor. Furthermore, these rigorous first principles-based methods assume a putative binding site, and cannot be applied to cases where the binding site is unknown. While the score or energy given by computational docking programs such as AutoDock Vina is reasonably well-suited for docking pose prediction, improvements are possible on the screening and affinity problems, and for this, we use here machine learning.

There is an ongoing need for the development of computationally tractable models that can be easily validated on benchmark docking data sets. To this end, accurate, physics-based, machine-learned models for the docking and affinity have been trained using the PDBbind database, a dataset consisting of experimentally determined protein-ligand complex structures with accurate experimental binding affinities [162] [163] [164] [165] . On an independent data set, the CASF-2013 benchmark 166, 167 , affinity prediction (random forestbased) models achieve, at best, a Pearson correlation (R2) of 0.86, and docking pose prediction classifiers achieve an area under the curve of the receiver operator characteristic (AUC of ROC) of 0.91 using support vector machines (Demerdash et al., In Review) . While the random forest model trained on unnormalized features achieved the best R2 at 0.86, a range of additional models (trained with random forest, gradient boosted trees, or support-vector machines using normalized or unnormalized features) achieved R2 of 0.81-0.85. Regarding the docking pose prediction, the model used here achieves greater enrichment for nativelike structures (78%) than AutoDock Vina (63%) (Demerdash et al., In Review).

A model dedicated to virtual screening as a first step in triaging candidate molecules was developed. Once again, as with affinity and pose prediction, this model requires docked structures as input. This model is trained to discriminate between active and inactive compounds, and affinity ranking is performed as a second step only on the true active compounds. To this end, a support-vector machine-based model using the Dataset of Useful Decoys-Enhanced, a database of 102 proteins with experimentally verified active and inactive compounds, has been trained 168, 169 . Preliminary performance on an independent validation set is encouraging, achieving AUC of ROC of 0.80 and recall of 0.76; that is, 76% of experimentally validated true positives were predicted positively by the model. This model is currently being subjected to further optimized, primarily through the calculation of additional physicochemical descriptors (features) and the optimization of hyperparameters.

Due to the urgent nature of the Covid-19 drug discovery campaign, computational expediency precluded calculating features on all docked structures for a given compound and, in turn, precluded running the docking pose classifier on the output of AutoDock Vina. Therefore, we relied on AutoDock Vina's ranking and not the machine-learned docking pose classifier, thereby reducing the number of feature calculations that must be performed and increasing the throughput. (Parallelization efforts and code optimization are underway, so that feature calculation on all docked poses and subsequent application of the docking pose classifier becomes less onerous.) The virtual screening model was applied to these top-scoring structures from AutoDock Vina, generating a "binder" vs. "non-binder" classification. Subsequently, affinity prediction models were applied to just those complex structures classified as "binder." The affinity prediction was performed using the range of high-performing models on docked structures corresponding to each MD cluster representative used in ensemble docking (See Ensemble Docking with HPC Methods for details.). This results in affinity predictions on typically 10 cluster representatives, each with affinity predictions from 5 machine-learned models (1 SVM, 2 boosted tree, and two random forest approaches), resulting in 50 "cases." For each case, the top-500 ligands in terms of predicted affinity, were obtained. Molecules that appeared in the top 500 in at least 25 of the 50 cases were deemed hits and are presently undergoing experimental testing.

In a preliminary evaluation of the accuracy of FMO-DFTB/PCM in describing the interactions between ligands and the S-protein, we compare FMO-DFTB/PCM pair interaction energy (PIE) to that of the higherlevel, but more expensive FMO-MP2/PCM method. The PIEs were calculated for ligands binding to the Sprotein in the binding pocket. Figure 7 shows that FMO-DFTB/PCM interaction energies agree very well with high-level ab initio FMO-MP2/PCM data with the R correlation coefficient; in this case, it is 0.984. The high correlation between FMO-DFTB/PCM PIE and FMO-MP2/PCM PIE indicates that FMO-DFTB/PCM may be a fast and reliable QM-based method for interaction energy calculations.

An updated preliminary homology model and T-REMD simulation similar to that reported by Smith & Smith of the S-protein RBD and re-docking to new clusters from followed by a re-evaluation of the top scoring 150 complexes with FMO-DFTB/PCM was performed as follows: 15 protein conformations and their ten strongest binding ligands predicted by AutoDock Vina were selected, and geometry optimizations were performed at the FMO-DFTB3-D3(BJ)/3ob/PCM level of theory. The binding energies of the top-3 best binders ranked by FMO-DFTB/PCM are listed in Table IV . According to our preliminary results, although FMO-DFTB/PCM agrees well with Vina in categorizing the strong binders, with all Vina, top-10 ligands have considerably stronger FMO-DFTB/PCM binding energies ( E bind < -12 kcal/mol) and the QMbased ranking is significantly different from the Vina ranking. It is important to note that the current FMO-DFTB/PCM energy, which is based on solvent-corrected binding interaction energies is not the binding free energy. Various additional contributions to the binding free energy can be separately evaluated, and work is underway in this regard. For example, an entropic contribution can be estimated from vibrational frequencies once the requisite Hessian matrix is available.

The present manuscript reports on the establishment of a supercomputer-based virtual high-throughput screening ensemble-docking pipeline that takes into account the dynamic properties of protein targets, as well as preliminary results on simulations and docking screens to a number of protein targets from SARS-CoV-2.

The speed at which structural data have been derived experimentally for the SARS-CoV-2 proteome means that several of the simulations reported above were 'out of date' almost immediately. By this, we mean that the simulations were performed using models derived from experiments that had been superseded by higher-resolution or more complete data. Examples of these are the S-protein, MPro, N Protein, and NSP9. Clearly, as information on structures increases in quality, simulations will be further repeated. Furthermore, the complexity of the structural models derived is expected to increase. For example, models of the S protein interacting with the viral envelope or extending up to the complete virion can be envisaged and, in principle at least, incorporated into drug screening protocols.

The present results provide comprehensive simulation models for 8 of the viral proteins in 23 molecular systems. T-REMD is well suited for massively parallel supercomputing because many replicas are run simultaneously, and they need to communicate with each other. In the present tests, 350ns/day/replica was obtained for the smaller, and this, therefore, scales up to about 1.5ms/day of aggregate MD time, given the hypothetical situation that one had about 100 different proteins to run of roughly the same size. For bigger systems, with 10 5 atoms, the throughput is lower, about 1.0 ms/day. Nevertheless, it is clear that extensive simulation data can be obtained on many proteins with a short time-to-solution on this machine. As one possible future direction, one might envisage running T-REMD on the 44 drug targets that have been suggested as a minimal screen for the toxicity effects in human drug trials 170 .

The ensemble docking performed so far mostly involves repurposing databases and therefore is limited to about 10k compounds. Many of these compounds are predicted to be quite promiscuous in binding to the targets. Two of the compounds identified in the top 1% of our preliminary S-protein screen have been reported to be in two registered clinical trials (quercetin and hypericin). Further, several compounds from the screens reported above show activity in reducing live viral infectivity: these results will be reported elsewhere.

The docking results using the smaller database were not run on Summit, because of the fact that for Summit code running on GPUs is preferred. However, as COVID-19 therapeutic research moves beyond repurposing to the discovery of novel compounds, there is a need to quickly screen many more compounds. Therefore, we have installed Autodock-GPU and demonstrated that it is capable of screening 1 billion compounds on Summit in 12 hours when scaled to the whole machine. Although several other groups have reported billion-compound screens, these have been using AI approaches or rigid docking without pose optimization 171, 172 . The present billion-compound screen calculations, therefore, represent a potential supercomputer-driven paradigm shift in computational drug discovery and can be envisaged to be performed on dozens of proteins in a single day when the exascale era of supercomputing arrives, as planned for 2021. Chain1: SER1 GLY2 PHE3 ARG4  LYS5 MET6 ALA7 VAL125 TYR126  GLN127 CYS128 ALA129 ARG131  THR135 ILE136 LYS137 GLY138  SER139 GLY170 VAL171 HSD172  ASP197 THR198 THR199 VAL204  TRP207 ILE213 ASN214 ASP216  TRP218 TYR239 ILE281 

Innovation in the pharmaceutical industry: New estimates of R&D costs

Challenges of antibacterial discovery

Viracept (nelfinavir mesylate, AG1343): a potent, orally bioavailable inhibitor of HIV-1 protease

Rational design of potent sialidase-based inhibitors of influenza virus replication

An open-source drug discovery platform enables ultra-large virtual screens

Ensemble Docking in Drug Discovery

Implications of protein flexibility for drug discovery

Method for Including the Dynamic Fluctuations of a Protein in Computer-Aided Drug Design

Ensemble Docking in Drug Discovery

Structural and Functional Evidence for Testosterone Activation of GPRC6A in Peripheral Tissues

Evidence for osteocalcin binding and activation of GPRC6A in β-cells

Ensemble-based docking: From hit discovery to metabolism and toxicity predictions

A computationally identified compound antagonizes excess FGF-23 signaling in renal tubules and a mouse model of hypophosphatemia

Reviving Antibiotics: Efflux Pump Inhibitors That Interact with AcrA, a Membrane Fusion Protein of the AcrAB-TolC Multidrug Efflux Pump

Structure-based design of broadly protective group a streptococcal M proteinbased vaccines

Identification and Structure-Activity Relationships of Novel Compounds that Potentiate the Activities of Antibiotics in Escherichia coli

Ensemble docking to difficult targets in early-stage drug discovery: Methodology and application to fibroblast growth factor 23

Polycystin-1 interacts with TAZ to stimulate osteoblastogenesis and inhibit adipogenesis

Computationally identified novel agonists for GPRC6A

Ensemble Docking in Drug Discovery: How Many Protein Configurations from Molecular Dynamics Simulations are Needed To Reproduce Known Ligand Binding

Repurposing Therapeutics for COVID-19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and Viral Spike Protein-Human ACE2 Interface

Repurposing Therapeutics for COVID-19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and Viral Spike Protein-Human ACE2 Interface

Ensemble Docking in Drug Discovery: How Many Protein Configurations from Molecular Dynamics Simulations are Needed To Reproduce Known Ligand Binding?

GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

GROMACS: Fast, flexible, and free

Parallel tempering: Theory, applications, and new perspectives

Parallel tempering algorithm for conformational studies of biological molecules

Replica-exchange molecular dynamics method for protein folding

Multidimensional replica-exchange method for freeenergy calculations

Enhanced sampling techniques in molecular dynamics simulations of biological systems

iScreen: world's first cloud-computing web server for virtual screening and de novo drug design based on TCM database@Taiwan

HPC Cloud Technologies for Virtual Screening in Drug Discovery

A semiempirical free energy force field with charge-based desolvation

Grid-based hydrogen bond potentials with improved directionality

Comparison of affinity ranking using AutoDock-GPU and MM-GBSA scores for BACE-1 inhibitors in the D3R Grand Challenge 4

Evaluating the Energy Efficiency of OpenCL-accelerated AutoDock Molecular Docking

Accelerating autodock4 with gpus and gradient-based local search

The fragment molecular orbital method: theoretical development, implementation in GAMESS , and applications. WIREs Computational Molecular Science

News Feature: Avoiding pitfalls in the pursuit of a COVID-19 vaccine

Coronavirus susceptibility to the antiviral remdesivir (GS-5734) is mediated by the viral polymerase and the proofreading exoribonuclease

Broad spectrum antiviral remdesivir inhibits human endemic and zoonotic deltacoronaviruses with a highly divergent RNA dependent RNA polymerase

Comparative therapeutic efficacy of remdesivir and combination lopinavir, ritonavir, and interferon beta against MERS-CoV

SARS-CoV-2 vaccines: status report

Prophylactic and therapeutic remdesivir (GS-5734) treatment in the rhesus macaque model of MERS-CoV infection

Dexamethasone in the management of covid -19

Characterization of influenza virus variants induced by treatment with the endonuclease inhibitor baloxavir marboxil

Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2

Research and Development on Therapeutic Agents and Vaccines for COVID-19 and Related Human Coronavirus Diseases

Neutralizing Antibodies against SARS-CoV-2 and Other Human Coronaviruses

Virus factories, double membrane vesicles and viroplasm generated in animal cells

A Multibasic Cleavage Site in the Spike Protein of SARS-CoV-2 Is Essential for Infection of Human Lung Cells

Biogenesis and dynamics of the coronavirus replicative structures

Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China

Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors

Structural basis for inhibition of the RNA-dependent RNA polymerase from SARS-CoV-2 by remdesivir

The anti-influenza virus drug, arbidol is an efficient inhibitor of SARS-CoV-2 in vitro

RNA-virus proteases counteracting host innate immunity

Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein

The nonstructural proteins directing coronavirus RNA synthesis and processing

Determining druggability

FragLites-Minimal, Halogenated Fragments Displaying Pharmacophore Doublets. An Efficient Approach to Druggability Assessment and Hit Generation

Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods

Antiviral drug resistance

Viral mutation rates

Ancestral and compensatory mutations that promote antiviral resistance in influenza N1 neuraminidase revealed by a phylonumerics approach

Phylogenetic Evidence for Deleterious Mutation Load in RNA Viruses and Its Contribution to Viral Evolution

Potential therapeutic targets and promising drugs for combating SARS-CoV-2

Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2

Structure of M pro from SARS-CoV-2 and discovery of its inhibitors

Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor

Crystal structure of SARS-CoV-2 nucleocapsid protein RNA binding domain reveals potential unique drug targeting sites

The acidic environment of the Golgi is critical for glycosylation and transport

Mechanisms of pH regulation in the regulated secretory pathway

Coronaviruses -drug discovery and therapeutic options

Parameters of Monovalent Ions in the AMBER-99 Forcefield: Assessment of Inaccuracies and Proposed Improvements

Improved Parametrization of Li+, Na+, K+, and Mg2+ Ions for All-Atom Molecular Dynamics Simulations of Nucleic Acid Systems

Computer simulations of alkali-acetate solutions: Accuracy of the forcefields in difference concentrations

Polarizable Force Fields for Biomolecular Simulations: Recent Advances and Applications

The SARS coronavirus nucleocapsid protein-forms and functions

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response

The SARS-coronavirus papain-like protease: structure, function and inhibition by designed antiviral compounds

Crystal structures of SARS-CoV-2 ADP-ribose phosphatase (ADRP): from the apo form to ligand complexes

RNA-virus proteases counteracting host innate immunity

Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors

Non-structural Protein 9

Crystal structure of Nsp15 endoribonuclease NendoU from SARS-CoV-2

The crystal structure of nsp10-nsp16 heterodimer from SARS-CoV-2 in complex with Sadenosylmethionine

Biochemical and Structural Insights into the Mechanisms of SARS Coronavirus RNA Ribose 2′-O-Methylation by nsp16/nsp10 Protein Complex

Coronavirus non-structural protein 16: Evasion, attenuation, and possible treatments

Efficient Estimation of Word Representations in Vector Space

The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

CHARMM-GUI: a web-based graphical user interface for CHARMM

Modeling Structural Coordination and Ligand Binding in Zinc Proteins with a Polarizable Potential

CysxHisy-Zn2+ interactions: Possibilities and limitations of a simple pairwise force field

Molecular dynamics study of human carbonic anhydrase II in complex with Zn2+ and acetazolamide on the basis of all-atom force field simulations

Biosynthesis, purification, and substrate specificity of severe acute respiratory syndrome coronavirus 3C-like proteinase

Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α-ketoamide inhibitors

Structural plasticity of SARS-CoV-2 3CL M(pro) active site cavity revealed by room temperature X-ray crystallography

The crystal structures of severe acute respiratory syndrome virus main protease and its complex with an inhibitor

Structure of Main Protease from Human Coronavirus NL63: Insights for Wide Spectrum Anti-Coronavirus Drug Design. Sci Rep

Structure of coronavirus main proteinase reveals combination of a chymotrypsin fold with an extra αhelical domain

Evidence for substrate binding-induced zwitterion formation in the catalytic Cys-His dyad of the SARS-CoV main protease

Revealing the Molecular Mechanisms of Proteolysis of SARS-CoV-2 Mpro from QM/MM Computational Methods

Revealing the molecular mechanisms of proteolysis of SARS-CoV-2 Mpro by QM/MM computational methods

PDB2PQR: expanding and upgrading automated preparation of biomolecular structures for molecular simulations

TopoGromacs: Automated Topology Conversion from CHARMM to GROMACS within VMD

GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation

CHARMM36m: an improved force field for folded and intrinsically disordered proteins

Optimization of parameters for molecular dynamics simulation using smooth particle-mesh Ewald in GROMACS 4.5

A parallel linear constraint solver for molecular simulation

Molecular dynamics with coupling to an external bath

Enhanced sampling techniques in molecular dynamics simulations of biological systems

Parallel tempering algorithm for conformational studies of biological molecules

How Efficient Is Replica Exchange Molecular Dynamics? An Analytic Approach

Enhanced sampling in molecular dynamics

A temperature predictor for parallel tempering simulations

Canonical sampling through velocity rescaling

polymorphic transitions in single-crystalls -a new molecular dynamics method

Constant pressure molecular dynamics for molecular systems

VMD: visual molecular dynamics

SciPy and NumPy: an overview for developers

Rapid calculation of RMSDs using a quaternion-based characteristic polynomial

SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery

SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery

SuperDrug: a conformational drug database

SuperDRUG2: a one stop resource for approved/marketed drugs

Diversity Set Information

VinaMPI: Facilitating multiple receptor highthroughput virtual docking on high-performance computers

AutoDock and AutoDockTools for protein-ligand docking: Beta-site amyloid precursor protein cleaving enzyme 1 (BACE1) as a case study

Using autodock for ligand-receptor docking

KFC2: a knowledge-based hot spot prediction method based on interface solvation, atomic density, and plasticity features

Evolution of In Silico Strategies for Protein-Protein Interaction Drug Discovery

Repurposing naproxen as a potential antiviral agent against SARS-CoV

Structural basis of RNA recognition by the SARS-CoV-2 nucleocapsid phosphoprotein

Enamine real database: Making chemical diversity real. Chemistry today

Evaluation of AutoDock and AutoDock Vina on the CASF-2013 Benchmark

Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets

GISAID

Parallelization of the MAFFT multiple sequence alignment program

FAST: FAST Analysis of Sequences Toolbox

A Mathematical Theory of Communication

Accurate Scoring in Seconds with the Fragment Molecular Orbital and Density-Functional Tight-Binding Methods

Density-functional tight-binding combined with the fragment molecular orbital method

Recent developments in the general atomic and molecular electronic structure system

The polarizable continuum model (PCM) interfaced with the fragment molecular orbital method (FMO)

Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2

Deep clustering of protein folding simulations

Mechanism of glucocerebrosidase activation and dysfunction in Gaucher disease unraveled by molecular dynamics and deep learning

DeepDriveMD: Deep-Learning Driven Adaptive Molecular Simulations for Protein Folding

The PDBbind database: Collection of binding affinities for protein− ligand complexes with known three-dimensional structures

The PDBbind database: methodologies and updates

Comparative assessment of scoring functions on a diverse test set

Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results

Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results

Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set

Benchmarking Sets for Molecular Docking

Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking

Reducing safety-related drug attrition: the use of in vitro pharmacological profiling

Rapid identification of potential inhibitors of SARS-CoV-2 main protease by deep docking of 1.3 billion compounds

An open-source drug discovery platform enables ultra-large virtual screens

This work was made possible in part by a grant of high-performance computing resources and technical support from the Alabama Supercomputer Authority to JB and KB.