key: cord-228152-k4bw8w5g authors: Pyzer-Knapp, Edward O. title: Using Bayesian Optimization to Accelerate Virtual Screening for the Discovery of Therapeutics Appropriate for Repurposing for COVID-19 date: 2020-05-11 journal: nan DOI: nan sha: doc_id: 228152 cord_uid: k4bw8w5g The novel Wuhan coronavirus known as SARS-CoV-2 has brought almost unprecedented effects for a non-wartime setting, hitting social, economic and health systems hard.~ Being able to bring to bear pharmaceutical interventions to counteract its effects will represent a major turning point in the fight to turn the tides in this ongoing battle.~ Recently, the World's most powerful supercomputer, SUMMIT, was used to identify existing small molecule pharmaceuticals which may have the desired activity against SARS-CoV-2 through a high throughput virtual screening approach. In this communication, we demonstrate how the use of Bayesian optimization can provide a valuable service for the prioritisation of these calculations, leading to the accelerated identification of high-performing candidates, and thus expanding the scope of the utility of HPC systems for time critical screening The virus now known as SARS-CoV-2 seems to have initiated in Wuhan, in Hubei Province, China at the end of 2019. [12, 24] It has since become a global pandemic, impacting the economic, social and health systems of almost every country in the world. In a race to discover potential pharmaceutical interventions, some researchers have turned to high-throughput virtual screening as an avenue to prioritise the screening of existing commercially available or pharmainternal proprietary, compounds for activity against the virus. Screening at scale has been enabled by an unprecedented communal activity known as the COVID-19 HPC consortium [1] , which includes systems provided by It is anticipated that there will be high demand for the use of HPC resources such as are provided by the consortium, and so an ability to 'power up' smaller compute clusters to accelerate the (re-)discovery of pharmaceutical interventions through the use of in silico methods could prove very valuable. In this paper we propose the use of Bayesian optimization as such a technology, and demonstrate its potential through a demonstration on an early dataset collected on the IBM SUMMIT supercomputer by Smith, et al. [21] High Throughput Virtual Screening High throughput virtual screening (HTVS) has emerged as a powerful methodology for attacking the identification of candidate molecules for materials and pharmaceutical applications. [19, 11, 8] . Benefit over traditional high throughput methods, such as robotic assays, mainly stem from two key areas: 1. Once there is a working 'virtual protocol' for an experiment, scaling that experiment is mainly driven through access to computational resources. A virtual high-throughput method is not limited to considering molecules which exist in the lab, and thus does not differentiate molecules based on synthetic complexity or stability. The cloud, in particular, has driven more and more individuals and organisations to have access to sufficient computational resource to perform HTVS techniques, at least at reasonable timescale. [17, 6, 5, 4] It should be worth noting, however, that in many cases, the performance of the cloud is still insufficient for some HPC applications. [15] For some problems, however, the urgency of the outcome dictates that a scale of computing and performance currently not yet feasible in the cloud be used. One such example of this is the work of Jeremy Smith et al, who used the world's fastest supercomputer, Summit [2] to screen over 8,000 known biologically active compounds, including many approved drug compounds, for potential activity against the SARS-CoV-2 virus through a study of their binding of the ACE-2 receptor. [21] Smith used restrained temperature molecular dynamics to locate six different configurations of the host protein and then used the molecular docking code autodock-VINA [23] to score potential poses of the small molecule pharmaceuticals included in the SWEETLEAD library. [16] The use of the SUMMIT supercomputer was deemed necessary to reduce the time to score the entire library of more than 8,000 compounds in days, rather than the more normally expected months. Bayesian optimization is a black-box optimization algorithm commonly used in cases where the evaluation of data-points is computationally or financially expensive, or in other situations in which the number of evaluations of potential configurations is required to be minimized. It has become popular recently in the machine learning community for the tuning of hyper-parameters of machine learning models, but has also demonstrated significant ability to accelerate scientific tasks such as the discovery of new materials [10] , pharmaceutically active ingredients [18] and the parameterisation of classical force-fields. [13] At a high level, the algorithm follows a relatively simple flow, which is shown in Figure 1 . The lack of access to analytical forms is mitigated through the use of a probabilistic surrogate model, often a Gaussian process [22, 3] although other Bayesian models can be used. [10] Once the model has been used to calculate predictive means and variances (uncertainties) it is possible to calculate a value commonly termed the improvement: where y pred represents a prediction from the Bayesian model, σ 2 represents the corresponding model uncertainty, and f * represents the best value found thus far. This value can then be treated with a number of strategies, known as acquisition functions, which determine how valuable each data point is to acquire. Perhaps the most commonly used acquisition function is known as Expected Improvement [14] and takes the following form: where Φ(γ) and φ(γ) represent the CDF and PDF of the standard normal distribution applied to the improvement function. A challenge with the Expected Improvement methodology is that it is an inherently serial algorithm, since in order to generate a new posterior, there must be new information added. Several methodologies have evolved to tackle this problem, whether through partitioning the acquisition function, for example using K-means [9] or through penalising the acquisition function from selecting certain points. [7] . In this study we use a batch variant of Thompson sampling, PDTS, [10] to achieve the required parallel acquisition necessary to exploit high performance computing clusters. PDTS was chosen in part for its excellent performance in the task of performing HTVS on large materials databases, and in part for its excellent scalability to input spaces with large numbers of dimensions, which are frequently found in chemical problems. For this proof of concept study, simulations were not run on the SUMMIT supercomputer, but instead the openly released results of Jeremy Smith, et al. [21] were used as an oracle. There is, of course, no reason why this could not be repeated on new problems using a resource such as SUMMIT, especially as Smith et al. have open-sourced the deployment scripts necessary. For the purposes of this study, we targeted the two situations described in [21] as two minimization challenges, with the aim being to locate particularly low VINA scores efficiently. We understand that VINA scores are not a perfect indicator of performance, and thus entreat the reader to consider the optimization as a prioritization exercise -there is nothing to stop the entire library being calculated, but clearly the order in which these calculations happen can have a significant effect on the speed of discovery of candidates for re-purposing. Each molecule was described by an extended connectivity fingerprint (ECFP), with a radius of 2, and a 512 bit hashing, as this was sufficient to ensure that there were no clashes between molecules in the library. [20] Clashes arise when two molecules are given the same fingerprint, and so choosing a fingerprint length is a trade-off between fingerprint length (long fingerprints are less likely to clash) and computational cost (long fingerprints make the machine learning more expensive, and harder). A Gaussian process model was used for the construction of the acquisition function, using a Matern kernel of the form: where the hyperparameters of the kernel are optimized to maximise the log marginal likelihood of the model. The PDTS algorithm was used to generate batches of molecules to send to the oracle. For this application, a batch size of 10 was used to balance the granularity of updating the posterior distribution regularly with the obvious efficiency benefit to operating in parallel. The PDTS algorithm was seeded with 15 randomly selected molecules, and then run for 50 epochs, resulting in a set of just over 500 molecules. This set size was chosen as it was estimated that more commonly available compute capability could run a set of this size in the same number of days that SUMMIT took to run the full set of 8,000 molecules. In order to evaluate the effectiveness of the Bayesian optimization approach, we believe there are two key axes to evaluate: 1. How rapidly does the approach discover highly active molecules (as represented by highly negative VINA scores) 2. How enriched are the samples taken by the Bayesian optimization approach, as compared to the exhaustive set (i.e. how many more samples are in the 'highly active' bins than might be expected given the known distribution of VINA scores) We attack this problem for both situations outlined by Smith et al, namely docking into the interface and isolated viral S-protein. After the budget of 50 epochs has been exhausted, we interrogate the set which would have been calculated, and compare it to randomly sampling from the set for the same amount of time. This experiment is repeated 5 times to generate uncertainty estimates resulting from different seeding of the sampler, and to demonstrate the volatility of the random sampling approach. For this first task, the simple challenge of locating molecules with a desirable VINA score is considered. It is understood that this is not the only consideration for which a successful active pharmaceutical may be identified for further testing for efficacy against SARS-CoV-2, but rapidly identifying candidates with this property is a reasonable way of prioritising the spend of compute resource. Additionally, if it can be shown that the Bayesian optimization methodology is able to optimize to this task, it is reasonable to assume that there is likely to be advantage to using the same methodology for similar derived tasks, which may more closely map to the users actual needs. Figure 2 : Optimization progress as a function of molecules tested for the combined S-protein-ACE2 interface target (left) and the Isolated S-protein target (right). Confidence intervals were calculated from five repeats using the bootstrap methodology. It can be seen from Figure 2 that the PDTS sampler -the selected Bayesian optimization methodology for this study -significantly outperformed the random search. It achieved this both in terms of the targets obtained -locating the global optimum in <500 steps in almost all of the 10 runs (2 sets of 5 replicates), and doing so reliably, with significantly lower uncertainties. This is strong evidence that Bayesian optimization meets the first criteria for an effective prioritisation tool for HTVS activities. It is also important to consider not just the single optimium point, but a set of molecules with desirable properties, which affords the user some options for further investigation. There are a number of ways to consider the enrichment, but here we will consider two metrics: 1. How the average score of the top 10 ranked molecules sampled thus far evolves as the sampling progresses. 2. Where the 'sampling density' is concentrated after sampling 500 molecules, as compared to the overall distribution of molecules If the Bayesian optimization methodology shows significantly more desirable behaviour than the random search method then this is further proof that this methodology can be utilised to prioritise HVTS screens successfully. Figure 3 : Evolution of the average VINA score of the top 10 ranked molecule for the PDTS sampler (blue) and a random sample (red). These were generated for the combined S-protein-ACE2 interface target (left) and the Isolated S-protein target (right). Confidence intervals were calculated from five repeats using the bootstrap methodology. Figure 3 shows the evolution of the average VINA score of the top 10 candidates over the 50-epoch optimization (500 molecules sampled). As can be seen, in both tasks, the PDTS sampler robustly achieves a lower top10 score than the random sampler, indicating that higher quality molecules are being sampled. It can be seen by looking at the distribution of VINA scores for the whole set of molecules considered by Smith et al, (Figure 4 ) that there are simply not many molecules in the set which display a desirable VINA score. Additionally drop off in the number of molecules displaying a desirable VINA scores is more marked for the Isolated S-protein target than for the Combined S-protein-ACE2 interface target. Intuititvely, therefore, it is unreasonable to expect a randomly drawn subset of these molecules to display strong potential as candidates against SARS-CoV-2. By comparing the distribution of a subset of the overall sets which correspond to desirable VINA scores against the distrubtion of molecules selected by the Bayesian PDTS sampler, we can determine the 'enrichment' the Bayesian sampler offers. Here, we plot the 'mean fraction of sample' as a scoring metric. This is simply the mean value (over the five runs for PDTS) of the fraction of the sample which would be placed within this bin. This is compared to the fraction which is observed within these bins for the full set of molecules. The fact that in all cases, the blue bar (representing the PDTS approach) is greater than the orange bar (representing the overall distribution) demonstrates the effectiveness of the PDTS methodology to draw enriched sets of molecules for testing, and is further proof of the efficiency of the Bayesian optimization methodology to prioritise tasks for HVTS. Table 1 indicates numerically the performance of the PDTS algorithm for these three tasks. It can be seen than on every task the PDTS method of Bayesian optimization outperforms random sampling and thus shows great potential for the efficient prioritisation of HTVS testing of pharmaceuticals for activity against SARS-CoV-2. This is both demonstrated by the ability of the PDTS sampler to efficiently locate the global optima in the two tasks, but also to do so reliably -as indicated by the low standard deviations over runs. We also demonstrate the ability of the PDTS sampler to create enriched samples on limited budgets, which was demonstrated by the greater scores in both the Top10 average test, and the Top 'Bin' percentage test. Again, both of these tests display reasonable standard deviations across runs, demonstrating the reproducibility of the methodology. Table 1 : Overall results of the three metrics for investigation as to the potential for Bayesian optimization to act as an efficient prioritisation engine for HTVS. For 'Percentage in top bin', the 'Random' value refers to the overall percentage, rather then a particular random sample. For 'Best VINA' and 'Top10' metrics, PDTS and random results are averaged over 5 runs, with the standard deviation of those runs also being reported. When concluding this study, it is first important to emphasise what this paper is not. It is not an exhaustive benchmark of optimization methodologies for application to HTVS. Indeed, the no-free-lunch theorem says that there is not one globally guaranteed optimal methodology existing, anyway. We have shown in the past [18] that Bayesian optimization offers significant benefits over a greedy methodology, which is prone to catastrophic local optimization, and so we do not seek to reproduce those results here. What this study demonstrates is that, for problems which are both time critical and compute intensive, Bayesian optimization (specifically here the PDTS algorithm) have significant benefits when applied to HTVS screens. With a global problem such as the SARS-CoV-2 pandemic, speed to solution is necessary, but must also be tempered with a pragmatic approach to resource allocation. There are simply not enough SUMMIT supercomputers in the world for everyone who might want to use them. Here, we demonstrate that for the challenge attempted by Smith, et al. for determining a set of molecules for further investigation using the HTVS methodology, promising results could be reached with a smaller compute cluster, if paired with a prioritisation mechanism powered by Bayesian optimization. COVID-19 HPC Consortium -XSEDE TOP500 List -November 2019 -TOP500 Supercomputer Sites A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning Large-scale virtual screening on public cloud resources with Apache Spark HPC Cloud Technologies for Virtual Screening in Drug Discovery. In Intelligent Information and Database Systems High-throughput virtual molecular docking with AutoDockCloud Batch bayesian optimization via local penalization High-throughput and virtual screening: core lead discovery technologies move towards integration Parallel and Distributed Thompson Sampling for Large-scale Accelerated Exploration of Chemical Space Virtual Screening An Effective Tool for Lead Structure Discovery Utilizing Machine Learning for Efficient Parameterization of Coarse Grained Molecular Force Fields On bayesian methods for seeking the extremum HPC Cloud for Scientific and Business Applications SWEETLEAD: an In Silico Database of Approved Drugs Regulated Chemicals, and Herbal Isolates for Computer-Aided Drug Discovery Cloud-Based High Throughput Virtual Screening in Novel Drug Discovery Bayesian optimization for accelerated drug discovery What Is High-Throughput Virtual Screening? A Perspective from Extended-connectivity fingerprints Repurposing Therapeutics for COVID-19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and Viral Spike Protein-Human ACE2 Interface Practical bayesian optimization of machine learning algorithms AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function efficient optimization, and multithreading Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission The author would like to thank Jeremy Smith for making his data publicly available, and to Dave Turek, Kirk Jordan and Chris Porter of IBM Corp, for their helpful discussions, and insight. Additionally, the author would like to take this opportunity to thank those who are working on the front line to respond to the SARS-CoV-2 global pandemic.