key: cord-1018531-vm4721yd authors: Ton, Anh‐Tien; Gentile, Francesco; Hsing, Michael; Ban, Fuqiang; Cherkasov, Artem title: Rapid Identification of Potential Inhibitors of SARS‐CoV‐2 Main Protease by Deep Docking of 1.3 Billion Compounds date: 2020-03-23 journal: Mol Inform DOI: 10.1002/minf.202000028 sha: c6d069ed9776bfdb6416aa24c09d3c3e25221d75 doc_id: 1018531 cord_uid: vm4721yd The recently emerged 2019 Novel Coronavirus (SARS‐CoV‐2) and associated COVID‐19 disease cause serious or even fatal respiratory tract infection and yet no approved therapeutics or effective treatment is currently available to effectively combat the outbreak. This urgent situation is pressing the world to respond with the development of novel vaccine or a small molecule therapeutics for SARS‐CoV‐2. Along these efforts, the structure of SARS‐CoV‐2 main protease (Mpro) has been rapidly resolved and made publicly available to facilitate global efforts to develop novel drug candidates. Recently, our group has developed a novel deep learning platform – Deep Docking (DD) which provides fast prediction of docking scores of Glide (or any other docking program) and, hence, enables structure‐based virtual screening of billions of purchasable molecules in a short time. In the current study we applied DD to all 1.3 billion compounds from ZINC15 library to identify top 1,000 potential ligands for SARS‐CoV‐2 Mpro protein. The compounds are made publicly available for further characterization and development by scientific community. Coronaviruses (CoVs) are enveloped viruses containing a single positive-stranded RNA, and causing a wide array of respiratory, gastrointestinal, and neurological diseases in human hosts. [1, 2] It has been established that strains of CoVs were at the source of the 2002 severe acute respiratory syndrome (SARS) and 2012 middle east respiratory syndrome (MERS) epidemics. [3] In late December 2019, a novel CoV of SARS-CoV-2 was identified to be the cause of atypical pneumonia outbreak in Wuhan, China, named COVID-19. [4] The rapidly increasing number of infected patients worldwide prompted the World Health Organization to declare a state of global health emergency to coordinate scientific and medical efforts to rapidly develop a cure for patients. [5] While drug repurposing may be a short-term and non-specific solution to treat COVID-19 patients, [6] development of more targeted inhibitors is highly desirable. Previous research efforts to develop anti-viral agents against members of Coronaviridae family demonstrated that the Angiotensin-converting enzyme II (ACE2) entry receptor, the RNA-dependent RNA polymerase (RdRp) and the main protease (Mpro) proteins may represent suitable drug targets. [7] Although initially promising, inhibitors targeting ACE2 (hence aiming to block critical coronavirushost interactions) did not advance clinically due to significant side effects. [8] Likewise, RdRp inhibitors appeared to be not very specific and demonstrated overall lower potency, that also translated into common side effects in patients. [1, 9] Nevertheless, rapid drug repurposing efforts have identified Remdesivir, a RdRp inhibitor, as a promising antiviral drug against COVID-19. [10, 11] Clinical trials are currently ongoing to determine the full efficacy spectrum of the compound in patients (clinicaltrials.gov, NCT04280705 [12] ). Concurrently, CoV infected patients administered with protease inhibitors, lopinavir/ritonavir, have shown improved outcome, [1, 13] demonstrating the potential of the main protease (Mpro) as the most promising drug target in CoVs [14, 15] Hence, a recently published X-ray crystal structure of the SARS-CoV-2 Mpro provides an excellent ground for structure-based drug discovery efforts. [16] Earlier efforts to target SARS-CoV resulted in identification of several covalent Mpro inhibitors targeting the catalytic dyad of the protein defined by His41 and Cys145 [17] residues. However, covalent inhibitors are often marked by adverse drug responses, off-target side effects, toxicity and lower potency. [18] [19] [20] [21] [22] Therefore, noncovalent protease inhibitors may have advantages for the treatment of this kind of infections. Still, the majority of approved drugs administered as anti-SARS were designed for other viral strains (Table S1 in supplementary material) . Notably, no CoVprotease specific inhibitor has yet successfully completed a clinical development program to date. [19, 23] The impact of current COVID-19 outbreak and the likelihood of future CoV epidemics strongly advocate for rapid development of new treatments and fast intervention protocols. Few research groups have already suggested potential repurposing strategies for clinically approved drugs [24] [25] [26] or proposed de novo agents [27] as therapeutic solutions for SARS-CoV-2. However, previously reported docking (virtual screening) campaigns with Mpro targets were able to process only few millions or even thousands compounds. [6, [28] [29] [30] The main reason for that is that conventional docking is too computationally expensive and slow, while the libraries of available chemicals are growing exponentially. [31] To address this general challenge, we have recently developed a novel deep learning-based approach for accelerated screening of large chemical libraries, consisting of billions of entities. This Deep Docking (DD) platform utilizes quantitative structure-activity relationship (QSAR) models trained on docking scores of database subsets to approximate in an iterative manner the docking outcome of the remaining entries. Importantly, DD does not provide any novel scoring function for docking, thus its accuracy relies completely on the docking program that is used. The development of deep learning scoring functions has been already attempted, but results have shown various degrees of success which could be due to a lack of appropriate datasets. [32, 33] Likely, as the very nature of docking is approximate, the improvements are likely to come from better approximation of physical-chemical processes, including solvation, enthalpic and entropic factors, rather than from a better training base and procedures. [34, 35] Thus, our method represents not just feasible, but also practical options for utilizing deep learning in virtual screening. Herein we have used DD for large-scale virtual screening against the SARS-CoV-2 Mpro active site. To assess the performance of fast Glide SP protocol [36] to virtually screen against the Mpro target, we collected 81 known SARS Mpro small molecule inhibitors that are reported by Pillaiyar et al., [37] and Turlington et al.. [38] Then, we generated 50 molecular decoys for each active molecules using the methodology implemented in the Database of Useful Decoys: Enhanced (DUDÀ E). [39] All compounds were prepared for docking with the OpenEye package. Most probable tautomer and ionization states at pH 7.4 were calculated with OpenEye QUACPAC package [40] and starting 3D conformations were generated using Omega pose routine. [41] The structure of SARS Mpro bound to a noncovalent inhibitor (PDB 4MDS, 1.6 Å resolution) was obtained from the Protein Data Bank (PDB), [42] and prepared using Protein Preparation Wizard. [43] Docking was performed using Glide SP module. [36] Receiver operating curve areas under the curve (ROC AUC) were then calculated. We used DD to virtually screen all ZINC15 (1.36 billion compounds) [44] against the SARS-CoV-2 Mpro. The model was initialized by randomly sampling 3 million molecules and dividing them evenly into training, validation and test set. The structure PDB 6LU7 (resolution 2.16 Å) [45] of the SARS-CoV-2 Mpro bound to the N3 covalent inhibitor was obtained from the PDB, and prepared as before. Molecule preparation and docking were performed similarly as before, and computed scores were used for DNN initialization. We then ran 4 iterations, adding each time 1 million of docked molecules sampled from previous predictions to the training set and setting the recall of top scoring compounds to 0.75. At the end of the 4 th iteration, the top 3 million molecules predicted to have favorable scores were then docked to the protease site. The set of protease inhibitors (7,800 compounds) from the BindingDB repository was also docked to the same site. [46] Our computational setup consisted of 13 Intel(R) Xeon(R) Gold 6130 CPUs @ 2.10GHz (a total of 390 cores) for docking, and 40 Nvidia Tesla V100 GPUs with 32GB memory for deep learning. Although drug repurposing and high-throughput screening have identified potential hit compounds with strong antiviral activity against COVID-19, [47] no noncovalent inhibitors for SARS-CoV-2 Mpro have been reported to date. Glide protocols were recently deployed to identify potential hit compounds as protease inhibitors, notably against FP-2 and FP-3 (P. falciparum cysteine protease), [48] nsP2 (Chikunguya virus protease), [49] and more recently against SARS-CoV-2 MPro. [47] Therefore, Glide was shown to be adequate and effective in docking ligands with high fidelity compared to other available academic and commercial docking software. [50, 51] Nonetheless, we performed our own benchmarking study to evaluate the viability of using Glide SP to screen the SARS-CoV-2 Mpro. We first evaluated the feasibility of virtual screening using a closely related protein, the SARS Mpro (96 % of sequence identity,) for which different series of noncovalent inhibitors with low micromolar to nanomolar acitivity have been discovered. [37] Our benchmarking study revealed good ability of Glide SP to dock known inhibitors. First, the co-crystallized ligand (SID 24808289 from Turlington et al. [38] ) was accurately redocked to its binding site (root mean square deviation (r. m.s.d.) of 0.86 Å between Glide and x-ray pose, Figure 1a) . Second, ROC AUC value for Glide SP used to dock 81 Mpro inhibitors and~4,000 decoys was 0.72, similarly to the more computationally expensive Glide XP protocol (Figure 1b) , and 0.74 when active molecules were diluted in 1 million random compounds extracted from ZINC15 (Figure S1 in supplementary material) . Thus, in light of recent studies advocating for extending virtual screening to large chemical libraries when docking works well at smaller scales, [31] we decided to use Glide SP as DD docking program to screen ZINC15 against SARS-CoV-2 Mpro. DD relies on a deep neural network trained with docking scores of small random samples of molecules extracted from a large database to predict the scores of remaining molecules and, therefore, discard low scoring molecules without investing time and resources to dock them. The combination of an iterative process to improve model training and the use of simple 2D QSAR descriptors such as Morgan fingerprints makes DD particularly suited for fast virtual screening of emerging giga-sized chemical libraries using standard computational resources. We have recently showed the wide range of applicability of DD by using the method to dock all ZINC15 compounds to 12 targets representing major protein families of therapeutic interest. [52] The use of DD platform enabled us to dock 1.3 billion compounds from ZINC15 database [44] into SARS-CoV-2 Mpro active site using standard Glide SP protocols in a week. In our benchmark study on SARS Mpro, AUC ROC for Glide SP improved from 0.72 to 0.78, when a ligand efficiency (LE) cutoff of À 0.20 kcal/mol was introduced prior to ranking molecules by their docking scores ( Figure S2 in supplementary material). Thus, the top 1,000 hits selected from the DD run were picked following the same strategy. The SARS Mpro cleaves the replicase polyproteins, pp1a and pp1b, at 11 specific positions, using core sequences in the polyprotein substrate to determine cleavage sites. [53] The positions of the residues on the polyproteins are named depending on their relative position to the cleavage site. Position P1 corresponds to the residue just before the cleavage site, followed by P2, P3, P4, P5, and up until the Nterminal of the cleavage site. Position P1' corresponds to the residue immediately following the cleavage site, followed by P2', P3', P4', P5' and up until the C-terminal of the cleavage site. [54] The protease recognizes specific residues at each position of the polyproteins to determine a cleavage site and initiate the replication-transcription complex necessary for viral replication. [55] Based on the consensus recognition sequence of the polyproteins, a substrate-analogue inhibitor, CMK, was designed to mimic positions P1 to P6 of the substrate in the SARS Mpro substrate-binding sites. The compound is characterised by its chloromethyl ketone warhead and its core sequence of Val(P6)-Asn(P5)-Ser(P4)-Thr(P3)-Leu(P2)-Gln(P1) to occupy the same volume as the residues of the recognition sequence. [56] An X-ray crystallography structure of the SARS Mpro with the CMK inhibitor revealed the mode of inhibitor binding to the substrate-binding sites of the main protease, providing a crucial structural basis for rational drug design and guiding drug discovery efforts against the SARS Mpro (PDB 1UK4 [57] ). Therefore, the pocket of the Mpro can be partitioned into different sections, depending on the volume occupied by polyprotein residues at each positions. Using insight gained from the crystal structure, Turlington et al. first developed a moderate noncovalent inhibitor, SID 24808289, with an IC 50 of 6.2 μM, [38] and demonstrated that additional positions could be explored for efficient inhibition of SARS Mpro. The binding pose of the compound is shown in Figure 2 , and the compound occupies the same volume as positions P1 to P4, and P1'. The compound still maintained key interactions with catalytic dyad of Gln189 and Met49 through hydrophobic contacts (P2), and through hydrogen bonds to Cys145, His163, and Glu166 (P1). The number 1 series of compounds identified from our virtual screening is presented in Table 1 . They are predicted to have consistent binding pose, similar to the noncovalent compound SID 24808289, as shown in Figure 3a . The predicted interaction between ZINC000541677852 and SARS-CoV-2 Mpro is shown in Figure 3b . This series of compounds occupied the same volume as the P1, P2 and P3 groups with the common favored hydrophobic interactions of the phenyl ring (P2), and two hydrogen bonds to Cys145 and Leu141 respectively (P1). We have also analyzed the origin of top 1,000 ZINC hits (selected by LE), and observed that 99 % of them are not present in the ZINC15 in-stock library (~11 millions of molecules), commonly used in routine docking campaigns, demonstrating that the DD methodology can access complete and diverse chemical space beyond classical docking. The Glide SP scores of the top 1,000 candidates we selected were significantly better than top 1,000 molecules from a 1 million random sample of ZINC15 entries, and even better than top candidates from BindingDB protease inhibitor library, which were docked to the same site ( Figure 4) . We also evaluated the chemical diversity of the newly identified set of inhibitors compared to the protease library. Calculation of Murcko frameworks [58] for hits from such library and DD hits revealed a similar number of frameworks present in the two sets (603 and 587 scaffolds, respectively). Encouragingly, we observed just two common frameworks, clearly indicating that screening 1.36 billion enables identification of new chemical classes that can potentially inhibit SARS-COV-2 Mpro. Thus, DD allowed us to rapidly narrow down ZINC15 to a smaller dataset enriched with high scoring compounds, which consists of novel molecules with highly favourable docking scores as well as significantly different structures than known protease inhibitors. Our DD screening identified 585 new scaffolds for SARS-CoV-2 that are not shared with known protease inhibitors, although they can establish all the critical interactions with the protease active site, thus providing a completely new set of chemicals for testing and optimization. Collectively, our results strongly support the use of docking the largest available compound library for identifying novel potent scaffolds or chemicals, as concluded by Lyu et al.. [31] The use of DD methodology in conjunction with Glide allowed rapid estimation of docking scores for 1.3 billion chemical structures into an active site of novel SARS-CoV-2 Mpro. The candidate inhibitors in the top-1,000 hit list are chemically diverse, exhibit superior docking scores compared to known protease inhibitors, and can be readily sourced from established vendors. The structures of the identified compounds are made publicly available and List of the top 1,000 identified compounds, as well as docking results in SDF format are publicly available at https://drive.google.com/drive/folders/1xgA8ScPRqIunxEAX-FrUEkavS7y3tLIMN?usp = sharing. None declared. ZINC001627499877 Br À À À À -9.32 ZINC001362111980 Cl À À À À -9.13 Coronavirus latest: Chinese cases spike after changes to diagnosis method Adaptive COVID-19 Treatment Trial Deep Learning Based Drug Screening for Novel Coronavirus RCSB Protein Data Bank 2020 Proc. Natl. Acad. Sci Proc. Natl. Acad. Sci Rapid Identification of Potential Inhibitors of SARS-CoV-2 Main Protease by Deep Docking of 1.3 Billion Compounds This work was funded by the CIHR Canadian 2019 Novel Coronavirus (2019-nCoV) Rapid Research grant # DC0190GP.