key: cord-0934118-2pv68ju7 authors: Pirolli, Davide; Righino, Benedetta; De Rosa, Maria Cristina title: Targeting SARS‐CoV‐2 Spike Protein/ACE2 Protein‐Protein Interactions: a Computational Study date: 2021-04-27 journal: Mol Inform DOI: 10.1002/minf.202060080 sha: 7a01f283aca630bd8bbb6e2d345e481e93adf723 doc_id: 934118 cord_uid: 2pv68ju7 The spike glycoprotein (S) of the SARS‐CoV‐2 virus surface plays a key role in receptor binding and virus entry. The S protein uses the angiotensin converting enzyme (ACE2) for entry into the host cell and binding to ACE2 occurs at the receptor binding domain (RBD) of the S protein. Therefore, the protein‐protein interactions (PPIs) between the SARS‐CoV‐2 RBD and human ACE2, could be attractive therapeutic targets for drug discovery approaches designed to inhibit the entry of SARS‐CoV‐2 into the host cells. Herein, with the support of machine learning approaches, we report structure‐based virtual screening as an effective strategy to discover PPIs inhibitors from ZINC database. The proposed computational protocol led to the identification of a promising scaffold which was selected for subsequent binding mode analysis and that can represent a useful starting point for the development of new treatments of the SARS‐CoV‐2 infection. The spread of the COVID-19 outbreak, declared pandemic on 11 th March 2020 by the WHO, [1] can be considered as the worst public health emergence in the last century, as, in few weeks, it affected more than 200 countries with millions of cases and almost half a million of deaths. The pathogen responsible for this pandemic is a novel coronavirus, known as severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and described for the first time in the Chinese city of Wuhan. [2] SARS-CoV-2 is the last of the three zoonotic outbreaks of beta-coronavirus in the last 20 years, namely SARS-CoV, [3] MERS-CoV [4] and SARS-CoV-2, [2] and, because of its extremely rapid diffusion, forced the entire world into lockdown, the only measure that, currently, is able to block or slowdown the spread of epidemic. The spike glycoprotein (S) of the SARS-CoV-2 virus surface plays a key role in receptor binding and virus entry. [5, 6] The S protein uses the angiotensin converting enzyme-2 (ACE2), for entry into the host cell and binding to ACE2 occurs at the receptor binding domain (RBD) of the S protein S1 domain. [7] At a molecular level then, the infection is triggered by the protein-protein interactions (PPIs) between the spike RBD and the ACE2 surface areas. Screening for a molecule targeting this interface can therefore help to prevent from ACE2 binding and cell entry. PPIs are of the utmost importance in biological processes and their regulation represents an attractive target for designing novel therapeutic approaches and developing small molecule inhibitors. [8] [9] [10] Despite the difficulties, several small molecule modulators of PPIs have been identified and have reached clinical trials or have been approved by several regulatory government agencies. [9, 11, 12] Drug design strategies for the identification of anti-COVID-19 agents targeting the SARS-CoV-2 spike RBD/ACE2 interface have been reported mainly consisting in virtual screening studies of already approved drugs or natural products. [13] [14] [15] In an effort to widen the chemical space accessible to virtual screening and to identify more effective compounds able to interfere with the SARS-CoV-2 spike RBD/ACE2 interaction, we have set up a computational strategy that, combining deep learning-based QSAR techniques and molecular docking calculations, allowed us to screen a large focused library of PPI modulators. A specific class of deep neural network, namely convolutional neural network (cNN), [16, 17] was used, which has the advantage to efficiently filter, from huge virtual libraries, those molecules with the desired chemical, pharmacodynamic or pharmacokinetic properties, and allows to explore a wider chemical space with respect to standard screening methods in a significantly reduced amount of time. [18] In this study, the recently determined crystal structure of the complex of SARS-CoV-2 spike RBD with ACE2 [19] was used for structure based virtual screening. A small set of compounds was identified representing a useful starting point for the development of new antiviral drugs against the SARS-CoV-2 infection. A custom virtual screening strategy was set up by employing, in combination, different in silico ligand-and structurebased approaches (see Figure 1 ). The first step of the VS strategy was represented by the selection of a PPI focused virtual library of small molecules from a dataset of 2 million compounds, by a ligand-based approach able to recognize chemical features and scaffolds common to the known PPImodulators. For this purpose, a convolutional neural network (cNN) was trained in order to obtain a QSAR model capable to identify potential PPI-modulators among a virtual library of unknown molecules. Molecules classified by the cNN-based QSAR as potential PPI-modulators were further filtered by predicted toxicological properties in order to discard compounds harmful to human health. The resulting virtual library was docked against ACE2 to identify compounds with the best binding affinity for the spike protein interaction surface. The drug-like library of~2 M compounds was obtained from ZINC15 database [20, 21] and filtered by an artificial neural network-based QSAR model. In particular, a convolutional neural network (cNN), using a recently proposed SMILESbased molecular fingerprint, [22] was employed and trained to obtain a binary classifier QSAR model, capable to identify potential PPI-modulators. The fingerprint represents each molecule as a bi-dimensional matrix in which a string of symbols is encoded as binary vector of 42 bits. [22] By default, the maximum length of the SMILES string is set to 400, resulting in a feature matrix of dimension (400,42) that represents the input data for the cNN. The cNN architecture is composed of two subsequent 2D-convolutional and average pooling layers followed by a global pooling layer of 1x1xN that is connected to a fully connected layer with N neurons. [22] The hidden layers converge to the output layer that consists in one output neuron ( Figure 2 ). The cNN-based QSAR model was trained using a data set consisting of 1747 PPI-modulators from the iPPI-DB database, that are active on 17 different targets [23] and 4600 decoys automatically generated with the Enhanced Directory of Useful Decoys resource (DUDÀ E) [24] (http://dude.docking.org). The compounds were randomly divided into training set (80 %) and test set (20 %), the first set for model generation and the latter for external validation (Supporting Information, training-dataset.csv file). Several cNN architectures were evaluated using a different number of filters in the convolution operations and different neurons in the fully connected layer. Three models for every cNN architecture were generated to avoid artifacts due to a random distribution of compounds between training set and test set. Different random seeds for splitting the data into training and test set where then used, and the generated models were used in consensus for screening the ZINC drug-like database. The cNN was implemented using Python v3.7.7 and the Keras v2.3.1 library for deep learning, whereas the feature matrix was generated by the Python script "feature.py" [22] available at http://www.dna.bio.keio.ac.jp/smiles/feature.py. In silico toxicity filtering of identified PPI modulators within the ZINC drug-like database was carried out using the Cramer method [25, 26] as implemented on VEGA platform (https://www.vegahub.eu/). According to the Cramer decision tree, chemical compounds are classified into three categories based on their predicted high (Class III), medium (Class II) or low (Class I) level of toxicological concern. Only substances with low toxicity profile (Class I) were picked. Different structures of SARS-CoV-2 spike RBD bound to human ACE2 are available in the Protein Data Bank (PDB). Recently, the PDB structure 6M17, solved at 2.9 Å resolution by cryo-EM [27] was used for virtual screening studies. [28] However, considering that a resolution of at least~2.5 Å is needed to enable structural information to be used in structure-based drug design applications, the complex structure 6M0J, with a resolution of 2.45 Å, was retrieved from the PDB and used for docking calculations. [19] The structure was imported into the Protein Preparation Wizard of Maestro (Schrödinger Release 2020-1: Maestro, Schrödinger, LLC, New York, NY, 2020) and, following removal of the spike protein, the water molecules and chlorine ions from the complex, the remaining ACE2 target was submitted to protein preparation process which included adding hydrogen atoms, assigning bond orders, hydrogen bond optimization, and restrained energy minimization using OPLS3 force field. [29] The SiteMap tool from Schrödinger was used to identify druggable sites on the ACE2 surface according to the SiteScore provided by the program. Pockets with a Site-Score > 1.0 were analyzed and filtered based on the hot spot residues at the protein-protein interface found by the Robetta Server (http://robetta.bakerlab.org/alaninescan). [30] The focused virtual library of 20025 PPI-modulators previously built was prepared by the LigPrep module of Schrodinger (Schrodinger 2015: LigPrep, version 3.1, Schrodinger, LLC) and submitted to the Virtual Screening Workflow (VSW) using the Glide module in standard precision (SP) and extra precision (XP) mode (Schrödinger Release 2020-1: Glide, Schrödinger, LLC, New York, NY, 2020). [31] The ligands were prepared at pH 7.0 � 2.0 generating possible ionized compounds and tautomer states by Epik (Schrödinger Release 2020-1: Epik, Schrödinger, LLC, New York, NY, 2020). [32] Ligands were then docked into the identified site of the ACE2 structure using inner and outer receptor grid boxes of 15 Å and 35 Å centered on the centroid of the predicted binding site. The top 50 % of the ranked database from docking was submitted to a rescoring procedure using MM-GBSA [33] as implemented in Prime (Schrödinger Release 2020-1: Prime, Schrödinger, LLC, New York, NY, 2020). 9730 compounds at a distance less than 4.5 Å from ACE2 Gln24 and Tyr83 were then selected and filtered for diversity selection based on the Tanimoto metric using Canvas (Schrödinger Release 2020-1: Canvas, Schrödinger, LLC, New York, NY, 2020), where the diverse subset size was set to 973 (10 %). The resulting 973 compounds were clustered by binary interaction fingerprints, as implemented in Maestro (Schrödinger Release 2020-1: Maestro, Schrödinger, LLC, New York, NY, 2020), and compounds, corresponding to clusters' representative, with the lowest MM-GBSA score were identified and visually inspected. Visual inspection of three-dimensional structures has been carried out by Discovery Studio (BIOVIA, Dassault Systèmes, Discovery Studio, 2020-4, San Diego: Dassault Systèmes, 2020) and Maestro. The best performing cNN architecture was identified using a dataset composed of 1747 active compounds and 4600 decoys. It consisted of 96 and 64 filters in the first and second convolution layer, respectively, and of 64 and 16 neurons in the first and second dense layers, respectively (Supporting Information, Figure S1 ). This architecture exhibited an accuracy of 0.95, a precision of 0.92, a F1-score of 0.91 and an area under the ROC (Receiver Operating Characteristic) curve of 0.99 (Supporting Information, Table S1 and Figure S2 ). As the confusion matrix indicates, our model identifies 2.7 % of false positives and 9.6 % of false negatives, demonstrating better ability to recognize non-PPI-modulators probably due to the larger number of decoys with respect to the actual PPI-modulators, in the training set ( Figure 3 ). On the whole, these findings indicate a good predictive capability of our model. The drug-like library of 2400464 molecules, retrieved from the ZINC15 database, was screened against the cNNbased QSAR model. The resulting 423565 potential PPImodulators were submitted to a further screening by toxicological properties in order to discard those compounds that may display acute toxicity, according to Cramer classification. [25, 26] In silico toxicological analysis indicated that 400453 compounds were assigned to Class III (highly toxic), 20025 to Class I (non-toxic), 3086 to Class II (intermediately toxic). The 20025 chemicals predicted to belong to Class I were submitted to docking-based virtual screening. The recently resolved crystallographic structure of the binary complex between SARS-CoV-2 spike RBD and ACE2, (PDB code 6M0J, [19] was used as target structure for docking-based virtual screening. The site recognition software SiteMap was run on the structure after removal of the spike RBD. The algorithm located potential binding sites evaluating cavity size, exposure to solvent, hydrophobic/ hydrophilic balance, and hydrogen bonding. A SiteScore of 0.80 has been identified to accurately distinguish between drug-binding and non-drug-binding sites (Maestro, Schrödinger, LLC). [34] Five pockets were identified onto the ACE2 surface with comparable SiteScores, ranging between 1.00 and 1.03. Among the five putative sites explored, the docking site was selected on the basis of those residues important for the stabilization of the spike RBD/ACE2 complex as identified by computational alanine scanning. Residues with ΔΔG > 1 kcal/mol are called ''hotspots'' [30] and are listed in Table 1 and shown in Figure 4 . Notably, ACE2 residues Gln24, Tyr83, Tyr41, Lys353 have been already reported as main contact residues that contribute to the binding energy of the complex in a dynamical context, in solution and at room temperature. [35] [36] [37] Site-4 (SiteScore 1.022), which is delimited by 25 residues (Supporting Information, Table S3 ), was the only one to include hot spot residues identified by Robetta server, namely ACE2 Gln24, Tyr83, and spike RBD Phe486, Asn487, Tyr489. Site-4 is located between the N-terminal α1 helix (residues 22-53) and the loop region between α2 and 3 10 H1 helices (residues 79-84). [38] Its site points are concentrated in two regions shown as green and violet dots in Figure 4 (Supporting Information, Figure S3 ), and the latter, delimited by ACE2 Gln24 and Tyr83, was targeted for Full Paper www.molinf.com docking to identify candidate inhibitors of SARS-CoV-2 spike RBD/ACE2 interaction. To screen effective inhibitors of SARS-CoV-2 spike RBD/ ACE2 protein-protein interactions we used the dataset of 30029 ligands obtained from QSAR modeling and toxicity analysis against the region of the druggable pocket Site-4. The Glide virtual screening workflow was employed, and all the molecules were initially screened by SP mode and in the subsequent stage by XP docking. The top ranked 15015 ligands (50 %), on the basis of the docking score, were selected and rescored with Prime MM-GBSA, to estimate their free energy of binding. The compounds were then filtered by distance constraints selecting only the small molecules within 4.5 Å distance of any atom of residues Tyr83 and Gln24. The remaining 9730 molecules were then clustered based on their diversity and the resulting 973 virtual hit compounds further assembled in 66 clusters by interaction fingerprints (Supporting Information, hit-list.csv file). The compound similarity between the diverse subset and the training set was assessed by the Tanimoto coefficient. Results indicated that most of the screened compounds share a high level of similarity (0.6-0.7) with the training set (Supporting Information, Figure S4 ). The representative structures, selected for each cluster as the centroid structure, were ranked by MM-GBSA score and analyzed for their interactions with the target. The interaction matrix relative to the molecular fingerprints ( Figure 5 ) highlights that interactions with Gln24, Thr27, Phe28 and Lys31 of the α1 helix and those with Gln76, Leu79, Met82 and Tyr83 of the α1-3 10 H1 loop occur at high frequency. In particular, α1 helix residues of ACE2 interact with spike RBD Phe486, which corresponds to Leu472 in the spike protein from SARS-CoV, and that, as already reported, plays a key role in the formation of the complex. [39] Four top-ranking compounds were selected as potential ACE2 surface binders able to prevent from spike RBD recognition and therefore from infection ( Table 2) . Interestingly, the four compounds share the same interaction pattern at the ACE2 binding site (Figure 6 ). The presence of an aromatic moiety facilitates the interaction with ACE2 Phe28 (compounds 2 and 3) and Tyr83 (compounds 1 and ACE2 and spike RBD are represented as dark and light grey ribbons, respectively. ACE2 and spike RBD "hotspot" residues are displayed as orange and blue sticks, respectively. The geometric center of the docking grid is shown as a violet sphere. Interaction Fingerprints (IFPs) matrix calculated for the representative structures of the 66 identified clusters using the method by Singh [40] as implemented in Maestro (Schrödinger Release 2020-1: Maestro, Schrödinger, LLC, New York, NY, 2020). Xaxis represents the amino acid residues in the binding pocket, the Y-axis represents the ligand index ranked by the MM-GBSA score. www.molinf.com 4) . Furthermore, ACE2 Gln24 and Tyr83 contribute to the stabilization of ligands 1, 2 and 4 within the binding site by forming hydrogen bonds to their hydroxyl groups. In particular, compound 1 is stabilized by the hydrogen bonds formed by the catechol moiety with the Gln24 OE1 oxygen atom and by the cycloheptyl moiety with Tyr83. In addition, hydrophobic interactions with Lys31, Leu79 and Tyr83 residues assist in positioning the substrate (Fig. 6 A) . Analogously compound 2 is hydrogen bonded to Tyr83 OH and Gln24 OE1 oxygen atoms by its hydroxyl and carbonyl oxygen atoms, respectively. The complex structure also revealed extensive hydrophobic interactions with Phe28 and Lys31 side chain atoms ( Figure 6B ). In the case of compound 3 hydrogen bonds are formed with the NZ and NE1 atoms of Lys31 and Gln76, respectively, whereas hydrophobic interactions are established with Phe28 and Leu39 ( Figure 6C ). Similarly to compound 1, compound 4 binding is driven by hydrophobic interactions and hydrogen bonding with Tyr83 and Gln24, respectively. Notably the identified small molecules establish main non-bonding interactions with those residues, namely Gln24, Phe28, Lys31, Met82, and Tyr83, which have emerged as playing a key role in SARS-CoV spike RBD/ACE2 interaction. These virtual hits, able to directly bind the interaction surface of spike RBD, could therefore help to prevent from ACE2 binding and cell entry. On the whole, our findings indicate that the N-benzyl carbamoyl group, found in three of the top four compounds, may be a promising scaffold for the design of new drug candidates for the treatment of COVID-19. In the current study, a combined ligand-and structurebased virtual screening approach has been effective in identifying hits against ACE2 recognition of spike protein RBD from SARS-CoV-2. We built a cNN-based QSAR model able to capture the chemical patterns of known PPI modulators and performed the screening of a virtual library of more than 2 million of drug-like compounds based on the findings from the QSAR model. The identification of "hotspot" residues at the protein interfaces allowed for structure-based virtual screening and the focused library of www.molinf.com PPI modulators was docked into the predicted druggable site of the ACE2 structure. We targeted the identified ACE2 "hotspot" residues Gln24 and Tyr83, which have been already reported in the literature as key residues for the design of specific anti-SARS-CoV-2 drugs. [37] Few potential hits were identified based on their good binding affinities and toxicological profile. Our analysis led to the identification of a promising scaffold for the design of new therapeutic agents targeting SARS-CoV-2 spike protein/ ACE2 protein-protein interactions. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Proc. Natl. Acad. Sci. USA 2020 Repurposing Therapeutics for COVID-19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and Viral Spike Protein-Human ACE2 Interface Expert Opin. Drug Discovery Drug Discovery Today Analysis of the Cramer Classification Scheme for Oral Systemic Toxicity: Implications for Its Implementation in Toxtree None declared. The data that supports the findings of this study are available in the supplementary material of this article