key: cord-0859999-n8is814a
authors: Jukič, Marko; Škrlj, Blaž; Tomšič, Gašper; Pleško, Sebastian; Podlipnik, Črtomir; Bren, Urban
title: Prioritisation of Compounds for 3CL(pro) Inhibitor Development on SARS-CoV-2 Variants
date: 2021-05-18
journal: Molecules
DOI: 10.3390/molecules26103003
sha: c8753cb6f1d12f0ab7b9d3d5c888851e9da8aa0c
doc_id: 859999
cord_uid: n8is814a

COVID-19 represents a new potentially life-threatening illness caused by severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2 pathogen. In 2021, new variants of the virus with multiple key mutations have emerged, such as B.1.1.7, B.1.351, P.1 and B.1.617, and are threatening to render available vaccines or potential drugs ineffective. In this regard, we highlight 3CL(pro), the main viral protease, as a valuable therapeutic target that possesses no mutations in the described pandemically relevant variants. 3CL(pro) could therefore provide trans-variant effectiveness that is supported by structural studies and possesses readily available biological evaluation experiments. With this in mind, we performed a high throughput virtual screening experiment using CmDock and the “In-Stock” chemical library to prepare prioritisation lists of compounds for further studies. We coupled the virtual screening experiment to a machine learning-supported classification and activity regression study to bring maximal enrichment and available structural data on known 3CL(pro) inhibitors to the prepared focused libraries. All virtual screening hits are classified according to 3CL(pro) inhibitor, viral cysteine protease or remaining chemical space based on the calculated set of 208 chemical descriptors. Last but not least, we analysed if the current set of 3CL(pro) inhibitors could be used in activity prediction and observed that the field of 3CL(pro) inhibitors is drastically under-represented compared to the chemical space of viral cysteine protease inhibitors. We postulate that this methodology of 3CL(pro) inhibitor library preparation and compound prioritisation far surpass the selection of compounds from available commercial “corona focused libraries”.

Coronavirus disease is an infectious disease caused by a novel severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2. COVID-19 was initially reported in Wuhan province in China and was declared as a global pandemic [1] . COVID-19 is a severe illness similar to flu, with major symptoms being cough, fever and breathing difficulty. Furthermore, the illness can cause systemic inflammation [2, 3] . The pathogen SARS-CoV-2 belongs to the Coronaviridae family, an enveloped positive-sense single-stranded (+ssRNA) RNA virus, and is closely related to the previously described SARS-CoV and MERS-CoV coronaviruses [4] . The SARS-CoV-2 genome shares 82% sequence identity with SARS-CoV and 90% identity with MERS-CoV and shares common pathogenesis mechanisms [5] . It is of key interest that no mutations have been observed on SARS-CoV-2 3CL pro main protease (location 3292 → 3582 on ORF1ab polyprotein; YP_009724389.1) and only one mutation on SARS-CoV-2 PL pro protease (location 1564 → 1882 on ORF1ab polyprotein; YP_009724389.1) for each variant [22] . It should be stressed, however, that this does not mean that these proteins cannot mutate in the future. In this context, 3CL pro is an attractive target for novel antiviral discovery with a potential for trans-variant activity [23] [24] [25] [26] 3CL pro (Picornain 3C-like protease, also referred to as M pro for main protease) is a homodimeric cysteine protease (EC 3.4.22 .69) and is 96% sequence identical with the SARS-CoV M pro [27] . The enzyme belongs to the family C30 peptidases in the PA peptidases clan. It consists of two 306 residues long polypeptide chains, which fold into three domains (I, II and III). Domains I and II have an antiparallel β-barrel structure, while domain III is composed of 5 α-helices, which connect to domain II by a long loop region. Judging by its dimer interface, it seems the dimer (comprised of protomers A and B) is an active form which is considerably less efficient when isolated in its monomer form. The active site is readily accessible to the solvent and is located distal to the dimer interface [28] . The substrate binding site is comprised of pockets P1, P1', P2 and P3. The P1 subsite is formed with Phe140, Asn142, Glu166, His163 and His172 residues ( Figure 1 ) and two conserved water molecules. P2 is a deep pocket comprised of His41, Met49, Tyr54, Met165 and Asp187 residues while P3 is defined by Leu168 and flanked by Glu166, Pro168 and Gly170 [29] . Proteolysis occurs via a catalytic dyad defined by Cys145 and His41 [30] . The enzyme is responsible for cleavage on no less than 11 sites on the large viral polyprotein 1ab. Cleavage generally follows the pattern Leu/Phe/Met-Gln ↓ Gly/Ser/Ala (↓ denotes the cleavage site). Glutamine at the P1 position is crucial for proteolysis to occur. As there are no known native human enzymes with such cleavage sites, M pro looks to be an ideal drug target, since there is a low risk for toxic effects on host cells [31] [32] [33] . Structural data on the enzyme is available and reporter assays developed making this target suitable for novel antiviral design [34, 35] . by its dimer interface, it seems the dimer (comprised of protomers A and B) is an active form which is considerably less efficient when isolated in its monomer form. The active site is readily accessible to the solvent and is located distal to the dimer interface [28] . The substrate binding site is comprised of pockets P1, P1', P2 and P3. The P1 subsite is formed with Phe140, Asn142, Glu166, His163 and His172 residues ( Figure 1 ) and two conserved water molecules. P2 is a deep pocket comprised of His41, Met49, Tyr54, Met165 and Asp187 residues while P3 is defined by Leu168 and flanked by Glu166, Pro168 and Gly170 [29] . Proteolysis occurs via a catalytic dyad defined by Cys145 and His41 [30] . The enzyme is responsible for cleavage on no less than 11 sites on the large viral polyprotein 1ab. Cleavage generally follows the pattern Leu/Phe/Met-Gln ↓ Gly/Ser/Ala (↓ denotes the cleavage site). Glutamine at the P1 position is crucial for proteolysis to occur. As there are no known native human enzymes with such cleavage sites, M pro looks to be an ideal drug target, since there is a low risk for toxic effects on host cells [31] [32] [33] . Structural data on the enzyme is available and reporter assays developed making this target suitable for novel antiviral design [34, 35] . In this study, we perform a high-throughput virtual screening (HTVS) experiment coupled with machine learning classification to offer a prioritisation approach for compounds with potential activity on SARS-CoV-2 3CL pro . We employ fast methodologies to cover a comprehensive chemical space based on molecular docking scores in order to offer prioritisation lists of compounds for further free energy calculations and suitable for biological evaluation [36] . We focus on identifying novel potential non-covalent protease inhibitors, as we firmly believe they offer the flexibility of optimisation and synthetic or commercial availability [37, 38] .

For our HTVS (high-throughput virtual screening) experiment, we employed the ZINC 15 library [39] . To produce robust results that would enable downstream experimental support and biological evaluation, we specifically selected the "In-Stock" library In this study, we perform a high-throughput virtual screening (HTVS) experiment coupled with machine learning classification to offer a prioritisation approach for compounds with potential activity on SARS-CoV-2 3CL pro . We employ fast methodologies to cover a comprehensive chemical space based on molecular docking scores in order to offer prioritisation lists of compounds for further free energy calculations and suitable for biological evaluation [36] . We focus on identifying novel potential non-covalent protease inhibitors, as we firmly believe they offer the flexibility of optimisation and synthetic or commercial availability [37, 38] .

For our HTVS (high-throughput virtual screening) experiment, we employed the ZINC 15 library [39] . To produce robust results that would enable downstream experimental support and biological evaluation, we specifically selected the "In-Stock" library subset that includes commercially readily-available compounds. The subset was trenched to exclude small fragments below the molecular weight of 200 g/mol and included 9,232,022 Figure 2 ). The next step was compound ionisation at the pH of 7.4, the calculation of initial 3D conformations for the whole database, the enumeration of undefined chiral centres, and the removal of structural faults. Smiles strings were syntactically validated (all rings/branches closed, no illegal atom types) with RDKit software (MolFromSmiles). Ionisation was performed using OpenEye QUACPAC 2.1.1.0 software (OpenEye Scientific Software, Inc., Santa Fe, NM, USA; www.eyesopen.com; accessed on 8 May 2021) with FixpKa module and ionise 7.4 parameters. The initial 3D conformation was calculated with OpenEye OMEGA 2.5.1.4 software (OpenEye Scientific Software, Inc., Santa Fe, NM, USA; www.eyesopen.com). The following parameters were used: maxconfs 1 and flipper: true [40] . 

Upon examination of available SARS-CoV-2 3CL pro crystal structures at the time, we selected the complex (PDB ID: 6Y7M) with an excellent resolution of 1.9 Å deposited by Zhang L. et al. [41] . This complex contained a relatively large peptidomimetic inhibitor OEW (MW = 585.69 g/mol) occupying major pockets at the enzyme active site, essentially keeping the S1 pocket in the correct shape and the enzyme in the active conformation. After superimposition to a reference structure (PDB ID:6LU7) by Yang et al., the catalytic binding pocket was defined around Cys145 as published previously by Jukic et al. [29, 42] . Finally, the target was prepared as a sphere of 7 Å around the ligand for docking volume calculation using the CmDock docking package (https://gitlab.com/Jukic/cmdock/; accessed on 8 May 2021; Figure 3 ; details are in Supplementary Materials). 

Upon examination of available SARS-CoV-2 3CL pro crystal structures at the time, we selected the complex (PDB ID: 6Y7M) with an excellent resolution of 1.9 Å deposited by Zhang L. et al. [41] . This complex contained a relatively large peptidomimetic inhibitor OEW (MW = 585.69 g/mol) occupying major pockets at the enzyme active site, essentially keeping the S1 pocket in the correct shape and the enzyme in the active conformation. After superimposition to a reference structure (PDB ID:6LU7) by Yang et al., the catalytic binding pocket was defined around Cys145 as published previously by Jukic et al. [29, 42] . Finally, the target was prepared as a sphere of 7 Å around the ligand for docking volume calculation using the CmDock docking package (https://gitlab.com/Jukic/cmdock/; accessed on 8 May 2021; Figure 3 ; details are in Supplementary Materials). 

For the HTVS campaign, the complete pre-prepared database was docked using CmDock into the prepared receptor binding site to afford 1‰ of top-scoring compounds with the Z-score cutoff of -7.7 ( Figure 4 ). In order to inspect all chemical space, filtering was done post-docking, where subse quently PAINS [43] [44] [45] and REOS structures were filtered out [46, 47] . KNIME software with RDKit nodes was applied to compare all structures in the library to the selection o SMARTS-formatted PAINS and to remove flagged hits from the database followed by REOS filtering with Schrödinger SMD software (Schrödinger LLC., New York, NY, USA) In this way, approximately 40% of the pan-assay interfering and reactive compounds (with labile functional groups) were filtered out, and the top 200 scoring compounds were examined by cluster analysis and classified in the current 3CL pro -actives space. Hierar . Generated receptor volume with the 3CL pro PDB ID: 6Y7M, the active site near the Cys145 residue and the docking volume defined as a sphere of 7 Å around the reference ligand OEW. Protein is depicted as a pink-magenta-coloured cartoon model with residues emphasised in the line model and coloured atoms (carbon in green, oxygen in red, nitrogen in blue and hydrogen in white colour) with a blue-white transparent active site surface. Docking volume boundary mesh is depicted in blue colour. Isomesh (0.99) was constructed using PyMol 2.1.0 using a grid calculated with cmgrid software (v 0.1.1).

For the HTVS campaign, the complete pre-prepared database was docked using CmDock into the prepared receptor binding site to afford 1‰ of top-scoring compounds with the Z-score cutoff of -7.7 ( Figure 4 ). 

For the HTVS campaign, the complete pre-prepared database was docked using CmDock into the prepared receptor binding site to afford 1‰ of top-scoring compounds with the Z-score cutoff of -7.7 ( Figure 4 ). In order to inspect all chemical space, filtering was done post-docking, where subsequently PAINS [43] [44] [45] and REOS structures were filtered out [46, 47] . KNIME software with RDKit nodes was applied to compare all structures in the library to the selection of SMARTS-formatted PAINS and to remove flagged hits from the database followed by REOS filtering with Schrödinger SMD software (Schrödinger LLC., New York, NY, USA). In this way, approximately 40% of the pan-assay interfering and reactive compounds (with labile functional groups) were filtered out, and the top 200 scoring compounds were examined by cluster analysis and classified in the current 3CL pro -actives space. Hierarchical clustering was performed using Schrödinger SMD (Schrödinger LLC., New York, In order to inspect all chemical space, filtering was done post-docking, where subsequently PAINS [43] [44] [45] and REOS structures were filtered out [46, 47] . KNIME software with RDKit nodes was applied to compare all structures in the library to the selection of SMARTS-formatted PAINS and to remove flagged hits from the database followed by REOS filtering with Schrödinger SMD software (Schrödinger LLC., New York, NY, USA). In this way, approximately 40% of the pan-assay interfering and reactive compounds (with Molecules 2021, 26, 3003 6 of 14 labile functional groups) were filtered out, and the top 200 scoring compounds were examined by cluster analysis and classified in the current 3CL pro -actives space. Hierarchical clustering was performed using Schrödinger SMD (Schrödinger LLC., New York, NY, USA) using Molprint2D hashed fingerprints, Tanimoto similarity metric and average cluster linkage method; the 17 clusters have been estimated by the Kelley criterion [48] . The final compound selection was performed with top-scoring hits with consideration of QuickProp QPlogS (Schrödinger LLC., New York, NY, USA) descriptor (>−6.5) in order to focus on compounds that have a greater potential to be soluble. Complete QuickProp descriptor set was calculated to flag all un-/desired properties and enable further custom prioritisation of compounds, and the whole set of top 200 hits is supplied in the Supplementary Materials for the convenience of future research. Upon examination of all top-scoring compounds, we identified that they all conform to the classic P1-P2-P3 binding pockets, as described previously [42] . Predicted bound conformations for the top-scoring hit as well as for the first ten hit compounds are analogous (and in accordance to the crystal ligand OEW), and compounds interact with key Thr25, Leu27, Gly143, Ser144, Cys145, His163, His164, Met165, Glu166, Asp187, Thr190, Gln189 and Gln192 residues at the 3CL pro active site (Table 2, Figure 5 ). hit as well as for the first ten hit compounds are analogous (and in accordance to the crystal ligand OEW), and compounds interact with key Thr25, Leu27, Gly143, Ser144, Cys145, His163, His164, Met165, Glu166, Asp187, Thr190, Gln189 and Gln192 residues at the 3CL pro active site (Table 2, Figure 5 ).

(A) (B) pockets, as described previously [42] . Predicted bound conformations for the top-scoring hit as well as for the first ten hit compounds are analogous (and in accordance to the crystal ligand OEW), and compounds interact with key Thr25, Leu27, Gly143, Ser144, Cys145, His163, His164, Met165, Glu166, Asp187, Thr190, Gln189 and Gln192 residues at the 3CL pro active site (Table 2, Figure 5 ).

(A) (B) 

We also investigated whether state-of-the-art machine learning algorithms could be of use for compound prioritisation. In this work, we implemented four different machine learning algorithms capable of detecting various complexities and evaluated their performance on the training data. The training data for the regression experiment consisted of all compounds with activity on the 3CL pro enzyme (ChEMBL standard value IC50 in nM, 133 compounds with 1k decoy set, available in Supplementary Materials) present in the ChEMBL database, while the training data for classification consisted of chemical libraries active on 3CL pro as well as actives on all viral cysteine proteases (3145 compounds with 1k decoy set, available in Supplementary Materials) [49] . For all compounds, we used RDKit.Chem package. RDKit.Chem.Descriptors module calculated a set of 208 chemical descriptors in order to capture the respective chemical space. In this manner, we could effectively classify compounds in the context of their representative chemical space: similar to known 3CL pro inhibitors, similar to known viral cysteine protease inhibitors, or belonging to a different chemical space to help in future design directions for individual scaffolds. The learning algorithms were (from simplest to more complex ones): a majority classifier (prediction of the most common class), a logistic regression classifier (linear classifier), and two non-linear classifiers, namely a deep feedforward neural network, designed specifically for this task, as well as the extreme gradient boosting machines (XGB), which are considered a strong learner in contemporary chemoinformatics [50] [51] [52] [53] . 

We also investigated whether state-of-the-art machine learning algorithms could be of use for compound prioritisation. In this work, we implemented four different machine learning algorithms capable of detecting various complexities and evaluated their performance on the training data. The training data for the regression experiment consisted of all compounds with activity on the 3CL pro enzyme (ChEMBL standard value IC50 in nM, 133 compounds with 1k decoy set, available in Supplementary Materials) present in the ChEMBL database, while the training data for classification consisted of chemical libraries active on 3CL pro as well as actives on all viral cysteine proteases (3145 compounds with 1k decoy set, available in Supplementary Materials) [49] . For all compounds, we used RDKit.Chem package. RDKit.Chem.Descriptors module calculated a set of 208 chemical descriptors in order to capture the respective chemical space. In this manner, we could effectively classify compounds in the context of their representative chemical space: similar to known 3CL pro inhibitors, similar to known viral cysteine protease inhibitors, or belonging to a different chemical space to help in future design directions for individual scaffolds. The learning algorithms were (from simplest to more complex ones): a majority classifier (prediction of the most common class), a logistic regression classifier (linear classifier), and two non-linear classifiers, namely a deep feedforward neural network, designed specifically for this task, as well as the extreme gradient boosting machines (XGB), which are considered a strong learner in contemporary chemoinformatics [50] [51] [52] [53] . Upon examination of all topscoring compounds, we identified that they all conform to the classic P1-P2-P3 binding pockets, as described previously [42] . Predicted bound conformations for the top-scoring hit as well as for the first ten hit compounds are analogous (and in accordance to the crystal ligand OEW), and compounds interact with key Thr25, Leu27, Gly143, Ser144, Cys145, His163, His164, Met165, Glu166, Asp187, Thr190, Gln189 and Gln192 residues at the 3CL pro active site (Table 2, Figure 5 ).

(A) (B) 

We also investigated whether state-of-the-art machine learning algorithms could be of use for compound prioritisation. In this work, we implemented four different machine learning algorithms capable of detecting various complexities and evaluated their performance on the training data. The training data for the regression experiment consisted of all compounds with activity on the 3CL pro enzyme (ChEMBL standard value IC50 in nM, 133 compounds with 1k decoy set, available in Supplementary Materials) present in the ChEMBL database, while the training data for classification consisted of chemical libraries active on 3CL pro as well as actives on all viral cysteine proteases (3145 compounds with 1k decoy set, available in Supplementary Materials) [49] . For all compounds, we used RDKit.Chem package. RDKit.Chem.Descriptors module calculated a set of 208 chemical descriptors in order to capture the respective chemical space. In this manner, we could effectively classify compounds in the context of their representative chemical space: similar to known 3CL pro inhibitors, similar to known viral cysteine protease inhibitors, or belonging to a different chemical space to help in future design directions for individual scaffolds.

The learning algorithms were (from simplest to more complex ones): a majority classifier (prediction of the most common class), a logistic regression classifier (linear classifier), and two non-linear classifiers, namely a deep feedforward neural network, designed specifically for this task, as well as the extreme gradient boosting machines (XGB), which are considered a strong learner in contemporary chemoinformatics [50] [51] [52] [53] . [54, 55] . For the docking receptor preparation, the covalent bond was cleaved, a small molecule removed, and the Cys145 amino-acid residue regenerated (Open Source PyMOL, release 2.1). The target was prepared using Schrödinger Small-Molecule Discovery Suite (Schrödinger LLC., New York), protein preparation module. Missing hydrogens were added, h-bonding network was optimised using PROPKA tool at pH 7.4, waters were removed, and restrained minimisation was performed with a convergence of heavy atoms towards 0.3 Å. Finally, using the cmcavity program from CmDock docking package (https://gitlab.com/Jukic/cmdock/; accessed on 8 May 2021), we generated a docking receptor [56] [57] [58] . The reference ligand method was employed for cavity calculation (receptor definition), where we used the OEW-cleaved regenerated ligand as a reference and a sphere of 7 Å around the ligand for the docking volume (cavity volume) calculation. We calculated a total cavity volume of 3106. 25 

For the virtual screening experiment (HTVS), a docking approach with a robust CmDock software, the prepared compound database, and the docking receptor (Cavity #1), as described in the previous section, were employed [56, 57] . Firstly, we conducted a redocking experiment. The reference-regenerated OEW peptidomimetic ligand (PDB ID: 6Y7M) was prepared as a SMILES string and energy-minimised in Ligprep tool from Schrödinger SMD using the OPLS 3e forcefield. The minimised structure was subsequently used as an input for the redocking experiment into the prepared receptor in a non-covalent manner. Applied parameters for the CmDock software (v 0.1.1) were standard docking protocol (dock.prm) with 100 runs, no constraints, and no score filters. We successfully retrieved the crystal-complex binding conformation of the OEW ligand with an RMSD of 1.34 Å. Furthermore, we calculated the receiver operating characteristic (ROC) curve to validate the performance of the classifier docking method. We selected a set of known SARS-CoV 3CL pro inhibitors from the ChEMBL database with experimental IC 50 < 100 µM values and created a testing database by the addition of negative control compounds that were calculated decoys based on employed actives using DUD-E: A Database of Useful (Docking) Decoys [59] . Upon using 1% and 10% of actives in the test database, we obtained a ROC AUC of 0.80 and 0.79, respectively. We also employed the activity data from the PostEra Covid Moonshot project (https://covid.postera.ai/covid/activity_data; accessed on 8 May 2021). We selected compounds with pIC 50 above 7 as true actives and compounds with pIC 50 up to 4.00436 as inactives or experimental decoys (compounds with no data were left out). When using 2% and 10% of actives in the test set, we obtained ROC AUCs of 0.61, indicating that our docking protocol can indeed identify active compounds and produce enriched libraries.

In order to effectively utilise CPU-time in downstream calculations, we analysed the chemical library performance in HTVS. We sampled a random 10% population of the designed library and performed an exhaustive docking experiment on 977,600 molecules (dock.prm protocol, 100 runs per molecule, no constraints, and no score filter). We then analysed the docking results using the sdreport script (part of CmDock package) and KNIME software, where the mean docking score was −12.317, and the standard deviation was 4.273. Upon passthrough analysis using rbhtfinder script (part of CmDock package, parameters: −15 and −20), the HTVS analysis afforded an optimal HTVS experiment workflow where up to 5 runs were performed for molecules that possessed docking score of −18 and up to 15 runs for molecules that were found with docking scores of −25. In the case of docking scores lower than −25, 50 runs were performed. The first filter was passed by 25.7% of molecules, the second filter was passed by 1.71% of molecules, and the run was on average 7.8% CPU-time compared to the exhaustive docking run-time, thereby effectively improving the HTVS efficiency.

The deep neural network consisted of six hidden layers, where each layer was activated with the ReLU activation, followed by dropout-based regularisation, set to 0.3 [60] . The final layer of the neural network was activated with a softmax function to obtain a distribution across the space of possible classes. During prediction, the most probable class is selected (argmax across class probabilities). The extreme gradient boosting machines are a type of tree-based ensemble classifier capable of fast learning and robust and accurate predictions. The evaluation regime was designed as follows: we randomly sampled ten different stratified splits, where the training data was used to learn the models, followed by the evaluation of their predictions on the test data. The splits were in ratio 8:2. We report the average performance with a standard deviation for the macro F1 score (the task considered is a multiclass classification). Apart from the classification task, the point was to differentiate between chemical spaces occupied by known 3Clpro inhibitors reported in ChEMBL, known viral cysteine protease inhibitors in the ChEMBL database, and other chemical space. We also conducted a similar series of experiments to assess to what extent machine learning can predict the inhibition of 3CL pro directly (on the basis of the ChEMBL standard value IC50 in nM), which is a regression task. The regression variants of the algorithms mentioned above were considered, namely the XGB with the "req:squarederror" loss, the neural network's classification head was replaced with a regression one, and instead of logistic regression, we used a simple linear regressor. The neural network that performed best consisted of seven (regularized) hidden layers; other hyperparameters were the same as in the classification scenario. As the neural network's expected output is a positive real number, the final activation used was a ReLU as well, as we observed faster convergence compared to using no activation at all. Mean squared errors with standard deviation across ten random splits (see the previous section) are reported (Table 3) . The neural network and XGB performed, as expected, the best, followed by a simpler linear learner. Compared to the majority baseline, e.g., the neural network's performance is substantially higher, indicating that it learned to differentiate between the classes. On the contrary, the average-training baseline (averaged train predictions) performed better than the linear classifier, indicating that the regression problem at hand is relatively hard and potentially not linearly separable. The XGB and neural networks performed better than this baseline, albeit not by a large margin, indicating the actives 3CL pro dataset was insufficient for effective training. However, the regression task similarly indicates that it is possible to use non-linear models for direct activity prediction necessitating good data collection on 3CL pro actives in the future. Furthermore, the authors are aware of the generalisation of the idea on other well-represented datasets and postulate its great value in future in silico drug design.

Herein, we presented a novel in silico approach towards compound prioritisation in the design of novel 3CL pro inhibitors. Rather than relying on commercial "corona-focused libraries", we performed a robust and efficient HTVS experiment to obtain virtual hit compounds. We then coupled the method to a machine learning classification experiment where each compound is classified into the chemical space of 3CL pro inhibitors, into general viral cysteine protease inhibitors, or into a completely novel unrelated chemical space. In this way, medicinal chemists can turn their attention towards compounds that would otherwise escape and derive insight into the possible mechanism of action. Therefore, we employed a complete library of 3CL pro inhibitors and viral cysteine protease inhibitors obtained from the ChEMBL library using the calculated ensemble of 208 descriptors for each compound in the library. We reported ten top-scoring compounds as viable binders supported by CmDock docking calculations considering the QuickProp QPlogS descriptor that indicates possible soluble compounds. Namely, solubility forms one of the serious problems when transitioning from in silico enrichment lists towards physical samples for in vitro evaluation. We supply full lists of hit compounds in the Supplementary Materials for the reader's benefit, especially if alternative compound selection criteria are desired, all with the aim of providing the medicinal chemistry community with a viable prioritisation library of potential 3CL pro inhibitors tailored for further molecular dynamics (MD) studies [61] as well as experimental design and development. Namely, reported enriched lists of compounds can serve as starting points, whereas compounds can be purchased commercially or be a subject of a synthetic campaign. Compounds can thus serve as experimental decoys or, if successful, progress as leads or probes.

Supplementary Materials: Details on target preparation and virtual screening. References [62] and [63] are cited in the supplementary materials. 

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

Clinical characteristics of 138 hospitalized patients with 2019 novel coronavirus-infected pneumonia in Wuhan, China

A novel coronavirus outbreak of global health concern

Emerging coronaviruses: Genome structure, replication, and pathogenesis

Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding

Durability of responses after SARS-CoV-2 mRNA-1273 vaccination

vaccines: Status report. Immunity

SARS-CoV-2 vaccines in development

The SARS-CoV-2 vaccine pipeline: An overview

Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods

Candidate drugs against SARS-CoV-2 and COVID-19

Updated approaches against SARS-CoV-2

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

Discovery of SARS-CoV-2 antiviral drugs through large-scale compound repurposing

Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: Structural genomics approach

Transmission of SARS-CoV-2 Lineage, B. 1.1. 7 in England: Insights from linking epidemiological and genetic data

Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa

Will SARS-CoV-2 variants of concern affect the promise of vaccines?

Adaptation of SARS-CoV-2 in BALB/c mice for testing vaccine efficacy

SARS-CoV-2 501Y. V2 escapes neutralization by South African COVID-19 donor plasma

Genomic characterisation of an emergent SARS-CoV-2 lineage in Manaus: Preliminary findings

Mechanism and inhibition of SARS-CoV-2 PLpro

Discovery of new hydroxyethylamine analogs against 3CLpro protein target of SARS-CoV-2: Molecular docking, molecular dynamics simulation, and structure-activity relationship studies

Structural basis of SARS-CoV-2 3CLpro and anti-COVID-19 drug discovery from medicinal plants

Drug repurposing studies targeting SARS-CoV-2: An ensemble docking approach on drug target 3C-like protease (3CLpro)

Inhibition of the main protease 3cl-pro of the coronavirus disease 19 via structurebased ligand design and molecular modeling

The picornaviral 3C proteinases: Cysteine nucleophiles in serine proteinase folds

Mechanism for Controlling the Dimer-Monomer Switch and Coupling Dimerization to Catalysis of the Severe Acute Respiratory Syndrome Coronavirus 3C-Like Protease

Structure of M pro from COVID-19 virus and discovery of its inhibitors

Coronavirus Main Proteinase (3CLpro) Structure: Basis for Design of Anti-SARS Drugs

Crystallographic studies on coronaviral proteases enable antiviral drug design. FEBS

Ketoamides as Broad-Spectrum Inhibitors of Coronavirus and Enterovirus Replication: Structure-Based Design

Potent Noncovalent Inhibitors of the Main Protease of SARS-CoV-2 from Molecular Sculpting of the Drug Perampanel Guided by Free Energy Perturbation Calculations

Development of a fluorescence-based, high-throughput SARS-CoV-2 3CLpro reporter assay

Crystallographic models of SARS-CoV-2 3CLpro: In-depth assessment of structure quality and validation

Virtual Double-System Single-Box: A Nonequilibrium Alchemical Technique for Absolute Binding Free Energy Calculations: Application to Ligands of the SARS-CoV-2 Main Protease

Covalent inhibitors in drug discovery: From accidental discoveries to avoided liabilities and designed therapies

Targeted covalent inhibitors for drug design

ZINC 15-ligand discovery for everyone

Conformer Generation with OMEGA: Algorithm and Validation Using High Quality Structures from the Protein Databank and the Cambridge Structural Database

Crystal structure of SARS-CoV-2 main protease provides a basis for design of improved α-ketoamide inhibitors

Ensemble Docking Coupled to Linear Interaction Energy Calculations for Identification of Coronavirus Main Protease (3CL pro ) Non-Covalent Small-Molecule Inhibitors

Interpreting steep dose-response curves in early inhibitor discovery

New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays

KNIME workflow to assess PAINS filters in SMARTS format. Comparison of RDKit and Indigo cheminformatics libraries

Virtual screening-An overview

Hit identification and optimization in virtual screening: Practical recommendations based on a critical literature analysis: Miniperspective

An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies

ChEMBL: A large-scale bioactivity database for drug discovery

Deep Learning

Scikit-learn: Machine learning in Python

Xgboost: A Scalable Tree Boosting System

Anti-SARS-CoV-2 activities in vitro of Shuanghuanglian preparations and bioactive ingredients

3C-like protease inhibitors block coronavirus replication in vitro and improve survival in MERS-CoV-infected mice

Validation of an empirical RNA-ligand scoring function for fast flexible docking using RiboDock

rDock: A Fast, Versatile and Open Source Program for Docking Ligands to Proteins and Nucleic Acids

Pymol: An open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr

Directory of useful decoys, enhanced (DUD-E): Better ligands and decoys for better benchmarking

Dropout: A simple way to prevent neural networks from overfitting

Molecular dynamics simulations give insight into D-glucose dioxidation at C2 and C3 by Agaricus meleagris pyranose dehydrogenase

Bringing the MMFF force field to the RDKit: Implementation and validation

We gratefully acknowledge NVIDIA Corporation's support with the donation of GPU hardware used in this research. We thank OpenEye for the academic licencing of their software and their support. We thank Rita Podzuna and support from Schrödinger LLC. Thanks to Boštjan Laba for support in the community project. Last but not least: a heartfelt thanks from all authors to all participants in the COVID.SI community (www.covid.si and www.sidock.si) for supporting our work. This represents a wonderful example of citizen science, and we are grateful for all joint scientific and community work. Thank You All!

The authors declare no conflict of interest.