key: cord-0303637-vfd0su0w authors: Gawriljuk, Victor O.; Kyaw Zin, Phyo Phyo; Foil, Daniel H.; Bernatchez, Jean; Beck, Sungjun; Beutler, Nathan; Ricketts, James; Yang, Linlin; Rogers, Thomas; Puhl, Ana C.; Zorn, Kimberley M.; Lane, Thomas R.; Godoy, Andre S.; Oliva, Glaucius; Siqueira-Neto, Jair L.; Madrid, Peter B.; Ekins, Sean title: Machine Learning Models Identify Inhibitors of SARS-CoV-2 date: 2020-06-16 journal: bioRxiv DOI: 10.1101/2020.06.16.154765 sha: d22e080257056d45f782199c16bbb20ad4d4e252 doc_id: 303637 cord_uid: vfd0su0w With the ongoing SARS-CoV-2 pandemic there is an urgent need for the discovery of a treatment for the coronavirus disease (COVID-19). Drug repurposing is one of the most rapid strategies for addressing this need and numerous compounds have been selected for in vitro testing by several groups already. These have led to a growing database of molecules with in vitro activity against the virus. Machine learning models can assist drug discovery through prediction of the best compounds based on previously published data. Herein we have implemented several machine learning methods to develop predictive models from recent SARS-CoV-2 in vitro inhibition data and used them to prioritize additional FDA approved compounds for in vitro testing selected from our in-house compound library. From the compounds predicted with a Bayesian machine learning model, CPI1062 and CPI1155 showed antiviral activity in HeLa-ACE2 cell-based assays and represent potential repurposing opportunities for COVID-19. This approach can be greatly expanded to exhaustively virtually screen available molecules with predicted activity against this virus as well as a prioritization tool for SARS-CoV-2 antiviral drug discovery programs. The very latest model for SARS-CoV-2 is available at www.assaycentral.org. In December 2019, several cases of pneumonia with unknown etiology started to 49 arise in Wuhan, China. A new betacoronavirus was identified and named SARS-CoV-2 50 due to high similarity with previous SARS-CoV 1,2 . This virus causes the disease which 51 has been called COVID-19 3 .Since then, SARS-CoV-2 has rapidly spread worldwide 52 prompting the World Health Organization to declare the outbreak a pandemic, with more 53 8 The first criterion for the applicability assessment is determined based on 137 whether it fits within the range of the key molecular descriptors of the training set (MW, 138 MolLogP, NRB, TPSA, HBA, HBD). If at least four properties lie within the maximum 139 and minimum values of the model's data, the molecule is considered similar and goes to 140 the next criterion. The second criterion relies on structural fragment-based similarity 141 measured with Tanimoto coefficient using MACCS fingerprints. The similarity of the 142 MACCS fingerprints for the query compound and all training data is computed using the 143 Tanimoto score. Only 5% of the training set compounds that are most similar to the 144 query compound is used for evaluation (i.e. if the training set has 100 molecules only 5 145 molecules with more similarity to the query compound are used for the next evaluation). 146 If the Tanimoto score exceeds 0.5 against the 5% of the training set compounds, the 147 model is considered to have enough structural fragments overlap with the query 148 compound and thus the compound goes onto the reliability assessment. 149 The reliability domain assessment implements k-means clustering methods 150 based on ECFC6 fingerprints to classify the predictions from very high to low reliability. 151 The reliability class depends on four criteria: distance from the major central point of the 152 training data, distance from the closest cluster, closest cluster density and closest 153 cluster distance within the chemical space. Each criterion has different weights and 154 scores, with the second and third having higher priority. If the compound scores 1 in 155 each criterion it is classified as very highly reliable, if that is not the case only the two 156 higher priority criteria are considered for the next classes. The compound is classified 157 as highly reliable if scores a total of 2, moderately reliable if it scores between -1 and 2 158 or low reliability if it scores less than or equal to -1 in the two higher priority criteria. The 9 scores for each criterion as well as its definition are extensively described in the 160 Supplemental Methods. 161 162 Compounds were tested in a 10-point serial dilution experiment to determine the 50% 164 inhibitory concentration (IC 50 ) and 50% cytotoxicity concentration (CC 50 ). 1,000 HeLa-165 ACE2 cells/well were added into 384-well plates with compounds in a volume of 25 nl. The performance of the machine learning models on the external testing data is 207 shown in Table 2 . The external validation was used to measure model performance in 208 data from different studies outside the training set. svc and knn had slightly better 209 statistics compared to all other models, with the best balance between recall and 210 specificity. 211 212 The PCA of the model training set alone shows that the SARS-CoV-2 chemical 218 space is well distributed with active and inactive molecules well mixed when analyzed 219 using either molecular and fingerprint descriptors. When compared with Prestwick 220 Chemical Library (PwCL), a library of predominantly FDA approved drugs, the SARS-221 CoV-2 data lie within a big cluster with molecular descriptors and is more widely 222 distributed when using the fingerprint descriptors. suggesting it is likely more conservative. Indeed, with the external test and training set 244 PCA we can see that most molecules superimpose with few of them distant from each 245 other ( Figure S1 ). Therefore, similarity together with clustering methods are more 246 suitable for applicability and reliability assessment compared with only structural 247 similarity, as seen by the PCA. 248 249 A selection of FDA approved drugs available to us in our relatively small in-house 251 compound collection of hundreds of molecules was scored with the AC Bayesian model. 252 A selection of some of the best scoring molecules (Table 3) (Table 1) . When compared with different machine learning methods AC 282 outperforms all of them in the SARS-CoV-2 training set, but this may be due to the 283 threshold for all models being set as optimal for AC. However, choosing different values 284 could imbalance the training set and remove important compounds from the active 285 More important than a good performance in the training set is the performance 287 on external data, since most prospective predictions will occur for molecules outside 288 training data. For external validation all models had intermediate performance, with 289 ROC of 0.6. Taking into account the small number of molecules and that some test set 290 molecules lie outside the applicability domain, the performance is acceptable. Different 291 from the training set performance, svc had the highest overall score, predicting 60% of 292 the active molecules despite its modest statistics in five-fold cross validation. The good 293 performance of svc in predicting biological activity is in accordance with several studies 294 that show good performance in different datasets 28,32,35,39 . Therefore, the models 295 described here are suitable for initial prospective predictions. 296 The applicability and reliability assessment shows that 73% of the test set The molecules of the dataset do not have a common scaffold, but there are 314 several common structural features that occur in active/inactive molecules that can be 315 highlighted, such as tertiary amines and aliphatic chains in active molecules and phenyl 316 rings and peptide molecule features in inactive molecules ( Figure S3 ). These most 317 common active features appear in chloroquine, tripanarol and tilorone, while the inactive 318 features appear in darunavir, amprenavir and ritonavir (Figure 3 ). The lack of common 319 scaffolds and features that appears in more than 30% of the active or inactive 320 molecules shows how different and diverse the active molecules are, which turn 321 classification models for these molecules into a relatively difficult task. or not the expression of TMPRSS2 affects compound activity. 45 357 As new data is continually being published the machine learning models can be 358 updated to increase performance in terms of both training and external test set 359 validation. The very latest model for SARS-CoV-2 is available at www.assaycentral.org. 360 In the meantime, we have shown these models perform well with internal cross 361 validation, external validation as well as prospective prediction, enabling us to find 362 additional active molecules. These models should be used to prioritize compounds 363 which have both a high prediction score and reliability as described herein. This will be 364 expected to return more reliable predictions that together with drug discovery expertise 365 can help prioritize compounds in future for in vitro testing. 366 367 We would like to kindly acknowledge Dr. Nancy Baker and Ms. Natasha Baker for their 369 help in collating recently SARS-CoV-2 published data. We also thank Biovia for A new coronavirus associated with human respiratory 392 disease in China The species Severe acute respiratory 395 syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-396 2 Race to find COVID-19 treatments accelerates Coronavirus puts drug repurposing on the fast track A bibliometric review of drug 406 repurposing Identification of antiviral drug candidates against 409 SARS-CoV-2 from FDA-approved drugs FDA approved drugs with 412 broad anti-coronaviral activity inhibit SARS-CoV-2 in vitro An orally bioavailable broad-spectrum 414 antiviral inhibits SARS-CoV-2 and multiple endemic, epidemic and bat 415 coronavirus The FDA-approved Drug 418 Ivermectin inhibits the replication of SARS-CoV-2 in vitro Structure of Mpro from COVID-19 virus and discovery of 421 its inhibitors Teicoplanin potently blocks the cell entry of 2019-423 nCoV Remdesivir and chloroquine effectively inhibit the 425 recently emerged novel coronavirus (2019-nCoV) in vitro chloroquine, is effective in inhibiting SARS-CoV-2 infection in vitro Atazanavir inhibits SARS-CoV-2 431 replication and pro-inflammatory cytokine production QSAR modeling: Where have you 433 been? Where are you going to Use of 436 machine learning approaches for novel drug discovery Exploiting machine learning for end-to-end drug 439 discovery and development Vanquishing the Virus: 160+ COVID-19 Drug and Vaccine Candidates in 442 A Large-scale Drug Repositioning Survey for SARS-446 CoV-2 Antivirals Discovery of baicalin and baicalein 448 as novel , natural product inhibitors of SARS-CoV-2 3CL protease in vitro An orally bioavailable broad-spectrum 451 antiviral inhibits SARS-CoV-2 in human airway epithelial cell cultures and multiple 452 coronaviruses in mice In vitro screening of a FDA 455 approved chemical library reveals potential inhibitors of SARS-CoV-2 replication Indomethacin has a potent antiviral activity against SARS CoV-2 in vitro and 458 canine coronavirus in vivo Abstract Approved Drugs as Inhibitors of K(v)7.1 and Na(v)1.8 to Treat Pitt Hopkins 461 High-throughput screening and Bayesian 463 machine learning for copper-dependent inhibitors of Staphylococcus aureus Multiple Machine 466 Learning Comparisons of HIV Cell-based and Reverse Transcriptase Data Sets Models Enable New in Vitro Leads Substitution Influences Ketamine Metabolism by Cytochrome P450 2B6: In Vitro 473 and Computational Approaches Repurposing for Neglected Diseases Comparing Multiple Machine 479 Learning Algorithms and Metrics for Estrogen Receptor Binding Prediction Comparing and Validating Machine Learning 482 Models for Mycobacterium tuberculosis Drug Discovery Assessment of Substrate-485 Dependent Ligand Interactions at the Organic Cation Transporter OCT2 Using 486 Six Model Substrates Comparing and Validating Machine Learning 489 Models for Mycobacterium tuberculosis Drug Discovery Scikit-492 learn FDA approved drugs with 495 broad anti-coronaviral activity inhibit SARS-CoV-2 in vitro Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery 501 Trust, but Verify II: A Practical Guide to 504 Chemogenomics Data Curation Testing therapeutics in cell-based 507 assays: Factors that influence the apparent potency of drugs SARS-CoV-2 Cell Entry 510 Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease 511 Characterization of spike glycoprotein of SARS-CoV-2 on 513 virus entry and its immune cross-reactivity with SARS-CoV Enhanced isolation of SARS-CoV-2 by 516 TMPRSS2-expressing cells Coronavirus 229E Bypass the Endosome for Cell Entry