key: cord-0303637-vfd0su0w
authors: Gawriljuk, Victor O.; Kyaw Zin, Phyo Phyo; Foil, Daniel H.; Bernatchez, Jean; Beck, Sungjun; Beutler, Nathan; Ricketts, James; Yang, Linlin; Rogers, Thomas; Puhl, Ana C.; Zorn, Kimberley M.; Lane, Thomas R.; Godoy, Andre S.; Oliva, Glaucius; Siqueira-Neto, Jair L.; Madrid, Peter B.; Ekins, Sean
title: Machine Learning Models Identify Inhibitors of SARS-CoV-2
date: 2020-06-16
journal: bioRxiv
DOI: 10.1101/2020.06.16.154765
sha: d22e080257056d45f782199c16bbb20ad4d4e252
doc_id: 303637
cord_uid: vfd0su0w

With the ongoing SARS-CoV-2 pandemic there is an urgent need for the discovery of a treatment for the coronavirus disease (COVID-19). Drug repurposing is one of the most rapid strategies for addressing this need and numerous compounds have been selected for in vitro testing by several groups already. These have led to a growing database of molecules with in vitro activity against the virus. Machine learning models can assist drug discovery through prediction of the best compounds based on previously published data. Herein we have implemented several machine learning methods to develop predictive models from recent SARS-CoV-2 in vitro inhibition data and used them to prioritize additional FDA approved compounds for in vitro testing selected from our in-house compound library. From the compounds predicted with a Bayesian machine learning model, CPI1062 and CPI1155 showed antiviral activity in HeLa-ACE2 cell-based assays and represent potential repurposing opportunities for COVID-19. This approach can be greatly expanded to exhaustively virtually screen available molecules with predicted activity against this virus as well as a prioritization tool for SARS-CoV-2 antiviral drug discovery programs. The very latest model for SARS-CoV-2 is available at www.assaycentral.org.

In December 2019, several cases of pneumonia with unknown etiology started to 49 arise in Wuhan, China. A new betacoronavirus was identified and named SARS-CoV-2 50 due to high similarity with previous SARS-CoV 1,2 . This virus causes the disease which 51 has been called COVID-19 3 .Since then, SARS-CoV-2 has rapidly spread worldwide 52 prompting the World Health Organization to declare the outbreak a pandemic, with more 53 8

The first criterion for the applicability assessment is determined based on 137 whether it fits within the range of the key molecular descriptors of the training set (MW, 138 MolLogP, NRB, TPSA, HBA, HBD). If at least four properties lie within the maximum 139

and minimum values of the model's data, the molecule is considered similar and goes to 140 the next criterion. The second criterion relies on structural fragment-based similarity 141 measured with Tanimoto coefficient using MACCS fingerprints. The similarity of the 142 MACCS fingerprints for the query compound and all training data is computed using the 143 Tanimoto score. Only 5% of the training set compounds that are most similar to the 144 query compound is used for evaluation (i.e. if the training set has 100 molecules only 5 145 molecules with more similarity to the query compound are used for the next evaluation). 146

If the Tanimoto score exceeds 0.5 against the 5% of the training set compounds, the 147 model is considered to have enough structural fragments overlap with the query 148 compound and thus the compound goes onto the reliability assessment. 149

The reliability domain assessment implements k-means clustering methods 150 based on ECFC6 fingerprints to classify the predictions from very high to low reliability. 151

The reliability class depends on four criteria: distance from the major central point of the 152 training data, distance from the closest cluster, closest cluster density and closest 153 cluster distance within the chemical space. Each criterion has different weights and 154 scores, with the second and third having higher priority. If the compound scores 1 in 155 each criterion it is classified as very highly reliable, if that is not the case only the two 156 higher priority criteria are considered for the next classes. The compound is classified 157 as highly reliable if scores a total of 2, moderately reliable if it scores between -1 and 2 158 or low reliability if it scores less than or equal to -1 in the two higher priority criteria. The 9 scores for each criterion as well as its definition are extensively described in the 160 Supplemental Methods. 161 162

Compounds were tested in a 10-point serial dilution experiment to determine the 50% 164 inhibitory concentration (IC 50 ) and 50% cytotoxicity concentration (CC 50 ). 1,000 HeLa-165 ACE2 cells/well were added into 384-well plates with compounds in a volume of 25 nl. 

The performance of the machine learning models on the external testing data is 207

shown in Table 2 . The external validation was used to measure model performance in 208 data from different studies outside the training set. svc and knn had slightly better 209 statistics compared to all other models, with the best balance between recall and 210 specificity. 211 212 

The PCA of the model training set alone shows that the SARS-CoV-2 chemical 218 space is well distributed with active and inactive molecules well mixed when analyzed 219 using either molecular and fingerprint descriptors. When compared with Prestwick 220

Chemical Library (PwCL), a library of predominantly FDA approved drugs, the SARS-221

CoV-2 data lie within a big cluster with molecular descriptors and is more widely 222 distributed when using the fingerprint descriptors. suggesting it is likely more conservative. Indeed, with the external test and training set 244

PCA we can see that most molecules superimpose with few of them distant from each 245 other ( Figure S1 ). Therefore, similarity together with clustering methods are more 246 suitable for applicability and reliability assessment compared with only structural 247 similarity, as seen by the PCA. 248 249

A selection of FDA approved drugs available to us in our relatively small in-house 251 compound collection of hundreds of molecules was scored with the AC Bayesian model. 252

A selection of some of the best scoring molecules (Table 3) (Table 1) . When compared with different machine learning methods AC 282 outperforms all of them in the SARS-CoV-2 training set, but this may be due to the 283 threshold for all models being set as optimal for AC. However, choosing different values 284 could imbalance the training set and remove important compounds from the active 285

More important than a good performance in the training set is the performance 287 on external data, since most prospective predictions will occur for molecules outside 288 training data. For external validation all models had intermediate performance, with 289 ROC of 0.6. Taking into account the small number of molecules and that some test set 290 molecules lie outside the applicability domain, the performance is acceptable. Different 291 from the training set performance, svc had the highest overall score, predicting 60% of 292 the active molecules despite its modest statistics in five-fold cross validation. The good 293 performance of svc in predicting biological activity is in accordance with several studies 294 that show good performance in different datasets 28,32,35,39 . Therefore, the models 295 described here are suitable for initial prospective predictions. 296

The applicability and reliability assessment shows that 73% of the test set The molecules of the dataset do not have a common scaffold, but there are 314 several common structural features that occur in active/inactive molecules that can be 315 highlighted, such as tertiary amines and aliphatic chains in active molecules and phenyl 316 rings and peptide molecule features in inactive molecules ( Figure S3 ). These most 317 common active features appear in chloroquine, tripanarol and tilorone, while the inactive 318 features appear in darunavir, amprenavir and ritonavir (Figure 3 ). The lack of common 319 scaffolds and features that appears in more than 30% of the active or inactive 320 molecules shows how different and diverse the active molecules are, which turn 321 classification models for these molecules into a relatively difficult task. or not the expression of TMPRSS2 affects compound activity. 45 357

As new data is continually being published the machine learning models can be 358 updated to increase performance in terms of both training and external test set 359 validation. The very latest model for SARS-CoV-2 is available at www.assaycentral.org. 360

In the meantime, we have shown these models perform well with internal cross 361 validation, external validation as well as prospective prediction, enabling us to find 362 additional active molecules. These models should be used to prioritize compounds 363 which have both a high prediction score and reliability as described herein. This will be 364 expected to return more reliable predictions that together with drug discovery expertise 365 can help prioritize compounds in future for in vitro testing. 366 367

We would like to kindly acknowledge Dr. Nancy Baker and Ms. Natasha Baker for their 369 help in collating recently SARS-CoV-2 published data. We also thank Biovia for 

A new coronavirus associated with human respiratory 392 disease in China

The species Severe acute respiratory 395 syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-396 2

Race to find COVID-19 treatments accelerates

Coronavirus puts drug repurposing on the fast track

A bibliometric review of drug 406 repurposing

Identification of antiviral drug candidates against 409

SARS-CoV-2 from FDA-approved drugs

FDA approved drugs with 412 broad anti-coronaviral activity inhibit SARS-CoV-2 in vitro

An orally bioavailable broad-spectrum 414 antiviral inhibits SARS-CoV-2 and multiple endemic, epidemic and bat 415 coronavirus

The FDA-approved Drug 418

Ivermectin inhibits the replication of SARS-CoV-2 in vitro

Structure of Mpro from COVID-19 virus and discovery of 421 its inhibitors

Teicoplanin potently blocks the cell entry of 2019-423 nCoV

Remdesivir and chloroquine effectively inhibit the 425 recently emerged novel coronavirus (2019-nCoV) in vitro

chloroquine, is effective in inhibiting SARS-CoV-2 infection in vitro

Atazanavir inhibits SARS-CoV-2 431 replication and pro-inflammatory cytokine production

QSAR modeling: Where have you 433 been? Where are you going to

Use of 436 machine learning approaches for novel drug discovery

Exploiting machine learning for end-to-end drug 439 discovery and development

Vanquishing the Virus: 160+ COVID-19 Drug and Vaccine Candidates in 442

A Large-scale Drug Repositioning Survey for SARS-446

CoV-2 Antivirals

Discovery of baicalin and baicalein 448 as novel , natural product inhibitors of SARS-CoV-2 3CL protease in vitro

An orally bioavailable broad-spectrum 451

antiviral inhibits SARS-CoV-2 in human airway epithelial cell cultures and multiple 452 coronaviruses in mice

In vitro screening of a FDA 455 approved chemical library reveals potential inhibitors of SARS-CoV-2 replication

Indomethacin has a potent antiviral activity against SARS CoV-2 in vitro and 458 canine coronavirus in vivo Abstract

Approved Drugs as Inhibitors of K(v)7.1 and Na(v)1.8 to Treat Pitt Hopkins 461

High-throughput screening and Bayesian 463 machine learning for copper-dependent inhibitors of Staphylococcus aureus

Multiple Machine 466

Learning Comparisons of HIV Cell-based and Reverse Transcriptase Data Sets

Models Enable New in Vitro Leads

Substitution Influences Ketamine Metabolism by Cytochrome P450 2B6: In Vitro 473 and Computational Approaches

Repurposing for Neglected Diseases

Comparing Multiple Machine 479

Learning Algorithms and Metrics for Estrogen Receptor Binding Prediction

Comparing and Validating Machine Learning 482

Models for Mycobacterium tuberculosis Drug Discovery

Assessment of Substrate-485

Dependent Ligand Interactions at the Organic Cation Transporter OCT2 Using 486

Six Model Substrates

Comparing and Validating Machine Learning 489

Models for Mycobacterium tuberculosis Drug Discovery

Scikit-492 learn

FDA approved drugs with 495 broad anti-coronaviral activity inhibit SARS-CoV-2 in vitro

Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery 501

Trust, but Verify II: A Practical Guide to 504

Chemogenomics Data Curation

Testing therapeutics in cell-based 507 assays: Factors that influence the apparent potency of drugs

SARS-CoV-2 Cell Entry 510

Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease 511

Characterization of spike glycoprotein of SARS-CoV-2 on 513 virus entry and its immune cross-reactivity with SARS-CoV

Enhanced isolation of SARS-CoV-2 by 516 TMPRSS2-expressing cells

Coronavirus 229E Bypass the Endosome for Cell Entry