key: cord-0838691-g1qxo01i
authors: Kowalewski, Joel; Ray, Anandasankar
title: Predicting novel drugs for SARS-CoV-2 using machine learning from a >10 million chemical space
date: 2020-08-06
journal: Heliyon
DOI: 10.1016/j.heliyon.2020.e04639
sha: b966d377a5ac89cf1d9824941a1eac0636a93294
doc_id: 838691
cord_uid: g1qxo01i

There is an urgent need for the identification of effective therapeutics for COVID-19 and we have developed a machine learning drug discovery pipeline to identify several drug candidates. First, we collect assay data for 65 target human proteins known to interact with the SARS-CoV-2 proteins, including the ACE2 receptor. Next, we train machine learning models to predict inhibitory activity and use them to screen FDA registered chemicals and approved drugs (∼100,000) and ∼14 million purchasable chemicals. We filter predictions according to estimated mammalian toxicity and vapor pressure. Prospective volatile candidates are proposed as novel inhaled therapeutics since the nasal cavity and respiratory tracts are early bottlenecks for infection. We also identify candidates that act across multiple targets as promising for future analyses. We anticipate that this theoretical study can accelerate testing of two categories of therapeutics: repurposed drugs suited for short-term approval, and novel efficacious drugs suitable for a long-term follow up.

Q2 SARS-CoV-2 is a novel coronavirus that is responsible for the COVID-19 disease which is a rapidly evolving global pandemic. Coronaviruses primarily target the upper respiratory tract and the lungs, with varying degrees of severity. Related corona viruses such as the SARS-CoV emerging in China in 2002 and the MERS-CoV in the Middle East in 2012 result in severe respiratory conditions. The SARS-CoV-2 also produces similarly severe respiratory conditions, albeit at a lower rate but with a higher contagion factor [1] . Alarmingly, infected individuals may be asymptomatic carriers, presumably harboring the viral infection in the upper airway tract, increasing the likelihood of infecting populations that are most susceptible to severe complications [2, 3] .

Although the mechanisms underlying SARS-CoV-2 infection are not completely understood, select human proteins are targets for the virus including ACE2 [4] . The SARS-CoV-2 receptor binding domain (RBD) interacts strongly with the human ACE2 receptor and TMPRSS2 to enter a human cell [5] . In addition to ACE2, a recent systems-level analyses of protein-protein interaction with peptides encoded in the SARS-CoV-2 genome identified~300 additional human proteins, of which, 66 were considered suitable candidates for identification of therapeutics [6] . Gordon et. al. performed an in vitro assay with human cells expressing 26 SARS-CoV-2 proteins, which was followed by an analysis for high-confidence interactions. Of the 100s of reported interactions 66 were prioritized, and the authors subsequently mined and tested FDA approved drugs that were known or suspected to target these human proteins. Most of the human target proteins are overexpressed in the respiratory tract. Of particular note is the entry receptor ACE2 which is expressed at high levels in a few cell types of the nasal epithelium, as well as elsewhere [6, 7] . This could be an unusual opportunity for volatile inhaled therapeutics and prophylactics that will have direct access to the cells that are infected by the virus.

The Gordon et al study also identified FDA-approved drugs that have known activity against these human protein targets or are structurally related to chemicals with known activity on the targets. While these drugs have yet to be tested directly on the virus, another study performed high-throughput testing of~12,000 FDA-approved or clinical stage drugs on viral replication in cell lines [8] . This study identified at least 6 1 2 3 4 5 6 7 8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56 potential leads that include a kinase inhibitor, a CCR1 inhibitor and 4 cysteine protease inhibitors that are candidates for testing in clinical trials.

Since the regulatory process for the approval of new drugs can take several years, the repurposing of FDA approved drugs for COVID-19 offers a potential fast-track to approval. One of the more promising candidates being tested is the antiviral Remdesivir, which has been effective in vitro [9] as well as in non-human primates [10] , with human trails currently ongoing. The other drug being tested is the antimalarial, hydroxychloroquine, which showed some promise alongside the antibiotic, azithromycin, in small clinical trials [11, 12] . However, hydroxychloroquine has shown less promise in larger trials for treating COVID-19 [13] .

While drug repurposing is expedient, it is possible that drugs designed for other diseases will not be as well suited to respiratory organs, where a large percentage of putative human proteins targeted by the virus are enriched [6] , or to the nervous system, implicated by neurological symptoms as well as prior evidence that coronaviruses can cross the blood brain barrier [14, 15] . Drug-development strategies are also often guided by minimizing off-target interactions. Repurposed drugs might have to be used in combination, and the side effects and interactions that this entails are presently not well defined. While there are recent efforts exploring novel, directed therapies from small molecule libraries [16] , it is desirable to identify 100-1000s of putative chemicals as the majority may be difficult to synthesize in mass, prove toxic at therapeutic concentrations, or yield inconsistent benefits across patients due to genetic variability. These shortcomings have significantly increased the demand for additional drugs or small molecules that might interfere with viral entry and replication. Additionally, if prophylactics or non-toxic, easy to use therapeutics were available even for mild cases that do not require hospitalization and experimental drug treatments, it may nevertheless impact long-term health and community transmission [17] .

There are subsequently unmet needs in COVID-19 research, including identification of compounds that target the relevant SARS-CoV-2 human proteins from (1) approved drugs, (2) FDA registered chemicals or (3) a large repository of~14 million purchasable chemicals from the ZINC 15 database [18] , which we computed additional properties for such as mammalian toxicity, vapor pressure, and logP. For 65 human protein targets that SARS-CoV-2 interacts with that had publicly available bioassay and chemical data [6] , we first generated a database of predictions based on structural similarity to chemicals that interact with the targets and then machine learning models (34) . Many chemicals we have identified have little or no known biological activities and are predicted to have low toxicity in addition to a wide range of vapor pressures. These data are a resource to rapidly identify and test novel, safe treatment strategies for COVID-19 and other diseases where the target proteins are relevant.

In order to test whether there is a structural basis for inhibitors of the target proteins identified previously [5, 6] , we used two complementary approaches to evaluate each target's training set of compounds with known activity, compiled from the literature. First, we performed an exhaustive search for maximum common substructures among active chemicals. In some cases, enriched substructures were apparent among known ligands, with slight variation in the substructure based on the sensitivity to the targets, suggesting physicochemical features may be relevant in predicting activity against these targets (Supplementary  Table 1 ). Next, we used a machine learning pipeline for predicting chemicals that interfere with SARS-CoV-2 targets. It involves selection of important physicochemical features for each target, followed by fitting support vector machines (SVM) with these features and then evaluating the predictions using various computational validation methods ( Figure 1A ). The chemical features that best predicted activity for the different targets included simple 2D information, describing the type and number of bonds, but also more abstract 3D geometries (Tables 1 and 2) . Identification of each target-specific feature set provides a foundation to better understand the physicochemical basis of the activity. To that end, Supplementary Tables 2-3 include more comprehensive rank ordered lists of the physicochemical features that optimally predict activity against the targets (details about the feature ranking algorithms in Materials and Methods).

We identified 24 targets with training sets large enough to model the log IC 50 , K i , or AC 50 (Figure 2A ). Rigorous computational validation was performed and the results on training ( Figure 2B , left) and test data that had been set aside ( Figure 2C For some of the viral targets, we noticed that assay data included additional inhibitory measurements. Some of the available data such as % inhibition, for instance, are less quantitative. However, to include as much of the available data as possible, we created models to identify physicochemical features that might broadly contribute to inhibition. We therefore assigned binary, active and inactive, labels to the chemicals, then trained models as outlined before (Figure 2A ; Materials and Methods). The models that were developed using this classification approach similarly proved successful, validating over partitions of the training data (avg. AUC ¼ 0.87, avg. Shuffle AUC ¼ 0.50, p < 10 À19 ) ( Figure 2B , right), as well as over sets of external test chemicals (avg. AUC ¼ 0.83, avg. Shuffle AUC ¼ 0.51, p < 10 À8 ) ( Figure 2C , right) ( Supplementary Information 1) . Collectively, these results suggested the models provided accurate predictions and could be used to screen approved drug libraries as well as databases of commercially available chemicals for novel therapeutics.

Repurposing of existing FDA approved drugs offers a path towards rapid deployment of therapeutics against SARS-CoV-2. Approved drugs may have activity that extend beyond the original target protein. Accordingly, we used the machine learning models to predict activities of 100,000 FDA registered chemicals (UNII database) [19] as well as the DrugBank [20] and Therapeutic Targets [21, 22] databases, which include information on drug interactions, pathways, and approval status. Interestingly, some of the approved drugs are predicted to have high activity against the SARS-CoV-2 targets ( Figure 3A ). In order to identify more efficacious candidates, we isolated the drugs scoring in the top 25 for multiple targets and found a few of high priority ( Figure 3B ). The structural analysis suggested that hits visually display 2D similarity to known active chemicals as well. (Supplementary Information 2).

Given that many of the human target proteins are overexpressed in the respiratory tract, including the entry receptor ACE2 in only a few cells types of the nasal epithelium, the upper airways and lungs [7, 23] , we reasoned that volatile chemicals may offer a unique opportunity as inhaled therapeutics that will have direct access to the cells and tissues that are infected by the virus. We used the machine learning models to search a large database of~14 million commercially available chemicals Machine learning pipeline to identify chemicals that interfere with SARS-CoV-2 targets. a) Overview of the pipeline to predict chemicals for 65 SARS-CoV-2 human targets selected from Gordon et al., 2020 and using bioassay data from publicly available databases. b) Graphically depicts the pipeline details. Available bioassay data on the viral targets were mined for information to use in machine learning or structural analysis. This resulted in 24 targets that could be modeled using values for the most abundant inhibitory assay measure (e.g. K i or IC 50 ) and 21 targets modeled by classifying broad inhibition (34 unique targets in total). The remaining targets with limited data were funneled into a structural similarity analysis, which aids in developing more bioassay data and helps clarify the chemical features contributing to bioactivity. For targets modeled with supervised machine learning, optimal chemical features were identified on subsets of training data. The top features were sampled by support vector machines (SVM). These models were then aggregated. External chemicals were used to verify successful predictions. Models trained for the 34 targets predicted large chemical databases including FDA registered chemicals and approved drugs, as well as 10 þ million purchasable chemicals from the ZINC database. Top scoring predicted chemicals were subsequently assigned theoretical toxicity, log vapor pressure, and MLOGP, which estimates membrane permeability. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 66 67 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67   68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129  130  131  132 (ZINC) for volatile candidates. We initially isolated the top 1% of the predicted scoring distribution ( Figure 4A , left), which resulted in >1 million chemicals in total ( Figure 4A , right). To prioritize the hits for potential human use, we next developed machine learning models to predict volatility (vapor pressure) (Supplementary Figure 1) and mammalian toxicity (LD 50 ) (Supplementary Figure 2) . The toxicity and vapor pressure estimates helped identify smaller priority sets ( Figure 4B ). Although the vapor pressures were not especially high, we rank ordered the top candidates according to the best values ( Figure 4C ; Supplementary Information 3). Chemicals with suspected odorant properties, however, represent only a fraction of the chemical space, and these chemicals may not have the activity levels suited for COVID-19 cases. Volatile compounds, for instance, may be biased towards structurally simple chemicals that do not resemble drugs. We therefore also focused on additional chemicals with highest predicted activities for their targets and low estimated toxicities regardless of vapor pressure. We identified numerous candidates with potential activity against multiple viral targets ( Figure 5A ) and many other others with significant activity against a single target ( Figure 6A ; Supplementary Information 4).

SARS-CoV-2 is a significant world health crisis. The full scope of COVID-19 disease and any long-term health complications following infection remain unclear. Although vaccines are the best long-term solution, treatments will be necessary to mitigate disease severity in the short term. What is concerning is that, while several repurposed drugs have already been tested in some form of clinical trial, and only one drug Remdesivir has shown a clear benefit in randomized clinical trials. Additionally, there is no guarantee that an effective vaccine can be found for the SARS-CoV2 virus, and therefore drug candidate pipelines are extremely important to pursue for the long-term research effort against COVID-19. A vaccine against SARS-CoV-2 would likely need to stimulate local immunity, since the infection is limited to mucosal surfaces, and these could be short-lived immunities.

We have therefore taken a comprehensive approach to try and provide a pipeline for short and long-term use, and for a potentially local application route via inhalation. Existing FDA approved drugs that target a single protein important for viral replication and host entry are currently the highest priority for repurposing as new COVID-19 drugs. However, we think that there are compelling reasons to create pipelines to explore many putative targets, and chemical spaces that are far larger and more diverse than the known approved drugs. We have therefore screened~14 million potentially purchasable compounds from the ZINC database and also predicted toxicity values for the numerous candidates. In addition, we have identified chemicals that are predicted to affect more than one of the host proteins, suggesting these may have more efficacy. One unusual category we have emphasized is volatiles, as these compounds may be biologically sourced, and therefore microbes could be genetically engineered to produce them in mass [24] . This would subsequently reduce the strain on global supply chains for chemicals that are necessary in synthesizing certain pharmaceuticals. These chemicals are also intriguing options for drug cocktails. If present in metabolic pathways, they possibly already interact in vivo. Therefore, short-term therapeutic concentrations may be better tolerated in humans.

It is nevertheless important to note that machine learning depends on available data. Because the size and diversity of publicly available bioassay data are limited, caution is required in interpreting the predictions. It is common to find past bioassays focused on similar shaped chemicals, limiting the scope of the machine learning approach to find new chemistries. Importantly, apart from ACE2, the other human proteins that were identified to interact with SARS-CoV-2 are yet to be tested in vivo for drug-ability. And although some of the candidate chemicals we identified may be biologically sourced, the concentrations are not well defined or unknown, nor is there any understanding of a therapeutic concentration in this scenario. These data are presented as a forwardlooking resource and a pipeline to evaluate chemical data with additional research. While our motivation was the evolving COVID-19 pandemic, the 64 SARS-CoV-2 targets are relevant to a range of other diseases and conditions. We therefore anticipate that the AI-based predictions of purchasable compounds from 10 þ million chemicals will accelerate drug discovery in general and facilitate research on these chemicals in the future for a number of diseases. In general, the use of AIdriven tools could provide additional valuable solutions for tackling Covid-19 [25] .

ZINC is a free database comprised of 230 million chemicals for in silico analyses. It was developed as a resource for non-commercial research. Chemicals predicted here are from a purchasable subset; however, availability is subject to change and pricing may vary widely [18, 26] .

Bioassay data was retrieved from ChEMBL 25 using the associated Python module, which enables access to the API services via Python [27, 28] . The various inhibitory measures/endpoints, wherever possible, are standardized to nM units; the logarithm of the standardized values was used for machine learning. Regression models were fit for a single endpoint. For classification machine learning models, however, 'active' class chemicals were defined using the activity comments, endpoints with values up to 10,000 nM (K i and IC 50 ) and for the semi-quantitative % inhibition, greater than 10%. The majority class was downsampled during the training and model tuning phases to adjust for possible class imbalances. Training for the regression and classification approaches was 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  Table 2 . Important chemical features for classification models. Top three chemical features for viral targets where the models classified chemicals as active vs inactive relative to broad inhibition rather than a specific assay value (e.g. K i , IC 50 , and AC 50 ). 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67   68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129  130  131 132 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67 done on 85% of the total data. Notably, in a small number of cases the remaining 15% was insufficient to effectively estimate performance using an external test set. To reduce bias, feature selection (recursive feature elimination (RFE) algorithm) was always run on 85% of the data over 250-300 different partitions (iteratively running the 10-fold cross validation 25-30 times). However, for these cases, the held-out portion (15%) was then incorporated back into the dataset to better estimate performance of the trained model by 10-fold cross-validation (repeated 5 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67   68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129  130  131  132 times). We also fit 3 different radial basis function (RBF) support vector machine (SVM) models, wherein the chemical features (predictors) were randomly sampled (50%) from the top 70. This makes the performance estimates more conservative (see Key Resources Table for machine algorithm source files).

Training and testing data are curated by various government agencies and provided freely to the general public as databases (see Key Resources Table) [29, 30, 31] . . Predicting activity against SARS-CoV-2 targets among theoretical volatile chemicals. a) Left, count of chemicals per target after initially filtering based on predicted scores. Right, chemical counts across all viral targets for the models predicting general inhibitory scores (Classification) and those for specific inhibitory endpoints (Regression) (e.g. IC50). b) Pipeline for further prioritizing chemical sets according to estimated vapor pressure and low mammalian toxicity (LD50). c) Top ranking predictions of general inhibitory activity (Score) and/or specific inhibitory endpoints (Predicted Assay Value) against SARS-CoV-2 targets from the ZINC database, filtered to the highest estimated log vapor pressures. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67 4.1.4. Vapor pressure data Training and testing data are from EPI Suite [32] , which is developed and maintained by the Environmental Protection Agency (EPA) (see Key Resources Table) . Methods for fitting these models are as outlined in the Figure 1 pipeline. To compare the vapor pressure model predictions with respect to different machine learning methods as well as EPI suite, data were split into train/test partitions as defined in a previous study [33] .

Chemical features were computed with~5300 AlvaDesc descriptors, from the developers of DRAGON software, and 3D coordinates and optimization performed using RDKit in Python [34] .

Chemical feature ranking and importance 4.2.2.1. Cross-validated recursive feature elimination (CV-RFE). Recursive feature elimination iteratively selects subsets of features to identify optimal sets. The algorithm is a "wrapper" and therefore relies on an additional algorithm to supply predictions and quantify importance. We used two different algorithms, depending on the size and composition of data: (1) Random Forest and (2) Support Vector Machine (SVM). Random forest determines the importance in relation to the % increase in error when permuting a feature or predictor. There is no equivalent method for computing importance with the SVM. Accordingly, the importance is based on fitting a model between the response and each predictor or feature as compared to null. If the response is numeric, importance is derived from the pseudo R 2 (non-linear regression). If, however, the response is binary, the AUC is instead computed for each predictor or feature (see Key Resources Table for algorithm source files) .

Including cross-validation with the recursive feature elimination (RFE) partitions the training data into multiple folds. This step avoids biasing performance estimates but results in lists of top predictors over the cross-validation folds such that importance of a predictor is based on a selection rate.

Selecting features or predictors on the same dataset used for cross validation results in models that have already "seen" possible partitions of the data and therefore performance metrics will be biased. Selection bias [35] was addressed by bootstrapping and cross validation, which ensure some separation between predictor/feature selection and model-fitting/validation. In addition to these methods, we used hidden test sets.

The support vector machine (SVM) with the radial basis function kernel (RBF) outperformed regularized Random Forest (regRF) or performed comparably. Rather than utilize many different approaches, we aggregated multiple SVM models to improve generalizability. However, in the case of the classification model for EIF4H, we included the regularized random forest algorithm, as the aggregated prediction (SVM and regRF) was clearly optimal on the test data. Algorithm selection and training was done using the classification and regression training package in R [36] , caret [37] , and the implementation of the Support Vector Machine (SVM) algorithm in Kernlab [38] .

Enriched cores were analyzed using RDKit through Python [34] . The algorithm performs an exhaustive search for maximum a common substructure among a set of chemicals. In practice, larger sets often yield fewer substantive cores. To remedy this, the algorithm includes a threshold parameter that relaxes the proportion of chemicals containing the core. We used a threshold of 0.55, which ensures that the majority of the chemicals contained the core. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67   68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129 130 131 132 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67 4.5. Chemical fingerprinting Extended Connectivity Fingerprints (ECFP) are a class of cheminformatic algorithms that iteratively combine chemical features that are present within a predefined radius/diameter, representing them by set of integer values. Typically, the fingerprint is converted into a binary string of fixed length using a hash function. Here, the bit length was set at 1024 and a radius of 2 (diameter ¼ 4 or ECFP4). This structural representation was preferred as it is strongly associated with activity [39] . Accordingly, it is a suitable alternative to identify drug candidates in the absence of machine learning models. We used the ECFP algorithm in RDKit (Morgan or circular fingerprint) [34] . The similarity between the fingerprints of chemicals with known activity against the SARS-CoV-2 targets and prospective chemicals was computed using the Tanimoto index. This index is a similarity coefficient (0-1; 1 ¼ max similarity). It is the overlap of the "on-bits" divided by the sum of the unique "on-bits". Notably, coefficients of 1 need not imply identical chemicals.

where c ¼ overlapping "on-bits"; a ¼ "on bits" in A; b ¼ "on-bits" in B.

Training the support vector machine (SVM) involves identifying a set of parameters that optimize a cost function, where cost 1 and cost 0 correspond to training chemicals labeled as "Active" and "Inactive," respectively. θ T is the scoring function or output of the support vector machine. If the output is !0, the prediction is "Active." The function (ƒ) is a kernel function.

The kernel determines the shape of the decision boundary between the active and inactive chemicals from the training set. The radial basis function (RBF) or Gaussian kernel enables the learning of more complex, non-linear boundaries. It is therefore well suited for problems in which the biologically active chemicals cannot be properly classified as a linear function of physicochemical properties. This kernel computes the similarity for each chemical (x) and a set of landmarks (l), where σ 2 is a tunable parameter determined by the problem and data. The similarity with respect to these landmarks is used to predict new chemicals ("Active" vs. "Inactive").

The Area under the ROC Curve (AUC) assesses the true positive rate (TPR or sensitivity) as a function of the false positive rate (FPR or 1-specificity) while varying the probability threshold (T) for a label (Active/ Inactive). If the computed probability score (x) is greater than the threshold (T), the observation is assigned to the active class. Integrating the curve provides an estimate of classifier performance, with the top left corner giving an AUC of 1.0 denoting maximum sensitivity to detect all targets or actives in the data without any false positives. The theoretical random classifier is reported at AUC ¼ 0.5.

where T is a variable threshold and x is a probability score.

However, we generated classifiers that are more authentic than theoretical random classification, shuffling the chemical feature values in the models and statistically comparing the mean AUCs across multiple partitions of the data. This controls against optimally tuned algorithms predicting well simply because of specific predictor attributes (e.g. range, mean, median, and variance) or models that are of a specific size (number of predictors) performing well even with shuffled values. Additionally, biological data sets are often small, with stimuli or chemicals that-rather than random selection-reflect research biases, possibly leading to optimistic validation estimates without the proper controls.

We used the AUC for evaluating classification models. For the classification-based training, we initially converted the inhibitory data into a binary label (Active/Inactive). For predictions of quantitative bioassay measures (e.g. K i , IC 50 , AC 50 , Log LD 50 ), we computed the mean absolute error (MAE), the correlation coefficient (R) and the squared correlation coefficient (R2). MAE: Mean absolute error is the mean of the absolute difference between predicted and observed (% usage). It therefore assigns equal weight to all prediction errors, whether large or small. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67   68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100  101  102  103  104  105  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128  129  130  131  132 Declarations Author contribution statement Joel Kowalewski: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Anandasankar Ray: Conceived and designed the experiments; Wrote the paper.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38   39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74 

High contagiousness and rapid spread of severe acute respiratory syndrome coronavirus 2

Presumed asymptomatic carrier transmission of COVID-19

Covid-19: four fifths of cases are asymptomatic, China figures indicate

Receptor recognition by novel coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS

Structural basis for the recognition of the SARS-CoV-2 by full-length human ACE2

A SARS-CoV-2 protein interaction map reveals targets for drug repurposing

A large-scale drug repositioning survey for SARS-CoV-2 antivirals

Remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus (2019-nCoV) in vitro

Clinical benefit of remdesivir in rhesus macaques infected with SARS-CoV-2

Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label nonrandomized clinical trial

Efficacy of hydroxychloroquine in patients with COVID-19: results of a randomized clinical trial

Reagent or Resource Source Identifier DSSTox Richard and Williams

No evidence of clinical efficacy of hydroxychloroquine in patients hospitalized for COVID-19 infection with oxygen requirement: results of a study using routinely collected data to emulate a target trial

The neuroinvasive potential of SARS-CoV2 may be at least partially responsible for the respiratory failure of COVID-19 patients

An orally bioavailable broad-spectrum antiviral inhibits SARS-CoV-2 in human airway epithelial cell cultures and multiple coronaviruses in mice

Coincidence of COVID-19 epidemic and olfactory dysfunction outbreak

ZINC 15 -Ligand discovery for everyone

Food and drug administration substance registration system standard operating procedure, Language (Baltim

TTD: therapeutic target database

Update of TTD: therapeutic target database

Bacteria as genetically programmable producers of bioactive natural products

AI-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data

ZINC: a free tool to discover chemistry for biology

ChEMBL: towards direct deposition of bioassay data

Acutoxbase, an innovative database for in vitro acute toxicity studies

Distributed structure-searchable toxicity (DSSTox) public database network: a proposal

The national library of medicine's (NLM) hazardous substances data bank (HSDB): background, recent enhancements and future plans

Estimation Programs Interface Suite TM for Microsoft® Windows, United States Environ

In Silico prediction of physicochemical properties of environmental chemicals using molecular fingerprints and machine learning

RDKit: Open-Source Cheminformatics

Selection bias in gene extraction on the basis of microarray gene-expression data

R: a language and environment for statistical computing, R Found

Kernlab -an S4 package for kernel methods in R

Extended-connectivity fingerprints

Ray Heliyon xxx (xxxx) xxx

The authors declare no conflict of interest.

Supplementary content related to this article has been published online at https://doi.org/10.1016/j.heliyon.2020.e04639.

Supplementary Table 2 Top 50 physicochemical features to predict broadly inhibiting activity for each SARS-CoV-2 target This paper Supplementary Table 3 Top predicted drug and FDA registered chemicals. Structural similarity between drugs and chemicals with bioassay activities for SARS-CoV-2 targets