key: cord-1037052-3p5l1v1w
authors: Gaudêncio, Susana P.; Pereira, Florbela
title: A Computer-Aided Drug Design Approach to Predict Marine Drug-Like Leads for SARS-CoV-2 Main Protease Inhibition
date: 2020-12-10
journal: Mar Drugs
DOI: 10.3390/md18120633
sha: 1003805f6df81314fdaa991f085c0ff5014dd8d8
doc_id: 1037052
cord_uid: 3p5l1v1w

The investigation of marine natural products (MNPs) as key resources for the discovery of drugs to mitigate the COVID-19 pandemic is a developing field. In this work, computer-aided drug design (CADD) approaches comprising ligand- and structure-based methods were explored for predicting SARS-CoV-2 main protease (M(pro)) inhibitors. The CADD ligand-based method used a quantitative structure–activity relationship (QSAR) classification model that was built using 5276 organic molecules extracted from the ChEMBL database with SARS-CoV-2 screening data. The best model achieved an overall predictive accuracy of up to 67% for an external and internal validation using test and training sets. Moreover, based on the best QSAR model, a virtual screening campaign was carried out using 11,162 MNPs retrieved from the Reaxys(®) database, 7 in-house MNPs obtained from marine-derived actinomycetes by the team, and 14 MNPs that are currently in the clinical pipeline. All the MNPs from the virtual screening libraries that were predicted as belonging to class A were selected for the CADD structure-based method. In the CADD structure-based approach, the 494 MNPs selected by the QSAR approach were screened by molecular docking against M(pro) enzyme. A list of virtual screening hits comprising fifteen MNPs was assented by establishing several limits in this CADD approach, and five MNPs were proposed as the most promising marine drug-like leads as SARS-CoV-2 M(pro) inhibitors, a benzo[f]pyrano[4,3-b]chromene, notoamide I, emindole SB beta-mannoside, and two bromoindole derivatives.

To date, over 68.2 million people worldwide have been infected by coronavirus disease 2019 , resulting in over 1.5 million deaths caused by severe acute respiratory syndrome (SARS-CoV-2) and in addition, severe economic costs, national health systems crises, and massive unemployment [1, 2] . Despite the enormous human and financial efforts, there is still no appropriate treatment or prevention for COVID-19 and the number of deaths keeps increasing, which makes the discovery of drugs for infection treatment of foremost importance and an emergency. New and efficient therapeutic agents will relieve social, economic, and management burdens, improving patients' quality of life, and reducing hospitals pressure due to ineffective/long hospitalizations.

to the class A, were selected to proceed to the CADD structure-based method. Where 494 MNPs selected by QSAR approach were screened by molecular docking against M pro enzyme. In this CADD approach, a list of virtual screening hits comprising fifteen MNPs was assented on the basis of some established limits, such as: confidence value (3), probability of being active against SARS-CoV-2 in the best model, prediction of the affinity between the M pro of the selected MNPs through molecular docking. Five MNPs, a benzo[f]pyrano [4,3-b] chromene, notoamide I, emindole SB beta-mannoside, and two bromoindole derivatives were proposed as the most promise marine drug-like leads as SARS-CoV-2 M pro inhibitors.

The whole data set of 5272 organic molecules from the ChEMBL database with SARS-CoV-2 screening data (antiviral activity determined as inhibition of SARS-CoV-2 induced cytotoxicity of Caco-2 cells) was randomly divided into a training set of 3499 molecules (comprising 302 molecules from class A with inhibition % ≥50%, 265 molecules from class B with 50% > inhibition % ≥ 30%, and 2932 molecules from class C with inhibition % < 30%), a test set of 1533 molecules (comprising 145 molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table 1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts.

The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30] . To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and ALogP. The analysis of these data indicates that the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set are distributed over a wide range of MW (i.e., 84-4187 Da) and ALogP (i.e., −8.9-17.85).

Interestingly, approximately 84% of the molecules present in the training set have a MW bellow 500 Da. This MW interval contains approximately 78%, 82%, and 85% of all active, intermediate, and inactive molecules against SARS-CoV-2 in the training set, respectively. In spite of this, using this rule (MW < 500 Da), it is possible to discriminate actives molecules in relation to intermediate and inactive molecules in three structural clusters, namely in the clusters III, IV, and X, which comprises 80%, 82%, and 86% of inactive molecules, respectively, when compared to 78% of active molecules in the overall training set. In addition, more than 68%, 82%, and 86% of the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set have an ALogP that is lower than 5, respectively. In the same way, using the ALogP < 5 rule, it is possible to prioritize active molecules in relation to the other molecules in three clusters, namely in the clusters: VII, VIII and IX, which comprises 85%, 81%, and 79% of active molecules when compared to 68% of active molecules in the overall training set, respectively. Table 1 . Structural clusters and SARS-CoV-2 activity class counts within the ten structural clusters.

Tr Set Te Set Tr Set Te Set Tr Set Te Set I-Indole derivative molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. III-γ-Lactone derivative molecules from class A with inhibition % ≥ 50%, 99 molecules from class B with 50% > inhibition % ≥ 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. 

30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. V-α-Amino acid ester derivative 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. 30%, and 1288 molecules from class C with inhibition % < 30%), and an additional test set, test 2 set, of 241 molecules, which were used for the development and external validation of the QSAR classification models, respectively. The whole data set was clustered into ten structural classes or scaffold types (I-X) using the ward tool in JChem. The ten structural clusters are represented in Table  1 along with their average of molecular weight (MW), the octanol-water partition coefficient estimation (AlogP), and activity class counts. The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30] . To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 279 (13) 131 ( The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30] . To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 588 (31) 254 ( The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30] . To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 285 (29) 136 ( The percentage of molecules belonging to class A, i.e., active against SARS-CoV-2, in the training set is relatively low, 9%, and these molecules are distributed within the ten structural groups (I-X), with percentages ranging from 3-13%. There are three structural clusters (III, VII, and VIII) in which class A is less represented compared to its representation in the training set, with percentages between 3-5%. Moreover, in the other seven structural clusters, class A has a representation equal to or higher than that obtained for class A in the training set, with percentages between 9-13%. The well-known, Lipinski rule, informs mainly if a molecule is more likely to be an orally administrated active drug and if it is easily absorbed by the body. One of the most important parameters is the LogP, which is highly correlated with lipophilicity, thus, highly lipophilic molecules are often discontinued from drug development and are frequently related to toxicity issues [30] . To explore the chemical and biological diversity of the training set, the active, intermediate, and inactive molecules against SARS-CoV-2 in the training set were analyzed, according to the ten structural clusters, using MW and 266 (35) 98 (12) 

Two extensive sets of descriptors were explored, one with 6 different types of fingerprints (FPs) with different sizes (166 MACCS, MACCS keys; 307 Substructure, presence and count Sub and SubC, respectively; 881 PubChem fingerprints; 1024 CDK, circular fingerprints; and 1024 CDKExt, extended circular fingerprints with additional bits describing ring features) and another with a total of 1442 1D&2D descriptors (including electronic, topological, and constitutional descriptors). The FPs and the Mar. Drugs 2020, 18, 633 5 of 17 molecular descriptors were calculated by PaDEL-Descriptor [31] . Random forests (RF) [32] machine learning (ML) technique was used for building SARS-CoV-2 activity prediction classification models, and the performance of the models was successfully evaluated by internal validation (out-of-bag, OOB, estimation on the training set), Table 2 . Table 2 . Evaluation of the predictive performance of FPs and 1D&2D molecular descriptors for modeling the antiviral activity against SARS-CoV-2 of organic molecules using the RF algorithm for the training set in an OOB estimation. 4 False A that was B. 5 False A that was C. 6 False B that was A. 7 False B that was C. 8 False C that was A. 9 False C that was B. 10 Sensitivity, the ratio of true A to the sum of true A and false A. 11 Specificity B, the ratio of true B to the sum of true B and false B. 12 Specificity C, the ratio of true C to the sum of true C and false C. 13 Overall predictive accuracy, the ratio of the sum of true A, true B, and true C to the sum of true A, true B, true C, false A, false B, and false C. 14 MCC, Matthews correlation coefficient.

In Table 2 , were highlighted in bold the three selected models, MACCS model-best model built with sets of fragment FPs (MACCS, Sub, SubC, and PubChem), ExtCDK model-best model built with sets of circular FPs (CDK and ExtCDK), and 1D&2D descriptors in accordance with Q and MCC parameters. For the two best models for the training set, ExtCDK, and 1D&2D (Table 2) , the descriptor selection was evaluated based on the importance assigned by the RF model with the R program- Table 3 . For the MACCS model, descriptors were not selected since these comprised only 166 FPs. The best models for each set of ExtCDK FPs and 1D&2D descriptors were accomplished with the RF algorithm using the 150 selected FPs and descriptors, respectively, (the best models were highlighted in bold- Table 3 ) for the training set. Majority voting predictions (consensus) obtained by ExtCDK, 1D&2D, and MACCS models (consensus model, CM) further improved the results with a Q and MCC of 0.68 and 0.31 for the training set, respectively, Table 4 . The results obtained by the CM for the test set in accordance with the ten structural categories (I-X), which were set up using the ward tool in JChem, are shown in Table 5 . 3 Specificity C, the ratio of true C to the sum of true C and false C. 4 Overall predictive accuracy, the ratio of the sum of true A, true B, and true C to the sum of true A, true B, true C, false A, false B, and false C. 5 MCC, Matthews correlation coefficient. 6 OOB estimation. 3 Specificity C, the ratio of true C to the sum of true C and false C. 4 Overall predictive accuracy, the ratio of the sum of true A, true B, and true C to the sum of true A, true B, true C, false A, false B, and false C. 5 MCC, Matthews correlation coefficient.

In general, the predictions obtained for the structural clusters are better than those obtained for the overall test set simultaneously taking into account the Q and MCC values, except for Clusters IV-V and IX-X (bold highlighted in Table 5 ). An improvement in the CM model prediction accuracies (Q = 0.67-0.72 and MCC = 0.25-0.38) was achieved for the other clusters (Clusters I-III and VI-VIII) of the test set, when compared with the prediction accuracy obtained for all the molecules of the test set (Q = 0.67 and MCC = 0.19). For the clusters IV-V and IX-X inferior prediction accuracies were obtained, Q = 0.58-0.68 and MCC = 0.05-0.21. Interestingly, the best achieved predictions for structural clusters I and VI are related to the best performance obtained for the class A (active) prediction with SE values of 0.71 and 0.67, respectively, compared to the SE value of 0.48 for all test sets. Furthermore, an additional predicting criterion can be generated according to the vote count of each of the models (i.e., MACCS, ExtCDK, and 1D&2D). Thus, if classes A, B, or C are predicted by the three models, these have the maximum confidence value (3), by two models (2) , and by only one model (1) , in this case, the CM prediction is obtained by the prediction of the best model, ExtCDK. An additional parameter, probability of being of class A (Prob_A) or probability of being active, was assigned by the RF algorithm and it can be used as a predicting criterion. For instance, the percentage of TA with the maximum confidence value (3) was 57%, which compares with the percentages of 42% and 43% obtained for false A, FA_B (false A which was B) and FA_C (false A which was C), respectively. If the criterion Prob_A ≥ 0.5 was added for the percentage of TA, the percentage of 33% was obtained and it compares with percentages of 13% and 15% obtained for FA_B and FA_C, respectively.

There are thirty-seven molecules from test set 2 that were predicted as being for class A, from those fourteen, were predicted with the maximum confidence value (3) and only five were predicted with Prob_A ≥ 0.5. Chemical structures of the five molecules in test set 2 that were predicted as belonging to class A with the confidence value (3) and Prob_A ≥ 0.5 are described in Figure 1 .

each of the models (i.e., MACCS, ExtCDK, and 1D&2D). Thus, if classes A, B, or C are predicted by the three models, these have the maximum confidence value (3), by two models (2) , and by only one model (1) , in this case, the CM prediction is obtained by the prediction of the best model, ExtCDK. An additional parameter, probability of being of class A (Prob_A) or probability of being active, was assigned by the RF algorithm and it can be used as a predicting criterion. For instance, the percentage of TA with the maximum confidence value (3) was 57%, which compares with the percentages of 42% and 43% obtained for false A, FA_B (false A which was B) and FA_C (false A which was C), respectively. If the criterion Prob_A ≥ 0.5 was added for the percentage of TA, the percentage of 33% was obtained and it compares with percentages of 13% and 15% obtained for FA_B and FA_C, respectively.

There are thirty-seven molecules from test set 2 that were predicted as being for class A, from those fourteen, were predicted with the maximum confidence value (3) and only five were predicted with Prob_A ≥ 0.5. Chemical structures of the five molecules in test set 2 that were predicted as belonging to class A with the confidence value (3) and Prob_A ≥ 0.5 are described in Figure 1 . NCATS released, via OpenData Portal (https://opendata.ncats.nih.gov/covid19/assays) the quantitative HTS data on drugs approved for clinical use tested in the SARS-CoV-2 cytopathic effect (CPE) assay. We used the chemical structures and data of 3957 molecules in SMILES format of this SARS-CoV-2 CPE evaluation from the work of Alves et al. [17] . From those 3957 molecules, there are 1401 molecules that were used to build (910 molecules in the training set) and validate (409 and 82 molecules in the test and test 2 sets, respectively) the SARS-CoV-2 QSAR model. Only the KRCA-0008 molecule in Figure 1 , which was proposed as the most promising active molecules against SARS-CoV-2 for test set 2, was tested by NCATS and had no activity. However, there is a proposed hit of the test 2 set, quizartinib (Figure 2) , predicted with the maximum confidence value (3) but with NCATS released, via OpenData Portal (https://opendata.ncats.nih.gov/covid19/assays) the quantitative HTS data on drugs approved for clinical use tested in the SARS-CoV-2 cytopathic effect (CPE) assay. We used the chemical structures and data of 3957 molecules in SMILES format of this SARS-CoV-2 CPE evaluation from the work of Alves et al. [17] . From those 3957 molecules, there are 1401 molecules that were used to build (910 molecules in the training set) and validate (409 and 82 molecules in the test and test 2 sets, respectively) the SARS-CoV-2 QSAR model. Only the KRCA-0008 molecule in Figure 1 , which was proposed as the most promising active molecules against SARS-CoV-2 for test set 2, was tested by NCATS and had no activity. However, there is a proposed hit of the test 2 set, quizartinib (Figure 2) , predicted with the maximum confidence value (3) but with Prob_A < 0.5 by the QSAR model that was experimentally validated by the SARS-CoV-2 CPE assay as active with AC 50 of 2.51 µM.

All six molecules predicted to be the most promising against SARS-CoV-2 from the test set 2 in Figures 1 and 2 are N-heterocyclic molecules, four of which have a pyrimidine scaffold (BIBU-1361, CHEMBL214796, NGD-94-1, KRCA-0008), one, OXA-06, has a pyrrolo[2¨C3-b]pyridine scaffold, and quizartinib has a benzo[d]imidazo[2,1-b]thiazole scaffold. Better predictions compared to those obtained for the test set (Table 4 ) were obtained taking into account the results of the SARS-CoV-2 CPE assay for the 82 molecules of the test set 2 and the values of 0.21, 0.94, and 0.74 for sensitivity, specificity C, and general predictive precision, respectively. Likewise, the other 2556 molecules of the NCATS data were predicted by the CM model, obtaining 0.01, 0.01, 0.98, and 0.91 for sensitivity, specificity B, specificity C, and general predictive precision, respectively. Only one molecule, vorapaxar (Figure 3 Prob_A < 0.5 by the QSAR model that was experimentally validated by the SARS-CoV-2 CPE assay as active with AC50 of 2.51 µM. All six molecules predicted to be the most promising against SARS-CoV-2 from the test set 2 in Figure 1 and Figure 3 . Chemical structure of vorapaxar from the test set 2 that was predicted as class A with the confidence value (1) and Prob_A of 0.17, which is active in the SARS-CoV-2 CPE assay.

A comparison of the best twenty FPs (i.e., MACCS keys) and descriptors (i.e., 1D_2D) selected by descriptor importance of RF used to build the QSAR classification models is comprised in Table 2 and these were analyzed and presented in Figure 4 . All six molecules predicted to be the most promising against SARS-CoV-2 from the test set 2 in Figure 1 and Figure 2 are N-heterocyclic molecules, four of which have a pyrimidine scaffold (BIBU-1361, CHEMBL214796, NGD-94-1, KRCA-0008), one, OXA-06, has a pyrrolo[2¨C3-b]pyridine scaffold, and quizartinib has a benzo[d]imidazo[2,1-b]thiazole scaffold. Better predictions compared to those obtained for the test set (Table 4 ) were obtained taking into account the results of the SARS-CoV-2 CPE assay for the 82 molecules of the test set 2 and the values of 0.21, 0.94, and 0.74 for sensitivity, specificity C, and general predictive precision, respectively. Likewise, the other 2556 molecules of the NCATS data were predicted by the CM model, obtaining 0.01, 0.01, 0.98, and 0.91 for sensitivity, specificity B, specificity C, and general predictive precision, respectively. Only one molecule, vorapaxar (Figure 3 Figure 3 . Chemical structure of vorapaxar from the test set 2 that was predicted as class A with the confidence value (1) and Prob_A of 0.17, which is active in the SARS-CoV-2 CPE assay.

A comparison of the best twenty FPs (i.e., MACCS keys) and descriptors (i.e., 1D_2D) selected by descriptor importance of RF used to build the QSAR classification models is comprised in Table 2 and these were analyzed and presented in Figure 4 . 

A comparison of the best twenty FPs (i.e., MACCS keys) and descriptors (i.e., 1D_2D) selected by descriptor importance of RF used to build the QSAR classification models is comprised in Table 2 and these were analyzed and presented in Figure 4 .

Of the twenty most important MACCS FPs for modeling the antiviral activity against SARS-CoV-2 in the set, there are more MACCS FPs (i.e., eighteen MACCS key) that are relevant in discriminating the class C than the classes A or B. For example, only the 11th, which codifies the presence of N-heterocyclic scaffold, and the 19th, which encodes the existence of a 4-membered ring, MACCS FPs out of the twenty most important FPs of MACCs are more relevant in discriminating the A and B classes, respectively. The importance of the N-heterocyclic scaffold in class A discrimination has already been highlighted in the proposed list of the six most promising molecules as antiviral against SARS-CoV-2 for test set 2, namely four molecules containing a pyrimidine scaffold, a molecule with a pyrrolo[2,3-b]pyridine scaffold, and a benzo[d]imidazo[2,1-b]thiazole scaffold. In the same way, of the twenty most important descriptors for modeling the antiviral activity against SARS-CoV-2, there are more 1D&2D descriptors that are relevant in discriminating the class C than the classes A or B (i.e., eighteen descriptors) in the set. There are two 1D&2D descriptors that are more relevant in discriminating the class A than the others classes in the set of the twenty most important 1D&2D descriptors for modelling the antiviral activity against SARS-CoV-2, Figure 4 , both are topological descriptors, VR3_Dzp (logarithmic Randic-like eigenvector-based index from Barysz matrix/weighted by polarizabilities) and piPC4 (conventional bond order ID number of order 4, log scale). Despite the two most relevant 1D&2D descriptors for class B, an autocorrelation (ATSC0e, centered Broto-Moreau autocorrelation-lag 0/weighted by Sanderson electronegativities) and a topological (ETA_dEpsilon_A, a measure of contribution of unsaturation and electronegative atom count) descriptors, these are more relevant in class C discrimination. Of the twenty most important MACCS FPs for modeling the antiviral activity against SARS-CoV-2 in the set, there are more MACCS FPs (i.e., eighteen MACCS key) that are relevant in discriminating the class C than the classes A or B. For example, only the 11th, which codifies the presence of N-heterocyclic scaffold, and the 19th, which encodes the existence of a 4-membered ring, MACCS FPs out of the twenty most important FPs of MACCs are more relevant in discriminating the A and B classes, respectively. The importance of the N-heterocyclic scaffold in class A discrimination has already been highlighted in the proposed list of the six most promising molecules as antiviral against SARS-CoV-2 for test set 2, namely four molecules containing a pyrimidine scaffold, a molecule with a pyrrolo[2,3-b]pyridine scaffold, and a benzo[d]imidazo[2,1-b]thiazole scaffold. In the same way, of the twenty most important descriptors for modeling the antiviral activity against SARS-CoV-2, there are more 1D&2D descriptors that are relevant in discriminating the class C than the classes A or B (i.e., eighteen descriptors) in the set. There are two 1D&2D descriptors that are more relevant in discriminating the class A than the others classes in the set of the twenty most important 1D&2D descriptors for modelling the antiviral activity against SARS-CoV-2, Figure 4 , both are topological descriptors, VR3_Dzp (logarithmic Randic-like eigenvector-based index from Barysz matrix/weighted by polarizabilities) and piPC4 (conventional bond order ID number of order 4, log scale). Despite the two most relevant 1D&2D descriptors for class B, an autocorrelation (ATSC0e, centered Broto-Moreau autocorrelation-lag 0/weighted by Sanderson electronegativities) and a topological (ETA_dEpsilon_A, a measure of contribution of unsaturation and electronegative atom count) descriptors, these are more relevant in class C discrimination. 

A virtual screening campaign was carried out to search for new lead-like inhibitors against SARS-CoV-2. The best model, the CM, was selected for the virtual screening procedure using 11,162 MNPs retrieved from the Reaxys ® database, 7 in-house MNPs obtained by the team from marine-derived actinomycetes, and 14 MNPs in the pharmaceutical pipeline (eight approved drugs and six MNPs in Phase II and III of clinical trials). All the MNPs from the virtual screening libraries that were predicted as belonging to the class A, were selected to the CADD structure-based approach. In the CADD structure-based method, the 494 MNPs selected by QSAR classification approach were screened by molecular docking against M pro enzyme.

The 494 MNPs from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs pharmaceutical pipeline, respectively) selected by QSAR classification approach were screened by molecular docking against M pro enzyme (PDB ID: 6LU7) [3] . The antiviral drugs nelfinavir and lopinavir were used as positive controls and allicin was used as negative control in molecular docking experiments [7] . A list of virtual screening hits comprising thirteen MNPs was assented on the basis of some limits established in this CADD ligand-and structure-based approach, such as: confidence value (3) and probability of being active against SARS-CoV-2 in the best model and prediction of the affinity between the M pro and the selected MNPs through molecular docking (∆G B ≤ −8 kcal/mol). The Autodock Vina software [33] (http://vina.scripps.edu/) was used to perform the flexible virtual screening of the 494 MNPs from the three MNPs libraries to find the most favorable binding interactions, and the calculated free binding energies by the two sets of search space coordinates were reported in Table 6 for the 15 MNPs selected, one from in-house MNPs (hydroxydebromomarinone), two from MNPs clinical trials (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and the negative (allicin) controls. Table 6 . Structures and calculated free binding energies (∆G B , in kcal/mol) of the fifteen selected MNPs, one from in-house MNPs (hydroxydebromomarinone), two from MNPs pharmaceutical pipeline (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and negative (allicin) controls, using two sets of search space coordinates.

Chemical Structure Structural Category Natural Source Prob_A ∆G B (kcal/mol)

The 494 MNPs from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs pharmaceutical pipeline, respectively) selected by QSAR classification approach were screened by molecular docking against M pro enzyme (PDB ID: 6LU7) [3] . The antiviral drugs nelfinavir and lopinavir were used as positive controls and allicin was used as negative control in molecular docking experiments [7] . A list of virtual screening hits comprising thirteen MNPs was assented on the basis of some limits established in this CADD ligand-and structure-based approach, such as: confidence value (3) and probability of being active against SARS-CoV-2 in the best model and prediction of the affinity between the M pro and the selected MNPs through molecular docking (ΔGB ≤ −8 kcal/mol). The Autodock Vina software [33] (http://vina.scripps.edu/) was used to perform the flexible virtual screening of the 494 MNPs from the three MNPs libraries to find the most favorable binding interactions, and the calculated free binding energies by the two sets of search space coordinates were reported in Table 6 for the 15 MNPs selected, one from in-house MNPs (hydroxydebromomarinone), two from MNPs clinical trials (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and the negative (allicin) controls. Table 6 . Structures and calculated free binding energies (∆GB, in kcal/mol) of the fifteen selected MNPs, one from in-house MNPs (hydroxydebromomarinone), two from MNPs pharmaceutical pipeline (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and negative (allicin) controls, using two sets of search space coordinates.

Chemical The 494 MNPs from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs pharmaceutical pipeline, respectively) selected by QSAR classification approach were screened by molecular docking against M pro enzyme (PDB ID: 6LU7) [3] . The antiviral drugs nelfinavir and lopinavir were used as positive controls and allicin was used as negative control in molecular docking experiments [7] . A list of virtual screening hits comprising thirteen MNPs was assented on the basis of some limits established in this CADD ligand-and structure-based approach, such as: confidence value (3) and probability of being active against SARS-CoV-2 in the best model and prediction of the affinity between the M pro and the selected MNPs through molecular docking (ΔGB ≤ −8 kcal/mol). The Autodock Vina software [33] (http://vina.scripps.edu/) was used to perform the flexible virtual screening of the 494 MNPs from the three MNPs libraries to find the most favorable binding interactions, and the calculated free binding energies by the two sets of search space coordinates were reported in Table 6 for the 15 MNPs selected, one from in-house MNPs (hydroxydebromomarinone), two from MNPs clinical trials (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and the negative (allicin) controls. Table 6 . Structures and calculated free binding energies (∆GB, in kcal/mol) of the fifteen selected MNPs, one from in-house MNPs (hydroxydebromomarinone), two from MNPs pharmaceutical pipeline (nelarabine and fludarabine), and the positive (nelfinavir and lopinavir) and negative (allicin) controls, using two sets of search space coordinates.

Chemical To prioritize the best marine drug-like leads as SARS-CoV-2 M pro inhibitors from the list of fifteen selected MNPs by QSAR model against SARS-CoV-2 and molecular docking of M pro enzyme, the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties were predicted via in silico methods using the pKCSM software (http://biosig.unimelb.edu.au/pkcsm/) [34] . Five MNPs, a benzo[f]pyrano [4,3-b] chromene (Reaxys ID 7450892), notoamide I (Reaxys ID 19384758), emindole SB beta-mannoside (Reaxys ID 26845562), and two bromoindole derivatives (Reaxys IDs 10714788 and 10720065) were proposed as marine drug-like leads as SARS-CoV-2 M pro inhibitors. In accordance with Lipinski rule of five, 3 molecules failure in only one rule (one from indoloditerpene and two bromoindole derivatives). Regarding the desirable ADMET properties, no AMES toxicity, hERG I inhibition, or skin sensitization was predicted by the first three hits, and no hERG I inhibition, hepatotoxicity, or skin sensitization by the other two hits, Table 6 . The prediction of ADMET properties of the fifteen selected MNPs by QSAR model against SARS-CoV-2 and molecular docking of M pro enzyme was presented in Table S1 , in the Supplementary Materials. In Figure 5 , the interaction profiles of the best-docked poses for the five lead-like SARS-CoV-2 M pro inhibitors were represented. 

In total, 5466 organic molecules were extracted from the ChEMBL (https://www.ebi.ac.uk/chembl/) database [35] , searching by antiviral activity determined as 

In total, 5466 organic molecules were extracted from the ChEMBL (https://www.ebi.ac.uk/chembl/) database [35] , searching by antiviral activity determined as inhibition of SARS-CoV-2 induced cytotoxicity of Caco-2 cells. Their chemical structures were saved in the simplified molecular input line entry specification (SMILES) data format. The Caco-2 cell line is widely used with in vitro assays to predict the absorption rate of lead drug molecules across the intestinal epithelial cell barrier and it is extensively used to study infection by SARS-CoV and can be used for SARS-CoV-2 infection [36] . The antiviral activity was classified using three activity classes: (A)-inhibition % ≥ 50%; (B)-50% > inhibition % ≥ 30%; (C)-inhibition % < 30%. After calculating the molecular descriptors and the fingerprints, the final data set comprises 5273 organic molecules, the molecules that showed errors in the calculation were removed. From these 447 molecules are recorded as active (class A), 364 as intermediate active (class B), 4220 as inactive (class C), and 241 without defined recorded (label as "Outside typical range"). The data set was randomly divided into a training set of 3499 molecules (class A: 302 molecules, class B: 265 molecules, and class C: 2932 molecules), and a test set of 1533 molecules (class A: 145 molecules, class B: 99 molecules, and class C: 1288 molecules), a partition of approximately 56:44 for the training and test sets. The test 2 set comprises the 241 molecules without activity recorded. The SMILES strings of the data set, the corresponding experimental, and the predicted activities are available as Supplementary Materials.

The virtual library comprises 11,162 MNPs set retrieved from the Reaxys ® database (Elsevier Information Systems GmbH)) in the MDL SDF data format, 7 in-house MNPs set obtained by the team from marine-derived actinomycetes, and 14 MNPs from the pharmaceutical pipeline set (eight approved drugs and six MNPs in Phase II and III of clinical trials). The SMILES strings of the in-house MNPs and MNPs clinical trials sets, the corresponding predicted activities are available as Supplementary Materials.

JChem Standardizer tool version 5.7.13.0 (ChemAxon Ltd., Budapest, Hungary) was used to standardize the molecular structures of all data sets by normalizing tautomeric and mesomeric groups and by removing small disconnected fragments. Empirical molecular fingerprints (FPs) and 1D&2D molecular descriptors were calculated by PaDEL-Descriptor version 2.21 (http://www. yapcwsoft.com/dd/padeldescriptor/) [31] . Different types of FPs, with different sizes, were calculated and explored: 166 MACCS (MACCS keys), 307 Substructure (presence and count of SMARTS patterns for Laggner functional group classification-Sub and SubC, respectively), 881 PubChem FPs (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt), 1024 CDK (circular fingerprints), and 1024 CDK extended (Ext circular fingerprints with additional bits describing ring features). The 1D&2D molecular descriptors comprise 1443 descriptors, including electronic, topological, and constitutional descriptors.

In the quest for QSAR models with the minimum possible number of descriptors, descriptor selection was performed based on the importance of descriptors assessed by random forest (RF) (computeAttributeImportance) [32] . Selection of descriptors was accomplished using this procedure with the importance of descriptors assessed by RF within an OOB methodology using the 50, 100, 150, and 200 most important descriptors and RF algorithm as ML technique employing the following statistical metrics: true A (TA), true B (TB), true C (TC), false A that was B (FA_B), false A that was C (FA_C), false B that was A (FB_A), false B that was C (FB_C), false C that was A (FC_A), false C that was B (FC_B), sensitivity (SE, prediction accuracy for class A, active against SARS-CoV-2), specificity B (SP_B, prediction accuracy for class B, intermediate activity against SARS-CoV-2), specificity C (SP_C, prediction accuracy for class C, inactive against SARS-CoV-2), overall predictive accuracy (Q), and Matthews correlation coefficient (MCC).

In general, class imbalance introduces a bias in the performance of the ML algorithms due to their preference towards the majority class [37] . Our SARS-CoV-2 activity training set is unbalanced, and the imbalance ratio is 9:8:84 for the A: active/ B: intermediate/ C: inactive classes, respectively. To address this issue, a base-classifier with RF in R program version 3.6.1 [38] , was used using the sampsize, which is a parameter in the RF with R specified for balancing the classes. This parameter was set to be of the same size as the minority class (class B). With this parameter, some molecules belonging to the minority class were used more than once.

A RF [32, 39] is implemented as an ensemble of unpruned classification trees which are created using bootstrap samples of the training set. For each individual tree, the best split at each node is defined using a randomly selected subset of descriptors. Each individual tree is created using a different training and validation set. Prediction is made by a majority vote of the classification trees in the forest. Performance is internally assessed with the prediction error for the objects left out in the bootstrap procedure (OOB estimation). The method quantifies the importance of a descriptor by the increase in misclassification occurring when the values of the descriptor are randomly permuted, correlated with the mean decrease in accuracy parameter. RFs also assign a probability to every prediction based on the number of votes obtained by the predicted class. RFs were grown with the R program [38] , version 3.6.1, using the random forest library [40] . As a result of the nature of three-class imbalance, this problem was alleviated by setting the class weight ranges from 1-265, 1-265, and 1-265 for class A, B, and C, respectively, using the sampsize parameter.

A list of 494 virtual screening hits were prioritized by the QSAR approach from the three MNPs libraries (491, 1, and 2 from MNPs from Reaxys ® database, in-house MNPs, MNPs clinical pipeline, respectively). The software program OpenBabel (version 2.3.1) [41] was used to convert the mol2 files to PDBQT files. PDBQT files were used for docking to M pro enzyme (PDB ID: 6LU7) [3] with AUTODOCK VINA (version 1.1) [33] . Water molecules and N3 ligand (https://www.tocris.com/ products/mpro-n3_7230) were removed from 6LU7 [3] prior to docking using the AutoDockTools (http://mgltools.scripps.edu/). An experiment to evaluate the importance of water molecules in the binding of ligands to the active site of the protease was carried out and it was concluded that water molecules do not seem to have a relevant role, since a Pearson correlation of 0.885 between the calculated free binding energies with and without water molecules was obtained. The AutoDockTools is graphical front-end for setting up and running AutoDock-an automated docking software designed to predict how small molecules bind to a receptor of known 3D structure. The coordinates of the search space for M pro enzyme were maximized to allow the entire macromolecule to be considered for docking. The search space coordinates were: M pro enzyme; Center X: -36.149 Y: -3.796 Z: 45.045, and X: -12.806 Y: 18.646 Z: 65.607, Dimensions X: 40.000 Y: 40.000 Z: 40.000. Ligand tethering of the M pro enzyme was performed by regulating the genetic algorithm (GA) parameters, using 10 runs of the GA criteria. The docking binding poses were visualized with PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC.

The current results suggest that CADD approaches relying on ligand-and structure-based methodologies could be used with success to predict new inhibitory MNPs against SARS-CoV-2 M pro . Five MNPs, a benzo[f]pyrano [4,3-b] chromene (Reaxys ID 7450892), notoamide I (Reaxys ID 19384758), emindole SB beta-mannoside (Reaxys ID 26845562), and two bromoindole derivatives (Reaxys IDs 10714788 and 10720065) were proposed as the most promise marine drug-like leads as SARS-CoV-2 M pro inhibitors. To our knowledge, the CADD ligand-based using a QSAR classification model, that was developed here, is the largest study ever performed with regard both to the number of molecules involved and to the number of structural families involved in the modeling of the antiviral activity against SARS-CoV-2 and the best model achieved an overall predictive accuracy of up to 67% for both test and training sets. In future works, the proposed five marine drug-like leads against SARS-CoV-2 M pro can be validated experimentally. Based on these results, virtual libraries of marine-derived drug-like leads can be built, which can be virtual screened using the best QSAR model and molecular docking against SARS-CoV-2 M pro .

Supplementary Materials: The following data are available online at http://www.mdpi.com/1660-3397/18/12/633/ s1, Tables S1-S6. The following files are available free of charge. Predictions of ADMET properties with in silico methods, using the pKCSM software for a list of fifteen selected MNPs by QSAR model against SARS-CoV-2 and molecular docking of Mpro enzyme are available as Supplementary Materials, Table S1 . SMILES strings of the data set (training, test, and test 2 sets), the corresponding experimental and predicted activities are available as Supplementary Materials, Tables S2, S3, and S4, respectively. Moreover, SMILES strings of the in-house MNPs and MNPs clinical pipeline sets, for the virtual screening data set, the corresponding predicted activities are available as Supplementary Materials, Tables S5 and S6, respectively (XLSX). 

Covid-19-Implications for the health care system

Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors

Novel 2019 coronavirus structure, mechanism of action, antiviral drug promises and rule out against its treatment

Putative inhibitors of SARS-CoV-2 main protease from A Library of Marine Natural Products: A virtual screening and molecular modeling study

Marine natural compounds as potents inhibitors against the main protease of SARS-CoV-2-a molecular dynamic study

Potential inhibitor of COVID-19 main protease (Mpro) from several medicinal plant compounds by molecular docking study

Both Boceprevir and GC376 efficaciously inhibit SARS-CoV-2 by targeting its main protease

Investigating the potential antiviral activity drugs against SARS-CoV-2 by molecular docking simulation

Computational methodologies in the exploration of marine natural product leads

Have marine natural product drug discovery efforts been productive and how can we improve their efficiency?

Computational biophysical characterization of the SARS-CoV-2 spike protein binding with the ACE2 receptor and implications for infectivity

Structural and functional properties of SARS-CoV-2 spike protein: Potential antivirus drug development for COVID-19

Enriching cancer pharmacology with drugs of marine origin

Ten-Year Research Update Review: Antiviral Activities from Marine Organisms

Chemical-informatics approach to COVID-19 drug discovery: Exploration of important fragments and data mining based prediction of some hits from natural origins as main protease (Mpro) inhibitors

QSAR modeling of SARS-CoV Mpro inhibitors identifies Sufugolix, Cenicriviroc, Proglumetacin and other drugs as candidates for repurposing against SARS-CoV-2

Development of a simple, interpretable and easily transferable QSAR model for quick screening antiviral databases in search of novel 3C-like protease (3CLpro) enzyme inhibitors against SARS-CoV diseases

Inhibition of SARS-CoV-2 main protease 3CLpro by means of α-ketoamide and pyridone-containing pharmaceuticals using in silico molecular docking

Marine Viruses: Key Players in Marine Ecosystems

Re-examination of the relationship between marine virus and microbial cell abundances

Distribution of marine viruses in the Central and South Adriatic Sea

The Madeira Archipelago As a significant source of marine-derived Actinomycete diversity with Anticancer and Antimicrobial Potential

Intra-clade metabolomic profiling of MAR4 Streptomyces from the Macaronesia Atlantic region reveals a source of anti-biofilm metabolites

Antifouling Napyradiomycins from marine-derived Actinomycetes Streptomyces Aculeolatus

A chemoinformatics approach to the discovery of lead-like molecules from marine and microbial sources en route to antitumor and antibiotic drugs

A computational approach in the discovery of lead-like compounds for anticancer drugs. Front

A computer-driven approach to discover natural product leads for Methicillin-resistant Staphylococcus Aureus infection therapy

In silico HCT116 human colon cancer cell-based models en route to the discovery of lead-like anticancer drugs

Generation of a set of simple, interpretable ADMET rules of thumb

PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints

Random Forests

AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading

Predicting small-molecule pharmacokinetic and toxicity properties using graph-based signatures

Proteomics of SARS-CoV-2-infected host cells reveals therapy targets

Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity

R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing

Random Forest: A classification and regression tool for compound classification and QSAR modeling

Classification and regression by randomforest

Open Babel: An open chemical toolbox

We thank ChemAxon Ltd. for access to JChem and Marvin.

The authors declare no conflict of interest.