key: cord-0861247-7l1o95gn
authors: Kleandrova, Valeria V.; Scotti, Marcus T.; Speck-Planche, Alejandro
title: Computational Drug Repurposing for Antituberculosis Therapy: Discovery of Multi-Strain Inhibitors
date: 2021-08-19
journal: Antibiotics (Basel)
DOI: 10.3390/antibiotics10081005
sha: 1aa96246c34738ccc0f647ace8e605b710ab0ef3
doc_id: 861247
cord_uid: 7l1o95gn

Tuberculosis remains the most afflicting infectious disease known by humankind, with one quarter of the population estimated to have it in the latent state. Discovering antituberculosis drugs is a challenging, complex, expensive, and time-consuming task. To overcome the substantial costs and accelerate drug discovery and development, drug repurposing has emerged as an attractive alternative to find new applications for “old” drugs and where computational approaches play an essential role by filtering the chemical space. This work reports the first multi-condition model based on quantitative structure–activity relationships and an ensemble of neural networks (mtc-QSAR-EL) for the virtual screening of potential antituberculosis agents able to act as multi-strain inhibitors. The mtc-QSAR-EL model exhibited an accuracy higher than 85%. A physicochemical and fragment-based structural interpretation of this model was provided, and a large dataset of agency-regulated chemicals was virtually screened, with the mtc-QSAR-EL model identifying already proven antituberculosis drugs while proposing chemicals with great potential to be experimentally repurposed as antituberculosis (multi-strain inhibitors) agents. Some of the most promising molecules identified by the mtc-QSAR-EL model as antituberculosis agents were also confirmed by another computational approach, supporting the capabilities of the mtc-QSAR-EL model as an efficient tool for computational drug repurposing.

Tuberculosis (TB) constitutes the deadliest infectious disease that afflicts humankind. The causative agent of TB, Mycobacterium tuberculosis (Mtb), was responsible in 2018 for causing more than 10 million cases of active TB, resulting in 1.5 million deaths [1] . Despite the efforts of the scientific community in providing anti-TB therapies through the process known as drug discovery, several aspects pose great challenges for the safe and successful eradication of TB. On one hand, the thick cell wall formed by mycolic acids, the ability of certain enzymes to modify/inactivate drug molecules, the presence of drug efflux systems, and the occurrence of spontaneous mutations in the bacterial genome are the main biological attributes that make Mtb considerably resistant (multidrug-resistant TB) to current anti-TB treatments [2] . On the other hand, the FDA-approved anti-TB drugs are associated with a low diversity of mechanisms of action [3] and exhibit a wide range of side effects [4] . All these ideas, together with the complexity and remarkable expenditure of time and financial resources during the drug discovery process [5] , demonstrate that TB is hard to treat and that efficacious anti-TB agents are urgently needed.

A plausible way to tackle the TB problem is to apply the drug repurposing philosophy, which focuses on finding new applications for "old" drugs [6] . In this sense, in the context Antibiotics 2021, 10, 1005 2 of 18 of identifying novel anti-TB agents, computational approaches have played an essential role by filtering the chemical space, providing different tools that can speed up the discovery of molecules able to inhibit the growth of Mtb [7] [8] [9] [10] [11] [12] [13] [14] . However, all the computational approaches reported to date have at least one of the following aspects that prevent their full exploitation in virtual screening scenarios: (a) the anti-TB activity is predicted in a generic manner or by considering either only one target protein or Mtb strain, (b) small datasets of structurally related chemicals are used to build the computational models, (c) there is no information regarding the experimental protocol or the assay time employed to assess the inhibitory activity against Mtb, and (d) no insights are provided concerning the physicochemical properties and structural features that are required in a molecule to have anti-TB activity.

To solve the aforementioned limitations, several researchers have emphasized the use of interpretable in silico models focused on a combination of perturbation theory concepts and machine learning techniques (PTML) [15] [16] [17] , which can integrate different sources of chemical and biological data, enabling the simultaneous prediction of multiple biological endpoints against many targets of varying degrees of complexity. Seminal works on PTML models have found successful applications in diverse research areas such as infectious diseases [18, 19] , oncology [20, 21] , neuroscience [22] [23] [24] [25] , proteomics [26] , metabolomics [27] , nanotechnology [28] [29] [30] [31] , toxicology [32] , and immunology and immunotoxicity [33, 34] .

Bearing in mind all the previous ideas, we report in this work a special case of the PTML modeling methodology in the context of the search for novel anti-TB chemical therapies, establishing the theoretical foundations for the computational repurposing of drugs as anti-TB agents. Thus, we have developed here a multi-condition model based on quantitative structure-activity relationships and an ensemble of neural networks (mtc-QSAR-EL) able to perform virtual screening for the identification of multi-strain inhibitors of Mtb. We also provide the physicochemical and fragment-based structural interpretation of the different molecular descriptors used to build the mtc-QSAR-EL model. Finally, we performed a virtual screening of a large and heterogeneous dataset of agency-regulated chemicals, identifying both already proven anti-TB agents and promising chemical structures with the potential to inhibit Mtb.

The best mtc-QSAR-EL model found by us has twelve D[LQI]cj descriptors, six of them based on hydrophobic factors, five containing steric information, and one focused on polar features of the molecules. Additionally, in terms of atom types, five D[LQI]cj descriptors were based on the effect of halogen atoms, three indicating the influence of heteroatoms (mainly N, O, S, and P), two characterizing the presence of methyl groups, and two accounting for the importance of aromatic carbons. The details regarding each D[LQI]cj descriptor appear in Table 1 , while all the molecular descriptors of the chemicals and the corresponding biological data are reported in Supplementary Material 1.

The most appropriate mtc-QSAR-EL model developed here was an ensemble formed by three MLP networks whose parameters are represented in Table 2 . These MLP networks have different numbers of neurons in the hidden layer and require diverse error functions and different numbers of epochs to be trained. Two of these MLP networks (first and third) have the same type of activation function (hyperbolic tangent), with the other one having a logistic function. In the output layer, the first and second MLP networks use a softmax function, while the third one uses a logistic function. The combination of these aspects resulted in a difference in performance among these MLP networks, particularly in the case of the sensitivities [Sn(%)]ta, [Sn(%)]bs, and [Sn(%)]ap, as well as the specificities [Sp(%)]ta, [Sp(%)]bs, and [Sp(%)]ap. These six local metrics were of paramount importance in assessing the classification performance of the mtc-QSAR-EL model in both trained and unseen data (training and test sets, respectively) when considering the dissimilar experimental conditions cj (see Section 3 for full explanation) under which the molecules were assayed. 

Deviation of the atom-based stochastic quadratic index of order 3, weighted by the hydrophobicity of the halogens and their chemical environments, and depending on the chemical structure and the assay time.

Deviation of the atom-based stochastic quadratic index of order 0, weighted by the polarizability of the halogens, and depending on the chemical structure and the assay time.

Deviation of the atom-based stochastic quadratic index of order 0, weighted by the hydrophobicity of the heteroatoms, and depending on the chemical structure and the assay time.

Deviation of the atom-based stochastic quadratic index of order 1, weighted by the polar surface area of the heteroatoms and their adjacent atoms, and depending on the chemical structure and the assay time.

Deviation of the atom-based stochastic quadratic index of order 0, weighted by the hydrophobicity of the methyl groups, and depending on the chemical structure and the Mtb strain against which the molecule was tested.

Deviation of the atom-based stochastic quadratic index of order 0, weighted by the atomic weight of the aromatic carbons, and depending on the chemical structure and the Mtb strain against which the molecule was tested.

Deviation of the atom-based stochastic quadratic index of order 2, weighted by the hydrophobicity of the heteroatoms and their chemical environments, and depending on the chemical structure and the Mtb strain against which the molecule was tested.

Deviation of the atom-based stochastic quadratic index of order 2, weighted by the hydrophobicity of the halogens and their chemical environments, and depending on the chemical structure and information regarding the assay protocol.

Deviation of the atom-based stochastic quadratic index of order 5, weighted by the atomic weight of the halogens and their chemical environments, and depending on the chemical structure and information regarding the assay protocol.

Deviation of the atom-based stochastic quadratic index of order 3, weighted by Kupchick's vertex degrees of the halogens and their chemical environments, and depending on the chemical structure and information regarding the assay protocol.

Deviation of the atom-based stochastic quadratic index of order 2, weighted by the hydrophobicity of the methyl groups and their chemical environments, and depending on the chemical structure and information regarding the assay protocol.

Deviation of the atom-based stochastic quadratic index of order 2, weighted by the atomic weight of the aromatic carbons and their chemical environments, and depending on the chemical structure and information regarding the assay protocol. Last, we would like to highlight that, from a physicochemical and structural point of view, the present mtc-QSAR-EL model classified anti-TB drugs belonging to a wide Last, we would like to highlight that, from a physicochemical and structural point of view, the present mtc-QSAR-EL model classified anti-TB drugs belonging to a wide spectrum of chemical families (Figures 1 and 2 ) such as polyketides, ethylenediamine derivative, aminoglycosides, nitroimidazopyrans, fluoroquinolones, and diarylquinolines. Last, we would like to highlight that, from a physicochemical and structural point of view, the present mtc-QSAR-EL model classified anti-TB drugs belonging to a wide spectrum of chemical families (Figures 1 and 2 ) such as polyketides, ethylenediamine derivative, aminoglycosides, nitroimidazopyrans, fluoroquinolones, and diarylquinolines. Regarding the AD of the mtc-QSAR-EL model, as reported in seminal references (see Section 3.3), we calculated the so-called total score of applicability domain (TSAD), which is derived from the descriptors' space approach. Since our mtc-QSAR-EL model contained twelve D[LQI]cj descriptors, only those chemicals with TSAD = 12 were considered to be inside the AD (Supplementary Material 3). By carefully inspecting our dataset, we found that only 15 out of 1571 molecules/cases in the dataset were outside the AD of the mtc-QSAR-EL model, 14 of them with TSAD = 11 and one with TSAD = 10. However, Antibiotics 2021, 10, 1005 6 of 18 13 out of these 15 seemingly atypical chemicals were correctly classified by considering the consensus prediction approach performed by the mtc-QSAR-EL model. We decided to keep these chemicals in the dataset since the consensus predictions constituted the priority over the descriptors' space approach to define the AD.

The sensitivity values SV of the D[LQI]cj descriptors are depicted in Figure 3 , where the highest values indicate those D[LQI]cj descriptors with the greatest influence and discriminatory power in the mtc-QSAR-EL model. Simultaneously, the comparison between the class-based means for each D[LQI]cj descriptor represented in Table 4 shows us the type of variation that the value of a given D[LQI]cj descriptor should undergo to potentiate the anti-TB activity against the different Mtb strains. Regarding the AD of the mtc-QSAR-EL model, as reported in seminal references (see Section 3.3), we calculated the so-called total score of applicability domain (TSAD), which is derived from the descriptors' space approach. Since our mtc-QSAR-EL model contained twelve D[LQI]cj descriptors, only those chemicals with TSAD = 12 were considered to be inside the AD (Supplementary Material 3). By carefully inspecting our dataset, we found that only 15 out of 1571 molecules/cases in the dataset were outside the AD of the mtc-QSAR-EL model, 14 of them with TSAD = 11 and one with TSAD = 10. However, 13 out of these 15 seemingly atypical chemicals were correctly classified by considering the consensus prediction approach performed by the mtc-QSAR-EL model. We decided to keep these chemicals in the dataset since the consensus predictions constituted the priority over the descriptors' space approach to define the AD.

The sensitivity values SV of the D[LQI]cj descriptors are depicted in Figure 3 , where the highest values indicate those D[LQI]cj descriptors with the greatest influence and discriminatory power in the mtc-QSAR-EL model. Simultaneously, the comparison between the class-based means for each D[LQI]cj descriptor represented in Table 4 shows us the type of variation that the value of a given D[LQI]cj descriptor should undergo to potentiate the anti-TB activity against the different Mtb strains. As mentioned before, in our mtc-QSAR-EL model, we have six D[LQI]cj descriptors characterizing information regarding atomic hydrophobic contributions. In this sense, we would like to highlight that such contributions are based on the hydrophobicity scale proposed by Ghose and co-workers [35] . According to this scale, aliphatic carbon atoms present negative hydrophobicity values, excluding the fragments of the form CHX3, CR2X2, CRX3, and CX4 (X = O, N, S, P, Se, or halogen). Nitrogenated and oxygenated functional groups also have negative hydrophobic contributions except for the pyrrolic nitrogen (oxygen in furan) atom, Ar-NH-Ar (and its oxygenated counterpart), with Ar being an aromatic (or heteroaromatic ring), and all the tertiary amines. As mentioned before, in our mtc-QSAR-EL model, we have six D[LQI]cj descriptors characterizing information regarding atomic hydrophobic contributions. In this sense, we would like to highlight that such contributions are based on the hydrophobicity scale proposed by Ghose and co-workers [35] . According to this scale, aliphatic carbon atoms present negative hydrophobicity values, excluding the fragments of the form CHX3, CR2X2, CRX3, and CX4 (X = O, N, S, P, Se, or halogen). Nitrogenated and oxygenated functional groups also have negative hydrophobic contributions except for the pyrrolic nitrogen (oxygen in furan) atom, Ar-NH-Ar (and its oxygenated counterpart), with Ar being an aromatic (or heteroaromatic ring), and all the tertiary amines.

With that being said, the results presented in Table 4 indicate that the anti-TB activity through the inhibition of different Mtb strains can be enhanced by increasing the value of D[LASSq3(HYD)G]ta, which describes the augmentation of the product of the hydrophobic contributions of any two atoms (with at least one of them being a halogen) that are placed at the topological distance (number of bonds between two atoms without considering bond multiplicity) of three. Examples of fragments with this characteristic are the 5-(halomethyl)pyrimidines, including those with substitutions in positions 4 and/or 6. Notice that D[LASSq3(HYD)G]ta is the eleventh most important D[LQI]cj descriptor in the mtc-QSAR-EL model. We also have D[LASSq0(HYD)Y]ta whose increase is directly proportional to the number of heteroatoms in a molecule. In this sense, the presence of fragments such as Ar-NO 2 (Ar = aromatic or heteroaromatic ring), primary amines, amides, hydroxyl groups (alcohol), thiols, thioethers, functional groups containing sulfur (with sp2 hybridization) attached to a carbon atom, phosphite, and R3-P = X (R = any group link though carbon, Finally, we have D[LASSq1(PSA)Y]ta (with the ninth highest significance) characterizing the increase in the polar surface area of any two adjacent heteroatoms, and, therefore, only pyridazine and 1,2,3-triazine rings, as well as the sulfonamide and phosphoruscontaining functional groups (phosphorus forming bonds with only oxygen and/or sulfur), will considerably augment the value of D[LASSq1(PSA)Y]ta.

We performed the virtual screening of a dataset formed by 8898 agency-regulated chemicals (Supplementary Material 4), which included (but was not limited to) investigational and FDA-approved drugs. These were predicted by considering the 24 experimental conditions cj reported in the dataset used to build the mtc-QSAR-EL model, yielding 213,552 predictions (Supplementary Materials 5 and 6). In terms of the reliability of the predictions using the AD of the mtc-QSAR-EL model according to the descriptors' space approach, we generated the TSAD values for each of these 8898 chemicals ( Supplementary  Material 7) . Then, the metrics FA(%) and S(TSAD) were calculated (see Section 3.5 for full explanation). In any case, we would like to highlight that FA(%) describes the anti-TB potential of a molecule because it expresses the frequency in which a chemical is predicted as active against Mtb. A high FA(%) value (maximum value is 100%) for a chemical means that it has a higher probability of having anti-TB activity by inhibiting multiple Mtb strains with MIC 90 < 7622.22 nM, which is the highest of the MIC 90 cutoffs reported in this work (see Section 3.1). At the same time, a high S(TSAD) value (the ideal value = 12 × 24 experimental conditions cj = 288, with 12 being the maximum TSAD value) indicates the general tendency of a chemical to be inside the AD according to the descriptors' space approach, which, together with the consensus predictions performed by the mtc-QSAR-EL model, helps to assess the degree of reliability of such predictions.

The combined use of FA(%) and S(TSAD) (Supplementary Material 8) allowed us to rank the 8898 agency-regulated chemicals, and those with FA(%) > 80% and S(TSAD) ≥ 276 were the top ranked, giving priority to those exhibiting the highest FA(%) values. Notice that there is no "rule of thumb" in terms of the criteria used to prioritize chemicals. Therefore, in the present study, we employed these rigorous cutoff values for FA(%) and

Antibiotics 2021, 10, 1005 9 of 18 S(TSAD) to achieve a virtual screening hit rate of 0.49% (44 out of 8898 chemicals) which is higher than the greatest hit rate value of 0.14% for high-throughput screening but lower than smallest hit rate value for virtual screening campaigns (1%) [36] .

By using the aforementioned metrics for compound prioritization, the mtc-QSAR-EL model identified three chemicals with experimentally proven anti-TB activity ( Figure 4) : macozinone (PBTZ-169), BTZ-043, and niclosamide. Notice that macozinone is a remarkably potent piperazine-containing benzothiazinone, being able to inhibit multiple Mtb strains at MIC 99 ≤ 1 nM [37] . In the case of BTZ-043, although structurally related to macozinone, it lacks the 2-(4-(cyclohexylmethyl)piperazin-1-yl) moiety, which decreases its activity. Still, BTZ-043 is a nanomolar inhibitor of several Mtb strains [37, 38] . On the other hand, niclosamide offers very attractive opportunities for anti-TB therapies because, in addition to being a recognized antihelminthic drug, it also has a wide-spectrum antimicrobial profile, which includes nanomolar to micromolar potency against diverse viruses (including coronaviruses) [39] , as well as anti-TB activity in the low micromolar range [40, 41] .

the general tendency of a chemical to be inside the AD according to the descriptors' space approach, which, together with the consensus predictions performed by the mtc-QSAR-EL model, helps to assess the degree of reliability of such predictions.

The combined use of FA(%) and S(TSAD) (Supplementary Material 8) allowed us to rank the 8898 agency-regulated chemicals, and those with FA(%) > 80% and S(TSAD) ≥ 276 were the top ranked, giving priority to those exhibiting the highest FA(%) values. Notice that there is no "rule of thumb" in terms of the criteria used to prioritize chemicals. Therefore, in the present study, we employed these rigorous cutoff values for FA(%) and S(TSAD) to achieve a virtual screening hit rate of 0.49% (44 out of 8898 chemicals) which is higher than the greatest hit rate value of 0.14% for high-throughput screening but lower than smallest hit rate value for virtual screening campaigns (1%) [36] .

By using the aforementioned metrics for compound prioritization, the mtc-QSAR-EL model identified three chemicals with experimentally proven anti-TB activity ( Figure 4) : macozinone (PBTZ-169), BTZ-043, and niclosamide. Notice that macozinone is a remarkably potent piperazine-containing benzothiazinone, being able to inhibit multiple Mtb strains at MIC99 ≤ 1 nM [37] . In the case of BTZ-043, although structurally related to macozinone, it lacks the 2-(4-(cyclohexylmethyl)piperazin-1-yl) moiety, which decreases its activity. Still, BTZ-043 is a nanomolar inhibitor of several Mtb strains [37, 38] . On the other hand, niclosamide offers very attractive opportunities for anti-TB therapies because, in addition to being a recognized antihelminthic drug, it also has a wide-spectrum antimicrobial profile, which includes nanomolar to micromolar potency against diverse viruses (including coronaviruses) [39] , as well as anti-TB activity in the low micromolar range [40, 41] . We should recall that the cutoff values of the metrics FA(%) and S(TSAD) used by us are very rigorous. If, for instance, we slightly relax these two metrics (e.g., FA(%) > 60% and S(TSAD) ≥ 270), other agency-regulated chemicals with experimentally proven anti-TB activity pop up. These are the cases of biapenem (FA(%) = 66.67% and S(TSAD) = 288) and TBA-354 (FA(%) = 91.67% and S(TSAD) = 270), whose anti-TB profile has been demonstrated in vitro (low micromolar range) and in vivo [42, 43] . Notice that further relaxing FA(%) and/or S(TSAD) will allow the mtc-QSAR-EL model to identify more anti-TB agents, but a larger number of false positives may also be predicted. In the end, it is up to the analyst to select the adequate values of the metrics FA(%) and S(TSAD), which will depend on the number of drugs available for testing, with this being a key element when planning the expenditure of financial resources for experimental validation of virtually selected chemicals. In any case, we advise the use of FA(%) > 29% and S(TSAD) ≥ 250 since with these cutoffs, most of the known FDA-approved and investigational anti-TB drugs (not included in the dataset used to build our mtc-QSAR-EL model) were identified in the virtual screening analysis. We would like to highlight that the FA(%) value suggested by us is in the range reported for the hit rate in the prospective virtual screening [36, 44] .

Returning to the top 44 molecules predicted by the mtc-QSAR-EL model from the 8898 agency-regulated chemicals, we ran an experiment. To obtain a deeper insight regarding the new molecular patterns with promising anti-TB potential, we used the webserver mycoCSM [45] , which is an online tool with the capability to predict the antimycobacterial profile of any molecule, including the anti-TB activity. The results of the top 44 agencyregulated chemicals identified as potential anti-TB agents (multi-strain inhibitors of Mtb) by our mtc-QSAR-EL model together with the predictions derived from mycoCSM appear in Table 5 . It can be seen that the mtc-QSAR-EL model and the webserver mycoCSM converge in 10 out of 44 chemicals (22.73%). In our opinion and experience, such a convergence is very good taking into account that the mtc-QSAR-EL model and the webserver mycoCSM employed dissimilar approaches to characterize the molecular structure, different machine learning algorithms, and distinct ways to consider experimental information when building the computational models. In the end, given all the experimental and theoretical evidence, we can conclude that our mtc-QSAR-EL model can be efficiently used alone or in combination with other in silico tools for virtual screening of anti-TB molecules, which may inhibit several Mtb strains. 

All the chemical and biological data associated with the anti-TB activity were retrieved from the public database known as ChEMBL [46] [47] [48] . The dataset used in the present study was formed by 1237 molecules belonging to different chemical families. These molecules were experimentally tested for their inhibitory activity against Mtb and where the MIC 90 (minimum inhibitory concentration that prevents the visible growth in 90% of the Mtb isolates) was measured. More specifically, each molecule was assayed by considering at least 1 out of 7 assay times (ta), against at least 1 out of 8 Mtb strains (bs), and involving at least 1 out of 4 assay protocols (ap). Notice that each combination of the elements ta, bs, and ap represents a unique experimental condition cj, which can be expressed as cj(ta, bs, ap). If a molecule was found to be assayed more than one time by considering the same experimental condition cj, the duplicate data were deleted, and we kept the lowest MIC 90 value for that molecule. If two stereoisomers were tested under the same experimental condition cj, we kept only the stereoisomer with the lowest MIC 90 value. In any case, in our dataset, most of the molecules were tested by considering only one cj, and, therefore, after removing entries with lacking SMILES codes, values, units, duplicates, and unclear information regarding the assay time or the test protocol, the dataset ended up having 1571 cases. Each case/molecule in the dataset was annotated as active (ATB i (c j ) = 1) or inactive (ATB i (c j ) = −1), where ATB i (c j ) was a dichotomous variable indicating the anti-TB activity of the ith case/molecule under the experimental condition c j . Assignments of active and inactive cases were realized by considering the different MIC 90 cutoff values depending on the assay time (Table 6 ). Mtb (H37Rv) Broth dilution method 7d

Mtb (RMP-R) AlamarBlue/Resazurin/MABA method 7d

Mtb (H37Rv) Spectrophotometric assay (OD580-OD600) 7d

Mtb (MC2 6220_RF) Spectrophotometric assay (OD580-OD600) 7d

Mtb (MC2 6220_NRF) Spectrophotometric assay (OD580-OD600) a Activity value from which a molecule was considered active. b ta-assay time. c bs-bacterial strain belonging to Mtb. d The following abbreviations were used: RF (non-replicative form), RF (replicative form), INH-R (isoniazid-resistant strain), and RMP-R (rifampicinresistant strain). e ap-information regarding the assay protocol.

Several reasons justified the selection of the MIC 90 cutoffs in Table 6 . First, by inspecting our dataset formed by the 1571 cases, one can see that there are FDA-approved anti-TB drugs (isoniazid, rifampicin, ethambutol, etc.) with MIC 90 values that fall either above or below the cutoffs, which means that if a query chemical is predicted as active (or inactive) by the mtc-QSAR-EL model, it will be possible to assess the differences between the inhibitory potential of that query molecule and the actual activity of the current anti-TB drugs as a consequence of their differences in their chemical structures. Second, the annotations of the molecules as active or inactive employed a classification approach instead of a regression one. Notice that, in contrast to their regression counterparts, classification approaches do not need to predict the exact values of a biological property in a dataset, and, therefore, they do not need to deal so much with the potentially great uncertainty of the data. Third, the chosen MIC 90 cutoffs prevented (as much as it was possible) the imbalance between the number of molecules labeled as active and the number of molecules considered inactive. Fourth, the MIC 90 cutoffs are in the range (some of them are lower and, therefore, more rigorous) of the cutoffs selected in high-throughput screening campaigns to prioritize chemicals with potent anti-TB activity [49] . Fifth, from a phenomenological point of view, even though several anti-TB drugs appear in our dataset, the purpose of our mtc-QSAR-EL was to perform virtual screening of large and heterogenous external datasets to identify novel molecular patterns different from those present in the current anti-TB drugs.

The SMILES codes of the 1571 cases were stored in a file of the type *.smi. Following this, we used the software OpenBabel v2.4.0 [50] to convert this file to *.sdf, obtaining the connectivity table for each chemical present in our dataset; no additional standardization actions were applied. Then, using the sdf file as input, we employed the software QuBiLS-MAS v1.0 to compute the molecular descriptors known as local atom-based stochastic quadratic indices LQI [51, 52] . These are topological descriptors with successful applications in medicinal chemistry and drug discovery [53, 54] . We calculated the LQI descriptors of order k (with k from 0 to 5) by using predefined parameters such as algebraic form (quadratic), constraints (atom-based), matrix type (stochastic), cutoff (keep all), groups (local-referring to specific atom types such as aliphatic and aromatic carbons, methyl groups, halogens, and heteroatoms), and aggregation operator (Manhattan distance). These LQI descriptors were weighted by atomic properties such as hydrophobicity (HYD), atomic weight (AW), polarizability (POL), polar surface area (PSA), and Kupchick vertex degree (KU).

Notice that the LQI descriptors are not able to discriminate the effect of the chemical structure of a molecule on the anti-TB activity when the experimental condition cj changes, e.g., use of different assay times (ta), Mtb strains (bs), and/or test protocols (ap). This means that the LQI descriptor for a molecule will have the same value regardless of the experimental condition cj used to assess the anti-TB activity of that molecule. To solve this inconvenience, we applied the adaptation of the Box-Jenkins approach, which is a distinctive characteristic of all the PTML models [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] 55] :

In Equation (1), the term avg[LQI]cj is the average of the LQI i values for all the molecules in the training set, which were annotated as active by considering the same experimental condition cj. Consequently, n(cj) refers to the number of molecules labeled as active (also in the training set) that were tested under the same cj. Notice that, as mentioned before, the experimental condition cj depends on the elements ta, bs, and ap, and, therefore, Equation (1) was applied to each of the elements of cj separately. For instance, if cj = ta, then avg[LQI]cj = avg[LQI]ta and n(cj) = n(ta), meaning that, in this case, Equation (1) was employed for calculations depending only on the assay times. The same line of thinking was applied to the elements bs and ap. In the second step of the Box-Jenkins approach,

In Equation (2), D[LQI]cj is a multi-target descriptor that considers both the chemical structure of a molecule and a specific element of the experimental condition cj. Therefore, as in the case of Equation (1), Equation (2) was applied to each element of cj separately. The descriptor D[LQI]cj characterizes how much any molecule physicochemically and structurally deviates from a group of molecules annotated as active, having been tested by considering the same element of cj. Additionally, SDev[LQI] is the standard deviation calculated from all the values of each LQI descriptor for the molecules/cases present only in the training set. Last, the term p(cj) is the a priori probability calculated as the quotient of the number of molecules/cases in the training set assayed by involving a given element of cj and the total number of compounds present in the training set.

When generating the mtc-QSAR-EL model, we followed a series of steps. First, we divided our dataset (containing 1571 cases) into training and test sets. In this sense, for each assay time, we sorted the molecules in ascending order of their MIC 90 values. After, within each assay time, the first three molecules/cases were annotated as members of the training set, while the fourth molecule/case was labeled to belong to the test set. Such a ratio of 3:1 was repeated in the whole dataset. Thus, the training set was formed by 1184 molecule/cases (75.37% of the dataset), 602 active and 582 inactive; the training set was employed to find the best mtc-QSAR-EL model. The test set (the remaining 24.63%) contained 387 molecules/cases, 201 considered active and 186 defined as inactive; the test set was used to estimate the predictive power of the mtc-QSAR-EL model.

We employed the software IMMAN v1.0 [56] , which allowed us to prioritize those D[LQI]cj descriptors that were likely to exhibit the highest discriminatory power. In doing so, we used two information theory metrics: the differential Shannon entropy [57] and the information gain ratio [58] . We ranked the D[LQI]cj descriptors according to their decreasing values of the geometric mean between the two aforementioned metrics. While selecting the D[LQI]cj descriptors with the highest discriminatory power (high geometric mean values), we estimated the redundancy among them by computing the pairwise Pearson correlation coefficient (PCC) [59] . Only the D[LQI]cj descriptors with pairwise correlation values in the interval −0.7 < PCC < 0.7 were chosen to construct the mtc-QSAR-EL model.

We should recall that the mtc-QSAR-EL model is an ensemble of artificial neural networks (ANNs), and, thus, to form such an ensemble, we searched for the best ANNs, all of them being based on the multi-layer perceptron (MLP) architecture. When selecting the most appropriate MLP networks, we examined, in a first step, global statistical metrics such as accuracy [Ac(%)], Matthews' correlation coefficient (MCC) [60] [61] was employed to generate the MLP networks and subsequently build the mtc-QSAR-EL model.

When defining the applicability domain (AD) of the mtc-QSAR-EL model, we used two approaches. One of them was the consensus predictions since our mtc-QSAR-EL model was an ensemble of different MLP networks [62, 63] . This means that for each molecule present in our dataset, the predicted probabilities given by each of the MLP networks were averaged, resulting in the probability of the ensemble (mtc-QSAR-EL) model for that molecule. As the second method to estimate the AD, we used a modification of the descriptors' space approach reported by Speck-Planche [64] . In doing so, for each molecule present in the dataset containing the 1571 cases/molecules, we generated the local scores (LS) of the applicability domain for each D[LQI]cj descriptor in the following manner. If for a specific D[LQI]cj descriptor, the D[LQI]cj value of a query molecule fell between the maximum and minimum D[LQI]cj values, the LS took the value of one; otherwise, the LS was equal to zero. This operation was repeated for each D[LQI]cj descriptor that entered in the mtc-QSAR-EL model. Then, the metric known as the total score of applicability domain (TSAD) was calculated for the aforementioned query molecule as the sum of the LS values. In the end, in magnitude, the maximum value of TSAD was equal to the number of D[LQI]cj descriptors present in the mtc-QSAR-EL model.

Due to the non-linear nature of the mtc-QSAR-EL model, we provided the physicochemical and structural interpretation of the D[LQI]cj descriptors by strictly following the guidelines recently reported by Speck-Planche and co-workers [21, 64] . In this sense, we used the sensitivity values SV calculated by the ANN package of the software STATISTICA v13.5.0.17 to rank the different D[LQI]cj descriptors in terms of their importance in the mtc-QSAR-EL model while calculating the class-based means to help us gain insights regarding how the values of the D[LQI]cj descriptors should vary (increase or decrease) to enhance both the anti-TB activity and the versatility of a molecule to inhibit more than one Mtb strain.

We virtually screened a large and heterogeneous dataset formed by 8898 agencyregulated chemicals. None of these molecules were present in the dataset used to build the mtc-QSAR-EL model (1571 chemicals/cases). We predicted each of the 8898 agencyregulated chemicals under the 24 experimental conditions cj reported in Table 6 . Then, we generated the metric FA(%), which expressed the percentage of experimental conditions cj in which each of these 8898 agency-regulated chemicals was predicted as active (anti-TB agent) by the mtc-QSAR-EL model (Supplementary Material 8) . We also defined a second metric symbolized as S(TSAD); this was calculated as the sum of all the TSAD values reported for a single agency-regulated chemical by considering the 24 experimental conditions cj. The meaning of S(TSAD) was that the higher its value, the more reliable the predictions associated with that agency-regulated chemical in terms of whether it fell within the AD of the mtc-QSAR-EL model by considering the 24 experimental conditions cj (Supplementary Material 8). By using FA(%) and S(TSAD), we could rank the agencyregulated chemicals according to their predicted anti-TB activity and the reliability of the predictions, respectively.

The search for novel and more efficient anti-TB therapies can be greatly accelerated by employing in silico models as tools in the context of computational drug repurposing, where the prioritized hits could then be experimentally validated. Current computeraided drug discovery approaches should, however, focus on including more information regarding the experimental conditions under which the molecules are assayed. By doing so, computational models will be able to make more accurate predictions while also providing a deeper phenomenological understanding of the physicochemical properties and structural requirements associated with the potentiation of the anti-TB activity and versatility to inhibit different Mtb strains. The mtc-QSAR-EL model created in this work constitutes an advance in early drug discovery against TB, demonstrating that it is possible to identify anti-TB agents and/or multi-strain inhibitors of Mtb via virtual screening while proposing new molecular patterns that could be promising as starting points for anti-TB drug development. The present report confirms the promising applications of the PTML modeling methodology, which can be extended to diverse research areas devoted to finding treatments for unmet needs. 

A balancing act: Efflux/influx in mycobacterial drug resistance

Strategies in anti-Mycobacterium tuberculosis drug discovery based on phenotypic screening

New agents for the treatment of drug-resistant Mycobacterium tuberculosis

Innovation in the pharmaceutical industry: New estimates of R&D costs

Drug repurposing: Progress, challenges and recommendations

Simultaneous modeling of antimycobacterial activities and ADMET profiles: A chemoinformatic approach to medicinal chemistry

New insights toward the discovery of antibacterial agents: Multi-tasking QSBER model for the simultaneous prediction of anti-tuberculosis activity and toxicological profiles of drugs

In silico discovery and virtual screening of multi-target inhibitors for proteins in Mycobacterium tuberculosis

Design of novel antituberculosis compounds using graph-theoretical and substructural approaches

Global Bayesian models for the prioritization of antitubercular agents

QSAR studies on chalcones and flavonoids as anti-tuberculosis agents using genetic function approximation (GFA) method

Antibiotics 2021

Antimycobacterial pyrroles: Synthesis, anti-Mycobacterium tuberculosis activity and QSAR studies

Yildiz, I. QSAR of antimycobacterial activity of benzoxazoles by optimal SMILES-based descriptors

Label Algorithms: Models, Software, and Applications

Multitasking models for quantitative structure-biological effect relationships: Current status and future perspectives to speed up drug discovery

The urgent need for pan-antiviral agents: From multitarget discovery to multiscale design

Multioutput Perturbation-Theory Machine Learning (PTML) Model of ChEMBL Data for Antiretroviral Compounds

Modeling Antibacterial Activity with Machine Learning and Fusion of Chemical Structure Information with Microorganism Metabolic Networks

PTML Combinatorial Model of ChEMBL Compounds Assays for Multiple Types of Cancer

Multi-Target Drug Discovery via PTML Modeling: Applications to the Design of Virtual Dual Inhibitors of CDK4 and HER2

Big Data Challenges Targeting Proteins in GPCR Signaling Pathways; Combining PTML-ChEMBL Models and [(35)S]GTPgammaS Binding Assays

PTML Modeling for Alzheimer's Disease: Design and Prediction of Virtual Multi-Target Inhibitors of GSK3B, HDAC1, and HDAC6

Perturbation Theory/Machine Learning Model of ChEMBL Data for Dopamine Targets: Docking, Synthesis, and Assay of New l-Prolyl-l-leucyl-glycinamide Peptidomimetics

Multi-Target Mining of Alzheimer Disease Proteome with Hansch's QSBR-Perturbation Theory and Experimental-Theoretic Study of New Thiophene Isosters of Rasagiline

PTML Model of Enzyme Subclasses for Mining the Proteome of Biofuel Producing Microorganisms

Predicting Metabolic Reaction Networks with Perturbation-Theory Machine Learning (PTML) Models

Designing nanoparticle release systems for drug-vitamin cancer co-therapy with multiplicative perturbation-theory machine learning (PTML) models

Predicting coated-nanoparticle drug release systems with perturbation-theory machine learning (PTML) models

PTML Model for Selection of Nanoparticles, Anticancer Drugs, and Vitamins in the Design of Drug-Vitamin Nanoparticle Release Systems for Cancer Cotherapy

Computational MitoTarget Scanning Based on Topological Vacancies of Single-Walled Carbon Nanotubes with the Human Mitochondrial Voltage-Dependent Anion Channel (hVDAC1)

In silico assessment of the acute toxicity of chemicals: Recent advances and new model for multitasking prediction of toxic effect

PTML Model for Proteome Mining of B-Cell Epitopes and Theoretical-Experimental Study of Bm86 Protein Sequences from Colima

Model for high-throughput screening of drug immunotoxicity-Study of the anti-microbial G1 over peritoneal macrophages using flow cytometry

Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods

Hit identification and optimization in virtual screening: Practical recommendations based on a critical literature analysis

Towards a new combination therapy for tuberculosis with next generation benzothiazinones

Clinical isolates of Mycobacterium tuberculosis in four European hospitals are uniformly susceptible to benzothiazinones

Broad Spectrum Antiviral Agent Niclosamide and Its Therapeutic Potential

Dual activity of niclosamide to suppress replication of integrated HIV-1 and Mycobacterium tuberculosis (Beijing)

Antituberculosis activity of certain antifungal and antihelmintic drugs. Tuber. Lung Dis

In vitro and in vivo activity of biapenem against drug-susceptible and rifampicin-resistant Mycobacterium tuberculosis

In vitro and in vivo activities of the nitroimidazole TBA-354 against Mycobacterium tuberculosis

Evaluating virtual screening methods: Good and bad metrics for the "early recognition" problem

Using Graph-Based Signatures to Identify Safe Potent Hits against Mycobacteria

ChEMBL: A large-scale bioactivity database for drug discovery

Mining the ChEMBL database: An efficient chemoinformatics workflow for assembling an ion channelfocused screening library

An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI). Interview by Wendy A. Warr

A high-throughput whole cell screen to identify inhibitors of Mycobacterium tuberculosis

Open Babel: An open chemical toolbox

QuBiLS-MAS, open source multi-platform software for atom-and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations

QUBILs-MAS: Free Software for Molecular Descriptors Calculator from Quadratic, Bilinear and Linear Maps Based on Graph-Theoretic Electronic-Density Matrices and Atomic Weightings

QuBiLs-MAS method in early drug discovery and rational drug identification of antifungal agents

Discovery of novel anti-inflammatory drug-like compounds by aligning in silico and in vivo screening: The nitroindazolinone chemotype

De novo computational design of compounds virtually displaying potent antibacterial activity and desirable in vitro ADMET profiles

IMMAN: Free software for information theory-based chemometric analysis

Differential Shannon Entropy as a sensitive measure of differences in database variability of molecular descriptors

Induction of decision trees

Notes on regression and inheritance in the case of two parents

Comparison of the predicted and observed secondary structure of T4 phage lysozyme

Data Analysis Software System); v13.5.0.17

Comparison of different approaches to define the applicability domain of QSAR models

Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52

Combining Ensemble Learning with a Fragment-Based Topological Approach to Generate New Molecular Diversity in Drug Discovery: In Silico Design of Hsp90 Inhibitors

The authors would like to acknowledge all the experimental scientists working in the field of antituberculosis research, whose data made the creation of our mtc-QSAR-EL model possible.

The authors declare no conflict of interest.