key: cord-278456-gsv6dh36
authors: Qureshi, Abid; Kaur, Gazaldeep; Kumar, Manoj
title: AVCpred: an integrated web server for prediction and design of antiviral compounds
date: 2016-09-09
journal: Chem Biol Drug Des
DOI: 10.1111/cbdd.12834
sha: 
doc_id: 278456
cord_uid: gsv6dh36

Viral infections constantly jeopardize the global public health due to lack of effective antiviral therapeutics. Therefore, there is an imperative need to speed up the drug discovery process to identify novel and efficient drug candidates. In this study, we have developed quantitative structure–activity relationship (QSAR)‐based models for predicting antiviral compounds (AVCs) against deadly viruses like human immunodeficiency virus (HIV), hepatitis C virus (HCV), hepatitis B virus (HBV), human herpesvirus (HHV) and 26 others using publicly available experimental data from the ChEMBL bioactivity database. Support vector machine (SVM) models achieved a maximum Pearson correlation coefficient of 0.72, 0.74, 0.66, 0.68, and 0.71 in regression mode and a maximum Matthew's correlation coefficient 0.91, 0.93, 0.70, 0.89, and 0.71, respectively, in classification mode during 10‐fold cross‐validation. Furthermore, similar performance was observed on the independent validation sets. We have integrated these models in the AVCpred web server, freely available at http://crdd.osdd.net/servers/avcpred. In addition, the datasets are provided in a searchable format. We hope this web server will assist researchers in the identification of potential antiviral agents. It would also save time and cost by prioritizing new drugs against viruses before their synthesis and experimental testing.

validation. Furthermore, similar performance was observed on the independent validation sets. We have integrated these models in the AVCpred web server, freely available at http://crdd.osdd.net/servers/avcpred. In addition, the datasets are provided in a searchable format. We hope this web server will assist researchers in the identification of potential antiviral agents. It would also save time and cost by prioritizing new drugs against viruses before their synthesis and experimental testing. 

Antiviral compounds (AVCs) inhibit the development of viruses in the host cell and are relatively harmless to the host. [1] They can be natural, for example, antivirals found in turmeric [2] and eucalyptus oil, [3] or synthetic, for example, zidovudine (a nucleoside analog) [4] and Tamiflu (neuraminidase inhibitor). [5] Many compounds and drugs have also been tested and found to be useful in restricting the growth of certain viruses. [6, 7] Scientists are endeavoring to broaden the range of antivirals to other families of viruses. [8] However, designing safe and effective antiviral drugs is a difficult task due to the high genetic diversity and consequent drug resistance in viruses. [9] Initially, antivirals were discovered using traditional trial-and-error methods. [10] However, it was a very lengthy process for discovering effective antivirals. [10, 11] Later, research on virology helped to identify many target pathways to block viral multiplication. [12, 13] Scientists are now using rational drug design strategies for developing antivirals that target the viruses at different stages of their life cycles. [14] During the past decade, many new drugs have been successfully identified in controlling the viral replication in host cells, for example, maraviroc (inhibits human immunodeficiency virus or HIV entry), pleconaril (inhibits picornavirus uncoating), acyclovir (inhibits herpesvirus replication), and oseltamivir (inhibits influenza release). [9, 15] To save time and money for discovering a new drug, researchers have widely used various computational methods to screen virtual libraries of compounds before the synthesis and animal testing of chemicals. Among the different approaches, quantitative structure-activity relationship (QSAR) is mostly used. [16] [17] [18] In this approach, relationships connecting molecular descriptors and activity are used to predict the property of other molecules. [19] Molecular descriptors transform the chemical information (structure and linking of groups) of a molecule into simple numbers. [20] QSAR-based virtual screening is an effective computational technique leading toward identification and design of novel antiviral agents. [21] Lately, many dedicated bioinformatic resources have been developed for antivirals. For example, in the area of RNA interference resources published are VIRsiRNAdbantiviral siRNAs resource for about 42 disease causing viruses, [22] HIVsirDB-anti HIV siRNAs database, [23] VIRsiRNApred-antiviral siRNA inhibition efficacy predictor, [24] and VIRmiRNA-database of virus encoded miRNAs including antiviral miRNAs. [25] Similarly, for peptide-based antivirals, a few web servers have also been created like AVPdb-collection of antiviral peptides targeting more than 60 medically important viruses, [26] HIPdb-HIV inhibiting peptide repository, [27] and AVPpred-predictor of antiviral activity of peptides. [28] . Many general depositories provide information of antiviral molecules. For example, ChEMBL, [29] PubChem-a database of molecules and their activities, [30] ZINC-database of commercial compounds for virtual screening, [31] and DrugBank-a knowledgebase for drugs and drug targets. [32] In addition, there are a few QSAR studies targeting specific viral proteins. [33] [34] [35] [36] [37] [38] [39] [40] [41] However, till date there is no web server/software, which can regressively predict the percentage inhibition value of a compound against different human viruses under a single platform.

To cater this need, we developed AVCpred, a web server for prediction and design of antiviral compounds. In this method, we used previously known AVCs against HIV, hepatitis C virus (HCV), hepatitis B virus (HBV), human herpesvirus (HHV) and 26 other viruses with experimentally validated percentage inhibition from ChEMBL, a large-scale bioactivity database for drug discovery. [29] This was followed by descriptor calculation and selection of best performing molecular descriptors. The latter were then used as input for support vector machine (in regression mode) to develop QSAR models for different viruses as well as a general model for other viruses. We have integrated these models in the AVCpred web server, which will be helpful for virtual screening of AVCs and designing new compounds to target the viruses.

In this study, we have used different datasets of AVCs having experimentally verified percent inhibition values against HIV, HCV, HHV, HBV and a general dataset having AVCs against 26 human viruses. The data were obtained from the ChEMBL resource (https://www.ebi.ac.uk/chembl/). The desired data were fetched using target browser (taxonomy tree) as well as target search using keywords such as HIV, HCV, HBV, HHV, virus, viral, viruses. Initially, among the AVCs, the majority of data belonged to HIV (1383 compounds), HCV (803 compounds), HHV (473 compounds), HBV (416 compounds), and other viruses (1635 compounds). After filtering entries with desired information and removing redundant entries, we were left with 389 compounds for HIV, 467 in case of HCV, 124 for HHV, 112 against HBV, and 1391 AVCs targeting the 26 viruses (Table 1 and Table  S1 ). These datasets were used for descriptor selection and model development. The datasets are available along with references on the web server and can be downloaded from this URL: http://crdd.osdd.net/servers/avcpred/datasets.php.

To develop virus specific as well as general QSAR models, we computed about 18000 chemical descriptors (1D, 2D, and 3D), including geometric, constitutional, electrostatic, topological, hydrophobic, binary fingerprints, using PaDEL, an open-source software to calculate molecular descriptors and fingerprints. [42] T A B L E 1 Creation of datasets for the development of prediction models S. no.

Data filter a Percent inhibition [1] Reference [2] Non-redundant [ The general dataset is comprised of below viruses with unique number of AVCs in brackets: Dengue virus 1, [1] dengue virus 2, [16] enterovirus, [30] human adenovirus 5, [41] human cox B1, [4] human cox B5, [21] human echovirus 13, [3] human echovirus 9, [2] human enterovirus 71, [19] human enterovirus C, [1] human polio virus 1, [4] human rhinovirus, [1] human rhinovirus 14, [29] human rhinovirus 1B, [18] human rhinovirus 2, [2] human T lymphotropic virus, [42] influenza A, [36] influenza A (H1N1), [16] influenza B, [1] monkeypox virus, [1] respiratory syncytial virus, [4] Rift Valley fever virus (Cercopithecidae), [1] sandfly fever Sicilian virus, [2] SARS coronavirus, [23] simian virus 40, [45] Sindbis virus, [4] vaccinia virus, [12] vaccinia virus WR, [22] variola virus, [1] vesicular stomatitis virus, [63] West Nile virus, [17] yellow fever virus. [51] 

To improve the speed of calculation, we selected the most essential descriptors using 'RemoveUseless' filter followed by ClassifierSubsetEval (attribute evaluator) with BestFirst (search method) module available in Weka package. [43] ClassifierSubsetEval evaluates attribute subsets on training/ testing data using a classifier to estimate the merit of a set of attributes. [44, 45] The selected descriptors were then used to develop the QSAR models (Table S3 ).

We developed individual QSAR models for each of the 4 viruses (HIV, HCV, HHV, and HBV) as well as a general model comprising 26 different viruses using SMOreg algorithm [46] in Weka machine learning software [43] freely available at http://www.cs.waikato.ac.nz/ml/weka. SMOreg implements the support vector machine in regression mode. In SMOreg, Pearson VII function-based universal kernel (Puk) and RegSMOImproved optimizer were used along with parameters such as (i) the regularization constant/complexity value (c) that allows trade-off between training error and margin, (ii) the omega exponent value (ω) that controls peak half-width, and (iii) the sigma bandwidth value (σ) that controls peak tailing factor. [47, 48] Simultaneously, software SVM light (freely available at http://svmlight.joachims.org) was employed for machine learning in classification mode. In SVM light , radial basis function (RBF) kernel was used with parameters (i) gamma (g) that defines how far the influence of a single training example reaches and (ii) complexity constant (c) that allows trade-off amid training error and margin. [49] Selected molecular descriptors and fingerprints were used as input features for the development of QSAR models.

In order to evaluate performance of our models, we employed a number of statistical parameters including Pearson's correlation coefficient, coefficient of determination, mean absolute error root-mean-square error, sensitivity, specificity, accuracy, and Mathew's correlation coefficient as briefly described below. The Pearson's correlation coefficient (R) is a measure of correlation between two variables.

where n is the size of test set, and E i pred and E i act is the predicted and actual efficacy of AVCs respectively. A value of 1 denotes total positive correlation, 0 is no correlation, and −1 is total negative correlation.

The coefficient of determination (R 2 ) indicates how well data fit a statistical model. An R 2 of 1 indicates that the model perfectly fits the data, while an R 2 of 0 means that the model does not fit the data at all.

The mean absolute error (MAE) measure indicates how close the predictions are to the eventual outcomes.

where E i pred is the prediction, E i act the true value, and

MAEs are negatively oriented scores; that is, lower values are better.

The root-mean-square error (RMSE) measures the average magnitude of the error.

RMSEs are also negatively oriented scores; that is, lower values are better. Sensitivity (Sn) or the true positive rate measures the percentage of correctly identified positives.

An ideal predictor would be expressed as 100% sensitive. Specificity (Sp) or the true negative rate measures the percentage of correctly identified negatives An ideal predictor would be expressed as 100% specific. Accuracy (Ac) is the percentage of correct results (i.e. both true positives and true negatives) among the total number of cases.

An ideal predictor would be expressed as 100% accurate. The Matthew's correlation coefficient (MCC) is used in machine learning to evaluate the performance of binary classifications. 

Qureshi et al.

In the above Eqs. (4-7) , TP, FP, TN, and FN represent the true positives, false positives, true negatives, and false negatives respectively.

Its value ranges from −1 to 1 and a value close to 1 means a better prediction.

In order to identify the most effective features or descriptors of antiviral drugs, we computed the correlation between selected chemical features of antiviral drugs and their percent inhibition using comprehensive pharmacological screening datasets from ChEMBL [29] (Figure 1) .

After attribute selection, the relevant descriptors were 45 for HIV, 52 for HCV, 15 for HBV, 20 for HHV, and 65 for rest of the viruses. A combination of selected chemical descriptors like partial charge, atom-type electrotopological state, extended topochemical atom, chi cluster, weighted path, and fingerprints based on substructure, graph, path, and extended features including PubChem and Klekota-Roth were found to be useful in prediction. The selected descriptors were then used to develop the QSAR models (Table S3) (Table 2 ). Other statistical parameters used in the development of QSAR models are depicted in Table S2 . A scatter plot between actual and predicted efficacy in each case is shown in Figure 2 .

In addition, we also checked the performance of our models developed using classification mode of machine learning. The general dataset is comprised of below viruses with unique number of AVCs in brackets: Dengue virus 1, [1] dengue virus 2, [16] enterovirus, [30] human adenovirus 5, [41] human cox B1, [4] human cox B5, [21] human echovirus 13, [3] human echovirus 9, [2] human enterovirus 71, [19] human enterovirus C, [1] human polio virus 1, [4] human rhinovirus, [1] human rhinovirus 14, [29] human rhinovirus 1B, [18] human rhinovirus 2, [2] human T lymphotropic virus, [42] influenza A, [36] influenza A (H1N1), [16] influenza B, [1] monkeypox virus, [1] respiratory syncytial virus, [4] Rift Valley fever virus (Cercopithecidae), [1] sandfly fever Sicilian virus, [2] SARS coronavirus, [23] simian virus 40, [45] Sindbis virus, [4] vaccinia virus, [12] vaccinia virus WR, [22] variola virus, [1] vesicular stomatitis virus, [63] West Nile virus, [17] yellow fever virus. [51] (ROC) plots illustrating the performance of the QSAR models are shown in Figure 3 .

The QSAR models have been integrated into a freely available and easy to use web server, 'AVCpred', where users can predict the antiviral potential of their query molecules against the different viruses in terms of percent inhibition value. AVCpred web server includes the following modules:

This allows users to submit on or more molecules at a time.

Users have to choose the viruses on which they want to test their query chemical compounds. On submission, it returns with percent inhibition values against the selected viruses. Also users can view the different properties of the query molecule such as structure, charge, molecular weight, logP value, hydrogen and Lipinski bond donors/acceptors, rigid and rotatable bonds to identify drug-like molecular structures ( Figure 4) . Abbreviations: Puk: Pearson VII function-based universal kernel. RegSMOImproved: optimizer for algorithm speed improvement. c: regularization constant/complexity parameter allows trade-off between training error and margin. ω: omega exponent value (controls half-width of the peak) σ: sigma bandwidth value (controls tailing factor of the peak). RBF: radial basis function g: parameter gamma in RBF kernel.

It has been found that analogs of known chemical compounds are sometimes more effective than the parent molecule. [50] In order to identify potent analogs of an existing AVC, we have included the 'Design analogs' tool, where user can design analogs based on given building blocks and predict their inhibition on the viruses.

Using the 'Draw tool', one can sketch the structure of the query molecule using Marvin editor ( Figure 5 ). This tool also gives the predicted percent inhibition values against the different viruses. In addition, one can view the various properties of the query structure.

AVCpred also provides the users a search tool to browse the compounds used in our datasets. In this module, different compounds targeting the viruses are stored in a database. The records can be readily searched, filtered/sorted, and downloaded via the web interface.

AVCpred has been developed using the open-source LAMP (Linux-Apache-MySQL-PHP) system. The prediction software runs on Red Hat Enterprise Linux 5 environment using Apache httpd server.

To inhibit viral growth, the antiviral molecules or drugs target different phases of viral life cycle such as fusion, integration, replication, maturation and should be relatively non-toxic to the host organism. [51, 52] Each stage can be targeted using AVCs that can, for example, inhibit entry receptors (CD4, CCR5) or viral enzymes (protease, neuraminidase). [53] [54] [55] F I G U R E 4 AVCpred submission form with output

Qureshi et al.

Various AVCs are currently in medical use, and new ones are in clinical trials. [56, 57] Finding new and improved viral inhibitors is a major concern in the treatment of deadly human viruses. [58, 59] However, discovery of novel AVCs is a tedious process. [60] To speed up the identification of new AVCs, a computational approach using QSAR method is a rational strategy to decrease cost and time efforts in the wet laboratory. [20] QSAR techniques have been widely used in drug designing and further identification of lead molecules. [17] Although there are many QSAR studies pertaining to different types of viral protein inhibitors, they are very specific in their approach and deal with a particular class of inhibitors such as endonuclease inhibitors [33] in which 40 compounds were used and reached a correlation of 0.76, thiourea derivatives [34] where 85 compounds had a correlation of 0.92, protease inhibitors [39] in which 170 compounds had a correlation of 0.60-0.83, and flavonoid inhibitors [38] where 20 compounds had a correlation of 0.75-0.97 etc. (Table 5 ). In most of the cases, the studies are carried out on a limited number of inhibitors. Due to this reason, they predict the inhibitors that are similar to the compound type with a high correlation, but do not work on other dissimilar inhibitors for the same target virus. To address these limitations, AVCpred models have been developed using diverse and large number of inhibitors. In the current algorithm, we have employed antiviral compound datasets from different studies due to which the overall correlation is less than above studies, yet the models are comparatively more robust to predict different classes of inhibitors. However, as new high-throughput screening data tested under homogeneous conditions on antiviral drugs becomes available, performance of the QSAR method can be improved.

In this study, we developed virus specific as well as general prediction models to identify the likelihood of a compound being antiviral using selected chemical attributes of experimentally validated AVCs. PaDEL, an open-source software, was used to calculate molecular descriptors and fingerprints. However, the software calculates a large number of descriptors, and hence, we used attribute selection approach to reduce their number by eliminating unrelated and extraneous descriptors to get a highly correlated descriptor set. Our analysis revealed that several chemical descriptors are important in predicting the compound inhibition activity, for example, partial charge, atom-type electrotopological state, extended topochemical atom, chi cluster, weighted path, and fingerprints.

We employed machine learning to train the QSAR models on different sets of experimentally validated data. These models were validated on independent datasets, not used during training, and were found to have satisfactory performance. We used the pharmacological data from the ChEMBL resource for training/testing the models developed for general as well as specific viruses. These models were integrated in an open-source web server for evaluation and screening of antiviral compounds.

The applicability domain of the QSAR models was demonstrated using Williams plot ( Figure 6 ) in which [40] 9

Thymidine kinase N2-phenylguanine inhibitors 20 0.85-0.98 HSV No 2000 [41] F I G U R E 5 Web interface of 'AVCpred Draw' tool standardized residuals are plotted against leverages. [61] If the standardized residual of a compound is greater than three times standard deviation units (±3σ), the compound is treated as an outlier. The warning value of leverage (h*) is considered as 3p/n, where p is the number of model descriptors plus one and n is the number of training compounds. [62, 63] If the leverage of a compound exceeds h*, it is regarded as dissident. The plots demonstrate that the leverages of majority of the compounds do not surpass the critical value (h*) in the regression models, and hence, the compounds are within the chemical domain, implying that the predictivity of the models is reliable. The web server also provides useful services like designing analogs based on given building blocks and drawing structure to sketch novel compounds and predict their inhibition potential against multiple viruses. The AVCpred algorithm is hoped to assist the researchers in discovering novel antiviral compounds as well as virtually check the effect of modifications on existing drugs.

AVCpred is the first web-based algorithm for prediction of AVCs based on experimentally validated datasets. Five prediction models pertaining to HIV, HCV, HHV, HBV, and a general one were implemented in the web server to make comprehensive predictions. In addition, tools for drug design, virtual screening, and collection of existing AVCs have also been integrated. This web server would be helpful for researchers working for the development of antiviral therapeutics.

Authors are thankful to Council of Scientific and Industrial Research (CSIR) (GENESIS-BSC0121), Department of Biotechnology (GAP001), and CSIR-Institute of Microbial Technology for providing infrastructure and financial support. 

Using Locally Weighted Learning to Improve SMOreg for Regression

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Cold Spring Harb

The authors declare that they have no competing interests.