key: cord-0860858-oz8eml9h
authors: Huang, Kexin; Fu, Tianfan; Glass, Lucas M; Zitnik, Marinka; Xiao, Cao; Sun, Jimeng
title: DeepPurpose: a deep learning library for drug–target interaction prediction
date: 2020-12-12
journal: Bioinformatics
DOI: 10.1093/bioinformatics/btaa1005
sha: 5df61c89fe413fde5bbded344c7850b02e16bd3a
doc_id: 860858
cord_uid: oz8eml9h

SUMMARY: Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/kexinhuang12345/DeepPurpose. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Drug-target interactions (DTI) characterize the binding of compounds to protein targets (Santos et al., 2017) . Accurate identification of molecular drug targets is fundamental for drug discovery and development (Rutkowska et al., 2016; Zitnik et al., 2019) and is especially important for finding effective and safe treatments for new pathogens, including SARS-CoV-2 (Velavan and Meyer, 2020) .

Deep learning (DL) has advanced traditional computational modeling of compounds by offering an increased expressive power in identifying, processing and extrapolating complex patterns in molecular data (Lee et al., 2019; Ö ztü rk et al., 2018) . There are many DL models designed for DTI prediction (Lee et al., 2019; Nguyen et al., 2020; Ö ztü rk et al., 2018) . However, to generate predictions, deploy DL models in practice, test and evaluate model performance, one needs considerable programming skills and extensive biochemical knowledge. Prevailing tools are designed for experienced interdisciplinary researchers. They are challenging to use by both computer scientists entering the biomedical field and domain bioinformaticians with limited experience in training and deploying DL models. Furthermore, each open-sourced tool has a different programming interface and is coded differently, which prevents easy integration of outputs from various methods for model ensembles (Yang et al., 2019) .

Here, we introduce DeepPurpose, a DL library for encoding and downstream prediction of proteins and compounds. DeepPurpose allows rapid prototyping via a programming framework that implements over 50 DL models, seven protein encoders and eight compound encoders. Empirically, we find that models implemented in DeepPurpose achieve state-of-the-art prediction performance on DTI benchmark datasets.

DL models for DTI prediction can be formulated as an encoderdecoder architectures (Cho et al., 2014) . DeepPurpose library implements a unifying encoder-decoder framework, which makes the library uniquely flexible. By merely specifying an encoder's name, the user can automatically connect a encoder of interest with the relevant decoder. DeepPurpose then trains the corresponding encoderdecoder model in an end-to-end manner. Finally, the user accesses the trained model either programmatically or via a visual interface and uses the model for DTI prediction.

DeepPurpose takes the compound's simplified molecular-input lineentry system (SMILES) string and protein amino acid sequence pair V C The Author(s) 2020. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 

DeepPurpose feeds the learned protein and compound embeddings into an MLP decoder to generate predictions. Output scores include both continuous binding scores, such as the median inhibitory concentration (IC 50 ), as well as binary outputs indicating whether a protein binds to a compound. The library detects whether the task is regression or classification and switches to the correct loss function and evaluation metrics. In the case of regression, we use the Mean Square Error (MSE) as the loss function and MSE, Concordance Index and Pearson Correlation as performance metrics. In the classification case, we use Binary Cross Entropy as the loss function and Area Under the Receiver Operating Characteristics (AUROC), Area Under Precision-Recall (AUPRC) and F-1 score as performance metrics. At inference, given new proteins and new compounds, DeepPurpose returns prediction scores representing predicted probabilities of binding between compounds and proteins.

DeepPurpose includes repurposing and virtual_screening functions.

Using only a few lines of codes that specify a list of compounds library to be screened upon and an optional set of training dataset, DeepPurpose trains five DL models, aggregates prediction results and generates a descriptive ranked list in which compound candidates with the highest predicted binding scores are placed at the top. If the user does not specify a training dataset, DeepPurpose uses a pre-trained deep model for prediction. This list can then be examined to identify promising compound candidates for further experiments. Second, DeepPurpose also supports user-friendly programming frameworks for other modeling tasks, including drug and protein property prediction, drug-drug interaction prediction and protein-protein interaction prediction (see Supplementary  Material) . Third, DeepPurpose provides an interface to many types of data, including public large binding affinity dataset (Liu et al., 2007) , bioassay data (Kim et al., 2019) and a drug repurposing library (Corsello et al., 2017) .

The functionality of DeepPurpose is modularized into six key steps where a single line of code can invoke each step: (i) Load the dataset from a local file or load a DeepPurpose benchmark dataset. (ii) Specify the names of compound and protein encoders. (iii) Split the dataset into training, validation and testing sets using data_process function, which implements a variety of data-split strategies. (iv) Create a configuration file and specify model parameters. If needed, DeepPurpose can automatically search for optimal values of hyperparameters.

(v) Initialize a model using the configuration file. Alternatively, the user can load a pre-trained model or a previously saved model. (vi) Finally, train the model using train function and monitor the progress of training and performance metrics. DeepPurpose is OS-agnostic and uses the Jupyter Notebook interface. It can be run in the cloud or locally. All datasets, models, documentation, installation instructions and tutorials are provided at https://github.com/kexinhuang12345/DeepPurpose.

To demonstrate the use of DeepPurpose, we compare DeepPurpose with KronRLS (Pahikkala et al., 2015) , a popular DTI method, and GraphDTA (Nguyen et al., 2020) and DeepDTA (Ö ztü rk et al., 2018) , state-of-the-art DL methods. We find that many DeepPurpose models achieve comparable prediction performance on two benchmark datasets, DAVIS (Davis et al., 2011) and KIBA (He et al., 2017) (Fig. 1D) . A complete script to generate the results is provided in Supplementary Material. The learned embeddings are then concatenated and fed into a decoder to predict DTI binding affinity. (C) DeepPurpose provides a simple but flexible programming framework that implements over 50 state-of-the-art DL models for DTI prediction. (D) DeepPurpose models achieve comparable performance with three other DTI prediction algorithms on two benchmark datasets. (E) Finally, DeepPurpose has many functionalities, including monitoring the training process, debugging and generation ranked lists for repurposing and screening. Further, DeepPurpose supports other downstream prediction tasks (e.g. drug-drug interaction prediction, compound property prediction)

In addition to rapid model prototyping, DeepPurpose also provides utility functions to load a pre-trained model and make predictions for a new drug and target inputs. This functionality allows domain scientists to examine predictions quickly, modify the inputs based on predictions, and iterate on the process until finding a drug or target with desired properties. We leverage Gradio (Abid et al., 2019) to create a web interface programmatically. We use a user-trained DeepPurpose model in the backend and create a custom web interface in fewer than ten code lines. This web interface takes the SMILES and amino acid sequence as the input and returns prediction scores with less than 1-second latency. We provide examples in the Supplementary Material. Conflict of Interest: none declared.

Gradio: Hassle-free sharing and testing of ml models in the wild

On the properties of neural machine translation: encoder-decoder approaches

The drug repurposing hub: a next-generation drug library and information resource

Comprehensive analysis of kinase inhibitor selectivity

SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines

Pubchem 2019 update: improved access to chemical data

DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences

BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities

GraphDTA: Predicting drug-target binding affinity with graph neural networks

DeepDTA: deep drug-target binding affinity prediction

Toward more realistic drug-target interaction predictions

A modular probe strategy for drug localization, target identification and target occupancy measurement on single cell level

A comprehensive map of molecular drug targets

The COVID19 epidemic

Analyzing learned molecular representations for property prediction

Machine learning for integrating data in biology and medicine: principles, practice, and opportunities