key: cord-0775251-la0im97e authors: Zheng, Shuyu; Aldahdooh, Jehad; Shadbahr, Tolou; Wang, Yinyin; Aldahdooh, Dalal; Bao, Jie; Wang, Wenyu; Tang, Jing title: DrugComb update: a more comprehensive drug sensitivity data repository and analysis portal date: 2021-06-01 journal: Nucleic Acids Res DOI: 10.1093/nar/gkab438 sha: 7b76f93388acf355932ff9e73c9a058eac30987b doc_id: 775251 cord_uid: la0im97e Combinatorial therapies that target multiple pathways have shown great promises for treating complex diseases. DrugComb (https://drugcomb.org/) is a web-based portal for the deposition and analysis of drug combination screening datasets. Since its first release, DrugComb has received continuous updates on the coverage of data resources, as well as on the functionality of the web server to improve the analysis, visualization and interpretation of drug combination screens. Here, we report significant updates of DrugComb, including: (i) manual curation and harmonization of more comprehensive drug combination and monotherapy screening data, not only for cancers but also for other diseases such as malaria and COVID-19; (ii) enhanced algorithms for assessing the sensitivity and synergy of drug combinations; (iii) network modelling tools to visualize the mechanisms of action of drugs or drug combinations for a given cancer sample and (iv) state-of-the-art machine learning models to predict drug combination sensitivity and synergy. These improvements have been provided with more user-friendly graphical interface and faster database infrastructure, which make DrugComb the most comprehensive web-based resources for the study of drug sensitivities for multiple diseases. Despite the scientific advances in the understanding of complex diseases such as cancer, there remains a major gap between the vast knowledge of molecular biology and effective treatments. Next generation sequencing has revealed intrinsic heterogeneity across cancer samples, which partly explain why patients respond differently to the same therapy (1) . For the patients that lack common oncogenic drivers, multi-targeted drug combinations are urgently needed, which shall block the emergence of drug resistance and therefore achieve sustainable efficacy (2) . To facilitate the discovery of drug combination therapies, highthroughput drug screening techniques have been developed to allow for a large scale of drug combinations to be tested for their sensitivity (percentage inhibition of cell growth) and synergy (degree of interaction) in-vitro (3) . Furthermore, patient-derived cancer cell cultures and xenograft models have been developed, which make the drug discovery closer to the actual patients (4) (5) (6) . With the increasing amount of drug sensitivity screening data, the challenge of translating them into actual drug discovery remains, as recent studies showed that most of clinically approved drug combinations work independently (7) , that the efficacy and synergy observed in a pre-clinical setting may not be translated into a clinical trial (8, 9) . The challenge of utilizing the results from drug combination screens largely resides from un-harmonized metrics for syn-ergy and sensitivity that are derived from different mathematical models, which are often incompatible for the same datasets (10) . Another limitation is the lack of standardization of drug combination experimental design and the insufficient level of data curation and deposition to publicly available databases (11) . Furthermore, the drug combination data has not been harmonized with single drug screening data, partially due to a lack of computational tools to enable a systematic comparison of drug combination efficacy against single drug efficacy (12) . To initialize the efforts for curating drug combination datasets, and to facilitate a community-driven standardization of evaluation of the degree of synergy and sensitivity of drug combinations, we have provided DrugComb as the very first data portal to harbour the manually curated datasets as well as the web server to analyse them (13) . The original version of DrugComb consists of four major high-throughput studies, which served as a reference dataset for developing machine learning algorithms to predict drug combination sensitivity and synergy (14) . Different from other recent databases including DrugCombDB (15) and SynergxDB (16) , DrugComb is a unique resource as it is a compendium of database and web application, not only for depositing deeply curated public datasets but also for the analysis and annotation of user-uploaded data. Furthermore, DrugComb provides detailed visualization of drug combination sensitivity and synergy, which shall greatly facilitate the understanding of drug interactions at specific dose levels. The data from DrugComb has been used to develop machine learning models for drug combination prediction (17, 18) , and synthetic lethality knowledge graph (14) . The analysis tools provided by DrugComb have also helped to explore the mechanism of replication stress response in colorectal cancer stem cells (19) . With the development of high-throughput screening techniques, the number of data points for drug combinations has been greatly increased. For example, the recent Dream Challenge on drug combination prediction has provided more than 20k drug combinations in cancer cell lines (20) . Furthermore, drug combination screening has been extended to other disease models such as malaria and Ebola (21) . More recently, drug combination screening studies on COVID19 have been conducted, providing important clues for the treatment of the ongoing pandemic (22) . In the new version of DrugComb, we aim to expand our manual curation from cancer to other diseases to improve the data coverage. On the other hand, drug combinations need to be harmonized with the monotherapy drug screening data, since these treatment options shall be evaluated using the same endpoint metric (such as progression free survival and overall survival) in clinical trials. Therefore, we aim to harmonize the drug combination with monotherapy drug screening, by providing informatics tools to evaluate their overall sensitivity in a more systematic manner. For this reason, in the new version of DrugComb, we do not limit ourselves for curating drug combination data, but rather we included monotherapy drug sensitivity screening data as well. More importantly, we provide a robust metric to enable a direct comparison of drug combinations and single drugs, as monotherapy drug screening can be considered as a subset of drug combination experiments. The new data harmo-nization framework thus allows a more systematic evaluation of a drug combination in comparison to a single drug. In addition, we implement several new modules for the analysis of these datasets, including the integration of drug targets and gene expressions of neighbouring proteins in a signalling network, such that the mechanisms of action of a drug or a drug combination can be annotated systematically in a specific cellular context. We also provide a baseline model based on CatBoost to predict the sensitivity and synergy of drug combinations, with which the machine learning community may develop novel algorithms to improve our understanding of drug responses in cancer cells. Taken together, the new version of DrugComb features an enhanced web portal to make drug screening data more interpretable and reusable for various applications such as machine learning, network modelling and experimental validation. DrugComb portal consists of two major components including a database for harbouring the most recent drug screening datasets as well as a web server to analyse and visualize these datasets or user-uploaded datasets for the degree of sensitivity and synergy. For retrieving the database, users can query by drug names, cell line names as well as study names. For utilizing the web server to analyse useruploaded datasets, users need to import the data according to the format of an example file, and the results will be shown as both tabular and image displays, which are also downloadable. When users plan a drug combination experiment, they may utilize the web server to predict the sensitivity and synergy and utilize such information to guide the selection of drugs. The drug targets as well as the gene expressions of the signalling pathways for a given cancer cell line can be also annotated as a network model. In the following, we describe how we have improved the coverage of the database as well as the data analysis modules of the web server with a range of algorithms, and the new implementation techniques to accelerate data curation and harmonization efficiency ( Figure 1 ). The initial version of DrugComb consists of four drug combination screening studies, covering 437 923 drug combination experiments. We have curated much more drug combination experiments for cancer cell lines. Furthermore, we have incorporated monotherapy drug screening datasets and considered them as a subset of a drug combination experiment, where the other drug is absent. We have also included the drug screening results from patient-derived cancer samples in haematological malignancies (5) . In addition to multiple cancer types, we have extended the curation efforts to other diseases such as Ebola, malaria and COVID-19. The manual curation is under high level of quality control, that only those studies that reported the raw dose-response results will be considered, and thus the studies that reported only summary-level results including IC50, AUC (area under the dose response curves) or synergy scores (e.g. combination index) are excluded. We uti- Figure 1 . A schematic overview of the DrugComb database and web server pipeline. Drug combination and monotherapy drug screening datasets are curated from public databases, publications or user-upload. After quality control and pre-processing, the cell information is retrieved from Cellosaurus (23), while the drug information is retrieved from multiple databases including PubChem (24), ChEMBL (25), UniChem (26), DrugBank (27) , KEGG (28) and DrugTargetCommons (29) . The degree of synergy in drug combinations, as well as the sensitivity of drug combinations and single drugs are determined using the SynergyFinder R package (3) . For inferring the mechanisms of action of drugs or drug combinations, their targets as well as interacting proteins are visualized in a signalling network, retrieved from STITCH (30) and UniProt (31) . Furthermore, the gene expressions of these proteins in the given cancer cells are obtained from DepMap (32) and Cell Model Passports (33) , and from BeatAML where the cancer samples were derived from AML (Acute Myeloid Leukaemia) patients (5) . Machine learning algorithms utilize chemical structural and gene expression features to predict drug combination synergy and sensitivity. The DrugComb portal enables the query and download of curated raw datasets and analysis results, as well as the contribution of new datasets. lized SynergyFinder (3) to determine the synergy scores directly from the raw dose-response data and compared them with those reported in the original publications. Only the datasets that have a correlation higher than 0.6 will be included. Furthermore, dose-response matrices containing abnormal response values, for example percentage inhibition of cell growth less than −200% or larger than 200%, were marked as poor-quality data points for which the data analysis results were not shown in the web interface. We have also standardized the metadata about experimental protocols of these studies so that their differences can be evaluated more systematically. The annotation of the bioassay protocols is based on the BAO (Bioassay annotation on-tology) (34) , that is commonly adopted for major chemical biology databases including ChEMBL (25), PubChem (24) and DrugTargetCommons (29) . For the drugs and cell lines we provided the cross-database references such that their pharmacological and clinical information can be easily accessed (Figure 2A Table S1 shows the summary of the data points from the individual studies that are curated and harmonized in DrugComb. DrugComb utilizes the SynergyFinder R package to analyse drug combination sensitivity and synergy. The single drug sensitivity is characterized as a dose-response curve with its IC50 and RI (relative inhibition) values. RI is the normalized area under the log 10 -transformed dose-response curves, which has shown enhanced robustness to characterize drug sensitivity (35) . Moreover, RI can be interpreted as percentage inhibition, summarizing the overall drug inhibition effects relative to positive controls. With the RI metric, drug responses of different concentration ranges can be compared, in contrast to IC50 or EC50, which are usually a relative term depending on the tested concentration ranges. For drug combination sensitivity, we provide a metric called CSS (Combination Sensitivity Score), that is based on the normalized area under the log 10 -transformed of the combination dose-response curve when one of the two drugs is fixed at its IC50 concentration (12) . CSS and RI use the same principle to characterize the overall drug response efficacy, such that their values can be directly compared ( Figure 2C ). For evaluating the drug synergy, we implement four major mathematical models including Bliss, Loewe, HSA and ZIP (36) and provide the visualization of these scores in the dose-response matrices. Furthermore, we provide a synergy score called S score that is derived from the difference between CSS and RI scores of the combination and single drugs respectively (12) . Drug combinations with synergy scores of zero are considered additive, while a positive synergy score suggests synergy, and a negative score suggests antagonism. The five synergy scores are based on different mathematical assumptions such that they do not necessarily match with each other ( Figure 2D ). For example, the Bliss model assumes probabilistic independence when drugs are non-interactive while the Loewe model assumes that the efficacy of non-synergistic drug combinations is identical to that of a drug combined with itself. The ZIP model, on the other hand, can be considered as an Ensembl model as it combines the assumptions of Bliss and Loewe (36) . In actual clinical trials, approval of a drug combination often is based on the HSA model that simply shows that the drug combination improves patient survival compared to monotherapies. To insure the clinical translation of drug combinations, we encourage the use of all the major synergy scoring metrics, such that the top hits that pass the threshold of all of them can be prioritized (37) . On the other hand, there have been biases by focusing solely on the synergy, while the sensitivity of a drug combination might be understudied. It is likely that a drug combination produces strong synergy while their overall efficacy is not achieving therapeutic relevance. Therefore, we provide an SS (Synergy-Sensitivity) plot to ensure that both of these two scores can be evenly weighted when interpreting the relevance of a drug combination ( Figure 2E , Supplementary Figure S1 ). As a unique feature of DrugComb, we visualize the synergy scores of a drug combination at each tested dose. The so-called synergy landscape allows a rich information display to facilitate the interpretation of the data, for which the most synergistic and antagonistic doses can be identified separately ( Figure 2F ). For a given drug or a given cell line, we provide the boxplots and histograms to show the general distributions of the synergy and sensitivity scores, such that the users may assess the general trend. For example, users may evaluate whether drug combinations involving a particular drug tend to be more synergistic, or a cell line tends to be more sensitive to drug treatment. Note that the majority of the data points (93.2%) that we curated from the literature do not contain replicates, and therefore, we decide not to provide the statistical significance of the synergy and sensitivity over a dose-response matrix, as the significance of individual doses contributing to the overall synergy cannot be systematically assessed. Therefore, we would like to highlight the issue of lack of replicates from a typical drug combination screening that may likely hinder the translation of the results into clinical trials. Once a drug combination experiment has been conducted, for which the results were analysed with the sensitivity and synergy scoring, the next question would be the mechanisms of action of the drug combinations. Network modelling of drug combinations have been recently introduced as an efficient approach for the interpretation of drug combinations, as well as the identification of predictive biomarkers from molecular profiles of cancer (39) (40) (41) (42) . In Drug-Comb, the drugs are annotated with their target profiles, and these profiles were further annotated in the signalling networks of cancer cells, such that their first and secondary neighbour proteins can be also retrieved. We utilize the databases including ChEMBL, PubChem and DrugTarget-Commons for their primary and secondary targets, and retrieve STITCH for the signalling networks. Furthermore, we have incorporated the transcriptomics profiles of the cancer cell lines into the network, such that their gene expression values can be also displayed ( Figure 3A ). In addition, we provide the correlation of the gene expression and drug sensitivity such that those neighbouring genes for which their gene expressions are highly correlated with the drug sensitivity will be further identified as potential biomarkers ( Figure 3B ). For user-uploaded drug combinations or single drugs, ideally the InChiKeys of the drugs should be provided. This allows the web server to query drug STITCH ID from the major drug databases. In case only the drug names are provided, the web server will query from the major drug databases, for which their targets profiles will be visualized in a generic cancer signalling network. In case the cell line names can be matched with the existing gene expression data, their gene expression values will be displayed as coloured nodes. The network modelling results should be interpreted together with the actual drug screening profiles, such that the drug resistance or sensitivity can be related to its target or neighbouring gene expressions ( Figure 3B ). Upon the large volume of drug combination data curated in DrugComb, we provide the state-of-the-art machine learning algorithms to predict the sensitivity and synergy for a user-selected drug combination on a given cancer cell line. We utilize the ONEIL data (43) to train a CatBoost model, which has been considered as a reference algorithm for many machine learning tasks (44) . The ONEIL data consists of 583 drug combinations involving 38 drugs tested in 39 cell lines, resulting in 92 208 drug combination experiments consisting of 2 305 200 data points. The ONEIL data has been considered a high-quality dataset, as it contains multiple replicates and has been utilized in previous machine learning development (45) (46) (47) . The CatBoost model is based on decision-trees that can facilitate the integration of different types of features including textual, categorical and numerical values. To build our model, the names of drugs and cell lines are specified as categories in our feature vectors. Additionally, the concentrations for drugs are considered as both numeric values and categories. The cell line's gene expression and compound's structural fingerprints (MACCS) are considered as numerical values. Moreover, in order to accelerate the training process for our model we consider only top 5% most variant genes (n = 153) across the 39 cell lines (Figure 4) . Among all the CatBoost hyper-parameters, only four of them show high importance for obtaining the best model. Those hyper-parameters include iterations that indicate the number of trees used in the model, maximum depth of the tree, the learning rate used for gradient steps, and the L2 regularization for the loss function. The best values for mentioned parameters are set and the rest of the hyperparameters are set to the default values. For drug combination inhibition and synergy scores, a model has been trained separately and the results of the validation accuracy are presented in Table 1 . To facilitate the prediction, users need only to specify the names and the maximal concentrations for each of two drugs, and a cell line name. After receiving the user input, the MACCS fingerprints of the drugs will be obtained by the RCDKlibs package in R, and the cell line gene expression data will be retrieved internally from DrugComb. The pre-processed data will be loaded into the trained models to predict the inhibition values and synergy scores for a 10×10 equally distanced dose matrix within the given maximal concentrations. To facilitate the data curation, we have provided a web server for users to upload their drug combination data into the database. The 'Contribute' panel will ask for the annotation information of the drug combination screening results, and then the actual data points will be formulated as a tabular format. We have utilized the contribution module to curate the majority of the literature datasets and found that it greatly facilitates the burden of the data contributors as well as data curators. For example, autofill functions are available when users input the literature citation and drug names. The cell line annotation is also available by retrieving the Cellosaurus website for its disease classification and other cross-reference links. Furthermore, data contributors are guided to provide critical information about assay protocols, such as detection technologies and culture time. When the data has been successfully uploaded, we will first manually check the format, completeness, and valid- ity of the uploaded information, and then integrate them into the database via the data analysis and annotation functions ( Figure 5A ). In addition to the actual data points as an outcome of such a data curation effort, we can also systematically evaluate the differences in the assay protocols ( Figure 5B ), which might provide more insights on assessing the reproducibility of the drug sensitivity screens (48) . Taken together, we believe that the data contribution may greatly facilitate the open access of drug screening data and therefore we encourage the users of DrugComb to be part of the community-driven data curation team in the future. DrugComb is built using PHP 7.4.14 [Laravel Framework 6.20.7] for server-side data processing, Javascript EC- The data portal has been designed in a straightforward manner to maximize the user flexibility to retrieve the existing datasets as well as to analyse their own datasets. We provided the API access at http://api.drugcomb.org such that users can request data as json files. The API is implemented using the PHP laravel framework. Instructions of each of the modules are provided in their associated web pages and the overview of the data portal was summarized as tutorial video available at the home page. We aim to continue accommodating new features such as cloud-based computing and data infrastructure to facilitate the FAIRness (Findable, Accessible, Interoperable and Reusable) of drug screening data analysis. Meanwhile, the communitybased features such as data contribution and quality control can be developed further. Making cancer treatment more effective is what a combination therapy aims to achieve. With the advances of high-throughput drug screening technologies, an increasing number of drug combinations have been tested. However, before we can develop robust machine learning and network modelling algorithms to predict and understand the potential drug combinations, the datasets need to be systematically curated and harmonized. Here we report the major updates of DrugComb, a comprehensive data portal for the drug discovery community to access the concurrent highthroughput drug combination as well as monotherapy drug screening datasets. These datasets have been deeply curated, standardized and harmonized with the data analysis tools including synergy and sensitivity scoring, such that their potential can be maximized within a unified framework. Furthermore, we have updated the network modelling of the drug combinations, such that the transcriptomics profiles of the cancer cell line, and drug target profiles can be integrated in a signalling network where the protein-protein interactions may provide deeper insights on the mechanisms of drugs and drug combinations. In addition, we have provided a machine learning model to predict a given drug combination for a cell line at the single dose level. To the best of our knowledge, this is the first drug combination prediction tool that has been made online with easy accessibility for drug discovery users. The four basic modules of DrugComb, i.e. (i) data curation, (ii) synergy and sensitivity scoring, (iii) network modelling and (iv) machine learning constitute a workflow of network pharmacological approaches based on which we may gain deeper understanding of drug-drug interactions. Currently, DrugComb focuses on small molecule drugs such as cytotoxic and kinase inhibitors, while immunotherapy and gene therapy drugs are largely missing. Furture steps of DrugComb will involve constant improvement on the data coverage, for example, by including drugs from other classes. Moreover, we will include higher-order combinations that involve more than two drugs (e.g. (21) ). In addition, we will consider the datasets from more recent techniques of microfluidic-based drug screening (49) , as well as from patient-derived samples such as 3D organoid-based drug screening (50) and patient-derived xenograft mouse models (51) . These datasets may help identify drug combinations that are more translational to the clinics compared to cell line-based studies (9) . Meanwhile, the data analysis tools will be also updated to incorporate the new data types. For example, we will develop mathematical and statistical methods for analysing and visualizing higher-order drug combinations. Taken together, we envisage that the high-quality data in DrugComb will serve as a benchmark for the development of more robust and predictive machine learning models, for example, to improve the transfer learning from one study to another study, or to an under-studied tissue (18) , as well as accurate network-based models to capture the mechanisms of drug combinations that may eventually lead to predictive biomarkers that warrant patient stratification for maximizing the efficacy of combinatorial therapies. The synergy and sensitivity scores in DrugComb are freely available for download. Larger batch downloads of raw data are permitted by contacting the authors. The AstraZeneca drug combination datasets are proprietary, and a separate agreement is needed, available at https://openinnovation. astrazeneca.com/. The visualization results for sensitivity, synergy and network models are downloadable as images. The source code for analysing the drug combination datasets is available as the R package SynergyFinder version 2.2.4 (https://bioconductor.org/packages/release/bioc/ html/synergyfinder.html). We are committed to open data and welcome any researchers to participate in the development of data curation and harmonization tools for drug discovery. Supplementary Data are available at NAR Online. Pan-cancer analysis of whole genomes On the design of combination cancer therapy Methods for high-throughput drug combination screening and synergy scoring Survey of ex vivo drug combination effects in chronic lymphocytic leukemia reveals synergistic drug effects and genetic dependencies Functional genomic landscape of acute myeloid leukaemia A proof of concept for biomarker-guided targeted therapy against ovarian cancer based on patient-derived tumor xenografts A curative combination cancer therapy achieves high fractional cell killing through low cross-resistance and drug additivity. eLife You cannot have your synergy and efficacy too Combination cancer therapy can confer benefit via patient-to-patient variability without drug additivity or synergy Applying synergy metrics to combination screening data: agreements, disagreements and pitfalls Charting the fragmented landscape of drug synergy Drug combination sensitivity scoring facilitates the discovery of synergistic and efficacious drug combinations in cancer DrugComb: an integrative cancer drug combination data portal The tumor therapy landscape of synthetic lethality DrugCombDB: a comprehensive database of drug combinations toward the discovery of combinatorial therapy SYNERGxDB: an integrative pharmacogenomic portal to identify synergistic drug combinations for precision oncology The Aurora kinase/␤-catenin axis contributes to dexamethasone resistance in leukemia Anticancer drug synergy prediction in understudied tissues using transfer learning Control of replication stress and mitosis in colorectal cancer stem cells through the interplay of PARP1, MRE11 and RAD51 Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen Modulation of triple artemisinin-based combination therapy pharmacodynamics by plasmodium falciparum genotype Synergistic and Antagonistic Drug Combinations against SARS-CoV-2 The cellosaurus, a cell-line knowledge resource PubChem in 2021: new data content and improved web interfaces ChEMBL: towards direct deposition of bioassay data UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers DrugBank 5.0: a major update to the DrugBank database KEGG: integrating viruses and cellular organisms Drug target commons: A community effort to build a consensus knowledge base for drug-target interactions STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data UniProt: a worldwide hub of protein knowledge Defining a cancer dependency map Cell model Passports-a hub for clinical, genetic and functional datasets of preclinical cancer models BioAssay Ontology (BAO): a semantic description of bioassays and high-throughput screening results A community challenge for pancancer drug mechanism of action inference from perturbational profile data Searching for drug synergy in complex dose-response landscapes using an interaction potency model What is synergy? The Saariselkä agreement revisited Sorafenib and vorinostat kill colon cancer cells by CD95-dependent and -independent mechanisms Network-based prediction of drug combinations Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2 Network pharmacology modeling identifies synergistic Aurora B and ZAK interaction in triple-negative breast cancer TranSynergy: mechanism-driven interpretable deep neural network for the synergistic prediction and pathway deconvolution of drug combinations An unbiased oncology compound screen to identify novel combination strategies CatBoost for big data: an interdisciplinary review DeepSynergy: predicting anti-cancer drug synergy with Deep Learning Computationally predicting clinical drug combination efficacy with cancer cell line screens and independent drug action Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects Reproducible pharmacogenomic profiling of cancer cell line panels A microfluidics platform for combinatorial drug screening on cancer biopsies Development of a miniaturized 3D organoid culture platform for ultra-high-throughput screening High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response We thank the authors of the drug combination studies to share their datasets, especially the AstraZeneca for agreeing the Dream Challenge data to be part of DrugComb. We thank also the DepMap consortium and the Cell Model Passports to make the transcriptomics profiles of cancer cell lines freely available. We thank the NCATS and other institutions for making their drug screening datasets easily accessible. The data portal is located at the CSC-IT Center for Science in Finland.