key: cord-0650274-9v3mapmu authors: Samart, Kewalin; Tuyishime, Phoebe; Krishnan, Arjun; Ravi, Janani title: Reconciling Multiple Connectivity Scores for Drug Repurposing date: 2020-09-19 journal: nan DOI: nan sha: 9f5b692412092e52ceecc7eb6d05d0d307d73207 doc_id: 650274 cord_uid: 9v3mapmu The basis of several recent methods for drug repurposing is the key principle that an efficacious drug will reverse the disease molecular 'signature' with minimal side-effects. This principle was defined and popularized by the influential 'connectivity map' study in 2006 regarding reversal relationships between disease- and drug-induced gene expression profiles, quantified by a disease-drug 'connectivity score.' Over the past 14 years, several studies have proposed variations in calculating connectivity scores towards improving accuracy and robustness in light of massive growth in reference drug profiles. However, these variations have been formulated inconsistently using varied notations and terminologies even though they are based on a common set of conceptual and statistical ideas. Therefore, we present a systematic reconciliation of multiple disease-drug connectivity scores by defining them using consistent notation and terminology. In addition to providing clarity and deeper insights, this coherent definition of connectivity scores and their relationships provides a unified scheme that newer methods can adopt, enabling the computational drug-development community to compare and benchmark different approaches easily. To facilitate the continuous and transparent integration of newer methods, this review will be available as a live document at https://jravilab.github.io/connectivity_score_review coupled with a GitHub repository https://github.com/jravilab/connectivity_score_review that any researcher can build on and push changes to. The manifestation of a disease or perturbation by a small molecule in a tissue leaves a characteristic imprint (a "signature") in its gene expression profile [1] . These signatures, recorded for thousands of diseases and drugs, form the basis of a powerful and widely-adopted method for drug repurposing called "drug-disease connectivity analysis" [2] . In this analysis, novel drug indications for a specific disease of interest are identified based on the extent to which the ranked drug-gene profile is a "reversal" of the disease gene signature [3, 4] ; Fig. 1 ). Connectivity-based drug repurposing has been used to discover drugs in various cancers and non-cancer diseases [5] . is the drug gene-expression profile containing the rank-ordered list of genes going from the most significantly up-regulated gene to the most significantly down-regulated gene. is the gene set for the disease of interest with !" containing the set of up-regulated genes and #$%& containing the set of down-regulated genes. B. Connectivity. Positions of !" and #$%& disease genes in the ranked drug list, , determine the signs and magnitudes of enrichment scores ( ; !" , #$%& ). Positive connectivity is defined as the case when the disease signature and drug profile show similar perturbations, i.e. when !" is positive and/or when #$%& is negative. This happens when !" predominantly appears towards the top of the drug profile or when #$%& appears predominantly towards the bottom of the drug profile (scenarios 1 and 4). Conversely, negative connectivity is defined as the case when the disease signature and drug profile show dissimilar perturbations, i.e. when !" is negative and/or when #$%& is positive. This happens when !" predominantly appears towards the bottom of the drug profile or when #$%& appears towards the top of the drug profile (scenarios 2 and 3). Negative connectivity indicates drug reversal of disease signature. From its inception in 2006, the exact method for connectivity analysis has evolved, with a series of proposed modifications over the past decade and a half ( Fig. 2A) . The first method for connectivity analysis [6] builds on the seminal paper by Subramanian et al., 2005 [7] that proposed the Gene Set Analysis (GSEA) method. GSEA uses a modified Kolmogorov-Smirnov statistic [8] -referred to as "enrichment statistic" ( ) -to evaluate if genes in a certain pathway appear towards the top or bottom of a gene (differential) expression profile. Lamb et al., 2006 [6] built a reference database (CMap, which we refer to as CMap 1.0 in this review) with gene expression profiles for 1000s of small molecules and proposed the first method for connectivity analysis based on GSEA. This method compares a query signature (disease) to each of the ranked drug-gene expression profiles in their reference database and ranks all the drugs based on their connectivity scores. A connectivity score ranges between -1 (indicating a complete 'drug-disease' reversal) and +1 (indicating perfect 'drug-disease' similarity). Another study adapted this connectivity score calculation and used it to find compound in the L1000 LINCS collection [9] that could be repurposed for three cancer types [10] . This study quantified the reversal relationship between the drug and disease by computing the proposed Reverse Gene Expression Score ( ). Finally, CMap 1.0 itself was further updated by expanding the LINCS L1000 to more 1.3 million profiles [11] (referred to as CMap 2.0 in this review). Along with expansion of data, the CMap 2.0 study also proposed another variation of the connectivity score called the weighted connectivity score that uses GSEA's weighted Kolmogorov-Smirnov enrichment statistic along with ways to normalize the resulting score and correcting them further to account for background associations. A taxonomy of connectivity scores Figure 2 . A taxonomy of connectivity scores. A. Relationship between connectivity scores. The main formulations discussed here are GSEA enrichment score ( ) [7] , CMap 1.0 connectivity score ( ) [6] , and [10] , CMap 2.0 weighted connectivity score ( ), normalized connectivity score ( ), and Tau score (t) [11] . B. Detailed definitions of connectivity scores in A. Connectivity scores and methodologies have been evaluated in the past to assess their performance in predicting drug-drug relationships or drug-disease relationships. The performance of CMap 1.0 was evaluated in predicting drug-drug relationships using the Anatomical Therapeutic Chemical classification [12, 13] , and in predicting drug-disease relationships [14] . Furthermore, a recent review [15] assessed advances that have been made in CMap 1.0 and computational tools that have been applied in the drug repurposing and discovery fields. Lin et al., 2019 [16] further evaluated connectivity approaches that use L1000 data, including six different scores that are used to predict drug-drug relationships. All these proposed variations of the connectivity score share a common set of conceptual and statistical ideas. Yet, they have been formulated inconsistently using varied notations and terminologies in the original papers and in the aforementioned evaluation studies. This lack of consistency in the precise notation makes it difficult to seamlessly understand the subtle differences and the intuition underlying each score. For example, the connectivity score referred to as " " [10] directly builds on " " [6] . Another example is the " " in [11] that is a bi-directional weighted version of " ( )" used in GSEA [7] ; in this case, they are named and notated quite differently though they are essentially direct, simple variants of each other. In this review, we develop a systematic scheme that defines the aforementioned methodologies using consistent notations and terms. Additionally, we provide summary tables throughout the article to relate our consistent scheme with the previously published ones. We begin creating a standardized set of notations and terms to denote the various concepts and quantities required to define the different connectivity scores. A connectivity score between a disease and a drug is computed by comparing the genes up-( !" ) and down-regulated ( #$%& ) by the disease (compared to a healthy control) to a ranked list of genes ( ) ordered based on their differential expression in response to a drug. A good connectivity score is usually a lower negative value since it is designed to indicate a reversal relationship between the disease and the drug. A good connectivity score is usually achieved when genes in !" appear at the bottom of and/or when genes in #$%& appear at the top of . When there is no relationship or when !" appears at the top and/or when #$%& appears at the bottom of (i.e., similarity between the disease and drug signatures), the drug is unlikely to be efficacious in treating that disease. These scenarios are depicted in Figure 1 , and the general notations, which we use throughout this work, are presented in Table 1 , Figure 2 . Building on these general notations and terms, in the rest of this review, we develop and present a systematic scheme that defines four formulations of the drug-disease connectivity scores using consistent notations and terms, detailed formulation, and a summary table that will enable researchers to relate our consistent scheme back to the notations and terminology used in the original publications. disease gene set (i.e., query) ( Fig. 1 ) Without any loss in generality, only the subset of disease genes that are also part of are considered throughout (i.e., ⊆ ). number of genes in All connectivity scores described here begin with the calculation of some form of an Enrichment Score ( ) that captures the relationship between a drug and a disease. The basis of all these formulations is the Gene Set Enrichment Analysis (GSEA) [7] ; that was originally developed to assess the enrichment (over-representation) of predefined biological gene sets (e.g., pathways, targets of a regulator, etc.) at the top or bottom of a list of genes ranked by their extent of differential expression in response to an experimental factor of interest. Enriched gene sets are then hypothesized to be biologically relevant to that experimental factor. When adapted to the question of drug repurposing, a method like GSEA can be used to assess the enrichment of sets of genes associated with a disease at the top or bottom of a list of genes ranked by their extent of differential expression in response to a drug ( Fig. 1) . GSEA is a weighted signed version of the classical Kolmogorov-Smirnov test. It takes two inputs: i) a disease gene set composed of a set of genes significantly perturbed in response to a disease (denoted ), and ii) a rank-ordered list ( ) of drug genes (in decreasing order of a drug-response score ( ' ) for each gene ' ). Using these two inputs, GSEA quantifies the level of association between the disease and the drug by calculating an enrichment score ( ) based on the following steps: 1. For each position in the rank-ordered list ( ) from top to bottom, 1.1. if the gene is in , calculate: 1.2. if the gene is not in , calculate: 3)44 ( , ) = 3 2. Finally, calculate the final enrichment score ( ): the maximum positional enrichment score. . Thus, ()* ( , ) and 3)44 ( , ) are both empirical distribution functions of the positions of the disease genes (i.e., ) and the positions of the non-disease genes (i.e., − ), respectively, in the drug gene list . Therefore, when 71 = 0, (the signed maximum distance between the two functions) reduces to a signed two-sample Kolmogorov-Smirnov (KS) statistic: is the classical two-sample KS statistic, with 1 and 591 being the empirical distribution function of and − , respectively, defined as follows: : . When 71 = 1, becomes a weighted signed two-sample KS statistic with each position in the drug gene list weighted by the drug-response score ( ' ). Setting 71 to one is recommended for GSEA. We point the reader to the original GSEA publication for a discussion of statistics when 71 is set to lesser or greater than one. • Enrichment score, , ranges from -1 to +1 (Fig. 3) . • is the maximum deviation from zero encountered between the empirical distributions of the disease and non-disease genes in drug gene list . -A positive indicates disease gene set enrichment towards the top of drug gene list . indicates disease enrichment at the bottom of . • When is randomly distributed in , the magnitude of is small but if a large proportion of genes in is concentrated at the top or bottom of , the magnitude of is large (Fig. 4) . • When calculated separately for genes up-( !" ) and down-regulated ( #$%& ) by the disease, good drug candidates that show a reversal relationship with the disease profile have a negative !" and a positive #$%& (Fig. 4, Table 2 ). • Revised notations used in this GSEA section are summarized in Table 2 . the weight of the step in enrichment score calculation The connectivity map 1.0 (CMap 1.0) project pioneered the identification of drug candidates based on their ability to reverse disease gene expression profiles [6] . Key to this project was the creation of a large collection of reference gene expression profiles of multiple human cell lines that are treated with 164 small molecules, including approved drugs. The expression profiles were generated using Affymetrix microarrays. The original CMap 1.0 study and several others focused on cancer [18] , inflammatory bowel disease [3] and spinal muscular atrophy [19] have used this reference library of drug profiles for drug repurposing. In all these cases, the starting point is a disease "signature" defined by the sets of genes up-and down-regulated in the disease. This signature is compared to each drug profile in the reference library using a GSEA-like analysis that results in an enrichment score ( ) for each of the up-and downregulated disease gene sets separately. The captures the level and direction of association of the disease gene set with that drug. Then, the 'up' and 'down' are combined into a single connectivity score ( ) for the disease with respect to that drug. Finally, for the given disease, drug candidates are identified as those that have low negative . The drug-disease enrichment score ( ) in CMap 1.0 is adapted from GSEA. Instead of using GSEA's signed two-sample KS test formulation that compares the positions of genes to those of − genes, This formulation is used to calculate an !" and an #$%& value for the genes up-( !" ) and downregulated ( #$%& ) by the disease, respectively. These two scores are then used to calculate a raw connectivity score : The final connectivity score is calculated by normalizing the raw score by dividing by the maximum or minimum of raw scores across treatment instances, depending on the sign of , bringing it back to range between -1 and +1: • !" and #$%& represent the association between the up-( !" ) and down-regulated ( #$%& ) disease genes in the disease of interest ( ) with the ranked drug gene list ( ). • is the connectivity score that combines !" and #$%& per drug treatment and normalizes them across treatments. Similar to , ranges from -1 to +1 (Fig. 3) . • Lower indicates a better reversal relationship between the disease and the drug. • Revised notations used in this CMap 1.0 section are summarized in Table 3 . The Connectivity Map project was subsequently expanded into the NIH library of integrated networkbased cellular signatures (LINCS) program by using a cost-effective gene-expression assay called L1000 [11] . The L1000 platform measures only about 1000 carefully-chosen genes with the rest of the transcriptome estimated by an imputation model trained using publicly available genome-scale expression data [20] . The pilot phase of the LINCS program included data for about 20,000 compounds assayed on about 50 human cell lines across a range of doses to result in over one million L1000 profiles. The focus of the study by Chen et al., 2017 [10] was to use this LINCS data to not only capture expression-based drug-disease reversal relationships but also evaluate if these reversals correlate with independently-measured drug efficacies. Towards this goal, the authors selected compounds with both efficacy data in ChEMBL [21] and gene expression LINCS data. Using these two datasets, this study showed that the distribution of connectivity scores ( ) from CMap 1.0 [6] are enriched at 0 and that these scores do not correlate well with =8 values. To address this issue, the authors proposed a new connectivity score called the Reverse Gene Expression Score ( ). In CMap 1.0, the connectivity score for a drug is set to zero if !" and #$%& , the enrichment scores for the up-and down-regulated disease gene sets have the same signs. , on the other hand, is computed as the difference between absolute values of the two values: • The connectivity score is based on the difference between the absolute values of the scores of the up-and down-regulated disease genes regardless of whether they are enriched at the top or the bottom of the drug gene list . • Similar and , ranges from -1 to +1 (Fig. 3) . • is inversely correlated with drug efficacy. • Revised notations used in this section are summarized in Table 4 . Since the LINCS dataset contains multiple profiles corresponding to the same drug assayed on multiple cell lines, concentrations, and time points, the study also proposed summarizing a drug's values across these various conditions into a single score called the Summarization of Reverse Gene Expression Score ( ). is estimated by first setting the condition that corresponds to 10 and 24 hours (the most common in the LINCS database) as the 'reference' condition and setting all other conditions as 'target' conditions. Then, for a specific cell line, a drug's in a target condition is assumed to be dependent on the target condition's dose and time relative to the reference condition, quantified using a heuristic "awarding function" ( ): Target conditions are first divided into four groups (as in the equation above), and the value of the function for each target group (e.g., ( ) < 10 ( ) < 24 ℎ ) is estimated by averaging the difference in between the target group and reference group across all the drugs in the reference database that were profiled in the same cell line in that target condition and the reference condition. Then, to combine values across cell lines, a weight ( ) is calculated for each treatment that reflects how much that treatment's corresponding cell line, ( ) is similar to the disease under study: Here, the correlation between cell line ( ) and the disease, ( ( ), )), is the average of the Spearman correlations between the expression profiles of the cell line and disease of interest, normalized by the maximum correlation between all cell lines and the disease. Finally, is defined as the following: This study shows that these new formulations of the connectivity scores, and , show a correlation with drug =8 values, with drugs with low negative or tending to have low =8 values. The connectivity score is designed to combine the values of based on the difference between the absolute values of the scores of the up-and down-regulated disease genes regardless of whether they are enriched at the top or the bottom of the drug gene list . • Similar and , ranges from -1 to +1 (Fig. 3) . • is inversely correlated with drug efficacy. Revised notations used in this sub-section are summarized in Table 4 . CMap 2.0 is a massive expansion of the L1000 dataset to ~1.4 million profiles, which represent 42K genetic and small molecules perturbed across multiple cell lines [11] . As part of the release of this data, the study also proposed new connectivity score calculations (Weighted Connectivity Score, Normalized Connectivity Score, and Tau Score). Similar to other scenarios outlined above, the CMap 2.0 methodology works by comparing the disease gene set ( ) (containing the up-( !" ) and down-regulated ( #$%& ) genes) to reference drug profiles in the L1000 database to get a rank-ordered list of all drugs based on a slightly new formulation of the connectivity score, along with new proposals for normalizing the scores across cell lines and drug types and for correcting the resulting normalized score against the background of the entire reference library. The disease-drug enrichment score ( ) in CMap 2.0 is based directly on GSEA's weighted signed twosample KS statistic that compares the positions of genes to those of − genes with the weight 71 set to 1. is then used to calculate a Weighted Connectivity Score ( ) that represents a nonparametric disease-drug similarity measure. is defined as follow: , ℎ • The disease-drug similarities ( !" & #$%& ) are computed using the two-sided weighted KS statistic. • ranges from -1 to +1 (Fig. 3) . indicates that and are positively related (similar). • A negative indicates that and are negatively related (dissimilar). • A zero indicates that and are unrelated. • Revised notations used in this sub-section are summarized in Table 5 . The Normalized Connectivity Score ( ) was developed to enable the comparison of across cell lines and drug type. Given the for a disease in relation to a specific drug of a type , tested in cell line , the corresponding is computed by rescaling the by dividing by the mean value across all the drugs of the same type tested in the same cell line : values, respectively. This procedure is identical to that used in the original GSEA for normalizing scores to make them comparable across gene sets of different sizes. Finally, the Normalized Connectivity Score for a disease to a specific drug (i.e., the for a given disease-drug pair) is converted to a tau ( ) score by comparing it to values of that disease to all the drugs in the reference database (referred to as "touchstone" in CMap 2.0) of the same type tested in the same cell line , expressed as signed percentage value between -100 and +100: : : Thus, a of 95 indicates that only 5% of drugs in the reference database of the same type and tested in the same cell line (containing B drugs) showed stronger connectivity to the disease than the drug of interest. Since any disease is queried against the same fixed drug reference database, values are comparable across diseases. Another way to calculate a score corresponding to the value for a disease-drug pair is to compare to the values of that specific drug to all the perturbation signatures in a reference database. This comparison will yield a that represents the signed percentage of reference signatures that are less connected to the drug than the disease of interest. In other words, based on this comparison, a of 95 indicates that only 5% of signatures in a reference database showed stronger connectivity to the drug than the disease of interest. Similarly, values in this new setting are comparable across drugs in the reference database. • The normalized connectivity score was developed to enable the comparison of across cell lines and drug type. • The tau score ( ) measures further corrects for non-specific associations by expressing the of a given disease-drug pair in terms of the fraction of signatures/profiles in a reference database that exceed this value. • Tau ( ) ranges from -100 to +100 (Fig. 3) and a lower negative score reveals a better disease-drug reversal relationship. • Good tau scores ( ) should range between -95 and -100. A of 95 indicates that only 5% of reference signatures/profiles in the reference database showed stronger connectivity. • Revised notations used in the and sub-sections are summarized in Table 5 . ; ;,* weighted connectivity score; also used to refer to a specific instance of the weighted connectivity score of a given cell line and perturbagen type In this review, we have reconciled four key formulations of drug-disease connectivity scores by defining them using consistent notation and terminology. This unified scheme will foster long-term adoption and potential collaboration within the growing computational drug-repurposing community. This review provides significant insights on different methods that have been proposed in the drug repurposing field. Our coherent definition of connectivity scores and their relationships will allow researchers to better understand the current state-of-the-art including expressing all other existing methods using the same notation and terminology. The drug-repurposing community can adopt this consolidated framework to develop, compare, and benchmark new computational drug-repurposing quantification metrics in the context of existing methods. To facilitate the continuous and transparent integration of newer methods, this review is hosted in a GitHub repository (https://github.com/jravilab/connectivity_score_review) that can be edited by the research community to include new methods for connectivity score calculation. The review document has been written using RMarkdown [22, 23] and distill [24] , and rendered as a living document at https://jravilab.github.io/connectivity_score_review. Perturbational Gene-Expression Signatures for Combinatorial Drug Discovery Systematic evaluation of drug-disease relationships to identify leads for novel drug uses Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease Discovery and preclinical validation of drug indications using compendia of public gene expression data Drug repurposing: a promising tool to accelerate the drug discovery process The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures Reversal of cancer gene expression correlates with drug efficacy and reveals therapeutic targets A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles Drug-induced regulation of target expression Evaluation of analytical methods for connectivity map data Systematic evaluation of connectivity map for disease indications A review of connectivity map and computational approaches in pharmacogenomics A comprehensive evaluation of connectivity methods for L1000 data Human primary liver cancer-derived organoid cultures for disease modeling and drug screening Inhibitors Preferentially Target CD15+ Cancer Stem Cell Population in SHH Driven Medulloblastoma Systems-based Discovery of Tomatidine as a Natural Small Molecule Inhibitor of Skeletal Muscle Atrophy NCBI GEO: mining tens of millions of expression profiles-database and tools update ChEMBL: towards direct deposition of bioassay data Distill for R Markdown This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to A.K., MSU Diversity Research Network Launch Awards Program to J.R., MSU College of Natural Science Scholarships to K.S., and, in part, by MSU start-up funds to A.K. and J.R.