key: cord-0310362-aj5pksfi authors: Thorman, Alexander W.; Reigle, James; Chutipongtanate, Somchai; Shamsaei, Behrouz; Pilarczyk, Marcin; Fazel-Najafabadi, Mehdi; Adamczak, Rafal; Kouril, Michal; Morrow, Ardythe L.; Czyzyk-Krzeska, Maria F.; McCullumsmith, Robert; Seibel, William; Nassar, Nicolas; Zheng, Yi; Hildeman, David; Herr, Andrew B.; Medvedovic, Mario; Meller, Jarek title: Accelerating Drug Discovery and Repurposing by Combining Transcriptional Signature Connectivity with Docking date: 2020-11-26 journal: bioRxiv DOI: 10.1101/2020.11.25.399238 sha: 5516431249465e3d1cc779064a00937a78a7940a doc_id: 310362 cord_uid: aj5pksfi The development of targeted treatment options for precision medicine is hampered by a slow and costly process of drug screening. While small molecule docking simulations are often applied in conjunction with cheminformatic methods to reduce the number of candidate molecules to be tested experimentally, the current approaches suffer from high false positive rates and are computationally expensive. Here, we present a novel in silico approach for drug discovery and repurposing, dubbed connectivity enhanced Structure Activity Relationship (ceSAR) that improves on current methods by combining docking and virtual screening approaches with pharmacogenomics and transcriptional signature connectivity analysis. ceSAR builds on the landmark LINCS library of transcriptional signatures of over 20,000 drug-like molecules and ~5,000 gene knock-downs (KDs) to connect small molecules and their potential targets. For a set of candidate molecules and specific target gene, candidate molecules are first ranked by chemical similarity to their ‘concordant’ LINCS analogs that share signature similarity with a knock-down of the target gene. An efficient method for chemical similarity search, optimized for sparse binary fingerprints of chemical moieties, is used to enable fast searches for large libraries of small molecules. A small subset of candidate compounds identified in the first step is then re-scored by combining signature connectivity with docking simulations. On a set of 20 DUD-E benchmark targets with LINCS KDs, the consensus approach reduces significantly false positive rates, improving the median precision 3-fold over docking methods at the extreme library reduction. We conclude that signature connectivity and docking provide complementary signals, offering an avenue to improve the accuracy of virtual screening while reducing run times by multiple orders of magnitude. Accelerating the pace of drug discovery and repurposing is paramount for the development of treatments for rare diseases or personalized treatment options for precision medicine, and the ability to respond to public health crises, such as the COVID-19 pandemic. Systematic efforts for drug discovery have used high throughput in vitro or ex vivo screening approaches, often in conjunction with an initial in silico screening of small molecule libraries. These efforts have resulted in a large number of candidate compounds targeting the druggable part of the genome 1-3 . Parallel advances in pharmacogenomics and large-scale candidate drug profiling in cell lines and other model systems, such as Connectivity Map 4 , NCI60 5 and Cancer Cell Line Encyclopedia 6 , or GDSC 7, 8 , have further revolutionized drug discovery, target and mode of action prediction, and repurposing. For example, transcriptional signature connectivity analysis has been used to identify drugs that may reverse a signature of a disease state or that may have the same mode of action because of the similarity of their signatures 4, [9] [10] [11] . The LINCS consortium has recently compiled a library of transcriptional signatures for over 40,000 drug-like molecules as well as over 6,000 gene knockdown (KD) and overexpression constructs in multiple cell lines 12, 13 . As a result, LINCS transcriptional signatures can be used to directly correlate downstream transcriptional responses induced by chemical perturbations with those induced by loss or gain of function of the target protein. By enabling the direct exploration of drug-gene relationships on a previously unattainable scale, using significant subsets of both: the drug-like universe of small molecules and druggable genome, LINCS provides a unique big data resource for pharmacogenomics [13] [14] [15] that is explored here with the goal of identifying candidate inhibitors of a specific protein target. It should be emphasized that similar downstream transcriptional signatures may result from the loss of function of multiple upstream proteins in signaling cascades or pathways converging on the same transcriptional targets. Of particular relevance are signaling cascades involving multiple kinases and phosphorylation events between a growth receptor and a transcription factor in many types of cancer 16 . Thus, the analysis of concordance between signatures of small molecules and the target gene knock-down can identify candidate molecules that effectively lead to the loss of function as pathway inhibitors, and not necessarily a specific target inhibitor, as illustrated in Figure 1 . The left panel in the figure pictorially represents a signature connectivity analysis to identify putative inhibitors of SRC by searching for candidate small molecules whose Figure 1 : The overall principle of the new connectivity enhanced Structure Activity Relation (ceSAR) approach that ranks candidate molecules by their similarity to LINCS analogs with signatures concordant to those of the target gene KDs (left panel), and can be subsequently combined with docking simulations to assess the shape complementarity with specific protein targets (right panel). Transcriptional signatures are defined as down-and up-regulated genes with the corresponding differential expression values, represented as blue and yellow boxes in the left panel, respectively. The fictitious SRC KD signature consists of 6 genes, with genes 1, 3, and 6 down-regulated and genes 2, 4, and 5 up-regulated. All 3 compounds targeting the EGFR -SRC -JUN cascade result in signatures concordant with that of SRC KD, but only the actual SRC inhibitor fits the binding pocket in docking. signatures are concordant, i.e., positively correlated with the SRC KD signature. Note that all 3 compounds targeting the EGFR -SRC -JUN signaling cascade are 'concordant', although only one of them targets SRC directly. To achieve the specificity to a target in a pathway, the predicted binding affinity to a target protein may be used to complement the signature connectivity-based approach. This is illustrated in the right panel in Figure 1 where the actual SRC inhibitor is shown as predicted to have the lowest binding energy, and thus selected as consensus candidate. In this context, various in silico docking techniques have been widely used to computationally predict binding affinities between small molecules and their (structurally resolved) targets, often coupled with Structure-Activity Relationship (SAR) analysis of chemical analogs for top ranking candidate molecules 17, 18 . Here, we present a novel approach for accelerating drug discovery and repurposing, dubbed connectivity enhanced Structure Activity Relationship (ceSAR), that combines these two principles. Capitalizing on the LINCS library of transcriptional signatures (denoted as ), ceSAR combines drug and target transcriptional signature connectivity analysis with efficient chemical similarity search and virtual screening approaches. For a gene target and a library of candidate compounds to be screened, a subset of 'concordant' LINCS small molecules is first identified to include only those compounds that have signatures concordant with a target gene knock-down (or over-expression) signature. The library of candidate compounds is then reduced by using a fast chemical similarity search, optimized for sparse binary fingerprints of chemical moieties, to identify those compounds that are, with some Jaccard similarity 19 threshold, structural analogs to a LINCS molecule with transcriptional concordance to the genetic knock-down (or overexpression). The resulting small subset of compounds can be subsequently re-scored in conjunction with docking, using a consensus ranking to filter out likely pathway, but not target protein inhibitors. In order to assess the new method and test the hypothesis that combining the principles of signature connectivity and shape complementarity can improve drug discovery by reducing the level of false positives and reducing computational cost of virtual screening, we systematically evaluate the performance of ceSAR and compare it with the results of Autodock 20 and MTiOpenScreen 21 , using a subset of targets from the DUD-E benchmark, which is widely used in the assessment of docking and virtual screening methods 22 . Candidate molecule ranking using ceSAR. For a library of small molecules, , and a target gene with at least one consensus shRNA knock-down transcriptional signature available in LINCS, ∈ , ceSAR ranks candidate compounds by identifying their closest chemical analogs in the LINCS library of transcriptionally profiled chemical perturbagens, ∈ , that result in signatures concordant to those of the target KDs. For each ∈ , the following similarity score is computed as a basis for ranking: where ( , ) is the Tanimoto coefficient (Jaccard similarity measure) 19 between compounds and represented as binary fingerprints, and * ( , ) is the maximum concordance (over all cell lines for , and cell line, concentration, exposure time tuples for ) between the signatures of chemical perturbagen and genetic knock-downs of . iLINCS correlation-based concordance measure is used here 23 , and the threshold for significant concordance is set to 0 = 0.2, as discussed in the Methods sections. Note that the similarity score, ( ), is in fact the Tanimoto coefficient for the closest 'concordant' LINCS analog of , and thus is a real number between 0 and 1. By increasing the similarity threshold, 0 ∈ [0,1], one can reduce the initial library to a (hopefully enriched into true positives) subset that can be used for further analysis and validation by taking only those compounds that receive a score larger than 0 . While different forms of combining chemical similarity and concordance measures into a composite score can be considered to potentially improve the performance, such as machine learning-based ensemble consensus classifiers discussed in Supplementary Materials, we , and consensus approaches that combine the initial library reduction by Sig2Lead with AutoDock for the top 1%, 5% and 100% library subsets (C 1 , C 5 and C 100 , respectively), compared with a simple baseline method (B) that ignores signature connectivity and uses only chemical similarity to LINCS compounds for library reduction. deliberately use here this very simple form of the method, which will be referred to as S throughout the manuscript, to evaluate the advantages of ceSAR. Fast exact chemical similarity search. The LINCS library of drug-like molecules comprises over 40,000 compounds, while the user defined library of small molecules to be ranked and reduced by identifying 'concordant' LINCS analogs, can be quite large in the context of virtual screening. An efficient solution for computing the Jaccard similarity measure (Tanimoto coefficient) and retrieving the closest matches for the case of sparse binary fingerprints is used here to address this computational bottleneck and accelerate the ceSAR search. As shown in the Methods section, by using pre-processing of the reference data set of compounds (here LINCS library), the computation of similarity scores between a query compound and database compounds can be limited to only those columns in the binary fingerprint where is in the minority state, which is assumed to be 1, while also optimally exploiting the sparsity in each column of the fingerprint across database compounds by pre-computing indexes of database compounds in the minority state at that column. The resulting algorithm, dubbed minSim (for minority Sim), optimally exploits the sparse nature of binary fingerprints commonly used for fast chemical similarity search without using approximate techniques, such as those based on hashing [24] [25] [26] . For the retrieval from the LINCS library, minSim provides between 60 and 150-fold speed-up for different DUD-E datasets compared with traditional approaches (see Supplemental Table 1) . Consensus re-ranking using docking to improve ceSAR. The initial ceSAR search, as defined above, can be subsequently combined with docking simulations to achieve higher specificity for a target at hand. For the purpose of systematic benchmarking and assessment of robustness of the new method, we consider several forms of consensus, starting from the entire library and performing the initial ceSAR search (S) and docking simulations for all compounds in the library to derive the consensus ranking, referred to as C 100 , or starting from a subset of the library, first reduced using the signature connectivity based filter S. When the library is first reduced to the top 5% or 1% of the library, the consensus form of ceSAR is referred to as C 5 and C 1 , respectively. We would like to emphasize that given the negligible computational cost compared with docking (see Figure 7 ) and the low barrier to applying ceSAR, one might consider a success even modest levels of enrichment or increased precision as the library is reduced. The other measure of success is the level of improvement over the baseline that measures the signal due to signature connectivity. As can be seen from Figure 2 , the simple form of ceSAR (S) in fact performs significantly better than the baseline (p-value of 6.1 × 10 −10 using Kolmogorov-Smirnov test), while the consensus form of ceSAR significantly outperform both A and S (including C 1 that yields Kolmogorov-Smirnov p-values of 2.8 × 10 −3 and 1.1 × 10 −8 in comparison with A and S, respectively). Importantly, these improvements are obtained for the most reduced, and thus arguably most relevant, library sizes, where the consensus approaches achieve median precision of more than 25% (35% for C 1 ), compared with just 10% for docking. Thus, on the DUD-E benchmark (using the original target conformations and binding sites), docking is successful in eliminating the most unlikely binders by using shape complementarity and the predicted binding energies, leading to initial success and higher precision (and enrichment) at the level of 5 or 10% of the original library. However, docking struggles to correctly rank true positives and the remaining (more challenging) true negatives, resulting in a drop of accuracy as the size of the library is reduced further. Note that DUD-E datasets comprise tens of thousands of molecules, so reduction to less than 1% of the library size is desirable to reduce the number of compounds for testing. In terms of the distribution of precision values over 20 DUD-E targets, the results of the consensus-based ceSAR method (C 1 ) and docking (A) are becoming statistically indistinguishable at 0.5% library size, while providing significant speed-ups since only a small fraction (1%) of the library needs to be re-scored using consensus with docking (see Figure 7) . On the other hand, the simple ceSAR search (S) that can be performed on a personal laptop within minutes, achieves results statistically indistinguishable from docking at 0.1% library size, with both methods yielding a median precision (or positive predictive value) of about 10% at this furthest library reduction (see also Figure 2 ). Note that 0.1% library size on average corresponds to only about 20 compounds to be tested. At this furthest library reduction, the fraction of true binders among the candidate compounds selected by using the C 1 consensus ceSAR approach, which combines signature connectivity analysis with docking for the top 1% of the library ranked by S, is equal or greater than 35% for half of DUD-E targets. We conclude that C 1 consensus method provides the best trade-off between speed and accuracy on the DUD-E benchmark, while the performance of the consensus-based ceSAR methods (C 1 , C 5 and C 100 ) is robust with respect to the choice of the top library subset for integration with docking. This is further illustrated by the distribution of top true positive rank for individual DUD-E targets (see Figure 3 ) that shows good performance of consensus methods. Similar results are obtained on a subset of DUD-E targets (as well as on the original DUD benchmark) using MTiOpenScreen docking server (see Supplementary Materials) . Importantly, ceSAR is more robust compared to docking, which performs very well for some targets while also failing completely for several targets at that library size. This is illustrated in Figures 4 and 5 using precision curves for individual targets and comparison of the area under the precision curve at different library sizes, respectively. As can be also seen from Figure 6 , at the most extreme library reduction considered here (0.1% library size), AutoDock fails to retain any true positives and thus yields precision of 0% in 8 out of 20 cases, compared with 7 for the simple ceSAR search, and only 4 such failures for the C 1 consensus method. These trends also hold in terms of the number of targets for which none of the true positives is ranked among top 100 candidates (Figure 3 ). It should be emphasized that the success of ceSAR is not due to overrepresentation of known binders from DUD-E datasets among the LINCS compounds. As can be seen from Supplemental Accelerating the identification of BCL2A1 inhibitors using ceSAR. Even though such an approach is less efficient, the ceSAR approach can be extended to incorporate signature connectivity-based re-scoring after first using docking for virtual screening to reduce the library size (this is different from the consensus approaches C considered above by reversing the order of library reduction). This form of combined approach is tested here in the context of an effort to identify specific inhibitors of an important anti-apoptotic target, namely BCL2A1 (A1). A1 has been implicated in a wide array of diseases, ranging from inflammation associated with pre-term birth 27 to chemotherapeutic resistance in melanoma 28 . To date, no inhibitors specific to A1 have been identified and most that target the BCL2 protein family are unable to effectively block A1 activity. Most anti-apoptotic proteins prevent apoptosis by physical binding and sequestration of pro-apoptotic proteins, achieved via binding to their "BH3" domain 29 . A major success in targeting this family was the development of a Bcl-2 inhibitor ABT-737 30 , which was modified to a bioavailable version called ABT-263 or navitoclax. Unfortunately, ABT-263 also bound Bcl- Figure 8 : Retrospective re-scoring of a drug-discovery pipeline for BCL2A1 yields an improved enrichment into experimentally validated inhibitors using a combined ceSAR approach (AutoDock followed by Sig2Lead re-scoring, Yellow), as compared to docking alone (AutoDock, Blue). xL whose role in promoting survival of platelets, lead to thrombocytopenia in humans 31, 32 . This observation spurned a biochemical tour de force that resulted in the development of ABT-199, which lost specificity for Bcl-xL 33 Sig2Lead, an R Shiny implementation of ceSAR, was then applied to re-score the tested compounds ( Figure 8 ), demonstrating an improvement in the overall precision when ranking the top in vitro validated compounds. Thus, re-scoring candidate compounds obtained using docking simulations can yield further enrichment into true positives and limit the number of compounds that need to be tested experimentally. Conversely, the observed enrichment into true positives for an important and challenging target (with available LINCS KD signatures) illustrates how a set of experimentally identified weak binders can be used to seed the signature connectivity-based ceSAR search with the goal of identifying additional candidate compounds, i.e., the 'concordant' Accelerating drug discovery, development and repurposing is paramount for advancing treatment options and for improving response to public health crises, such as the SARS-CoV2 pandemic, and for further progress in personalized precision medicine. In silico screening of small molecule libraries for their predicted interaction and inhibition of protein targets is often used to reduce the time and cost requirements in drug discovery and repurposing projects. The adage that structure dictates function has been applied in relation to small molecule inhibitors to enhance virtual screening by searching similar compounds and exploring structure activity relationships (SAR) 17, 18, 34, 35 . In this contribution, we introduce an efficient in silico method to accelerate drug discovery and repurposing, dubbed connectivity enhanced Structure Activity Relationship (ceSAR). ceSAR improves on existing approaches by combining small molecule docking simulations with signature connectivity analysis to reduce both false positive rates and the computational cost of virtual screening, and thus allows one to overcome two major limitations of current virtual screening approaches. Over the last two decades, transcriptional and other profiles of drug activity have been increasingly used in drug design, mode of action identification and SAR type analyses 4, 12, 34, 36, 37 . For example, identifying targets for small molecules (and thus identifying these molecules as novel inhibitors) can be facilitated by comparing bioactivity profiles or transcriptional signatures of a compound to known inhibitors 34, 38 . Another important example is the use of the connectivity map approach to connect gene expression profiles of disease states (such as drug resistant forms of cancer) with discordant drug signatures, allowing one to identify drugs that can potentially be used to reverse the disease signature 4,12,15 . The critical difference compared to these previous efforts is that ceSAR directly connects the transcriptional signatures of a small molecule with the signature of a gene knockdown for the purpose of identifying antagonists (or overexpression to identify agonistsan option not benchmarked in this manuscript) of a specific target rather than a pathway, and to that end combines signature connectivity analysis with atomistic docking simulations to use predicted binding energies to filter out likely pathway inhibitors. For the signature connectivity analysis, ceSAR capitalizes on the LINCS library of signatures which is at present the most comprehensive big data resource for pharmacogenomics 12, 13 . AR, FXa) on which the simple ceSAR search (S) performs poorly. Importantly, the consensus approach is also more robust, as indicated by the failure rate at the extreme library reduction, defined here as the fraction of targets for which the precision is reduced to 0% at 0.1% library size. As illustrated in Figure 6 , such defined failure rate is 40% (8 out of 20 targets) for AutoDock (A), 35% (7 out of 20 targets) for the simple signature connectivity approach (S), and 20% (4 out of 20 targets) for the consensus approach (C 1 ). Another measure of failure is the number of targets for which none of the true positives is ranked among the top 100 candidates, which is 6 for AutoDock as opposed to 2 for Sig2Lead and 4 for C 1 (it is worth noting that this number is zero for other consensus approachessee Figure 3 ). Taken together, these results strongly indicate the complementarity of signature connectivity and docking based approaches for drug discovery. On the other hand, Autodock (A) clearly outperforms signature connectivity enhanced methods (S and C) in 3 cases: HMGCR, Thrombin and PNP, none of which have close analogs in LINCS of the true binders included in the respective DUD-E datasets (see Supplemental Table 2) and/or are characterized by weak concordance between LINCS small molecule and KD signatures, antagonists, including kinase inhibitors, are well represented in LINCS, contributing to the high accuracy of ceSAR on the 5 kinases included in our evaluation. For these kinases, ceSAR (both S and C 1 ) yield improvements over docking already at 5% library size, and for the most reduced library size, achieve about 2-fold increase in median precision, which is about 50% for Sig2Lead alone compared to about 25% for AutoDock (see Supplemental Figure 8 ). Using ceSAR, through the integration of signature connectivity analysis, fast exact chemical similarity search for sparse binary fingerprints, and virtual screening approaches, a dramatic increase in speed is obtained while improving accuracy, thus providing a fast, robust and accurate platform for drug discovery and repurposing. We believe that the performance of ceSAR adds significantly to the utility of LINCS as a big data resource for pharmacogenomics and provides a strong argument in favor of further large-scale transcriptional profiling of drug-like molecules and druggable parts of the genome. We anticipate that with further advances in the CRISPR technology, more accurate gene signatures will be obtained, leading to increased performance of the new approach. At the same time, continued advances in determining 3D structures of proteins and their complexes by using cryo-electron microscopy and other techniques will expand the protein targetable space, adding to the importance of accelerating the speed of virtual screening approaches. Candidate molecule ranking using ceSAR. ceSAR ranks candidate molecules by combining signature connectivity analysis and chemical similarity search to identify the most similar 'concordant' LINCS analogs of candidate compounds. Here, 'concordant' is defined as having a signature that is significantly positively correlated with a target gene knock-down signature. For a target gene , with at least one knock-down transcriptional signature available in LINCS, ∈ , and for a library of small molecules to be ranked, , the following similarity score is computed for each ∈ as a basis for ranking: Fast exact chemical similarity search using minSim. Consider now a search for a query compound ∈ against a database compounds ∈ using binary fingerprints described above. The formula for the Tanimoto coefficient, ( , ), which is defined for two binary fingerprints and as the ratio of the number of positions with ones in both and and the number of positions with ones in either or , can be written in the following form: ChemBank: a small-molecule screening and cheminformatics resource database ChEMBL: a large-scale bioactivity database for drug discovery The discovery of first-in-class drugs: origins and evolution The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease Development of human tumor cell line panels for use in disease-oriented drug screening The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity Systematic identification of genomic markers of drug sensitivity in cancer cells Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells Reproducible pharmacogenomic profiling of cancer cell line panels Computational Drug Repurposing: Current Trends Integrative cancer pharmacogenomics to establish drug mechanism of action: drug repurposing A Next Generation Connectivity Map: L1000 Platform and the First 1 The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations Leveraging Big Data to Transform Drug Discovery Systems Pharmacogenomic Landscape of Drug Similarities from LINCS data: Drug Association Networks Large-scale integration of small molecule-induced genomewide transcriptional responses, Kinome-wide binding affinities and cell-growth inhibition profiles reveal global trends characterizing systems-level drug action Structure-Based Virtual Screening: From Classical to Artificial Intelligence The Light and Dark Sides of Virtual Screening: What Is There to Know? Why is Tanimoto index an appropriate choice for fingerprintbased similarity calculations Computational protein-ligand docking and virtual drug screening with the AutoDock suite MTiOpenScreen: a web server for structure-based virtual screening Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking Connecting omics signatures of diseases, drugs, and mechanisms of actions with iLINCS Binary Hashing for Approximate Nearest Neighbor Search on Big Data: A Survey Index Structures for Fast Similarity Search for Binary Vectors Hashing for Similarity Search: A Survey IL-1 signaling mediates intrauterine inflammation and chorio-decidua neutrophil recruitment and activation Role of the pro-survival molecule Bfl-1 in melanoma Dying to protect: Cell Death and the control of T cell Homeostasis An inhibitor of Bcl-2 family proteins induces regression of solid tumours Phase I study of Navitoclax (ABT-263), a novel Bcl-2 family inhibitor, in patients with small-cell lung cancer and other solid tumors Navitoclax, a targeted high-affinity inhibitor of BCL-2, in lymphoid malignancies: a phase 1 dose-escalation study of safety, pharmacokinetics, pharmacodynamics, and antitumour activity ABT-199, a potent and selective BCL-2 inhibitor, achieves antitumor activity while sparing platelets Identifying Compound-Target Associations by Combining Bioactivity Profile Similarity Search and Public Databases Mining Increased Accuracy with Structure-and Ligand-Based Shotgun Drug Repurposing Direct and indirect approaches to identify drug modes of action Omics'-Informed Drug and Biomarker Discovery: Opportunities, Challenges and Future Perspectives Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool Modern Computational Strategies for Designing Drugs to Curb Human Diseases: A Prospect PDBe: improved findability of macromolecular structure data in the PDB An overview of molecular fingerprint similarity search in virtual screening A generalizable definition of chemical similarity for read-across Similarity searching using 2D structural fingerprints Similarity-based virtual screening using 2D fingerprints Molecular fingerprint similarity search in virtual screening ChemmineR: a compound mining framework for R Open chemoinformatic resources to explore the structure, properties and chemical space of molecules This work was supported in part by the National Institutes of Health grants U54 HL127624, P30 ES006096, R01 MH107487, R01CA122346, R01GM128216, 1T32CA236764, R01 CA237016, R21 HD090856 and UL1TR001425, 2I01BX001110 BLR&D VA Merit award, and Cincinnati Children's Innovation Fund award (to DAH and ABH). ( , ) = ( , ) ( ) + ( ) − ( , ) where ( ) and ( ) are the number of ones that can be pre-computed for all database molecules , while ( , ) is the number of ones in common for and .Note that the computation of ( , ) can be limited to only those columns in the binary fingerprint where is in the minority state, which is assumed to be 1. Furthermore, by using preprocessing of the reference data set of compounds (here LINCS library) one can optimally exploit the sparsity in each column by pre-computing indexes of database compounds in the minority state at each column, as illustrated in Supplemental Figure 2 . Namely, the following list of database vectors is pre-computed for each column in the fingerprint:The minSim (for minority Sim) algorithm, computes all Tanimoto coefficients for a query molecule Note also that minSim computes the exact Jaccard similarity, without using approximate techniques, such as those based on hashing 25, 26, 46 . As can be seen from Supplemental Authors declare no conflict of interest.