94119894 A global cancer data integrator reveals principles of synthetic lethality, sex disparity and immunotherapy. Christopher Yogodzinski1,2,#*, Abolfazl Arab1-3, Justin R. Pritchard4, Hani Goodarzi1-3, Luke A. Gilbert1,2,5* 1 Department of Urology, University of California, San Francisco, San Francisco, CA, USA 2 Helen Diller Family Comprehensive Cancer Center, San Francisco, San Francisco, CA, USA 3 Department of Biochemistry and Biophysics, University of California, San Francisco, CA, USA 4 Department of Biomedical Engineering, Pennsylvania State University, University Park, PA 5 Department of Cellular & Molecular Pharmacology, University of California, San Francisco, CA, USA # Current Address: University of North Carolina Chapel Hill School of Medicine, Chapel Hill, NC, USA *Corresponding authors Correspondence: cyogodzi@unc.edu (C.Y.), luke.gilbert@ucsf.edu (L.A.G) (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Abstract Advances in cancer biology are increasingly dependent on integration of heterogeneous datasets. Large scale efforts have systematically mapped many aspects of cancer cell biology; however, it remains challenging for individual scientists to effectively integrate and understand this data. We have developed a new data retrieval and indexing framework that allows us to integrate publicly available data from different sources and to combine publicly available data with new or bespoke datasets. Beyond a database search, our approach empowered testable hypotheses of new synthetic lethal gene pairs, genes associated with sex disparity, and immunotherapy targets in cancer. Our approach is straightforward to implement, well documented and is continuously updated which should enable individual users to take full advantage of efforts to map cancer cell biology. Introduction Large scale but often independent efforts have mapped phenotypic characteristics of more than one thousand human cancer cell lines. Despite this, static lists of univariate data generally cannot identify the underlying molecular mechanisms driving a complex phenotype. We hypothesized that a global cancer data integrator that could incorporate many types of publicly available data including functional genomics, whole genome sequencing, exome sequencing, RNA expression data, protein mass spectrometry, DNA methylation profiling, ChIP- seq, ATAC-seq, and metabolomics data would enable us to link disease features to gene products 1–15. We set out to build a resource that enables cross platform correlation analysis of multi-omic data as this analysis is in and of itself is a high-resolution phenotype. Multi-omic analysis of (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 functional genomics data with genomic, metabolomic or transcriptomic profiling can link cell state or specific signaling pathways to gene function 2,3,13,15–18. Lastly, co-essentiality profiling across large panels of cell lines has revealed protein complexes and co-essential modules that can assign function to uncharacterized genes 19. Problematically, in many cases publicly available data are poorly integrated when considering information on all genes across different types of data and the existing data portals are inflexible. For example, lists of genes cannot be queried against groups of cell lines stratified by mutation status or disease subtype. Furthermore, one cannot integrate new data derived from individual labs or other consortia. We created the Cancer Data Integrator (CanDI) which is a series of python modules designed to seamlessly integrate genomic, functional genomic, RNA, protein and metabolomic data into one ecosystem. Our python framework operates like a relational database without the overhead of running MySQL or Postgres and enables individual users to easily query this vast dataset and add new data in flexible ways. This was achieved by unifying the indices of these datasets via index tables that are automatically accessed through CanDI’s biologically relevant Python Classes. We highlight the utility of CanDI through four types of analysis to demonstrate how complex queries can reveal previously unknown molecular mechanisms in synthetic lethality, sex disparity and immunotherapy. These data nominate new small molecule and immunotherapy anti-cancer strategies in KRAS-mutant colon, lung and pancreatic cancers. Results CanDI is a global cancer data integrator. We set out to integrate three types of data by creating programmatic and biologically relevant abstractions that allow for flexible cross referencing across all datasets. Data from the (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Cancer Cell Line Encyclopedia (CCLE) for RNA expression, DNA mutation, DNA copy number and chromosome fusions across more than 1000 cancer cells lines was integrated into our database with the functional genomics data from the Cancer Dependency Map (DepMap) (Fig. 1a,b and Supplementary Fig. 1) 1,12,20. We also integrated protein-protein interaction data from the CORUM database along with three additional distinct protein localization databases 4,7,11,21. CanDI by default will access the most recent release of data from DepMap although users can also specify both the release and data type that is accessed. The key advantage to this approach is that CanDI enables one to easily input user defined queries with multi-tiered conditional logic into this large integrated dataset to analyze gene function, gene expression, protein localization and protein-protein interactions. CanDI identifies genes that are conditionally essential in BRCA-mutant ovarian cancer. The concept that loss-of-function tumor suppressor gene mutations can render cancer cells critically reliant on the function of a second gene is known as synthetic lethality. Despite the promise of synthetic lethality, it has been challenging to predict or identify genes that are synthetic lethal with commonly mutated tumor suppressor genes. While there are many underlying reasons for this challenge, we reasoned that data integration through CanDI could identify synthetic lethal interactions missed by others. A paradigmatic example of synthetic lethality emerged from the study of DNA damage repair (DDR)22. Somatic mutations in the DNA double-strand break (DSB) repair genes, BRCA1/2, create an increased dependence on DNA single strand break (SSB) repair. This dependence can be exploited through small molecule inhibition of PARP1 mediated SSB repair. Inhibition of PARP provides significant clinical responses in advanced breast and ovarian cancer (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 patients but they ultimately progress22. Thus, new synthetic lethal associations with BRCA1/2 are a potential path towards therapeutic development PARP refractory patients. To illustrate the flexibility of CanDI to mine context specific synthetic sick lethal (SSL) genetic relationships we hypothesized that the genes that modulate response to a PARP1 inhibitor might be enriched for selectively essential proliferation or survival of BRCA1/2-mutant cancer cells. To test this hypothesis, we integrated the results of an existing CRISPR screen that identified genes that modulate response to the PARP inhibitor olaparib23. We then tested whether any of these genes are differentially essential for cell proliferation or survival in ovarian cancer and in breast cancer cell models that are either BRCA1/2 proficient or deficient (Fig. 1c,d). This query revealed that the Fanconi Anemia pathway is selectively essential in BRCA1/2-mutated ovarian cancer models but not in BRCA1/2-wild type ovarian cancer, BRCA1/2-mutated breast cancer or BRCA1/2-wildtype breast cancer models (Fig. 1e and Supplementary Table 1). To our knowledge a SSL phenotype between FANCM and BRCA1/2 has never been reported although a recent paper nominated a role for FANCM and BRCA1 in telomere maintenance24. Importantly, FANCM is a helicase/translocase and thus considered to be a druggable target for cancer therapy25. Clinical genomics data support this SSL hypothesis although this remains to be tested in ovarian cancer patient samples26. Because the DepMap currently only allows single genes to be queried and does not enable users to easily stratify cell lines by mutation such analysis would normally take a user several days to complete manually. Our approach enabled this analysis to be completed using a desktop computer in less than two hours, which includes the visualization of data presented here (Fig. 1e). Figure 1. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Figure 1. (A) A schematic showing human cell models integrated by CanDI. (B) A schematic illustrating types of data integrated by CanDI. (C) A cartoon of a genome-scale CRISPRi screen to identify genes that modulate response to PARP inhibition by Olaparib. (D) A schematic depicting data feature inputs parsed by CanDI. (E) Essentiality of Fanconi Anemia genes in ovarian and breast cancer cell lines separated by BRCA mutation status. A Bayes Factor score of gene essentiality is displayed by a heat map. N=4 BRCA1/2-mutant ovarian cancer, N=27 BRCA-wildtype ovarian cancer, N=5 BRCA1/2-mutant breast cancer, N=19 BRCA1/2-wildtype breast cancer. Conditional genetic essentiality in KRAS- and EGFR- mutant NSCLC cells. Beyond TSGs, many common driver oncogenes such as KRASG12D are currently undruggable, which motivates the search for oncogene specific conditional genetic dependencies. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 We reasoned that CanDI enables us to rapidly search functional genomics data for genes that are conditionally essential in lung cancer cells driven by KRAS- and EGFR-mutations. We stratified non-small cell lung cancer cell (NSCLC) models by EGFR and KRAS mutations and then looked at the average gene essentiality for all genes within each of these 4 subtypes of NSCLC. We observed that KRAS is conditionally self-essential in KRAS-mutant cell models but that no other genes are conditionally essential in KRAS-mutant, EGFR-mutant, KRAS-wildtype or EGFR-wildtype cell models (Fig. 2a,b and Supplementary Table 2). This finding demonstrates that very few---if any--- genes are synthetic lethal with KRAS- or EGFR- in KRAS- and EGFR- mutant lung cancer cell lines. It may be that these experiments are underpowered or it may be that when the genetic dependencies of diverse cell lines representing a disease subtype are averaged across a single variable (e.g. a KRAS-mutation) very few common synthetic lethal phenotypes are observed27. CanDI provides potential solutions for both of these hypotheses. CanDI enables a global analysis of conditional essentiality in cancer. It is thought that data aggregation across vast landscapes of unknown co-variates does not necessarily increase the statistical power to identify rare associations27. Thus, the global analyses of aggregated cancer data sometimes lies in systematically sub setting data based on key co- variates post aggregation. This has been observed in driver gene identification28. Inspired by our analysis of TSG and oncogene conditionally essentiality above, we next used CanDI to identify genes that are conditionally essential in the context of several hundred cancer driver mutations. We first grouped driver mutations (e.g. nonsense or missense) for each driver gene. For this analysis, we selected several thousand genes that are in the 85-90th percentile of essentiality within the DepMap data and therefore conditionally essential, meaning these genes are required (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 for cell growth or survival in a subset of cell lines. Importantly, it is not known why these several thousand genes are conditionally essential. We then tested whether each of these conditionally essential genes has a significant association with individual driver mutations. Our analytic approach does not weight the number of cell models representing each driver mutation nor does this give information on phenotype effect sizes. Our analysis nominates a large number of conditionally dependent genetic relationships with both TSG and oncogenes (Fig. 2c,d and Supplementary Table 3). A number of the conditional genetic dependencies identified in our independent variable analysis above are represented by a limited number of cell models and so further investigation is needed to validate these conditional dependencies, but this data further suggests that averaging genetic dependencies across diverse cell lines with un-modeled covariates obscures conditional SSL relationships. To further investigate this hypothesis, we analyzed these same conditional genetic relationships with a second analytic approach that weights the number of cell models representing each driver mutation. We observed a limited number of conditional genetic dependencies that largely consists of oncogene self-essential dependencies as previously highlighted for KRAS-mutant cell lines (Fig. 2e-g and Supplementary Table 4)13,29. Thus, analysis that averages each conditional phenotype across diverse panels of cell lines with unknown covariates masks interesting conditional genetic dependencies. Figure 2. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Figure 2. (A) Average gene essentiality for KRAS and EGFR in groups of NSCLC cell lines stratified by KRAS mutation status or by both KRAS and EGFR mutation status. N=38 for KRAS-wildtype shown in blue N=19 for KRAS-mutant shown in blue. N=30 for KRAS- wildtype EGFR-wildtype shown in grey and N=16 for KRAS-mutant EGFR-wildtype shown in grey. Gene essentiality is an averaged Bayes Factor score for each group of cell lines. (B) Average gene essentiality for KRAS and EGFR in groups of NSCLC cell lines stratified by EGFR mutation status or by both EGFR and KRAS mutation status. N=46 for EGFR-wildtype shown in blue, N=11 for EGFR-mutant shown in blue. N=30 for EGFR-wildtype KRAS- wildtype shown in grey and N=8 for EGFR-mutant KRAS-wildtype shown in grey. Gene essentiality is an averaged Bayes Factor score for each group of cell lines. (C) P-values from Chi2 tests of gene essentiality and nonsense mutations. (D) P-values from Chi2 tests of gene essentiality and missense mutations. (E) A scatter plot showing effect size of the change in gene essentiality with select missense mutations and the -Log10(P-value) of each essentiality/mutation pair. (F) A scatter plot showing effect size of the change in gene essentiality with select nonsense mutations and the -Log10(P-value) of each essentiality/mutation pair. (G) A scatter plot showing effect size of the change in gene essentiality with all mutations and the -Log10(P-value) of each essentiality/mutation pair. CanDI reveals female and male context specific essential genes in colon, lung and pancreatic cancer. Cancer functional genomics data is often analyzed without consideration for fundamental biological properties such as the sex of the tumor from which each cell line is derived. It is well established that biological sex influences cancer predisposition, cancer progression and response (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 to therapy30. We hypothesized that individual genes may be differentially essential across male and female cell lines. This hypothesis to our knowledge has never been tested in an unbiased large-scale manner. To maximize our statistical power to identify such differences we chose to test this hypothesis in a disease setting with large number of relatively homogenous cell lines and fewer unknown covariates. Using CanDI, we stratified all KRAS-mutant NSCLC, pancreatic adenocarcinoma (PDAC), and colorectal cancer (CRC) by sex and then tested for conditional gene essentiality. This analysis identified a number of genes that are differentially essential in male or female KRAS-mutant NSCLC, PDAC and CRC models (Fig. 3a-f and Supplementary Table 5). The genes that we identify are not common across all three disease types suggesting as one might expect that the biology of the tumor in part also determines gene essentiality. To test whether any association between differentially essential genes could be identified from expression data (e.g essential genes encoded on the Y chromosome) we first used CanDI to identify genes that are differentially expressed between male and female cell lines within each disease 31. We then plotted the set of differentially essential genes against the differentially expressed genes in KRAS-mutant NSCLC, PDAC and CRC models (Fig. 3a,c,e and Supplementary Table 6) and found little overlap between these gene lists. A number of genes that are more essential in male cells, such as AHCYL1, ENO1, GPI and PKM, regulate cellular metabolism. This finding is consistent with previous literature on sex and metabolism32. Our analysis demonstrates that stratifying groups of heterogeneous cancer models by three variables, in this case tumor type, KRAS mutation status and sex, reveals differentially essential genes. CanDi enables biologically principled stratification of data in the CCLE and DepMap by any feature associated with a group of cell models. This stratification allows us to identify genes associated with sex, which is not possible with other covariates included. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Figure 3. Figure 3. (A) Differential gene expression and differential gene essentiality in male and female CRC cell lines. N=7 male cell lines and N=3 female cell lines. (B) The distribution of Bayes factor gene essentiality scores in male and female CRC cell lines. The top seven and bottom (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 three differentially essential genes are shown in violin plots split by the sex of the cell lines. (C) Differential gene expression and differential gene essentiality in male and female NSCLC cell lines. N=9 male cell lines and N=5 female cell lines. (D) The distribution of Bayes factor gene essentiality scores in male and female NSCLC cell lines. The top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. (E) Differential gene expression and differential gene essentiality in male and female PDAC cancer cell lines. N=13 male cell lines and N=5 female cell lines. (F) The distribution of Bayes factor gene essentiality scores in male and female PDAC cell lines. The top seven and bottom three differentially essential genes are shown in violin plots split by the sex of the cell lines. CanDI enables rapid integration of external datasets to reveal new immunotherapy targets. An emerging challenge in the cancer biology is how to robustly integrate larger “resource” datasets like CCLE with the vast amount of published data from individual laboratories. For example, a big challenge in antibody discovery is identifying specific surface markers on cancer cells. To approach these big questions we utilized CanDIs ability to rapidly take new datasets, such as raw RNA-seq counts data in a disparate study of interest, then normalize and integrate this data into the CCLE, DepMap and protein localization databases previously described. Specifically, we rapidly integrated an RNA-seq expression dataset that measured the set of transcribed genes in primary lung bronchial epithelial cells from 4 donors 33. Classes within CanDI enable rapid application of DESeq2 to assess the differential expression between outside datasets and the CCLE. We used this feature to identify genes that are differentially expressed between primary lung bronchial epithelial cells and KRAS-mutant NSCLC, EGFR-mutant NSCLC or all NSCLC models in CCLE. We then used CanDI to identify (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 genes that are upregulated in cancer cells over normal lung bronchial epithelial cells with protein products that are localized to the cell membrane. This analysis of KRAS-mutant, EGFR-mutant and pan-NSCLC generated highly similar lists of differentially expressed surface proteins (Fig. 4a-f and Supplementary Table 7). Notably, overexpression of several of these genes, such as CD151 and CD44, has been observed in lung cancer and is associated with poor prognosis 34–36. These proteins represent potential new immunotherapy targets in KRAS-driven NSCLC. Figure 4. Figure 4. (A) A graph showing genes that are upregulated in KRAS-mutant NSCLC cell lines relative to primary human bronchial epithelial cells. A cell membrane protein localization score (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 is shown for each gene. Higher protein localization scores indicate higher confidence annotations. (B) A scatter plot showing gene expression for genes that encode cell surface proteins in KRAS-mutant NSCLC cell lines and primary human bronchial epithelial cells. N=46 for KRAS-mutant NSCLC cell lines and N=4 for primary human bronchial epithelial cells. (C) A graph showing genes that are upregulated in EGFR-mutant NSCLC cell lines relative to primary human bronchial epithelial cells. A cell membrane protein localization score is shown for each gene. Higher protein localization scores indicate higher confidence annotations. (D) A scatter plot showing gene expression for genes that encode cell surface proteins in EGFR-mutant NSCLC cell lines and primary human bronchial epithelial cells. N=21 for EGFR-mutant NSCLC cell lines and N=4 for primary human bronchial epithelial cells. (E) A graph showing genes that are upregulated in NSCLC cell lines relative to primary human bronchial epithelial cells. A cell membrane protein localization score is shown for each gene. Higher protein localization scores indicate higher confidence annotations. (F) A scatter plot showing gene expression for genes that encode cell surface proteins in NSCLC cell lines and primary human bronchial epithelial cells. N=141 for NSCLC cell lines and N=4 for primary human bronchial epithelial cells. Discussion Data integration is a critical requirement in biology research in the era of genomics and functional genomics. Large scale efforts such as the CCLE have revealed genomic features of more than 1000 cell line models. This data has not to our knowledge previously been integrated with functional genomics data in a manner that individual users can enter batched queries that are stratified by disease subtype or mutation status. This is not just a small improvement in functionality, but rather it is an enabling format that makes possible the types of conditional (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 genomics analyses that drive discovery. Moreover, it fills a fundamental gap in the cancer research community that integrates large scale projects with investigator initiated studies Our data framework enables biologists without specialized expertise in bioinformatics to use the full spectrum of data in the CCLE and DepMap in a higher throughput and precise manner. Using CanDI, we identified genes that are selectively essential in male versus female KRAS-mutant NSCLC, PDAC and CRC models. To our knowledge, such analysis has never been performed to begin to query the biologic basis of sex disparity in cancer or cancer therapy. We illustrate another feature of our framework by analyzing a list of hit genes nominated by a bespoke CRISPR drug screen for gene essentiality in BRCA1/2-wild type and BRCA1/2- mutated breast and ovarian cancer. In a third application, we analyzed the principle of synthetic lethality for 17427 genes in 19 KRAS-mutant and 11 EGFR-mutant NSCLC models. We then used CanDI to globally identify genes that are conditionally essential in the context of common cancer driver mutations. Finally, we nominated 12 potential new immunotherapy targets in KRAS-mutant, EGFR-mutant and pan -NSCLC models by using CanDI to identify genes that are differentially expressed in normal bronchial epithelial cells versus NSCLC models that are localized at the plasma membrane. Our data reveal a wealth of new hypotheses that can be rapidly generated from publicly available cancer data. By sharing data flows and use cases with a CanDI community we illustrate the ways in which individual research groups can interact with massive cancer genomics projects without reinventing tools or relying upon DepMap tool releases. We anticipate that CanDI will be widely used in cell biology, immunology and cancer research. Methods (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 CanDI The CanDI data integrator is available at https://github.com/Yogiski/CanDI. CanDI Module Structure The CanDI data integrator is a python library built on top of the Pandas that is specialized in integrating the publicly available data from The Cancer Dependency Map (DepMap Release: 2019 Quarter 3)12, The Cancer Cell Line Encyclopedia (CCLE Release: 2019 Quarter 3) 1, The Pooled In-Vitro CRISPR Knockout Essentiality Screens Database (PICKLES Library: Avana 2018 Quarter 4) 20, The Comprehensive Resource of Mammalian Protein Complexes (CORUM)8 and protein localization data from The Cell Atlas4, The Map of the Cell11, and The In Silico Surfaceome7,21. Data from DepMap and CCLE used in the following analyses are from the 2019Q3 release. Data from PICKLES is from the 2018 Quarter 4 release of DepMap using the Avana library. Access to all datasets is controlled via a python class called Data. Upon import the data class reads the config file established during installation and defines unique paths to each dataset and automatically loads the cell line index table and the gene index table. Installation of CanDI, configuration, and data retrieval is handled by a manager class that is accessed indirectly through installation scripts and the Data class. Interactions with this data are controlled through a parent Entity class and several handlers. The biologically relevant abstraction classes (Gene, CellLine Cancer, Organelle, GeneCluster, CellLineCluster) inherit their methods from Entity. Entity methods are wrappers for hidden data handler classes who perform specific transformations, such as data indexing and high throughput filtering. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Differential Expression In all cases where it is mentioned differential expression was evaluated using the DESeq2 R package (Release 3.10) 31. Significance was considered to be an adjusted p-value of less than 0.01. Differential Essentiality Essentiality scores are taken from the PICKLES database (Avana 2018Q4). To reduce the number of hypotheses posed during this analysis the mutual information of gene essentiality was calculated using the mutual information metric from the python package SciKitLearn (Version 0.22.0). Genes with mutual information scores greater than one standard devation above the median were removed from consideration. Differential essentiality was evaluated by performing a Mann-Whitney u-test between two groups on every gene that passed the mutual information filter. Significance was considered to be a p-value of less than 0.01. Magnitude of differential essentiality of a given gene was shown as the difference in mean Bayes factors between two groups of cell lines. Protein Localization Confidence Protein localization data was assembled from The Cell Atlas4, The Map of the Cell11, and The In Silico Surfaceome7,21. Confidence annotations were taken from the supplemental data of each paper and put on a number scale from 0 to 4 and summed for a total confidence score for each localization annotation for every gene where across all three papers. The analysis shown in Figure 4 represents a gene list that was further manually curated to remove the genes that are (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 localized to the intracellular space at the cell membrane revealing cell surface protein targets that are highly expressed in NSCLC cancer models over normal lung bronchial epithelial cells 4,7,11,21. DepMap Creative Commons License When an individual user runs CanDI they are downloading DepMap data and thus are agreeing to a CC Attribution 4.0 license (https://creativecommons.org/licenses/by/4.0/). Synthetic Lethality of Fanconi Anemia Genes in Ovarian and Breast Cancer Models We made a list of the top 50 gene hits that confer sensitivity to PARP inhibition in HeLa cells23. Using CanDI the essentiality scores of these top hits were visualized across all ovarian cancer cell models in PICKLES (Avana 2018Q4). FANCA and FANCE showed selective essentiality in the BRCA1/2 mutant ovarian cancer cell lines. Following this observation CanDI was used to gather the gene essentiality for all FANC genes in the fanconi anemia pathway. CanDI was then used to visualize these data across all ovarian and breast cancer cell lines, sorting by BRCA1/2 mutation status. Synthetic Lethality in KRAS and EGFR mutant Cell Lines CanDI was leveraged to bin NSCLC cell lines present in both CCLE (Release: 2019Q3) and PICKLES (Avana 2018Q4) into 8 groups. KRAS mutant and KRAS wild type cell lines with and without EGFR mutants removed as well as EGFR mutant and EGFR wild type cell lines with and without KRAS mutants removed. The mean essentiality score for every gene in the genome was calculated for every group of cell lines. Synthetic lethality score per gene is defined as the change in mean essentiality from the mutant groups to the wild type groups. (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Pan Cancer Synthetic Lethality Analysis A set of 299 core oncogenes and tumor suppressor driver mutations was chosen for analysis37. To test the effect of these gene’s mutations on gene essentiality CanDI was leveraged to split into two groups: a nonsense mutation group containing genes annotated as tumor suppressors (N=153) and a missense mutation group containing genes annotated as oncogenes with specific driver protein changes (N=53). CanDI was then used to collect a core set of genes with highly variable essentiality. To do this the Bayes factors from the PICKLES database (Avana 2018Q4) were converted to binary numeric variables. Bayes factors over 5 were assigned a 1=essential and Bayes factors under 5 were assigned a 0=non-essential. Genes were then sorted buy their variance across cell lines and genes between the 85th and 95th percentile were used for this analysis (N=2340). To determine a short list of genes with which to follow up on Chi2 tests were applied to the 95940 gene pairs in the missense group and the 603720 gene pairs in the tumor suppressor group. Three new groups were formed for further analysis: the first consisted of the significant gene/mutation pairs from the oncogenic group, the second consisted of the significant gene/mutation pairs from the tumor suppressor group, and the third was a combination of the significant pairs from both groups with no discrimination on the type of mutations considered. These groups were further analyzed for differential essentiality via the Mann Whitney method described above and the Cohens D effect size were calculated to measure the extent of the phenotype. Differential Expression and Essentiality of Male and Female KRAS driven cancers (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 We used CanDI to gather all cell lines that are present in both PICKLES (Avana 2018Q4) and CCLE (Release 2019Q3). CanDI was then leveraged to put these cell lines into the following tissue groups: KRAS mutant Colon/Colorectal, PDAC, and NSCLC. Each tissue group was then split into male and female sub-groups. Differential expression was analyzed by applying the methods described above to raw RNA-seq counts data from CCLE (Release: 2019Q3). Genes with adjusted p-values less than 0.01 were considered significantly differentially expressed. Differential essentiality was analyzed using the methods described above on the previously described sex-subgroups for each tissue type. Genes with p-values less than 0.01 were considered significantly differentially essential between male and female cell models. For each tissue type the distributions of the top 7 significantly differentially essential genes were highlighted in comparison with the bottom 3 as a negative control. Differential expression of benign and malignant cancer cell lines We downloaded human bronchial epithelial (HBE) RNA-seq data from Gillen et al via the European Nucleotide Archive to use as a benign lung tissue model33. This 4 data set contains gene expression data for primary HBE cells cultured from three different donors and also NHBE cells (Lonza CC-2541, a mixture of HBE and human tracheal epithelial cells). We then used CanDI to put NSCLC models into three different groups: KRAS mutant, EGFR mutant, and all cell lines. For our benign model raw counts were quantified via kallisto38. Raw counts for our malignant cell lines were queried via CanDI. DESeq2 was then applied to evaluate the differential expression between our normal lung tissue model and our three malignant lung tissue groups. The results from DESeq2 were then filtered by significance (adjusted p-value < 0.01). To filter based on potential immunotherapy targets we removed all genes not annotated as being (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 localized to the plasma membrane, and genes with localization confidence scores lower than six. Genes that were obviously mis-annotated as surface proteins were also manually removed. Supplementary Figure/Table Legends Supplementary Figure 1. Supplementary Figure 1. An Object-oriented schema diagram showing core structure of CanDI software. Supplementary Table 1. A table containing raw PICKLES Bayes factors displayed in the heat map of Fig. 1e. Supplementary Table 2. A table containing mean PICKLES Bayes factors for each series displayed in Fig. 2a,b. A (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Supplementary Table 3. A table containing the data for all chi2 tests performed to generate Fig. 2c,d. Supplementary Table 4. A table containing the data for scatter plots shown in Fig. 2e,f,g. Supplementary Table 5. A table containing the data from the differential essentiality analysis for all three tissues in Fig. 3a-f. Supplementary Table 6. A table containing the data from the differential expression analysis for all three tissues in Fig. 3a,c,e. Supplementary Table 7. A table containing the differential expression analysis data merged with the location data for all three tissues shown in Fig. 4. Acknowledgements We thank everyone in the Gilbert lab for helpful comments and discussion. LAG is supported by K99/R00 CA204602 and DP2 CA239597 as well as the Goldberg-Benioff Endowed Professorship in Prostate Cancer Translational Biology. Conflicts of Interest None Bibliography 1. Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569, 503–508 (2019). 2. Li, H. et al. The landscape of cancer cell line metabolism. Nat. Med. 25, 850–860 (2019). 3. Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564-576.e16 (2017). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 4. Thul, P. J. et al. A subcellular map of the human proteome. Science 356, (2017). 5. Cancer Cell Line Encyclopedia Consortium & Genomics of Drug Sensitivity in Cancer Consortium. Pharmacogenomic agreement between two cancer cell line data sets. Nature 528, 84–87 (2015). 6. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012). 7. Bausch-Fluck, D. et al. The in silico human surfaceome. PNAS 115, E10988–E10997 (2018). 8. Giurgiu, M. et al. CORUM: the comprehensive resource of mammalian protein complexes- 2019. Nucleic Acids Res. 47, D559–D563 (2019). 9. Nusinow, D. P. et al. Quantitative Proteomics of the Cancer Cell Line Encyclopedia. Cell 180, 387-402.e16 (2020). 10. Szklarczyk, D. et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 (2017). 11. Itzhak, D. N., Tyanova, S., Cox, J. & Borner, G. H. Global, quantitative and dynamic mapping of protein subcellular localization. Elife 5, (2016). 12. Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017). 13. Behan, F. M. et al. Prioritization of cancer therapeutic targets using CRISPR–Cas9 screens. Nature 568, 511–516 (2019). 14. Wang, T. et al. Identification and characterization of essential genes in the human genome. Science 350, 1096–1101 (2015). 15. Hart, T. et al. High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype- Specific Cancer Liabilities. Cell 163, 1515–1526 (2015). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 16. Wang, T. et al. Gene Essentiality Profiling Reveals Gene Networks and Synthetic Lethal Interactions with Oncogenic Ras. Cell 168, 890-903.e15 (2017). 17. Chan, E. M. et al. WRN helicase is a synthetic lethal target in microsatellite unstable cancers. Nature 568, 551–556 (2019). 18. Adamson, B. et al. A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response. Cell 167, 1867-1882.e21 (2016). 19. Wainberg, M. et al. A genome-wide almanac of co-essential modules assigns function to uncharacterized genes. http://biorxiv.org/lookup/doi/10.1101/827071 (2019) doi:10.1101/827071. 20. Lenoir, W. F., Lim, T. L. & Hart, T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res 46, D776–D780 (2018). 21. Bausch-Fluck, D. et al. A Mass Spectrometric-Derived Cell Surface Protein Atlas. PLoS One 10, (2015). 22. O’Connor, M. J. Targeting the DNA Damage Response in Cancer. Mol. Cell 60, 547–560 (2015). 23. Zimmermann, M. et al. CRISPR screens identify genomic ribonucleotides as a source of PARP-trapping lesions. Nature 559, 285–289 (2018). 24. Pan, X. et al. FANCM, BRCA1, and BLM cooperatively resolve the replication stress at the ALT telomeres. PNAS 114, E5940–E5949 (2017). 25. Lou, K., Gilbert, L. A. & Shokat, K. M. A Bounty of New Challenging Targets in Oncology for Chemical Discovery. Biochemistry 58, 3328–3330 (2019). 26. Narayan, G. et al. Promoter Hypermethylation of FANCF: Disruption of Fanconi Anemia- BRCA Pathway in Cervical Cancer. Cancer Res 64, 2994–2997 (2004). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 27. Ideker, T., Dutkowski, J. & Hood, L. Boosting signal-to-noise in complex biology: prior knowledge is power. Cell 144, 860–863 (2011). 28. Chang, M. T. et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nat. Biotechnol. 34, 155–163 (2016). 29. Lou, K. et al. KRASG12C inhibition produces a driver-limited state revealing collateral dependencies. Sci Signal 12, (2019). 30. Cancer Disparities - National Cancer Institute. https://www.cancer.gov/about- cancer/understanding/disparities (2016). 31. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014). 32. Rubin, J. B. et al. Sex differences in cancer mechanisms. Biol Sex Differ 11, (2020). 33. Gillen, A. E. et al. Molecular characterization of gene regulatory networks in primary human tracheal and bronchial epithelial cells. J. Cyst. Fibros. 17, 444–453 (2018). 34. Mj, K. et al. Prognostic Significance of CD151 Overexpression in Non-Small Cell Lung Cancer. Lung cancer (Amsterdam, Netherlands) vol. 81 https://pubmed.ncbi.nlm.nih.gov/23570797/ (2013). 35. Ko, Y. H. et al. Prognostic significance of CD44s expression in resected non-small cell lung cancer. BMC Cancer 11, 340 (2011). 36. Penno, M. B. et al. Expression of CD44 in Human Lung Tumors. Cancer Res 54, 1381–1387 (1994). 37. Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173, 371-385.e18 (2018). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 38. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525–527 (2016). (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Count sgRNAs abundance by deep sequencing to measure gene/drug phenotypes T0 SampleCRISPR Hela cell line Lentiviral transduction of genome-scale CRISPR sgRNA library Olaparib Untreated 1 1 3 2 Hela Cell Line CAL51 Cell Line KPL1 Cell Line ZR751 Cell Line ... COV362 Cell Line JHOS2 Cell Line TOV31G Cell Line ... Breast cancer Cervical cancer Ovarian cancer CA B D E CanDI Integration Cancer Data Integrator Essentiality Mutation ... CanDI Cellular Genomics Functional Genomics Transcriptomics Proteomics Vs. 2 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 −40 −20 0 20 40 60 Differential Essentiality (Δ Average BF) −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 PPP1R15B CFLAR NXT1 CTNNB1 SLC4A7 MANSC1 AHCYL1 ARHGEF10L MRPL20 EFCAB11 C ol on Non-Sigfnificant Differentially Expressed Differentially Essential Shown in Violin Plots PP P1 R1 5B CF LA R NX T1 CT NN B1 SL C4 A7 MA NS C1 AH CY L1 AR HG EF 10 L MR PL 20 EF CA B1 1 Gene −60 −40 −20 0 20 40 60 80 100 B ay es F ac to r Top Hit Female Top Hit Male −30 −20 −10 0 10 20 30 Differential Essentiality (Δ Average BF) −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 D iff er en ti al E xp re ss io n ( Lo g2 (F C )) BCL2L1 GPI ENO1 RTCB PKM WAC PCID2 ARHGAP12 SLC19A2 GPR137 BC L2 L1 GP I EN O1 RT CB PK M W AC PC ID 2 AR HG AP 12 SL C1 9A 2 GP R1 37 Gene −50 −25 0 25 50 75 100 B ay es F ac to r −30 −20 −10 0 10 20 30 Differential Essentiality (Δ Average BF) −10 −5 0 5 10 15 20 CHMP3 CHMP5 HAUS6 WLS KATNB1 ID1 ACSL3 KCNE1 RUFY1 KRT16 Pa nc re as CH MP 3 CH MP 5 HA US 6 W LS KA TN B1 ID 1 AC SL 3 KC NE 1 RU FY 1 KR T1 6 Gene −50 −25 0 25 50 75 100 B ay es F ac to r Lu ng Negative Control Female Negative Control Male Essential Gene ThresholdM or e Es se nt ia l Le ss E ss en tia l M or e Es se nt ia l Le ss E ss en tia l M or e Es se nt ia l Le ss E ss en tia l Female Cell LinesMale Cell Lines More Essential In More Essential In Male Cell Lines More Essential In Female Cell Lines More Essential In Male Cell Lines More Essential In Female Cell Lines More Essential In U p re gu la te d In U p re gu la te d In D iff er en ti al E xp re ss io n ( Lo g2 (F C )) U p re gu la te d In M al e C el l L in es U p re gu la te d In Fe m al e C el l L in es D iff er en ti al E xp re ss io n ( Lo g2 (F C )) U p re gu la te d In U p re gu la te d In M al e C el l L in es Fe m al e C el l L in es M al e C el l L in es Fe m al e C el l L in es A B C D E F (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 0 2 4 6 8 10 12 14 16 Log2(Fold Change) 0 10 20 30 40 50 60 70 80 -L og 10 (Q V al ue ) CD151 SLC4A2 B2M ITGA3 SLC3A2 HLA-C CD44 LRPAP1 DDR1 VDAC2 SLC29A1 SLCO4A1 KRAS Mutant CD151 SLC4A2 B2M ITGA3 SLC3A2 HLA-C CD44 LRPAP1 DDR1 VDAC2 SLC29A1 SLCO4A1 Gene 0 2 4 6 8 10 12 14 Lo g2 ( TP M + 1 ) KRAS Mutant Cell Line Type Benign Bronchial Malignant 0 2 4 6 8 10 12 14 16 Log2(Fold Change) 0 10 20 30 40 50 -L og 10 (Q V al ue ) B2M SLC4A2 CD151 ITGA3 ATP1A1 SLC3A2 CD44DDR1 HLA-CLRPAP1 ITGA5 TFPI EGFR Mutant B2M SLC4A2 CD151 ITGA3 ATP1A1 SLC3A2 CD44 DDR1 HLA-C LRPAP1 ITGA5 TFPI Gene 0 2 4 6 8 10 12 14 Lo g2 ( TP M + 1 ) EGFR Mutant 0 5 10 15 20 25 Log2(Fold Change) 0 10 20 30 40 -L og 10 (Q V al ue ) B2M CD151 THY1 SLC3A2 SLC4A2 LRPAP1 HLA-C DDR1 SLC29A1 ITGA3 PTGFRN VDAC2 All Lung Cancer B2M CD151 THY1 SLC3A2 SLC4A2 LRPAP1 HLA-C DDR1 SLC29A1 ITGA3 PTGFRN VDAC2 Gene 0 2 4 6 8 10 12 14 Lo g2 ( TP M + 1 ) All Lung Cancer Location Confidence 6 7 8 9 10 A B C D E F (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 Gene Essentiality in KRAS MT Cell Lines (Average BF) G en e Es se nt ia lit y in K R AS W T C el l L in es ( Av er ag e BF ) KRAS EGFR KRAS EGFR More EssentialLess Essential M ore Essential Less Essential Essential Gene Threshold EGFR MT Included EGFR MT Removed Gene Essentiality in EGFR MT Cell Lines (Average BF) G en e Es se nt ia lit y in E G FR W T C el l L in es ( Av er ag e BF ) KRAS EGFR KRAS EGFR More EssentialLess Essential M ore Essential Less Essential Essential Gene Threshold KRAS MT Included KRAS MT Removed A B C Es se nt ia lit y Nonsense Tumor Supressor Genes Context Speci�c 0 Effect Size 0.0 BRAF/BRAF NRAS/NRAS KRAS/KRAS HRAS/HRAS 0 Effect Size 0 Effect Size 0 KRAS/KRAS NRAS/NRAS BRAF/BRAF HRAS/HRAS NRAS/KRAS Non-Hit Signi�cant Hit Essentiality/Mutation Missense All Mutations Nonsense E F G More Essential Less Essential 0.00 0.05 1.00 P-value D Missense Oncogenes Tumor Supressor Genes Context Speci�c Mutations (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918 A (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2021.01.08.425918doi: bioRxiv preprint https://doi.org/10.1101/2021.01.08.425918