Integrated cross-study datasets of genetic dependencies in cancer Integrated cross-study datasets of genetic dependencies in cancer Clare Pacini ​1,2​, Joshua M. Dempster​3​, Isabella Boyle ​3​, Emanuel Gonçalves​1​, Hanna Najgebauer​1,2,4​, Emre Karakoc​1,2​, Dieudonne van der Meer​1​, Andrew Barthorpe ​1​, Howard Lightfoot​1​, Patricia Jaaks​1​, James M. McFarland ​3​, Mathew J. Garnett​1,2​, Aviad Tsherniak​3​, Francesco Iorio ​1,2,5,* 1 ​Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK 2 ​Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK 3 ​Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA 4 ​European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK 5 ​Human Technopole, Via Cristina Belgioioso 147, 20157 Milano - Italy * Corresponding author: ​francesco.iorio@sanger.ac.uk Abstract CRISPR-Cas9 viability screens are increasingly performed at a genome-wide scale across large panels of cell lines to identify new therapeutic targets for precision cancer therapy. Integrating the datasets resulting from these studies is necessary to adequately represent the heterogeneity of human cancers and to assemble a comprehensive map of cancer genetic vulnerabilities. Here, we integrated the two largest public independent CRISPR-Cas9 screens performed to date (at the Broad and Sanger institutes) by assessing, comparing, and selecting methods for correcting biases due to heterogeneous single guide RNA efficiency, gene-independent responses to CRISPR-Cas9 targeting originated from copy number alterations, and experimental batch effects. Our integrated datasets recapitulate findings from the individual datasets, provide greater statistical power to cancer- and subtype-specific analyses, unveil additional biomarkers of gene dependency, and improve the detection of common essential genes. We provide the largest integrated resources of CRISPR-Cas9 screens to date and the basis for harmonizing existing and future functional genetics datasets. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint mailto:francesco.iorio@sanger.ac.uk https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Cancer is a complex disease that can arise from multiple different genetic alterations. The alternative mechanisms by which cancer can evolve result in considerable heterogeneity between patients, with the vast majority of them not benefiting from approved targeted therapies​1​. In order to identify and prioritize new potential therapeutic targets for precision cancer therapy, analyses of cancer vulnerabilities are increasingly performed at a genome-wide scale and across large panels of ​in vitro​ cancer models​2–11​. This has been facilitated by recent advances in genome editing technologies allowing unprecedented precision and scale via CRISPR-Cas9 screens. Of particular note are two large pan-cancer CRISPR-Cas9 screens that have been independently performed by the Broad and Sanger institutes​2,12​. The two institutes have also joined forces with the aim of assembling a joint comprehensive map of all the intracellular genetic dependencies and vulnerabilities of cancer: the ​Cancer Dependency Map (DepMap)​13,14​. The two generated datasets collectively contain data from over 1,000 screens of more than 900 cell lines. However, it has been estimated that the analysis of thousands of cancer models will be required to detect cancer dependencies across all cancer types​3​. Consequently, the integration of these two datasets will be key for the DepMap and other projects aiming at systematically probing cancer dependencies. These integrated datasets will provide a more comprehensive representation of heterogeneous cancer types and form the basis for the development of effective new therapies with associated biomarkers for patient stratification ​15​. Further, designing robust standards and computational protocols for the integration of these types of datasets will mean that future releases of data from CRISPR-Cas9 screens can be integrated and analyzed together, paving the way to even larger cancer dependency resources. We have previously shown that the pan-cancer CRISPR-Cas9 datasets independently generated at the Broad and Sanger institutes are consistent on the domain of 147 commonly screened cell lines​16​. The reproducibility of these CRISPR screens holds despite extensive differences in the experimental pipelines underlying the two datasets, including distinct CRISPR-Cas9 sgRNA libraries. Here we investigate the integrability of the full Broad/Sanger gene dependency datasets, yielding the most comprehensive cancer dependency resource to date, encompassing dependency profiles of 17,486 genes across 908 different cell lines that span 26 tissues and 42 different cancer types. We compare different state-of-the-art data processing methods to account for heterogeneous single-guide RNA (sgRNA) on-target efficiency, and to correct for gene independent responses to 2 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/VOtGa https://paperpile.com/c/BNwyax/e4Ooj+5JKGI+ayQe4+AS1lX+YMsJ9+T0Woi+ODthp+DcTjJ+BIfQG+g3BuJ https://paperpile.com/c/BNwyax/f4TT0+e4Ooj https://paperpile.com/c/BNwyax/Kl5bc+htOyk https://paperpile.com/c/BNwyax/5JKGI https://paperpile.com/c/BNwyax/wJXm9 https://paperpile.com/c/BNwyax/6UH1G https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ CRISPR-Cas9 targeting ​12,17,18​, evaluating their performance on common use cases for CRISPR-Cas9 screens (​Figure 1a, 1b and 1c​). Figure 1: Schematic of the integration strategy. ​ a. Broad and Sanger gene dependency datasets (raw count data of single-guide RNAs) are downloaded from respective web-portals. b. The datasets from each institute are pre-processed with three different methods, accounting for gene-independent responses to CRISPR-cas9 targeting (arising from copy number amplifications) and heterogeneous sgRNA efficiency, providing gene-level corrected depletion fold changes. Then, four different batch-correction pipelines are applied to the gene level fold changes across the two institute datasets for each of the pre-processing methods. c. Twelve different integrated datasets resulting from applying three different pre-processing methods (as indicated by the border colors) and four different batch-correction pipelines (as indicated by the fill colors) are benchmarked. d. Advantages provided by the final integrated datasets and conservation of analytical outcomes from the individual ones are investigated. We show that our integration strategy accounts and corrects for technical biases whilst preserving gene dependency heterogeneity and recapitulates established associations between molecular features and gene dependencies. We highlight the benefits of the integrated dataset over the two individual ones in terms of improved coverage of the genomic heterogeneity across different cancer types, identification of new biomarker/dependency associations, and increased reliability of human 3 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/f4TT0+Q4ESm+htDUx https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ core-fitness/common-essential genes (​Figure 1d​). Finally, we estimate the minimal size (in terms of the number of screened cell lines) required in order to effectively correct batch effects when integrating a new dataset. Collectively, this study presents a robustly benchmarked framework to integrate independently generated CRISPR-Cas9 datasets that provide the most comprehensive resource for the exploration of cancer dependencies and the identification of new oncology therapeutic targets. Results Overview of the integrated CRISPR-Cas9 screens The Sanger’s Project Score CRISPR-Cas9 dataset (part of the Sanger DepMap)​19 and the Broad’s 20Q2 DepMap dataset​20,21​ contain data for 317 and 759 cell lines, respectively. Overall, these represent screens for 908 unique cell lines (​Figure 2a​, Supplementary Table 1 ​). Together these cell lines spanned 26 different tissues (​Figure 2b​) and for 16 of these the number of cell lines covered increased when considering both datasets together. Similarly, the integrated dataset provided richer coverage of specific cancer types and clinically relevant subtypes (​Figure 2c​). These preliminary observations highlight the first benefit of combining these resources to increase statistical power for tissue-specific as well as pooled pan-cancer analyses. Between the two datasets, there was an overlap of 168 ​ ​cell lines screened by both institutes, encompassing 16 different tissue types (median = 8, min 1 for Soft Tissue, Biliary Tract and Kidney, max 28 for Lung, ​Figure 2a and 2b​). The set of overlapping cell lines enabled the estimation of batch effects due to differences in the experimental protocols underlying the two datasets​16​, without biasing the correction toward specific cell line lineages. 4 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/3CgU2 https://paperpile.com/c/BNwyax/6qc1+N7Jvg https://paperpile.com/c/BNwyax/6UH1G https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2. Overview of CRISPR-Cas9 screened cancer cell lines. ​a. Number of cell lines screened by the Broad and the Sanger institutes and their overlap. b. Overview of the number of cell lines screened for each tissue type across the two datasets. c. Number of screened Lung cancer and Breast cancer cell lines split according to cancer types and PAM50 subtypes, respectively, across the two datasets. Data Pre-processing Known biases in CRISPR screens arise due to nonspecific cutting toxicity that increases with copy number amplifications (CNAs)​22,23​ and heterogeneous levels of on-target efficiency across sgRNAs targeting the same gene ​24​. Multiple methods exist to correct for these biases. Here, we evaluate three: CRISPRcleanR, an unsupervised nonparametric CNA effect correction method for individual genome-wide screens​17​; a method resulting from using CRISPRcleanR with JACKS, a Bayesian method accounting for differences in guide on target efficacy​18​ (CCR-JACKS) through joint analysis of multiple screens; and CERES, a method that simultaneously corrects for CNA effects and accounts for differences in guide efficacy​12​, also analyzing screens jointly. 5 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/iQbeE+59O9I https://paperpile.com/c/BNwyax/EqQvF https://paperpile.com/c/BNwyax/Q4ESm https://paperpile.com/c/BNwyax/htDUx https://paperpile.com/c/BNwyax/f4TT0 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Batch effect correction Technical differences in screening protocols, reagents and experimental settings can cause batch effects between datasets. These batch effects can arise from factors that vary within institute screens (for example, differences in control batches and Cas9 activity levels) as well as between institutes (such as differences in assay lengths and employed sgRNA libraries). When focusing on the set of cell lines screened at both institutes, a Principal Component Analysis (PCA) of the cell line dependency profiles across genes (DPGs) highlighted a clear batch effect determined by the origin of the screen, irrespective of the pre-processing method, consistent with previous results (​Figure 3a​)​16​. We quantile-normalized each cell line DPG and adjusted for differences in screen quality in the individual Broad/Sanger data sets. The combined Broad/Sanger dataset was then batch corrected using ComBat​25​ (Methods). Following ComBat correction, the combined datasets on the overlapping cell lines showed reduced yet persistent residual batch effects clearly visible along the two first principal components (​Supplementary Figure 1​). Analysis of the first two principal components (using MsigDB gene signatures​26​ and all cell lines, Methods), showed enrichment for metabolic processes (phosphorus metabolic process q-value = 1.06e-08, protein metabolic process q-value = 8.70e-07, hypergeometric test) in the first principal component. The enrichment of metabolic processes is consistent with differences identified across these datasets due to different media conditions employed in the underlying experimental pipelines​27,28​. The second principal component contained significant enrichments for protein complex organisation and assembly (q-value = 1.57e-16 and 5.28e-11 respectively, hypergeometric test) (​Supplementary Table 2​), which have no obvious associations with technical biases found in CRISPR-cas9 screens. Based on these results, we considered four different batch correction pipelines and evaluated their use in our integrative strategy. In the first pipeline, we processed the combined Broad/Sanger DPG dataset using ComBat alone (ComBat). In the second, we applied a second round of quantile normalization following ComBat correction (ComBat+QN) to account for different phenotype intensities across experiments, resulting in different ranges of gene dependency effects. In the third and fourth pipelines we also removed the first one or two principal components respectively (ComBat+QN+PC1) and (ComBat+QN+PC1-2). The final 12 datasets contained data from unique screens of 908 cell lines using each of the three pre-processing methods and four different batch correction pipelines as outlined in the previous section. To assess the performance of different batch correction pipelines we estimated, using the overlapping cell lines, the extent to which each cell line DPG from one study matched that of its counterpart (derived from the same cell line) from the other study 6 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/6UH1G https://paperpile.com/c/BNwyax/AX4Xh https://paperpile.com/c/BNwyax/wM6a https://paperpile.com/c/BNwyax/ezH2+RXWN https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ following batch correction. To quantify the agreement, we calculated for each DPG its similarity to all other screen DPGs using a weighted Pearson’s (wPearson) correlation (Methods). We then calculated the proximity of a cell line to its counterpart compared to all other cell lines using the wPearson as a metric (Recall of cell line identity)​ ​(​Figure 3b ​). The best performances were obtained when removing either the first or the first two principal components following ComBat and quantile normalization, i.e. ComBat+QN+PC1 or ComBat+QN+PC1-2. Across pre-processing methods, CERES performed best with 302 (90%) of the cell lines being closest to their counterpart from the other study (k = 1) followed by CRISPRcleanR with 272 cell lines (81%) and CCR-JACKS with 215 (64%). The Recall of cell line identity was high for each integration pipeline with normalized Area under the curve (nAUC) values of 0.98 for CCR-JACKS and 0.99 for CRISPRcleanR and CERES when considering the best performing ComBat+QN+PC1-2 batch correction method. 7 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 Figure 3: Batch effect assessment and correction.​ a. Principal component plots of the dependency profile across genes (DPGs) for cell lines screened in both Broad and Sanger studies and pre-processing methods. Screens are colored by the institute of origin. b. Percentages of cell line DPGs that have the corresponding (same cell line) DPG screened at the other institute among their ​k​ most correlated DPGs (the ​k-neighborhood​). Results are shown across different pre-processing methods (in different plots) and different batch correction pipelines (as indicated by the different colors). Correlations between DPGs are computed using a weighted Pearson correlation metric. Genes with higher selectivity have a larger weight in the correlation calculation. As a measure of selectivity we used the average (across the two individual datasets) skewness of a gene’s dependency profile across cell lines. The proportion of cell lines closest to their counterpart from the other study (k = 1) is shown and the normalised areas under the curves (nAUC) are shown in brackets. The x-axis values are restricted to between 1-100 to highlight the range over which performance differences are visible between datasets. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Performance of the integration pipelines We evaluated the performance of each of the 12 integrated datasets, containing 908 cell lines, under four use-cases: the identification of i) essential and non-essential genes ii) lineage subtypes iii) biomarkers of selective dependencies and iv) functional relationships. Identification of essential and non-essential genes A cell line DPG with a large separation of dependency scores (DS) of common essential and non-essential genes should yield lower misclassification rates when identifying dependencies specific to that cell line. For each cell line we measured the separation of dependency scores (DS) between known common essential and non-essential genes​11 across all integrated datasets. As a measure of separation we used the ​null-normalized mean difference (​NNMD)​29​, defined as the ​difference between the mean DS of the common essential genes and non-essential genes divided by the standard deviation of the DSs of the non-essential genes​. By analysing multiple screens jointly, CERES and JACKS borrow essentiality signal information across screens. As a consequence, these methods better identify consistent signals across cell line DPGs (i.e. for common essential and non-essential genes), especially for DPGs derived from lower quality experiments, or reporting weaker depletion phenotypes​18,23​. Consistently, CERES (median NNMD range [-5.78, -5.88]) showed better NNMD values than CRISPRcleanR (median NNMD range [-5.02, -5.12], Wilcox test (WT) p​-value < 2.2e-16) and CCR-JACKS (median NNMD range [-5.14, -5.23], WT ​p​-value < 2.2e-16)), and similarly CCR-JACKS had better NNMD values than CRISPRcleanR (largest WT ​p ​-value < 0.0005) (​Figure 4a​). Comparing the batch correction methods, ComBat+QN+PC1-2 had marginally better performance across all pre-processing methods. Next, we evaluated the gene dependency false-positive rates across all integrated datasets. For each cell line DPG, we defined a set of putative negative controls composed of genes not expressed at the basal level in that cell line (Methods). False positives were calculated as the sum of negative controls identified as significant dependencies (in the top 15% most depleted genes) normalized by their total number across the DPG. There was little difference in false-positive rates across the four different batch correction pipelines, with a slight improvement when two principal components were removed (​Figure 4b​). CERES outperformed CCR-JACKS significantly for all batch correction methods (largest 𝜒​2 9 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/g3BuJ https://paperpile.com/c/BNwyax/fOJkA https://paperpile.com/c/BNwyax/59O9I+htDUx https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ contingency table ​p​-value 1.87 x 10 ​-11​, N=1.43 x 10 ​7​) and CCR-JACKS outperformed CRISPRCleanR (​p​-value below machine precision). Comparing the correction methods, the differences between ComBat and ComBat+QN and between ComBat+QN+PC1 and ComBat+QN+PC1-2 were generally not significant across preprocessing methods, while the difference between either ComBat or Combat+QN and either ComBat+QN+PC1 or ComBat+QN+PC1-2 were generally significant (largest ​p​-value 1.42 x 10 ​-5​). As a final test of control separation, we used the unexpressed genes as an empirical null distribution for each DPG to estimate ​p- ​values for all DS and thus false discovery rates (FDRs) within each DPG. We calculated the recall of a reference set of common essential genes​11​ at 10% FDR (​Figure 4c ​). Again CERES outperformed CCR-JACKS which outperformed CRISPRCleanR, and increasing the number of steps in the batch correction pipeline monotonically improved essential recall for all preprocessing methods. All differences between preprocessing methods and batch correction methods were significant, with the largest observed ​t​-test (related) ​p​-value 1.96 x 10 ​-3​ (N = 830). 10 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/g3BuJ https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 4: Use case recall of essential genes and lineage identification ​. a. ​Null-normalized mean difference ​(NNMD, a measure of separation between dependency scores of prior-known essential and non-essentials genes): defined as the difference in means between dependency scores of essential and non-essential genes divided by standard deviation of dependency scores of the non-essential genes. Lower values of NNMD indicate better separation of essential genes and non-essential genes. b. False positive rates across all pre-processing methods and batch-correction pipelines. In the gene dependency profile of a given cell line, a significant dependency gene was called a false positive if that gene was not expressed in that cell line. c. Recall of known essential genes across all pre-processing methods and batch-correction-pipelines at 10% 11 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ FDR​. ​d. Agreement between cell line clusters based on DPGs correlation and tissue lineage labels of corresponding cell lines, across pre-processing methods and batch-correction pipelines. e. Agreement of Lung CRISPR-cas9 fitness profiles according to the Lung cancer subtypes. For each query Lung cancer cell line in turn we computed correlation scores to all other Lung cancer cell lines (responses). We then ranked the response cell lines according to these correlations. For each query cell line, the rank position k of the most correlated response cell line from the same cancer subtype (matching response) was identified. A rank of k = 1 indicates that the query cell line was closest to another cell line from the same cancer subtype. The curves show the ratio of query cell lines with a matching response within a given rank position. The proportion of query cell lines with a matching response in k = 1 are also shown as percentages for each dataset. The normalised area under the curve (nAUC) for each dataset is shown in brackets. The figure shows the x-axis zoomed in to between 0 and 60. Identification of lineage subtypes Many dependencies are context specific, reducing cellular fitness in a subset of lineages, that can be used to elucidate gene function and identify cancer type specific vulnerabilities. To evaluate the ability of the integrated datasets in recapitulating tissue lineages and clinical subtypes we first estimated the extent of conserved similarity between screens of cell lines derived from the same tissue lineage. We evaluated the tendency of screens of cell lines from the same lineage to yield similar results by comparing unsupervised clusterings of the batch-corrected cell line DPGs to the lineage labels of the cell lines. To this aim, we performed one hundred ​k​-means clusterings of each of the 12 datasets, with ​k ​equal to the number of tissue lineages screened in at least one study. We then calculated the adjusted mutual information (AMI, Methods) between each DPG clustering and the partition of the cell lines induced by their lineage labels. We observed higher than chance AMI between the obtained ​k​ clusters and the tissue lineages of the cell line DPGs, regardless of the starting batch corrected dataset (largest single-sample ​t​-test p​-value of 3.59 x 10 ​-135​, ​N ​ = 100, ​Figure 4d ​). Under each pre-processing method the removal of one or two principal components resulted in an increased AMI between cell line DPGs clusters and tissue lineages. We next measured the ability of each of the integrated datasets to separate cell lines according to lineage subtypes. The integrated datasets contain over 100 Lung cell lines. These cell lines can further be stratified into subtypes such as Small cell lung carcinoma and Mesothelioma, whilst clinical subtypes such as PAM50 classifications are available for the Breast cancer cell lines (​Figure 2c​). To quantify the clustering of cell lines by subtype we calculated the correlation between all cell lines DPGs, and for a given query cell line the rank of the cell line with most correlated DPG to the query from the same subtype (​k​-rank). For the Lung cancer cell lines, the percentage of cell lines whose closest neighbour was from the same subtype (​k ​= 1) was greatest for CERES (64-65% across batch correction methods) 12 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ followed by CRISPRcleanR (61-64%) and CCR-JACKS (50-57%), with slight improvement with the removal of 1 or 2 principal components (​Figure 4e​). The normalised area under the curve (nAUC) values showed little variation across batch correction methods and were broadly similar between the pre-processing methods CERES (Lung = 0.96, Breast = 0.91 - 0.92), CCR-JACKS (Lung = 0.95 - 0.96, Breast = 0.84 - 0.85), CRISPRcleanR (Lung=0.96 - 0.97, Breast=0.89 - 0.9)(​Supplementary Figure 2 ​). Identification of biomarkers Interesting potential novel therapeutic targets are genes that show a pattern of selective dependency, i.e. exerting a strong reduction of viability upon CRISPR-Cas9 targeting in a subset of cell lines. Furthermore, these selective dependencies are often associated with molecular features that may explain their dependency profiles (biomarkers). We investigated each of the integrated datasets’ ability to reveal tissue-specific biomarkers of dependencies. As potential biomarkers we used a set of 676 clinically relevant cancer functional events (CFEs​30​), across 17 different tissue types. The CFEs encompass mutations in cancer driver genes, amplifications/deletions of chromosomal segments recurrently altered in cancer, hypermethylated gene promoters and microsatellite instability status. For each CFE and tissue type, we performed a Student’s t-test for each selective gene dependency (SGD, Methods) contrasting two groups of cell lines based on the status of CFE under consideration (present/absent), for a total number of 2,142,162 biomarker/dependency pairs tested. The total number of significant biomarker/dependency associations showed little variation across batch-correction methods at 5% FDR. However, a significantly larger number of biomarker/dependency associations were identified when using CRISPRcleanR compared to CCR-JACKS (largest ​p​-value 1.0e-14, proportion test) or CERES (largest ​p​-value 3.60e-10, proportion test) whilst little significant difference was found between CCR-JACKS and CERES (smallest ​p​-value 0.038, proportion test) (​Figure 5a, Supplementary Table 3​). Similar results were seen when the CFEs were split according to whether the biomarker was a mutation, recurrent copy number alteration or hypermethylated region (​Supplementary Figure 3) ​. We next examined the ability of each dataset to recover known selective dependencies in individual cell lines. We downloaded a set of oncogenic gene alterations 13 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/hBt7j https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ from OncoKB​31,32​. After filtering for genes that tend to be common essentials (mean dependency score lower than -0.5 in the CRISPRcleanR-ComBat dataset, where -1 is the median of scores of known common essentials), we considered the oncogenes as positive controls in cell lines where they had indicated oncogenic or likely-oncogenic gain of function alterations, and negative controls in all others. For each oncogene, we measured the NNMD between positive and negative cell lines (​Figure 5b​). We found little difference in median performance by either preprocessing method or batch correction method. We then collected the dependency scores of all oncogenes in cell lines with a corresponding oncogenic alteration and measured receiver operator characteristic (ROC) AUC between them and the dependency scores of the same genes in cell lines without oncogenic alterations (​Figure 5c​). By this measure, CRISPRcleanR outperformed CERES by 2.2% and CCR-JACKS by 4.0%, with minimal variations across batch correction method. Recovery of functional relationships We tested the ability of each dataset to identify expected dependency relations between paralogs, gene pairs coding for interacting proteins, or members of the same complex using gene pairs annotation from publicly available databases​33–35​ (Methods). For each pair of genes known to have a functional relationship, we selected a random pair of genes with similar mean dependency scores across cell lines to serve as null examples. We calculated the false discovery rate for the known pairs using the absolute Pearson correlation of their dependency profiles versus those of the null examples. Recovery of known relationships was unsurprisingly low, since many genes with known functional relationships do not exhibit selective viability phenotypes. ComBat+QN+PC1 or PC1-2 recovered the greatest number of expected gene dependency relations at 10% FDR (​Figure 5d​). 14 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/aSsl+D9gc https://paperpile.com/c/BNwyax/dwIrJ+z554A+KXhhL https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 5: Use case Biomarkers and functional relationships ​. ​a. For each tissue pairs of Cancer Functional Events (CFEs) and dependencies were tested for significant associations between the gene dependency and the absence/presence of a biomarker (CFE). The bar chart shows the total number of significant associations at 5% FDR across tissue types for each of the integrated datasets.​ ​ b. The per-oncogene NNMD between cell lines with and without an indicated oncogenic gain-of-function indication (more negative is better). c. For all identified oncogenes collectively, the receiver-operator characteristic (ROC) AUC between oncogene scores in cell lines where they have an indicated gain-of-function mutation and cell lines where they do not.​ ​d. For each dataset, the number of known gene-gene relationships recovered at 10% FDR. Final selection of pre-processing methods and batch-correction pipelines Comparing the performance of batch correction methods across the use-cases we found that ComBat+QN outperformed ComBat alone and removing one or two principal components had similar or noticeable increases in performance compared to ComBat+QN. 15 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ The principal component analysis indicated that ComBat+QN+PC1 corrected for linear and non-linear effects of technical confounders including assay length, guide library and media conditions. Removing the first two principal components offered little improvement over removing the first principal component alone and we found no attributable technical bias in the gene sets enriched in the second principal component. Overall, we selected ComBat+QN+PC1 as the batch correction pipeline as it had good performance over all metrics and a reduced impact on the data with respect to ComBat+QC+PC1-2, whilst still correcting for multiple technical biases. Comparing the pre-processing methods we found that CERES outperformed the other methods while identifying essential genes and lineage subtypes, that CRISPRcleanR showed higher performance in the biomarker association use case, and these two methods performed comparably and better than CCR-JACKS in identifying known gene-gene relationships. As a conclusion, we selected both CERES and CRISPRcleanR as processing methods and considered the two corresponding integrated datasets as the final results of our pipeline. Advantages of the integrated datasets over the individual ones In-line with the results from all the use-cases, we estimated the benefits of the integrated datasets with respect to the individual ones, in terms of increased capacity to unveil reliable sets of common essential genes (using CERES), as well as increased diversity of genetic dependencies and biomarker associations (using CRISPRcleanR). To evaluate the increased coverage of molecular diversity and genetic dependencies in the integrated dataset we first estimated the increase in the number of detected gene dependencies with respect to the two individual datasets. To this aim, using the CRISPRcleanR processed dataset we quantified the number of genes significantly depleted in ​n​ cell lines (at 5% FDR, Methods) for a fixed number of cell lines ​n ​(with ​n​ = 1, 3, 5 or ​n​ ≥  10​) of the integrated dataset, as well as in the individual Broad and Sanger datasets. ​The integrated dataset identified more dependencies, indicating greater coverage of molecular features and dependencies than in the individual datasets ​(​Supplementary Figure 4a​). We then evaluated the ability of the CERES processed integrated dataset to predict common essential genes and its performance when compared to the individual datasets and two existing sets of common essential genes from recent publications: Behan ​2​ and Hart​36​. We predicted common essential genes using two methods: the 90th-percentile method ​16​ and 16 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/KArN https://paperpile.com/c/BNwyax/6UH1G https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ the Adaptive Daisy Model (ADaM)​2​. The majority of genes called common essentials according to one of ADaM or 90th percentile methods was also identified by the other (1,482 out of 2,103, ​Supplementary​ ​Figure 4b ​). We assigned to each of the 2,103 common essential genes a tier based on the amount of supporting evidence of their common essentiality. Tier 1, the highest confidence set comprised the 1,482 genes found by both methods. Tier 2 had 621 genes found by only one method (​Supplementary Table 4​). For each predicted set of common essential genes, we calculated Recall rates of known essential genes sets obtained from KEGG​37​ and Reactome ​38​ pathways. These pathways included Ribosomal protein genes, genes involved in DNA replication and components of the Spliceosome (Methods). The Integrated set of common essentials (Tier 1 and 2) showed greater Recall of known essential genes compared to Behan and Hart, and increased Recall over the individual datasets for 5 out of the 6 gene sets (​Figure 6a​). We next generated a set of 647 genes that were never expressed across the panel of cell lines, to serve as high confidence negative controls (Methods). We calculated the proportion of negative controls in each set of common essentials genes. The best performance was for the Hart gene set (0%) followed by the integrated data set (0.33%) (​Figure 6b ​). As the positive and negative controls did not cover all genes we further investigated the genes predicted to be common essentials. The integrated dataset predicted the largest number of common essentials, with 233 genes found in the integrated data set alone. The 233 genes were enriched for Cell cycle genes (FDR 3.06e-9) and mitochondrial gene expression (FDR 3.66e-7), indicative of essential cellular processes. Similar results were observed for the 1,159 genes in the integrated set of common essentials but neither of the existing datasets (Behan and Hart) (​Supplementary Table 5​) We next asked whether the CRISPRcleanR processed integrated dataset was able to unveil additional significant gene dependencies and CFE/gene-dependency statistical interactions compared to either one of the Broad or Sanger (individual) datasets. Performing systematic biomarker analysis using CFEs on cell lines from individual tissue lineages unveiled 52 additional significant associations in the integrated dataset (when considering only CFE/gene-dependency pairs testable in the individual datasets at 1% FDR) with respect to those using the Sanger dataset alone, and 68 ​ ​with respect to the Broad dataset (​Supplementary Table 6 ​). Examples included decreased dependency on MDM2 in TP53 mutant Lung cell lines for the Sanger dataset, and increased dependency on STAG1 in STAG2 mutated Central Nervous System cancer cell lines for the Broad dataset (​Figure 6c​). 17 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/tHHR https://paperpile.com/c/BNwyax/shSW https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Furthermore, 19 tissue-specific significant associations identified in the integrated dataset were tested but not found significant in either the Broad or the Sanger dataset (​Figure 6d​). 18 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Figure 6: Advantages of an integrated dataset ​. a. Recall of essential genes sets for the integrated dataset, across different tiers, compared to two previously published gene sets (Behan and Hart). b. Proportion of genes in the common .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Sample size requirements for efficient data integration To further increase the coverage of a cancer dependency map, new CRISPR-cas9 screens should be integrated into the existing datasets as they are generated. To aid in this integration we estimated the minimum number of overlapping cell lines that should be screened to efficiently calculate and correct batch effects. We performed a downsampling analysis on the 168 cell lines screened at both Sanger and Broad, ranging from 5% to 90%, and used the obtained subset of cell lines to estimate and correct batch-effects using ComBat. Following this, for each cell line DPG generated at either institute, we computed the Pearson correlation following batch correction using all 168 overlapping cell lines (​Figure 6e​). We found a high degree of correlation between datasets at all levels of downsampling, with the minimum of 8 samples still reducing batch effects when compared to no batch correction (N = 0) (​Supplementary Figure 4c​). We next evaluated the batch correction using the average silhouette width (ASW) of the clustering induced by the institute of origin of the cell lines as a measure of the extent to which cell lines from the same institute clustered together. As expected, as the number of samples used to estimate and correct the batch effect decreases, the DPGs increasingly cluster by the batch of origin (​Figure 6f​). The ASW and Pearson correlation metrics both showed clear convergence with increasing sample size and at the same rate. Given the convergence of these metrics, the results showed that the 168 overlapping cell lines used were sufficient to maximise the batch correction using ComBat. Further the downsampling analysis showed convergence was reached at 90 cell lines and that between 30 and 40 cell lines would be sufficient to provide a batch corrected dataset that is highly correlated (over 0.995) with that obtained when estimating and correcting batch effects with using more than 90 cell lines. The 168 overlapping cell lines contained cell lines from 16 different lineages. To investigate the impact of lineage composition of the cell lines on the batch correction we also 20 essential gene sets that are constitutively not expressed across the panel of cell lines and therefore likely to be false positive results. c. Examples of significant associations between genes and features, found in the integrated dataset compared to the individual dataset. d. Examples of significant associations found in the integrated dataset that were not significant in either of the individual datasets. e. The boxplots contain 50 random samples of between 5% and 90% of the 168 overlapping cell lines (number of cell lines in each sample indicated on the x-axis). For each sample the Pearson correlation of the DPGs following ComBat correction compared to the integrated dataset was calculated for each pre-processing method. f. The average silhouette width (ASW) for each downsampled dataset was calculated using the institute of origin as the cluster label. An ASW of close to zero indicating a near random performance of the clustering, meaning the samples do not cluster by the origin of the screen and batch effects have been removed. .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ used a single lineage to estimate the batch effects. In the overlapping cell lines the Lung lineage had the most cell lines (28 in total). We subsampled the Lung cell lines to include 8, 17 or 25 cell lines (​Supplementary​ ​Figure 4de ​) and found little difference in performance between using a single and a mixture of lineages, indicating that this is not a major factor for estimating batch effects. Discussion The integration of data from different high-throughput functional genomics screens is becoming increasingly important in oncology research to ​adequately represent the diversity of human cancers. Integrating CRISPR-Cas9 screens performed independently and/or using distinct experimental protocols, requires correction and benchmarking strategies to account for technical biases, batch effects and differences in data-processing methods. Here, we proposed a strategy for the integration of CRISPR-Cas9 screens and evaluated methods accounting for biases within and between two dependency datasets generated at the Broad and Sanger institutes. Our results show that established batch correction methods can be used to adjust for linear and non-linear study-specific biases. ​Our analyses and assessment yielded two final integrated datasets of cancer dependencies across 908 cell lines. In contrast to existing databases of CRISPR-Cas9 screens​39,40​, our integrated datasets are corrected for batch effects allowing for their joint analysis. ​Following integration, dependency profiles of cell lines from the same tissue lineage and cancer specific subtypes show good concordance.​ Our integrated datasets cover a greater number of genetic dependencies, and the increased diversity of screened models allows additional associations between biomarkers and dependencies to be identified. The integrated datasets were the output of two orthogonal pre-processing methods, CRISPRcleanR and CERES. The use-case analysis showed that CERES (which borrows information across screens) yields a final dataset better able to identify prior known essential and non-essential genes and clustering of cell lines by lineage. In contrast, CRISPRcleanR (a per sample method) was better able to detect associations between selective dependencies and potential biomarkers, and had better recall of known oncogenic 21 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/xH1A3+cZFN5 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ addictions. Therefore, results from both processing methods provide the best overall data-driven functional Cancer Dependency Map. The data integration strategies and sample size guidelines outlined here can be used with future and additional CRISPR-Cas9 datasets to increase coverage of cancer dependencies. This will be important for oncological functional genomics, for the identification of novel cancer therapeutic targets, and for the definition of a global cancer dependency map. Further, as library design improves​24,41,42​ we would expect the coverage and accuracy of the integrated datasets to also improve. 22 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/EqQvF+Ztmd+DkGL https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Data availability The final integrated datasets are available for download at https://figshare.com/projects/Integrated_CRISPR/78252 ​. The data will also be made accessible through the DepMap (https://depmap.org) and Score (https://score.depmap.sanger.ac.uk) web portals in early 2021. Code availability Scripts and software packages implementing the integration pipeline described in this manuscript and needed to reproduce results and figures are available on GitHub at https://github.com/DepMap-Analytics/IntegratedCRISPR with data sources available on Figshare: ​https://figshare.com/projects/Integrated_CRISPR/78252 ​. Acknowledgments This work was partially funded by Open Targets [project OTAR0255] and by the Wellcome Trust [grant 206194]. We thank Leo Parts for a number of insightful discussions. Author Contributions CP conceived the study, designed, implemented and performed analyses, assembled figures, curated data, wrote the manuscript. JMD conceived the study, designed, implemented and performed analyses, assembled figures, and contributed to manuscript writing. IB contributed to pipeline implementation. EG performed analyses, assembled figures, revised the manuscript. HN assembled figures, revised the manuscript. EK, DvdM, AB, HL, PJ contributed to data curation. JMM, MJG, and AT revised the manuscript and contributed to study supervision. FI conceived the study, designed analyses, contributed to figure production, wrote the manuscript, acquired funds and supervised the study. Competing interests MJG, and FI receive funding from Open Targets, a public-private initiative involving academia and industry. MJG receives funding from AstraZeneca and performs consultancy for Sanofi. FI performs consultancy for the joint CRUK - AstraZeneca Functional Genomics Centre. AT is a consultant for Tango Therapeutics and Cedilla Therapeutics. JMD, JM and AT receive funding from the Cancer Dependency Map Consortium, but no consortium member was involved in or influenced this study. All the other authors ​declare no competing interests. 23 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://figshare.com/projects/Integrated_CRISPR/78252 https://github.com/DepMap-Analytics/IntegratedCRISPR https://figshare.com/projects/Integrated_CRISPR/78252 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Methods Preprocessing data Sanger data processed with CRISPRcleanR were obtained from the Score website (​https://score.depmap.sanger.ac.uk/​). The CRISPRcleanR corrected counts were used as input into JACKS, for the CCR-JACKS processing method. Raw counts and the copy number profiles for the Sanger dataset downloaded were processed with CERES​20​. The Broad data processed with CERES (unscaled gene effect) version 20Q2 scores were downloaded from the Broad DepMap portal ​20​. The raw counts for Broad data 20Q2 were processed with CRISPRcleanR and the CRISPRcleanR corrected counts processed with JACKS. Gene names were matched across the Broad and Sanger datasets by updating both to the current version of HUGO gene symbols from the HGNC website. Missing entries were mean imputed for the principal component removal and then re-assigned as NA in the final matrix. Cell lines processed by both CERES and CRISPRcleanR were used for analysis. Tissue annotations for each cell line were obtained from the Cell Model Passports (​https://cellmodelpassports.sanger.ac.uk/​)​43​. Batch correction pipelines The dependency profiles across genes (DPGs) for overlapping cell lines from each institute were first quantile normalized using the preprocessCore package in R​44​. Screen quality adjustments were made by fitting a spline to the average gene fold change across cell line DPGs. Each DPG was then adjusted to remove the difference between the fitted spline and the diagonal. The overlapping cell lines were then batch corrected using three different methods. A standard least squares model was fitted in R. The ComBat correction was performed using the sva package in R​45​. Batch correction pipelines’ assessment and weighted Pearson correlation metric Cell lines’ rank neighborhoods were based on a weighted Pearson correlation metric. The weights were defined as the absolute mean (over the Broad and Sanger datasets) of a gene dependency signal skewness across the 168 overlapping cell lines for the Broad and Sanger datasets. Using skewness upweights genes with a variable and sufficiently selective fitness profile whilst downweighting those that show weak/no-signal or unselective dependencies. Then for each query DPG we ranked all the others based on how similar they were to the fixed one in decreasing order, according to the wPearson scores. For each position ​k​ in the resulting rank we then defined a ​k-neighborhood​ of the query DPG composed of all the other  DPGs whose rank position was ≤ ​k​. Finally we determined the number of cell line DPGs that 24 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://score.depmap.sanger.ac.uk/ https://paperpile.com/c/BNwyax/6qc1 https://paperpile.com/c/BNwyax/6qc1 https://cellmodelpassports.sanger.ac.uk/ https://paperpile.com/c/BNwyax/wfSuM https://paperpile.com/c/BNwyax/6zWnw https://paperpile.com/c/BNwyax/ZCFXR https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ had the DPG derived from screening the same cell line in the other dataset (a matching DPG) in its ​k-neighborhood​. The final rank for each cell line was defined based on the minimum rank obtained for each cell line when considering the DPG for that cell line from the Broad data compared to all DPGs, and similarly the DPG for the cell line in the Sanger dataset compared to all DPGs. Analysis of Principal Components The first two principal components (PCs) were extracted from ComBat corrected CRISPRcleanR data using the prcomp function in R. The top 500 genes (according to the absolute value of their PC loadings) were selected for enrichment analysis. The gene lists were used as input into the GSEA website (​https://www.gsea-msigdb.org/​) and were tested against the Gene ontology Biological Processes, Hallmark and Canonical Pathway databases. The top 10 significantly enriched (q-value <0.05) gene sets were downloaded from the website. Batch correction extended to 908 cell lines The ComBat estimates, pooled mean, variance and empirical Bayes adjustments (mean and standard deviation) for each batch based on the analysis of 168 cell lines common to both initial dataset were computed. The ComBat correction using these estimates was then applied to all screens, i.e. the union of the two initial datasets. Particularly, each individual cell line DPG was shifted and scaled gene-wise using the batch correction vectors outputted by ComBat. Further adjustments were then applied to all screens including quantile normalization, and the removal of either the 1st principal component of the joint datasets or the first two. Finally, DPGs for overlapping cell lines passing a similarity threshold (detailed below) were averaged. Across the three pre-processing methods the number of cell lines that matched their counterparts exactly after ComBat correction ranged from 51% - 86% (​Figure 3b)​, suggesting that under all pre-processing methods there remained cell lines whose DPGs diverged between studies. For each of the cell lines that matched their counterpart as the first neighbor we considered their distances (1-wPearson) as a measure of the variability in distance profiles between DPGs of the same cell line across institutes. We called divergent DPGs those with a distance greater than the 95th percentile of distances from matching cell lines. For 16 cell lines with divergent DPGs across all three processing methods we selected the DPG from the screen with the highest quality to be included in the integrated datasets. As a quality metric we used the Null-normalized mean difference (NNMD, defined in the 25 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://www.gsea-msigdb.org/ https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ main text) and took its consensual value across the three datasets (resulting from applying CERES, CCR-JACKS and CRISPRcleanR). Agreement between dependency profile clusterings and cell line tissue labels We selected 500 genes with the highest variance in the CERES ComBat integrated dataset and performed repeated 100 k-means clusterings cell lines using the high variance genes for each pre-processing and batch-correction method. For each clustering, we calculated the adjusted mutual information between the obtained clusters and the cell line tissue labels as specified in the annotation provided by the sample_info file of the DepMap_public_20Q2 dataset​20​ using sklearn’s python function adjusted_mutual_info_score (​https://scikit-learn.org/stable/​). Recall of known gene relationships We assembled a set of functionally related gene pairs using paralogs identified by EnsemblCompara ​33​, protein-protein interactions identified by Li et al ​34​, and CORUM complex comemberships​35​. For a given dataset, for each pair of related genes, we calculated a Pearson correlation coefficient between those genes’ dependency scores across cell lines. We then binned each gene that appeared in the list of known gene relationships according to its mean gene score using 20 equally spaced bins. For pairs of genes in the related genes pairs, we chose one as the query gene and replaced its related partner with another randomly selected gene of similar gene mean, i.e. belonging to the same bin, excluding genes known to be related to the query gene. We calculated Pearson’s correlation coefficients between these randomly selected gene pairs to generate a null distribution, from which we calculated empirical ​p​-values and Benjamini-Hochberg FDRs for known related gene pairs. Ensuring that the pairs of genes used in the null distribution have similar distributions of mean gene effect as the pairs of known related genes is necessary because variable screen quality can produce a high but artifactual correlation between any pair of common essential genes, and CORUM is highly biased towards common essentials. This is discussed further in the comparisons of batch corrections in Dempster et al ​29​. Unexpressed false positives We defined a gene as unexpressed in a cell line if the log2(Transcripts per million +1) of its DepMap expression was less than 0.01 ​46​. Any score of an unexpressed gene in a cell line was called a false positive if it fell in the bottom 15% of gene scores for that cell line. 26 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/6qc1 https://scikit-learn.org/stable/ https://paperpile.com/c/BNwyax/dwIrJ https://paperpile.com/c/BNwyax/z554A https://paperpile.com/c/BNwyax/KXhhL https://paperpile.com/c/BNwyax/fOJkA https://paperpile.com/c/BNwyax/3zOfE https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ Identifying selective dependencies NormLRT and likelihood of normal distribution was calculated in R using the MASS package ​47​. For the skew t-distribution the st.mple function from the sn package was used to calculate the likelihood. If the fitting procedure failed different degrees of freedom were used iteratively until a solution was found. The degrees of freedom used in order were 2,5,10,25,50 and 100. Systematic association test between molecular features and gene dependencies We performed a systematic two-sample unpaired Student’s ​t​-test (with the assumption of equal variance between compared populations) to assess the differential essentiality of each gene across a dichotomy of cell lines defined by the status (present/absent) of each CFE in turn. We tested genes whose NormLRT values were greater than 200 in any integrated dataset. From these tests, we obtained ​p​-values against the null hypothesis that the two compared populations had an equal mean, with the alternative hypothesis indicating an association between the tested CFE/gene-dependency pair. ​P​-values were corrected for multiple hypothesis testing using Benjamini–Hochberg (method ‘fdr’ using the p.adjust function in R). We also estimated the effect size of each tested association using Cohen’s Delta (ΔFC), i.e. the difference in population means divided by their pooled standard deviations. Evaluating known selective dependencies A table of all annotated oncogene variants was downloaded from OncoKB​32​. The table was filtered first for genes that were (likely) oncogenic and alterations that were (likely) gain-of-function or switch-of-function. For each alteration, the DepMap public 20Q2 ​20 mutation and fusion calls were used to identify which cell lines had the alteration. These cell lines were treated as positive controls for the gene in question, with all other cell lines treated as negative controls. Only oncogenes with at least one positive cell line were retained. For each integrated dataset, we calculated the ROC AUC between all positive oncogene-cell line pairs and negative pairs. Then, for each oncogene with at least two positive cell lines, we calculated the NNMD between its positive and negative cell lines. Identification of common essential genes via the 90th Percentile method The 90th percentile method ​27​ finds for each gene the cell line on the boundary of its 90th percentile least dependent cell lines. It then calculates the rank of that gene in that cell line, by sorting all the genes based on their dependency score in increasing order. A mixture of 27 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/fENJN https://paperpile.com/c/BNwyax/D9gc https://paperpile.com/c/BNwyax/6qc1 https://paperpile.com/c/BNwyax/ezH2 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ two normal distributions is then fitted to the rank positions of all genes. Those genes with ranks below the crossover point of these two distributions are labeled as common essentials. ADaM method Binary depletion matrices for the integrated datasets were calculated as outlined in the next section and used with the ADaM method as described in Behan et al ​2​. The ADaM method determines the number of cell lines dependent on a gene required to call that gene a common essential. The number of cell lines is calculated by maximizing the tradeoff between true positive rate (using a set of known prior essential genes) and the deviance from the null expected rate (calculated using random permutations of the binary depletion matrix). Common essential genes were identified for each tissue separately (according to the cell line annotation from the Cell Model Passports​43​) and were then used as input into ADaM to determine pan-cancer common essential genes. Binary depletion calls Binary depletion calls were computed by considering each cell line DPG as a rank-based classifier of essential/non-essential genes​11​ (with gene rank positions determined by their fitness effect, i.e. average depletion fold-change of targeting single guide RNAs abundance at the end of the assay with respect to plasmid counts). The fitness effect threshold was then fixed as that corresponding to the largest rank position r​ guaranteeing a false discovery rate (FDR) < 5%, when the predicted essential genes are  those with a rank position ≤ ​r​. This allowed us to assign to each gene in each cell line, in each of the two datasets, a binary dependency score. To identify significantly depleted genes for a given cell line at a 5% FDR, we ranked all the genes in the cell line DPG in increasing order based on their depletion log fold-changes. We used the ranked list to calculate the precision curve using a set of prior known essential (​E​) and non-essential (​N​) genes, respectively, derived from Hart et al ​11​. To estimate the rank position corresponding to the 5% FDR threshold we calculated for each rank position ​k​, a set of predicted essential genes ​P(k)​ ​=​ {​s​ ​∈​ ​E​ ​∪​ ​N:​ ​r(s)​ ​≤​ ​k ​}, with ​r(s) indicating the rank position of ​s​, and the corresponding positive predictive value (or precision) ​PPV(k)​ as: 28 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/wfSuM https://paperpile.com/c/BNwyax/g3BuJ https://paperpile.com/c/BNwyax/g3BuJ https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ PPV(k)=|P(k)∩E|/|P(k)| We then determined the largest rank position ​k*​ with ​PPV(k*)​ ≥ 0.95 (equivalent to a  FDR ≤ 0.05). The 5% FDR logFCs threshold ​F*​ was defined as the logFCs of the gene s such that ​r(s)​ ​=​ ​k*​. We called all genes with a logFC < ​F*​ as significantly depleted at 5% FDR. Binary dependency matrices were defined as gene by cell lines matrices with non null entries corresponding to significant dependency genes at 5% FDR, for each cell line, i.e. column. Positive controls for common essentials To generate sets of prior known common essential genes we downloaded gene sets from MsigDB (v7.2) using the R package qusage. The gene sets used were from KEGG were KEGG_SPLICEOSOME, KEGG_RIBOSOME, KEGG_PROTEASOME, KEGG_RNA_POLYMERASE and KEGG_DNA_REPLICATION. For the histones gene set we combined two reactome gene sets REACTOME_HATS_ACETYLATE_HISTONES and REACTOME_HDACS_DEACETYLATE_HISTONES as well as the curated histones gene set from ​2​. Negative controls for common essentials We compiled a set of negative controls for the common essential genes as those genes that were not expressed across all cell lines. We defined a gene as unexpressed across the panel of cell lines using the log2(Transcripts per million +1) of its CCLE expression ​20​ and the 90th percentile method (The input into the ADaM2 package (available at https://github.com/DepMap-Analytics/ADAM2 ​) performing the 90th percentile method was -1*log2(TPM+1) to ensure correct ranking). A gene defined as constitutively unexpressed was therefore one that was still lowly expressed in its highly ranked (90th percentile) most expressed cell line. Downsampling for batch correction sample sizes We downsampled 50 times the overlapping cell lines at different levels between 5% and 90%. Random samples were generated using probabilities of selecting a cell line based 29 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint https://paperpile.com/c/BNwyax/e4Ooj https://paperpile.com/c/BNwyax/6qc1 https://github.com/DepMap-Analytics/ADAM2 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ on the relative proportions of each cell line lineage in the overlapping data set. Using the downsampled set of overlapping cell lines ComBat was used to calculate the batch adjustment vectors. The batch adjustment vectors were then applied to all 1,074 cell lines. The correlation of a cell lines fold changes batch corrected using the downsampled datasets and the full 168 overlapping cell lines was calculated and compared to the correlation with no batch correction. To evaluate the batch correction we also used the average silhouette width as a measure of clustering. We calculated the average silhouette width for each batch corrected data set (using samples of the overlapping cell lines) using the institute of origin as the cluster label. The average silhouette width is 1 for perfect clustering (or complete separation of cell lines by the institute of origin) with 0 indicating random performance of the clusters. References 1. Prasad, V. Perspective: The precision-oncology illusion. ​Nature​ ​537​, S63 (2016). 2. Behan, F. M. ​et al. ​ Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. ​Nature​ ​568​, 511–516 (2019). 3. Tsherniak, A. ​et al.​ Defining a Cancer Dependency Map. ​Cell​ ​170​, 564–576.e16 (2017). 4. McDonald, E. R., 3rd ​et al.​ Project DRIVE: A Compendium of Cancer Dependencies and Synthetic Lethal Relationships Uncovered by Large-Scale, Deep RNAi Screening. Cell​ ​170​, 577–592.e10 (2017). 5. Shalem, O. ​et al. ​ Genome-scale CRISPR-Cas9 knockout screening in human cells. Science​ ​343​, 84–87 (2014). 6. Koike-Yusa, H., Li, Y., Tan, E.-P., Velasco-Herrera, M. D. C. & Yusa, K. Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. ​Nat. Biotechnol.​ ​32​, 267–273 (2014). 7. Wang, T., Wei, J. J., Sabatini, D. M. & Lander, E. S. Genetic screens in human cells using the CRISPR-Cas9 system. ​Science​ ​343​, 80–84 (2014). 8. Steinhart, Z. ​et al. ​ Genome-wide CRISPR screens reveal a Wnt-FZD5 signaling circuit as a druggable vulnerability of RNF43-mutant pancreatic tumors. ​Nat. Med.​ ​23​, 60–68 (2017). 9. Shi, J. ​et al. ​ Discovery of cancer drug targets by CRISPR-Cas9 screening of protein domains. ​Nat. Biotechnol.​ ​33​, 661–667 (2015). 10. Tzelepis, K. ​et al.​ A CRISPR Dropout Screen Identifies Genetic Vulnerabilities and Therapeutic Targets in Acute Myeloid Leukemia. ​Cell Rep.​ ​17​, 1193–1205 (2016). 11. Hart, T. ​et al. ​ High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. ​Cell​ ​163​, 1515–1526 (2015). 12. Meyers, R. M., Bryan, J. G., McFarland, J. M. & Weir, B. A. Computational correction of 30 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/VOtGa http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/e4Ooj http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/5JKGI http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/ayQe4 http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/AS1lX http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/YMsJ9 http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/T0Woi http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/ODthp http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/DcTjJ http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/BIfQG http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/g3BuJ http://paperpile.com/b/BNwyax/f4TT0 https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. ​Nature​ (2017). 13. Wellcome Sanger Institute. Cancer Dependency Map. ​https://depmap.sanger.ac.uk/​. 14. Broad Institute of Harvard and MIT. Cancer Dependency Map. ​https://depmap.org/​. 15. Feng, F. Y. & Gilbert, L. A. Lethal clues to cancer-cell vulnerability. ​Nature​ vol. 568 463–464 (2019). 16. Dempster, J. ​et al.​ Agreement between two large pan-cancer genome-scale CRISPR knock-out datasets. ​Nature Communications​ ​In Press ​,. 17. Iorio, F. ​et al. ​ Unsupervised correction of gene-independent cell responses to CRISPR-Cas9 targeting. ​BMC Genomics​ ​19​, 604 (2018). 18. Allen, F. ​et al.​ JACKS: joint analysis of CRISPR/Cas9 knockout screens. ​Genome Res. 29​, 464–471 (2019). 19. Project Score. ​https://score.depmap.sanger.ac.uk/​. 20. DepMap, B. DepMap 20Q2 Public. (2020) doi:​10.6084/M9.FIGSHARE.12280541.V4 ​. 21. Project Achilles. ​https://figshare.com/articles/DepMap_19Q3_Public/9201770 ​. 22. Aguirre, A. J. ​et al. ​ Genomic Copy Number Dictates a Gene-Independent Cell Response to CRISPR/Cas9 Targeting. ​Cancer Discov.​ ​6 ​, 914–929 (2016). 23. Gonçalves, E. ​et al.​ Structural rearrangements generate cell-specific, gene-independent CRISPR-Cas9 loss of fitness effects. ​Genome Biol.​ ​20​, 27 (2019). 24. Doench, J. G. ​et al. ​ Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. ​Nat. Biotechnol.​ ​32​, 1262–1267 (2014). 25. Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. ​Bioinformatics​ ​28​, 882–883 (2012). 26. Liberzon, A. ​et al.​ Molecular signatures database (MSigDB) 3.0. ​Bioinformatics​ ​27​, 1739–1740 (2011). 27. Dempster, J. M. ​et al. ​ Agreement between two large pan-cancer CRISPR-Cas9 gene dependency data sets. ​Nat. Commun.​ ​10​, 5817 (2019). 28. Lagziel, S., Lee, W. D. & Shlomi, T. Inferring cancer dependencies on metabolic genes from large-scale genetic screens. ​BMC Biol.​ ​17​, 37 (2019). 29. Dempster, J. M., Rossen, J., Kazachkova, M. & Pan, J. Extracting Biological Insights from the Project Achilles Genome-Scale CRISPR Screens in Cancer Cell Lines. BioRxiv​ (2019). 30. Iorio, F. ​et al. ​ A Landscape of Pharmacogenomic Interactions in Cancer. ​Cell​ ​166​, 740–754 (2016). 31. Chakravarty, D. ​et al.​ OncoKB: A Precision Oncology Knowledge Base. ​JCO Precis Oncol​ ​2017​, (2017). 32. OncoKB. All Annotated Variants. ​OncoKB.org http://oncokb.org/api/v1/utils/allAnnotatedVariants​ (2020). 33. Aken, B. L. ​et al. ​ Ensembl 2017. ​Nucleic Acids Res.​ ​45​, D635–D642 (2017). 34. Li, T. ​et al. ​ A scored human protein-protein interaction network to catalyze genomic interpretation. ​Nat. Methods​ ​14​, 61–64 (2017). 35. Ruepp, A. ​et al.​ CORUM: the comprehensive resource of mammalian protein complexes--2009. ​Nucleic Acids Res.​ ​38​, D497–501 (2010). 36. Hart, T. ​et al. ​ Evaluation and Design of Genome-Wide CRISPR/SpCas9 Knockout Screens. ​G3 ​ ​7 ​, 2719–2727 (2017). 31 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/f4TT0 http://paperpile.com/b/BNwyax/Kl5bc http://paperpile.com/b/BNwyax/Kl5bc http://paperpile.com/b/BNwyax/Kl5bc http://paperpile.com/b/BNwyax/htOyk https://depmap.org/ http://paperpile.com/b/BNwyax/htOyk http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/wJXm9 http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/6UH1G http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/Q4ESm http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/htDUx http://paperpile.com/b/BNwyax/3CgU2 https://score.depmap.sanger.ac.uk/ http://paperpile.com/b/BNwyax/3CgU2 http://paperpile.com/b/BNwyax/6qc1 http://dx.doi.org/10.6084/M9.FIGSHARE.12280541.V4 http://paperpile.com/b/BNwyax/6qc1 http://paperpile.com/b/BNwyax/N7Jvg https://figshare.com/articles/DepMap_19Q3_Public/9201770 http://paperpile.com/b/BNwyax/N7Jvg http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/iQbeE http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/59O9I http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/EqQvF http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/AX4Xh http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/wM6a http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/ezH2 http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/RXWN http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/fOJkA http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/hBt7j http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/aSsl http://paperpile.com/b/BNwyax/D9gc http://paperpile.com/b/BNwyax/D9gc http://paperpile.com/b/BNwyax/D9gc http://oncokb.org/api/v1/utils/allAnnotatedVariants http://paperpile.com/b/BNwyax/D9gc http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/dwIrJ http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/z554A http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KXhhL http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN http://paperpile.com/b/BNwyax/KArN https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37. Kanehisa, M. ​et al.​ KEGG for linking genomes to life and the environment. ​Nucleic Acids Res.​ ​36​, D480–4 (2008). 38. Fabregat, A. ​et al.​ The Reactome Pathway Knowledgebase. ​Nucleic Acids Res.​ ​46​, D649–D655 (2018). 39. Lenoir, W. F., Lim, T. L. & Hart, T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. ​Nucleic Acids Res.​ ​46​, D776–D780 (2018). 40. Rauscher, B., Heigwer, F., Breinig, M., Winter, J. & Boutros, M. GenomeCRISPR - a database for high-throughput CRISPR/Cas9 screens. ​Nucleic Acids Research​ vol. 45 D679–D686 (2017). 41. Gonçalves, E., Thomas, M., Behan, F. M., Picco, G. & Pacini, C. Minimal genome-wide human CRISPR-Cas9 library. ​bioRxiv​ (2019). 42. Elmentaite, R., Noell, G., Turner, G., Iyer, V. & Parts, L. Minimized double guide RNA libraries enable scale-limited CRISPR/Cas9 screens. ​bioRxiv​ (2019). 43. van der Meer, D. ​et al.​ Cell Model Passports—a hub for clinical, genetic and functional datasets of preclinical cancer models. ​Nucleic Acids Res.​ ​47​, D923–D929 (2019). 44. Bolstad, B. M. preprocessCore: A collection of pre-processing functions. 2016. ​R package version​ ​1 ​,. 45. Leek, J. T. ​et al. ​ sva: Surrogate Variable Analysis. R Package Version 30. 2017. 46. DepMap, B. DepMap 19Q4 Public. (2020) doi:​10.6084/m9.figshare.11384241.v2 ​. 47. Ripley, B. ​et al.​ Package ‘mass’. ​Cran R​ ​538​, (2013). 32 .CC-BY-NC-ND 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2020.05.22.110247doi: bioRxiv preprint http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/tHHR http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/shSW http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/xH1A3 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/cZFN5 http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/Ztmd http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/DkGL http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/wfSuM http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/6zWnw http://paperpile.com/b/BNwyax/ZCFXR http://paperpile.com/b/BNwyax/ZCFXR http://paperpile.com/b/BNwyax/ZCFXR http://paperpile.com/b/BNwyax/3zOfE http://dx.doi.org/10.6084/m9.figshare.11384241.v2 http://paperpile.com/b/BNwyax/3zOfE http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN http://paperpile.com/b/BNwyax/fENJN https://doi.org/10.1101/2020.05.22.110247 http://creativecommons.org/licenses/by-nc-nd/4.0/