key: cord-0317386-2dz6xq5c authors: Salvadores, Marina; Fuster-Tormo, Francisco; Supek, Fran title: Matching cell lines with cancer type and subtype of origin via mutational, epigenomic and transcriptomic patterns date: 2019-10-17 journal: bioRxiv DOI: 10.1101/809400 sha: b393f5a2c159aa984f13858135d74e6fb5f4a0ad doc_id: 317386 cord_uid: 2dz6xq5c Cell lines are commonly used as cancer models. Because the tissue and/or cell type of origin provide important context for understanding mechanisms of cancer, we systematically examined whether cell lines exhibit features matching the cancer type that supposedly originated them. To this end, we aligned the mRNA expression and DNA methylation data between ∼9,000 solid tumors and ∼600 cell lines to remove the global differences stemming from growth in cell culture. Next, we created classification models for cancer type and subtype using tumor data, and applied them to cell line data. Overall, the transcriptomic and epigenomic classifiers consistently identified 35 cell lines which better matched a different tissue or cell type than the one the cell line was originally annotated with; we recommend caution in using these cell lines in experimental work. Six cell lines were identified as originating from the skin, of which five were further corroborated by the presence of a UV-like mutational signature in their genome, strongly suggesting mislabelling. Overall, genomic evidence additionally supports that 22 (3.6% of all considered) cell lines may be mislabelled because we predict they originate from a different tissue/cell type. Finally, we cataloged 366 cell lines in which both transcriptomic and epigenomic profiles strongly resemble the tumor type of origin, designating them as ‘golden set’ cell lines. We suggest these cell lines are better suited for experimental work that depends on tissue identity and propose tentative assignments to cancer subtypes. Finally, we show that accounting for the uncertain tissue-of-origin labels can change the interpretation of drug sensitivity and CRISPR genetic screening data. In particular, in brain, lung and pancreatic cancer cell lines, many novel determinants of drug sensitivity or resistance emerged by focussing on the cell lines that are best matched to the cancer type of interest. An overview of the HyperTracker methodology applied in the manuscript. First, we systematically identified possible mislabeled cell lines using GE and MET data, independantly. Second, we used various types genomic data to corroborate the hits. Third, we further validate the cell lines suspected to originate from skin using independent data. whole-exome sequencing data sets (Fig S4; our past work (24) suggests whole genome 217 sequences are more powerful). Finally, we included an additional classifier based on copy number 218 alteration (CNA) profiles, which were also shown to yield accurate predictive models of tissue 219 specificity (24,25). 220 221 For the 43 examples of suspected cell lines tissue identity, we first derived one-versus-one 222 classification models separately for GE and MET. If a cell line is truly mislabelled when testing 223 the original versus the suspected cancer type, we should observe the same reassignment of the 224 cell line to be robustly observed across multiple runs of the classification algorithm, which use 225 different random initializations. Out of 20 iterations of the algorithm, a score of 20 indicates that 226 the cell line is consistently predicted as the suspected cancer type, and a score of 0 means that 227 the cell line is consistently assigned to the original cancer type. We randomized the labels to 228 obtain a background model of expected values (Fig 3b; Fig S5a) . From the 43 suspected cell 229 lines, 35 are consistently reassigned to the other tissue (score>10), irrespective of the variability 230 in the predictive models introduced by resampling the data (Fig 3a; Fig S5b) . Next, we calculated 231 the same score for the genomic classifiers (based on mutations and CNA, as described above) 232 on these 35 suspected cell lines (Fig 3a) . 233 234 by one or more genomic classifiers (Fig 3a; score>=15, corresponding to FDRs of 0%, 0% and 236 18% for the CNA, OGM and MS96 respectively, based on randomized data; Fig 3b) . This data 237 suggests 22 cell lines are candidates for assignment to another cancer type, based on converging 238 evidence from the levels of the genome, epigenome and transcriptome, which provides 239 confidence. Reassuringly, this list contains two cell lines which have been previously shown to be 240 misclassified: SW626 which was initially annotated as ovarian cancer but later discovered to be 241 derived from colon cancer (26), and COLO741 which was originally thought to be a colon 242 adenocarcinoma cell line but later shown to originate from a melanoma (27). The fact that these 243 two known examples were detected and reassigned to the correct cancer type provides evidence 244 that our method is overall reliable. 245 The two plausible reasons why a cell line thought to originate from one cell type would need to be 247 reassigned to a different cell type are (i) that at the time of isolation, the cell line was not of the 248 type that it was thought to be (mislabeling), or (ii) that during prolonged cell culture, the cell line 249 diverged greatly and now resembles another cell type (transdifferentiation). Our data allows to 250 examine how prevalent each case is: mislabelling is expected to be reflected equally in both the 251 epigenome and the genome, while transdifferentiation is expected to be reflected more strongly 252 in the (presumably more malleable) epigenome, and less so in the genome, which retains the 253 mutations from the original tumor. We suggest that mislabelling at isolation is a much more 254 common scenario ( From the previous analysis, we identified a total of six cell lines which are reassigned from various 262 cancer types to skin cancer. We note that, of skin cancers, the TCGA study contains only 263 melanoma but not the non-melanoma skin cancers, so we are currently not able to distinguish 264 between cell type identities of different types of skin cancer. 265 266 analysis based on mutational signatures to confirm the mislabelling. Large-scale analyses of 268 trinucleotide mutation spectra across human tumors have revealed at least 30 different types of 269 mutational signatures (28). Of these, Signature 7 (C>T changes in CC and TC contexts) was 270 associated with exposure to UV light and is highly abundant in sun-exposed melanoma tumors 271 (29). The same signatures were recently estimated in cancer cell lines by two related methods 272 (30,31), which enabled us to use existence UV-linked Signature 7 to examine whether these cell 273 lines originated from the skin. Based on mutational burden of Signature 7, the known melanoma 274 cell lines (turquoise dots) are clearly separated from the rest (Fig 3e) , meaning the approach can 275 distinguish skin-derived cells. Among the melanoma cell lines with high mutational burden of 276 Signature 7, we found four out of five of the suspected cell lines (Fig 3e) indeed not a stomach cell line (Fig 3a) . Past work based on gene expression suggested that RF48 286 is indeed not representative of stomach --instead, a lymphoid origin was proposed for RF48 (32). 287 288 Next, we sought to substantiate these findings using drug sensitivity data. In particular, two drugs 289 (dabrafenib and trametinib) that target mutant BRAF are approved for treating melanoma in the 290 clinic. These drugs are known to have poor efficacy in other cancer types bearing BRAF 291 mutations, such as in colon cancers (33) and therefore sensitivity to these drugs adds confidence 292 we are in fact looking at a melanoma cell line; (note that the converse does not necessarily hold 293 here: resistance does not imply it is not a melanoma). Therefore, we compared the IC50 of these 294 two drugs for all cell lines (Fig 3d) . As expected, many melanoma cell lines cluster at low values 295 of IC50 for the two drugs, meaning these cells are sensitive to the drugs. Among this cluster we 296 observed two out of five of our suspected cell lines (ES2 and MDST8) providing further supporting 297 evidence these are of skin, likely melanoma skin cancer origin. For the majority of the cancer types, we observed that one of the filtered subsets recovered a 387 higher number of significant (at FDR<= 25%) associations of CFE with drug sensitivity or 388 resistance, than were recovered using all cell lines (Fig S7) . For instance, for glioblastoma, using 389 the 'golden set' cell lines we found 23 new associations, which were not recovered from the entire 390 cell line panel nor from the random-subset controls (Fig 4b) . For example, this recovers the 391 positive association of CDKN2A loss with camptothecin sensitivity (Fig 4c) , which was previously 392 reported in an independent analysis of the NCI-60 cell line panel screening data (41). Similarly, 393 for pancreatic adenocarcinoma, benefits were observed by focusing on cell lines that resemble 394 the corresponding cancer type better: using only the 'golden set' plus 'silver set' cell lines, 10 new 395 significant associations were found (Fig 4b) . For instance, we detected that SMAD4-mutant cell 396 lines are more resistant to piperlongumine, a natural product claimed to have antitumor properties 397 For the random subsets, the number of significant associations is calculated 10 times and median. P-value for a sign test (one-tailed) between the associations in the G/G&S and the associations in the 10 runs of r_G/r_G&S are shown. See Fig S8 for remaining cancer types. (b in the CL dataset. In particular, we calculated the Area Under the Receiver Operating 616 Characteristic curves (AUC) and the Area Under the Precision Recall curve (AUPRC) for each 617 cancer type vs the rest (all the rest of cancer types combined). 618 619 FDR Score. For each cell line, we calculated an FDR score of belonging to a particular cancer 620 type. For this, we divided the TCGA data into two datasets (training and testing) of the same size 621 keeping the cancer type proportions. For each cancer type, we trained classifiers in the TCGA 622 training dataset and we introduced the cell lines one by one with the testing data and calculated 623 the precision recall (PR) curve (TCGA testing + 1CL). We set the cell line FDR score for that 624 specific cancer type as (1 -precision) at the threshold where the cell line is situated in the PR 625 curve. Overall, for every cell line we obtained 17 FDR scores, 1 for each possible cancer type. 626 We repeated this procedure 5 times and calculated the median FDR for every case to get more 627 robust values. In addition, when training for 1 cancer type (label = 1) versus the rest of cancer 628 types combined (label = 0) we made some exceptions and removed those cancer types which 629 are similar and therefore the classifier is not good at separating them (e.g. when we calculated 630 Tumor-Derived Cell Lines as Molecular Models of Cancer 702 Pharmacogenomics Genomics of Drug Sensitivity in Cancer (GDSC): a 704 resource for therapeutic biomarker discovery in cancer cells The Cancer Cell Line Encyclopedia enables 706 predictive modelling of anticancer drug sensitivity An interactive resource to identify cancer genetic and 708 lineage dependencies targeted by small molecules Guidelines for the use of cell lines in 710 biomedical research Widespread Use of Misidentified Cell Line KB (HeLa): Incorrect Attribution 712 and Its Impact Revealed through Mining the Scientific Literature Line of attack Comprehensive Analysis of Hypermutation in 715 Human Cancer Integrated classification of lung tumors and cell 717 lines by expression profiling A Landscape of Pharmacogenomic Interactions in 719 Differences in Signaling Patterns on Inhibition Reveal Context Specificity in KRAS -Mutant Cancers Evaluating cell lines as tumour models by comparison of genomic 723 profiles Analysis of renal cancer cell lines from two major 725 resources enables genomics-guided cell line selection Tumour lineage shapes BRCA-mediated 727 phenotypes Phase II Pilot Study of Vemurafenib in Patients 729 With Metastatic BRAF-Mutated Colorectal Cancer Epigenetic epistatic interactions constrain the evolution of gene expression A network of human functional gene interactions from 732 knockout fitness screens in cancer cells A global map of human gene expression Dynamic DNA methylation across diverse human 736 cell lines and tissues Adjusting batch effects in microarray expression data using empirical Bayes methods Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods The ghosts of HeLa: How cell line misidentification contaminates the scientific literature. PLOS 743 ONE Passenger mutations accurately classify human tumors A deep learning system can accurately classify primary and 747 metastatic cancers based on patterns of passenger mutations. bioRxiv Evidence for the colonic origin of ovarian cancer cell line 749 SW626 The molecular landscape of colorectal cancer cell lines 751 unveils clinically actionable kinase targets Clock-like mutational processes in human 753 somatic cells Whole-genome landscapes of major melanoma 755 subtypes Mutation Signatures Including APOBEC in Cancer Cell Lines Comprehensive analysis of the gene expression profiles in human 762 gastric cancer cell lines Dabrafenib and Trametinib in BRAF V600-Mutant Colorectal Cancer Non-oncology drugs are a source of previously 766 unappreciated anti-cancer activity. bioRxiv The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours Discrete Subsets and Pathways of Progression in Diffuse Glioma Comparative Molecular Analysis of 772 Gastrointestinal Adenocarcinomas A collection of breast cancer cell lines for the study of 774 functionally distinct cancer subtypes Evaluation of colorectal cancer subtypes and cell lines using deep learning Comprehensive transcriptomic analysis of cell lines as models of 778 primary tumors across 22 tumor types In vitro differential sensitivity of melanomas to 780 phenothiazines is based on the presence of codon 600 BRAF mutation Piperlongumine rapidly induces the death of human pancreatic cancer cells mainly 782 through the induction of ferroptosis Transcriptome Analysis of Piperlongumine-Treated Human Pancreatic Cancer 784 Cells Reveals Involvement of Oxidative Stress and Endoplasmic Reticulum Stress Pathways Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens EGFR as a Target for Glioblastoma Treatment: An Unfulfilled Promise Assessing breast cancer cell lines as tumour models by comparison of mRNA 791 expression profiles Check your cultures! A list of cross-793 contaminated or misidentified cell lines Pathway-specific differences between tumor cell lines and normal and 795 tumor tissue cells. Mol Cancer Distinct patterns of somatic genome alterations in 797 lung adenocarcinomas and squamous cell carcinomas Toward a Shared Vision for Cancer Genomic 799 Data Cancer Cell Line Encyclopedia Consortium, Genomics of Drug Sensitivity in Cancer Consortium Comprehensive Characterization of 803 Cancer Driver Genes and Mutations. Cell Scalable Open Science Approach for Mutation 805 Calling of Tumor Exomes Using Multiple Genomic Pipelines Variation across 141,456 human exomes and 807 genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data GDSCTools for mining pharmacogenomic 813 interactions in cancer Computational correction of copy number effect 815 improves specificity of CRISPR-Cas9 essentiality screens in cancer cells