67941284 1 2 3 4 5 6 Evaluating the transcriptional fidelity of cancer models 7 8 9 Da Peng1*, Rachel Gleyzer2*, Wen-Hsin Tai2, Pavithra Kumar2, Qin Bian2, Bradley Issacs2, 10 Edroaldo Lummertz da Rocha3, Stephanie Cai1, Kathleen DiNapoli4,5, Franklin W Huang6, 11 Patrick Cahan1,2,7 12 13 1Department of Biomedical Engineering, Johns Hopkins University School of Medicine, 14 Baltimore MD 21205 USA 15 16 2Institute for Cell Engineering, Johns Hopkins University School of Medicine, 17 Baltimore MD 21205 USA 18 19 3Department of Microbiology, Immunology and Parasitology, 20 Federal University of Santa Catarina, Florianópolis SC, Brazil 21 22 4Department of Cell Biology, Johns Hopkins University School of Medicine, 23 Baltimore, MD 21205 USA 24 25 5Department of Electrical and Computer Engineering, Johns Hopkins University, 26 Baltimore MD 21218 USA 27 28 6Division of Hematology/Oncology, Department of Medicine; Helen Diller Family Cancer Center; 29 Bakar Computational Health Sciences Institute; Institute for Human Genetics; 30 University of California, San Francisco, San Francisco, CA 31 32 7Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, 33 Baltimore MD 21205 USA 34 35 36 * These authors made equal contributions. 37 38 39 Correspondence to: patrick.cahan@jhmi.edu 40 41 Article type: Research 42 43 Website: http://www.cahanlab.org/resources/cancerCellNet_web 44 45 Code: https://github.com/pcahan1/cancerCellNet 46 47 48 49 50 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 ABSTRACT 51 52 Background: Cancer researchers use cell lines, patient derived xenografts, engineered mice, 53 and tumoroids as models to investigate tumor biology and to identify therapies. The 54 generalizability and power of a model derives from the fidelity with which it represents the tumor 55 type under investigation, however, the extent to which this is true is often unclear. The 56 preponderance of models and the ability to readily generate new ones has created a demand 57 for tools that can measure the extent and ways in which cancer models resemble or diverge 58 from native tumors. 59 60 Methods: We developed a machine learning based computational tool, CancerCellNet, that 61 measures the similarity of cancer models to 22 naturally occurring tumor types and 36 subtypes, 62 in a platform and species agnostic manner. We applied this tool to 657 cancer cell lines, 415 63 patient derived xenografts, 26 distinct genetically engineered mouse models, and 131 64 tumoroids. We validated CancerCellNet by application to independent data, and we tested 65 several predictions with immunofluorescence. 66 67 Results: We have documented the cancer models with the greatest transcriptional fidelity to 68 natural tumors, we have identified cancers underserved by adequate models, and we have 69 found models with annotations that do not match their classification. By comparing models 70 across modalities, we report that, on average, genetically engineered mice and tumoroids have 71 higher transcriptional fidelity than patient derived xenografts and cell lines in four out of five 72 tumor types. However, several patient derived xenografts and tumoroids have classification 73 scores that are on par with native tumors, highlighting both their potential as faithful model 74 classes and their heterogeneity. 75 76 Conclusions: CancerCellNet enables the rapid assessment of transcriptional fidelity of tumor 77 models. We have made CancerCellNet available as freely downloadable software and as a web 78 application that can be applied to new cancer models that allows for direct comparison to the 79 cancer models evaluated here. 80 81 82 83 84 85 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 INTRODUCTION 86 Models are widely used to investigate cancer biology and to identify potential therapeutics. 87 Popular modeling modalities are cancer cell lines (CCLs)1, genetically engineered mouse 88 models (GEMMs)2, patient derived xenografts (PDXs)3, and tumoroids4. These classes of 89 models differ in the types of questions that they are designed to address. CCLs are often used 90 to address cell intrinsic mechanistic questions5, GEMMs to chart progression of molecularly 91 defined-disease6, and PDXs to explore patient-specific response to therapy in a physiologically 92 relevant context7. More recently, tumoroids have emerged as relatively inexpensive, 93 physiological, in vitro 3D models of tumor epithelium with applications ranging from measuring 94 drug responsiveness to exploring tumor dependence on cancer stem cells. Models also differ in 95 the extent to which the they represent specific aspects of a cancer type8. Even with this intra- 96 and inter-class model variation, all models should represent the tumor type or subtype under 97 investigation, and not another type of tumor, and not a non-cancerous tissue. Therefore, cancer-98 models should be selected not only based on the specific biological question but also based on 99 the similarity of the model to the cancer type under investigation9,10. 100 Various methods have been proposed to determine the similarity of cancer models to 101 their intended subjects. Domcke et al devised a 'suitability score' as a metric of the molecular 102 similarity of CCLs to high grade serous ovarian carcinoma based on a heuristic weighting of 103 copy number alterations, mutation status of several genes that distinguish ovarian cancer 104 subtypes, and hypermutation status11. Other studies have taken analogous approaches by 105 either focusing on transcriptomic or ensemble molecular profiles (e.g. transcriptomic and copy 106 number alterations) to quantify the similarity of cell lines to tumors12–14. These studies were 107 tumor-type specific, focusing on CCLs that model, for example, hepatocellular carcinoma or 108 breast cancer. Notably, Yu et al compared the transcriptomes of CCLs to The Cancer Genome 109 Atlas (TCGA) by correlation analysis, resulting in a panel of CCLs recommended as most 110 representative of 22 tumor types15. Most recently, Najgebauer et al16 and Salvadores et al17 111 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 have developed methods to assess CCLs using molecular traits such as copy number 112 alterations (CNA), somatic mutations, DNA methylation and transcriptomics. While all of these 113 studies have provided valuable information, they leave two major challenges unmet. The first 114 challenge is to determine the fidelity of GEMMs, PDXs, and tumoroids, and whether there are 115 stark differences between these classes of models and CCLs. The other major unmet challenge 116 is to enable the rapid assessment of new, emerging cancer models. This challenge is especially 117 relevant now as technical barriers to generating models have been substantially lowered18,19, 118 and because new models such as PDXs and tumoroids can be derived on patient-specific basis 119 therefore should be considered a distinct entity requiring individual validation4,20. 120 To address these challenges, we developed CancerCellNet (CCN), a computational tool 121 that uses transcriptomic data to quantitatively assess the similarity between cancer models and 122 22 naturally occurring tumor types and 36 subtypes in a platform- and species-agnostic manner. 123 Here, we describe CCN’s performance, and the results of applying it to assess 657 CCLs, 415 124 PDXs, 26 GEMMs, and 131 tumoroids. This has allowed us to identify the most faithful models 125 currently available, to document cancers underserved by adequate models, and to find models 126 with inaccurate tumor type annotation. Moreover, because CCN is open-source and easy to 127 use, it can be readily applied to newly generated cancer models as a means to assess their 128 fidelity. 129 130 RESULTS 131 CancerCellNet classifies samples accurately across species and technologies 132 Previously, we had developed a computational tool using the Random Forest 133 classification method to measure the similarity of engineered cell populations to their in vivo 134 counterparts based on transcriptional profiles21,22. More recently, we elaborated on this 135 approach to allow for classification of single cell RNA-seq data in a manner that allows for 136 cross-platform and cross-species analysis23. Here, we used an analogous approach to build a 137 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 platform that would allow us to quantitatively compare cancer models to naturally occurring 138 patient tumors (Fig 1A). In brief, we used TCGA RNA-seq expression data from 22 solid tumor 139 types to train a top-pair multi-class Random forest classifier (Fig 1B). We combined training 140 data from Rectal Adenocarcinoma (READ) and Colon Adenocarcinoma (COAD) into one 141 COAD_READ category because READ and COAD are considered to be virtually 142 indistinguishable at a molecular level24. We included an ‘Unknown’ category trained using 143 randomly shuffled gene-pair profiles generated from the training data of 22 tumor types to 144 identify query samples that are not reflective of any of the training data. To estimate the 145 performance of CCN and how it is impacted by parameter variation, we performed a parameter 146 sweep with a 5-fold 2/3 cross-validation strategy (i.e. 2/3 of the data sampled across each 147 cancer type was used to train, 1/3 was used to validate) (Fig 1C). The performance of CCN, as 148 measured by the mean area under the precision recall curve (AUPRC), did not fall below 0.945 149 and remained relatively stable across parameter sets (Supp Fig 1A). The optimal parameters 150 resulted in 1,979 features. The mean AUPRCs exceeded 0.95 in most tumor types with this 151 optimal parameter set (Fig 1D, Supp Fig 1B). The AUPRCs of CCN applied to independent 152 data RNA-Seq data from 725 tumors across five tumor types from the International Cancer 153 Genome Consortium (ICGC)25 ranged from 0.93 to 0.99, supporting the notion that the platform 154 is able to accurately classify tumor samples from diverse sources (Fig 1E). 155 As one of the central aims of our study is to compare distinct cancer models, including 156 GEMMs, our method needed to be able to classify samples from mouse and human samples 157 equivalently. We used the Top-Pair transform23 to achieve this and we tested the feasibility of 158 this approach by assessing the performance of a normal (i.e. non-tumor) cell and tissue 159 classifier trained on human data as applied to mouse samples. Consistent with prior 160 applications23, we found that the cross-species classifier performed well, achieving mean 161 AUPRC of 0.97 when applied to mouse data (Supp Fig 1C). 162 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 To evaluate cancer models at a finer resolution, we also developed an approach to 163 perform tumor subtype classifications (Supp Fig 1D). We constructed 11 different cancer 164 subtype classifiers based on the availability of expression or histological subtype 165 information24,26–36. We also included non-cancerous, normal tissues as categories for several 166 subtype classifiers when sufficient data was available: breast invasive carcinoma (BRCA), 167 COAD_READ, head and neck squamous cell carcinoma (HNSC), kidney renal clear cell 168 carcinoma (KIRC) and uterine corpus endometrial carcinoma (UCEC). The 11 subtype 169 classifiers all achieved high overall average AUPRs ranging from 0.80 to 0.99 (Supp Fig 1E). 170 171 Fidelity of cancer cell lines 172 Having validated the performance of CCN, we then used it to determine the fidelity of 173 CCLs. We mined RNA-seq expression data of 657 different cell lines across 20 cancer types 174 from the Cancer Cell Line Encyclopedia (CCLE) and applied CCN to them, finding a wide 175 classification range for cell lines of each tumor type (Fig 2A, Supp Tab 1). To verify the 176 classification results, we applied CCN to expression profiles from CCLE generated through 177 microarray expression profiling37. To ensure that CCN would function on microarray data, we 178 first tested it by applying a CCN classifier created to test microarray data to 720 expression 179 profiles of 12 tumor types. The cross-platform CCN classifier performed well, based on the 180 comparison to study-provided annotation, achieving a mean AUPRC of 0.91 (Supp Fig 2A). 181 Next, we applied this cross-platform classifier to microarray expression profiles from CCLE 182 (Supp Fig 2B). From the classification results of 571 cell lines that have both RNA-seq and 183 microarray expression profiles, we found a strong overall positive association between the 184 classification scores from RNA-seq and those from microarray (Supp Fig 2C). This comparison 185 supports the notion that the classification scores for each cell line are not artifacts of profiling 186 methodology. Moreover, this comparison shows that the scores are consistent between the 187 times that the cell lines were first assayed by microarray expression profiling in 2012 and by 188 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 RNA-Seq in 2019. We also observed high level of correlation between our analysis and the 189 analysis done by Yu et al15(Supp Fig 2D), further validating the robustness of the CCN results. 190 Next, we assessed the extent to which CCN classifications agreed with their nominal 191 tumor type of origin, which entailed translating quantitative CCN scores to classification labels. 192 To achieve this, we selected a decision threshold that maximized the Macro F1 measure, 193 harmonic mean of precision and recall, across 50 cross validations. Then, we annotated cell 194 lines based their CCN score profile as follows. Cell lines with CCN scores > threshold for the 195 tumor type of origin were annotated as 'correct'. Cell lines with CCN scores > threshold in the 196 tumor type of origin and at least one other tumor type were annotated as 'mixed'. Cell lines with 197 CCN scores > threshold for tumor types other than that of the cell line's origin were annotated 198 as 'other'. Cell lines that did not receive a CCN score > threshold for any tumor type were 199 annotated as 'none' (Fig 2B). We found that majority of cell lines originally annotated as Breast 200 invasive carcinoma (BRCA), Cervical squamous cell carcinoma and endocervical 201 adenocarcinoma (CESC), Skin Cutaneous Melanoma (SKCM), Colorectal Cancer 202 (COAD_READ) and Sarcoma (SARC) fell into the 'correct' category (Fig 2B). On the other 203 hand, no Esophageal carcinoma (ESCA), Pancreatic adenocarcinoma (PAAD) or Brain Lower 204 Grade Glioma (LGG) were classified as 'correct', demonstrating the need for more 205 transcriptionally faithful cell lines that model those general cancer types. 206 There are several possible explanations for cell lines not receiving a 'correct' 207 classification. One possibility is that the sample was incorrectly labeled in the study from which 208 we harvested the expression data. Consistent with this explanation, we found that colorectal 209 cancer line NCI-H68438,39, a cell line labelled as liver hepatocellular carcinoma (LIHC) by CCLE, 210 was classified strongly as COAD_READ (Supp Tab 1). Another possibility to explain low CCN 211 score is that cell lines were derived from subtypes of tumors that are not well-represented in 212 TCGA. To explore this hypothesis, we first performed tumor subtype classification on CCLs from 213 11 tumor types for which we had trained subtype classifiers (Supp Tab 2). We reasoned that if 214 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 a cell was a good model for a rarer subtype, then it would receive a poor general classification 215 but a high classification for the subtype that it models well. Therefore, we counted the number of 216 lines that fit this pattern. We found that of the 188 lines with no general classification, 25 (13%) 217 were classified as a specific subtype, suggesting that derivation from rare subtypes is not the 218 major contributor to the poor overall fidelity of CCLs. 219 Another potential contributor to low scoring cell lines is intra-tumor stromal and immune 220 cell impurity in the training data. If impurity were a confounder of CCN scoring, then we would 221 expect a strong positive correlation between mean purity and mean CCN classification scores of 222 CCLs per general tumor type. However, the Pearson correlation coefficient between the mean 223 purity of general tumor type and mean CCN classification scores of CCLs in the corresponding 224 general tumor type was low (0.14), suggesting that tumor purity is not a major contributor to the 225 low CCN scores across CCLs (Supp Fig 2E). 226 227 Comparison of SKCM and GBM CCLs to scRNA-seq 228 To more directly assess the impact of intra-tumor heterogeneity in the training data on 229 evaluating cell lines, we constructed a classifier using cell types found in human melanoma and 230 glioblastoma scRNA-seq data40,41. Previously, we have demonstrated the feasibility of using our 231 classification approach on scRNA-seq data23. Our scRNA-seq classifier achieved a high 232 average AUPRC (0.95) when applied to held-out data and high mean AUPRC (0.99) when 233 applied to few purified bulk testing samples (Supp Fig 3A-B). Comparing the CCN score from 234 bulk RNA-seq general classifier and scRNA-seq classifier, we observed a high level of 235 correlation (Pearson correlation of 0.89) between the SKCM CCN classification scores and 236 scRNA-seq SKCM malignant CCN classification scores for SKCM cell lines (Fig 2C, Supp Fig 237 3C). Of the 41 SKCM cell lines that were classified as SKCM by the bulk classifier, 37 were also 238 classified as SKCM malignant cells by the scRNA-seq classifier. Interestingly, we also observed 239 a high correlation between the SARC CCN classification score and scRNA-seq cancer 240 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 associated fibroblast (CAF) CCN classification scores (Pearson correlation of 0.92). Six of the 241 seven SKCM cell lines that had been classified as exclusively SARC by CCN were classified as 242 CAF by the scRNA-seq classifier (Fig 2D, Supp Fig 3C), which suggests the possibility that 243 these cell lines were derived from CAF or other mesenchymal populations, or that they have 244 acquired a mesenchymal character through their derivation. The high level of agreement 245 between scRNA-seq and bulk RNA-seq classification results shows that heterogeneity in the 246 training data of general CCN classifier has little impact in the classification of SKCM cell lines. 247 In contrast, we observed a weaker correlation between GBM CCN classification scores 248 and scRNA-seq GBM neoplastic CCN classification scores (Pearson correlation of 0.72) for 249 GBM cell lines (Fig 2E, Supp Fig 3D). Of the 31 GBM lines that were not classified as GBM 250 with CCN, 25 were classified as GBM neoplastic cells with the scRNA-seq classifier. Among the 251 22 GBM lines that were classified as SARC with CCN, 15 cell lines were classified as CAF (Fig 252 2F), 10 which were classified as both GBM neoplastic and CAF in the scRNA-seq classifier. 253 Similar to the situation with SKCM lines that classify as CAF, this result is consistent with the 254 possibility that some GBM lines classified as SARC by CCN could be derived from 255 mesenchymal subtypes exhibiting both strong mesenchymal signatures and glioblastoma 256 signatures or that they have acquired a mesenchymal character through their derivation. The 257 lower level of agreement between scRNA-seq and bulk RNA-seq classification results for GBM 258 models suggests that the heterogeneity of glioblastomas42 can impact the classification of GBM 259 cell lines, and that the use of scRNA-seq classifier can resolve this deficiency. 260 261 Immunofluorescence confirmation of CCN predictions 262 To experimentally explore some of our computational analyses, we performed 263 immunofluorescence on three cell lines that were not classified as their labelled categories: the 264 ovarian cancer line SK-OV-3 had a high UCEC CCN score (0.246), the ovarian cancer line 265 A2780 had a high Testicular Germ Cell Tumors (TGCT) CCN score (0.327), and the prostate 266 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 cancer line PC-3 had a high bladder cancer (BLCA) score (0.307) (Supp Tab 1). We reasoned 267 that if SK-OV-3, A2780 and PC-3 were classified most strongly as UCEC, TGCT and BLCA, 268 respectively, then they would express proteins that are indicative of these cancer types. 269 First, we measured the expression of the uterine-associated transcription factor 270 HOXB643,44, and the UCEC serous ovarian tumor biomarker WT145 in SK-OV-3, in the OV cell 271 line Caov-4, and in the UCEC cell line HEC-59. We chose Caov-4 as our positive control for OV 272 biomarker expression because it was determined by our analysis and others11,15 to be a good 273 model of OV. Likewise, we chose HEC-59 to be a positive control for UCEC. We found that SK-274 OV-3 has a small percentage (5%) of cells that expressed the uterine marker HOXB6 and a 275 large proportion (73%) of cells that expressed WT1 (Fig 3A). In contrast, no Caov-4 cells 276 expressed HOXB6, whereas 85% of cells expressed WT1. This suggests that SK-OV-3 exhibits 277 both biomarkers of ovarian tumor and uterine tissue. From our computational analysis and 278 experimental validation, SK-OV-3 is most likely an endometrioid subtype of ovarian cancer. This 279 result is also consistent with prior classification of SK-OV-346, and the fact that SK-OV-3 lacks 280 p53 mutations, which is prevalent in high-grade serous ovarian cancer47, and it harbors an 281 endometrioid-associated mutation in ARID1A11,46,48. Next, we measured the expression of 282 markers of OV and germ cell cancers (LIN28A49) in the OV-annotated cell line A2780, which 283 received a high TCGT CCN score. We found that 54% of A2780 cells expressed LIN28A 284 whereas it was not detected in Caov-4 (Fig 3B). The OV marker WT1 was also expressed in 285 fewer A2780 cells as compared to Caov-4 (48% vs 85%), which suggests that A2780 could be a 286 germ cell derived ovarian tumor. Taken together, our results suggest that SK-OV-3 and A2780 287 could represent OV subtypes of that are not well represented in TCGA training data, which 288 resulted in a low OV score and higher CCN score in other categories. 289 Lastly, we examined PC-3, annotated as a PRAD cell line but classified to be most 290 similar to BLCA. We found that 30% of the PC-3 cells expressed PPARG, a contributor to 291 urothelial differentiation50 that is not detected in the PRAD Vcap cell line but is highly expressed 292 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 in the BLCA RT4 cell line (Fig 3C). PC-3 cells also expressed the PRAD biomarker FOLH151 293 suggesting that PC-3 has an PRAD origin and gained urothelial or luminal characteristics 294 through the derivation process. In short, our limited experimental data support the CCN 295 classification results. 296 297 Subtype classification of cancer cell lines 298 Next, we explored the subtype classification of CCLs from three general tumor types in 299 more depth. We focused our subtype visualization (Fig 4A-C) on CCL models with general CCN 300 score above 0.1 in their nominal cancer type as this allowed us to analyze those models that fell 301 below the general threshold but were classified as a specific sub-type (Supp Tab 1-2). 302 Focusing first on UCEC, the histologically defined subtypes of UCEC, endometrioid and serous, 303 differ in prevalence, molecular properties, prognosis, and treatment. For instance, the 304 endometrioid subtype, which accounts for approximately 80% of uterine cancers, retains 305 estrogen receptor and progesterone receptor status and is responsive towards progestin 306 therapy52,53. Serous, a more aggressive subtype, is characterized by the loss of estrogen and 307 progesterone receptor and is not responsive to progestin therapy52,53. CCN classified the 308 majority of the UCEC cell lines as serous except for JHUEM-1 which is classified as mixed, with 309 similarities to both endometrioid and serous (Fig 4A). The preponderance CCLE lines of serous 310 versus endometroid character may be due to properties of serous cancer cells that promote 311 their in vitro propagation, such as upregulation of cell adhesion transcriptional programs54. 312 Some of our subtype classification results are consistent with prior observations. For example, 313 HEC-1A, HEC-1B, and KLE were previously characterized as type II endometrial cancer, which 314 includes a serous histological subtype55. On the other hand, our subtype classification results 315 contradict prior observations in at least one case. For instance, the Ishikawa cell line was 316 derived from type I endometrial cancer (endometrioid histological subtype)55,56, however CCN 317 classified a derivative of this line, Ishikawa 02 ER-, as serous. The high serous CCN score 318 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 could result from a shift in phenotype of the line concomitant with its loss of estrogen receptor 319 (ER) as this is a distinguishing feature of type II endometrial cancer (serous histological 320 subtype)52. Taken together, these results indicate a need for more endometroid-like CCLs. 321 Next, we examined the subtype classification of Lung Squamous Cell Carcinoma 322 (LUSC) and Lung adenocarcinoma (LUAD) cell lines (Fig 4B-C). All the LUSC lines with at least 323 one subtype classification had an underlying primitive subtype classification. This is consistent 324 either with the ease of deriving lines from tumors with a primitive character, or with a process by 325 which cell line derivation promotes similarity to more primitive subtype, which is marked by 326 increased cellular proliferation28. Some of our results are consistent with prior reports that have 327 investigated the resemblance of some lines to LUSC subtypes. For example, HCC-95, 328 previously been characterized as classical28,57, had a maximum CCN score in the classical 329 subtype (0.429) . Similarly, LUDLU-1 and EPLC-272H, previously reported as classical57 and 330 basal57 respectively, had maximal tumor subtype CCN scores for these sub-types (0.323 and 331 0.256) (Fig 4B, Supp Tab 2) despite classified as Unknown. Lastly, the LUAD cell lines that 332 were classified as a subtype were either classified as proximal inflammation or proximal 333 proliferation (Fig 4C). RERF-LC-Ad1 had the highest general classification score and the 334 highest proximal inflammation subtype classification score. Taken together, these subtype 335 classification results have revealed an absence of cell lines models for basal and secretory 336 LUSC, and for the Terminal respiratory unit (TRU) LUAD subtype. 337 338 Cancer cell lines’ popularity and transcriptional fidelity 339 Finally, we sought to measure the extent to which cell line transcriptional fidelity related 340 to model prevalence. We used the number of papers in which a model was mentioned, 341 normalized by the number of years since the cell line was documented, as a rough 342 approximation of model prevalence. To explore this relationship, we plotted the normalized 343 citation count versus general classification score, labeling the highest cited and highest 344 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 classified cell lines from each general tumor type (Fig 4D). For most of the general tumor types, 345 the highest cited cell line is not the highest classified cell line except for Hep G2, AGS and ML-346 1, representing liver hepatocellular carcinoma (LIHC), stomach adenocarcinoma (STAD), and 347 thyroid carcinoma (THCA), respectively. On the other hand, the general scores of the highest 348 cited cell lines representing BLCA (T24), BRCA (MDA-MB-231), and PRAD (PC-3) fall below 349 the classification threshold of 0.25. Notably, each of these tumor types have other lines with 350 scores exceeding 0.5, which should be considered as more faithful transcriptional models when 351 selecting lines for a study (Supp Tab 1 and 352 http://www.cahanlab.org/resources/cancerCellNet_results/). 353 354 Evaluation of patient derived xenografts 355 Next, we sought to evaluate a more recent class of cancer models: PDX. To do so, we 356 subjected the RNA-seq expression profiles of 415 PDX models from 13 different types of cancer 357 types generated previously20 to CCN. Similar to the results of CCLs, the PDXs exhibited a wide 358 range of classification scores (Fig 5A, Supp Tab 3). By categorizing the CCN scores of PDX 359 based on the proportion of samples associated with each tumor type that were correctly 360 classified, we found that SARC, SKCM, COAD_READ and BRCA have higher proportion of 361 correctly classified PDX than those of other cancer categories (Fig 5B). In contrast to CCLs, we 362 found a higher proportion of correctly classified PDX in STAD, PAAD and KIRC (Fig 5B). 363 However, similar to CCLs, no ESCA PDXs were classified as such. This held true when we 364 performed subtype classification on PDX samples: none of the PDX in ESCA were classified as 365 any of the ESCA subtypes (Supp Tab 4). UCEC PDXs had both endometrioid subtypes, serous 366 subtypes, and mixed subtypes, which provided a broader representation than CCLs (Fig 5C). 367 Several LUSC PDXs that were classified as a subtype were also classified as Head and Neck 368 squamous cell carcinoma (HNSC) or mix HNSC and LUSC (Fig 5D). This could be due to the 369 similarity in expression profiles of basal and classical subtypes of HNSC and LUSC28,58, which is 370 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 consistent with the observation that these PDXs were also subtyped as classical. No LUSC 371 PDXs were classified as the secretory subtype. In contrast to LUAD CCLs, four of the five LUAD 372 PDXs with a discernible sub-type were classified as proximal inflammatory (Fig 5E). On the 373 other hand, similar to the CCLs, there were no TRU subtypes in the LUAD PDX cohort. In 374 summary, we found that while individual PDXs can reach extremely high transcriptional fidelity 375 to both general tumor types and subtypes, many PDXs were not classified as the general tumor 376 type from which they originated. 377 378 Evaluation of GEMMs 379 Next, we used CCN to evaluate GEMMs of six general tumor types from nine studies for 380 which expression data was publicly available59–67. As was true for CCLs and PDXs, GEMMs 381 also had a wide range of CCN scores (Fig 6A, Supp Tab 5). We next categorized the CCN 382 scores based on the proportion of samples associated with each tumor type that were correctly 383 classified (Fig 6B). In contrast to LGG CCLs, LGG GEMMs, generated by Nf1 mutations 384 expressed in different neural progenitors in combination with Pten deletion66, consistently were 385 classified as LGG (Fig 6A-B). The GEMM dataset included multiple replicates per model, which 386 allowed us to examine intra-GEMM variability. Both at the level of CCN score and at the level of 387 categorization, GEMMs were invariant. For example, replicates of UCEC GEMMs driven by 388 Prg(cre/+)Pten(lox/lox) received almost identical general CCN scores (Fig 6C, Supp Tab 6). 389 GEMMs sharing genotypes across studies, such as LUAD GEMMs driven by Kras mutation and 390 loss of p5359,65,67, also received similar general and subtype classification scores (Fig 6A,B,E). 391 Next, we explored the extent to which genotype impacted subtype classification in 392 UCEC, LUSC, and LUAD. Prg(cre/+)Pten(lox/lox) GEMMs had a mixed subtype classification of 393 both serous and endometrioid, consistent with the fact that Pten loss occurs in both subtypes 394 (albeit more frequently in endometrioid). We also analyzed Prg(cre/+)Pten(lox/lox)Csf3r-/- 395 GEMMs. Polymorphonuclear neutrophils (PMNs), which play anti-tumor roles in endometrioid 396 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 cancer progression, are depleted in these animals. Interestingly, Prg(cre/+)Pten(lox/lox)Csf3r-/- 397 GEMMs had a serous subtype classification, which could be explained by differences in PMN 398 involvement in endometrioid versus serous uterine tumor development that are reflected in the 399 respective transcriptomes of the TCGA UCEC training data. We note that the tumor cells were 400 sorted prior to RNA-seq and thus the shift in subtype classification is not due to contamination of 401 GEMMs with non-tumor components. In short, this analysis supports the argument that tumor-402 cell extrinsic factors, in this case a reduction in anti-tumor PMNs, can shift the transcriptome of 403 a GEMM so that it more closely resembles a serous rather than endometrioid subtype. 404 The LUSC GEMMs that we analyzed were Lkb1fl/fl and they either overexpressed of 405 Sox2 (via two distinct mechanisms) or were also Ptenfl/fl 65. We note that the eight lenti-Sox2-406 Cre-infected;Lkb1fl/fl and Rosa26LSL-Sox2-IRES-GFP;Lkb1fl/fl samples that classified as 407 'Unknown' had LUSC CCN scores only modestly lower than the decision threshold (Fig 6D) 408 (mean CCN score = 0.217). Thirteen out of the 17 of the Sox2 GEMMs classified as the 409 secretory subtype of LUSC. The consistency is not surprising given both models overexpress 410 Sox2 and lose Lkb1. On the other hand, the Lkb1fl/fl;Ptenfl/fl GEMMs had substantially lower 411 general LUSC CCN scores and our subtype classification indicated that this GEMM was mostly 412 classified as 'Unknown', in contrast to prior reports suggesting that it is most similar to a basal 413 subtype68. None of the three LUSC GEMMs have strong classical CCN scores. Most of the 414 LUAD GEMMs, which were generated using various combinations of activating Kras mutation, 415 loss of Trp53, and loss of Smarca4L59,65,67, were correctly classified (Fig 6E). Those that were 416 not classified have modestly lower CCN score than the decision threshold (mean CCN score = 417 0.214) . There were no substantial differences in general or subtype classification across driver 418 genotypes. Although the sub-type of all LUAD GEMMs was 'Unknown', the subtypes tended to 419 have a mixture of high CCN proximal proliferation, proximal inflammation and TRU scores. 420 Taken together, this analysis suggests that there is a degree of similarity, and perhaps plasticity 421 between the primitive and secretory (but not basal or classical) subtypes of LUSC. On the other 422 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 hand, while the LUAD GEMMs classify strongly as LUAD, they do not have strong particular 423 subtype classification -- a result that does not vary by genotype. 424 425 Evaluation of Tumoroids 426 Lastly, we used CCN to assess a relatively novel cancer model: tumoroids. We 427 downloaded and assessed 131 distinct tumoroid expression profiles spanning 13 cancer 428 categories from The NCI Patient-Derived Models Repository (PDMR)69 and from three individual 429 studies70–72 (Fig 7A, Supp Tab 7). We note that several categories have three or fewer samples 430 (BRCA, CESC, KIRP, OV, LIHC, and BLCA from PDMR). Among the cancer categories 431 represented by more than three samples, only LUSC and PAAD have fewer than 50% classified 432 as their annotated label (Fig 7B). In contrast to GBM CCLs, all three induced pluripotent stem 433 cell-derived GBM tumoroids72 were classified as GBM with high CCN scores (mean = 0.53). To 434 further characterize the tumoroids, we performed subtype classification on them (Supp Tab 8). 435 UCEC tumoroids from PDMR contains a wide range of subtypes with two endometrioid, two 436 serous and one mixed type (Fig 7C). On the other hand, LUSC tumoroids appear to be 437 predominantly of classical subtypes with one tumoroid classified as a mix between classical and 438 primitive (Fig 7D). Lastly, similar to the CCL and PDX counterparts, LUAD tumoroids are 439 classified as proximal inflammatory and proximal proliferation with no tumoroids classified as 440 TRU subtype (Fig 7E). 441 442 Comparison of CCLs, PDXs, GEMMs and tumoroids 443 Finally, we sought to estimate the comparative transcriptional fidelity of the four cancer 444 models modalities. We compared the general CCN scores of each model on a per tumor type 445 basis (Fig 8). In the case of GEMMs, we used the mean classification score of all samples with 446 shared genotypes. We also used mean classification of technical replicates found in LIHC 447 tumoroids70. We evaluated models based on both the maximum CCN score, as this represents 448 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 the potential for a model class, and the median CCN score, as this indicates the current overall 449 transcriptional fidelity of a model class. PDXs achieved the highest CCN scores in three (UCEC, 450 PAAD, LUAD) out of the five cancer categories in which all four modalities were available (Fig 451 8), despite having low median CCN scores. Notably, PDXs have a median CCN score above 452 the 0.25 threshold in PAAD while none of the other three modalities have any samples above 453 the threshold. In LIHC, the highest CCN score for PDX (0.9) is only slightly lower than the 454 highest CCN score for tumoroid (0.91). This suggest that certain individual PDXs most closely 455 mimic the transcriptional state of native patient tumors despite a portion of the PDXs having low 456 CCN scores. Similarly, while the majority of the CCLs have low CCN scores, several lines 457 achieve high transcriptional fidelity in LUSC, LUAD and LIHC (Fig 8). Collectively, GEMMs and 458 tumoroids had the highest median CCN scores in four of the five model classes (LUSC and 459 LUAD for GEMMs and UCEC and LIHC for tumoroids). Notably, both of the LIHC tumoroids 460 achieved CCN scores on par with patient tumors (Fig 8). In brief, this analysis indicates that 461 PDXs and CCLs are heterogenous in terms of transcriptional fidelity, with a portion of the 462 models highly mimicking native tumors and the majority of the models having low transcriptional 463 fidelity (with the exception of PAAD for PDXs). On the other hand, GEMMs and tumoroids 464 displayed a consistently high fidelity across different models. 465 Because the CCN score is based on a moderate number of gene features (i.e. 1,979 466 gene pairs consisting of 1,689 unique genes) relative to the total number of protein-coding 467 genes in the genome, it is possible that a cancer model with a high CCN score might not have a 468 high global similarity to a naturally occurring tumor. Therefore, we also calculated the GRN 469 status, a metric of the extent to which tumor-type specific gene regulatory network is 470 established21, for all models (Supp Fig 4). We observed high level of correlation between the 471 two similarity metrics, which suggests that although CCN classifies on a selected set of genes, 472 its scores are highly correlated with global assessment of transcriptional similarity. 473 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 We also sought to compare model modalities in terms of the diversity of subtypes that 474 they represent (Supp Fig 5). As a reference, we also included in this analysis the overall 475 subtype incidence, as approximated by incidence in TCGA. Replicates in GEMMs and 476 tumoroids were averaged into one classification profile. In models of UCEC, there is a notable 477 difference in endometroid incidence, and the proportion of models classified as endometroid, 478 with PDX and tumoroids having any representatives (Supp Fig 5). All of the CCL, GEMM, and 479 tumoroid models of PAAD have an unknown subtype classification and no correct general 480 classification. However, the majority of PDXs are subtyped as either a mixture of basal and 481 classical, or classical alone. LUAD have proximal inflammation and proximal proliferation 482 subtypes modelled by CCLs and PDX (Supp Fig 5). Likewise, LUSC have basal, classical and 483 primitive subtypes modelled by CCLs and PDXs, and secretory subtype modelled by GEMMs 484 exclusively (Supp Fig 5). Taken together, these results demonstrate the need to carefully select 485 different model systems to more suitably model certain cancer subtypes. 486 487 DISCUSSION 488 A major goal in the field of cancer biology is to develop models that mimic naturally occurring 489 tumors with enough fidelity to enable therapeutic discoveries. However, methods to measure 490 the extent to which cancer models resemble or diverge from native tumors are lacking. This is 491 especially problematic now because there are many existing models from which to choose, and 492 it has become easier to generate new models. Here, we present CancerCellNet (CCN), a 493 computational tool that measures the similarity of cancer models to 22 naturally occurring tumor 494 types and 36 subtypes. While the similarity of CCLs to patient tumors has already been 495 explored in previous work, our tool introduces the capability to assess the transcriptional fidelity 496 of PDXs, GEMMs, and tumoroids. Because CCN is platform- and species-agnostic, it 497 represents a consistent platform to compare models across modalities including CCLs, PDXs, 498 GEMMs and tumoroids. Here, we applied CCN to 657 cancer cell lines, 415 patient derived 499 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 xenografts, 26 distinct genetically engineered mouse models and 131 tumoroids. Several 500 insights emerged from our computational analyses that have implications for the field of cancer 501 biology. 502 First, PDXs have the greatest potential to achieve transcriptional fidelity with three out of 503 five general tumor types for which data from all modalities was available, as indicated by the 504 high scores of individual PDXs. Notably PDXs are the only modality with samples classified as 505 PAAD. At the same time, the median CCN scores of PDXs were lower than that of GEMMs and 506 tumoroids in the other four tumor types. It is unclear what causes such a wide range of CCN 507 scores within PDXs. We suspect that some PDXs might have undergone selective pressures in 508 the host that distort the progression of genomic alterations away from what is observed in 509 natural tumor73. Future work to understand this heterogeneity is important so as to yield 510 consistently high fidelity PDXs, and to identify intrinsic and host-specific factors that so 511 powerfully shape the PDX transcriptome. 512 Second, in general GEMMs and tumoroids have higher median CCN scores than those 513 of PDXs and CCLs. This is also consistent with that fact that GEMMs are typically derived by 514 recapitulating well-defined driver mutations of natural tumors, and thus this observation 515 corroborates the importance of genetics in the etiology of cancer74. Moreover, in contrast to 516 most PDXs, GEMMs are typically generated in immune replete hosts. Therefore, the higher 517 overall fidelity of GEMMs may also be a result of the influence of a native immune system on 518 GEMM tumors75. The high median CCN scores of tumoroids can be attributed to several factors 519 including the increased mechanical stimuli and cell-cell interactions that come from 3D self-520 organizing cultures76,77. 521 Third, we have found that none of the samples that we evaluated here are 522 transcriptionally adequate models of ESCA. This may be due to an inherent lability of the ESCA 523 transcriptome that is often preceded by a metaplasia that has obscured determining its cell 524 type(s) of origin78. Therefore, this tumor type requires further attention to derive new models. 525 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 Fourth, we found that in several tumor types, GEMMs tend to reflect mixtures of 526 subtypes rather than conforming strongly to single subtypes. The reasons for this are not clear 527 but it is possible that in the cases that we examined the histologically defined subtypes have a 528 degree of plasticity that is exacerbated in the murine host environment. 529 Lastly, we recognize that many CCLs are not classified as their annotated labels. While 530 we have suggested that the lack of immune component is not a major confounder, we suspect 531 that the CCLs could undergo genetic divergence due to high number of passages, 532 chemotherapy before biopsy, culture condition and genetic instability79–82, which could all be 533 factors that drive CCLs away from their labelled tumors. 534 Currently, there are several limitations to our CCN tool, and caveats to our analyses 535 which indicate areas for future work and improvement. First, CCN is based on transcriptomic 536 data but other molecular readouts of tumor state, such as profiles of the proteome83, 537 epigenome84, non-coding RNA-ome84, and genome74 would be equally, if not more important, to 538 mimic in a model system. Therefore, it is possible that some models reflect tumor behavior well, 539 and because this behavior is not well predicted by transcriptome alone, these models have 540 lower CCN scores. To both measure the extent that such situations exist, and to correct for 541 them, we plan in the future to incorporate other omic data into CCN so as to make more 542 accurate and integrated model evaluation possible. As a first step in this direction, we plan to 543 incorporate DNA methylation and genomic sequencing data as additional features for our 544 Random forest classifier as this data is becoming more readily available for both training and 545 cancer models. We expect that this will allow us to both refine our tumor subtype categories and 546 it will enable more accurate predictions of how models respond to perturbations such as drug 547 treatment. 548 A second limitation is that in the cross-species analysis, CCN implicitly assumes that 549 homologs are functionally equivalent. The extent to which they are not functionally equivalent 550 determines how confounded the CCN results will be. This possibility seems to be of limited 551 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 consequence based on the high performance of the normal tissue cross-species classifier and 552 based on the fact that GEMMs have the highest median CCN scores (in addition to tumoroids). 553 A third caveat to our analysis is that there were many fewer distinct GEMMs and 554 tumoroids than CCLs and PDXs. As more transcriptional profiles for GEMMs and tumoroids 555 emerge, this comparative analysis should be revisited to assess the generality of our results. 556 Finally, the TCGA training data is made up of RNA-Seq from bulk tumor samples, which 557 necessarily includes non-tumor cells, whereas the CCLs are by definition cell lines of tumor 558 origin. Therefore, CCLs theoretically could have artificially low CCN scores due to the presence 559 of non-tumor cells in the training data. This problem appears to be limited as we found no 560 correlation between tumor purity and CCN score in the CCLE samples. However, this problem 561 is related to the question of intra-tumor heterogeneity. We demonstrated the feasibility of using 562 CCN and single cell RNA-seq data to refine the evaluation of cancer cell lines contingent upon 563 availability of scRNA-seq training data. As more training single cell RNA-seq data accrues, CCN 564 would be able to not only evaluate models on a per cell type basis, but also based on cellular 565 composition. 566 We have made the results of our analyses available online so that researchers can 567 easily explore the performance of selected models or identify the best models for any of the 22 568 general tumor types and the 36 subtypes presented here. To ensure that CCN is widely 569 available we have developed a free web application, which performs CCN analysis on user-570 uploaded data and allows for direct comparison of their data to the cancer models evaluated 571 here. We have also made the CCN code freely available under an Open Source license and as 572 an easily installed R package, and we are actively supporting its further development. Included 573 in the web application are instructions for training CCN and reproducing our analysis. The 574 documentation describes how to analyze models and compare the results to the panel of 575 models that we evaluated here, thereby allowing researchers to immediately compare their 576 models to the broader field in a comprehensive and standard fashion. 577 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 578 Online Methods 579 Training General CancerCellNet Classifier 580 To generate training data sets, we downloaded 8,991 patient tumor RNA-seq expression 581 count matrix and their corresponding sample table across 22 different tumor types from TCGA 582 using TCGAWorkflowData, TCGAbiolinks85 and SummarizedExperiment86 packages. We used 583 all the patient tumor samples for training the general CCN classifier. We limited training and 584 analysis of RNA-seq data to the 13,142 genes in common between the TCGA dataset and all 585 the query samples (CCLs, PDXs, GEMMs, and tumoroids). To train the top pair Random forest 586 classifier, we used a method similar to our previous method23. CCN first normalized the training 587 counts matrix by down-sampling the counts to 500,000 counts per sample. To significantly 588 reduce the execution time and memory of generating gene pairs for all possible genes, CCN 589 then selected n up-regulated genes, n down-regulated genes and n least differentially 590 expressed genes (CCN training parameter nTopGenes = n) for each of the 22 cancer 591 categories using template matching87 as the genes to generate top scoring gene pairs. In short, 592 for each tumor type, CCN defined a template vector that labelled the training tumor samples in 593 cancer type of interest as 1 and all other tumor samples as 0 CCN then calculated the Pearson 594 correlation coefficient between template vector and gene expressions for all genes. The genes 595 with strong match to template as either upregulated or downregulated had large absolute 596 Pearson correlation coefficient. CCN chose the upregulated, downregulated and least 597 differentially expressed genes based on the magnitude of Pearson correlation coefficient. 598 After CCN selected the genes for each cancer type, CCN generated gene pairs among 599 those genes. Gene pair transformation was a method inspired by the top-scoring pair classifier88 600 to allow compatibility of classifier with query expression profiles that were collected through 601 different platforms (e.g. microarray query data applied to RNA-seq training data). In brief, the 602 gene pair transformation compares 2 genes within an expression sample and encodes the 603 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 “gene1_gene2” gene-pair as 1 if the first gene has higher expression than the second gene. 604 Otherwise, gene pair transformation would encode the gene-pair as 0. Using all the gene pair 605 combinations generated through the gene sets per cancer type, CCN then selected top m 606 discriminative gene pairs (CCN training parameter nTopGenePairs = m) for each category using 607 template matching (with large absolute Pearson correlation coefficient) described above. To 608 prevent any single gene from dominating the gene pair list, we allowed each gene to appear at 609 maximum of three times among the gene pairs selected as features per cancer type. 610 After the top discriminative gene pairs were selected for each cancer category, CCN 611 grouped all the gene pairs together and gene pair transformed the training samples into a binary 612 matrix with all the discriminative gene pairs as row names and all the training samples as 613 column names. Using the binary gene pair matrix, CCN randomly shuffled the binary values 614 across rows then across columns to generate random profiles that should not resemble training 615 data from any of the cancer categories. CCN then sampled 70 random profiles, annotated them 616 as “Unknown” and used them as training data for the “Unknown” category. Using gene pair 617 binary training matrix, CCN constructed a multi-class Random Forest classifier of 2000 trees 618 and used stratified sampling of 60 sample size to ensure balance of training data in constructing 619 the decision trees. 620 To identify the best set of genes and gene-pair parameters (n and m), we used a grid-621 search cross-validation89 strategy with 5 cross-validations at each parameter set. The specific 622 parameters for the final CCN classifier using the function “broadClass_train” in the package 623 cancerCellNet are in Supp Tab 9. The gene-pairs are in Supp Tab 10. 624 625 Validating General CancerCellNet Classifier 626 Two thirds of patient tumor data from each cancer type were randomly sampled as 627 training data to construct a CCN classifier. Based on the training data, CCN selected the 628 classification genes and gene-pairs and trained a classifier. After the classifier was built, 35 629 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 held-out samples from each cancer category were sampled and 40 “Unknown” profiles were 630 generated for validation. The process of randomly sampling training set from 2/3 of all patient 631 tumor data, selecting features based on the training set, training classifier and validating was 632 repeated 50 times to have a more comprehensive assessment of the classifier trained with the 633 optimal parameter set. To test the performance of final CCN on independent testing data, we 634 applied it to 725 profiles from ICGC spanning 6 projects that do not overlap with TCGA (BRCA-635 KR, LIRI-JP, OV-AU, PACA-AU, PACA-CA, PRAD-FR). 636 637 Selecting Decision Thresholds 638 Our strategy for selecting a decision threshold was to find the value that maximizes the 639 average Macro F1 measure90 for each of the 50 cross-validations that were performed with the 640 optimal parameter set, testing thresholds between 0 and 1 with a 0.01 increment. The F1 641 measure is defined as: 642 𝑀𝑎𝑐𝑟𝑜 𝐹1 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 643 We selected the most commonly occurring threshold above 0.2 that maximized the average 644 Macro F1 measure across the 50 cross-validations as the decision threshold for the final 645 classifier (threshold = 0.25). The same approach was applied for the subtype classifiers. The 646 thresholds and the corresponding average precision, recall and F1 measures are recorded in 647 (Supp Tab 11). 648 649 Classifying Query Data into General Cancer Categories 650 We downloaded the RNA-seq cancer cell lines expression profiles and sample table 651 from (https://portals.broadinstitute.org/ccle/data), and microarray cancer cell lines expression 652 profiles and sample table from Barretina et al 37. We extracted two WT control NCCIT RNA-seq 653 expression profiles from Grow et al91. We received PDX expression estimates and sample 654 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 annotations from the authors of Gao et al 20. We gathered GEMM expression profiles from nine 655 different studies59–67. We downloaded tumoroid expression profiles from The NCI Patient-656 Derived Models Repository (PDMR)69 and from three individual studies70–72. To use CCN 657 classifier on GEMM data, the mouse genes from GEMM expression profiles were converted into 658 their human homologs. The query samples were classified using the final CCN classifier. Each 659 query classification profile was labelled as one of the four classification categories: “correct”, 660 “mixed”, “none” and “other” based on classification profiles. If a sample has a CCN score higher 661 than the decision threshold in the labelled cancer category, we assigned that as “correct”. If a 662 sample has CCN score higher than the decision threshold in labelled cancer category and in 663 other cancer categories, we assigned that as “mixed”. If a sample has no CCN score higher 664 than the decision threshold in any cancer category or has the highest CCN score in ‘Unknown’ 665 category, then we assigned it as “none”. If a sample has CCN score higher than the decision 666 threshold in a cancer category or categories not including the labelled cancer category, we 667 assigned it as ”other”. We analyzed and visualized the results using R and R packages 668 pheatmap92 and ggplot293. 669 670 Cross-Species Assessment 671 To assess the performance of cross-species classification, we downloaded 1003 672 labelled human tissue/cell type and 1993 labelled mouse tissue/cell type RNA-seq expression 673 profiles from Github (https://github.com/pcahan1/CellNet). We first converted the mouse genes 674 into human homologous genes. Then we found the intersecting genes between mouse 675 tissue/cell expression profiles and human tissue/cell expression profiles. Limiting the input of 676 human tissue RNA-seq profiles to the intersecting genes, we trained a CCN classifier with all 677 the human tissue/cell expression profiles. The parameters used for the function 678 “broadClass_train” in the package cancerCellNet are in Supp Tab 9. We randomly sampled 75 679 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 samples from each tissue category in mouse tissue/cell data and applied the classifier on those 680 samples to assess performance. 681 682 Cross-Technology Assessment 683 To assess the performance of CCN in applications to microarray data, we gathered 684 6,219 patient tumor microarray profiles across 12 different cancer types from more than 100 685 different projects (Supp Tab 12). We found the intersecting genes between the microarray 686 profiles and TCGA patient RNA-seq profiles. Limiting the input of RNA-seq profiles to the 687 intersecting genes, we created a CCN classifier with all the TCGA patient profiles using 688 parameters for the function “broadClass_train” listed in Supp Tab 9. After the microarray 689 specific classifier was trained, we randomly sampled 60 microarray patient samples from each 690 cancer category and applied CCN classifier on them as assessment of the cross-technology 691 performance in Supp Fig 2A. The same CCN classifier was used to assess microarray CCL 692 samples Supp Fig 2B. 693 694 Training and validating scRNA-seq Classifier 695 We extracted labelled human melanoma and glioblastoma scRNA-seq expression 696 profiles40,41, and compiled the two datasets excluding 3 cell types T.CD4, T.CD8 and Myeloid 697 due to low number of cells for training. 60 cells from each of the 11 cell types were sampled for 698 training a scRNA-seq classifier. The parameters for training a general scRNA-seq classifier 699 using the function “broadClass_train” are in Supp Tab 9. 25 cells from each of the 11 cell types 700 from the held-out data were selected to assess the single cell classifier. Using maximization of 701 average Macro F1 measure, we selected the decision threshold of 0.255. The gene-pairs that 702 were selected to construct the classifier are in Supp Tab 10. To assess the cross-technology 703 capability of applying scRNA-seq classifier to bulk RNA-seq, we downloaded 305 expression 704 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 profiles spanning 4 purified cell types (B cells, endothelial cells, monocyte/macrophage, 705 fibroblast) from https://github.com/pcahan1/CellNet. 706 707 Training Subtype CancerCellNet 708 We found 11 cancer types (BRCA, COAD, ESCA, HNSC, KIRC, LGG, PAAD, UCEC, 709 STAD, LUAD, LUSC) which have meaningful subtypes based on either histology or molecular 710 profile and have sufficient samples to train a subtype classifier with high AUPR. We also 711 included normal tissues samples from BRCA, COAD, HNSC, KIRC, UCEC to create a normal 712 tissue category in the construction of their subtype classifiers. Training samples were either 713 labelled as a cancer subtype for the cancer of interest or as “Unknown” if they belong to other 714 cancer types. Similar to general classifier training, CCN performed gene pair transformation and 715 selected the most discriminate gene pairs for each cancer subtype. In addition to the gene pairs 716 selected to discriminate cancer subtypes, CCN also performed general classification of all 717 training data and appended the classification profiles of training data with gene pair binary 718 matrix as additional features. The reason behind using general classification profile as additional 719 features is that many general cancer types may share similar subtypes, and general 720 classification profile could be important features to discriminate the general cancer type of 721 interest from other cancer types before performing finer subtype classification. The specific 722 parameters used to train individual subtype classifiers using “subClass_train” function of 723 CancerCellNet package can be found in Supp Tab 9 and the gene pairs are in Supp Tab 10. 724 725 Validating Subtype CancerCellNet 726 Similar to validating general class classifier, we randomly sampled 2/3 of all samples in 727 each cancer subtype as training data and sampled an equal amount across subtypes in the 1/3 728 held-out data for assessing subtype classifiers. We repeated the process 20 times for more 729 comprehensive assessment of subtype classifiers. 730 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 Classifying Query Data into Subtypes 731 We assigned subtype to query sample if the query sample has CCN score higher than 732 the decision threshold. The table of decision threshold for subtype classifiers are in Supp Tab 733 11. If no CCN scores exceed the decision threshold in any subtype or if the highest CCN score 734 is in ‘Unknown’ category, then we assigned that sample as ‘Unknown’. Analysis was performed 735 in R and visualizations were generated with the ComplexHeatmap package94. 736 737 Cells culture, Immunohistochemistry and histomorphometry 738 Caov-4 (ATCC® HTB-76™), SK-OV-3(ATCC® HTB-77™), RT4 (ATCC® HTB-2™), and 739 NCCIT(ATCC® CRL-2073™) cell lines were purchased from ATCC. HEC-59 (C0026001) and 740 A2780 (93112519-1VL) were obtained from Addexbio Technologies and Sigma-Aldrich. Vcap 741 and PC-3. SK-OV-3, Vcap, and RT4 were cultured in Dulbecco's Modified Eagle Medium 742 (DMEM, high glucose, 11960069, Gibco) with 1% Penicillin-Streptomycin-Glutamine ( 743 10378016, Life Technologies); Caov-4, PC-3, NCCIT, and A2780 were cultured using RPMI-744 1640 medium (11875093, Gibco) while HEC-59 was in Iscove's Modified Dulbecco's Medium 745 (IMDM, 12440053, Gibco). Both media were supplemented with 1% Penicillin-Streptomycin 746 (15140122, Gibco). All medium included 10% Fetal Bovine Serum (FBS). 747 Cells cultured in 48-well plate were washed twice with PBS and fixed in 10% buffered 748 formalin for 24 hrs at 4 °C. Immunostaining was performed using a standard protocol. Cells 749 were incubated with primary antibodies to goat HOXB6 (10 µg/mL, PA5-37867, Invitrogen), 750 mouse WT1(10 µg/mL, MA1-46028, Invitrogen), rabbit PPARG (1:50, ABN1445, Millipore), 751 mouse FOLH1(10 µg/mL, UM570025, Origene), and rabbit LIN28A (1:50, #3978, Cell Signaling) 752 in Antibody Diluent (S080981-2, DAKO), at 4 °C overnight followed with three 5 min washes in 753 TBST. The slides were then incubated with secondary antibodies conjugated with fluorescence 754 at room temperature for 1 h while avoiding light followed with three 5 min washes in TBST and 755 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 nuclear stained with mounting medium containing DAPI. Images were captured by Nikon 756 EcLipse Ti-S, DS-U3 and DS-Qi2. 757 Histomorphometry was performed using ImageJ (Version 2.0.0-rc-69/1.52i). % 758 N.positive cells was calculated by the percentage of the number of positive stained cells divided 759 by the number of DAPI-positive nucleus within three of randomly chosen areas. The data were 760 expressed as means ± SD. 761 762 Tumor Purity Analysis 763 We used the R package ESTIMATE95 to calculate the ESTIMATE scores from TCGA 764 tumor expression profiles that we used as training data for CCN classifier. To calculate tumor 765 purity we used the equation described in YoshiHara et al., 201395: 766 Tumour purity = cos (0.6049872018 + 0.0001467884 × ESTIMATE score) 767 768 Extracting Citation Counts 769 We used the R package RISmed96 to extract the number of citations for each cell line 770 through query search of “cell line name[Text Word] AND cancer[Text Word]” on PubMed. The 771 citation counts were normalized by dividing the citation counts with the number of years since 772 first documented. 773 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 = 𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑢𝑛𝑡𝑠 # 𝑦𝑒𝑎𝑟𝑠 𝑠𝑖𝑛𝑐𝑒 𝑓𝑖𝑟𝑠𝑡 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 774 775 GRN construction and GRN Status 776 GRN construction was extended from our previous method21. 80 samples per cancer 777 type were randomly sampled and normalized through down sampling as training data for the 778 CLR GRN construction algorithm. Cancer type specific GRNs were identified by determining the 779 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 differentially expressed genes per each cancer type and extracting the subnetwork using those 780 genes. 781 To extend the original GRN status algorithm21 across different platforms and species, we 782 devised a rank-based GRN status algorithm. Like the original GRN status, rank based GRN 783 status is a metric of assessing the similarity of cancer type specific GRN between training data 784 in the cancer type of interest and query samples. Hence, high GRN status represents high level 785 of establishment or similarity of the cancer specific GRN in the query sample compared to those 786 of the training data. The expression profiles of training data and query data were transformed 787 into rank expression profiles by replacing the expression values with the rank of the expression 788 values within a sample (highest expressed gene would have the highest rank and lowest 789 expressed genes would have a rank of 1). Cancer type specific mean and standard deviation of 790 every gene’s rank expression were learned from training data. The modified Z-score values for 791 genes within cancer type specific GRN were calculated for query sample’s rank expression 792 profiles to quantify how dissimilar the expression values of genes in query sample’s cancer type 793 specific GRN compared to those of the reference training data: 794 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)XYZ = [ 0, 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑢𝑝𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 0, 𝑖𝑓 𝑍𝑠𝑐𝑜𝑟𝑒 𝑖𝑠 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑔𝑒𝑛𝑒 𝑖𝑠 𝑓𝑜𝑢𝑛𝑑 𝑡𝑜 𝑏𝑒 𝑑𝑜𝑤𝑛𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑 𝑎𝑏𝑠(𝑍𝑠𝑐𝑜𝑟𝑒), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 795 If a gene in the cancer type specific GRN is found to be upregulated in the specific 796 cancer type relative to other cancer types, then we would consider query sample’s gene to be 797 similar if the ranking of the query sample’s gene is equal to or greater than the mean ranking of 798 the gene in training sample. As a result of similarity, we assign that gene of a Z-score of 0. The 799 same principle applies to cases where the gene is downregulated in cancer specific subnetwork. 800 GRN status for query sample is calculated as the weighted mean of the 801 (1000 − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)XYZ) across genes in cancer type specific GRN. 1000 is an arbitrary 802 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 large number, and larger dissimilarity between query’s cancer type specific GRN indicate high 803 Z-scores for the GRN genes and low GRN status. 804 𝑅𝐺𝑆 = e(1000 − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒 𝑖)XYZ)𝑤𝑒𝑖𝑔ℎ𝑡fghg i h ijk 805 𝐺𝑅𝑁 𝑆𝑡𝑎𝑡𝑢𝑠 = 𝑅𝐺𝑆 ∑ 𝑤𝑒𝑖𝑔ℎ𝑡fghg ihijk 806 The weight of individual genes in the cancer specific network is determined by the 807 importance of the gene in the Random Forest classifier. Finally, the GRN status gets normalized 808 with respect to the GRN status of the cancer type of interest and the cancer type with the lowest 809 mean GRN status. 810 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 = 𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 mngop − 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 Xih qrhqgo) 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) 811 Where “min cancer” represents the cancer type where its training data have the lowest 812 mean GRN status in the cancer type of interest, and 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠 Xih qrhqgo) represents the 813 lowest average GRN status in the cancer type of interest. 𝑎𝑣𝑔(𝐺𝑅𝑁 𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo sptg ihsgogus) 814 represents average GRN status of the cancer type of interest in the training data. 815 816 Code availability 817 CancerCellNet code and documentation is available at GitHub: 818 https://github.com/pcahan1/cancerCellNet 819 820 Acknowledgements 821 This work was supported by the National Institutes of Health NCI Ovarian Cancer SPORE 822 P50CA228991 via a Development Research Program award to PC. FWH was supported by a 823 Prostate Cancer Foundation Young Investigator Award, Department of Defense W81XWH-17-824 PCRP-HD (F.W.H.), the National Institutes of Health/National Cancer Institute P20 CA233255-825 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 01 (F.W.H.) U19 CA214253 (F.W.H.). We would like to thank John Powers, Hao Zhu, Tian-Li 826 Wang, Charles Eberhart, and Kaloyan Tsanov for comments on the manuscript and helpful 827 discussions. Some figures were created in part with Biorender.com. 828 829 FIGURE LEGENDS 830 Fig. 1 CancerCellNet (CCN) workflow, training, and performance. (A) Schematic of CCN 831 usage. CCN was designed to assess and compare the expression profiles of cancer models 832 such as CCLs, PDXs, GEMMs, and tumoroids with native patient tumors. To use trained 833 classifier, CCN inputs the query samples (e.g. expression profiles from CCLs, PDXs, GEMMs, 834 tumoroids) and generates a classification profile for the query samples. The column names of 835 the classification heatmap represent sample annotation and the row names of the classification 836 heatmap represent different cancer types. Each grid is colored from black to yellow representing 837 the lowest classification score (e.g. 0) to highest classification score (e.g. 1). (B) Schematic of 838 CCN training process. CCN uses patient tumor expression profiles of 22 different cancer types 839 from TCGA as training data. First, CCN identifies n genes that are upregulated, n that are 840 downregulated, and n that are relatively invariant in each tumor type versus all of the others. 841 Then, CCN performs a pair transform on these genes and subsequently selects the most 842 discriminative set of m gene pairs for each cancer type as features (or predictors) for the 843 Random forest classifier. Lastly, CCN trains a multi-class Random Forest classifier using gene-844 pair transformed training data. (C) Parameter optimization strategy. 5 cross-validations of each 845 parameter set in which 2/3 of TCGA data was used to train and 1/3 to validate was used search 846 for the values of n and m that maximized performance of the classifier as measured by area 847 under the precision recall curve (AUPRC). (D) Mean and standard deviation of classifiers based 848 on 50 cross-validations with the optimal parameter set. (E) AUPRC of the final CCN classifier 849 when applied to independent patient tumor data from ICGC. 850 851 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 Fig. 2 Evaluation of cancer cell lines. (A) General classification heatmap of CCLs extracted 852 from CCLE. Column annotations of the heatmap represent the labelled cancer category of the 853 CCLs given by CCLE and the row names of the heatmap represent different cancer categories. 854 CCLs’ general classification profiles are categorized into 4 categories: correct (red), correct 855 mixed (pink), no classification (light green) and other classification (dark green) based on the 856 decision threshold of 0.25. (B) Bar plot represents the proportion of each classification category 857 in CCLs across cancer types ordered from the cancer types with the highest proportion of 858 correct and correct mixed CCLs to lowest proportion. (C) Comparison between SKCM general 859 CCN scores from bulk RNA-seq classifier and SKCM malignant CCN scores from scRNA-seq 860 classifier for SKCM CCLs. (D) Comparison between SARC general CCN scores from bulk RNA-861 seq classifier and CAF CCN scores from scRNA-seq classifier for SKCM CCLs. (E) Comparison 862 between GBM general CCN scores from bulk RNA-seq classifier and GBM neoplastic CCN 863 scores from scRNA-seq classifier for GBM CCLs. (F) Comparison between SARC general CCN 864 scores and CAF CCN scores from scRNA-seq classifier for GBM CCLs. The green lines 865 indicate the decision threshold for scRNA-seq classifier and general classifier. 866 867 Fig. 3 Immunofluorescence of selected cell lines. (A) Classification profiles (left) and IF 868 expression (middle) of Caov-4 (OV positive control), HEC-59 (UCEC positive control) and SK-869 OV-3 for WT1 (OV biomarker) and HOXB6 (uterine biomarker). The bar plots quantify the 870 average percentage of positive cells for WT1 (top-right) and HOXB6 (bottom-right). (B) 871 Classification profiles (left) and IF expression (middle) of Caov-4, NCCIT (germ cell tumor 872 positive control) and A2780 for WT1 and LIN28A (germ cell tumor biomarker). Classification of 873 NCCIT were performed using RNA-seq profiles of WT control NCCIT duplicate from Grow et 874 al91. The bar plots quantify the average percentage of positive cells for WT1 (top-right) and 875 LIN28A (bottom-right). (C) Classification profiles (left) and IF expression (middle) of Vcap 876 (PRAD positive control), RT4 (BLCA positive control) and PC-3 for FOLH1 (prostate biomarker) 877 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 34 and PPARG (urothelial biomarker). The bar plots quantify the average percentage of positive 878 cells for FOLH1 (top-right) and PPARG (bottom-right). 879 880 Fig. 4 Subtype classification of CCLs and CCL prevalence. The heatmap visualizations 881 represent subtype classification of (A) UCEC CCLs, (B) LUSC CCLs and (C) LUAD CCLs. Only 882 samples with CCN scores > 0.1 in their nominal tumor type are displayed. (D) Comparison of 883 normalized citation counts and general CCN classification scores of CCLs. Labelled cell lines 884 either have the highest CCN classification score in their labelled cancer category or highest 885 normalized citation count. Each citation count was normalized by number of years since first 886 documented on PubMed. 887 888 Fig. 5 Evaluation of patient derived xenografts. (A) General classification heatmap of PDXs. 889 Column annotations represent annotated cancer type of the PDXs, and row names represent 890 cancer categories. (B) Proportion of classification categories in PDXs across cancer types is 891 visualized in the bar plot and ordered from the cancer type with highest proportion of correct and 892 mixed correct classified PDXs to the lowest. Subtype classification heatmaps of (C) UCEC 893 PDXs, (D) LUSC PDXs and (E) LUAD PDXs. Only samples with CCN scores > 0.1 in their 894 nominal tumor type are displayed. 895 896 Fig. 6 Evaluation of genetically engineered mouse models. (A) General classification 897 heatmap of GEMMs. Column annotations represent annotated cancer type of the GEMMs, and 898 row names represent cancer categories. (B) Proportion of classification categories in GEMMs 899 across cancer types is visualized in the bar plot and ordered from the cancer type with highest 900 proportion of correct and mixed correct classified GEMMs to the lowest. Subtype classification 901 heatmap of (C) UCEC GEMMs, (D) LUSC GEMMs and (E) LUAD GEMMs. Only samples with 902 CCN scores > 0.1 in their nominal tumor type are displayed. 903 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 35 904 Fig. 7 Evaluation of tumoroid models. (A) General classification heatmap of tumoroids. 905 Column annotations represent annotated cancer type of the tumoroids, and row names 906 represent cancer categories. (B) Proportion of classification categories in tumoroids across 907 cancer types is visualized in the bar plot and ordered from the cancer type with highest 908 proportion of correct and mixed correct classified tumoroids to the lowest. Subtype classification 909 heatmap of (C) UCEC tumoroids, (D) LUSC tumoroids and (E) LUAD tumoroids. Only samples 910 with CCN scores > 0.1 in their nominal tumor type are displayed. 911 912 Fig. 8 Comparison of CCLs, PDXs, and GEMMs. Box-and-whiskers plot comparing general 913 CCN scores across CCLs, GEMMs, PDXs of five general tumor types (UCEC, PAAD, LUSC, 914 LUAD, LIHC). 915 916 Supplementary Information 917 Supplementary Figure 1 Assessment of CCN general classifier and subtype classifier. (A) 918 Mean AUPRC of repeated grid-search cross-validation for each parameter grid. (B) Mean and 919 range of CCN classifier’s PR curves from 50 cross validations based on the optimal feature 920 selection parameters n and m. (C) AUPRC of CCN human tissue classifier when applied to 921 mouse tissue data. (D) The schematic of training a subtype classifier in CCN. CCN uses patient 922 tumor expression profiles from cancer of interest as training data. CCN performs gene-pair 923 transformation and selects the most discriminative gene pairs among the cancer subtypes from 924 training data as features. CCN then applies the general classification on training data and uses 925 the general classification profile as features in addition to gene pairs for training a Random 926 Forest classifier. The weight of the general classification profiles as features can be tuned to 927 improve AUPRC. (E) The mean and standard deviation of AUPRC for 11 subtype classifiers 928 based on 20 iterations of random sampling of training and held-out data, training subtype 929 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 36 classifier using training data, classification of held-out data, and calculation of recall and 930 precision. 931 932 Supplementary Figure 2 Further validation of CCN and classification results. To validate the 933 cross-platform classification performance of CCN, a new classifier specifically trained to classify 934 microarray data was trained using RNA-seq data from TCGA as training data and intersecting 935 genes between RNA-seq data and microarray data. (A) AUPRC of CCN classifier when applied 936 to tumor profiles assayed on microarrays. (B) Classification heatmap of CCLs using microarray 937 expression data. (C) Pearson correlation between CCN scores of CCLE lines generated from 938 RNA-seq data and microarray data. (D) Comparison between CCLs’ CCN scores and the 939 similarity metric from Yu et al15, median correlations of transcriptional profiles between CCLs 940 and TCGA tumors from CCLs’ labelled cancer category. (E) Comparison of mean tumor purity 941 of training data and mean CCN scores of CCLs for each cancer category. 942 943 Supplementary Figure 3 Single-cell classification of SKCM and GBM cell lines. (A) AUPRC of 944 the single-cell classifier when applied to scRNA-seq held-out data. (B) AUPRC of the scRNA-945 seq classifier when applied to purified bulk RNA samples. (C) Single-cell classification of SKCM 946 CCLs. Red bar-plot (top) represents general CCN scores in SARC and blue bar-plot (bottom) 947 represents general CCN scores in SKCM. (D) Single-cell classification of GBM CCLs. Red bar-948 plot (top) represents general CCN scores in SARC and yellow bar-plot (bottom) represents 949 general CCN scores in GBM. 950 951 Supplementary Figure 4 Correlation between cancer type specific network GRN status and 952 general CCN scores. 953 954 955 Supplementary Figure 5 Proportion of cancer subtypes in different cancer models and TCGA 956 tumor data across 11 general cancer types. 957 958 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37 959 Supplementary Table 1 General classification profiles of CCLs. 960 961 Supplementary Table 2 Subtype classification profiles of CCLs. 962 963 Supplementary Table 3 General classification profiles of PDXs. 964 965 Supplementary Table 4 Subtype classification profiles of PDXs. 966 967 Supplementary Table 5 General classification profiles of GEMMs 968 969 Supplementary Table 6 Subtype classification profiles of GEMMs. 970 971 Supplementary Table 7 General classification profiles of tumoroids. 972 973 Supplementary Table 8 Subtype classification profiles of tumoroids. 974 975 Supplementary Table 9 Specific parameters used for training of all classifiers. 976 977 Supplementary Table 10 Gene-pairs selected for final training of CCN general, subtype 978 classifiers and single-cell classifier. 979 980 Supplementary Table 11 Decision thresholds and the corresponding precision and recall for 981 the general classifier and subtype classifier. 982 983 Supplementary Table 12 Accessions of tumor microarray data used in validation. 984 985 986 REFERENCES 987 1. Sharma, S. V., Haber, D. A. & Settleman, J. Cell line-based platforms to evaluate 988 the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer 10, 241–989 253 (2010). 990 2. Kersten, K., de Visser, K. E., van Miltenburg, M. H. & Jonkers, J. Genetically 991 engineered mouse models in oncology research and cancer medicine. EMBO Mol. 992 Med. 9, 137–153 (2017). 993 3. Hidalgo, M. et al. Patient-derived xenograft models: an emerging platform for 994 translational cancer research. Cancer Discov. 4, 998–1013 (2014). 995 4. Drost, J. & Clevers, H. Organoids in cancer research. Nat. Rev. Cancer 18, 407–996 418 (2018). 997 5. Klijn, C. et al. A comprehensive transcriptional portrait of human cancer cell lines. 998 Nat. Biotechnol. 33, 306–312 (2015). 999 6. Koren, S. et al. PIK3CA(H1047R) induces multipotency and multi-lineage mammary 1000 tumours. Nature 525, 114–118 (2015). 1001 7. DeRose, Y. S. et al. Tumor grafts derived from women with breast cancer 1002 authentically reflect tumor pathology, growth, metastasis and disease outcomes. 1003 Nat. Med. 17, 1514–1520 (2011). 1004 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 38 8. Sharpless, N. E. & Depinho, R. A. The mighty mouse: genetically engineered 1005 mouse models in cancer drug development. Nat. Rev. Drug Discov. 5, 741–754 1006 (2006). 1007 9. Mouradov, D. et al. Colorectal cancer cell lines are representative models of the 1008 main molecular subtypes of primary cancer. Cancer Res. 74, 3238–3247 (2014). 1009 10. Stuckelberger, S. & Drapkin, R. Precious GEMMs: emergence of faithful models for 1010 ovarian cancer research. J. Pathol. 245, 129–131 (2018). 1011 11. Domcke, S., Sinha, R., Levine, D. A., Sander, C. & Schultz, N. Evaluating cell lines 1012 as tumour models by comparison of genomic profiles. Nat. Commun. 4, 2126 1013 (2013). 1014 12. Jiang, G. et al. Comprehensive comparison of molecular portraits between cell lines 1015 and tumors in breast cancer. BMC Genomics 17 Suppl 7, 525 (2016). 1016 13. Chen, B., Sirota, M., Fan-Minogue, H., Hadley, D. & Butte, A. J. Relating 1017 hepatocellular carcinoma tumor samples and cell lines using gene expression data 1018 in translational research. BMC Med. Genomics 8 Suppl 2, S5 (2015). 1019 14. Vincent, K. M., Findlay, S. D. & Postovit, L. M. Assessing breast cancer cell lines as 1020 tumour models by comparison of mRNA expression profiles. Breast Cancer Res. 1021 17, 114 (2015). 1022 15. Yu, K. et al. Comprehensive transcriptomic analysis of cell lines as models of 1023 primary tumors across 22 tumor types. Nat. Commun. 10, 3574 (2019). 1024 16. Najgebauer, H. et al. CELLector: Genomics-Guided Selection of Cancer In Vitro 1025 Models. Cell Syst. 10, 424–432.e6 (2020). 1026 17. Salvadores, M., Fuster-Tormo, F. & Supek, F. Matching cell lines with cancer type 1027 and subtype of origin via mutational, epigenomic, and transcriptomic patterns. Sci. 1028 Adv. 6, (2020). 1029 18. Guernet, A. & Grumolato, L. CRISPR/Cas9 editing of the genome for cancer 1030 modeling. Methods 121-122, 130–137 (2017). 1031 19. Gargiulo, G. Next-Generation in vivo Modeling of Human Cancers. Front. Oncol. 8, 1032 429 (2018). 1033 20. Gao, H. et al. High-throughput screening using patient-derived tumor xenografts to 1034 predict clinical trial drug response. Nat. Med. 21, 1318–1325 (2015). 1035 21. Cahan, P. et al. CellNet: network biology applied to stem cell engineering. Cell 158, 1036 903–915 (2014). 1037 22. Radley, A. H. et al. Assessment of engineered cells using CellNet and RNA-seq. 1038 Nat. Protoc. 12, 1089–1102 (2017). 1039 23. Tan, Y. & Cahan, P. SingleCellNet: A Computational Tool to Classify Single Cell 1040 RNA-Seq Data Across Platforms and Across Species. Cell Syst. 9, 207–213.e2 1041 (2019). 1042 24. Cancer Genome Atlas Network. Comprehensive molecular characterization of 1043 human colon and rectal cancer. Nature 487, 330–337 (2012). 1044 25. Zhang, J. et al. International Cancer Genome Consortium Data Portal--a one-stop 1045 shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). 1046 26. Cancer Genome Atlas Network. Comprehensive molecular portraits of human 1047 breast tumours. Nature 490, 61–70 (2012). 1048 27. Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic 1049 subtypes. J. Clin. Oncol. 27, 1160–1167 (2009). 1050 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 39 28. Wilkerson, M. D. et al. Lung squamous cell carcinoma mRNA expression subtypes 1051 are reproducible, clinically important, and correspond to normal cell types. Clin. 1052 Cancer Res. 16, 4864–4875 (2010). 1053 29. Cancer Genome Atlas Research Network. Electronic address: 1054 andrew_aguirre@dfci.harvard.edu & Cancer Genome Atlas Research Network. 1055 Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer 1056 Cell 32, 185–203.e13 (2017). 1057 30. Cancer Genome Atlas Research Network et al. Integrated genomic characterization 1058 of endometrial carcinoma. Nature 497, 67–73 (2013). 1059 31. Cancer Genome Atlas Research Network et al. Integrated genomic characterization 1060 of oesophageal carcinoma. Nature 541, 169–175 (2017). 1061 32. Cancer Genome Atlas Network. Comprehensive genomic characterization of head 1062 and neck squamous cell carcinomas. Nature 517, 576–582 (2015). 1063 33. Cancer Genome Atlas Research Network. Comprehensive molecular 1064 characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013). 1065 34. Verhaak, R. G. W. et al. Integrated genomic analysis identifies clinically relevant 1066 subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, 1067 and NF1. Cancer Cell 17, 98–110 (2010). 1068 35. Cancer Genome Atlas Research Network. Comprehensive molecular profiling of 1069 lung adenocarcinoma. Nature 511, 543–550 (2014). 1070 36. Hu, B. et al. Gastric cancer: Classification, histology and application of molecular 1071 pathology. J. Gastrointest. Oncol. 3, 251–261 (2012). 1072 37. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling 1073 of anticancer drug sensitivity. Nature 483, 603–607 (2012). 1074 38. Medico, E. et al. The molecular landscape of colorectal cancer cell lines unveils 1075 clinically actionable kinase targets. Nat. Commun. 6, 7002 (2015). 1076 39. Park, J.-G. et al. Characteristics of Cell Lines Established from Human Colorectal 1077 Carcinoma. Cancer Res. (1987). 1078 40. Jerby-Arnon, L. et al. A cancer cell program promotes T cell exclusion and 1079 resistance to checkpoint blockade. Cell 175, 984–997.e24 (2018). 1080 41. Darmanis, S. et al. Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at 1081 the Migrating Front of Human Glioblastoma. Cell Rep. 21, 1399–1410 (2017). 1082 42. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in 1083 primary glioblastoma. Science 344, 1396–1401 (2014). 1084 43. Xu, B. et al. Regulation of endometrial receptivity by the highly expressed HOXA9, 1085 HOXA11 and HOXD10 HOX-class homeobox genes. Hum. Reprod. 29, 781–790 1086 (2014). 1087 44. Raines, A. M. et al. Recombineering-based dissection of flanking and paralogous 1088 Hox gene functions in mouse reproductive tracts. Development 140, 2942–2952 1089 (2013). 1090 45. Netinatsunthorn, W., Hanprasertpong, J., Dechsukhum, C., Leetanaporn, R. & 1091 Geater, A. WT1 gene expression as a prognostic marker in advanced serous 1092 epithelial ovarian carcinoma: an immunohistochemical study. BMC Cancer 6, 90 1093 (2006). 1094 46. Kelly, Z. et al. The prognostic significance of specific HOX gene expression patterns 1095 in ovarian cancer. Int. J. Cancer 139, 1608–1617 (2016). 1096 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 40 47. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian 1097 carcinoma. Nature 474, 609–615 (2011). 1098 48. Wiegand, K. C. et al. ARID1A mutations in endometriosis-associated ovarian 1099 carcinomas. N. Engl. J. Med. 363, 1532–1543 (2010). 1100 49. Murray, M. J. et al. LIN28 Expression in malignant germ cell tumors downregulates 1101 let-7 and increases oncogene levels. Cancer Res. 73, 4872–4884 (2013). 1102 50. Biton, A. et al. Independent component analysis uncovers the landscape of the 1103 bladder tumor transcriptome and reveals insights into luminal and basal subtypes. 1104 Cell Rep. 9, 1235–1245 (2014). 1105 51. Fair, W. R., Israeli, R. S. & Heston, W. D. Prostate-specific membrane antigen. 1106 Prostate 32, 140–148 (1997). 1107 52. Black, J. D., English, D. P., Roque, D. M. & Santin, A. D. Targeted therapy in 1108 uterine serous carcinoma: an aggressive variant of endometrial cancer. Womens 1109 Health (Lond. Engl.) 10, 45–57 (2014). 1110 53. Yang, S., Thiel, K. W. & Leslie, K. K. Progesterone: the ultimate endometrial tumor 1111 suppressor. Trends Endocrinol. Metab. 22, 145–152 (2011). 1112 54. Huszar, M. et al. Up-regulation of L1CAM is linked to loss of hormone receptors and 1113 E-cadherin in aggressive subtypes of endometrial carcinomas. J. Pathol. 220, 551–1114 561 (2010). 1115 55. Kozak, J., Wdowiak, P., Maciejewski, R. & Torres, A. A guide for endometrial 1116 cancer cell lines functional assays using the measurements of electronic 1117 impedance. Cytotechnology 70, 339–350 (2018). 1118 56. Korch, C. et al. DNA profiling analysis of endometrial and ovarian cell lines reveals 1119 misidentification, redundancy and contamination. Gynecol. Oncol. 127, 241–248 1120 (2012). 1121 57. Wu, D. et al. Gene-expression data integration to squamous cell lung cancer 1122 subtypes reveals drug sensitivity. Br. J. Cancer 109, 1599–1608 (2013). 1123 58. Walter, V. et al. Molecular subtypes in head and neck cancer exhibit distinct 1124 patterns of chromosomal gain and loss of canonical cancer genes. PLoS One 8, 1125 e56823 (2013). 1126 59. Adeegbe, D. O. et al. BET Bromodomain Inhibition Cooperates with PD-1 Blockade 1127 to Facilitate Antitumor Response in Kras-Mutant Non-Small Cell Lung Cancer. 1128 Cancer Immunol Res 6, 1234–1245 (2018). 1129 60. Blaisdell, A. et al. Neutrophils oppose uterine epithelial carcinogenesis via 1130 debridement of hypoxic tumor cells. Cancer Cell 28, 785–799 (2015). 1131 61. Fitamant, J. et al. YAP inhibition restores hepatocyte differentiation in advanced 1132 HCC, leading to tumor regression. Cell Rep. 10, 1692–1707 (2015). 1133 62. Jia, D. et al. Crebbp loss drives small cell lung cancer and increases sensitivity to 1134 HDAC inhibition. Cancer Discov. 8, 1422–1437 (2018). 1135 63. Kress, T. R. et al. Identification of MYC-Dependent Transcriptional Programs in 1136 Oncogene-Addicted Liver Tumors. Cancer Res. 76, 3463–3472 (2016). 1137 64. Li, L. et al. GKAP acts as a genetic modulator of NMDAR signaling to govern 1138 invasive tumor growth. Cancer Cell 33, 736–751.e5 (2018). 1139 65. Mollaoglu, G. et al. The Lineage-Defining Transcription Factors SOX2 and NKX2-1 1140 Determine Lung Cancer Cell Fate and Shape the Tumor Immune 1141 Microenvironment. Immunity 49, 764–779.e9 (2018). 1142 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 41 66. Pan, Y. et al. Whole tumor RNA-sequencing and deconvolution reveal a clinically-1143 prognostic PTEN/PI3K-regulated glioma transcriptional signature. Oncotarget 8, 1144 52474–52487 (2017). 1145 67. Lissanu Deribe, Y. et al. Mutations in the SWI/SNF complex induce a targetable 1146 dependence on oxidative phosphorylation in lung cancer. Nat. Med. 24, 1047–1057 1147 (2018). 1148 68. Xu, C. et al. Loss of Lkb1 and Pten leads to lung squamous cell carcinoma with 1149 elevated PD-L1 expression. Cancer Cell 25, 590–604 (2014). 1150 69. NCI-Frederick, Frederick, MD. National Laboratory for Cancer Research. The NCI 1151 Patient-Derived Models Repository (PDMR). (2019). at 1152 70. Broutier, L. et al. Human primary liver cancer-derived organoid cultures for disease 1153 modeling and drug screening. Nat. Med. 23, 1424–1435 (2017). 1154 71. Lee, S. H. et al. Tumor Evolution and Drug Response in Patient-Derived Organoid 1155 Models of Bladder Cancer. Cell 173, 515–528.e17 (2018). 1156 72. Ogawa, J., Pao, G. M., Shokhirev, M. N. & Verma, I. M. Glioblastoma model using 1157 human cerebral organoids. Cell Rep. 23, 1220–1229 (2018). 1158 73. Ben-David, U. et al. Patient-derived xenografts undergo mouse-specific tumor 1159 evolution. Nat. Genet. 49, 1567–1575 (2017). 1160 74. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 1161 719–724 (2009). 1162 75. Balkwill, F. R., Capasso, M. & Hagemann, T. The tumor microenvironment at a 1163 glance. J. Cell Sci. 125, 5591–5596 (2012). 1164 76. Lancaster, M. A. & Knoblich, J. A. Organogenesis in a dish: modeling development 1165 and disease using organoid technologies. Science 345, 1247125 (2014). 1166 77. Bregenzer, M. E. et al. Integrated cancer tissue engineering models for precision 1167 medicine. PLoS One 14, e0216564 (2019). 1168 78. Wang, D. H. & Souza, R. F. Biology of Barrett’s esophagus and esophageal 1169 adenocarcinoma. Gastrointest Endosc Clin N Am 21, 25–38 (2011). 1170 79. Lee, J. et al. Tumor stem cells derived from glioblastomas cultured in bFGF and 1171 EGF more closely mirror the phenotype and genotype of primary tumors than do 1172 serum-cultured cell lines. Cancer Cell 9, 391–403 (2006). 1173 80. Wenger, S. L. et al. Comparison of established cell lines at different passages by 1174 karyotype and comparative genomic hybridization. Biosci. Rep. 24, 631–639 (2004). 1175 81. Ben-David, U. et al. Genetic and transcriptional evolution alters cancer cell line drug 1176 response. Nature 560, 325–330 (2018). 1177 82. Cooke, S. L. et al. Genomic analysis of genetic heterogeneity and evolution in high-1178 grade serous ovarian carcinoma. Oncogene 29, 4905–4913 (2010). 1179 83. Hristova, V. A. & Chan, D. W. Cancer biomarker discovery and translation: 1180 proteomics and beyond. Expert Rev Proteomics 16, 93–103 (2019). 1181 84. Dawson, M. A. & Kouzarides, T. Cancer epigenetics: from mechanism to therapy. 1182 Cell 150, 12–27 (2012). 1183 85. Silva, T. C. et al. TCGA Workflow: Analyze cancer genomics and epigenomics data 1184 using Bioconductor packages. [version 2; peer review: 1 approved, 2 approved with 1185 reservations]. F1000Res. 5, 1542 (2016). 1186 86. Morgan, M., Obenchain, V., Hester, J. & Pag`es, H. SummarizedExperiment: 1187 SummarizedExperiment container. (2018). 1188 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ 42 87. Pavlidis, P. & Noble, W. S. Analysis of strain and regional variation in gene 1189 expression in mouse brain. Genome Biol. 2, RESEARCH0042 (2001). 1190 88. Geman, D., d Avignon, C., Naiman, D. Q. & Winslow, R. L. Classifying gene 1191 expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3, 1192 Article19 (2004). 1193 89. Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. Cross-validation pitfalls 1194 when selecting and assessing regression and classification models. J. Cheminform. 1195 6, 10 (2014). 1196 90. Lipton, Z. C., Elkan, C. & Naryanaswamy, B. Optimal Thresholding of Classifiers to 1197 Maximize F1 Measure. Mach. Learn. Knowl. Discov. Databases 8725, 225–239 1198 (2014). 1199 91. Grow, E. J. et al. Intrinsic retroviral reactivation in human preimplantation embryos 1200 and pluripotent cells. Nature 522, 221–225 (2015). 1201 92. Kolde, R. pheatmap: Pretty Heatmaps. (CRAN, 2019). 1202 93. Wickham, H. ggplot2 - Elegant Graphics for Data Analysis . (Springer-Verlag New 1203 York, 2016). doi:10.1007/978-0-387-98141-3 1204 94. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations 1205 in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016). 1206 95. Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture 1207 from expression data. Nat. Commun. 4, 2612 (2013). 1208 96. Kovalchik, S. RISmed: Download Content from NCBI Databases. (CRAN.R-project, 1209 2017). 1210 1211 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ A B Figure 1 HighLow C an ce r T yp es Cancer models Classification score Cancer cell lines (CCL) Patient derived xenograft (PDX) Genetically engineered mouse model (GEMM) Tumoroids Select parameter set with maximum mean AUPRC. Train on all TCGA data CancerCellNet Set parameters n, m Randomly select 2/3 TCGA data; run training process Assess performance on 1/3 held out data Repeat steps (2-3) 5 times (1) (2) (3) (4) Repeat steps (1-4) for each parameter set (5) CancerCellNet RNA-seq from … G en e pa irs Training data Training process Train Random Forest classifier G en es Samples G en es Labeled RNA-seq data Select n genes Gene pair transform Select m gene pairs G en e pa irs G en es Samples Samples Samples Samples Samples CancerCellNet C D E .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 2 A F C D E CCN Score B .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ CCN Score A B C Figure 3 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ D A B Figure 4 C General classification General CCN score (UCEC) Sub-type classification Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification prox.-inflam prox.-prolif TRU Unknown .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ CCN Score Figure 5 A B C D E General classification General CCN score (UCEC) Sub-type classification Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification prox.-inflam prox.-prolif TRU Unknown .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 6 C BA D E General classification General CCN score (UCEC) Sub-type classification Genotype Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification Genotype basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification Genotype prox.-inflam prox.-prolif TRU Unknown CCN Score .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 7 A B C D E General classification General CCN score (UCEC) Sub-type classification Endometrioid Serous Normal Unknown General classification General CCN score (LUSC) Sub-type classification basal classical primitive secretory Unknown General classification General CCN score (LUAD) Sub-type classification prox.-inflam prox.-prolif TRU Unknown CCN Score .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Figure 8 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 1 BA D E Training data Samples G en es RNA-Seq TCGA Training process Gene Pair Transform Feature Selection Train Random forest classifier G en es G en e P ai rs CancerCellNetBroad Class Classification Add on to Gene Pairs as Additional Features C C N S co re s G en e P ai rs C .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 2 A B D E C .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 3 C D A B .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 4 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/ Supplemental Figure 5 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint https://doi.org/10.1101/2020.03.27.012757 http://creativecommons.org/licenses/by-nc-nd/4.0/