67941284


 1 
 2 
 3 

 4 
 5 
 6 

Evaluating the transcriptional fidelity of cancer models 7 
 8 
 9 

Da Peng1*, Rachel Gleyzer2*, Wen-Hsin Tai2, Pavithra Kumar2, Qin Bian2, Bradley Issacs2, 10 
Edroaldo Lummertz da Rocha3, Stephanie Cai1, Kathleen DiNapoli4,5, Franklin W Huang6, 11 
Patrick Cahan1,2,7 12 
 13 
1Department of Biomedical Engineering, Johns Hopkins University School of Medicine, 14 
Baltimore MD 21205 USA 15 
 16 
2Institute for Cell Engineering, Johns Hopkins University School of Medicine,  17 
Baltimore MD 21205 USA 18 
 19 
3Department of Microbiology, Immunology and Parasitology,  20 
Federal University of Santa Catarina, Florianópolis SC, Brazil 21 
 22 
4Department of Cell Biology, Johns Hopkins University School of Medicine,  23 
Baltimore, MD 21205 USA 24 
 25 
5Department of Electrical and Computer Engineering, Johns Hopkins University,  26 
Baltimore MD 21218 USA 27 
 28 
6Division of Hematology/Oncology, Department of Medicine; Helen Diller Family Cancer Center; 29 
Bakar Computational Health Sciences Institute; Institute for Human Genetics;  30 
University of California, San Francisco, San Francisco, CA 31 
 32 
7Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, 33 
Baltimore MD 21205 USA 34 
 35 
 36 
* These authors made equal contributions. 37 
 38 
 39 
Correspondence to: patrick.cahan@jhmi.edu 40 
 41 
Article type: Research 42 
 43 
Website: http://www.cahanlab.org/resources/cancerCellNet_web 44 
 45 
Code: https://github.com/pcahan1/cancerCellNet 46 
 47 
 48 
 49 
 50 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 2 

ABSTRACT 51 
 52 
Background: Cancer researchers use cell lines, patient derived xenografts, engineered mice, 53 
and tumoroids as models to investigate tumor biology and to identify therapies. The 54 

generalizability and power of a model derives from the fidelity with which it represents the tumor 55 

type under investigation, however, the extent to which this is true is often unclear. The 56 

preponderance of models and the ability to readily generate new ones has created a demand 57 

for tools that can measure the extent and ways in which cancer models resemble or diverge 58 

from native tumors.  59 

 60 

Methods: We developed a machine learning based computational tool, CancerCellNet, that 61 
measures the similarity of cancer models to 22 naturally occurring tumor types and 36 subtypes, 62 

in a platform and species agnostic manner. We applied this tool to 657 cancer cell lines, 415 63 

patient derived xenografts, 26 distinct genetically engineered mouse models, and 131 64 

tumoroids. We validated CancerCellNet by application to independent data, and we tested 65 

several predictions with immunofluorescence.  66 

 67 

Results: We have documented the cancer models with the greatest transcriptional fidelity to 68 
natural tumors, we have identified cancers underserved by adequate models, and we have 69 

found models with annotations that do not match their classification. By comparing models 70 

across modalities, we report that, on average, genetically engineered mice and tumoroids have 71 

higher transcriptional fidelity than patient derived xenografts and cell lines in four out of five 72 

tumor types. However, several patient derived xenografts and tumoroids have classification 73 

scores that are on par with native tumors, highlighting both their potential as faithful model 74 

classes and their heterogeneity.  75 

 76 

Conclusions: CancerCellNet enables the rapid assessment of transcriptional fidelity of tumor 77 
models. We have made CancerCellNet available as freely downloadable software and as a web 78 

application that can be applied to new cancer models that allows for direct comparison to the 79 

cancer models evaluated here.  80 

 81 

 82 
 83 
 84 

 85 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 3 

INTRODUCTION 86 

Models are widely used to investigate cancer biology and to identify potential therapeutics. 87 

Popular modeling modalities are cancer cell lines (CCLs)1, genetically engineered mouse 88 

models (GEMMs)2, patient derived xenografts (PDXs)3, and tumoroids4. These classes of 89 

models differ in the types of questions that they are designed to address. CCLs are often used 90 

to address cell intrinsic mechanistic questions5, GEMMs to chart progression of molecularly 91 

defined-disease6, and PDXs to explore patient-specific response to therapy in a physiologically 92 

relevant context7. More recently, tumoroids have emerged as relatively inexpensive, 93 

physiological, in vitro 3D models of tumor epithelium with applications ranging from measuring 94 

drug responsiveness to exploring tumor dependence on cancer stem cells. Models also differ in 95 

the extent to which the they represent specific aspects of a cancer type8. Even with this intra- 96 

and inter-class model variation, all models should represent the tumor type or subtype under 97 

investigation, and not another type of tumor, and not a non-cancerous tissue. Therefore, cancer-98 

models should be selected not only based on the specific biological question but also based on 99 

the similarity of the model to the cancer type under investigation9,10. 100 

 Various methods have been proposed to determine the similarity of cancer models to 101 

their intended subjects. Domcke et al devised a 'suitability score' as a metric of the molecular 102 

similarity of CCLs to high grade serous ovarian carcinoma based on a heuristic weighting of 103 

copy number alterations, mutation status of several genes that distinguish ovarian cancer 104 

subtypes, and hypermutation status11. Other studies have taken analogous approaches by 105 

either focusing on transcriptomic or ensemble molecular profiles (e.g. transcriptomic and copy 106 

number alterations) to quantify the similarity of cell lines to tumors12–14. These studies were 107 

tumor-type specific, focusing on CCLs that model, for example, hepatocellular carcinoma or 108 

breast cancer. Notably, Yu et al compared the transcriptomes of CCLs to The Cancer Genome 109 

Atlas (TCGA) by correlation analysis, resulting in a panel of CCLs recommended as most 110 

representative of 22 tumor types15. Most recently, Najgebauer et al16 and Salvadores et al17 111 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 4 

have developed methods to assess CCLs using molecular traits such as copy number 112 

alterations (CNA), somatic mutations, DNA methylation and transcriptomics. While all of these 113 

studies have provided valuable information, they leave two major challenges unmet. The first 114 

challenge is to determine the fidelity of GEMMs, PDXs, and tumoroids, and whether there are 115 

stark differences between these classes of models and CCLs. The other major unmet challenge 116 

is to enable the rapid assessment of new, emerging cancer models. This challenge is especially 117 

relevant now as technical barriers to generating models have been substantially lowered18,19, 118 

and because new models such as PDXs and tumoroids can be derived on patient-specific basis 119 

therefore should be considered a distinct entity requiring individual validation4,20.  120 

 To address these challenges, we developed CancerCellNet (CCN), a computational tool 121 

that uses transcriptomic data to quantitatively assess the similarity between cancer models and 122 

22 naturally occurring tumor types and 36 subtypes in a platform- and species-agnostic manner. 123 

Here, we describe CCN’s performance, and the results of applying it to assess 657 CCLs, 415 124 

PDXs, 26 GEMMs, and 131 tumoroids. This has allowed us to identify the most faithful models 125 

currently available, to document cancers underserved by adequate models, and to find models 126 

with inaccurate tumor type annotation. Moreover, because CCN is open-source and easy to 127 

use, it can be readily applied to newly generated cancer models as a means to assess their 128 

fidelity.  129 

 130 

RESULTS 131 

CancerCellNet classifies samples accurately across species and technologies  132 

Previously, we had developed a computational tool using the Random Forest 133 

classification method to measure the similarity of engineered cell populations to their in vivo 134 

counterparts based on transcriptional profiles21,22. More recently, we elaborated on this 135 

approach to allow for classification of single cell RNA-seq data in a manner that allows for 136 

cross-platform and cross-species analysis23. Here, we used an analogous approach to build a 137 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 5 

platform that would allow us to quantitatively compare cancer models to naturally occurring 138 

patient tumors (Fig 1A). In brief, we used TCGA RNA-seq expression data from 22 solid tumor 139 

types to train a top-pair multi-class Random forest classifier (Fig 1B). We combined training 140 

data from Rectal Adenocarcinoma (READ) and Colon Adenocarcinoma (COAD) into one 141 

COAD_READ category because READ and COAD are considered to be virtually 142 

indistinguishable at a molecular level24. We included an ‘Unknown’ category trained using 143 

randomly shuffled gene-pair profiles generated from the training data of 22 tumor types to 144 

identify query samples that are not reflective of any of the training data. To estimate the 145 

performance of CCN and how it is impacted by parameter variation, we performed a parameter 146 

sweep with a 5-fold 2/3 cross-validation strategy (i.e. 2/3 of the data sampled across each 147 

cancer type was used to train, 1/3 was used to validate) (Fig 1C). The performance of CCN, as 148 

measured by the mean area under the precision recall curve (AUPRC), did not fall below 0.945 149 

and remained relatively stable across parameter sets (Supp Fig 1A). The optimal parameters 150 

resulted in 1,979 features. The mean AUPRCs exceeded 0.95 in most tumor types with this 151 

optimal parameter set (Fig 1D, Supp Fig 1B). The AUPRCs of CCN applied to independent 152 

data RNA-Seq data from 725 tumors across five tumor types from the International Cancer 153 

Genome Consortium (ICGC)25 ranged from 0.93 to 0.99, supporting the notion that the platform 154 

is able to accurately classify tumor samples from diverse sources (Fig 1E). 155 

 As one of the central aims of our study is to compare distinct cancer models, including 156 

GEMMs, our method needed to be able to classify samples from mouse and human samples 157 

equivalently. We used the Top-Pair transform23 to achieve this and we tested the feasibility of 158 

this approach by assessing the performance of a normal (i.e. non-tumor) cell and tissue 159 

classifier trained on human data as applied to mouse samples. Consistent with prior 160 

applications23, we found that the cross-species classifier performed well, achieving mean 161 

AUPRC of 0.97 when applied to mouse data (Supp Fig 1C).   162 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 6 

 To evaluate cancer models at a finer resolution, we also developed an approach to 163 

perform tumor subtype classifications (Supp Fig 1D). We constructed 11 different cancer 164 

subtype classifiers based on the availability of expression or histological subtype 165 

information24,26–36. We also included non-cancerous, normal tissues as categories for several 166 

subtype classifiers when sufficient data was available: breast invasive carcinoma (BRCA), 167 

COAD_READ, head and neck squamous cell carcinoma (HNSC), kidney renal clear cell 168 

carcinoma (KIRC) and uterine corpus endometrial carcinoma (UCEC). The 11 subtype 169 

classifiers all achieved high overall average AUPRs ranging from 0.80 to 0.99 (Supp Fig 1E). 170 

 171 

Fidelity of cancer cell lines 172 

Having validated the performance of CCN, we then used it to determine the fidelity of 173 

CCLs. We mined RNA-seq expression data of 657 different cell lines across 20 cancer types 174 

from the Cancer Cell Line Encyclopedia (CCLE) and applied CCN to them, finding a wide 175 

classification range for cell lines of each tumor type (Fig 2A, Supp Tab 1). To verify the 176 

classification results, we applied CCN to expression profiles from CCLE generated through 177 

microarray expression profiling37. To ensure that CCN would function on microarray data, we 178 

first tested it by applying a CCN classifier created to test microarray data to 720 expression 179 

profiles of 12 tumor types. The cross-platform CCN classifier performed well, based on the 180 

comparison to study-provided annotation, achieving a mean AUPRC of 0.91 (Supp Fig 2A). 181 

Next, we applied this cross-platform classifier to microarray expression profiles from CCLE 182 

(Supp Fig 2B). From the classification results of 571 cell lines that have both RNA-seq and 183 

microarray expression profiles, we found a strong overall positive association between the 184 

classification scores from RNA-seq and those from microarray (Supp Fig 2C). This comparison 185 

supports the notion that the classification scores for each cell line are not artifacts of profiling 186 

methodology. Moreover, this comparison shows that the scores are consistent between the 187 

times that the cell lines were first assayed by microarray expression profiling in 2012 and by 188 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 7 

RNA-Seq in 2019. We also observed high level of correlation between our analysis and the 189 

analysis done by Yu et al15(Supp Fig 2D), further validating the robustness of the CCN results.  190 

Next, we assessed the extent to which CCN classifications agreed with their nominal 191 

tumor type of origin, which entailed translating quantitative CCN scores to classification labels. 192 

To achieve this, we selected a decision threshold that maximized the Macro F1 measure, 193 

harmonic mean of precision and recall, across 50 cross validations. Then, we annotated cell 194 

lines based their CCN score profile as follows. Cell lines with CCN scores > threshold for the 195 

tumor type of origin were annotated as 'correct'. Cell lines with CCN scores > threshold in the 196 

tumor type of origin and at least one other tumor type were annotated as 'mixed'. Cell lines with 197 

CCN scores > threshold for tumor types other than that of the cell line's origin were annotated 198 

as 'other'. Cell lines that did not receive a CCN score > threshold for any tumor type were 199 

annotated as 'none' (Fig 2B). We found that majority of cell lines originally annotated as Breast 200 

invasive carcinoma (BRCA), Cervical squamous cell carcinoma and endocervical 201 

adenocarcinoma (CESC), Skin Cutaneous Melanoma (SKCM), Colorectal Cancer 202 

(COAD_READ) and Sarcoma (SARC) fell into the 'correct' category (Fig 2B). On the other 203 

hand, no Esophageal carcinoma (ESCA), Pancreatic adenocarcinoma (PAAD) or Brain Lower 204 

Grade Glioma (LGG) were classified as 'correct', demonstrating the need for more 205 

transcriptionally faithful cell lines that model those general cancer types.  206 

There are several possible explanations for cell lines not receiving a 'correct' 207 

classification. One possibility is that the sample was incorrectly labeled in the study from which 208 

we harvested the expression data. Consistent with this explanation, we found that colorectal 209 

cancer line NCI-H68438,39, a cell line labelled as liver hepatocellular carcinoma (LIHC) by CCLE, 210 

was classified strongly as COAD_READ (Supp Tab 1). Another possibility to explain low CCN 211 

score is that cell lines were derived from subtypes of tumors that are not well-represented in 212 

TCGA. To explore this hypothesis, we first performed tumor subtype classification on CCLs from 213 

11 tumor types for which we had trained subtype classifiers (Supp Tab 2). We reasoned that if 214 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 8 

a cell was a good model for a rarer subtype, then it would receive a poor general classification 215 

but a high classification for the subtype that it models well. Therefore, we counted the number of 216 

lines that fit this pattern. We found that of the 188 lines with no general classification, 25 (13%) 217 

were classified as a specific subtype, suggesting that derivation from rare subtypes is not the 218 

major contributor to the poor overall fidelity of CCLs.  219 

  Another potential contributor to low scoring cell lines is intra-tumor stromal and immune 220 

cell impurity in the training data. If impurity were a confounder of CCN scoring, then we would 221 

expect a strong positive correlation between mean purity and mean CCN classification scores of 222 

CCLs per general tumor type. However, the Pearson correlation coefficient between the mean 223 

purity of general tumor type and mean CCN classification scores of CCLs in the corresponding 224 

general tumor type was low (0.14), suggesting that tumor purity is not a major contributor to the 225 

low CCN scores across CCLs (Supp Fig 2E).  226 

 227 

Comparison of SKCM and GBM CCLs to scRNA-seq  228 

 To more directly assess the impact of intra-tumor heterogeneity in the training data on 229 

evaluating cell lines, we constructed a classifier using cell types found in human melanoma and 230 

glioblastoma scRNA-seq data40,41. Previously, we have demonstrated the feasibility of using our 231 

classification approach on scRNA-seq data23. Our scRNA-seq classifier achieved a high 232 

average AUPRC (0.95) when applied to held-out data and high mean AUPRC (0.99) when 233 

applied to few purified bulk testing samples (Supp Fig 3A-B). Comparing the CCN score from 234 

bulk RNA-seq general classifier and scRNA-seq classifier, we observed a high level of 235 

correlation (Pearson correlation of 0.89) between the SKCM CCN classification scores and 236 

scRNA-seq SKCM malignant CCN classification scores for SKCM cell lines (Fig 2C, Supp Fig 237 

3C). Of the 41 SKCM cell lines that were classified as SKCM by the bulk classifier, 37 were also 238 

classified as SKCM malignant cells by the scRNA-seq classifier. Interestingly, we also observed 239 

a high correlation between the SARC CCN classification score and scRNA-seq cancer 240 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 9 

associated fibroblast (CAF) CCN classification scores (Pearson correlation of 0.92). Six of the 241 

seven SKCM cell lines that had been classified as exclusively SARC by CCN were classified as 242 

CAF by the scRNA-seq classifier (Fig 2D, Supp Fig 3C), which suggests the possibility that 243 

these cell lines were derived from CAF or other mesenchymal populations, or that they have 244 

acquired a mesenchymal character through their derivation. The high level of agreement 245 

between scRNA-seq and bulk RNA-seq classification results shows that heterogeneity in the 246 

training data of general CCN classifier has little impact in the classification of SKCM cell lines. 247 

In contrast, we observed a weaker correlation between GBM CCN classification scores 248 

and scRNA-seq GBM neoplastic CCN classification scores (Pearson correlation of 0.72) for 249 

GBM cell lines (Fig 2E, Supp Fig 3D). Of the 31 GBM lines that were not classified as GBM 250 

with CCN, 25 were classified as GBM neoplastic cells with the scRNA-seq classifier. Among the 251 

22 GBM lines that were classified as SARC with CCN, 15 cell lines were classified as CAF (Fig 252 

2F), 10 which were classified as both GBM neoplastic and CAF in the scRNA-seq classifier. 253 

Similar to the situation with SKCM lines that classify as CAF, this result is consistent with the 254 

possibility that some GBM lines classified as SARC by CCN could be derived from 255 

mesenchymal subtypes exhibiting both strong mesenchymal signatures and glioblastoma 256 

signatures or that they have acquired a mesenchymal character through their derivation. The 257 

lower level of agreement between scRNA-seq and bulk RNA-seq classification results for GBM 258 

models suggests that the heterogeneity of glioblastomas42 can impact the classification of GBM 259 

cell lines, and that the use of scRNA-seq classifier can resolve this deficiency.  260 

 261 

Immunofluorescence confirmation of CCN predictions  262 

To experimentally explore some of our computational analyses, we performed 263 

immunofluorescence on three cell lines that were not classified as their labelled categories: the 264 

ovarian cancer line SK-OV-3 had a high UCEC CCN score (0.246), the ovarian cancer line 265 

A2780 had a high Testicular Germ Cell Tumors (TGCT) CCN score (0.327), and the prostate 266 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 10 

cancer line PC-3 had a high bladder cancer (BLCA) score (0.307) (Supp Tab 1). We reasoned 267 

that if SK-OV-3, A2780 and PC-3 were classified most strongly as UCEC, TGCT and BLCA, 268 

respectively, then they would express proteins that are indicative of these cancer types.  269 

First, we measured the expression of the uterine-associated transcription factor 270 

HOXB643,44, and the UCEC serous ovarian tumor biomarker WT145 in SK-OV-3, in the OV cell 271 

line Caov-4, and in the UCEC cell line HEC-59.  We chose Caov-4 as our positive control for OV 272 

biomarker expression because it was determined by our analysis and others11,15 to be a good 273 

model of OV. Likewise, we chose HEC-59 to be a positive control for UCEC. We found that SK-274 

OV-3 has a small percentage (5%) of cells that expressed the uterine marker HOXB6 and a 275 

large proportion (73%) of cells that expressed WT1 (Fig 3A). In contrast, no Caov-4 cells 276 

expressed HOXB6, whereas 85% of cells expressed WT1. This suggests that SK-OV-3 exhibits 277 

both biomarkers of ovarian tumor and uterine tissue. From our computational analysis and 278 

experimental validation, SK-OV-3 is most likely an endometrioid subtype of ovarian cancer. This 279 

result is also consistent with prior classification of SK-OV-346, and the fact that SK-OV-3 lacks 280 

p53 mutations, which is prevalent in high-grade serous ovarian cancer47, and it harbors an 281 

endometrioid-associated mutation in ARID1A11,46,48. Next, we measured the expression of 282 

markers of OV and germ cell cancers (LIN28A49) in the OV-annotated cell line A2780, which 283 

received a high TCGT CCN score. We found that 54% of A2780 cells expressed LIN28A 284 

whereas it was not detected in Caov-4 (Fig 3B). The OV marker WT1 was also expressed in 285 

fewer A2780 cells as compared to Caov-4 (48% vs 85%), which suggests that A2780 could be a 286 

germ cell derived ovarian tumor. Taken together, our results suggest that SK-OV-3 and A2780 287 

could represent OV subtypes of that are not well represented in TCGA training data, which 288 

resulted in a low OV score and higher CCN score in other categories.  289 

Lastly, we examined PC-3, annotated as a PRAD cell line but classified to be most 290 

similar to BLCA. We found that 30% of the PC-3 cells expressed PPARG, a contributor to 291 

urothelial differentiation50 that is not detected in the PRAD Vcap cell line but is highly expressed 292 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 11 

in the BLCA RT4 cell line (Fig 3C). PC-3 cells also expressed the PRAD biomarker FOLH151 293 

suggesting that PC-3 has an PRAD origin and gained urothelial or luminal characteristics 294 

through the derivation process. In short, our limited experimental data support the CCN 295 

classification results.  296 

 297 

Subtype classification of cancer cell lines 298 

 Next, we explored the subtype classification of CCLs from three general tumor types in 299 

more depth. We focused our subtype visualization (Fig 4A-C) on CCL models with general CCN 300 

score above 0.1 in their nominal cancer type as this allowed us to analyze those models that fell 301 

below the general threshold but were classified as a specific sub-type (Supp Tab 1-2).  302 

Focusing first on UCEC, the histologically defined subtypes of UCEC, endometrioid and serous, 303 

differ in prevalence, molecular properties, prognosis, and treatment. For instance, the 304 

endometrioid subtype, which accounts for approximately 80% of uterine cancers, retains 305 

estrogen receptor and progesterone receptor status and is responsive towards progestin 306 

therapy52,53. Serous, a more aggressive subtype, is characterized by the loss of estrogen and 307 

progesterone receptor and is not responsive to progestin therapy52,53. CCN classified the 308 

majority of the UCEC cell lines as serous except for JHUEM-1 which is classified as mixed, with 309 

similarities to both endometrioid and serous (Fig 4A). The preponderance CCLE lines of serous 310 

versus endometroid character may be due to properties of serous cancer cells that promote 311 

their in vitro propagation, such as upregulation of cell adhesion transcriptional programs54. 312 

Some of our subtype classification results are consistent with prior observations. For example, 313 

HEC-1A, HEC-1B, and KLE were previously characterized as type II endometrial cancer, which 314 

includes a serous histological subtype55. On the other hand, our subtype classification results 315 

contradict prior observations in at least one case. For instance, the Ishikawa cell line was 316 

derived from type I endometrial cancer (endometrioid histological subtype)55,56, however CCN 317 

classified a derivative of this line, Ishikawa 02 ER-, as serous. The high serous CCN score 318 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 12 

could result from a shift in phenotype of the line concomitant with its loss of estrogen receptor 319 

(ER) as this is a distinguishing feature of type II endometrial cancer (serous histological 320 

subtype)52. Taken together, these results indicate a need for more endometroid-like CCLs.  321 

 Next, we examined the subtype classification of Lung Squamous Cell Carcinoma 322 

(LUSC) and Lung adenocarcinoma (LUAD) cell lines (Fig 4B-C). All the LUSC lines with at least 323 

one subtype classification had an underlying primitive subtype classification. This is consistent 324 

either with the ease of deriving lines from tumors with a primitive character, or with a process by 325 

which cell line derivation promotes similarity to more primitive subtype, which is marked by 326 

increased cellular proliferation28. Some of our results are consistent with prior reports that have 327 

investigated the resemblance of some lines to LUSC subtypes. For example, HCC-95, 328 

previously been characterized as classical28,57, had a maximum CCN score in the classical 329 

subtype (0.429) . Similarly, LUDLU-1 and EPLC-272H, previously reported as classical57 and 330 

basal57 respectively, had maximal tumor subtype CCN scores for these sub-types (0.323 and 331 

0.256) (Fig 4B, Supp Tab 2) despite classified as Unknown. Lastly, the LUAD cell lines that 332 

were classified as a subtype were either classified as proximal inflammation or proximal 333 

proliferation (Fig 4C). RERF-LC-Ad1 had the highest general classification score and the 334 

highest proximal inflammation subtype classification score. Taken together, these subtype 335 

classification results have revealed an absence of cell lines models for basal and secretory 336 

LUSC, and for the Terminal respiratory unit (TRU) LUAD subtype. 337 

 338 

Cancer cell lines’ popularity and transcriptional fidelity  339 

Finally, we sought to measure the extent to which cell line transcriptional fidelity related 340 

to model prevalence. We used the number of papers in which a model was mentioned, 341 

normalized by the number of years since the cell line was documented, as a rough 342 

approximation of model prevalence. To explore this relationship, we plotted the normalized 343 

citation count versus general classification score, labeling the highest cited and highest 344 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 13 

classified cell lines from each general tumor type (Fig 4D). For most of the general tumor types, 345 

the highest cited cell line is not the highest classified cell line except for Hep G2, AGS and ML-346 

1, representing liver hepatocellular carcinoma (LIHC), stomach adenocarcinoma (STAD), and 347 

thyroid carcinoma (THCA), respectively. On the other hand, the general scores of the highest 348 

cited cell lines representing BLCA (T24), BRCA (MDA-MB-231), and PRAD (PC-3) fall below 349 

the classification threshold of 0.25. Notably, each of these tumor types have other lines with 350 

scores exceeding 0.5, which should be considered as more faithful transcriptional models when 351 

selecting lines for a study (Supp Tab 1 and 352 

http://www.cahanlab.org/resources/cancerCellNet_results/).  353 

 354 

Evaluation of patient derived xenografts 355 

 Next, we sought to evaluate a more recent class of cancer models: PDX. To do so, we 356 

subjected the RNA-seq expression profiles of 415 PDX models from 13 different types of cancer 357 

types generated previously20 to CCN. Similar to the results of CCLs, the PDXs exhibited a wide 358 

range of classification scores (Fig 5A, Supp Tab 3). By categorizing the CCN scores of PDX 359 

based on the proportion of samples associated with each tumor type that were correctly 360 

classified, we found that SARC, SKCM, COAD_READ and BRCA have higher proportion of 361 

correctly classified PDX than those of other cancer categories (Fig 5B). In contrast to CCLs, we 362 

found a higher proportion of correctly classified PDX in STAD, PAAD and KIRC (Fig 5B). 363 

However, similar to CCLs, no ESCA PDXs were classified as such. This held true when we 364 

performed subtype classification on PDX samples: none of the PDX in ESCA were classified as 365 

any of the ESCA subtypes (Supp Tab 4). UCEC PDXs had both endometrioid subtypes, serous 366 

subtypes, and mixed subtypes, which provided a broader representation than CCLs (Fig 5C). 367 

Several LUSC PDXs that were classified as a subtype were also classified as Head and Neck 368 

squamous cell carcinoma (HNSC) or mix HNSC and LUSC (Fig 5D). This could be due to the 369 

similarity in expression profiles of basal and classical subtypes of HNSC and LUSC28,58, which is 370 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 14 

consistent with the observation that these PDXs were also subtyped as classical. No LUSC 371 

PDXs were classified as the secretory subtype. In contrast to LUAD CCLs, four of the five LUAD 372 

PDXs with a discernible sub-type were classified as proximal inflammatory (Fig 5E). On the 373 

other hand, similar to the CCLs, there were no TRU subtypes in the LUAD PDX cohort. In 374 

summary, we found that while individual PDXs can reach extremely high transcriptional fidelity 375 

to both general tumor types and subtypes, many PDXs were not classified as the general tumor 376 

type from which they originated. 377 

 378 

Evaluation of GEMMs 379 

 Next, we used CCN to evaluate GEMMs of six general tumor types from nine studies for 380 

which expression data was publicly available59–67. As was true for CCLs and PDXs, GEMMs 381 

also had a wide range of CCN scores (Fig 6A, Supp Tab 5). We next categorized the CCN 382 

scores based on the proportion of samples associated with each tumor type that were correctly 383 

classified (Fig 6B). In contrast to LGG CCLs, LGG GEMMs, generated by Nf1 mutations 384 

expressed in different neural progenitors in combination with Pten deletion66, consistently were 385 

classified as LGG (Fig 6A-B). The GEMM dataset included multiple replicates per model, which 386 

allowed us to examine intra-GEMM variability. Both at the level of CCN score and at the level of 387 

categorization, GEMMs were invariant. For example, replicates of UCEC GEMMs driven by 388 

Prg(cre/+)Pten(lox/lox) received almost identical general CCN scores (Fig 6C, Supp Tab 6). 389 

GEMMs sharing genotypes across studies, such as LUAD GEMMs driven by Kras mutation and 390 

loss of p5359,65,67, also received similar general and subtype classification scores (Fig 6A,B,E).   391 

 Next, we explored the extent to which genotype impacted subtype classification in 392 

UCEC, LUSC, and LUAD. Prg(cre/+)Pten(lox/lox) GEMMs had a mixed subtype classification of 393 

both serous and endometrioid, consistent with the fact that Pten loss occurs in both subtypes 394 

(albeit more frequently in endometrioid). We also analyzed Prg(cre/+)Pten(lox/lox)Csf3r-/- 395 

GEMMs. Polymorphonuclear neutrophils (PMNs), which play anti-tumor roles in endometrioid 396 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 15 

cancer progression, are depleted in these animals. Interestingly, Prg(cre/+)Pten(lox/lox)Csf3r-/- 397 

GEMMs had a serous subtype classification, which could be explained by differences in PMN 398 

involvement in endometrioid versus serous uterine tumor development that are reflected in the 399 

respective transcriptomes of the TCGA UCEC training data.  We note that the tumor cells were 400 

sorted prior to RNA-seq and thus the shift in subtype classification is not due to contamination of 401 

GEMMs with non-tumor components. In short, this analysis supports the argument that tumor-402 

cell extrinsic factors, in this case a reduction in anti-tumor PMNs, can shift the transcriptome of 403 

a GEMM so that it more closely resembles a serous rather than endometrioid subtype.   404 

 The LUSC GEMMs that we analyzed were Lkb1fl/fl and they either overexpressed of 405 

Sox2 (via two distinct mechanisms) or were also Ptenfl/fl 65. We note that the eight lenti-Sox2-406 

Cre-infected;Lkb1fl/fl and Rosa26LSL-Sox2-IRES-GFP;Lkb1fl/fl samples that classified as 407 

'Unknown' had LUSC CCN scores only modestly lower than the decision threshold (Fig 6D) 408 

(mean CCN score = 0.217). Thirteen out of the 17 of the Sox2 GEMMs classified as the 409 

secretory subtype of LUSC. The consistency is not surprising given both models overexpress 410 

Sox2 and lose Lkb1. On the other hand, the Lkb1fl/fl;Ptenfl/fl GEMMs had substantially lower 411 

general LUSC CCN scores and our subtype classification indicated that this GEMM was mostly 412 

classified as 'Unknown', in contrast to prior reports suggesting that it is most similar to a basal 413 

subtype68. None of the three LUSC GEMMs have strong classical CCN scores. Most of the 414 

LUAD GEMMs, which were generated using various combinations of activating Kras mutation, 415 

loss of Trp53, and loss of Smarca4L59,65,67, were correctly classified (Fig 6E). Those that were 416 

not classified have modestly lower CCN score than the decision threshold (mean CCN score = 417 

0.214) . There were no substantial differences in general or subtype classification across driver 418 

genotypes. Although the sub-type of all LUAD GEMMs was 'Unknown', the subtypes tended to 419 

have a mixture of high CCN proximal proliferation, proximal inflammation and TRU scores. 420 

Taken together, this analysis suggests that there is a degree of similarity, and perhaps plasticity 421 

between the primitive and secretory (but not basal or classical) subtypes of LUSC. On the other 422 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 16 

hand, while the LUAD GEMMs classify strongly as LUAD, they do not have strong particular 423 

subtype classification -- a result that does not vary by genotype. 424 

 425 

Evaluation of Tumoroids  426 

 Lastly, we used CCN to assess a relatively novel cancer model: tumoroids. We 427 

downloaded and assessed 131 distinct tumoroid expression profiles spanning 13 cancer 428 

categories from The NCI Patient-Derived Models Repository (PDMR)69 and from three individual 429 

studies70–72 (Fig 7A, Supp Tab 7). We note that several categories have three or fewer samples 430 

(BRCA, CESC, KIRP, OV, LIHC, and BLCA from PDMR). Among the cancer categories 431 

represented by more than three samples, only LUSC and PAAD have fewer than 50% classified 432 

as their annotated label (Fig 7B). In contrast to GBM CCLs, all three induced pluripotent stem 433 

cell-derived GBM tumoroids72 were classified as GBM with high CCN scores (mean =  0.53). To 434 

further characterize the tumoroids, we performed subtype classification on them (Supp Tab 8). 435 

UCEC tumoroids from PDMR contains a wide range of subtypes with two endometrioid, two 436 

serous and one mixed type (Fig 7C). On the other hand, LUSC tumoroids appear to be 437 

predominantly of classical subtypes with one tumoroid classified as a mix between classical and 438 

primitive (Fig 7D). Lastly, similar to the CCL and PDX counterparts, LUAD tumoroids are 439 

classified as proximal inflammatory and proximal proliferation with no tumoroids classified as 440 

TRU subtype (Fig 7E).  441 

 442 

Comparison of CCLs, PDXs, GEMMs and tumoroids  443 

 Finally, we sought to estimate the comparative transcriptional fidelity of the four cancer 444 

models modalities. We compared the general CCN scores of each model on a per tumor type 445 

basis (Fig 8). In the case of GEMMs, we used the mean classification score of all samples with 446 

shared genotypes. We also used mean classification of technical replicates found in LIHC 447 

tumoroids70. We evaluated models based on both the maximum CCN score, as this represents 448 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 17 

the potential for a model class, and the median CCN score, as this indicates the current overall 449 

transcriptional fidelity of a model class. PDXs achieved the highest CCN scores in three (UCEC, 450 

PAAD, LUAD) out of the five cancer categories in which all four modalities were available (Fig 451 

8), despite having low median CCN scores. Notably, PDXs have a median CCN score above 452 

the 0.25 threshold in PAAD while none of the other three modalities have any samples above 453 

the threshold. In LIHC, the highest CCN score for PDX (0.9) is only slightly lower than the 454 

highest CCN score for tumoroid (0.91). This suggest that certain individual PDXs most closely 455 

mimic the transcriptional state of native patient tumors despite a portion of the PDXs having low 456 

CCN scores. Similarly, while the majority of the CCLs have low CCN scores, several lines 457 

achieve high transcriptional fidelity in LUSC, LUAD and LIHC (Fig 8). Collectively, GEMMs and 458 

tumoroids had the highest median CCN scores in four of the five model classes (LUSC and 459 

LUAD for GEMMs and UCEC and LIHC for tumoroids). Notably, both of the LIHC tumoroids 460 

achieved CCN scores on par with patient tumors (Fig 8). In brief, this analysis indicates that 461 

PDXs and CCLs are heterogenous in terms of transcriptional fidelity, with a portion of the 462 

models highly mimicking native tumors and the majority of the models having low transcriptional 463 

fidelity (with the exception of PAAD for PDXs). On the other hand, GEMMs and tumoroids 464 

displayed a consistently high fidelity across different models.  465 

Because the CCN score is based on a moderate number of gene features (i.e. 1,979 466 

gene pairs consisting of 1,689 unique genes) relative to the total number of protein-coding 467 

genes in the genome, it is possible that a cancer model with a high CCN score might not have a 468 

high global similarity to a naturally occurring tumor. Therefore, we also calculated the GRN 469 

status,  a metric of the extent to which tumor-type specific gene regulatory network is 470 

established21, for all models (Supp Fig 4). We observed high level of correlation between the 471 

two similarity metrics, which suggests that although CCN classifies on a selected set of genes, 472 

its scores are highly correlated with global assessment of transcriptional similarity. 473 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 18 

 We also sought to compare model modalities in terms of the diversity of subtypes that 474 

they represent (Supp Fig 5). As a reference, we also included in this analysis the overall 475 

subtype incidence, as approximated by incidence in TCGA. Replicates in GEMMs and 476 

tumoroids were averaged into one classification profile. In models of UCEC, there is a notable 477 

difference in endometroid incidence, and the proportion of models classified as endometroid, 478 

with PDX and tumoroids having any representatives (Supp Fig 5). All of the CCL, GEMM, and 479 

tumoroid models of PAAD have an unknown subtype classification and no correct general 480 

classification. However, the majority of PDXs are subtyped as either a mixture of basal and 481 

classical, or classical alone. LUAD have proximal inflammation and proximal proliferation 482 

subtypes modelled by CCLs and PDX (Supp Fig 5). Likewise, LUSC have basal, classical and 483 

primitive subtypes modelled by CCLs and PDXs, and secretory subtype modelled by GEMMs 484 

exclusively (Supp Fig 5). Taken together, these results demonstrate the need to carefully select 485 

different model systems to more suitably model certain cancer subtypes.  486 

 487 

DISCUSSION 488 

A major goal in the field of cancer biology is to develop models that mimic naturally occurring 489 

tumors with enough fidelity to enable therapeutic discoveries. However, methods to measure 490 

the extent to which cancer models resemble or diverge from native tumors are lacking. This is 491 

especially problematic now because there are many existing models from which to choose, and 492 

it has become easier to generate new models. Here, we present CancerCellNet (CCN), a 493 

computational tool that measures the similarity of cancer models to 22 naturally occurring tumor 494 

types and 36 subtypes. While the similarity of CCLs to patient tumors has already been 495 

explored in previous work, our tool introduces the capability to assess the transcriptional fidelity 496 

of PDXs, GEMMs, and tumoroids. Because CCN is platform- and species-agnostic, it 497 

represents a consistent platform to compare models across modalities including CCLs, PDXs, 498 

GEMMs and tumoroids. Here, we applied CCN to 657 cancer cell lines, 415 patient derived 499 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 19 

xenografts, 26 distinct genetically engineered mouse models and 131 tumoroids. Several 500 

insights emerged from our computational analyses that have implications for the field of cancer 501 

biology. 502 

 First, PDXs have the greatest potential to achieve transcriptional fidelity with three out of 503 

five general tumor types for which data from all modalities was available, as indicated by the 504 

high scores of individual PDXs. Notably PDXs are the only modality with samples classified as 505 

PAAD. At the same time, the median CCN scores of PDXs were lower than that of GEMMs and 506 

tumoroids in the other four tumor types.  It is unclear what causes such a wide range of CCN 507 

scores within PDXs. We suspect that some PDXs might have undergone selective pressures in 508 

the host that distort the progression of genomic alterations away from what is observed in 509 

natural tumor73. Future work to understand this heterogeneity is important so as to yield 510 

consistently high fidelity PDXs, and to identify intrinsic and host-specific factors that so 511 

powerfully shape the PDX transcriptome.   512 

 Second, in general GEMMs and tumoroids have higher median CCN scores than those 513 

of PDXs and CCLs. This is also consistent with that fact that GEMMs are typically derived by 514 

recapitulating well-defined driver mutations of natural tumors, and thus this observation 515 

corroborates the importance of genetics in the etiology of cancer74. Moreover, in contrast to 516 

most PDXs, GEMMs are typically generated in immune replete hosts. Therefore, the higher 517 

overall fidelity of GEMMs may also be a result of the influence of a native immune system on 518 

GEMM tumors75. The high median CCN scores of tumoroids can be attributed to several factors 519 

including the increased mechanical stimuli and cell-cell interactions that come from 3D self-520 

organizing cultures76,77.  521 

 Third, we have found that none of the samples that we evaluated here are 522 

transcriptionally adequate models of ESCA. This may be due to an inherent lability of the ESCA 523 

transcriptome that is often preceded by a metaplasia that has obscured determining its cell 524 

type(s) of origin78. Therefore, this tumor type requires further attention to derive new models. 525 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 20 

 Fourth, we found that in several tumor types, GEMMs tend to reflect mixtures of 526 

subtypes rather than conforming strongly to single subtypes. The reasons for this are not clear 527 

but it is possible that in the cases that we examined the histologically defined subtypes have a 528 

degree of plasticity that is exacerbated in the murine host environment.  529 

 Lastly, we recognize that many CCLs are not classified as their annotated labels. While 530 

we have suggested that the lack of immune component is not a major confounder, we suspect 531 

that the CCLs could undergo genetic divergence due to high number of passages, 532 

chemotherapy before biopsy, culture condition and genetic instability79–82, which could all be 533 

factors that drive CCLs away from their labelled tumors.    534 

 Currently, there are several limitations to our CCN tool, and caveats to our analyses 535 

which indicate areas for future work and improvement. First, CCN is based on transcriptomic 536 

data but other molecular readouts of tumor state, such as profiles of the proteome83,  537 

epigenome84, non-coding RNA-ome84, and genome74 would be equally, if not more important, to 538 

mimic in a model system. Therefore, it is possible that some models reflect tumor behavior well, 539 

and because this behavior is not well predicted by transcriptome alone, these models have 540 

lower CCN scores. To both measure the extent that such situations exist, and to correct for 541 

them, we plan in the future to incorporate other omic data into CCN so as to make more 542 

accurate and integrated model evaluation possible. As a first step in this direction, we plan to 543 

incorporate DNA methylation and genomic sequencing data as additional features for our 544 

Random forest classifier as this data is becoming more readily available for both training and 545 

cancer models. We expect that this will allow us to both refine our tumor subtype categories and 546 

it will enable more accurate predictions of how models respond to perturbations such as drug 547 

treatment.   548 

 A second limitation is that in the cross-species analysis, CCN implicitly assumes that 549 

homologs are functionally equivalent. The extent to which they are not functionally equivalent 550 

determines how confounded the CCN results will be. This possibility seems to be of limited 551 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 21 

consequence based on the high performance of the normal tissue cross-species classifier and 552 

based on the fact that GEMMs have the highest median CCN scores (in addition to tumoroids).  553 

 A third caveat to our analysis is that there were many fewer distinct GEMMs and 554 

tumoroids than CCLs and PDXs. As more transcriptional profiles for GEMMs and tumoroids 555 

emerge, this comparative analysis should be revisited to assess the generality of our results. 556 

 Finally, the TCGA training data is made up of RNA-Seq from bulk tumor samples, which 557 

necessarily includes non-tumor cells, whereas the CCLs are by definition cell lines of tumor 558 

origin. Therefore, CCLs theoretically could have artificially low CCN scores due to the presence 559 

of non-tumor cells in the training data. This problem appears to be limited as we found no 560 

correlation between tumor purity and CCN score in the CCLE samples. However, this problem 561 

is related to the question of intra-tumor heterogeneity. We demonstrated the feasibility of using 562 

CCN and single cell RNA-seq data to refine the evaluation of cancer cell lines contingent upon 563 

availability of scRNA-seq training data. As more training single cell RNA-seq data accrues, CCN 564 

would be able to not only evaluate models on a per cell type basis, but also based on cellular 565 

composition. 566 

  We have made the results of our analyses available online so that researchers can 567 

easily explore the performance of selected models or identify the best models for any of the 22 568 

general tumor types and the 36 subtypes presented here. To ensure that CCN is widely 569 

available we have developed a free web application, which performs CCN analysis on user-570 

uploaded data and allows for direct comparison of their data to the cancer models evaluated 571 

here.  We have also made the CCN code freely available under an Open Source license and as 572 

an easily installed R package, and we are actively supporting its further development. Included 573 

in the web application are instructions for training CCN and reproducing our analysis. The 574 

documentation describes how to analyze models and compare the results to the panel of 575 

models that we evaluated here, thereby allowing researchers to immediately compare their 576 

models to the broader field in a comprehensive and standard fashion.  577 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 22 

 578 

Online Methods 579 

Training General CancerCellNet Classifier 580 

To generate training data sets, we downloaded 8,991 patient tumor RNA-seq expression 581 

count matrix and their corresponding sample table across 22 different tumor types from TCGA 582 

using TCGAWorkflowData, TCGAbiolinks85 and SummarizedExperiment86 packages. We used 583 

all the patient tumor samples for training the general CCN classifier. We limited training and 584 

analysis of RNA-seq data to the 13,142 genes in common between the TCGA dataset and all 585 

the query samples (CCLs, PDXs, GEMMs, and tumoroids). To train the top pair Random forest 586 

classifier, we used a method similar to our previous method23. CCN first normalized the training 587 

counts matrix by down-sampling the counts to 500,000 counts per sample. To significantly 588 

reduce the execution time and memory of generating gene pairs for all possible genes, CCN 589 

then selected n up-regulated genes, n down-regulated genes and n least differentially 590 

expressed genes (CCN training parameter nTopGenes = n) for each of the 22 cancer 591 

categories using template matching87 as the genes to generate top scoring gene pairs. In short, 592 

for each tumor type, CCN defined a template vector that labelled the training tumor samples in 593 

cancer type of interest as 1 and all other tumor samples as 0 CCN then calculated the Pearson 594 

correlation coefficient between template vector and gene expressions for all genes. The genes 595 

with strong match to template as either upregulated or downregulated had large absolute 596 

Pearson correlation coefficient. CCN chose the upregulated, downregulated and least 597 

differentially expressed genes based on the magnitude of Pearson correlation coefficient.  598 

After CCN selected the genes for each cancer type, CCN generated gene pairs among 599 

those genes. Gene pair transformation was a method inspired by the top-scoring pair classifier88 600 

to allow compatibility of classifier with query expression profiles that were collected through 601 

different platforms (e.g. microarray query data applied to RNA-seq training data). In brief, the 602 

gene pair transformation compares 2 genes within an expression sample and encodes the 603 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 23 

“gene1_gene2” gene-pair as 1 if the first gene has higher expression than the second gene. 604 

Otherwise, gene pair transformation would encode the gene-pair as 0. Using all the gene pair 605 

combinations generated through the gene sets per cancer type, CCN then selected top m 606 

discriminative gene pairs (CCN training parameter nTopGenePairs = m) for each category using 607 

template matching (with large absolute Pearson correlation coefficient) described above. To 608 

prevent any single gene from dominating the gene pair list, we allowed each gene to appear at 609 

maximum of three times among the gene pairs selected as features per cancer type. 610 

After the top discriminative gene pairs were selected for each cancer category, CCN 611 

grouped all the gene pairs together and gene pair transformed the training samples into a binary 612 

matrix with all the discriminative gene pairs as row names and all the training samples as 613 

column names. Using the binary gene pair matrix, CCN randomly shuffled the binary values 614 

across rows then across columns to generate random profiles that should not resemble training 615 

data from any of the cancer categories. CCN then sampled 70 random profiles, annotated them 616 

as “Unknown” and used them as training data for the “Unknown” category. Using gene pair 617 

binary training matrix, CCN constructed a multi-class Random Forest classifier of 2000 trees 618 

and used stratified sampling of 60 sample size to ensure balance of training data in constructing 619 

the decision trees.  620 

To identify the best set of genes and gene-pair parameters (n and m), we used a grid-621 

search cross-validation89 strategy with 5 cross-validations at each parameter set. The specific 622 

parameters for the final CCN classifier using the function “broadClass_train” in the package 623 

cancerCellNet are in Supp Tab 9. The gene-pairs are in Supp Tab 10. 624 

 625 

Validating General CancerCellNet Classifier 626 

Two thirds of patient tumor data from each cancer type were randomly sampled as 627 

training data to construct a CCN classifier. Based on the training data, CCN selected the 628 

classification genes and gene-pairs and trained a classifier. After the classifier was built, 35 629 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 24 

held-out samples from each cancer category were sampled and 40 “Unknown” profiles were 630 

generated for validation. The process of randomly sampling training set from 2/3 of all patient 631 

tumor data, selecting features based on the training set, training classifier and validating was 632 

repeated 50 times to have a more comprehensive assessment of the classifier trained with the 633 

optimal parameter set. To test the performance of final CCN on independent testing data, we 634 

applied it to 725 profiles from ICGC spanning 6 projects that do not overlap with TCGA (BRCA-635 

KR, LIRI-JP, OV-AU, PACA-AU, PACA-CA, PRAD-FR).  636 

 637 

Selecting Decision Thresholds 638 

 Our strategy for selecting a decision threshold was to find the value that maximizes the 639 

average Macro F1 measure90 for each of the 50 cross-validations that were performed with the 640 

optimal parameter set, testing thresholds between 0 and 1 with a 0.01 increment. The F1 641 

measure is defined as:  642 

𝑀𝑎𝑐𝑟𝑜	𝐹1 =
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

 643 

We selected the most commonly occurring threshold above 0.2 that maximized the average 644 

Macro F1 measure across the 50 cross-validations as the decision threshold for the final 645 

classifier (threshold = 0.25). The same approach was applied for the subtype classifiers. The 646 

thresholds and the corresponding average precision, recall and F1 measures are recorded in 647 

(Supp Tab 11).  648 

 649 

Classifying Query Data into General Cancer Categories 650 

We downloaded the RNA-seq cancer cell lines expression profiles and sample table 651 

from (https://portals.broadinstitute.org/ccle/data), and microarray cancer cell lines expression 652 

profiles and sample table from Barretina et al 37. We extracted two WT control NCCIT RNA-seq 653 

expression profiles from Grow et al91. We received PDX expression estimates and sample 654 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 25 

annotations from the authors of Gao et al 20. We gathered GEMM expression profiles from nine 655 

different studies59–67.  We downloaded tumoroid expression profiles from The NCI Patient-656 

Derived Models Repository (PDMR)69 and from three individual studies70–72. To use CCN 657 

classifier on GEMM data, the mouse genes from GEMM expression profiles were converted into 658 

their human homologs. The query samples were classified using the final CCN classifier. Each 659 

query classification profile was labelled as one of the four classification categories: “correct”, 660 

“mixed”, “none” and “other” based on classification profiles. If a sample has a CCN score higher 661 

than the decision threshold in the labelled cancer category, we assigned that as “correct”. If a 662 

sample has CCN score higher than the decision threshold in labelled cancer category and in 663 

other cancer categories, we assigned that as “mixed”. If a sample has no CCN score higher 664 

than the decision threshold in any cancer category or has the highest CCN score in ‘Unknown’ 665 

category, then we assigned it as “none”. If a sample has CCN score higher than the decision 666 

threshold in a cancer category or categories not including the labelled cancer category, we 667 

assigned it as ”other”. We analyzed and visualized the results using R and R packages 668 

pheatmap92 and ggplot293.  669 

 670 

Cross-Species Assessment  671 

To assess the performance of cross-species classification, we downloaded 1003 672 

labelled human tissue/cell type and 1993 labelled mouse tissue/cell type RNA-seq expression 673 

profiles from Github (https://github.com/pcahan1/CellNet). We first converted the mouse genes 674 

into human homologous genes. Then we found the intersecting genes between mouse 675 

tissue/cell expression profiles and human tissue/cell expression profiles. Limiting the input of 676 

human tissue RNA-seq profiles to the intersecting genes, we trained a CCN classifier with all 677 

the human tissue/cell expression profiles. The parameters used for the function 678 

“broadClass_train” in the package cancerCellNet are in Supp Tab 9. We randomly sampled 75 679 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 26 

samples from each tissue category in mouse tissue/cell data and applied the classifier on those 680 

samples to assess performance.  681 

 682 

 Cross-Technology Assessment  683 

To assess the performance of CCN in applications to microarray data, we gathered 684 

6,219 patient tumor microarray profiles across 12 different cancer types from more than 100 685 

different projects (Supp Tab 12). We found the intersecting genes between the microarray 686 

profiles and TCGA patient RNA-seq profiles. Limiting the input of RNA-seq profiles to the 687 

intersecting genes, we created a CCN classifier with all the TCGA patient profiles using 688 

parameters for the function “broadClass_train” listed in Supp Tab 9.  After the microarray 689 

specific classifier was trained, we randomly sampled 60 microarray patient samples from each 690 

cancer category and applied CCN classifier on them as assessment of the cross-technology 691 

performance in Supp Fig 2A. The same CCN classifier was used to assess microarray CCL 692 

samples Supp Fig 2B.  693 

 694 

Training and validating scRNA-seq Classifier  695 

We extracted labelled human melanoma and glioblastoma scRNA-seq expression 696 

profiles40,41, and compiled the two datasets excluding 3 cell types T.CD4, T.CD8 and Myeloid 697 

due to low number of cells for training. 60 cells from each of the 11 cell types were sampled for 698 

training a scRNA-seq classifier. The parameters for training a general scRNA-seq classifier 699 

using the function “broadClass_train” are in Supp Tab 9.  25 cells from each of the 11 cell types 700 

from the held-out data were selected to assess the single cell classifier. Using maximization of 701 

average Macro F1 measure, we selected the decision threshold of 0.255. The gene-pairs that 702 

were selected to construct the classifier are in Supp Tab 10. To assess the cross-technology 703 

capability of applying scRNA-seq classifier to bulk RNA-seq, we downloaded 305 expression 704 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 27 

profiles spanning 4 purified cell types (B cells, endothelial cells, monocyte/macrophage, 705 

fibroblast) from https://github.com/pcahan1/CellNet. 706 

 707 

Training Subtype CancerCellNet 708 

We found 11 cancer types (BRCA, COAD, ESCA, HNSC, KIRC, LGG, PAAD, UCEC, 709 

STAD, LUAD, LUSC) which have meaningful subtypes based on either histology or molecular 710 

profile and have sufficient samples to train a subtype classifier with high AUPR. We also 711 

included normal tissues samples from BRCA, COAD, HNSC, KIRC, UCEC to create a normal 712 

tissue category in the construction of their subtype classifiers. Training samples were either 713 

labelled as a cancer subtype for the cancer of interest or as “Unknown” if they belong to other 714 

cancer types. Similar to general classifier training, CCN performed gene pair transformation and 715 

selected the most discriminate gene pairs for each cancer subtype. In addition to the gene pairs 716 

selected to discriminate cancer subtypes, CCN also performed general classification of all 717 

training data and appended the classification profiles of training data with gene pair binary 718 

matrix as additional features. The reason behind using general classification profile as additional 719 

features is that many general cancer types may share similar subtypes, and general 720 

classification profile could be important features to discriminate the general cancer type of 721 

interest from other cancer types before performing finer subtype classification. The specific 722 

parameters used to train individual subtype classifiers using “subClass_train” function of 723 

CancerCellNet package can be found in Supp Tab 9 and the gene pairs are in Supp Tab 10.  724 

 725 

Validating Subtype CancerCellNet 726 

 Similar to validating general class classifier, we randomly sampled 2/3 of all samples in 727 

each cancer subtype as training data and sampled an equal amount across subtypes in the 1/3 728 

held-out data for assessing subtype classifiers. We repeated the process 20 times for more 729 

comprehensive assessment of subtype classifiers.  730 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 28 

Classifying Query Data into Subtypes 731 

We assigned subtype to query sample if the query sample has CCN score higher than 732 

the decision threshold. The table of decision threshold for subtype classifiers are in Supp Tab 733 

11. If no CCN scores exceed the decision threshold in any subtype or if the highest CCN score 734 

is in ‘Unknown’ category, then we assigned that sample as ‘Unknown’. Analysis was performed 735 

in R and visualizations were generated with the ComplexHeatmap package94.   736 

 737 

Cells culture, Immunohistochemistry and histomorphometry 738 
Caov-4 (ATCC® HTB-76™), SK-OV-3(ATCC® HTB-77™), RT4 (ATCC® HTB-2™), and 739 

NCCIT(ATCC® CRL-2073™) cell lines were purchased from ATCC. HEC-59 (C0026001) and 740 

A2780 (93112519-1VL) were obtained from Addexbio Technologies and Sigma-Aldrich. Vcap 741 

and PC-3. SK-OV-3, Vcap, and RT4 were cultured in Dulbecco's Modified Eagle Medium 742 

(DMEM, high glucose, 11960069, Gibco) with 1% Penicillin-Streptomycin-Glutamine ( 743 

10378016, Life Technologies);  Caov-4, PC-3, NCCIT, and A2780 were cultured using RPMI-744 

1640 medium (11875093, Gibco) while HEC-59 was in Iscove's Modified Dulbecco's Medium 745 

(IMDM, 12440053, Gibco). Both media were supplemented with 1% Penicillin-Streptomycin 746 

(15140122, Gibco). All medium included 10% Fetal Bovine Serum (FBS).  747 

Cells cultured in 48-well plate were washed twice with PBS and fixed in 10% buffered 748 

formalin for 24 hrs at 4 °C. Immunostaining was performed using a standard protocol. Cells 749 

were incubated with primary antibodies to goat HOXB6 (10 µg/mL, PA5-37867, Invitrogen), 750 

mouse WT1(10 µg/mL, MA1-46028, Invitrogen), rabbit PPARG (1:50, ABN1445, Millipore), 751 

mouse FOLH1(10 µg/mL, UM570025, Origene), and rabbit LIN28A (1:50, #3978, Cell Signaling) 752 

in Antibody Diluent (S080981-2, DAKO), at 4 °C overnight followed with three 5 min washes in 753 

TBST. The slides were then incubated with secondary antibodies conjugated with fluorescence 754 

at room temperature for 1 h while avoiding light followed with three 5 min washes in TBST and 755 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 29 

nuclear stained with mounting medium containing DAPI. Images were captured by Nikon 756 

EcLipse Ti-S, DS-U3 and DS-Qi2. 757 

Histomorphometry was performed using ImageJ (Version 2.0.0-rc-69/1.52i). % 758 

N.positive cells was calculated by the percentage of the number of positive stained cells divided 759 

by the number of DAPI-positive nucleus within three of randomly chosen areas. The data were 760 

expressed as means ± SD. 761 

 762 

Tumor Purity Analysis  763 

 We used the R package ESTIMATE95 to calculate the ESTIMATE scores from TCGA 764 

tumor expression profiles that we used as training data for CCN classifier. To calculate tumor 765 

purity we used the equation described in YoshiHara et al., 201395: 766 

	Tumour	purity = cos	(0.6049872018 + 0.0001467884	 × 	ESTIMATE	score) 767 

 768 

Extracting Citation Counts  769 

 We used the R package RISmed96 to extract the number of citations for each cell line 770 

through query search of “cell line name[Text Word] AND cancer[Text Word]” on PubMed. The 771 

citation counts were normalized by dividing the citation counts with the number of years since 772 

first documented.  773 

𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑	𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛	𝑐𝑜𝑢𝑛𝑡𝑠 =
𝑐𝑖𝑡𝑎𝑡𝑖𝑜𝑛	𝑐𝑜𝑢𝑛𝑡𝑠

#	𝑦𝑒𝑎𝑟𝑠	𝑠𝑖𝑛𝑐𝑒	𝑓𝑖𝑟𝑠𝑡	𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑
 774 

 775 

GRN construction and GRN Status  776 

 GRN construction was extended from our previous method21. 80 samples per cancer 777 

type were randomly sampled and normalized through down sampling as training data for the 778 

CLR GRN construction algorithm. Cancer type specific GRNs were identified by determining the 779 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 30 

differentially expressed genes per each cancer type and extracting the subnetwork using those 780 

genes.  781 

 To extend the original GRN status algorithm21 across different platforms and species, we 782 

devised a rank-based GRN status algorithm. Like the original GRN status, rank based GRN 783 

status is a metric of assessing the similarity of cancer type specific GRN between training data 784 

in the cancer type of interest and query samples. Hence, high GRN status represents high level 785 

of establishment or similarity of the cancer specific GRN in the query sample compared to those 786 

of the training data. The expression profiles of training data and query data were transformed 787 

into rank expression profiles by replacing the expression values with the rank of the expression 788 

values within a sample (highest expressed gene would have the highest rank and lowest 789 

expressed genes would have a rank of 1). Cancer type specific mean and standard deviation of 790 

every gene’s rank expression were learned from training data. The modified Z-score values for 791 

genes within cancer type specific GRN were calculated for query sample’s rank expression 792 

profiles to quantify how dissimilar the expression values of genes in query sample’s cancer type 793 

specific GRN compared to those of the reference training data: 794 

𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒	𝑖)XYZ = [
0, 	𝑖𝑓	𝑍𝑠𝑐𝑜𝑟𝑒	𝑖𝑠	𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒	𝑎𝑛𝑑	𝑡ℎ𝑒	𝑔𝑒𝑛𝑒	𝑖𝑠	𝑓𝑜𝑢𝑛𝑑	𝑡𝑜	𝑏𝑒	𝑢𝑝𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑

0, 	𝑖𝑓	𝑍𝑠𝑐𝑜𝑟𝑒	𝑖𝑠	𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒	𝑎𝑛𝑑	𝑡ℎ𝑒	𝑔𝑒𝑛𝑒	𝑖𝑠	𝑓𝑜𝑢𝑛𝑑	𝑡𝑜	𝑏𝑒	𝑑𝑜𝑤𝑛𝑟𝑒𝑔𝑢𝑙𝑎𝑡𝑒𝑑
𝑎𝑏𝑠(𝑍𝑠𝑐𝑜𝑟𝑒), 	𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒	

 795 

If a gene in the cancer type specific GRN is found to be upregulated in the specific 796 

cancer type relative to other cancer types, then we would consider query sample’s gene to be 797 

similar if the ranking of the query sample’s gene is equal to or greater than the mean ranking of 798 

the gene in training sample. As a result of similarity, we assign that gene of a Z-score of 0. The 799 

same principle applies to cases where the gene is downregulated in cancer specific subnetwork.  800 

GRN status for query sample is calculated as the weighted mean of the 801 

(1000 − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒	𝑖)XYZ) across genes in cancer type specific GRN. 1000 is an arbitrary 802 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 31 

large number, and larger dissimilarity between query’s cancer type specific GRN indicate high 803 

Z-scores for the GRN genes and low GRN status. 804 

𝑅𝐺𝑆 = e(1000 − 𝑍𝑠𝑐𝑜𝑟𝑒(𝑔𝑒𝑛𝑒	𝑖)XYZ)𝑤𝑒𝑖𝑔ℎ𝑡fghg	i

h

ijk

 805 

𝐺𝑅𝑁	𝑆𝑡𝑎𝑡𝑢𝑠 =
𝑅𝐺𝑆

∑ 𝑤𝑒𝑖𝑔ℎ𝑡fghg	ihijk
 806 

The weight of individual genes in the cancer specific network is determined by the 807 

importance of the gene in the Random Forest classifier. Finally, the GRN status gets normalized 808 

with respect to the GRN status of the cancer type of interest and the cancer type with the lowest 809 

mean GRN status.  810 

𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑	𝐺𝑅𝑁	𝑠𝑡𝑎𝑡𝑢𝑠 =
𝐺𝑅𝑁	𝑠𝑡𝑎𝑡𝑢𝑠	mngop − 𝑎𝑣𝑔(𝐺𝑅𝑁	𝑠𝑡𝑎𝑡𝑢𝑠	Xih	qrhqgo)

𝑎𝑣𝑔(𝐺𝑅𝑁	𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo	sptg	ihsgogus)
 811 

Where “min cancer” represents the cancer type where its training data have the lowest 812 

mean GRN status in the cancer type of interest, and 𝑎𝑣𝑔(𝐺𝑅𝑁	𝑠𝑡𝑎𝑡𝑢𝑠	Xih	qrhqgo) represents the 813 

lowest average GRN status in the cancer type of interest. 𝑎𝑣𝑔(𝐺𝑅𝑁	𝑠𝑡𝑎𝑡𝑢𝑠qrhqgo	sptg	ihsgogus) 814 

represents average GRN status of the cancer type of interest in the training data.  815 

 816 

Code availability 817 

CancerCellNet code and documentation is available at GitHub: 818 

https://github.com/pcahan1/cancerCellNet 819 

 820 

Acknowledgements 821 

This work was supported by the National Institutes of Health NCI Ovarian Cancer SPORE 822 

P50CA228991 via a Development Research Program award to PC. FWH was supported by a 823 

Prostate Cancer Foundation Young Investigator Award, Department of Defense W81XWH-17-824 

PCRP-HD (F.W.H.), the National Institutes of Health/National Cancer Institute P20 CA233255-825 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 32 

01 (F.W.H.) U19 CA214253 (F.W.H.). We would like to thank John Powers, Hao Zhu, Tian-Li 826 

Wang, Charles Eberhart, and Kaloyan Tsanov for comments on the manuscript and helpful 827 

discussions. Some figures were created in part with Biorender.com. 828 

 829 

FIGURE LEGENDS 830 

Fig. 1 CancerCellNet (CCN) workflow, training, and performance. (A) Schematic of CCN 831 

usage. CCN was designed to assess and compare the expression profiles of cancer models 832 

such as CCLs, PDXs, GEMMs, and tumoroids with native patient tumors. To use trained 833 

classifier, CCN inputs the query samples (e.g. expression profiles from CCLs, PDXs, GEMMs, 834 

tumoroids) and generates a classification profile for the query samples. The column names of 835 

the classification heatmap represent sample annotation and the row names of the classification 836 

heatmap represent different cancer types. Each grid is colored from black to yellow representing 837 

the lowest classification score (e.g. 0) to highest classification score (e.g. 1). (B) Schematic of 838 

CCN training process. CCN uses patient tumor expression profiles of 22 different cancer types 839 

from TCGA as training data. First, CCN identifies n genes that are upregulated, n that are 840 

downregulated, and n that are relatively invariant in each tumor type versus all of the others. 841 

Then, CCN performs a pair transform on these genes and subsequently selects the most 842 

discriminative set of m gene pairs for each cancer type as features (or predictors) for the 843 

Random forest classifier. Lastly, CCN trains a multi-class Random Forest classifier using gene-844 

pair transformed training data.  (C) Parameter optimization strategy. 5 cross-validations of each 845 

parameter set in which 2/3 of TCGA data was used to train and 1/3 to validate was used search 846 

for the values of n and m that maximized performance of the classifier as measured by area 847 

under the precision recall curve (AUPRC).  (D) Mean and standard deviation of classifiers based 848 

on 50 cross-validations with the optimal parameter set. (E) AUPRC of the final CCN classifier 849 

when applied to independent patient tumor data from ICGC. 850 

 851 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 33 

Fig. 2 Evaluation of cancer cell lines. (A) General classification heatmap of CCLs extracted 852 

from CCLE. Column annotations of the heatmap represent the labelled cancer category of the 853 

CCLs given by CCLE and the row names of the heatmap represent different cancer categories. 854 

CCLs’ general classification profiles are categorized into 4 categories: correct (red), correct 855 

mixed (pink), no classification (light green) and other classification (dark green) based on the 856 

decision threshold of 0.25. (B) Bar plot represents the proportion of each classification category 857 

in CCLs across cancer types ordered from the cancer types with the highest proportion of 858 

correct and correct mixed CCLs to lowest proportion. (C) Comparison between SKCM general 859 

CCN scores from bulk RNA-seq classifier and SKCM malignant CCN scores from scRNA-seq 860 

classifier for SKCM CCLs. (D) Comparison between SARC general CCN scores from bulk RNA-861 

seq classifier and CAF CCN scores from scRNA-seq classifier for SKCM CCLs. (E) Comparison 862 

between GBM general CCN scores from bulk RNA-seq classifier and GBM neoplastic CCN 863 

scores from scRNA-seq classifier for GBM CCLs. (F) Comparison between SARC general CCN 864 

scores and CAF CCN scores from scRNA-seq classifier for GBM CCLs. The green lines 865 

indicate the decision threshold for scRNA-seq classifier and general classifier.  866 

 867 

Fig. 3 Immunofluorescence of selected cell lines. (A) Classification profiles (left) and IF 868 

expression (middle) of Caov-4 (OV positive control), HEC-59 (UCEC positive control) and SK-869 

OV-3 for WT1 (OV biomarker) and HOXB6 (uterine biomarker). The bar plots quantify the 870 

average percentage of positive cells for WT1 (top-right) and HOXB6 (bottom-right). (B) 871 

Classification profiles (left) and IF expression (middle) of Caov-4, NCCIT (germ cell tumor 872 

positive control) and A2780 for WT1 and LIN28A (germ cell tumor biomarker). Classification of 873 

NCCIT were performed using RNA-seq profiles of WT control NCCIT duplicate from Grow et 874 

al91. The bar plots quantify the average percentage of positive cells for WT1 (top-right) and 875 

LIN28A (bottom-right). (C) Classification profiles (left) and IF expression (middle) of Vcap 876 

(PRAD positive control), RT4 (BLCA positive control) and PC-3 for FOLH1 (prostate biomarker) 877 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 34 

and PPARG (urothelial biomarker). The bar plots quantify the average percentage of positive 878 

cells for FOLH1 (top-right) and PPARG (bottom-right).  879 

 880 

Fig. 4 Subtype classification of CCLs and CCL prevalence. The heatmap visualizations 881 

represent subtype classification of (A) UCEC CCLs, (B) LUSC CCLs and (C) LUAD CCLs. Only 882 

samples with CCN scores > 0.1 in their nominal tumor type are displayed. (D) Comparison of 883 

normalized citation counts and general CCN classification scores of CCLs. Labelled cell lines 884 

either have the highest CCN classification score in their labelled cancer category or highest 885 

normalized citation count. Each citation count was normalized by number of years since first 886 

documented on PubMed.  887 

  888 

Fig. 5 Evaluation of patient derived xenografts. (A) General classification heatmap of PDXs. 889 

Column annotations represent annotated cancer type of the PDXs, and row names represent 890 

cancer categories. (B) Proportion of classification categories in PDXs across cancer types is 891 

visualized in the bar plot and ordered from the cancer type with highest proportion of correct and 892 

mixed correct classified PDXs to the lowest. Subtype classification heatmaps of (C) UCEC 893 

PDXs, (D) LUSC PDXs and (E) LUAD PDXs. Only samples with CCN scores > 0.1 in their 894 

nominal tumor type are displayed. 895 

 896 

Fig. 6 Evaluation of genetically engineered mouse models. (A) General classification 897 

heatmap of GEMMs. Column annotations represent annotated cancer type of the GEMMs, and 898 

row names represent cancer categories. (B) Proportion of classification categories in GEMMs 899 

across cancer types is visualized in the bar plot and ordered from the cancer type with highest 900 

proportion of correct and mixed correct classified GEMMs to the lowest. Subtype classification 901 

heatmap of (C) UCEC GEMMs, (D) LUSC GEMMs and (E) LUAD GEMMs. Only samples with 902 

CCN scores > 0.1 in their nominal tumor type are displayed. 903 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 35 

 904 

Fig. 7 Evaluation of tumoroid models. (A) General classification heatmap of tumoroids. 905 

Column annotations represent annotated cancer type of the tumoroids, and row names 906 

represent cancer categories. (B) Proportion of classification categories in tumoroids across 907 

cancer types is visualized in the bar plot and ordered from the cancer type with highest 908 

proportion of correct and mixed correct classified tumoroids to the lowest. Subtype classification 909 

heatmap of (C) UCEC tumoroids, (D) LUSC tumoroids and (E) LUAD tumoroids. Only samples 910 

with CCN scores > 0.1 in their nominal tumor type are displayed. 911 

 912 

Fig. 8 Comparison of CCLs, PDXs, and GEMMs. Box-and-whiskers plot comparing general 913 

CCN scores across CCLs, GEMMs, PDXs of five general tumor types (UCEC, PAAD, LUSC, 914 

LUAD, LIHC).  915 

 916 

Supplementary Information 917 

Supplementary Figure 1 Assessment of CCN general classifier and subtype classifier. (A) 918 

Mean AUPRC of repeated grid-search cross-validation for each parameter grid. (B) Mean and 919 

range of CCN classifier’s PR curves from 50 cross validations based on the optimal feature 920 

selection parameters n and m. (C) AUPRC of CCN human tissue classifier when applied to 921 

mouse tissue data. (D) The schematic of training a subtype classifier in CCN. CCN uses patient 922 

tumor expression profiles from cancer of interest as training data. CCN performs gene-pair 923 

transformation and selects the most discriminative gene pairs among the cancer subtypes from 924 

training data as features. CCN then applies the general classification on training data and uses 925 

the general classification profile as features in addition to gene pairs for training a Random 926 

Forest classifier. The weight of the general classification profiles as features can be tuned to 927 

improve AUPRC. (E) The mean and standard deviation of AUPRC for 11 subtype classifiers 928 

based on 20 iterations of random sampling of training and held-out data, training subtype 929 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 36 

classifier using training data, classification of held-out data, and calculation of recall and 930 

precision.  931 

 932 

Supplementary Figure 2 Further validation of CCN and classification results. To validate the 933 

cross-platform classification performance of CCN, a new classifier specifically trained to classify 934 

microarray data was trained using RNA-seq data from TCGA as training data and intersecting 935 

genes between RNA-seq data and microarray data. (A) AUPRC of CCN classifier when applied 936 

to tumor profiles assayed on microarrays. (B) Classification heatmap of CCLs using microarray 937 

expression data. (C) Pearson correlation between CCN scores of CCLE lines generated from 938 

RNA-seq data and microarray data. (D) Comparison between CCLs’ CCN scores and the 939 

similarity metric from Yu et al15, median correlations of transcriptional profiles between CCLs 940 

and TCGA tumors from CCLs’ labelled cancer category. (E) Comparison of mean tumor purity 941 

of training data and mean CCN scores of CCLs for each cancer category.  942 

 943 

Supplementary Figure 3 Single-cell classification of SKCM and GBM cell lines. (A) AUPRC of 944 

the single-cell classifier when applied to scRNA-seq held-out data. (B) AUPRC of the scRNA-945 

seq classifier when applied to purified bulk RNA samples. (C) Single-cell classification of SKCM 946 

CCLs. Red bar-plot (top) represents general CCN scores in SARC and blue bar-plot (bottom) 947 

represents general CCN scores in SKCM. (D) Single-cell classification of GBM CCLs. Red bar-948 

plot (top) represents general CCN scores in SARC and yellow bar-plot (bottom) represents 949 

general CCN scores in GBM.  950 

 951 

Supplementary Figure 4 Correlation between cancer type specific network GRN status and 952 
general CCN scores.  953 
 954 
 955 
Supplementary Figure 5 Proportion of cancer subtypes in different cancer models and TCGA 956 
tumor data across 11 general cancer types.  957 
 958 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 37 

  959 
Supplementary Table 1 General classification profiles of CCLs.   960 
 961 
Supplementary Table 2 Subtype classification profiles of CCLs. 962 
 963 
Supplementary Table 3 General classification profiles of PDXs. 964 
 965 
Supplementary Table 4 Subtype classification profiles of PDXs. 966 
 967 
Supplementary Table 5 General classification profiles of GEMMs 968 
 969 
Supplementary Table 6 Subtype classification profiles of GEMMs. 970 
 971 
Supplementary Table 7 General classification profiles of tumoroids. 972 
 973 
Supplementary Table 8 Subtype classification profiles of tumoroids. 974 
 975 
Supplementary Table 9 Specific parameters used for training of all classifiers. 976 
 977 
Supplementary Table 10 Gene-pairs selected for final training of CCN general, subtype 978 
classifiers and single-cell classifier. 979 
 980 
Supplementary Table 11 Decision thresholds and the corresponding precision and recall for 981 
the general classifier and subtype classifier.  982 
 983 
Supplementary Table 12 Accessions of tumor microarray data used in validation. 984 
  985 
 986 

REFERENCES 987 

1. Sharma, S. V., Haber, D. A. & Settleman, J. Cell line-based platforms to evaluate 988 
the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer 10, 241–989 
253 (2010). 990 

2. Kersten, K., de Visser, K. E., van Miltenburg, M. H. & Jonkers, J. Genetically 991 
engineered mouse models in oncology research and cancer medicine. EMBO Mol. 992 
Med. 9, 137–153 (2017). 993 

3. Hidalgo, M. et al. Patient-derived xenograft models: an emerging platform for 994 
translational cancer research. Cancer Discov. 4, 998–1013 (2014). 995 

4. Drost, J. & Clevers, H. Organoids in cancer research. Nat. Rev. Cancer 18, 407–996 
418 (2018). 997 

5. Klijn, C. et al. A comprehensive transcriptional portrait of human cancer cell lines. 998 
Nat. Biotechnol. 33, 306–312 (2015). 999 

6. Koren, S. et al. PIK3CA(H1047R) induces multipotency and multi-lineage mammary 1000 
tumours. Nature 525, 114–118 (2015). 1001 

7. DeRose, Y. S. et al. Tumor grafts derived from women with breast cancer 1002 
authentically reflect tumor pathology, growth, metastasis and disease outcomes. 1003 
Nat. Med. 17, 1514–1520 (2011). 1004 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 38 

8. Sharpless, N. E. & Depinho, R. A. The mighty mouse: genetically engineered 1005 
mouse models in cancer drug development. Nat. Rev. Drug Discov. 5, 741–754 1006 
(2006). 1007 

9. Mouradov, D. et al. Colorectal cancer cell lines are representative models of the 1008 
main molecular subtypes of primary cancer. Cancer Res. 74, 3238–3247 (2014). 1009 

10. Stuckelberger, S. & Drapkin, R. Precious GEMMs: emergence of faithful models for 1010 
ovarian cancer research. J. Pathol. 245, 129–131 (2018). 1011 

11. Domcke, S., Sinha, R., Levine, D. A., Sander, C. & Schultz, N. Evaluating cell lines 1012 
as tumour models by comparison of genomic profiles. Nat. Commun. 4, 2126 1013 
(2013). 1014 

12. Jiang, G. et al. Comprehensive comparison of molecular portraits between cell lines 1015 
and tumors in breast cancer. BMC Genomics 17 Suppl 7, 525 (2016). 1016 

13. Chen, B., Sirota, M., Fan-Minogue, H., Hadley, D. & Butte, A. J. Relating 1017 
hepatocellular carcinoma tumor samples and cell lines using gene expression data 1018 
in translational research. BMC Med. Genomics 8 Suppl 2, S5 (2015). 1019 

14. Vincent, K. M., Findlay, S. D. & Postovit, L. M. Assessing breast cancer cell lines as 1020 
tumour models by comparison of mRNA expression profiles. Breast Cancer Res. 1021 
17, 114 (2015). 1022 

15. Yu, K. et al. Comprehensive transcriptomic analysis of cell lines as models of 1023 
primary tumors across 22 tumor types. Nat. Commun. 10, 3574 (2019). 1024 

16. Najgebauer, H. et al. CELLector: Genomics-Guided Selection of Cancer In Vitro 1025 
Models. Cell Syst. 10, 424–432.e6 (2020). 1026 

17. Salvadores, M., Fuster-Tormo, F. & Supek, F. Matching cell lines with cancer type 1027 
and subtype of origin via mutational, epigenomic, and transcriptomic patterns. Sci. 1028 
Adv. 6, (2020). 1029 

18. Guernet, A. & Grumolato, L. CRISPR/Cas9 editing of the genome for cancer 1030 
modeling. Methods 121-122, 130–137 (2017). 1031 

19. Gargiulo, G. Next-Generation in vivo Modeling of Human Cancers. Front. Oncol. 8, 1032 
429 (2018). 1033 

20. Gao, H. et al. High-throughput screening using patient-derived tumor xenografts to 1034 
predict clinical trial drug response. Nat. Med. 21, 1318–1325 (2015). 1035 

21. Cahan, P. et al. CellNet: network biology applied to stem cell engineering. Cell 158, 1036 
903–915 (2014). 1037 

22. Radley, A. H. et al. Assessment of engineered cells using CellNet and RNA-seq. 1038 
Nat. Protoc. 12, 1089–1102 (2017). 1039 

23. Tan, Y. & Cahan, P. SingleCellNet: A Computational Tool to Classify Single Cell 1040 
RNA-Seq Data Across Platforms and Across Species. Cell Syst. 9, 207–213.e2 1041 
(2019). 1042 

24. Cancer Genome Atlas Network. Comprehensive molecular characterization of 1043 
human colon and rectal cancer. Nature 487, 330–337 (2012). 1044 

25. Zhang, J. et al. International Cancer Genome Consortium Data Portal--a one-stop 1045 
shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). 1046 

26. Cancer Genome Atlas Network. Comprehensive molecular portraits of human 1047 
breast tumours. Nature 490, 61–70 (2012). 1048 

27. Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic 1049 
subtypes. J. Clin. Oncol. 27, 1160–1167 (2009). 1050 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 39 

28. Wilkerson, M. D. et al. Lung squamous cell carcinoma mRNA expression subtypes 1051 
are reproducible, clinically important, and correspond to normal cell types. Clin. 1052 
Cancer Res. 16, 4864–4875 (2010). 1053 

29. Cancer Genome Atlas Research Network. Electronic address: 1054 
andrew_aguirre@dfci.harvard.edu & Cancer Genome Atlas Research Network. 1055 
Integrated genomic characterization of pancreatic ductal adenocarcinoma. Cancer 1056 
Cell 32, 185–203.e13 (2017). 1057 

30. Cancer Genome Atlas Research Network et al. Integrated genomic characterization 1058 
of endometrial carcinoma. Nature 497, 67–73 (2013). 1059 

31. Cancer Genome Atlas Research Network et al. Integrated genomic characterization 1060 
of oesophageal carcinoma. Nature 541, 169–175 (2017). 1061 

32. Cancer Genome Atlas Network. Comprehensive genomic characterization of head 1062 
and neck squamous cell carcinomas. Nature 517, 576–582 (2015). 1063 

33. Cancer Genome Atlas Research Network. Comprehensive molecular 1064 
characterization of clear cell renal cell carcinoma. Nature 499, 43–49 (2013). 1065 

34. Verhaak, R. G. W. et al. Integrated genomic analysis identifies clinically relevant 1066 
subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, 1067 
and NF1. Cancer Cell 17, 98–110 (2010). 1068 

35. Cancer Genome Atlas Research Network. Comprehensive molecular profiling of 1069 
lung adenocarcinoma. Nature 511, 543–550 (2014). 1070 

36. Hu, B. et al. Gastric cancer: Classification, histology and application of molecular 1071 
pathology. J. Gastrointest. Oncol. 3, 251–261 (2012). 1072 

37. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling 1073 
of anticancer drug sensitivity. Nature 483, 603–607 (2012). 1074 

38. Medico, E. et al. The molecular landscape of colorectal cancer cell lines unveils 1075 
clinically actionable kinase targets. Nat. Commun. 6, 7002 (2015). 1076 

39. Park, J.-G. et al. Characteristics of Cell Lines Established from Human Colorectal 1077 
Carcinoma. Cancer Res. (1987). 1078 

40. Jerby-Arnon, L. et al. A cancer cell program promotes T cell exclusion and 1079 
resistance to checkpoint blockade. Cell 175, 984–997.e24 (2018). 1080 

41. Darmanis, S. et al. Single-Cell RNA-Seq Analysis of Infiltrating Neoplastic Cells at 1081 
the Migrating Front of Human Glioblastoma. Cell Rep. 21, 1399–1410 (2017). 1082 

42. Patel, A. P. et al. Single-cell RNA-seq highlights intratumoral heterogeneity in 1083 
primary glioblastoma. Science 344, 1396–1401 (2014). 1084 

43. Xu, B. et al. Regulation of endometrial receptivity by the highly expressed HOXA9, 1085 
HOXA11 and HOXD10 HOX-class homeobox genes. Hum. Reprod. 29, 781–790 1086 
(2014). 1087 

44. Raines, A. M. et al. Recombineering-based dissection of flanking and paralogous 1088 
Hox gene functions in mouse reproductive tracts. Development 140, 2942–2952 1089 
(2013). 1090 

45. Netinatsunthorn, W., Hanprasertpong, J., Dechsukhum, C., Leetanaporn, R. & 1091 
Geater, A. WT1 gene expression as a prognostic marker in advanced serous 1092 
epithelial ovarian carcinoma: an immunohistochemical study. BMC Cancer 6, 90 1093 
(2006). 1094 

46. Kelly, Z. et al. The prognostic significance of specific HOX gene expression patterns 1095 
in ovarian cancer. Int. J. Cancer 139, 1608–1617 (2016). 1096 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 40 

47. Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian 1097 
carcinoma. Nature 474, 609–615 (2011). 1098 

48. Wiegand, K. C. et al. ARID1A mutations in endometriosis-associated ovarian 1099 
carcinomas. N. Engl. J. Med. 363, 1532–1543 (2010). 1100 

49. Murray, M. J. et al. LIN28 Expression in malignant germ cell tumors downregulates 1101 
let-7 and increases oncogene levels. Cancer Res. 73, 4872–4884 (2013). 1102 

50. Biton, A. et al. Independent component analysis uncovers the landscape of the 1103 
bladder tumor transcriptome and reveals insights into luminal and basal subtypes. 1104 
Cell Rep. 9, 1235–1245 (2014). 1105 

51. Fair, W. R., Israeli, R. S. & Heston, W. D. Prostate-specific membrane antigen. 1106 
Prostate 32, 140–148 (1997). 1107 

52. Black, J. D., English, D. P., Roque, D. M. & Santin, A. D. Targeted therapy in 1108 
uterine serous carcinoma: an aggressive variant of endometrial cancer. Womens 1109 
Health (Lond. Engl.) 10, 45–57 (2014). 1110 

53. Yang, S., Thiel, K. W. & Leslie, K. K. Progesterone: the ultimate endometrial tumor 1111 
suppressor. Trends Endocrinol. Metab. 22, 145–152 (2011). 1112 

54. Huszar, M. et al. Up-regulation of L1CAM is linked to loss of hormone receptors and 1113 
E-cadherin in aggressive subtypes of endometrial carcinomas. J. Pathol. 220, 551–1114 
561 (2010). 1115 

55. Kozak, J., Wdowiak, P., Maciejewski, R. & Torres, A. A guide for endometrial 1116 
cancer cell lines functional assays using the measurements of electronic 1117 
impedance. Cytotechnology 70, 339–350 (2018). 1118 

56. Korch, C. et al. DNA profiling analysis of endometrial and ovarian cell lines reveals 1119 
misidentification, redundancy and contamination. Gynecol. Oncol. 127, 241–248 1120 
(2012). 1121 

57. Wu, D. et al. Gene-expression data integration to squamous cell lung cancer 1122 
subtypes reveals drug sensitivity. Br. J. Cancer 109, 1599–1608 (2013). 1123 

58. Walter, V. et al. Molecular subtypes in head and neck cancer exhibit distinct 1124 
patterns of chromosomal gain and loss of canonical cancer genes. PLoS One 8, 1125 
e56823 (2013). 1126 

59. Adeegbe, D. O. et al. BET Bromodomain Inhibition Cooperates with PD-1 Blockade 1127 
to Facilitate Antitumor Response in Kras-Mutant Non-Small Cell Lung Cancer. 1128 
Cancer Immunol Res 6, 1234–1245 (2018). 1129 

60. Blaisdell, A. et al. Neutrophils oppose uterine epithelial carcinogenesis via 1130 
debridement of hypoxic tumor cells. Cancer Cell 28, 785–799 (2015). 1131 

61. Fitamant, J. et al. YAP inhibition restores hepatocyte differentiation in advanced 1132 
HCC, leading to tumor regression. Cell Rep. 10, 1692–1707 (2015). 1133 

62. Jia, D. et al. Crebbp loss drives small cell lung cancer and increases sensitivity to 1134 
HDAC inhibition. Cancer Discov. 8, 1422–1437 (2018). 1135 

63. Kress, T. R. et al. Identification of MYC-Dependent Transcriptional Programs in 1136 
Oncogene-Addicted Liver Tumors. Cancer Res. 76, 3463–3472 (2016). 1137 

64. Li, L. et al. GKAP acts as a genetic modulator of NMDAR signaling to govern 1138 
invasive tumor growth. Cancer Cell 33, 736–751.e5 (2018). 1139 

65. Mollaoglu, G. et al. The Lineage-Defining Transcription Factors SOX2 and NKX2-1 1140 
Determine Lung Cancer Cell Fate and Shape the Tumor Immune 1141 
Microenvironment. Immunity 49, 764–779.e9 (2018). 1142 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 41 

66. Pan, Y. et al. Whole tumor RNA-sequencing and deconvolution reveal a clinically-1143 
prognostic PTEN/PI3K-regulated glioma transcriptional signature. Oncotarget 8, 1144 
52474–52487 (2017). 1145 

67. Lissanu Deribe, Y. et al. Mutations in the SWI/SNF complex induce a targetable 1146 
dependence on oxidative phosphorylation in lung cancer. Nat. Med. 24, 1047–1057 1147 
(2018). 1148 

68. Xu, C. et al. Loss of Lkb1 and Pten leads to lung squamous cell carcinoma with 1149 
elevated PD-L1 expression. Cancer Cell 25, 590–604 (2014). 1150 

69. NCI-Frederick, Frederick, MD. National Laboratory for Cancer Research. The NCI 1151 
Patient-Derived Models Repository (PDMR). (2019). at <https://pdmr.cancer.gov/> 1152 

70. Broutier, L. et al. Human primary liver cancer-derived organoid cultures for disease 1153 
modeling and drug screening. Nat. Med. 23, 1424–1435 (2017). 1154 

71. Lee, S. H. et al. Tumor Evolution and Drug Response in Patient-Derived Organoid 1155 
Models of Bladder Cancer. Cell 173, 515–528.e17 (2018). 1156 

72. Ogawa, J., Pao, G. M., Shokhirev, M. N. & Verma, I. M. Glioblastoma model using 1157 
human cerebral organoids. Cell Rep. 23, 1220–1229 (2018). 1158 

73. Ben-David, U. et al. Patient-derived xenografts undergo mouse-specific tumor 1159 
evolution. Nat. Genet. 49, 1567–1575 (2017). 1160 

74. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The cancer genome. Nature 458, 1161 
719–724 (2009). 1162 

75. Balkwill, F. R., Capasso, M. & Hagemann, T. The tumor microenvironment at a 1163 
glance. J. Cell Sci. 125, 5591–5596 (2012). 1164 

76. Lancaster, M. A. & Knoblich, J. A. Organogenesis in a dish: modeling development 1165 
and disease using organoid technologies. Science 345, 1247125 (2014). 1166 

77. Bregenzer, M. E. et al. Integrated cancer tissue engineering models for precision 1167 
medicine. PLoS One 14, e0216564 (2019). 1168 

78. Wang, D. H. & Souza, R. F. Biology of Barrett’s esophagus and esophageal 1169 
adenocarcinoma. Gastrointest Endosc Clin N Am 21, 25–38 (2011). 1170 

79. Lee, J. et al. Tumor stem cells derived from glioblastomas cultured in bFGF and 1171 
EGF more closely mirror the phenotype and genotype of primary tumors than do 1172 
serum-cultured cell lines. Cancer Cell 9, 391–403 (2006). 1173 

80. Wenger, S. L. et al. Comparison of established cell lines at different passages by 1174 
karyotype and comparative genomic hybridization. Biosci. Rep. 24, 631–639 (2004). 1175 

81. Ben-David, U. et al. Genetic and transcriptional evolution alters cancer cell line drug 1176 
response. Nature 560, 325–330 (2018). 1177 

82. Cooke, S. L. et al. Genomic analysis of genetic heterogeneity and evolution in high-1178 
grade serous ovarian carcinoma. Oncogene 29, 4905–4913 (2010). 1179 

83. Hristova, V. A. & Chan, D. W. Cancer biomarker discovery and translation: 1180 
proteomics and beyond. Expert Rev Proteomics 16, 93–103 (2019). 1181 

84. Dawson, M. A. & Kouzarides, T. Cancer epigenetics: from mechanism to therapy. 1182 
Cell 150, 12–27 (2012). 1183 

85. Silva, T. C. et al. TCGA Workflow: Analyze cancer genomics and epigenomics data 1184 
using Bioconductor packages. [version 2; peer review: 1 approved, 2 approved with 1185 
reservations]. F1000Res. 5, 1542 (2016). 1186 

86. Morgan, M., Obenchain, V., Hester, J. & Pag`es, H. SummarizedExperiment: 1187 
SummarizedExperiment container. (2018). 1188 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


 42 

87. Pavlidis, P. & Noble, W. S. Analysis of strain and regional variation in gene 1189 
expression in mouse brain. Genome Biol. 2, RESEARCH0042 (2001). 1190 

88. Geman, D., d Avignon, C., Naiman, D. Q. & Winslow, R. L. Classifying gene 1191 
expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3, 1192 
Article19 (2004). 1193 

89. Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. Cross-validation pitfalls 1194 
when selecting and assessing regression and classification models. J. Cheminform. 1195 
6, 10 (2014). 1196 

90. Lipton, Z. C., Elkan, C. & Naryanaswamy, B. Optimal Thresholding of Classifiers to 1197 
Maximize F1 Measure. Mach. Learn. Knowl. Discov. Databases 8725, 225–239 1198 
(2014). 1199 

91. Grow, E. J. et al. Intrinsic retroviral reactivation in human preimplantation embryos 1200 
and pluripotent cells. Nature 522, 221–225 (2015). 1201 

92. Kolde, R. pheatmap: Pretty Heatmaps. (CRAN, 2019). 1202 
93. Wickham, H. ggplot2 - Elegant Graphics for Data Analysis . (Springer-Verlag New 1203 

York, 2016). doi:10.1007/978-0-387-98141-3 1204 
94. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations 1205 

in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016). 1206 
95. Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture 1207 

from expression data. Nat. Commun. 4, 2612 (2013). 1208 
96. Kovalchik, S. RISmed: Download Content from NCBI Databases. (CRAN.R-project, 1209 

2017). 1210 
 1211 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


A

B

Figure 1

HighLow

C
an

ce
r T

yp
es

Cancer models

Classification score

Cancer cell lines (CCL)

Patient derived xenograft (PDX)

Genetically engineered mouse
model (GEMM)

Tumoroids

Select parameter set 
with maximum mean 
AUPRC. Train on all 

TCGA data

CancerCellNet

Set parameters 
n, m

Randomly select 2/3 
TCGA data; run 
training process
Assess performance 
on 1/3 held out data
Repeat steps (2-3) 5 
times

(1)

(2)

(3)

(4)

Repeat steps (1-4) 
for each parameter 
set

(5)

CancerCellNet

RNA-seq from …

G
en

e 
pa

irs

Training data Training process  

Train Random 
Forest classifier

G
en

es

Samples

G
en
es

Labeled RNA-seq data Select n
genes

Gene pair 
transform

Select m
gene pairs

G
en

e 
pa

irs

G
en

es

Samples
Samples Samples Samples Samples CancerCellNet

C D

E

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 2

A

F

C D

E

CCN 
Score B

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


CCN 
Score

A

B

C

Figure 3 .CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


D

A

B

Figure 4

C

General classification

General CCN score
(UCEC)

Sub-type classification

Endometrioid
Serous
Normal

Unknown

General classification

General CCN score
(LUSC)

Sub-type classification
basal

classical

primitive

secretory

Unknown

General classification

General CCN score
(LUAD)

Sub-type classification

prox.-inflam

prox.-prolif

TRU

Unknown

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


CCN 
Score

Figure 5

A B

C

D

E

General classification

General CCN score
(UCEC)

Sub-type classification

Endometrioid

Serous

Normal

Unknown

General classification

General CCN score
(LUSC)

Sub-type classification

basal

classical

primitive

secretory

Unknown

General classification

General CCN score
(LUAD)

Sub-type classification

prox.-inflam

prox.-prolif

TRU

Unknown

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 6

C

BA

D

E

General classification
General CCN score
(UCEC)
Sub-type classification

Genotype

Endometrioid

Serous

Normal

Unknown

General classification

General CCN score
(LUSC)

Sub-type classification

Genotype

basal

classical

primitive

secretory

Unknown

General classification

General CCN score
(LUAD)

Sub-type classification

Genotype

prox.-inflam

prox.-prolif

TRU

Unknown

CCN 
Score

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 7

A B

C

D E

General classification

General CCN score
(UCEC)

Sub-type classification

Endometrioid

Serous

Normal

Unknown

General classification

General CCN score
(LUSC)

Sub-type classification
basal

classical

primitive

secretory

Unknown

General classification

General CCN score
(LUAD)

Sub-type classification

prox.-inflam

prox.-prolif

TRU

Unknown

CCN 
Score

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Figure 8 .CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Supplemental Figure 1

BA

D

E

Training data 

Samples

G
en

es

RNA-Seq TCGA
Training process  

Gene Pair 
Transform

Feature 
Selection

Train Random 
forest classifier

G
en

es

G
en

e 
P

ai
rs

CancerCellNetBroad Class 
Classification Add on to Gene 

Pairs as 
Additional 
Features

C
C

N
 S

co
re

s

G
en

e 
P

ai
rs

C

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Supplemental Figure 2

A B

D

E

C

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Supplemental Figure 3

C

D

A B

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Supplemental Figure 4 .CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/


Supplemental Figure 5 .CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 9, 2021. ; https://doi.org/10.1101/2020.03.27.012757doi: bioRxiv preprint 

https://doi.org/10.1101/2020.03.27.012757
http://creativecommons.org/licenses/by-nc-nd/4.0/