Predicting chemotherapy response using a variational autoencoder approach


i
i

“output” — 2021/1/4 — 20:53 — page 1 — #1 i
i

i
i

i
i

Bioinformatics
doi.10.1093/bioinformatics/xxxxxx

Advance Access Publication Date: Day Month Year
Original Paper

Topic Area: Biomedical Informatics

Predicting chemotherapy response using a
variational autoencoder approach
Qi Wei 1∗ and Stephen A. Ramsey 2∗

1School of EECS, Oregon State University, Corvallis, Oregon 97333, USA
2Department of Biomedical Sciences and School of EECS, Oregon State University, Corvallis, Oregon 97333, USA.

∗To whom correspondence should be addressed.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation: Multiple studies have shown the utility of transcriptome-wide RNA-seq profiles as features
for machine learning-based prediction of response to chemotherapy in cancer. While tumor transcriptome
profiles are publicly available for thousands of tumors for many cancer types, a relatively modest number
of tumor profiles are clinically annotated for response to chemotherapy. The paucity of labeled examples
and high dimension of the feature data limit performance for predicting therapeutic response using
fully-supervised classification methods. Recently, multiple studies have established the utility of a deep
neural network approach, the variational autoencoder (VAE), for generating meaningful latent features
from original data. Here, we report first study of a semi-supervised approach using VAE-encoded
tumor transcriptome features and regularized gradient boosted decision trees (XGBoost) to predict
chemotherapy drug response for five cancer types: colon adenocarcinoma, pancreatic adenocarcinoma,
bladder carcinoma, sarcoma, and breast invasive carcinoma.
Results: We found: (1) VAE-encoding of the tumor transcriptome preserves the cancer type identity
of the tumor, suggesting preservation of biologically relevant information; and (2) as a feature-set for
supervised classification to predict response-to-chemotherapy, the unsupervised VAE encoding of the
tumor’s gene expression profile leads to better area under the receiver operating characteristic curve
(AUROC) classification performance than either the original gene expression profile or the PCA principal
components of the gene expression profile, in four out of five cancer types that we tested.
Availability: github.com/ATHED/VAE_for_chemotherapy_drug_response_prediction
Contact: ramseyst@oregonstate.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction
Although chemotherapy is a mainstay of treatment for aggressive
cancers, many agents have serious side effects (Airley, 2009). Whether
or not chemotherapy will provide a net benefit to a patient depends
in large part on whether the malignancy responds to the treatment.
Chemotherapy is often administered in cycles (Skeel, 2003), leading
to multiple opportunities where treatment appropriateness may be (re-
)assessed (Chabner and Longo, 2005). Currently, the medical cost-benefit
of chemotherapy (versus a non-pharmaceutical approach) is assessed in
light of patient health status, expected therapeutic tolerance, and tumor
pathological classification (Kaestner and Sewell, 2007; Gurney, 2002).

For many cancer types, there is a broad spectrum of cases where the
decision of whether or not to undergo or continue chemotherapy is
difficult (Corrie, 2008; Whelan et al., 2003; Malfuson et al., 2008).
The development of a quantitative model that could predict—based on
a specific tumor’s molecular signature—whether or not the tumor will
respond to chemotherapy would have significant clinical utility and would
potentially improve patient quality-of-life. Moreover, an advance in
machine-learning methods for the response-to-chemotherapy prediction
problem (Chiu et al., 2019; Geeleher et al., 2014) would have potential
crossover benefits for other prediction problems in precision medicine.

Oncogenesis is driven by alterations in the somatic genome and
epigenome in cancer cells (Weir et al., 2004); however, the somatic genetic
or epigenetic determinants of response to chemotherapy are also thought

© The Author 2021. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 1

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://github.com/ATHED/VAE_for_chemotherapy_drug_response_prediction
ramseyst@oregonstate.edu
Weiqi


https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 2 — #2 i
i

i
i

i
i

to exert measurable effects on gene expression in the tumor. Consistent
with this theory, studies of various cancer types have demonstrated that
biomarkers identified from systematic measurement of the patient’s cancer
transcriptome or proteome correlate with the probability that a tumor will
respond to chemotherapy, for example, a five-protein signature in breast
cancer (Gámez-Pozo et al., 2017), 13- and 14-gene signatures in rectal
cancer (Casado et al., 2011; Del Rio et al., 2007), and a 63-gene signature
in liver cancer (Kurokawa et al., 2004). Taken together, the findings from
such “omics” biomarker studies suggest that RNA sequencing- (RNA-
seq (Wang et al., 2009))-based transcriptome measurements of tumor
samples labeled with clinical response can be used to train machine-
learning classifiers for predicting response to chemotherapy. However,
the accuracy of such models is presently limited by the small number
of available training cases that are labeled for clinical outcome, given
the large size of the transcriptome (∼60k genes Frankish et al., 2018)
and the significant intertumoral variance of gene expression. For typical
cancers, most of the profiled tumor transcriptomes are not labeled
for chemotherapeutic response; the ratio of such unlabeled to labeled
tumor datasets in the Cancer Genome Atlas (TCGA) dataset (Hutter and
Zenklusen, 2018) ranges from 10–20, depending on the cancer type.
While using (exclusively) supervised learning methods for the response-to-
chemotherapy prediction problem has been a sensible first step, unlabeled
data are a substantial resource that could—in the context of a semi-
supervised approach—reveal multivariate structure or patterns that could
ultimately improve predictive accuracy. Semi-supervised approaches that
fuse unsupervised data reduction methods (such as principal components
analysis, or PCA) for low-dimensional embedding with supervised
methods (such as decision trees) for prediction have proved beneficial
in problems where large unlabeled datasets are available, for example,
a PCA-XGBoost method has been previously used in finance (Wen and
Huang, 2020), and an independent components analysis-based method has
been used to classify electroencephalographic signals (Qin et al., 2006).

Multiple studies (An and Cho, 2015; Li and She, 2017; Bouchacourt
et al., 2017; Kipf and Welling, 2016) have established the power of the
variational autoencoder (VAE; Kingma and Welling (2013); Jimenez
Rezende et al. (2014))—an unsupervised nonlinear data embedding
model with two deep neural networks oppositely connected through a
low-dimensional probabilistic latent space—for finding meaningful and
useful latent features in high-dimensional data. In the context of cancer
bioinformatics, VAEs have been variously used to (i) model cancer gene
expression and capture biologically-relevant features using the TCGA
Pan-cancer Project RNA-seq dataset (Way and Greene, 2018); (ii) find
encodings that correlate with biological features such as patient sex and
tumor type (Titus et al., 2018); (iii) find encodings that can be used to
predict gene inactivation in cancer (Way and Greene, 2017); and (iv) obtain
an encoding that is predictive of chemotherapy resistance (George and
Lio, 2019). Based on their exploration of multiple VAE architectures
for predicting gene inactivation in a pan-cancer dataset, Way & Greene
reported (2017) biological insights obtained from the latent-space
embeddings learned by VAEs. George and Lio (2019) used a VAE-based,
fully unsupervised approach to encode ovarian tumor transcriptomes
and obtained latent-space features that were associated with response to
chemotherapy. These studies suggest that a tumor transcriptome VAE may
be broadly useful for the response-to-chemotherapy prediction problem
and they set the stage for the present multi-cancer investigation of the
utility of the tumor transcriptome VAE in precision oncology.

Given previous reports of success using a VAE to obtain useful
low-dimensional encodings of transcriptome data (Dong et al., 2020;
Way and Greene, 2018; Way and Greene, 2017), in this work, we
first sought to ascertain to what extent a VAE encoding of tumor
transcriptome data would preserve biological characteristics—spanning

multiple genes at a time that have coordinated variation across tumors—
that are associated with distinct cancer types. To answer this question,
we trained a pan-cancer transcriptome VAE and used it to encode TCGA
tumor RNA-seq data from 9,310 tumors comprising 32 different cancer
types, focusing on the top 5,000 most variable genes. We trained the
VAE using an efficient contemporary optimization engine (Adam) to find
the VAE coefficient values that together balance reconstruction loss and
desired latent-space distributional shape. We applied an unsupervised
two-dimensional embedding method (t-distributed stochastic neighbor
embedding, or t-SNE) directly to tumor transcriptome and to the VAE-
embedded tumor transcriptome data, and mapped clusters of tumors by
cancer type across the two t-SNE embeddings. We found (Sec. 2.1) that the
VAE preserves the clustering of tumors of the same cancer type, suggesting
biological fidelity in the components of the VAE embedding.

Next, to set the stage for a semi-supervised approach for predicting
cancer response to chemotherapy, we selected five cancer types (breast,
bladder, colon, pancreatic, and sarcoma) based on sufficient availability of
clinically labeled data and then defined three different VAE architectures:
VAE-1, which we used to obtain feature data for bladder, breast, and
pancreatic cancer; VAE-2, for sarcoma; and VAE-3, for colon cancer.
In order to train a VAE, it is necessary to specify a reconstruction loss
function; both L2 and L1 reconstruction loss have been used for training
VAEs in machine-learning, and we sought to clarify which is best for
this application. Thus, we trained each of the three VAE architectures
on 2,606 tumor transcriptomes from TCGA, in an unsupervised fashion,
separately using L1 loss and L2 loss. Next, in order to label tumors
for response to chemotherapy, we analyzed the available TCGA clinical
data regarding the outcome of pharmaceutical therapy (in most cases
including chemotherapy) for each of the patients, and thereby assigned
a label “responded” or “progressive” to 806 out of the 2,606 tumors
(Sec. 2.2); the remainder of the tumors were unlabeled and thus used
only during VAE training. For the 806 labeled tumors, we used the VAE-
encoded latent vectors as feature data for supervised prediction of the
binary label using gradient boosted decision trees (XGBoost; Chen and
Guestrin (2016)). Using this semi-supervised “VAE-XGBoost” approach,
we found (Sec. 2.3) that a VAE trained using L1 reconstruction loss yields
features that result in better classification performance (by area under the
receiver operating characteristic, AUROC) than a VAE trained using L2.

In the main part of this work, using XGBoost, we measured
response-to-chemotherapy prediction performance for each of three tumor
transcriptome-derived feature sets: (i) expression levels of the top 20% of
genes, by intertumoral variance (a fully supervised approach); (ii) the first
387 principal components of expression levels of “top 20%” genes (“semi-
supervised PCA-XGBoost”); and (iii) VAE-encoded expression levels of
the top 20% genes (“semi-supervised VAE-XGBoost”, our new method,
Fig. 1). Within a cross-validation framework for AUROC performance
evaluation, we found (Sec. 2.4) that for four out of five cancer types,
the semi-supervised VAE-XGBoost approach outperformed the fully-
supervised approach. Moreover, for four out of the five cancer types, semi-
supervised VAE-XGBoost outperformed semi-supervised PCA-XGBoost.
Finally, for the one cancer type for which PCA-XGBoost outperformed
VAE-XGBoost, we investigated their relative performance through the
lens of XGBoost feature importance (Sec. 2.5). Below, we describe our
results (Sec. 2) and the VAE-XGBoost method in detail (Sec. 5).

2 Results

2.1 VAE encoding preserves cancer type features

Given multiple reports (Dolezal et al., 2018; Esteva et al., 2017) that
t-SNE can be used to visualize the grouping of cancer types from high-
dimensional molecular tumor data, we investigated the extent to which

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 3 — #3 i
i

i
i

i
i

Gene 
Expression 

Data 

(Original Input, 
)x

Reconstructed 

Gene Expression  

Data 

( Output, )x̃

Encoder 
Network 

Eq. 2 

mean vector ( )μ ̂θ

Variance vector ( )σ ̂θ Sampled latent vector ( )z

Decoder 
Network 

g ̂ϕ

Add labeled input 
( )y

Latent vector  

+ Label as  

input (z, y)
XGBoost Classifier 

Eq. 12 & Eq. 13

Probability of 
predicated 

Label 

P(ỹ | z)

Reparameterize 

Sampling 

Eq. 5 & Eq.6 

Fig. 1: Overview of the VAE-XGBoost method that we used for predicting tumor response to chemotherapy. For each tumor t, the encoder’s input vector
xt contains the levels of the top 20% of genes by intertumoral gene expression variance (Sec. 5.1). Each network has multiple fully connected
dense layers (Sec. 5.5). The encoder outputs two vectors of configurable latent variable dimension h � m (Sec. 5.5): a vector of means µ and a
vector of standard deviations σ that parameterize the multivariate normal latent-space vector Z|xt (Sec. 5.3). The sampled encoding Z|xt = zt
is passed to the decoding neural network (decoder), whose architecture is identical to (with inversion) that of the encoder network. The sampled
latent-space vector zt is passed to XGBoost for supervised classification to predict response to chemotherapy (training label y, prediction ỹ).

VAE encoding of tumor transcriptomes preserves data-space features
that determine cancer type-specific groupings. In order to do so, we
obtained (Sec. 5.1) from the TCGA data portal RNA-seq transcriptome
data for 9,310 tumors labeled for 32 different cancer types (listed in Fig. 2).
As a baseline view of transcriptome-based cancer type groupings, we
generated a two-dimensional embedding of the 9,310 tumor samples by
applying t-SNE (Sec. 5.2) to the expression levels of the top 5,000 most
variable genes, yielding 32 distinct clusters (Fig. 2A). Next, we trained
(Sec. 5.3) a VAE to encode the expression levels of the 5,000 most variable
genes in each of 9,310 tumors into 9,310 points in a 50-dimensional latent
space. An unsupervised t-SNE visualization (Fig 2B) of the VAE-encoded
tumor transcriptome data was remarkably similar in structure to the t-SNE
visualization of the 5,000-dimensional original dataset, with intercluster
distances for all pairs of clusters correlated between of the two t-SNE plots
(R = 0.49; see Fig. S1). This analysis indicated that the VAE encoding
preserves data-space features that distinguish individual cancer types.

2.2 Obtaining a labeled tumor transcriptome dataset

Having demonstrated that the VAE can efficiently encode tumor
transcriptomes while preserving features that distinguish different cancer
types, and to set the stage for implementing a semi-supervised approach
for predicting response to chemotherapy, we obtained a five-cancer-
type tumor transcriptome dataset with a significant subset of the tumors
labeled for “response to chemotherapy”, as described below. We
obtained transcriptomes of 806 tumors across five cancer types [colon
adenocarcinoma (COAD), pancreatic adenocarcinoma (PAAD), bladder
carcinoma (BLCA), sarcoma (SARC), and breast invasive carcinoma
(BRCA); see Table 1] that we selected based on availability of a sufficient
amount of labeled data in TCGA (see Sec. 5.1) and generated binary
clinical labels for them corresponding to “responded” or “progressive”
(see Sec. 5.4). Among these tumors, the class balance ratio, i.e., the ratio
of responding tumors to progressive disease tumors, ranged from a low of
0.77 for pancreatic cancer to a high of 8.61 for breast cancer.

2.3 L1 loss is better than L2 loss for this application

Having obtained 2,606 tumor transcriptomes across five cancer types with
806 of the tumors labeled for response to chemotherapy, we next sought

to determine which type of VAE reconstruction loss function—L1 loss or
L2 loss—would yield transcriptome encodings that are most amenable to
accurate XGBoost-based prediction of response to chemotherapy. On the
2,606 tumor transcriptomes, we trained two sets of cancer type-specific
VAEs (see Sec. 5.5) using L1 and L2 loss functions, respectively. We used
the L1 and L2 VAEs to encode the 806 labeled tumor transcriptomes (the
top 20% most variable genes in each cancer type, merged across the five
cancers, for a total of 13,584 genes) spanning the five cancer types, yielding
(for each cancer type) two feature matrices (one for L1 loss and one for
L2 loss) that we separately evaluated for XGBoost prediction (Sec. 5.6)
of the binary response-to-chemotherapy class label. By test-set area under
the receiver operating characteristic (AUROC; Sec. 5.7), averaged across
the five cancers, we found (Fig. 3) that the features that were generated by
the L1 VAEs led to 6.2% better (p < 10−9, Welch’s t-test) classification
performance than the features generated by the L2 VAEs, and thus, for all
subsequent analyses, we used VAEs trained with L1 loss.

2.4 Chemotherapy drug response classification result

Having selected L1 reconstruction loss for training VAEs to encode tumor
transcriptomes for predicting response-to-chemotherapy, we focused on
the key question of whether (and to what extent) a semi-supervised
approach using the VAE can outperform (in terms of predictive accuracy)
a fully supervised approach or a semi-supervised approach based on
a traditional dimensional reduction technique (principal components
analysis, PCA). In brief, our VAE-based semi-supervised approach
involves three steps: (i) training a VAE to encode clinically unlabeled
tumor transcriptomes (for the top 20% most variable genes) for a single
cancer type, into a low-dimensional space (Sec. 5.5); (ii) using that VAE to
obtain latent-space encodings for the tumor transcriptomes that are labeled
for a relevant clinical endpoint (in this work, response to chemotherapy);
and (iii) training and testing a supervised classifier (in this work, XGBoost
binary classification) using the latent-space encodings as feature data.
To address the question of whether this VAE-based, semi-supervised
(VAE-XGBoost) approach can outperform a fully supervised approach,
we compared the performance of the VAE-XGBoost method to a fully
supervised approach in which we applied XGBoost directly to the tumor
expression levels of the top 20% most variable genes (13,584 genes) as
feature data. In the same analysis, to address the question of whether

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 4 — #4 i
i

i
i

i
i

Table 1. Table of numbers of samples with chemotherapy response record for each cancer type (n.b., the total number of labeled tumor samples exceeds the total
number of patients because some patients had multiple tumors). After each cancer type, its TCGA abbreviation is shown in parentheses.

cancer type
total number of samples
(labeled and unlabeled)

number of
labeled samples

proportion of
labeled samples

class balance ratio
(responding/progressive)

breast invasive carcinoma (BRCA) 1,217 394 0.324 8.61
colon adenocarcinomas (COAD) 512 117 0.229 1.72
bladder carcinoma (BLCA) 430 115 0.267 0.95
pancreatic adenocarcinoma (PAAD) 182 115 0.632 0.77
sarcoma (SARC) 265 65 0.245 0.82

sum 2,606 806

Table 2. Quantitative AUROC performances of XGBoost (“Raw data”), PCA-XGBoost (“PCA”), and VAE-XGBoost (“VAE”), along with pairwise comparisons.

AUROC (mean) p (Welch’s t-test) p (Wilcoxon signed-rank test)

Cancer type VAE PCA Raw data VAE versus Raw data VAE versus PCA VAE versus Raw data VAE versus PCA

BRCA 0.674 0.614 0.649 8.07 × 10−4 3.80 × 10−12 6.08 × 10−4 5.59 × 10−9

COAD 0.694 0.726 0.674 3.74 × 10−3 1.38 × 10−3 6.64 × 10−3 2.37 × 10−3

BLCA 0.630 0.593 0.626 4.26 × 10−1 1.13 × 10−4 5.05 × 10−1 1.53 × 10−4

PAAD 0.738 0.710 0.694 6.99 × 10−6 5.04 × 10−3 3.24 × 10−6 4.67 × 10−3

SARC 0.704 0.682 0.679 3.49 × 10−2 2.91 × 10−2 6.06 × 10−2 3.82 × 10−2

the VAE-XGBoost method could outperform a semi-supervised approach
based on PCA dimensional reduction, we compared the VAE-XGBoost
method to the PCA-XGBoost method. We carried out this analysis for
each of the five cancer types separately, using the set of cancer type-specific
labeled tumors (totaling 806 labeled tumors). We measured performance
using test-set AUROC in a cross-validation framework (Sec. 5.7).

For four out of five cancer types (breast, colon, pancreatic, and
sarcoma), in terms of test-set AUROC, the VAE-XGBoost approach
outperformed the fully-supervised approach of applying XGBoost directly
to the expression levels of the tumors’ top 20% most variable genes (Fig. 4),
by both Welch’s t-test and Wilcoxon’s signed-rank test (Table 2); for
BLCA, the semi-supervised VAE-XGBoost and fully-supervised models’
performances were statistically indistinguishable. Additionally, for four
out of five cancer types (bladder, breast, pancreatic, and sarcoma), the
semi-supervised VAE-XGBoost method significantly outperformed the
semi-supervised PCA-XGBoost method (Fig. 4 and Table 2). The five-
cancer average AUROC for VAE-XGBoost was 0.682, a performance
gain of 5.4% over the five-cancer average AUROC for PCA-XGBoost
(0.646) and a gain of 3.6% over the fully-supervised model’s average
(0.658). Notably, a single deep VAE architecture (VAE-1, which had a 50-
dimensional latent space and six layers in the encoder; see Sec. 5.5) yielded
latent-space encodings that outperformed semi-supervised PCA-XGBoost
for three cancer types (bladder, breast, and pancreatic).

2.5 PCA & VAE feature importance scores, for COAD

Having established that the semi-supervised VAE-XGBoost outperforms
the semi-supervised PCA-XGBoost approach for tumor transcriptome-
based prediction of response to chemotherapy for four out of five cancer
types, we sought to understand the basis for the higher performance
of PCA-XGBoost over VAE-XGBoost on the fifth cancer type, colon
adenocarcinoma (COAD). Specifically, we investigated whether the strong
performance of PCA-XGBoost on COAD is attributable to differences
in the distributions of XGBoost feature importance scores (Sec. 5.6) of
the PCA features versus VAE latent-space features. We found that the
distribution of feature importance scores (as a function of rank) was more
sharply peaked at lowest-ranked features in the VAE than in the PCA
(Fig. 5), suggesting that the performance gain with PCA reflects a broader
spectrum of informative features for that particular cancer type.

3 Discussion
As far as we are aware, this work is the first report of a broad (five-
cancer) investigation of the potential for a VAE-based, semi-supervised
approach for predicting response to chemotherapy. Across the five cancer
types that we studied, the ratio of responding tumors to progressive
disease tumors ranged from a low of 0.77 for pancreatic cancer to a
high of 8.61 for breast cancer, reflecting a broad range of resistances
to standard-of-care chemotherapy. Our results clearly demonstrate the
utility of the VAE for compressing high-dimensional data to a continuous,
low-dimensional latent space while retaining features that are essential
for distinguishing different cancer types and for predicting response to
chemotherapy. Nevertheless, three limitations of this work bear noting.

The first limitation concerns the type(s) of tumor “omics” data from
which features are derived for the predictive model. While in this work
we focused on tumor transcriptome data which can be measured with high
precision over a wide dynamic range of transcript abundances by RNA-
seq, we note that TCGA datasets of tumor somatic mutations and copy
number alteration events are also available (Hutter and Zenklusen, 2018).
Given the voluminous literature on the use of tumor somatic genomic data
for precision cancer diagnosis (Mitchel et al., 2019; Zhang et al., 2020; Lee
et al., 2019), tumor DNA datasets are fertile ground for developing a semi-
supervised, multi-omics model for predicting response to chemotherapy.

Second, we noted for decision tree-based response-to-chemotherapy
prediction, the performance of VAE-encoded transcriptome features is
somewhat sensitive to the type of normalization used for the input data
(data not shown). We explored various types of normalization for the RNA-
seq data including standardization of log counts and using FPKM data,
we ultimately chose min-max-normalized log2 total-count-normalized
counts (Sec. 5.1) for the gene expression levels to be used to derive features.
However, there are additional transcript quantification methods (Evans
et al., 2017) that could be explored in the context of finding optimal tumor
transcriptome VAE encodings for precision oncology. A similar comment
applies to the specific form of the reconstruction loss function: in our
analysis, features from the VAE trained with L1 loss clearly (across five
cancers) outperformed those from the VAE trained with L2 loss, and thus,
consistent with Way and Greene (2017), we used L1 loss for the VAE that
we used to address the main question of this work (Sec. 2.4) as well as the
pan-cancer t-SNE analysis (Sec. 2.1)

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 5 — #5 i
i

i
i

i
i

Fig. 2: Marks represent tumor transcriptomes visualized using t-SNE, with
colors representing cancer types. (A) Original gene expression
data of the top 5,000 most variable genes. (B) VAE compressed
gene expression data. Red rectangles denote the five cancer types
selected for chemotherapy response classification (Sec. 2.4).

The third limitation relates to the VAE architecture. While it is
promising that a single deep VAE architecture (VAE-1, with a 50-
dimensional latent space and six fully-connected layers) yielded features
that outperformend PCA and the original RNA-seq feature data for
three different cancer types (bladder, breast, and pancreatic), for

0.61

0.62

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.70

L1_loss L2_loss

A
U

R
O

C

Fig. 3: Average AUROC results over five different types of cancer, by loss
type. Squares, mean values; bars, 95% confidence interval (c.i.).

colon cancer and sarcoma, it was necessary to use shallower (two-
layer) VAE architectures with bigger latent space dimensions (650
and 500, respectively). Of the five cancers studied, colon cancer and
sarcoma had the lowest proportions of labeled samples (0.229 and 0.245
respectively; see Table 1). Our findings suggest that for some cancers, a
deep, low-latent-dimension VAE architecture yields optimal features for
predicting response, while for other cancers, a shallow, medium-sized-
latent-dimension VAE architecture is more effective. More study with
larger datasets will be required in order to determine whether a single
VAE architecture could be successfully used for general-purpose tumor
transcriptome feature extraction for precision oncology.

While our results show promise for the VAE in the context of a semi-
supervised approach for response-to-chemotherapy prediction, for colon
cancer, the VAE-XGBoost method did not outperform PCA-XGBoost
(though it did outperform the fully supervised approach of XGBoost
trained on the unencoded gene expression data). One possible explanation
for the colon cancer-specific superior performance of PCA features over
VAE features for predicting response to chemotherapy may be due to the
fact that while (for COAD) feature importance for the VAE features is
sharply peaked for the first few features and falls off fairly rapidly with
feature rank, the PCA features have a much flatter distribution of relative
feature importance (Fig. 5). Follow-on studies with larger datasets will
be required to delineate under what circumstances transcriptome VAE
encodings will prove superior to linear principal components.

4 Conclusions
For four of the five cancer types that we studied, the semi-supervised
VAE-XGBoost approach significantly outperformed a semi-supervised
PCA-XGBoost approach for tumor transcriptome-based prediction of
response to chemotherapy, reaching a top AUROC of 0.738 for pancreatic
adenocarcinoma. For four out of five cancer types, the semi-supervised
VAE-XGBoost approach significantly outperformed a fully-supervised
approach consisting of XGBoost applied to the expression levels of the
top 20% most variably expressed genes. Given high-dimensional “omics”
data, the VAE is a powerful tool for obtaining a nonlinear low-dimensional
embedding; it yields features that retain biological patterns that distinguish
between different types of cancer and that enable more accurate tumor
transcriptome-based prediction of response to chemotherapy than would
be possible using the original data or their principal components.

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 6 — #6 i
i

i
i

i
i

SARC

COAD PAAD

BLCA BRCA

Raw(13,584) PCA(387) VAE(500)

Raw(13,584) PCA(387) VAE(650) Raw(13,584) PCA(387) VAE(50)

Raw(13,584) PCA(387) VAE(50) Raw(13,584) PCA(387) VAE(50)
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82

0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82

0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
0.82

A
U

R
O

C

Fig. 4: Test-set performance of the three models for predicting response
to chemotherapy, across five cancer types. Group abbreviations:
“PCA(387)”, the PCA-XGBoost semi-supervised method (387:
number of principal components used as features); “Raw(13,584)“,
the fully-supervised XGBoost method (13,584: number of genes
used as features); and “VAE(n)”, the VAE-XGBoost semi-
supervised method (n: dimension of the latent feature space).
Marks correspond to individual replications of five-fold cross-
validation; solid squares denote mean; bars indicate 95% c.i; colors
denote the type of feature-set (Sec. 5.5): red, “PCA”; olive, “Raw”;
cyan, VAE-1; magenta, VAE-2; green, VAE-3.

5 Methods
We carried out all data processing and machine-learning tasks on a Dell
XPS 8700 workstation equipped with Nvidia Titan RTX GPU and running
the Ubuntu GNU/Linux operating system version 16.04. All of the analysis
code that we implemented was executed in Python version 3.5.5 except that
we used R version 3.3.3 for statistical analysis of AUROC values (Sec. 5.7),
gene-level MAD calculations (Sec. 5.1) and plotting (Sec. 5.2). We carried
out all statistical tests using the R computing environment (version 3.3.3)
and using the R software package stats version 3.4.4.

5.1 Gene expression data

From the Xena data portal (Goldman et al., 2019), we obtained
TCGA Level 3 tumor RNA-seq transcriptome data of 32 cancer types
(totaling 9, 310 tumors) and, for the response-to-chemotherapy prediction
problem, five cancer types [colon adenocarcinomas (COAD), pancreatic
adenocarcinoma (PAAD), bladder carcinoma (BLCA), sarcoma (SARC),
and breast invasive carcinoma (BRCA)] totaling 2, 606 tumors. We
selected the five cancer types based on two criteria: (i) a sufficient
number (at least 65) of paired tumor-transcriptome and clinical data

 1
 2
 3
 4
 5
 6
 7
 8
 9

10
11
12
13
14
15
16
17
18
19
20

0 200 400 600

Sum of importance

R
a

n
k 

o
f 

fe
a

tu
re

s

Group
PCA
VAE

Fig. 5: Bars indicate the sum (over 30 replications) of XGBoost
feature importance scores. “Group” indicates the low-dimensional
embedding method used (VAE or PCA). Bars separately ordered
from highest to lowest (only top 20 most important features shown).

samples available for the cancer type; and (ii) a sufficient number (at
least 180) of tumor transcriptome samples available (regardless of the
clinical data availability) for the cancer type. We obtained both the RNA-
seq (gene-level) total-read-count-normalized log2(1 +C) read counts and
normalized (fragments per kilobase of transcript per million mapped reads,
FPKM (Dillies et al., 2013)) expression data for for 60,483 human genes.
To focus the machine-learning on the portion of the tumor transcriptome
that had the most variation from tumor to tumor, we identified the
top 20% most variable genes as measured by the median absolute deviation
(MAD) across tumors, of gene expression in terms of FPKM (we used
FPKM for this purpose in order to mitigate bias due to read length
and tumor-specific depth of sequencing). For deriving feature-sets for
XGBoost prediction directly based on transcript abundances or based on
VAE- or PCA encoding, the 20% criterion applied to each of the five
cancer types yielded a set of 13,584 genes. We computed MAD using
the R package stats version 3.4.4 (R Core Team, 2013) with default
parameters. After the variance-filtering step, we used the log2(1 + C)
of total-count-normalized count values for the top-20% highest-variance
genes (that were selected as described above) to obtain (or encode) feature
values. We compared the performance—in terms of minimizing the VAE
reconstruction loss (see Sec. 5.3)—of different feature scaling methods
(no scaling, min-max normalization, and standardization (Kreyszig et al.,
2011)) and selected min-max normalization as the method that we used to
rescale gene-level count data for input into the VAE.

5.2 t-distributed stochastic neighbor embedding (t-SNE)

We computed t-SNE embedding components of the tumors using
the function sklearn.decomposition.manifold.TSNE from
the python software package scikit-learn version 0.19.1 with
parameters init = “pca′′, perplexity = 20, learning_rate = 300,
and n_iter = 400. For plotting the tumor transcriptome t-SNE
embeddings, we used the R software package ggplot2 version 3.1.1.

5.3 Variational autoencoder (VAE)

An autoencoder is a type of model that combines “encoder” and “decoder”
neural networks to learn a low-dimensional continuous data encoding from
which the input signal can be approximately reconstructed (Kramer, 1991).
A key advantage of an autoencoder is that it is unsupervised, i.e., it can

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 7 — #7 i
i

i
i

i
i

be trained without labeled examples. Unlike classical autoencoders (e.g.,
sparse or denoising autoencoders), the variational autoencoder (VAE) is a
generative probabilistic model which maps an input vector to a latent-space
random variable (r.v.). Below, we mathematically define the VAE.

Let T denote the set of tumors for which the VAE is to be fit to the
tumor transcriptomes (with n ≡ |T|) and let m denote the number of
genes for which transcript abundances are used to represent the tumor
transcriptome. After min-max transformation of the tumor transcriptome
measurements (Sec. 5.1), each tumor’s transcriptome is represented as a
vector x ∈ [0, 1]m. Let X denote the random variable representing the
population distribution from which tumor transcriptomes are sampled, and
let X ∈ [0, 1]m×n represent the composite matrix of all sampled tumor
transcriptomes). We aim to learn a VAE that will comprise an encoder
and decoder, with the encoder consisting of mean and variance functions
µ : [0, 1]m → Rh and σ : [0, 1]m → Rh+, respectively. Together, µ and
σ map the tumor transcriptome vector xt to a h-dimensional r.v. Z|xt,

Z|xt ∼N(µ(xt), diag(σ(xt))), (1)

where diag(m) is a matrix whose diagonal elements are the elements of
the vector m. The decoder is a function g : Rh → [0, 1]m that, for an
outcome Z|xt = zt ∈ Rh, maps

g : zt 7→ g(zt) ≡ x̃t; (2)

the tilde on x̃t denotes that it is the decoded data for the tumor
transcriptome xt. A good autoencoder should have low reconstruction
error L, which is convenient to define in terms of the p-norm of the
difference between the tumor transcriptome data xt and the reconstructed
data x̃t, i.e., ||xt−x̃t||

p
p , where || ||p denotes the p-norm. However, this

definition of the reconstruction error is only deterministic in the context
of a specific outcome Z|xt = zt. Thus, it is conventional to define the
reconstruction error as an expectation value over outcomes of Z|xt,

L|(X =xt) ≡ E Z|xt=zt (||xt −g(zt)||
p
p ), (3)

where EΩ represents an expectation value over a space of outcomes Ω. It
should be noted the above representation of the reconstruction error is in
terms of the outcome, zt, of a r.v. (Z|xt) whose distributional parameter
functions µ and σ have hyperparameters (neural network coefficients) that
will be fitted. Because Eq. 3 is ill-suited to backpropagation, it is helpful
to recast it in terms of a new random variable Et that depends on Z|xt by

Et ≡ (diag(σ(xt)))−
1
2 (Zt|xt −µ(xt)). (4)

It follows from Eq. 4 and Eq. 1 that Et is standard multivariate normal,

Et ∼N(0,I), (5)

where I is the h×h identity matrix, and thus, Et does not depend on µ,
σ, or t. We therefore drop the subscript t and simply denote the rescaled
latent-space random variable as E. Solving Eq. 4 for Z|xt and applying
it to Eq. 3, the reconstruction error L|(X =xt) can be represented by

L|(X =xt) = EE
(∣∣∣∣∣∣xt−g(µ(xt) +√diag(σ(xt)) E)∣∣∣∣∣∣p

p

)
, (6)

which is amenable to backpropagation because the only r.v. in it is E, whose
distributional parameters do not depend on the neural network coefficients
that we will be varying. In practice, rather than computing the multivariate
integral over outcomes of E, L|(X = xt) is typically approximated by
averaging over a limited number J of samples from E,

L|(X =xt) '
〈(∣∣∣∣∣∣xt−g(µ(xt)+√diag(σ(xt)) �j))∣∣∣∣∣∣p

p

)〉
j
, (7)

where 〈〉j denotes average over j ∈{1, . . . ,J} and �j is sample j from
E. Following Way and Greene (2017), we used a number of samples

that is equivalent to the dimension of the transcriptome, i.e., J = m.
For the case of p = 2 (i.e., L2 norm), minimizing L|(X = xt) as
defined above is equivalent to maximizing the expectation value of the log-
likelihood log(P(g(Z) = xt | X = xt)). However, following Way
and Greene (2017) and consistent with empirical evidence (Sec. 2.3), for
our five-cancer study of the utility of a VAE-based approach for response-
to-chemotherapy prediction, as well as for the pan-cancer t-SNE analysis
(Sec. 2.1), we chose to use L1 reconstruction loss, i.e., p = 1 in Eq. 3.

The reconstruction loss measures bias error, whose minimization
must be balanced against the simultaneous goal of controlling variance
error through regularization. In the VAE, regularization requires
incentivizing (in the learning of µ, σ, and g) the latent space
distributions of Z|x to be close to standard multivariate normal. This
is accomplished by assigning a penalty based on the Kullback-Leibler
divergence between the distribution of Z|xt and the target distribution
E, represented by DKL(P(Z|xt) ||P(E)). This regularization is
analytically tractable (Duchi, 2007), and for a given tumor t yields (see
Supplementary Note, Eq. S2) the following regularization function:

DKL

(
P(Zt|xt)

∣∣∣∣ P(E)) =
||µ(xt)||22 + ||σ(xt)||

2
2 −|| log(σ(xt))||1 − 1, (8)

where log(σt) denotes an element-wise log and || ||1 is the L1 norm.
Fitting the VAE to X requires selecting µ, σ, and g from their

respective function spaces; in practice, we search over functions that can
be represented using a neural network for µ and σ (parameterized by
the vector θ)1 and a neural network for the function g (parameterized
by the vector φ). Exploring the space of functions µθ, σθ, and gφ
corresponds to computationally searching for the vector pair (θ̂,φ̂) that
together minimize the joint (over all tumors) sum of the tumor-specific
reconstruction loss and the regularization penalty,

(θ̂,φ̂) = argmin
(θ,φ)

∑
t∈T

[
L|(X = xt)+DKL

(
P(Z|xt)

∣∣∣∣P(E))]. (9)
Applying Eqs. 6, 7, and 8, and setting p = 1 as discussed above, we obtain
the explicit formula for fitting a VAE to X,

(θ̂,φ̂) = argmin
(θ,φ)

∑
t∈T

[

1

J

J∑
j=1

(∣∣∣∣∣∣xt −gφ(µθ(xt) + √diag(σθ(xt)) �j)∣∣∣∣∣∣
1

)

+ ||µθ(xt)||
2
2 + ||σθ(xt)||

2
2 −|| log(σθ(xt))||1 − 1

]
. (10)

We implemented Eq. 10 in Tensorflow version 1.4.1 with Keras version
2.1.3 as the model-level library. We solved Eq. 10 using the Adam
optimization algorithm (Kingma and Ba, 2014) (with batch normalization)
from the python package keras-gpu version 2.1.3 with parameters
learning_rate = 2 × 10−3, beta_1 = 0.9, and beta_2 = 0.999, to
obtain (θ̂,φ̂). Then, for each tumor t, we used a single sample Z|xt = zt
from the distribution N(µ

θ̂
(xt), diag(σθ̂(xt))) as the final latent-space

encoding of the tumor to be used for supervised learning (Sec. 5.6).

5.4 Labeling tumors based on response to chemotherapy

From Xena and cBioPortal (Cerami et al., 2012; Gao et al., 2013),
we obtained and combined TCGA clinical data (where available) for

1 Note, functions µ and σ are just two different outputs of the encoding
neural network, differing only at the final layer, and thus for simplicity of
notation we represent them as having a common parameter vector θ.

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 8 — #8 i
i

i
i

i
i

the patients whose tumor transcriptomes we acquired (see Sec. 5.1).
From Xena, we extracted the variables submitter_id.samples,
therapy_type, and measure_of_response; from cBioPortal, we
extracted the variables Sample_ID, Disease.Free.Status, and
Pharmaceutical.Therapy.Indicator. We co-analyzed the Xena- and
cBioPortal-obtained clinical data to label tumors “responded” (y =
0) or ”progressive” (y = 1), by assigning y = 0 when the
clinical record had Complete response or partial response in
the measure_of_response column of the clinical data from Xena,
or with value DiseaseFree in the Disease.Free.Status column
of the clinical data from cBioPortal while therapy type is recorded
as Chemotherapy in both. We assigned y = 1 to tumors whose
clinical records had values Radiographic progressive disease,
Clinical progressive disease, or stable disease in the Xena
clinical data column measure_of_response, or had value Recurred/
progressed in the cBioPortal data column Disease.Free.Status while
the therapy_type is recorded as Chemotherapy in both files. This
yielded 806 labeled tumors out of 2,606 total. A total of 39 different drugs
were used to treat the 794 patients (see Supplementary Note, Table S1).

5.5 VAE model architectures

We trained six transcriptome-encoding VAEs based on four VAE
architectures, the pan-cancer VAE architecture (for the 32-cancer
unsupervised analysis, see Sec. 2.1) and three cancer type-specific VAE
architectures for response-to-chemotherapy prediction (Sec. 2.4) (one of
which was used for three different cancer types, BLCA, BRCA, and PAAD,
and the others of which were cancer type specific for COAD and SARC).
For the pan-cancer VAE, we used a latent space dimension h = 50 and
three fully connected layers each for the encoder and decoder. For the
cancer type-specific VAE architectures, we again used the same number
of fully-connected layers in the encoder as in the decoder (Table 3).

Table 3. VAE architectures used for predicting chemotherapy response (h, latent
space dimension; “layers”, # of layers used in the encoder/decoder).

Name Cancer types h Layers

VAE-1 BLCA, BRCA, PAAD 50 Six
VAE-2 COAD 650 Two
VAE-3 SARC 500 Two

5.6 Regularized gradient boosted decision trees (XGBoost)

For predicting whether or not (based on its transcriptome-derived feature-
set: raw, PCA, or VAE) a tumor would respond to chemotherapy, we
used XGBoost (Chen and Guestrin, 2016), an efficient implementation
of regularized gradient boosted decision trees. We used the binary
classifier function XGBClassifier from the python software package
xgboost version 0.80, with gamma=0. We tuned eight hyper-
parameters (Table 4) by exhaustive grid-search with five-fold cross-
validation, using sklearn.model_selection.GridSearchCV
from scikit-learn version 0.19.1. To obtain feature importance
scores, we used get_score with importance_type = cover.

5.7 Area Under ROC Curve (AUROC)

For computing the AUROC (i.e., sensitivity versus false positive error
rate curve), we used the function metrics.roc_auc_score from the
python software package scikit-learn version 0.19.1 with parameter
average=“weighted”. We logit-transformed AUROC values before
testing (using two-tailed Welch’s t-test and the Wilcoxon signed rank test)

For the L1 vs. L2 analysis (Fig. 2.3), we carried out 30 replications of
five-fold cross-validation; within each replication, across the five folds,
we obtained prediction scores for each tumor from the fold in which the
tumor was in the test set, enabling us to compute an overall AUROC within
each replication. For each training data set, we have done 30 replications of
five-fold cross-validation by altering the random seed used for assign split
of data during cross-validation. We have conducted the same procedure
for five different cancer types (BLCA, BRCA, COAD, PAAD, SARC) as
shown in the panel names of Figure 4.

5.8 Principal component analysis (PCA)

For PCA, we used the function decomposition.PCA (with parameters
svd_solver = “full′′) and n_components = 0.9 (90% variance,
yielding 387 components) from the python package scikit-learn
version 0.19.1. For plotting, we used matplotlib version 2.1.2.

Funding
SAR acknowledges support from the Animal Cancer Foundation.

References
Airley, R. (2009). Cancer chemotherapy. Wiley-Blackwell, NY, NY.
An, J. and Cho, S. (2015). Variational Autoencoder based Anomaly

Detection using Reconstruction Probability. Technical Report SNUDM-
TR-2015-03, Seoul National University.

Bouchacourt, D. et al. (2017). Multi-Level Variational Autoencoder:
Learning Disentangled Representations from Grouped Observations.
arXiv:1705.08841.

Casado, E. et al. (2011). A Combined Strategy of SAGE
and Quantitative PCR Provides a 13-Gene Signature that Predicts
Preoperative Chemoradiotherapy Response and Outcome in Rectal
Cancer. PLOS ONE, 17, 4145–54.

Cerami, E. et al. (2012). The cBio Cancer Genomics Portal: An
Open Platform for Exploring Multidimensional Cancer Genomics Data.
Cancer Discovery, 2, 401.

Chabner, B. A. and Longo, D. L. (2005). Cancer Chemotherapy and
Biotherapy: Principles and Practice. Lippincott Willians & Wilkins,
Philadelphia, PA, fourth edition.

Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting
System. arXiv:1603.02754.

Chiu, Y.-C. et al. (2019). Predicting drug response of tumors from
integrated genomic profiles by deep neural networks. BMC Medical
Genomics, 12(1), 18.

Corrie, P. G. (2008). Cytotoxic chemotherapy: clinical aspects. Medicine,
36(1), 24–28.

Del Rio, M. et al. (2007). Gene expression signature in advanced colorectal
cancer patients select drugs and response for the use of leucovorin,
fluorouracil, and irinotecan. Journal of clinical oncology : official
journal of the American Society of Clinical Oncology, 25(7), 773–78.

Dillies, M.-A. et al. (2013). A comprehensive evaluation of normalization
methods for Illumina high-throughput RNA sequencing data analysis.
Briefings in Bioinformatics, 14(6), 671–683.

Dolezal, J. M. et al. (2018). Diagnostic and prognostic implications of
ribosomal protein transcript expression patterns in human cancers. BMC
Cancer, 18(1), 275.

Dong, H. et al. (2020). Variational Autoencoder for Anti-Cancer Drug
Response Prediction. arXiv:2008.09763.

Duchi, J. (2007). Derivations for linear algebra and optimization.
Technical report, Standford University.

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/


i
i

“output” — 2021/1/4 — 20:53 — page 9 — #9 i
i

i
i

i
i

Table 4. XGBoost classification algorithm hyperparameters and hyperparameter ranges used in grid-search tuning.

Hyperparameter name Hyperparameter description Hyperparameter range

n_estimators number of trees to fit (1, 2, 3, . . ., 40)
max_depth maximum tree depth (1, 2, 3, . . ., 10)
learning_rate boosting learning rate (0.05, 0.1, 0.2, 0.4, 0.6, 0.8)
min_child_weight minimum sum of instance weight needed in a child (1, 2, 3, . . ., 10)
subsample sub-sample ratio of the training instance (0.1, 0.2, 0.3, . . ., 1.0)
colsample_bytree sub-sample ratio of columns when constructing each tree (0.1, 0.2, 0.3, . . ., 1.0)
reg_alpha coefficient of L1 regularization for the node weights (0, 1, 2, 3)
reg_lambda coefficient of L2 regularization for the node weights (1, 2, . . ., 100)

Esteva, A. et al. (2017). Dermatologist-level classification of skin cancer
with deep neural networks. Nature, 542(7639), 115–118.

Evans, C. et al. (2017). Selecting between-sample RNA-Seq normalization
methods from the perspective of their assumptions. Briefings in
Bioinformatics, 19(5), 776–792.

Frankish, A. et al. (2018). GENCODE reference annotation for the human
and mouse genomes. Nucleic Acids Research, 47, D766–D773.

Gao, J. et al. (2013). Integrative analysis of complex cancer genomics and
clinical profiles using the cBioPortal. Science Signaling, 6, 11.

Geeleher, P. et al. (2014). Clinical drug response can be predicted using
baseline gene expression levels and in vitrodrug sensitivity in cell lines.
Genome Biology, 15(3), R47.

George, T. M. and Lio, P. (2019). Unsupervised Machine Learning for
Data Encoding applied to Ovarian Cancer Transcriptomes. bioRxiv;
doi:10.1101/855593.

Goldman, M. et al. (2019). The UCSC Xena platform for public and
private cancer genomics data visualization and interpretation. bioRxiv;
doi:10.1101/326470.

Gurney, H. (2002). How to calculate the dose of chemotherapy. British
Journal of Cancer, 86, 1297–1302.

Gámez-Pozo, A. et al. (2017). Prediction of adjuvant chemotherapy
response in triple negative breast cancer with discovery and targeted
proteomics. PLOS ONE, 12, 6.

Hutter, C. and Zenklusen, J. C. (2018). The Cancer Genome Atlas:
Creating Lasting Value beyond Its Data. Cell, 173(2), 283–285.

Jimenez Rezende, D. et al. (2014). Stochastic Backpropagation and
Approximate Inference in Deep Generative Models. arXiv:1401.4082.

Kaestner, S. A. and Sewell, G. J. (2007). Chemotherapy Dosing Part I:
Scientific Basis for Current Practice and Use of Body Surface Area.
Clinical Oncology, 19, 23–37.

Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic
Optimization. arXiv:1412.6980.

Kingma, D. P. and Welling, M. (2013). Auto-Encoding Variational Bayes.
arXiv, page arXiv:1312.6114.

Kipf, T. N. and Welling, M. (2016). Variational Graph Auto-Encoders.
arXiv:1611.07308.

Kramer, M. A. (1991). Nonlinear principal component analysis using
autoassociative neural networks. AIChE Journal, 37(2), 233–243.

Kreyszig, E. et al. (2011). Advanced Engineering Mathematics. Wiley,
Hoboken, NJ, tenth edition.

Kurokawa, Y. et al. (2004). Molecular Prediction of Response to 5-
Fluorouracil and Interferon-α Combination Chemotherapy in Advanced
Hepatocellular Carcinoma. AACR, 10(18), 6029–38.

Lee, K. et al. (2019). CPEM: Accurate cancer type classification based
on somatic alterations using an ensemble of a random forest and a deep

neural network. Scientific Reports, 9(1), 16927.
Li, X. and She, J. (2017). Collaborative Variational Autoencoder for

Recommender Systems. In Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
pages 305–314, New York, NY. ACM.

Malfuson, J.-V. et al. (2008). Risk factors and decision criteria for
intensive chemotherapy in older patients with acute myeloid leukemia.
Haematologica, 93(12), 1806–1813.

Mitchel, J. et al. (2019). A translational pipeline for overall survival
prediction of breast cancer patients by decision-level integration of multi-
omics data. In 2019 IEEE International Conference on Bioinformatics
and Biomedicine (BIBM), pages 1573–1580.

Qin, J. et al. (2006). ICA based semi-supervised learning algorithm for
BCI systems. In J. Rosca, D. Erdogmus, J. C. Príncipe, and S. Haykin,
editors, Independent Component Analysis and Blind Signal Separation,
pages 214–221, Berlin. Springer.

R Core Team (2013). R: A Language and Environment for Statistical
Computing. R Foundation, Vienna, Austria. ISBN 3-900051-07-0.

Skeel, R. T. (2003). Handbook of Cancer Chemotherapy. Lippincott
Williams & Wilkins, Philadelphia, PA, sixth edition.

Titus, A. J. et al. (2018). Unsupervised deep learning with variational
autoencoders applied to breast tumor genome-wide DNA methylation
data with biologic feature extraction. bioRxiv; doi:10.1101/433763.

Wang, Z. et al. (2009). RNA-Seq: a revolutionary tool for transcriptomics.
Nature Reviews Genetics, 10(1), 57–63.

Way, G. P. and Greene, C. S. (2017). Evaluating deep variational
autoencoders trained on pan-cancer gene expression. arXiv:1711.04828.

Way, G. P. and Greene, C. S. (2018). Extracting a biologically relevant
latent space from cancer transcriptomes with variational autoencoders.
Pacific Symposium on Biocomputing, 23, 80–91.

Weir, B. et al. (2004). Somatic alterations in the human cancer genome.
Cancer Cell, 6(5), 433–438.

Wen, H. and Huang, F. (2020). Personal loan fraud detection based
on hybrid supervised and unsupervised learning. In 2020 5th IEEE
International Conf. on Big Data Analytics (ICBDA), pages 339–343.

Whelan, T. et al. (2003). Helping Patients Make Informed Choices: A
Randomized Trial of a Decision Aid for Adjuvant Chemotherapy in
Lymph Node-Negative Breast Cancer. JNCI: Journal of the National
Cancer Institute, 95(8), 581–587.

Zhang, Y. et al. (2020). A novel xgboost method to identify cancer tissue-
of-origin based on copy number variations. Front Genet, 11, 1319.

.CC-BY 4.0 International licensereview) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a
The copyright holder for this preprint (which was not certified by peerthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.04.425288doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.04.425288
http://creativecommons.org/licenses/by/4.0/

	Introduction
	Results
	VAE encoding preserves cancer type features
	Obtaining a labeled tumor transcriptome dataset
	L1 loss is better than L2 loss for this application
	Chemotherapy drug response classification result
	PCA & VAE feature importance scores, for COAD

	Discussion
	Conclusions
	Methods
	Gene expression data
	t-distributed stochastic neighbor embedding (t-SNE)
	Variational autoencoder (VAE)
	Labeling tumors based on response to chemotherapy
	VAE model architectures
	Regularized gradient boosted decision trees (XGBoost)
	Area Under ROC Curve (AUROC)
	Principal component analysis (PCA)