scAEspy: a tool for autoencoder-based analysis of single-cell RNA sequencing data


scAEspy: a tool for
autoencoder-based analysis of
single-cell RNA sequencing

data

Andrea Tangherloni[∗], E-mail : andrea.tangherloni@unibg.it

Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute
CB2 0AW, Cambridge, UK

Department of Haematology, University of Cambridge
CB2 0AW, Cambridge UK

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus
CB10 1SA, Hinxton, UK

Current address: Department of Human and Social Sciences,
University of Bergamo
24129, Bergamo, Italy

Federico Ricciuti, E-mail : f.ricciuti@campus.unimib.it

Department of Informatics, Systems and Communication,
University of Milano-Bicocca

20126, Milan, Italy

Daniela Besozzi, E-mail : daniela.besozzi@unimib.it

Department of Informatics, Systems and Communication,
University of Milano-Bicocca

20126, Milan, Italy
Bicocca Bioinformatics, Biostatistics and Bioimaging Centre (B4),

20900, Milan, Italy

Pietro Liò[†],[∗], E-mail : pl219@cam.ac.uk

Department of Computer Science and Technology, University of Cambridge
CB3 0FD, Cambridge, UK

Ana Cvejic[†],[∗], E-mail : as889@cam.ac.uk

Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute
CB2 0AW, Cambridge, UK

Department of Haematology, University of Cambridge
CB2 0AW, Cambridge, UK

Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus
CB10 1SA, Hinxton, UK

[∗]Corresponding author.
[†]These authors contributed equally.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al.

RESEARCH

scAEspy: a tool for autoencoder-based analysis of
single-cell RNA sequencing data
Andrea Tangherloni1,2,3,6*, Federico Ricciuti4, Daniela Besozzi4,7, Pietro Liò5† and Ana Cvejic1,2,3†

*Correspondence:
andrea.tangherloni@unibg.it
1Wellcome Trust-Medical
Research Council Cambridge
Stem Cell Institute, Cambridge,
UK
Full list of author information is
available at the end of the article
†Equal contributor

Abstract

Background: Single-cell RNA sequencing (scRNA-Seq) experiments are gaining
ground to study the molecular processes that drive normal development as well as
the onset of different pathologies. Finding an effective and efficient
low-dimensional representation of the data is one of the most important steps in
the downstream analysis of scRNA-Seq data, as it could provide a better
identification of known or putatively novel cell-types. Another step that still poses
a challenge is the integration of different scRNA-Seq datasets. Though standard
computational pipelines to gain knowledge from scRNA-Seq data exist, a further
improvement could be achieved by means of machine learning approaches.
Results: Autoencoders (AEs) have been effectively used to capture the
non-linearities among gene interactions of scRNA-Seq data, so that the
deployment of AE-based tools might represent the way forward in this context.
We introduce here scAEspy, a unifying tool that embodies: (1) four of the most
advanced AEs, (2) two novel AEs that we developed on purpose, (3) different
loss functions. We show that scAEspy can be coupled with various batch-effect
removal tools to integrate data by different scRNA-Seq platforms, in order to
better identify the cell-types. We benchmarked scAEspy against the most used
batch-effect removal tools, showing that our AE-based strategies outperform the
existing solutions.
Conclusions: scAEspy is a user-friendly tool that enables using the most recent
and promising AEs to analyse scRNA-Seq data by only setting up two
user-defined parameters. Thanks to its modularity, scAEspy can be easily
extended to accommodate new AEs to further improve the downstream analysis
of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be
considered as a starting point to build a more comprehensive toolkit designed to
integrate multi single-cell omics.

Keywords: Autoencoders; scRNA-Seq; Dimensionality reduction; Clustering;
Batch correction; Data integration

Background
Single-cell RNA sequencing (scRNA-Seq) was named the “Method of the Year” in
2013, and it is currently used to investigate cell-to-cell heterogeneity since it allows
to measure the transcriptome-wide gene expression at single-cell resolution, enabling
the identification of different cell-types. scRNA-Seq data are prevalent generated in
studies that aim at understanding the molecular processes driving normal develop-
ment and the onset of pathologies [1, 2]. This field of research continuously poses
new computational questions that have to be addressed [3].

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

mailto:andrea.tangherloni@unibg.it
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 2 of 27

One of the most important steps in scRNA-Seq analysis is the clustering of cells
into groups that correspond to known or putatively novel cell-types, by considering
the expression of common sets of signature genes. However, this step still remains a
challenging task because applying clustering approaches in high-dimensional spaces
can generate misleading results, as the distance between most pairs of points is
similar [4]. As a consequence, finding an effective and efficient low-dimensional rep-
resentation of the data is one of the most crucial steps in the downstream analysis of
scRNA-Seq data. A common workflow of downstream analysis, depicted in Figure
1, includes two dimensionality reduction steps: (1) Principal Component Analysis
(PCA) [5] for an initial reduction of the dimensions based on the Highly Vari-
able Genes (HVGs), and (2) a non-linear dimensionality reduction approach—e.g.,
t-distributed Stochastic Neighbour Embedding (t-SNE) [6] or Uniform Manifold
Approximation and Projection (UMAP) [7, 8]—on the PCA space for visualisa-
tion purposes (e.g., showing the labelled clusters) [9, 10]. In addition, when mul-
tiple scRNA-Seq datasets have to be combined for further analyses, the technical
non-negligible batch-effects that may exist among the datasets must be taken into
account [3, 9, 11–14], making the dimensionality reduction even more complicated
and fundamental. Indeed, finding a salient batch corrected and low dimensional
embedding space can help to better partition and distinguish the various cell-types.
Although commonly used approaches for dimensionality reduction achieved good

performance when applied to scRNA-Seq data [9], novel and more robust dimension-
ality reduction strategies should be used to account for the sparsity, intrinsic noise,
unexpected dropout, and burst effects [3, 15], as well as the low amounts of RNA
that are typically present in single-cells. Ding et al. showed that low-dimensional
representations of the original data learned using latent variable models preserve
both the local and global neighbour structures of the original data [16]. Autoen-
coders (AEs) showed outstanding performance in this regard due to their ability
to capture the strong non-linearities among the gene interactions existing in the
high-dimensional expression space.

Autoencoders for denoising and dimensionality reduction
Deep Count AE network (DCA) was one of the first AE-based approach proposed
to denoise scRNA-Seq datasets [17] by considering the count distribution, overdis-
persion, and sparsity of the data. DCA relies on a negative binomial noise model,
with or without zero-inflation, to capture nonlinear gene-gene dependencies. Start-
ing from the vanilla version of the Variational AE (VAE) [18], several approaches
have been proposed. Among them, single-cell Variational Inference (scVI) was the
first scalable framework that allowed for a probabilistic representation and anal-
ysis of gene expression datasets [19]. scVI was built upon Deep Neural Networks
(DNNs) and stochastic optimization to consider the information across similar cells
and genes to approximate the distributions underlying the analysed gene expression
data. This computational tool allows for coupling low-dimensional probabilistic rep-
resentation of gene expression data with the downstream analysis to consider the
measurement of uncertainty through a statistical model. Svensson et al. integrated
a Linearly Decoded VAE (LDVAE) into scVI [20], enabling the identification of re-
lationships among the cell representation coordinates and gene weights via a factor

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 3 of 27

mode. Single-cell VAE (scVAE) was introduced to directly model the raw counts
from RNA-seq data [21, 22]. More importantly, the authors proposed a Gaussian-
mixture model to better learn biologically plausible groupings of scRNA-Seq data
on the latent space.
Decomposition using Hierarchical AE (scDHA) is a hierarchical AE composed of

two modules [23]. The first module is a non-negative kernel AE able to provide a non-
negative, part-based denoised representation of the original data. During this step,
the genes and the components having an insignificant contribution to the denoised
representation of the data are removed. The second module is a stacked Bayesian
self-learning network built upon the VAE. This specific module is used to project the
denoised data into a low-dimensional space used during the downstream analysis.
scDHA outperformed PCA, t-SNE, and UMAP in terms of silhouette index [24] on
the tested datasets.
AEs coupled with disentanglement methods have been used to both improve the

data representation and obtain better separation of the biological factors of vari-
ation in gene expression data [25]. In addition, a graph AE, consisting of graph
convolutional layers, was developed to predict relationships between single-cells.
This framework can be used to identify the cell-types in the dataset under analysis
and discover the driver genes for the differentiation process. Wang et al. proposed
a deep VAE for scRNA-Seq data named VASC [26], a deep multi-layer genera-
tive model that improves the dimensionality reduction and visualisation steps in
an unsupervised manner. Thanks to its ability to model dropout events—which
can hinder various downstream analysis steps (e.g., clustering analysis, differential
expression analysis, inference of gene-to-gene relationships) by introducing a high
number of zero counts in the expression matrices—, and to find nonlinear hierarchi-
cal representations of the data, VASC obtained superior performance with respect
to four state-of-the-art dimensionality reduction and visualisation approaches [26].
Dimensionality Reduction with Adversarial VAE (DR-A) has been recently pro-

posed to fulfil the dimensionality reduction step from a data-driven point of view
[27]. Compared to the previous approaches, DR-A exploits an adversarial VAE-
based framework, which is a recent variant of generative adversarial networks. DR-
A generally obtained more accurate low-dimensional representation of scRNA-Seq
data compared to state-of-the-art approaches (e.g., PCA, scVI, t-SNE, UMAP),
leading to better clustering performance. Geddes et al. proposed an AE-based clus-
ter ensemble framework to improve the clustering step [28]. As a first step, random
subspace projections of the data are compressed onto a low-dimensional space by
exploiting an AE, obtaining different encoded spaces. Then, an ensemble cluster-
ing approach is applied across all the encoded spaces to generate a more accurate
clustering of the cells.

Autoencoders for the imputation of missing data
AutoImpute was proposed to deal with the insufficient quantities of starting RNA in
the individual cells, a problem that generally leads to significant dropout events. As
a consequence, the resulting gene expression matrices are sparse and contain a high
number of zero counts. AutoImpute is an AE-based imputation method that works
on sparse gene expression matrices, trying to learn the inherent distribution of the

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 4 of 27

input data to assign the missing values [29]. scSVA was also proposed to identify
and recover dropout events [30], which are imputed by fitting a mixed model of each
possible cell-type. In addition, it performs an efficient feature extraction step of the
high-dimensional scRNA-Seq data, obtaining a low-dimensional embedding. In the
tests showed by the authors, scSVA was able to outperform different state-of-the-art
and novel approaches (e.g., PCA, t-SNE, UMAP, VASC).
Other two methods based on nonparametric AEs were proposed to address the

imputation problem [31]. Learning with AuToEncoder (LATE) relies on an AE that
is directly trained on a gene expression matrix with parameters randomly generated,
while TRANSfer learning with LATE (TRANSLATE) takes into consideration a
reference gene expression dataset to estimate the parameters that are then used
by LATE on the new gene expression matrix. LATE and TRANSLATE were able
to obtain outstanding performance on both real and simulated data by recovering
nonlinear relationships in pairs of genes, allowing for a better identification and
separation of the cell-types.
GraphSCI combines Graph convolution network and AE to systematically in-

tegrate gene-to-gene relationships with the gene expression data. It is the first
approach that integrates gene-to-gene relationships into a deep learning frame-
work. GraphSCI is able to impute the dropout events by taking advantage of low-
dimensional representations of similar cells and gene-gene interactions [32].

Generally, in the existing AEs the input data are usually codified in a specific for-
mat, making their integration into the existing scRNA-Seq analysis toolkits (e.g.,
Scanpy [33] and Seurat [34]) a difficult task. In addition, the existing tools are im-
plemented in Keras[1], TensorFlow [35] or PyTorch [36], and all the three libraries
are thus required to run them. Finally, the currently available AEs cannot be di-
rectly exploited to obtain the latent space or to generate synthetic cells. In order
to overcome the described limitations, we developed scAEspy, which is a unifying,
user-friendly, and standalone tool that relies only on TensorFlow and allows easy
access to different AEs by setting up only two user-defined parameters. scAEspy
can be used on High-Performance Computing (HPC) infrastructures to speed-up its
execution. It can be easily run on clusters of both Central Processing Units (CPUs)
and Graphics Processing Units (GPUs). Indeed, it was designed and developed to
be executed on multi- and many-core infrastructures. In addition, scAEspy gives
access to the latent space, generated by the trained AE, which can be directly used
to show the cells in this embedded space or as a starting point for other dimension-
ality reduction approaches (e.g., t-SNE and UMAP) as well as downstream analyses
(e.g., batch-effect removal).
In this work, we show how scAEspy can be used to deal with the existing batch-

effects among samples. Indeed, the application of batch-effect removal tools into the
latent space allowed us to outperform state-of-the-art methods as well as the same
batch-effect removal tools applied on the PCA space. Finally, scAEspy implements
different loss functions, which are fundamental to deal with different sequencing
platforms.

[1]https://github.com/fchollet/keras

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://github.com/fchollet/keras
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 5 of 27

Results
We tested PCA and AEs to address the integration of different datasets. Specifically,
we used all the AEs implemented in our scAEspy tool: VAE [18], an AE only based
on the Maximum Mean Discrepancy (MMD) distance (called here MMDAE) [37],
MMDVAE, Gaussian-mixture VAE (GMVAE), and two novel Gaussian-mixture
AEs that we developed, called GMMMD and GMMMDVAE, respectively. In all the
performed tests, the constrained versions of the following loss functions were used:
Negative Binomial (NB), Poisson, zero-inflated NB (ZINB), zero-inflated Poisson
(ZIP). We used a number of Gaussian distributions equal to the number of datasets
to integrate for GMVAE, GMMMD, and GMMMDVAE. In addition, we tested the
following configurations of hidden layer and latent space to understand how the
dimension of the AEs might potentially affect the performance: (256,64), (256,32),
(256,16), (128,64), (128,32), (128,16), (64,32), and (64,16); where (H,L) repre-
sents the number of neurons composing the hidden layer (H neurons) and latent
space (L neurons).
In order to deal with the possible batch-effects, we applied the following ap-

proaches, as suggested in [9,12] and being the most used batch-effect removal tools
in the literature: Batch Balanced k-Nearest Neighbours (BBKNN) [38, 39], Har-
mony [40], ComBat [41–43], and the Seurat implementation of the Canonical Cor-
relation Analysis (CCA) [14]. Thus, we compared vanilla PCA and AEs, PCA and
AEs followed by either BBKNN or Harmony, ComBat, and CCA.
The proposed strategies were compared on three publicly available datasets,

namely: Peripheral Blood Mononuclear Cells (PBMCs), Pancreatic Islet Cells
(PICs), and Mouse Cell Atlas (MCA) by using well-known clustering metrics (i.e.,
Adjusted Rand Index, Adjusted Mutual Information Index, Fowlkes Mallows In-
dex, Homogeneity Score, and V-Measure). It is worth mentioning that generally
the cell-types are manually identified by expert biologists starting from an over or
under clustering of the data, eventually followed by different steps of sub clustering
of some clusters. Here, we evaluate how the different strategies are able to auto-
matically separate the cells by fixing the number of clusters equal to the number of
cell-types manually identified by the authors of the papers.

Datasets
Peripheral Blood Mononuclear Cells
PBMCs from eight patients with systemic lupus erythematosus were collected and
processed using the 10× Chromium Genomics platform [44]. The dataset is com-
posed of a control group (6573 cells) and an interferon-β stimulated group (7466
cells). We considered the 8 distinct cell-types identified by the authors following
a standard workflow [44]. The count matrices were downloaded from Seurat’s tu-
torial “Integrating stimulated vs. control PBMC datasets to learn cell-type specific
responses" [2].

Pancreatic islet cells
PIC datasets were generated independently using four different platforms: CEL-
Seq [45] (1004 cells), CEL-Seq2 [46] (2285 cells), Fluidigm C1 [47] (638 cells), and
[2]https://satijalab.org/seurat/v3.0/immune_alignment.html

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://satijalab.org/seurat/v3.0/immune_alignment.html
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 6 of 27

Smart-Seq2 [48] (2394 cells). For our tests, we considered the 13 different cell-types
across the datasets identified in [49] by applying PCA on the scaled integrated data
matrix. The count matrices were downloaded from Seurat’s tutorial “Integration
and Label Transfer" [3].

Mouse cell atlas
MCA is composed of two different datasets. The former was generated by Han et
al. [50] using Microwell-Seq (4239 cells) [50], while the latter by the Tabula Muris
Consortium [51] using Smart-Seq2 (2715 cells). The 11 distinct cell-types with the
highest number of cells, which were present in both datasets, have been taken into
account as in [12]. The count matrices were downloaded from the public GitHub
repository related to [12] [4].

Metrics
Adjusted Rand Index
The Rand Index (RI) is a similarity measure between the results obtained from the
application of two different clustering methods. The first clustering method is used
as ground truth (i.e., true clusters), while the second one has to be evaluated (i.e.,
predicted clusters). RI is calculated by considering all pairs of samples appearing
in the clusters, namely, it counts the pairs that are assigned either to the same or
different clusters in both the predicted and the true clusters. The Adjusted RI (ARI)
[52] is the “adjusted for chance" version of RI. Its values vary in the range [−1,1]:
a value close to 0 means a random assignment, independently of the number of
clusters, while 1 indicates that the clusters obtained with both clustering approaches
are identical. Negative values are obtained if the index is less than the expected
index.

Adjusted Mutual Information Index
The Mutual Information Index (MII) [53] represents the mutual information of two
random variables, which is a similarity measure of the mutual dependence between
the two variables. Specifically, it is used to quantify the amount of information that
can be gained by one random variable observing the other variable. MII is strictly
correlated with the entropy of a random variable, which quantifies the expected
“amount of information" that is contained in a random variable. This index is used
to measure the similarity between two labels of the same data. Similarly to ARI,
the Adjusted MII (AMII) is “adjusted for chance" and its values vary in the range
[0,1].

Fowlkes Mallows Index
The Fowlkes Mallows Index (FMI) [54] measures the similarity between the clusters
obtained by using two different clustering approaches. It is defined as the geometric
mean between precision and recall. Assuming that the first clustering approach is
the ground truth, the precision is the percentage of the results that are relevant,
while the recall refers to the percentage of total relevant results correctly assigned
by the second clustering approach. The index ranges from 0 to 1.
[3]https://satijalab.org/seurat/v3.0/integration.html
[4]https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://satijalab.org/seurat/v3.0/integration.html
https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 7 of 27

Homogeneity Score
The result of the tested clustering approach satisfies the Homogeneity Score (HS)
[55] if all of its clusters contain only cells which are members of a single cell-type.
Its values range from 0 to 1, where 1 indicates perfectly homogeneous labelling.
Notice that by switching true cluster labels with the predicted cluster labels, the
Completeness Score is obtained.

Completeness Score
The result of the tested clustering approach satisfies the Completeness Score (CS)
[55] if all the cells that are members of a given cell-type are elements of the same
cluster. Its values range from 0 to 1, where 1 indicates perfectly complete labelling.
Notice that by switching true cluster labels with the predicted cluster labels, the
HS is obtained.

V-Measure
The V-Measure (VM) [56] is the harmonic mean between HS and CS; it is equivalent
to MII when the arithmetic mean is used as aggregation function.

Integration of multiple datasets obtained with the same sequencing platforms
Nowadays, various scRNA-Seq platforms are currently available (e.g., droplet-based
and plate-based [57–66]) and their integration is often challenging due to the differ-
ences in biological sample batches as well as to the used experimental platforms. To
test whether AEs can be effectively applied to combine multiple datasets, generated
using the same platform but under different experimental conditions, we used the
PBMC datasets.
We merged the control and treated datasets by using vanilla PCA and AEs, PCA

and AEs followed by either BBKNN or Harmony, ComBat, and CCA. After the
construction of the neighbourhood graphs, we performed a clustering step by using
the Leiden algorithm [67]. Since in the original paper 8 different cell-types were
manually identified [44], we selected Leiden’s resolutions that allowed us to obtain
8 distinct clusters and calculated all the metrics described above. In what follows,
the calculated values of all metrics are given in percentages. For each metric, the
higher the value the better the result.
Our analysis showed that the CCA-based approach, proposed in the Seurat li-

brary, achieved a mean ARI equal to 73.49% (with standard deviation equal to
±1.52%), ComBat reached a mean ARI of 72.84% (±0.78%), vanilla PCA had a
mean ARI of 68.90% (±0.86%), PCA followed by BBKNN was able to obtain a
mean ARI of 83.65% (±0.81%), while followed by Harmony reached a mean ARI
of 82.83% (±1.20%), as shown in Figure 3A. Among all the tested AEs, MMDAE
followed by Harmony (using the NB loss function and 256 neurons for the hidden
layer and 32 neurons for the latent space) achieved the best results, with a mean
ARI equal to 87.18% (±0.49%). In order to assess whether any of the results ob-
tained by the best AE were different from a statistical point of view, we applied the
Mann–Whitney U test with the Bonferroni correction [68–70]. In all the compar-
isons, MMDAE followed by Harmony had a p-value lower than 0.0001, confirming
that the achieved results are statistically different compared to those achieved by
the other approaches.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 8 of 27

Regarding the AMII, CCA had a mean value of 66.46% (±0.50%), ComBat
achieved a mean value of 70.95% (±0.82%), vanilla PCA obtained a mean value
of 68.44% (±1.00%), PCA followed by BBKNN reached a mean value of 75.22%
(±0.76%), while followed by Harmony a mean value of 74.55% (±1.19%). MM-
DAE followed by Harmony had better results, with a mean value equal to 78.61%
(±0.29%). MMDAE followed by Harmony outperformed the other strategies also in
terms of of FMS, HS, CS, and VM (see Additional file 2 and Figure 4).
We also compared the results obtained by the best AE for each of the tested di-

mension (H,L) in terms of ARI (Figure 3B). GMMMD followed by Harmony (using
the NB loss function) obtained the best results for the dimension (64,16), GMMMD
followed by Harmony (using the Poisson loss function) reached the best results for
the dimensions (128,16) and (256,64), and GMMMD followed by BBKNN (using
the NB loss function) achieved the best results for the dimension (128,32). MMDAE
followed by Harmony (using the NB loss function) was able to reach the best results
for the dimensions (64,32), (256,16), and (256,32), while MMDAE followed by Har-
mony (using the Poisson loss function) obtained the best result for the dimensions
(128,64). Notice that we used two Gaussian distributions because we merged two
different datasets.
In order to visually assess the quality of the separation of the manually annotated

cell-type and the found clusters, we plotted them in the UMAP space generated
starting from the MMDAE followed by Harmony space (Figures 3C and D). Finally,
we also plotted the two samples in the same UMAP space to visually see the quality
of the alignment between the two samples them-self (Figure 5A). This plot confirms
that the batch-effects were completely removed.
Our analysis showed that clustering the neighbourhood graph generated from AE

spaces allowed for a better identification of the existing cell-types when compared
to other approaches, thus confirming the ARI results.

Integration of multiple datasets obtained with different sequencing platforms
Combining datasets from different studies and scRNA-Seq platforms can be a pow-
erful approach to obtain complete information about the biological system under
investigation. However, when datasets generated with different platforms are com-
bined, the high variability in the gene expression matrices can obscure the existing
biological relationships. For example, the gene expression values are much higher
in data acquired with plate-based methods (i.e., up to millions) than in those ac-
quired with droplet-based methods (i.e., a few thousands). Thus, combining gene
expression data that spread across several orders of magnitude is a difficult task
that cannot be tackled by using linear approaches like PCA. To examine how well
AEs perform in resolving this task, we combined four PIC datasets acquired with
CEL-Seq [59], CEL-Seq2 [60], Fluidigm C1 [66], and Smart-Seq2 protocols [65].
We integrated the datasets by using vanilla PCA and AEs, PCA and AEs fol-

lowed by either BBKNN or Harmony, ComBat, and CCA. Since in the original
paper 13 cell-types were manually annotated for the PIC datasets [49], we clustered
the neighbourhood graphs using the Leiden algorithm considering only the resolu-
tions that allowed us to obtain 13 distinct clusters. We then calculated ARI, AMII,
FMS, HS, CS, and VM metrics. The calculated values of all metrics are given in
percentages; for each metric, the higher the value the better the result.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 9 of 27

CCA had a very low mean ARI, i.e., 5.45% (±0.22%), ComBat obtained a mean
ARI of 76.20% (±3.06%), vanilla PCA achieved a mean ARI of 61.38% (±0.09%),
PCA followed by BBKNN reached a mean ARI of 71.49% (±0.58%), while followed
by Harmony was able to obtain a mean ARI of 94.00% (±0.36%), see Figure 6A.
GMMMD followed by Harmony (using the NB loss function and 256 neurons for
the hidden layer and 32 neurons for the latent space) outperformed the other AEs,
achieving a mean ARI equal to 94.23% (±0.12%). In all the comparisons, expect
for the one against PCA followed by Harmony, GMMMD followed by Harmony had
a p-value lower than 0.0001, confirming that the achieved results are statistically
different with respect to those obtained by the other approaches.
Similar results were achieved for the AMII metric, CCA reached a mean value

equal to 16.57% (±0.33%), ComBat obtained a mean value of 76.11% (±1.14%),
vanilla PCA reached a mean value of 71.70% (±0.11%), PCA followed by BBKNN
achieved a mean value of 77.55% (±0.65%) and PCA followed by Harmony a mean
value of 91.17% (±0.47%), while GMMMD followed by Harmony was able to reach
a mean value equal to 89.37% (±0.02%). Considering the other measures, both PCA
and GMMMD followed by Harmony obtained very similar results, outperforming
the other strategies (see Additional file 3 and Figure 7).
Considering the best AE for each of the tested dimension (H,L) in terms of

ARI (see Figure 6B), GMMMD followed by Harmony (using the NB loss function)
resulted the best choice for the dimensions (128,64) and (256,32), while it obtained
the best results for the dimension (64,32) when the Poisson loss function was used.
GMVAE followed by Harmony (using the Poisson loss function) reached the best
results for the dimensions (128,32) and (256,16). MMDAE followed by Harmony
achieved the best results for the dimensions (256,64) and (64,16), exploiting the
NB loss function and Poisson loss function, respectively. Finally, VAE followed by
Harmony obtained the best results with the Poisson function for the dimension
(128,16). Note that we exploited four Gaussian distributions because we merged
four different datasets.
The quality of the separation of the manually annotated cell-type and found

clusters can be visually evaluated in Figures 6C and D) We finally visualised the
cells (coloured by platform) using the UMAP space generated from the GMMMD
followed by Harmony space (Figure 5B) to confirm that the batch-effects among the
samples sequenced with different platforms were correctly removed. Taken together,
our analysis shows that GMMMD followed by Harmony can efficiently identify the
“shared" cell-types across the different platforms due to its ability to deal with
the high variability in the gene expression matrices. We would like to highlight
that PCA followed by Harmony was capable of achieving good results because the
original clusters were obtained by applying a similar pipeline [49].
As a final test, we combined two MCA datasets acquired with Microwell-Seq [50]

and Smart-Seq2 protocols [65]. We integrated the datasets in the same way we did
in the other two tests. We clustered the neighbourhood graphs using the Leiden
algorithm considering only the resolutions that allowed us to obtain 11 distinct
clusters because 11 distinct cell-types were manually annotated for the PIC datasets
[50]. We then calculated all metrics.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 10 of 27

In such a case, MMDVAE followed by Harmony (using the Poisson loss function
and 256 neurons for the hidden layer and 64 neurons for the latent space) outper-
formed the other AEs as well as the other strategies, obtaining a mean ARI equal
to 79.50% (±0.02%), as shown in Figure 8A. ComBat achieved the worst mean
ARI, i.e., 54.13% (±4.22%), CCA reached a mean ARI of 57.62% (±0.75%), vanilla
PCA obtained a mean ARI of 67.29% (±11.55%), PCA followed by BBKNN had a
similar mean ARI, that is, 67.73% (±3.98%), while followed by Harmony achieved
a mean ARI of 66.08% (±0.11%). MMDVAE followed by Harmony had a p-value
lower than 0.0001 in all the tested comparisons. Considering the other metrics,
MMDVAE followed by Harmony generally obtained better results compared to the
other strategies (see Additional file 3 and Figure 9).
Comparing the best AE for each of the tested dimension (H,L) in terms of ARI,

the vanilla GMMMD with the NB loss function obtained the best results for the di-
mension (128,16), while GMMMD followed by Harmony reached the best results for
the dimensions (256,32) and (64,16), exploiting the Poisson loss function and NB
loss function, respectively. MMDAE followed by BBKNN (using the ZIP loss func-
tion) achieved the best results for the dimensions (256,16) and (128,64), exploiting
the NB loss function and Poisson loss function, respectively. MMDVAE followed by
Harmony resulted the best choice for the dimensions (128,16) and (256,64) when
coupled with the NB loss function and Poisson loss function, respectively. Finally,
VAE followed Harmony with the Poisson loss function obtained the best results for
the dimension (64,32). As for the the integration of the PBMC datasets, we used
two Gaussian distributions because we merged two different datasets.
Figures 8C and D show the UMAP generated from the MMDVAE followed by

Harmony space coloured by the manually annotated cell-type and found clusters,
respectively, while Figure 5C depicts the cells coloured by platform on the same
UMAP space, confirming that the batch-effects between the two samples were cor-
rectly removed. In this case, the achieved results show that MMDVAE followed
by Harmony was able to better identify the “shared" cell-types across the different
platforms.

Discussion
Non-linear approaches for dimensionality reduction can be effectively used to cap-
ture the non-linearities among the gene interactions that may exist in the high-
dimensional expression space of scRNA-Seq data [16]. Among the different non-
linear approaches, AEs showed outstanding performance, outperforming other ap-
proaches like UMAP and t-SNE. Several AE-based methods have been developed
so far, but their integration with the common single-cell toolkits results a difficult
task because they usually require input data codified in a specific format. In addi-
tion, three different machine learning libraries are required to use them (i.e., Keras,
TensorFlow, and PyTorch).
Here, we proposed scAEspy, a unifying and user-friendly tool that allows the user

to use the most recent and promising AEs (i.e., VAE, MMDAE, MMDVAE, and
GMVAE). We also designed and developed GMMMD and GMMMDVAE, two novel
AEs that combine MMDAE and MMDVAE with GMVAE to exploit more than one
Gaussian distribution. We introduced a learnable prior distribution in the latent

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 11 of 27

space to model the dimensionality of the subpopulations of cells composing the
data or to combine multiple samples.

We integrated AEs with both Harmony and BBKNN to remove the existing batch-
effects among different datasets. Our results showed that exploiting the latent space
to remove the existing batch-effects permits for a better identification of the cell
subpopulations. As a batch-effect removal tool, Harmony allowed for achieving bet-
ter results than BBKNN in the majority of the cases. When different droplet-based
data have to be combined, our GMMMD and the MMDAE, coupled with the con-
strained NB and Poisson loss functions, obtained the highest results compared to
all the other AEs. In order to combine and analyse multiple datasets, generated by
using different scRNA-Seq platforms, both our GMMMD and MMDVAE, mainly to-
gether with the NB and Poisson loss functions, outperformed the other strategies.
However, also GMVAE and the simple VAE obtained outstanding performance,
highlighting that the Kullback-Leibler divergence function can become fundamen-
tal to handle data spreading various orders of magnitude, especially the high values
(up to millions) introduced by plate-based methods. It is clear that using more than
one Gaussian distribution allow for obtaining a better integration of the datasets
and separation of the cell-types when more than two datasets have to be integrated,
as clearly shown by the results reached on the PIC datasets.

Considering the achieved results on the identification of the clusters, scAEspy
can be used at the basis of methods that aim to automatically identify the cell-
types composing the scRNA-Seq datasets under analysis [71]. As a matter of fact,
scAEspy coupled with BBKNN was successfully applied to integrate 15 different
foetal human samples, enabling the identification of rare blood progenitor cells [72].

Conclusions
In this study, we proposed a AE-based and user-friendly tool, named scAEspy, which
allows for using the most recent and promising AEs to analyse scRNA-Seq data.
The user can select the desired AE by only setting up two user-defined parame-
ters. Once the selected AE has been trained, it can be used to generate synthetic
cells to increase the number of data for further downstream analyses (e.g., training
classifiers). In scAEspy, the latent space is easily accessible and thus allows the
user to perform different analyses, such as the correction of possible batch-effect
in a reduced non-linear space or the inference of differentiation trajectories. In this
case, the latent space can be utilised to generate the “pseudotime” that measures
transcriptional changes that a cell undergoes during the dynamic process.

Thanks to its modularity, scAEspy can be extended to accommodate new AEs
so that the user will be always able to utilise the latest and cutting-edge AEs [73],
which can improve the downstream analysis of scRNA-Seq data. It is worth noticing
that scAEspy can be used on HPC infrastructures, both based on CPUs and GPUs,
to speed-up the computations. This is a crucial point when datasets composed of
hundreds of thousands of cells are analysed. In such cases, the required running
time drastically increases, so relying on HPC infrastructures is the best solution to
incredibly reduce the prohibitive running time.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 12 of 27

Future improvements
As an improvement, prior biological knowledge about genes from ontologies can be
incorporated into scAEspy. Ontologies can introduce useful information into ma-
chine learning systems that are used to solve biological problems. They allow for
integrating data from different omics (e.g., genomics, transcriptomics, proteomics,
and metabolomics) as structured representations of semantic knowledge, which is
commonly used for the representation of biological concepts. This approach has
been successfully applied to predict the clinical targets from high-dimensional low-
sample data [74]. Specifically, ontology embeddings are able to capture the semantic
similarities among the genes, which can be exploited to sparsify the network con-
nections. In addition, the Gene Ontology (GO) [75] can be exploited to interpret
the extracted features from the latent spaces generated by the AEs, allowing for
bringing an explanation to the learned representations of the gene expression data.
As a possible example, g:Profiler [76] focusing on GO terms, Kyoto Encyclopedia
of Genes and Genomes (KEGG), and Reactome can be used on the learned embed-
dings to investigate the joint effects of different gene sets within specific biological
pathways. This approach can help the interpretability and explainability of the
learned embeddings of the used AEs.

Integration of multi-omics data
Since AEs showed outstanding performance in the integration of multi-omics of
cancer data [73], we plan to extend scAEspy to analyse other single-cell omics. For
instance, AEs can be applied to analyse scATAC-seq, where the identification of the
cell-types is still more difficult due to technical challenges [11, 77]. scAEspy could
be effectively applied to analyse disparate types of single-cell data from different
points of view. The latent representations of different or combined single-cell omics
can be used for further and more in-depth analyses. For instance, the application
of other machine learning techniques (e.g., deep neural networks) to the latent
representations could facilitate the identification of interesting patterns on gene
expression or methylation data, as well as relationships among genomics variants.
In that regard, scAEspy can be the starting point to build a more comprehensive
toolkit designed to integrate multi single-cell omics as an integration and extension
of the work proposed in [73].

Methods
We developed scAEspy so that it can be easily integrated into both Scanpy and Seu-
rat pipelines, as it directly works on a gene expression matrix (see Figure 2). We
integrated into a single tool the latest and most powerful AEs designed to resolve
the problems underlying scRNA-Seq data (e.g., sparsity, intrinsic noise, dropout
events [3]). Specifically, scAEspy is comprised of six AEs, based on the VAE [18]
and InfoVAE [37] architectures. The following most advanced AEs are included in
scAEspy: VAE, MMDAE, MMDVAE, GMVAE, and two novel Gaussian-mixture
AEs that we developed, called GMMMD and GMMMDVAE. GMMMD is a modifi-
cation of the MMDAE where more than one Gaussian distribution is used to model
different modes and only the MMD function is used as divergence function. GMM-
MDVAE is a combination of MMDVAE and GMVAE where both the MMD func-
tion [78] and the Kullback-Leibler divergence function [79] are used. scAEspy allows

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 13 of 27

the user to exploit these six different AEs by setting up two user-defined parameters,
α and λ, which are needed to balance the MMD and the Kullback-Leibler diver-
gence functions. We designed and developed GMMMD and GMMMDVAE starting
from InfoVAE [37] and scVAE [21]. In addition, a learnable mixture distribution
was used for the prior distribution in the latent space, and also the marginal condi-
tional distribution was defined to be a learnable mixture distribution with the same
number of components as the prior distribution. Finally, the user can also select the
following loss functions: NB, constrained NB, Poisson, constrained Poisson, ZINB,
constrained ZINB, ZIP, constrained ZIP, and Mean Square Error (MSE).

The tested batch-effect removal tools
Originally proposed to deal with batch-effects in microarray gene expression data
[41], ComBat has been successfully applied to analyse scRNA-Seq data [80]. Briefly,
given a gene expression matrix, it is firstly standardised so that all genes have
similar means and variances. Then, starting from the obtained standardised matrix,
standard distributions are fitted using a Bayesian approach to estimate the existing
batch-effects in the data. Finally, the original expression matrix is corrected using
the computed batch-effect estimators. In our tests, we used the default parameter
settings provided by the Scanpy function combat. We then applied PCA on the
space obtained by the top k (here, we set k = 1000) HVGs calculated by using
the function provided by Scanpy (v.1.4.5.1), where the top HVGs are separately
selected within each batch and merged to avoid the selection of batch-specific genes.
We calculated the first 50 components and applied the so-called “elbow method”
to select the number of components for the downstream analysis. We used the
first 18, 13, and 18 components for PBMC, PIC, and MCA datasets, respectively.
After that, we calculated the neighbourhood graph by using the default parameter
settings proposed in Scanpy. We clustered the obtained neighbourhood graphs with
the Leiden algorithm by selecting the values of the resolution parameter such that
the number of clusters was equal to the manually annotated clusters. Finally, all
the metrics for each found resolution have been calculated.
As another batch-effect removal tool, we used the CCA-based approach proposed

in the Seurat package (v.2.3.4) [14]. We applied both RunCCA and MultiCCA Seurat
functions to integrate two batches and more than two batches, respectively. Firstly,
we normalised and log-transformed the counts. Then, we calculated the top 1000
HVGs by using the function provided by Scanpy (v.1.4.5.1). We also scaled the log-
transformed data to zero mean and unit variance. In both RunCCA and MultiCCA
Seurat functions, as a first step, the CCA components (here, we exploited the first
20) are used to compute the linear combinations of the genes with the maximum
correlation between the batches. A dynamic time warping (AlignSubspace Seurat
function), which accounts for population density changes, is then used to align the
calculated vectors and obtain a single low-dimensional subspace where the batch-
effects are corrected. We calculated the neighbourhood graph, using the default
parameter settings proposed in Scanpy, starting from the aligned low-dimensional
subspace. We clustered the built neighbourhood graphs with the Leiden algorithm
as explained before. Finally, we calculated all the metrics for each found resolution.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 14 of 27

We also applied Harmony [40] to remove the batch-effects. Starting from a reduced
space (e.g., PCA space or latent space), Harmony exploits an iterative clustering-
based procedure to remove the multiple-dataset-specific batch-effects. In each itera-
tion, the following 4 steps are applied: (i) the cells are grouped into multiple-dataset
clusters by exploiting a variant of the soft k-means clustering, which is a fast and
flexible method developed to cluster single-cell data; (ii) a centroid is calculated
for each cluster and for each specific dataset; (iii) using the calculated centroids,
a correction factor is derived for each dataset; (iv) the correction factors are then
used to correct each cell with a cell-specific factor.
As a further batch-effect removal tool, we applied BBKNN [39]. Polanski et al. [38,

39] showed that BBKNN has comparable or better performance in removing batch-
effects with respect to the CCA-based approach proposed in the Seurat package,
Scanorama [81] and mnnCorrect [82]. In addition, BBKNN is a lightweight graph
alignment method that requires minimal changes to the classical workflow. Indeed, it
computes the k-nearest neighbours in a reduced space (e.g., PCA or latent space),
where the nearest neighbours are identified in a batch-balanced manner using a
user-defined distance (in our tests, we used the Euclidean distance). The neighbour
information is transformed into connectivities to build a graph where all cells across
batches are linked together. We used both Harmony and BBKNN to correct the
PCA and AE spaces.
As a final step, we calculated the UMAP spaces starting from the built neighbour-

hood graphs and using the default parameter settings proposed in Scanpy, except for
the initialisation of the low dimensional embedding (i.e., init_pos equal to random,
and random_state equal to 10 of the umap function).

The proposed pipeline
We modified the workflow shown in Figure 1 by replacing PCA with AEs (Figure
2). We merged the gene expression matrices of E different samples (E = 2, E = 4,
and E = 2 for PBMC, PIC, and MCA datasets, respectively). We applied both
PCA and AEs on the space obtained by the top 1000 HVGs calculated by using the
latest implementation of Scanpy function.
For what concerns PCA, we firstly normalised and log-transformed the counts,

then we applied a classic standardisation, that is, the distribution of the expression
of each gene was scaled to zero mean and unit variance. We calculated the first 50
components; after that, we used the “elbow method” to select the first 12, 14, and
19 components for PBMC, PIC, and MCA datasets, respectively.
Regarding AEs, we used the original counts since AEs showed to achieve better

results when applied using the raw counts [21]. Indeed, using the counts allows for
exploiting discrete probability distributions, such as Poisson and NB distributions,
which obtained the best results in our tests. In all the tests presented here, we used a
single hidden layer. In addition, we set 100 epochs, sigmoid activation functions, and
a batch equal to 100 samples (i.e., cells). In all tests we used the Adam optimizer [83].
After that, we applied three different strategies (Figure 2): (i) we calculated the

neighbourhood graph in both PCA and AE spaces by using the default parameter
settings proposed in Scanpy. Then, we clustered the obtained neighbourhood graphs
with the Leiden algorithm as described before. Finally, we calculated all the metrics

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 15 of 27

Table 1 Setting of α, λ, and K to obtain the desired AE.

α λ K

VAE 0 1 1
MMDAE 1 1 1

MMDVAE 0 2 1
GMVAE 0 1 > 1

GMMMD 1 1 > 1
GMMMDVAE 0 2 > 1

for each found resolution. (ii) we performed a similar analysis where we firstly
corrected the PCA and AE spaces using Harmony [40] with the default parameter
settings proposed in https://github.com/slowkow/harmonypy. (iii) we performed
the same analysis described in (i) by replacing the neighbourhood graphs with those
generated using BBKNN, using the default parameter settings.

The generalised formulation of scAEspy
In this work, we used the notation proposed in [37] to extend MMDVAE with
multiple Gaussian distributions as well as to introduce a learnable prior distribution
in the latent space. The idea behind the introduction of learnable coefficients is that
they might be suitable to model the diversity among the subpopulations of cells
composing the data or to combine multiple samples or datasets.
We consider p∗(x) as the unknown probability in the input space over which

the optimisation problem is formulated, z is the latent representation of x with
|z| ≤ |x|. The encoder is identified by a function eφ : x 7→ z, while the decoder by
a function dθ : z 7→ x.
We remind that in VAEs, the input x is not mapped into a single point in the

latent space, but it is represented by a probability distribution over the latent
space. q(z) can be any possible distribution in the latent space and y ∈{1, . . . ,K}
is a categorical random variable, where K corresponds to the number of desired
Gaussian distributions. As general strict divergence function, we considered the
MMD(·) divergence function [78].
The ELBO term proposed in this work, which is the measure maximised during

the training of AEs, is:

ELBO = E[log(d(x|z,y))] (1)
− (α + λ−1)MMD(pe(z)||q(z))
− (1−α)E[KL(pe(z,y|x)||q(z,y))],

where KL(·) is the Kullback-Leibler divergence [79] between two distributions. All
the mathematical details required to derive the generalised formula shown in Equa-
tion 1 can be found in the Additional file 1.
Equation 1 allows the user to easily exploit VAE, MMDAE, MMDVAE, GMVAE,

GMMMD, and GMMMMDVAE (see Table 1).

Availability and requirements
scAEspy is written in Python programming language (v.3.6.5) and it relies on
TensorFlow (v.1.12.0), an open-source and massively used machine learning li-
brary [35]. scAEspy requires the following Python libraries: NumPy, SciKit-Learn,

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://github.com/slowkow/harmonypy
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 16 of 27

Matplotlib, and Seaborn. scAEspy’s open-source code is available on GitLab:
https://gitlab.com/cvejic-group/scaespy under the GPL-3 license.
The repository contains all the scripts, code and Jupyter Notebooks used to obtain

the results shown in the paper. In the provided Jupyter Notebooks, we show how
it is easy to integrate scAEspy and Scanpy, and how the data can be visualised
and explored by using both scAEspy and Scanpy’s functions. We also provide a
detailed description of scAEspy’s parameters so that it can be used by both novice
and expert researchers for downstream analyses.

Competing interests
The authors declare that they have no competing interests.

Author’s contributions
AT conceived the project. AT and FR developed the software. AC, DB, and PL supervised the project and
helped to interpret and present the results. AT performed all the tests and analysed the results. AT wrote the
manuscript. AC, DB, and PL edited the manuscript. All authors read and approved the final manuscript.

Acknowledgements
This research was supported by Cancer Research UK grant number C45041/A14953 (AC and AT), European
Research Council project 677501 – ZF_Blood (AC) and a core support grant from the Wellcome Trust and
MRC to the Wellcome Trust – Medical Research Council Cambridge Stem Cell Institute.
We thank Dr. Leonardo Rundo (Department of Radiology, University of Cambridge) for their critical comments.

Author details
1Wellcome Trust-Medical Research Council Cambridge Stem Cell Institute, Cambridge, UK. 2Department of
Haematology, University of Cambridge, Cambridge, UK. 3Wellcome Trust Sanger Institute, Wellcome Trust
Genome Campus, Hinxton, UK. 4Department of Informatics, Systems and Communication, University of
Milano-Bicocca, Milan, Italy. 5Department of Computer Science and Technology, University of Cambridge,
Cambridge, UK. 6Current address: Department of Human and Social Sciences, University of Bergamo,
Bergamo, Italy. 7Bicocca Bioinformatics, Biostatistics and Bioimaging Centre (B4), Milan, Italy.

References
1. Gladka, M.M., Molenaar, B., De Ruiter, H., Van Der Elst, S., Tsui, H., Versteeg, D., Lacraz, G.P.,

Huibers, M.M., Van Oudenaarden, A., Van Rooij, E.: Single-cell sequencing of the healthy and diseased
heart reveals cytoskeleton-associated protein 4 as a new modulator of fibroblasts activation. Circulation
138(2), 166–180 (2018). doi:10.1161/CIRCULATIONAHA.117.030742

2. Keren-Shaul, H., Spinrad, A., Weiner, A., Matcovitch-Natan, O., Dvir-Szternfeld, R., Ulland, T.K., David,
E., Baruch, K., Lara-Astaiso, D., Toth, B., et al.: A unique microglia type associated with restricting
development of alzheimer’s disease. Cell 169(7), 1276–1290 (2017). doi:10.1016/j.cell.2017.05.018

3. Lähnemann, D., Köster, J., Szczurek, E., McCarthy, D.J., Hicks, S.C., Robinson, M.D., Vallejos, C.A.,
Campbell, K.R., Beerenwinkel, N., Mahfouz, A., et al.: Eleven grand challenges in single-cell data science.
Genome Biol. 21(1), 1–35 (2020). doi:10.1186/s13059-020-1926-6

4. Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high dimensional data. In: New
Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition, pp. 273–309.
Springer, Berlin, Heidelberg (2004). doi:10.1007/978-3-662-08968-2_16

5. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom Intell Lab Syst. 2(1-3), 37–52
(1987). doi:10.1016/0169-7439(87)80084-9

6. Maaten, L.v.d., Hinton, G.: Visualizing data using t-SNE. J Mach Learn Res. 9(Nov), 2579–2605 (2008)
7. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform manifold approximation and projection for dimension

reduction. arXiv preprint arXiv:1802.03426 (2018)
8. Becht, E., McInnes, L., Healy, J., Dutertre, C.-A., Kwok, I.W., Ng, L.G., Ginhoux, F., Newell, E.W.:

Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol. 37(1), 38 (2019).
doi:10.1038/nbt.4314

9. Luecken, M.D., Theis, F.J.: Current best practices in single-cell rna-seq analysis: a tutorial. Mol. Syst. Biol.
15(6), 8746 (2019). doi:10.15252/msb.20188746

10. Hwang, B., Lee, J.H., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp
Mol Med. 50(8), 96 (2018). doi:10.1038/s12276-018-0071-8

11. Luecken, M.D., Buttner, M., Chaichoompu, K., Danese, A., Interlandi, M., Müller, M.F., Strobl, D.C.,
Zappia, L., Dugas, M., Colomé-Tatché, M., et al.: Benchmarking atlas-level data integration in single-cell
genomics. BioRxiv (2020). doi:10.1101/2020.05.22.111161

12. Tran, H.T.N., Ang, K.S., Chevrier, M., Zhang, X., Lee, N.Y.S., Goh, M., Chen, J.: A benchmark of
batch-effect correction methods for single-cell rna sequencing data. Genome Biol. 21(1), 1–32 (2020).
doi:10.1186/s13059-019-1850-9

13. Leek, J.T., Scharpf, R.B., Bravo, H.C., Simcha, D., Langmead, B., Johnson, W.E., Geman, D., Baggerly,
K., Irizarry, R.A.: Tackling the widespread and critical impact of batch effects in high-throughput data.
Nat. Rev. Genet. 11(10), 733–739 (2010). doi:10.1038/nrg2825

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://gitlab.com/cvejic-group/scaespy
http://dx.doi.org/10.1161/CIRCULATIONAHA.117.030742
http://dx.doi.org/10.1016/j.cell.2017.05.018
http://dx.doi.org/10.1186/s13059-020-1926-6
http://dx.doi.org/10.1007/978-3-662-08968-2$_$16
http://dx.doi.org/10.1016/0169-7439(87)80084-9
http://dx.doi.org/10.1038/nbt.4314
http://dx.doi.org/10.15252/msb.20188746
http://dx.doi.org/10.1038/s12276-018-0071-8
http://dx.doi.org/10.1101/2020.05.22.111161
http://dx.doi.org/10.1186/s13059-019-1850-9
http://dx.doi.org/10.1038/nrg2825
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 17 of 27

14. Butler, A., Hoffman, P., Smibert, P., Papalexi, E., Satija, R.: Integrating single-cell transcriptomic data
across different conditions, technologies, and species. Nat Biotechnol. 36(5), 411 (2018).
doi:10.1038/nbt.4096

15. Bacher, R., Kendziorski, C.: Design and computational analysis of single-cell rna-sequencing experiments.
Genome Biol. 17(1), 63 (2016). doi:10.1186/s13059-016-0927-y

16. Ding, J., Condon, A., Shah, S.P.: Interpretable dimensionality reduction of single cell transcriptome data
with deep generative models. Nat Commun. 9(1), 2002 (2018). doi:10.1038/s41467-018-04368-5

17. Eraslan, G., Simon, L.M., Mircea, M., Mueller, N.S., Theis, F.J.: Single-cell RNA-seq denoising using a
deep count autoencoder. Nat Commun. 10(1), 390 (2019). doi:10.1038/s41467-018-07931-2

18. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
19. Lopez, R., Regier, J., Cole, M.B., Jordan, M.I., Yosef, N.: Deep generative modeling for single-cell

transcriptomics. Nat Methods 15(12), 1053 (2018). doi:10.1038/s41592-018-0229-2
20. Svensson, V., Gayoso, A., Yosef, N., Pachter, L.: Interpretable factor models of single-cell rna-seq via

variational autoencoders. Bioinformatics 36(11), 3418–3421 (2020). doi:10.1093/bioinformatics/btaa169
21. Grønbech, C.H., Vording, M.F., Timshel, P.N., Sønderby, C.K., Pers, T.H., Winther, O.: scVAE: Variational

auto-encoders for single-cell gene expression data. bioRxiv, 318295 (2018). doi:10.1101/318295
22. Grønbech, C.H., Vording, M.F., Timshel, P.N., Sønderby, C.K., Pers, T.H., Winther, O.: scVAE:

Variational auto-encoders for single-cell gene expression data. Bioinformatics (2020).
doi:10.1093/bioinformatics/btaa293

23. Tran, D., Nguyen, H., Tran, B., Nguyen, T.: Fast and precise single-cell data analysis using hierarchical
autoencoder. bioRxiv, 799817 (2019). doi:10.1101/799817

24. Rousseeuw, J.: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl.
Math. 20, 53–65 (1989)

25. Bica, I., Andrés-Terré, H., Cvejic, A., Liò, P.: Unsupervised generative and graph representation learning
for modelling cell differentiation. Sci. Rep. 10(1), 1–13 (2020). doi:10.1038/s41598-020-66166-8

26. Wang, D., Gu, J.: VASC: dimension reduction and visualization of single-cell RNA-seq data by deep
variational autoencoder. Genom Proteom Bioinf. 16(5), 320–331 (2018). doi:10.1016/j.gpb.2018.08.003

27. Lin, E., Mukherjee, S., Kannan, S.: A deep adversarial variational autoencoder model for dimensionality
reduction in single-cell rna sequencing analysis. BMC bioinformatics 21(1), 1–11 (2020).
doi:10.1186/s12859-020-3401-5

28. Geddes, T.A., Kim, T., Nan, L., Burchfield, J.G., Yang, J.Y., Tao, D., Yang, P.: Autoencoder-based cluster
ensembles for single-cell rna-seq data analysis. BMC bioinformatics 20(19), 660 (2019).
doi:10.1186/s12859-019-3179-5

29. Talwar, D., Mongia, A., Sengupta, D., Majumdar, A.: AutoImpute: Autoencoder based imputation of
single-cell RNA-seq data. Sci. Rep. 8(1), 16329 (2018). doi:10.1038/s41598-018-34688-x

30. Sun, S., Liu, Y., Shang, X.: Deep generative autoencoder for low-dimensional embeding extraction from
single-cell rnaseq data. In: Proceedings of the IEEE International Conference on Bioinformatics and
Biomedicine (BIBM), pp. 1365–1372 (2019). doi:10.1109/BIBM47256.2019.8983289. IEEE

31. Badsha, M.B., Li, R., Liu, B., Li, Y.I., Xian, M., Banovich, N.E., Fu, A.Q.: Imputation of single-cell gene
expression with an autoencoder neural network. Quant. Biol., 1–17 (2020).
doi:10.1007/s40484-019-0192-7

32. Rao, J., Zhou, X., Lu, Y., Zhao, H., Yang, Y.: Imputing single-cell rna-seq data by combining graph
convolution and autoencoder neural networks. bioRxiv (2020). doi:10.1101/2020.02.05.935296

33. Wolf, F.A., Angerer, P., Theis, F.J.: SCANPY: large-scale single-cell gene expression data analysis.
Genome Biol. 19(1), 15 (2018). doi:10.1186/s13059-017-1382-0

34. Satija, R., Farrell, J.A., Gennert, D., Schier, A.F., Regev, A.: Spatial reconstruction of single-cell gene
expression data. Nat Biotechnol. 33(5), 495 (2015). doi:10.1038/nbt.3192

35. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard,
M., et al.: Tensorflow: A system for large-scale machine learning. In: Proceedings of the Symposium on
Operating Systems Design and Implementation), pp. 265–283 (2016)

36. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L.,
Lerer, A.: Automatic differentiation in PyTorch. In: Proceedings of the Conference on Advances in Neural
Information Processing Systems (2017)

37. Zhao, S., Song, J., Ermon, S.: Infovae: Information maximizing variational autoencoders. arXiv preprint
arXiv:1706.02262 (2017)

38. Park, J.-E., Polański, K., Meyer, K., Teichmann, S.A.: Fast batch alignment of single cell transcriptomes
unifies multiple mouse cell atlases into an integrated landscape. bioRxiv, 397042 (2018)

39. Polański, K., Young, M.D., Miao, Z., Meyer, K.B., Teichmann, S.A., Park, J.-E.: Bbknn: fast batch
alignment of single cell transcriptomes. Bioinformatics 36(3), 964–965 (2020).
doi:10.1093/bioinformatics/btz625

40. Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., Baglaenko, Y., Brenner, M., Loh,
P.-r., Raychaudhuri, S.: Fast, sensitive and accurate integration of single-cell data with Harmony. Nat.
Methods 16, 1289–1296 (2019). doi:10.1038/s41592-019-0619-0

41. Johnson, W.E., Li, C., Rabinovic, A.: Adjusting batch effects in microarray expression data using empirical
bayes methods. Biostatistics 8(1), 118–127 (2007). doi:10.1093/biostatistics/kxj037

42. Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D.: The sva package for removing batch
effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6), 882–883
(2012). doi:10.1093/bioinformatics/bts034

43. Pedersen, B.: Python implementation of ComBat. GitHub (2012)
44. Kang, H.M., Subramaniam, M., Targ, S., Nguyen, M., Maliskova, L., McCarthy, E., Wan, E., Wong, S.,

Byrnes, L., Lanata, C.M., et al.: Multiplexed droplet single-cell rna-sequencing using natural genetic
variation. Nat Biotechnol. 36(1), 89 (2018). doi:10.1038/nbt.4042

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

http://dx.doi.org/10.1038/nbt.4096
http://dx.doi.org/10.1186/s13059-016-0927-y
http://dx.doi.org/10.1038/s41467-018-04368-5
http://dx.doi.org/10.1038/s41467-018-07931-2
http://dx.doi.org/10.1038/s41592-018-0229-2
http://dx.doi.org/10.1093/bioinformatics/btaa169
http://dx.doi.org/10.1101/318295
http://dx.doi.org/10.1093/bioinformatics/btaa293
http://dx.doi.org/10.1101/799817
http://dx.doi.org/10.1038/s41598-020-66166-8
http://dx.doi.org/10.1016/j.gpb.2018.08.003
http://dx.doi.org/10.1186/s12859-020-3401-5
http://dx.doi.org/10.1186/s12859-019-3179-5
http://dx.doi.org/10.1038/s41598-018-34688-x
http://dx.doi.org/10.1109/BIBM47256.2019.8983289
http://dx.doi.org/10.1007/s40484-019-0192-7
http://dx.doi.org/10.1101/2020.02.05.935296
http://dx.doi.org/10.1186/s13059-017-1382-0
http://dx.doi.org/10.1038/nbt.3192
http://dx.doi.org/10.1093/bioinformatics/btz625
http://dx.doi.org/10.1038/s41592-019-0619-0
http://dx.doi.org/10.1093/biostatistics/kxj037
http://dx.doi.org/10.1093/bioinformatics/bts034
http://dx.doi.org/10.1038/nbt.4042
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 18 of 27

45. Grün, D., Muraro, M.J., Boisset, J.-C., Wiebrands, K., Lyubimova, A., Dharmadhikari, G., van den Born,
M., van Es, J., Jansen, E., Clevers, H., et al.: De novo prediction of stem cell identity using single-cell
transcriptome data. Cell stem cell 19(2), 266–277 (2016). doi:10.1016/j.stem.2016.05.010

46. Muraro, M.J., Dharmadhikari, G., Grün, D., Groen, N., Dielen, T., Jansen, E., van Gurp, L., Engelse,
M.A., Carlotti, F., de Koning, E.J., et al.: A single-cell transcriptome atlas of the human pancreas. Cell
Syst. 3(4), 385–394 (2016). doi:10.1016/j.cels.2016.09.002

47. Lawlor, N., George, J., Bolisetty, M., Kursawe, R., Sun, L., Sivakamasundari, V., Kycia, I., Robson, P.,
Stitzel, M.L.: Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific
expression changes in type 2 diabetes. Genome Res. 27(2), 208–222 (2017). doi:10.1101/gr.212720.116

48. Segerstolpe, Å., Palasantza, A., Eliasson, P., Andersson, E.-M., Andréasson, A.-C., Sun, X., Picelli, S.,
Sabirsh, A., Clausen, M., Bjursell, M.K., et al.: Single-cell transcriptome profiling of human pancreatic
islets in health and type 2 diabetes. Cell Metab. 24(4), 593–607 (2016). doi:10.1016/j.cmet.2016.08.020

49. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck III, W.M., Hao, Y., Stoeckius,
M., Smibert, P., Satija, R.: Comprehensive integration of single-cell data. Cell (2019).
doi:10.1016/j.cell.2019.05.031

50. Han, X., Wang, R., Zhou, Y., Fei, L., Sun, H., Lai, S., Saadatpour, A., Zhou, Z., Chen, H., Ye, F., et al.:
Mapping the mouse cell atlas by microwell-seq. Cell 172(5), 1091–1107 (2018).
doi:10.1016/j.cell.2018.02.001

51. Consortium, T.M., et al.: Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature
562, 367–372 (2018). doi:10.1038/s41586-018-0590-4

52. Hubert, L., Arabie, P.: Comparing partitions. J Classif. 2(1), 193–218 (1985). doi:10.1007/BF01908075
53. Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J

Mach Learn Res. 3(Dec), 583–617 (2002)
54. Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J Am Stat Assoc.

78(383), 553–569 (1983)
55. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants,

properties, normalization and correction for chance. J Mach Learn Res. 11(Oct), 2837–2854 (2010)
56. Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation

measure. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, pp. 410–420 (2007)

57. Macosko, E.Z., Basu, A., Satija, R., Nemesh, J., Shekhar, K., Goldman, M., Tirosh, I., Bialas, A.R.,
Kamitaki, N., Martersteck, E.M., et al.: Highly parallel genome-wide expression profiling of individual cells
using nanoliter droplets. Cell 161(5), 1202–1214 (2015). doi:10.1016/j.cell.2015.05.002

58. Klein, A.M., Mazutis, L., Akartuna, I., Tallapragada, N., Veres, A., Li, V., Peshkin, L., Weitz, D.A.,
Kirschner, M.W.: Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell
161(5), 1187–1201 (2015). doi:10.1016/j.cell.2015.04.044

59. Hashimshony, T., Wagner, F., Sher, N., Yanai, I.: Cel-seq: single-cell rna-seq by multiplexed linear
amplification. Cell Rep. 2(3), 666–673 (2012). doi:10.1016/j.celrep.2012.08.003

60. Hashimshony, T., Senderovich, N., Avital, G., Klochendler, A., de Leeuw, Y., Anavy, L., Gennert, D., Li,
S., Livak, K.J., Rozenblatt-Rosen, O., et al.: Cel-seq2: sensitive highly-multiplexed single-cell rna-seq.
Genome Biol. 17(1), 77 (2016). doi:10.1186/s13059-016-0938-8

61. Zheng, G.X., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D.,
McDermott, G.P., Zhu, J., et al.: Massively parallel digital transcriptional profiling of single cells. Nat
Commun. 8, 14049 (2017). doi:10.1038/ncomms14049

62. Gierahn, T.M., Wadsworth II, M.H., Hughes, T.K., Bryson, B.D., Butler, A., Satija, R., Fortune, S., Love,
J.C., Shalek, A.K.: Seq-well: portable, low-cost rna sequencing of single cells at high throughput. Nat
Methods 14(4), 395 (2017). doi:10.1038/nmeth.4179

63. Islam, S., Kjällquist, U., Moliner, A., Zajac, P., Fan, J.-B., Lönnerberg, P., Linnarsson, S.: Characterization
of the single-cell transcriptional landscape by highly multiplex rna-seq. Genome Res. 21(7), 1160–1167
(2011). doi:10.1101/gr.110882.110

64. Ramsköld, D., Luo, S., Wang, Y.-C., Li, R., Deng, Q., Faridani, O.R., Daniels, G.A., Khrebtukova, I.,
Loring, J.F., Laurent, L.C., et al.: Full-length mrna-seq from single-cell levels of rna and individual
circulating tumor cells. Nat Biotechnol. 30(8), 777 (2012). doi:10.1038/nbt.2282

65. Picelli, S., Faridani, O.R., Björklund, Å.K., Winberg, G., Sagasser, S., Sandberg, R.: Full-length rna-seq
from single cells using smart-seq2. Nat Protoc. 9(1), 171 (2014). doi:10.1038/nprot.2014.006

66. Jaitin, D.A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I., Mildner, A., Cohen, N.,
Jung, S., Tanay, A., et al.: Massively parallel single-cell rna-seq for marker-free decomposition of tissues
into cell types. Science 343(6172), 776–779 (2014). doi:10.1126/science.1247651

67. Traag, V.A., Waltman, L., van Eck, N.J.: From Louvain to Leiden: guaranteeing well-connected
communities. Sci Rep. 9 (2019). doi:10.1038/s41598-019-41695-z

68. Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than
the other. Ann. of Math. Stat., 50–60 (1947)

69. Wilcoxon, F.: Individual comparisons by ranking methods. In: Breakthroughs in Statistics, pp. 196–202.
Springer, New York, NY (1992). doi:10.1007/978-1-4612-4380-9_16

70. Dunn, O.J.: Multiple comparisons among means. J. Am. Stat. Assoc. 56(293), 52–64 (1961)
71. Ma, F., Pellegrini, M.: ACTINN: Automated identification of cell types in single cell RNA sequencing.

Bioinformatics (2019). doi:10.1093/bioinformatics/btz592
72. Ranzoni, A.M., Tangherloni, A., Berest, I., Riva, S.G., Myers, B., Strzelecka, P.M., Xu, J., Panada, E.,

Mohorianu, I., Zaugg, J.B., et al.: Integrative single-cell rna-seq and atac-seq analysis of human foetal liver
and bone marrow haematopoiesis. BioRxiv (2020). doi:10.1101/2020.05.06.080259

73. Simidjievski, N., Bodnar, C., Tariq, I., Scherer, P., Andres Terre, H., Shams, Z., Jamnik, M., Liò, P.:
Variational autoencoders for cancer data integration: design principles and computational practice. Front.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

http://dx.doi.org/10.1016/j.stem.2016.05.010
http://dx.doi.org/10.1016/j.cels.2016.09.002
http://dx.doi.org/10.1101/gr.212720.116
http://dx.doi.org/10.1016/j.cmet.2016.08.020
http://dx.doi.org/10.1016/j.cell.2019.05.031
http://dx.doi.org/10.1016/j.cell.2018.02.001
http://dx.doi.org/10.1038/s41586-018-0590-4
http://dx.doi.org/10.1007/BF01908075
http://dx.doi.org/10.1016/j.cell.2015.05.002
http://dx.doi.org/10.1016/j.cell.2015.04.044
http://dx.doi.org/10.1016/j.celrep.2012.08.003
http://dx.doi.org/10.1186/s13059-016-0938-8
http://dx.doi.org/10.1038/ncomms14049
http://dx.doi.org/10.1038/nmeth.4179
http://dx.doi.org/10.1101/gr.110882.110
http://dx.doi.org/10.1038/nbt.2282
http://dx.doi.org/10.1038/nprot.2014.006
http://dx.doi.org/10.1126/science.1247651
http://dx.doi.org/10.1038/s41598-019-41695-z
http://dx.doi.org/10.1007/978-1-4612-4380-9$_$16
http://dx.doi.org/10.1093/bioinformatics/btz592
http://dx.doi.org/10.1101/2020.05.06.080259
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 19 of 27

Genet. 10, 1205 (2019). doi:10.3389/fgene.2019.01205
74. Trębacz, M., Shams, Z., Jamnik, M., Scherer, P., Simidjievski, N., Terre, H.A., Liò, P.: Using ontology

embeddings for structural inductive bias in gene expression data analysis. arXiv preprint arXiv:2011.10998
(2020)

75. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K.,
Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1),
25–29 (2000)

76. Raudvere, U., Kolberg, L., Kuzmin, I., Arak, T., Adler, P., Peterson, H., Vilo, J.: g:Profiler: a web server
for functional enrichment analysis and conversions of gene lists (2019 update). Nucl. Acids Res. 47(W1),
191–198 (2019). doi:10.1093/nar/gkz369

77. Chen, X., Miragaia, R.J., Natarajan, K.N., Teichmann, S.A.: A rapid and robust method for single cell
chromatin accessibility profiling. Nat Commun. 9(1), 5345 (2018). doi:10.1038/s41467-018-07771-0

78. Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the
two-sample-problem. In: Proceedings of the Conference on Advances in Neural Information Processing
Systems, pp. 513–520 (2007)

79. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann Math Statist. 22(1), 79–86 (1951).
doi:10.1214/aoms/1177729694

80. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S., Vert, J.-P.: A general and flexible method for signal
extraction from single-cell rna-seq data. Nat. Commun. 9(1), 1–17 (2018).
doi:10.1038/s41467-017-02554-5

81. Hie, B.L., Bryson, B., Berger, B.: Panoramic stitching of heterogeneous single-cell transcriptomic data.
BioRxiv, 371179 (2018)

82. Haghverdi, L., Lun, A.T., Morgan, M.D., Marioni, J.C.: Batch effects in single-cell rna-sequencing data are
corrected by matching mutual nearest neighbors. Nat Biotechnol. 36(5), 421 (2018). doi:10.1038/nbt.4091

83. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Additional Files
Additional file 1 — Mathematical formulation of the proposed autoencoders
We provide the mathematical derivation of GMMMD and GMMMDVAE as well as the generalised formulation
that we derived by following the notation proposed in [37].

Additional file 2 — Excel file of the metrics calculated for the PBMC datasets
Each tab is related to a tested approach and shows the calculated metrics and used method.

Additional file 3 — Excel file of the metrics calculated for the PIC datasets
Each tab is related to a tested approach and shows the calculated metrics and used method.

Additional file 4 — Excel file of the metrics calculated for the MCA datasets
Each tab is related to a tested approach and shows the calculated metrics and used method.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

http://dx.doi.org/10.3389/fgene.2019.01205
http://dx.doi.org/10.1093/nar/gkz369
http://dx.doi.org/10.1038/s41467-018-07771-0
http://dx.doi.org/10.1214/aoms/1177729694
http://dx.doi.org/10.1038/s41467-017-02554-5
http://dx.doi.org/10.1038/nbt.4091
https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 20 of 27

Figures

PCA on standardised data

Steps (iv) and (v)

ClusteringMarker genes

Step (vii)

Data visualisation

Quality Control

Step (i) 

Log-transformationNormalisation

Step (ii) 

Step (vi)

Highly variable genes

Step (iii) 

Figure 1 A common workflow for the downstream analysis of scRNA-Seq data. The workflow
includes the following seven steps: (i) quality control to remove low-quality cells that may add
technical noise, which could obscure the real biological signals; (ii) normalisation and
log-transformation; (iii) identification of the HVGs to reduce the dimensionality of the dataset by
including only the most informative genes; (iv) standardisation of each gene to zero mean and
unit variance; (v) dimensionality reduction generally obtained by applying PCA; (vi) clustering of
the cells starting from the low-dimensional representation of the data that are used to annotate
the obtained clusters (i.e., identification of known and putatively novel cell-types); (vii) data
visualisation on the low-dimensional space generated by applying a non-linear approach (e.g.,
t-SNE or UMAP) on the reduced space calculated in step (v).

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 21 of 27

m1	genes

 n
1 

ce
lls Gene expression matrix

Sample 1

k	genes

 n
1	
+
	..
.	+
	n
E
 c

el
ls

   HVGs per
  sample

 k
 n

od
es

 k nodes

t-SNE

UMAPLeiden
algorithm

Label
assignment

scAEspy

mE	genes

 n
E
 c

el
ls Gene expression matrix

Sample E

...

Corrected
neighbourhood

graph

BBKNN
(Corrected

neighbourhood
graph)

z-
sc

or
e

Rank

Marker
genes

Dimensionality
reduction

Sample 1

Sample E

Graph
clustering

UMAP1

U
M
A
P
2

tSNE1

tS
N
E
2 12

3
4
5
6
7

1
2

3

45

6

6

...

HARMONY
(corrected

latent space)

Neighbourhood
graph

Corrected
latent space

Figure 2 The proposed workflow to integrate different samples. Given E different samples, their
gene expression matrices are merged. Then, the top k HVGs are selected by considering the
different samples. Specifically, they are selected within each sample separately and then merged to
avoid the selection of batch-specific genes. scAEspy is used to reduce the HVG space (k
dimensions), and the obtained latent space can be (i) used to calculate a t-SNE space, (ii)
corrected by Harmony, and (iii) used to infer an uncorrected neighbourhood graph. The corrected
latent space by Harmony is then used to build a neighbourhood graph, which is clustered by using
the Leiden algorithm and used to calculate a UMAP space. Otherwise, BBKNN is applied to
rebuild a uncorrected neighbourhood graph by taking into account the possible batch-effects. The
corrected neighbourhood graph by BBKNN is then clustered by using the Leiden algorithm and
used to calculate a UMAP space. In order to assign the correct label to the obtained clusters, the
marker genes are calculated by using the Mann–Whitney U test. Finally, the annotated clusters
can be visualised in both t-SNE and UMAP space.

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 22 of 27

C
C

A

C
om

B
at

M
M

D
A

E
-H

ar
m

on
y-

25
6×

32

P
C

A

P
C

A
-B

B
K

N
N

P
C

A
-H

ar
m

on
y

70

75

80

85

90

95

100

A
R

I

****

****

****

****

****

(A)

G
M

M
M

D
-B

B
K

N
N

-1
28
×

32

G
M

M
M

D
-H

ar
m

on
y-

64
×

16

G
M

M
M

D
-H

ar
m

on
y-

12
8×

16

G
M

M
M

D
-H

ar
m

on
y-

25
6×

64

M
M

D
A

E
-H

ar
m

on
y-

25
6×

16

M
M

D
A

E
-H

ar
m

on
y-

25
6×

32

M
M

D
A

E
-H

ar
m

on
y-

64
×

32

M
M

D
A

E
-H

ar
m

on
y-

12
8×

64

84

86

88

90

92

94

****
****

****
****

****

****

***

(B)

UMAP1

U
M

A
P

2

(C)

B cells
CD14+ Monocytes 
CD4 T cells
CD8 T cells 
Dendritic cells 
FCGR3A+ Monocytes 
Megakaryocytes 
NK cells

UMAP1

U
M

A
P

2

(D)

0
1
2
3
4
5
6
7

Figure 3 Results obtained on the PBMC datasets. (A) Boxplot showing the ARI values achieved
by CCA, ComBat, PCA, MMDAE followed by Harmony with dimension (256, 32), PCA followed
by BBKNN, and PCA followed by Harmony on the PBMC datasets. (B) Boxplot showing the ARI
values achieved by the best AE for each of the tested dimension (H, L) of the hidden layer (H
neurons) and latent space (L neurons). (C) UMAP visualisation of the cell-type manually
annotated in the original paper. (D) UMAP visualisation of clusters identified by the Leiden
algorithm using the resolution corresponding by the best ARI achieved by MMDAE followed by
Harmony. p-value≤ 0.0001 (****); 0.0001 <p-value≤ 0.001 (***); 0.001 <p-value≤ 0.01 (**);
0.01 <p-value≤ 0.05 (*); p-value> 0.05 (ns)

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 23 of 27

65

70

75

80

85

90

A
M
II

****

****

****

****

****

(A)

76

78

80

82

84

**

****

ns

****

****

****

ns

(B)

80

85

90

95

100

FM
S

****

****

****

****

****

(C)

89

90

91

92

93

94

95

96

****
****

****

****

****

****

****

(D)

65

70

75

80

85

90

H
S

****

****

****

****

****

(E)

76

78

80

82

84

**

****

ns

****

****

****

ns

(F)

70

75

80

85

90

95

C
S

****

****

****

****

****

(G)

82

84

86

88

90

****

****

****

****

*

ns

****

(H)

C
C
A

C
om
B
at

M
M
D
A
E
-H
ar
m
on
y-
25
6×
32

P
C
A

P
C
A
-B
B
K
N
N

P
C
A
-H
ar
m
on
y

70

75

80

85

90

V
M

****

****

****

****

****

(I)

G
M
M
M
D
-B
B
K
N
N
-1
28
×
32

G
M
M
M
D
-H
ar
m
on
y-
64
×
16

G
M
M
M
D
-H
ar
m
on
y-
12
8×
16

G
M
M
M
D
-H
ar
m
on
y-
25
6×
64

M
M
D
A
E
-H
ar
m
on
y-
25
6×
16

M
M
D
A
E
-H
ar
m
on
y-
25
6×
32

M
M
D
A
E
-H
ar
m
on
y-
64
×
32

M
M
D
A
E
-H
ar
m
on
y-
12
8×
64

79

80

81

82

83

84

85

86

****
****

****

****

****

****

****

(J)

Figure 4 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA,
MMDAE followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA
followed by Harmony as well as by the best AE for each of the tested dimension (H, L), analysing
the PBMC datasets. (A) AMII achieved by the different strategies. (B) AMII achieved by the best
AE for each of the tested dimension. (C) FMS achieved by the different strategies. (D) FMS
achieved by the best AE for each of the tested dimension. (E) HS achieved by the different
strategies. (F) HS achieved by the best AE for each of the tested dimension. (G) CS achieved by
the different strategies. (H) CS achieved by the best AE for each of the tested dimension. (I) VM
achieved by the different strategies. (J) VM achieved by the best AE for each of the tested
dimension. p-value≤ 0.0001 (****); 0.0001 <p-value≤ 0.001 (***); 0.001 <p-value≤ 0.01 (**);
0.01 <p-value≤ 0.05 (*); p-value> 0.05 (ns)

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 24 of 27

UMAP1

U
M

A
P

2

(A)
Control cells 
Treated cells

UMAP1

U
M

A
P

2

(B)
CEL-Seq 
CEL-Seq2 
Fluidigm C1 
Smart-Seq2

UMAP1

U
M
A
P
2

(C)
Microwell-Seq
Smart-Seq2

Figure 5 UMAP visualisation showing the sample alignment performed by Harmony into the
latent space obtained by MMDAE with dimension (256, 32) for the PBMC datasets (A), by
GMMMD with dimension (256, 32) for the PIC datasets (B), and by MMDVAE with dimension
(256, 64) for the MCA datasets (C).

C
C

A

C
om

B
at

G
M

M
M

D
-H

ar
m

on
y-

25
6×

32

P
C

A

P
C

A
-B

B
K

N
N

P
C

A
-H

ar
m

on
y

20

40

60

80

100

120

140

A
R

I

****

****

****

****

ns

(A)

G
M

M
M

D
-H

ar
m

on
y-

12
8×

64

G
M

M
M

D
-H

ar
m

on
y-

25
6×

32

G
M

M
M

D
-H

ar
m

on
y-

64
×

32

G
M

V
A

E
-H

ar
m

on
y-

12
8×

32

G
M

V
A

E
-H

ar
m

on
y-

25
6×

16

M
M

D
-H

ar
m

on
y-

25
6×

64

M
M

D
-H

ar
m

on
y-

64
×

16

V
A

E
-H

ar
m

on
y-

12
8×

16

90

92

94

96

98

****

****

****

****

****

(B)

UMAP1

U
M

A
P

2

(C)
Acinar cells 
Activated stellate 
Alpha cells 
Beta cells 
Delta cells 
Ductal cells  
Endothelial cells

Epsilon cells 
Gamma cells 
Macrophages 
Mast cells 
Quiescent stellate 
Schwann cells

UMAP1

U
M

A
P

2

(D)
0
1
2
3
4
5
6

7
8
9
10
11
12

Figure 6 Results obtained on the PIC datasets. (A) Boxplot showing the ARI values achieved by
CCA, ComBat, PCA, GMMMD followed by Harmony with dimension (256, 32), PCA followed by
BBKNN, and PCA followed by Harmony on the PBMC datasets. (B) Boxplot showing the ARI
values achieved by the best AE for each of the tested dimension (H, L) of the hidden layer (H
neurons) and latent space (L neurons). (C) UMAP visualisation of the cell-type manually
annotated in the original paper. (D) UMAP visualisation of clusters identified by the Leiden
algorithm using the resolution corresponding by the best ARI achieved by GMMMD followed by
Harmony. p-value≤ 0.0001 (****); 0.0001 <p-value≤ 0.001 (***); 0.001 <p-value≤ 0.01 (**);
0.01 <p-value≤ 0.05 (*); p-value> 0.05 (ns)

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 25 of 27

20

40

60

80

100

120

A
M
II

****

****

****

****

****

(A)

82

84

86

88

90

92

94

96

****

****

****

****

****

(B)

20

40

60

80

100

120

140

FM
S

****

****

****

****

ns

(C)

92

93

94

95

96

97

98

99

100

FM
S

****

****

****

****

****

(D)

20

40

60

80

100

120

140

H
S

****

****

****

****

****

(E)

86

88

90

92

94

****

****

****

****

****

(F)

20

40

60

80

100

120

C
S

****

****

****

****

****

(G)

82

84

86

88

90

92

94

96

****

****

****

****

****

(H)

C
C
A

C
om
B
at

G
M
M
M
D
-H
ar
m
on
y-
25
6×
32

P
C
A

P
C
A
-B
B
K
N
N

P
C
A
-H
ar
m
on
y

20

40

60

80

100

120

V
M

****

****

****

****

****

(I)

G
M
M
M
D
-H
ar
m
on
y-
12
8×
64

G
M
M
M
D
-H
ar
m
on
y-
25
6×
32

G
M
M
M
D
-H
ar
m
on
y-
64
×
32

G
M
V
A
E
-H
ar
m
on
y-
12
8×
32

G
M
V
A
E
-H
ar
m
on
y-
25
6×
16

M
M
D
-H
ar
m
on
y-
25
6×
64

M
M
D
-H
ar
m
on
y-
64
×
16

V
A
E
-H
ar
m
on
y-
12
8×
16

86

88

90

92

94

****

****

****

****

****

(J)

Figure 7 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA,
GMMMD followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA
followed by Harmony as well as by the best AE for each of the tested dimension (H, L), analysing
the PIC datasets. (A) AMII achieved by the different strategies. (B) AMII achieved by the best
AE for each of the tested dimension. (C) FMS achieved by the different strategies. (D) FMS
achieved by the best AE for each of the tested dimension. (E) HS achieved by the different
strategies. (F) HS achieved by the best AE for each of the tested dimension. (G) CS achieved by
the different strategies. (H) CS achieved by the best AE for each of the tested dimension. (I) VM
achieved by the different strategies. (J) VM achieved by the best AE for each of the tested
dimension. p-value≤ 0.0001 (****); 0.0001 <p-value≤ 0.001 (***); 0.001 <p-value≤ 0.01 (**);
0.01 <p-value≤ 0.05 (*); p-value> 0.05 (ns)

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 26 of 27

C
C

A

C
om

B
at

M
M

D
V

A
E

-H
ar

m
on

y-
25

6×
64

P
C

A

P
C

A
-B

B
K

N
N

P
C

A
-H

ar
m

on
y

50

60

70

80

90

100

A
R

I

****

****

****

****

****

(A)

G
M

M
M

D
-1

28
×

16

G
M

M
M

D
-H

ar
m

on
y-

64
×

16

G
M

M
M

D
-H

ar
m

on
y-

25
6×

32

M
M

D
-B

B
K

N
N

-1
28
×

64

M
M

D
-B

B
K

N
N

-2
56
×

16

M
M

D
V

A
E

-H
ar

m
on

y-
12

8×
16

M
M

D
V

A
E

-H
ar

m
on

y-
25

6×
64

V
A

E
-H

ar
m

on
y-

64
×

32

40

50

60

70

80

90

****

****

****

(B)

UMAP1

U
M

A
P

2

(C)

B cells 
Dendritic cells 
Endothelial cells 
Epithelial cells 
Macrophages 
Monocytes

NK cells
Neutrophils 
Smooth-muscle cells 
Stromal cells
T cells

UMAP1

U
M

A
P

2

(D)

0
1
2
3
4
5

6
7
8
9
10

Figure 8 Results obtained on the MCA datasets. (A) Boxplot showing the ARI values achieved
by CCA, ComBat, PCA, MMDVAE followed by Harmony with dimension (256, 64), PCA followed
by BBKNN, and PCA followed by Harmony. (B) Boxplot showing the ARI values achieved by the
best AE for each of the tested dimension (H, L) of the hidden layer (H neurons) and latent space
(L neurons). (C) UMAP visualisation of the cell-type manually annotated in the original paper.
(D) UMAP visualisation of clusters identified by the Leiden algorithm using the resolution
corresponding by the best ARI achieved by MMDVAE followed by Harmony. p-value≤ 0.0001
(****); 0.0001 <p-value≤ 0.001 (***); 0.001 <p-value≤ 0.01 (**); 0.01 <p-value≤ 0.05 (*);
p-value> 0.05 (ns)

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/


Tangherloni et al. Page 27 of 27

70

75

80

85

A
M
II

****

ns

****

****

****

(A)

65

70

75

80

85

****

****

****

(B)

60

70

80

90

100

FM
S

****

****

****

****

****

(C)

50

60

70

80

90

****

****

****

(D)

70

75

80

85

90

H
S

****

ns

****

ns

****

(E)

65

70

75

80

85

****

****

****

(F)

70

75

80

85

90

C
S

****

ns

****

****

****

(G)

65

70

75

80

85

****

****

****

(H)

C
C
A

C
om
B
at

M
M
D
V
A
E
-H
ar
m
on
y-
25
6×
64

P
C
A

P
C
A
-B
B
K
N
N

P
C
A
-H
ar
m
on
y

67.5

70.0

72.5

75.0

77.5

80.0

82.5

85.0

87.5

V
M

****

ns

****

ns

****

(I)

G
M
M
M
D
-1
28
×
16

G
M
M
M
D
-H
ar
m
on
y-
64
×
16

G
M
M
M
D
-H
ar
m
on
y-
25
6×
32

M
M
D
-B
B
K
N
N
-1
28
×
64

M
M
D
-B
B
K
N
N
-2
56
×
16

M
M
D
V
A
E
-H
ar
m
on
y-
12
8×
16

M
M
D
V
A
E
-H
ar
m
on
y-
25
6×
64

V
A
E
-H
ar
m
on
y-
64
×
32

65

70

75

80

85

****

****

****

(J)

Figure 9 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA,
MMDVAE followed by Harmony with dimension (256, 64), PCA followed by BBKNN, and PCA
followed by Harmony, as well as by the best AE for each of the tested dimension (H, L), analysing
the MCA datasets. (A) AMII achieved by the different strategies. (B) AMII achieved by the best
AE for each of the tested dimension. (C) FMS achieved by the different strategies. (D) FMS
achieved by the best AE for each of the tested dimension. (E) HS achieved by the different
strategies. (F) HS achieved by the best AE for each of the tested dimension. (G) CS achieved by
the different strategies. (H) CS achieved by the best AE for each of the tested dimension. (I) VM
achieved by the different strategies. (J) VM achieved by the best AE for each of the tested
dimension. p-value≤ 0.0001 (****); 0.0001 <p-value≤ 0.001 (***); 0.001 <p-value≤ 0.01 (**);
0.01 <p-value≤ 0.05 (*); p-value> 0.05 (ns)

.CC-BY-NC-ND 4.0 International licenseunder a
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available 

The copyright holder for this preprint (which wasthis version posted February 13, 2021. ; https://doi.org/10.1101/727867doi: bioRxiv preprint 

https://doi.org/10.1101/727867
http://creativecommons.org/licenses/by-nc-nd/4.0/

	Abstract