Triku: a feature selection method based on nearest neighbors for single-cell data


Ascensión et al.

SOFTWARE

Triku: a feature selection method based on nearest
neighbors for single-cell data
Alex M. Ascensión1,2†, Olga Ibañez-Solé1,2†, Inaki Inza3, Ander Izeta2 and Marcos J. Araúzo-Bravo1,4*

*Correspondence:
mararabra@yahoo.co.uk
1Biodonostia Health Research
Institute, Computational Biology
and Systems Biomedicine Group,
Paseo Dr. Begiristain, s/n, 20014,
Donostia-San Sebastian, Spain
Full list of author information is
available at the end of the article
†Equal contributor

Abstract
Feature selection is a relevant step in the analysis of single-cell RNA sequencing
datasets. Triku is a feature selection method that favours genes defining the main
cell populations. It does so by selecting genes expressed by groups of cells that
are close in the nearest neighbor graph. Triku efficiently recovers cell populations
present in artificial and biological benchmarking datasets, based on mutual
information and silhouette coefficient measurements. Additionally, gene sets
selected by triku are more likely to be related to relevant Gene Ontology terms,
and contain fewer ribosomal and mitochondrial genes. Triku is available at
https://gitlab.com/alexmascension/triku.
Keywords: scRNAseq; feature selection; bioinformatics; python

1 Background
Single-cell RNA sequencing (scRNA-seq) is a powerful technology to study the bi-
ological heterogeneity of tissues at the individual cell level, allowing the characteri-
zation of new cell populations and cell states–i.e. cell types responding to different
environmental stimuli– previously undetected due to their low frequency within the
tissue and the lack of individual resolution of bulk methods [1, 2].

scRNA-seq datasets are multidimensional, i.e. the expression profile per cell con-
sists of multiple genes. Two common characteristics of multidimensional datasets
is their high dimensionality and their sparsity, which are worsened in single-cell
datasets due the high proportion of zeros from low signal recovery [3]. This spar-
sity affects downstream methods such as cell type detection or differential gene
expression [4].

A common task when working with multidimensional datasets is feature selection
(FS). FS, alongside with feature extraction (FE), responds to the need of obtaining a
reduced dataset with a smaller dimensionality [5]. While FE methods like Principal
Component Analysis (PCA) extract new features based on combinations of the
original features, FS methods aim to select a subset of the features that best explains
the original dataset.

There are three main types of FS methods: filter, wrapper and embedded methods
[5]. Current FS methods in scRNA-seq analysis are filter methods because common
downstream analysis steps do not embed the FS within the pipeline [6]. FS methods
represent a key step in processing pipelines of bioinformatic datasets and provide
several advantages [5]: they reduce model overfitting risk, improve clustering qual-
ity, and favour a deeper insight into the underlying processes that generated the
data (features –genes– that contain random noise do not contribute to the biology of

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

mailto:mararabra@yahoo.co.uk
https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 2 of 18

the dataset and are removed). Specifically, in scRNA-seq, removing non-informative
features can improve results in downstream analyses such as differential gene ex-
pression.

Early methods for FS in scRNA-seq data were based on the idea that genes
whose expression show a greater dispersion across the dataset are the ones that
best capture the biological structure of the dataset. Conversely, genes that are
evenly expressed across cells are unlikely to define cell types or cell functions in a
heterogeneous dataset. The most straightforward way of selecting genes that are
not evenly expressed is to look at a measure of dispersion of the counts of each gene
and to select those genes that have a dispersion over a threshold.

However, the correlation between mean expression and dispersion introduces a
bias whereby genes with higher expression are more likely to be selected by FS
methods. However, biological gene markers that define minor cell types are usually
expressed in a medium to small subset of cells. Therefore, new FS methods based
on dispersion are designed to correct for this dispersion/expression correlation to
select genes with a broader expression spectrum.

Brennecke et al. [7] developed a FS method that introduces a correction over the
dispersion that accounts for differences in the mean expression of genes. It does so
by setting a threshold to the correlation between the average gene expression and
its coefficient of variation across cells. Newer FS methods have arisen after different
corrections, like the one originally described by Satija et al. [8] implemented in
Seurat, later adapted to scanpy [9], or the one implemented in scry [10].

A new generation of FS methods emerged when Svensson discovered that the
proportion of zeros in droplet-based scRNA-seq data, originally assumed to be
dropouts, was tightly related to the mean expression of genes, following a nega-
tive binomial (NB) curve [11]. Genes with an expected lower percentage of zeros
tend to have an even expression across the entire set of cells. Conversely, genes
with a higher than expected percentage of zeros might possess biological relevance
because they are expressed in fewer cells than expected, and these cells might be
associated to a specific cell type or state.

This finding opened the path for new FS methods that would rely on genes that
showed a greater than expected proportion of zeros, according to their mean ex-
pression. These methods are based on a null distribution of some property of the
dataset, and genes whose behavior differs from the expected are selected. The FS
method nbumi, a negative binomial method based on M3Drop [12], works under
this premise. Nbumi fits the NB zero-count probability distribution to the dataset,
and selects genes of interest calculating p-values of observed dropout rates. M3Drop
works similarly by fitting a Michaelis-Menten model instead of the NB from nbumi.

In summary, existing FS methods assume that an unexpected distribution of
counts for a particular gene in a dataset is explained by cells belonging to different
cell types. However, we observe that there are three main patterns of expression
according to the distribution of zeros of a particular gene and overall transcriptional
similarity (expression of all genes), as explained in detail in Figure 1: a) a gene evenly
expressed across cells, or a gene expressed by a subset of cells, which can be b1)
transcriptionally separate or b2) transcriptionally similar. Thus, in some cases a
particular gene shows an unexpected distribution of counts because a subset of cells
are expressing it but those cells might not be transcriptionally similar.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 3 of 18

Here we present triku, a FS method that selects genes that show an unexpected
distribution of zero counts and whose expression is localized in cells that are tran-
scriptomically similar. Figure 2 summarizes the feature selection process. Triku
identifies genes that are locally overexpressed in groups of neighboring cells by
inferring the distribution of counts in the vicinity of a cell and computing the ex-
pected distribution of counts. Then, the Wasserstein distance between the observed
and the expected distributions is computed and genes are ranked according to that
distance. Higher distances imply that the gene is locally expressed in a subset of
transcriptionally similar cells. Finally, a subset of relevant features is selected using
a cutoff value for the distance. Triku outperforms other feature selection methods
on benchmarking and artificial datasets, using unbiased evaluation metrics such as
Normalized Mutual Information (NMI) or Silhouette. Of note, features selected by
triku are more biologically meaningful.

2 Results
The objective of FS methods is to select the features that are the most relevant
in order to understand and explain the structure of the dataset. In the context of
single-cell data, this means finding the subset of genes that, when given as input
to a clustering method, will yield a clustering solution where each cluster can be
annotated as a putative cell type.

Initially, we generated artificial datasets with the splatter package [13], so that
cells belonging to the same cluster have a similar gene expression. All datasets
contained the same number of genes, cells and populations, but differed in the
de.prob parameter value. This parameter was set so that higher values indicate a
higher probability of genes being differentially expressed, resulting in more resolved
populations. A combination of 8 de.prob values, from 0.0065 to 0.3 were used (see
Methods). In addition, we tested triku on two biological benchmarking datasets
by Ding et al. [14] and Mereu et al. [15] that have been expert-labeled using a
semi-supervised procedure. Both benchmarking datasets are composed of individual
subsets of data with different library preparation methods (10X, SMART-seq2, etc.)
in human Peripheral Blood Mononuclear Cells (PBMCs) (Mereu and Ding) and
mouse colon (Mereu) and cortex (Ding) cells.

We have evaluated the relevance of the features selected by triku by comparing
them to the ones selected using other feature selection methods. The relevance of
the features was first measured using metrics associated to the efficacy of clustering,
and then using metrics to evaluate the quality of the genes selected.

We made six types of comparisons between the subsets of genes selected by each
feature selection method: 1) the ability to recover basic dataset structure (main cell
types) in artificial and biological datasets, 2) the ability to obtain transcriptomically
distinct cell clusters, 3) the overlap of features between different FS methods, 4)
the localized pattern of expression of the features selected, 5) the ability to avoid
the overrepresentation of mitochondrial and ribosomal genes and 6) the biological
relevance of the genes by studying the composition and quality of the gene ontology
(GO) terms obtained.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 4 of 18

2.1 triku efficiently recovers cell populations present in sc-RNAseq datasets
The first set of metrics evaluates the ability to recover the original cell types based
on the NMI index, and the cluster separation and cohesion using the Silhouette
coefficient.

2.1.1 NMI
NMI measures the correspondence between a labelling considered as the ground
truth and the clustering solution that we obtained using the genes selected by triku
and other FS methods (scanpy, std, scry, brennecke, m3drop, nbumi).

First, we evaluated how well the clustering using the genes selected by the FS
methods was able to recover the same populations that were defined when gen-
erating the artificial datasets. Figure 3 shows that triku is among the best three
feature selection methods for a wide range of de.prob values. For low values of
de.prob –below 0.05–, where the selection of genes that lead to a correct recovery
of cell populations is more challenging, triku notably outperforms the rest of the
FS methods. NMI values obtained with triku are 0.1 to 0.2 higher than the second
and third best FS methods. In addition, the results obtained when using the first
250 selected genes were comparable to those obtained when selecting 500 genes.

We also studied how well the genes selected led to a clustering solution that was
similar to the manually-assigned cell labels in the biological benchmarking datasets,
as shown in Figure 4. For each dataset, the variability between NMI scores was quite
low, meaning that features selected with the different methods yielded clustering so-
lutions that were quite similar to the manually-labeled cell types, although there are
some exceptions to this rule–e.g. Brennecke in Ding datasets, which showed notably
reduced NMI values–. In some datasets, for instance, 10X human, QUARTZseq hu-
man and SMARTseq2 human from Mereu’s benchmarking set, features selected
by FS methods did not lead to increased NMI values as compared with randomly
selected genes.

Despite the differences in NMI between methods being small for each particular
dataset, post-hoc analysis revealed that triku is significantly the best ranked method
across all datasets. To do the post-hoc analysis, we ranked for each dataset the
NMI of each FS method. Figure 4 (left) shows the mean rank of each FS method
across datasets. Triku is the best-ranked FS method in both Mereu’s and Ding’s
benchmarking datasets, with a mean rank of 2.7 and 2.8, respectively. M3drop is the
second best-ranked FS method and triku is in both cases statistically significantly
better (Quade test, p < 0.05).

2.1.2 Silhouette coefficient
Another important aspect of the genes selected by FS methods in scRNA-seq data
analysis is their ability to cluster data into well-separated groups that are transcrip-
tomically similar. We used the Silhouette coefficient to measure the compactness
and separation-degree of cell communities obtained with a clustering method. When
the same clustering algorithm is used on a dataset but using two different FS meth-
ods, the differences in the resulting Silhouette coefficients can be entirely attributed
to the features selected by those methods. We assume that FS methods that increase
the separation between clusters and the compactness within clusters are better at
recovering the cell types present in the dataset.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 5 of 18

Figure 5 shows the Silhouette coefficients obtained with the different FS meth-
ods. For the Mereu and Ding datasets, we observed that triku was the best-ranked
method–mean rank of 1.8 and 1.1–, and the second best-ranked methods were
m3drop and scanpy with a mean rank of 3.8 and 2.2, respectively. In both cases, the
difference between triku and the second-ranked method was statistically significant
(Quade test, p < 0.05).

We performed an additional analysis using the labels obtained with leiden clus-
tering instead of the manually curated cell types (Figure S1). Again, triku outper-
formed the rest of the FS methods showing a statistically significant best mean-rank.

2.2 Genes selected by different FS methods show limited overlap
Next, we studied the characteristics of the genes selected by triku and compared
them to the genes selected by other methods.

Initially, we studied the level of consistency between the results obtained using
different FS methods by studying their degree of overlap, as shown in Figure 6.
In order to compare between equally sized gene lists, we ranked the genes based
on p-values or scoring value from each FS method and set the number of genes
selected by triku as a cutoff to select the first genes. Although the genes selected
by the different methods yielded clustering solutions that are highly consistent, as
shown in the previous section, we did not see any clear gene overlap pattern between
pairs of FS methods. In fact, there is no correlation between the degree of overlap
between the genes selected by the different methods and the clustering solutions
that are obtained when using those genes as input.

For instance, we found an overlap of 11% between the genes selected by scanpy and
std for the 10x mouse dataset, yet the NMI between the clustering solutions obtained
with each of them and the expert-labeled cell types was 0.7. On the other hand, the
overlap between scanpy and Brennecke is one of the highest across datasets (ranging
from 26 to 67%), yet the differences between their corresponding NMI scores are
0.45.

2.3 triku selects genes that are biologically relevant
Based on these results, we studied the biological relevance of the genes selected by
different FS methods in three alternative ways.

Genes whose expression, or lack thereof, is limited to a single population are
more likely to be cell-type specific and thus might be better candidates as positive
or negative cell population markers. Therefore, we studied which are the best FS
methods to select genes showing a localized expression pattern.

Mitochondrial and ribosomal genes are usually highly expressed and many FS
methods tend to overselect them despite them not being particularly relevant in
most single-cell studies and are commonly excluded from downstream analysis [16,
17, 18]. Assuming that in these benchmarking datasets ribosomal and mitochondrial
genes are not as relevant to the biology of the dataset, we measured the percentage
of these genes in the subset of genes selected by triku and compared it to other FS
methods.

Lastly, we analyzed the biological pertinence of the selected genes by performing
Gene Ontology Enrichment Analysis (GOEA) on a dataset of immune cell popu-
lations whose underlying biology is well understood, as a robust indicator of FS
quality.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 6 of 18

2.3.1 Selection of locally-expressed genes
We first studied the expression pattern of genes selected by triku and other methods,
as shown in Figure S2. We observed that out of the 9 populations of the artificial
dataset, when a gene is selected by triku–exclusively or together with other FS
methods–, one of the populations had a markedly higher or lower expression com-
pared to the rest. On the other hand, when a gene is selected by other FS methods
and not by triku, we do not observe any population-specific expression pattern. For
instance, genes exclusively selected by scanpy had a wide expression variation across
clusters, but they were not exclusive of one or two clusters. Features selected by std
and scry showed some variation, but it was overshadowed by the high expression of
the gene, and therefore not relevant under the previous premise.

To evaluate the cluster expression of selected genes in benchmarking datasets, for
each gene we scaled its expression to the 0-1 range, and sorted the clusters so that
the first one had the greatest expression. Figure S3 shows the expression patterns
for several benchmarking datasets. We see that, in most datasets, triku showed
more biased expression patterns, that is, genes selected by triku were expressed,
on average, on fewer clusters than the genes selected by other FS methods. The
second and third best methods were scanpy and brennecke, with similar or slightly
less biased expression patterns as compared to triku. With these methods, up to
80% of the expression of the gene was usually restricted to the 2 to 3 clusters that
most express it.

m3drop and nbumi performed similarly, and showed an expression distribution
across clusters similar to a random selection of genes, which was slightly biased
towards 3 to 5 clusters accumulating up to 80 % of the expression of the gene.
Lastly, std and scry methods were the least biased, and showed almost a linear
decrease of expression percentage across clusters, with 4 to 6 clusters accumulating
up to 80 % of the expression of the gene.

2.3.2 Avoidance of mitochondrial and ribosomal genes
Table 1 shows the percentage of genes that code for ribosomal and mitochondrial
proteins within the genes selected by different FS methods in the two sets of bench-
marking datasets. We observed that std and scry were the only methods that tended
to overselect mitochondrial and ribosomal genes. Among the rest of the methods,
triku showed percentages that were comparable to the rest of the methods, and
slightly lower for the Ding datasets.

2.3.3 Selection of genes based on gene ontologies
We assessed the quality of the GO output by studying its term composition. We se-
lected two PBMC datasets from the Ding datasets: the 10X human and the Dropseq
human. We used PBMC datasets for this analysis because their cell-to-cell variabil-
ity has been extensively studied using single-cell technologies as Fluorescence Acti-
vated Cell Sorting (FACS) and scRNA-seq [19, 20, 21, 22, 23]. Using these datasets,
we measured the proportion of GO terms obtained in the output that were tightly
related to the biological system under study.

Figures 7 and S4 show the first 25 GO terms obtained with the genes selected
by each FS method on the two PBMC datasets, where the terms tightly related to

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 7 of 18

immune processes–chosen by three independent assessors–have been highlighted. We
observed that triku was the FS method that yielded the most terms directly related
to immune processes, with 23/25 and 19/25 related terms in the Ding Dropseq
and 10X datasets, respectively. Examples of terms that we considered to be tightly
related to immune processes included B cell receptor signalling pathway, neutrophil
degranulation and T cell proliferation. The next methods were scanpy and m3drop,
whose performances were comparable to that of triku for the 10X dataset (23/25)
but less robust for the Dropseq dataset (10/25 and 9/25 related terms). The rest
of the FS methods mainly selected genes that were related to general cell functions
such as RNA processing, protein processing and cell-cycle regulation.

3 Discussion
FS methods are a key step in any scRNA-seq sequencing analysis pipeline as they
help us obtain a dimensionally reduced version of the dataset that captures the most
relevant information and eases the interpretation and understanding of its under-
lying biology. However, every FS method relies on a set of assumptions regarding
what characteristics make a gene relevant. FS methods that sort genes according to
their dispersion assume that gene expression variability is indicative of its biological
relevance. FS methods like nbumi and m3drop assume that genes showing a propor-
tion of zero-counts that is greater than expected (according to a null distribution)
are more likely to be informative. Triku assumes that genes that have a localized
expression in a subset of cells that share an overall transcriptomic similarity are
more likely to define cell types. A general trend in FS method design has been to
refine the requirements that a gene must meet in order for it to be selected, from the
more general dispersion-based to more sophisticated formulations. It is noteworthy
that the requirements in triku are consistent with the previous dispersion-based and
zero-count-based formulations, but involve a new aspect that we consider essential
for an accurate gene selection: a localized expression in neighboring cells. Another
important advantage of triku over FS methods that consider the zero-count dis-
tribution is that, unlike m3drop and nbumi, triku does not assume gene counts to
follow any particular distribution, since it estimates the null distribution from the
dataset, thus extending the range of single-cell technologies that it can use beyond
droplet-based technologies.

We verified the locality of the genes selected by triku in different artificial and
real scRNA-seq datasests and concluded that, on average, the expression of triku-
selected genes is restricted to fewer, well-defined clusters. In addition, the clusters
obtained when using triku-selected genes as input for unsupervised clustering in
both artificially generated and biological datasets have a better resolved pattern
structure, as shown by their increased Silhouette coefficients. In the case of artificial
datasets, where the degree of mixture between clusters can be predefined, triku
proved to be able to recover the originally-defined cell populations. In fact, we
found that the higher the degree of mixture between clusters, the more obvious the
advantage of triku over the rest of the FS methods tested.

An important difficulty in the interpretation of single-cell data is that we must
consider that cell-to-cell variability has both technical and biological components.
I.e., it is difficult to know whether a set of genes is differentially expressed between

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 8 of 18

cell clusters due to technical reasons (differences in the efficiency of mRNA cap-
ture, amplification and sequencing) or if it constitutes a biological signal. Moreover,
there is a wide range of sources of biological variability within a dataset, some of
which might not be of interest depending on the experimental context. For instance,
fluctuations in genes that regulate the cell cycle constitute a source of biological vari-
ability that is often disregarded. This has been extensively studied and addressed
in a number of ways: normalization, regression of unwanted sources of variation,
etc. [24, 25, 26, 27].

The expression of genes whose variability is associated with technical reasons
tend to have a high dispersion but their expression is usually not restricted to a few
clusters. A good example of these genes are the ribosomal and mitochondrial genes,
which are expressed across all cell types at different levels. Our results show that
these genes are in fact selected by the majority of compared FS methods due to
their high expression and cell-to-cell variability, but are less likely to be selected by
triku, since they do not usually meet the locality requirement. Additionally, when
performing GOEA, we observed that the list of genes obtained with triku were more
enriched for terms that are specifically related to a biological process of the system
under study.

In our work, we have observed that the genes selected by different FS methods
might show little overlap between them. This phenomenon has been described else-
where [28]. In fact, gene covariation and redundancy is a well characterized effect
that has been observed in omics studies. The effect of redundancy arises from the
fact that different cell types must have a common large set of pathways to be ac-
tive. The difference between cell type and cell state is that two cell types might have
large sets of pathways that are different between each other, and two cell states will
only differ in a few pathways. Since pathways are composed of many genes, only
choosing a reduced set of genes from a set of pathways from cell type A and B
might be enough to differentiate them, and we might not need to select all genes
from all pathways. This “paradigm” explains several effects. Qiu et al. described
that scRNA-seq datasets could preserve basic structure after gene expression bina-
rization [29] or by conducting very shallow sequencing experiments [4]. This can be
explained by the fact that only a few genes are necessary to describe the main cell
populations in a single-cell dataset, and the presence/absence of a certain marker is
often more informative than its expression level. This is related to the notion that
despite the high dimensionality of omics studies, most biological systems can be ex-
plained in a reduced number of dimensions. Moreover, some authors have claimed
this low dimensionality to be a natural and fundamental property of gene expres-
sion data [4]. This highlights the importance of designing accurate FS methods that
extract the fundamental information from single-cell datasets.

Triku Python package is available at https://gitlab.com/alexmascension/
triku and can be downloaded using PyPI. Triku has been designed to be com-
patible with scanpy syntax, so that scanpy users can easily include triku into their
pipelines.

4 Methods
The triku workflow is further described in Suplementary Methods.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://gitlab.com/alexmascension/triku
https://gitlab.com/alexmascension/triku
https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 9 of 18

4.1 Artificial and benchmarking datasets
In order to perform the evaluation of the FS methods we used a set of artificial and
biological benchmarking datasets.

Artificial datasets were constructed using splatter R package (v 1.10.1). Each
dataset contains 10,000 cells and 15,000 genes, and consists of 9 populations with
abundances in the dataset of {25%, 20%, 15%, 10%, 10%, 7%, 5.5%, 4%, 3.5%} of the
cells. Each dataset contains a parameter, de.prob, that controls the probability that
a gene is differentially expressed. Lower de.prob values (< 0.05) imply that different
populations have fewer differentially expressed genes between them and, therefore,
are more difficult to be differentiated. Selected values of de.prob are {0.0065, 0.008,
0.01, 0.016, 0.025, 0.05, 0.1, 0.3}. Populations in datasets with de.prob values above
0.05 are completely separated in the low-dimensionality representation with UMAP,
even without feature selection (Figure S5).

Regarding biological datasets, two benchmarking datasets have been recently pub-
lished by Mereu et al. [15] and Ding et al. [14]. The aim of these two works is
to analyze the diversity of library preparation methods, e.g. 10X, SMART-seq2,
CEL-seq2, single nucleus or inDrop. Mereu et al. use mouse colon cells and human
PBMCs to perform the benchmarking, whereas Ding et al. use mouse cortex and
human PBMCs. There are a total of 14 datasets in Mereu et al. and 9 in Ding et
al. An additional characteristic of these datasets is that they have been manually
annotated, and this annotation is useful as a semi ground truth. Ding dataset files
were downloaded from Single Cell Portal (accession numbers SCP424 and SCP425),
and cell type metadata is located within the downloaded files. Mereu datasets were
downloaded from GEO database (accession GSE133549), and cell type metadata
was obtained under personal request.

4.2 FS methods
Triku is compared to the following FS methods:

• Standard deviation (std). Computed directly using Numpy (v 1.18.3).
• brennecke [7]: fits a curve based on the square of the coefficient of variation

(CV 2) versus the mean expression (µ) of each gene and selects the features
with higher CV 2 and µ. The features are selected with the BrenneckeGetVari-
ableGenes function from M3Drop R package (v 1.12.0).

• scry [10]: computes a deviance statistic for counts based on a multinomial
model that assumes each feature has a constant rate. The features are selected
with the devianceFeatureSelection function from scry R package (v 0.99.0).

• scanpy [9]: selects features based on a z-scored deviation, adapted from Seu-
rat’s method. The features are selected with the sc.pp.highly_variable_genes
function from scanpy (v 1.6.0).

• M3Drop [12]: fits a Michaelis-Menten equation to the percentage of zeros ver-
sus µ, and selects features with higher percentages of zeros than expected. The
features are selected with the M3DropFeatureSelection function from M3Drop
R package.

• nbumi: it acts in the same manner as M3Drop, but fitting a negative binomial
equation instead of a Michaelis-Menten equation. The features are selected
with the NBumiFeatureSelectionCombinedDrop function.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 10 of 18

4.2.1 FS and dataset preprocessing
To make the comparison between FS methods, each feature is ranked based on the
score provided by each FS method. Calculating the ranking instead of just selecting
the features allow us to select different numbers of features when needed. By default,
the number of features is the one automatically selected by triku. Additionally, in
some contexts, analyses are performed with all features or with a random selection
of features.

After the ranking of genes is computed, dataset processing is performed equally
for all methods, in artificial and benchmarking datasets. Datasets are first log trans-
formed –if required by the method–, and PCA with 30 components is calculated.
Then, the k-Nearest Neighbors (kNN) matrix is computed setting k as

√
ncells. Uni-

form Manifold Approximation and Projection (UMAP) (v 0.3.10) is then applied
to reduce the dimensionality for plotting. If community detection is required, leiden
(v 0.7.0) is applied selecting the resolution that matches the number of cell types
manually annotated in the dataset. This procedure is repeated with 10 different
seeds. This conditions the output of triku, random FS, PCA projection, neighbor
graph, leiden community detection, and UMAP.

4.2.2 NMI calculation in artificial and benchmarking datasets
In order to compare the leiden community detection results with the ground-truth
labels from artificial and biological datasets, we used the Normalized Mutual Infor-
mation (NMI) score [30].

If T and L are the labels of the cell types (true populations) and leiden commu-
nities respectively, the NMI between T and L is:

NMI(T, L) =
2I(T ; L)

H(T) + H(L)

Where H(X) is the entropy of the labels, and I(T ; L) is the mutual information
between the two sets of labels. This value is further described in [31]. We used scikit-
learn (v 0.23.1) implementation of NMI, sklearn.metrics.adjusted_mutual_info_score.

One of the advantages of NMI against other mutual information methods is that
it performs better with label sets with class imbalance, which are common in single-
cell datasets, where there are differences in the abundance of cell types.

On artificial datasets, leiden was applied using the first 250 and 500 selected
features, and the resulting community labels were compared with the population
labels from the dataset. On benchmarking datasets, leiden was applied with the
manually-curated cell types.

4.2.3 Silhouette coefficient in benchmarking datasets
In order to assess the clustering performance of the communities obtained with
benchmarking datasets we used the Silhouette coefficient. The Silhouette coefficient
compares the distances of the cells within each cluster (intra-cluster) and between
clusters (inter-cluster) within a measurable space. The distance between two cells is
the cosine distance between their gene expression vectors, considering only the genes
selected by each FS method. The greater the distance between cells that belong to

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 11 of 18

different clusters and the smaller the distance between cells from different cluster,
the greater the Silhouette score.

In order to calculate the Silhouette coefficient for a cell c within cluster Ci (out
of n clusters), the mean distance between the cell and the rest of the cells within
the cluster is computed using the gene expression:

a(c) =
1

|Ci| − 1
∑

j∈Ci,c̸=j

d(c, j)

Then, the minimum mean distance between that cell and the rest of cells from other
clusters is computed:

b(c) = min
Ck ̸=Ci

{ 1
Ck

∑
j∈Ck

d(i, j)
}

k ∈ 1, · · · , n

Then the Silhouette coefficient is computed as

s(c) =
b(c) − a(c)

max b(c), a(c)

Higher Silhouette scores imply a better separation between clusters and, therefore,
a better performance of the FS method. We used scikit-learn implementation of
Silhouette, sklearn.metrics.silhouette_score.

4.2.4 Overlap between gene lists
In order to calculate the overlap between selected features for each FS method, we
applied the Jaccard index [32]: jaccard(i, j) = |i∩j||i∪j| , where i, j are the sets of genes
selected by the two FS methods.

4.2.5 Performance of gene selection and locality measures
In order to assess the performance of different FS methods selecting genes that
are relevant for the dataset, we applied two different strategies for artificial and
biological datasets.

For artificial datasets, we selected 4 representative genes of each of the combi-
nations of genes shown in Figure S2. Then we calculated the mean expression of
each of the for genes in each population, and we represent this information in the
barplots.

For benchmarking datasets, in order to represent the Figure S3, for each dataset
and FS method we used the following procedure: for each gene, the expression was
scaled to sum 1 across all cells. Then, leiden clustering was run with resolution pa-
rameter value 1.2. For each cluster, the proportion of the expression was calculated,
and the clusters were ordered so that the first cluster is the one that concentrates the
majority of the expression. To create Figure S3, the average value of the proportion
of expression is calculated.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 12 of 18

4.2.6 Proportion of ribosomal and mitochondrial genes
When calculating the proportion of mitochondrial and ribosomal genes, the list
of existing ribosomal and mitochondrial proteins was calculated by extracting the
genes starting with RPS, RPL or MT-. The proportion of mitochondrial or riboso-
mal genes is the quotient between the genes of the previous list that appear selected
by that FS method, and the genes in the list.

4.2.7 GO enrichment analysis
In order to calculate the sets of gene ontologies enriched for the selected features of
each FS method, we used python gseapy (v 0.9.17) module gseapy.enrichr function
with the list of the first 1000 selected features against the GO_Biological_Process_2018
ontology. From the list of enriched ontologies, the 25 with the smallest adjusted
p-value were selected.

4.2.8 Ranking and CD
During calculation of NMI and Silhouette coefficients, to evaluate the overall per-
formance of the FS methods across different datasets, the FS methods are ranked
–where 1 is the best rank–. The methodology proposed by Demšar [33] is used
to test for significant differences among FS methods in the datasets: The Fried-
man rank test is applied to test whether the mean rank values for all FS methods
are similar (null hypothesis). If the Friedman rank test rejects the null hypothesis
(α < 0.05), this implies a statistically significant difference among at least two FS
methods. If the null hypothesis is refuted we apply the Quade post-hoc test be-
tween all pairs of FS methods to check which pairs of FS methods are significantly
different (α < 0.05). These results are then plotted in a critical difference diagram.

5 Abbreviations
Single-cell RNA sequencing: scRNA-seq; Feature Selection: FS; Feature Extraction:
FE, Principal Component Analysis: PCA, Negative Binomial: NB, Normalized Mu-
tual Information (NMI); Fluorescence Activated Cell Sorting: FACS; Gene Ontol-
ogy: GO; Gene Ontology Enrichment Analysis: GOEA; Peripheral Blood Mononu-
clear Cells: PBMC; Uniform Manifold Approximation and Projection: UMAP; k-
Nearest Neighbors: kNN.

Declarations
Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and software
Ding dataset files were downloaded from Single Cell Portal (accession numbers SCP424 and SCP425), and cell type
metadata is located within the downloaded files. Mereu datasets were downloaded from GEO database (accession
GSE133549), and cell type metadata was obtained under personal request. Triku software and analysis notebooks
are available at https://www.gitlab.com/alexmascension/triku.

Competing interests
The authors declare that they have no competing interests.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://www.gitlab.com/alexmascension/triku
https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 13 of 18

Funding
This work was supported by grants from Instituto de Salud Carlos III (AC17/00012 and PI19/01621), cofunded by
the European Union (European Regional Development Fund/ European Science Foundation, Investing in your
future) and the 4D-HEALING project (ERA-Net program EracoSysMed, JTC-2 2017); Diputación Foral de
Gipuzkoa, and the Department of Economic Development and Infrastructures of the Basque Government
(KK-2019/00006, KK-2019/00093). AMA was supported by a Basque Government Postgraduate Diploma
fellowship (PRE_2020_2_0081), and OIS was supported by a Postgraduate Diploma fellowship from la Caixa
Foundation (identification document 100010434; code LCF/BQ/IN18/11660065).

Author’s contributions
Conceptualization: AMA; Funding Acquisition: MJA-B, AMA, OI-S; Investigation: AMA, OI-S, MJA-B, AI;
Methodology: AMA, OI-S, II; Project Administration: AI, MJA-B; Resources: MJA-B; Software: AMA, OI-S;
Supervision: II, AI, MJA-B; Visualization: AMA, OI-S; Writing - Original Draft Preparation: AMA, OI-S; Writing -
Review and Editing: AMA, OI-S, II, MJA-B, AI.

Acknowledgements
We would like to thank Amaia Elícegui, Ainhoa Irastorza and Paula Vázquez for the assessment of the immune Gene
Ontology terms.

Author details
1Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr.
Begiristain, s/n, 20014, Donostia-San Sebastian, Spain. 2Tissue Engineering Group, Biodonostia Health Research
Institute, Paseo Dr. Begiristain, s/n, 20014, Donostia-San Sebastian, Spain. 3Intelligent Systems Group, Computer
Science Faculty, University of the Basque Country, Donostia-San Sebastian, Spain. 4Max Planck Institute for
Molecular Biomedicine, Roentgenstr. 20, 48149, Muenster, Germany.

References
1. Trapnell, C.: Defining cell types and states with single-cell genomics. Genome Research 25(10), 1491–1498

(2015). doi:10.1101/gr.190595.115
2. Maclean, A.L., Hong, T., Nie, Q.: Exploring intermediate cell states through the lens of single cells. Current

Opinion in Systems Biology 9, 32–41 (2018). doi:10.1016/j.coisb.2018.02.009
3. Bzdok, D., Altman, N., Krzywinski, M.: Statistics versus machine learning. Nature Methods 15(4), 233–234

(2018). doi:10.1038/nmeth.4642
4. Heimberg, G., Bhatnagar, R., El-Samad, H., Thomson, M.: Low dimensionality in gene expression data enables

the accurate extraction of transcriptional programs from shallow sequencing. Cell Systems 2(4) (2016).
doi:10.1016/j.cels.2016.04.001

5. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics
23(19), 2507–2517 (2007). doi:10.1093/bioinformatics/btm344.
https://academic.oup.com/bioinformatics/article-pdf/23/19/2507/452322/btm344.pdf

6. Luecken, M.D., Theis, F.J.: Current best practices in single‐cell rna‐seq analysis: a tutorial. Molecular Systems
Biology 15(6) (2019). doi:10.15252/msb.20188746

7. Brennecke, P., Anders, S., Kim, J.K., Kołodziejczyk, A.A., Zhang, X., Proserpio, V., Baying, B., Benes, V.,
Teichmann, S.A., Marioni, J.C., et al.: Accounting for technical noise in single-cell rna-seq experiments. Nature
Methods 10(11), 1093–1095 (2013). doi:10.1038/nmeth.2645

8. Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W.M., Hao, Y., Stoeckius, M.,
Smibert, P., Satija, R., et al.: Comprehensive integration of single-cell data. Cell 177(7) (2019).
doi:10.1016/j.cell.2019.05.031

9. Wolf, F.A., Angerer, P., Theis, F.J.: Scanpy: large-scale single-cell gene expression data analysis. Genome
Biology 19(1) (2018). doi:10.1186/s13059-017-1382-0

10. Townes, F.W., Hicks, S.C., Aryee, M.J., Irizarry, R.A.: Feature selection and dimension reduction for single-cell
rna-seq based on a multinomial model. Genome Biology 20(1) (2019). doi:10.1186/s13059-019-1861-6

11. Svensson, V.: Droplet scrna-seq is not zero-inflated. Nature Biotechnology 38(2), 147–150 (2020).
doi:10.1038/s41587-019-0379-5

12. Andrews, T.S., Hemberg, M.: M3drop: dropout-based feature selection for scrnaseq. Bioinformatics 35(16),
2865–2867 (2018). doi:10.1093/bioinformatics/bty1044

13. Zappi, L., Phipson, B., Oshlack, A.: Splatter: simulation of single-cell rna sequencing data. Genome Biology
18(174) (2017). doi:10.1186/s13059-017-1305-0

14. Ding, J., Adiconis, X., Simmons, S.K., Kowalczyk, M.S., Hession, C.C., Marjanovic, N.D., Hughes, T.K.,
Wadsworth, M.H., Burks, T., Nguyen, L.T., et al.: Systematic comparison of single-cell and single-nucleus
rna-sequencing methods. Nature Biotechnology (2020). doi:10.1038/s41587-020-0465-8

15. Mereu, E., Lafzi, A., Moutinho, C., Ziegenhain, C., Mccarthy, D.J., Álvarez-Varela, A., Batlle, E., Sagar, Grün,
D., Lau, J.K., et al.: Benchmarking single-cell rna-sequencing protocols for cell atlas projects. Nature
Biotechnology (2020). doi:10.1038/s41587-020-0469-4

16. Freytag, S., Tian, L., Lönnstedt, I., Ng, M., Bahlo, M.: Comparison of clustering tools in r for medium-sized 10x
genomics single-cell rna-sequencing data. F1000 Research 7(1297) (2018). doi:10.12688/f1000research.15809.2

17. Lun, A.T.L., Mccarthy, D.J., Marioni, J.C.: A step-by-step workflow for low-level analysis of single-cell rna-seq
data with bioconductor. F1000 Research 5(2122) (2016). doi:10.12688/f1000research.9501.2

18. Senabouth, A., Lukowski, S.W., Hernandez, J.A., Andersen, S.B., Mei, X., Nguyen, Q.H., Powell, J.E.: ascend:
R package for analysis of single-cell rna-seq data. Gigascience 8(8) (2019). doi:10.1093/gigascience/giz087

19. Chen, J., Cheung, F., Shi, R., Zhou, H., Lu, W.: Pbmc fixation and processing for chromium single-cell rna
sequencing. Journal of Translational Medicine 16(1) (2018). doi:10.1186/s12967-018-1578-4

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

http://dx.doi.org/10.1101/gr.190595.115
http://dx.doi.org/10.1016/j.coisb.2018.02.009
http://dx.doi.org/10.1038/nmeth.4642
http://dx.doi.org/10.1016/j.cels.2016.04.001
http://dx.doi.org/10.1093/bioinformatics/btm344
http://arxiv.org/abs/https://academic.oup.com/bioinformatics/article-pdf/23/19/2507/452322/btm344.pdf
http://dx.doi.org/10.15252/msb.20188746
http://dx.doi.org/10.1038/nmeth.2645
http://dx.doi.org/10.1016/j.cell.2019.05.031
http://dx.doi.org/10.1186/s13059-017-1382-0
http://dx.doi.org/10.1186/s13059-019-1861-6
http://dx.doi.org/10.1038/s41587-019-0379-5
http://dx.doi.org/10.1093/bioinformatics/bty1044
http://dx.doi.org/10.1186/s13059-017-1305-0
http://dx.doi.org/10.1038/s41587-020-0465-8
http://dx.doi.org/10.1038/s41587-020-0469-4
http://dx.doi.org/10.12688/f1000research.15809.2
http://dx.doi.org/10.12688/f1000research.9501.2
http://dx.doi.org/10.1093/gigascience/giz087
http://dx.doi.org/10.1186/s12967-018-1578-4
https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 14 of 18

20. Massoni-Badosa, R., Iacono, G., Moutinho, C., Kulis, M., Palau, N., Marchese, D., Rodríguez-Ubreva, J.,
Ballestar, E., Rodriguez-Esteban, G., Marsal, S., et al.: Sampling time-dependent artifacts in single-cell
genomics studies. Genome Biology 21(1) (2020). doi:10.1186/s13059-020-02032-0

21. Villani, A.-C., Satija, R., Reynolds, G., Sarkizova, S., Shekhar, K., Fletcher, J., Griesbeck, M., Butler, A.,
Zheng, S., Lazo, S., et al.: Single-cell rna-seq reveals new types of human blood dendritic cells, monocytes, and
progenitors. Science 356(6335) (2017). doi:10.1126/science.aah4573

22. Zheng, G.X.Y., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D.,
Mcdermott, G.P., Zhu, J., et al.: Massively parallel digital transcriptional profiling of single cells. Nature
Communications 8(1) (2017). doi:10.1038/ncomms14049

23. Zhu, L., Yang, P., Zhao, Y., Zhuang, Z., Wang, Z., Song, R., Zhang, J., Liu, C., Gao, Q., Xu, Q., et al.:
Single-cell sequencing of peripheral mononuclear cells reveals distinct immune response landscapes of covid-19
and influenza patients. Immunity 53(3) (2020). doi:10.1016/j.immuni.2020.07.009

24. Hafemeister, C., Satija, R.: Normalization and variance stabilization of single-cell rna-seq data using regularized
negative binomial regression. Genome Biology 20(1) (2019). doi:10.1186/s13059-019-1874-1

25. Lytal, N., Ran, D., An, L.: Normalization methods on single-cell rna-seq data: An empirical survey. Frontiers in
Genetics 11 (2020). doi:10.3389/fgene.2020.00041

26. Nestorowa, S., Hamey, F.K., Sala, B.P., Diamanti, E., Shepherd, M., Laurenti, E., Wilson, N.K., Kent, D.G.,
Göttgens, B.: A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation.
Blood 128(8) (2016). doi:10.1182/blood-2016-05-716480

27. Tran, H.T.N., Ang, K.S., Chevrier, M., Zhang, X., Lee, N.Y.S., Goh, M., Chen, J.: A benchmark of batch-effect
correction methods for single-cell rna sequencing data. Genome Biology 21(1) (2020).
doi:10.1186/s13059-019-1850-9

28. Yip, S.H., Sham, P.C., Wang, J.: Evaluation of tools for highly variable gene discovery from single-cell RNA-seq
data. Briefings in Bioinformatics 20(4), 1583–1589 (2018). doi:10.1093/bib/bby011.
https://academic.oup.com/bib/article-pdf/20/4/1583/30119512/bby011.pdf

29. Qiu, P.: Embracing the dropouts in single-cell rna-seq analysis. Nature Communications 11 (2020).
doi:10.1038/s41467-020-14976-9

30. Kvalseth, T.O.: On normalized mutual information: Measure derivations and properties. Entropy 19(11) (2017).
doi:10.3390/e19110631

31. Liu, X., Cheng, H.-M., Zhang, Z.-Y.: Evaluation of Community Detection Methods (2019). 1807.01130
32. Jaccard, P.: The distribution of the flora in the alpine zone. The New Phytologist 11(2) (1912).

doi:10.1111/j.1469-8137.1912.tb05611.x
33. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research

7, 1–30 (2006)

Tables

Table 1 Percentage of ribosomal protein (RBP) and mitochondrial (MT) genes appearing within
the selected genes by each FS method.

Mereu Ding
% RBP % MT % RBP % MT

triku 1.94 0.06 0.07 0.00
m3drop 2.50 0.46 0.62 0.18
nbumi 1.45 0.11 0.44 0.12
scanpy 1.46 0.11 0.36 0.09
std 8.30 0.97 3.57 0.52
scry 6.34 1.14 1.99 0.46
brennecke 0.54 0.06 0.03 0.00

Figures

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

http://dx.doi.org/10.1186/s13059-020-02032-0
http://dx.doi.org/10.1126/science.aah4573
http://dx.doi.org/10.1038/ncomms14049
http://dx.doi.org/10.1016/j.immuni.2020.07.009
http://dx.doi.org/10.1186/s13059-019-1874-1
http://dx.doi.org/10.3389/fgene.2020.00041
http://dx.doi.org/10.1182/blood-2016-05-716480
http://dx.doi.org/10.1186/s13059-019-1850-9
http://dx.doi.org/10.1093/bib/bby011
http://arxiv.org/abs/https://academic.oup.com/bib/article-pdf/20/4/1583/30119512/bby011.pdf
http://dx.doi.org/10.1038/s41467-020-14976-9
http://dx.doi.org/10.3390/e19110631
http://arxiv.org/abs/1807.01130
http://dx.doi.org/10.1111/j.1469-8137.1912.tb05611.x
https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 15 of 18

Ex
pr

es
sio

n

� �� ��

�

�� ��

log10 mean expression

p
er

ce
n

ta
g

e 
o

f z
er

o
s

Figure 1 Distribution of gene expression in three scenarios. There are three main patterns of
expression for any particular gene in a single-cell dataset: a) The gene is expressed evenly across
cells in the dataset, which probably means it does not define any particular cell type. b) A gene
shows an unexpected distribution of zeros, because it is only expressed by a subset of cells. Within
case b, there are two possible patterns. b1) The gene is highly expressed by a subset of
transciptionally different cells (i.e. cells that are not collocalized in the dimensionally reduced
map) and b2) the gene is highly expressed by cells that share an overall transcriptomic profile.
Triku preferentially selects the genes shown in the b2 pattern. When looking at the proportion of
zeros, genes in cases b1 and b2 show an increased proportion of zeros with respect to a, but they
are indistinguishable from each other by that metric.

�

�
�

��

�

�

�

�

����������� �����������
�������	���������

�
������������
����

�
�������������
����

�����������������������������������������
����

�����

��
�
��
��
��
�

� ��
����������

������������	������

�

��
��

� ��
����������

��
��

� �

��
��

� ��
����������

��
������������

��	�����������������������������	������

�

������������	������

�

���������������

��� ���������

�

­����������������������������

� �
��

��

�

��

Figure 2 Graphical abstract of triku workflow. a) DR representation of the gene expression from
the count matrix from a dataset, where each dot represents a cell. b) kNN graph representation
with 3 neighbors. For each cell the k transcriptomically most similar cells are selected (3 in this
example). c1) Considering the graph in b) for each cell with positive expression, the expression of
its k neighbors is summed to yield the kNN distribution in blue. c2) With the distribution of reads
(blue line), the null distribution is estimated by sampling k random cells. d) The null and kNN
distributions of each gene are compared using the Wasserstein distance. e) For each gene, its
distance is plotted against the log mean expression, and divided into w windows (4 in this
example). For each window, the median of the distances is calculated and subtracted to the
distances in that window. f) All corrected distances are ranked and the cutoff point is selected.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 16 of 18

Figure 3 Comparison of NMI for FS methods on artificial datasets. Barplots of the NMI for all
FS methods with different artificial datasets, using the top 250 (top) and 500 (bottom) features
of each FS method. The probability of the selected genes being differentially expressed between
clusters (de.prob) is shown in the X axis. Higher NMI values mean better recovery of the cell
populations. Note that in category all, all features are selected, not the top 250 or 500, therefore
their NMI values are the same in both graphs.

Figure 4 NMI for annotated cell types in Mereu and Ding datasets. Barplots of Silhouette
coefficient for Mereu (top) and Ding (bottom) datasets. Each barplot represents the mean over 5
runs, and the vertical bar is the standard deviation. The plot on the left is a critical difference
diagram, where each horizontal bar represents the mean rank for all datasets. If two or more bars
are linked by a vertical bar, the mean ranks for those FS methods are not significantly different
(Quade test, α = 0.05).

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 17 of 18

Figure 5 Silhouette coefficients for annotated cell types in Mereu and Ding datasets. Barplots
of Silhouette coefficient for Mereu (top) and Ding (bottom) datasets. Each barplot represents the
mean of 5 seeds, and the vertical bar is the standard deviation. The plot on the left is a critical
difference diagram, where each horizontal bar represents the mean rank for all datasets and all
seeds. If two or more bars are linked by a vertical bar, the mean ranks for those FS methods are
not significantly different (Quade test, α = 0.05).

Figure 6 Heatmaps of overlap of features between pairs of methods. For each pair of methods,
the value represents the proportion of features that are shared between the two methods. The
number of genes selected in each method is the automatic cutoff by triku.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/


Ascensión et al. Page 18 of 18

triku
scanpy
std
scry
brennecke
m3drop
nbumi

Figure 7 Barplot of p-values of GOEA. Each bin represents the number of features selected for
each method, in Mereu et al. mouse Dropseq dataset. The y value is the -log10 adjusted p-value
for the best 25 ontologies. On the bottom, the bar plot shows the names of the ontology terms for
the case with the best 1000 features. In immune datasets, gray dots at the left of each term
represent that that term is directly-related to an immune process. Non-dotted terms refer to more
general processes that may or may not be related to immune processes.

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430764doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430764
http://creativecommons.org/licenses/by/4.0/

	Abstract
	Background
	Results
	triku efficiently recovers cell populations present in sc-RNAseq datasets
	NMI
	Silhouette coefficient

	Genes selected by different FS methods show limited overlap
	triku selects genes that are biologically relevant
	Selection of locally-expressed genes
	Avoidance of mitochondrial and ribosomal genes
	Selection of genes based on gene ontologies


	Discussion
	Methods
	Artificial and benchmarking datasets
	FS methods
	FS and dataset preprocessing
	NMI calculation in artificial and benchmarking datasets
	Silhouette coefficient in benchmarking datasets
	Overlap between gene lists
	Performance of gene selection and locality measures
	Proportion of ribosomal and mitochondrial genes
	GO enrichment analysis
	Ranking and CD


	Abbreviations