A like-for-like comparison of lightweight-mapping pipelines for single-cell RNA-seq data pre-processing


A like-for-like comparison of lightweight-mapping
pipelines for single-cell RNA-seq data pre-processing

Mohsen Zakeri1, Avi Srivastava2, Hirak Sarkar3, and Rob Patro1,�

1Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
2New York Genome Center and NYU Center for Genomics and Systems Biology, New York City, NY, USA

3Harvard Medical School, Boston, Massachusetts, USA

Abstract: Recently, Booeshaghi and Pachter (1) published a
benchmark comparing the kallisto-bustools pipeline (2) for single-cell
data pre-processing to the alevin-fry pipeline (3). Their benchmarking
adopted drastically dissimilar configurations for these two tools, and
overlooked the time- and space-frugal configurations of alevin-fry
previously benchmarked by Sarkar et al. (3). In this manuscript,
we provide a small set of modifications to the benchmarking scripts
of Booeshaghi and Pachter that are necessary to perform a like-for-like
comparison between kallisto-bustools and alevin-fry. We also address
some misuses of the alevin-fry commands and include important data
on the exact reference transcriptomes used for processing1. Using
the same benchmarking scripts of Booeshaghi and Pachter (1), we
demonstrate that, when configured to match the computational com-
plexity of kallisto-bustools as closely as possible, alevin-fry processes
data faster (∼2.08 times as fast on average) and uses less peak memory
(∼0.34 times as much on average) compared to kallisto-bustools, while
producing results that are similar when assessed in the manner done
by Booeshaghi and Pachter (1). This is a notable inversion of the
performance characteristics presented in the previous benchmark.

RNA-seq, single-cell RNA-seq, quantification

Correspondence: rob@cs.umd.edu

Introduction
Alevin-fry (3) is a new pipeline for single-cell RNA-seq

pre-processing, which is currently being developed. While there are
many relevant design decisions and performance implications we
hope to discuss in detail in the preprint describing alevin-fry, one
crucial aspect motivating the development of the alevin-fry pipeline
is to allow testing the effect of different algorithmic choices on
the gene expression estimates eventually produced by the pipeline.
For example, alevin-fry exposes both a selective-alignment (4, 5)
mode and pseudoalignment (6) with structural constraints mode
for mapping reads. Further, after read mapping, the alevin-fry tool
exposes multiple algorithms for generating a permit list (sometimes
called a “whitelist”) of barcodes corresponding to what are believed to
be high-confidence cells, and for resolving UMIs into counts. When
applying any of the probabilistic methods it implements for UMI
resolution, alevin-fry also allows assessing quantification uncertainty
in the estimated counts via a bootstrapping procedure that can output
either the bootstrap samples, or their summary statistics.

Exploring these different algorithms in a unified framework
is an important task to optimize the pre-processing of single-cell

1The authors later updated their repository to contain a link to a deposition with
the reference data they used, but that information was not available in the original
repository commit d0549e0351c4875428b153c7804f21eed7fa82eb at
the time the preprint was published, and our framework was in place by the time
this update was made. This is further described in "Methods."

sequencing data, and there may not be a single algorithm that is best
suited to all different single-cell technologies. For example, while the
benefits of selective-alignment and the use of an expanded index in the
processing of bulk RNA-seq data have been highlighted in a growing
number of scenarios (4, 7, 8), these tradeoffs have not been thoroughly
explored in the context of single-cell (and particularly tagged-end)
data. Given that the majority of common tagged-end single-cell
analyses are performed at the gene rather than transcript level, and in
light of the extensive use of techniques like unique molecular identifier
(UMI) tagging, it may be the case that different tradeoffs in mapping
specificity versus speed are appropriate or desirable — indeed, an
argument for simpler but faster methods in this space has been made
by Melsted et al. (2) and in subsequent work by the same authors.
Likewise, the effect of different approaches for UMI error correction
and UMI resolution (and how they may interact with different read
mapping strategies) has not been thoroughly evaluated across many
different single-cell technologies, to understand if, and when, different
approaches may lead to different results in downstream analysis.

In the alevin-fry poster (3), we described the results of
benchmarking STARsolo (9), kallisto-bustools (2) and alevin-fry (3),
running the latter tool with a number of different configurations of read
mapping algorithm and UMI resolution algorithm. We observed that
the “fast” configurations of alevin-fry tested in (3), which adopt some
of the major simplifications argued for by Melsted et al. (2), are faster
than kallisto-bustools, and that all of the configurations tested there use
less peak memory. The recent preprint of Booeshaghi and Pachter (1)
omits all of the fast and memory-frugal configurations tested in Sarkar
et al. (3), and instead compares the time and memory requirements of
only the most computationally- and memory-intensive configuration
of the alevin-fry pipeline to the kallisto-bustools pipeline.

We are encouraged that others in the community are eager to try
out new tools like alevin-fry for the pre-processing of single-cell data,
and we recognize that fairly comparing new pipelines to existing ones
can be a difficult task in the absence of sufficient documentation and
tutorials. Admittedly, we have not yet produced sufficient tutorials
or documentation for alevin-fry given that our efforts have been
in continuing to develop the tool itself. At the same time, it is not
possible to “faithfully” follow recommended practice (1) when the
best practices have not yet been established for a fledgling method;
in such a case, benchmarking multiple configurations (especially
those that have already been tested in previous benchmarks (3))
may be a reasonable approach. Spurred by Booeshaghi and Pachter
(1), we have now created a simple-to-follow tutorial for speed-
optimized single-cell pre-processing using alevin-fry (https://
combine-lab.github.io/alevin-fry-tutorials/
2021/running-alevin-fry-fast/). Here, we benchmark

Zakeri et al. | bioRχiv | February 10, 2021 | 1–7

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/
https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/
https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/
https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/


this workflow for alevin-fry (using the same versions of salmon
(1.4.0) and alevin-fry (0.1.0)2, and kallisto(0.46.2)-bustools(0.40.0)
adopted in (1)).

Methods
In order to assess how the results of the benchmark proposed

by Booeshaghi and Pachter (1) change when a like-for-like com-
parison between alevin-fry and kallisto-bustools is carried out, we
start with the experimental framework introduced in that preprint
and describe here the necessary modifications to the benchmarking
scripts that were made.

The first difference we note is the versions of
the references used for quantification. The reposi-
tory provided by Booeshaghi and Pachter (1) (https:
//github.com/pachterlab/BP_2021/ at commit
d0549e0351c4875428b153c7804f21eed7fa82eb,
which was the version available when the preprint was published)
lacked both the specific reference sequences used and the URLs
from which these reference sequences were obtained3. The paper
refers the reader to (2), wherein the relevant metadata is contained
in an Excel file, which still lacks adequate specificity (e.g. it lists
the Caenorhabditis elegans transcriptome used only as “modified
ws260”). Thus, in this manuscript, we have adopted the following
procedure for normalizing the reference transcriptomes. For human,
mouse, and combined human/mouse data, we have used the latest
reference bundles provided by 10x Genomics as of Jan 29, 2021
(named as “2020-A”), and extracted the transcriptomes from the
provided genomes and respective GTF files using gffread (10). For all
other organisms, we have adopted the latest Ensembl (11) reference
transcriptomes for each organism. For Danio rerio, C. elegans,
Drosophila melanogaster and Rattus norvegicus this is from release
102; for Arabidopsis thaliana it is from release 49.

These updated reference transcriptomes lead, in some cases,
to quite different memory usages from those reported in (1). For
the alevin-fry pipeline, this is largely explained by the fact that we
index the same reference sequences as used for kallisto (6) (that
is, we do not compare indexing the transcriptome in kallisto to
indexing the transcriptome and genome in salmon (12)). However,
the increased memory usage of kallisto-bustools likely stems
from variation in the specific reference transcriptomes used. For
example, using the current 10x reference transcriptome for GRCh38
(2020-A, from https://cf.10xgenomics.com/supp/
cell-exp/refdata-gex-GRCh38-2020-A.tar.gz),
the peak memory usage of kallisto becomes ∼7GB during mapping
( Fig. 1), rather than the ∼4GB reported in (1) (while the peak memory
usage of salmon during mapping reaches ∼1.7GB). Our version of
the repository contains a file called gather_refs.sh with the
commands used to obtain these reference transcriptomes.

Furthermore, the following additional modifications have been
made to the benchmarking, which otherwise remains the same

2We do not use the modified version of alevin-fry 0.1.0 that Booeshaghi and
Pachter (1) altered to convert encoded barcode identifiers in the generate-
permit-list step into character strings, but instead the tagged 0.1.0 release with
the rand crate additionally pinned at 0.7.3 to enable compilation.
3A subsequent commit included a link to a deposition of the references they used,

but our framework was in place and benchmarking underway by the time that commit
was made. Further, it is informative to see how even modest changes in the specific
reference used can lead to large changes in the memory requirements of a tool.

as was performed in (1). We run alevin with the --sketch
flag when producing the mapping file (called a RAD file); this
uses pseudoalignment (6) with structural constraints rather than
selective-alignment4. We do not consider a configuration of either
kallisto-bustools or alevin-fry that corrects to or uses the full 10x
permit list. In the original benchmark, the authors used alevin-fry
with the -b flag, treating the full list of 10x barcodes as a filtered
permit list; however the -b flag is meant to accept a list of barcodes
corresponding to high-confidence cells that have passed external
filtering. Passing the full 10x barcode list to the -b flag is neither
intended nor currently supported in alevin-fry (though we are
planning to add this functionality), which we have now clarified
in the documentation — as we had previously clarified this same
point in the alevin (13) documentation. We have both methods
generate their own permit list, and perform quantification on their
corresponding filtered cells. We pass the -d fw flag to alevin-fry’s
generate-permit-list step rather than -d either, as
the RAD file records the orientation of each read with respect to the
target transcript, and all the technologies evaluated here expect the
second read to map to the transcriptome in the forward orientation.
Mappings in an unexpected orientation should be filtered. We have
used the cr-like resolution strategy when invoking the alevin-fry
quant command; this implements a simple but fast UMI resolution
algorithm that breaks ties by UMI frequency alone and discards reads
for which a most frequent unique gene cannot be determined.

We have also removed the step of the pipeline that converts the
RAD (respectively BUS) file into a text format. The binary to text
conversion may be useful for debugging purposes, but is not a standard
or necessary part of these pre-processing pipelines, as the BUS and
RAD files are primarily intended for the storage and processing of
data rather than human inspection. Further, contrary to the supposition
of Booeshaghi and Pachter (1), this conversion is likely a case where
language choice, and usage of standard language idioms, leads to
different performance characteristics. Unlike C++, Rust places the
standard output stream behind a lock to ensure threadsafe access, a
decision that imposes a cost for programs that are heavy on writing
to the standard output stream in a line-oriented manner when standard
idioms are used. While we do not view the optimization of the
command that dumps a RAD file to text as particularly high-priority,
we will nonetheless explore making use of unsafe C system calls in
this command until a comparable solution is exposed natively in Rust.

The benchmarking scripts used to produce the results described
here can be found at https://github.com/COMBINE-lab/
BP_2021-lfl (these are the same as the benchmarking scripts of
https://github.com/pachterlab/BP_2021 at commit
d0549e0351c4875428b153c7804f21eed7fa82eb with
the modifications described above). We encourage users to run these
benchmarks for themselves, and welcome feedback and suggestions.

Despite the additions and modifications we describe here,
neither our repository nor the original repository of Booeshaghi
and Pachter (1) enable full reproducibility without non-trivial effort
or investigation. One complication is that there existed multiple
candidate scripts for performing specific steps of the data analysis
within different directories of the repository, and none had complete

4This sketch mode was evaluated in detail in the poster of Sarkar et al. (3), where
its scalability was assessed and its mappings were paired with a number of different
UMI resolution strategies.

2 | bioRχiv Zakeri et al. | lightweight single-cell RNA-seq pre-processing

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://github.com/pachterlab/BP_2021/
https://github.com/pachterlab/BP_2021/
https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
https://github.com/COMBINE-lab/BP_2021-lfl
https://github.com/COMBINE-lab/BP_2021-lfl
https://github.com/pachterlab/BP_2021
https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/


$ kallisto index

$ salmon index

Purpose: Build mapping index

command

0

5

10

15

Ti
m

e 
[m

in
]

time

kallisto
alevin

0

5

10

M
em

or
y 

[G
B]

memory

kallisto
alevin

$ kallisto bus

$ salmon alevin --rad --sketch

Purpose: Perform mapping

0

20
Ti

m
e 

[m
in

] kallistoalevin

0

2

4

6

M
em

or
y 

[G
B]

kallisto
alevin

BUS file

RAD file

Purpose: Store result of mapping

0

5

10

Si
ze

 [G
B]

kallisto
alevin

$ bustools sort + bustools whitelist
$ alevin-fry generate-permit-list
             --knee-distance
Purpose: Error correct of barcodes

0

2

4

Ti
m

e 
[m

in
] kallistoalevin

0

2

4

M
em

or
y 

[G
B]

kallisto
alevin

$ bustools correct + bustools sort

$ alevin-fry collate

Purpose: Enable streaming

0.0

2.5

5.0

7.5

Ti
m

e 
[m

in
] kallistoalevin

0

2

4

M
em

or
y 

[G
B]

kallisto
alevin

$ bustools count

$ alevin-fry quant

Purpose: Generate count matrix

0.0

0.5

1.0

Ti
m

e 
[m

in
] kallistoalevin

0

1

2

M
em

or
y 

[G
B]

kallisto
alevin

$ kallisto + bustools pipeline

$ salmon + alevin pipeline

Purpose: Process single cell data

107 108
Number of reads

0

20

40

Ti
m

e 
[m

in
] kallistoalevin

107 108
Number of reads

0

2

4

6

M
em

or
y 

[G
B]

kallisto
alevin

mouse-SRR8599150_v2

worm-SRR8611943_v2

mouse-SRR6998058_v2

human_mouse-hgmm1k_v3

human-pbmc1k_v3

human_mouse-hgmm1k_v2

mouse-heart1k_v3

mouse-SRR8206317_v2

mouse-heart1k_v2

human-SRR8524760_v2

rat-SRR7299563_v2

fly-SRR8513910_v2

zebrafish-SRR6956073_v2

arabidopsis-SRR8257100_v2

human-SRR8327928_v2

mouse-EMTAB7320_v2

mouse-neuron10k_v3

mouse-SRR8639063_v2

human-pbmc10k_v3

human_mouse-hgmm10k_v3

mouse-SRR8599150_v2

worm-SRR8611943_v2

mouse-SRR6998058_v2

human_mouse-hgmm1k_v3

human-pbmc1k_v3

human_mouse-hgmm1k_v2

mouse-heart1k_v3

mouse-SRR8206317_v2

mouse-heart1k_v2

human-SRR8524760_v2

rat-SRR7299563_v2

fly-SRR8513910_v2

zebrafish-SRR6956073_v2

arabidopsis-SRR8257100_v2

human-SRR8327928_v2

mouse-EMTAB7320_v2

mouse-neuron10k_v3

mouse-SRR8639063_v2

human-pbmc10k_v3

human_mouse-hgmm10k_v3

Fig. 1. The time and memory used by the relevant steps of the alevin-fry and kallisto-bustools pipelines for pre-processing the 20 diverse tagged-end single-cell RNA-seq datasets used in (1).
The plots are generated using the analysis/notebooks/memtime.ipynb notebook.

or adequate instructions for generating the plots. For example,
there exist multiple versions of the run_gsea_bar_full.R
script for performing gene set enrichment analysis, which
each required building certain sub-directories in the main di-
rectory of the repository in order to be executed without any
errors. Eventually, we used the run_gsea_bar_full.R
script located within analysis/notebooks rather than
the one located in analysis/scripts/code, since the
latter version had hard-coded paths and no central way to
uniformly and globally change the working directory (e.g.
https://github.com/pachterlab/BP_2021/blob/

e87e98713bf7967d2fa22716dbbebd10609c1dd9/
analysis/scripts/code/gsea_bar_full.R#L39).
After providing the required data, we ran mkdata.py and mk-
plot.py within the analysis/scripts/code directory
to prepare the plots for comparing the gene count estimates provided
by both tools. Furthermore, since we benchmarked an unmodified
version of alevin-fry, we had to modify the mkdata.py script
to load a single column file as alevin’s permit list (which we took
from the quants_mat_rows.txt file accompanying each
cell by gene count matrix), and also to remove the lines which were
intended for dealing with decoy aware results. For producing the

Zakeri et al. | lightweight single-cell RNA-seq pre-processing bioRχiv | 3

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://github.com/pachterlab/BP_2021/blob/e87e98713bf7967d2fa22716dbbebd10609c1dd9/analysis/scripts/code/gsea_bar_full.R#L39
https://github.com/pachterlab/BP_2021/blob/e87e98713bf7967d2fa22716dbbebd10609c1dd9/analysis/scripts/code/gsea_bar_full.R#L39
https://github.com/pachterlab/BP_2021/blob/e87e98713bf7967d2fa22716dbbebd10609c1dd9/analysis/scripts/code/gsea_bar_full.R#L39
https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/


Fig. 2. A comparison of the resulting count matrices obtained from alevin-fry and kallisto-bustools, as run in this manuscript, for the pbmc_10k_v3 dataset. Panels A-H have the same inter-
pretation as in Fig. 2 of Booeshaghi and Pachter (1), and compare the count matrices at the gene and cell levels. The plots are generated using the analysis/scripts/mkplots.py,
analysis/scripts/mkdata.py and analysis/notebooks/run_gsea_bar_full.R scripts.

4 | bioRχiv Zakeri et al. | lightweight single-cell RNA-seq pre-processing

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/


time and memory plots, we used the memtime.ipynb notebook
located in analysis/notebooks after making the required
modifications to compare the time and memory of the relevant steps
in both tools. While we have addressed any issues as we encountered
them, and have documented how we have run this pipeline, we have
not undertaken the effort of fully removing all barriers to “trivial”
reproducibility, as it is outside the scope of the current work.

Finally, we also note that the benchmark of Booeshaghi and
Pachter (1) focuses only on comparing kallisto-bustools to a single
configuration of alevin-fry, excluding other relevant tools like
STARsolo (9), which is a fast, flexible, and popular tool for the
pre-processing of tagged-end single-cell data. The benchmark also
omits another recently-published, lightweight-mapping based tool,
Raindrop (14), from the benchmark (though, seemingly, this would
currently have to be restricted to 10x chromium v2 data). A more
extensive benchmark, including other tools, is likely to provide
greater value to the broader community. However, the primary focus
in this manuscript is to highlight the effect on the original benchmark
that results from running the tools considered therein in a like-for-like
configuration. Thus, we have not added STARsolo or Raindrop to
the current benchmark, though it may provide a useful perspective
on these tools to the broader community.

All experiments were performed on a server with dual Intel Xeon
CPUs (E5-2699 v4), each with 22 cores clocked at 2.20 GHz, 512
GB of 2.4GHz DDR4 memory, and an array of 8 3.6TB Toshiba
MG03ACA4 HDDs configured as independent disks.

Results

Fig. 1 shows the overall time and peak memory taken by both
alevin-fry and kallisto-bustools when pre-processing the 20 diverse
tagged-end single-cell 10x chromium datasets evaluated in (1). Alevin-
fry is faster than kallisto-bustools on all datasets (between ∼1.28 and ∼
2.87 times as fast, and ∼2.08 times as fast on average). Also, alevin-fry
uses less peak memory than kallisto-bustools on 19 of the 20 datasets
tested, with the peak memory of kallisto-bustools ranging from ∼91%
of that used by alevin-fry to ∼8 times that used by alevin-fry (kallisto-
bustools used ∼2.92 times as much peak memory on average).

In addition to the overall runtime and peak memory usage
(bottom row of Fig. 1), the figure also shows the time and memory
required for the main steps of the pipelines. While there is not a perfect
correspondence between the specific set of commands used by the two
tools, the fundamental steps include mapping the reads, generating a
permit-list of valid filtered barcodes (with each method using its own
algorithm to infer the filtered set of corrected barcodes), rearranging
the mapping information for all records having the same corrected
barcode so that they are adjacent in the resulting file, and applying
a UMI resolution algorithm to obtain a gene-by-cell count matrix.

Looking across the datasets, some general characteristics emerge.
If one evaluates the ratio of the total runtime of kallisto-bustools to the
total runtime of alevin-fry, one observes that alevin-fry is faster in the
processing of every dataset, with a speedup (i.e. runtime of kallisto-
bustools/ runtime of alevin-fry ratio) ranging from ∼1.28 up to ∼2.87
(with an average runtime speedup of ∼2.08). If one evaluates the same
ratio in terms of peak memory usage instead of total runtime, a similar
trend emerges. In 19 of the 20 datasets tested here, the kallisto-bustools
pipeline exhibits a higher peak memory usage than alevin-fry. In the
mouse-SRR8639063_v2 dataset, kallisto-bustools’ peak memory

usage reached 91% of that of alevin-fry (with alevin-fry requiring a
maximum of ∼4.6GB5 of memory and kallisto-bustools requiring ∼
4.2GB of memory). In every other dataset, kallisto-bustools used more
peak memory than alevin-fry, with the kallisto-bustools pipeline using
at most ∼8 times as much peak memory and, on average, ∼2.92 times
as much peak memory as alevin-fry. The peak memory usage of both
tools reached their respective maxima on the hybrid human-mouse
dataset, where the peak memory usage of kallisto-bustools is ∼7.5GB
(which occurs during pseudoalignment) and the peak memory usage
of alevin-fry is ∼5GB (which occurs during mapping record collation).

While the step of indexing only has to be done once per reference
sequence (i.e. with each new organism, or when the reference anno-
tation is updated), we also evaluate the time and memory required to
build all indices used in these experiments. This is important, since the
peak memory usage during indexing may dictate whether the index
can be built on the same machine used for subsequent quantification,
or if it must be constructed on a machine with more memory. Fig. 1
shows that, as with the pre-processing of reads, alevin-fry is faster
and uses less memory for index construction for each reference con-
sidered. The slowest index construction for both tools was for the
human-mouse combined transcriptome, where the kallisto-bustools
pipeline took ∼ 15.6 minutes and required ∼ 10.2GB of memory,
while indexing this transcriptome with the alevin-fry pipeline took ∼3
minutes (a ∼ 5.2 times speedup compared to kallisto) and ∼ 1.3GB
of memory (∼13% of the the memory usage of kallisto). When eval-
uating the time differences, it is important to note that the alevin-fry
pipeline can make use of multiple threads when indexing (here we
used 10 as in (1)), while the indexing in the kallisto-bustools pipeline is
currently restricted to a single thread. The memory usage in the alevin-
fry pipeline does not vary considerably with the number of threads
used during indexing. The peak memory reduction during indexing
and mapping in alevin-fry arise primarily due to alevin-fry’s use of the
pufferfish (15) index, while a number of different factors at both the im-
plementation and design level contribute to the runtime improvements.

When assessing the same summary statistics and count com-
parisons considered by Booeshaghi and Pachter (1) to evaluate the
similarity of the resulting quantifications, we find that the cell by gene
count matrices produced by both tools are similar under these metrics
(Fig. 2). As is expected, these evaluations show that the data sum-
maries are more similar than in the configuration tested in (1). In that
comparison, Booeshaghi and Pachter (1) claim that differences in re-
sulting gene expressions between the configurations of the tools tested
therein are “irrelevant for downstream analysis” (presumably implying
all possible downstream analyses). It is not clear how these compar-
isons justify such a sweeping claim. Yet, while these comparisons do
not necessarily imply that no differences will manifest in downstream
processing of the alevin-fry quantified data compared to the
kallisto-bustools quantified data, they do suggest that the differences
that may arise under this configuration of alevin-fry are likely to be
less extreme than differences that may arise in the configuration tested
in (1). We also note that, while Booeshaghi and Pachter (1) observe
no significant gene sets found when comparing the quantifications of
kallisto-bustools and the configuration of alevin-fry that they tested on
the pbmc_10k_v3 data, we do observe some genes as detected.

5The alevin-fry peak memory usage in this dataset happens during the collate
step, which can easily be made to operate within a strict desired RAM budget; a
feature on which we are currently working.

Zakeri et al. | lightweight single-cell RNA-seq pre-processing bioRχiv | 5

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/


This appears to stem from our use of the most recent release of Seu-
rat (16) (currently version 4.0.0), which modified the default behavior
of the FindMarkers() function to “prefilter genes and report
fold change using base 2, as is commonly done in other differential ex-
pression packages, instead of natural log” (https://satijalab.
org/seurat/articles/v4_changes.html).

For the purposes of keeping this benchmark like-for-like, we have
run both tools in configurations where they must generate their own
permit list without the input of an external set of valid but unfiltered
(i.e. possible) barcodes. We choose this configuration for two reasons.
First, unlike 10x chromium v2 and v3 experiments, many single-cell
technologies supported by both pipelines do not provide an external
list of barcodes, and so, this is indicative of the general case where
pipelines must provide their own method for generating a permit list
of barcodes. Second, since alevin-fry does not yet support unfiltered
external permit lists, there is not a way to fairly compare it against
a method that can take advantage of this information. We consider it a
development priority to support this feature in alevin-fry for single-cell
technologies where this information is available. Nonetheless, we
tested the effect that requiring a second sort of a (sorted and filtered
permit list corrected) BUS file had on the overall runtime, compared
to correcting the initial unsorted file with an unfiltered permit list,
and processing the unfiltered file in the remainder of the pipeline.
To do this, we also ran a configuration of kallisto-bustools where
the raw BUS file was first corrected with the external, unfiltered 10x
permit list, then the file was sorted, then a permit list was extracted
from this sorted file (to allow subsequent filtering of an unfiltered
count matrix), and finally, the count step was performed. This process
results in an unfiltered matrix which may then be filtered using the
generated permit list. Alevin-fry was, on average, ∼ 2.15 times as
fast as kallisto-bustools under this configuration, rather than ∼2.08 as
fast; in other words, the runtime costs of the different kallisto-bustools
configurations were very similar for these data.

Finally, though we have retained the parallelism settings used in
the original benchmark for the purposes of the main results reported
in this manuscript, we also evaluated, on one of the larger datasets
(pbmc_10k_v3), how both tools scale to a higher thread count of
16. In this case, we found that the total runtime for alevin-fry dropped
from 19.7 minutes with 10 threads to 15.6 minutes with 16 threads,
and the total runtime for kallisto-bustools went from 38.8 minutes with
10 threads to 37.8 minutes with 16 threads. So, in this case, increasing
the thread count by 6 lead to a ∼1.26 times increase in the speed of
alevin-fry and a ∼1.03 times increase in the speed of kallisto-bustools.

Conclusions

We find that when alevin-fry is benchmarked in a like-for-like
comparison with kallisto-bustools, it is both faster and uses less
memory while producing similar results. Of course, in this
configuration, alevin-fry (3), unlike the original alevin (13) or other
configurations of alevin-fry, is adopting some of the computational
simplifications for which Melsted et al. (2) argue, and the similarity
of these results is fully expected.

In their manuscript, Booeshaghi and Pachter (1) repeatedly refer
to alevin-fry as a “reimplementation” of bustools. This characteriza-
tion is untrue both in detail and in spirit. The alevin-fry tool has not
been designed to reimplement the bustools commands or interface, or
specifically to match the implementation of bustools. It has been de-

signed as a way to allow the exploration and configuration, in a unified
framework, of a variety of different algorithms for single-cell data pre-
processing, many of which don’t currently exist in kallisto-bustools.
For example, it implements multiple different methods for generating
permit lists, and multiple different algorithms for UMI resolution,
including some that correct for UMI sequencing errors, resolve multi-
gene UMIs by parsimony, probabilistically (which kallisto-bustools
subsequently implemented after it was introduced in alevin (13)), or
both, as well as the functionality to quantify the uncertainty of proba-
bilistic resolution through bootstrapping. Of course it is the case that, in
designing such a tool after the work of Melsted et al. (2) was published
and in widespread use, one should learn from the design decisions of
that work that proved to be effective and useful. The main such design
decisions in this case are first, the separation of the read mapping from
the subsequent processing of barcodes and UMIs via intermediate files
(as is also done internally by STARsolo), and second, the arrangement
of mapping records relevant to a given corrected barcode subsequently
so that cells can be processed in an effectively independent manner.
The alevin-fry tool adopts these choices described by Melsted et al. (2),
as we see no reason to avoid relevant design decisions demonstrated
by prior tools, that seem to work, when building new tools. We look
forward to discussing these design decisions, as well as some novel
design choices we have made, when we publish the alevin-fry preprint.

We have not completed, to our satisfaction, a thorough
investigation of the effect of different mapping approaches, permit
list generating methods and UMI correction and resolution strategies
provided by alevin-fry across a wide range of tagged-end single-cell
RNA-seq data and technologies (which have, in general, distinct
characteristics compared to both bulk RNA-seq data and full-length
single-cell RNA-seq data). Once we have adequately explored this
algorithmic parameter space, we plan to publish a full manuscript
describing the design and implementation of the alevin-fry pipeline,
highlighting where it derives design decisions from kallisto-bustools
and other tools, and where it differs, as well as the effect that different
configurations have on runtime and memory performance, the raw
count matrices and common downstream analyses, and how those
effects may vary in different single-cell technologies.

We have described in this manuscript, and demonstrated in the
associated code repository and tutorial, how alevin-fry can optionally
be configured so as to match the computational complexity of kallisto-
bustools as closely as possible. In this like-for-like comparison of
these two pipelines, we have shown that, while the estimated gene
expressions are similar — at least when assessed in the manner done
by Booeshaghi and Pachter (1) — the runtime and memory character-
istics are not. Rather, while using the same benchmarking framework
as Booeshaghi and Pachter (1), instead of alevin-fry taking ∼3 times
as long to pre-process data (on average) than kallisto-bustools and
using many times as much memory in the worst case (1), we find that
alevin-fry is both faster and uses less memory than kallisto-bustools.
Specifically, alevin-fry is on average ∼2.08 times as fast as kallisto-
bustools and consumes, on average, only ∼0.34 as much peak memory.
According to the formulae used in the jupyter (17) notebooks of Booe-
shaghi and Pachter (1) to estimate costs for performing processing
on Amazon Web Services compute instances, pre-processing the
pbmc_10k_v3 dataset using the configuration of the alvein-fry
pipeline we have tested in this manuscript costs $0.05, which is half
of the cost of running the kallisto-bustools pipeline ($0.1). Further-

6 | bioRχiv Zakeri et al. | lightweight single-cell RNA-seq pre-processing

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://satijalab.org/seurat/articles/v4_changes.html
https://satijalab.org/seurat/articles/v4_changes.html
https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/


more, if one needs to first build the reference index, the peak runtime
memory for kallisto-bustools exceeds 8 GB and so a more expensive
instance would be necessary. In that case, building the reference
index and processing the pbmc_10k_v3 would cost $0.23 using
kallisto-bustools (while the cost would remain at $0.05 using alevin-
fry, even if index construction is included). The cost of the alevin-fry
pipeline we have benchmarked in this manuscript is 39 times smaller
than what was reported in (1) for this dataset, while the cost of the
kallisto-bustools pipeline is twice as large due to the increased memory
requirements when using the newer human transcriptome annotation.
If one is comfortable with the simplifying assumptions being made,
the performance profiles observed in this manuscript provide a com-
pelling case for the use of this configuration of alevin-fry for the rapid
and lightweight pre-processing of single-cell RNA-seq data.

Finally, it is important to note that alevin-fry is still undergoing
active development and improvement, which is, in part, why no full
preprint has yet been published describing the tool and underlying
methods and implementation in detail. Of course, one can use the
tool today to obtain gene expression counts for single-cell data, but
we expect that alevin-fry will continue to advance and expand to offer
more capabilities and to be further optimized.

Disclosure.RP is a co-founder of Ocean Genomics Inc.

Funding. This work is supported by the US National Institutes
of Health [R01HG009937], and the National Science Foundation
[CCF-1750472, CNS-1763680]. The funders had no role in this
research, or the decision to publish.

References
1. A. Sina Booeshaghi and Lior Pachter. Benchmarking of lightweight-

mapping based single-cell RNA-seq pre-processing. bioRxiv, 2021. doi:
10.1101/2021.01.25.428188.

2. Páll Melsted, A. Sina Booeshaghi, Fan Gao, Eduardo Beltrame, Lambda
Lu, Kristján Eldjárn Hjorleifsson, Jase Gehring, and Lior Pachter. Modular
and efficient pre-processing of single-cell RNA-seq. bioRxiv, 2019. doi:
10.1101/673285.

3. Hirak Sarkar, Avi Srivastava, Mohsen Zakeri, Scott Van Buren, Naim U Rashid,
Michael Love, and Rob Patro. Accurate, efficient, and uncertainty-aware
expression quantification of single-cell RNA-seq data. 11 2020. doi:
10.6084/m9.figshare.13198100.v1.

4. Avi Srivastava, Laraib Malik, Hirak Sarkar, Mohsen Zakeri, Fatemeh
Almodaresi, Charlotte Soneson, Michael I Love, Carl Kingsford, and Rob Patro.
Alignment and mapping methodology influence transcript abundance estimation.
Genome Biology, 21(1):1–29, 2020.

5. Hirak Sarkar, Mohsen Zakeri, Laraib Malik, and Rob Patro. Towards
selective-alignment: Bridging the accuracy gap between alignment-based and
alignment-free transcript quantification. In Proceedings of the 2018 ACM
International Conference on Bioinformatics, Computational Biology, and
Health Informatics, pages 27–36, 2018.

6. Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal
probabilistic RNA-seq quantification. Nature Biotechnology, 34(5):525–527,
2016.

7. Matthew D. Shirley, Viveksagar K. Radhakrishna, Javad Golji, and Joshua M.
Korn. PISCES: a package for rapid quantitation and quality control of large
scale mRNA-seq datasets. bioRxiv, 2020. doi: 10.1101/2020.12.01.390575.

8. Avi Srivastava, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Carl Kingsford,
and Rob Patro. Accounting for fragments of unexpected origin improves
transcript quantification in RNA-seq simulations focused on increased realism.
bioRxiv, 2021. doi: 10.1101/2021.01.17.426996.

9. Ash Blibaum, Jonathan Werner, and Alexander Dobin. STARsolo:
single-cell RNA-seq analyses beyond gene expression. 2019. doi:
10.7490/F1000RESEARCH.1117634.1.

10. Geo Pertea and Mihaela Pertea. GFF utilities: GffRead and GffCompare.
F1000Research, 9:304, September 2020. doi: 10.12688/f1000research.23297.2.

11. Andrew D Yates, Premanand Achuthan, Wasiu Akanni, James Allen, Jamie

Allen, Jorge Alvarez-Jarreta, M Ridwan Amode, Irina M Armean, Andrey G
Azov, Ruth Bennett, et al. Ensembl 2020. Nucleic acids research, 48(D1):
D682–D688, 2020.

12. Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford.
Salmon provides fast and bias-aware quantification of transcript expression.
Nature Methods, 14(4):417–419, 2017.

13. Avi Srivastava, Laraib Malik, Tom Smith, Ian Sudbery, and Rob Patro. Alevin
efficiently estimates accurate gene abundances from dscRNA-seq data. Genome
Biology, 20(1):1–16, 2019.

14. Stefan Niebler, André Müller, Thomas Hankeln, and Bertil Schmidt. Raindrop:
Rapid activation matrix computation for droplet-based single-cell rna-seq reads.
BMC Bioinformatics, 21(1):1–14, 2020.

15. Fatemeh Almodaresi, Hirak Sarkar, Avi Srivastava, and Rob Patro. A space and
time-efficient index for the compacted colored de Bruijn graph. Bioinformatics,
34(13):i169–i177, 2018.

16. Yuhan Hao, Stephanie Hao, Erica Andersen-Nissen, William M. Mauck, Shiwei
Zheng, Andrew Butler, Maddie J. Lee, Aaron J. Wilk, Charlotte Darby, Michael
Zagar, Paul Hoffman, Marlon Stoeckius, Efthymia Papalexi, Eleni P. Mimitou,
Jaison Jain, Avi Srivastava, Tim Stuart, Lamar B. Fleming, Bertrand Yeung,
Angela J. Rogers, Juliana M. McElrath, Catherine A. Blish, Raphael Gottardo,
Peter Smibert, and Rahul Satija. Integrated analysis of multimodal single-cell
data. bioRxiv, 2020. doi: 10.1101/2020.10.12.335331.

17. Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger,
Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason
Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing,
and Jupyter development team. Jupyter notebooks - a publishing format for
reproducible computational workflows. In Fernando Loizides and Birgit Scmidt,
editors, Positioning and Power in Academic Publishing: Players, Agents and
Agendas, pages 87–90, Netherlands, 2016. IOS Press.

Zakeri et al. | lightweight single-cell RNA-seq pre-processing bioRχiv | 7

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.10.430656doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.10.430656
http://creativecommons.org/licenses/by-nc-nd/4.0/