Learning Sparse Log-Ratios for High-Throughput Sequencing Data


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Elliott Gordon-Rodriguez 1 Thomas P. Quinn 2 John P. Cunningham 1

Abstract
The automatic discovery of interpretable features
that are associated to an outcome of interest is
a central goal of bioinformatics. In the context
of high-throughput genetic sequencing data, and
Compositional Data more generally, an important
class of features are the log-ratios between sub-
sets of the input variables. However, the space of
these log-ratios grows combinatorially with the
dimension of the input, and as a result, existing
learning algorithms do not scale to increasingly
common high-dimensional datasets. Building on
recent literature on continuous relaxations of dis-
crete latent variables, we design a novel learning
algorithm that identifies sparse log-ratios several
orders of magnitude faster than competing meth-
ods. As well as dramatically reducing runtime,
our method outperforms its competitors in terms
of sparsity and predictive accuracy, as measured
across a wide range of benchmark datasets.

1. Introduction
Much recent work has been devoted to designing differen-
tiable relaxations of discrete latent variables. These relax-
ations can be used to learn class membership (Jang et al.,
2016; Maddison et al., 2017; Potapczynski et al., 2020),
permutations (Linderman et al., 2018; Mena et al., 2018),
subsets (Xie & Ermon, 2019; Yang et al., 2019), and rank-
ings (Cuturi et al., 2019; Blondel et al., 2020). Depend-
ing on their use case, existing methods range in complex-
ity, from the simple-but-effective straight-through estimator
(Bengio et al., 2013), to mathematically intricate schemes
based on optimal transport (Xie et al., 2020). However, the
driving principle is always the same: to enable efficient
gradient-based optimization on an otherwise intractable dis-
crete space. The goal of our work is to extend this principle
to a novel setting where, to the best of our knowledge, no dif-
ferentiable relaxations have yet been proposed. Motivated

1Department of Statistics, Columbia University 2Applied Arti-
ficial Intelligence Institute (A2I2), Deakin University. Correspon-
dence to: Elliott Gordon-Rodriguez <eg2912@columbia.edu>.

by domain applications, our objective is to select log-ratios
from a set of covariates, a problem that is equivalent to a dis-
crete optimization over pairs of disjoint subsets (where the
pair represents the numerator and denominator of the ratio,
respectively). Our novel relaxation will result in dramatic
speedups over several recent state-of-the-art learning algo-
rithms from the field of bioinformatics, thereby enabling the
analysis of much larger datasets than previously possible.

Log-ratios are an important class of features for analyz-
ing high-throughput sequencing (HTS) metagenomic data
(Wooley et al., 2010; Gloor & Reid, 2016; Gloor et al.,
2017; Quinn et al., 2018). For example, in microbiome
count data, the relative weight between two sub-populations
of related microorganisms can serve as a clinically useful
biomarker (Rahat-Rozenbloom et al., 2014; Crovesy et al.,
2020; Magne et al., 2020). More generally, log-ratios are of
fundamental importance to the field of Compositional Data
(CoDa), of which HTS data can be seen as a special case
(Pawlowsky-Glahn & Egozcue, 2006; Pawlowsky-Glahn &
Buccianti, 2011). CoDa can be defined as simplex-valued
data, or equivalently, non-negative vectors whose totals are
uninformative, i.e., relative data. Due to the nature of the
recording technique, HTS data represents the relative abun-
dance of different microbial signatures in a given sample,
and therefore is an instance of CoDa (Gloor & Reid, 2016;
Gloor et al., 2017; Quinn et al., 2018). Indeed, the ap-
plication of CoDa methodology to HTS data has become
increasingly popular in recent years (Fernandes et al., 2013;
2014; Rivera-Pinto et al., 2018; Quinn et al., 2019; Calle,
2019), with log-ratios serving as the basic building blocks
for statistical analysis.

But why do log-ratios form the basis of CoDa methodology?
Unlike unconstrained real-valued data, the relative nature of
HTS data and CoDa results in each covariate becoming neg-
atively correlated to all others (increasing one component
of a composition implies a relative decrease of the other
components). It is well known that, as a result, the usual
measures of association and feature attribution are problem-
atic when applied to CoDa (Pearson, 1896; Filzmoser et al.,
2009; Van den Boogaart & Tolosana-Delgado, 2013; Lovell
et al., 2015). Log-ratios account for this idiosyncratic struc-
ture by transforming CoDa onto unconstrained feature space,
where the usual tools of statistical learning apply (Aitchison,
1982; Pawlowsky-Glahn & Egozcue, 2006). The choice


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

of the log-ratio transform offers the necessary property of
scale invariance, but in the CoDa literature it holds primacy
for a variety of other technical reasons, including so-called
subcompositional coherence (Aitchison, 1982; Pawlowsky-
Glahn & Buccianti, 2011; Egozcue & Pawlowsky-Glahn,
2019). Log-ratios can be taken over pairs of individual
covariates (Aitchison, 1982; Greenacre, 2019b) or aggrega-
tions thereof, typically geometric means (Aitchison, 1982;
Egozcue et al., 2003; Egozcue & Pawlowsky-Glahn, 2005;
Rivera-Pinto et al., 2018; Quinn & Erb, 2019) or summa-
tions (Greenacre, 2019a; 2020; Quinn & Erb, 2020). The
resulting features work well empirically, but also imply a
clear interpretation: a log-ratio is a single composite score
that expresses the overall quantity of one sub-population
as compared with another. When the log-ratios are sparse,
meaning they are taken over a small number of covariates,
they define biomarkers that are particularly intuitive to un-
derstand, a key desiderata for predictive models that are of
clinical relevance (Goodman & Flaxman, 2017).

Thus, learning sparse log-ratios is a central problem in
CoDa. This problem is especially challenging in the context
of HTS data, due to its high dimensionality (ranging from
100 to over 10,000 covariates). Existing methods rely on
stepwise search or evolutionary algorithms (Rivera-Pinto
et al., 2018; Greenacre, 2019b; Quinn & Erb, 2020; Prifti
et al., 2020), which scale very poorly with the dimension of
the input. These algorithms are prohibitively slow for most
HTS datasets, and thus there is a new demand for sparse
and interpretable models that scale to high dimensions (Li,
2015; Cammarota et al., 2020; Susin et al., 2020).

This demand motivates the present work, in which
we present CoDaCoRe,1a novel learning algorithm for
Compositional Data via Continuous Relaxations. The key
idea behind CoDaCoRe is to approximate a combinatorial
optimization over the set of log-ratios (equivalent to the set
of pairs of disjoint subsets of the covariates), by means of a
continuous relaxation that can be optimized efficiently using
gradient descent. To the best of our knowledge, CoDaCoRe
is the first CoDa method that scales to high dimensions,
and that simultaneously produces sparse, interpretable, and
accurate models. The main contributions of our method can
be summarized as follows:

• Computational efficiency. CoDaCoRe scales linearly
with the dimension of the input. It runs several orders
of magnitude faster than its competitors.

• Interpretability. CoDaCoRe identifies a set of log-
ratios that are sparse, biologically meaningful, and
ranked in order of importance. Our model is highly
interpretable, and much sparser, relative to compet-
ing methods of similar accuracy and computational

1Our implementation can be downloaded from
https://github.com/cunningham-lab/codacore.

complexity.
• Predictive accuracy. CoDaCoRe achieves better out-

of-sample accuracy than existing CoDa methods, and
performs similarly to state-of-the-art black-box classi-
fiers (which are neither sparse nor interpretable).

• Optimization robustness. We leverage the functional
form of our continuous relaxation to identify an adap-
tive learning rate that enables CoDaCoRe to converge
reliably, requiring no additional hyperparameter tuning
when deployed on novel datasets.

2. Background
Our work focuses on the supervised learning problem
with compositional predictors. Namely, we are given data
{xi,yi}ni=1, where xi is compositional (e.g., HTS data), and
our goal is to learn an association xi 7→ yi. For many mi-
crobiome applications, xi represents a vector of frequencies
of the different species of bacteria that compose the micro-
biome of the ith subject. In other words, xij denotes the
abundance of the jth species (of which there are p total) in
the ith subject. The response yi is a binary variable indicat-
ing whether the ith subject belongs to the case or the control
groups (e.g., sick vs. healthy). Due to the nature of HTS, the
input frequencies xij arise from an inexhaustive sampling
procedure, so that the totals

∑p
j=1 xij are arbitrary and the

components should only be interpreted in relative terms
(i.e., as CoDa) (Gloor & Reid, 2016; Gloor et al., 2017;
Quinn et al., 2018; Calle, 2019). While we mainly consider
applications to microbiome data, our method applies more
generally to any high-dimensional CoDa, including those
produced by Liquid Chromatography Mass Spectrometry
(Filzmoser & Walczak, 2014).

In order to account for the compositional nature of xi, we
seek log-ratio transformed features that can be passed to a
regression function downstream. As discussed, these log-
ratios will result in interpretable features and scale-invariant
models (that are also subcompositionally coherent). The
simplest such choice is to take the pairwise log-ratios be-
tween input variables, i.e., log(xij+/xij−), where (j

+,j−)
indexes a pair of covariates (Aitchison, 1982). Note that the
ratio cancels out any scaling factor applied to xi, preserv-
ing only the relative information in the data, while the log
transform ensures the output is (unconstrained) real-valued.
In order to select a good pair (j+,j−) from the input co-
variates, Greenacre (2019b) proposed a step-wise algorithm
for identifying pairwise log-ratios that explain the most vari-
ation in a dataset. This algorithm produces a sparse and
interpretable set of features, but it is prohibitively slow on
high-dimensional datasets, as a result of the step-wise search
scaling poorly in the dimension of the input. A heuristic
search algorithm that is less accurate but computationally
faster has been developed as part of Quinn et al. (2017),


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

though its computational cost is still troublesome (as we
shall see in Section 4).

2.1. Balances

Recently, a class of log-ratios known as balances (Egozcue
& Pawlowsky-Glahn, 2005) have become of interest in mi-
crobiome applications, due to their interpretability as the
relative weight between two sub-populations of bacteria
(Morton et al., 2019b; Quinn & Erb, 2019). Balances are
defined as the log-ratios between geometric means of two
subsets of the covariates:2

B(xi;J
+,J−) = log


(∏j∈J+ xij) 1p+
(
∏
j∈J− xij)

1

p−


 (1)

=
1

p+

∑
j∈J+

log xij −
1

p−

∑
j∈J−

log xij,

where J+ and J− denote a pair of disjoint subsets of the
indices {1, . . . ,p}, and p+ and p− denote their respective
sizes. For example, in microbiome data, J+ and J− are
groups of bacteria species that may be related by their envi-
ronmental niche (Morton et al., 2017) or genetic similarity
(Silverman et al., 2017; Washburne et al., 2017). Note that
when p+ = p− = 1 (i.e., J+ and J− each contain a single
element), B(x;J+,J−) reduces to a pairwise log-ratio. By
allowing for the aggregation of more than one covariate in
the numerator and denominator of the log-ratio, balances
provide a richer set of features that allows for more flexible
models than pairwise log-ratios. Insofar as the balances are
taken over a small number of covariates (i.e., J+ and J− are
sparse), they also provide highly interpretable biomarkers.

The selbal algorithm (Rivera-Pinto et al., 2018) has gained
popularity as a method for automatically identifying bal-
ances that predict a response variable. However, this algo-
rithm is also based on a step-wise search through the combi-
natorial space of subset pairs (J+,J−), which scales poorly
in the dimension of the input and becomes prohibitively
slow for HTS data (Susin et al., 2020).

2.2. Amalgamations

An alternative to balances, known as amalgamations, is
defined by aggregating components through summation:

A(xi;J
+,J−) = log

(∑
j∈J+ xij∑
j∈J− xij

)
, (2)

where again J+ and J− denote disjoint subsets of the input
components. Amalgamations have the advantage of reduc-

2Note that the original definition of balances includes a “nor-
malization” constant, which we omit for clarity. This constant
is in fact unnecessary, as it will get absorbed into a regression
coefficient downstream.

ing the dimensionality of the data through an operation, the
sum, that some authors argue is more interpretable than a
geometric mean (Greenacre, 2019a; Greenacre et al., 2020).
On the other hand, amalgamations can be less effective than
balances for identifying components that are statistically
important, but small in magnitude, e.g., rare bacteria species
(since small terms will have less impact on a summation
than on a product).

Recently, Greenacre (2020) has advocated for the use of
expert-driven amalgamations, using domain knowledge to
construct the relevant features. On the other hand, Quinn &
Erb (2020) proposed amalgam, an evolutionary algorithm
to automatically identify amalgamated log-ratios (Eq. 2)
that are predictive of a response variable. However, this
algorithm does not scale to high-dimensional data (albeit,
comparing favorably to selbal), nor does it produce sparse
models (hindering interpretability of the results).

2.3. Other Related Work

CoDa methodology has also recently attracted interest from
the machine learning community (Tolosana-Delgado et al.,
2019; Quinn et al., 2020; Gordon-Rodriguez et al., 2020a;b;
Templ, 2020). Relevant to us is DeepCoDA (Quinn et al.,
2020), which combines self-explaining neural networks with
log-ratio transformed features. In particular, DeepCoDA
learns a set of log-contrasts, in which the numerator and
denominator are defined as unequally weighted geometric
averages of components. As a result of this weighting, Deep-
CoDA loses much of the interpretability and intuitive appeal
of balances (or amalgamations), which is exacerbated by
its lack of sparsity (in spite of regularization). Moreover,
like most deep architectures, DeepCoDA is sensitive to ini-
tialization and optimization hyperparameters (which limits
its ease of use) and is susceptible to overfitting (which can
further compromise interpretability of the model).

The special case of a linear log-contrast model has been
referred to as Coda-lasso, and was separately proposed by
Lu et al. (2019). While Coda-lasso scales better than selbal,
it has been found to perform worse in terms of predictive
accuracy (Susin et al., 2020). More importantly, Coda-
lasso is still prohibitively slow on the high-dimensional HTS
data that we wish to consider. Last, we highlight another
common set of features that are also a special case of log-
contrasts: centered-log-ratios, where individual covariates
are divided by the geometric mean of all input variables
(Aitchison, 1982). Models using these features, such as
Susin et al. (2020), can be accurate and computationally
efficient, however they are inherently not sparse and are
difficult to interpret scientifically (Greenacre, 2019a).


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Table 1. Qualitative comparison of the methods discussed, ordered from most sparse (top) to least (bottom). CoDaCoRe is the only
learning algorithm that performs on all of our criteria. See Table 2 for a corresponding quantitative comparison.

SCALABILITY INTERPRETABILITY SPARSITY ACCURACY

CODACORE (OURS) + + + +
PAIRWISE LOG-RATIOS (GREENACRE, 2019B) − + + −

SELBAL (RIVERA-PINTO ET AL., 2018) − + + ·
LASSO + · + −

CODA-LASSO (LU ET AL., 2019) − · · ·
AMALGAM (QUINN & ERB, 2020) − + − ·
DEEPCODA (QUINN ET AL., 2020) · · − ·
CLR-LASSO (SUSIN ET AL., 2020) + − − +

BLACK-BOX (RANDOM FOREST, XGBOOST) + − − +

3. Methods
We now present CoDaCoRe, a novel learning algorithm
for HTS data, and more generally, high-dimensional CoDa.
Unlike existing methods, CoDaCoRe is simultaneously scal-
able, interpretable, sparse, and accurate. We compare the
relative merits of CoDaCoRe and its competitors in Table 1.

3.1. Continuous Relaxation

In its basic formulation, CoDaCoRe learns a regression
function of the form:

f(x) = α + β ·B(x;J+,J−), (3)

where B denotes a balance (Eq. 1), and α and β are scalar
parameters. For clarity, we will restrict our exposition to
this formulation, but note that our algorithm can be applied
equally to learn amalgamations instead of balances (see
Section 3.5), as well as generalizing straightforwardly to
nonlinear functions (provided they are suitably parameter-
ized and differentiable).

Let L(y,f) denote the cross-entropy loss, with f ∈ R
given in logit space. The goal of CoDaCoRe is to find
the balance that is maximally associated of the response.
Mathematically, this can be written as an empirical risk
minimization:

min
(J+,J−,α,β)

∑
i

L
(
yi,α + β ·B(xi;J+,J−)

)
. (4)

This objective involves a discrete optimization over pairs
(J+,J−) of disjoint subsets, a combinatorially hard prob-
lem. The key insight of CoDaCoRe is to approximate this
combinatorial optimization with a continuous relaxation
that can be trained efficiently by gradient descent.

Our relaxation is parameterized by an unconstrained vec-
tor of “assignment weights”, w ∈ Rp, with one scalar
parameter per input dimension (e.g., one weight per bacte-
ria species). The weights are mapped to a vector of “soft

assignments” via:

w̃ = 2 · sigmoid(w)−1 =
2

1 + exp(−w)
−1, (5)

where the sigmoid is applied component-wise. Eq. 5 maps
onto the interval (−1,1), which can be understood straight-
forwardly as a relaxation of the set {−1,1,0}, denoting
membership to J−, J+, or neither, respectively. Let us
write w̃+ = ReLU(w̃) and w̃− = ReLU(−w̃) for the pos-
itive and negative parts of w̃, respectively. We approximate
balances (Eq. 1) with the following relaxation:

B̃(xi; w) =

∑
j w̃

+
j log xij∑
j w̃

+
j

−
∑
j w̃
−
j log xij∑
j w̃
−
j

(6)

=
w̃+ · log xi
‖w̃+‖1

−
w̃− · log xi
‖w̃−‖1

. (7)

In other words, we approximate geometric averages over
subsets of the inputs, by weighted geometric averages over
all components (compare Equations 1 and 6).

Crucially, this relaxation is differentiable in w, allowing
us to construct a surrogate objective function that can be
optimized jointly in (w,α,β) by gradient descent:

min
(w,α,β)

∑
i

L
(
yi,α + β · B̃(xi; w)

)
. (8)

We defer the details of our implementation of gradient de-
scent to the Supplement (Section A), but we highlight two
observations. First, the computational cost of the gradient
of Eq. 8 is linear in the dimension of w. As a result, our
algorithm scales linearly with the dimension of the input,
and is fast to fit on large datasets (see Section 4.3). Second,
knowledge of the functional form of our relaxation (Eq. 6)
can be exploited in order to select the learning rate adap-
tively (i.e., without tuning), resulting in robust convergence
across all real and simulated datasets that we considered.

3.2. Discretization

While a set of features in the form of Eq. 6 may perform
accurate classification, a weighted geometric average over


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

all covariates is much harder for a biologist to interpret (and
less intuitively appealing) than a bona fide balance over a
small number of covariates. On these grounds, CoDaCoRe
implements a “discretization” procedure that exploits the in-
formation learned by the soft assignment vector w̃, in order
to efficiently identify a pair of sparse subsets (Ĵ+, Ĵ−).

The most straightforward way to convert the (soft) assign-
ment w̃ into a (hard) pair of subsets is by fixing a threshold
t ∈ (0,1):

J̃+ = {j : w̃j > t}, (9)
J̃− = {j : w̃j < −t}. (10)

Note that given a trained w̃ and a fixed threshold t, we
can evaluate the quality of the corresponding balance
B(x; J̃+, J̃−) (resp. amalgamation) by optimizing Eq. 4
over (α,β) alone, i.e., fitting a linear model. Computation-
ally, fitting a linear model is much faster than optimizing
Eq. 8, and can be done repeatedly for a range of values of t
with little overhead. In CoDaCoRe, we combine this strat-
egy with cross-validation in order to select the threshold, t̂,
that optimizes predictive performance (see Section A of the
Supplement for full detail). Finally, the trained regression
function is:

f̂(x) = α̂ + β̂ ·B(x; Ĵ+, Ĵ−), (11)

where (Ĵ+, Ĵ−) are the subsets corresponding to the opti-
mal threshold t̂, and (α̂, β̂) are the coefficients obtained by
regressing yi against B(xi; Ĵ+, Ĵ−) on the entire training
set.

3.3. Regularization

Note from Equations 9 and 10 that larger values of t result
in fewer covariates assigned to the balance B(x; J̃+, J̃−),
i.e., a sparser model. Thus, CoDaCoRe can be regularized
simply by making t̂ larger. Similarly to lasso regression, our
implementation of CoDaCoRe uses the 1-standard-error
rule: namely, to pick the sparsest model (i.e., the highest t)
with mean cross-validated score within 1 standard error of
the optimum (Friedman et al., 2001). Trivially, this rule can
be generalized to a λ-standard-error rule, where λ becomes
a regularization hyperparameter that can be tuned by the
practitioner if so desired (with lower values trading off some
sparsity in exchange for predictive accuracy). For consis-
tency, we restrict our experiments to λ = 1, however our
results can be improved further by tuning λ on each dataset.
In practice, we recommend choosing a lower value (e.g.,
λ = 0) when the emphasis is on predictive accuracy rather
than interpretability or sparsity, though our benchmarks still
show competitive performance with the choice of λ = 1.

Algorithm 1 CoDaCoRe
Inputs: Training data: (xi,yi)ni=1.
Initialize ĝ(x) = 0.
repeat

Initialize a new relaxation (w,α,β).
Train (w,α,β) by gradient descent.
Use cross-validation to find the optimal threshold, t̂.
Retrain (α,β) using (Ĵ+, Ĵ−).
Update ensemble ĝ(x) ← ĝ(x) + f̂(x).

until Ĵ+ = ∅ or Ĵ− = ∅.
Return ĝ(x).

3.4. CoDaCoRe Algorithm

The computational efficiency of our continuous relaxation
allows us to train multiple regressors of the form of Eq. 11
within a single model. In the full CoDaCoRe algorithm, we
ensemble multiple such regressors in a stage-wise additive
fashion, where each successive balance is fitted on the resid-
ual from the current model. Thus, CoDaCoRe identifies a
sequence of balances, in decreasing order of importance,
each of which is sparse and interpretable. Training termi-
nates when an additional relaxation (Eq. 6) cannot improve
the cross-validation score relative to the existing ensemble
(equivalently, when we obtain t̂ = 1). Typically, only a
small number of balances is required to capture the signal
in the data, and as a result CoDaCoRe produces very sparse
models overall, further enhancing interpretability. Our pro-
cedure is summarized in Algorithm 1.

3.5. Amalgamations

CoDaCoRe can be used to learn amalgamations (Eq. 2)
much in the same way as for balances (the choice of which
to use depending on the goals of the biologist). In this case,
our relaxation is defined as:

Ã(xi; w) = log

(∑
j w̃

+
j xij∑

j w̃
−
j xij

)
(12)

= log

(
w̃+ ·xi
w̃− ·xi

)
, (13)

i.e., we approximate summations over subsets of the in-
puts, with weighted summations over all components (com-
pare Eq. 2 and Eq. 12). The rest of the argument follows
verbatim, replacing B(·) with A(·) and B̃(·) with Ã(·) in
Equations 3, 4, 8, and 11.

3.6. Extensions

Our model allows for a number of extensions:

• Unsupervised learning. By means of a suitable unsu-
pervised loss function, CoDaCoRe can be extended to
unlabelled datasets, {xi}ni=1, as a method for identi-


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

fying log-ratios that provide a useful low-dimensional
representation. Such a method would automatically
provide a scalable alternative to several existing dimen-
sionality reduction techniques for CoDa (Pawlowsky-
Glahn et al., 2011; Mert et al., 2015; Martı́n-Fernández
et al., 2018; Greenacre, 2019b; Martino et al., 2019).

• Incorporating confounders. In addition to (xi,yi)ni=1,
in some applications the effect of additional (non-
compositional) predictors, zi, is also of interest. In
this case, the effect of zi can be “partialled out” a pri-
ori by first regressing yi on zi alone, and using this
regression as the initialization of the CoDaCoRe en-
semble. Alternatively, zi can also be modeled jointly
in Equations 3 and 11 (e.g., by adding a linear term
γ · zi) (Forslund et al., 2015; Noguera-Julian et al.,
2016; Rivera-Pinto et al., 2018).

• Nonlinear regression functions. Our method extends
naturally to nonlinear regression functions of the form
f(x) = hθ(B(x;J

+,J−)), where hθ is a parameter-
ized differentiable family. These functions include
neural networks, which have recently become of in-
terest in microbiome research (Morton et al., 2019a;
Quinn et al., 2020).

• Applications to non-compositional data. Aggregations
of parts can be useful outside the realm of CoDa; for
example, an amalgamation applied to a categorical
variable with many levels represents a grouping of the
categories (Bondell & Reich, 2009; Gertheiss & Tutz,
2010; Tutz & Gertheiss, 2016).

4. Experiments
We evaluate CoDaCoRe on a collection of 25 benchmark
datasets including 13 datasets from the Microbiome Learn-
ing Repo (Vangay et al., 2019), and 12 microbiome, metabo-
lite, and microRNA datasets curated by Quinn & Erb (2019).
These data vary in dimension from 48 to 3,090 covariates
(see Section B of the Supplement for a full description).
For each dataset, we fit CoDaCoRe on 20 random 80/20
train/test splits, sampled with stratification by case-control
(He & Ma, 2013). We compare against:

• Interpretable models (Sections 2.1 and 2.2): pairwise
log-ratios (Greenacre, 2019b)3, selbal (Rivera-Pinto
et al., 2018), and amalgam (Quinn & Erb, 2020). We
also consider lasso logistic regression (with regular-
ization parameter chosen by cross-validation with the
1-standard-error rule).
• Other CoDa models (Section 2.3): Coda-lasso (Lu

et al., 2019), DeepCoDA (Quinn et al., 2020), and
Susin et al. (2020). Note that these methods learn
(weighted) geometric averages over a large number of

3Implemented using a heuristic search for improved computa-
tional efficiency (Quinn et al., 2017).

101 102 103 104 105
Runtime (s)

−40

−20

0

20

40

Ac
cu

ra
cy

 g
ai

n 
ov

er
 b

as
el

in
e 

(%
)

CoDaCoRe (ours)
Selbal
Pairwise log-ratios
Coda-lasso
Amalgam

100 inputs
1,000 inputs

Figure 1. Classification accuracy (over baseline) against runtime.
Each point represents one of 25 datasets, with size proportional to
the input dimension. Note the x-axis is drawn on the log-scale. Co-
DaCoRe (with balances) is the only method that scales effectively
to our larger datasets, while consistently achieving high predictive
accuracy. Moreover, its performance is broadly consistent across
smaller and larger datasets.

input variables, which are evidently not as straightfor-
ward to interpret as simple balances or amalgamations.
• Black box classifiers: Random Forest and XGBoost,

where we tune the model complexity parameters by
cross-validation (subsample size and early stopping,
respectively).

4.1. Results

We evaluate the quality of our models across the following
criteria: computational efficiency (as measured by runtime),
sparsity (as measured by the percentage of inpute variables
that are active in the model), and predictive accuracy (as
measured by out-of-sample accuracy and ROC AUC). Table
2 provides an aggregated summary of the results; CoDa-
CoRe (with balances) is performant on all metrics. Indeed,
our method provides the only interpretable model that is
simultaneously scalable, sparse, and accurate. Detailed per-
formance metrics on each of the 25 datasets are provided in
Section C of the Supplement.

Figure 1 shows the average runtime of our classifiers on
each dataset, with larger points denoting larger datasets. Co-
DaCoRe trains orders of magnitude faster and scales better
than existing interpretable CoDa methods. On our larger
datasets (3,090 inputs), selbal runs in ∼100 hours, pairwise
log-ratios and amalgam both run in ∼10 hours, and CoDa-
CoRe runs in under 10 seconds (full runtimes are provided
in Table 2 in the Supplement). All runs, including those in-
volving gradient descent, were performed on identical CPU


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Table 2. Evaluation metrics shown for each method, averaged over 25 datasets × 20 random train/test splits. Standard errors are computed
independently on each dataset, and then averaged over the 25 datasets. The models are ordered by sparsity, i.e., percentage of active input
variables. CoDaCoRe (with balances) is the only learning algorithm that is simultaneously fast, sparse, and accurate.

RUNTIME (S) ACTIVE VARS (%) ACCURACY (%) AUC (%)

MAJORITY CLASS 0.0±0.0 0.0±0.0 62.5±0.0 50.0±0.0
CODACORE - BALANCES (OURS) 4.8±0.4 1.9±0.3 75.2±2.4 79.5±2.6

CODACORE - AMALGAMATIONS (OURS) 4.4±0.4 1.9±0.3 71.8±2.4 74.5±2.8
SELBAL (RIVERA-PINTO ET AL., 2018) 79,033.7±2,094.1 2.4±0.2 61.2±1.9 80.0±2.4

PAIRWISE LOG-RATIOS (GREENACRE, 2019B) 3,283.0±214.1 2.5±0.4 73.3±1.7 75.2±2.4
LASSO 1.6±0.1 4.4±0.6 72.4±1.7 75.2±2.3

CODA-LASSO (LU ET AL., 2019) 1,043.0±55.4 19.7±2.7 72.5±2.3 78.0±2.4
AMALGAM (QUINN & ERB, 2020) 7,360.5±209.8 87.6±2.1 74.4±2.5 78.2±2.7

DEEPCODA (QUINN ET AL., 2020) 296.5±21.4 89.3±0.6 70.6±2.9 77.6±2.9
CLR-LASSO (SUSIN ET AL., 2020) 2.0±0.2 100.0±0.0 77.5±1.8 81.6±2.2

RANDOM FOREST 10.6±0.4 · 78.0±2.2 82.2±2.2
XGBOOST 3.9±0.2 · 78.4±2.1 82.4±2.1

cores; CoDaCoRe can be accelerated further using GPUs,
but we did not find it necessary to do so. It is also worth
noting that the outperformance of CoDaCoRe is not merely
as a result of the other methods failing on high-dimensional
datasets. The consistent performance of CoDaCoRe across
smaller and larger datasets is demonstrated in Supplemen-
tary Tables 3, 4, and 5, which show a full breakdown of
results across each dataset.

Not only is CoDaCoRe sparser and more accurate than other
interpretable models, it also performs on par with state-of-
the-art black-box classifiers. By simply reducing the regular-
ization parameter, from λ = 1 to λ = 0, CoDaCoRe (with
balances) achieved an average 77.6% out-of-sample accu-
racy of and 82.0% AUC, on par with Random Forest and
XGBoost (bottom rows of Table 2), while only using 5.9%
of the input variables, on average. This result indicates, first,
that CoDaCoRe provides a highly effective algorithm for
variable selection in high-dimensional HTS data. Second,
the fact that CoDaCoRe achieves similar predictive accu-
racy as best-in-class black-box classifiers, suggests that our
model may have captured a near-complete representation of
the signal in the data. At any rate, we take this as evidence
that log-ratio transformed features are indeed of biological
importance in the context of HTS data, corroborating previ-
ous microbiome research (Rahat-Rozenbloom et al., 2014;
Crovesy et al., 2020; Magne et al., 2020).

4.2. Interpretability

The CoDaCoRe algorithm offers two kinds of interpretabil-
ity. First, it provides the analyst with sets of covariates
whose aggregated ratio predicts the outcome of interest.
These sets are easy to understand because they are discrete,
with each component making an equivalent (unweighted)
contribution. They are also sparse, usually containing fewer
than 10 features per ratio, and can be made sparser by adjust-

ing the regularization parameter λ. Such ratios have a prece-
dent in microbiome research, for example the Firmicutes-
to-Bacteroidetes ratio is used as a biomarker of gut health
(Crovesy et al., 2020; Magne et al., 2020). Second, Co-
DaCoRe ranks predictive ratios hierarchically. Due to the
ensembling procedure, the first ratio learned is the most
predictive, the second ratio predicts the residual from the
first, and so forth. Like principal components, the balances
(or amalgamations) learned by CoDaCoRe are naturally or-
dered in terms of their explanatory power. This ordering
aids interpretability by decomposing a multivariable model
into comprehensible “chunks” of information.

Notably, we find a high degree of stability in the log-ratios
selected by the model. We repeated CoDaCoRe on 10 inde-
pendent training set splits of the Crohn disease data provided
by Rivera-Pinto et al. (2018), and found consensus among
the learned models. Figure 2 shows which bacteria were
included for each split, in both versions of CoDaCoRe (bal-
ances and amalgamations). Importantly, most of the bacteria
that were selected consistently by CoDaCoRe – notably Di-
alister, Roseburia and Clostridiales – were also identified
by Rivera-Pinto et al. (2018). Differences between the sets
selected by CoDaCoRe with balances vs. CoDaCoRe with
amalgamations can be explained by differences in how the
geometric mean vs. summation operations impact the log-
ratio. The geometric mean, being more sensitive to small
numbers, is more affected by the presence of rarer bacte-
ria species like Dialister and Roseburia (as compared with
the more common bacteria species like Haemophilus and
Faecalibacterium).

4.3. Scaling to Liquid Biopsy Data

HTS data generated from from clinical blood samples can be
described as a “liquid biopsy” that can be used for cancer di-
agnosis and surveillance (Best et al., 2015; Alix-Panabières


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

1 2 3 4 5 6 7 8 9 10

Dialister
Aggregatibacter

Lactobacillales
Streptococcus

Parabacteroides
Peptostreptococcaceae

Faecalibacterium
Lachnospira
Clostridiales

Roseburia

CoDaCoRe - Balances

1 2 3 4 5 6 7 8 9 10
Independent 80% training set splits

Haemophilus
Enterobacteriaceae

Fusobacterium
Blautia

Streptococcus
Dialister

Lachnospiraceae
Roseburia
Prevotella

Clostridiales
Parabacteroides

Bacteroides
Faecalibacterium

CoDaCoRe - Amalgamations

Figure 2. CoDaCoRe variable selection for the first (most explana-
tory) log-ratio on the Crohn disease data (Rivera-Pinto et al., 2018).
For each of 10 independent training set splits (80% of the data), we
show which variables are selected in the numerator (blue) and de-
nominator (orange) of the log-ratio. Both versions of CoDaCoRe,
with balances (top) or amalgamations (bottom), learn remarkably
consistent log-ratios across independent training sets.

& Pantel, 2016). These data can be very high-dimensional,
especially when they include all gene transcripts as covari-
ates. In a clinical context, the use of log-ratio predictors
is an attractive option because they automatically correct
for inter-sample sequencing biases that might otherwise
limit the generalizability of the models (Dillies et al., 2013).
Unfortunately, existing log-ratio methods like selbal and
amalgam simply cannot scale to liquid biopsy data sets that
contain as many as 50,000 or more input variables.

The large dimensionality of such data has restricted its anal-
ysis to overly simplistic linear models, black-box models
that are scalable but not interpretable, or suboptimal hybrid
approaches where covariates must be pre-selected based on
univariate measures (Best et al., 2015; Zhang et al., 2017;
Sheng et al., 2018). Owing to its linear scaling, CoDaCoRe

Table 3. Evaluation metrics for the liquid biopsy data (Best et al.,
2015), averaged over 20 independent 80/20 train/test splits. Co-
DaCoRe (with balances) achieves equal predictive accuracy as
competing methods, but with much sparser solutions. Note that
sparsity is expressed as an (integer) number of active variables in
the model (not as a percentage of the total, as was done in Table
1). Running time is shown in seconds (standard errors were small
and are omitted for brevity).

TIME # VARS ACC. (%) AUC (%)

BASELINE 0 0±0 79.1±0.0 50.0±0.0
CODACORE 31 3±1 91.0±1.9 93.6±2.6

LASSO 23 22±4 87.8±1.3 94.7±1.5
RF 383 · 89.0±1.6 94.1±1.8

XGBOOST 108 · 90.6±1.9 95.9±1.5

can be fitted to these data at a similar computational cost to a
single lasso regression, i.e., under a minute on a single CPU
core. Thus, CoDaCoRe can be used to discover interpretable
and predictive log-ratios that are suitable for liquid biopsy
cancer diagnostics, among other similar applications.

We showcase the capabilities of CoDaCoRe in this high-
dimensional setting, by applying our algorithm to the liquid
biopsy data of (Best et al., 2015). These contain p = 58,037
genes sequenced in n = 288 human subjects, 60 of whom
were healthy controls, the others having been previously
diagnosed with cancer. Averaging over 20 random 80/20
train/test splits of this dataset, we found that CoDaCoRe
achieved the same predictive accuracy as competing meth-
ods (within error), but obtained a much sparser model. Re-
markably, CoDaCoRe identified log-ratios involving just 3
genes, that were equally predictive to both black-box classi-
fiers and linear models with over 20 covariates. This case
study again illustrates the potential of CoDaCoRe to derive
novel biological insights, and also to develop learning al-
gorithms for cancer diagnosis, a domain in which model
interpretability – including sparsity – is of paramount im-
portance (Wan et al., 2017).

5. Conclusion
Our results corroborate the summary in Table 1: CoDaCoRe
is the first sparse and interpretable CoDa model that can
scale to high-dimensional HTS data. It does so convinc-
ingly, with linear scaling that results in runtimes similar
to linear models. Our method is also competitive in terms
of predictive accuracy, performing comparably to powerful
black-box classifiers, but with interpretability. Our findings
suggest that CoDaCoRe could play a significant role in the
future analysis of high-throughput sequencing data, with
broad implications in microbiology, statistical genetics, and
more generally, in the field of CoDa.


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

References
Aitchison, J. The statistical analysis of compositional data.

Journal of the Royal Statistical Society: Series B (Method-
ological), 44(2):139–160, 1982.

Alix-Panabières, C. and Pantel, K. Clinical applications
of circulating tumor cells and circulating tumor dna as
liquid biopsy. Cancer discovery, 6(5):479–491, 2016.

Bengio, Y., Léonard, N., and Courville, A. Estimating or
propagating gradients through stochastic neurons for con-
ditional computation. arXiv preprint arXiv:1308.3432,
2013.

Best, M. G., Sol, N., Kooi, I., Tannous, J., Westerman, B. A.,
Rustenburg, F., Schellen, P., Verschueren, H., Post, E.,
Koster, J., et al. Rna-seq of tumor-educated platelets en-
ables blood-based pan-cancer, multiclass, and molecular
pathway cancer diagnostics. Cancer cell, 28(5):666–676,
2015.

Blondel, M., Teboul, O., Berthet, Q., and Djolonga, J. Fast
differentiable sorting and ranking. In International Con-
ference on Machine Learning, pp. 950–959. PMLR, 2020.

Bondell, H. D. and Reich, B. J. Simultaneous factor selec-
tion and collapsing levels in anova. Biometrics, 65(1):
169–177, 2009.

Calle, M. L. Statistical analysis of metagenomics data.
Genomics & informatics, 17(1), 2019.

Cammarota, G., Ianiro, G., Ahern, A., Carbone, C., Temko,
A., Claesson, M. J., Gasbarrini, A., and Tortora, G. Gut
microbiome, big data and machine learning to promote
precision medicine for cancer. Nature Reviews Gastroen-
terology & Hepatology, 17(10):635–648, 2020.

Crovesy, L., Masterson, D., and Rosado, E. L. Profile of the
gut microbiota of adults with obesity: a systematic review.
European journal of clinical nutrition, 74(9):1251–1262,
2020.

Cuturi, M., Teboul, O., and Vert, J.-P. Differentiable rank-
ing and sorting using optimal transport. In Advances in
Neural Information Processing Systems, pp. 6861–6871,
2019.

Dillies, M.-A., Rau, A., Aubert, J., Hennequet-Antier, C.,
Jeanmougin, M., Servant, N., Keime, C., Marot, G., Cas-
tel, D., Estelle, J., et al. A comprehensive evaluation of
normalization methods for illumina high-throughput rna
sequencing data analysis. Briefings in bioinformatics, 14
(6):671–683, 2013.

Egozcue, J. J. and Pawlowsky-Glahn, V. Groups of parts
and their balances in compositional data analysis. Mathe-
matical Geology, 37(7):795–828, 2005.

Egozcue, J. J. and Pawlowsky-Glahn, V. Compositional
data: the sample space and its structure. TEST, 28(3):
599–638, 2019.

Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G.,
and Barcelo-Vidal, C. Isometric logratio transformations
for compositional data analysis. Mathematical Geology,
35(3):279–300, 2003.

Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid,
G., and Gloor, G. B. Anova-like differential expression
(aldex) analysis for mixed population rna-seq. PLoS One,
8(7):e67019, 2013.

Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMur-
rough, T. A., Edgell, D. R., and Gloor, G. B. Unifying the
analysis of high-throughput sequencing datasets: charac-
terizing rna-seq, 16s rrna gene sequencing and selective
growth experiments by compositional data analysis. Mi-
crobiome, 2(1):15, 2014.

Filzmoser, P. and Walczak, B. What can go wrong at the
data normalization step for identification of biomarkers?
Journal of Chromatography A, 1362:194–205, 2014.

Filzmoser, P., Hron, K., and Reimann, C. Univariate sta-
tistical analysis of environmental (compositional) data:
problems and possibilities. Science of the Total Environ-
ment, 407(23):6100–6108, 2009.

Forslund, K., Hildebrand, F., Nielsen, T., Falony, G.,
Le Chatelier, E., Sunagawa, S., Prifti, E., Vieira-Silva, S.,
Gudmundsdottir, V., Pedersen, H. K., et al. Disentangling
type 2 diabetes and metformin treatment signatures in
the human gut microbiota. Nature, 528(7581):262–266,
2015.

Friedman, J., Hastie, T., Tibshirani, R., et al. The elements of
statistical learning, volume 1. Springer series in statistics
New York, 2001.

Gertheiss, J. and Tutz, G. Sparse modeling of categorial
explanatory variables. The Annals of Applied Statistics,
pp. 2150–2180, 2010.

Gloor, G. B. and Reid, G. Compositional analysis: a
valid approach to analyze microbiome high-throughput
sequencing data. Canadian journal of microbiology, 62
(8):692–703, 2016.

Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and
Egozcue, J. J. Microbiome datasets are compositional:
and this is not optional. Frontiers in microbiology, 8:
2224, 2017.

Goodman, B. and Flaxman, S. European union regulations
on algorithmic decision-making and a “right to explana-
tion”. AI magazine, 38(3):50–57, 2017.


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Gordon-Rodriguez, E., Loaiza-Ganem, G., and Cunning-
ham, J. The continuous categorical: a novel simplex-
valued exponential family. In International Conference
on Machine Learning, pp. 3637–3647. PMLR, 2020a.

Gordon-Rodriguez, E., Loaiza-Ganem, G., Pleiss, G., and
Cunningham, J. P. Uses and abuses of the cross-entropy
loss: case studies in modern deep learning. arXiv preprint
arXiv:2011.05231, 2020b.

Greenacre, M. Comments on: Compositional data: the
sample space and its structure. TEST, 28(3):644–652,
2019a.

Greenacre, M. Variable selection in compositional data anal-
ysis using pairwise logratios. Mathematical Geosciences,
51(5):649–682, 2019b.

Greenacre, M. Amalgamations are valid in compositional
data analysis, can be used in agglomerative clustering,
and their logratios have an inverse transformation. Ap-
plied Computing and Geosciences, 5:100017, 2020.

Greenacre, M., Grunsky, E., and Bacon-Shone, J. A compar-
ison of isometric and amalgamation logratio balances in
compositional data analysis. Computers & Geosciences,
pp. 104621, 2020.

He, H. and Ma, Y. Imbalanced learning: foundations, algo-
rithms, and applications. 2013.

Jang, E., Gu, S., and Poole, B. Categorical repa-
rameterization with gumbel-softmax. arXiv preprint
arXiv:1611.01144, 2016.

Li, H. Microbiome, metagenomics, and high-dimensional
compositional data analysis. Annual Review of Statistics
and Its Application, 2:73–94, 2015.

Linderman, S., Mena, G., Cooper, H., Paninski, L., and
Cunningham, J. Reparameterizing the birkhoff polytope
for variational permutation inference. In International
Conference on Artificial Intelligence and Statistics, pp.
1618–1627. PMLR, 2018.

Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat,
S., and Bähler, J. Proportionality: a valid alternative to
correlation for relative data. PLoS Comput Biol, 11(3):
e1004075, 2015.

Lu, J., Shi, P., and Li, H. Generalized linear models with lin-
ear constraints for microbiome compositional data. Bio-
metrics, 75(1):235–244, 2019.

Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete
distribution: A continuous relaxation of discrete random
variables. In International Conference on Learning Rep-
resentations, 2017.

Magne, F., Gotteland, M., Gauthier, L., Zazueta, A., Pe-
soa, S., Navarrete, P., and Balamurugan, R. The firmi-
cutes/bacteroidetes ratio: a relevant marker of gut dysbio-
sis in obese patients? Nutrients, 12(5):1474, 2020.

Martı́n-Fernández, J., Pawlowsky-Glahn, V., Egozcue, J.,
and Tolosona-Delgado, R. Advances in principal balances
for compositional data. Mathematical Geosciences, 50
(3):273–298, 2018.

Martino, C., Morton, J. T., Marotz, C. A., Thompson, L. R.,
Tripathi, A., Knight, R., and Zengler, K. A novel sparse
compositional technique reveals microbial perturbations.
MSystems, 4(1), 2019.

Mena, G., Snoek, J., Linderman, S., and Belanger, D. Learn-
ing latent permutations with gumbel-sinkhorn networks.
In International Conference on Learning Representations,
2018.

Mert, M. C., Filzmoser, P., and Hron, K. Sparse principal
balances. Statistical Modelling, 15(2):159–174, 2015.

Morton, J. T., Sanders, J., Quinn, R. A., McDonald, D.,
Gonzalez, A., Vázquez-Baeza, Y., Navas-Molina, J. A.,
Song, S. J., Metcalf, J. L., Hyde, E. R., et al. Balance
trees reveal microbial niche differentiation. MSystems, 2
(1), 2017.

Morton, J. T., Aksenov, A. A., Nothias, L. F., Foulds, J. R.,
Quinn, R. A., Badri, M. H., Swenson, T. L., Van Goethem,
M. W., Northen, T. R., Vazquez-Baeza, Y., et al. Learn-
ing representations of microbe–metabolite interactions.
Nature methods, 16(12):1306–1314, 2019a.

Morton, J. T., Marotz, C., Washburne, A., Silverman, J.,
Zaramela, L. S., Edlund, A., Zengler, K., and Knight, R.
Establishing microbial composition measurement stan-
dards with reference frames. Nature communications, 10
(1):1–11, 2019b.

Noguera-Julian, M., Rocafort, M., Guillén, Y., Rivera, J.,
Casadellà, M., Nowak, P., Hildebrand, F., Zeller, G., Par-
era, M., Bellido, R., et al. Gut microbiota linked to sexual
preference and hiv infection. EBioMedicine, 5:135–146,
2016.

Pawlowsky-Glahn, V. and Buccianti, A. Compositional data
analysis: Theory and applications. John Wiley & Sons,
2011.

Pawlowsky-Glahn, V. and Egozcue, J. J. Compositional data
and their analysis: an introduction. Geological Society,
London, Special Publications, 264(1):1–10, 2006.

Pawlowsky-Glahn, V., Egozcue, J. J., Tolosana Delgado, R.,
et al. Principal balances. Proceedings of CoDaWork, pp.
1–10, 2011.


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Pearson, K. Vii. mathematical contributions to the theory of
evolution.—iii. regression, heredity, and panmixia. Philo-
sophical Transactions of the Royal Society of London.
Series A, containing papers of a mathematical or physi-
cal character, (187):253–318, 1896.

Potapczynski, A., Loaiza-Ganem, G., and Cunningham,
J. P. Invertible gaussian reparameterization: Revisiting
the gumbel-softmax. Advances in Neural Information
Processing Systems, 33, 2020.

Prifti, E., Chevaleyre, Y., Hanczar, B., Belda, E., Danchin,
A., Clément, K., and Zucker, J.-D. Interpretable and
accurate prediction models for metagenomics data. Giga-
Science, 9(3):giaa010, 2020.

Quinn, T., Nguyen, D., Rana, S., Gupta, S., and Venkatesh,
S. Deepcoda: personalized interpretability for composi-
tional health data. In International Conference on Ma-
chine Learning, pp. 7877–7886. PMLR, 2020.

Quinn, T. P. and Erb, I. Using balances to engineer fea-
tures for the classification of health biomarkers: a new
approach to balance selection. bioRxiv, pp. 600122, 2019.

Quinn, T. P. and Erb, I. Amalgams: data-driven amalga-
mation for the dimensionality reduction of compositional
data. NAR Genomics and Bioinformatics, 2(4):lqaa076,
2020.

Quinn, T. P., Richardson, M. F., Lovell, D., and Crowley,
T. M. propr: an r-package for identifying proportion-
ally abundant features using compositional data analysis.
Scientific reports, 7(1):1–9, 2017.

Quinn, T. P., Erb, I., Richardson, M. F., and Crowley, T. M.
Understanding sequencing data as compositions: an out-
look and review. Bioinformatics, 34(16):2870–2878,
2018.

Quinn, T. P., Erb, I., Gloor, G., Notredame, C., Richardson,
M. F., and Crowley, T. M. A field guide for the compo-
sitional analysis of any-omics data. GigaScience, 8(9):
giz107, 2019.

Rahat-Rozenbloom, S., Fernandes, J., Gloor, G. B., and
Wolever, T. M. Evidence for greater production of
colonic short-chain fatty acids in overweight than lean
humans. International journal of obesity, 38(12):1525–
1531, 2014.

Rivera-Pinto, J., Egozcue, J. J., Pawlowsky-Glahn, V., Pare-
des, R., Noguera-Julian, M., and Calle, M. L. Balances:
a new perspective for microbiome analysis. MSystems, 3
(4), 2018.

Sheng, M., Dong, Z., and Xie, Y. Identification of tumor-
educated platelet biomarkers of non-small-cell lung can-
cer. OncoTargets and therapy, 11:8143, 2018.

Silverman, J. D., Washburne, A. D., Mukherjee, S., and
David, L. A. A phylogenetic transform enhances analysis
of compositional microbiota data. Elife, 6:e21887, 2017.

Susin, A., Wang, Y., Lê Cao, K.-A., and Calle, M. L. Vari-
able selection in microbiome compositional data analysis.
NAR Genomics and Bioinformatics, 2(2):lqaa029, 2020.

Templ, M. Artificial neural networks to impute
rounded zeros in compositional data. arXiv preprint
arXiv:2012.10300, 2020.

Tolosana-Delgado, R., Talebi, H., Khodadadzadeh, M., and
Van den Boogaart, K. On machine learning algorithms
and compositional data. In Proceedings of the 8th In-
ternational Workshop on Compositional Data Analysis,
Terrassa, Spain, pp. 3–8, 2019.

Tutz, G. and Gertheiss, J. Regularized regression for cate-
gorical data. Statistical Modelling, 16(3):161–200, 2016.

Van den Boogaart, K. G. and Tolosana-Delgado, R. Ana-
lyzing compositional data with R, volume 122. Springer,
2013.

Vangay, P., Hillmann, B. M., and Knights, D. Microbiome
Learning Repo (ML Repo): A public repository of micro-
biome regression and classification tasks. GigaScience, 8
(5), 04 2019.

Wan, J. C., Massie, C., Garcia-Corbacho, J., Mouliere, F.,
Brenton, J. D., Caldas, C., Pacey, S., Baird, R., and Rosen-
feld, N. Liquid biopsies come of age: towards implemen-
tation of circulating tumour dna. Nature Reviews Cancer,
17(4):223, 2017.

Washburne, A. D., Silverman, J. D., Leff, J. W., Bennett,
D. J., Darcy, J. L., Mukherjee, S., Fierer, N., and David,
L. A. Phylogenetic factorization of compositional data
yields lineage-level associations in microbiome datasets.
PeerJ, 5:e2969, 2017.

Wooley, J. C., Godzik, A., and Friedberg, I. A primer on
metagenomics. PLoS Comput Biol, 6(2):e1000667, 2010.

Xie, S. M. and Ermon, S. Reparameterizable subset sam-
pling via continuous relaxations. In International Joint
Conferences on Artificial Intelligence, 2019.

Xie, Y., Dai, H., Chen, M., Dai, B., Zhao, T., Zha, H.,
Wei, W., and Pfister, T. Differentiable top-k operator
with optimal transport. Advances in Neural Information
Processing Systems, 33, 2020.

Yang, J., Zhang, Q., Ni, B., Li, L., Liu, J., Zhou, M., and
Tian, Q. Modeling point clouds with self-attention and
gumbel subset sampling. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pp. 3323–3332, 2019.


Learning Sparse Log-Ratios for High-Throughput Sequencing Data

Zhang, Y.-H., Huang, T., Chen, L., Xu, Y., Hu, Y., Hu, L.-D.,
Cai, Y., and Kong, X. Identifying and analyzing different
cancer subtypes using rna-seq data of blood platelets.
Oncotarget, 8(50):87494, 2017.