StrainFLAIR: Strain-level profiling of metagenomic samples using variation graphs


StrainFLAIR: Strain-level profiling of1
metagenomic samples using variation2
graphs3
Kévin Da Silva1,2,*, Nicolas Pons2, Magali Berland2, Florian Plaza Oñate2,4
Mathieu Almeida2, and Pierre Peterlongo15
1Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F-35000 Rennes, France6
2Université Paris-Saclay, INRAE, MGP, 78350 Jouy-en-Josas, France7

Corresponding author:8
∗Kévin Da Silva kevin.da-silva@inria.fr9

Email address:10

ABSTRACT11

Current studies are shifting from the use of single linear references to representation of multiple genomes
organised in pangenome graphs or variation graphs. Meanwhile, in metagenomic samples, resolving
strain-level abundances is a major step in microbiome studies, as associations between strain variants
and phenotype are of great interest for diagnostic and therapeutic purposes.

12

13

14

15

We developed StrainFLAIR with the aim of showing the feasibility of using variation graphs for indexing
highly similar genomic sequences up to the strain level, and for characterizing a set of unknown sequenced
genomes by querying this graph.

16

17

18

On simulated data composed of mixtures of strains from the same bacterial species Escherichia coli,
results show that StrainFLAIR was able to distinguish and estimate the abundances of close strains, as
well as to highlight the presence of a new strain close to a referenced one and to estimate its abundance.
On a real dataset composed of a mix of several bacterial species and several strains for the same species,
results show that in a more complex configuration StrainFLAIR correctly estimates the abundance of
each strain. Hence, results demonstrated how graph representation of multiple close genomes can be
used as a reference to characterize a sample at the strain level.

19

20

21

22

23

24

25

Availability: http://github.com/kevsilva/StrainFLAIR26

INTRODUCTION27
The use of reference genomes has shaped the way genomics studies are currently conducted. Reference28
genomes are particularly useful for reference guided genomic assembly, variant calling or mapping29
sequencing reads. For the later, they provide a unique coordinate system to locate variants, allowing30
to work on the same reference and easily share information. However, the usage of reference genomes31
represented as flat sequences reaches some limits (Ballouz et al., 2019).32

Close reference genomes or genomes of strains from the same species show a high sequence similarity.33
Mapping sequencing reads on similar reference genomes results in mis-mapped reads or ambiguous34
alignments generating noise in the downstream analysis, that has yet to be clarified (Na et al., 2016). This35
has led recent methods to provide a representation of multiple genomes as genome graphs, also called36
variation graphs, in which each path is a different known variation. Such graph representations are well37
defined, and tools to build and manipulate graphs are under active development (Garrison et al., 2017;38
Kim et al., 2019; Rakocevic et al., 2019; Li et al., 2020).39

This graph structure provides obvious advantages such as the reduction of the data redundancy, while40
highlighting variations (Garrison et al., 2018). However, it also introduces novel difficulties. Updating41
a graph with novel sequences, adapting existing efficient algorithms for read mapping, and, mainly,42
developing new ways to analyse sequence-to-graph mapping results for downstream analyses are among43
those new challenges. The work presented here primarily focuses on this latest point and proposes to44
show the feasibility of using a variation graph for identifying and estimating abundances, at the strain45

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


level, from an unknown metagenomic read set.46
In the context of metagenomics, representing genomes in graphs is of particular interest for indexing47

microorganism genomes. Microorganisms are predominant in almost every ecosystems from ocean48
water (Sunagawa et al., 2015) to human body (Clemente et al., 2012), and play major functioning roles49
in them (New and Brito, 2020). While studies in microbial ecology are facing a bottleneck due to the50
difficulty of isolating and cultivating most of those microbes in laboratory, preventing the analysis of51
the complex structure and dynamics of the microbial communities (Stewart, 2012), high-throughput52
sequencing in metagenomics offers the opportunity to study a whole ecosystem. In particular, shotgun53
sequencing allows a resolution up to the species level (Jovel et al., 2016), and enable samples analysis in54
terms of population stratification, microbial diversity or bio-markers identification (Quince et al., 2017).55
Understanding of microbial communities structure and dynamics is usually revealed by resolving the56
species present in samples and their relative abundances, which can then be associated with phenotypes,57
notably in the field of human health (Ehrlich, 2011; Vieira-Silva et al., 2020; Solé et al., 2021). Now,58
characterizing samples at the strain level has a growing interest, as it may highlight new associations with59
phenotypes, and a better understanding of the functional impact of strains in host-microbe interactions60
is crucial to new therapeutic strategies and personalized medicine. Escherichia coli, which has a highly61
variable genome, is a well-known example since some strains are harmless commensals in the human62
gut microbiota while others are harmful pathogens (Rasko et al., 2008; Loman et al., 2013). Current63
approaches to handle multiple similar genomes as with strains use gene clustering and then select the64
representative sequence of each cluster, getting rid of the redundancy but also the variations, yet crucial65
to distinguish the strains of a species (Qin et al., 2010). Hence, indexation of a set of known strains is a66
good framework for testing the ability of a variation graph to capture the diversity while offering a way to67
correctly assign sequenced data to the strains they belong to.68

In this work, we present StrainFLAIR, a novel method and its implementation that uses variation69
graph representation of gene sequences for strain identification and quantification. We proposed novel70
algorithmic and statistical solutions for managing ambiguous alignments and computing an adequate71
abundance metric at the graph node level. Results have shown that we could correctly identify and quantify72
strains present in a sample. Notably, we could also identify close strains not present in the reference.73

StrainFLAIR is available at http://github.com/kevsilva/StrainFLAIR.74

METHODS75
We propose here a description of our tool StrainFLAIR (STRAIN-level proFiLing using vArIation76
gRaph). This method exploits various state-of-the-art tools and proposes novel algorithmic solutions77
for indexing bacterial genomes at the strain-level. It also permits to query metagenomes for assessing78
and quantifying their content, in regards to the indexed genomes. An overview of the index and query79
pipelines are presented on Fig. 1.80

Rational for the choice of third-party tools and their detailed usages are given in Supplementary81
Materials, Section S1.1.82

Indexing strains83
Gene prediction84
As non-coding DNA represents 15% in average of bacterial genomes and is not well characterized in85
terms of structure, StrainFLAIR focuses on protein-coding genes in order to characterize strains by86
their gene content and nucleotidic variations of them. Moreover, non-coding DNA regions can be highly87
variable (Thorpe et al., 2017) and taking into account complete genomes would then lead to highly88
complex graphs, and combinatorial explosions when mapping reads. Additionally, complete genomes89
are not always available. Focusing on the genes allows to use also drafts and metagenome-assembled90
genomes or a pre-existing set of known genes (Qin et al., 2010; Li et al., 2014). Hence, StrainFLAIR91
indexes genes instead of complete genomes in graphs.92

Genes are predicted using Prodigal, a tool for prokaryotic protein-coding genes prediction (Hyatt93
et al., 2010).94

Knowing that some reads map at the junction between the gene and intergenic regions, by conserving95
only gene sequences, mapping results are biased towards deletions and drastically lower the mapping96
score. In order to alleviate this situation, we extend the predicted gene sequences at both ends. Hence,97

2/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


Figure 1. StrainFLAIR overview. a. Indexation. Input is a set of known reference genomes of
various bacterial species and strains. StrainFLAIR uses a graph for indexing genes of those reference
genomes. b. Read mapping on the previously mentioned graph. c. Mapped reads analysis.
StrainFLAIR assigns and estimates species and strain abundances of a bacterial metagenomic sample
represented as short reads.

StrainFLAIR conserves predicted genes plus their surrounding sequences. By default, and if the98
sequence is long enough, we conserve 75 bp on the left and on the right of each gene.99

Gene clustering100
Genes are clustered into gene families using CD-HIT (Li and Godzik, 2006). For the clustering step, the101
genes without extensions are used in order to strictly cluster according to the exact gene sequences and102
no parts of intergenic regions. CD-HIT-EST is used to realize the clustering with an identity threshold103
of 0.95 and a coverage of 0.90 on the shorter sequence. The local sequence identity is calculated as the104
number of identical bases in alignment divided by the length of the alignment. Sequences are assigned to105
the best fitting cluster verifying these requirements.106

Graph construction107
Each gene family is represented as a variation graph (Fig. 2). Variation graphs are bidirected DNA108
sequence graphs that represents multiple sequences, including their genetic variation. Each node of the109
graph contains sub-sequences of the input sequences, and successive nodes draw paths on the graph.110
Paths corresponding to reference sequences are specifically called “colored paths”. Each colored path111
corresponds to the original sequences of a gene in the cluster.112

Figure 2. Illustration of a variation graph structure and colored paths. Each node of the graph
contains a sub-sequence of the input sequences and is integer-indexed. A path corresponding to an input
sequence is called a colored path, and is encoded by its succession of node ids, e.g. 1,3,5,6 for the colored
path 1 in this example.

3/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


In the case of a cluster composed of only one sequence, vg toolkit (Garrison et al., 2017)113
is used to convert the sequence into a flat graph. Alternatively, when a cluster is composed of two114
sequences or more, minimap2 (Li, 2018) is used to generate a multiple sequence alignment. Then115
seqwish (Garrison, 2021) is used to convert this multiple sequence alignment into a variation graph.116
All the so-computed graphs (one per input cluster) are then concatenated to produce a single variation117
graph where each cluster of genes is a connected component.118

The index is created once for a set of reference genomes. Afterward, any set of sequenced reads can119
to be profiled at the strain-level based on this index.120

Querying variation graphs121
Mapping reads122
For mapping reads on the previously described reference graph, we use the sequence-to-graph mapper vg123
mpmap from vg toolkit. It produces a so-called “multipath alignments”. A multipath alignment is a124
graph of partial alignments and can be seen as a sub-graph (a subset of edges and vertices) of the whole125
variation graph (see Fig. 3 for an example). The mapping result describes, for each read, the nodes of the126
variation graph traversed by the alignment and the potential mismatches or indels between the read and127
the sequence of each traversed node.128

Reads attribution129
When mapping a read on a graph with colored path, two key issues arise, as illustrated Fig. 3. As mapping130
generates a sub-graph per mapped read, the most probable mapped path(s) has / have to be defined. In the131
meanwhile, the most probable mapped path(s) corresponding to a colored path also have to be defined.132
Hence we developed an algorithm to analyse and convert, when possible, a mapping result into one or133
several continuous path(s) (successive nodes joined by only one edge) per mapped read. In addition we134
propose an algorithm to attribute such path to most probable colored path(s).135

Path attribution136
A breadth first search on the multipath alignment is proposed. It starts at each node of the alignment137

with a user-defined threshold on the mapping score. A single path alignment with a mapping score138
below this threshold is ignored, and the single path alignment with the best mapping score is retained.139
Additionally, for each alignment, nodes are associated with a so-called “horizontal coverage” value. The140
horizontal coverage of a node by a read corresponds to the proportion of bases of the node covered by the141
read. Hence, a node has an horizontal coverage of 1 if all its nucleotides are covered by the read with or142
without mismatches or indels.143

Because of possible ties in mapping score, the search can result in multiple single path alignments, as144
illustrated Fig. 3(A). This situation corresponds to a read which sequence is found in several different145
genes or to a read mapping onto the similar region of different versions of a gene.146

To take into account ambiguous mapping affectations, as shown below, the parsing of the mapping147
output is decomposed into two steps. The first step processes the reads that mapped only a unique colored148
path (called “unique mapped reads” here), corresponding to a single gene. The second step processes the149
reads with multiple alignments (called “multiple mapped reads” here).150

Colored path attribution151
Once a read is assigned to one or several path alignment(s), it still has to be attributed, if possible, to a152

colored path. The following process attributes each mapped read to a colored path and various metrics for153
downstream analyses are computed. In particular, an absolute abundance for each node of the variation154
graph, called the “node abundance”, is computed, first focusing on unique mapped reads (first step). For a155
given alignment, the successive nodes composing the path are compared to the existing colored paths of156
the variation graph. If the alignment matches part of a colored path, the number of mapped reads on this157
path is incremented by one (i.e. reads raw count). The node abundance for each node of the alignment is158
incremented with its horizontal node coverage defined by this alignment. Alignments with no matching159
colored paths are skipped.160

Then, we focus on multiple mapped reads (second step), as illustrated Fig. 3(B). During this step, the161
alignment matches multiple colored paths. Hence, the abundance is distributed to each matching colored162
path relatively to the ratio between them. This ratio is determined from the reads raw count of each path163
from the first step. For example, if 70 unique mapped reads were found for path1 and 30 for path2 during164
the first step, a read matching ambiguously both path1 and path2 during the second step counts as 0.7 for165

4/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


Figure 3. Illustration of the multipath alignment concept and the read attribution process. (A)
Path attribution. The region of the read in blue aligns un-ambiguously to a node of the graph while the
dark and light red parts can either align to the top or the bottom nodes of their respective mapping
localization (due to mismatches that can align on both nodes for example), drawing an alignment as a
sub-graph of the reference variation graph, and thus opening the possibility of four single path alignments.
(B) Colored path attribution. First, from the multipath alignment (all four read sub-paths), the breadth
search finds the possible corresponding single path alignments while respecting the mapping score
threshold imposed by the user. Here, for the example, all four possible paths are considered valid. Second,
each single path is compared to the colored paths from the reference variation graph. Two single path
alignments matched the colored paths (4-6-8 and 5-6-7). As it mapped equally more than one colored
path, this read falls in the multiple mapped reads case and is processed during the second step of the
algorithm.

path1 and 0.3 for path2. This ratio is applied to increment both the raw count of reads and the coverage of166
the nodes.167

Gene-level and strain-level abundances168
StrainFLAIR output is decomposed into an intermediate result describing the queried sample and169
gene-level abundances, and the final result describing the strain-level abundances.170

Gene-level171
After parsing the mapping result, the first output provides information for each colored path, i.e.172

each version of a gene. Thereby, this first result proposes gene-level information including abundances.173
Exhaustive description of these intermediate results is provided in Section S1.2 in Supplementary Materials.174
We describe here three major metrics outputted by StrainFLAIR:175

The mean abundance of the nodes composing the path. Instead of solely counting reads, we make176
full use of the graph structure and we propose abundances computation for each node as previously177
explained, and as already done for haplotype resolution (Baaijens et al., 2019). Hence, for each colored178
path, the gene abundance is estimated by the mean of the nodes abundance.179

In order to not underestimate the abundance in case of a lack of sequencing depth (which could result180
in certain nodes not to be traversed by sequencing reads), the mean abundance without the nodes of181

5/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


the path never covered by a read is also outputted.182
The mean abundance with and without these non-covered nodes are computed using unique mapped183

reads only or all mapped reads.184
The ratio of covered nodes, defined as the proportion of nodes from the path which abundance is185

strictly greater than zero.186

Strain-level187
Strain-level abundances are then obtained by exploiting the specific genes of each reference genome188

from these intermediate results. First, for each genome, the proportion of detected genes is computed,189
as the proportion of specific genes on which at least one read maps. Then, the global abundance of the190
genome is computed as the mean or median of all its specific gene abundances. However, if the proportion191
of detected genes is less than a user-defined threshold, the genome is considered absent and hence its192
abundance is set to zero.193

StrainFLAIR final output is a table where each line corresponds to one of the reference genomes,194
containing in columns the proportion of detected specific genes, and our proposed metrics to estimate their195
abundances (using mean or median, with or without never covered nodes as described for the gene-level196
result).197

Results presented Section S1.3 in Supplementary Materials validate and motivate the proposed198
abundance metric by comparing it to the expected abundances and other estimations using linear models.199

RESULTS200
We validated our method on both a simulated and a real dataset. All computations were performed using201
StrainFLAIR, version 0.0.1, with default parameters. The relative abundances estimation was based202
on the mean of the specific gene abundances, computed by taking into account all the nodes (including203
non-covered nodes), and using a threshold on the proportion of detected specific genes of 50%.204

Results were compared to Kraken2 (Wood et al., 2019) considered as one of the state-of-the-art tool205
dedicated to the characterization of read set content, and based on flat sequences as references. Read206
counts given by Kraken2 were normalized by the genome length and converted into relative abundances.207

Computing setup and performances are indicated in Supplementary Materials, Section S1.4.208

Validation on a simulated dataset209
We first validated our method on simulated data, focusing on a single species with multiple strains. Our210
aim was to validate the StrainFLAIR ability to identify and quantify strains given sequencing data211
from a mixture of several strains of uneven abundances, and with one of them absent from the index.212

Reference variation graph213
We selected complete genomes of Escherichia coli, a predominant aerobic bacterium in the gut micro-214
biota (Tenaillon et al., 2010), and a species known for its phenotypic diversity (pathogenicity, antibiotics215
resistance) mostly resulting from its high genomic variability (Dobrindt, 2005).216

Eight strains of E. coli were selected for this experiment from the NCBI1. Seven were used to construct217
a variation graph (E. coli IAI39, O104:H4 str. 2011C-3493, str. K-12 substr. MG1655, SE15, O157:H16218
str. Santai, O157:H7 str. Sakai, O26 str. RM8426), and one was used as an unknown strain in a strains219
mixture (E. coli BL21-DE3).220

Mixtures and sequencing simulations221
Our aim was to simulate the co-presence of several E. coli strains. Two simulations with sequencing222
errors were conducted in order to highlight the detection and quantification of strains in a mixture. For223
each one, we tested our approach with various read coverage, as described below.224

We simulated the sequencing of three strains to mimic complex single species composition in225
metagenomic samples. One of the strain was in equal abundance of one of the two others, potentially226
making it more difficult to distinguish, or in lower abundance, potentially making it more difficult to227
detect at all. The first simulation was a mixture composed of three strains contributing in the reference228
graph: E. coli O104:H4 2011c-3493, IAI39, and K-12 MG1655. The second simulation was a mixture229
composed of three strains: E. coli O104:H4 2011c-3493, IAI39, and BL21-DE3. The later being absent230
from the reference variation graph thus simulating a new strain to be identified and quantified.231

1https://www.ncbi.nlm.nih.gov/genome/?term=txid562[orgn]

6/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


For both simulations, short sequencing reads of 150 bp were simulated using vg sim from vg232
toolkit with a probability of errors set to 0.1% : 300,000 reads for E. coli O104:H4 2011c-3493233
(representing ≈8.5x), 200,000 reads for E. coli IAI39 (representing ≈5.8x). For both simulations, various234
quantities of reads were generated for K-12 MG1655 or BL21-DE3: 200,000, 100,000, 50,000, 25,000,235
10,000, 5,000 or 1,000 reads, representing approximately 6.5x, 3x, 1.6x, 0.8x, 0.3x, 0.2x, and 0.03x236
respectively for these two strains.237

Strain-level abundances238
As explained in Methods, we computed the strain-level abundances using the specific gene-level abundance239
table obtained by mapping the simulated reads onto the variation graph. We compared our results to the240
expected simulated relative abundances.241

#reads
K-12

Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426

Expected 59.88 39.92 0.2 0 0 0 0
1,000 StrainFLAIR 56.45 43.55 0 0 0 0 0

Kraken2 38.91 60.72 0.22 0.04 0.07 0.03 0.02
Expected 57.14 38.1 4.76 0 0 0 0

25,000 StrainFLAIR 52.1 40.58 7.32 0 0 0 0
Kraken2 37.23 58.1 4.51 0.04 0.07 0.03 0.02
Expected 42.86 28.57 28.57 0 0 0 0

200,000 StrainFLAIR 38.12 29.83 32.05 0 0 0 0
Kraken2 28.31 44.18 27.35 0.04 0.08 0.03 0.02

Table 1. Reference strains relative abundances expected and computed by StrainFLAIR or
Kraken2 for each simulated experiment with variable coverage of the K-12 MG1655 strain. Best
results are shown in bold. Complete results are presented Section S1.6 in Supplementary Materials.

Simulation 1: mixtures with K-12 MG1655, present in the reference graph242
StrainFLAIR successfully estimated the relative abundances of the three strains present in the243

mixture (Table 1), the sum of squared errors between the estimation given by our tool and the expected244
relative abundance was between 25 and 45 for all the experiments. However, it did not detect the very245
low abundant strain in the case of the mixture with 1,000 simulated reads for K-12 MG1655 (coverage of246
≈0.03x). With our methodology, the threshold on the proportion of detected genes (see Methods) lead247
to set relative abundance to zero of likely absent strains. This reduces both the underestimation of the248
relative abundances of the present strains and the overestimation of the absent strains.249

In comparison, Kraken2 did not provide this resolution. Applied to our simulated mixtures, while250
Kraken2 was slightly better for K-12 MG1655 abundance estimation, it overestimated IAI39 relative251
abundance and underestimated O104’s one, leading to an overall higher sum of squared errors (between252
456 and 872) compared to the expected abundances. Moreover, it set relative abundances to all the seven253
reference strains whereas four of them were absent from the mixture. This was expected as some reads254
(from intergenic regions for example) can randomly be similar to regions of genes from absent strains.255

Simulation 2: mixtures with BL21-DE3, absent from the reference graph256
Here, BL21-DE3 was considered an unknown strain, not contributing to the variation graph. The closest257

strain of BL21-DE3 in the graph, according to fastANI (Jain et al., 2018), was K-12 MG1655 (98.9%258
of identity, see Supplementary Materials, Section S1.5). Thus we expected to find signal of BL21-DE3259
through the results on K-12 MG1655.260

As with the K-12 MG1655 mixtures, StrainFLAIR successfully estimated the relative abundances261
of the two known strains present in the mixture (Table 2), the sum of squared errors between the estimation262
given by our tool and the expected relative abundance was between 22 and 180 for all the experiments.263
Labelled as K-12, it also gave close estimations for BL21-DE3. Again, it did not detect the very low264
abundant strain in the case of the mixture with 1,000, 5,000, and 10,000 simulated reads for BL21-DE3.265
Also similarly to the K-12 MG1655 mixtures experiments, Kraken2 overestimated IAI39 relative266
abundance and underestimated O104’s one (sum of squared errors between 751 and 873), even less267

7/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


#reads
BL21-DE3

Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426

Expected 59.88 39.92 (0.2) 0 0 0 0
1,000 StrainFLAIR 56.47 43.53 0 0 0 0 0

Kraken2 38.93 60.76 0.11 0.05 0.08 0.04 0.03
Expected 57.14 38.1 (4.76) 0 0 0 0

25,000 StrainFLAIR 54.09 41.71 4.2 0 0 0 0
Kraken2 37.75 58.93 2.16 0.28 0.34 0.25 0.29
Expected 42.86 28.57 (28.57) 0 0 0 0

200,000 StrainFLAIR 46.95 35.34 17.72 0 0 0 0
Kraken2 31.14 48.83 13.53 1.57 1.67 1.58 1.68

Table 2. Reference strain relative abundances expected and computed by StrainFLAIR or
Kraken2 for each simulated experiment with variable coverage of the BL21-DE3 strain, absent
from the reference variation graph. BL21-DE3 strain expected abundances are given in parentheses in
the K-12 column. Best results are shown in bold. Complete results are presented Section S1.6 in
Supplementary Materials.

precisely than in the previous experiment. With sufficient coverage (here from the 0.8x for BL21-DE3),268
StrainFLAIR was closer to the expected values for all the reference strains than Kraken2.269

Interestingly, the proportion of detected specific genes for each strain (Fig. 4) seems to highlight a270
pattern allowing to distinguish present strains, absent strains and likely new strains close to the reference271
in the graph. According to the experiments with enough coverage (from 25,000 simulated reads for272
BL21-DE3), three groups of proportions could be observed: proportion of almost 100% (O104:H4 and273
IAI39 : strains present in the mixtures and in the reference graph), proportion under 30-35% (Sakai, SE15,274
Santai, and RM8426 : strains absent from the mixtures), and an in-between proportion around 60-70% for275
K-12 MG1655 (closest strain to BL21-DE3).276

It was expected that an absent strain would have specific genes detected as StrainFLAIR detects a277
gene once only one read mappped on it. However, all absent strains had a proportion at around 30% except278
K-12 MG1655 which proportion was twice higher. Conjointly with the non-null abundance estimated for279
the reference K-12 MG1655, this suggests the presence of a new strain whose genome is highly similar to280
K-12 MG1655.281

Validation on a real dataset282
We used a mock dataset available on EBI-ENA repository under accession number PRJEB42498, in order283
to validate our method on real sequencing data from samples composed of various species and strains.284
The mock dataset is composed of 91 strains of bacterial species for which complete genomes or sets of285
contigs are available, including plasmids. Among the species, two of them contained each two different286
strains. Three mixes had been generated from the mock, and we used the “Mix1A” in the following287
results.288

Even though 20 out of 91 strains were absents in this mix, we indexed the full set of 91 genomes.289
This was done in order to mimic a classical StrainFLAIR use case where the queried data is mainly290
unknown, and the reference graph contains species or strains not existing in these queried data. The291
metagenomic sample was sequenced using Illumina HiSeq 3000 technology and resulted in 21,389,196292
short paired-end reads.293

We compared our results to the expected abundances of each strain in the sample defined as the294
theoretical experimental DNA concentration proportion. As such, it has to be noted that potential295
contamination and/or experimental bias could have occurred and affected the expected abundances.296

Strain detection297
Among the 91 strains used in the reference variation graph, StrainFLAIR detected 65 strains. All of298

these 65 strains were indeed sequenced in Mix1A. Hence, StrainFLAIR produced no false positive.299
From the 26 strains considered absent by StrainFLAIR, 20 were not present in the sample (true300
negatives) and 6 should have been detected (false negatives). However, the term false negative has to be301
soften as the ground truth remains uncertain. Among those 6 undetected strains, all of them had theoretical302
abundance below 0.1%.303

8/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


Figure 4. Proportion of detected specific genes for each simulated experiment with variable
coverage of the BL21-DE3 strain, absent from the reference graph.

More precisely, among the 6 strains undetected by StrainFLAIR, 5 had some detected genes,304
but below the 50% threshold. In this case, by default, StrainFLAIR discards these strains. Finally,305
only one of the undetected strains (Desulfovibrio desulfuricans ND 132) should have been theoretically306
detected (even if its expected coverage was below 0.1%), but no specific gene was identified. Considering307
that StrainFLAIR uses a permissive definition of detected gene (at least one read maps on the gene),308
having strictly no specific genes detected for Desulfovibrio desulfuricans ND 132 suggests that this strain309
might in fact be absent from Mix1A. This is also supported by the result from Kraken2 which estimated310
a relative abundance of ≈ 9e−5, almost 500 times lower than the theoretical result.311

As in the simulated dataset validation, Kraken2 affected non-null abundances to all the references312
and thus could not be used to definitely conclude on presence/absence of strains in the sample.313

Strain relative abundances314
For the estimated relative abundances, StrainFLAIR gave more similar results compared to the315

state-of-the-art tool Kraken2 than the experimental values (Fig. 5). The sum of squared error between316
StrainFLAIR and Kraken2 was around 11. StrainFLAIR and Kraken2 gave similar results317
compared to the experimental values, with sum of squared errors of around 209 and 211 respectively.318

Interestingly, Thermotoga petrophila RKU-1 is the only case where results from StrainFLAIR319
and Kraken2 differs greatly, with, in addition, the theoretical abundance being in-between. Moreover,320
Thermotoga sp. RQ2 is the strain expected to be absent that Kraken2 estimates with the highest relative321
abundance among the other expected absent strains, and the only one exceeding the relative abundances322
of two present strains. Considering the previous results on the simulated mixtures and that Thermotoga323

9/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


Figure 5. Experimental relative abundance compared to relative abundance as computed by
StrainFLAIR and Kraken2. A selection of relevant results is shown here, see Supplementary
Materials (Section S1.7) for the complete results. (A) Represents a case where StrainFLAIR and
Kraken2 give similar results to the experimental value (18 cases over 91). (B) Represents a case where
StrainFLAIR and Kraken2 give similar results, but lower than the experimental value (26 cases over
91). (C) Represents a case where StrainFLAIR and Kraken2 give similar results, but greater than
the experimental value (16 cases over 91). (D, E, F, G) Represent the two species represented by two
strains each. (H, I) Represent two atypical cases.

petrophila RKU-1 and Thermotoga sp. RQ2 are close species (fastANI around 96.6%) it could be an324
additional indicator of how tools like Kraken2 can be mislead by too close species or strains.325

In the sample, the species Methanococcus maripaludis was represented by two strains (S2 and C5) and326
the species Shewanella baltica likewise (OS223 and OS185). StrainFLAIR successfully distinguished327
and estimated the relative abundances of each strain of these two genomes. In this very situation and328
contrary to results on E. coli strains, Kraken2 was also able to correctly estimate the abundances.329

DISCUSSION330
Recent advances in sequencing technologies have provided large reference genome resources. Represen-331
tation and integration of those multiple genomes, often highly similar, are under active development and332
led to genome graphs based tools. Integrating multiple genomes from the same species is particularly333
interesting as it provides new opportunities to characterize strains, a key resolution, for instance opening334
the field of precision medicine (Albanese and Donati, 2017; Marchesi et al., 2016).335

In this context, we developed StrainFLAIR, a new computational approach for strain level profiling336
of metagenomic samples, using variation graphs for representing all reference genomes. Our intention was337

10/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


in the one hand to test whether or not indexing highly similar genomes in a graph enables to characterize338
queried samples at the strain level, and, in the other hand, to provide a end-user tool able to perform the339
indexation of genomes and the query of reads including the analyses of mapping results.340

The method exploits state-of-the art-tools additionally to novel algorithmic and statistical solutions.341
By indexing microbial species and/or strains in a graph, it enables the identification and quantification of342
strains from a sequenced sample, mapped onto this graph.343

We have demonstrated on simulated and on real datasets the ability of our method to identify and cor-344
rectly estimate the abundance of microbial strains in metagenomic samples. In addition, StrainFLAIR345
was able to highlight the presence and also to estimate a relative abundance for a strain similar to existing346
references, but absent from these references.347

We also showed that StrainFLAIR tended to set to zero the predicted abundance of low abundant348
strains, while a tool like Kraken2 was able detect them. As a result, it seemed that StrainFLAIR349
looses the ability to detect very low abundant strains. However, in our simulations, this situation350
corresponded to coverages of 0.03x or less, hence simulating a strain for which not all genomic content351
was present. Eventually, it might be more relevant to define this strain as absent. Overall, there is a need to352
distinguish between low abundant strains, insufficient sequencing depth, and reads from intergenic regions353
or other genes randomly matching genes. In this regard, StrainFLAIR integrated a threshold on the354
proportion of specific genes detected that can be further explored to refine which strain abundances are set355
to zero. Importantly, results also showed that our graph-based tool had no false positive call, contrary to356
general purpose tool Kraken2 that detected 100% of strains that were indexed but absent from queried357
reads.358

From the validation on real datasets, we showed that StrainFLAIR was still able to correctly359
estimate the relative abundances in a more complex context mixing both different species and different360
strains, without being biased by references absent in the sample.361

Our methodology taking into account all mapped reads and imposing a threshold that sets some strains362
abundances to zero seems more adequate and closer to what is expected in reality. Moreover, being able363
to detect some queried strains as absent is particularly interesting in the metagenomics context. Unlike364
mock datasets that are of controlled and known compositions, no prior knowledge is available for real365
metagenomic samples. They require the most exhaustive references - including unnecessary genomes -366
hence strains absent from the sample. StrainFLAIR is a new step towards the objective to take into367
account those unnecessary genomes without biasing the downstream analysis.368

Measured computation time performances show that StrainFLAIR enables to analyse million reads369
in a few hours. Even if this opens the doors to routine analyses of small read sets, new development370
efforts will have to be made for reducing computation time in order to scale-up to very large datasets.371

While StrainFLAIR focuses on profiling metagenomic samples at the strain level based on genes, it372
opens the way to pangenomic studies. Genome graphs are used to capture all the information on variation373
or similarity of sequences, which is particularly adapted to represent the gene repertoire diversity and the374
set of nucleotidic variations found between the different genomes of a species. This work highlights the375
importance to keep up working on pangenome graph representation.376

The presence of queried unknown strain(s) is revealed both by reads mapping non-colored paths and377
by the amount of nucleotidic variations (indels and substitutions). The natural continuation will be related378
to the dynamical update of the graph when novel strains are detected in this way. This dynamicity will also379
be particularly useful considering the future flow of new sequenced metagenomes and the development of380
clinical metagenomics that will help to quickly and efficiently characterize in silico emerging strains of381
human health interest.382

ACKNOWLEDGMENTS383
This work used the GenOuest bioinformatics core facility (https://www.genouest.org).384

We acknowledge Mircea Podar for the providing of the mock dataset in premium access. Finally, we385
thank Mahendra Mariadassou, Rayan Chikhi, Olivier Jaillon and David Vallenet for all their advice along386
this work.387

11/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


REFERENCES388

Albanese, D. and Donati, C. (2017). Strain profiling and epidemiology of bacterial species from metage-389
nomic sequencing. Nature Communications, 8(1):1–14.390

Baaijens, J. A., der Roest, B. V., Köster, J., Stougie, L., and Schönhuth, A. (2019). Full-length de novo391
viral quasispecies assembly through variation graph construction. bioRxiv, page 287177.392

Ballouz, S., Dobin, A., and Gillis, J. (2019). Is it time to change the reference genome? bioRxiv, page393
533166.394

Clemente, J. C., Ursell, L. K., Parfrey, L. W., and Knight, R. (2012). The impact of the gut microbiota on395
human health: An integrative view.396

Dobrindt, U. (2005). (Patho-)Genomics of Escherichia coli.397
Ehrlich, S. D. (2011). MetaHIT: The European Union project on metagenomics of the human intestinal398

tract. In Metagenomics of the Human Body, pages 307–316. Springer New York.399
Garrison, E. (2021). ekg/seqwish: alignment to variation graph inducer. https://github.com/400
ekg/seqwish.401

Garrison, E., Novak, A., Hickey, G., Eizenga, J., Dawson, E., Jones, W., Buske, O., and Lin, M. (2017).402
Sequence variation aware references and read mapping with vg : the variation graph toolkit. bioRxiv.403

Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S.,404
Markello, C., Lin, M. F., Paten, B., and Durbin, R. (2018). Variation graph toolkit improves read405
mapping by representing genetic variation in the reference.406

Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., and Hauser, L. J. (2010). Prodigal:407
Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119.408

Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T., and Aluru, S. (2018). High throughput409
ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications,410
9(1):1–8.411

Jovel, J., Patterson, J., Wang, W., Hotte, N., O’Keefe, S., Mitchel, T., Perry, T., Kao, D., Mason, A. L.,412
Madsen, K. L., and Wong, G. K. (2016). Characterization of the gut microbiome using 16S or shotgun413
metagenomics. Frontiers in Microbiology, 7(APR):459.414

Kim, D., Paggi, J. M., Park, C., Bennett, C., and Salzberg, S. L. (2019). Graph-based genome alignment415
and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 37(8):907–915.416

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–417
3100.418

Li, H., Feng, X., and Chu, C. (2020). The design and construction of reference pangenome graphs with419
minigraph. Genome Biology, 21(1):265.420

Li, J., Wang, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R.,421
Prifti, E., Nielsen, T., Juncker, A. S., Manichanh, C., Chen, B., Zhang, W., Levenez, F., Wang, J., Xu,422
X., Xiao, L., Liang, S., Zhang, D., Zhang, Z., Chen, W., Zhao, H., Al-Aama, J. Y., Edris, S., Yang,423
H., Wang, J., Hansen, T., Nielsen, H. B., Brunak, S., Kristiansen, K., Guarner, F., Pedersen, O., Doré,424
J., Ehrlich, S. D., and Bork, P. (2014). An integrated catalog of reference genes in the human gut425
microbiome. Nature Biotechnology, 32(8):834–841.426

Li, W. and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or427
nucleotide sequences. Bioinformatics, 22(13):1658–1659.428

Loman, N. J., Constantinidou, C., Christner, M., Rohde, H., Chan, J. Z.-M., Quick, J., Weir, J. C., Quince,429
C., Smith, G. P., Betley, J. R., Aepfelbacher, M., and Pallen, M. J. (2013). A Culture-Independent430
Sequence-Based Metagenomics Approach to the Investigation of an Outbreak of Shiga-Toxigenic431
Escherichia coli O104:H4. JAMA, 309(14):1502.432

Marchesi, J. R., Adams, D. H., Fava, F., Hermes, G. D., Hirschfield, G. M., Hold, G., Quraishi, M. N.,433
Kinross, J., Smidt, H., Tuohy, K. M., Thomas, L. V., Zoetendal, E. G., and Hart, A. (2016). The gut434
microbiota and host health: A new clinical frontier. Gut, 65(2):330–339.435

Na, J. C., Kim, H., Park, H., Lecroq, T., Léonard, M., Mouchard, L., and Park, K. (2016). FM-index of436
alignment: A compressed index for similar strings. Theoretical Computer Science, 638:159–170.437

New, F. N. and Brito, I. L. (2020). What Is Metagenomics Teaching Us, and What Is Missed?438
Paten, B., Eizenga, J. M., Rosen, Y. M., Novak, A. M., Garrison, E., and Hickey, G. (2018). Superbubbles,439

Ultrabubbles, and Cacti. In Journal of Computational Biology, volume 25, pages 649–663. Mary Ann440
Liebert Inc.441

Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh, C., Nielsen, T., Pons, N., Levenez,442

12/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


F., Yamada, T., Mende, D. R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie,443
Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.-M., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen,444
H. B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou,445
Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen,446
K., Pedersen, O., Parkhill, J., Weissenbach, J., MetaHIT Consortium, M., Bork, P., Ehrlich, S. D.,447
and Wang, J. (2010). A human gut microbial gene catalogue established by metagenomic sequencing.448
Nature, 464(7285):59–65.449

Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J., and Segata, N. (2017). Shotgun metagenomics,450
from sampling to analysis.451

Rakocevic, G., Semenyuk, V., Lee, W. P., Spencer, J., Browning, J., Johnson, I. J., Arsenijevic, V., Nadj, J.,452
Ghose, K., Suciu, M. C., Ji, S. G., Demir, G., Li, L., Toptaş, B., Dolgoborodov, A., Pollex, B., Spulber,453
I., Glotova, I., Kómár, P., Stachyra, A. L., Li, Y., Popovic, M., Källberg, M., Jain, A., and Kural, D.454
(2019). Fast and accurate genomic analyses using genome graphs. Nature Genetics, 51(2):354–362.455

Rasko, D. A., Rosovitz, M. J., Myers, G. S., Mongodin, E. F., Fricke, W. F., Gajer, P., Crabtree, J.,456
Sebaihia, M., Thomson, N. R., Chaudhuri, R., Henderson, I. R., Sperandio, V., and Ravel, J. (2008).457
The pangenome structure of Escherichia coli: Comparative genomic analysis of E. coli commensal and458
pathogenic isolates. Journal of Bacteriology, 190(20):6881–6893.459

Solé, C., Guilly, S., Da Silva, K., Llopis, M., Le-Chatelier, E., Huelin, P., Carol, M., Moreira, R.,460
Fabrellas, N., De Prada, G., Napoleone, L., Graupera, I., Pose, E., Juanola, A., Borruel, N., Berland,461
M., Toapanta, D., Casellas, F., Guarner, F., Doré, J., Solà, E., Ehrlich, S. D., and Ginès, P. (2021).462
Alterations in Gut Microbiome in Cirrhosis as Assessed by Quantitative Metagenomics: Relationship463
With Acute-on-Chronic Liver Failure and Prognosis. Gastroenterology, 160(1):206–218.e13.464

Stewart, E. J. (2012). Growing unculturable bacteria.465
Sunagawa, S., Coelho, L. P., Chaffron, S., Kultima, J. R., Labadie, K., Salazar, G., Djahanschiri, B., Zeller,466

G., Mende, D. R., Alberti, A., Cornejo-Castillo, F. M., Costea, P. I., Cruaud, C., D’Ovidio, F., Engelen,467
S., Ferrera, I., Gasol, J. M., Guidi, L., Hildebrand, F., Kokoszka, F., Lepoivre, C., Lima-Mendez, G.,468
Poulain, J., Poulos, B. T., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral, M.,469
Searson, S., Kandels-Lewis, S., Boss, E., Follows, M., Karp-Boss, L., Krzic, U., Reynaud, E. G., Sardet,470
C., Sieracki, M., Velayoudon, D., Bowler, C., De Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,471
Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M. B.,472
Weissenbach, J., Wincker, P., Karsenti, E., Raes, J., Acinas, S. G., and Bork, P. (2015). Structure and473
function of the global ocean microbiome. Science, 348(6237).474

Tenaillon, O., Skurnik, D., Picard, B., and Denamur, E. (2010). The population genetics of commensal475
Escherichia coli.476

Thorpe, H. A., Bayliss, S. C., Hurst, L. D., and Feil, E. J. (2017). Comparative analyses of selection477
operating on nontranslated intergenic regions of diverse bacterial species. Genetics, 206(1):363–376.478

Vieira-Silva, S., Falony, G., Belda, E., Nielsen, T., Aron-Wisnewsky, J., Chakaroun, R., Forslund, S. K.,479
Assmann, K., Valles-Colomer, M., Nguyen, T. T. D., Proost, S., Prifti, E., Tremaroli, V., Pons, N.,480
Le Chatelier, E., Andreelli, F., Bastard, J. P., Coelho, L. P., Galleron, N., Hansen, T. H., Hulot, J. S.,481
Lewinter, C., Pedersen, H. K., Quinquis, B., Rouault, C., Roume, H., Salem, J. E., Søndertoft, N. B.,482
Touch, S., Alves, R., Amouyal, C., Galijatovic, E. A. A., Barthelemy, O., Batisse, J. P., Berland, M.,483
Bittar, R., Blottière, H., Bosquet, F., Boubrit, R., Bourron, O., Camus, M., Cassuto, D., Ciangura,484
C., Collet, J. P., Dao, M. C., Debedat, J., Djebbar, M., Doré, A., Engelbrechtsen, L., Fellahi, S.,485
Fromentin, S., Giral, P., Graine, M., Hartemann, A., Hartmann, B., Helft, G., Hercberg, S., Hornbak,486
M., Isnard, R., Jaqueminet, S., Jørgensen, N. R., Julienne, H., Justesen, J., Kammer, J., Kerneis, M.,487
Khemis, J., Krarup, N., Kuhn, M., Lampuré, A., Lejard, V., Levenez, F., Lucas-Martini, L., Massey,488
R., Maziers, N., Medina-Stamminger, J., Moitinho-Silva, L., Montalescot, G., Moutel, S., Le Pavin,489
L. P., Poitou-Bernert, C., Pousset, F., Pouzoulet, L., Schmidt, S., Silvain, J., Svendstrup, M., Swartz, T.,490
Vanduyvenboden, T., Vatier, C., Verger, E., Walther, S., Dumas, M. E., Ehrlich, S. D., Galan, P., Gøtze,491
J. P., Hansen, T., Holst, J. J., Køber, L., Letunic, I., Nielsen, J., Oppert, J. M., Stumvoll, M., Vestergaard,492
H., Zucker, J. D., Bork, P., Pedersen, O., Bäckhed, F., Clément, K., and Raes, J. (2020). Statin therapy493
is associated with lower prevalence of gut microbiota dysbiosis. Nature, 581(7808):310–315.494

Wood, D. E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome495
Biology, 20(1):257.496

13/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


S1 SUPPLEMENTARY MATERIALS497
S1.1 Third-party tools usage and rational498
We propose here a the motivations and precise usage of the third-party tools that are employed in499
StrainFLAIR.500

S1.1.1 Graph construction501
vg toolkit allows to modify the graph including a normalization step. Normalization consists in502
deleting redundant nodes (nodes containing the same sub-sequence and having the same parent and child503
nodes), removing edges that do not introduce new paths, and merging nodes separated by only one edge.504

For each cluster, if the colored paths of the corresponding graph still describe their respective input505
sequences, the graph is normalized.506

After the concatenation of all computed graphs (one for each cluster), the final single variation graph507
is indexed using vg toolkit. Indexing a graph allows a fast querying of the graph when mapping508
reads. Indexation uses two file formats: XG, which is a succinct graph index which presents a static509
index of nodes, edges and paths of a variation graph, and GCSA, a generalized FM-index to directed510
acyclic graphs. A SNARLS file is also generated, describing snarls (a generalization of the superbubble511
concept (Paten et al., 2018)) in the variation graph and similarly allowing faster querying.512

S1.1.2 Mapping reads513
vg toolkit offers two sequence-to-graph mappers. The first one, vg map, outputs one or several514
final paths for each alignment. However, in case of several alignments with equal mapping scores, only515
one is randomly chosen. In order to get more exhaustive and accurate results, StrainFLAIR uses vg516
mpmap to map reads on the variation graph.517

The mapping results are given in GAMP format, then converted into JSON format with vg toolkit,518
describing, for each read, the nodes of the graph traversed by the alignment.519

S1.2 Gene-level output by StrainFLAIR520
Here we present the exhaustive description of information provided by StrainFLAIR at the gene level521
(before strain-level computations). For each colored path StrainFLAIR provides the following items:522

• The corresponding gene identifier.523

• For each reference genome, the number of copies of the gene. Since each unique version of a gene524
is represented once in the graph, whereas it can exist in several copies in the genome (duplicate525
genes), the counts and abundances computed correspond to the sum of those copies. Keeping track526
of the number of copies is important to normalize the counts.527

• The cluster identifier to which the colored path belongs.528

• For unique mapped reads: their raw number and their number normalized by the sequence length529
(see Section Querying variation graphs in Methods).530

• For unique plus multiple mapped reads: their raw number and their number normalized by the531
sequence length (see Section Querying variation graphs in Methods).532

• The mean abundance of the nodes composing the path, as defined in the manuscript.533

• The mean abundance without the nodes of the path never covered by a read, as defined in the534
manuscript.535

• The ratio of covered nodes, as defined in the manuscript.536

S1.3 Abundance metrics validation537
The output of StrainFLAIR provides several metrics to estimate the abundance of the genes detected538
in the sample.539

For validation, we used a combination of LASSO (least absolute shrinkage and selection operator)540
model and linear model on the simulated dataset to estimate the abundances at the strain-level, as the541
abundance of a gene is a linear combination of the abundances of the strains it belongs to. As such,542

14/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


we expect no intercept value for those models and have forced the intercept at zero for the following543
modeling.544

First, a LASSO model was used to perform strain selection. The response variable of the model was545
the presence or absence of the genes according to the selected metric while the strains, described as their546
genes content (number of copies), were the predictors. Then, a linear model was constructed with the547
raw selected metric as the response variable, and only the strains selected by the LASSO model as the548
predictors. The estimate of the strains relative abundance was thus the coefficients of the linear model549
associated to the strains and transformed into relative values. For each metric, the sum of squared errors550
between the real relative abundances and the estimated relative abundances from the linear model was551
computed. The best metric was then defined as the one minimizing this sum of squared errors.552

For the mixtures containing E. coli K-12 MG1655, the three expected strains were selected and thus553
detected using LASSO, except for the mixture containing only 1,000 reads of K-12 MG1655 (representing554
0.002% of the mixture, hence very negligible). For all the mixtures, the best metric was the mean555
abundance computed from the node abundances and by taking into account the multiple mapped reads.556

For the mixtures containing E. coli BL21-DE3, BL21-DE3 being absent from the reference but very557
close to K-12 MG1655, we expected to get some detection of K-12 in the results. The three expected558
strains were selected and thus detected using LASSO, except for the mixture containing only 1,000 reads559
of BL21-DE3 (representing 0.002% of the mixture, hence very negligible). For the mixtures at 200,000,560
100,000, and 50,000 reads of BL21-DE3, the best metric was the mean abundance computed from the561
node abundances without the abundances at zero, and by taking into account the multiple mapped reads.562
While for the others, the best metric was the mean abundance computed from the node abundances563
(including the abundances at zero), and by taking into account the multiple mapped reads.564

This approach using linear models was particularly appropriate for this situation where the reference565
variation graph and the sample contained a small number of strains and thus a small number of predictors566
for the model. However, this can hardly transpose to a whole metagenomic sample with various species567
and various strains that would lead to too many predictors and probably confusing the heuristics behind568
the models. This was confirmed by applying the same methodology above on the mock dataset leading569
to abundances estimation hardly comparable to expected. Compared to Kraken2 results, the sum of570
squared errors of our methodology was approximately 6 whereas for the results with the LASSO model it571
was around 236. Nevertheless, those results highlighted the relevance of (i) using a metric taking into572
account the multiple mapped reads and not only the unique mapped reads, and (ii) using our metric of573
abundance based on the node abundances over raw read counts.574

15/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


S1.4 Performances575
Our benchmarks were performed on the GenOuest platform on a machine with 48 Xeon E5-2670 2.30576
GHz with 500 GB of memory and 16 CPUs. Time results (Table S1) are the wall-clock times. We577
provided rough computation time, mainly in the purpose to show that StrainFLAIR can be applied on578
usual datasets.579

Dataset Step Items processed Time Disk used (GB) Max mem. (GB)
Gene prediction 7 genomes 0m20 0 1.2
Gene clustering 34,011 genes 0m22 0 0.36
Graph construction 8,596 clusters 2m44 0.04 1.31
Graph concatenation 8,596 graphs 0m51 0 0.25

Simulated Graph indexation 1 graph 6m23 0.16 4.24
Mapping reads 350,000 short reads 15m15 0.16 0.99
JSON conversion 1 GAMP file 3m58 4.2 0.03
JSON parsing 1 JSON file + 1 GFA file + 1 pickle file 12m44 0 0.55
Abundance computing 1 Gene abundances table 0m2 0 0.04
Gene prediction 91 genomes 1m43 1.02 6.7
Gene clustering 280,174 genes 3m38 0.14 0.98
Graph construction 270,712 clusters 41m54 1.12 9.1
Graph concatenation 270,712 graphs 14m38 0 1.05

Mock Graph indexation 1 graph 75m19 1.98 30.4
Mapping reads 21,389,196 short read pairs 147m28 7 17.5
JSON conversion 1 GAMP file 53m21 75 0.12
JSON parsing 1 JSON file + 1 GFA file + 1 pickle file 110m44 0 5.7
Abundance computing 1 Gene abundances table 0m4 0 0.68

Table S1. StrainFLAIR performances on simulated and mock datasets.

16/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


S1.5 Distance between the selected genomes in the simulated experiment580
We estimated the distance between the complete genomes of the selected strains using fastANI (Average581
Nucleotide Identity). FastANI uses an alignment-free algorithm to estimate the average nucleotide identity582
between pairs of sequences.583

K-12 IAI39 O104:H4 Sakai SE15 Santai BL21-DE3 RM8426
K-12 100 97.0652 98.3769 97.8703 96.8716 98.0362 98.9365 98.3657
IAI39 97.037 100 96.9742 96.7417 97.1289 96.9295 97.0197 96.8987

O104:H4 98.3059 96.9521 100 97.4788 96.8007 97.8896 98.249 98.7212
Sakai 97.7497 96.8627 97.5094 100 96.6657 98.1523 97.7455 97.6125
SE15 96.8453 97.1064 96.9211 96.7362 100 96.7575 96.8141 96.7763
Santai 98.0073 97.0372 97.9584 98.1797 96.8199 100 97.9279 97.9077

BL21-DE3 98.9983 97.1721 98.4048 97.8227 96.8448 97.9616 100 98.3204
RM8426 98.306 96.9037 98.6801 97.5815 96.6907 97.8353 98.2567 100

Table S2. Distance between each pair of complete genome sequences from eight strains of E. coli
as computed by fastANI.

All pairs showed a distance at least greater than 95%, highlighting the strong similarities between584
the strains. As a threshold, we although considered that beyond 99%, sequences were too similar to be585
considered and distinguished, additionally to the effect of sequencing errors. The fastANI results showed586
that none of the pairs exceeded this similarity threshold.587

The strain E. coli BL21-DE3 was chosen as the unknown strain while the seven others would be used588
to build the reference pangenome graph. According to the results of fastANI, the strain BL21-DE3 closest589
genome in the present references is the strain K-12 with a similarity of 98.9%. Hence we expected to find590
evidences of the strain K-12 while analyzing a sample containing the unknown strain BL21-DE3.591

17/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


S1.6 Detailed results from simulated datasets592

#reads
K-12

Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426

Expected 59.88 39.92 0.2 0 0 0 0
1,000 StrainFLAIR 56.45 43.55 0 0 0 0 0

Kraken2 38.91 60.72 0.22 0.04 0.07 0.03 0.02
Expected 59.41 39.6 0.99 0 0 0 0

5,000 StrainFLAIR 54.89 42.46 2.65 0 0 0 0
Kraken2 38.61 60.25 0.99 0.04 0.07 0.03 0.02
Expected 58.82 39.22 1.96 0 0 0 0

10,000 StrainFLAIR 54.08 41.96 3.96 0 0 0 0
Kraken2 38.26 59.69 1.9 0.04 0.07 0.03 0.02
Expected 57.14 38.1 4.76 0 0 0 0

25,000 StrainFLAIR 52.1 40.58 7.32 0 0 0 0
Kraken2 37.23 58.1 4.51 0.04 0.07 0.03 0.02
Expected 54.55 36.36 9.09 0 0 0 0

50,000 StrainFLAIR 49.23 38.51 12.26 0 0 0 0
Kraken2 35.63 55.6 8.62 0.04 0.07 0.03 0.02
Expected 50 33.33 16.67 0 0 0 0

100,000 StrainFLAIR 44.66 35.05 20.29 0 0 0 0
Kraken2 32.8 51.19 15.85 0.04 0.07 0.03 0.02
Expected 42.86 28.57 28.57 0 0 0 0

200,000 StrainFLAIR 38.12 29.83 32.05 0 0 0 0
Kraken2 28.31 44.18 27.35 0.04 0.08 0.03 0.02

Table S3. Reference strains relative abundances expected and computed by StrainFLAIR or
Kraken2 for each simulated experiment with variable coverage of the K-12 MG1655 strain. Best
results are shown in bold.

Table S3 provides exhaustive results on simulated datasets when all queried strains are indexed in the593
variation graph. Table S4 provides exhaustive results on simulated datasets when one of the queried strain594
(BL21-DE3) is not indexed and highly similar to strain K-12.595

18/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


#reads
BL21-DE3

Method O104:H4 IAI39 K-12 Sakai SE15 Santai RM8426

Expected 59.88 39.92 (0.2) 0 0 0 0
1,000 StrainFLAIR 56.47 43.53 0 0 0 0 0

Kraken2 38.93 60.76 0.11 0.05 0.08 0.04 0.03
Expected 59.41 39.6 (0.99) 0 0 0 0

5,000 StrainFLAIR 56.45 43.55 0 0 0 0 0
Kraken2 38.72 60.42 0.5 0.09 0.13 0.08 0.07
Expected 58.82 39.22 (1.96) 0 0 0 0

10,000 StrainFLAIR 56.45 43.55 0 0 0 0 0
Kraken2 38.47 60.05 0.92 0.14 0.19 0.12 0.13
Expected 57.14 38.1 (4.76) 0 0 0 0

25,000 StrainFLAIR 54.09 41.71 4.2 0 0 0 0
Kraken2 37.75 58.93 2.16 0.28 0.34 0.25 0.29
Expected 54.55 36.36 (9.09) 0 0 0 0

50,000 StrainFLAIR 52.74 40.62 6.65 0 0 0 0
Kraken2 36.59 57.17 4.15 0.51 0.57 0.48 0.53
Expected 50 33.33 (16.67) 0 0 0 0

100,000 StrainFLAIR 50.47 38.64 10.89 0 0 0 0
Kraken2 34.53 54.03 7.68 0.91 0.98 0.91 0.96
Expected 42.86 28.57 (28.57) 0 0 0 0

200,000 StrainFLAIR 46.95 35.34 17.72 0 0 0 0
Kraken2 31.14 48.83 13.53 1.57 1.67 1.58 1.68

Table S4. Reference strains relative abundances expected and computed by StrainFLAIR or
Kraken2 for each simulated experiment with variable coverage of the BL21-DE3 strain, absent
from the reference graph. BL21-DE3 being similar at 98.9% to K-12 strain (highest similarity
compared to the other references), we expect that reads from BL21-DE3 will map this strain, hence its
expected values are given in parentheses, as they correspond to BL21-DE3 strain abundances and not
K-12. Best results are shown in bold.

19/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/


S1.7 Detailed results for validation on mock datasets596

Figure S1. Experimental relative abundance compared to relative abundance computed by
StrainFLAIR and Kraken2.

Figure S1 shows full results obtained on the mock dataset.597

20/20

.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted February 13, 2021. ; https://doi.org/10.1101/2021.02.12.430979doi: bioRxiv preprint 

https://doi.org/10.1101/2021.02.12.430979
http://creativecommons.org/licenses/by/4.0/