key: cord-0274707-6spqyhgf
authors: Dias, Fernando H. C.; Williams, Lucia; Mumey, Brendan; Tomescu, Alexandru I.
title: Fast, Flexible, and Exact Minimum Flow Decompositions via ILP
date: 2022-01-26
journal: nan
DOI: 10.1007/978-3-031-04749-7_14
sha: 0dc74c60e09452910309faaa90c8857f284b046e
doc_id: 274707
cord_uid: 6spqyhgf

Minimum flow decomposition (MFD) (the problem of finding a minimum set of paths that perfectly decomposes a flow) is a classical problem in Computer Science, and variants of it are powerful models in multiassembly problems in Bioinformatics (e.g. RNA assembly). However, because this problem and its variants are NP-hard, practical multiassembly tools either use heuristics or solve simpler, polynomial-time solvable versions of the problem, which may yield solutions that are not mini-mal or do not perfectly decompose the flow. Many RNA assemblers also use integer linear programming(ILP) formulations of such practical variants, having the major limitation they need to encode all the potentially exponentially many solution paths. Moreover, the only exact solver for MFD does not scale to large instances and cannot be efficiently generalized to practical MFD variants. In this work, we provide the first practical ILP formulation for MFD (and thus the first fast and exact solver for MFD), based on encoding all of the exponentially many solution paths using only a quadratic number of variables. On both simulated and real flow graphs, our approach solves any instance in under 13 seconds. We also show that our ILP formulation can be easily and efficiently adapted for many practical variants, such as incorporating longer or paired-end reads or minimizing flow errors. We hope that our results can remove the current tradeoff between the complexity of a multi assembly model and its tractability and can lie at the core of future practical RNA assembly tools.

Flow decomposition (FD), the problem of decomposing a network flow into a set of source-to-sink paths and associated weights that perfectly explain the flow values on the edges, is a classical and well-studied concept in Computer Science. For example, it is a standard result that any flow in a directed acyclic graph (DAG) with m edges can be decomposed into at most m weighted paths (see, e.g., [1] ). However, finding an FD with a minimum number of paths (MFD) is NP-hard [53] , even on DAGs. This result was later strengthened by [16] who proved that MFD is hard to approximate (i.e., there is some > 0 such that MFD cannot be approximated to within a (1 + ) factor, unless P=NP). More recent work has shown that the problem is FPT in the size of the minimum decomposition [22] , and that it can be approximated with an exponential factor [33] . It is also possible to decompose all but a ε-fraction of the flow within a O(1/ε) factor of the optimal number of paths [16] . Heuristic approaches to the problem have also been developed, particularly greedy methods based on choosing the widest or longest paths [53] , which can be improved by making iterative modifications to the flow graph before finding a greedy decomposition [45] . But despite this history of work on algorithms for MFD, an exact solver that is fast for instances with large optimal solutions or large flow values remains elusive.

FD is also a key step in numerous applications. For example, some network routing problems (e.g. [17, 9, 16, 33] ) and transportation problems (e.g. [35, 36] ) require FDs that are optimal with respect to various measures. MFDs in particular are used to reconstruct biological sequences such as RNA transcripts [38, 50, 11, 5, 49, 59] , and viral quasispecies [4] . However, because MFD is NP-hard, all of these tools in fact use heuristics or solve some simpler version of the problem ignoring some information that is available from the sequencing process, resulting in tools that may not reconstruct the correct sequence, even if no other errors are present. More broadly, it has been noted [34] that the lack of exact solvers for many of the sub-problems involved in DNA sequencing has led to heuristic and ad hoc tools with no provable guarantees on the quality of solutions. Additionally, some authors [5, 7] have noted that there is a tradeoff between the complexity of the model for RNA assembly (i.e., how much of the true possible solution space that it supports) and its tractability. But if a fast exact solver for MFD exists, this tradeoff may not be necessary.

The main bioinformatics motivation for this paper is multiassembly [61] . In this problem, we seek to reconstruct multiple genomic sequences from mixed samples using short substrings (called reads) generated cheaply and accurately from next-generation sequencing technology. The two major multiassembly problems are RNA assembly and viral quasispecies assembly, which we describe in more detail below.

One mechanism by which complex organisms create a vast array of proteins is alternative splicing of gene sequences, where multiple different RNA transcripts (that are then used to produce different proteins) can be created from the same gene [46] . In humans, over 90% of genes are believed to produce multiple transcripts [57] . Reconstructing the specific RNA transcripts has proved essential in characterizing gene regulation and function, and in studying development and diseases, including cancer; see, e.g., [21, 43] . A second multiassembly problem is the reconstruction of viral quasispecies, for example, the different HIV or hepatitis strains present in a single patient sequencing sample, or the different SARS-CoV-2 strains present in a sewage water sample. Because viruses evolve quickly, there can be many distinct strains present at one time, and this diversity can be an important factor in the success or effect of the virus [54] .

While the biological realities underlying the different multiassembly problems may yield some differences in how the problems can be solved, at their heart many approaches contain the algorithmic step of decomposing a network flow into weighted paths. The basic setup and approach for multiassembly is as follows. Given a sample of unknown sequences, each with some unknown abundance (for example, a set of RNA transcripts or virus strains), all sequences are multiplied and then broken into fragments that can be read by next-generation sequencers to produce millions of sequence reads ranging from hundreds of tens of thousands DNA characters in length. Many approaches are reference-based (e.g., [50, 52, 32, 38, 23, 5, 26] for RNA assembly and [62, 51] for viral quasispecies assembly), meaning that they use a previously-constructed reference genome to guide the assembly process. These approaches construct a graph using the sequences contained in the reads where nodes are strings, edges represent overlaps, and weights on edges give the counts of reads that support each overlap. Because a reference is used, these graphs are always DAGs. In the non-reference case (called de novo), graphs may have cycles; we address this further at the end of the paper. If errors are minimal, the weights on the edges should form a flow on the network, and the underlying sequences and their abundances must be some decomposition of the flow into weighted paths. For RNA assembly, recent work [22, 60] has confirmed the common assertion (e.g., by [50, 44, 23, 31, 64, 28, 30] ) that the true transcripts and abundances should be minimum flow decomposition. No such study has been done for viral quasispecies assembly, but existing tools seek minimum-sized decompositions [4, 58] . However, while the abovementioned tools seek minimum-sized flow decompositions, since MFD is NP-hard, they in fact compute decompositions that are not guaranteed to be minimum (and thus may not give the correct assembly, even when no other errors are present).

One promising direction for fast exact solvers for MFD is integer linear programming (ILP). Existing ILP solvers like Gurobi [13] and CPLEX [47] incorporate optimizations that allow for fast runtimes in practice for problems that should be hard in general (see also [14] for various applications of ILP in Bioinformatics). Indeed, many existing multiassembly tools do use ILP to solve MFD as one step in their process. The basic idea behind these existing formulations is to consider some set of source-to-sink paths through the graph and assign each a binary variable indicating whether or not it is selected in the optimal solution, along with constraints to fully encode the FD problem (i.e. that the selected set of paths-with the weights derived for them by the ILP-form an FD) and to model further practical aspects of the specific multiassembly problem. However, the number of paths in a DAG is exponential, meaning that if the tools enumerate all paths (and thus can be guaranteed to find the true optimal solution) they are impractical for larger instances (e.g., Toboggan [22] ). The most common strategy is to pre-select some set of paths, either for all instances (e.g., vg-flow [4] and CLIIQ [28] ), or only when the input is large (e.g., MultiTrans [64] and SSP [40] ). But by pre-selecting paths, these formulations may not find the optimal MFD solution for the instance.

While the conference version of this paper was in print, the recent transcript assembly method JUMPER [41] was brought to our attention. JUMPER appears to be, to our knowledge, the only prior method incorporating the search for paths in a DAG into an ILP. However, their solution is slightly less general, since it works only for DAGs having a Hamiltonian path. If Hamiltonicity holds, any source-to-sink path can be encoded as a subset of edges that do not pairwise overlap in the Hamiltonian path (i.e., the tail of an edge does not appear before the head of another edge in the Hamiltonian path). As such, to avoid such pairwise edge overlaps they require a number of constraints that is quadratic in the size of the graph.

We give a new ILP approach to the MFD problem on DAGs, and we show that it can be used on both simulated and real RNA assembly graphs under conditions used in many reference-based multiassembly tools. Our approach is:

Fast and exact: We show in Section 3.1 that it is not necessary to enumerate all paths in order to encode them in an ILP. The key idea is that any path must have a conserved (unit) flow from its start to its end, and that this concept can be encoded using only a number of variables and constraints that is linear in the size of the graph (rather than exponential, as is the case when the model enumerates all possible paths). This is a standard integer programming method for expressing paths in DAGs, used for example in [48] . An implementation of our ILP formulation using CPLEX finds optimal flow decomposition solutions on RNA assembly graphs (simulated and assembled from real reads) in under 13 seconds on average, over all the datasets tested. This is several times faster than the state-of-the-art MFD solver Toboggan [22] , depending on the dataset. While heuristic solvers such as Catfish [45] or CoasterHeuristic [60] finish withing a few seconds, we show that they do not provide optimum solutions. Another benefit of our ILP solutions is that all optimum solutions can be reported by the ILP solver, thus potentially helping in "identifying" the correct RNA multiassembly solution (a practical issue acknowledged by e.g. [29, 19] ).

Flexible: In practice, many multiassembly tools in fact solve variants of MFD. For example, many tools account for paired-end reads by requiring that they be included in the same path. Another common strategy is to incorporate longer reads as subpath constraints or phasing paths [38, 44, 60] which again must be covered by some predicted transcript (i.e. flow path). In Section 3.2, we give additional constraints that are expressive enough to not only encode paired-end reads and subpath constraints, but also any generic set of edges that must be covered by a single path (e.g., as when modelling the recent Smart-seq3 protocol producing RNA multi-end reads [15] ). Additionally, due to sequencing or read mapping errors, the weights on edges may not be a flow (i.e. flow conservation might not hold). One approach in this case is to consider intervals of edge weights instead, as in [40, 59] . We give a formulation to handle this approach in Section 3. 3 . Our implementation solves subpath constraint instances in similar time to standard instances, while the existing exact solver could not complete on many instances in under 60 seconds. Moreover, while the existing interval heuristic is fast, it finds decompositions that are far from optimum. While all these additional constraints are naturally expressed in ILP (further underlining the flexibility of our approach), the novelty here is their integration with the ILP encoding of all possible paths in the DAG from Section 3. 1. In Appendix B, we give MFD formulations dealing with the total error over all edges. We can consider an upper bound on the total error, or seek a minimum decomposition that also achieves the minimum error, as studied in [49] and used in RNA assemblers such as [26, 24, 5, 50] . Finally, we note that our formulation could also be used to find decompositions for any of the above variants using a fixed, or upper bounded, number of paths, which is useful if further information is available.

Given a graph G = (V, E), with vertex set V and edge set E ⊆ V × V , we say that s ∈ V is a source if s has no in-coming edges. Analogously, we say that t ∈ V is a sink if t has no out-going edges. Moreover, we say that G is a directed acyclic graph (DAG) if G contains no directed cycles.

is a DAG with unique source s and unique sink t, where for every edge (u, v) ∈ E we have an associated positive integer flow value f uv , satisfying conservation of flow for every v ∈ V \ {s, t}, namely:

Given a flow network, a flow decomposition for it consists of a set of source-to-sink flow paths, and associated weights strictly greater than 0, such that the flow value of each edge equals the sum of the weights of the paths passing through that edge. In other words, the superposition of the weighted paths of the flow decomposition equals the flow of the network (see also Fig. 1) . Formally:

Our above definitions assume integer flow values in the network and integer weights of the flow paths, as is natural since these values count the number of sequenced reads traversing the edges, and are also consistent with previous works such as [22] . However, in practical applications, one could have both fractional flow values and flow path weights, as in e.g., [38] . Note also that the integer and fractional decompositions to the problem may differ. For example, [53] observes that are integer flow networks which admit a k-flow decomposition with fractional weights, but no k -flow decomposition with integer weights, for any k ≤ k. 

In this section we consider the following problem of finding a minimum-size flow decomposition.

Problem 1 (Minimum flow decomposition (MFD)). Given a flow network G = (V, E, f ), the minimum flow decomposition (MFD) problem is to find a flow decomposition (P, w) such that |P| is minimized.

Our solution for Problem MFD is based on an ILP formulation of a flow decomposition with a given number k of paths (a k-flow decomposition). Using this, one can easily solve the MFD problem by finding smallest k such that the flow network admits a k-flow decomposition. Notice that any DAG admits a flow decomposition of size at most |E|, see e.g., [1] (since one can iteratively take the edge with smallest flow value and create an s-t path of weight equaling this flow value). Moreover, if assuming integer weights, another trivial upper bound on the size of any flow decomposition is |f |, namely the flow exiting s, and there is always a flow decomposition with |f | paths of weight one. Thus, if there is a k-flow decomposition, there is also a k -flow decomposition, for all k < k ≤ min{|E|, |f |} (just duplicate a path of weight greater than one, and move weight one from the old copy to the new one). This shows that when searching for the smaller k such that the graph admits a k-flow decomposition we can either do a linear scan in increasing order, or binary search. Since k is usually small in our applications, we just do a linear scan. As mentioned at the end of Section 2, the problem can also be defined as allowing real flow values and/or weights. Our ILP formulation can also handle this variant by just changing the domain of the corresponding variables (in which case we will obtain a Mixed Integer Linear Program (MILP)) 3 .

We start by recalling the standard formulation of a path used for example by [48] for the shortest path problem. If an s-t path repeats no edge (which is always the case if the graph is a DAG) then we can interpret it simply as the set of edges belonging to the path. If we assign value 1 for each edge on the path, and value 0 for each edge not on the path, then these binary values correspond to a conceptual flow in the graph (V, E) (different from the input flow). Moreover, this conceptual flow induced by the (single) path is such that the flow out-going from s is 1 and the flow in-coming to t is 1. It can be easily checked (cf. e.g., [48] ) that if the graph is a DAG, then this is a precise characterization of an s-t path.

Thus, for every path i ∈ {1, . . . , k}, and every edge (u, v) ∈ E, we can introduce a binary variable x uvi indicating whether the edge (u, v) belongs to the i-th path. The above characterization of a path can be expressed by the following equations (see also Fig. 2 ): Having expressed a set of k s-t paths with already known ILP constraints, we need to introduce the new constraints tailored for the k-flow decomposition problem. That is, we need to state that the superposition of their weights equals the given flow in the network (2). Thus, for each path i we introduce a positive integer variable w i corresponding to its weight, and add the constraint: i∈{1,...,k}

To get the ILP formulation, it remains to linearize equation (4), which is nonlinear because it involves a product of two decision variables. Let us remark that even though non-linear programming solvers exist (such as IPOPT [56] ), they are inefficient, do not scale to a large number of variables and are non-professional grade. Instead, having an ILP formulation means that we can make use of popular solvers such as CPLEX [47] and Gurobi [6] .

Since the decision variables involved in the product in Eq. (4) are bounded (x uvi is binary and w i is at most the largest flow value of any edge), this equation can be linearized by standard techniques as in e.g., [10] and [27] . For that, we introduce the integer decision variable π uvi which represents the product between w i and x uvi , and a constant w that is a large enough upper bound for any variable w i (e.g., the largest flow value of any edge). As such, Eq. (4) can be replaced by the following equations:

In these constraints, Eq. (5b) ensures that π uvi is 0 if x uvi is 0, and Eqs. (5c) and (5d) ensure that π uvi is w i if x uvi is 1. For completeness, see Appendix A for the full ILP formulation for k-Flow Decomposition.

In this section we consider the flow decomposition variant where we are also given a set of subpath constraints that must appear (as a subpath of some path) in any flow decomposition. Among all such decompositions we must find of one with the minimum number of paths. In multiassembly, subpath constraints represent longer reads that span three or more vertices; they are used in popular RNA assembly tools such as StringTie [23] and Scallop [44] and their usefulness for that problem was confirmed empirically in [60] . Such subpath constraints can also naturally model long RNA-seq reads, and we note that, as several authors also acknowledge [63, 2, 55] , long reads do not render the RNA assembly problem obsolete, because they do not always capture full-length transcripts (due to the conversion from RNA to cDNA), and do not fully capture low-expressed transcripts.

Formally, the problem can be defined as follows (see also Fig. 3a) . ∀R j ∈ R, ∃P i ∈ P such that R j is a subpath of P i .

Problem 2 (Minimum flow decomposition with subpath constraints (MFDSC)). Given a flow network G = (V, E, f ) and subpath constraints R, the minimum flow decomposition with subpath constraints problem is to determine if there exists, and if so, find a flow decomposition (P, w) satisfying (6) such that |P| is minimized. (b) Constraint R1 is satisfied because for the i th path we can set ri1 = 1 (and satisfy Eq. (7b)) so that xaci + xcti ≥ 2ri1 holds (and satisfy Eq. (7a)). Fig. 1 with a subpath constraint (which is satisfied by the 4-flow decomposition from Fig. 1c , but not by the one in Fig. 1b) , and example of a path satisfying the constraint.

We can expand the previous ILP formulation for k-Flow Decomposition to incorporate the conditions necessary to represent the subpath constraints. Let R be the set of simple paths that are required to be part of at least one path of the flow decomposition. For each R j ∈ R, we introduce an additional binary variable r ij denoting the presence of the subpath R j in the i th path. It clearly holds that r ij = 1 if and only if for each edge (u, v) in R j we have that x uvi = 1. Let |R j | denote the length (i.e., number of edges) of subpath constraint R j , which is a parameter (i.e. constant). The following inequalities guarantee that each subpath constraint is satisfied by the flow decomposition (see also Fig. 3b ):

i∈{1,...,k}

Remark 1. In the above ILP formulation we do not use the fact that the edges of subpath constraint R j are consecutive (i.e., form a path). Thus, the same formulation applies also if the constraint consists of a pair of edge-disjoint paths that must all occur in the same transcript, modelling paired-end Illumina reads, or if it consists of a set of edge-disjoint paths (or simply of a set of edges), modelling multi-end Smart-seq3 RNA reads [15] . More specifically, Eq. (7a) simply characterizes when all edges of constraint R j are covered by some flow path i, and Eq. (7b) requires that at least one flow path satisfies the constraint R j .

While for MFD we could modify the ILP to allow also real positive path weights by setting their lower bound to be 0 (because we solve MFD by increasing k, as discussed at the beginning of Section 3.1), this is no longer possible here, since the resulting model could allow as feasible optimum solution a set of k paths decomposing the flow, plus one 0-weight path added just to satisfy some subpath constraints.

Another variant of the flow decomposition problem is when the given values on the edges of the flow network do not satisfy the conservation of flow property. Instead, they are required to belong to a given interval, for each edge. Thus, we are looking for an inexact flow decomposition, namely one such that the superposition of its weights belongs to the given interval of each edge. This model was studied in [59] and is used in the practical RNA assembler SSP [40] , which seeks a set of transcripts explaining the read coverage within some user-defined error tolerance (i.e., interval around the observed weights) on all edges. The problem is formally stated as follows. . . , w k ) with w i ∈ Z + such that for each edge (u, v) ∈ E it holds that:

In this variant, the same formulation as presented k-Flow Decomposition can be expanded to accommodate the inexact flow component. By simply replacing the flow conservation expressed in Eq. (4) (in the linearized form in Eq. (5a)), with the following two constraints: 

Solvers. We denote by StandardILP, SubpathConstraintsILP, and InexactFlowILP our ILP formulations for Problems 1 (MFD), 2 (MFDSC) and 3 (MIFD), respectively. We implemented these using the Cplex Python API under default settings. We compare StandardILP with Toboggan, the implementation by [22] for their exact FPT algorithm for MFD, and with Catfish, the implementation by [45] of their heuristic algorithm for MFD. We compare SubpathConstraintsILP with Coaster, the implementation by [60] for MFDSC, which is an exact FPT algorithm extending Toboggan, and also with CoasterHeuristic, which is a heuristic for MFDSC also by [60] . We compare InexactFlowILP with IFDSolver, which is an implementation of a heuristic algorithm for MIFD by [59] . Given the size of the datasets, we set a time limit for each graph, as also done by [22, 60] (we use 1 minute in all cases, except that we also include a run of Toboggan with a 5 minute time limit). The runtimes of our ILP implementations include the linear scan in increasing order to find the smallest k for which there is a k-flow decomposition.

To test the performance of the solvers under a range of biologically-occurring graph topologies and flows weights, we used three human transcriptomic datasets containing a perfect (i.e., the edge weights satisfy conservation of flow) splice graph for each gene of the human genome. The first dataset, produced by the authors of [44] and also used in a number of flow decomposition benchmarking studies [22, 60] , was built using publicly available RNA transcripts from the Sequence Read Archive with quantification using the tool Salmon [37] . We use one of the larger transcriptomes 4 and call this dataset SRR020730-Salmon. We also produce perfect splice graphs by running HiSat2 [20] with the provided GRCh38 reference index and then popular RNA assembly tool StringTie [23] on real RNA reads from SRR307903, and superimposing the resulting transcripts and abundances (after rounding abundances to the nearest integer). We call this dataset SRR307903-StringTie. Finally, we create another dataset by directly simulating expression values for all reference transcripts of all genes in the reference genome GRCh.104 homo sapiens by sampling weights from the lognormal distribution with mean −4 and variance 4, as in the default setting of the RNASeqReadSimulator tool [25] . We multiply the simulated values by 1000 and round to the nearest integer.

We call this dataset Reference-Sim. For both the Reference-Sim and SRR307903-StringTie datasets, we use only genes on the positive strand.

For the subpath constraint experiments, we simulate four subpath constraints in each graph as in [60] . For four of the groundtruth paths, we take the prefix of the path that includes three nontrivial junctions (equivalent to three edges in the contracted graph described in [22, Lemma 13] ) as a subpath constraint. If a splice graph has fewer than four groundtruth paths, it is excluded from this experiment.

For the inexact flow experiments, we simulate interval flows as follows, similar to what was done in [59] . For each true edge flow f uv , we independently sample a perturbed flow f uv from N (f uv , ( f uv ) 2 ), the Gaussian distribution with mean f uv and standard deviation f uv For this experiment we fixed = 0.05. We then create intervals as [0.9f uv , 1.1f uv ] with values rounded to the nearest integer, corresponding to a 10% error tolerance from the observed values. As described in [59] , it is possible that an inexact flow decomposition instance created in this way is infeasible; if an infeasible instance is created, we re-create it until a feasible instance is found.

From all datasets, the trivial graphs made up of a single path (i.e. admitting a trivial flow decomposition) are excluded.

For each dataset and each FD variant, we report min k, the number of paths in a minimum flow decomposition for each problem variant; Amount, namely the number of graphs having that specifc value of min k; Avg., the average time (in seconds) for each instance solved within the time limit; Σ, the total time (in seconds) required to solve all instances (this included also the running time of the instances that did not finish within the time limit); Solved, the percentage all instances solved within the time limit; Diff., the average difference between the number of paths obtained with a heuristic algorithm and the optimum one. 

The results for Problem MFD are shown in Table 1 . For all three datasets, the average time and the total time of Toboggan and Catfish outperform StandardILP for less complex genes, where the number of flow-paths is at most 10 or 15. However, as the genes becomes more complex (larger optimum flow decompositions), StandardILP is capable of solving all instances within an average of 10 seconds, while Toboggan and Catfish require on average 16 and 11 seconds for the solved instances, respectively. In addition, Toboggan does not solve all instances even within the 5 minute time limit. Recall also that Catfish is a heuristic, and thus it does not always return optimum solutions (see column Diff. ) .

Among the different datasets, SRR020730-Salmon has fewer complex genes and most instances are solved more easily. However for SRR307903-StringTie (constructed from real RNA reads) and Reference-Sim datasets, there is a larger amount of complex genes and consequently fewer instances can be solved by Toboggan and Catfish, while StandardILP remains efficient and scalable. In these results, although StandardILP does not perform as fast as on SRR020730-Salmon, its runtime is still competitive, it can be scaled to graphs with larger k without compromising its efficiency. On the other hand, Toboggan's runtime is exponential in the size of the optimum decomposition, which hinders its usage on larger instances. Moreover, notice that in some applications (e.g. cancer transcriptomics [18] ) the graphs of interest do have a large number of RNA transcripts because of the genetic mechanism driving the disease. Hence, in such applications the need to find a flow decomposition is even greater for large k.

Lastly, one of the key steps in the Toboggan implementation is a reduction of the graph (to simplify nodes with in-degree or out-degree equal to one, see [22] ), which is a key insight behind its efficiency. However, this observation is highly tailored to the MFD problem, and cannot be easily extended to other FD variants (in fact, it is not used by real RNA assemblers).

The results for Problem MFDSC are shown in Table 2 . For all three datasets, SubpathConstraintsILP is capable of solving instances of any size within a few seconds. As an ILP formulation, the addition of the constraints corresponding to the subpath constraints do not hinder its scalability or efficiency. On the other hand, Coaster is both slow on small instances, and does not solve large instances. This shows that the Toboggan implementation is optimized to use many properties of the standard MFD problem, that are not generalizable to variants of it of practical applicability, such as Problem MFDSC. Moreover, similarly to the Catfish heuristic, CoasterHeuristic does not return optimum solutions.

The results for Problem MIFD are shown in Table 3 . For all three datasets, both formulations run on any instance in a small amount of time. In fact, InexactFlowILP generally has the same running time as StandardILP, which further underscores the flexibility and efficiency of our formulations. However, IFDSolver is a heuristic solver, having a significant difference with respect to the size of a minimum decomposition even for small k. 

Flow decomposition is a key problem in Computer Science, with applications in various fields, including the major multiassembly problems from Bioinformatics. Despite this, the only exact solution for MFD is the FPT algorithm of [22] , which does not scale to large values of k, and cannot be efficiently extended to model practical features of real data (such as long reads, or inexact flows). In fact, a large number of practical RNA assemblers use an ILP formulation at their core, thanks to their flexibility in modeling various aspects of real data. However, such formulations are based either on an impractical exhaustive enumeration of all possible s-t paths, or on a greedy heuristic to select a smaller set of candidate s-t paths that might be part of an optimum solution.

In this paper we show an efficient quadratic-size ILP for MFD and variants, avoiding for the first time the current limitation of (exhaustively) enumerating candidate s-t paths. We also show that many constraints inside state-of-the-art RNA assemblers can be easily modeled on top of our basic ILP (i.e. subpath constraints, inexact and imperfect flows). Further flexibility also comes from the fact that all our ILPs are based on modeling a specific type of flow decomposition with a given, or upper bounded number k of paths (thus, they do not need to solve the minimum version of the problem). On both simulated and real datasets, we show that our ILP formulations finish within 13 seconds on any instance, and within a few seconds on most instances.

On the practical side, we hope that our flexible ILP formulations can lie at the core of future referencebased RNA assemblers employing exact solutions. Thus, the current tradeoff between the complexity of the model and its tractability might not be necessary anymore. On the theoretical side, our ILP formulation represents the first exact solver for MFD scaling to large values of k, and it could be a reference when e.g. benchmarking various other heuristic or approximation algorithms.

Given the maturity of ILP solvers and Toboggan's intrinsic exponential dependence on k, it is not surprising that an ILP for MFD using a quadratic number of variables performs significantly better than Toboggan for larger k values. However, since for small k values our ILP formulations are still slower, as future work it would be interesting to further devise more efficient MFD solvers (e.g., as a start, run Toboggan when the instance is detected as being "small enough").

It would also be interesting to extend our ILP formulations to flow networks with cycles. While in this work we focus on reference-based approaches for multiassembly, de novo approaches (e.g., [12, 42] for RNA assembly and [3, 4, 39, 8] for viral quasispecies assembly) may yield graphs with cycles. In this context, any flow in such a network can be decomposed into at most |E| weights s-t paths and cycles.

Network flows

Opportunities and challenges in longread sequencing data analysis

Full-length de novo viral quasispecies assembly through variation graph construction

Strain-aware assembly of genomes from mixed samples using flow variation graphs

Efficient RNA isoform identification and quantification from RNA-Seq data with network flows

The Gurobi Optimizer. Transp. Re-search

CIDANE: comprehensive isoform discovery and abundance estimation

De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding

On the effect of forwarding table size on SDN network utilization

Theoretical and computational study of several linearisation techniques for binary quadratic problems

Ryūtō: network-flow based transcriptome reconstruction

Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data

Gurobi Optimization

Integer linear programming in computational and systems biology: an entry-level text and course

Single-cell RNA counting at allele and isoform resolution using Smart-seq3

How to split a flow?

Achieving high utilization with software-driven wan

Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer

Safety and Completeness in Flow Decompositions for RNA Assembly

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Analysis of copy number variants and segmental duplications in the human genome: Evidence for a change in the process of formation in recent evolutionary history

A practical fpt algorithm for flow decomposition and transcript assembly

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation

RNASeqReadSimulator: a simple RNA-seq read simulator

IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly

Compact linearization for binary quadratic problems

Cliiq: Accurate comparative detection and quantification of expressed isoforms in a population

Finding ranges of optimal transcript expression quantification in cases of non-identifiability. bioRxiv

An integer programming approach to novel transcript reconstruction from paired-end RNA-Seq reads

Refshannon: A genome-guided transcriptome assembler using sparse flow decomposition

Bayesian transcriptome assembly

Parity balancing path flow decomposition and routing

Sequence assembly demystified

On the Construction of Optimal Paths from Flows and the Analysis of Evacuation Scenarios

A study on flow decomposition methods for scheduling of electric buses in public transport based on aggregated time-space network models

Salmon: accurate, versatile and ultrafast quantification from RNA-seq data using lightweight-alignment

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data

SSP: An interval integer linear programming for de novo transcriptome assembly and isoform discovery of RNA-seq reads

Jumper enables discontinuous transcript assembly in coronaviruses

Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels

The clonal and mutational evolution spectrum of primary triple-negative breast cancers

Accurate assembly of transcripts through phase-preserving graph decomposition

Theory and a heuristic for the minimum path flow decomposition problem

Function of alternative splicing

Cplex users manual, version

Integer programming formulations for the elementary shortest path problem

Explaining a weighted dag with few paths for solving genome-guided multi-assembly

A novel min-cost flow method for estimating transcript expression with RNA-Seq

Probabilistic inference of viral quasispecies subject to recombination

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths

Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population

Next-generation transcriptome assembly: strategies and performance analysis

On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming

Alternative isoform regulation in human tissue transcriptomes

HCV quasispecies assembly using network flows

RNA Transcript Assembly Using Inexact Flows

Flow decomposition with subpath constraints

The multiassembly problem: reconstructing multiple transcript isoforms from est fragment mixtures

Shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data

Scallop2 enables accurate assembly of multiple-end rna-seq data

Multitrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly

w k ) with w i ∈ Z + such that for

A Full ILP formulation for k-Flow Decomposition 

An alternative approach to handle a graph whose weights to not satisfy the flow conservation property flow consists in directly taking the observed read coverages, and trying to find a set of path whose superposition best explains the observed coverages under some error model, penalizing the difference between the observed coverage of an edge and the sum of the weights of the paths going through that edge. This problem has been formalized in [49] and also proven NP-hard. To formalize this problem, we denote by imperfect flow network any DAG (V, E) with unique source s and unique sink t, where for every edge we have an associated integer positive value f uv (not necessarily satisfying the flow conservation property). A first formulation of such an MFD variant imposes a fixed bound on the total error of all of the edges.Problem 4 (Minimum imperfect flow decomposition (bounded error)). Given an imperfect flow network G = (V, E, f ), and an error bound B ≥ 0, find (if it exists) a minimum-sized set of s-t paths P = (P 1 , . . . , P k )Notice that Problem 4 is a strict generalization of the MFD problem, which is obtained by taking B = 0. As done in Section 3.3, we can obtain an ILP for it by extending the ILP formulation for k-Flow Decomposition to express Eq. (11) by the following two sets of linear equations:This model is for a fixed k value, and a full solution for Problem 4 is obtained by trying all values of k in increasing order until the ILP formulation admits a solution. Notice that the same upper bound k ≤ |E|, since any solution to Problem 4 (i.e. any set of weighted s-t paths) induces a flow, which is decomposable into at most |E| weighted paths.Another formulation, defined by [49] and at the core of RNA multiassembly tools such as [26, 24, 5, 50] , asks to minimize the total sum of squared errors with a minimum number of paths.Problem 5 (Minimum imperfect flow decomposition (minimum total error) [49] ). Given an imperfect flow network G = (V, E, f ), find a set of s-t paths P = (P 1 , . . . , P k ) and associated weights w = (w 1 , . . . , w k ), minimizingand among all such sets of paths, find one with minimum k (i.e. with minimum cardinality).For a given number k of path, Eq. (12) can be used as an objective function in an Integer Quadratic Problem (IQP), which can solved by commercial solvers such as CPLEX and Gurobi. The main requirements is that the objective function is quadratic and convex, such as:As before, to fully solve Problem 5, one can iterate over k from 1 to |E| (upper bound holding by the same reasoning as above), and choose the smallest one attaining Eq. (13) .