key: cord-1028687-o1cjdi0a
authors: Orlovich, Yury; Kukharenko, Kirill; Kaibel, Volker; Skums, Pavel
title: Scale-Free Spanning Trees and Their Application in Genomic Epidemiology
date: 2021-10-01
journal: J Comput Biol
DOI: 10.1089/cmb.2020.0500
sha: c06cd83353e3476689c7bc3da92ba688a3cb0fcc
doc_id: 1028687
cord_uid: o1cjdi0a

We study the algorithmic problem of finding the most “scale-free-like” spanning tree of a connected graph. This problem is motivated by the fundamental problem of genomic epidemiology: given viral genomes sampled from infected individuals, reconstruct the transmission network (“who infected whom”). We use two possible objective functions for this problem and introduce the corresponding algorithmic problems termed m-SF (-scale free) and s-SF Spanning Tree problems. We prove that those problems are APX- and NP-hard, respectively, even in the classes of cubic and bipartite graphs. We propose two integer linear programming (ILP) formulations for the s-SF Spanning Tree problem, and experimentally assess its performance using simulated and experimental data. In particular, we demonstrate that the ILP-based approach allows for accurate reconstruction of transmission histories of several hepatitis C outbreaks.

V iral outbreaks continue to be major causes of morbidity and mortality. The ongoing pandemic of the coronavirus SARS-CoV-2 (Huang et al., 2020 ) is a vivid example, but long-standing epidemics of HIV, hepatitis B virus, and hepatitis C virus (HCV) are hardly less damaging (Kilmarx, 2009; Hajarizadeh et al., 2013) . Viral epidemics are complex processes defined by evolutionary dynamics of pathogens and social dynamics of susceptible populations (e.g., individual behaviors, social interactions, and mobility patterns).

Recent advances in sequencing technologies invigorated the field of genomic epidemiology (Armstrong et al., 2019; Knyazev et al., 2020 ) that aims to use viral genomic data to understand the epidemiological dynamics of pathogens. The fundamental algorithmic problem of genomic epidemiology could be formulated as follows:

Given viral genomes sampled from n infected individuals, infer a transmission network indicating who of them infected whom (Knyazev et al., 2020) . If each individual is supposed to be infected only once, then a transmission network is a tree called a transmission tree.

This problem has been approached by a variety of methods ( Jombart et al., 2011 ( Jombart et al., , 2014 Sledzieski et al., 2019; Wertheim et al., 2014; Campo et al., 2016; De Maio et al., 2016; Klinkenberg et al., 2017; Skums et al., 2018) . One family of methods is based on the so-called network approach. It is particularly popular among researchers of HIV and HCV and has been adopted as a standard methodology for outbreak investigations carried out by the CDC (Wertheim et al., 2014; Campo et al., 2016; Campbell et al., 2017; Kosakovsky Pond et al., 2018; Ramachandran et al., 2018; Ragonnet-Cronin et al., 2019) . This approach usually consists of two stages. First, a weighted relatedness graph G R is constructed. Its vertices represent infected hosts, and edges connect the hosts whose viral populations are close to each other according to a selected population genetics measure. Often G R itself supplies enough information for epidemiologists and provides a fast and scalable alternative to phylogenetic trees when applied to next-generation sequencing (NGS) data (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019) . However, usually it contains many edges that do not represent actual transmissions. Thus, at the second stage, the transmission tree is inferred as the spanning tree of G R .

Under the maximum parsimony criterion, the most likely transmission network is a minimum spanning tree of G R ( Jombart et al., 2011) . However, experiments demonstrated that this approach is not accurate ( Jombart et al., 2014) . Furthermore, genomic data alone often do not allow to resolve ambiguities in transmission tree inference, and incorporation of additional evidence is necessary ( Jombart et al., 2014; Villandre et al., 2016; Jha et al., 2017) . Such evidence usually comes in the form of epidemiological information, such as sample collection times and exposure intervals. However, HIV, HCV, and many other infections tend to be initially asymptomatic, and consequently, sampling times may not accurately reflect the infection times. In addition, in outbreaks with high transmission rates (e.g., HIV/HCV among injection drug users), susceptible hosts are almost constantly exposed to the virus, which makes exposure intervals useless. Another important drawback of many existing methods is their implicit assumption that transmission tree edges are independent. In reality, it is not the case, as, for example, certain hosts (so-called superspreaders) infect more people than an average person (Galvani and May, 2005) . Skums et al. (2018) proposed an alternative approach. It is known that for viruses, whose transmissions are associated with behavioral risk factors, their transmission trees have properties of so-called scale-free graphs (Leigh Brown et al., 2011; Wertheim et al., 2014) . Those graphs have specific features, including power-law degree distribution, small diameter, and the presence of high-degree vertices (hubs). This observation gives rise to the following informally defined algorithmic problem (scale-free spanning tree problem): find the most ''scale-free-like'' spanning tree T of the graph G R . In addition, constraints on the weight of T could be imposed. This approach was the basis of the Bayesian framework and the Markov Chain Monte Carlo algorithm for the transmission network inference described by Skums et al. (2018) and implemented as a tool called QUENTIN. Although QUENTIN is efficient in practice, it is a heuristic, and the questions about computational complexity and possibility of the exact solution of the problem were left open.

In this article, we present the first detailed study of the scale-free spanning tree problem. Our major contributions are as follows.

(1) We propose two rigorous formulations of the scale-free spanning tree problem further referred to as m-SF Spanning Tree and s-SF Spanning Tree problems. They are based on two related objective functions and, to the best of our knowledge, have not been previously studied.

(2) We establish the computational complexity of both problems by demonstrating that they are NP-hard or APX-hard, even when restricted to cubic graphs and bipartite graphs. (3) We propose two integer linear programming (ILP) formulations for the problems, and perform computational experiments to assess their performance using simulated data. Then we apply an ILP approach to real genomic data from several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., 2016; Skums et al., 2018) and demonstrate that it allows for accurate inference of transmission trees.

Several definitions of scale-free graphs of different degrees of mathematical rigor are known in a literature. We utilize the rigorous combinatorial characterization that has been introduced by Li et al. (2005) using the so-called s-metric of a graph. This graph invariant is defined as follows:

The same parameter is known in mathematical chemistry as second Zagreb index (Das and Gutman, 2004; Borovicanin et al., 2017) . Li et al. (2005) demonstrated that a higher s-metric indicates with high probability the presence of most of the expected properties of scale-free graphs. The intuition behind these results is that in a graph with a high s-metric, a large number of edges should be incident to high-degree vertices, thus forcing them to resemble preferential attachment graphs-a standard Barabási and Albert (1999) model for scale-free networks. Therefore, another mathematical chemistry parameter called the first Zagreb index (Borovicanin et al., 2017) or m-metric also can serve as a measure of ''scale-freeness'' of a graph:

Thus, we can formulate m-SF Spanning Tree and s-SF Spanning Tree problems: given a connected graph G, find the spanning tree T of G such that m(T) (respectively, s(T)) is maximal. The respective maximum values of m(T) and s(T) are called first and second SF-dimensions of G and denoted by s 1 (G) and s 2 (G). By T sopt and T mopt , we denote an s-optimal tree and an m-optimal tree of G, respectively.

A somehow related problem has been studied by Kincaid et al. (2016) : find a spanning subgraph with prescribed vertex degrees such that its s-metric is maximum. This problem is polynomially solvable in general, but becomes NP-hard, when the output spanning subgraph is required to be connected. Proof. We prove only the second equality, the first one can be proved similarly. Let A = [a ij ] be the adjacency matrix of G and d be its degree vector. We have s(G) = 1 2 d T Á A Á d and d = A Á 1, where 1 = (1‚ . . . ‚ 1) T 2 R n . Therefore

ij equals the number of walks of length 3 between vertices i and j. Thus, s(G) is equal to one-half of the total number of three-walks in G. An edge v 1 v 2 produces exactly two such walks:

. As every threewalk of G has one of these forms, the statement of the lemma follows.

, v laying on P T (u‚ v) by u + and v -, respectively. In case u and v are not adjacent, the neighbor of u + distinct from u and laying on P T (u‚ v) is denoted by u + + . Let A = N T (u)nfu + g = fa 1 ‚ . . . ‚ a p -1 g, and let the set N T (v)nfvg be partitioned into two subsets B = fb 1 ‚ . . . ‚ b q g and C = fc 1 ‚ . . . ‚ c r g, where B 6 ¼ . Furthermore, let deg T u + = a and deg T v -= b. Define numbers D A , D B , and D C as follows:

Given the pair (u‚ v), the neighbor switch S B v!u is a transformation producing a new tree e T from T by replacing the edges vb 1 ‚ . . . ‚ vb q with new edges ub 1 ‚ . . . ‚ ub q (Fig. 1 ). This operation changes only degrees of the vertices u and v, namely deg e Proof. We prove lemma when u and v are not adjacent, that is, u 6 ¼ vand v 6 ¼ u + (the proof for the other case is similar). Define by X (resp., Y) the set of edges of T (resp., e T) incident to u or v. Let us denote by k(X) (resp., e k(Y)) the contribution to s(T) (resp., s( e T)) from the edges of X (resp., Y). Then

Using Equation (3) one can easily calculate

After substituting t = q + r + 1, we obtain

Similarly,

Using equalities (4)-(6) we obtain

Since a ! b and p ! r + 1, it follows that q(a -b) + D B (pr -1) ! 0. On the contrary, since q ! 1 and

then the neighbor switch produces a tree e T with v being a leaf. In this case S B v!u is a total neighbor switch. For our goals it suffices to prove the following corollary.

Corollary 3. If e T is obtained from T by a total neighbor switch S B v!u and, in case of u and v not being adjacent, additionally a ! b or p ! b, then s( e T) > s(T). 

Proof. We check that all conditions of Lemma 2 are satisfied for the total neighbor switch. Indeed, since D A ! p -1 ! 1 (recall deg T u = p ! 2) and D C = r = 0, we have D A > D C and p ! r + 1. If u and v are not adjacent, we still require that a ! b, as in Lemma 2. However, this condition can be replaced if we rewrite Equation (7) as follows:

Note that the latter expression is positive in case of p ! b, since a ! 2 and D A ! p -1 ! 1.

In the same way we can compare the trees T and e T = S B v!u (T) in terms of their m-metrics.

Proof. The idea is similar to the proof of Lemma 2. Since the neighbor switch changes only degrees of vertices u and v, m( e

, For further results we need weaker modifications of Lemmas 2 and 4 for the case deg T u = p ! 1 (and therefore D A ! 0). Recall deg T v = t ! 2 since we still require at least one vertex to switch.

Lemma 5. Suppose e T is obtained from T by a total neighbor switch S B v!u , then the following propositions hold:

Now we consider a special case, when u is a vertex of maximum degree in T and all vertices in

. The reason to treat this two-step switch as a single operation is that the first switch itself might cause the descend of s-metric, however, the decrease would be compensated by the second switch.

Proof. In case u and v are adjacent, that is, u + = v, a double switch S B‚ C v!u‚ u + (T) gets reduced to the first neighbor switch S B v!u (T), which produces a tree with a higher s-metric due to Lemma 2. Therefore, assume u and v are not adjacent. Consider the first switch and let e T = S B v!u (T). From Equation (7), since D B = q and D C = r, we obtain Again from Equation (7) we get

Summation of Equations (8) and (9) gives

Furthermore, since u is a vertex of maximum degree in T, p ! b and p ! q + r + 1 > r + 1 (recall q‚ r > 0), which proves the lemma. ,

There exist bounds for both SF-dimensions of a graph in terms of its order only (de Caen, 1998; Das, 2003; Das and Gutman, 2004) . However, they are not particularly efficient, when used as ILP cuts. Here we provide the adjusted upper bounds that turned out to be more useful for that purpose. Let D(G) denote the maximum vertex degree of G and S m‚ k denote a double star, that is, a tree obtained from two disjoint stars K 1‚ m and K 1‚ k with m and k leaves, respectively, by adding an edge joining their central vertices.

Theorem 7. For any graph G of order n ! 2,

Proof. We provide the proof for the second SF-dimension only (the other proof is similar). Suppose T sopt is an s-optimal tree of G and T sopt 6 ¼ S D(G) -1‚ n -D(G) -1 . We prove the statement by performing a sequence of neighbor switches on T sopt , with each of them increasing s-metric, so that the resulting tree is

Let u be a vertex of maximum degree in T sopt . Then for every v in T sopt follows deg T sopt v deg T sopt u deg G u D(G). Let T : = T sopt . We divide the sequence of neighbor switches into three stages.

Stage 1: For each vertex v with all vertices in N T (v)nfvg (where v -2 P T (u‚ v)) being leaves, we either perform the total neighbor switch T : = S B v!u (T) or double neighbor switch T : = S B‚ C v!u‚ u + (T) until the degree of u is not equal to D(G):

One can observe that a double neighbor switch is needed to ensure that deg T u can be increased exactly to D(G). Since deg T u increases after each switch, only the finite number of switches is required. In case the tree T obtained after the first stage differs from S D(G) -1‚ n -D(G) -1 , we perform the second stage if there exist at least two vertices w 1 and w 2 in N T (u) with deg T w 1 ! deg T w 2 ! 2 or jump directly to Stage 3 otherwise.

Stage 2: For each distinct w 1 and w 2 in N T (u) with deg T w 1 ! deg T w 2 ! 2 perform a total neighbor switch T : = S B w 2 !w 1 (T). After each iteration, the number of vertices in N T (u) with degree at least two decreases by one. Thus, Stage 2 terminates after a finite number of switches leaving at most one vertex w 2 N T (u) with degree at least two. Finally if T still differs from S D(G) -1‚ n -D(G) -1 , the third stage is required.

Stage 3: While there exists vertex v in N T (w)nfug with deg T v ! 2 perform a total neighbor switch T : = S B v!w (T). Since the number of neighbors of w with degrees at least two decreases after each switch, Stage 3 terminates after finite number of steps with all neighbors of w‚ except for u, being leaves, that is, T = S D(G) -1‚ n -D(G) -1 . Note that each iteration of Stages 1-3 produces a tree with a higher s-metric due to Lemmas 2, 6 and Corollary 3. ,

In this section, we study the computational complexity of both the m-SF and the s-SF Spanning Tree problem. The following known fact is used:

Theorem 8 (Kleitman and West, 1991) . Any connected graph of order n with minimum vertex degree at least 3 has a spanning tree with at least n=4 + 2 leaves.

We start by investigating the complexity of our problems for cubic graphs. Theorem 9. The m-SF Spanning Tree problem is APX-hard for cubic graphs.

Proof. Let G be a cubic graph on n vertices and T be a spanning tree with ' = '(T) leaves and n i = n i (T) vertices of degree i, i 2 f2‚ 3g. Then

with the numbers n i satisfying the equalities ' + n 2 + n 3 = n and ' + 2n 2 + 3n 3 = 2(n -1): Deriving n 2 and n 3 from these equalities gives us

After substituting these expressions into Equation (10), we get m(T) = 2' + 4n -10:

Thus, finding a spanning tree with maximum m-metric in this case is polynomially equivalent to finding a spanning tree with maximum number of leaves (MaxLeaf problem). For cubic graphs, the latter problem 950 ORLOVICH ET AL.

was shown to be APX-hard by Bonsma (2012) . Thus, we prove the APX-hardness of the m-SF Spanning Tree problem by providing an L-reduction (Papadimitriou and Yannakakis, 1991) from MaxLeaf.

Given an optimization problem P and an instance I of this problem, we use opt P (I) to denote the optimum value of I, and val P (I‚ S) to denote the value of a feasible solution S of instance I. Let A and B be two optimization problems. The problem A is said to be L-reducible to B if there exist polynomial-time computable functions f, g and constants a‚ b > 0 such that (L1) f maps any instance I of A to an instance f (I) of B such that opt B (f (I)) a Á opt A (I); (L2) for any instance I of A and a solution S 0 of the instance f (I), g maps S 0 to a solution S for I such that

Let T mopt be an m-optimal spanning tree of G and ' Ã be the maximum number of leaves in spanning trees of G. Note that ' Ã ! n=4 + 2 by Theorem 8, and therefore, n 4' Ã -8. Then using Equation (12) we get s 1 (G) = m(T mopt ) = 2'(T mopt ) + 4n -10 2' Ã + 16' Ã -32 18' Ã :

Moreover, for every spanning tree T of G we have 1 2 jm(T)m(T mopt )j = j'(T) -' Ã j. As a result, Equation (12) implies an L-reduction with identity mappings f and g and constants a = 18 and b = 1 2 , thus proving the theorem.

Theorem 10. The s-SF Spanning Tree problem is NP-hard for cubic graphs.

Proof. For the reduction, we use the following problem proved to be NP-complete by Lemke (1988) : Instance: A connected cubic graph G of order n. Question: Is there a spanning tree of G without vertices of degree 2? According to Equation (11), n 2 = n 2 (T) = n + 2 -2'(T). Thus, the answer for the problem's question is negative if n is odd. Hence, we concentrate only on the case when n ! 4 is even, thus n 2 is even as well. We show that among all trees T of order n with D(T) 3, the trees without vertices of degree 2 have the highest s-metric. Indeed, the following claim holds:

Claim 11. If D(T) 3 and n ! 4 are even, then s(T) 6n -15. The equality holds if and only if T has no vertices of degree 2.

Proof. If T has no vertices of degree 2, then Equation (11) implies ' = '(T) = n + 2 2 . Furthermore, s(T) = 3m 1 + 9m 3 , where m 1 is the number of edges incident to a leaf and m 3 is the number of edges with both ends of degree 3. Obviously, m 1 = ' and m 3 = n -1 -', thus yielding s(T) = 6n -15. Now suppose that T has n 2 ! 2 vertices of degree 2. Let u and v be two vertices of degree 2 lying on a path P T (u‚ v) and deg T u + ! deg T v -. Iteratively applying a total neighbor switch S B v!u for all pairs of vertices u and v of degree 2, we obtain a tree with higher s-metric (due to Corollary 3) and without vertices of degree 2. This proves the claim. , Thus, s 2 (G) = 6n -15 if and only if G has a spanning tree without vertices of degree 2. This concludes the proof. , Next, we consider bipartite graphs.

Theorem 12. The m-SF Spanning Tree and s-SF Spanning Tree problems are NP-hard for bipartite graphs.

Proof. We present a polynomial-time reduction from the NP-complete 3-Dimensional Matching (3-DM) problem (Garey and Johnson, 1979) :

Instance: Pairwise disjoint sets X, Y, Z of cardinality n, and a collection M of m three-element sets, where each M 2 M includes exactly one element from each of X, Y, and Z.

Question: Is there a set of pairwise disjoint members of M (a perfect 3-dimensional matching), whose union is X [ Y [ Z?

Let Q = (X‚ Y‚ Z‚ M) be an instance of 3-DM. We construct a graph G = G Q on 3n + m + 1 vertices as follows. The vertex set of G is the disjoint union frg [ A [ B, where A = M, B = X [ Y [ Z, and r is the special root vertex. The edge set includes all edges ra, a 2 A, as well as the edges Mx, My, and Mz for each M = fx‚ y‚ zg 2 A (Fig. 2) . We may assume that G is connected. Note also that G is a bipartite graph with the parts A and frg [ B.

For a vertex v of G and a subset W V(G) let us denote by (v : W) the set of edges connecting v to vertices in W.

Lemma 13. There are spanning trees T 1 and T 2 in G, both containing all edges of (r : A), with m(T 1 ) = s 1 (G) and s(T 2 ) = s 2 (G).

Proof. We provide the proof for the s-metric, the proof for the m-metric is similar. Among the optimal spanning trees of G, let T 2 be the one with the maximum number of edges from (r : A). We claim that T 2 contains all these edges.

Suppose for a contradiction that the set C A of all vertices that are adjacent to r in T 2 differs from A. Then there must be a vertex b 2 B with P T 2 (r‚ b) having two edges, such that set D = N T 2 (b) \ fAnCg is nonempty. By Lemma 5, since deg T 2 r + = deg T 2 band deg T 2 r ! 1, we can apply total neighbor switch S D b!r to construct a spanning tree T 0 2 from T 2 with s(T 0 2 ) ! s(T 2 ), and the root r having more neighbors in T 0 2 than it has in T 2 . , Any spanning tree T of G containing all edges of (r : A) has m + 3n edges, 3n(m -1) paths of length three (each of the 3n edges of the tree connecting A and B induces exactly m -1 such paths), and m(m -1)=2 + 3n paths of length two that are not formed by a pair of edges between A and B. There are 3d 4 + d 3 remaining paths of length two, where d i is the number of vertices in A that have degree i in the tree. Indeed, a vertex v 2 A with j 2 f0‚ 1‚ 2‚ 3g neighbors from B in the tree contributes no such path in case of j 2 f0‚ 1g, one such path in case of j = 2, and three such paths in case of j = 3. Thus by Proposition 1 m(T) = m 2 + m + 12n + 6d 4 + 2d 3 ‚ s(T) = m 2 + 3mn + 6n + 6d 4 + 2d 3 :

Since jBj = 3n, we have 3d 4 + 2d 3 3n and 6d 4 + 2d 3 6d 4 + 4d 3 6n. Hence, 6d 4 + 2d 3 6n with equality holding if and only if d 3 = 0 and d 4 = n.

A perfect 3-DM M Ã = fM 1 ‚ . . . ‚ M n g induces the spanning tree T M Ã that contains all edges from (r : A) and edges ax‚ ay‚ az for each a = fx‚ y‚ zg 2 M Ã . For this tree we have d 4 = n and m(T) = m 2 + m + 18n : = t 1 (n‚ m)‚ s(T) = m 2 + 3mn + 12n = : t 2 (n‚ m):

Conversely, every spanning tree T that contains all edges from (r : A) and m(T) = t 1 (n‚ m) or s(T) = t 2 (n‚ m) (and thus d 4 = n) arises from a perfect 3-DM.

By Lemma 13, the graph G satisfies s 1 (G) ! t 1 (n‚ m) (resp., s 2 (G) ! t 2 (n‚ m)) if and only if there is a spanning tree T of G that contains all edges from (r : A) and whose m-metric (resp., s-metric) is equal to t 1 (n‚ m) (resp., t 2 (n‚ m)). The latter is true if and only if Q has a perfect 3-DM. ,

Here we describe two ILP models for the s-SF Spanning Tree problem (for the m-SF Spanning Tree problem the approach is similar). For a given spanning tree T of a graph G = (V‚ E) of order n, consider the indicator variables (x e ) e2E :

1‚ e 2 E(T); 0‚ otherwise:

2. An example of the graph G for n = 3, X = fx 1 ‚ x 2 ‚ x 3 g, Y = fy 1 ‚ y 2 ‚ y 3 g, Z = fz 1 ‚ z 2 ‚ z 3 g, and M = ffx 1 ‚ y 2 ‚ z 1 g‚ fx 3 ‚ y 2 ‚ z 3 g‚ fx 2 ‚ y 1 ‚ z 1 g‚ fx 1 ‚ y 2 ‚ z 3 g‚ fx 3 ‚ y 1 ‚ z 2 g‚ fx 2 ‚ y 3 ‚ z 1 gg. Here each vertex labeled fp‚ q‚ rg represents a set fx p ‚ y q ‚ z r g.

ORLOVICH ET AL.

Using Proposition 1, we can represent s(T) as

x e i x e j x e k + 2 X fe i ‚ e j g 2 G 2 (G)

x e i x e j + X e 2 E(G)

where G i (G) denotes the set of all paths of length i in G. To linearize (14), we introduce Boolean variables y ijk and y ij and the following constraints:

for every fe i ‚ e j ‚ e k g 2 G 3 (G) and fe i ‚ e j g 2 G 2 (G), which are equivalent to y ijk = x e i x e j x e k and y ij = x e i x e j . Thus the objective function (14) can be rewritten as

We use two types of constraints to describe the spanning trees. The first type is the extended formulation of Martin (1991) , which uses auxiliary variables

where z r (v‚ r) = 0 for every r 2 V(G) and vr 2 E(G). A 0/1-vector x describes a spanning tree of G if and only if these variables satisfy the constraints

The second type exploits the Miller-Tucker-Zemlin (MTZ) constraints (Miller et al., 1960) . We introduce the auxiliary variables

and constraints

where r 2 V(G) is some fixed vertex. Finally we add the additional constraint

defined by Theorem 7, which turns out to significantly improve the algorithm running times. Maximization of the objective (16) 

In this section, we investigate the practical aspects of scale-free spanning tree problems by conducting computational experiments for various simulated and experimental data sets to evaluate the performance of the ILP models. All computations below were performed on a standard laptop with 2.0 GHz dual core processor and 16 GB of RAM, and ILP problems were solved using Gurobi 8.1. 5.1. Synthetic data 5.1.1. Synthetic graphs. We used graphs from the following synthetic data sets:

Erd} os-Re´nyi graphs constructed by adding each possible edge uniformly and independently with the probability p = 4:25=n. The number of nodes n in our experiments varied from 10 to 40 (corresponding to the sizes of HCV outbreaks analyzed later).

n · m grid graphs (Cartesian products of paths P n and P m ) with n‚ m = 4‚ . . . ‚ 7.

Scale-free graphs of two types generated using NetworkX library (Hagberg et al., 2008) : those based on the classical Barabási and Albert (1999) model and those constructed with NetworkX default parameters. The latter graphs are usually denser.

For all synthetic data sets except for grid graphs, we generated 10 graphs per node number. Figures 3 and  4 illustrate the running times of the ILP solver on both the MTZ formulation and the Martin formulation compared with the published tool QUENTIN (Skums et al., 2018) runtimes for all four simulated graph classes. 1 The results demonstrate that for those graph classes, the ILP algorithms in average perform much better than in the worst case and are able to produce optimal results in a reasonable amount of time. Moreover, for considered graph sizes, they outperform QUENTIN. For Erd} os-Rényi graphs and grids (Fig. 3) , which are characterized by relatively large sets of feasible solutions, the Martin formulation was superior to MTZ and QUENTIN, while for Barabási-Albert scale-free graphs (Fig. 4a) , the MTZ formulation was leading to the faster algorithm. In general, the ILP approach allows to solve the problem within minutes or few hours for small-to-medium-sized problems (up to several dozens of vertices) on Erd} os-Rényi graphs and grids, and for medium-sized problems (several hundred vertices) on scale-free graphs.

We simulated outbreaks over scale-free Barabási-Albert contact networks of n = 10 -30 nodes using the following model. The infection spreads over each network according to the susceptible infected (SI) model (Newman, 2010) with the transmission rate q = 10 -2 . Each infected individual is assumed to carry a viral sequence of length m = 13200, and at each transmission event, the source's sequence is transmitted to the recipient. Sequence evolution is described by a skyline model with the piecewise constant decreasing mutation rate, that is, viral sequences mutate at the basic rate of l = 10 -5 changes/position/time unit, and the mutation rate is decreasing by 30% every s = 100 time units. This model captures the decrease of the speed of intrahost evolution as the infection progresses from an acute to a persistent stage (De Maio et al., 2016; Icer et al., 2020) .

For each simulated outbreak, we compared the performance of the ILP algorithm for the Martin formulation, with the standard approach based on the phylogenetic trait inference (Sagulenko et al., 2018) . First, we constructed a maximum likelihood phylogeny using MEGA (Kumar et al., 2018) . Each patient was encoded by a discrete trait, and the marginal likelihood ancestral traits were reconstructed using the Felsenstein pruning algorithm (Felsenstein, 2004) with the pairwise between-trait transition rates equal to q. Inferred transmission links then correspond to trait changes along the phylogeny branches. The genetic relatedness network G R used as an input for the ILP was constructed using a threshold-based approach suggested by Kosakovsky Pond et al. (2018) . A pair of vertices of G R are adjacent, if the Hamming distance between the corresponding sequences does not exceed a threshold t that was estimated as the minimal integer such that the graph G R is connected. The obtained graph was further sparsed out by applying the same procedure to each of its biconnected component.

The results of algorithms' comparison are shown in Figure 5 . We measured algorithm accuracy by the proportion of correctly inferred transmission links and transmission ancestries (i.e., pairs, ancestor/ descendant). s-SF-based ILP clearly outperformed the phylogenetic approach: the average transmission link detection accuracy was 82:44% for the former and 72:61% for the latter, while the average transmission ancestry detection accuracies were 97:48% and 73:96%, respectively.

We applied the concept of scale-free spanning trees to the graphs arising from the benchmark data set consisting of several epidemiologically curated HCV outbreaks investigated by the CDC (Campo et al., FIG. 3 . Running times of ILP solver and QUENTIN on Erd} os-Rényi graphs (a) and grids (b). ILP, integer linear programming.

2016; Glebova et al., 2017; Skums et al., 2018) . This data set comprises HCV quasispecies populations sampled from 81 infected individuals involved in 10 viral outbreaks. Each population consists of RNA sequences of HCV hypervariable region 1 (HVR1) of length 264 bp. Transmission histories of the outbreaks (''who infected whom'') are known as a result of epidemiological investigations. In this case, we are dealing with intrahost viral populations rather than single sequences, and therefore, we compared the proposed approach with QUENTIN, which has been specifically designed to handle such data (Skums et al., 2018) .

For each outbreak, the genetic relatedness network G R was constructed using the threshold-based approach suggested by Campo et al. (2016) . The vertices of G R are adjacent, if the minimal Hamming distance 

ORLOVICH ET AL.

between the sets of sequences sampled from these patients does not exceed the threshold t. The threshold value was estimated as described in Subsection 5.1.2. Next, the ILP algorithm for the Martin formulation has been applied to G R . For all outbreaks, the ILP problem has been solved to optimality. We tested the accuracy of inference of transmission links and identification of the superspreaders (the sources of majority of infections). The results are reported in Table 1 . The superspreaders correspond to vertices of highest degrees in s-optimal and m-optimal trees for 9 out of 10 outbreaks. It should be noted that all algorithms incorrectly identified a superspreader for the same outbreak. It is the only outbreak where the virus was transmitted via a nonsocial interaction (namely, through blood transfusions), while all other outbreaks were associated with unsafe injection practices or sexual contacts. For those outbreaks, both ILP approaches correctly recovered 92% of transmission links and all ancestor/descendant pairs, thus outperforming QUENTIN.

In genomic epidemiology, reconstruction of viral transmission histories from genomic data is fundamental for the investigation of outbreaks and understanding of epidemic spread. Genomic analysis has become one of the major tools for the investigation of outbreaks and surveillance of transmission dynamics (Armstrong et al., 2019; Knyazev et al., 2020) . Naturally, graphs are the primary models used in such studies (Wertheim et al., 2014; Campo et al., 2016; Ragonnet-Cronin et al., 2019) . In many settings, graphbased methods have been shown to be more efficient to ascertain transmission links compared with methods based on binary phylogenies (Wertheim et al., 2014) , as phylogenetic clades are not easily resolvable into transmission clusters and pairs (Lewis et al., 2008; Hughes et al., 2009; Kouyos et al., 2010) , while the statistical support for a clade does not necessarily indicate the statistical support for a relationship between individual genomes inside a clade (Volz et al., 2012; Wertheim et al., 2014) . However, in many cases, transmission links cannot be inferred using the genomic data alone ( Jombart et al., 2014; Villandre et al., 2016) . It leads to the need to introduce additional constraints on the reconstructed transmission networks or utilize more complicated objectives.

As a result, the associated algorithmic problems become harder. In this article, we studied one such problem-scale-free spanning tree problem-that arises in epidemiological studies of viruses whose spread is highly influenced by social networks of contacts between susceptible individuals. This includes HIV, HCV, and other pathogens transmitted through sexual contact or needle sharing. We demonstrated that this problem in its two possible algorithmic formulations is NP-hard, even if restricted to relatively simple graph classes. However, it admits an ILP formulation allowing to efficiently solve the problem for small-tomedium networks. It is often enough for the vast majority of outbreaks of HIV and HCV that involve dozens of infected individuals.

However, some outbreaks involve hundreds or even thousands of hosts, and in such cases, more scalable algorithmic solutions are needed. Thus, an important open problem is to establish whether constant or logarithmic approximation exists for the m-SF Spanning Tree and s-SF Spanning Tree problems. In this context, it would be interesting to explore the relationships between scale-free spanning tree problems and max-leaf spanning tree problems. The latter is a well-studied combinatorial problem (Griggs et al., 1989; Galbiati et al., 1994) , which seems to be the closest to our problem. Indeed, both problems aim to find a ''star-like'' spanning tree; furthermore, several reduction schemes for the proof of NP-completeness used by us exploit this relationship. Importantly, Lu and Ravi (1998) and Reich (2016) showed that the max-leaf spanning tree problem is approximable within a constant factor. Although the problems are far from being equivalent, it may seem reasonable for future studies to try to adopt algorithmic machinery developed for the max-leaf spanning tree problem to the scale-free spanning tree problem.

Pathogen genomics in public health

Emergence of scaling in random networks

Max-leaves spanning tree is APX-hard for cubic graphs

Bounds for Zagreb indices

Detailed transmission network analysis of a large opiate-driven outbreak of HIV infection in the United States

Accurate genetic detection of hepatitis C virus transmissions in outbreak settings

Graphs & Digraphs

Sharp bounds for the sum of the squares of the degrees of a graph

Some properties of the second Zagreb index

An upper bound on the sum of squares of degrees in a graph

SCOTTI: Efficient reconstruction of transmission within outbreaks with the structured coalescent

Inferring phylogenies

A short note on the approximability of the maximum leaves spanning tree problem

Dimensions of superspreading

Computers and Intractability: A Guide to the Theory of NP-Completeness

Inference of genetic relatedness between viral quasispecies from sequencing data

Spanning trees with many leaves in cubic graphs

Exploring network structure, dynamics, and function using NetworkX

Epidemiology and natural history of HCV infection

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

Molecular phylodynamics of the heterosexual HIV epidemic in the United Kingdom

Quantitative differences between intra-host HCV populations from persons with recently established and persistent infections

Modeling the spread of HIV and HCV infections based on identification and characterization of high-risk communities using social media, 425-430

Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data

Reconstructing disease outbreaks from genetic data: A graph approach

Global epidemiology of HIV

Algorithms and complexity results for finding graphs with extremal Randić index

Spanning trees with many leaves

Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks

Epidemiological data analysis of viral quasispecies in the nextgeneration sequencing era

HIV-TRACE (TRansmission Cluster Engine): A tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens

Molecular epidemiology reveals long-term changes in HIV type 1 subtype B transmission in Switzerland

MEGA X: Molecular evolutionary genetics analysis across computing platforms

Transmission network parameters estimated from HIV sequences for a nationwide epidemic

The maximum leaf spanning tree problem for cubic graphs is NP-complete

Episodic sexual transmission of HIV revealed by molecular phylodynamics

Towards a theory of scale-free graphs: Definition, properties, and implications

Approximating maximum leaf spanning trees in almost linear time

Using separation algorithms to generate mixed integer model reformulations

Integer programming formulation of traveling salesman problems

Networks

Optimization, approximation, and complexity classes

A phylogenetic analysis of surveillance data

A large HCV transmission network enabled a fast-growing HIV outbreak in rural Indiana

Complexity of the maximum leaf spanning tree problem on planar and regular graphs

TreeTime: Maximum-likelihood phylodynamic analysis

QUENTIN: Reconstruction of disease transmissions from viral quasispecies genomic data

TreeFix-TP: Phylogenetic error-correction for infectious disease transmission network inference

Assessment of overlap of phylogenetic transmission clusters and communities in simple sexual contact networks: Applications to HIV-1

Simple epidemiological dynamics explain phylogenetic clustering of HIV from patients with recent infection

The global transmission network of HIV-1

The authors declare they have no conflicting financial interests.

Y.O. was partially supported by the BRFFR grant (Project F20UKA-005). The work of V.K. and K.K. was supported by the German National Science Foundation via DFG-Research Training Group 2297 (Mathematical Complexity Reduction-MathCoRe). P.S. was supported by the National Institutes of Health grant 1R01EB025022 and by the National Science Foundation grant 2047828.