key: cord-0036519-5bnrpoiz authors: Vu, Hoa; Chin, Francis; Hon, W. K.; Leung, Henry; Sadakane, K.; Sung, Ken W. K.; Yiu, Siu-Ming title: Reconstructing k-Reticulated Phylogenetic Network from a Set of Gene Trees date: 2013 journal: Bioinformatics Research and Applications DOI: 10.1007/978-3-642-38036-5_14 sha: cbacba9a7ad8917f10e5050fe2198e215983c440 doc_id: 36519 cord_uid: 5bnrpoiz The time complexity of existing algorithms for reconstructing a level-x phylogenetic network increases exponentially in x. In this paper, we propose a new classification of phylogenetic networks called k-reticulated network. A k-reticulated network can model all level-k networks and some level-x networks with x > k. We design algorithms for reconstructing k-reticulated network (k = 1 or 2) with minimum number of hybrid nodes from a set of m binary trees, each with n leaves in O(mn (2)) time. The implication is that some level-x networks with x > k can now be reconstructed in a faster way. We implemented our algorithm (ARTNET) and compared it with CMPT. We show that ARTNET outperforms CMPT in terms of running time and accuracy. We also consider the case when there does not exist a 2-reticulated network for the input trees. We present an algorithm computing a maximum subset of the species set so that a new set of subtrees can be combined into a 2-reticulated network. The study of evolutionary history of a species plays a crucial role in biomedical research. For example, understanding the evolutionary history of a virus (e.g. SARS) may help us deduce the natural reservoirs of the virus, thus identifying the source of the virus. The details of how the virus evolves may help to uncover clues to treat or vaccinate the virus and understand how it evolves resistance to existing drugs. A traditional representation of evolutionary history is phylogenetic tree (a rooted, unordered tree with distinctly labeled leaves, each represents a species or a strain of the species). To construct a phylogenetic tree, a common practice is to select a group of genes, which are believed to be critical for evolution, to represent the species. However, selecting a different set of genes may end up with a different phylogenetic tree (called a gene tree). To deal with this issue, researchers may try to extract the subtrees which are common in all trees (known as the maximum agreement subtree problems, see [1] [2] [3] for examples) and ignore the other non-common parts. This may result in a small tree. Also, information not in the common subtree will be lost. It is now well-known that the differences in the gene trees are not due to errors. There exist evolutionary events (known reticulation events), such as hybridization, horizontal gene transfer, and recombination, that may cause the genes to evolve differently and a phylogenetic tree is not powerful enough to model the resulting evolutionary history [4] . To model the evolutionary history better, phylogenetic network is proposed. Phylogenetic network is a generalization of phylogenetic tree (note that in this paper, we focus on rooted bifurcating (each node has at most 2 descends) phylogenetic tree/network). Phylogenetic network is defined as a rooted, directed acyclic graph in which (1) exactly one node has indegree 0 (the root), and all other nodes have indegree 1 or 2; (2) all nodes with indegree 2 (hybrid nodes or reticulation nodes) have outdegree 1, and all other nodes have outdegree 0 or 2; and (3) all nodes with outdegree 0 (leaves) are distinctly labeled. For a hybrid node h in a phylogenetic network, every ancestor s of h such that h can be reached using two disjoint directed paths starting from the children of s is called a split node of h (and h is called a hybrid node of s). The edges attached to a hybrid node is called hybrid edges. Figure 1 shows an example. Typically, a split node is used to represent a speciation event (two different species are evolved) while a hybrid node is used to represent the reticulation event between the two descendants of the split node. We say that a phylogenetic network N is compatible with (induces or displays) a set of gene trees if each tree can be obtained from N by deleting one of the hybrid edges of each hybrid node and contracting all nodes with outdegree and indegree equal 1 (see Figure 3 for an example). If there is no restriction, for any given set of trees, we can always have a phylogenetic network that induces the trees. However, reticulation events are hard to occur, so a more biological meaningful question is to ask for such a phylogenetic network with the minimum number of hybrid nodes. A common classification of phylogenetic network is the level-x network [11 -12] . A level-x network is one in which each biconnected component (also known as blob [5] ) of the network contains at most x hybrid nodes. A level-0 network is a phylogenetic tree, a level-1 network is also known as a galled tree [5] or a galled network [6] . There are algorithms that reconstruct a level-x network, however, the time complexity increases exponentially in x even if we only consider some restricted cases. Thus, in practice, if x > 2, the algorithm is not fast enough. On the other hand, the evolutionary history of quite many viruses can only be modeled by high level networks (with x > 2). For example, to capture all known reticulation events of HIV [7] , we need to use a level-4 network ( Figure 2 shows the network). In this paper, we propose to consider a new classification of networks by restricting the maximum number of hybrid nodes each split node may have, namely a kreticulated network is one in which each split node can correspond to at most k hybrid nodes. This new classification is also supported by evidence in real life cases. Several studies of recombination in bacteria have shown that recombination rates decrease as sequence divergence increases [8] [9] . These studies imply that the number of recombination events of a split node will be limited as the descendants from the same split node will diverge more as the number of generations increases. This observation is also supported by a computer simulation study [10] . Therefore, networks with limited reticulation events for each split node while no limit on the total number of reticulation events in each blob seem to be more biologically relevant and can model the recombination events in nature more appropriately. Our Contributions. This new classification of phylogenetic networks is more powerful than level-x networks. By definition, every level-x network is also an xreticulated network. And some level-x network can be modeled by a k-reticulated network with k < x (see Figure 2 for an example of a level-4 network which is also a 2-reticulated network). So, even solving the problem for k-reticulated network with k as small as 2, some of the meaningful high level networks can be constructed efficiently. We show that given a set of binary gene trees, one can reconstruct an 1reticulated or 2-reticulated network (if one exists) with minimum number of hybrid nodes compatible with all trees in O(mn 2 ) time where m is the number of trees and n is the number of leaves. We also consider the problem that when a compatible 2reticulated network does not exist, compute a subset of species with maximum size so that a 2-reticulated network exists. This problem is believed to be NP-hard and we provide an O(2 mm n 3m ) algorithm to solve it. We implement the 2-reticulated network reconstruction algorithm (ARTNET) and compare it with the program CMPT [13] that reports a phylogenetic network with the smallest number of hybrid nodes. We only consider the case when a 2-reticulated network exists for the input set of trees. The experiments show that ARTNET is more efficient than CMPT. When the number of hybrid nodes increases, the running time of ARTNET only increases slightly while that of CMPT increases rapidly. Regarding accuracy evaluation, ARTNET also outperforms CMPT. Related Work. Several methods of constructing phylogenetic networks have been proposed. Nakhleh et al. [6] have developed an algorithm for constructing a level-1 phylogenetic network from two phylogenetic trees running in polynomial time. However, Nakhleh et al.'s algorithm can handle two trees only. Huynh et al. [12] have succeeded in providing a O(|T| 2 n 2 ) algorithm reconstructing a galled network from a set T of multiple phylogenetic trees of arbitrary degree. Huson and Klopper [14] gave an O(n k ) algorithm constructing restricted level-k network from a set of trees. A rooted phylogenetic tree can be uniquely represented by the set of triplets obtained by taking all combinations of three leaves in the tree [12] . It takes O(n 3 ) running time to construct a galled network in the algorithm designed by Jansson, Nguyen and Sung [15] . Extending to level-2 network, Van Iersel et al. [11] developed an O(n 8 ) time algorithm. Habib and To [16] have solved the general problem of constructing level-k network from a dense triplet set T in exponential running-time | | . Gambette et al. [17] have shown that we can decide in optimal O(n 4 ) time whether there exists a simple unrooted level-1 network for a set of all quartets. = the subtree of T rooted at u, and L(T) = the leaf label set of T. If u is a node in network N, a subnetwork N[u] is obtained from N by only retaining all nodes and their incident edges which are reachable from u, and L(N) is the set of leaf labels of N. Given a subtree t of T, T is a subtree obtained by removing t from T. Similarly, with a subnetwork N' of N, N\N' is a network obtained by removing N' from N. Given a tree T with the leaf set L, and . T|L' denotes a subtree obtained by first deleting all nodes which are not on any directed path from the root to a leaf in L' along with their incident edges, and then, for every node with outdegree 1 and indegree less than 2, contracting its outgoing edge. Denote P(N, T i , L) the procedure to reconstruct a k-reticulated network N compatible with T 1 , T 2 , …, T m , where k = 1 or 2. We employ the divide-and-conquer technique. Base Case: if each input tree is a single node with the same label, return a network which is that single node of the same label; otherwise consider the following cases: If h i ≠ r i1 and h i ≠ r i2 for i = 1…m, the problem can be divided into 3 subproblems: Network N can be combined from N 1 , N 2 and N h by first creating a new node r to be the parent of the roots of N 1 and N 2 . Find node u 1 in N 1 and u 2 in N 2 such that for i = 1…m, either u 1 or u 2 corresponds to h i 's sibling s i . Let v 1 and v 2 be the parent of u 1 and u 2 respectively, create nodes p 1 and p 2 on edges (v 1 , u 1 ) and (v 2 , u 2 ) respectively. A new hybrid node h is created, and let h be a child of p 1 and p 2 , and h be the parent of N h 's root (Fig. 5) . Given network N compatible with tree T, a node u in N is said to correspond to a node s in T if T can be converted from N by a series of cuts in which any edge contraction related to node u will create a new node that is labeled u, then u becomes s. If there is a tree T i in which h i is a child of the root r i , the network constructed in this case is skew (i.e., there is a split node such that the path from the split node to its hybrid node is 1). The problem can be divided into 2 sub-problems: P(N', T i ', L 1 }; and P(N h , T i [h i ], L h ). If N' and N h can be constructed, N can be obtained by first creating a node r and making r become the parent of the root of N'. Find a node u in N' such that for every tree T i in which h i is not a child of the root r i , u corresponds to s i in T i ', which is the sibling of h i before removing T i [h i ]. Let v be the parent of u, and a new node p on edge (v, u), create a hybrid node h that is the child of p and r, and h is the parent of the root of N h (Fig. 6) . To solve this problem, we also consider the base case, Case I and Case II as in the above. In addition, we need to consider Case III -Quadripartition as follows. The tree set {T 1 , T 2 , …, T m } is said to admit a leaf-set-quadripartition (L 1 , L h1 , L h2 L 2 ) if for every tree T i with root r i and its children r i1 and r i2 , there exists a node h i2 ∉ If N' is a non-skew network, N' is created by combining three 2-reticulated networks N 1 , N 2 and N h1 (as case II). Find two nodes a and b in two distinct networks out of three networks N 1 , N 2 and N h1 such that either a or b corresponds to node s i in T i ', which is the sibling of h i2 in T i , for i = 1, 2,…, m. Attaching N h2 to N' is done similarly to case II by creating a new hybrid node h 2 (Fig. 7) . If N' is a skew-network, N' is created by combining two 2-reticulated networks N 1 and N h1 . Find nodes a and b in N 1 and N h1 respectively such that for i = 1…m, either a or b corresponds to node s i in T i ', which is the sibling of h i2 in T i . Attaching N h2 to N' by creating a new hybrid node h 2 . (Fig.8 respectively. We have L 1 ∩L 2 = ∅ and L 1 ∪L 2 = L, so in N and every tree T i , their root is the only common ancestor of any node in L 1 and any node in L 2 . This means the input tree set admit a leaf set bipartition, corresponding to case I in the algorithm. does not correspond to any hybrid node, the argument can be turn back to Case 2. This implies that the tree set admit a leaf set tripartition, corresponding to case 2 of the algorithm. Case 4: r corresponds to two hybrid nodes h 1 and h 2 Let p 1 and q 1 be the parents of h 1 , and p 2 and q 2 be the parents of h 2 , then either h 1 lies on one of the merge paths from the root r to h 2 , or none of the merge paths from r to h 2 (resp. h 1 ) go through h 1 (resp. . It takes O(mn) to divide L c into two disjoint subsets L c1 and L c2 , and L d into two disjoint subsets L d1 and L d2 (Fig. 9) . Pick one leaf node v in L c , then For every leaf node u ∈ L c : If LCA(v, u) is not the root of every tree T i ; u is put in L c1 ; else, u is put in L c2 . For every leaf node w ∈ L d : If LCA(v, w) is the root of every tree T i ; w is put in L d1 ; else w is put in L d2 . Claim 3: If the tree set admits a leaf set quadripartition (L 1 , L h1 , L h2 , L 2 ), one of two sets L c1 ∪L d1 or L c2 ∪L d2 can be either (1) L h1 , or (2) L h2 , or (3) L h1 ∪L h2 . Pick any tree, say T 1 , to find the p 1 = LCA(L c1 ∪L d1 ) and p 2 = LCA(L c2 ∪ L d2 ). 1. One of two nodes p 1 or p 2 is the root r 1 and the other is not. Assume p 1 = r 1 , and p 2 is a proper descendant of r 1 , then L c2 ∪ L d2 = L h1 or L c2 ∪ L d2 = L h2 . 2. If both p 1 and p 2 are r 1 , find j = 1 or 2 such that LCA(L cj ) and LCA(L dj ) are the children of the root r 1 . If j does not exist, return "null"; otherwise, assume j = 1, then L h1 ∪L h2 = L c2 ∪L d2 . It takes O(n) time to determine L' which is either L h1 , or L h2 or L h1 ∪ L h2 . , m} admit a leaf set tripartition (L 1 , L h , L 2 ). If yes, L h2 = L' and L h = L h1 ; else return "null". If there is a tree T j in which there does not exist any node w such that L(T j [w]) = L' (Fig.10) . If h j is LCA(L'), then L' = L h1 ∪ L h2 , which is examined as case 2 below. Fig. 10 . With a leaf set L h1 , find a node w in T j such that L(T j [w]) = L h1 ∪ L h2 Let T k be a tree in which there is no node satisfying L' = L(T k [w] In total, checking whether the input trees admit a leaf set bipartition or tripartition or quadripartition, and partition every tree into proper subtrees takes O(mn).  Claim: Node u exits iff u i is u or a descendant of u such that all nodes on the path from u to u i are either hybrid node of a skew split node or non-hybrid nodes whose siblings are hybrid nodes. It takes O(mn) to find the set {u 1 , u 2 , …, u m } from N (note that u x can be u y ), and O(n) time to check (i) all nodes {u 1 , u 2 , …, u m } lie on the same directed path; and (ii) The siblings of u i , i = 1, 2, …, m, are all hybrid nodes. If two conditions are satisfied, the node u will be the starting node x of the path created by {u 1 , u 2 ,…, u m } or x's sibling if x's sibling is the hybrid node of a skew split node ; otherwise, return "null". We evaluate and compare the performance of our method, namely ARTNET, with the program CMPT [13] which constructs a network with the smallest number in reticulation from a set of binary trees. NETGET [10] is used to generate random networks. For every 2-reticulated network simulated, we produce a certain number of induced binary trees which are the input of both programs ARTNET and CMPT. We use n (number of leaf node) = 40. Figure 11 shows that we run faster than CMPT. Following [11] , we use split-based false negative (FN) and false positive (FP) rates to measure the error rates of the methods. Figure 12 shows that ARTNET produces fewer false positives than CMPT. On the other hand, CMPT and ARTNET have similar performance in false negative rates. Computing the unrooted maximum agreement subtree in sub-quadratic time Sparse dynamic programming for evolutionary-tree comparison Kaikoura tree theorems: Computing the maximum agreement subtree Phylogenetic classification and the universal tree A fundamental decomposition theory for phylogenetic networks and incompatible characters Reconstructing reticulate evolution in speciestheory and practice RB-finder: An improved distance-based sliding window method to detect recombination breakpoints Mismatch induced speciation in salmonella: model and data Sexual isolation in bacteria Recombination and the nature of bacterial speciation Constructing level-2 phylogenetic networks from triplets Constructing a smallest refining galled phylogenetic network Algorithms for Reticulate Networks of Multiple Phylogenetic Trees Beyond galled trees -decomposition and computation of galled networks Algorithms for combining rooted triplets into a galled phylogenetic network Constructing a minimum phylogenetic network from a dense triplet set Quartets and Unrooted Phylogenetic Networks Given a set of binary trees {T 1 Using brute-force approach by considering all possible subsets of the leaf set, the problem can be done in O(2 n mn 2 ). However, when m ≪ n, the following algorithm produces better time complexity. , if there is a tree whose root is pointed by any _ , 1 (resp. _ , 2 with a specific value (resp. ), then for every other tree containing a node that is pointed by _ , 1 (resp. _ , 2 ), replace in computing .Time Complexity: By applying dynamic programming, and backtracking on the recursive equations, the problem can be computed in O(2 m mn 3m ). 