key: cord-0701206-fezuzaoc authors: Wertheim, Joel O; Steel, Mike; Sanderson, Michael J title: Accuracy in near-perfect virus phylogenies date: 2021-08-16 journal: Syst Biol DOI: 10.1093/sysbio/syab069 sha: da6dab33ecb4a7759668de14fd3e802e584894e9 doc_id: 701206 cord_uid: fezuzaoc Phylogenetic trees from real-world data often include short edges with very few substitutions per site, which can lead to partially resolved trees and poor accuracy. Theory indicates that the number of sites needed to accurately reconstruct a fully resolved tree grows at a rate proportional to the inverse square of the length of the shortest edge. However, when inferred trees are partially resolved due to short edges, “accuracy” should be defined as the rate of discovering false splits (clades on a rooted tree) relative to the actual number found. Thus, accuracy can be high even if short edges are common. Specifically, in a “near-perfect” parameter space in which trees are large, the tree length ξ (the sum of all edge lengths) is small, and rate variation is minimal, the expected false positive rate is less than ξ∕3; the exact value depends on tree shape and sequence length. This expected false positive rate is far below the false negative rate for small ξ and often well below 5% even when some assumptions are relaxed. We show this result analytically for maximum parsimony and explore its extension to maximum likelihood using theory and simulations. For hypothesis testing, we show that measures of split “support” that rely on bootstrap resampling consistently imply weaker support than that implied by the false positive rates in near-perfect trees. The near-perfect parameter space closely fits several empirical studies of human virus diversification during outbreaks and epidemics, including Ebolavirus, Zika virus, and SARS-CoV-2, reflecting low substitution rates relative to high transmission/sampling rates in these viruses. (C-iii) the split described by f is compatible with k characters that are perfectly evolved on T (by the Markovian process described above). 139 In other words, we are interested in 'false splits' (i.e., splits in the reconstructed MP tree (2) The probability of a false split is then given by the following theorem (see SI for 150 proof). Theorem 1 For each T ∈ B(n), and k 1 we have: Theorem 1 shows that for fixed k and n, the shape of T plays a significant role in In general, it is mathematically complicated to describe F P T in terms of these 165 parameters. However, when the number of leaves in a tree grows faster than the number of 166 perfectly compatible characters, it is possible to state a limit result to provide an 167 approximation to F P T for large trees. 168 In the following theorem, we consider the following setting: 169 (I) mξ = Θ(n β ) for some 0 < β < 1 2 , and 170 (II) mξ 2 = O(1), 171 where O(1) refers to dependence on n (thus mξ 2 is not growing with n). Note that 172 Condition (I) implies that the number of perfectly evolved characters grows with the 173 number of leaves, but at a rate that is slower than linearly. Conditions (I) and (II) imply 174 that ξ decreases as n increases. 175 In this setting, we show that the false positive rate is (asymptotically) of the form ξ 3 176 times a function Ω that involves T (via its shape), m, and ξ. If we now treat ξ as a 177 variable, then for ξ = 0, the function Ω is close to 1 (for large n) and so F P T initially 178 grows like ξ/3. However, as ξ increases, Ω begins to decline at an increasing rate, resulting 179 in the false positive rate reaching a maximum value before starting to decrease. To describe this result, we need to define this function Ω. Let ). Notice that Ω(T n , ξ, m) depends on T n only via the coefficientsφ Tn (i), and this 183 dependence is linear. Thus, if D is a distribution on trees (e.g. the PDA or YH), then the 184 expected value of Ω(T n , ξ, m) is given by: For the PDA distribution, the term E P DA [φ Tn (i)] has an explicit exact value, 186 namely, for all i between 1 and n − 3 (see SI for proof). Theorem 2 For each n 1, let T n be a binary phylogenetic tree with n leaves, and 189 suppose that Conditions (I) and (II) hold. (i) where o(1) is a term that tends to 0 as n grows. (ii) If T n is sampled from a distribution D (e.g. PDA, YH), then the expected value of Remarks: Note that F P Tn depends only on the shape of the tree T n (and not on 192 how its leaves are labelled), thus for a tree distribution D on either the class of caterpillar 193 trees, or symmetric trees, we have F P Tn = E D [F P Tn ]. Notice also from Fig. 3 that as ξ increases from 0 the estimate of E D [F P Tn ] given by ξ 3 · Ω(T n , ξ, m) for the YH, PDA distributions and for symmetric trees initially increases (approximately linearly) with ξ but then begins to decrease with increasing ξ. By contrast, when T n has the caterpillar tree shape, the estimate of F P Tn appears to be constant as ξ increases from 0 (see Fig. 3 ). Indeed, when T n is a caterpillar tree, the expression for F P Tn in Theorem 2(i) reduces to the following remarkably simple expression as n becomes large: which is independent of ξ (and n). Details are provided in the SI. JC+FQ -nt 1 -redo -mredo -polytomy -blmin 1e-9', replicated r times, followed by strict IQTree2 with a JC69 model, minimum branch lengths of 10 −9 , and collapsed polytomies. simulation in Seq-Gen, as described above. Edge length (per site) variation was assumed to follow a gamma distribution: Overview of Results on Accuracy Simulations of tree inference with MP over a large range of tree lengths, ξ, and 288 other parameters, illustrate several known results (Fig. 2) and perhaps a few less well 289 known ones. First, resolution of the inferred tree increases with tree length. Second, "overall" accuracy, as measured by the RF distance, is optimal at an intermediate tree and insensitive to the presence of rate variation; but, when ξ > 1, the false positive rate is 300 much more sensitive to rate variation-high when variation is present and low when absent 301 (contrast Fig. 2A and Fig. 2B ). In real-world data, as ξ increases, we expect that evidence 302 of rate variation will become more apparent. Key elements of these findings can be shown analytically in a "near-perfect" zone 304 described by a simple evolutionary model. Overview of the Mathematical Theory q q q q q q q q q q q q q q q q q q q ξ q q q q q q q q q q q q q q q q q q q meaning multiple changes at a site occur on distinct edges. Though these conditions will 314 generate alignments dominated by "perfect" sites exhibiting no homoplasy, a few sites may 315 exhibit homoplasy even with ξ 1, which motivates the term "near-perfect". Under these 316 conditions, tree reconstruction methods will tend to infer relatively unresolved trees unless 317 the number of sites is very large. Rare sites that exhibit homoplasy can introduce false positive splits on the inferred tree ( Fig. 1) . A naïve argument using Equation 1 might suggest that FP T would depend on , namely the ratio of the expected numbers of sites having changes on two edges (i.e., those that are potentially homoplastic and misleading) to those sites having only a single change (those that are reliable), for sufficiently small ξ. But because only one-third of those two-edge sites are actually homoplastic in a JC69 model, which implies FP T is small when ξ is small enough (e.g., FP T < 0.05 whenever ξ < 0.15). This approximation can be improved further by recognizing that not all two-edge of Theorem 2, we see that, for a JC69 model and trees inferred with MP, the following 337 approximation holds increasingly well as n increases: given the assumption that ξ is sufficiently small and the number of sites does not grow too Methods I, is monotonically decreasing in ξ and m, and depends on the shape of T . Simulations indicate that the approximation is close for ξ 1 (Fig. 3) , but if many equally 342 parsimonious trees are present, the search algorithm should take a strict consensus of a 343 broad sample of those solutions (Fig. S3 ). E D [FP Tn ] is better on average for PDA than YH 344 trees, and both are bounded between a theoretical worst case error rate for symmetric and 345 best case error rate for caterpillar trees. In fact, the expected false positive rate for the 346 latter is just 4/(3m) in the limit of large n, which is independent of ξ. well below 5% under our typical simulation conditions when ξ 1 (Fig. S5 ). Thus, the near-perfect tree length of ξ 1 is a region in which rate variation 365 appears to have less of an impact on false positive rates than when tree lengths are longer. This suggests that the definition of near-perfect zone in practice can include substantial 367 rate variation. We estimated key parameters from the trees and underlying data for 11 empirical 370 virus phylogenies (Table 1, Table S1 ) and used simulation to estimate expected false Trees with lowest levels of resolution (Table 1) had the highest expected false four DENV serotypes had an even higher tree length and expected false positive rate. The measles virus tree was an outlier with E D [FP Tn ] above 5%, even though its tree 391 lengths was below one. Notable, MeV had the fewest taxa of any virus analyzed (Table 1) 392 and subsequently lower phylogenetic resolution. This combination of factors implies 393 sensitivity to the assumption of large n in our results. (Table 1) for maximum parsimony (MP) inference, estimated by simulation using parameters estimated from the data (Table S1 ). Abbreviations given in Table 1 (Fig. S6) . Nonetheless, some differences were observed, which tended to imply 404 better accuracy for MP. These differences could largely be attributed to technical or 405 implementation issues in ML software. First, the computational expense of ML searches 406 makes it tempting to undertake fewer replicate searches for local optima, but this was as 407 critical to improve the fit to Equation 6 for ML as it was for MP (Fig. S6) . Second, ML 408 programs set hard numerical lower bounds strictly greater than zero on edge lengths, often (by default) on the same order asλ for the virus datasets, so these must be reset downward 410 to obtain correct tree likelihoods (Morel et al., 2021) . Finally, inferred edge lengths that 411 are larger than these programs' lower bounds but still smaller than about 1/m tend to be 412 included in the ML tree despite weak evidence (IQ-TREE 2 issues a warning about this). 413 We saw this in ML searches roughly when ξ 0.1, when three-state sites become more 414 common in alignments than they were at lower values of ξ. Even without homoplasy, ML 415 tends to over-resolve trees in a way that elevates E D [FP Tn ]. By collapsing short edge 416 lengths inferred by ML to be less than 1/m, this behavior can be mitigated (Fig. S6) . In general, ML is expected to be more accurate than MP under more realistic We explored other factors impacting support in the boundary case of perfect trees. For sequence length, we computed standard support metrics in an ML framework in 434 perfect four-taxon datasets, in which each branch was defined by a single change, and 24 WERTHEIM, STEEL, AND SANDERSON alignments range between 40 nt and 30,000 nt (Fig. S7) integrate relevant information about low tree lengths. On the other hand, in perfect trees from 8-128 taxa, in which the mean edge length 448 remained the same (but therefore ξ grew with n), mean SH-aLRT and aBayes support was 449 unchanged, but mean ultrafast bootstrap support increased (Fig. S8) . The TBE method 450 was developed to correct for a downward bias of bootstrap values often seen in large trees. As expected, TBE exceeds conventional bootstrap support as taxon number increases. However, this increase is modest in perfectly symmetrical trees compared with perfectly 453 asymmetric trees and only surpasses 95% in the largest asymmetric trees (Fig. S8 ). In this paper, we study a "near-perfect" parameter space for phylogenetic inference 456 on large trees with small tree lengths and no rate variation within or between sites or 457 edges. The "near-perfect" tree length of ξ 1 means that few sites exhibit homoplasy and, 458 for MP inference, the false positive rate can be much better than the false negative rate For example, with no rate variation, the false positive rate can be very good even when 462 ξ > 1 ( Fig. 2A, S5) , and, if ξ < 1, a substantial level of rate variation can be present 463 without elevating the false positive rate by nearly as much as when ξ > 1 (Fig. 2,4, S4) . The second case is clearly more relevant in real-world data. The 11 empirical virus 465 datasets all had substantial rate variation and showed a general increase in false positive 466 rate with ξ, with almost all rates below 5% occurring when ξ 1, much like the predicted 467 patterns seen in Fig. 2B and Fig. 4 . This observation accords with our simulation results 468 suggesting that the good "near-perfect" false positive rates may emerge even when relaxing 469 the strict near-perfect assumption of no rate variation-as long as ξ 1. Roch, 2019), implying again that short edges tend to degrade accuracy when accuracy is 503 defined in terms of total agreement between T andT , in contrast to our findings here. A cryptic factor affecting the false positive rate is tree shape. Highly asymmetric 505 trees have better expected false positive rates than highly symmetric trees, because 506 expected path lengths are longer and it is harder to induce false positive splits by chance 507 (Fig. 1) . Thus, a random sample of PDA trees will have a better E D [FP Tn ] than more 508 symmetrical YH trees. Differences in tree shape among RNA virus phylogenies have long 509 been noted (Grenfell et al., 2004) , such as the typically more asymmetric influenza trees. theory, both in terms of its absolute value and its shape as a function of tree length. As ξ 527 increases, especially above ξ * , ML consistently has better accuracy than MP, but we 528 conjecture that the false positive rates of MP and ML differ much less as ξ gets very small. Further work is needed to test this conjecture. The connection between the false positive rate as a measure of accuracy and 531 conventional measures of phylogenetic support appears to be sensitive to the choice of 532 support method when ξ << 1 (Fig. 6) draw firm conclusions without a formal analysis of support in the "near-perfect" parameter 539 space, but we do note the variability in support estimates we found and suspect that 540 Bayesian measures may be better reflections of false positive accuracy in practice (Fig. 6) . If individual clade support needs to be invoked in near-perfect viral phylogenies, we 542 recommend Bayesian approaches that do not rely on bootstrap resampling of sparse 543 substitutions. In near-perfect trees, Bayesian approaches can make use of the limited 544 amount of genetic diversity to draw strongly supported inference, as opposed to 545 bootstrapping approaches which require multiple sites supporting a clade before inferring 546 similarly strong support. When phylogenetically informative data are limited, as in 547 near-perfect trees, the consistency of the data supporting a clade appears more relevant 548 than their prevalence. The low false positive rate in near-perfect trees suggests that phylogenies describing (Table 1) for maximum parsimony (MP) inference, estimated by simulation using 809 parameters estimated from the data (Table S1 ). Abbreviations given in Table 1 . Stochastic models and descriptive statistics for phylogenetic trees, from REFERENCES Virus genomes reveal factors 628 that spread and sustained the Ebola epidemic Intra-and 630 interpatient evolution of enterovirus D68 analyzed by whole-genome deep sequencing Bootstrap confidence levels for phylogenetic 633 trees A few logs suffice to 635 build (almost) all trees (I) Confidence limits on phylogenies: An approach using the bootstrap A polynomial-time algorithm for near-perfect 644 phylogeny CoVizu: Rapid analysis and visualization of the 647 global diversity of SARS-CoV-2 genomes Genomic surveillance elucidates Ebola 657 virus origin and transmission during the 2014 outbreak Unifying the epidemiological and evolutionary dynamics of pathogens Public 664 health actions to control new SARS-CoV-2 variants Tracking virus outbreaks in the twenty-first century New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the 670 performance of PhyML 3.0 Algorithms on strings, trees and sequences Twenty years of West Nile virus 675 spread and evolution in the Americas visualized by Nextstrain Nextstrain: real-time tracking of pathogen evolution An empirical test of bootstrapping as a method for 681 assessing confidence in phylogenetic analysis UFBoot2: improving the ultrafast bootstrap approximation ModelFinder: fast model selection for accurate phylogenetic estimates MAFFT multiple sequence alignment software 690 version 7: improvements in performance and usability A signal-to-noise analysis of phylogeny estimation by 692 neighbor-joining: insufficiency of polynomial length sequences A global phylogeny of SARS-CoV-2 sequences from GISAID Renewing Felsenstein's phylogenetic bootstrap in 697 the era of big data Modern phylogenomics: building 699 phylogenetic trees using the multispecies coalescent model 2020. A fast and memory-efficient 702 implementation of the transfer bootstrap IQ-TREE 2: new models and efficient methods for 705 phylogenetic inference in the genomic era Phylogenetic analysis of SARS-CoV-2 data is difficult A. Timen, M. Koopmans, and Dutch-Covid-19 response team CoV-2 whole-genome sequencing and analysis for informed public health 715 decision-making in the Netherlands Assignment of epidemiological lineages in an emerging pandemic using the Pangolin 720 tool Timing the 722 SARS-CoV-2 index case in Hubei province Seq-Gen: an application for the Monte Carlo 724 simulation of dna sequence evolution along phylogenetic trees Pybus. 2020. A dynamic nomenclature proposal for SARS-CoV-2 lineages to 728 assist genomic epidemiology Comparison of phylogenetic trees Hands-on introduction to sequence-length requirements in phylogenetics Pages 47-86 in Bioinformatics and Phylogenetics: Seminal Contributions of Bernard MrBayes 3: Bayesian phylogenetic inference Rouet, F Massive 742 iatrogenic outbreak of Human Immunodeficiency Virus Type 1 in rural Cambodia Divergent maximum-likelihood-branch-support 745 values for polytomies Phylogeny estimation given sequence length 747 heterogeneity ILS-aware 749 analysis of low-homoplasy retroelement insertions: inference of species trees and 750 introgression using quartets Sufficient conditions for two tree reconstruction techniques to succeed on 752 sufficiently long sequences The optimal rate for resolving a near-polytomy in a 754 phylogeny On the distributions of bootstrap support and posterior distributions for a 756 star tree Bootstrap support is not first-order correct Links between maximum likelihood and maximum 761 parsimony under a simple model of site substitution Ultrafast sample placement on existing trees 764 (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic Homoplasy: from detecting pattern to 767 determining process and mechanism of evolution Large-scale multiple sequence alignment and phylogeny estimation Pages 85-146 in Models and algorithms for genome evolution The emergence of SARS-CoV-2 in 773 Europe and North America On the best evolutionary rate for phylogenetic analysis