key: cord-0319792-6k0f3orp authors: Wong, Thomas K. F.; Li, Teng; Ranjard, Louis; Wu, Steven; Sukumaran, Jeet; Rodrigo, Allen G. title: An assembly-free method of phylogeny reconstruction using short-read sequences from pooled samples without barcodes date: 2021-04-09 journal: bioRxiv DOI: 10.1101/2021.04.09.439138 sha: 2510085c0188885ce0bc8fc5964e86222c8212c8 doc_id: 319792 cord_uid: 6k0f3orp A current strategy for obtaining haplotype information from several individuals involves short-read sequencing of pooled amplicons, where fragments from each individual is identified by a unique DNA barcode. In this paper, we report a new method to recover the phylogeny of haplotypes from short-read sequences obtained using pooled amplicons from a mixture of individuals, without barcoding. The method, AFPhyloMix, accepts an alignment of the mixture of reads against a reference sequence, obtains the single-nucleotide-polymorphisms (SNP) patterns along the alignment, and constructs the phylogenetic tree according to the SNP patterns. AFPhyloMix adopts a Bayesian model of inference to estimates the phylogeny of the haplotypes and their relative frequencies, given that the number of haplotypes is known. In our simulations, AFPhyloMix achieved at least 80% accuracy at recovering the phylogenies and frequencies of the constituent haplotypes, for mixtures with up to 15 haplotypes. AFPhyloMix also worked well on a real data set of kangaroo mitochondrial DNA sequences. Molecular phylogenetic reconstruction is the mainstay of modern evolutionary 2 biology [1, 2] . To use a particularly relevant and recent example, tracing the spread of 3 the COVID-19 pandemic, and understanding the emergence of new variants, has 4 required the use of reliably constructed phylogenies of SARS-CoV-2 genomes [3] . DNA 5 sequencing is used to produce the data from which such valuable phylogenies can be 6 inferred. However, because modern sequencing technologies can produce several exist as a collection of genetically diverse genomes within an infected individual. To 23 sequence one or more target genes from a collection of these viruses using barcoding, 24 one would need to isolate individual viral genomes before library preparation. This can 25 be done, but again, is time-consuming and laborious. 26 In this paper, we describe a novel approach, AFPhyloMix (Assembly-Free 27 Phylogenetics for Mixtures) to reconstruct the phylogeny of non-barcoded amplicons in 28 a mixture that has been sequenced using short-read sequencing. More precisely, the 29 input sample consists of mixtures of anonymous (i.e., non-barcoded) amplicons of a 30 targeted locus, obtained from multiple individuals, each amplicon longer than the 31 read-length of sequenced fragments. We assume that all short-reads can be aligned to 32 the same reference sequence. We have developed our method to work on samples drawn 33 from a population of closely related individuals (i.e., from individuals within a species). 34 In any mixture of individuals drawn from such populations, some amplicons may be 35 identical to others. We refer to a group of identical amplicons as a haplotype [6] . The 36 mixture, therefore, contains a collection of haplotypes, each haplotype being represented 37 by a relative frequency between 0 and 1 (non-inclusive). AFPhyloMix estimates the 38 phylogeny of the haplotypes and their relative frequencies. To validate our approach, we 39 evaluate the efficiency of the method on simulated and real data, and we discuss the 40 conditions under which the method performs well, and its limitations. The algorithm, AFPhyloMix, proceeds as follows. Given a mixture of n haplotypes, 44 with relative frequencies (f 1 , f 2 , ..., f n ), short-read sequencing generates k sequences 45 that can be aligned to a reference sequence. AFPhyloMix then identifies the potential 46 sites with single-nucleotide-polymorphisms (SNPs) from this alignment of reads. Under 47 an infinite-sites model of evolution [7] , where each mutation occurs at a new site and 48 any given SNP can have a maximum of two nucleotides, we distinguish between the 49 frequency of a given nucleotide at a given SNP, and the number of SNPs with the same 50 frequency distribution of nucleotides. We refer to these two types of frequencies as the 51 SNP ratio and the SNP frequency, respectively. For example, assume that in an 52 alignment with three SNPs, site i has nucleotides A and G with frequencies 0.75, 0. 25, 53 respectively; site j has nucleotides C and T , with frequencies 0.6 and 0.4, respectively; 54 and site k has nucleotides G and T with frequencies 0.75 and 0.25, respectively. We will 55 adopt the convention of using the smaller nucleotide frequency when identifying the 56 value of a SNP ratio. Therefore, the SNP ratio for site i is 0.25. Sites i and k have the 57 same frequency distribution of nucleotides, even though they may have different 58 constituent nucleotides. In this case, the SNP frequency for the nucleotide distribution 59 instantiated in sites i and k is 0.67 or 2/3. (We note that the SNP ratios and 60 frequencies are related to the Site Frequency Spectrum [8] ; however, because coverage of 61 short-reads vary across the alignment, nucleotide frequencies at each SNP vary as a 62 continuous rational variable rather than as an integer). 63 In AFPhyloMix, a likelihood function computes the probability of observing the 64 April 6, 2021 2/22 distributions of SNP ratios (data, D) along the alignment given their expected 65 distributions, which is itself conditional on a specified tree topology, haplotype 66 frequencies, and sequencing error, assuming an infinite-sites model of evolution. A Bayesian approach is used to compute the posterior probability P (F, T, e|D) of a set of 68 parameters: the frequencies of haplotypes (F ), the tree topology (T ), and the 69 sequencing error (e), given the observed pattern of the data (D), as follows. 2B) . Similar to the compatibility problem of 103 two sets of binary characters [9] , since the infinite site model only allows one mutation 104 along the tree for every SNP site, two SNP sites can create either two or three, but not 105 four patterns of nucleotides. Moreover, how often these patterns of nucleotides happen 106 depends on the product of the lengths of the edges on which the two SNP sites occur. Considering the possible three patterns of nucleotides (with frequencies f 1 , f 2 , Estimation of tree topology and tip frequencies Assume that there are n haplotypes. If there is no sequencing error, there should be 128 only two types of nucleotides on each SNP site; for convenience, we will refer to the two 129 allowable states at a given SNP location canonically as '0' and '1'. Considering two sites 130 i and j, let s(ij) be the nucleotides of the same read covering the sites i and j. Also let 131 the states of the root of the tree be r i and r j , where r i , r j ∈ {0, 1}, on the site i and the 132 site j, respectively. Given a n-tip rooted tree topology T , a set of n tip frequencies F , 133 and the edges of T : {e 1 , · · · , e 2n−2 }, letP (s(ij) = pq|u, v, r i , r j ), where p, q ∈ {0, 1} 134 and u, v ∈ {ε, e 1 , · · · , e 2n−2 } (the empty string ε represents no mutation on the site), be 135 the expected probability of the same sequence having nucleotide p on the site i and , or three. However, in a 144 data set with sequencing error, the number of combinations observed may well be more 145 (up to a maximum of 16). We will compute the expected probabilities taking account of 146 sequencing errors. With the sequencing errors, each SNP site may contain 4 nucleotide 147 types, say 0, 1, 2, and 3. Without loss of generality, we assume 0 and 1 are the major respectively. We assume that the ratios of the reads having different patterns of 165 nucleotides for the sites i and j follow the multinomial distribution. Practically, when performing an analysis on the alignment of the reads, for each site 167 of the alignment, AFPhyloMix assigns the nucleotide supported by the largest number 168 of reads to 0, the second largest to 1, the third largest to 2, and the one with the least 169 supports to 3. There are three reasons to observe two or more nucleotides at a site: • A site truly has a single mutational event only in its evolutionary history, 171 sequencing errors and other technical artifacts can introduce more than two 172 nucleotides in the alignment of short-reads; • A site is truly invariable over the evolutionary tree, but sequencing errors/artifacts 174 introduce more than a single observed nucleotide in the alignment at that site; or 175 • A site truly has experienced multiple mutational events in its events in its 176 evolutionary history (and thus, violates the assumption of an infinite sites model). 177 We deal with the second and third of these cases below, but if a site truly has only a 178 single mutational event in its history, then nucleotides 0 and 1 should dominate, while 179 nucleotides 2 and 3 will be due to sequencing errors. Let the n SNP sites be {S 1 , S 2 , S 3 , S 4 , S 5 , S 6 , · · · , S n }. One approach is to consider 181 the patterns observed with pairs of adjacent SNP sites [i.e., if n is even, then consider ]. This approach allows, at most, only n/2 pairs of SNP 183 sites to be considered. On the other hand, if we nominate a reference site, and pair each 184 non-reference site with the reference, we can use ≈ n pairs of SNP sites. We have used 185 this approach, as follows: the whole alignment is partitioned into m non-overlapping 186 windows (W 1 , W 2 , · · · , W k , · · · , W m ) of size d (d was set to 100 in our implementation). 187 In each window W k a reference position c k ∈ W k is selected. Let the average of the read 188 coverage along the alignment be cov avg . The reference site is selected arbitrarily among 189 those sites covered by at least max{50, r * cov avg } reads (where r was set to 0.2). Thus, 190 the selected reference sites will have reasonably high levels of support. For every site i 191 inside the window W k , its association with the reference position c k is considered. This 192 approach allows us to consider n − m pairs of SNP sites. Note that if such reference The likelihood of the whole alignment (A) given the tree topology (T ), the tip 197 frequencies (F ) and the sequencing error rate (e) is: The patterns obtained from the pair of site i and the reference c k depends on the 199 tree topology, the tip frequencies, the sequencing error rate, the root states, and the conditioning on the tree topology, the tip frequencies, the sequencing error rate, the root 203 state and the edge on which the mutation occurs for the reference c k . Therefore, the likelihood of the window (W k ) given the tree topology (T ), the tip 205 frequencies (F ), the sequencing error (e), the root state (r c k ) and the edge on which the 206 mutation occurs (v) for the reference c k is: Using this approach, the likelihood essentially integrates across all observed patterns, 208 thus negating the need to identify observed nucleotide ratios that are the consequence of 209 sequencing errors. During the experiments, the following modified formula was found to have a better 211 accuracy on the estimation of the tree topology and the tip frequencies: where P r(r i ) = the observed frequency of the nucleotide r i in A Define P (T, F, e|A) as the posterior probability of the tree topology T , the tip frequencies F , and the sequencing error e given the alignment A. where P (T ), P (F ), P (e) are the prior probabilities of T, F, e, respectively Identifying invariable sites 213 As noted above, a truly invariable site may appear to be a SNP because of sequencing 214 errors. To speed up the computation, AFPhyloMix skips all the sites which are 215 identified as invariable sites. Let the maximum value of the sequencing error rate be 216 e max (in our application of AFPhyloMix, e max is set to 0.01). A site is regarded as an 217 invariable site if there exists only one nucleotide such that the percentage of the reads 218 having that nucleotide covering the site is higher than e max . Let I be the set of these 219 sites which are identified as invariable sites; then AFPhyloMix adopts a Bayesian model of inference to obtain an estimate of the joint 222 posterior probability of phylogenies, haplotype frequencies, and sequencing error, using 223 Markov chain Monte Carlo (MCMC) sampling [10, 11] . The MCMC approach has been 224 extensively used in phylogenetic analysis [12, 13] , but sampling chains may not mix as 225 well as they should. To overcome this, the Metropolis-coupled Markov chain Monte 226 Carlo (MCMCMC) approach was developed by [14] . Although MCMCMC requires 227 multiple parallel sampling chains to be run simultaneously (and thus, has demanding 228 computational overheads), the approach has been demonstrated to improve mixing and 229 convergence to a stationary posterior probability distribution [15] . We implemented 230 MCMCMC in AFPhyloMix, which reports a the tree topology and the tip frequencies calculates the edge lengths in T (with edges e 1 , · · · , e 2n−2 ) by the following method. Let length(u) be the length of edge e u . In AFPhyloMix, length(u) is approximated 238 as the probability of having mutation on edge u along the tree. Note that length(u = 0) 239 is the probability of no mutation along the tree (i.e. 0≤u≤2n−2 length(u) = 1). 240 length(u) = P r(edge u) where P r(edge u at site i) is the probability of a mutation on edge u of the tree at site i. 241 As described above, the whole genome is partitioned into non-overlapping windows 242 (W 1 , W 2 , · · · , W k , · · · ) of size d (d was set to 100), and inside each window W k a 243 reference site c k ∈ W k is selected. For every site i (say it is inside the window W k ), we 244 consider the connection between the site i and the reference position c k . For i = c k , let 245 n ic k (p, q), where p, q ∈ {0, 1, 2, 3}, be the number of reads observed having nucleotide p 246 at site i and nucleotide q at site c k on the same read. Define A ic k = {n ic k (p, q)|p, q ∈ For case 1, P r(edge u at site i) For case 2, P r(r i ) and P r(r c k ) are the probabilities of the root states r i and r c k , which are set 255 to the observed frequencies of r i and r c k in A, respectively. L(A ic k |u, v, r i , r c k ) is the using the similar approach. Whereas it is theoretically possible to simultaneously infer 262 edge lengths and topologies, we have found that this does not deliver accurate results. 263 This is because, in the absence of sequence information at the tips, mutations will 264 naturally favour long branches, thus lowering the probability of seeing short branches. 265 But estimating edge lengths after the topology has been estimated, we overcome this 266 bias. An alternative is to define a suitable prior (e.g., a coalescent prior [16] ) that will 267 override the tendency of the likelihood to favour long branches. Removal of the sites that violate the infinite sites model 269 AFPhyloMix uses an infinite site model for modelling the evolution of the genomic 270 sequences between the haplotypes, and thus assumes that every site has no or only one 271 mutation in its evolutionary history. Before AFPhyloMix proceeds to estimate the 272 topology and the tip frequencies, AFPhyloMix examines the read alignments and filters 273 out the sites likely having more than one mutation. The procedure to identify these 274 sites is as follows: 275 1. If the SNP site has more than two different characters supported by at least e max 276 (i.e. 1%) of the reads, then the SNP site is ignored. Both simulated and real data were used to evaluate AFPhyloMix. Simulated data Six hundred data sets were simulated and each data set was a mixture of various 293 numbers (n) of haplotypes, where n ∈ {5, 7, 9, 11, 13, 15} (100 data sets each). In each 294 data set, the n haplotypes were mixed in different random concentrations (with the 295 difference between any two haplotypes ≥ 0.0033). Paired-end reads of length 150bp 296 with total coverage of 15,000x were simulated by ART [17] with the Illumina HiSeq 2500 297 error model -HS25, from n 50k-long haplotype sequences, which were generated by 298 INDELible [18] using JC model [19] from a n-tip tree with 0.03 root-to-tip distance 299 randomly created by Evolver [20] from PAML package [21] . The root sequence reported by INDELible [18] was used as the reference sequence. After using BWA [22] to align the reads against the reference sequence, we ran Figure 5B and Figure 5C show the summary on the distributions of the The pooled 95% highest posterior density of the MCMCMC estimate of error rates 337 on these simulated data sets was between 0.00199 and 0.00204 with an average 0.00202. 338 ART [17] reports the errors for the reads simulated by the error model HS25; in our simulations, we obtained a simulated error rate of 0.00190. AFPhyloMix relies on the 340 alignment of reads against a root sequence to obtain the marginal posterior probability 341 distribution of error rates. We expect, therefore, that the slightly higher value 342 (≈ +0.00012) of the MCMCMC estimate of error rates on these simulated data sets, 343 compared with the simulated error rate reported by ART, is likely due to imperfect 344 alignment between simulated reads and the root sequences. highest quota in New South Wales, Australia [23] . Approximately 30 mg of meat was excised and homogenized for each individual. Genomic DNA was then extracted using the DNeasy Blood & Tissue Kits (Qiagen) 361 following the manufacturer's protocol. Amplification and sequencing 363 Long PCR amplification of complete kangaroo mitochondrial genomes was carried out 364 using the pair of primers (Lt12cons: 5'-GGGATTAGATACCCCACTAT -3', HtPhe: 365 5'-CCATCTAAGCATTTTCAGT -3'), which was selected from a previous study [24] . PCR reactions was performed using Takara PrimeSTAR GXL DNA Polymerase under 367 the following conditions: 1 min initial denaturation at 95 • C, followed by 30 cycles of 10 368 s at 98 • C, 15 s at 55 • C, and 15 min at 68 • C. The PCR products were electrophoresed 369 in 1% agarose gel, purified the fragments, and then randomly fragmented to 650 bp by 370 sonication (Covaris S220). Library preparation and sequencing were performed by GENEWIZ. Amplified 372 fragments of all 10 individuals were sequenced under the same run. In order to obtain a 373 highly reliable phylogenetic tree of these sub-samples for evaluating our method, each 374 individual was barcoded with unique indices before multiplexing and sequencing, so that 375 each short read sequence could be identified to the corresponding sub-sample. The 376 relative concentrations of the sub-samples are listed in the Table 1 (2 nd column). Sequencing was performed on an Illumina MiSeq machine with paired-end read length 378 of 2 x 300 bp. The relative concentration between different kangaroo sub-samples. To start with, a reliable phylogenetic tree between the haplotypes was constructed, so 381 that this gold-standard result could be used to evaluate our method. First, all the short 382 read sequences were demultiplexed into sub-samples according to the barcodes 383 appended on the sequences. Then de-novo assembly was performed on the short reads 384 for each sub-sample separately by SOAPdenovo-Trans-127mer from SOAPdenovo-Trans 385 April 6, 2021 14/22 package [25] with parameter: kmer-size=91. After the mitochondria DNA sequences of 386 the 10 haplotypes were constructed, a multiple sequences alignment was computed by 387 MAFFT [4] with G-INS-i strategy in Geneious 11.1.5 [26] . The phylogenetic analysis 388 was conducted using IQ-TREE [27] with the evolutionary model HKY+F+I, and the 389 ML phylogenetic tree (shown in Fig 6B) was used as the reference tree to evaluate our 390 method. The short read sequences from 10 haplotypes were then mixed together, and the 392 barcode of each haplotype was removed. Then Trimmomatic [28] was run on the obtained. The tip labels inside the brackets were added manually after comparing with 404 the tree reported by IQ-Tree (in Fig 6B) AFPhyloMix and IQ-TREE analyses resulted in the same topology associated with tip 416 frequencies well matched the concentrations of those 9 haplotypes (in Fig 7) . The 95% highest posterior density of the MCMCMC estimate of error rates on the 418 real data sets was between 0.000722 and 0.000735 with an average 0.000729. In order to 419 compute the underlying actual error rate on the real data set, reads which had been 420 processed by Trimmomatic [28] , were compared with the corresponding assembled 421 haplotype and the error rate of 0.000713 was obtained, after discarding the read bases 422 with base quality scores lower than 25 (the same criteria AFPhyloMix used to filter out 423 the low-quality read bases). Again, the slightly higher value (≈ +0.000016) of the 424 MCMCMC estimate of error rates on the real data sets compared with the obtained 425 error rate is consistent with what we observed with simulated data, and is likely due to 426 the imperfect alignment between the reads and the reference sequence. This research demonstrates the feasiblity of reconstructing a phylogenetic tree directly 429 from the short read sequences obtained from a mixture of closely related amplified 430 sequences, without barcoding. The results indicate that our methods work well on the 431 simulated data set for a mixture of reads generated from up to 15 haplotypes and on a 432 real data set of a mixture with 10 haplotypes. Perhaps unsurprisingly, AFPhyloMix worked better in the simulated data sets than 434 in the real data sets when it came to estimating haplotype concentrations. The [27] . Note that the tip labels inside the brackets in (A) were added manually after comparing with the tree reported by IQ-Tree, indicating the corresponding sample that each tip should be assigned to according to the topology and the tip frequencies. root-mean-square difference between the estimated sub-samples concentrations and the 436 expected concentrations in the real data sets was 0.0059 (from Fig 7) , which was larger 437 than those in the simulated data sets (i.e. all were below 0.0027 in Fig 5B) . Of course, 438 the expected haplotype concentrations may have differed from the true concentrations 439 in the mixture: the physical act of mixing small volumes could have led to differences in 440 the relative concentrations of haplotypes, and this may have contributed to a 441 higher-than-expected root-mean-square. Another factor affecting the performance of the method is the varying coverage of 443 short read sequences along the genome. Fig 8A shows the actual distribution of the read 444 alignments of 10 haplotypes along the genome. We expected read coverages to vary 445 along the genome randomly, without any association to haplotypes. Surprisingly, we 446 found that, when all the haplotypes were sequenced under the same run (i.e. all 447 haplotypes were pooled into the same library before sequencing), the read coverages of 448 the haplotypes had similar trends: all had relatively high (or low) read coverages at the 449 same regions of the genome. As shown in Fig 8B, the similar trends of the read 450 coverages along the genome between the haplotypes led to a steady distribution of the 451 ratios on the read coverages between the haplotypes along the genome. The consistent 452 read coverage ratios along the genome worked in our method's favor; on the other hand, 453 as shown in Fig 8B, we noticed that few short regions on the genome had sudden 454 changes in the ratios of the read coverage amongst haplotypes. In those regions, some 455 reads were trimmed after the alignment against the reference sequence or could not be 456 aligned to the reference sequence due to the dissimilarity between the read sequences 457 and the reference sequence. Another preprocessing step was therefore developed in Result on a real data set with mixture of 9 sub-samples The haplotype K01 was removed from the mixture and the experiment was repeated. (A) The tree with tip frequencies reported by AFPhyloMix. (B) The tree reported by IQ-TREE [27] . Note that the tip labels inside the brackets in (A) were added manually after comparing with the tree reported by IQ-Tree, indicating the corresponding sample that each tip should be assigned to according to the topology and the tip frequencies. AFPhyloMix, in which read alignments were examined and problematic regions removed 459 if there existed over r (r was set to 2% of the read coverage) reads being trimmed 460 around the same sites on the genome. AFPhyloMix only considers the association between two SNP sites, but it is sensible 462 to consider the association between more SNP sites in order to acquire higher sensitivity 463 of the methods especially when the number of haplotypes increases. The time 464 complexity of the algorithm is O(mn d ) where n is the number of haplotypes, m is the 465 number of potential SNP sites, and d is the number of SNP sites associated to construct 466 patterns of nucleotides. In our current implementation of AFPhyloMix, d = 2. The 467 running time of the algorithm will increase exponentially as d increases. It will be a 468 challenge to come up with a faster algorithm and consider the association between more 469 number of SNP sites, so that the method can work more effectively even for a mixture 470 with a large number of sub-samples. Finally, it is worth noting that we have applied AFPhyloMix to sequences of closely 472 related individuals -in our simulations, we set a root-to-tip distance of 0.03. The 473 assumption of an infinite sites model that is applied in AFPhyloMix is appropriate for 474 closely-related individuals. Amongst other things, the infinite sites model allows us to 475 constrain the number of site patterns we expect to see, and use deviations from these 476 expected patterns to error-correct. To extend our algorithm to more divergent 477 sequences will require a different model of mutation. This remains a work in progress. 478 April 6, 2021 The prior of haplotype frequency is a gamma distribution with rate parameter 0.1 524 and shape parameter 2. The prior of the tree topology is uniform across all the possible 525 topologies, and the prior of the error rate a uniform distribution with maximum value of 526 e max (which is set to 0.01 for the reads produced from Illumina sequencing machines). 527 AFPhyloMix runs 8 Markov chain Monte Carlo processes in parallel: one cold chain 529 and seven hot chains. The i-th hot chain's temperature is set to (1 − i/8). A hot chain 530 is randomly selected and its posterior probability is compared with that of the cold 531 chain for every x iterations (where the value of x equals to the number of possible 532 moves times 5, for example, x = 12 × 5 = 60 for 5 haplotypes; while x = 22 × 5 = 110 533 for 15 haplotypes). If the posterior probability of the hot chain is higher than that of 534 the cold chain, two chains will be swapped. When swapping between two chains, we 535 followed [29] 's implementation that the temperature of the two chains were exchanged 536 instead of the states. Exchanging their temperatures are more efficient than exchanging 537 their states, because the states which include many parameters are usually large in size. 538 Availability of software and materials 539 The software AFPhyloMix and the materials are available in OSF repository: Phylogenetics and speciation Molecular phylogenetics: principles and practice Phylogenetic network analysis of SARS-CoV-2 genomes MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies Haplotype-resolved genome sequencing: experimental methods and applications The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations Site frequency spectra from genomic SNP surveys On the compatibility of binary qualitative taxonomic characters Equation of State Calculations by Fast Computing Machines Monte Carlo sampling methods using Markov chains and their applications Bayesian inference of phylogenetic trees BEAST: Bayesian evolutionary analysis by sampling trees Markov chain Monte Carlo maximum likelihood Bayesian Inference of Phylogeny and Its Impact on On the genealogy of large populations ART: a next-generation sequencing read simulator INDELible: A Flexible Simulator of Biological Sequence Evolution CHAPTER 24 -Evolution of Protein Molecules PAML: a program package for phylogenetic analysis by maximum likelihood Phylogenetic Analysis by Maximum Likelihood Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM State of New South Wales and Office of Environment and Heritage Radiation of Extant Marsupials After the K/T Boundary: Evidence from Complete Mitochondrial Genomes SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads Geneious Basic: An integrated and extendable desktop software platform for the organization and analysis of sequence data ModelFinder: fast model selection for accurate phylogenetic estimates Trimmomatic: a flexible trimmer for Illumina sequence data Adaptive parallel tempering for BEAST 2 The authors thank David Bryant for constructive comments on the manuscript. This