key: cord-0259316-g62fwh9x
authors: Baumdicker, Franz; Bisschop, Gertjan; Goldstein, Daniel; Gower, Graham; Ragsdale, Aaron P.; Tsambos, Georgia; Zhu, Sha; Eldon, Bjarki; Ellerman, Castedo E.; Galloway, Jared G.; Gladstein, Ariella L.; Gorjanc, Gregor; Guo, Bing; Jeffery, Ben; Kretzschmar, Warren W.; Lohse, Konrad; Matschiner, Michael; Nelson, Dominic; Pope, Nathaniel S.; Quinto-Cortés, Consuelo D.; Rodrigues, Murillo F.; Saunack, Kumar; Sellinger, Thibaut; Thornton, Kevin; van Kemenade, Hugo; Wohns, Anthony W.; Wong, H. Yan; Gravel, Simon; Kern, Andrew D.; Koskela, Jere; Ralph, Peter L.; Kelleher, Jerome
title: Efficient ancestry and mutation simulation with msprime 1.0
date: 2021-09-21
journal: bioRxiv
DOI: 10.1101/2021.08.31.457499
sha: 9602c987d47cdd917835efd9e1a0e94246158dc9
doc_id: 259316
cord_uid: g62fwh9x

Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this necessity, a large number of specialised simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and tskit library. We summarise msprime’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialised alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.

The coalescent process (Kingman, 1982a,b; Hudson, 1983b; Tajima, 1983 ) models the ancestry of a 63 set of sampled genomes, providing a mathematical description of the genealogical tree that relates 64 the samples to one another. It has proved to be a powerful model, and is now central to population 65 genetics (Hudson, 1990; Hein et al., 2004; Wakeley, 2008) . The coalescent is an efficient framework 66 for population genetic simulation, because it allows us to simulate the genetic ancestry for a sample 67 from an idealised population model, without explicitly representing the population in memory or 68 stepping through the generations. Indeed, Hudson (1983b) independently derived the coalescent 69 in order to efficiently simulate data, and used these simulations to characterise an analytically 70 intractable distribution. This inherent efficiency, and the great utility of simulations for a wide 71 range of purposes, has led to dozens of different tools being developed over the decades (Carvajal Two technological developments of recent years, however, pose major challenges to most ex-75 isting simulation methods. Firstly, fourth-generation sequencing technologies have made complete 76 chromosome-level assemblies possible (Miga et al., 2020) , and high quality assemblies are now 77 available for many species. Thus, modelling genetic variation data as a series of unlinked non-78 recombining loci is no longer a reasonable approximation, and we must fully account for recombi-79 nation. However, while a genealogical tree relating n samples in the single-locus coalescent can be 80 simulated in O(n) time (Hudson, 1990) , the coalescent with recombination is far more complex, and 81 programs such as Hudson's classical ms (Hudson, 2002) can only simulate short segments under the 82 influence of recombination. The second challenge facing simulation methods is that sample sizes in 83 genetic studies have grown very quickly in recent years, enabled by the precipitous fall in genome 84 sequencing costs. Human datasets like the UK Biobank (Bycroft et al., 2018) and gnomAD (Kar-85 czewski et al., 2020) now consist of hundreds of thousands of genomes and many other datasets on 86 a similar scale are becoming available (Tanjo et al., 2021) . Classical simulators such as ms and even 87 fast approximate methods such as scrm (Staab et al., 2015) simply cannot cope with such a large 88 number of samples. 89 The msprime simulator (Kelleher et al., 2016; Kelleher and Lohse, 2020) has greatly increased 90 the scope of coalescent simulations, and it is now straightforward to simulate millions of whole 91 chromosomes for a wide range of organisms. The "succinct tree sequence" data structure (Kelleher 92 (B) This ancestry is provided as the input to sim_mutations, which adds mutations. Graphics produced using tskit's draw_svg method.

but lacks flexibility, for example, when running simulations with parameters drawn from a distri-116 bution. In practice, for any reproducible simulation project users will write a script to generate 117 the required command lines or input parameter files, invoke the simulation engine, and process the 118 results in some way. This process is cumbersome and labour intensive, and a number of packages 119 have been developed to allow simulations to be run directly in a high-level scripting language (Staab 120 and Metzler, 2016; Parobek et al., 2017; Gladstein et al., 2018) . 121 The more recent trend has been to move away from this file and command-line driven approach 122 and to instead provide direct interfaces to the simulation engines via an Application Programming 123 Interface (API) (e.g. Thornton 

A major change for the msprime 1.0 release is the introduction of a new set of APIs, designed in 135 part to avoid sources of error (see the Demography section) but also to provide more appropriate 136 defaults while keeping compatibility with existing code. In the new APIs, ancestry and mutation 137 simulation are fully separated (see Fig. 1 

0 1 2 3 4 5 6 7 8 9 3 C T G G T G C G C C 2 Figure 2 : An example tree sequence describing genealogies and sequence variation for four samples at ten sites on a chromosome of twenty bases long. Information is stored in a set of tables (the tables shown here include only essential columns, and much more information can be associated with the various entities). The node table stores information about sampled and ancestral genomes. The edge table describes how these genomes are related along a chromosome, and defines the genealogical tree at each position. The site and mutation tables together describe sequence variation among the samples.

replacing the legacy simulate function. Among other changes, the new APIs default to discrete 139 genome coordinates and finite sites mutations, making the default settings more realistic and resolv-140 ing a major source of confusion and error. The previous APIs are fully supported and tested, and 141 will be maintained for the foreseeable future. The msp program has been extended to include new 142 commands for simulating ancestry and mutations separately. A particularly useful feature is the 143 ability to specify demographic models in Demes format (Gower et al., 2021) from the command line, 144 making simulation of complex demographies straightforward. We also provide an ms compatible 145 command line interface to support existing workflows.

Tree sequences 147 One of the key reasons for msprime's substantial performance advantage over other simulators (Kelle-148 her et al., 2016) is its use of the "succinct tree sequence" data structure to represent simulation 149 results. The succinct tree sequence (usually abbreviated to "tree sequence") was introduced by 150 Kelleher et al. (2016) to concisely encode genetic ancestry and sequence variation and was originally 151 implemented as part of msprime. We subsequently extracted the core tree sequence functionality 152 from msprime to create the tskit library, which provides a large suite of tools for processing genetic 153 ancestry and variation data via APIs in the Python and C languages (Tskit developers, 2021 , 2021) to take advantage of the same efficient data structures used in msprime, and we hope 157 that many more will follow. While a full discussion of tree sequences and the capabilities of tskit 158 is beyond the scope of this article, we summarise some aspects that are important for simulation. 159 Let us define a genome as the complete set of genetic material that a child inherits from one 160 parent. Thus, a diploid individual has two (monoploid) genomes, one inherited from each par-161 ent. Since each diploid individual lies at the end of two distinct lineages of descent, they will be 162 represented by two places (nodes) in any genealogical tree. In the tree sequence encoding a node 163 therefore corresponds to a single genome, which is associated with its creation time (and other op-164 tional information), and recorded in a simple tabular format (Fig. 2) . Genetic inheritance between 165 genomes (nodes) is defined by edges. An edge consists of a parent node, a child node and the left 166 and right coordinates of the contiguous chromosomal segment over which the child genome inher-167 ited genetic material from the parent genome. Parent and child nodes may correspond to ancestor 168 and descendant genomes separated by many generations. Critically, edges can span multiple trees 169 along the genome (usually referred to as "marginal" trees), and identical node IDs across different 170 trees corresponds to the same ancestral genome. For example, in Fig. 2 the branch from node 171 0 to 4 is present in both marginal trees, and represented by a single edge (the first row in the 172 edge table). This simple device, of explicitly associating tree nodes with specific ancestral genomes 173 and recording the contiguous segments over which parent-child relationships exist, generalises the 174 original "coalescence records" concept (Kelleher et al., 2016) , and is the key to the efficiency of tree 175 sequences (Kelleher et al., , 2019 Ralph et al., 2020) . See the Ancestral Recombination Graphs 176 section below for a discussion of this closely related concept.

The final output of most population genetic simulations is some representation of sequence 178 variation among the specified samples. For coalescent simulations, we usually have three steps: 179 (1) simulate the genetic ancestry, and optionally output the resulting marginal trees; (2) simu-180 late sequence evolution conditioned on this ancestry by generating mutations (see the Simulating 181 mutations section); and (3) output the resulting nucleotide sequences by percolating the effects of 182 the mutations through the trees. Information about the mutations themselves-e.g., where they 183 have occurred on the trees-is usually not retained or made available for subsequent analysis. In 184 msprime, however, we skip step (3), instead using tskit's combined data model of ancestry and 185 mutations to represent the simulated sequences. As illustrated in Fig. 2 , mutations are a fully 186 integrated part of tskit's tree sequence data model, and genetic variation is encoded by recording 187 sites at which mutations have occurred, and where each mutation at those sites has occurred on the 188 marginal tree. Crucially, the genome sequences themselves are never stored, or indeed directly rep-189 resented in memory (although tskit can output the variant matrix in various formats, if required). 190 It may at first seem inconvenient to have only this indirect representation of the genome sequences, 191 but it is extremely powerful. Firstly, the storage space required for simulations is dramatically 192 reduced. For a simulation of n samples with m variant sites, we would require O(nm) space to 193 store the sequence data as a variant matrix. However, if this simulation was of a recombining 194 genome with t trees, then the tskit tree sequence encoding requires O(n + t + m) space, assuming 195 we have O(1) mutations at each site (Kelleher et al., 2016) . For large sample sizes, this difference 196 is profound, making it conceivable, for example, to store the genetic ancestry and variation data 197 for the entire human population on a laptop (Kelleher et al., 2019) . As well as the huge difference 198 in storage efficiency, it is often far more efficient to compute statistics of the sequence data from 199 the trees and mutations than it is to work with the sequences themselves. For example, computing 200 Tajima's D from simulated data stored in the tskit format is several orders of magnitude faster 201 than efficient variant matrix libraries for large sample sizes (Ralph et al., 2020) .

The vast genomic datasets produced during the SARS-CoV-2 pandemic have highlighted the ad-203 vantages of storing genetic variation data using the underlying trees. propose 204 the Mutation Annotated Tree (MAT) format (consisting of a Newick tree and associated mutations 205 in a binary format) and the matUtils program as an efficient way to store and process large viral 206 datasets (McBroome et al., 2021), achieving excellent compression and processing performance. 207 Similarly, phastsim (De Maio et al., 2021) was developed to simulate sequence evolution on such 208 large SARS-CoV-2 phylogenies, and also outputs a Newick tree annotated with mutations (not in 209 MAT format) to avoid the bottleneck of generating and storing the simulated sequences. While 210 these methods illustrate the advantages of the general approach of storing ancestry and mutations 211 rather than sequences, they do not generalise beyond their immediate settings, and no software 212 library support is available. The standard way of representing simulation data is to render the results in a text format, which 224 must subsequently be parsed and processed as part of some analysis pipeline. For example, ms 225 outputs a set of sequences and can also optionally output the marginal trees along the genome in 226 Newick format, and variants of this approach are used by many simulators. Text files have many 227 advantages, but are slow to process at scale. The ability to efficiently process simulation results 228 is particularly important in simulation-based inference methods such as Approximate Bayesian 229 of weakly informative statistics, making the need to efficiently compute statistics from simulation 243 results all the more acute. By using the stable APIs and efficient data interchange mechanisms 244 provided by tskit, the results of an msprime simulation can be immediately processed, without 245 format conversion overhead. The tskit library has a rich suite of population genetic statistics 246 and other utilities, and is in many cases orders of magnitude faster than matrix-based methods for 247 large sample sizes (Ralph et al., 2020) . Thus, the combination of msprime and tskit substantially 248 increases the overall efficiency of many simulation analysis pipelines.

Classical text based output formats like ms are inefficient to process, but also lack a great deal of 250 important information about the simulated process. The tree-by-tree topology information output 251 by simulators in Newick format lacks any concept of node identity, and means that we cannot 252 reliably infer information about ancestors from the output. Because Newick stores branch lengths 253 rather than node times, numerical precision issues also arise for large trees (McGill et al., 2013) . 254 Numerous forks of simulators have been created to access information not provided in the output. 255 For example, ms has been forked to output information about migrating segments (Rosenzweig et al., 256 2016), ancestral lineages (Chen and Chen, 2013), and ms's fork msHOT (Hellenthal and Stephens, 257 2007) has in turn been forked to output information on local ancestry (Racimo et al., 2017) . All 258 of this information is either directly available by default in msprime, or can be optionally stored 259 via options such as record_migrations or record_full_arg (see the Ancestral Recombination 260 Graphs section) and can be efficiently and conveniently processed via tskit APIs.

Simulating mutations 262 Because coalescent simulations are usually concerned with neutral evolution (see the Selective 263 sweeps section, however) the problem of generating synthetic genetic variation can be decomposed 264 into two independent steps: firstly, simulating genetic ancestry (the trees), then subsequently sim-265 ulating variation by superimposing mutation processes on those trees (see Fig. 1 ). A number of 266 programs exist to place mutations on trees: for instance, the classical Seq-Gen program (Rambaut 267 and Grassly, 1997) supports a range of different models of sequence evolution, and various exten- Part of the reason for this poor record of software reuse and modularity is the lack of standardised 277 file formats, and in particular, the absence of common library infrastructure to abstract the details 278 of interchanging simulation data. Although msprime also supports simulating both ancestry and 279 mutations, the two aspects are functionally independent within the software; both ancestry and 280 mutation simulators are present in msprime for reasons of convenience and history, and could be split 281 into separate packages. The efficient C and Python interfaces for tskit make it straightforward to 282 add further information to an existing file, and because of its efficient data interchange mechanisms, 283 there is no performance penalty for additional operations in a different software package. Thanks 284 to this interoperability, msprime's mutation generator can work with any tskit tree sequence, be 285 it simulated using SLiM : Time required to run sim_mutations on tree sequences generated by sim_ancestry (with a population size of 10 4 and recombination rate of 10 −8 ) for varying (haploid) sample size and sequence length. We ran 10 replicate mutation simulations each for three different mutation rates, and report the average CPU time required (Intel Core i7-9700). (A) Holding sequence length fixed at 10 megabases and varying the number of samples (tree tips) from 10 to 100,000. (B) Holding number of samples fixed at 1000, and varying the sequence length from 1 to 100 megabases.

As well as providing a new API that emphasises the logical split between ancestry and mutation 290 simulation, we have greatly extended the sophistication of msprime's mutation generation engine 291 for version 1.0, achieving near feature-parity with Seq-Gen. We support a large number of muta-292 tion models, including the JC69 (Jukes et al., 1969), F84 (Felsenstein and Churchill, 1996) , and 293 GTR (Tavaré et al., 1986 ) nucleotide models and the BLOSUM62 (Henikoff and Henikoff, 1992) and 294 PAM (Dayhoff et al., 1978) amino acid models. Other models, such as the Kimura two and three 295 parameter models (Kimura, 1980 (Kimura, , 1981 , can be defined easily and efficiently in user code by spec-296 ifying a transition matrix between any number of alleles (which can be arbitrary unicode strings). 297 Mutation rates can vary along the genome, and multiple mutation models can be imposed on a 298 tree sequence by overlaying mutations in multiple passes. We have extensively validated the results 299 of mutation simulations against both theoretical expectations and output from Seq-Gen (Rambaut 300 and Grassly, 1997) and Pyvolve (Spielman and Wilke, 2015).

Simulating mutations in msprime is efficient. Fig. 3 shows the time required to generate muta-302 tions (using the default JC69 model) on simulated tree sequences for a variety of mutation rates 303 as we vary the number of samples (Fig. 3A ) and the sequence length (Fig. 3B ). For example, the 304 longest running simulation in Fig. 3B required less than 2 seconds to generate an average of 1.5 305 million mutations over 137,081 trees in a tree sequence with 508,125 edges. This efficiency for large 306 numbers of trees is possible because the tree sequence encoding allows us to generate mutations on 307 an edge-by-edge basis (see Fig. 2 and the Mutation generation appendix), rather than tree-by-tree 308 and branch-by-branch as would otherwise be required. In the above example from Fig. 3B , if we 309 generated mutations tree-by-tree, we would have to iterate over 273,887,838 branches (since there 310 are 137,081 trees and 1,998 branches in each tree) rather than 508,125 edges, resulting in ∼500 times 311 more work. Even if we have a tree sequence consisting of a single tree (negating the advantage of 312 working edge-by-edge), msprime's mutation generator is still very efficient. For example, we simu-313 lated mutations under the BLOSUM62 amino acid model for a tree with 10 6 leaves over 10 4 sites 314 (resulting in ∼260,000 mutations) in about 0.8 seconds, including the time required for file input 315 and output. We do not attempt a systematic benchmarking of msprime's mutation generation code 316 against other methods, because at this scale it is difficult to disentangle the effects of inefficient 317 input and output formats from the mutation generation algorithms. Given these timings, it seems 318 unlikely that generating mutations with msprime would be a bottleneck in any realistic analysis.

There are many ways in which the mutation generation code in msprime could be extended. For 320 example, we intend to add support for microsatellites (Mailund et al., 2005) , codon models (Arenas 321 and Posada, 2007) and indels (Cartwright, 2005; Fletcher and Yang, 2009 ), although changes may 322 be required to tskit's data model which is currently based on the assumption of independent sites. 323

Crossover recombination is implemented in msprime using Hudson's algorithm, which works back-325 wards in time, generating common ancestor and recombination events and tracking their effects 326 on segments of ancestral material inherited from the sample (Hudson, 1983a, 1990; Kelleher et al., 327 2016). Common ancestor events merge the ancestral material of two lineages, and result in coa-328 lescences in the marginal trees when ancestral segments overlap. Recombination events split the 329 ancestral material for some lineage at a breakpoint, creating two independent lineages. Using the 330 appropriate data structures (Kelleher et al., 2016) , this process is much more efficient to simulate 331 than the equivalent left-to-right approach (Wiuf and Hein, 1999b,a). In msprime 1.0, recombi-332 nation rates can vary along a chromosome, allowing us to simulate recombination hotspots and 333 patterns of recombination from empirical maps. The implementation of recombination in msprime 334 is extensively validated against analytical results (Hudson, 1983a; Kaplan and Hudson, 1985) and 335 simulations by ms, msHOT and SLiM. . The SMC and SMC' models are supported in msprime 1.0. 344 However, they are currently implemented using a naive rejection sampling approach, and are some-345 what slower to simulate than the exact coalescent with recombination. These models are therefore 346 currently only appropriate for studying the SMC approximations themselves, although we intend 347 to implement them more efficiently in future versions. the population scaled recombination rate ρ = 4N e L, where L is the length of the genome in units 353 of recombination distance. This is because Hudson's algorithm tracks recombinations not only 354 in segments ancestral to the sample, but also between ancestral segments. As mentioned above, 355 common ancestor events in which the ancestral material of two lineages is merged only result in 356 coalescences in the marginal trees if their ancestral segments overlap. If there is no overlap, the 357 merged segments represent an ancestral chromosome that is a genetic ancestor of the two lineages, 358 but not the most recent common genetic ancestor at any location along the genome. When this 359 happens, the merged lineage carries "trapped" genetic material that is not ancestral to any samples, 360 but where recombinations can still occur (Wiuf and Hein, 1999b) . The SMC approximations work 361 by disallowing common ancestor events that generate trapped material, greatly simplifying the 362 process. However, this also removes subtle long-range correlations in the trees since there are many 363 ways in which ancestry segments can merge without overlapping.

For large ρ, recombination events in trapped ancestral material will dominate, and so we can use 365 this as a proxy for the overall number of events in Hudson's algorithm. Hein et al. (2004, Eq. 5 .10) 366 gave 367 ρ(ρ + 1)

as an upper bound on the number of recombination events within trapped ancestral material in 368 Hudson's algorithm for n samples. Fig. 4 shows the observed run time for simulations with a variety 369 of population size, chromosome length and sample sizes, and demonstrates that Eq. (1) correctly 370 predicts the quadratic dependence on ρ, as previously conjectured (Kelleher et al., 2016, Fig. 2) . 371 We also see that the dependence on n is quite weak, since increasing sample size 100-fold only 372 increases run time by a factor of 2 or so. However, the log 2 n factor implied by Eq. (1) (the sum 373 is a harmonic number and can be approximated by log n) is not well supported by observed run 374 times (or numbers of events) except possibly at very large values of ρ. It therefore appears that a 375 different dependence on n is required to accurately predict simulation time for a given ρ and n. 376 Fig. 4 is a useful yardstick, allowing us to predict how long simulations should take for a wide 377 range of species. Taking a typical chromosome to be 1 Morgan in length, these plots show, roughly, 378 that simulating chromosome-length samples from a population of thousands of individuals takes 379 seconds, while samples from a population of tens of thousands take minutes. Simulating whole 380 chromosomes for many species is very fast, with 1000 samples of chromosome 1 for Arabidopsis 381 thaliana taking less than a second, and a few minutes for dogs and humans. However, the depen-382 dence on ρ is quadratic, and if ρ is sufficiently large simulations may not be feasible. For example, 383 the Drosophila melanogaster chromosome 2L is about 23.5Mb long with an average recombination 384 rate of around 2.4 × 10 −8 , so L ≈ 0.57, and with N e = 1.7 × 10 6 (Li and Stephan, 2006) , N e L ≈ 10 6 , 385 so extrapolating the curve in Fig. 4B predicts that simulation would require around 177 hours for 386 1000 samples. For such large values of ρ we recommend users consider approximate simulations. 387 Since msprime does not currently have efficient implementations of approximate coalescent with 388 recombination models, in these cases we recommend using SMC based methods such as scrm, par-389 ticularly if sample sizes are small. In practice, to predict the running time of a given simulation in 390 msprime, we recommend that users measure run time in a series of simulations with short genome 391 lengths and the desired sample size, and then predict run time by fitting a quadratic curve to 392 genome length as in Fig. 4 . It is important to note that the quadratic curves in the two panels 393 of Fig. 4 are different, and predicting the run times of days-long simulations using the timing of 394 seconds-long runs is unlikely to be very accurate.

What about simulations with changing population size? To understand how run time depends 396 on demography it helps to consider why run time is quadratic in ρ. At any point in time, msprime 397 must keep track of some number of lineages, each of which contains some number of chunks of 398 genetic material. Common ancestor events reduce the number of lineages, and recombination events 399 increase their number. However, with long genomes, only a small fraction of the common ancestor 400 events will involve overlapping segments of ancestry and lead to coalescence in the marginal trees. 401 Such disjoint segments are often far apart (on average, about distance L/2), and so recombine apart 402 again immediately; it is these large numbers of rapid and inconsequential events that lead to the 403 quadratic run time. The maximum number of lineages occurs when the increase and decrease in 404 numbers of lineages due to common ancestor and recombination events balance out. To get an 405 idea of run time we can estimate when this balance occurs. Suppose that the maximum number 406 of lineages is M ; at this time the rate of common ancestor events is M (M − 1)/(4N e ) and the 407 total rate of recombination is M , where is the mean length of genome carried by each lineage 408 (including "trapped" non-ancestral material). At the maximum, coalescence and recombination 409 rates are equal, so a typical segment of ancestry will spend roughly half its time in a lineage with 410 at least one other such segment-and, since such lineages carry at least two segments, at most 411 one-third of the lineages carry long trapped segments of ancestry. Since the maximum number of 412 lineages is reached very quickly (Nelson et al., 2020) , this implies that ≈ L/6. Setting the rates 413 of recombination and common ancestor events to be equal and solving for M , we find that M is 414 roughly equal to LN e . The number of lineages then decreases gradually from this maximum on the 415 coalescent time scale, and therefore over roughly 2N e generations. Since the total rate of events 416 when the maximum number of lineages is present is roughly L 2 N e /6, then the total number of 417 events is proportional to (LN e ) 2 -i.e., proportional to ρ 2 .

What does this tell us about run time for simulating time-varying population sizes? The ar-419 gument above implies that the work is spread out relatively evenly on the coalescent time scale. 420 Suppose that population size today is N 1 , while T generations ago it was N 2 . Does the run time 421 depend more on 4N 1 L or 4N 2 L? The answer depends on how T compares to N 1 : if T /N 1 is large, 422 then run time will be similar to a population of size N 1 ; while if T /N 1 is small, it will be similar to 423 a population of size N 2 . For instance, in many agricultural species N 1 ∝ 100, while N 2 ∝ 10 5 , and 424 the run time will depend critically on T -in other words, simulation will be quick in a species with 425 a strong domestication bottleneck, and slow otherwise. Figure 5A 441 shows that msprime is far more efficient than both SimBac and the SMC-based approximation 442 fastSimBac. Figure 5B shows that msprime requires somewhat more memory than fastSimBac, (as 443 expected since fastSimBac uses a left-to-right SMC approximation) but is still reasonably modest 444 at around 1GiB for a simulation of 500 whole E. coli genomes. However, msprime is currently 445 lacking many of the specialised features required to model bacteria, and so an important avenue 446 for future work is to add features such as circular genomes and bacterial gene transfer (Baumdicker 447 and Pfaffelhuber, 2014).

In terms of predicting the run time for a simulation including gene conversion, we recommend 449 following the same approach as discussed in the previous section: run a number of simulations 450 for short genome lengths, fit a quadratic to the observed CPU times, and use this to predict 451 run time for larger simulations. Depending on the relative contributions of gene conversion and 452 crossover recombination, this may be an over-estimate since gene conversion events tend to generate 453 less trapped ancestral material than crossovers. Thus, simulations using mammalian-like gene 454 conversion parameters may run faster than simulations in which an equivalent amount of crossover 455 recombination is imposed. Since each gene conversion creates two breakpoints and a crossover 456 creates only one, we expect the output tree sequence for a given rate of gene conversion to be 457 roughly twice the size of the output from a simulation with the same rate of crossovers.

One of the key applications of population genetic simulations is to generate data for complex de-460 mographies. Beyond idealised cases such as stepping-stone or island models, or specialised cases such 461 as isolation-with-migration models, analytical results are rarely possible. Simulation is therefore 462 integral to the development and evaluation of methods for demographic inference. The demogra-463 phy model in msprime is directly derived from the approach used in ms, and supports an arbitrary 464 number of randomly mating populations exchanging migrants at specified rates. A range of demo-465 graphic events are supported, which allow for varying population sizes and growth rates, changing 466 migration rates over time, as well as population splits, admixtures and pulse migrations. The loca-467 tion of sampled lineages can be tracked through time in as much detail as required: each tree node 468 is automatically associated with the population in which it arose, the location of lineages can be 469 recorded at any given time via census events, or every lineage migration can be recorded. Large 470 demographic models can be simulated efficiently in version msprime 1.0, since we only consider pop-471 ulations that contain lineages and have non-zero migration rates when generating migration event 472 waiting times. This is a considerable improvement over version 0.x, which scaled quadratically with 473 the number of populations.

A major change for msprime 1.0 is the introduction of the new Demography API, designed to 475 address a design flaw in the msprime 0.x interface which led to a number of avoidable errors in 476 downstream simulations (Ragsdale et al., 2020). Briefly, the 0.x API required three separate pa-477 rameters be provided to the simulate function to describe a demographic model, making it easy to 478 accidentally omit information. The 1.0 API resolves this issue by creating a new Demography class, 479 which encapsulates all information about the demographic model, and fully decouples the definition 480 from other simulation details. An instance of this class is then provided as a parameter to the new 481 sim_ancestry function, substantially reducing the potential for error. Another improvement over 482 the 0.x APIs is the introduction of explicit population split and admixture events, and a popula-483 tion state machine that ensures that lineages cannot migrate into (or be sampled from) inactive 484 populations. This demography model is compatible with the Demes standard (Gower et al., 2021) , 485 and the Demography class supports importing and exporting Demes models. Models previously 486 constructed using the 0.x API can be seamlessly imported into the Demography class, and we also 487 support importing demographic models from Newick species trees and the output of programs like 488 *BEAST (Heled and Drummond, 2009 ).

The DemographyDebugger provides detailed information about demographic models as well as 490 numerical methods to make predictions about these models. For example, we can compute the 491 coalescence rates for two or more lineages drawn from populations at specified times in the past, 492 which can be inverted to obtain the "inverse instantaneous coalescence rate" of Chikhi et al. The Ancestral Recombination Graph (ARG) was introduced by Griffiths (Griffiths, 1991; Griffiths 553 and Marjoram, 1997) to represent the stochastic process of the coalescent with recombination as 554 a graph. This formulation is complementary to Hudson's earlier work (Hudson, 1983a) , and sub-555 stantially increased our theoretical understanding of recombination. In Griffiths' ARG formulation, 556 a realisation of the coalescent with recombination is a graph in which vertices represent common 557 ancestor or recombination events, and edges represent lineages. There is the "big" ARG, in which 558 we track lineages arising out of recombinations regardless of whether they carry ancestral mate-559 rial (Ethier and Griffiths, 1990) , and the "little" ARG in which we only track genetic ancestors. 560 Over time, usage of the term has shifted away from its original definition as a stochastic process, 561 to being interpreted as a representation of a particular genetic ancestry as a graph, without neces-562 sarily following the specific details of the Griffiths formulation (e.g. Minichiello and Durbin, 2006; 563 Mathieson and Scally, 2020). Under the latter interpretation, the tree sequence encoding of genetic 564 ancestry (described above) clearly is an ARG: the nodes and edges define a graph in which edges 565 are annotated with the set of disjoint genomic intervals through which ancestry flows. the equivalent topology depicted as a tree sequence, including the recombination node; (C) the same tree sequence topology "simplified" down to its minimal tree sequence representation. Note that the internal nodes have been renumbered in the simplified representation, so that, e.g., node 5 in (C) corresponds to node 6 in (A) and (B).

partially present through the extant lineages and the ancestral material they carry over time. We do 570 not output the graph directly, but rather store the information required to recover the genealogical 571 history as nodes and edges in a tree sequence. This is far more efficient than outputting the 572 simulated ARG in its entirety. For a given scaled recombination rate ρ (setting aside the dependency 573 on the sample size n) we know from Eq. (1) that the number of nodes in an ARG is O(ρ 2 ), whereas 574 the size of the tree sequence encoding is O(ρ) (Kelleher et al., 2016) . This difference between a 575 quadratic and a linear dependency on ρ is profound, and shows why large simulations cannot output 576 an ARG in practice.

Although by default msprime outputs tree sequences that contain full information about the 578 genealogical trees, their correlation structure along the chromosome, and the ancestral genomes on 579 which coalescences occurred, some information is lost in this mapping down from ARG space to 580 the minimal tree sequence form. In particular, we lose information about ancestral genomes that 581 were common ancestors but in which no coalescences occurred, and also information about the 582 precise time and chromosomal location of recombination events. In most cases, such information is 583 of little relevance as it is in principle unknowable, but there are occasions such as visualisation or 584 computing likelihoods (see below) in which it is useful. We therefore provide the record_full_arg 585 option in msprime to store a representation of the complete ARG traversed during simulation. This 586 is done by storing extra nodes (marked with specific flags, so they can be easily identified later) and 587 edges in the tree sequence (Fig. 6) . One situation in which a record of the full ARG is necessary 588 is when we wish to compute likelihoods during inference. The likelihood is a central quantity in 589 evaluating the plausibility of a putative ancestry as an explanation of DNA sequence data, both 590 directly through e.g. approaches based on maximum likelihood, and as an ingredient of methods 591 such as Selective sweeps are implemented in the coalescent as a two step-process: first generating an allele frequency trajectory, and then simulating a structured coalescent process conditioned on that trajectory. Following discoal, we generate sweep trajectories in msprime using a jump process approximation to the conditional diffusion of an allele bound for fixation (Coop and Griffiths, 2004) . The jump process moves back in time following the beneficial allele frequency, p, from some initial frequency (e.g., p = 1) back to the origination of the allele at p = 1/(2N ), tracking time in small increments δt. Then, given the frequency p at time t, the frequency p at time t + δt is given by p = p + µ(p)δt + p(1 − p)δt with probability 1/2 p + µ(p)δt − p(1 − p)δt with probability 1/2 (1 − p) ) .

Here, α = 2N s and s is the fitness advantage in homozygotes. This model assumes genic selection 608 (i.e., that the dominance coefficient h = 0.5), but can be generalised straightforwardly to include 609 arbitrary dominance. We can also define trajectories to model neutral alleles and soft selective 610 sweeps, which we plan as future additions to msprime.

Then, given a randomly generated allele frequency trajectory under the above model, the simu-612 lation of a sweep works by assigning lineages to two different structured coalescent "labels", based 613 on whether they carry the beneficial allele. The allele frequency trajectory determines the relative 614 sizes of the "populations" in these labels over time, and therefore the rates at which various events 615 occur. Common ancestor events can then only merge lineages from within a label, but lineages can 616 transfer from one label to the other (i.e., from the advantageous to disadvantageous backgrounds, 617 and vice versa) as a result of recombination events. Once we have reached the end of the simulated 618 trajectory the sweep is complete, and we remove the structured coalescent labels. Simulation may 619 then resume under any other ancestry model. Fig. 8 shows that msprime simulates the DTWF more quickly and requires substantially less 639 memory than ARGON (Palamara, 2016), a specialised DTWF simulator. However, the generation-by-640 generation approach of the DTWF is less efficient than the coalescent with recombination when the 641 number of lineages is significantly less than the population size (the regime where the coalescent 642 is an accurate approximation), which usually happens in the quite recent past ( . We simulated ancestry for a sample of 1000 haploids from a population of 10000, and report the (A) total CPU time and (B) maximum memory usage for varying sequence lengths, and a per-base recombination rate of 10 −8 . Each point is the average over 5 replicate simulations. We show observations for ARGON, msprime's DTWF implementation ("DTWF") and a hybrid simulation of 100 generations of the DTWF followed by the standard coalescent with recombination model ("DTWF + Hudson"). Memory usage for msprime's DTWF and hybrid simulations are very similar. We ran ARGON with a mutation rate of 0 and with minimum output options, to ensure we are measuring only ancestry simulation time.

2014). We therefore support changing the simulation model during a simulation so that we can 644 run hybrid simulations, as proposed by Bhaskar et al. (2014) . Any number of different simulation 645 models can be combined, allowing for the flexible choice of simulation scenarios. As the discrete 646 time Wright-Fisher model improves accuracy of genealogical patterns in the recent past, we can 647 simulate the recent history using this model and then switch to the standard coalescent to more 648 efficiently simulate the more ancient history.

Integration with forward simulators 650 A unique feature of msprime is its ability to simulate genetic ancestries by extending an existing 651 partial genetic ancestry. Given a tree sequence that is complete up until time t ago as input 652 (where marginal trees may or may not have fully coalesced), msprime can efficiently obtain the 653 segments of ancestral material present at this time, and then run the simulation backwards in time 654 from there. This allows a simulated ancestry to be produced by any number of different processes 655 across disjoint time slices. In practice this feature is used to "complete" forwards-time ancestry . The authorship of our paper reflects these 697 trends, with a skew towards men and affiliations in the USA and Europe. We know the importance of 698 creating and strengthening networks to develop and maintain a diverse community of contributors, 699 and we are committed to fostering a supportive and collaborative environment that helps to address 700 these inequalities in our field.

The 1.0 release of msprime marks a major increase in the breadth of available features and the 703 potential biological realism of simulations. These abilities will allow researchers to perform more 704 robust power analyses, more reliably test new methods, carry out more reliable inferences, and more 705 thoroughly explore the properties of theoretical models. Despite this complexity and generality, 706 msprime's performance is state-of-the-art and all features are extensively tested and statistically 707 validated. These advances have only been possible thanks to a distributed, collaborative model of 708 software development, and the work of many people.

Even though simulation has long been a vital tool in population genetics, such collaborative 710 software development has historically been uncommon. A huge proliferation of tools have been 711 published (the references here are not exhaustive) and only a small minority of these are actively 712 developed and maintained today. The ecosystem is highly fragmented, with numerous different 713 ways of specifying parameters and representing results, and there are significant software quality 714 issues at all stages. This is unsurprising, since the majority of simulation software development is 715 performed by students, often without formal training in software development. The result resembles 716 Haldane's sieve for new mutations: many new pieces of software stay permanently on a dusty shelf 717 of supplementary materials, while some of those that prove particularly useful when new (like 718 dominant alleles) are quickly adopted. Although this has produced many good tools and enabled 719 decades of research, it also represents a missed opportunity to invest as a community in shared 720 infrastructure and mentorship in good software development practice.

Scientific software is vital apparatus, and must be engineered to a high quality if we are to 722 trust its results. There is a growing realisation across the sciences (e.g. Siepel, 2019; Harris et al., 723 2020; Gardner et al., 2021) that investing in shared community infrastructure produces better 724 results than a proliferation of individually maintained tools, allowing scientists to focus on their 725 specific questions rather than software engineering. Msprime 1.0 is the result of such a community 726 process, with features added by motivated users, taking advantage of the established development 727 practices and infrastructure. Software development in a welcoming community, with mentorship 728 by experienced developers, is a useful experience for many users. The skills that contributors learn 729 can lead to greatly increased productivity in subsequent work (e.g., through more reliable code and 730 better debugging skills). We hope that users who find that features they require are missing will 731 continue to contribute to msprime, leading to a community project that is both high quality and 732 sustainable in the long term.

The succinct tree sequence data structure developed for msprime provides a view of not only 734 genetic variation, but also the genetic ancestry that produced that variation. Recent breakthroughs 735 in methods to infer genetic ancestry in recombining organisms ( ancestors based on inferred trees, and other uses are sure to follow. Since the inferred genetic 741 ancestry becomes the input for other downstream inferences, it is vitally important that these 742 primary inferences are thoroughly validated, with the detailed properties of the inferred ancestries 743 catalogued and understood. Msprime will continue to be an important tool for these inferences 744 and validations, and in this context the ability to interoperate with other methods-particularly 745 forwards simulators-through the succinct tree sequence data structure and tskit library will be 746 essential.

Availability 748 Msprime is freely available under the terms of the GNU General Public License v3.0, and can be in-749 stalled from the Python Package Index (PyPI) or the conda-forge conda channel. Development is 750 conducted openly on GitHub at https://github.com/tskit-dev/msprime/. The documentation 751 for msprime is available at https://tskit.dev/msprime/docs/. The source code for all the evalua-752 tions and figures in this manuscript is available at https://github.com/tskit-dev/msprime-1.0-paper/. 753 on the same edge. Otherwise (i.e., for an infinite-sites model), positions are rejection sampled to 1240 obtain a unique floating-point number. If an edge spans a region of the genome with more than one 1241 mutation rate, this is done separately for each sub-region on which the mutation rate is constant. 1242 Since each edge is processed independently, the algorithm scales linearly with the number of edges 1243 in the tree sequence.

Next, alleles are chosen for each mutation. If the site was not previously mutated, then a 1245 new ancestral allele is chosen for the site, according to an input distribution of ancestral state 1246 allele probabilities. Then, each mutation on the tree is considered in turn, and a derived allele 1247 is randomly chosen based on the parental allele (which may be the ancestral allele or the derived 1248 allele of a previous mutation). Finally, information about the mutations are recorded in the site 1249 and mutation tables of the tree sequence.

A mutation model must, therefore, provide two things: a way of choosing an ancestral allele 1251 for each new variant site, and a way of choosing a derived allele given the parental allele at each 1252 mutation. Perhaps the simplest mutation model implemented in msprime is the InfiniteAlleles 1253 mutation model, which keeps an internal counter so that the requested alleles are assigned subse-1254 quent (and therefore unique) integers.

The distribution of ancestral alleles is used to choose the allele present at the root of the tree 1256 at each mutated site, i.e., the root_distribution. Mutation models with a finite possible set 1257 of alleles have a natural choice for this distribution-the stationary distribution of the mutation 1258 process. (All mutation models are Markovian, so this may be found as the top left eigenvector of 1259 the mutation matrix.) This is the default in most models, except, e.g., the BinaryMutationModel, 1260 whose alleles are 0 and 1 and always labels the ancestral allele "0". However, mutational processes 1261 are not in general stationary, so we often allow a different root distribution to be specified.

Since the general algorithm above applies mutations at a single rate independent of ancestral 1263 state, a model in which different alleles mutate at different rates must necessarily produce some 1264 silent mutations, i.e., mutations in which the derived allele is equal to the parental allele. To 1265 illustrate this, consider a mutation model in which A or T mutates to a randomly chosen different 1266 nucleotide at rate α and C or G mutates at rate β, with β < α. To implement this, first place 1267 mutations at the largest total rate, which is α. Then, at each site, choose an ancestral allele from 1268 the root distribution, and for each mutation, choose a derived allele as follows: if the parental allele 1269 is A or T , then choose a random derived allele different to the parental allele; if the parental allele 1270 is C or G, then choose the derived allele to be equal to the parent allele with probability β/(α + β), 1271 and randomly choose a different nucleotide otherwise. This produces the correct distribution by 1272 Poisson thinning: a Poisson process with rate α in which each point is discarded independently 1273 with probability β/(α + β) is equivalent to a Poisson process with rate β. All finite-state models 1274 (implemented under the generic MatrixMutationModel class) work in this way: mutations are 1275 placed at the maximum mutation rate, and then some silent mutations will result.

In previous versions of msprime, silent mutations were disallowed, and we could have removed 1277 them from the output entirely. However, we have chosen to leave them in, so that for instance simu-1278 lating with the HKY mutation model will result in silent mutations if not all equilibrium frequencies 1279 are the same. The presence of silent mutations may at first be surprising but there is a good reason 1280 to leave them in: to allow layering of different mutation models. Suppose that we wanted to model 1281 the mutation process as a mixture of more than one model, e.g., Jukes-Cantor mutations at rate µ 1 , 1282 and HKY mutations occur at rate µ 2 . Layering multiple calls to sim_mutations is allowed, so we 1283 could first apply mutations with the JC69 model at rate µ 1 and then add more with the HKY model 1284 at rate µ 2 . However, there is a small statistical problem: suppose that after applying Jukes-Cantor 1285 mutations we have an A → C mutation, but then the HKY mutations inserts another mutation in 1286 the middle, resulting in A → C → C. If neither mutation model allows silent transitions, then this 1287 is clearly not correct, i.e., it is not equivalent to a model that simultaneously applies the two models. 1288 (The impact is small, however, as it only affects sites with more than one mutation.) The solution 1289 is to make the Jukes-Cantor model state-independent (also called "parent-independent"), by placing 1290 mutations at rate 4/3µ 1 and then choosing the derived state for each mutation independently of the 1291 parent (so that 1/4 of mutations will be silent). If so-and, more generally, if the first mutational 1292 process put down is state-independent-then the result of sequentially applying the two mutation 1293 models is equivalent to the simultaneous model. To facilitate this, many mutation models have 1294 a state_independent option that increases the number of silent mutations and makes the model 1295 closer to state-independent.

Silent mutations are fully supported by tskit, which correctly accounts for their presence when 1297 computing statistics and performing other operations. For example, silent mutations have no effect 1298 on calculations of nucleotide site diversity.

Likelihood calculations 1300 We provide two functions to facilitate likelihood-based inference. Both are implemented only for 1301 the simplest case of the standard ARG with a constant population size, and require tree sequences 1302 compatible with the record_full_arg option as their arguments.

The msprime.log_arg_likelihood(ts, r, N) function returns the natural logarithm of the sampling probability of the tree sequence ts under the ARG with per-link, per-generation recombination probability r and population size N (e.g. Kuhner et al., 2000, equation (1)). Specifically, the function returns the logarithm of

where t i is the number of generations between the (i − 1)th and ith event, k i is the number of 1304 extant ancestors in that interval, l i is the number of links in that interval that would split ancestral 1305 material should they recombine, q is the total number of events in the tree sequence ts, q c is the 1306 number of coalescences, R is the set of indices of time intervals which end in a recombination, 1307 and g i is the corresponding gap: the length of contiguous non-ancestral material around the link 1308 at which the recombination in question took place. The gap indicates the number of links (or 1309 length of genome in a continuous model) at which a recombination would result in exactly the 1310 observed pattern of ancestral material in the ARG. For a continuous model of the genome and a 1311 recombination in ancestral material, we set g i = 1 and interpret the result as a density.

The msprime.unnormalised_log_mutation_likelihood(ts, m) function returns the natural logarithm of the probability of the mutations recorded in the tree sequence ts given the corresponding ancestry, assuming the infinite sites model, up to a normalising constant which depends on the pattern of mutations, but not on the tree sequence or the per-site, per-generation mutation probability m. Specifically, the function returns the logarithm of

where T and M are the total branch length and set of mutations in ts, respectively, and for a 1313 mutation γ, h γ is the total branch length on which γ could have arisen while appearing on all 1314 of the leaves of ts it does, and on no others. Unary nodes on marginal trees arising from the 1315 record_full_arg option mean that, in general h γ corresponds to the length of one or more edges. 1316

Multiple merger coalescent model 1317 Multiple merger coalescents, in which a random number of ancestral lineages may merge into a 1318 common ancestor at a given time, are referred to as Λ-coalescents. The rate at which a given group 1319 of k out of a total of b lineages merges is 1320

where 1 {A} := 1 if A holds, and zero otherwise, a ≥ 0 is a constant, and Λ is a finite measure on the unit interval without an atom at zero (Donnelly and Kurtz, 1999; Pitman, 1999; Sagitov, 1999) . There is also a larger class of simultaneous multiple merger coalescents involving simultaneous mergers of distinct groups of lineages into several common ancestors (Schweinsberg, 2000) . These are commonly referred to as Ξ-coalescents, and often arise from population models incorporating diploidy or more general polyploidy ( x j ≤ 1 .

The rate of mergers is determined by Ξ = Ξ 0 + aδ 0 , where a ≥ 0 is a constant, δ 0 is the Dirac delta measure, and Ξ 0 is a finite measure on ∆ with no atom at (0, 0, . . . ). For an initial number of blocks b ≥ 2 and r ∈ {1, 2, . . . , b − 1}, let k 1 ≥ 2, . . . , k r ≥ 2 be the sizes of r merger events and s = b − k 1 − · · · − k r be the number of blocks not participating in any merger. The rate of each possible set of mergers with sizes (k 1 , . . . , k r ) is where j := #{i ∈ {1, . . . , r} : k i = j} is the number of mergers of size j ≥ 2 (Schweinsberg, 2000) . 1321 Viewing coalescent processes strictly as mathematical objects, it is clear that the class of Ξ-1322 coalescents contains Λ-coalescents as a specific example in which at most one group of lineages can 1323 merge at each time, and the class of Λ-coalescents contain the Kingman-coalescent as a special case. 1324 However, viewed as limits of ancestral processes derived from specific population models they are not 1325 nested. For example, one can obtain Λ-coalescents from haploid population models incorporating 1326 sweepstakes reproduction and high fecundity, and Ξ-coalescents for the same models for diploid 1327 populations (Birkner et al., 2013a). One should therefore apply the models as appropriate, i.e. Λ-1328 coalescents to haploid (e.g. mtDNA) data, and Ξ-coalescents to diploid or polyploid (e.g. autosomal) 1329 data (Blath et al., 2016).

In msprime we have incorporated two examples of multiple-merger coalescents. One is a diploid 1331 extension (Birkner et al., 2013a) of the haploid Moran model adapted to sweepstakes reproduction 1332 considered by Eldon and Wakeley (2006) . Let N denote the population size, and take ψ ∈ (0, 1] 1333 to be fixed. In every generation, with probability 1 − ε N a single individual (picked uniformly at 1334 random) perishes. With probability ε N , ψN individuals picked uniformly without replacement 1335 perish instead. In either case, a surviving individual picked uniformly at random produces enough 1336 offspring to restore the population size back to N . Taking ε N = 1/N γ for some γ > 0, Eldon 

The interpretation of (3) is that 'small' reproduction events in which two lineages merge occur at 1347 rate 1/(1 + cψ 2 /4), while large reproduction events with the potential to result in simultaneous 1348 multiple mergers occur at rate (cψ 2 /4)/(1 + cψ 2 /4).

The other multiple merger coalescent model incorporated in msprime is the haploid population 1350 model considered by Schweinsberg (2003) , as well as its diploid extension (Birkner et al., 2018) . 1351 In the haploid version, in each generation of fixed size N , individuals produce random numbers of 1352 juveniles (X 1 , . . . , X N ) independently, each distributed according to a stable law satisfying 1353 lim k→∞ Ck α P (X ≥ k) = 1 (4) with index α > 0, and where C > 0 is a normalising constant. If the total number of juveniles 1354 S N := X 1 + . . . + X N produced in this way is at least N , then N juveniles are sampled uniformly at 1355 random without replacement to form the next generation. As long as E [X 1 ] > 1, one can show that 1356 {S N < N } has exponentially small probability in N , and does not affect the resulting coalescent 1357 as N → ∞ (Schweinsberg, 2003) . If α ≥ 2 the ancestral process converges to the Kingman-1358 coalescent; if 1 ≤ α < 2 the ancestral process converges to a Λ-coalescent with Λ in (2) given by 1359 the Beta(2 − α, α) distribution, i.e. 1360 Λ(dx) = 1 {0<x≤1} 1 B(2 − α, α)

x

where B(a, b) = Γ(a)Γ(b)/Γ(a+b) for a, b > 0 is the beta function (Schweinsberg, 2003) . This model 1361 has been adapted to diploid populations by Birkner et al. (2018) , and the resulting coalescent is 1362

771 A community-maintained standard library of population genetic models

Predicting the landscape of recombination 773 using deep learning

Simulation of molecular data under diverse evolutionary scenarios

Recodon: coalescent simulation of coding DNA sequences with 777 recombination, migration and demography

Mitochondrial cytochrome b DNA variation in the high-fecundity Atlantic cod: 779 trans-Atlantic clines and shallow gene genealogy

A new model for extinction and 781 recolonization in two dimensions: quantifying phylogeography. Evolution: International journal 782 of organic evolution

The infinitely many genes model with horizontal gene 784 transfer

Approximate Bayesian computation in 786 population genetics

Occupancy spectrum distribution: application for coales-788 cence simulation with generic mergers

The quetzal coalescence template library: 790 A C++ programmers resource for integrating distributional, demographic and coalescent models

Mitochondrial haplotype frequencies in oysters: neutral alternatives to 793 selection models

Coalescent processes when the distribution of offspring number 876 among individuals is highly skewed

On the two-locus sampling distribution

MSMS: a coalescent simulation program including re-880 combination, demographic structure, and selection at a single locus

Fastsimcoal: a continuous-time coalescent simulator of ge-883 nomic diversity under arbitrarily complex evolutionary scenarios

A hidden markov model approach to variation among 886 sites in rate of evolution

The unreasonable effectiveness of convolutional 888 neural networks in population genetic inference

INDELible: a flexible simulator of biological sequence evolution. 891 Molecular biology and evolution

Cannings models, population size changes and multiple-merger coalescents

Detecting bottlenecks and selective sweeps 895 from DNA sequence polymorphism

Sustained software develop-898 ment, not number of citations or journal choice, is indicative of accurate bioinformatic software

AlphaSimR: An R-package for breeding 901 program simulations

Genetic drift in an infinite population: the pseudohitchhiking model

Simprily: A Python framework to simplify high-throughput genomic 906 simulations

Demes: a standard format for demographic models

The two-locus ancestral graph

An ancestral recombination graph

Sampling 915 theory for neutral alleles in a varying environment

Nemo: an evolutionary and population genetics 918 programming framework

SLiM 3: forward genetic simulations beyond the Wright-920 Fisher model

Tree-922 sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. 923 Molecular ecology resources

Array 926 programming with numpy

From a database of genomes to a forest of evolutionary trees

Does variance in reproductive success limit effective population sizes of marine 930 organisms? Genetics and evolution of aquatic organisms

Sweepstakes reproductive success in highly fecund 932 marine fish and shellfish: a review and commentary

Gene genealogies, variation and evolution: a 935 primer in coalescent theory

Bayesian inference of species trees from multilocus data. 937 Molecular biology and evolution

mshot: modifying Hudson's ms simulator to incorporate 939 crossover and gene conversion hotspots

Amino acid substitution matrices from protein blocks

msBayes: pipeline for testing comparative 943 phylogeographic histories using hierarchical approximate bayesian computation. BMC bioinfor-944 matics

Computer simulations: tools for population 946 and evolutionary genetics

Markovian approximation to the finite loci coalescent with 948

Coalescent simulation in continuous 985 space

Coalescent simulation in continuous 987 space: Algorithms for large neighbourhood size

Efficient coalescent simulation and 989 genealogical analysis for large sample sizes

Efficient pedigree 991 recording for fast population genetics simulation

Inferring whole-genome histories in large population datasets

Discoal: flexible coalescent simulations with selection

Detecting a local signature of genetic hitchhiking along a 999 recombining chromosome

A simple method for estimating evolutionary rates of base substitutions through 1001 comparative studies of nucleotide sequences

Estimation of evolutionary distances between homologous nucleotide sequences

The coalescent. Stochastic processes and their applications

On the genealogy of large populations

Gene conversion and linkage: effects on genome 1009 evolution and speciation

Multi-locus data distinguishes between population growth and multiple merger coa-1011 lescents. Statistical applications in genetics and molecular biology

Robust model selection between population growth and 1013 multiple merger coalescents

Maximum likelihood estimation of recom-1015 bination rates from population data

1017 The impact of selection, gene conversion, and biased sampling on the assessment of microbial 1018 demography

Inferring the demographic history and rate of adaptive substi-1020 tution in Drosophila

Inference of human population history from individual whole-genome 1022 sequences

A survey of genetic simulation software 1024 for population and epidemiological studies

Popabc: a program to infer historical 1026 demographic parameters

CoaSim: a flexible environment for simulating genetic data under 1029 coalescent models

Fast "coalescent" simulation

The allele frequency 1032 spectrum in genome-wide human variation data reveals signals of differential demographic history 1033 in three large world populations

Human demo-1036 graphic history impacts genetic risk prediction across diverse populations

Erratum: Hu-1040 man demographic history impacts genetic risk prediction across diverse populations (the amer-1041 ican journal of human genetics

What is ancestry?

Coalescent 1045 processes with skewed offspring distributions and nonequilibrium demography

A daily-updated database and tools for 1049 comprehensive SARS-CoV-2 mutation-annotated trees. bioRxiv

GraphML specializations to codify 1051 ancestral recombinant graphs

ipcoal: An interactive Python package for simulating 1053 and analyzing genealogies and sequences on a species tree or network

Approximating the coalescent with recombination

Telomere-to-telomere 1059 assembly of a complete human X chromosome

Mapping trait loci by use of inferred ancestral recombi-1061 nation graphs

A classification of coalescent processes for haploid exchangeable 1063 population models

Revis-1065 iting the Out of Africa event with a novel deep learning approach. bioRxiv

Genealogies of rapidly adapting populations

Accounting for long-range correlations in genome-wide simulations of large cohorts

Estimation of population parameters and recombination rates from single nu-1072 cleotide polymorphism

Estimating dispersal rates and locating genetic ancestors 1074 with genome-wide genealogies. bioRxiv

ARGON: fast, whole-genome simulation of the discrete time Wright-1076 Fisher process

skelesim: an extensible, general framework for population genetic 1079 simulation in r

msABC: a modification of Hudson's ms to 1081 facilitate multi-locus ABC analysis

A sequential coalescent algorithm 1083 for chromosomal inversions

Genetic data simulators and their applications: an overview. Genetic epidemiology

Coalescents with multiple collisions

Reliable ABC model choice via random forests

Mod-1092 eling SNP array ascertainment with Approximate Bayesian Computation for demographic infer-1093 ence

Archaic adaptive 1096 introgression in TBX15/WARS2

Lessons learned from bugs 1098 in models of human history

Efficiently summarizing relationships in large 1100 samples: a general duality between statistics of genealogies and genomes

Seq-Gen: an application for the Monte Carlo simulation 1103 of DNA sequence evolution along phylogenetic trees

Genome-wide inference 1105 of ancestral recombination graphs

ABC random forests for Bayesian parameter inference

Simulation with RADinitio 1110 improves RADseq experimental design and sheds light on sources of missing data. Molecular 1111 ecology resources

Powerful 1113 methods for detecting introgressed regions from population genomic data

The general coalescent with asynchronous mergers of ancestral lines

Deep learning for population size 1118 history inference: Design, comparison and combination with approximate bayesian computation

An ancestral recombination graph of 1121 human, Neanderthal, and Denisovan genomes

Inferring human population size and separation history from 1123 multiple genome sequences

Supervised machine learning for population genetics: a 1125 new paradigm

Coalescents with simultaneous multiple collisions

Coalescent processes obtained from supercritical Galton-Watson processes. 1129 Stochastic processes and their Applications

Rigorous results for a population model with selection II: genealogy of the 1131 population

Aloyce Odhi-1133 ambo, Alie Eleveld, and Jenevieve Mannell. Gender equality in science, medicine, and global 1134 health: where are we at and why does it matter? The Lancet

Deep learning for population genetic inference

Estimating variable effective population sizes from 1138 multiple genomes: a sequentially markov conditional sampling distribution approach

Cosi2: an efficient simulator of exact 1141 and approximate coalescent with selection

Challenges in funding and developing genomic software: roots and remedies

A method for genome-wide genealogy 1145 estimation for thousands of samples

Inferring population histories for ancient genomes using genome-wide genealogies. Molec-1148 ular Biology and Evolution

Inference and analysis of population-specific fine-scale recombi-1150 nation maps across 26 diverse human populations

SelSim: a program to simulate population genetic data with 1152 natural selection and recombination

Pyvolve: a flexible Python module for simulating 1154 sequences along phylogenies

Coala: an R framework for coalescent simulation

scrm: Efficiently simulating long se-1158 quences using the approximated coalescent with recombination

Evolutionary relationship of DNA sequences in finite populations

Genealogy at the genome scale

Prac-1164 tical guide for managing large-scale human genome data in research

Some probabilistic and statistical problems in the analysis of DNA sequences. 1167 Lectures on mathematics in the life sciences

Geonomics: forward-time, spatially 1169 explicit, and arbitrarily complex landscape genomic simulations

Robust and scalable inference of population 1172 history from hundreds of unphased whole genomes

mbs: modifying Hudson's ms software to generate samples 1174 of DNA sequences with a biallelic site under selection

Approximate Bayesian inference reveals evidence for a 1176 recent, severe bottleneck in a Netherlands population of Drosophila melanogaster

A C++ template library for efficient forward-time population genetic simulation 1179 of large populations

Women's par-1181 ticipation in open source software: A survey of the literature

Tskit: a portable library for population scale genealogical analysis. In preparation

Ultrafast sample placement on existing trees 1187 (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic

Gspace: an exact 1190 coalescence simulator of recombining genomes under isolation by distance

Coalescent theory: an introduction. Roberts and Company

Gene genealogies within a 1194 fixed pedigree, and the robustness of Kingman's coalescent

Tracking human population 1196 structure through time from whole genome sequences

Bayesian inference of fine-scale recombination rates using popula-1198 tion genomic data

A new 1201 method for modeling coalescent processes with recombination

ABC-1204 toolbox: a versatile toolkit for approximate Bayesian computations

Women in evolution-highlighting the changing face of evo-1207 lutionary biology

The SMC' is a highly accurate approximation to 1209 the ancestral recombination graph

The ancestry of a sample of sequences subject to recombination

Recombination as a point process along sequences

The coalescent with gene conversion

A unified genealogy of modern and 1218 ancient genomes. bioRxiv

Critical assessment of coalescent simulators in 1220 modeling recombination hotspots in genomic sequences

An overview of 1222 population genetic data simulation

Hybrid-Lambda: simulation 1224 of multiple merger and Kingman gene genealogies in species networks and species trees

We would like to thank Iain Mathieson 

Mutation generation 1228 The algorithm that msprime uses to simulate mutations on a tree sequence proceeds in two steps: 1229 first, mutations are "placed" on the tree sequence (i.e., sampling their locations in time, along the 1230 genome, and on the marginal tree), and then the ancestral and derived alleles of each mutation are 1231 generated. All mutation models share the code to place mutations, but choose alleles in different 1232 ways.

First, mutations are placed on the tree sequence under an inhomogeneous Poisson model by 1234 applying them independently to each edge. If an edge spans a region [a, b) of the genome and 1235 connected parent and child nodes with times s < t, and the mutation rate locally is µ, then the 1236 number of mutations on the edge is Poisson with mean µ(t − s)(b − a), and each mutation is placed 1237 independently at a position chosen uniformly in [a, b) and a time uniformly in [s, t). In a discrete 1238 genome, all positions are integers and so more than one mutation may occur at the same position 1239 Ξ-coalescent with merger ratewhere k := k 1 +. . .+k r (Blath et al., 2016; Birkner et al., 2018). The interpretation of (6) is that the 1364 random number of lineages participating in a potential merger is governed by the Λ-coalescent with 1365 rate (5), and all participating lineages are randomly allocated into one of four groups corresponding 1366 to the four parental chromosomes, giving rise to up to four simultaneous mergers.

The stable law (4) assumes that individuals can produce arbitrarily large numbers of juveniles. 1368 Since juveniles are at least fertilised eggs, it may be desirable to suppose that the number of 1369 juveniles surviving to reproductive maturity cannot be arbitrarily large. Hence we also consider 1370 an adaptation of the Schweinsberg (2003) model, where the random number of juveniles has a 1371 deterministic upper bound φ(N ), and the distribution of the number of juveniles produced by a 1372 given parent (or pair of parents in the diploid case) is