key: cord-0253629-geeowsw6
authors: Douglas, Jordan; Zhang, Rong; Bouckaert, Remco
title: Adaptive dating and fast proposals: revisiting the phylogenetic relaxed clock model
date: 2020-09-09
journal: bioRxiv
DOI: 10.1101/2020.09.09.289124
sha: 4a9ccf89e05ce0f38c05b4cc067f10658725be02
doc_id: 253629
cord_uid: geeowsw6

Uncorrelated relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC). Author summary Biological sequences, such as DNA, accumulate mutations over generations. By comparing such sequences in a phylogenetic framework, the evolutionary tree of lifeforms can be inferred. With the overwhelming availability of biological sequence data, and the increasing affordability of collecting new data, the development of fast and efficient phylogenetic algorithms is more important than ever. In this article we focus on the relaxed clock model, which is very popular in phylogenetics. We explored how a range of optimisations can improve the statistical inference of the relaxed clock. This work has produced a phylogenetic setup which can infer parameters related to the relaxed clock up to 65 times faster than previous setups, depending on the dataset. The methods introduced adapt to the dataset during computation and are highly efficient when processing long biological sequences.

Introduction 1 this paper we will only consider uncorrelated relaxed clock models. 23 Finally, although not the focus of this article, the class of correlated clock models 24 assumes some form of auto-correlation between rates over time. The correlation itself 25 can invoke a range of stochastic models, including compound Poisson [21] and CIR 26 processes [17] , or it can exist as a series of local clocks [22] . However, due to the 27 correlated and discrete nature of such models, the time required for MCMC to achieve 28 convergence can be cumbersome, particularly for larger datasets [22] . 29 With the overwhelming availability of biological sequence data, the development of 30 efficient Bayesian phylogenetic methods is more important than ever. The performance 31 of MCMC is dependent not only on computational performance but also the efficacy of 32 an MCMC setup to achieve convergence. A critical task therein lies the further Models and Methods 49 Preliminaries 50 Let T be a binary rooted time tree with N taxa, and data D associated with the tips, 51 such as a multiple sequence alignment with L sites, morphological data or geographic 52 locations. The posterior density of a phylogenetic model is described by 53 p(T , R , σ, µ C , θ|D) ∝ p(D|T , r( R ), µ C , θ) p(T |θ) p( R |σ) p(σ) p(µ C ) p(θ) (1) where σ and µ C represent clock model related parameters, and p(T |θ) is the tree prior 54 where θ describes further unspecified parameters. The tree likelihood 55 p(D|T , r( R ), µ C , θ) has µ C as the overall clock rate and R is an abstracted vector of 56 branch rates which is transformed into real rates by function r( R ). Branch rates have a 57 mean of 1 under the prior to avoid non-identifiability with the clock rate µ C . Three 58 methods of representing rates as R are presented in Branch rate parameterisations. 59 Let t i be the height (time) of node i. Each node i in T , except for the root, is 60 associated with a parental branch length τ i (the height difference between i and its 61 parent) and a parental branch substitution rate r i = r(R i ). In a relaxed clock model, 62 each of the 2N − 2 elements in R are independently distributed under the prior p( R |σ). 63 The posterior distribution is sampled by the Metropolis-Hastings-Green MCMC 64 algorithm [7, 8, 32] , under which the probability of accepting proposed state x from 65 state x is equal to: 66 α(x |x) = min 1,

where q(a|b) is the transition kernel: the probability of proposing state b from state a.

The ratio between the two q(x|x ) q(x |x) is known as the Hastings ratio [8] . The determinant of 68 the Jacobian matrix |J|, known as the Green ratio, solves the dimension-matching 69 problem for proposals which operate on multiple terms across one or more 70 spaces [32, 33] . 71 Branch rate parameterisations 72 In Bayesian inference, the way parameters are represented in the model can affect the 73 mixing ability of the model and the meaning of the model itself [34] . Three methods for 74 parameterising substitution rates are described below. Each parameterisation is 75 associated with i) an abstraction of the branch rate vector R , ii) some function for 76 transforming this parameter into unabstracted branch rates r( R ), and iii) a prior 77 density function of the abstraction p( R |σ). The three methods are summarised in Fig 78  1 . 1. Real rates 80 The natural (and unabstracted) parameterisation of a substitution rate is a real number 81 R i ∈ R, R i > 0 which is equal to the rate itself. Under the real parameterisation:

Under a log-normal clock prior p( R |σ), rates are distributed with a mean of 1:

where µ = −0.5σ 2 is set such that the expected value is 1. In this article we only 84 consider log-normal clock priors, however the methods discussed are general. 85 Zhang and Drummond 2020 introduced a series of tree operators which propose node 86 heights and branch rates, such that the resulting genetic distances (r i × τ i ) remain 87 constant [31] . These operators account for correlations between branch rates and branch 88 times. By keeping the genetic distance of each branch constant, the likelihood is 89 unaltered by the proposal. 

The category parameterisation cat is an abstraction of the real parameterisation. Each 92 of the 2N − 2 branches are assigned an integer from 0 to n − 1:

These integers correspond to n rate categories (Fig. 1 ). Let f (x|σ) be the 94 probability density function (PDF) and let F (x|σ) =

x 0 f (t|σ) dt be the cumulative 95 distribution function (CDF) of the prior distribution used by the underlying real clock 96 model (a log-normal distribution for the purposes of this article). In the cat 97 parameterisation, f (x|σ) is discretised into n bins and each element within R points to 98 one such bin. The rate of each bin is equal to its median value:

where F −1 is the inverse cumulative distribution function (i-CDF). The domain of R is 100 uniformly distributed under the prior:

The key advantage of the cat parameterisation is the removal of a term from the 102 posterior density (Eq 1), or more accurately the replacement of a non-trivial p( R |σ) 103 term with that of a uniform prior. This may facilitate efficient exploration of the 104 posterior distribution by MCMC. 105 This parameterisation has been widely used in BEAST and BEAST 2 analyses [3] . 106 However, the recently developed constant distance operators -which are incompatible 107 with the cat parameterisation -can yield an increase in mixing rate under real by up to 108 an order of magnitude over that of cat, depending on the dataset [31] .

Finally, rates can be parameterised as real numbers describing the rate's quantile with 111 respect to some underlying clock model distribution. Under the quant parameterisation, 112 each of the 2N − 2 elements in R are uniformly distributed.

Transforming these quantiles into rates invokes the i-CDF of the underlying real 114 clock model distribution. Evaluation of the log-normal i-CDF is has high computational 115 costs and therefore an approximation is used instead.

whereF −1 is a linear piecewise approximation with 100 pieces. While this approach has 117 clear similarities with cat, the domain of rates here is continuous instead of discrete. In 118 this project we extended the family of constant distance operators [31] so that they are 119 compatible with quant. Further details on the quant piecewise approximation and 120 constant distance operators can be found in S1 Appendix.

Clock model operators 122

The weight of an operator determines the probability of the operator being selected.

Weights are typically fixed throughout MCMC. In BEAST 2, operators can have their 124 own tunable parameter s, which determines the step size of the operator. This term is 125 learned over the course of the MCMC [10, 35] . We define clock model operators as those 126 which generate proposals for either R or σ. 

ConstantDistance Adjusts an internal node height and recalculates all incident branch rates such that the genetic distances remain constant [31] .

SimpleDistance Applies ConstantDistance to the root node [31] . R , T real, quant

Proposes new branch rates incident to the root such that their combined genetic distance is constant [31] .

CisScale Applies Scale to σ. Then recomputes all rates such that their quantiles are constant (for real [31] ) or recomputes all quantiles such that their rates are constant (quant).

R , σ real, quant Table 1 . Summary of pre-existing BEAST 2 operators, which apply to either branch rates R or the clock standard deviation σ, and the substitution rate parameterisation they apply to. ConstantDistance and SimpleDistance also adjust node heights in the tree T .

The family of constant distance operators (ConstantDistance, SimpleDistance, 130 and SmallPulley [31] ) are best suited for larger datasets (or datasets with strong 131 signal) where the likelihood distribution is peaked (Fig 1) . While simple one 132 dimensional operators such as RandomWalk or Scale must take small steps in order to 133 stay "on the ridge" of the likelihood function, the constant distance operators "wander 134 along the ridge" by ensuring that genetic distances are constant after the proposal. Traversing likelihood space. The z-axes above are the log-likelihoods of the genetic distance r × τ between two simulated nucleic acid sequences of length L, under the Jukes-Cantor substitution model [36] . Two possible proposals from the current state (white circle) are depicted. These proposals are generated by the RandomWalk (RW) and ConstantDistance (CD) operators. In the low signal dataset (L = 0.1kb), both operators can traverse the likelihood space effectively. However, the exact same proposal by RandomWalk incurs a much larger likelihood penalty in the L = 0.5kb dataset by "falling off the ridge", in contrast to ConstantDistance which "walks along the ridge". This discrepancy is even stronger for larger datasets and thus necessitates the use of operators such as ConstantDistance which account for correlations between branch lengths and rates.

Scale and CisScale both operate on the clock model standard deviation σ however 136 they behave differently in the real and quant parameterisations (Fig 3) . In real, large 137 proposals of σ → σ made by Scale could incur large penalties in the clock model prior 138 density p( R |σ ) and thus may be rejected quite often. This led to the development of 139 the fast clock scaler [31] (herein referred to as CisScale). This operator recomputes all 140 branch rates R → R such that their quantiles under the new clock model prior remain 141 constant p( R |σ) = p( R |σ ). In contrast, a proposal made by Scale σ → σ under the 142 quant parameterisation implicitly alters all branch rates r( R ) while leaving the Clock standard deviation scale operators. The two operators above propose a clock standard deviation σ → σ . Then, either the new quantiles are such that the rates remain constant ("New quantiles", above) or the new rates are such that the quantiles remain constant ("New rates"). In the real parameterisation, these two operators are known as Scale and CisScale, respectively. Whereas, in quant, they are known as CisScale and Scale.

It is not always clear which operator weighting scheme is best for a given dataset. In 150 this article we introduce AdaptiveOperatorSampler -a meta-operator which learns 151 the weights of other operators during MCMC and then samples these operators 152 according to their learned weights. This meta-operator undergoes three phases. In the 153 first phase (burn-in), AdaptiveOperatorSampler samples from its set of sub-operators 154 uniformly at random. In the second phase (learn-in), the meta-operator starts learning 155 several terms detailed below whilst continuing to sample operators uniformly at random. 156 In its final phase, AdaptiveOperatorSampler samples operators (denoted by ω) using 157 the following distribution:

where T(ω) is the cumulative computational time spent on each operator, D is a 159 distance function, and we use Ω = 0.01 to allow any sub-operator to be sampled 160 regardless of its performance. The parameters of interest (POI) may be either a set of 161 numerical parameters (such as branch rates or node heights), or it may be the tree itself, 162 but it cannot be both in its current form. The distance between state x p and its 163 (accepted) proposal x p with respect to parameter p is determined by

where RF is the Robinson-Foulds tree distance [37] , and |p| is the number of dimensions 165 of numerical parameter p (1 for σ, 2N − 2 for R , and 2N − 1 for node heights t). 

Datasets which contain very poor signal (or small L) are likely to mix better when 174 more weight is placed on bold operators (Fig 2) . We therefore introduce the 175 SampleFromPrior( x ) operator. This operator resamples ψ randomly selected elements 176 within vector x from their prior distributions, where ψ ∼ Binomial(n = | x |, p = s | x | ) for 177 tunable term s. SampleFromPrior is included among the set of operators under 178 AdaptiveOperatorSampler and serves to make the boldest proposals for datasets with 179 poor signal.

In this article we apply three instances of the AdaptiveOperatorSampler 181 meta-operator to the real, cat, and quant parameterisations. These are summarised in 182 Table 2 . Table 2 . Summary of AdaptiveOperatorSampler operators and their parameters of interest (POI). Different operators are applicable to different substitution rate parameterisations ( Table 1) . AdaptiveOperatorSampler(root) applies the root-targeting constant distance operators only [31] while AdaptiveOperatorSampler( R ) targets all rates and all nodes heights t. These two operators are weighted proportionally to the contribution of the root node to the total node count.

AdaptiveOperatorSampler(σ) σ CisScale(σ, R ) RandomWalk(σ) Scale(σ) SampleFromPrior(σ) AdaptiveOperatorSampler( R ) R , t ConstantDistance( R , T ) RandomWalk( R ) Scale( R ) Interval( R ) Swap( R ) SampleFromPrior( R ) AdaptiveOperatorSampler(root) R , t SimpleDistance( R , T ) SmallPulley( R , t)

The step size of a proposal kernel q(x |x) should be such that the proposed state x is 185 sufficiently far from the current state x to explore vast areas of parameter space, but 186 not so large that the proposal is rejected too often [38] . Operators which attain an 187 acceptance probability of 0.234 are often considered to have arrived at a suitable 188 midpoint between these two extremes [10, 38] . The standard uniform distribution kernel 189 has recently been challenged by the family of Bactrian kernels [29, 30] . The (Gaussian) 190 Bactrian(m) distribution is defined as the sum of two normal distributions:

where 0 ≤ m < 1 describes modality. When m = 0, the Bactrian distribution is proposal kernels apply to are described in Table 3 .

Operator(s) Table 3 . Proposal kernels q(x |x) of clock model operators. In each operator, Σ is drawn from either a Bactrian(m) or uniform distribution. The scale size s is tunable. ConstantDistance and SimpleDistance propose tree heights t. The Interval operator applies to rate quantiles and respects its domain i.e. 0 < x, x < 1.

The NarrowExchange operator [39] , used widely in BEAST [9, 40] and BEAST 2 [10] , is 202 similar to nearest neighbour interchange [41] , and works as follows ( Fig 5) : 203 Step 1 . Sample an internal/root node E from tree T , where E has grandchildren. 204 Step 2 . Identify the child of E with the greater height. Denote this child as D and 205 its sibling as C (i.e. t D > t C ). If D is a leaf node, then reject the proposal.

Step 3 . Randomly identify the two children of D as A and B.

Step 4 . Relocate the B − D branch onto the C − E branch, so that B and C 208 become siblings and their parent is D. All node heights remain constant. We hypothesised that if NarrowExchange was adapted to the relaxed clock model by 210 ensuring that genetic distances remain constant after the proposal (analogous to 211 constant distance operators [31] ), then its ability to traverse the state space may 212 improve. 213 Here, we present the NarrowExchangeRate (NER) operator. Let r A , r B , r C , and r D 214 be the substitution rates of nodes A, B, C, and D, respectively. In addition to the 215 modest topological change applied by NarrowExchange, NER also proposes new branch 216 rates r A , r B , r C , and r D . While NER does not alter t D (i.e. t D ← t D ), we also 217 consider NERw -a special case of the NER operator which embarks t D on a random 218 walk:

September 4, 2020 14/39 for random walk step size sΣ where s is tunable and Σ is drawn from a uniform or 220 Bactrian distribution. NER (and NERw) are compatible with both the real and quant 221 parameterisations. Analogous to the ConstantDistance operator, the proposed rates 222 ensure that the genetic distances between nodes A, B, C, and E are constant. Let D ij 223 be the constraint defined by a constant genetic distance between nodes i and j before 224 and after the proposal. There are six pairwise distances between these four nodes and 225 therefore there are six such constraints:

Further constraints are imposed by the model itself:

Unfortunately, there is no solution to all six D ij constraints unless non-positive rates 228 or illegal trees are permitted. Therefore instead of conserving all six pairwise distances, 229 NER conserves a subset of distances. It is not immediately clear which subset should be 230 conserved. The simplest NER kernel -the null operator denoted by NER{} -does not satisfy any 235 distance constraints and is equivalent to NarrowExchange. To determine which NER 236 variants have the best performance, we developed an automated pipeline for generating 237 and testing these operators. subsets were found to be solvable, and the unsolvables were discarded. 241 2. Solving Jacobian determinants. The determinant of the Jacobian matrix J is 242 required for computing the Green ratio of this proposal. J is defined as

Solving the determinant |J| invokes standard analytical differentiation and linear 244 algebra libraries of MATLAB. 6 of the 54 solvable operators were found to have |J| = 0, 245 corresponding to irreversible proposals, and were discarded. 

r D ← r D 10: 11:

Calculate Jacobian determinant 12: return (r A , r B , r C , r D , t D , |J|) These experiments showed that NER variants which satisfied the genetic distances 259 between nodes B and A (i.e. D AB ) or between B and C (i.e. D BC ) usually performed 260 worse than the standard NarrowExchange operator (Fig 6) . This is an intuitive result. 261 If there is high uncertainty in the positioning of B with respect to A and C, then there 262 is no value in respecting either of these distance constraints, and the proposals made to 263 the rates may often be too extreme or the Green ratio |J| too small for the proposal to 264 be accepted. 

Propose new rates 7:

r D ← r D 10: 11:

Calculate Jacobian determinant Finally, this initial screening showed that applying a (Bactrian) random walk to the 276 node height t D made the operator worse. This effect was most dominant for the NER 277 variants which satisfied distance constraints (i.e. the operators which are not NER{}). 278 Although there were several operators which behaved equivalently during this initial 279 screening process, we selected NER{D AE , D BE , D CE } to proceed to benchmarking Table 4) .

AdaptiveOperatorSampler(NER) T NER{} NER{D AE , D BE , D CE } Table 4 . The adaptive NER operator. The Robinson-Foulds distance between trees before and after every proposal accept is used to train the operator weights. In the special case of NER proposals, the RF distance is always equal to 1. The number of times each operator is proposed and accepted is compared with that of NER{}, and one-sided z-tests are performed to assess the statistical significance between the two acceptance rates (p = 0.001). This process is repeated across 300 simulated datasets. The axes of each plot are the proportion of these 300 simulations for which there is evidence that the operator is significantly better than NER{} (x-axis) or worse than NER{} (y-axis). The AVMVN kernel assumes its parameters live in x ∈ R N for taxon count N and that 293 these parameters follow a multivariate normal distribution with covariance matrix Σ N . 294 Hence, the kernel operates on the logarithmic or logistic transformation of the N leaf 295 branch rates, depending on the rate parameterisation:

where r is a real rate and q is a rate quantile. The AVMVN probability density is

where MVN is the multivariate normal probability density. β = 0.05 is a constant 299 which determines the fraction of the proposal determined by the identity matrix I N , as 300 opposed to the covariance matrix Σ D which is trained during MCMC. Our BEAST 2 301 implementation of the AVMVN kernel is adapted from that of BEAST [40] .

LeafAVMVN has the advantage of operating on all N leaf rates simultaneously (as 303 well as learning their correlations), as opposed to ConstantDistance which operates on 304 at most 2, or Scale which operates on at most 1 leaf rate at a time. As the size of the 305 covariance matrix Σ N grows with the number of taxa N , LeafAVMVN is likely to be less 306 efficient with larger taxon sets. Therefore, the weight behind this operator is learned by 307 AdaptiveOperatorSampler.

To prevent the learned weight behind LeafAVMVN from dominating the 309 AdaptiveOperatorSampler weighting scheme and therefore inhibiting the mixing of 310 internal node rates, we introduce the AdaptiveOperatorSampler(leaf) and 311 AdaptiveOperatorSampler(internal) meta-operators which operate exclusively on 312 leaf node rates R leaf and internal node rates R int respectively ( 

SampleFromPrior( R int ) Table 5 . Leaf rate R leaf and internal node rate R int operators. This division enables the two meta-operators to be weighted proportionally to the number of nodes (leaves or internal) which they apply to. This facilitates incorporation of the LeafAVMVN operator, which is only applicable to leaf nodes. In this setup, the RandomWalk(x), Scale(x), and SampleFromPrior(x) operators apply to the corresponding set of branch rates x, whereas ConstantDistance(x, T ) is only applicable to internal nodes which have at least one child of type x ∈ { R leaf , R int }.

In all phylogenetic analyses presented here, we use a Yule [43] tree prior p(T |λ) with 317 birth rate λ ∼ Log-normal(1, 1.25). Here and throughout the article, a Log-normal(a, b) 318 distribution is parameterised such that a and b are the mean and standard deviation in 319 log-space. The clock standard deviation has a σ ∼ Gamma(0.5396, 0.3819) prior. 320 Datasets are partitioned into subsequences, where each partition is associated with a 321 distinct HKY substitution model [44] . The transition-transversion ratio benchmarking of larger datasets we use BEAGLE for high-performance tree likelihood 328 calculations [45] and coupled MCMC with four chains for efficient mixing [27] . The 329 neighbour joining tree [46] is used as the initial state in each MCMC chain. Table 6 .

AdaptiveOperatorSampler Samples sub-operators proportionally to their weights, which are learned (see Adaptive operator weighting).

Resamples a random number of elements from their prior (see Adaptive operator weighting).

NarrowExchangeRate Moves a branch and recomputes branch rates so that their genetic distances are constant (see Narrow Exchange Rate).

Proposes new rates for all leaves in one move (see An adaptive leaf rate operator) [28] .

R Table 6 . Summary of clock model operators introduced throughout this article.

Pre-existing clock model operators are summarised in Table 1 In 

Interval( R ) 10

CisScale(σ, R ) 10

Scale(σ) 10 adapt AdaptiveOperatorSampler(σ) 10 Table 7 . Operator configurations and the substitution rate parameterisations which each operator is applicable to. Within each configuration (and substitution rate parameterisation), the weight behind R sums to 30, the weight of σ is equal to 10, and the weight of NER is equal to 15. Operators which apply to specific node sets (root, internal, leaf, or all) are weighted according to leaf count N . The adaptive operators are further broken down in Tables 2, 4, and 5. All other operators (i.e. those which apply to which apply to other terms in the state such as the nucleotide substitution model) are held constant within each dataset.

To avoid a cross-product explosion, the five targets for clock model improvement were 336 evaluated sequentially in the following order: Adaptive operator weighting, 337 Branch rate parameterisations, Bactrian proposal kernel, Narrow Exchange 338 Rate, and An adaptive leaf rate operator. The four operators introduced in these 339 sections are summarised in Table 6 . The setting which was considered to be the best in 340 each step was then incorporated into the following step. This protocol and its outcomes 341 are summarised in Fig 7 Methodologies were benchmarked using one simulated and eight empirical datasets. 354 The latter were compiled [47] and partitioned [48] Table 8 . Benchmark datasets, sorted in increasing order of taxon count N . Number of partitions P , total alignment length L, and number of unique site patterns L unq in the alignment are also specified. Clock standard deviation estimatesσ are of moderate magnitude, suggesting that most of these datasets are not clock-like.

Round 1: A simple operator-weight learning algorithm greatly 362 improved performance 363 We compared the nocons, cons, and adapt operator configurations ( Table 7) . nocons 364 contained all of the standard BEAST 2 operator configurations and weightings for real, 365 cat, and quant. cons additionally contained (cons)tant distance operators and employed 366 the same operator weighting scheme used previously [31] (real and quant only). Finally, 367 the adapt configuration combined all of the above applicable operators, as well as the 368 simple-but-bold SampleFromPrior operator, and learned the weights of each operator 369 using the AdaptiveOperatorSampler. 370 This experiment revealed that nocons usually performed better than cons on smaller 371 datasets (i.e. small L) while cons consistently performed better on larger datasets (Fig 372  8 and S1 Fig) . This result is unsurprising (Fig 2) . Furthermore, the adapt setup 373 dramatically improved mixing for real by finding the right balance between cons and 374 nocons. This yielded an ESS/hr (averaged across all 9 datasets) 95% faster than cons 375 and 520% faster than nocons, with respect to leaf branch rates, and 620% and 190% 376 faster for σ. Similar results were observed with quant. However, adapt neither helped 377 nor harmed cat, suggesting that the default operator weighting scheme was sufficient.

This experiment also revealed that the standard Scale operator was preferred over 379 CisScale for the real configuration. Averaged across all datasets, the learned weights We compared the three rate parameterisations described in Branch rate 390 parameterisations. adapt (real ) and adapt (quant) both employed constant distance 391 tree operators [31] and both used the AdaptiveOperatorSampler operator to learn 392 clock model operator weights. Clock model operators weights were also learned in the 393 adapt (cat) configuration. 394 This experiment showed that the real parameterisation greatly outperformed cat on 395 most datasets and most parameters (Fig 9) . This disparity was strongest for long 396 alignments. In the most extreme case, leaf substitution rates r and clock standard 397 deviation σ both mixed around 50× faster on the 15.5 kb seed plant dataset (Ran et al. 398 2018 [49] ) for real than they did for cat. The advantages in using constant distance 399 operators would likely be even stronger for larger L. Furthermore, real outperformed 400 quant on most datasets, but this was mostly due to the slow computational performance 401 of quant compared with real, as opposed to differences in mixing prowess (Fig 10) .

Irrespective of mixing ability, the adapt (real ) configuration had the best computational 403 performance and generated samples 40% faster than adapt (cat) and 60% faster than 404 adapt (quant).

Overall, we determined that real, and its associated operators, made the best 406 parameterisation covered here and it proceeded to the following rounds of benchmarking. 407 Round 3: Bactrian proposal kernels were around 15% more 408 efficient than uniform kernels 409 We benchmarked the adapt (real ) configuration with a) standard uniform proposal 410 kernels, and b) Bactrian(0.95) kernels [29] . These kernels applied to all clock model 411 operators ( Table 3 ). These results confirmed that the Bactrian kernel yields faster 412 mixing than the standard uniform kernel (Fig 11) . All relevant continuous parameters 413 considered had an ESS/hr, averaged across the 9 datasets, between 15% and 20% faster 414 compared with the standard uniform kernel. Although the Bactrian proposal made Table 7) . The benchmark 426 datasets are fairly non clock-like and therefore could potentially benefit from NER 427 ( Table 8) . 428 Our experiments confirmed that NER{D AE , D BE , D CE } was indeed superior on 429 larger datasets (where L > 5kb; Fig 12) . While there was no significant difference in Round 5: The AVMVN leaf rate operator was computationally 444 demanding and improved mixing very slightly 445 We tested the applicability of the AVMVN kernel to leaf rate proposals. This operator 446 exploits any correlations which exist between leaf branch substitution rates. To do this, 447 we wrapped the LeafAVMVN operator within an AdaptiveOperatorSampler ( The two configurations compared here were a) adapt + Bactrian + NER (real ) and b) 449 AVMVN + NER + Bactrian + adapt (real ) ( Table 7) .

Sequence length L (kb) Learned operator weight [53] ) received the strongest boost from the NER{D AE , D BE , D CE } operator, likely due to its high topological uncertainty and branch rate variance. Tree generated by UglyTrees [57] .

These results showed that the AVMVN operator yielded slightly better mixing 451 (around 6% faster) for the tree likelihood, the tree length, and the mean branch rate 452 (Fig 14) . However, it also produced slightly slower mixing for κ, reflecting the high 453 computational costs associated with the LeafAVMVN operator (Fig 10) . The learned 454 weight of the LeafAVMVN operator was quite small (ranging from 1 to 8% across all 455 datasets), again reflecting its costly nature, but also reinforcing the value in having an 456 adaptive weight operator which penalises slow operators. The LeafAVMVN operator 457 provided some, but not much, benefit in its current form.

Comparison of AVMVN+NER+Bactrian+adapt(real ) and NER+Bactrian+adapt(real ) Overall, we determined that the AVMVN operator configuration was the final 459 winner of the tournament, however its performance benefits were minor and therefore In conjunction with all settings which came before it, the tournament winner 464 outperformed both the historical cat configuration [3] as well as the recently developed 465 cons (real ) scheme [31] . Averaged across all datasets, this configuration yielded a 466 relaxed clock mixing rate between 1.2 and 13 times as fast as cat and between 1.8 and 467 7.8 times as fast as cons (real ), depending on the parameter. For the largest dataset 468 considered (seed plants by Ran et al. 2018 [49] ), the new settings were up to 66 and 37 469 times as fast respectively. This is likely to be even more extreme for larger alignments. 470

Modern operator design 472 Adaptability and advanced proposal kernels, such as Bactrian kernels, are increasingly 473 prevalent in MCMC operator design [58] [59] [60] [61] . Adaptive operators undergo training to 474 improve their efficiency over time [62] . In previous work, the conditional clade 475 probabilities of neighbouring trees have served as the basis of adaptive tree 476 AVMVN [28] , constant distance [31] , and the narrow exchange rate operators introduced 526 in this article. 527 We have shown that while the latter two operators are efficient on large alignments, 528 they are also quite frequently outperformed by simple random walk operators on small 529 alignments. For instance, we found that constant distance operators outperformed 530 standard operator configurations by up to two orders of magnitude on larger datasets 531 but they were up to three times slower on smaller ones (Fig 8) . Similarly, our narrow 532 exchange rate operators were up to 40% more efficient on large datasets but up to 10% 533 less efficient on smaller ones (Fig 12) . 534 This emphasises the value in our adaptive operator weighting scheme, which can 535 ensure that operator weights are suitable for the size of the alignment. Given the 536 overwhelming availability of sequence data, high performance on large datasets is more 537 important than ever.

In this article, we delved into the highly correlated structure between substitution rates 540 and divergence times of relaxed clock models, in order to develop MCMC operators 541 which traverse its posterior space efficiently. We introduced a range of relaxed clock 542 model operators and compared three molecular substitution rate parameterisations.

These methodologies were compared by constructing phylogenetic models from several 544 empirical datasets and comparing their abilities to converge in a tournament-like 545 protocol (Fig 7) . The methods introduced are adaptive, treat each dataset differently, 546 and rarely perform worse than without adaptation. This work has produced an operator 547 configuration which is highly effective on large alignments and can explore relaxed clock 548 model space up to two orders of magnitude more efficiently than previous setups. 

Molecular disease, evolution, and genetic heterogeneity. Horizons in biochemistry

Local molecular clocks in three nuclear genes: divergence times for rodents and other mammals and incompatibility among fossil calibrations

Relaxed phylogenetics and dating with confidence

Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling

Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Molecular biology and evolution

Bayesian phylogenetic inference via Markov chain Monte Carlo methods

Equation of state calculations by fast computing machines

Monte-Carlo sampling methods using Markov chains and their applications

Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular biology and evolution

BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis

MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space

RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language

Evolutionary divergence and convergence in proteins

The causes of molecular evolution

Effective population size and the rate and pattern of nucleotide substitutions

Optimization of DNA polymerase mutation rates during bacterial evolution

A general comparison of relaxed molecular clock models. Molecular biology and evolution

Model averaging and Bayes factor calculation of relaxed molecular clocks in Bayesian phylogenetics

Establishment and cryptic transmission of Zika virus in Brazil and the Americas

The first two cases of 2019-nCoV in Italy: Where they come from?

A compound Poisson process for relaxing the molecular clock

Bayesian random local clocks, or one rate to rule them all

Using parsimony-guided tree proposals to accelerate convergence in Bayesian phylogenetic inference

Adaptive Tree Proposals for Bayesian Phylogenetic Inference

Guided tree topology proposals for Bayesian phylogenetic inference. Systematic biology

Parallel metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference

Adaptive Metropolis-coupled MCMC for BEAST 2

Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST

Searching for efficient Markov chain Monte Carlo proposal kernels

Designing simple and efficient Markov chain Monte Carlo proposal kernels

Improving the performance of Bayesian phylogenetic inference under relaxed clock models

Reversible jump Markov chain Monte Carlo computation and Bayesian model determination

The metropolis-hastings-green algorithm

Parameterization and Bayesian modeling

Optimal proposal distributions and adaptive MCMC. Handbook of Markov Chain Monte Carlo

Evolution of protein molecules. Mammalian protein metabolism

Comparison of phylogenetic trees

Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability

Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data

Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus evolution

A mathematical theory of evolution, based on the conclusions of Dr. JC Willis, FR S. Philosophical transactions of the Royal Society of London Series B, containing papers of a biological character

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA

BEAGLE: an application programming interface and high-performance computing library for statistical phylogenetics

The neighbor-joining method: a new method for reconstructing phylogenetic trees

PartitionFinder 2: new methods for selecting partitioned models of evolution for molecular and morphological phylogenetic analyses. Molecular biology and evolution

Phylogenomics resolves the deep phylogeny of seed plants and indicates partial convergent or homoplastic evolution between Gnetales and angiosperms

Molecular phylogenetics of squirrelfishes and soldierfishes (Teleostei: Beryciformes: Holocentridae): Reconciling more than 100 years of taxonomic confusion

Exploring Data Interaction and Nucleotide Alignment in a Multiple Gene Analysis of Ips (Coleoptera: Scolytinae)

Testing the Impact of Calibration on Molecular Divergence Times Using a Fossil-Rich Group: The Case of Nothofagus (Fagales)

Multi-locus phylogenetic analysis reveals the pattern and tempo of bony fish evolution

Convergent evolution of morphology and habitat use in the explosive Hawaiian fancy case caterpillar radiation

Phylogeny and systematics of the bee genus Osmia (Hymenoptera: Megachilidae) with emphasis on North American Melanosmia: subgenera, synonymies and nesting biology revisited

Tectonic collision and uplift of Wallacea triggered the global songbird radiation

UglyTrees: a browser-based multispecies coalescent tree visualiser

An adaptive Metropolis algorithm

Robust adaptive Metropolis algorithm with coerced acceptance rate

Adaptive MCMC for multiple changepoint analysis with applications to large datasets

Blocking borehole conductivity logs at the resolution of above-ground electromagnetic systems

Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms

Clock-constrained tree proposal operators in Bayesian phylogenetic inference

Bayesian analysis in molecular biology and evolution

Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution

Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics

operators [24, 25] . Proposal step sizes can be tuned during MCMC [35] . The mirror 477 kernel learns a target distribution which acts as a "mirror image" of the current 478 point [30] . The AVMVN operator learns correlations between numerical parameters in 479 order to traverse the joint posterior distribution efficiently [28] . 480 Here, we introduced an adaptive operator which learns the weights of other 481 operators, by using a target function that rewards operators which bring about large 482 changes to the state and penalises operators which exhibit poor computational runtime 483 (Eq 11). We demonstrated how learning the operator weightings, on a 484 dataset-by-dataset basis, can improve mixing by up to an order of magnitude. We also 485 demonstrated the versatility of this operator by applying it to a variety of settings.

Assigning operator weights is an important task in Bayesian MCMC inference and the 487 use of such an operator can relieve some of the burden from the person making this 488 decision. However, this operator is no silver bullet and it must be used in a way that 489 maintains ergodicity within the search space [62] . 490 We also found that a Bactrian proposal kernel quite reliably increased mixing 491 efficiency by 15-20% (Fig 11) . Similar observations were made by Yang et al. [29] .

While this may only be a modest improvement, incorporation of the Bactrian kernels 493 into pre-existing operators is a computationally straightforward task and we recommend 494 implementing them in Bayesian MCMC software packages.

Traversing tree space 496 In this article we introduced the family of narrow exchange rate operators (Fig 5) . 

"Branch-rearrangement" operators relocate a branch and thus alter tree topology.

Members of this class include narrow exchange, nearest neighbour interchange, and 506 subtree-prune-and-regraft [41] . Whereas "branch-length" operators propose branch 507 lengths, but can potentially alter the tree topology as a side-effect. Such operators 508 include subtree slide [63] , LOCAL [64] , and continuous change [65] . Lakner et al. 2008 509 observed that topological proposals made by the former class consistently outperformed 510 topological changes invoked by the latter [66] . 511 We hypothesised that the increased efficiency behind narrow exchange rate operators 512 could facilitate proposing internal node heights in conjunction with branch 513 rearrangements. This would enable the efficient exploration of both topology and 514 branch length spaces with a single proposal. Unfortunately, by incorporating a random 515 walk on the height of the node being relocated, the acceptance rate of the operator 516 declined dramatically (Fig 6) . This decline was greater when more genetic distances 517 were conserved.

These findings support Lakner's hypothesis. The design of operators which are able 519 to efficiently traverse topological and branch length spaces simultaneously remains an 520 open problem.

Larger datasets require smarter operators 522 As signal within the dataset becomes stronger, the posterior distribution becomes 523 increasingly peaked (Fig 2) . This change in the posterior topology necessitates the use 524 of operators which exploit known correlations in the posterior density; operators such as 525