key: cord-0471729-1gvgomi9 authors: Medina-Aguayo, Felipe J; Christen, J Andr'es title: Penalised t-walk MCMC date: 2020-12-03 journal: nan DOI: nan sha: 4d7de4d014db37a72a08c397f0d25e214a2b81c3 doc_id: 471729 cord_uid: 1gvgomi9 Handling multimodality that commonly arises from complicated statistical models remains a challenge. Current Markov chain Monte Carlo (MCMC) methodology tackling this subject is based on an ensemble of chains targeting a product of power-tempered distributions. Despite the theoretical validity of such methods, practical implementations typically suffer from bad mixing and slow convergence due to the high-computation cost involved. In this work we study novel extensions of the t-walk algorithm, an existing MCMC method that is inexpensive and invariant to affine transformations of the state space, for dealing with multimodal distributions. We acknowledge that the effectiveness of the new method will be problem dependent and might struggle in complex scenarios; for such cases we propose a post-processing technique based on pseudo-marginal theory for combining isolated samples. Performing Bayesian analyses has become routine in many areas of application and research and, in turn, these are usually carried out using Monte Carlo methodology [see e.g . 27] . A popular and effective method, belonging to this class of methods, is Markov Chain Monte Carlo (MCMC), which is based on the construction of a Markov chain that converges to the desired target or posterior distribution [see 18, for example]. However, naive implementations of MCMC may result in chains that require a prohibitively large number of iterations in order to be of any use. For example, implementing the celebrated Metropolis-Hastings (MH) algorithm [23, 17] using highly local proposals may result in chains that do not explore the target fully in a reasonable amount of time. Thus, constructing informative proposals is essential for obtaining useful samplers. This may be achieved by incorporating gradient information from the target, as done by Langevin and Hamiltonian algorithms [30, 25] . Another possibility, that avoids computing derivatives, is adaptive MCMC [16, 29] aiming at learning the shape of the target in an online manner and modifying the proposal accordingly. Other approaches rely solely on extending the state space, and consequently extending the target distribution, for constructing good proposals [12] . An other example of the latter is the t-walk algorithm [19] , which will be our subject of study. It is based on 4 different moves that are invariant under affine transformations [the "traverse", the "walk", the "hop" and the "blow" moves, see 19 , for details]. The twalk is implemented in an oficial R package (rtwalk), in Python, Matlab, C and C++, https://www.cimat.mx/~jac/twalk/. Nevertheless, and similarly to other MCMC methods, in the presence of multimodality the t-walk may become trapped in a subset of modes, especially when two or more modes are separated by large regions (valleys) of very small probability. The chain becomes trapped since the proposals are not bold enough for jumping into regions that are far relative to where the chain currently lives. In this respect, in the Bayesian perspective, naively one may argue that multimodal posteriors are merely the result of a bad modelling process or due to a poor choice of priors. However, multimodality is not an artificial phenomenon, resulting from poor decisions by the user, since it commonly arises and is unavoidable in many well studied Bayesian problems. In particular, inference involving non-linear regressors, as in Bayesian inverse problems, may result in complex multimodal posteriors. Some examples inlcude the FitzHugh-Nagumo model [11, 24] , a soil organic matter decomposition model from [35] , or the epidemiological SIR model describing a black plague outbreak in 1666 [22, 20] . There is indeed a need to develop better MCMC methods to address multimodality in posterior distributions arising from modern Bayesian inferences. Currently, the state-of-the-art method for dealing with multimodality is parallel tempering (PT) [see 36, 14, for example]. PT is based on several interacting chains that converge to a product of tempered distributions, one of which is the desired posterior. The chains interact via a series of swap moves that in principle help exploring multiple modes. The challenge, however, is the computational cost involved in the algorithm as one usually requires a large number of tempered distributions and iterations [40] for the method to work correctly. Hence, this and similar approaches might not deal well with multimodality as the computational expense will render the approach unfeasible. The t-walk can sample from posteriors with several modes with different scales [see 19, fig. 2 and 3] but, as with most other methods, once these modes are separated by areas of very low density, then it may get trapped in one particular region of the sample space. Various staring points should always be used in the practice of MCMC, increasing the chance of finding how and where our chain got trapped in different modes. How to join these samples from different regions into one that has the correct target is not obvious. The t-walk has proved its value and has been used in several studies [9, 39, 5, 10, 38, 32, 31, 34, 33] and is currently used routinely by several users in some data analysis software in paleoecology [7] and recently in the COVID19 epidemic modeling [8, 1] . Certainly, it is not redundant to improve the performance of the t-walk in the presence of multimodal targets. Moreover, the ideas presented here could in principle be imported to improve the performance of other MCMC algorithms. In this work we propose studying the t-walk algorithm in the presence of multimodality by including an additional 5-th move to the standard 4 moves already mentioned. This additional move is constructed using a "penalised proposal". The idea is to construct proposals, easy to sample from, that penalise a region where the chain currently stands and that, ideally, favour regions with meaningful posterior mass that are located further away. Such a penalisation is developed in Section 2 and followed by some illustrative examples in Section 3. In case the penalisation fails, in Section 4 our aim is to combine two different chains, targeting the same posterior, but that were unable to jump between modes. That is, they got stuck in two separate regions of the state space. Hence, these pseudo-samples are distributed according to different sub-posteriors that need to be combined in a particular way for obtaining a true sample from the target. The procedure we propose is based on pseudo-marginal methodology [3] which introduces unbiased estimators of intractable quantities within an MCMC algorithm. Finally, in Section 5 we provide a final discussion and possible extensions of this work. Our interest is to describe a posterior or target distribution π on some measurable space (X , B (X )), X ⊆ R d . We also assume throughout that the distributions presented have densities with respect to the Lebesgue measure. As it commonly arises in Bayesian inference, the target density has the following generic form where γ is the unnormalised version of π and where Z = X γ (dx) is the intractable normalising constant. An effective method for describing the posterior π is through simulation using Markov chain Monte Carlo (MCMC) methods. These methods work by constructing a Markov chain (X t ) t≥0 , on X , such that the target π is the limiting stationary distribution. Thus, expectations of the form The integer b in the previous equation defines the "burn-in period" of the chain, meaning that the portion (X t ) t≤b is discarded when estimating π (h), as a pseudo-transient stage of the chain. The t-walk algorithm introduced in [19] belongs to the aforementioned class of MCMC methods, which differs to standard implementations by exploring an extended target distributionπ, defined on X 2 , B X 2 , and with density given bȳ π (x, y) =π (x) π (y) , for (x, y) ∈ X 2 . Consequently, the resulting chain (X t , Y t ) t≥0 takes values on X 2 and converges marginally towards π in the sense that either (X t ) t or (Y t ) t converges to π. In this case, working on a larger space is useful as one is able to construct informative proposals (combining both points, that is proposals of the kind q(·|X t , Y t ) or q(·|X t , Y t )) which are key for exploring efficiently the state space and thus for converging rapidly towards π. At the moment, the 4 different moves implemented in the t-walk algorithm may not be able to overcome the complexity that arises from a target having multiple modes. For such cases we are proposing a 5-th move, which we have termed the "penalty" move, that proposes values from regions that are relatively distant to the current state of the chain (X t , Y t ). As the name suggests, this is achieved by penalising a neighbourhood that is close to (X t , Y t ), which creates a flattening effect on some standard density from a distribution that is easy to sample from. The right plot of Figure 1 shows an example of this effect. Therefore, we introduce a family of penalty functions {ϕ xy } (x,y)∈X 2 such that ϕ xy : X → [0, 1] modifies any given proposal in such way that sampling near a region informed by (x, y) is not very likely. For instance, assuming the current state of the t-walk chain is (X t , Y t ) = (x, y) the penalised version of a proposal distribution g (· | x, y) defined on (X , B (X )) will have a density Notice that the t-walk chain lies on X 2 whereas the proposals g andg are defined on X . This apparent problem is overcome in the following section where we describe a simple transformation for obtaining two components, each belonging to X , out of a single draw fromg. In addition, it is worth mentioning that the choice of g is in principle arbitrary; in fact one could consider penalising one of the four moves in the t-walk. However, we do not follow this approach as the existing moves in the t-walk algorithm only update one of the two components at a time (either x or y), which could be problematic since components may get stuck in different modes. Instead, for reasons that will become clear in the following sections, our choice for g will be simply a symmetric distribution that is easy to sample from, e.g. a Gaussian or tdistribution. Moreover, we will discuss some feasible approaches for defining and sampling fromg, both using the gradient of log π and not. We finish this section introducing possible choices for ϕ xy . The general idea of the penalty function ϕ xy is that at its centre, say µ xy , the value of resulting penalised densityg will be exactly zero, and as we start moving away from µ xy the value of the density will increase. As an example we could consider flipping and renormalising a Gaussian density, i.e. for some Σ xy . Notice that ϕ xy ∈ [0, 1) and when multiplied by some g the resulting penalised proposalg will be flattened around µ xy , as shown in Figure 1 . Examples of other penalties include the following continuously differentiable functions: where µ xy and Σ xy are location and scale parameters determined by the current state (x, y). In general terms we will restrict to penalty functions ϕ xy of the form where ρ is some standard symetric distribution and therefore ρ −1 We construct a densityg that we can easily simulate from and that is informed by the gradient of log π. To define g consider a symmetric distribution g (· | x, y) centred atμ xy with dispersion parameterΣ 2 xy . For example, a Gaussian or t-distribution with central parameterμ xy = (x + y)/2 and dispersionΣ xy = diag(|x − y| 2 ). Notice that the previous parameters may be different to those from the penalty function ϕ xy introduced in the previous section. In any case, we assume that it is computationally straightforward to simulate from g. The following proposition is an extension of [21, Proposition 3.1] and will be useful for constructing, and simulating from,g xy . Proposition 1. Suppose X ⊆ R d , and assume that a penalty functionφ xy satisfies, for some quantityμ xy ∈ X , then for any symmetric distribution g (· | x, y) aroundμ xy , and defined on R d Proof. For simplicity we omit the dependence from (x, y) throughout the proof. First notice that Consider now one of the 2 d hyperoctants defined on R d and, without loss of generality, take the hyperoctant R +d where all the components are positive. Using (3) and the symmetry of g, the above integral restricted to R +d becomes Following the same logic for the remaining 2 d −1 hyperoctants, and adding all integrals to integrate over the whole space we obtain A direct implication from the previous result is that which does not involve an intractable normalising constant; moreover, one can simulate from the associated distribution easily using Algorithm 1, as proven in [21, Proposition 3.2] . In order to see why this is a valid procedure, suppose the centreμ xy = 0, then returning a value w as proposed by g occurs with probabilityφ xy (w). In contrast, according the algorithm the sign of w will be flipped with probability 1 −φ xy (w) = ϕ xy (−w), but (due to symmetry of g) proposing −w is equally as likely as proposing w. Hence, instead of discarding w after not being accepted, the value can be "reused" by returning −w which had the same probability of being proposed. The problem, however, is still how to choose a suitableφ xy . Leaving the formality aside for a moment, suppose we could sample from the reciprocal of π, i.e. a distribution with density proportional to 1/π(x). Incorporating such a move into an MCMC sampler Algorithm 1 Penalised proposal sampler usingφ xy INPUT: Current value for the chain (x, y) ∈ X 2 and centreμ xy . OUTPUT: Draw fromg involving the penaltyφ xy . 1. Simulate and store w ← W ∼ g (· | x, y) and ω ← Ω ∼ U nif (0, 1); could be useful as one would be able to escape from the attraction of a local mode. We perform this idea, in an approximate manner, by defining the penaltỹ where l (u) := log (π (u)). The choice in (4) approximates Barker's acceptance probability since, using a Taylor approximation, Henceφ xy approximates the acceptance probability of an MCMC sampler targeting the reciprocal of π. Finally, notice that (4) also satisfies the symmetry condition in (3); therefore, a sample from the resulting proposalg could be seen as a crude approximation from the reciprocal of π. Having simulated W in Algorithm 1, we can propose new values (U, V ) for the chain using either or We now present a simple toy example that illustrates the advantage of introducing a penalised move for a target with very-well separated modes. Example 1. Consider a 2-dimensional Gaussian mixture target with 2 components and with density given by where µ 1 = (0, 0), µ 2 = (20, −20) and The target involves two very different components, one with low and another with high correlation. Exploring these components separately would not represent a challenge for the t-walk algorithm, however the relatively large separation can be problematic. Figure 2 shows trace plots of the t-walk algorithm (left plot) and the penalised version using the gradient described above (right plot). Notice that the t-walk chain is 10 times longer (500 thousand iterations) than the penalised version; the reason for this was to take into account that the latter is more computationally expensive due to gradient computations. Nonetheless, observe how the t-walk only jumps to the second Gaussian component once, whereas the penalised version visits the two modes regularly. We now turn to a penalised version which in essence is much simpler but that will not require the computation of gradients. Having gradient information is beneficial but not widely applicable, e.g. Bayesian inverse problems commonly involve numerical solutions to differential equations, thus obtaining gradients with respect to fixed parameters is not straightforward. In this section we explore much simpler penalised proposals that are easy to evaluate and to sample from. The idea is to propose a point that is distant to the current location of the chain which is summarised by the single point µ xy ∈ X , e.g. µ xy = (x + y)/2. We consider once more a penalised proposal as in (1) with explicit density where Z xy = X g (dw | x, y) ϕ xy (w) is the normalization constant and g is an easy to simulate, heavy-tailed distribution, typically a multivariate t-distribution. Simulating then fromg can proceed via rejection sampling as described in Algorithm 2, and due to the fact that Z xy ≤ 1 since ϕ xy ∈ (0, 1) for any (x, y) ∈ X 2 . Notice that if g also belongs to a location-scale family Algorithm 2 Penalised proposal sampler using ϕ xy INPUT: Current value for the chain (x, y) ∈ X 2 . OUTPUT: Draw fromg involving the penalty ϕ xy . 1. Simulate and store w ← W ∼ g (· | x, y) and ω ← Ω ∼ U nif (0, 1) 2. If ϕ xy (w) < ω go back to step 1; else go to step 3; 3. Return w. where Υ 1/2 xy for some κ > 0, then the normalising constant Z xy satisfies Regarding Algorithm 2 we now compute the acceptance probability within the rejection sampling step, which will also be independent of the current state (x, y). This is crucial for controlling the extra computational cost of the penalised t-walk algorithm, as the expected number of iterations before terminating Algorithm 2 is the same irrespective of the where the chain is located. Proposition 2. Suppose the chain is at state (x, y), also assume the penalty ϕ xy satisfies (2) and the proposal g belongs to a location-scale family as stated in (7). Then the overall acceptance probability in the rejection sampling from Algorithm 2 is independent of (x, y). Proof. Let N be the number of trials needed for obtaining a sample fromg in the rejection sampling procedure. All the variables generated up tp the n-th trial are U 1:n iid ∼ U nif (0, 1) and W 1:n iid ∼ g (· | x, y), therefore where the last line follows from (8) due to the fact that g belongs to a location-scale family. Therefore, the probability of terminating the rejection sampling procedure at the n-th trial is which is independent of the current state (x, y). When both g and ρ are d-dimensional Gaussian distributions we have a closed expression for Z, namely Notice that as d → ∞ then Z → 1, therefore one might want to scale κ with the dimension of X . Table 1 shows the estimated acceptance probabilities in the rejection sampling procedure for different choices of the proposal g and the distribution ρ involved in the penalty. The default configuration of the penalised t-walk sampler was set to κ = 3, multivariate t-distribution with 2 df for the penalty, and multivariate t-distribution with 1 df for g. As discussed earlier, having simulated W ∼g (· | x, y) using Algorithm 2, which can be thought as a new value for the centre µ xy , we can obtain values for (U, V ) through the simple transformation in (5) or (6) . Recalling from (8) that the constant Z xy does not depend on (x, y), which implies that Z xy /Z uv = 1, the MH ratio in the corresponding Algorithm 3 Penalised move in the t-walk INPUT: Current value for the t-walk chain (x, y) ∈ X 2 . OUTPUT: New value for the chain. 1. Using the location µ xy and scale Σ xy , simulate and store w ← W ∼g (· | x, y) (e.g. via Algorithm 1 or 2); 2. Simulate and store ω ← Ω ∼ U nif (0, 1); 3. Obtain (u, v) using either (5) or (6); 4. Compute r (x, y; u, v) from (9); iteration of the t-walk becomes For completeness, and before considering further numerical examples, we present Algorithm 3 which contains a full description of the penalised moved implemented in the t-walk. In this section we present several examples that despite being artificial they are very illustrative for assessing the utility of the penalty move introduced to the t-walk algorithm. As mentioned before, all examples consider the default setting of the penalised t-walk sampler with µ xy = (x + y)/2, Σ xy = diag(|x − y| 2 ), κ = 3, a multivariate t-distribution with 2 df for the penalty ϕ xy , and a multivariate t-distribution with 1 df for g. Example 2. We consider the same bi-modal target introduced in Example 1. Figure 3 shows the trace plots, similar to those from Figure 2 , and the corresponding kernel density estimation (KDE) plots using 5 million iterations for both the original and penalised twalk. Observe that, similarly to the case with a penalty involving gradients, the penalised t-walk is able to jump often between modes. The estimated integrated autocorrelation time (IAT), which is commonly regarded as a measure of efficiency [15] , was 387 for the original t-walk and 955 for penalised version. These results are clearly misleading since one could conclude wrongly that the t-walk without penalisation is performing much better since its IAT is less than half to the penalised version. In order to properly compare these quantities we would need to run the chains for much longer; nonetheless this simple experiment shows that relying simply on the IAT for assessing the quality of a sample from a multimodal target may lead to incorrect results. On average the number of penalised moves was set to 10 percent of the total number of iterations. The average number of penalised moves that were accepted was only 3 percent, which produced a decrease in the global acceptance from 33 percent (using the standard t-walk) to 16 percent (when introducing the penalised move). Example 3. Now consider a 3-dimensional Gaussian mixture target with 9 components given by In words, the means for the first 8 components are located at each of the 8 vertices from the cube of length 20 and centred at the origin. Regarding the variance terms, the values for σ 2 i i range from 0.25 to 10. Figure 4 shows once more how the standard t-walk struggles visiting all the different modes. In fact the t-walk gets stuck in a single mode whereas the penalised version is able to visit all 9 modes in the first million iterations. Clearly, the run for the penalised version requires more iterations as the global acceptance rate is only 4 percent. However, despite the unappealing slow convergence shown by the penalised version, the method is able to expose the multimodality issue. The shape of this type of densities can be very challenging for MCMC samplers as they may get stuck in the tails of the distribution. From Figure 5 we can observe that the standard t-walk is able to explore the 3 modes with a global acceptance of 13 percent. In contrast, the penalised version showed a global acceptance of 10 percent, with an acceptance for the penalty move of only 0.3 percent. The IAT for the t-walk was 630 versus the marginally lower value of 578 for the penalised version. One can argue that for this example the benefit from the penalised t-walk is minimal, but does not interfere with the t-walk performance. One can always find a sufficiently pathological example in which modes are too isolated in order for our MCMC to visit them all. Even the penalised t-walk may not work correctly in such an example. This will be the case, for instance, if the two modes in Example 1 are placed extremely apart from each other. As already mentioned, in the practise of MCMC one should always experiment with different starting values, to corroborate that our chain is not getting trapped in some regions of the sample space. If it does get trapped, and within each separated region the chain appears to be mixing well, it becomes apparent that we have several modes and the chain cannot jump from one to the next. How could one combine the two samples in order to obtain a new sample approximately distributed as π? Certainly, one cannot simply mix the samples since this does not consider the probability accumulated in each region by the target π. Note that the discussion that follows considers any MCMC sampler, the t-walk, the penalised t-walk or otherwise. Suppose that the target π has "significant" mass on two disjoint regions X 1 and X 2 , and also assume the MCMC sampler produces a ϕ-irreducible and aperiodic chain but it has a very small probability of reaching X 2 from X 1 and viceversa. Sampling from π relying on a single chain X 0:T := (X t ) T t=0 becomes challenging since with ε T 1. Despite the fact that ε T → 1 as T → ∞, in practice it is usually the case that X 0:T ≈ π i (·) for some i ∈ {1, 2}, where Additionally, notice that the target can be expressed as follows Hence, suppose we have access to a sample from π 1 and one from π 2 , how could we obtain an approximate sample from π? We now discuss a solution that only requires unbiased estimates Z −1 i . Suppose we have two different proposals q 21 and q 12 , acting on the disjoint subsets X 1 and X 2 , respectively. A simple asymmetric MCMC kernel can be constructed as follows where the sub-kernel P ij (x, dy) = q ij (x, dy) α ij (x, y) + δ x (dy) ρ ij (x) considering α ij (x, y) = min 1, π (y) q ji (y, x) π (x) q ij (x, y) , . It is not difficult to show that the kernel P is invariant under the target distribution π. More general implementations of this idea are possible for which there are more than 2 proposals {q ij } i,j and also selecting sub-kernels at random [see e.g 37, 2]. In our context, we consider two samples X 1:N ∼ π 1 (·) and Y 1:N ∼ π 2 (·), and we would like to use the sub-targets {π i } i as proposals. Sampling is possible, however evaluating the corresponding densities is a challenge since the normalising constants Z i = Xiπ (dx) are not known. The acceptance probability for this scenario, and when x ∈ X i , becomes α ij (x, y) = min 1, π (y) π i (x) π (x) π j (y) = min 1, We require a procedure for estimating the ratio of normalising constants in order to obtain an exact algorithm, if possible. One approach is to estimate unbiasedly each term Z −1 i and resort to pseudo-marginal (PM) theory [6, 3] . Notice that for any density h i defined on X i This type of estimators are not new in the literature [26, 13] , and have proved useful for estimating ratios of marginal likelihoods or Bayes factors, see e.g. [9] and references therein. The resulting PM algorithm will target the extended distributionπ defined on X 2 with densityπ proposing moves usingq ij (x, u; dy, dv) = π j (dy) π i (dv). Notice that the extended target π admits π as one of its marginals since π(x, u)du = π(x), and the corresponding acceptance probabilityᾱ ij (when x ∈ X i ) simplifies tō Additionally, one could use multiple instances of the auxiliary variables U and V discussed above, and modify the acceptanceᾱ ij accordingly for using ratios of arithmetic averages instead. It is well-known that doing this does not affect the exactness of the algorithm [6, 3] , and in fact increasing the accuracy of the unbiased estimators within the acceptance ratio will typically reduce the asymptotic variance of the resulting chain [4] . Up to this point we have not discussed possible choices for h i and h j . In principle they can be arbitrary, but ideally one would like h i to resemble π, restricted to X i . A feasible option, and as discussed in the following section, is to set h i equal to the resulting density from a KDE algorithm using a sample from π i . Using KDE one can construct an unbiased estimator for the reciprocal of the normalising constants Z 1 and Z 2 . Having samples X 1:N and Y 1:N coming from π 1 and π 2 , respectively, if we construct an estimated kernel leaving out the i-th element of the sample, say X i , we obtain the approximate density π (−i) 1 : X 1 → R + 0 , where the superscript (−i) indicates we are not considering X i . Similarly using the sample Y 1:N one can construct π (−i) 2 for any i ∈ {1, . . . , N }. The reason for leaving out a variable becomes clear when computing the following expectation using the Tower property: Therefore, the ratio is an unbiased estimator of Z −1 j for any i ∈ {1, . . . , N } and any j ∈ {1, 2}. Therefore, we will use ratios from (10) for approximating the pseudo-marginal process discussed in the previous section, meaning that h j = π (−i) j for some i ∈ {1, . . . , N }. Algorithm 4 describes the process for obtaining an approximate draw from π, using the samples X 1:N ∼ π 1 and Y 1:N ∼ π 2 . We must stress that this algorithm is not exact since the estimators R Notice also that the algorithm produces a chain of indices (M t , I t ) t≥0 , rather than a chain of values in X . The first index (M t ) indicates which mode is being sampled, whereas the second one (I t ) corresponds to the variable sampled within X 1:N (if M t = 1) or Y 1:N (if M t = 2). With this in mind it is straightforward to obtain the actual chain of values, which is a sample approximately coming from π. Example 5. We consider a modified version of Example 1, where the two components are still the same but now the weights are different, i.e. π (x 1 , x 2 ) = 0.1N ((x 1 , x 2 ) | µ 1 , Σ 1 ) + 0.9N ((x 1 , x 2 ) | µ 2 , Σ 2 ) . In this case we set π i = N (µ i , Σ i ) and draw 10 thousand samples from each distribution. Using Algorithm 4 we aim at combining these samples that lie on well-separated regions of the state space. Figure 6 compares the KDE plots resulting from the proposed method (left) and a simple resampling process with the correct weights (right). The jumping modes algorithm was run for 100 thousand iterations showing a global acceptance of 20 percent. Notice that the resulting plots are almost identical; hence, even though the proposed algorithm is not exact it may prove useful for post-processing samples. The introduction of penalised moves into the t-walk algorithm can be useful for tackling multimodality in complex inference scenarios. In principle one could implement the Figure 6 : KDE plots from Example 5 for the jumping modes algorithm (left) and a resampling process using the correct weights (right). aforementioned moves, either gradient-based or non-gradient based, into other MCMC samplers. In this respect, modifying the penalty in adaptive manner may sound appealing, although one must be careful of not breaking the ergodicity of the chain. The penalised t-walk works well in various examples, but as observed, the algorithm may still struggle to jump between modes in complex or high-dimensional settings. This is due to chances of finding a new mode, hence accepting a penalised move, may become quite small. For such cases, we presented an approach for combining samples from two different samplers, that in theory should be able to visit all modes, but realistically they will get stuck in two separate regions of the state space. We want to emphasise that combining samples, as presented here, should be regarded as a last resort since the validity relies on strong assumptions. For instance, the two samples should be of decent quality, meaning that the corresponding MCMC samplers may need to run for several iterations. Additionally, the samples need to lie on disjoint, (or almost disjoint) regions of the state space, which may be difficult to asses in higher dimensions. Nevertheless, extending the method for non-disjoint sets and for more than two samples should be possible, but has been postponed for future research. The SARS-CoV-2 epidemic outbreak: a review of plausible scenarios of containment and mitigation for Mexico. medRxiv On the utility of Metropolis-Hastings with asymmetric acceptance ratio The pseudo-marginal approach for efficient Monte Carlo computations Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms Bayesian Analysis of 210 Pb Dating Estimation of population growth or decline in genetically monitored populations Flexible paleoclimate age-depth models using an autoregressive gamma process Forecasting hospital demand in metropolitan areas during the current COVID-19 pandemic and estimates of lockdown-induced 2nd waves. medRxiv Bayesian Analysis of ODEs: Solver Optimal Accuracy and Bayes Factors Towards uncertainty quantification and inference in the stochastic SIR epidemic model Impulses and Physiological States in Theoretical Models of Nerve Membrane emcee : The MCMC Hammer Bayesian Model Choice: Asymptotics and Exact Calculations Computing Science and Statistics Practical Markov Chain Monte Carlo An Adaptive Metropolis Algorithm Monte Carlo Sampling Methods Using Markov Chains and Their Applications Statistical Inference: An Integrated Approach Andrés Christen. A General Purpose Sampling Algorithm for Continuous Distributions (the t-walk) Incremental Mixture Importance Sampling With Shotgun Optimization The Barker proposal: combining robustness and efficiency in gradient-based MCMC The Eyam plague revisited: did the village isolation change transmission from fleas to pulmonary? Medical Hypotheses Equation of state calculations by fast computing machines An Active Pulse Transmission Line Simulating Nerve Axon* MCMC Using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo Approximate Bayesian Inference with the Weighted Likelihood Bootstrap Monte Carlo Statistical Methods. Springer Texts in Statistics General state space Markov chains and MCMC algorithms Examples of adaptive MCMC Exponential Convergence of Langevin Distributions and Their Discrete Approximations Bayesian linear regression with skew-symmetric error distributions with applications to survival analysis Inference in two-piece location-scale models with Jeffreys priors Bayesian modelling of skewness and kurtosis with two-piece scale and shape distributions Flexible objective Bayesian linear regression with applications in survival analysis Models of soil organic matter decomposition: the SoilR package, version 1.0. Geoscientific Model Development Replica Monte Carlo Simulation of Spin-Glasses Mode Jumping Proposals in MCMC Objective priors for the number of degrees of freedom of a multivariate t distribution and the t-copula Identification of Aquifer Parameters from Pumping Test Data with Regard for Uncertainty Sufficient Conditions for Torpid Mixing of Parallel and Simulated Tempering The authors were partially founded by CONACYT CB-2016-01-284451 and COVID19 312772 grants, and a RDCOMM grant. INPUT: Samples x 1:N ← X 1:N ∼ π 1 and y 1:N ← Y 1:N ∼ π 2 . Ratios r • If m = 1 then:; otherwise return (m, i).