key: cord-0581697-zooj4joh
authors: Jankowiak, Martin; Phan, Du
title: Surrogate Likelihoods for Variational Annealed Importance Sampling
date: 2021-12-22
journal: nan
DOI: nan
sha: d39a7bd358e205d98e163c1f14b14a9530dd5f45
doc_id: 581697
cord_uid: zooj4joh

Variational inference is a powerful paradigm for approximate Bayesian inference with a number of appealing properties, including support for model learning and data subsampling. By contrast MCMC methods like Hamiltonian Monte Carlo do not share these properties but remain attractive since, contrary to parametric methods, MCMC is asymptotically unbiased. For these reasons researchers have sought to combine the strengths of both classes of algorithms, with recent approaches coming closer to realizing this vision in practice. However, supporting data subsampling in these hybrid methods can be a challenge, a shortcoming that we address by introducing a surrogate likelihood that can be learned jointly with other variational parameters. We argue theoretically that the resulting algorithm permits the user to make an intuitive trade-off between inference fidelity and computational cost. In an extensive empirical comparison we show that our method performs well in practice and that it is well-suited for black-box inference in probabilistic programming frameworks.

Bayesian modeling and inference is a powerful approach to making sense of complex datasets. This is especially the case in scientific applications, where accounting for prior knowledge and uncertainty is essential. For motivating examples we need only look to recent efforts to study COVID-19, including e.g. epidemiological models that incorporate mobility data (Miller et al., 2020; Monod et al., 2021) or link differentiable transmissibility of SARS-CoV-2 lineages to viral mutations (Obermeyer et al., 2021) .

Unfortunately for many application areas the broader use of Bayesian models is hampered by the difficulty of formulating scalable inference algorithms. One regime that has proven particularly challenging is the large data regime. This is essentially because powerful MCMC methods like Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011) are, at least on the face of it, incompatible with data subsampling (Betancourt, 2015) . While considerable effort has gone into developing MCMC methods that accommodate data subsampling, 1 developing generic MCMC methods that yield high fidelity posterior approximations while scaling to very large datasets remains challenging.

For this reason, among others, recent years have seen extensive development of approximate inference methods based on variational inference (Jordan et al., 1999; Blei et al., 2017) . Besides 'automatic' support for data subsampling (Hoffman et al., 2013; Ranganath et al., 2014) , at least for suitable model classes, variational inference has several additional favorable properties, including support for amortization (Dayan et al., 1995) , log evidence estimates, and model learning, motivating researchers to combine the strengths of MCMC and variational inference (Salimans et al., 2015; Hoffman, 2017; Caterini et al., 2018; Ruiz & Titsias, 2019) .

Recent work from Geffner & Domke (2021) and Zhang et al. (2021) offers a particularly elegant formulation of a variational inference method that leverages (unadjusted) HMC as well as annealed importance sampling (AIS) (Neal, 2001) . Although these are fundamentally variational methods, since they make use of HMC, which does not itself readily accommodate data subsampling, these methods likewise do not automatically inherit support for data subsampling. This is unfortunate because these methods do automatically inherit many of the other nice features of variational inference, including support for amortization and model learning.

In this work we set out to extend the approach in Geffner & Domke (2021) and Zhang et al. (2021) to support data subsampling, thus making it applicable to the large data regime. Our basic strategy is simple and revolves around introducing a surrogate log likelihood that is cheap to evaluate. The surrogate log likelihood is used to guide HMC dynamics, resulting in a flexible variational distribution that is implicitly defined via the gradient of the surrogate log likelihood. The corresponding variational objective accommodates unbiased mini-batch estimates, at least for the large class of models with appropriate conditional independence structure that we consider. As we show in experiments in Sec. 7, our method performs well in practice and is well-suited for black-box inference in probabilistic programming frameworks.

We are given a dataset D and a model of the form

where the latent variable z ∈ R D is governed by a prior p θ (z) and the likelihood factorizes into N terms, one for each data point (y n , x n ) ∈ D, and where θ denotes any additional (non-random) parameters in the model. Eqn. 1 encompasses a large and important class of models and includes e.g. a wide variety of multi-level regression models. The log density corresponding to Eqn. 1 is given by log p θ (D, z) = log p θ (z) + Σ n log p θ (y n |z, x n ) ≡ Ψ 0 (z) + Ψ L (D, z)

where we define the log prior Ψ 0 (z) and log likelihood Ψ L (D, z). 2 We are interested in the regime where N is large or individual likelihood terms are costly to evaluate so that computing the full log likelihood Ψ L (D, z) and its gradients is impractical. We aim to devise a flexible variational approximation to the posterior p θ (z|D) that can be fit with an algorithm that supports data subsampling. Additionally we would like our method to be generic in nature so that it can readily be incorporated into a probabilistic programming framework as a black-box inference algorithm. Finally we would like a method that supports model learning, i.e. one that allows us to learn θ in conjunction with the approximate posterior.

For simplicity in this work we primarily focus on global latent variable models with the structure in Eqn. 1. The approach we describe can also be extended to models with local latent variables, i.e. those local to each data point. For more discussion see Sec. 7.6 and Sec. A.1 in the appendix.

Before describing our method in Sec. 4, we first review relevant background.

The simplest variants of variational inference introduce a parametric variational distribution q φ (z) and proceed to choose the parameters φ to minimize the Kullback-Leibler 2 We also use the notation ΨL(DI, z) ≡ Σn∈I log p θ (yn|z, xn) for a set of indices I.

(KL) divergence between the variational distribution and the posterior p θ (z|D), i.e. KL(q φ (z)|p θ (z|D)). This can be done by maximizing the Evidence Lower Bound or ELBO

which satisfies ELBO ≤ log p θ (D). Thanks to this inequality the ELBO naturally enables joint model learning and inference, i.e. we can maximize Eqn. 3 w.r.t. both θ and φ simultaneously. A potential shortcoming of the fully parametric approach described here is that it can be difficult to specify appropriate parameterizations for q φ (z).

Annealed importance sampling (AIS) (Neal, 2001 ) is a method for estimating the evidence p θ (D) that leverages a sequence of K bridging densities {f k (z)} K k=1 that connect a simple base distribution q 0 (z) to the posterior. In more detail, AIS can be understood as importance sampling on an extended space. That is we can write

where the proposal distribution q fwd and the (un-normalized) target distribution q bwd are given by

Here each T k is a MCMC kernel that leaves the bridging density f k (z) invariant. While there is considerable freedom in the choice of q 0 (z) and {f k (z)}, it is natural to let q 0 (z) = p θ (z) and

is the reverse MCMC kernel corresponding to T k . Conceptually, the kernels {T k } move samples from q 0 (z) towards the posterior via a sequence of moves, each of which is 'small' for appropriately spaced {β k } and sufficiently large K. Indeed AIS is consistent as K → ∞ (Neal, 2001) and has been shown to achieve accurate estimates of log p θ (D) empirically .

Hamiltonian Monte Carlo (HMC) is a powerful gradientbased MCMC method (Duane et al., 1987; Neal et al., 2011) . HMC proceeds by introducing an auxiliary momentum v ∈ R D and a log joint (un-normalized) density log π(z, v) = −H(z, v) where the Hamiltonian is defined as

is the potential energy, T (v) = 1 2 v M −1 v is the kinetic energy, and M is the mass matrix. Hamiltonian dynamics is then defined as:

Evolving trajectories according to Eqn. 6 for any time τ yields Markovian transitions that target the stationary distribution π(z, v). By alternating Hamiltonian evolution with momentum updates

and disregarding the momentum yields a MCMC chain targeting the posterior p θ (z|D). Here γ ∈ [0, 1) controls the level of momentum refreshment; e.g. the γ ≈ 1 regime suppresses random-walk behavior (Horowitz, 1991) . In general we cannot simulate Hamiltonian trajectories exactly and instead use a symplectic integrator like the so-called 'leapfrog' integrator. Combined with momentum refreshment a single

where η is the step size. Since the leapfrog integrator is inexact, a Metropolis-Hastings accept/reject step is used to ensure asymptotic correctness. For a comprehensive introduction to HMC see e.g. (Betancourt, 2017 A notable feature of HMC is that the HMC kernel T (z 1 , v 1 |z 0 , v 0 ) can generate large moves in z-space with large acceptance probabilities, which makes it an attractive kernel choice for AIS (Sohl-Dickstein & Culpepper, 2012) . Moreover, as an importance sampling framework AIS naturally gives rise to a variational bound, since applying Jensen's inequality to the log of Eqn. 4 immediately yields a lower bound to log p(D). Geffner & Domke (2021) and Zhang et al. (2021) note that the utility of such an AIS variational bound is severely hampered by the fact that typical MCMC kernels include a Metropolis-Hastings accept/reject step that makes the bound non-differentiable. DAIS restores differentiability by removing the accept/reject step, which makes it straightforward to optimize the variational bound using gradient-based methods. While dropping the Metropolis-Hastings correction invalidates detailed balance, AIS and thus the resulting variational bound remain intact. Moreover, since DAIS employs a HMC kernel and since we expect HMC moves to have high acceptance probabilities-at least for a well-chosen step size η and mass matrix M-we expect the K DAIS transitions to efficiently move samples from q 0 (z) towards the posterior.

In more detail DAIS operates on an extended space (z 0 , ..., z K , v 0 , ..., v K ), with proposal and target distributions given by

Here each kernel T k performs a single leapfrog step as in Eqn. 8 using the annealed potential energy

and without including a Metropolis-Hastings correction. Notably, Geffner & Domke (2021) and Zhang et al. (2021) show that the DAIS variational objective is easy to compute, as clarified by the following lemma.

Lemma 1 The DAIS bound given by proposal and target distributions as in Eqn. 9 is differentiable and is given by

Moreover, gradient estimates of L DAIS w.r.t. θ and φ can be computed using reparameterized gradients provided the base distribution q 0 (z) is reparameterizable. For a proof see Sec. A.2 in the supplemental materials, Geffner & Domke (2021) or Zhang et al. (2021) .

Note that T k andT k do not appear explicitly in Lemma 1; instead it suffices to compute kinetic energy differences.

The DAIS variational objective in Lemma 1 can lead to tight bounds on the model log evidence log p θ (D), especially for large K. However, for a model like that in Eqn. 1, optimizing L DAIS can be prohibitively expensive, with a O(N K) cost per optimization step for a dataset with N data points. This is because sampling q fwd requires computing the gradient of the full log likelihood, i.e. ∇Ψ L (D, z), K times, which is expensive whenever N is large or individual likelihood terms are costly to compute.

In order to speed-up training and sampling in this regime we first observe that the proof of Lemma 1 holds for any choice of potential energies {V k (z)}. 4 The annealed ansatz in Eqn. 10 is a natural choice, but any other choice that efficiently moves samples from q 0 (z) towards the posterior p θ (z|D) can lead to tight variational bounds. This observation naturally leads to two scalable variational bounds that are tractable in the large data regime. In the first method, a variant of which was considered theoretically in Zhang et al. (2021) and which we refer to as NS-DAIS, we replace V k (z) with a stochastic estimate that depends on a minibatch of data. In the second method, which we refer to as SL-DAIS, we replace the full log likelihood Ψ L (D, z) with a surrogate log likelihoodΨ L (z) that is cheap to evaluate. Below we argue theoretically and demonstrate empirically that SL-DAIS is superior to NS-DAIS. 

where we omit the common factor of q(I)q(J ) for brevity. See Algorithm 2 and Sec. A.4 in the supplement for details.

We proceed as in NS-DAIS except we plug in a fixed nonstochastic surrogate log likelihoodΨ L (z) into the potential energy in Eqn. 10. More formally, the proposal and target distributions q fwd and q bwd in Eqn. 9 are replaced with

where q(I) encodes sampling mini-batches of B indices without replacement, and data subsampling occurs solely in 4 See Sec. A.2 in the supplement for details.

Algorithm 1 SL-DAIS: Surrogate Likelihood Differentiable Annealed Importance Sampling. We highlight in blue where the algorithm differs from DAIS. To recover DAIS we substituteΨ L (z) → Ψ L (D, z) and B → N . Note thatΨ L is only used to guide HMC dynamics and that a stochastic estimate of Ψ L (D, z) still appears on the final line.

the p θ (D I |z K ) term. See Algorithm 1 for the complete algorithm and Sec. A.5 in the supplement for a more complete formal description.

There are multiple possibilities for how to parameterizê Ψ L (z). We briefly describe the simplest possible recipe, which we refer to as RAND, leaving a detailed ablation study to Sec. 7.1. In RAND we randomly choose N surr N surrogate data points {(ỹ n ,x n )} ⊂ D and introduce a N surrdimensional vector of learnable (positive) weights ω. The surrogate log likelihood is then given bŷ

Evidently computingΨ L (z) is O(N surr ). Besides its simplicity, an appealing feature of this parameterization is that it leverages the known functional form of the likelihood.

Before taking a closer look at the theoretical properties of NS-DAIS and SL-DAIS in Sec. 5, we make two simple observations. First, as we should expect from any bona fide variational method, maximizing the variational bound does indeed lead to tighter posterior approximations, as formalized in the following proposition:

Proposition 1 The NS-DAIS and SL-DAIS approximate posterior distributions, each of which is given by the marginal q fwd (z K ), both satisfy the inequality

See Sec. A.3 for a proof. Thus as L increases for fixed θ, the KL divergence decreases and q fwd (z K ) becomes a better approximation to the posterior. Second, sampling from q fwd in the case of NS-DAIS requires the entire dataset D. Conveniently in the case of SL-DAIS we only require the surrogate log likelihoodΨ L (z) so that the dataset can be discarded after training.

As in Zhang et al. (2021) to make our analysis tractable we consider linear regression with a prior p θ (z) = N (µ 0 , Λ −1 0 ) and a likelihood n N (y n |z·x n , σ 2 obs ), where each y n ∈ R and x n ∈ R D . Furthermore we work under the following set of simplifying assumptions:

Assumption 1 We use γ = 0, equally spaced inverse temperatures {β k }, a step size that varies as η ∼ K −1/4 , and the prior as the base distribution (i.e. q 0 (z) = p θ (z)).

What kind of posterior approximation do we expect from NS-DAIS? As clarified by the following proposition, NS-DAIS does not directly target the posterior p θ (z|D).

Proposition 2 The approximate posterior for linear regression given by running NS-DAIS with K steps under Assumption 1 converges to the 'aggregate pseudo-posterior'

as K → ∞. Here p θ (z|D J ) denotes the posterior corresponding to the data subset D J with appropriately scaled

See Sec. A.4 for a proof and additional details. Thus we generally expect NS-DAIS to provide a good posterior approximation when the aggregate pseudo-posterior is a good approximation to the posterior. The poor performance of NS-DAIS in experiments in Sec. 7 suggests that this is not case for moderate mini-batch sizes in typical models.

We now analyze SL-DAIS assuming the surrogate log likeli-hoodΨ L (z) differs from the full log likelihood. As is well known the exact posterior for linear regression is given by

and Λ post = Λ 0 + 1 σ 2 obs X X. The gradient of the full log likelihood is given by (13) where we have defined a ≡ 1 σ 2 obs y X and B ≡ 1 σ 2 obs X X.

We suppose thatΨ L (z) is likewise a quadratic 5 function of z but thatΨ L (z) = Ψ L (D, z) so that we can write

For ∇ zΨL (z) as in Eqn. 14 we can prove the following:

Proposition 3 Running SL-DAIS for linear regression with K steps under Assumption 1 and using ∇ zΨL (z) as in Eqn. 14 results in a variational gap that can be bounded as

where we have dropped higher order terms in δa and δB. 6 Furthermore the KL diveregence between q fwd (z K ) and the posterior p θ (z K |D) is bounded by the same quantity.

This intuitive result can be proven by suitably adapting the results in Zhang et al. (2021); see Sec. A.5 in the supplemental materials for details.

Proposition 3 suggests that SL-DAIS offers an intuitive trade-off between inference speed and fidelity. For example, for a fixed computational cost, SL-DAIS can accommodate a value of K that is a factor ∼ N/B larger than DAIS. For a sufficiently good surrogate log likelihood, the benefits of larger K can more than compensate for the error introduced by an imperfectΨ L (z). See Sec. 7.3 for concrete examples.

Many variational objectives that leverage importance sampling (IS) have been proposed. These include the importance weighted autoencoder (IWAE) (Burda et al., 2015; Cremer et al., 2017) , the thermodynamic variational objective (Masrani et al., 2019) , and approaches that make use of Sequential Monte Carlo (Le et al., 2017; Maddison et al., 2017; Naesseth et al., 2018) . For a general discussion of IS in variational inference see Domke & Sheldon (2018) .

An early combination of MCMC methods with variational inference was proposed by Salimans et al. (2015) and Wolf et al. (2016) . A disadvantage of these approaches is the need to learn reverse kernels, a shortcoming that was later addressed by Caterini et al. (2018) .

Bayesian coresets enable users to run MCMC on large datasets after first distilling the data into a smaller number of weighted data points (Huggins et al., 2016; Campbell & Broderick, 2018; . These methods do not support model learning. Finally a number of authors have explored MCMC methods that enable data subsampling (Maclaurin & Adams, 2015; Quiroz et al., 2018; Zhang et al., 2020) , including those that leverage stochastic gradients (Welling & Teh, 2011; Chen et al., 2014; Ma et al., 2015) .

We set out to compare the performance of SL-DAIS and NL-DAIS to various MCMC and variational baselines. All our experiments are implemented using JAX (Bradbury et al., 2020) and NumPyro (Phan et al., 2019; Bingham et al., 2019) .

We compare four ansätze for the surrogate log likelihoodΨ L used in SL-DAIS. For concreteness we consider a logistic regression model with a bernoulli likelihood p(y n |z, x n ) governed by logits σ(x n · z), where σ(·) is the logistic function. In RAND we randomly choose N surr surrogate data points {(ỹ n ,x n )} ⊂ D, introduce a N surrdimensional vector of learnable weights ω and letΨ L (z) = n ω n log p(ỹ n |z,x n ). In CS-INIT we proceed similarly but use a Bayesian coreset algorithm (Huggins et al., 2016) to choose the N surr surrogate data points. In CS-FIX we likewise use a coreset algorithm to choose the surrogate data points but instead of learning ω we use the weights provided by the coreset algorithm. Finally in NN we parameterizê Ψ L (z) as a neural network. See Table 1 for the results.

We can read off several conclusions from Table 1 . First, using a neural ansatz works extremely poorly. This is not surprising, since a neural ansatz does not leverage the known likelihood function. Second, for the three methods that rely on surrogate data points {(ỹ n ,x n )}, results improve as we increase N surr , although the improvements are somewhat modest for the two inference problems we consider. Third, the simplest ansatz, namely RAND, works quite well. This is encouraging because this ansatz is simple, generic, and easily automated, which makes it an excellent candidate for probabilistic programming frameworks. Consequently we use RAND in all subsequent experiments. Finally, the poor performance of CS-FIX makes it clear that the notion of Bayesian coresets, while conceptually similar, is not congruent with our use case for surrogate likelihoods. In particular the coreset notion in (Huggins et al., 2016) aims to approximate Ψ L (D, z) across latent space as a whole. However, the zero-avoiding behavior of variational inference implies that reproducing the precise tail behavior of Ψ L (D, z) is less important than accurately representing the vicinity of the posterior mode. Additionally the Hamiltonian 'integration error' resulting from using finite η and K can be partially corrected for by learning ω jointly with q 0 (z), something that a two-stage coreset-based approach is unable to do. 

To better understand the limitations of NS-DAIS and SL-DAIS we consider a binary classification problem with imbalanced data. We expect both methods to find this regime challenging, since the full log likelihood Ψ L (D, z) exhibits elevated sensitivity to the small number of terms involving the rare class. See Fig. 1 for the results. Test accuracies decrease substantially as the class imbalance increases; this is expected, since there is considerable overlap between the two classes in feature space. Strikingly, SL-DAIS with the RAND surrogate ansatz outperforms NS-DAIS across the board, 7 with the gap increasing as the class imbalance increases. Indeed NS-DAIS barely outperforms a meanfield baseline (not shown for legibility). This suggests that SL-DAIS + RAND can be a viable strategy even in difficult regimes like the one explored here.

We use five logistic regression datasets to systematically compare seven variational methods and two MCMC methods. For each dataset we consider a somewhat moderate number of training data (N = 5 × 10 4 ) so that we can include baseline methods that are impractical to scale to large N . Our baselines include three variational methods that use parametric distributions: mean-field Normal (MF); multivariate Normal (MVN); and a block neural autoregressive flow where we first form a Bayesian coreset consisting of 1024 weighted data points, and then run NUTS (Hoffman et al., 2014) . We emphasize that these two MCMC methods do not offer some of the benefits of variational methods (e.g. support for model learning) but we include them so that we can better assess the quality of the approximate posterior predictive distributions returned by the variational methods.

See Fig. 2 and Table 2 for results. We find that SL-DAIS consistently outperforms NS-DAIS and in many cases it matches the performance of DAIS. Among the scalable methods Flow and SL-DAIS perform best, although SL-DAIS can be substantially faster than Flow, see Fig. 3 . Fig. 3 also clarifies the computational benefits of SL-DAIS: SL-DAIS with K = 8 steps handily outperforms DAIS with K = 2 at about 2% of the computational cost. Finally we find that the best variational methods yield predictive log likelihoods that are competitive with the MCMC baselines.

We model a geospatial precipitation dataset (Lyon, 2004; Lyon & Barnston, 2005) using a Gaussian process regressor with a heavy-tailed Student's t likelihood; see Sec. A.6.5 for details on the modeling setup. We compare SL-DAIS with K = 4 and a multivariate Normal (MVN) base distribution q 0 (z) to a baseline with a MVN variational distribution.

Note that the dimension of the latent space D is equal to the number of inducing points. See Table 3 for the results.

There is significant variability in the data and consequently the posterior function values exhibit considerable uncer-tainty. Due to the non-conjugate likelihood, the posterior over the inducing points features significant non-gaussianity, especially as the number of inducing points increases and the inducing point locations become closer together. 

We train Gaussian process classifiers on the classification datasets considered in Sec. 7.3. We compare SL-DAIS with K = 4 and a MVN base distribution to a baseline with a MVN variational distribution. We use 128 inducing points so the latent dimension is D = 128. See Table 4 for results. We find that SL-DAIS leads to consistently higher ELBOs and that this translates to small but non-negligible gains in test accuracy for the datasets with the largest ELBO gains. 

What about models with global and local latent variables? For simplicity we evaluate the performance of the simplest DAIS-like inference procedure that can simultaneously accommodate local latent variables and data subsampling. In brief we introduce a parametric distribution for the global latent variable and use DAIS to define a distribution over the local latent variables. Assuming the local latent variables are conditionally independent once we condition on the global latent variable, this results in N non-interacting DAIS chains. Consequently the ELBO is amenable to data subsampling; see Sec. A.1 for an extended discussion

To evaluate this approach we consider a robust regression model that uses a Student's t likelihood. We use the well- Figure 2 . We report ELBO improvements and test log likelihoods for the logistic regression experiment in Sec. 7.3. ELBO improvements are with respect to the mean-field (MF) Normal baseline. Circles denote variational methods and squares denote MCMC methods. Metrics are averaged over 7 independent replications and error bars denote standard errors (we do 3 independent replications for the most expensive methods, namely DAIS and HMCECS). See Sec. A.6.3 in the supplement for test accuracies. Note that numerals in method names indicate the number of HMC steps K used. Table 2 . We report performance ranks w.r.t. ELBO and test log likelihood across 5 train/test splits and 5 datasets for the logistic regression experiment in Sec. 7.3. Lower is better for all metrics. The rank satisfies 1 ≤ rank ≤ 7, since we compare 7 scalable variational methods. Figure 3 . We report the time per optimization step for each variational method in Fig. 2 We compare two variational approaches, both of which use a mean-field Normal distribution for the global latent variable. To define an oracle baseline we integrate out the local latent variables before performing variational inference. This oracle represents an upper performance bound on the semi-parametric approach defined above, which we refer to as Semi-DAIS. See Table 5 for results. We find that for K = 16 Semi-DAIS 'recovers' about half of the gap between a fully mean-field baseline and the oracle. We expect this gap would decrease further for larger K. Table 5 . We report results for the experiment in Sec. 7.6. In each case the reported ELBO improvement is above a mean field baseline. Results are averaged over 10 replications.

In this work we have focused on models with global latent variables. The experiment in Sec. 7.6 only scratches the surface of what is possible for the richer and more complex case of models that contain global and local latent variables.

Exploring hybrid scalable inference strategies for this class of models that combine gradient-based MCMC with variational methods is an interesting direction for future work. 

The appendix is organized as follows. In Sec. A.1 we discuss local latent variable models. In Sec. A.2 we provide a proof of Lemma 1. In Sec. A.3 we discuss general properties of NS-DAIS and SL-DAIS. In Sec. A.4 we discuss NS-DAIS and in Sec. A.5 we discuss SL-DAIS. In Sec. A.6 we provide details about our experimental setup. In Sec. A.7 we present additional figures and tables that accompany those in the main text.

Another important class of models includes local latent variables {w n } in addition to a global latent variable z, i.e. models with joint densities of the form:

A viable DAIS-like inference strategy for this class of models that would be scalable to large N is to adopt a semi-parametric approach. First, we introduce a flexible, parametric variational distribution q φ (z) using, for example, a normalizing flow.

After conditioning on a sample z ∼ q φ (z) the N inference problems w.r.t. {w n } effectively decouple. Consequently we can apply DAIS to each subproblem, resulting in an ELBO that accommodates unbiased mini-batch estimates.

In more detail we proceed as follows. Introduce a parametric variational distribution q φ (z) and a mean-field distribution q φ (W) that factorizes as q φ (W) = n q φ (w n ). Then write (in this section we suppress momenta terms in the MCMC kernels for brevity)

where the forward MCMC kernels target a log density of the form (here for β = 1) N n=1 (log p θ (w n |z) + log p θ (y n |w n , z, x n ))

where z is kept fixed throughout HMC dynamics. Since this density splits up into a sum over N turns, we actually have N independent DAIS chains (assuming our mass matrix is appropriately factorized). Since the log p θ (D, W K , z) term that enters into L also splits up into such a sum, this ELBO admits mini-batch sampling. That is, we choose a mini-batch of size B N and run B DAIS chains forward at each optimization step so that e.g. most {w n } are not instantiated. This of course is exactly how subsampling works with vanilla parametric mean-field variational distributions.

We can follow the same basic strategy to make this variational distribution more fully DAIS-like while still supporting subsampling. The basic idea is that we will have N + 1 DAIS chains. The N DAIS chains for each w n will be as above. But we will also introduce a DAIS chain that is responsible for producing the z sample. To enable subsampling this z-chain must target a surrogate likelihood.

In more detail we proceed as follows. To simplify the equations we assume K = 1.

Here q φ (z 0 ,W 0 ) and q φ (W 0 ) are (distinct) simple parametric variational distributions. T surr is a DAIS chain that targets a surrogate likelihood with learnable weights that includes N surr data points and also N surr local latent variablesW. As above T (W 1 |W 0 ,z 1 ) consists of N independent DAIS chains, all of which are conditioned onz 1 . Note that since these N DAIS chains factorize there is no need for a surrogate here. We note that the output of the N + 1 DAIS chains isz 1 ,W 1 and W 1 .W 1 is a N surr -dimensional 'auxiliary variable' and so we essentially throw it out (nothing depends on it directly)-its only purpose is to 'integrate out' W-uncertainty in the surrogate likelihood. 8 By contrast W 1 is the 'actual' W sample that enters into p θ (D, W 1 ,z 1 ), i.e. our approximate posterior sample is (W 1 ,z 1 ). As above we can still subsample the N DAIS chains and get an ELBO that supports data subsampling. Basically we've replaced the parametric distribution for z above with a single DAIS chain that targets a surrogate likelihood.

Yet another (somewhat simpler) variant of this approach would foregoW transitions in building up the z sample and instead parameterize the surrogate likelihood using a learnable point parameterW. As should now be evident, the space of models with mixed local/global latent variables opens up lots of algorithmic possibilities.

We provide a proof of Lemma 1 in the main text. We reproduce the argument from Zhang et al. (2021); see Geffner & Domke (2021) for a similar derivation. The forward kernel T k in Eqn. 9 can be decomposed as

Similarly the backward kernelT k in Eqn. 9 can be decomposed as

where we flip the sign ofv k to account for time reversal. Since we have

and

Furthermore, since the leapfrog step is volume preserving and reversible we also have that

From these equations the only tricky part of Lemma 1, i.e. the derivation of the kinetic energy difference correction, follows. Importantly, Eqn. 24 is true regardless of the target density used to define T k leap andT k leap ; in particular using a surrogate log likelihood leaves Lemma 1 intact.

We prove Proposition 1 in the main text. In particular we show that the NS-DAIS and SL-DAIS approximate posteriors q fwd (z K ) satisfy the inequality

where L is L NS−DAIS or L SL−DAIS . First consider SL-DAIS and letq bwd (z 0:K , v 0:K ) be the normalized distribution corresponding to q bwd (z 0:K , v 0:K ) so that log q bwd (z 0:K , v 0:K ) = log p θ (D) + logq bwd (z 0:K , v 0:K ) and write log p θ (D) = E q fwd log q bwd (z 0:K , v 0:K )−log q fwd (z 0:K , v 0:K ) +

KL(q fwd (z 0:K , v 0:K )|q bwd (z 0:K , v 0:K )) = L SL−DAIS + KL(q fwd (z 0:K , v 0:K )|q bwd (z 0:K , v 0:K )) (27) = L SL−DAIS + KL(q fwd (z K )|p θ (z K |D)) + (28) E q fwd (z K ) [KL(q fwd (z 0:K , v 0:K |z K )|q bwd (z 0:K , v 0:K |z K ))]

where we appealed to the chain rule of KL divergences which reads KL(q(a, b)|p(a, b)) = KL(q(a)|p(a)) + E q(b) [KL(q(a|b)|p(a|b))]

The result then follows since the second KL divergence in Eqn. 28 is non-negative. The result for NS-DAIS follows by the same argument, with the difference that J is now one of the 'auxiliary' variables {z 0:K , v 0:K }. Note that the I index plays no role in this proof, since its role is to facilitate unbiased mini-batch estimates, i.e. it has no impact on the value of L NS−DAIS or L SL−DAIS (see the next section for details).

We first elaborate how NS-DAIS is formulated. The forward and backward kernels are defined as:

where J controls the potential energy used to guide the HMC dynamics and the corresponding ELBO is given by

are the forward and backward kernels without the I mini-batch index. 9 Note that here and elsewhere q 0

is the momentum distribution. To be more specific, for each J the potential energy that guides each T k is given by

See Algorithm 2 for the complete procedure and see Algorithm 3 to make a direct comparison to DAIS. Now we prove Proposition 2 in the main text. As in the main text let q(J ) denote the distribution that corresponds to sampling mini-batches of B indices J ⊂ {1, ..., N } without replacement. Let p θ (z|D J ) denote the posterior that corresponds to the data subset D J with appropriately scaled likelihood term, i.e.

Further denote the 'aggregate pseudo-posterior' by

We have (appealing to the convexity of the KL divergence)

where in the last line we used that q(J ) has finite support and where we have appealed to the convergence result in Zhang et al. (2021) to conclude that for each J we have that KL(q fwd (z K |J )|p θ (z K |D J )) = O(K −1/2 ). We note that, as is well known, convergence in KL divergence implies convergence w.r.t. the total variation distance.

Algorithm 2 NS-DAIS: Naive Subsampling Differentiable Annealed Importance Sampling. We highlight in blue where the algorithm differs from DAIS. To recover DAIS we substitute N B Ψ L (D J , z) → Ψ L (D, z), N B Ψ L (D I , z K ) → Ψ L (D, z K ), and B → N (or in other words remove all mini-batch sampling).

Input: model log prior density Ψ 0 (z), model log likelihood Ψ L (D, z), base variational distribution q 0 (z), number of steps K, inverse temperatures {β k }, step size η, momentum refresh parameter γ, mass matrix M, dataset D of size N , mini-batch size B Initialize:

Before we turn to a proof of Proposition 3 in the main text, we give a more formal description of SL-DAIS. The forward and backward kernels are defined as:

and the corresponding ELBO is given by L SL−DAIS ≡ E q fwd log q bwd (z 0:K , v 0:K , I)−log q fwd (z 0:K , v 0:K , I) = E q fwd log q bwd (z 0:K , v 0:K )−log q fwd (z 0:K , v 0:K )

are the forward and backward kernels without the I mini-batch index.

We now provide a proof of Proposition 3 in the main text. First we restate some of the equations and notation from the main text. We consider Bayesian linear regression with a prior p θ (z) = N (µ 0 , Λ −1 0 ) and a likelihood n N (y n |z · x n , σ 2 obs ), where each y n ∈ R and x n ∈ R D . The exact posterior is given by N (µ post , Λ −1 post ) where µ post = Λ −1 post (Λ 0 µ 0 + 1 σ 2 obs X y)

and Λ post = Λ 0 + 1 σ 2 obs X X. Throughout we work under the following set of simplifying assumptions:

Assumption 1 We use full momentum refreshment (γ = 0), equally spaced inverse temperatures {β k }, a step size that varies as η ∼ K −1/4 , and the prior as the base distribution (i.e. q 0 (z) → p θ (z)).

Next we define the parameters of the 'surrogate posterior' N (μ post ,Λ post ) that corresponds to the surrogate log likelihood Ψ L (z) specified by δa and δB, where δa and δB control the degree to whichΨ L (z) is incorrectly specified. Given our Algorithm 3 DAIS: Differentiable Annealed Importance Sampling. We provide a complete description of DAIS (Geffner & Domke, 2021; Zhang et al., 2021) in our notation.

Input: model log prior density Ψ 0 (z), model log likelihood Ψ L (D, z), base variational distribution q 0 (z), number of steps K, inverse temperatures {β k }, step size η, momentum refresh parameter γ, mass matrix M,

it is easy to verify thatΛ

Expanding these expressions in powers of δa and δB we have: 

Next we restate some results from Sec. 4.1 and the supplementary materials in Zhang et al. (2021), suitably adapting them to our setting and notation. Where appropriate we use subscripts/superscripts to distinguish DAIS and SL-DAIS terms.

Lemma 2 In the case of Bayesian linear regression under Assumption 1 each of the K + 1 random variables {z k } that enter DAIS are normally distributed as z k ∼ N (µ k , Λ −1 k ). If we run DAIS for K steps the ELBO gap is given by

where ||µ K − µ post || 2 Λpost = (µ K − µ post ) T Λ post (µ K − µ post ). Moreover, as is evident from the derivation in Sec. B.2 in Zhang et al. (2021) , this equation also holds for the gap log p(D) − L SL−DAIS if we substitute the SL-DAIS forward kernel q fwd into the expectation in (III) and µ k and Λ k are defined in terms of the {z k } that enter SL-DAIS. Additionally for SL-DAIS 10 we have that

With these ingredients we can now proceed with the proof. From Eqn. 41 and Eqn. 45 we have

Hence for SL-DAIS we have that

Once again appealing to Eqn. 41 and Eqn. 45 we have

so that we get the following estimate for the second term of Eqn. 44:

Finally, from Eqn. 43 and Eqn. 45 we have

which ends the first part of the proof for Proposition 3. The fact that the KL divergence between q fwd (z K ) and the posterior p θ (z K |D) is bounded by the same quantity easily follows from standard arguments, see e.g. the proof of Proposition 1 in Sec. A.3.

All experiments use 64-bit floating point precision. We use the Adam optimizer with default momentum hyperparameters in all experiments (Kingma & Ba, 2014) . In all optimization runs we do 3 × 10 5 optimization steps with an initial learning rate of 10 −3 that drops to 10 −4 and 10 −5 at iteration 10 5 and 2 × 10 5 , respectively. Similar to Geffner & Domke (2021) we parameterize the step size η k in the k th iteration of DAIS/NS-DAIS/SL-DAIS as

whereη and κ are learnable parameters and we choose η max = 0.25. The inverse temperatures {β k } are parameterized using the exponential transform to enforce positivity and a cumulative summation to enforce monotonicity. For SL-DAIS the learnable weights ω are uniformly initialized so that their total weight is equal to the number of data points, i.e. n ω n = N . In all experiments we use a diagonal mass matrix M. We initializeη to be small, e.g.η ∼ 10 −4 − 10 −2 , and initialize κ to κ = 0. We initialize γ to γ = 0.9. Unless noted otherwise (see GP models), we learn all parameters that define the model and variational distribution jointly in a single optimization run, i.e. without any kind of pre-training (in contrast to Geffner & Domke (2021) ). We found that this generally worked better, although in a few cases pre-training led to better results when the base distribution was a multivariate Normal. We set N surr = B = 256 and K = 8. We use the SUSY dataset with 20k data points held out for testing. The rare class corresponds to signal events. For each ratio of rare to non-rare class we keep the dataset fixed so that replications only differ with respect to the random number seed that controls optimization.

A.6.3. LOGISTIC REGRESSION When running HMCECS (Dang et al., 2019) , we use the 'perturbed' approach with a NUTS kernel and use a control variate based on a second-order Taylor approximation around the maximum a posteriori estimate. In addition, to facilitate mixing Table 7 . We report performance ranks w.r.t. three metrics across 5 train/test splits and 5 datasets for the logistic regression experiment in Sec. 7.3. Lower is better for all metrics. The rank satisfies 1 ≤ rank ≤ 7, since we compare 7 scalable variational methods.

locations Z with k-means and treat them as learnable parameters. We use a mini-batch size B = 128. After training ELBOs are estimated with 20k MC samples. The MVN base distribution of SL-DAIS is initialized using the result from the MVN baseline.

A.6.5. ROBUST GAUSSIAN PROCESS REGRESSION

We proceed exactly as in Sec. A.6.4 except we use a Student's t likelihood p(y n |ν, f n , σ obs ) = StudentT(y n |ν, f n , σ obs ).

Here ν > 0 is the overdispersion parameter, with small ν corresponding to large overdispersion. We constrain ν to satisfy ν > 2, which is equivalent to requiring that the observation noise have finite variance.

The Precipitation dataset is available from the IRI/LDEO Climate Data Library. 11 This spatio-temporal dataset represents the 'WASP' index (Weighted Anomaly Standardized Precipitation) at various latitudes and longitudes. Each data point corresponds to the WASP index for a given year (which itself is the average of monthly WASP indices). We use the data from 2010, resulting in a dataset with 10127 data points. In each train/test split we randomly choose N = 8000 data points for training and use the remaining 2127 data points for testing.

We use a mini-batch size B = 128. We estimate the final ELBO using 5000 samples and test log likelihoods and accuracies using 1000 posterior samples. We use N surr = 256 surrogate data points with the RAND surrogate parameterization (see Sec. A.6.1). We use an initial step size η init = 10 −4 and initialize SL-DAIS using the model and variational parameters obtained with a MVN variational distribution. Figure 5 . We report the time per optimization step for each variational method in Fig. 2 on the CovType-Fir dataset. We compare CPU runtime (24 cores; Intel Xeon Gold 5220R 2.2GHz) to GPU runtime (NVIDIA Tesla K80). We note that the runtime comparison against DAIS would be increasingly favorable to NS-DAIS and SL-DAIS as the size of the dataset increases.

A.6.6. LOCAL LATENT VARIABLE MODELS We use the Pol UCI dataset with 15k training and 15k test data points. We use 10k MC samples to estimate ELBOs after training. We set B = 256.

For additional results pertaining to the logistic regression experiment in Sec. 7.3 see Fig. 4, Fig. 5 , and Table 7 . For an expanded version of the table in Sec. 7.1 see Table 6 .

Uci machine learning repository

Searching for exotic particles in high-energy physics with deep learning

The fundamental incompatibility of scalable hamiltonian monte carlo and naive data subsampling

A conceptual introduction to hamiltonian monte carlo

Deep universal probabilistic programming

Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables

Variational inference: A review for statisticians

composable transformations of python+ numpy programs

Bayesian coreset construction via greedy iterative geodesic ascent

Automated scalable bayesian inference via hilbert coresets

Stochastic gradient hamiltonian monte carlo

Reinterpreting importance-weighted autoencoders

Hamiltonian monte carlo with energy conserving subsampling

The helmholtz machine

Block neural autoregressive flow

Importance weighting and variational inference

Mcmc variational inference via uncorrected hamiltonian annealing

Sandwiching the marginal likelihood using bidirectional monte carlo

Learning deep latent gaussian models with markov chain monte carlo

Stochastic variational inference

The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo

A generalized guided monte carlo algorithm

Coresets for scalable bayesian logistic regression

An introduction to variational methods for graphical models

A method for stochastic optimization

Auto-encoding sequential monte carlo

The strength of el niño and the spatial extent of tropical drought

Enso and the spatial extent of interannual precipitation extremes in tropical land areas

A complete recipe for stochastic gradient mcmc

Exact mcmc with subsets of data

The thermodynamic variational objective

Mobility trends provide a leading indicator of changes in sars-cov-2 transmission

Age groups that sustain resurging covid-19 epidemics in the united states

Variational sequential monte carlo

Annealed importance sampling

Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo

Analysis of 2.1 million sars-cov-2 genomes identifies mutations associated with transmissibility. medRxiv

Composable effects for flexible and accelerated probabilistic programming in numpyro

A unifying view of sparse approximate gaussian process regression

Speeding up mcmc by efficient data subsampling

Black box variational inference

Boosted decision trees as an alternative to artificial neural networks for particle identification

A contrastive divergence for combining variational inference and mcmc

Markov chain monte carlo and variational inference: Bridging the gap

Sparse gaussian processes using pseudo-inputs

Hamiltonian annealed importance sampling for partition function estimation

Monte carlo variational auto-encoders

We thank Tomas Geffner for answering questions about Geffner & Domke (2021) . We thank Ola Rønning for contributing the open source NumPyro-based HMCECS implementation we used in our experiments.

A.6.1. SURROGATE LOG LIKELIHOODS All methods use K = 8. We use 50k MC samples to estimate ELBOs after training. The training set has 50k data points and all replications are for the same training set. We consider eight different strategies for defining a surrogate log likelihood in the case of logistic regression:1. RAND: We randomly choose N surr surrogate data points {(ỹ n ,x n )} ⊂ D, introduce a N surr -dimensional vector of learnable weights ω, and letΨ L (z) = n ω n log p(ỹ n |z,x n ).

We randomly choose N surr surrogate covariates {x n } and introduce two N surr -dimensional vectors of learnable weights, ω − and ω + . We then writeΨ L (z) = n ω + n log p(y n = 1|z,x n ) + n ω − n log p(y n = 0|z,x n ).

We use a Bayesian coreset algorithm (Huggins et al., 2016) to choose N surr surrogate data points, introduce a N surr -dimensional vector of learnable weights ω, and letΨ L (z) = n ω n log p(ỹ n |z,x n ).

We proceed as in CS-INIT but ignore the {ỹ n } returned by the coreset algorithm and instead introduce two N surr -dimensional vectors of learnable weights, ω − and ω + . We then writeΨ L (z) = n ω + n log p(y n = 1|z,x n ) + n ω − n log p(y n = 0|z,x n ).

As in CS-INIT we use a coreset algorithm to choose the surrogate data points but instead of learning ω we use the weights provided by the coreset algorithm.6. KM-INIT±: We use k-means++ to extract N surr cluster centroids from the set of N covariates in D and introduce two N surr -dimensional vectors of learnable weights, ω − and ω + . We then writeΨ L (z) = n ω + n log p(y n = 1|z,x n ) + n ω − n log p(y n = 0|z,x n ) and treat the surrogate covariates {x n } as learnable parameters.7. KM-FIX±: We proceed as in KM-INIT± except the surrogate covariates returned by the clustering algorithm remain fixed.8. NN: We parameterizeΨ L (z) as a neural network with two hidden layers, ELU non-linearities, and 100 neurons per hidden layer. we use block pseudo-marginal updates for data subsample indices with 25 blocks (Tran et al., 2016) . We run for a total of 15 × 10 3 iterations, discard the first 5 × 10 3 samples for burn-in, thin the remaining 10 × 10 3 samples by a factor of 10, and use these 10 3 thinned samples to compute posterior predictive distributions on held-out test data. For NUTS we use the default settings in NumPyro, in particular a diagonal mass matrix that is adapted during warm-up and a maximum tree depth of 10.For NUTS-CS we identify 1024 coresets using the coreset algorithm described in Huggins et al. (2016) . Note that in most cases the total number of coresets returned by the algorithm is somewhat less than 1024, e.g. 1009 or 1017. We use the settings recommended in Huggins et al. (2016) . In particular we use kmeans++ to construct a clustering with k = 6 clusters. We set the hyperparameter a to a = 6. We then run NUTS Hoffman et al. (2014) on a logistic regression model conditioned on the weighted coresets using the same number of NUTS iterations, mass matrix, etc., as for HMCECS.The Higgs and SUSY datasets are described in (Baldi et al., 2014) and available from the UCI repository (Asuncion & Newman, 2007) . The MiniBooNE dataset is from (Roe et al., 2005) and is likewise available from UCI. For the two CovType datasets (Blackard & Dean, 1999) , which are also vailable from UCI, we reduce the 7-way classification problem to two binary classifications problems: i) fir versus rest; and ii) pine versus rest. Fig. 2 we report test accuracies for the logistic regression experiment in Sec. 7.3. Circles denote variational methods and squares denote MCMC methods. Metrics are averaged over 7 independent replications (except for DAIS and HMCECS where we do 3 independent replications), and error bars denote standard errors.

We consider a Gaussian process model for binary classification with a logistic link function. The covariates are assumed to be d-dimensional, x n ∈ R d . To make GP inference scalable we use the FITC approximation (Snelson & Ghahramani, 2006; Quinonero-Candela & Rasmussen, 2005) , which results in a model density of the following form:where Z ∈ R M ×d are the M inducing point locations, K refers to the RBF kernel parameterized by a (scalar) kernel scale and a d-dimensional vector of length scales, σ(·) is the sigmoid function, K ZZ ∈ R M ×M is the prior covariance, u ∈ R M are the inducing point values, f n is the Gaussian process function value corresponding to covariate x n , and m n (u) and v n are given by:Here k nZ ∈ R M with (k nZ ) m = k(x n , z m ) for m = 1, ..., M , where k(·, ·) is the kernel function. Since by construction the {f n } are uncorrelated with one another, we can approximately integrate out {f n } using Gauss-Hermite quadrature using Q quadrature points. In all our experiments we use Q = 16, M = 128, and N surr = 512. This results in a model density of the form p(D, u) = p(u|0, K ZZ ) n p(y n |u, x n )where the cost of computing N likelihoods p(y n |u, x n ) is O(N M 2 + M 3 + N Q). We use variational inference to infer an approximate posterior over u for the model density in Eqn. 53. In practice N M and the cost of computing the full likelihood is dominated by the O(N M 2 ) term. In all our Gaussian process experiments we initialize the inducing point