key: cord-0551784-lfsvo0zl authors: Wang, Yifei; Chen, Peng; Li, Wuchen title: Projected Wasserstein gradient descent for high-dimensional Bayesian inference date: 2021-02-12 journal: nan DOI: nan sha: 3f3de0e8d2feaaa64fe81613f1ca670500bfa03f doc_id: 551784 cord_uid: lfsvo0zl We propose a projected Wasserstein gradient descent method (pWGD) for high-dimensional Bayesian inference problems. The underlying density function of a particle system of WGD is approximated by kernel density estimation (KDE), which faces the long-standing curse of dimensionality. We overcome this challenge by exploiting the intrinsic low-rank structure in the difference between the posterior and prior distributions. The parameters are projected into a low-dimensional subspace to alleviate the approximation error of KDE in high dimensions. We formulate a projected Wasserstein gradient flow and analyze its convergence property under mild assumptions. Several numerical experiments illustrate the accuracy, convergence, and complexity scalability of pWGD with respect to parameter dimension, sample size, and processor cores. Bayesian inference is of central importance for inverse problems across many fields in machine learning, uncertainty quantification, data assimilation for computational science and engineering. It provides a powerful tool and principled framework in quantifying uncertainty of a given system by fusing complex data and models. Given observations of a system with uncertain parameters, Bayesian inference characterizes the optimal distribution of parameters by drawing samples from the posterior distribution. The key task is to efficiently draw samples from the posterior distribution, especially for high-dimensional parameter spaces. Classical approaches for Bayesian inference include Markov chain Monte Carlo (MCMC) based methods and transport-based variational inference. Many MCMC methods (Hwang et al., 2005; Neal et al., 2011; Welling and Teh, 2011; Duncan et al., 2017; Ma et al., 2019; Garbuno-Inigo et al., 2020) simulate a continuous-time diffusion process which keeps the posterior distribution as the invariant distribution. It is known that the Kolmogorov forward generator of overdamped Langevin dynamics in MCMC is equivalent to the gradient descent direction of the Kullback-Leibler (KL) divergence in probability space with the Wasserstein metric (Jordan et al., 1998; Ambrosio et al., 2008) . In this view point, these MCMC-related diffusion processes can also be viewed as modified Wasserstein gradient flows. Closely related to this perspective, transport-based variational inference methods (Liu and Wang, 2016; Liu, 2017; Detommaso et al., 2018; Duncan et al., 2019) seek to minimize the KL divergence between the transported distribution and the posterior distribution. They can also be viewed as gradient descent methods or Newton's methods in generalized Wasserstein space. Similarly, the Wasserstein Newton's method has been discussed in (Wang and Li, 2020) . However, all of these sampling methods may face the curse of dimensionality for solving high-dimensional Bayesian inference problems. Namely, to achieve a certain level of accuracy, the computational complexity of Wasserstein gradient direction of Boltzmann-Shannon entropy grows rapidly (usually exponential) with respect to the parameter dimension. To alleviate the curse of dimensionality, different approaches have been developed by exploiting the sparsity (Marzouk et al., 2007; Schwab and Stuart, 2012; Schillings and Schwab, 2013; Schwab, 2015, 2016; Zech and Marzouk, 2020) or intrinsic low dimensionality of the posterior distribution with respect to the parameter Bui-Thanh et al. (2012 ; Cui et al. (2016) ; Chen et al. (2017) ; Beskos et al. (2017) ; Zahm et al. (2018) ; Brennan et al. (2020) . In fact, the intrinsic low dimensionality, as revealed by derivative information with respect to the parameter, e.g., gradient, Hessian, or higher-order derivatives, has been observed in many uncertainty quantification problems such as model reduction for sampling and deep learning Bashir et al. (2008) Chen et al. (2019a Chen et al. ( , 2021 , and Bayesian optimal experimental design (Alexanderian et al., 2016; Crestel et al., 2017; Wu et al., 2020) . In particular, the low dimensionality detected by gradient and Hessian information has been used in projected Stein variational gradient descent and Newton (Chen et al., 2019b) methods, respectively, to solve high-dimensional Bayesian inference problems. The complexity of these methods grows slowly or remains the same with increasing parameter dimension. Contributions: In this paper, we present a projected version of Wasserstein gradient descent method (WGD) to sample from high-dimensional posterior distributions. It integrates the merits of the fast convergence of Wasserstein gradient flow with theoretical guarantee and favorable scaling to high dimensions with certified projection errors. Specifically, in implementing the WGD we use a kernel density estimation (KDE) as an approximation of a log density term-the gradient of entropy. However, it is well known that KDE suffers from the curse of dimensionality, i.e., the approximation accuracy quickly deteriorates with respect to the parameter dimension, which renders WGD samples an inaccurate representation of the posterior in high dimensions. To address this challenge, we project the parameters to a low-dimensional subspace, which is constructed by the gradient information of the log-likelihood. Then the KDE is only built in the low-dimensional subspace. To further reduce the subspace dimension, we propose a batched version of KDE, which allows us to use KDE in a batched subspace of (arbitrarily) small dimensions. Based on the projection of the parameters and KDE, we develop a projected WGD (pWGD) and formulate a corresponding Wasserstein gradient flow (WGF) in the subspace. Under suitable assumptions, we show that this projected WGF can be embedded in the full space as a special WGF, which we further prove to converge as fast as (if not faster than) the original WGF without projection. Furthermore, thanks to the parallel property of the ensemble transport of the samples, we also develop a parallel implementation of pWGD that distribute samples to multiple processors. Finally we demonstrate the (higher) accuracy and (faster) convergence compared to WGD, SVGD, pSVGD, as well as the scalability of computational complexity with respect to parameter dimension, sample size, and processor cores. A variety of numerical experiments are conducted with both low and high-dimensional parameters, both synthetic and real-world data, both toy model and simplified differential equation models used in environmental engineering, geoscience, and epidemics for COVID-19. The rest of the paper is organized as follows. In section 2, we introduce three major sampling methods, Langevin MCMC, WGD and SVGD with analysis on their differences and relations. The proposed pWGD algorithm is presented and analyzed through the lens of WGF in Section 3. Numerical experiments are presented in section 4. Let x ∈ R d denote a random parameter with prior density p 0 : as a continuous likelihood of y at point x. Then, the posterior density of x conditioned on data y is given by the Bayes' rule where Z is a normalization constant defined as which is often intractable to compute. The central task of Bayesian inference is to draw samples following the posterior distribution with density π(x). Then the samples can be used to compute some statistical quantities, e.g., mean, variance, quantiles, to facilitate system optimization and decision making under uncertainty. Langevin MCMC is a classical approach for sampling from Bayesian posterior. Briefly, to sample from a target distribution with density π, we evolve a particle system {X n l } N n=1 according to the update rule where Z n l ∼ N (0, I d ) follows the standard Gaussian distribution and α l > 0 is the step size. From another perspective, to sample from a target density, we can also consider a deterministic update rule by Wasserstein gradient descent (WGD): x n l+1 = x n l + α l (∇ log π(x n l ) − ξ l (x n l )), where ξ l : R d → R d is a vector-valued function which approximates ∇ log ρ l . Here we assume that {x n l } N n=1 follows an underlying distribution ρ l . Compared to Langevin MCMC, WGD has a deterministic update rule with particle interactions and the randomness only comes from the initial positions of the particle system. In this paper, we use the kernel density estimation (KDE)ρ l (x) = N n=1 k(x, x n l ) to approximate ρ l (x). Here k(x, x ) : R d × R d → R is a given positive kernel function. Using KDE as an approximation of ρ l , the mapping ξ l (x) is given by . One key drawback of KDE is that its approximation of the gradient of the log density term ∇ log ρ l is often significantly deteriorated in high dimensions, known as curse of dimensionality, see Scott (1991) ; Gramacki (2018) . Another deterministic update rule was proposed by Liu and Wang (2016) as a variational inference method, which is called Stein variational gradient descent (SVGD). In each iteration, it updates the particles as Here k(x, y) is a positive kernel function. 2.1. Relation. We present a continuous-time dynamical viewpoint to illustrate that the particle system {X n l } N n=1 generated by Langevin MCMC (1), WGD (2), and SVGD (3) approximate the posterior distribution. With α l → 0, the Langevin MCMC (1) can be viewed as the time discretization of the overdamped Langevin dynamic: where W t is the standard Brownian motion. The probability density ρ t of the overdamped Langevin dynamics (4) satisfies the Fokker-Planck equation In continuous time, WGD approximates the following deterministic mean-field particle system, namely mean-field Wasserstein dynamics: Here the terminology mean-field implies that the evolution of dynamics (6) depends on the current density function ρ t . The Liouville equation of the dynamic system (6) writes We note that equations (5) and (7) are identical by From an optimization perspective, they all correspond to the gradient flow of the KL divergence under the Wasserstein metric. To be concrete, consider the following optimization problem in the probability space Here P(Ω) = {ρ ∈ F(Ω)| ρdx = 1, ρ ≥ 0}, F(Ω) denotes the set of all smooth functions on Ω and D KL (ρ π) evaluates the KL divergence from ρ to π. The Wasserstein gradient flow of D KL (ρ π) writes Here δ δρ represents the L 2 first-order variation operator. A brief introduction to metrics and gradient flows in the probability space can be found in (Ambrosio et al., 2008) . By taking the KL divergence into the Wasserstein gradient flow (9), we immediately recover the Fokker-Planck equation (5) and the Liouville equation (7). As a result, with t → ∞, ρ t following the gradient flow (9) asymptotically converges to the unique minimizer π of (8). On the other hand, SVGD approximates the following particle dynamics, called meanfield Stein dynamics: where φ t : R d → R d is a transport mapping given by φ t (x) = (∇ y log π(y)k(y, x) + ∇ y k(y, x)) ρ t (y)dy. Indeed, φ t approximates ∇ log π −∇ log ρ t in the reproducing kernel Hilbert space (RKHS) (Liu et al., 2019, Theorem 2) . The Liouville equation of the dynamics (10) writes: In short, the above equation is an approximation of the Liouville equation (7) in RKHS. The connections among Langevin MCMC, WGD and SVGD are explained in Figure 1 . As previously mentioned, the high dimensionality d may impede the approximation of ∇ log ρ l in WGD. To tackle this curse of dimensionality, we employ a projection method for dimension reduction to evolve particles in a r-dimensional subspace of R d . Here r is usually significantly smaller than d. To find such a data-informed parameter subspace, we use the gradient information of the log-likelihood by defining a matrix H ∈ R d×d as Suppose that the prior density p 0 is given in the general form of where ∇ 2 V (x) Γ for a positive semi-definite matrix Γ and W is a bounded function. For instance, Γ can be the covariance of a Gaussian prior. generalized eigen-pair of (H, Γ), where (λ i , ψ i ) corresponds to the i-th largest generalized eigenvalue An important observation is that the eigenvalue λ i evaluates the sensitivity of the variation of likelihood w.r.t. parameters along the direction ψ i . In other words, the likelihood function f does not change much in directions ψ i with small eigenvalue λ i . In practice, we truncate at r such that λ r ≥ ε ≥ λ r+1 for a given tolerance ε. Moreover, we apply a randomized SVD algorithm (Saibaba et al., 2016) to compute (λ i , ψ i ) r i=1 , which takes O(r) matrix-vector products in computation cost. Define a linear projection operator: Here we denote Ψ r = (ψ 1 , . . . , ψ r ) ∈ R d×r and w ∈ R r with entries w i = ψ i , x . Based on this projection operator, we seek a profile function g : R r → R m such that g(P r x) is a good approximation of the likelihood f (x). For a given profile function g, we define a projected density π r : R d → R as follows where Z r is a normalization constant defined by Z r = g(P r x)p 0 (x)dx. According to (Zahm et al., 2018) , there exists an optimal profile function g * such that Here π * r is the projected density defined by the optimal profile function g * . This profile function can be explicitly given by Here X ⊥ is the complement of the subspace X r spanned by ψ 1 , . . . , ψ r and the density function Moreover, under the assumption that the prior satisfies (13), it is shown in (Zahm et al., 2018) that Here the constant is γ = exp(sup W (x) − inf W (x)). By the projection operator P r , we can decompose the prior for the parameter x = x r +x ⊥ (with x r := P r x = Ψ r w) as p 0 (x) = p r 0 (x r )p ⊥ 0 (x ⊥ |x r ). Because p r 0 only depends on x r = Ψ r w, a prior density for w can be defined as p 0 (w) = p r 0 (Ψ r w). Based on this prior for w and the optimal profile function g = g * , the posterior for w writes:π (w) = 1 Z g(Ψ r w)p 0 (w), whereZ = g(Ψ r w)p 0 (w)dw is a normalization constant. Withπ(w), we can rewrite the projected density π r by π r (x) =π(w)p ⊥ 0 (x ⊥ |Ψ r w). Hence, to sample x from π r , it suffices to sample w fromπ and x ⊥ from p ⊥ 0 (x ⊥ |Ψ r w). To sample fromπ, we apply WGD for w as w n l+1 = w n l + α n l (∇ w logπ(w n l ) − ξ r (w n l )). Here ξ r is an approximation of ∇ log ρ r l (w), where we assume that {w n l } N n=1 follow the distribution of ρ r l . Similarly, we can use KDE in calculating the approximation . Here k r : R r × R r → R is a positive kernel function. A common choice is a Gaussian kernel where h is computed at the current samples w n l , e.g., as the median of their square distances (Liu and Wang, 2016) or through optimization (Wang and Li, 2019) . For the step size α n l in (16), we use a line search technique (Chen et al., 2019b; Chen and Ghattas, 2020). In practice, to construct the basis Ψ r , instead of computing H in (12) with samples from posterior, which are not available, we approximate it using particles at step l bŷ Instead of computing the matrixĤ and the projection basis at every step, we compute them every L steps to save computational cost. Besides, we approximate the optimal profile function g * (x) via f (P r x) since f has negligible change in the complement space Zahm et al. (2018) . The overall projected Wasserstein gradient descent (pWGD) algorithm is summarized in Algorithm 1, where we can set convergence criteria as small particle move and/or the largest number of iterations reached. Note that for convenience we fix x n,⊥ 0 (computed line 5) in the complement space of each reconstructed subspace with basis Ψ r , since it is not informed by data. Thanks to the parallel structure of the updating the samples, we also develop a parallel implementation of this algorithm. Require: initial particles {x n 0 } N n=1 drawn from the prior. 1: Set l = 0. Construct basis Ψ r by (14) withĤ l in (17). Compute w n l = Ψ T r x n l and x n,⊥ 0 = x n l − Ψ r w n l . 6: end if 7: Update the WGD particle system by (16). Set x n l+1 = Ψ r w n l+1 + x n,⊥ 0 and l = l + 1. 9: end while 10: return {x n l } N n=1 3.1. Batched kernel density estimate. For some problems, the dimension r after projection can be relatively large for KDE to accurately approximate the density. Hence, we can further consider the hyper-projection operator P (1) , P (2) , . . . , P (s) : R r → R r which satisfies P (1) + · · · + P (s) = I r . For j = 1, . . . , s, we denote ξ r,j (w) = N n=1 ∇k r,j (P (j) w, P (j) w l n ) N n=1 k r,j (P (j) w, P (j) w l n ) . Here k r,j is a Gaussian kernel whose bandwidth is calculated based on {P (j) w n l } N n=1 . In each iteration, we iteratively use KDE to update w n l,j+1 = w n l,j + α l (P (j+1) ∇ w logπ(w n l,j ) − ξ r,j+1 (w n l,j )) with w n l,0 = w n l . Then, we update w n l+1 = w n l,s . Projected Wasserstein gradient flow in full space. For fixed projection operator, pWGD in continuous time corresponds to the projected Wasserstein gradient flow: withρ t | t=0 =p 0 . Hereρ t is the density of w t . In continuous time, the particle dynamics of pWGD in terms of w follows To characterize the dynamics of pWGD in the full space, we consider the situation that Here z ∈ R d is a fixed vector. Under this assumption, P r x t = Ψ r w t and (I − P r )x t = x ⊥ 0 are independent. Denote ρ t as the density function of x t . Then, we have . We obtain the following Wasserstein gradient flow for ρ t in the full space, whose proof is provided in Appendix A. Here the densityπ r (x) is defined aŝ Proof. We note that =Ψ r ∇ w logπ(Ψ T r x) + (I − P r )∇ log p 0 (x − P r x + Ψ T r z). Based on these observations, we have the identity: In summary, we have Hence, as the density function of x, ρ t satisfies 3.3. Convergence analysis in continuous time. Suppose that − log π is µ strongly convex, where µ > 0. From the classic analysis (Villani, 2008) [Theorem 24.7], the Wasserstein gradient flow (9) has the exponential convergence rate: For fixed projection operator, pWGD in continuous time corresponds to the projected Wasserstein gradient flow: withρ t | t=0 =p 0 . Assume that − logπ isμ strong convex, whereμ > 0. Similarly, the projected Wasserstein gradient flow has the following exponential convergence rate: Under suitable assumptions we can show that the convergence of projected Wasserstein gradient flow is at least as fast as the convergence of Wasserstein gradient flow in the full space. The proof is provided in Appendix B. Proposition 2. Suppose that the prior p 0 follows an isotropic Gaussian distribution. It satisfies that −∇ 2 log p 0 (x) = ξI. Assume that σ min − ∇ 2 f (x) Proof. First, we note that log π = log f + log p 0 − log Z. Because − log π is µ strongly convex, this suggests that for all x ∈ R d , where σ min (A) denotes the smallest eigenvalue of a symmetric matrix A. We note that is a rank-1 matrix, we can assume that σ min − ∇ 2 f (x) f (x) ≥ µ − ξ. On the other hand, we note that logπ(w) = log g(Ψ r w) + log p r 0 (Ψ r w) − logZ. Thus,μ shall satisfy that for all w ∈ R r , We note that − ∇ 2 log g(Ψ r w) For v ∈ R r , we have Recall that . This indicates that σ min (Ψ T r ∇ 2 log g(Ψ r w)Ψ r ) ≥ µ − ξ. Besides, because −∇ 2 log p 0 (x) = ξI, we also have −∇ log p r 0 (Ψ r w) = ξI. Hence, we have σ min (−Ψ T r (∇ 2 log g(Ψ r w) + ∇ 2 log p r 0 (Ψ r w))Ψ r ) ≥ µ. Thus, we haveμ ≥ µ. 3.4. Approximation error of optimal profile function. Suppose that information matrix H defined in (12) has a sharp eigenvalue decay. This can make r significantly smaller than d and this also yields that for z ∈ X ⊥ with z Γ = 1, Hence, we can assume that the following statement holds: Assumption 1. Given the projection operator P r , for all w ∈ R r and z 1 , z 2 ∈ X ⊥ , there Based on this assumption, we have the following estimation for the difference between the optimal profile function g(Ψ r w) and f (Ψ r w), whose proof is provided in Appendix C. Proposition 3. Under Assumption (1), denote 0 < δ 2 < 1 < δ 1 as Then, for all w ∈ R r , Proof. Under Assumption 1, for all w ∈ R r and z ∈ X ⊥ , we have: where δ 1 > 0 is a constant. Besides, we also have where δ 2 ∈ (0, 1) is a constant. From (19), 1 in Assumption 1 can be close to zero. We also note that for small 1 , the constants δ 1 , δ 2 are close to 1. In this section, we present a variety of numerical experiments to demonstrate the accuracy, convergence, and scalability of pWGD compared to WGD, SVGD, and pSVGD. The code for all the results is available at https://github.com/cpempire/pWGD. We first present two toy examples. The first example is a bi-modal posterior distribution with a Gaussian prior. WGD-MED and WGD-BM denote WGD with kernel bandwidth calculated by the MED method (Liu and Wang, 2016) and the BM method (Wang and Li, 2019) respectively. We compare WGD-MED, WGD-BM with SVGD. The results are presented in Figure 2 . We note that WGD converges much faster than SVGD. Besides, WGD-BM captures the variance of the posterior distribution better than WGD-MED. Another example is a double-banana-shaped posterior distribution with a Gaussian prior. The results are presented in Figure 3 . We note that WGD is still faster than SVGD. WGD-MED and WGD-BM have similar performance. where D is a physical domain, x is an infinite-dimensional contaminant source field parameter to be inferred, u is the contaminant concentration which we can observe at some locations, κ and ν are diffusion and reaction coefficients. Thanks to the linearity of the parameter x to the observable u, we have a linear Bayesian inference problem. Under the assumption of Gaussian prior for x ∼ N (x 0 , C) and Gaussian observation noise, we have a Gaussian posterior whose mean and covariance can be explicitly given. For simplicity, we set κ, ν = 1, D = (0, 1), u(0) = u(1) = 0, and consider 15 pointwise observations of u with 1% noise, equidistantly distributed in D. We set x 0 = 0 and use a covariance given by differential operator C = (−δ∆ + γI) −α with δ, γ, α > 0 representing the correlation length and variance, which is commonly used in geoscience (Lindgren et al., 2011) . We set δ = 0.1, γ = 1, α = 1. We solve this forward model by a finite element method with We compare pWGD, pWGD with batched KDE (pWGD-batch), WGD, SVGD, and pSVGD. For all compared methods, we use a (small) sample size N = 16, projection tolerance 10 −4 (leading to 8 dimensions of subspace). A smaller batch size (5 < 8) is used for pWGD-batch. We evaluate the accuracy of the sampling methods by the L 2 -norm of the mean and point-wise variance of the parameter x w.r.t. its poster distribution in Figure 4 , which display the convergence of the root mean square error (RMSE) of the sample mean (left) and variance (right) for dimension 17, 65, 257. We can observe from the right figures that with increasing dimensions, WGD and SVGD can not capture the variance as the samples collapse to the mean, while pWGD and pSVGD can preserve the accuracy of the variance. Both WGD and pWGD converge faster and achieve higher accuracy than SVGD and pSVGD. From the left figures we can see that pWGD-batch produces more accurate sample mean compared to pWGD with comparable accuracy for variance. In this experiment, we consider a nonlinear Bayesian inference problem constrained by the following PDEs for subsurface (Darcy) flow where u is pressure, v is velocity, h is force, e x is the uncertain permeability field equipped with a Gaussian prior x ∼ N (x 0 , C) with C = (−δ∆ + γI) −α where we set δ = 0.1, γ = 1, α = 2 and x 0 = 0. We set D = (0, 1) 2 and use a finite element method for the discretization of the problem. The data is generated as pointwise observation at 49 points equidistantly distributed in (0, 1) 2 corrupted with additive 5% Gaussian noise. We run the WGD and pWGD algorithms with 200 iterations for different parameter dimensions d = 9 2 , 17 2 , 33 2 , 65 2 , sample size N = 64, 128, 256, 512, in different CPU processor cores K = 1, 2, 4, 8, 16, 32 for their parallel implementation. The results are shown in Figure 5 . From the top-left we can see that both pWGD and pWGD-batch (with batch size 5) preserve the accuracy of the sample mean with 256 samples (compared to a reference value computed by a DILI-MCMC algorithm (Cui et al., 2016) with 10000 samples), with the later gives slightly more accurate result, while WGD leads to increasing errors with respect to the parameter dimension. This can be explained by the similar fast decay of eigenvalues in the middle-left figure where the projection dimension does not change much. Moreover, pWGD have similar convergence in averaged sample step norm (norm of sample updates from one step to the next) for different parameter dimensions, as can be seen from the middle-right figure. Similar behavior of the decay of the eigenvalues and step norms can be observed with respect to increasing sample size as shown in the bottom two figures. Finally, as we increase the number of CPU cores, as seen from the top-right figure, the cost for different computational parts is reduced linearly. These results demonstrate the scalability of the parallel pWGD algorithm with respect to the parameter dimension, sample size, and processor cores, which indicate its feasibility to use supercomputers to solve high-dimensional Bayesian inference problems and produce many posterior samples. Bayesian inference for COVID-19. Finally, we consider a real-world problem of Bayesian inference for the dynamics of the transmission and severity of COVID-19 using recorded data for New York from https://github.com/COVID19Tracking, as studied in . We use a compartmental model for epidemics and the number of hospitalized cases as the observation data to infer the social distancing parameter of 96 dimensions with transformed Gaussian prior. More details on the setup for the model, parameter, and data can be found in . We run WGD and pWGD using 128 samples with 8 samples in each of 16 processor cores. We update the projection bases for pWGD every 10 of 200 iterations. The eigenvalues have very fast decay as shown in Figure 6 , indicating an intrinsic low dimensionality of the data-informed parameter subspace. We can also see from the middle figure that pWGD produces posterior samples that recover the data better than those of WGD with 90% credible interval, which is due to the collapse of the WGD samples (especially before April) as can be seen from the bottom figure. In this paper, we develop a pWGD method for high-dimensional Bayesian inference, which effectively alleviates the curse of dimensionality faced in using KDE for approximating the sample density. We analyze the convergence property of pWGD through the lens of Wasserstein gradient flow under suitable assumptions. Moreover, we demonstrate the accuracy and convergence (compared to WGD, SVGD, and pSVGD), as well as scalability of the complexity (w.r.t. parameter dimension, sample size, processor cores) of pWGD by a variety of experiments. Further analyses for the convergence of pWGD and its application to other high-dimensional problems are of great interest. A fast and scalable method for A-optimal design of experiments for infinite-dimensional Bayesian nonlinear inverse problems Mean-variance riskaverse optimal control of systems governed by PDEs with random parameter fields using quadratic approximations Tensor train construction from tensor actions, with application to compression of large high order derivative tensors Gradient flows: in metric spaces and in the space of probability measures Hessian-based model reduction for large-scale systems with initial condition inputs Geometric MCMC for infinite-dimensional inverse problems Greedy inference with structure-exploiting lazy maps Extreme-scale UQ for Bayesian inverse problems governed by PDEs A computational framework for infinite-dimensional Bayesian inverse problems Part I: The linearized case, with application to global seismic inversion Stochastic gradient MCMC with stale gradients Hessian-based sampling for high-dimensional model reduction Projected Stein variational gradient descent Optimal design of acoustic metamaterial cloaks under uncertainty Sparse-grid, reduced-basis Bayesian inversion Sparse-grid, reduced-basis Bayesian inversion: Nonaffineparametric nonlinear equations Hessian-based adaptive sparse quadrature for infinite-dimensional Bayesian inverse problems Taylor approximation and variance reduction for PDE-constrained optimal control under uncertainty Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for long-term care facilities A-optimal encoding weights for nonlinear inverse problems, with application to the Helmholtz inverse problem Dimension-independent likelihood-informed MCMC A Stein variational Newton method On the geometry of Stein variational gradient descent Nonreversible Langevin samplers: Splitting schemes, analysis and implementation Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler Nonparametric kernel density estimation and its computational aspects Accelerating diffusions The variational formulation of the Fokker-Planck equation An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach Understanding and accelerating particle-based variational inference Stein Variational Gradient Descent as Gradient Flow Stein variational gradient descent: A general purpose Bayesian inference algorithm Is there an analog of Nesterov acceleration for MCMC? arXiv preprint Stochastic spectral methods for efficient Bayesian solution of inverse problems MCMC using Hamiltonian dynamics. Handbook of markov chain monte carlo Derivative-informed projected neural networks for high-dimensional parametric maps governed by PDEs Randomized algorithms for generalized Hermitian eigenvalue problems with application to computing Karhunen-Loève expansion Sparse, adaptive Smolyak quadratures for Bayesian inverse problems Sparse deterministic approximation of Bayesian inverse problems Feasibility of multivariate density estimates Optimal transport: old and new Accelerated information gradient flow Information Newton's flow: second-order optimization method in probability space Bayesian learning via stochastic gradient Langevin dynamics A fast and scalable computational framework for large-scale and high-dimensional Bayesian optimal experimental design Certified dimension reduction in nonlinear Bayesian inverse problems Wang is supported by a department fellowship from the Department of Electrical Engineering in Stanford University.