key: cord-0537689-pxmwiznv authors: Semenova, Elizaveta; Xu, Yidan; Howes, Adam; Rashid, Theo; Bhatt, Samir; Mishra, Swapnil; Flaxman, Seth title: PriorVAE: Encoding spatial priors with VAEs for small-area estimation date: 2021-10-20 journal: nan DOI: nan sha: a3186d96ba3f42942d7737ccfc50bf22d30314f4 doc_id: 537689 cord_uid: pxmwiznv Gaussian processes (GPs), implemented through multivariate Gaussian distributions for a finite collection of data, are the most popular approach in small-area spatial statistical modelling. In this context they are used to encode correlation structures over space and can generalise well in interpolation tasks. Despite their flexibility, off-the-shelf GPs present serious computational challenges which limit their scalability and practical usefulness in applied settings. Here, we propose a novel, deep generative modelling approach to tackle this challenge, termed PriorVAE: for a particular spatial setting, we approximate a class of GP priors through prior sampling and subsequent fitting of a variational autoencoder (VAE). Given a trained VAE, the resultant decoder allows spatial inference to become incredibly efficient due to the low dimensional, independently distributed latent Gaussian space representation of the VAE. Once trained, inference using the VAE decoder replaces the GP within a Bayesian sampling framework. This approach provides tractable and easy-to-implement means of approximately encoding spatial priors and facilitates efficient statistical inference. We demonstrate the utility of our VAE two stage approach on Bayesian, small-area estimation tasks. Spatially referenced data come in a variety of forms, including exact geographical coordinates such as a latitude and longitude or predefined geographical areal units such as a village, administrative unit or pixel of a raster image. The latter are known as areal unit data, and are found in fields such as epidemiology, environmental and political science; a variety of relevant methods come under the banner of small-area statistics (Rao and Molina, 2015) . There are many motivations for modelling such data, from surveillance program evaluation to identifying environmental risk factors for disease. Small-area statistics are particularly relevant to informing policy decisions, which are often made at the areal unit level (Clements et al., 2006) . Statistical modelling of spatial data is routinely performed using Gaussian process (GP) priors (Williams and Rasmussen, 2006) . GPs have gained popularity in a variety of applied fields due to their flexibility, ease of implementation, and their inherent ability to characterise uncertainty. However, GPs also present a number of practical challenges. For example, basic inference and prediction using a GP requires matrix inversions and determinants -both of which scale cubicly with data size. This makes applications of GPs prohibitive for large datasets. Moreover, kernel design of a GP requires substantial domain knowledge in order to reflect characteristics of the process of interest (Stephenson et al., 2021) . Hence, the choice of an inference method is of great importance when it comes to dealing with GPs. The theoretic asymptotic convergence properties and diversity of Markov chain Monte Carlo (MCMC) approaches make it the most reliable method for Bayesian inference. However, MCMC scales poorly, and struggles to deal with the high degree of auto-correlation inherent to spatial data, limiting its utility in large spatial settings (Rue et al., 2009) . A range of more tractable approximate methods have been developed, such as Laplace approximation, variational inference (Hoffman et al., 2013) , expectation propagation (Minka, 2013) , or inducing variables in sparse GPs (Titsias, 2009) . However, few of these approximate methods have asymptomatic guarantees or yield consistently accurate posterior estimates (Yao et al., 2018 ) over a wide range of challenging datasets. Userfriendly software packages, such as R-INLA (Martins et al., 2013) provide extremely convenient interfaces to a large set of predefined models, but do not provide enough flexibility for custom model development, and hence, have limitations for specific classes of applied research. There is an unmet need for approaches that can be easily implemented and customised in popular probabilistic programming languages such as Stan (Carpenter et al., 2017) , NumPyro (Phan et al., 2019) , PyMC3 (Salvatier et al., 2016) or Turing.jl (Ge et al., 2018) , while still scaling favourably to large datasets. Here, we propose a novel computational technique which leverages the variational autoencoder model from deep learning and combines it with Bayesian inference (Mishra et al., 2020a; Fortuin et al., 2020) with the goal of small-area estimation. An autoencoder (Hinton and Salakhutdinov, 2006) is a neural network architecture used for the task of supervised dimensionality reduction and representation learning. It is comprised of three components: an encoder network, a decoder network, and a low dimensional layer containing a latent representation. The first component, the encoder E γ (·) with parameters γ, maps an input x ∈ R p into the latent variable z ∈ R d , where d is typically lower than p. The decoder network D ψ (·) with parameters ψ aims to reconstruct the original data x from the latent representation z. Therefore, an autoencoder imposes a 'bottleneck' layer in the network which enforces a compressed representation z of the original input x. This type of dimensionality reduction technique is particularly useful when some structure exists in the data, such as spatial correlations. This structure can be learnt and consequently leveraged when forcing the input through the network's bottleneck. Parameters of the encoder and decoder networks are learnt through the minimisation of a reconstruction loss function p(x|x) expressed as the likelihood of the observed data x given the reconstructed datax. A variational autoencoder (VAE) extends this formulation to a probabilistic generative model (Kingma and Welling, 2013) . Rather than learning a latent representation z, it learns a probabilistic distribution p(z|x), from which a latent representation can be sampled. This distribution is chosen in such a way, that the latent variables z are independent and normally distributed. To achieve this, the encoder outputs a pair of d-dimensional vectors µ γ , σ 2 γ to characterise the mean and variance of z, respectively. The latent variable then follows the Gaussian distribution z ∼ N (µ γ , σ 2 γ I). The loss function is modified to include a KL-divergence term implied by a standard N (0, I) prior on z: L(x,x) = p(x|x) + KL(N (µ γ , σ 2 γ I)||N (0, I)). The KL term can be interpreted as a regulariser ensuring parsimony in the latent space. New data can be generated by sampling from the latent space with the trained decoder network. This feature of VAEs has led to a series of successful applications with the goal of generating new samples from the approximated distribution of real data (Liu et al., 2018) . In this paper we propose a novel use of VAEs, termed PriorVAE: we learn uncorrelated representations of spatial GP priors for a predefined spatial setting, and then use the trained decoder to perform Bayesian inference on new data. Our contributions can be summarised as follows. • We introduce a two stage process for inference. First, we train a VAE to create an uncorrelated representation of complex spatial priors. Next, we use the learnt latent distribution in the model instead of the GP prior to perform MCMC inference on new data, while keeping the trained decoder fixed. • We demonstrate the usage of VAE-priors on a range of simulated and real-life datasets to show their performance in small-area estimation tasks. The rest of this paper is organised as follows. In Section 2, we lay out the methodology of the twostage approach. In Section 3, we demonstrate the approach on synthetic and real datasets. In Section 4, we conclude with a discussion and provide a broader outlook on the potential impact of this work. There are three types of spatial data: areal (or lattice), geostatistical (or point-references) and point patterns. Models of all three of them rely on the notion of GPs and their evaluations in the form of the multivariate normal distribution. Areal data arise when a fixed domain is partitioned into a finite number of subregions at which outcomes are aggregated. Number of disease cases recorded per county, regardless of the exact locations where within the county the cases occurred or got recorded, constitutes an example of areal data. State-of-the-art models of areal data rely on 'borrowing strength' from neighbours and use a hierarchical Bayesian formulation to do so. Widely used models of such data capture spatial structure via the adjacency matrix A. Each element a ij of it is either 1, if areas B i and B j are neighbours, and 0 if not. Point-referenced (geostatistical) data represents measurements of a spatially continuous phenomenon at a set of fixed locations. Number of cases recorded at a set of hospitals with known geographical position is an example of such data. Modelling of the underlying continuous process relies on pair-wise distances between observed and unobserved locations. GPs in such models use continuous kernels. Point pattern data consist of precise locations of events. An example of a point pattern is a collection of GPS coordinates of households of newly occurred malaria cases. One way of modelling point pattern data is to cover the entire study area with a fine computational grid and view each grid cell as a location. Modelling of the continuous underlying process is typically done using continuous GP kernels. All of the three types of data can be modelled using the proposed novel approach. Suppose we are given outcome data y 1 , . . . , y n , corresponding to a set of observed disjoint areas {B i }, i = 1, . . . , n, covering the domain of interest G = ∪ n i=1 B i . These areas may be the cells of a computational grid, pixels of a raster image, or they may correspond to existing administrative units. Outcome data {y i } n i=1 can represent either counts aggregated over an area, such as number of disease cases, continuous bounded data, such as disease prevalence (i.e. a number between zero and one), or continuous unbounded data, such as land surface temperature in degrees Celsius ( • C). Hence, in applied fields, it is common to use generalised linear models to unify the modelling approach for all such types of outcome. Accounting for mixed effect model structure, their Bayesian hierarchical formulation can be expressed as follows (1) Here, (1) describes the hyperparameters, θ, of the model, f in (2) denotes the latent Gaussian field defined by mean µ and covariance Σ(θ), X in (3) is the fixed effects design matrix (a set of covariates), β are the fixed effects, and η is the linear predictor combining fixed and random effects. Equation (4) provides an observational model, where u is a link function characterising the mean of the distribution (e.g. logit for binomial data, exponential for positive data). A common modelling assumption is that the observations y i are conditionally independent p(y|f, θ) = n i=1 p(y i |η(f i ), θ), and the spatial structure is captured latently by function f . It is common to choose a GP prior over f , and as a consequence, finite realisations f GP are jointly normally distributed with mean µ and covariance matrix Σ. Since f GP always enters the model via the linear predictor, without loss of generality (affine transformation) we can consider f GP to have zero mean (µ = 0) and be distributed as f GP ∼ N (0, Σ). Structure of the covariance matrix Σ depends on the spatial setting of the problem and the model which we chose for the random effect. Once the model for f GP has been defined, the linear predictor can be computed as and then linked to the observed data via the likelihood. Unless the random effect is chosen to be trivial (i.e. i.i.d. set of variables resulting from Σ = I), the random effect f GP represents the computational challenge. Further we describe some options for spatial random effect priors in the context of smallarea estimation, and propose a method to substitute its calculation at the inference stage with another variable, leading to increased inference efficiency. A widely adopted group of approaches relies on the neighbourhood/adjacency structure and defines Σ based on the connectivity of the adjacency graph. Such methods leverage the tendency for adjacent areas to share similar characteristics. Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models were first proposed by Besag (1974) and later extended to the Besag-York-Molli (BYM) model (Besag et al., 1991) . The general form of the spatial prior for this group of models is where the precision matrix Q defines which specific model is used. Under the CAR model, random effect f CAR = φ is defined by the prior φ ∼ N (0, τ −1 R − ). Here by Q − = τ −1 R − we denote generalised inverse (James, 1978) of the precision matrix Q. The precision matrix is calculated as Q = τ (D − αA). Here A is the adjacency matrix, and D is a diagonal matrix, with elements d ii given by the total number of neighbours of area B i . The parameter τ is the marginal precision, and the parameter α controls the amount of spatial dependence. For instance, α = 0 implies complete independence and i.i.d. random effects. Condition |α| < 1 ensures that the joint distribution of φ is proper (Gelfand and Vounatsou, 2003) . It is not uncommon, however, to use α = 1 leading to the degenerate precision matrix (hence, the Q − notation) and the ICAR model. In practice, ICAR models are supplied with additional constraints, such as 1 I) to account for non-spatial heterogeneity and an ICAR component φ 2 ∼ N (0, Q − 2 ), Q 2 = τ 2 R 2 , for spatial auto-correlation. Hence, the total random effect can be calculated as f BYM = φ 1 + φ 2 . Some reparameterisations of BYM have been recently proposed in the literature (Riebler et al., 2016) to improve the interpretability of the inference. The advantage of the set of models presented above is that they take neighbourhood structure into account. However, neither the shapes (irregularity) or the sizes of areas are captured. Another natural way to model areal data, especially gridded surfaces with small area sizes, is modelling covariance between areas as an approximately spatially continuous process, for example based on the pairwise distances between the centroids. Typical kernels for distance-based covariance matrices include squared exponential, exponential, periodic or Matérn. An autoencoder is a neural network that is trained by unsupervised learning, with the aim of dimensionality reduction and feature learning. It consists of an encoder network, E γ (·) with parameters γ, a decoder network D ψ (·) with parameters ψ and a bottleneck layer containing a latent vector z. The encoder maps input x ∈ R p into the latent vector z ∈ R d , and the decoder network maps z back into the input space by creating a reconstruction of the original datax = D ψ (E γ (x)). The network is trained by optimisation to learn reconstructionsx that are close to the original input x. An autoencoder can borrow any neural network architecture, including multilayer perceptrons, convolutional or recurrent layers. A VAE is a directed probabilistic graphical model whose posterior is approximated by a neural network, forming an autoencoder-like architecture. The goal of VAEs is to train a probabilistic model in the form of p(x, z) = p(x|z)p(z) where p(z) = N (0, I) is a prior distribution over latent variables z and p(x|z) is the likelihood function that generates data x given latent variables z. The output of the encoder network E γ in VAEs is a pair of d−dimensional vectors µ γ (x), σ 2 γ (x), which can be used to construct the variational posterior for latent variable z. The decoder (generator) network D ψ tries to reconstruct the input by producingx. In particular, the model can be summarised into the following Neural network parameters γ and ψ are estimated as maximisers of the evidence lower bound (ELBO) or its extensions (Li and Turner, 2016 ). The first term in ELBO is the reconstruction loss, measured by likelihood quantifying how wellx and x match. The second term is a Kullback-Leibler (KL) divergence which ensures that z is as similar as possible to the prior distribution, a standard normal. It serves as a regulariser ensuring parsimony in the latent space and thus leads to uncorrelated variables in the latent space. New data can be generated by sampling from the latent space with the trained decoder network. Once the numerically optimal network parametersψ have been obtained, new realisations can be generated in two steps. As the first step, we draw from the standard normal distribution z ∼ N (0, I), and as the second step, apply the deterministic transformation Dψ(z). Most commonly, VAEs have been used in the literature to learn and generate observed data. We propose using spatial priors x = f GP , evaluated at the required spatial configuration (a grid or neighbourhood structure) as training examples instead of observed data. The trained VAE then enables us to computex = f VAE . Remarkably, unlike dealing with observed data, this approach does not have issues with the quality nor the quantity of the training examples. The amount of training data is unlimited since we can draw as many GP realisations for training as required. Similarly, there is no issue of data quality as we can create exact GP draws, free from noise, characteristic for any real-life observations. To perform MCMC inference in a spatial model, we replace evaluation of the GP-prior f GP in the linear predictor (5) with the learnt prior f VAE at the inference stage: Drawing from the standard normal distribution z ∼ N (0, I) with uncorrelated entries z i leads to a much higher efficiency as compared to the highly correlated multivariate normals N (0, Σ) with a dense covariance matrix Σ. As a consequence, computation time also decreases, especially for models where d < p. One-dimensional GP on regular and irregular grids. In this first example we use VAE to perform inference on continuous data {y i } n i=1 over a regular one-dimensional grid. The grid consists of n = 400 points in the (0, 1) interval. Training prior samples are drawn as evaluations of a Gaussian process with zero mean and squared exponential kernel k(h) = σ 2 e −h 2 /l 2 . This model is useful for small-area estimation when the covariance matrix is Euclidean distance-based. To allow for hyperparameter learning, we impose hierarchical structure on the model by using hyperpriors on variance σ 2 ∼ LogNormal(0, 0.1) and lengthscale l ∼ InverseGamma(4, 1). This hierarchical structure allows the VAE to be trained on a range of values of hyperparameters. The average lengthscale, according to these priors, is around 0.3. Since this is much larger than the distance between two neighbouring points on the grid (0.0025), there is enough redundancy in the data to expect a lower dimensional embedding. Realisations f GP are presented on Figure 1 (left). We trained a VAE with two hidden layers of dimensions 35 and 30, respectively, and the bottleneck layer of dimension 10. As an activation function we used the rectified linear unit (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011) for all nodes in the hidden layers. The priors learnt by the VAE are presented in Figure 1 (right). Figure 1 shows that the typical shape of the prior, as well as the mean have been learnt well. Amount of uncertainty (second moment) displayed by the VAE priors is lower. This is an expected phenomenon as vanilla VAEs are known to produce blurred and over-smoothed data due to compression; on the other hand, they perform well on denoising tasks (Kovenko and Bogach (2020) ) i.e. reconstructing underlying truth given corrupt data. Empirical covariance matrices of the original and trained VAE priors show similar patterns (Supplement Figure 12 ). To perform inference, we generate one GP realisation, use it as the ground truth, and simulate observed data by adding i.i.d. noise. We allow the number of observed data points to vary as 0.5% (2 data points), 1% (4 data points), and 1.5% (6 data points) of the total number of the grid points and recover the true function. The model used for inference is y ∼ N (f VAE , s 2 ), where the amount of noise is given the half-Normal prior s ∼ N + (0.1). Inference results are presented on Figure 2 . The higher the number of points, the closer is the estimated mean to the ground truth curve. Areas without any data available in their proximity, show higher uncertainty than areas informed by closely located data points. Effective sample size (ESS) is an important measure of the efficiency of MCMC sampling (Martino et al., 2017) . For example, we have run inference with 1000 warm-up and 1000 posterior MCMC samples for different number of data points. Average ESS for the posterior of the function evaluated at the observed points increased together with the number of points, while inference time remained constant. The original GP model displayed the reverse trend: average ESS remained constant, while computation time increased. To verify that our approach is also applicable to irregular grids, we modified the simulation study by placing the grid points irregularly. We use the same architecture to train this model, i.e. 35, 30 nodes in each hidden layer and 10 latent variables in the bottleneck layer. Original GP priors evaluated on an irregular grid, the learnt VAE priors and inference results are presented on Supplement Figures 13 and 14 . Two-dimensional GP. A similar set of experiments as described above were performed for two-dimensional GP priors. A regular grid was defined over a unit square with 25 segments along each of the two coordinates, resulting in n = 625 grid cells. Inference results produced using a VAE trained on evaluations of two-dimensional GP priors are presented of Figure 3 . As the number of observed points increases, quality of the mean prediction improves and uncertainty in the surface estimates decreases. Synthetic CAR example. We partition a rectangle by subdividing it into 10 rows and 15 columns. Unlike the two examples above, where we work with a continuous kernel, here we generate data using the CAR model adjacency matrix A. Parameter α is being assigned the Uniform(0.4, 1) prior. This prior is chosen to study the question whether VAE can be trained well on CAR samples, displaying spatial correlation which corresponds to the values of α away from zero. In this example we demonstrate a technique to train a VAE by allowing it to preserve explicit estimation of the marginal precision τ . For this VAE, the training is performed on the standardised samplesφ from the random effect, meaning that that the prior is drawn from the distribution Q = D − αA. Since φ = 1 √ τφ ∼ N (0, Q −1 ), the trained VAE generatorf VAE-CAR also needs to be adjusted in the model at the inference stage by the magnitude of the marginal precision τ : To test the performance of the trained VAE, ground truth data is generated by fixing the value of α at 0.7, and adding normal i.i.d. noise term at each area with variance of 0.5. This data is then used to fit the original CAR model, as well as the model using a VAE trained on CAR prior samples. Both models use the prior Uniform(0.01, 1) for the variance of the noise term. For both models we run MCMC with 1000 warm-up iterations and 2000 iterations to sample from posterior. The average ESS of the spatial random effect in the VAE-CAR model is 4650 which took 9 seconds to compute; the achieved mean squared error (MSE) is 0.135. The average ESS in the original CAR model is 3128 which took 73 seconds; the achieved MSE is 0.118. Results of the experiment are presented on Figure 4 . Scottish lip cancer dataset. The Scottish lip cancer dataset, originally presented by Kemp et al. (1985) , has become a benchmark dataset for areal models. It has been used to demonstrate performance The dataset consists of the observed and expected numbers of cases (y and E, respectively) at 56 counties in Scotland, as well as a covariate measuring the proportion of the population engaged in agriculture, fishing, or forestry (aff ). The covariate is related to exposure to sunlight, which is a risk factor for lip cancer. We model the count data y as following a Poisson distribution with the log-Normal rate λ distributed according to the BYM model: The VAE is trained on the spatial random effect BYM priors f BYM = φ 1 + φ 2 to obtain the f VAE-BYM representation. We can use any parametrisation of BYM, as we only rely on its generative properties (and this is one of the advantages of our approach). We notice that a model with i.i.d. random effect φ 1 already produces a relatively good fit (see Supplement, Figure 15 (a)). It is only the remaining discrepancies that the spatially-structured random effect φ 2 needs to explain. We account for this observation in our priors for τ 1 and τ 2 , as well as the dimension of the latent variable z: as there is only a moderate spatial signal, there is no redundancy in the data for spatial effect estimation. To be able to provide good quality VAE-BYM priors, we opt to not compress the spatial prior and choose the dimension of z to be equal to the number of counties. We train a network with one hidden layer with 56 hidden nodes and use the exponential linear unit (Clevert et al., 2015) , or elu, activation function. For optimisation, we use the variational Rényi bound that extends traditional variational inference to Rényi αdivergences as proposed by Li and Turner (2016) . By using α = 0 we opt for an importance weighted autoencoder (Burda et al., 2015) . We performed two assessments to evaluate whether the VAE-BYM produces similar inference results as the original BYM model. First, we used both models for inference on the entire dataset to compare results for mapping (the most typical task in epidemiology and policy informing work), and second, we performed 5-fold cross-validation. Posterior predictive distributions of the rates λ BYM , λ VAE-BYM obtained by the models where f BYM and f VAE-BYM have been used to capture the spatial random effect, are very close to each other: Figure 5 displays the two distributions, where each of them is represented by its point estimate : Inference results on a synthetic CAR example: data is generated by adding a noise term to the ground truth. We perform MCMC inference on the data by using both the original CAR and the VAE-CAR models. (mean) and 95% Bayesian credible interval (BCI). Uncertainty intervals produced by the two models are remarkably close to each other for most of the counties. Figure 6 demonstrates very good agreement in the point estimates. Figure 7 presents the obtained maps: models with f BYM and f VAE-BYM produce very close spatial patterns. The average ESS of the spatial effects in the BYM model is ∼150, and in the VAE-BYM model it is ∼1030. MCMC elapsed time shows the same trend: 402s and 12s for the BYM and VAE-BYM respectively. To perform cross-validation, we created a 5-fold split of the data. To measure performance, we have used MSE between the continuous predicted rate λ and the observed count y. The mean MSE of the BYM model across five runs was 426, with standard deviation of 131, and the mean MSE of the VAE-BYM model was 414, with standard deviation of 171. Average ESS of the VAE-BYM random effect was ∼3850, and average ESS of the BYM random effect was ∼630. Inference times were 3s on average (0.2s standard deviation) for the VAE-BYM runs, and 33s on average (3s standard deviation) for the BYM runs. These experiments confirm the consistency of our observations: even when the dimension of the latent variable z is the same as the number of the counties, there is a benefit to using VAE-BYM over BYMit achieves comparable performance while displaying much higher ESS and shorter inference times. An alternative approach to training a VAE for non-Gaussian likelihoods Our previously described approach, where we calculate the linear predictor and then fit model to the data using a link function, follows the long standing tradition in produced by models with f BYM and f VAE-BYM random effects, respectively. The distributions are represented by the point estimates (mean) and uncertainty intervals (95% BCI) for each county. There is a very good agreement between the two models. statistics driven by the interest to measure associations between predictors and an outcome. In the previous two examples we have trained the VAE directly on the GP draws (f GP ) and used the Gaussian distribution to compute the reconstruction loss p(x|x). When the outcome data is not Gaussian or not even continuous, there is an alternative way of training a VAE. Let us consider, for instance, count data which we would like to model using the Poisson distribution. Instead of using f GP draws as training data of the VAE, we can use directly the simulated counts y arising from the Poisson distribution with rate λ = exp(f GP ). The encoder now maps these counts into the latent There is a very good agreement between the two models. variable z, as, before, and the decoder is enhanced with a probabilistic step: it maps z into a positive ratê λ, and calculates the reconstruction loss as p(y|λ) : y ∼ Pois(λ). Results of such a model are shown on Figure 8 . We have generated training data y for the VAE over a regular grid with n = 100 points according to the distribution y ∼ Pois(exp(f GP )). The hyperparameters of the GP were identical to our one-dimensional GP example presented above. VAE architecture included two hidden layers with elu activation function. An exponential transform was applied to the final layer of the decoder to calculate the predicted rateλ. HIV prevalence in Zimbabwe. We consider household survey data from the 2015-2016 Populationbased HIV Impact Assessment Survey in Zimbabwe Sachathep et al. (2021) . The observed positive cases y i among all observed cases n i in each area B i are modelled according to the Binomial distribution: The parameter θ i is the "probability of success" of the binomial distribution, and serves as the estimate of HIV prevalence. The linear predictor logit −1 (θ i ) = b 0 + φ i , i = 1, ..., N consists of the intercept b 0 , common for each area, and area-specific spatial random effect φ i . The VAE is trained on the spatial random effect CAR priors to obtain the VAE-CAR representation. We train VAE on the standardised draws of the spatial random effect as presented in formulas 9. The total number of observed areas is 63, and we trained a VAE with a dimension of 50 in the latent space to achieve reduction in the dimensionality of the latent variable. Figure 9 presents the obtained maps: models with f CAR and f VAE-CAR produce very close spatial patterns. We used 2000 MCMC iterations to perform inference using both the original CAR, as well as the VAE-CAR models. The average ESS of the spatial effects in the CAR model is ∼120 with a computation time of 13 seconds. In the VAE-CAR model average ESS is higher, at ∼2600, and took only 4 seconds to compute. ESS achieved by the VAE-CAR model is about 20 times higher than the one achieved by the CAR model, and also higher than the actual number of posterior samples, which indicates extremely efficient sampling. Traceplots obtained by both models are presented on Figure 10 and demonstrate (together with Supplement Figure 17 ) that posterior samples produced by the VAE-CAR model display very little auto-correlation, which allows to achieve high ESS and fast inference. Posterior samples produced by the original CAR model, on the contrary, show high auto-correlation (18), leading to low ESS and longer computation time. COVID-19 incidence in the UK. We consider the projected number of COVID-19 infections in the UK at Lower Tier Local Authority (LTLA) level (Mishra et al. (2020b) ). These estimates are available from the public website 1 and are based on the Scott et al. (2020) package. The model behind the package ) relies on a self-renewal equation and does not take spatial correlations into account. Here we demonstrate how spatial modelling using the proposed approach can be performed on the incidence data. Number of infections y i in each LTLA during the period between the 21st of January and the 5th of February 2022 is modelled via the Poisson distribution according to the model As above, we have chosen the representation of the spatial CAR random effect in its standardised formφ in order to allow for explicit inference of the marginal precision τ when using VAE-based inference. To simplify modelling, we have removed all singletons (islands without any neighbouring areas within them) from the shapefile. We trained VAE on the standardised draws of the spatial random effect. The total number of modelled LTLAs is 372, and we chose the dimension of the latent space to be 300. Using this example, we demonstrate a technique, where several VAEs using different priors for the spatial hyperparameter α can be pretrained, and then used for model selection -a step enabled by fast inference times when using VAEbased models. We have trained five VAEs using hyper-priors α ∼ U (0, 0.2), U (0.2, 0.4), U (0.4, 0.6), U (0.6, 0.8), U (0.8, 0.99), correspondingly. Each of the resulting VAE-priors were used to fit a model. Model selection was performed based on the widely applicable information criterion (WAIC; Watanabe and Opper (2010) ) using the arviz package (Kumar et al. (2019) ). The best model was the one trained with α ∼ U (0.8, 0.99). To obtain smooth maps, we used 2000 MCMC iterations and performed inference using both the original CAR, as well as the VAE-CAR models with priors α ∼ Uniform(0.8, 0.99) and τ ∼ Gamma(6, 2). Resulting maps are presented on Figure 11 : models with f CAR and f VAE-CAR In this paper we have proposed a novel application of VAEs to learn spatial GP priors and enable smallarea estimation. Such an approach leverages the power of deep learning to fuel inference for wellestablished statistical models. Uncorrelated latent parameters of the VAE make consequent Bayesian inference with MCMC highly efficient at successfully exploring the posterior distribution. An advantage of the proposed approach, as compared to traditional VAE applications, is that there is no limitations in neither quality nor quantity of the training data as any number of training samples can be generated by drawing from the noise-free spatial priors. In addition, there is no need to retain the whole training data set as every batch can be generated on the fly. Our method is beneficial even if the latent space dimension is the same as the number of data points, unlike some other approximation methods, which rely on information redundancy in the data. As the HIV prevalence in Zimbabwe example shows, by reducing the autocorrelation between posterior MCMC samples, we can obtain drastic gains in both ESS and computation speed. Figure 11 : We perform MCMC inference on the projected COVID-19 incidence in the UK at the LTLA level. The leftmost plot displays raw data, the middle plot shows the mean of the posterior predictive distribution using the the CAR spatial random effect f CAR , the rightmost plot shows the mean of the posterior predictive distribution obtained from the model with the f VAE-CAR spatial random effect. realisations with short lengthscales, it is unreasonable to expect good results for long lengthscales. VAE training losses which capture distributional properties of the training samples have the potential to resolve it and constitute future work. The over-smoothing property of vanilla VAEs can be accounted for at the inference stage via priors allowing for wider uncertainty ranges. There is also the upfront cost of training a VAE, including the choice of architecture. Finally, recovering interpretable hyperparameters of the spatial priors might be of scientific interest and is more challenging in the VAE approach rather than the traditional framework. In some cases this difficulty can be alleviated. For instance, in the HIV prevalence example we have demonstrated how marginal precision or variance can be isolated from the VAE training to retain direct inference of this parameter. Statistical models of areal data described above assign the same characteristics to each location within every area B i . This assumption is unrealistic for many settings as heterogeneity might be present within each B i , as long as the size of the area is non-negligible. Areal data can be viewed as an aggregation of point data and a series of approaches began to emerge recently to address this issue (Johnson et al. (2019) ; Arambepola et al. (2022) ). Hence, it is reasonable to use a data augmentation step to sample from the posterior distribution of the exact locations, and then to aggregate results. Future directions of research include an approach where presented above models, reliant on the adjacency structure, are substituted with continuous kernel approaches and a VAE trained using them. The application of VAEs to small-area estimation has potential for far-reaching societal impact. If a set of different GP priors, such as CAR, ICAR, BYM and others, are used to pretrain several VAEs over the same spatial configuration, the resulting decoders can then be applied via the proposed inference scheme to rapidly solve real-life problems. Once the VAE training has been performed, users will only need access to the decoder parameters, and otherwise perform inference as usual the expensive step of training a VAE would not be required at practitioner's end. In case of an epidemic emergency, for instance, this would enable faster turnaround times in informing policy by estimating crucial quantities such as disease incidence or prevalence. ES: formal analysis, methodology, software, visualization, writing of original draft, review and editing; YX: formal analysis, software, visualization, review and editing; AH: data curation, methodology, review and editing; TR: data curation, review and editing; SB: conceptualization, methodology, supervision, review and editing; SM: conceptualization, methodology, supervision, review and editing; SF: conceptualization, funding acquisition, methodology, supervision, review and editing. One-dimensional synthetic example: regular grid GP covariance matrix VAE covariance matrix Figure 12 : Covariance matrices computed empirically using 100 draws from the GP and VAE-GP priors. Results of our approach on an irregular one-dimentional grid. Figure 13 (a) shows samples of priors from the original GP, evaluated on the irregular grid, Figure 13 (b) shows priors learnt by VAE and Figure 14 shows inference results for different number on observed datapoints. Here we present results concerned with the Scottish lip cancer dataset, produced by models with i.i.d., BYM and VAE-BYM random effects. Figure 15 shows posterior predictive distributions, produced by the three models when all of the available data was used to fit the models. We observe that the i.i.d. model already achieves a relatively good fit. BYM and VAE-BYM models are able to capture the remaming spatial dependence and are very similar between themselves. Here we present results of the study using HIV prevalence data in Zimbabwe: posterior density of the spatial random effect for on earea during MCMC inference (Figure 16 ), autocorrelations of MCMC samples of a spatial random effect modelled via the VAE-CAR model ( Figure 17 ) and autocorrelations of MCMC samples of a spatial random effect modelled via the CAR model ( Figure 18 ). A simulation study of disaggregation regression for spatial disease mapping Spatial interaction and the statistical analysis of lattice systems Bayesian image restoration, with two applications in spatial statistics Importance weighted autoencoders Stan: A probabilistic programming language Bayesian spatial analysis and disease mapping: tools to enhance planning and implementation of a schistosomiasis control programme in Tanzania Fast and accurate deep network learning by exponential linear units (elus) Spatial smoothing in Bayesian models: a comparison of weights matrix specifications and their impact on inference Estimating the effects of non-pharmaceutical interventions on covid-19 in europe Gp-vae: Deep probabilistic time series imputation Turing: A language for flexible probabilistic inference Proper multivariate conditional autoregressive models for spatial data analysis Deep sparse rectifier neural networks Reducing the dimensionality of data with neural networks Stochastic variational inference The generalised inverse. The Mathematical Gazette What is the best multi-stage architecture for object recognition? A spatially discrete approximation to log-gaussian cox processes for modelling aggregated disease count data Atlas of cancer in Scotland, 1975-1980: incidence and epidemiological perspective Auto-encoding variational bayes A comprehensive study of autoencoders, applications related to images Arviz a unified library for exploratory analysis of bayesian models in python Constrained graph variational autoencoders for molecule design Effective sample size for importance sampling based on discrepancy measures Bayesian computing with INLA: new features Expectation propagation for approximate Bayesian inference πVAE: Encoding stochastic process priors with variational autoencoders. arXiv A covid-19 model for local authorities of the united kingdom Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in stan Rectified linear units improve restricted Boltzmann machines Composable effects for flexible and accelerated probabilistic programming in NumPyro Small area estimation An intuitive Bayesian spatial model for disease mapping that accounts for scaling. Statistical methods in medical research Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations Population-based hiv impact assessments survey methods, response, and quality in zimbabwe, malawi, and zambia Probabilistic programming in Python using PyMC3 epidemia: Modeling of epidemics using hierarchical bayesian models Measuring the sensitivity of gaussian processes to kernel choice Variational learning of inducing variables in sparse gaussian processes Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory Gaussian processes for machine learning PriorVAE: Encoding spatial priors with VAEs for small-area estimation Swapnil Mishra We thank Jeffrey Eaton for his useful comments on the manuscript. Code is available at https://github.com/ elizavetasemenova/priorVAE