key: cord-0171506-44bw2x8h
authors: Gramatica, Marco; Congdon, Peter; Liverani, Silvia
title: Bayesian modelling for spatially misaligned health areal data: a multiple membership approach
date: 2020-04-11
journal: nan
DOI: nan
sha: c3faef24e5c84f17ccf490f05d78775737739829
doc_id: 171506
cord_uid: 44bw2x8h

Diabetes prevalence is on the rise in the UK, and for public health strategy, estimation of relative disease risk and subsequent mapping is important. We consider an application to London data on diabetes prevalence and mortality. In order to improve the estimation of relative risks we analyse jointly prevalence and mortality data to ensure borrowing strength over the two outcomes. The available data involves two spatial frameworks, areas (middle level super output areas, MSOAs), and general practices (GPs) recruiting patients from several areas. This raises a spatial misalignment issue that we deal with by employing the multiple membership principle. Specifically we translate area spatial effects to explain GP practice prevalence according to proportions of GP populations resident in different areas. A sparse implementation in Stan of both the MCAR and GMCAR allows the comparison of these bivariate priors as well as exploring the different implications for the mapping patterns for both outcomes. The necessary causal precedence of diabetes prevalence over mortality allows a specific conditionality assumption in the GMCAR, not always present in the context of disease mapping.

When dealing with socio-economic or health data it is common for individuals to be classified according to nested hierarchical structures. When it is necessary to account for level specific effects, hierarchical models are better suited than standard regression techniques (Fielding and Goldstein, 2006) . However, the most common setup of such models requires lower level units to belong to only one class in the hierarchy. When this perfect nesting is absent, as in cross classification (e.g. students classified by area of residence and school) an alternative modelling strategy is needed. Issues of multiple membership may then occur under one (or more) crossed categories (Fielding and Goldstein, 2006) . As an educational example, this structure can occur for pupils who attended more than one school during their educational career; when modelling random effects for a particular outcome, it might be useful to account for the different impacts that different schools might have had on the individual student outcome. This is where multiple membership (MM) strategies become an alternative to a classical hierarchical model: by choosing appropriate weights and averaging the random effects, e.g. over different schools attended, it is possible to account for the different memberships of a specific unit at a certain level (Browne et al., 2001) . A mixed structure is also exemplified by patients that are classified by both their area of residence and the medical general practitioner (GP) practice they are registered at.

In this paper, the main problem under consideration is to jointly estimate relative risk of diabetes mortality and prevalence where data for the former comes from census Middle Layer Super Output Areas (MSOA), while the latter is collected for GP registers. Generally, in the context of areal disease mapping it is common to model outcomes (e.g. mortality, prevalence) with a generalized linear model (GLM) (Lawson, 2018) using area counts as observations and conditional autoregressive priors (CAR, (Besag et al., 1991) see section 2) to account for spatially structured residuals. Additionally, expected disease counts or disease prevalence, would be included as offsets in the model, to account for different population structures among the areas, and known risk factors as covariates. This method can be applied to two areal outcomes leading to a bivariate model. However, this modelling strategy is based on the two outcomes sharing the same areal framework. This is not the case at hand, as residents of a specific MSOA can be registered at a GP practice outside it, making it impossible to attribute all the patients of a practice to the area it is in.

The dataset used here comprises 130 GP practices in three Clinical Commissioning Groups (CCGs) in outer northern east London (ONEL). MSOAs were included (in ONEL and beyond) if their residents were among the largest contributors to the selected GP populations. This procedure results in 95 MSOAs, 88 of which are actually in ONEL. With this data we aim to analyse prevalence and mortality risk for MSOAs using prevalence information for GP practices.

To the best of the authors' knowledge, there are no previous Bayesian studies of the spatial causal relationship between diabetes prevalence and mortality. One aim of the study is to examine the strength of this relationship (if any) at a small area scale. This relationship is attenuated at individual patient level because diabetes as a disease can be followed as a cause by death from many other conditions, thus making more difficult to list it as cause of death. Consequently, McEwen et al. (2006) reports that of nearly 12,000 diabetic patients in the US, diabetes was recorded on a minority (39%) of death certificates, and as the underlying cause of death for only 10% of decedents with diabetes. There is also the issue of accurate recording: many studies report that diabetes is an underreported cause of death. Stokes and Preston (2017) report that the proportion of deaths with diabetes assigned as the underlying cause of death (3.3-3.7%) severely understates the contribution of diabetes to mortality in the United States.

Dealing with multiple membership in this example is an illustration of spatial misalignment (Banerjee et al., 2014) . In order to deal with the misalignment problem, present in the study here, our approach involves multiple membership (MM) modelling of prevalence data, making MSOA spatial random effects for prevalence indirectly identified by GP prevalence data. Additionally, since causal precedence is evident given that prevalence necessarily precedes mortality, a specific modelling approach will then be evaluated, the GMCAR from Jin et al. (2005) , in order to embed the causal information in the estimation process.

Therefore, the main contribution of this paper is methodological: the incorporation of the multiple membership approach into a spatial prior. Previous applications of multiple membership have been in crossed multilevel applications (e.g. educational careers for pupils within schools) with unstructured priors. Another significant contribution is computational, as to the best of our knowledge this is the first implementation of a bivariate CAR spatial prior into Rstan (Carpenter et al., 2017; Stan Development Team, 2018) , a probabilistic programming platform which does full Bayesian inference using Hamiltonian Monte Carlo (HMC). Previous Rstan implementations or CAR priors have been univariate (Morris et al., 2019; Joseph, 2016) . Lastly, there is novelty in the application of the methods to the specific disease studied, and in showing how spatial priors can be applied to distinct health outcomes (e.g. prevalence, mortality) where causal precedence is clear.

The paper is organised as follows. We briefly described the motivating dataset in Section 1.1. In Section 2 we review conditional autoregressive priors for modelling of bivariate count data and in Section 3 we incorporate the multiple membership approach into a spatial prior. We discuss the implementation of this approach in Stan in Section 3.1. A simulation study is included in Section 4 and the results on our motivating dataset on diabetes prevalence and mortality are presented in Section 5.

Data on diabetes mortality is available from census Middle Layer Super Output Areas (MSOA), while data on diabetes prevalence is available in GP registers. With this data we aim to analyse prevalence and mortality risk for MSOAs using prevalence information for GP practices.

The dataset comprises all 130 GP practices in three Clinical Commissioning Groups (CCGs) in outer northern east London (ONEL): Barking and Dagenham, Havering and Red-bridge. MSOAs were included if their residents were among the largest contributors to the selected GPs populations up to the 90th percentile. This procedure results in 95 MSOAs, 88 of which are actually in ONEL.

The misalignement must be modelled carefully as it is significant: on average 43.4 MSOAs contribute to the population of a single GP (median 41 and range 14 to 80), and the average contribution of each MSOA to the population of a single GP is 2.3% (median 0.19 %) ranging between 0.0059% to 96 % (see Figure 1 as an example for a GP practice).

Mortality data by MSOA include death counts from 2013 to 2017, consequently October 2015 was chosen as midpoint for retrieving prevalence data, the number of registered patients by GP and the cross reference for MSOA and GP populations. In addition to that, two covariates recorded at the MSOA level have been included, namely the proportion of South Asian residents from the 2011 census 1 and the 2015 index of multiple deprivation (IMD). The former variable has been included due to the higher prevalence of diabetes among the South Asian population (Hall et al., 2010) , while the latter is a control variable for socio-economic status.

For the purpose of estimating areal prevalence and mortality, it is necessary to take into account population structure, therefore expected counts for each area are computed using national rates for mortality and prevalence, together with the number of residents by age. This gives

where i = 1, . . . , n j is the area, j = 1, 2 the outcome, k = 1, . . . , m are the population age groups, ω kj are the national mortality and prevalence rates for diabetes broken down by age group together with N k ij the corresponding population (Blangiardo and Cameletti, 2015) .

Generalised linear models (GLM) with spatially correlated random effects are commonly used in spatial data analysis. We consider a spatial GLM with Gaussian CAR random effects where a region is partitioned in n non-overlapping areas. For each area i = 1, . . . , n we can observe a random variable representing the outcome of interest Y i and a set of p non-random covariates x i = (x i1 , . . . , x ip ). Define E i as known offset values, and ρ i as the relative disease risk (Blangiardo and Cameletti, 2015) . The methods in this paper apply to all GLMs but, without loss of generality, in this paper we will model count data using the Negative Binomial distribution with mean µ i for each observation, a common overdispersion parameter ψ and a logarithmic link function. The Negative Binomial allows explicit modelling of overdispersion, when the variance in the outcome is larger than the average outcome (Coly et al., 2019) .

We can then write the GLM with the CAR term φ i , defined below, as:

To model explicitly the overdispersion, we set

where

The set of areal units on which data are recorded can form a regular lattice or differ largely in both shape and size. In either case such data typically exhibit spatial autocorrelation, with observations from areal units close together tending to have similar values. By assuming a Gaussian Markov random Field (GMRF) on the lattice (Rue and Held, 2005) it is possible to define φ i through the following full conditionals,

where n i is the number of neighbours for area i, b ij is an element of matrix B defined below, τ −1 i = (τ /n i ) −1 and α, a propriety parameter, is usually interpreted as a smoothing parameter. Thanks to Brook's Lemma (Rue and Held, 2005) it is guaranteed that the joint distribution is properly defined and it is given by

with m i is the number of neighbours for area i and W is the adjacency matrix such that (W ) ii = 0 and (W ) ij = 1 if i and j are neighbours. This setup can give rise to different prior choices that require different conditions to achieve properness (Congdon, 2014) . Building from (Mardia, 1988) we can specify a p-dimensional multivariate spatial model. However, general conditions for positive definiteness and invertibility for the covariance matrix are difficult to assess. Therefore, we use the following formulation by Gelfand and Vounatsou (2003) 

where Λ is a p×p matrix that controls the variance and covariance between outcomes in the same area and is therefore required to be symmetric positive definite. This model will be denoted as MCAR(α, Λ). An MCAR prior for spatial random effects allows estimates of relative risks to borrow strength from each other. This is a reasonable assumption for our motivating example, as diabetes prevalence and mortality are correlated. The MCAR on spatial effects can capture such relationship and improve the estimate for both outcomes.

When considering prevalence and mortality for the same disease, given the necessary precedence of prevalence on mortality, it is possible to draw a causal link between the two phenomena.

In order to embed that information in the model we can use the GMCAR (Jin et al., 2005) , a specification of the previously mentioned MCAR for the bivariate case, that is based on estimating the spatial effects for one outcome conditionally on the other, as follows.

with |α 1 | < 1 and |α 2 | < 1

In our context, φ 1 |φ 2 would be the spatial random effect for mortality given the spatial structure of prevalence, while φ 2 would be the spatial structure for prevalence. The parameter α 1 is the smoothing parameter associated with the conditional distribution of φ 1 |φ 2 , α 2 is similar for the marginal distribution of φ 2 , and τ 1 and τ 2 scale the precision of φ 1 |φ 2 and φ 2 , respectively. As the conditional expected value is linear in η 0 and η 1 , these two parameters model the linear relationship between the two diseases risks. The parameter η 0 represents an area specific spatial dependence, while η 1 refers to the neighbour effect between the two outcomes. We will refer to this model as GMCAR(α 1 , α 2 , η 0 , η 1 , τ 1 , τ 2 ). A particularly interesting feature of model (7) is that for the bivariate case by setting η 1 = 0 and α 1 = α 2 we obtain the MCAR(α, Λ) of equation (6).

The issue of misalignment arises from the discrepancy in the way mortality and prevalence data are collected. Firstly, patients from different and non neighbouring MSOAs can register at the same GP (i.e. the mapping from MSOAs to GPs is not injective) and, secondly, individuals from the same MSOA can register at different GPs. Thus, we propose to address this type of misalignment by treating it as a multiple membership problem (MM) (Browne et al., 2001) . We obtain practice specific weights based on the relative contribution of each MSOA (in terms of population affiliate) to each practice. Relevant weights are for those n j MSOAs that contribute to 90% or more of the population of the GP practice.

For each practice different MSOAs contribute with a different weight, so the idea is that the prevalence relative risk for each GP is then computed as an average of MSOA level risks. More explicitly, we observe that for each GP j there are n j contributing MSOAs. The full model becomes

where the spatial random effects are modelled as either

and priors are defined for parameters β, α 1 , α 2 , γ 1 , γ 2 , η 0 , η 1 , τ 1 , τ 2 , ψ 1 , ψ 2 . The index i ∈ j indicates that MSOA i contributes to GP j with its corresponding weight w i|j . Such weights are determined by the share of patients living in MSOA i and registered at GP j. The covariates coefficients β j = (β j1 , . . . , β jP ) with j = 1, 2 indexing the outcome (1 for mortality and 2 for prevalence) and p = 1, . . . , P the covariate.

To the best of our knowledge, this is the first implementation of GMCAR and MCAR models in RStan (Carpenter et al., 2017) . Moreover, we also implement the multiple membership approach within those two models. The code is available on github at https://github.com/ silvialiverani/GMCARMM. All models can be derived from the GMCAR model with multiple membership by setting the relevant parameters to zero: we can obtain an MCAR model by setting the parameter η 1 = 0 and we can exclude covariates by setting β equal to zero. 

Priors

We implemented a QR parametrisation (section 1.2 of Stan Development Team (2018)) for the estimation of the parameters β, which exploits the fact that any design matrix x can be decomposed using the thin QR decomposition into an orthogonal matrix Q and an uppertriangular matrix R, i.e. x = QR. This parametrisation performs much better in practice, often both in terms of wall time and in terms of effective sample size, as it makes it easier for the Monte Carlo algorithm to move in the space when an informative prior on the location of β is not available. Moreover, before obtaining the QR decomposition we have also implemented a min-max normalisation of the covariates, which does not affect the distribution of β but does affect γ.

Sampling for both covariance matrices of the CAR component was coded using unit precision, i.e. setting τ 1 = τ 2 = 1 in equation (7), and rescaling the φ 1 and φ 2 . Additionally, since calculating the determinant for those covariance matrices would be extremely computationally expensive, Joseph (2016) implemented an eigenvalue decomposition in Stan, as suggested by Jin et al. (2005) , reducing computational burden significantly.

We used flat priors for the parameters β while for the other parameters we used independent, proper, weakly informative prior distributions, as recommended by Gelman et al. (2013) . A list of the prior distributions is on the left hand side of Table 1 .

Convergence was assessed with theR diagnostic (Gelman and Rubin, 1992) , and accordingly satisfactory convergence is achieved for values below 1.01 (Vehtari et al., 2019) .

To choose among competing models, we used the deviance information criterion, or DIC (Spiegelhalter et al., 2002) . This criterion is based on the posterior distribution of the deviance statistics. For simplicity of notation, we will refer to the parameters of the Negative Binomial with the vector θ = (µ 1 , . . . , µ n , ψ) so we can write its saturated deviance as:

We compute both the average posterior deviance D(θ) and the deviance of averages of parameters' posterior D(θ). For the former, when all aspects of the model are assumed true, the approximation D(θ) ≈ n holds (Spiegelhalter et al., 2002) , where n is the number of observations. Consequently, this can also be used as a measure of fit. Computing the posterior deviances allows to easily compute also the effective number of parameters, p D = D(θ) − D(θ). Moreover, lack of fit of the data with respect to the posterior predictive distribution can be measured by the tail-area probability (TAP), or p-value, of the test quantity (Gelman et al., 2013) . For each observation i = 1, . . . , n and each outcome we can estimate:

where the y rep i simulated at each HMC iteration form the likelihood conditional on the current sampled parameters. The probabilities are estimated by the proportion of times the events y rep i < y i or y rep i = y i occur throughout the chain. In case of an adequate model the p i 's are expected to concentrate around 0.5 and away from the tails of the distribution, so by evaluating the tails of the estimated TAP's we can see how frequently the replicated values over (or under) predict the observed ones. Specifically we consider the proportion of TAP's that higher than 0.95 and lower than 0.05, as well as higher than 0.90 and lower than 0.1.

We also evaluate approximate leave-one-out cross-validation (LOO) (Vehtari et al., 2017) , comparing models (say A and B) by computing mean and standard error of the difference in elpd :

To validate our implementation we coded the GMCAR and MCAR models with multiple membership in BUGS and compared it with the Rstan runs. The BUGS (Bayesian inference Using Gibbs Sampling) project (Lunn et al., 2000) is concerned with flexible software for the Bayesian analysis of complex statistical models using Markov chain Monte Carlo (MCMC) methods. Altough BUGS is a very popular software for Bayesian hierarchical modelling, it is important to note that it is no longer maintained. We compared our Rstan implementation of the models above to an implementation in R2OpenBUGS (Sturtz et al., 2005) .

We generated two datasets: one with covariates and the other without, both using GMCAR to simulate spatially varying effects. Hence, we estimated all four models with both BUGS and Stan. We chose realistic parameter values in our simulation, in line with the posterior means of the results on real data in Section 5, as well as the true ONEL spatial lattice of 95 MSOAs, the E i j offsets and true covariates mentioned in Section 1.1 to generate new observations for mortality and prevalence.

Direct implementation of the Negative Binomial, as parametrised in equation (3), is not available in BUGS. Therefore we implemented it using the equivalent formulation via a Poisson-Gamma mixture (Hilbe, 2011) . Table 1 shows the priors and parameters for both implementations. Both Half-Normal and QR reparametrization are not available in BUGS so for the precision parameters τ j we used weak Gamma priors and for covariates coefficients β very vague Normal priors.

The results shown in Tables 2 and 3 were obtained by running 4 chains each with 50% warm-up and 6000 iterations. Are these the details for the Stan or BUGS implementation? Give details for both. Also, mention in writing any additional convergence check that you have done, even if not shown in the tables. Add also comments on the computational performance: although these runs had different number of iterations, how long did they take? What computer power did you use? The results show that RStan reports stronger convergence, i.e.R closer to 1. All 95% posterior credible intervals contain the true values, but in general RStan estimates smaller intervals.

As detailed in Section 1.1, we aim to analyse prevalence and mortality risk for MSOAs using prevalence information for GP practices. The dataset comprises 130 GP practices and the 95 MSOAs whose residents were among the largest contributors to the selected GPs populations.

To analyse the dataset we ran 4 chains each with 6000 iterations of HMC and 50% warmup using RStan. The models we considered are the GMCARMM and the MCARMM both including and excluding covariates. Table 4 reports the estimated posterior parameters. Table 4 reports summary statistics of the marginal posterior samples for the parameters of the models.

Tables 5 and 6 compare model fit using the DIC and LOO criteria. It can be seen that there are relatively small differences on these criteria between comparable models (i.e. models with covariates and models without covariates). The average posterior deviances are approximately close to the number of observations, suggesting a good fit of the model to the data. We compared LOO for GMCARMM and MCARMM models using differences in elpd as described in Section 3.2. This shows that the GMCARMM mortality model is preferable to the MCARMM one, showing a difference of 2.5 (on the elpd scale) and a standard error of 1.2. Also, the addition of Table 7 do not show significant differences in leave-one-out cross validation prediction. Table 5 and 6 show that the estimated posterior mean values of TAP never exceed 0.63 and are never below 0.31, as expected with adequately fitting models. Additionally, in Table  8 we can clearly see that it is very unlikely for the replicated values across all models to be sampled systematically over or under the dataset values. Figures 2 and 3 display densities for the replicated data at each iteration with the actual data.

All models fit the data adequately, but there are important differences in the results when covariates are included in the models. Consider the bridging parameters η 0 and η 1 . The former accounts for within area correlation between spatial effects for different outcomes, and the latter represents between area association (i.e a form of spatial lag effect). Only η 0 for the MCAR without covariates shows an entirely positive 95% credible interval (0.04, 1.05). The results in Table 7 show a strong within area association between the two outcomes. For the GMCAR models, η 1 for the GMCAR without covariates is close to significance with a 90% credible interval spanning between -0.01 and 0.35. So the association between outcomes remains positive but shows up as via a neighbourhood feedback effect with regards to surrounding areas.

Posterior samples for the coefficients β on the covariates, show a strong positive impact, as expected, for both the proportion of South Asian population and the deprivation index IMD, in accounting for both prevalence and mortality variations. Covariate effects are stronger on prevalence variations. The inclusion of such covariates generally improves fit, the exception being the GMCAR model for mortality.

With regards to measures of spatial dependence (the α parameters) these are considerably affected by the presence or not of covariates, especially when adding covariates improves fit. For prevalence, inclusion of covariates leads to reduced DIC and LOO-IC and the α parameters are correspondingly reduced; the covariates account for spatial dependence formerly in the residuals. For mortality impacts of covariates are significant (albeit smaller than those for prevalence) but fit is not improved when they are included. Figure 4 shows the posterior means for the estimated mortality relative rates, i.e. ρ 1 of equation (8). Figure 5 plots the posterior means of the estimated relative risks for prevalence without considering multiple membership, i.e. γ 2 + Xβ 2 + φ 2 as in equation (8). Both maps compare the two CAR models with and without the inclusion of covariates, and it is evident that different models give significantly different results. 

No previous studies have investigated the bivariate spatial relationship between diabetes mortality and prevalence for small areas. As mentioned above, recording of diabetes deaths data can be problematic, with possible undercounting. Despite these issues, the models applied in the above analysis show strong impacts of acknowledged area risk factors (area deprivation and ethnic population mix) on both outcomes. In models not including covariates, spatial association and dependence parameters are mostly in line with broader epidemiological knowledge: e.g. a positive association between the two outcomes. The analysis is complicated by the fact that prevalence is not observed directly for areas but for a different set of spatial units: general practitioner catchment areas. We have used a form of multiple membership principle to transfer information between these spatial frameworks. Posterior checks confirm that this approach is successful.

Space varying coefficient models for small area data

Hierarchical modeling and analysis for spatial data

Bayesian image restoration, with two applications in spatial statistics

Spatial and Spatio-temporal Bayesian Models with R -INLA

Multiple membership multiple classification (MMMC) models

Stan : A Probabilistic Programming Language

Hierarchical Modeling and Analysis for Spatial Data

Bayesian hierarchical models for disease mapping applied to contagious pathologies. bioRxiv

Applied Bayesian modelling

Identifiability and convergence issues for Markov chain Monte Carlo fitting of spatial models

Cross-Classified and Multiple Membership Structures in Multilevel Models: An Introduction and Review

Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference

Space-varying regression models: Specifications and simulation

Proper multivariate conditional autoregressive models for spatial data analysis

Bayesian data analysis

Inference from iterative simulation using multiple sequences

A multivariate CAR model for improving the estimation of relative risks

Fat Oxidation, Fitness and Skeletal Muscle Expression of Oxidative/Lipid Metabolism Genes in South Asians: Implications for Insulin Resistance?

Negative binomial regression

Generalized hierarchical multivariate CAR models for areal data

Exact sparse CAR models in Stan

Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology

Bayesian 2-Stage Space-Time Mixture Modeling With Spatial Misalignment of the Exposure in Small Area Health Data

A comparison of conditional autoregressive models used in Bayesian disease mapping

WinBUGS -A Bayesian modelling framework: Concepts, structure, and extensibility

Review and empirical comparison of joint mapping of multiple diseases

Multi-dimensional multivariate Gaussian Markov random fields with application to image processing

Generalized Linear Models, Second Edition, Chapman & Hall/CRC

Diabetes reporting as a cause of death: Results from the translating research into action for diabetes (TRIAD) study

Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in stan

Gaussian Markov random fields : theory and applications

On the Mode of Communication of Cholera

Bayesian measures of model complexity and fit

Stan Modeling Language Users Guide and Reference Manual

Deaths Attributable to Diabetes in the United States: Comparison of Data Sources and Estimation Approaches

R2WinBUGS: A package for running Win-BUGS from R

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Ranknormalization, folding, and localization: An improved R-hat for assessing convergence of MCMC

Health-exposure modeling and the ecological fallacy