key: cord-0513714-t0gqi9t5
authors: Reich, Brian J; Yang, Shu; Guan, Yawen; Giffin, Andrew B; Miller, Matthew J; Rappold, Ana G
title: A review of spatial causal inference methods for environmental and epidemiological applications
date: 2020-07-06
journal: nan
DOI: nan
sha: b33ee8462d226ad4e7f1a9a1bcf6f41c6a52c22c
doc_id: 513714
cord_uid: t0gqi9t5

The scientific rigor and computational methods of causal inference have had great impacts on many disciplines, but have only recently begun to take hold in spatial applications. Spatial casual inference poses analytic challenges due to complex correlation structures and interference between the treatment at one location and the outcomes at others. In this paper, we review the current literature on spatial causal inference and identify areas of future work. We first discuss methods that exploit spatial structure to account for unmeasured confounding variables. We then discuss causal analysis in the presence of spatial interference including several common assumptions used to reduce the complexity of the interference patterns under consideration. These methods are extended to the spatiotemporal case where we compare and contrast the potential outcomes framework with Granger causality, and to geostatistical analyses involving spatial random fields of treatments and responses. The methods are introduced in the context of observational environmental and epidemiological studies, and are compared using both a simulation study and analysis of the effect of ambient air pollution on COVID-19 mortality rate. Code to implement many of the methods using the popular Bayesian software OpenBUGS is provided.

Large-scale environmental and epidemiological studies often use spatially-referenced data to examine the effect of treatments or exposures on a health endpoint. Examples include studying the effect of interventions on the spread of an infectious disease, pesticide application on cancer rates, and lead exposure on childhood development. While standard analyses of spatial data simply estimate correlations, the ultimate goal of this research is to establish causal relationships (e.g., Bind, 2019) to inform decision making. Therefore, developing statistical methods to establish causal relationships when data show spatial and temporal variation is invaluable to environmental science and epidemiology.

A rich literature on the theory and methods for causal inference for independent data has emerged (Bind, 2019; Hernn and Robins, 2020) , but progress for spatial applications has been slow due to several analytic challenges. First, randomization is often infeasible due to logistical or ethical concerns and so studies rely on observational data. Second, exposure and response variables exhibit spatial correlation complicating statistical modeling and computation. Third, the treatment at one location may influence the outcomes at nearby locations, a phenomenon known as spillover or interference. These features of spatial applications violate the assumptions of standard causal inference methods and require new theory and computational tools.

Despite these challenges, major advances in spatial causal inference have been made in recent years. In this paper, we review the recent progress on spatial causal inference, evaluate and compare current methods, and suggest areas of future work. We first review methods to adjust for missing spatial confounding variables (Hodges and Reich, 2010) . Most causal inference methods for observational data rely on an assumption of no missing confounding variables (i.e., unmeasured variables correlated with both the treatment and response). However, if the missing confounding variables have prominent spatial patterns, methods have been developed to mitigate the bias caused by their omission. These methods include case-control matching (e.g., Jarner et al., 2002) , neighborhood adjustments by spatial smoothing (e.g., Schnell and Papadogeorgou, 2019) and propensity-score methods (e.g. Davis et al., 2019) . We review these methods and conduct a simulation study to compare their precision for estimating a causal treatment effect in the presence of a missing spatial confounding variable. A subset of the methods are applied to a study of the effect of ambient air pollution on the COVID-19 mortality rate.

A second major challenge in spatial causal inference is interference, where the treatment applied at one location affects the outcomes at other locations. For example, an intervention to reduce the emissions from a power plant would affect the air quality at the power plant, but also locations downwind. Capturing these spillover effects requires new definitions of the estimands of interest and new spatial models for the causal effects. In full generality, allowing the treatment at a site to affect the outcomes at all other sites results in an intractable estimation problem. Therefore, assumptions are required to limit the form and spatial extent of interference. We review several models for spatial interference including partial (e.g. Zigler et al., 2012) and network (e.g. Tchetgen et al., 2017) interference. We also discuss recent methods that combine mechanistic and spatial statistical models to anchor the causal analysis to scientific theory.

We begin reviewing these methods using cross-sectional data at a single time point, and then extend these methods to the spatiotemporal data. We discuss adapting spatial methods to the spatiotemporal setting, and methods specific to the temporal case such as difference-in-difference methods (e.g. Delgado and Florax, 2015) that exploit changes over time to estimate causal effects.

We also compare and contrast causal methods based on the potential outcomes framework (Rubin, 1974) with Granger causality (Granger, 1969) , which is defined specifically for processes that evolve over time. We also discuss extensions of spatial methods for areal data defined at a finite number of regions (e.g., geopolitical units) to point-referenced (geostatistical) data in which case the treatment and response variables can be modeled as continuous random fields over an uncountable number of spatial locations. This requires new definitions of causal effects, new methods for matching observations for case-control studies, and new models for missing spatial confounding variables and spillover effects. The paper concludes with a summary of the current literature and discussion of open problems in this rapidly-advancing field.

To ensure privacy, public health data are often made available only after aggregation to administrative or geopolitical regions. For areal data of this nature, we adapt the notation that Y ij , A ij and X ij = (X ij1 , ..., X ijp ) are the response, treatment and potential confounding variables (with X ij1 = 1 for the intercept) for observation j ∈ {1, ..., n i } in region i ∈ {1, ..., N } for a total of n = N i=1 n i observations. The confounding variable in X ij can include both covariates specific to observation j within region i or summaries of the region i common to all n i observations in the region. In addition to these observed variables, we allow for an unobserved confounding variable U i in region i, which is assumed to be a purely spatial term and thus the same for all observations in a region.

Example 1: As a concrete example, consider an environmental epidemiology study where Y ij is the birth weight of the j th baby born in zip code i and A ij = 1 if the average ambient air pollution concentration in the mother's zip code exceeds a high threshold and A ij = 0 otherwise.

We may adjust for known confounding variables by including the mother's age and family income in X ij , and describe the mother's environment by including the median income and measurable environmental factors such the average concentration of other known pollutants in region i in X ij .

In this scenario, the missing spatial confounder variable U i might be a second pollutant unknown to the researchers. The second pollutant qualifies as a missing spatial confounder if it has a strong spatial pattern, is associated with low birth weight, and is correlated with the pollutant of interest, perhaps via a common source. Failing to account for this missing spatial confounder, either because its importance is unknown or data are unavailable, may inadvertently attribute the effects of the unknown pollutant to the pollutant of interest, biasing the estimator.

In this section we review spatial models for unknown processes such as U = (U 1 , ..., U N ) T (Section 2.1) and causal inference methods that would apply if U were observed (Section 2.2). The remainder of the section is dedicated to methods that attempt to control for the missing confounder variable by exploiting its spatial structure.

Consider the spatial regression model

where β is the treatment effect of interest, γ determines the effects of the confounding variables, U i is the spatial random effect for region i and ε ij iid ∼ Normal(0, τ 2 ). A common approach (Banerjee et al., 2014) for areal data is to model the unobserved spatial effects using a conditionally autoregressive (CAR) model (also known as a Gaussian Markov random field model). The CAR model specifies spatial dependence in terms of the adjacencies between the regions. The full conditional distribution of the random effect for one region given all other random effects is

, whereŪ i is the mean of U at the m i regions adjacent to region i, and ρ ∈ (0, 1) and σ > 0 are spatial covariance parameters. These full conditional distributions define a multivariate normal distribution (Appendix A.2) for U, which we denote as

The spatial regression model in (1) where U is modelled as a spatial process often gives very different estimates of covariate effects than the non-spatial model that excludes U, especially when the treatment variable exhibits a strong spatial pattern (Reich et al., 2006; Paciorek, 2010; Hodges and Reich, 2010) . However, simply accounting for spatial correlation does not resolve spatial confounding. For example, Appendix A.1 describes a scenario where the bias of the posteriormean estimator for β depends on the strength of dependence between the treatment variable and the unmeasured confounding variable, but is the same whether the residuals are assumed to be independent or spatially correlated. The bias of this approach is confirmed in our simulation study (Section 2.8) when data are generated with correlation between U and the treatment and response variables. This calls for methods that explicitly adjust for missing spatial confounders by blocking the dependence of U on either the treatment or response variable.

In this section we temporarily assume that U i is observed (and thus treated the same way as X ij ) to facilitate a review of standard non-spatial causal inference methods. We begin with the potential outcomes framework (Rubin, 1974) . Assume that the treatment A ij is binary and that each unit has two potential outcomes, Y ij (0) and Y ij (1), which represent the outcomes if the unit j in region i is given treatment A ij = 0 or A ij = 1, respectively. Our goal is to estimate the average treatment effect (ATE),

where the expectation is taken with respect to both X ij and {Y ij (0), Y ij (1)}. The fundamental problem is that only one of the two potential outcomes can be observed (Holland, 1986) rendering the other as counterfactual. Therefore, assumptions are required to ensure the ATE can be identified.

This notion of potential outcomes implicitly encodes the Stable Unit Treatment Values Assumption (SUTVA; Rubin, 1978) .

Assumption 1 (SUTVA). There is no interference and a single version of treatment.

SUTVA is violated under interference where Y ij depends not only on A ij , but also on the treatment of other units. For instance, the birth weight of a baby in Example 1 could be influenced by the air pollution concentration both in the mother's zip code (A ij ) but also in other zip codes that the mother frequents. In this case, the potential outcomes are not determined by A ij alone, and we would need to introduce a different potential outcome for each combination of the treatment variables in the mother's vicinity (see Section 3).

An example of multiple versions of treatment might be if birth weight actually depends not only on whether the air pollution exceeds a high threshold, but also a second extremely high threshold.

In this case, A ij actually has three levels (low, high and extremely high) and there should be three potential outcomes. An analysis that collapses the two high categories into a single group with A ij = 1 would violate SUTVA by having multiple versions of the treatment. Violation of this assumption could be rectified by assuming A ij has three categories and thus each unit has three potential outcomes.

While SUTVA links treatments to potential outcomes, the consistency assumption is needed to further link the potential outcomes to the observations.

Assumption 2 (Consistency). The observed response is the potential outcome determined by the

In addition to these assumptions about the treatment and response variables, a standard assumption that permits unbiased estimation of the ATE is the no missing confounder variables other than the observed covariates X ij and the latent spatial confounder U i . We term this assumption as the latent ignorability assumption:

Assumption 3 (Latent ignorability). The potential outcomes {Y ij (0), Y ij (1)} and treatments A ij are independent given X ij and U i .

Since U is generally a latent (i.e., unknown) variable in the spatial setting, this assumption presumes that there exists some variable U that blocks dependence between the treatment variable and potential outcomes; if U is observed then this is the usual assumption that there are no unmeasured confounding variables. This assumption implies that the confounding variables {X ij , U i } are sufficient to adjust for correlation between the observed treatment and response that is due to non-randomized treatment allocation and not an actual causal effect. This requirement highlights the importance of careful evaluation of the system under study to ensure that all relevant variables are considered in X ij .

The final assumption deals with the distribution of observed treatment variables, i.e., the propensity score. The propensity score is the probability of the treatment assignments, Prob{A ij = 1 |

Under Assumption 3, the propensity score becomes

Assumption 4 is the standard positivity assumption on the propensity score:

Assumption 4 (Positivity). Both e(X ij , U i ) and 1 − e(X ij , U i ) are positive for all X ij and U i .

This assumption implies that both A ij = 0 and A ij = 1 are possible under the treatment allocation mechanism, which is necessary to estimate the ATE in (2) which averages over the expected potential outcome under both treatments.

Under Assumption 3 the propensity score is a function of known variables X ij and U i and can thus be estimated without knowledge of unobservable counterfactual responses. However, Assumptions 1-3 are difficult or impossible to verify empirically, and thus a causal inference requires scrutinizing the study design and the processes of interest to justify that these assumptions hold.

One of the main contributions of causal inference is to state explicitly the assumptions needed for an estimator to have a casual interpretation, and thus guide a discussion of a study's results.

Assumptions 1-4 underlie many non-spatial causal estimation procedures such as (augmented) inverse probability weighting (e.g., Rosenbaum and Rubin, 1983a; Robins and Greenland, 1994; Bang and Robins, 2005; Cao et al., 2009) , and matching (e.g., Rosenbaum, 1989; Heckman et al., 1997; Hirano et al., 2003; Hansen, 2004; Rubin, 2006; Abadie and Imbens, 2006; Stuart, 2010; Abadie and Imbens, 2016) . To fix ideas, we focus on the simplest approach of the linear model in

(1) where U i is observed and thus not given a spatial model. Spatial analyses often rely on parametric models because the lack independent replications in a region complicates non-parametric methods. The parametric model in (1) makes the additional assumptions of linearity and normality, but gives valid causal inference under the assumed model and Assumptions 1-4. In other words, the regression coefficient β can be interpreted as the ATE, δ. Therefore, if U i is observed and these assumptions hold, then the estimate of β from a standard least squares analysis has a causal interpretation. In the remainder of this section we discuss methods to deal with unknown U.

While most of the methods we discuss control for confounding at the analysis stage, a case-control study controls for confounding at the design stage. In a case-control analysis of a binary response variable (i.e., Y ij ∈ {0, 1}), each case (Y ij = 1) is matched with one or more controls (Y ij = 0) that are drawn from the same underlying population at risk. When applying this study design, investigators sample controls to resemble cases with respect to all factors that may determine the disease status except for the exposure of interest. As discussed below, this design removes the need to adjust for the matching factors at the analysis stage. Matching variables can be specific to the individual, such as age or education level. Partial control for spatial variation of risk can be achieved by matching on confounding factors that vary spatially such as the region's median income. To adjust for unmeasured spatial confounders, controls can be matched based on their proximity to the cases (Jarner et al., 2002) . Assuming there is replication within region (n i > 1) and treatment varies within region (A ij = A il for some j and l) then matching individuals in the same region is an effective means of adjusting for spatial confounding.

Matched case-control data are most often analyzed using conditional logistic regression. Assume each case Y ij = 1 is paired with a single control Y kl = 0. Under the spatial logistic regression

To account for variability within each pair (strata), a random intercept z ij is added so the likelihood contribution of the pair is

Since the covariates appear in the likelihood only through the difference X ij − X kl , the effect of covariates used for matching cannot be estimated and these covariates can be removed from the model. Similarly, if cases are paired with observations from the same region (i.e., i = k), then the spatial random effects U do not appear in the likelihood and a non-spatial analysis is sufficient.

Thus, while the matched case-control analysis is an excellent means of controlling for confounders, its drawbacks include discarding data and not being able to estimate all covariate effects and spatial variation in risk.

Pairing observations in the same region can also be applied for continuous responses. For a continuous response there is no natural definition of a case or control, but regressing the difference between the responses in the same region removes spatial confounding. For example, under the linear model in (1) the model for the difference between responses in the same region is

where˜ i is independent error. Again, differencing eliminates the latent variable U i , and thus the differences can be analyzed with non-spatial methods. This approach relies on a parametric linear model, but the concept of reducing bias by pairing observations in the same location can also be applied using weighting based on the propensity score model (He, 2018) .

In (4), modelling the difference between observations in the same region eliminated the unmeasured confounders. In cases without replication and a missing confounder that varies smoothly across space, its effect can be reduced by removing large-scale spatial trends from the response, the treatment, or both. Removing large-scale trends isolates local variation in the response, which is arguably less prone to spatial confounding than large-scale variation. In this section we review several methods that have been proposed for removing large-trends in spatial regression.

For simplicity, assume there are no replications within each region and temporarily drop the replication subscript by defining Y i1 = Y i , X i1 = X i and A i1 = A i . Rather than specifying the regression on the response, the Simultaneous Autoregressive (SAR) model first subtracts regional

whereȲ i ,Ā i andX i are the means of the response, treatment and covariates at the m i regions adjacent to region i, φ is an unknown parameter and ε i iid ∼ Normal(0, σ 2 ). Taking differences reduces the effect of missing confounding variables that are constant across neighboring regions.

In vector form, (5) can be expressed as Y = Aβ + Xγ + ε where the spatial covariance of ε is given in Appendix A.2. Wall (2004) compares differences in covariance implied by the SAR and CAR models. Wall (2004) finds the models produce similar regression coefficient estimates despite sometimes large differences in covariances between regions.

Rather than simply subtracting the mean of neighboring sites, spatial trends can be removed by joint spatial modeling of the treatment and the missing spatial confounder. Consider the spatial regression model in (1) for U and A that allows each process to have a different range of spatial correlation and permits correlation between U and A. The confounding bias is mitigated by fitting a joint model

where the form of B i (A) and the spatial covariance of e i1 and e i2 are given in Appendix A.3. As noted by Schnell and Papadogeorgou (2019) and was also suggested by Paciorek (2010) , if the spatial scale of treatment is larger or about the same as the unmeasured confounder, the confounding bias cannot be mitigated.

Propensity scores are used in a wide range of causal inference methods. Assuming a binary treatment variable, the propensity score for observation j in region i is Prob(A ij = 1) = e ij . In a standard analysis the propensity scores are modeled as a function of the known covariates X ij and the estimated propensity scores are used to alleviate the imbalance of the covariates between treatment groups. Here we face the additional challenge that the propensity scores may depend on the unobserved spatial process, U i .

For example, consider the simple hierarchical model that includes the unobserved spatial process in the propensity score,

where V i accounts for spatial patterns in treatment allocation not accounted for by the covariates or the missing confounder U i . To emphasize the effect of the propensity score on the response model,

The shared spatial random effect v i adjusts for the missing confounder by absorbing signal in the response that can be explain by spatial trends in the treatment allocation. The spatial random effects can be assigned priors u = (u 1 , ...,

. Fitting this joint model for the treatment and response processes is straightforward using hierarchical Bayesian methods.

A concern with this model is that some of its many parametric assumptions could be violated, invalidating inference. Another issue is that of so-called "feedback", which in this context refers to information in the response influencing the posterior of the propensity scores (e.g., Zigler et al., 2013; Zigler, 2016; Saarela et al., 2016) . Eliminating this feedback can be done by fitting the model in two stages, i.e., first fitting the model for the treatment indicators in (10) to obtain an estimate of v and then fitting (9) with v fixed at its first-stage estimate. Other possible remedies include "cutting feedback" in the steps of the MCMC algorithm (Lunn et al., 2009; McCandless et al., 2010) or post-hoc reweighting of the posterior distribution (Saarela et al., 2015; Davis et al., 2019) . These methods are discussed below.

Referring to the joint model in (9)-(10), if the propensity score e ij were known and logit(e ij )

were included as a known confounder in X ij , then latent ignorability (Assumption 3) would hold and the resulting estimate of β would have a causal interpretation. Of course, the exact propensity is unknown and must be estimated. Letê ij be a first-stage propensity-score estimator, e.g., as estimated by fitting the spatial logistic regression model in (10). The estimated propensity scores can be included in the mean of the response model to account for spatial confounding. The propensity score can be added to the response model as,

where f is the logit function or more generally a non-linear function estimated by, say, smoothing splines. Given the inclusion of the propensity score, it can now be assumed that U i and A ij are conditionally independent. Assuming the model assumptions hold and the propensity score estimate is accurate, then β has a causal interpretation.

Alternatively, the propensity score estimates can be used to define strata, i.e.,

where 0 = T 1 < T 2 < .... < T L+1 = 1 define the propensity-score strata, S l encodes the unmeasured confounder effect for strata l and U i and A ij are conditionally independent. Although the strata are defined irrespective of spatial information, the spatial random effect U i accounts for spatial dependence.

This joint modeling framework can be extended to continuous treatment variables by replacing

This method could be fit as a joint model or in two stages where first a Gaussian spatial model for A ij is fit and estimates of e ij are used as generalized propensity scores (Hirano and Imbens, 2004 ) in the response model as in (11) or (12). Generally, this model-based framework can be adapted to more complex settings as long as a model with reasonable fidelity to the data generating process can be determined and justified.

As an alternative to model-based causal adjustment, Davis et al. (2019) use imputation of potential outcomes and propensity-score weighting. They first estimate propensity scoresê ij using a spatial regression such as (10). Then in a second stage, they fit the response model in (1), which excludes the propensity score. Rather than use the estimate of β from this analysis, they post-process the model output to remove confounding bias. They estimate the causal effect using concepts from augmented inverse probability weighting (Rosenbaum and Rubin, 1983; Bang and Robins, 2005; Cao et al., 2009) 

An instrumental variable (IV) Z i is widely used to deal with unmeasured confounding. An valid IV must (a) be associated with the treatment A i , (b) not be related to the unmeasured confounder U i , and (c) not be directly affect the outcome. Figure 1 illustrates the dependence structure of the random variables. As an example, suppose A i is a region's air pollution level and Y i is the region's asthma rate. A potential instrumental variable is the region's traffic density, Z i . As traffic is a major source of air pollution, it is clear that A i and Z i are correlated, and it can be argued that traffic density is unrelated to asthma rate other than via air quality.

The classic causal analysis with IVs is a two-stage least squares regression, The treatment is first regressed onto the IV, and then the fitted values from this first-stage regression as used as the treatment variable in the response model. That is, if the first-stage regression givesÂ i = Z is the instrumental variable, A is the treatment, Y is the outcome, X is the observed confounder, and U is the unobserved confounder.

This confines the treatment variable to the span of the instrumental variable, and thus to a space orthogonal to the missing confounding variable. If a valid IV can be identified then this provides a simpler means of estimating average treatment effect instead of adjusting for missing confounders than propensity scores.

Some caution has to be exercised when interpreting causal estimates based on IVs. In the observational setting, as in traffic instrument example, the investigators do not have the ability to enforce treatment (PM) based on treatment assignment (traffic). Although traffic is a major source of variation in PM, other sources can play a role which leads to differences between intended and observed treatments among units and potentially to the heterogeneity of responses (power plants, wildfires, etc). In randomized treatment-control examples, this equates to the lack of full compliance between treatment assignment and the intake of drug. The implication is that the ATE is estimated only among those whose PM variation is explained by variation in the instrumental variable, referred to as the local average treatment effect (LATE) or complier average treatment effect (CATE). Imbens and Angrist (1994) provide the criteria under which the LATE/CATE represents the ATE.

Spatial consideration can be made in both stages of the model. Consider a continuous treatment variable and the joint model

iid ∼ Normal(0, τ 2 1 ) and 2ij

iid ∼ Normal(0, τ 2 2 ). In (14), A ij in the response model in (1) is replaced by Z ij α 1 in the instrumental variable regression.

Spatial random effects are included in both stages of the model to provide more efficient estimators of the regression coefficients and valid uncertainty quantification. This model closely resembles the joint propensity score model in (7)-(8) except that only the signal in A ij than can be explained by the IV enters the response model.

The two models in (14)-(15) can be fit simultaneously, although feedback effects must be considered as in the propensity score methods of Section 2.5. Alternatively, the method can be fit in two stages. The first stage is a spatial regression of A i onto Z i in (15) and X i gives an estimate of α 1 . In the second stage spatial regression of the response, Z iα1 is used as the treatment variable.

An important difference between the classical and this spatial IV approach is that in the spatial version the fitted values will not be strictly orthogonal to the errors U i . A potential remedy is the use of restricted spatial regression (Reich et al., 2006; Hodges and Reich, 2010; Hughes and Haran, 2013; Hanks et al., 2015) . Thaden and Kneib (2018) propose to adjust for spatial confounding using structural equation modelling (SEM). They introduce binary indicator variables for each spatial location in both the models for the treatment and response variables. Therefore, although motivated using SEMs, they arrive at a similar model to the joint model in (9)-(10). They argue that independent priors for the random effects (u i and v i in (9)-(10)) more effectively resolve spatial confounding than spatial priors.

Treating the random effects as independent requires replication within region, which is not always available. However, when there is sufficient replication within regions, independent priors are preferable to spatial models because they are less constrained and thus more completely block spatial confounding.

In this section we conduct a simulation study to compare methods for adjusting for an unmeasured confounding variable. We examine how the methods compare with different levels of spatial correlation in the treatment and confounding variable, and robustness to model misspecification.

Data generation: We generate data from the model

where the spatial terms are drawn from the model U ∼ CAR(ρ U , 2), V ∼ CAR(ρ V , 2) and the transformation function g is given below. The correlation structure is determined by three parameters: ρ U and ρ V control the range of spatial dependence and φ controls the strength of spatial confounding. For simplicity we exclude known confounders X i to isolate the effects of spatial confounding. The first four scenarios have g(

to study the performance of the joint model when it is correctly specified. Setting the CAR depen- We generated 100 datasets on a 30 × 30 square grid of regions with rook neighbors and β = φ = 0.5. For each dataset we fit the following models.

• NS: Non-spatial least squares,

• NS+P: Non-spatial least squares with a spline function of the propensity score,

• S+P: Spatial CAR regression with a spline function of the spatial propensity score,

• S+AIPW: Spatial CAR regression with post-hoc IDW debiasing step, i.e., model S with post-processing as in (13) • Joint: Joint model in (9)-(10)

• Cut: Joint model with feedback cut as in McCandless et al. (2010) In these modelsê i is computed using the spatial logistic regression in (10) and f is a B-spline basis expansion with five degrees of freedom. The priors for all models are U ∼ CAR(ρ U , σ U ), In this simulation the most effective methods are the spatial model with propensity score adjustment (S+P) and the full joint model (Joint). This is not surprising in the first four scenarios because the joint model was used to generate the data. In these cases the joint model appears to have less bias than the two-stage spatial propensity score model, but both methods are similar. These models are misspecified in the final two scenarios, but still outperform the other methods. Surely more extreme scenarios where these methods fail to deliver reliable inference can be devised, but these results suggest some robustness to model assumptions.

The strength of the spatial correlation in the treatment allocation process appears to be more predictive of reliable performance than model misspecification. In scenarios (b) and (d) with ρ U = Figure 2 : Simulation study results. The boxplots summarize the sampling distribution of the causal estimates across datasets and the solid line at 0.5 is the true value. The scenarios vary by the spatial dependence parameter of the confounder (ρ u ) and treatment (ρ v ) variables, and whether the joint model is misspecified. The competing methods are defined in Section 2.8. The empirical coverage of 95% credible intervals for the causal effect are given above the model labels. 0.9 all of the methods are biased and have low coverage. In these cases the spatial model of the treatment allocation process has low predictive power and thus all subsequent causal adjustments are ineffective. In these cases the unmeasured confounder cannot be explained by known covariates or spatial patterns, and there is simply no structure that can be exploited to remove its effect.

2.9 Effect of PM 2.5 exposure on COVID-19 mortality

To illustrate the spatial confounder adjustment methods, we reanalyze the data provided by Wu The known confounder variables in X i include p = 15 measures of the county's demographic, socio-economic and climate conditions (see Table 2 of Wu et al. (2020) for a complete list). Some covariates (number of hospital beds, BMI and smoking rate) have a high proportion of missing values. Rather than removing the counties with missing value, which would complicate the spatial adjacency structure, we remove the covariates with missing value. Removing these covariates does not greatly affect the effect estimates (as discussed below).

Because the dataset is large and the treatment is continuous we consider only the non-spatial ("NS") and spatial ("S") models and these models with a two-stage propensity score adjustment ("NS+P" and "S+P"). The response model is Y i ∼ Poisson(N i λ i ) where N i is the county's population and λ i is the mortality rate. Wu et al. (2020) use a quasi-Poisson model with state-level random effects; we use county-level random effects and allow these random effects to account for over-dispersion. Specifically, the mortality rate is modeled as

where U ∼ CAR(ρ u , σ u ),ê i is the estimated generalized propensity score (Hirano and Imbens, 2004 ), and f is a B-spline basis with 5 degrees of freedom. The generalized propensity score is the fitted negative log-likelihood (ignoring constants)ê i = (A i − X iα −V i ) 2 , whereα and Figure 4 : Causal effect estimate for the COVID-19/PM 2.5 analysis. Posterior distribution of the log relative risk of an increase of 1 µg/m 3 in long-term average PM 2.5 (β) on a county's COVID-19 mortality rate. The four models are defined by whether they are non-spatial ("NS") or spatial ("S") and whether they include a spatial propensity score ("+ P"). 

The priors are the same as in Section 2.8. The non-spatial models set ρ u = 0 (the county-level random effect remain in the model to account for overdispersion) and the methods without a propensity score set f (ê i ) = 0.

The posterior distributions of β under these four models are plotted in Figure 4 . The spatial models give smaller posterior mean and larger posterior variance than the non-spatial models.

Including the generalized propensity score leads to a slightly higher effect estimate for both the spatial and non-spatial analyses. The results are generally similar to those in Wu et al. (2020) who found an 8% increase in COVID-19 related mortality for a unit increase in long-term average PM 2.5 . Therefore, this analysis does not detect a missing spatial confounder that dramatically affects the causal effect estimate.

Interference (also called spillover) occurs when the treatment received by one unit can affect the outcomes of other units. The ubiquitous no interference assumption in Section 2.2 was first discussed in Cox (1958) , where it was referred to as "no interaction between units" (Hernn and Robins, 2020) . In the subsequent literature, it is often simply referenced as part of SUTVA. Despite a variety of data and treatments exhibiting interference, methods that account for interference have only recently begun to proliferate in the statistics literature, in part because interference significantly complicates the potential outcomes approach and requires additional assumptions about the form of the interference.

In this section we review the challenges associated with accounting for interference, and the current literature on this topic. In Section 3.1 we give a general formulation of potential outcomes in the presence of interference, and define several quantities of interest under this framework.

The remainder of the section discusses different assumptions about the nature of interference and subsequent estimation methods.

In the potential outcomes framework in Section 2.2 with binary treatment and no interference, there are two potential outcomes defined for each unit: Y ij (0) and Y ij (1). Allowing for general treatment interference entails considering 2 n potential outcomes, each corresponding to a different combination of treatments received by all units. As a result, the estimands under interference are more complicated because they require considering treatment that could be applied to multiple units. Therefore, defining the potential outcomes and estimands requires additional notation. We distinguish between the treatment applied to unit (i, j) in the observed dataset, A ij , and a hypothetical treatment that could be applied to unit (i, j), denoted a ij . To describe potential outcomes under interference we denote the treatments that could be applied to all n units as a = {a ij ; i = 1, ..., N ; j = 1, ..., n i }, and the collection of the n − 1 treatments excluding a ij as a −ij . The potential outcome for each unit is then written as Y ij (a ij , a −ij ), where the first term is the treatment received by unit (i, j) and the second term are the treatments received by other units.

only on the treatment assigned to unit (i, j). Rather, several treatment effects are needed to provide a comprehensive summary. Halloran and Struchiner (1991, 1995) and Hudgens and Halloran (2008) describe four key estimands assuming binary treatments. The direct effect (DE) is

The direct effect compares the difference potential outcomes for unit (i, j) with treatments A ij = 1 versus A ij = 0 and holding all other treatments fixed at a −ij . Unlike (2), there is not a single direct effect, as (18) 

The indirect effect is also called the spillover effect because it compares the difference between potential outcomes for two combinations of treatments for the other units, a −ij and a −ij , to an untreated unit with a ij = 0 to quantify how much of the other treatment effects spill over to observation (i, j). The direct and indirect effects can be combined using either the total (TE) or overall effects(OE):

These effects are similar, except that the total effect always compares a ij = 1 versus a ij = 0

whereas the overall effect allows the local treatment to be the same for a and a .

If these effects can be estimated, then the user can interrogate the fitted model by selecting any scenarios defined by a and a . For example, in the context of Example 1, the direct effect might be computed by fixing the air pollution status of all other units a −ij at their current value to determine the effect of a local action that changes the air pollution concentration in the mother's zip code but does not affect other zip codes. For the indirect effect we might fix all the treatment variables at their observed values except set the air pollution variable for the zip codes neighboring a mother's zip code to one in a −ij versus zero in a −ij to determine the impact of changing the air pollution in zip codes where the mother spends some time outdoors. The sum of these two effects is the total effect of changing the air pollution status of all zip codes in the mother's home range (her zip code and those the mother frequents). This total effect equals the overall effect of setting a = 1 for the mother's home range, a = 0 for the mother's home range, and both a and a equal to the current value for all other zip codes.

While measures such as DE ij (a −ij ) are useful for understanding the implications of individual actions on local outcomes, assessing the overall impact of the treatment requires averaging over units and potential actions. Rather than weight all potential actions equally, they can be assigned probabilities, Prob(a =ã) = ψ(ã). The probability mass function ψ is called the treatment policy.

For example, the policy-averaged expected counterfactual outcome under treatment a ij = a for

where the sum is over all 2 n−1 possible values of a −ij and Prob(a −ij |a ij = a) is determined by the policy, ψ. The policy-averaged direct effect for unit (i, j) is thenȲ ij (1, ψ) −Ȳ ij (0, ψ), and the spatial average direct effect is

Policy-averaged indirect, total and overall effects have similar forms.

In the context of the environmental epidemiology study described in Example 1, a simple policy is to assume that the a ij are independent over units with Prob(a ij = 1) = p and compute (21) for several values of p to understand the direct effect. A policy more tailored to anticipating short-term effects of interventions in a given region is to assume that the a ij are independent over units with Prob(a ij = 1) = p a if the current value of the treatment in unit (i, j) is A ij = a. Under this policy, a zip code currently below the threshold is converted to exceed the threshold with probability p 0 , and a zip code currently above the threshold is converted to below the threshold with probability 1 − p 1 . The policy-averaged direct, indirect and total effects can be approximated via Monte Carlo simulation for a range of p 0 and p 1 to evaluate the overall effects of a campaign to reduce air pollution.

While these summaries are well defined for any potential outcome model, estimation is virtually impossible without simplifying assumptions. In the remainder of this section we discuss several methods that exploit the spatial structure of the units to simplify the interference pattern.

These methods are summarized in Figure 5 .

Partial interference, a term coined in Sobel (2006) , or clustered interference, assumes that the units can be partitioned into groups so that interference can occur only between observations in the same group. In Example 1, partial interference might be evoked if it is reasonable to partition the zip codes into cities, and that birth weight is dependent only on the air pollution concentration in the mother's city, and not air pollution in other cities. A further parametric assumption might be that the potential outcome is a function only of the air pollution concentration in the mother's zip code and the proportion of her city's zip codes that exceed the threshold excluding zipcode i, denoted byã i . A linear model with these assumptions is

where β 1 and β 2 entail the direct and indirect effects, respectively. This parametric model and assumptions analogous to Assumptions 1, 3 and 4 that A is independent of all potential outcomes given the n vectors X ij and that φ(a) > 0 for all a endows the parametric model

with a causal interpretation. Of course, this model relies on strong assumptions that are difficult to verify, and thus a more flexible approach may be desirable.

There is an extensive literature that explores and expands on non-spatial partial interference (Halloran and Struchiner, 1991, 1995; Halloran, 2012; Tchetgen and VanderWeele, 2012; Vander-Weele et al., 2014; Liu et al., 2016; Barkley et al., 2017; Baird et al., 2018; Papadogeorgou et al., 2019) . Zigler et al. (2012) assume partial interference in a spatial analysis of the health effects of environmental regulations, with clusters of sites defined by their attainment status. Perez-Heydrich et al. (2014) and Zigler and Papadogeorgou (2018) assume partial interference for groups defined by spatial proximity. Zigler and Papadogeorgou (2018) deal with additional complications that arise when the spatial resolutions of the treatment and response differ. gives both a reasonable constraint for estimation and also allows for treatment effects to propagate through the network. In a further generalization of the spatial network interference assumption, Giffin et al. (2020) use the distance between units themselves, rather than a network approximation, to develop a generalized propensity score method to balance the spillover effect,Ā i . A is the treatment, Y is the outcome, and X is the observed confounder.

X X X X A A A A Y Y Y Y (a) No interference X X X X A A A A Y Y Y Y (b) General interference X X X X A A A A Y Y Y Y (c) Partial interference X X X X A A A A Y Y Y Y (d) Spatial network interference

Partial and network interference make assumptions that are conducive to a statistical analysis, such as the simple spillover effect in (23), but are likely crude representations of reality. Mechanistic methods that encode scientific understanding of the physical processes of interest offer increased fidelity to the true interference structure. Mechanistic models are indispensable in environmental attribution studies. For example, climate models play a central role in the Intergovernmental Panel on Climate Change's conclusion that human activities likely caused the majority of the observed increase in global mean surface temperature from 1951 to 2010 (Bindoff et al., 2013) . As reviewed by Hegerl and Zwiers (2011) , unlike purely statistical models that are limited to scenarios observed in the data, mechanistic models can be run under counterfactual scenarios that have not, or could not, be observed. This provides a key link to the potential outcomes framework in Section 3.1.

While mechanistic models can be used to estimate direct effects, they are more critical in the presence of interference because they can rule out many of the massive number of potential spillover paths, greatly reducing the complexity of the problem. Despite these strengths, mechanistic models are only approximations, and thus need to be calibrated and validated using observed data. Most relevant for our purposes is the recent work that combines mechanistic modelling with spatial statistical methods to estimate causal effects. 

Data collected over space and time are more informative about causal relationships than crosssectional data, because they afford the opportunity to observe variables coevolve. This reduces the potential for spurious associations. For example, if a treatment is applied in the course of the study, comparing a site's responses before and after the treatment can control for missing spatial confounding variables assuming they and their effects are time-invariant. This narrows the search for potential confounding variables to those with a similar pattern as the treatments over both space and time.

To describe spatiotemporal methods, we adopt new notation to accommodate the temporal dimension. For simplicity, we assume areal spatial units, discrete time steps, and that each region i ∈ {1, ..., N } has a single observation at each time step t ∈ {1, ..., T }. We denote the response, treatment, known and unknown confounding variables as Y it , A it , X it and U it , respectively. The potential outcomes framework and assumptions in Section 2.2 apply with the time step t replacing the replication number j. Similarly, many of the spatial methods in Section 2 such as matching (Section 2.3), neighborhood adjustments (Section 2.4), propensity score methods (Section 2.5) and

the instrumental variable approach (Section 2.6) apply for spatiotemporal data by viewing time as a third spatial dimension, with a different degree of correlation in this third dimension. Janes et al. (2007) propose a method to test for unmeasured spatial confounders using spatiotemporal data. LettingĀ t denote the average of A it at time t, their approach can be adapted to our setting via the model

where X it includes smooth functions of t to account for missing temporally-varying confounders.

In this model, η 1 and η 2 measure global and local effects of treatment, respectively, and they argue that if the estimated values of η 1 and η 2 are equal and non-zero then this represents an average causal effect of A it on Y it , and that a large difference between the estimated η 1 and η 2 suggests there may be a missing spatial confounder.

Difference-in-difference (DID) estimators (Ashenfelter and Card, 1985) aim to quantify the treatment effect on the increase in the mean response over time. For simplicity we assume a binary treatment variable and two time steps (T = 2). If the treatment at the both time steps is a i1 = a i2 = a,

. Therefore, δ i (0) is the increase over time in the absence of treatment, and δ i (1) − δ i (0) is the increase that can be attributed to treatment. The DID average treatment effect is then

which is analogous to (2) except that the outcomes are changes over time. Assume the potential outcomes follow the model Y it (a) = β 1 a + β 2 t + β 3 ta + X it γ + U it + ε it . Under Assumptions 1-4, the observed outcome model follows the induced linear model

Moreover, β 3 = δ DID has a causal interpretation.

To render Assumptions 1-3 plausible, it is important to include information on a rich enough set of time-varying confounders in X it that affect both A it and Y it . In the spatiotemporal settings, the time-varying confounders X it include the observed information on the past treatments and outcomes. Delgado and Florax (2015) extend the spatial DIDs by assuming Markov interference where treatment effects only impact neighbors. This gives the model

whereĀ it is the mean of A it over the m i neighbors of region i at time step t. The neighborhood coefficients β 4 and β 5 can be viewed either as indirect spillover effects or added terms to adjust for local confounders to give more precise estimates of the direct causal effect, β 3 .

Matched wake analysis combines the DID approach with a spatiotemporal analogue to coarsened exact matching (Schutte and Donnay, 2014) . It was developed in the political science literature for studying responses to whether insurgent violence in Iraq causes civilians to help the US military. In this scenario, insurgent violence leading to civilian casualties is the "treatment" and violence not resulting in casualties is the "control". The response is the act of turning in salvaged unexploded ordinance to the US military, so that it will not be used in an improvised explosive device. The data are divided into sliding spatiotemporal windows called "wakes" and matched. Then a difference-in-differences approach is applied to the matched sample by counting the number of explosives turned in before and after events. A drawback to this method is that in some cases the sliding windows may overlap, which will violate SUTVA.

Granger causality is a fundamentally different concept than the potential outcomes framework. It is defined by temporal relationships and not potential outcomes. In a time series analysis with response Y t , treatment A t , and all other relevant variables at time t, X t , the treatment is said to Granger cause the response if Var(Y t |H t ) > Var(Y t |H t , A 1 , ..., A t−1 ), where the history up to time t is H t = {Y 1 , ..., Y t−1 , X 1 , ..., X t−1 }. In other words, Granger causality implies that given the history of all other variables, knowledge of past treatments reduces predictive uncertainty. If a linear lag L time series model is assumed, Y t = L l=l (A t−l β l + X t−l γ l + Y t−l ρ l ) + ε t , then the treatment is said to Granger cause the response if β l = 0 for any l ∈ {1, ..., L}.

Because this notion of causality is inherently defined for temporal data, extending these methods to the spatiotemporal case is straightforward. The simplest model is the linear no-interference

where U it is correlated over space (e.g., following a CAR or SAR distribution) but independent over time. It is also straightforward to include spillover effects by including spatial averages as covariates, i.e., under a Markov interference assumption the mean of A it−1 over region i's m i neighbors could be added as a covariate.

Granger causality and Rubin causality based on potential outcomes are fundamentally different.

Granger causality is defined in terms of predictive uncertainty, as might be useful to a passive observer of the system trying to maximize predictive power. In contrast, Rubin causality is defined in terms of the effects of an active intervention, as might be performed by a scientist conducting a controlled experiment. Despite their different definitions and objectives, these two approaches share similarities. White and Lu (2010) show that Granger causality is equivalent to Rubin causality for times series data with no missing confounders and valid parametric assumptions. For example, the model in (28) could be motivated by Granger causality or Rubin causality with Assumptions 1-4 and further assumptions (normality, linearity, etc) on the form of the potential outcomes model.

For further discussion of the similarities and differences between types of causality, see Holland (1986) or Eichler (2012) .

Point-referenced, or geostatistical, data are not measurements of a region, but rather taken at a specific point (latitude/longitude). Let s i ∈ R 2 be the spatial location corresponding to observation i ∈ {1, ..., n}. The spatial regression model becomes

where the unknown confounder U (s) is a spatial processes and ε i iid ∼ Normal(0, τ 2 ). This notation allows for replications at sites if, say s i = s j , in which case observations i and j share the spatial term U (s i ) = U (s j ). The covariate vector X i can include spatial covariates such as the elevation at s i and non-spatial covariates such as the time of day the measurement was taken.

Unlike an areal data analysis as in Section 2 where the number of potential sampling locations is finite, a geostatistial analysis must consider an uncountable number of potential sampling locations s ∈ D ⊂ R 2 . We use the bold to denote a process over the entire spatial domain; e.g., U = {U (s) :

s ∈ D}. An unknown spatial process such as U is typically assumed to be a continuous function of s over D and modeled as a Gaussian process with mean zero and isotropic covariance function (i.e., a covariance that depends only on the distance between locations). Although other covariance functions are available (Banerjee et al., 2014) , the simplest choice is the exponential covariance function Cov{U (s i ), U (s j )} = σ 2 exp(−d ij /ρ) where d ij is the distance between s i and s j . We denote this Gaussian process model as U ∼ GP(ρ, σ).

In the most general form, the potential outcomes for observation i depend on the entire spatial field of potential treatments, a = {a(s) : s ∈ D}. Therefore, we define the potential outcome for observation i as Y i (a). In the context of Example 1, a(s) might be the air pollution concentration at spatial location s, as opposed to the average concentration in a zip code. In this geostatistical setting, a mother's exposure to air pollution would integrate the concentration a(s) along the path the mother travels. This could be estimated by a backpack the mother wears that continuously measures her local air pollution concentration. Therefore, changing a(s) for any s in the spatial domain could affect her potential outcome.

The potential outcomes framework simplifies dramatically under the no interference assumption. With a binary treatment, the two potential outcomes for unit i are Y i (0) if a(s i ) = 0 and Y i (1)

if a(s i ) = 1. In this simple case, the potential outcomes concepts, definitions and assumptions introduced in Section 2.2 directly apply to the geostatistical setting. Many of the methods developed to adjust for missing spatial confounders described for areal data can also be applied. For example, all of the propensity score methods in Section 2.5 and instrumental variables methods in Section 2.6 can be adapted for geostatistical data by replacing the CAR model for the missing spatial confounder with a Gaussian process model. Many of the other methods introduced for areal data can also be modified for geostatistical applications, as described in the remainder of this section.

The matching methods described in Section 2.3 that pair observations from the same region can be applied for geostatistical data with replications at spatial locations. Distance adjusted propensity score matching (DAPSm) can be used when there are not replications. This method alters propensity score matching (Rosenbaum and Rubin, 1983a) by using a standardized distance that combines the propensity score difference and geographic distance. The logic is that if unmeasured spatial confounders exist, then observations that are close together will have confounders that are the most alike. Similar to the neighborhood adjustment methods, this method balances treatment and control by including geographic distances as a proxy for the unmeasured confounders in the matching process. The difference for a pair with A i = 1 and A j = 0 is defined as

whereê i andê j are estimated propensity scores, m is the maximum distance between pairs of locations in the study domain and w ∈ [0, 1] is a weight. The authors propose an algorithm to select pairs with small D ij .

Regression discontinuity designs are generally used when treatment assignment is determined by whether the covariate value for a unit exceeds a threshold (Imbens and Lemieux, 2008; Bor et al., 2014; Keele and Titiunik, 2015) , e.g., students are admitted to a college if and only if their SAT score exceeds a threshold. These cases provide a natural experiment if it can be assumed that units slightly above and slightly below the threshold are similar in every way except the treatment assignment, and thus the difference between these groups can be attributed to the causal effect of the treatment. Natural experiments of this form often arise in environmental and epidemiological studies where the variable being thresholded to determine treatment is the spatial location. In the context of Example 1, the treatment might be whether a state is subject to an air pollution regulation, and the objective is to determine if this affects health outcomes. Figure 6 shows a Figure 6 : Illustration of regression discontinuity. The treatment region A is the region above the curve, the points are the sample locations s i with samples with A i = 1 filled and the background color is the mean function A(s)β + U (s) were A(s) indicates that s = (s 1 , s 2 ) ∈ A. (29) with A i = 1 if s i ∈ A and A i = 0 otherwise.

5.4.1 Stochastic partial differential equation modeling Section 2.4 introduces the SAR model that defines the regression of the response onto the treatment after subtracting the means across neighboring regions. The motivation for building a model on the differences is to remove the effects of spatially-smooth confounding variables. The stochastic partial differential equation (SPDE) models of Lindgren et al. (2011) can be viewed as an extension of this idea to the continuous (geostatistical) spatial domain. In the SPDE framework, models are specified on the partial derivatives of the response surface, which is a generalization of the SAR model that can be applied to differentiable functions such as U. Lindgren et al. (2011) show that this approach can be used to approximate Gaussian processes with the Matérn covariance function, and develop approximations that resemble the SAR covariance model.

Defining interference for geostatistical applications requires returning to the general potential outcomes formulation in Section 5.1, where the potential outcome for observation i depends on the entire field of treatments, a, and is denoted as Y i (a). Relating the spatial field a with the scalar potential outcome requires assumptions about the form of interference. A general form of the interference is

where w is a weighting function that determines the spillover effectā i and β 1 and β 2 control the direct and indirect effects, respectively. Given this potential outcome model, the four causal effects (direct, indirect, total and overall) can be defined and interpreted as in Section 3.1 with a −i defined as the surface a excluding a(s i ), or perhaps excluding a for all sites within a small radius of s i .

The form of spillover in (32) encompasses many common interference assumptions. For example, partial/cluster interference can be implemented by fixing w(s i , s) = 0 if sites s i and s are in different groups. A structure resembling Markov/network interference assumes that w(s i , s) = 1/(πr 2 ) if s is within radius r of s i and w(s i , s) = 0 otherwise. This reduces the spillover measurē a i to the average treatment within radius r of s i . If strict bounds on the range of interference cannot be assumed, then the weight function could be a decreasing function of the distance from s i , such as the Gaussian kernel function with w(s i , s) = exp {−0.5(||s − s i ||/φ) 2 } / 2πφ 2 .

Even after reducing the complexity of the model by selecting a simple form for the weighting function, computing the spatial integral in (32) is often impossible because the treatments are only observed at a finite number of locations. One remedy is to use spatial interpolation (Kriging) to im-pute the treatments onto a fine grid of locations covering the spatial domain and then approximate the integrals as sums over the grid points. In this case, uncertainty about the estimated spillover variables should be accounted for using Bayesian or multiple imputation methods.

Given a form of interference and the assumption of no missing confounders, estimation of the direct and indirect effects can proceed with the usual spatial linear model. One approach to accounting for missing spatial confounders is to include spatial propensity score models for both the direct treatment A i and the spillover effectĀ i . The propensity score for A i can be estimated as in the areal case with say a spatial logistic regression to giveê(s i ).

The field of spatial causal inference has seen impressive advances in recent years. There are now methods to address the fundamental problems including accounting for missing spatial confounding variables and modeling spatial interference. However, there are many opportunities for future work that we discuss below, including combining data types, relaxing model assumptions, going beyond mean estimation, and using causal estimates for decision making.

We have discussed methods for areal data (Section 2) and point-referenced/geostatistical data (Section 5) separately, but many analyses require utilizing both types of data. For example, treatments may be defined at point locations (e.g., air pollution concentration) while the response variable is defined regionally (e.g., hospital admission rate by zip code). In spatial statistics this is referred to as the change of support problem (Gotway and Young, 2002; Gelfand et al., 2010) . One approach to combining data with different supports is to conceptualize the areal data as an aggregation of a continuous latent process and then specify geostatistical models such as those presented in Section 5 on the latent process. Extending these methods to the causal inference would require carefully specifying the causal estimand and devising computationally-efficient methods for estimation. Zigler and Papadogeorgou (2018) may provide a template for this work.

Change of support issues also arise when the treatment is a point source, such as an oil spill, power plant or wildland fire. The effect of point source treatment variables can be direct, but their most prominent causal effects will likely be the spillover effects (Section 3) felt by nearby locations. The spillover effects can be modelled as a function of the distance from the response location to the point source or mechanistically using a mathematical dispersion model (Section 3.4). These methods can also be extended to the spatiotemporal setting using spillover effects that decay in space and time (e.g., Kim et al., 2018 Kim et al., , 2019 . Inferential methods that rely on modeling the treatment variables (e.g., propensity scores) could apply a spatial point pattern analysis (Baddeley et al., 2015) , such as an inhomogeneous Poisson process model, to estimate the treatment intensity.

It may also be possible to leverage work on informative sampling Pati et al., 2011) that uses a joint model for the sampling locations and the responses to reduce the effects of systematic bias in the sampling design.

Most of the methods discussed in this review rely on strong parametric assumptions such as linearity and normality. Parametric methods dominate spatial statistics because in the canonical problem with one observation at each spatial location there is insufficient data to relax these assumptions. In contrast, most causal inference methods aim to be robust to model misspecification.

There is a body of work on nonparametric spatial methods (Gelfand et al., 2010; Reich and Fuentes, 2015) that might be used to relax the parametric assumptions in spatial causal inference, but these

ideas have yet to be applied in this context.

We focused only on the average treatment effect, and future work is to extend spatial causal inference to other types of treatment effects. For example, extreme events are often the most impactful in environmental studies, and thus it would be of great interest to extend causal inference ideas to spatial quantile regression (e.g., Reich et al., 2011; Reich, 2012; Lum et al., 2012) or extreme value analysis (e.g., Davison and Huser, 2019) . Another simplification made throughout the review is that the confounder and treatment effects are the same throughout the spatial domain.

A more general approach is a locally-adaptive model with spatially-varying coefficients (Gelfand et al., 2003) , which would be a spatial application of conditional treatment effects.

Ultimately, causal effect estimates can be used to influence decision making. An area of future work is to use these estimates to derive individualized/localized treatment rules. This is complicated in the spatial case by interference between regions that require considering simultaneously assigning the treatments to all regions to achieve optimality. Laber et al. (2018) and Guan et al. (2020) propose a policy-search method for optimal treatment allocation for spatiotemporal problems, but a general theory awaits development.

In Section 2, we define the CAR and SAR models for individual observations, and in this section we provide the induced joint distribution of the spatial process at all N locations. If U ∼ CAR(ρ, σ)

then the joint distribution of U defined by the full conditional distributions given in Section 2.1 is multivariate normal with mean zero and covariance Σ CAR (ρ, σ) = σ 2 (M − ρW) −1 , where M is diagonal with the i th diagonal element m i (the number of regions neighboring region i) and W has (i, k) element equal one if regions i and k are adjacent and zero otherwise.

Similarly, the SAR model in (5) can be solved for Y = (Y 1 , ..., Y N ) T to show that the induced joint distribution is

with the (i, k) element of C is 1/m i if regions i and k are adjacent and 0 otherwise, so that, e.g., 

Large sample properties of matching estimators for average treatment effects

Matching on the estimated propensity score

Estimating average causal effects under general interference, with application to a social network experiment

Using the longitudinal structure of earnings to estimate the effect of training programs

Spatial point patterns: methodology and applications with R

Optimal design of experiments in the presence of interference

Hierarchical modeling and analysis for spatial data

Doubly robust estimation in missing data and causal inference models

Causal inference from observational studies with clustered interference

Causal modeling in environmental health

Detection and attribution of climate change: From global to regional

Regression discontinuity designs in epidemiology: causal inference without randomized trials

Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data

Planning of Experiments

Confronting models with data: The challenges of estimating disease spillover

Addressing geographic confounding through spatial propensity scores: A study of racial disparities in diabetes

Spatial extremes

Difference-in-differences techniques for spatial data: Local autocorrelation and spatial interaction

Geostatistical inference under preferential sampling

Causal inference in time series analysis

Identification and estimation of treatment and interference effects in observational studies on networks

2020+) Bipartite interference and air pollution transport: Estimating health effects of power plant interventions

Handbook of Spatial Statistics

Spatial modeling with spatially varying coefficient processes

Generalized propensity score approach to causal inference with spatial interference

Combining incompatible spatial data

Investigating causal relations by econometric models and cross-spectral methods

A spatiotemporal recommendation engine for malaria control

The minicommunity design to assess indirect effects of vaccination. Epidemiologic methods

Causal inference in infectious diseases

Restricted spatial regression in practice: geostatistical models, confounding, and robustness under model misspecification

Full matching in an observational study of coaching for the sat

Inverse conditional probability weighting with clustered data in causal inference

Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme

Use of models in detection and attribution of climate change

Causal inference: What if

The propensity score with continuous treatments. Applied Bayesian modeling and causal inference from incomplete-data perspectives

Efficient estimation of average treatment effects using the estimated propensity score

Adding spatially-correlated errors can mess up the fixed effect you love

Statistics and causal inference

Toward causal inference with interference

Dimension reduction and alleviation of confounding for spatial generalized linear mixed models

Identification and estimation of local average treatment effects

Regression discontinuity designs: A guide to practice

Trends in air pollution and mortality: an approach to the assessment of unmeasured confounding

Estimation of spatial variation in risk using matched case-control data

Geographic boundaries as regression discontinuities

Causal inference in disease spread across a heterogeneous social system

Modeling stochastic processes in disease spread across a heterogeneous social system

Optimal treatment allocations in space and time for on-line control of an emerging infectious disease

A spatial causal analysis of wildland fire-contributed PM2.5 using numerical model output

An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach

On inverse probability-weighted estimators in the presence of interference

Spatial quantile multiple regression using the asymmetric laplace process

Combining MCMC with sequentialPKPD modelling

Cutting feedback in Bayesian regression adjustment for the propensity score

The importance of scale for spatial-confounding bias and precision of spatial regression estimators

Adjusting for unmeasured spatial confounding with distance adjusted propensity score matching

Causal inference with interfering units for cluster and population level treatment allocation programs

Bayesian geostatistical modelling with informative sampling locations

Assessing effects of cholera vaccination in the presence of interference

Spatiotemporal quantile regression for detecting distributional changes in environmental processes

Spatial Bayesian nonparametric methods

Bayesian spatial quantile regression

Spatial analyses of periodontal data using conditionally autoregressive priors having two classes of neighbor relations

Effects of residual smoothing on the posterior of the fixed effects in disease-mapping models

Adjusting for differential rates of prophylaxis therapy for PCP in high-versus low-dose AZT treatment arms in an AIDS randomized trial

Estimation of regression coefficients when some regressors are not always observed

Optimal matching for observational studies

Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome

The central role of the propensity score in observational studies for causal effects

Estimating causal effects of treatments in randomized and nonrandomized studies

Bayesian inference for causal effects: The role of randomization

A Bayesian view of doubly robust causal inference

On Bayesian estimation of marginal structural models

Mitigating unobserved spatial confounding bias with mixed models

Matched wake analysis: finding causal relationships in spatiotemporal event data

What do randomized studies of housing mobility demonstrate? causal inference in the face of interference

Matching methods for causal inference: A review and a look forward

On causal inference in the presence of interference

Structural equation models for dealing with spatial confounding

Interference and sensitivity analysis. Statistical science: A

Causal inference under interference in spatial settings: A case study evaluating community policing program in Chicago

A close look at the spatial structure implied by the CAR and SAR models

Granger causality and dynamic structural systems

Exposure to air pollution and COVID-19 mortality in the United States

The central role of Bayes Theorem for joint estimation of causal effects and propensity scores

Estimating causal effects of air quality regulations using principal stratification for spatially correlated multivariate intermediate outcomes

Bipartite causal inference with interference

Model feedback in Bayesian propensity score estimation

This work was partially supported by the National Institutes of Health (R01ES031651-01,R01ES027892-01) and King Abdullah University of Science and Technology (3800.2). The research described in this article has been reviewed by the Center for Public Health and Environmental Assessment, U.S. Environmental Protection Agency and approved for publication. Approval does not signify that the contents necessarily reflect the views and the policies of the Agency, nor does mention of trade names of commercial products constitute endorsement or recommendation for use. The authors declare that they have no conflict of interest.

Consider the true data-generating model Y|A, U ∼ Normal(βA+U, τ 2 I n ), U|A ∼ Normal(φA, Σ 1 ) and A ∼ Normal(0, Σ 2 ). In this model the treatment variable and spatial process are correlated unless φ = 0. If the assumed model is Y|A, U ∼ Normal(βA+U, τ 2 I n ) and U|A ∼ Normal(0, Ω), or equivalently Y|A ∼ Normal(βA, Σ) where Σ = τ 2 I n + Ω, then the generalized least squares