key: cord-0613293-g1ddrfsb authors: Murakami, Daisuke title: Transformation-based generalized spatial regression using the spmoran package: Case study examples date: 2021-09-14 journal: nan DOI: nan sha: 2a51acc75a23ee94db915a4a5abff057c34fa35b doc_id: 613293 cord_uid: g1ddrfsb This study presents application examples of generalized spatial regression modeling for count data and continuous non-Gaussian data using the spmoran package (version 0.2.2 onward). Section 2 introduces the model. The subsequent sections demonstrate applications of the model for disease mapping, spatial prediction and uncertainty modeling, and hedonic analysis. The R codes used in this vignette are available from https://github.com/dmuraka/spmoran. Another vignette focusing on Gaussian spatial regression modeling is also available from the same GitHub page. This study presents application examples of generalized spatial regression modeling for count data and continuous non-Gaussian data using the spmoran package (version 0.2.2 onward). Section 2 introduces the model. The subsequent sections demonstrate applications of the model for disease mapping, spatial prediction and uncertainty modeling, and hedonic analysis. The R codes used in this vignette are available from https://github.com/dmuraka/spmoran. Another vignette focusing on Gaussian spatial regression modeling is also available from the same GitHub page (and Murakami 2017). We consider the following generalized spatial regression model (see Murakami et al., 2021) : where (•) is a transformation function normalizing the i-th explained variable " . ",$ is the k-th explanatory variable, ",$ is a fixed or random coefficient, which may vary spatially and/or nonspatially (the distribution for ",$ is omitted from Eq. (1) for simplicity). " is a term capturing residual spatial dependence. Moran eigenvectors, which are spatial basis functions, are used to model the spatially dependent processes in ",$ and " . The model may be rewritten as follows: Eq. (2) suggests that " is assumed to have a distribution that is obtained by transforming a Gaussian distributed " using the *' (•) function. This model describes a wide variety of non-Gaussian data including count data by flexibly specifying the transformation function. The transformation function is defined by concatenating D sub-transformation functions: where % (•) is the d-th sub-transformation function depending on a set of parameters + . For continuous explained variables, the spmoran package provides the following specifications for (•) (see Figure 1 ): (a) For non-negative " , the Box-Cox transformation is available (left of Figure 1 ). (b) For non-Gaussian " (e.g., skew and fat-tail distribution), the SAL transformation Eq. (4) (Rios and Tobar, 2019), which is a non-linear transformation, is iterated D times to accurately normalize " (middle of Figure 1 ): where + ∈ { +,' , +,) , +,, , +,-}. (c) For non-negative and non-Gaussian " , the Box-Cox transformation is applied first, and the SAL transformation is iterated D times after that to accurately normalized " (right of Figure 1 ). Non-Gaussian data: ! ! As illustrated in Figure 2 , the iteration of the SAL transformations converts a wide variety of non-Gaussian data " to Gaussian data ( " ) quite flexibly. Thus, the generalized regression model Eq. (1) is available for a wide variety of non-Gaussian data. This model Eq. (1) is also available for count data by applying a (log-)Gaussian transformation approximating a count data distribution. In the spmoran package, the following transformations are implemented: (d) For (over-dispersed) Poisson counts, a log-Gaussian approximation proposed by Murakami and Matsui (2021) is available (left of Figure 3 ). Based on them, accuracy of the approximate model is almost the same as the conventional over-dispersed Poisson regression. (e) For counts which do not obey the Poisson distribution, the log-Gaussian approximation is applied first to roughly normalize the data, and the SAL transformation is iterated after that to identify the most likely distribution (i.e., probability mass function) (right of Figure 3 ). (log-Gaussian approx.) (log-Gaussian approx.) In the spmoran package, the transformation function (•) in Eq. (1) where y_type specifies data type ("count" for count variables and "continuous" for continuous variables (default)). The subsequent sections present application examples of the model for count data (Section 2) and continuous data (Sections 3-4). This section demonstrates a count regression modeling for epidemic data considering spatially varying coefficients, residual spatial dependence, and heterogeneity across years. The estimated model is used mainly for disease mapping and uncertainty modeling. This section uses sf, rgeos, CARBayesdata, spdep, spmoran packages: We employ the pollution-health data (pollutionhealthdata), which is available from the CARBayesdata package. The data consists of respiratory hospitalization data, air pollution, and covariate data for the Greater Glasgow (2007 -2011) by 271 Intermediate Geographies (IG). Explained variable (y) is the number of hospitalization due to respiratory disease (observed). Explanatory variables (x) are the average particulate matter concentration (pm10), the percentage of working age people who are in receipt of Job Seekers Allowance, a benefit paid to unemployed people looking for work (jsa), and average property price (divided by 100,000) (price). Random effects by years are considered to estimate heterogeneity across years (xgroup). Besides, the expected numbers of hospitalizations based on Scotland-wide respiratory hospitalization rates (expected) is used as an offset variable. These variables are specified as follows: A binary contiguity matrix, which is generated from the spatial polygons by IGs (GGHB.IG), is used for modeling spatial dependence: As explained, Moran eigenvectors are used to model spatially dependent process. Here is a code generating the eigenvectors from the W matrix: where cmat specifies a spatial proximity matrix, and s_id specifies zone ID (the i-th row of cmat and the element of s_id that appears in the i-th are associated). This section considers two specifications for y. The former (ng1) assumes y to obey an overdispersed Poisson distribution. The latter assumes a more general distribution, and estimates it through the SAL transformation (ng2): The outputs ng1 and ng2 are used as inputs for the resf or resf_cv function. The resf function estimates spatial regression models without spatially varying coefficients (SVCs) while the resf_vc function estimates models with SVCs (see Murakami, 2017) . Here, we estimate the following models: mod1 and mod2 assume constant coefficients while mod3 and mod4 assume SVCs on x. For the distribution of y, mod1 and mod3 assume an over-dispersed Poisson distribution while mod2 and mod3 adjust the distribution using the SAL transformation to identify the most likely distribution. The estimation result of mod3 is as below. The intercept and coefficient on price are estimated spatially varying while the coefficients on jsa and pm10 are estimated constant. As shown in the bottom, the BIC of mod3 is considerably better than the BIC of the NULL model (74.9), which is a log-Gaussian model approximating the conventional Poisson regression: The estimated group effects are as follows: While regression coefficients for the transformed y is often difficult to interpret, marginal effect In addition to the predicted values plotted above, the resf and resf_vc functions return quantiles of the predicted values, which are estimated based on the modeled probability density/mass function. They are displayed as follows: The quantiles are useful for evaluating uncertainty in disease mapping (see below). The predicted values are available for disease mapping. Here, we consider mapping the patterns in 2007. Here is a code to create a dataset including observed counts in 2007 (obs), predicted counts and their standard errors (pred), estimated varying coefficients (b_est), and quantiles of the predicted values (pred_qt), and convert the dataset to sf format, which is a spatial data format, for mapping: 2017) spmoran: An R package for Moran's eigenvector-based spatial regression analysis Improved log-Gaussian approximation for over-dispersed Poisson regression: application to spatial analysis of COVID-19. ArXiv 2104.13588 Limitations on low rank approximations for covariance matrices of spatial data Geostatistics for large datasets