key: cord-0994295-jv08rjwm authors: Slater, Justin J.; Brown, Patrick E.; Rosenthal, Jeffrey S.; Mateu, Jorge title: Capturing spatial dependence of COVID-19 case counts with cellphone mobility data date: 2021-09-28 journal: Spat Stat DOI: 10.1016/j.spasta.2021.100540 sha: ad86faab0a3bf1e8b4305afcc9849d9eacd6d41e doc_id: 994295 cord_uid: jv08rjwm Spatial dependence is usually introduced into spatial models using measure of physical proximity. When analyzing COVID-19 case counts, this makes sense as regions that are close together are more likely to have more people moving between them, spreading the disease. However, using the actual number of trips between each region may explain COVID-19 case counts better than physical proximity. In this paper, we investigate the efficacy of using telecommunications-derived mobility data to induce spatial dependence in spatial models applied to two Spanish communities’ COVID-19 case counts. We do this by extending Besag York Mollié (BYM) models to include both a physical adjacency effect, alongside a mobility effect. The mobility effect is given a Gaussian Markov random field prior, with the number of trips between regions as edge weights. We leverage modern parametrizations of BYM models to conclude that the number of people moving between regions better explains variation in COVID-19 case counts than physical proximity data. We suggest that this data should be used in conjunction with physical proximity data when developing spatial models for COVID-19 case counts. Spatial analyses of COVID-19 case data were first published as early as March of 2020 [1] [2] [3] , in an attempt to characterize, predict, and attenuate the severity of the pandemic. Subsequent studies have noted substantial spatial dependence in COVID-19 case counts [4, 5] . This makes sense as regions that are close to each other likely have more people moving between them, spreading the disease to nearby regions. Many groups have attempted to model COVID-19 case counts as a function of climate [6] [7] [8] , healthcare quality [9] , socioeconomic factors [10] and more. More recently, mobility data has become more abundant and popular for modeling COVID-19 transmission. This makes sense because the disease spreads through human contact, meaning that case counts are likely to be a function of the number of people moving around. Such mobility data has been used to model the evolution of the epidemic in Spain [11, 12] , assess the effectiveness of the Spanish lockdown [13] , monitoring the epidemic in Switzerland [14] , identify at-risk populations in France during a lockdown [15] , individual-level infection tracing in China [16] , assess the timing of stay-home orders [17] , and evaluating the effectiveness of social distancing in the United States [18] . This data can be found in many forms, but is commonly found in the form of aggregated areal mobility matrices. If we denote a mobility matrix M , [M ] ij corresponds to the number of trips from region i to region j, and M ii represents the number of trips within region i. These data have been applied in a variety of different models to answer numerous questions, but lack of available methods makes it difficult for researchers to use this data to its full potential. In this paper, we demonstrate a novel method for analyzing this data, whereby the mobility data is used as edge weights in a Gaussian Markov random field (network) model. Previous work using network models have been applied to mobility data in the form of a network compartment model [19] which was used to conduct inference regarding societal inequities, and inform reopening. This work does not aim to make such claims, but rather demonstrate the efficacy of mobility data in modern parametrizations of Besag, York, and Mollié (BYM) models [20] and their extensions. BYM models have been used frequently in the spatial analysis literature due to their effectiveness and computational efficiency. In these models, the spatial component is comprised of Conditional Autoregressive (CAR) [21] models and conventional random effects. This means that the spatial effect of region i depends only on its "neighbours". Neighbours could be defined by any quantity the analyst has access to, but is most often defined by physical adjacency, i.e. if two regions share a common border, they are considered neighbours. Several ICAR/BYM models have been applied to COVID-19 data with neighbours defined in this way [22] [23] [24] . Although these spatial model components based on physical adjacency are powerful and computationally efficient, it makes more sense to use mobility between regions to induce spatial dependence in COVID-19 models because the disease spreads via person-to-person contact. In this paper, we build a BYM model where mobility data is used to induce spatial dependence between regions. Using mobility data within two Communities in Spain, Madrid and Castilla-Leon, we demonstrate the value of mobility data for COVID-19 spatial modeling applications. Furthermore, we extend modern parametrizations of BYM models to account for both physical adjacency and mobility simultaneously, and show that mobility data captures spatial variation in COVID-19 case counts much more accurately than physical adjacency alone. This is a short focused paper with the following plan. Section 2 presents the data and the modeling strategy based on particular parametrizations of BYM models. The results come in Section 3, and the paper ends with a final discussion in Section 4. This paper is focused on two regions in Spain. Castilla-Leon is the largest Community in Spain by area and is located in the northwest part of Spain, with a population of 2.5 million. The Community of Madrid is located in the central part of Spain and has a population of around 6.8 million, and it is home of the capital of the country, Madrid City, with 3.3 million inhabitants. The human mobility data was obtained from Barcelona Supercomputing Center Flow-map dashboard [25] . Trips within Madrid and Castilla-Leon were extracted from over 13 million phone records provided by a Spanish cellphone company. Both passive daily cases data were retrieved from the open data portal of Castilla-Leon [26] and from the Epidemiological Surveillance Network of Madrid [27] . Notice that the movement drops as cases rise, because a lockdown was implemented in response to the increasing severity of the epidemic. In order to avoid this potential "reverse causality" problem, we will only use movement data in the first week of March. Our justification for this is that there is a time lag between when the virus spreads and the resulting COVID cases are confirmed. That is, the "first wave" of the epidemic was likely influenced mostly by the movement that occurred prior to the peak in cases, and less by the movement that occurred during it. Madrid and Castilla-Leon are considered separately throughout this paper. Although they are adjacent, data on movements between the two communities are not available. In Madrid, there is a lot of movement in and around Madrid City, and less movement in the more rural areas. Castilla-Leon shows a less predictable movement pattern, as there 6 J o u r n a l P r e -p r o o f is not a single capital city that accounts for most of the movement. This movement data will be used to induce spatial correlation between regions, as described in Section 2.3. Besag, York, and Mollié (BYM) models [20] are widely used in spatial epidemiology and disease mapping due to their simplicity and computational efficiency. They assume the incidence of disease in region i follows a Poisson distribution where Y i is the number of infected cases in region i, and E i is some form of expected count or offset, which could be the at-risk population, exposure time, etc. The logrelative risk, λ i , is often modeled as where µ is the overall intercept, β is the effect of spatial covariates, φ i is the structured spatial random effect, and θ i is the unstructured spatial random effect which allows for overdispersion in the response. In the spatial formulation of the BYM model, w ij = 1 when regions i and j share a common border, and 0 otherwise. That is, region i's structured spatial effect is only conditionally dependent on its neighbours, given all other regions. The distributions {φ i |φ −i } n i=1 are known as the full conditionals, where φ −i is short hand for the set {φ 1 , φ 2 , ...φ i−1 , φ i+1 , ...φ n }. We can see from (1) that E(φ i |φ −i ) is a weighted average of its neighbours, resulting in spatial smoothing. These full conditionals correspond to the joint distribution of the φ's being a Gaussian Markov random field (GMRF) [28] , with where W is a matrix of weights such that w ij > 0 for i = j and w ii = 0, and σ 2 is a variance parameter to be estimated. D is a diagonal matrix such that D ii = j w ij . This definition ensures that the precision matrix, Q, is both symmetric and positive definite. In addition to the 0-1 weights based on regions being adjacent, other weighting schemes, such as inverse of Euclidean distance between regions, have been used. For a comparison of common weighting schemes, see [29] . When we specify Q in this way, we refer to this as an Intrinsic Autoregressive (ICAR) model for φ. The joint density function has a computationally convenient form with which is sometimes referred to as the pairwise difference formula. Notice that this density is invariant to the addition of a constant to each φ i , leaving the spatial random effects unidentifiable up to a constant. This is typically remedied by imposing the constraint i φ i = 0 [29] . We will now modify this BYM model to account for movement between regions, in addition to physical adjacency. In order to extend the BYM model to allow for spatial correlation based on movement data, a second ICAR term, γ i , with dependence structure governed by the movement data is added to the model. We also retain an adjacency-determined spatial effect φ i in order to infer the relative importance of mobility-based and adjacency-based spatial dependence in determining COVID-19 case counts. The resulting model is where φ i and γ i are the spatial random effects with priors based on the physical data and movement data respectively. The geographically-defined process φ i has weights w ij = 1 if regions i and j share a common border and are 0 otherwise, while the 9 J o u r n a l P r e -p r o o f movement-defined process γ i has weights v ij representing the number of trips between regions i and j. Using mobility as edge weights in network models has shown to be effective in the context of infectious diseases [30] [31] [32] . [30] used mobility weights in an autoregressive term, which allowed the weights matrices to be asymmetric. However, given that our mobility data is being used in a Gaussian prior for a random effect, the precision matrices of φ and γ, Q φ and Q γ , must be symmetric. Therefore we require w ij = w ji and v ij = v ji . While the first equality will always be true, the mobility matrices are not perfectly symmetric, thus symmetry was induced by defining v ij as the sum of the numbers of trips from i to j and from j to i. The GRMF does not account for the movement within a region, so the movement within a region was included in the model as a spatial covariate X i (fixed effect). That is, X i was computed as where v ii /E i is the number of trips per person within a region, and mean(v jj /E j ) and sd(v jj /E j ) are the mean and standard deviations of the trips per person in all other regions. This model was run on both the Madrid and Castilla-Leon data. There are two main drawbacks with the formulations of BYM models presented thus far. Firstly, the interpretation of the parameters σ γ and σ φ depend on the average number of neighbours and the total number of trips for each region, and hence their magnitudes are not comparable [33] . Secondly, σ φ , σ γ , and σ θ are hard to estimate without very careful choices of hyperpriors [34] . We will now address these shortcoming via reparametrizations. In order to solve issues with comparability, interpretability, and estimation, we apply a reparameterization of our model that is inspired by [35] with where ρ φ + ρ γ + ρ θ = 1 and 0 < ρ γ , ρ φ , ρ θ < 1. The priors for σ and ρ are Here, σ 2 is the combined variance of the spatial effects, and the ρ's are mixing parameters, interpreted as the proportion of the combined spatial variance explained by each model component. Note that ρ θ = 1 reduces the spatial component to purely overdispersion, ρ φ = 1 reduces the spatial component of the model to an adjacency ICAR model for the spatial effects, and ρ γ = 1 reduces the spatial component to a mobility ICAR model. Most importantly, if ρ γ > ρ φ then this means that the mobility data better explains variation in COVID-19 case counts than the adjacency data. As long as the spatial weights matrix and the mobility weights matrix are linearly independent, then having both spatial and mobility terms in our model present no issues with identifiability [36] . Finally, s γ and s φ are scaling factors, such that the geometric means of s −1 γ Var(γ i ) and s −1 φ Var(φ i ) are both ≈ 1 for each i, meaning that γ * i and φ * i are the log relative risk contributions from the movement data and physical data respectively [33] . Scaling is absolutely necessary in order to conduct inference on the J o u r n a l P r e -p r o o f Journal Pre-proof ρ's. We compute the scaling factors as follows where Q − is the generalized inverse of the n × n precision matrix [37] . In order to scale the precision matrices of the spatial effects, the generalized inverse for sparse matrices from [38] was used. The diagonal elements, [Q − ] ii , of Q − are referred to as the marginal variances of the structured spatial effects, i.e var(φ i ) = [Q − φ ] ii and var(γ i ) = [Q − γ ] ii . As was the case with the ICAR model in (1), we can derive the full conditionals of the combined spatial effect, τ i = φ * i + γ * i + θ * i , for the model described in Section 2.3 These full conditionals can help provide some intuition as to the mechanism by which this model provides spatial smoothing. As ρ γ → 1, τ i is simply the weighted sum of the other regions, where the weights are the proportion of region i's total movement between each other region. If ρ φ → 1, the conditional mean of τ i reduces to the arithmetic average of the spatial effects of its neighbours. If ρ θ → 1, then the conditional mean shrinks to 0 (remember that ρ φ + ρ γ + ρ θ = 1). Given that ρ θ is positive, the conditional mean is always shrunk towards 0, resulting in spatial smoothing. In practice, the conditional mean will be a weighted average of the estimates smoothed by the movement GMRF, the physical GMRF and 0. It is important to note here that the w ij /s φ and v ij /s γ are relative measures due to the scaling factors. That is, doubling the total amount of movement has no effect on the conditional mean or variance of τ i . This is in contrast to the combined spatial effects in the commonly used Leroux model [34] . Additionally, the variance of τ i |τ −i is lower when region i has a lot of movement or many neighbours, relative to the other regions. for computation purposes (as recommended by the Stan team [40] ). To complete the model, priors for β and µ were N (0, 1). To ensure the robustness of our results, we also ran BYM models using the adjacency data and the movement data separately. That is, for both Madrid and Castilla-Leon, we ran a model where we assumed ρ γ = 0, and a separate model where ρ φ = 0. The results of these four models are presented in Section 3.2. Our code and posterior samples are posted at https://github.com/cghr-toronto/ public/tree/master/covid/spain_public_code. Table 1 shows posterior medians and credible intervals for the mixing parameters for the model with both movement and adjacency spatial effects. For both Madrid and Castilla-Leon, the proportion of spatial variation explained by γ is much higher than that of φ and θ. The posterior probability that ρ γ > ρ φ was 0.997 for Madrid, and 0.998 for Castilla-Leon. However, φ does seem to account for a non-trivial amount of spatial variation in both Madrid and Castilla-Leon. This means that although movement data is likely more explanatory, adjacency data can help with explaining variation in COVID-19 cases. Additionally, there is a substantial amount of spatial variation explained by the unstructured spatial effect for Madrid. This is not the case for Castilla-Leon, as most of the mass of the posterior of ρ θ is near 0. This makes sense given that Madrid has a large metropolitan centre surrounded by a mix of suburbs and rural areas, so there are probably spatial confounders that our model is missing. For a plot of the posterior densities of ρ, see Appendix A. Figures 4a through 4d show the spatial distribution γ * and φ * , plotted using the same colour scale for comparability. We can see that γ's log-relative risks have a lot more spatial variation in both Commmunities. The log-relative risks for φ tend to have smooth spatial gradients, while γ tends to identify clusters of regions as high-risk areas. Table 1 : Posterior medians, and 95% credible intervals for ρ in BYM models using movement and physical (adjacency) data in the same model. As seen in equation 2, the expectation of the combined spatial effects are a weighted average of these spatial effects, and 0 (notice that the numerator can be rewritten as Figures 4e and 4f show the predicted cases per 1000 people per region, showing highly similar patterns to the observed values in Figure 2 . The standard deviation was slightly larger for Castilla-Leon than it was for Madrid. Table 2 shows posterior medians and credible intervals for the ρ parameter from the movement and physical BYM models described in Section 2.5, fit separately to Madrid In this paper, we have demonstrated that there is much value in using mobility data in combination with geographical proximity for defining correlation structures COVID-19 incidence data. We showed that even while using only one week of movement data, we were able to explain the spatial variation in COVID-19 counts better than using the classic BYM model. Additionally, we showed that the model can be re-parametrized so that the means by which smoothing occurs in these mobility models is intuitive. A key limitation of this work is that the models presented in this paper do not serve as individual-level infectious disease models, as correlation is induced by a latent effect rather than direct dependence between the counts. However, this will be a natural extension of this work and would require the addition of many more parameters, including multiple mobility network components at various time points. This will ultimately pose a computational challenge as well. An additional limitation of this work is that the availability and structure of mobility data will vary across data sources, and may only be available in higher income countries. Furthermore, there is selection bias in the movement data, as it only tracks those who actually have a cellphone, which may tend to be younger and more economically advantaged individuals. Given potential differences in quality of this data, its efficacy in spatial models may need to be assessed on a case by case basis. Furthermore, the models presented in this paper may suffer from overfitting. A potential remedy for this would be to put a penalized complexity prior [41] on the mixing 16 J o u r n a l P r e -p r o o f parameters, which may improve inference by shrinking ρ γ (and perhaps ρ φ ) towards 0. An interesting area for future work would be to combine Dirichlet and penalized complexity priors to specify a joint prior for the mixing parameters as described in [42] , which can be implemented using the makemyprior R package [43] . This was deemed unnecessary for this work, as we were mainly interested in comparing ρ γ to ρ φ , and felt that our prior should not favour either one of these terms. Despite these limitations, this work demonstrates the value of mobility data and provides the foundation for various extensions and future work. This data is only becoming more abundant as time passes, and methods that allow for efficient use of this data are essential to model the current epidemic, and any spatial epidemiological application where population movement is likely a predictor of disease. Epidemic features and control of 2019 novel coronavirus pneumonia in Wenzhou, China Mapping the incidence of the covid-19 hotspot in iran -implications for travellers Modelling and predicting the spatio-temporal spread of coronavirus disease 2019 (COVID-19) in Italy Spatial epidemic dynamics of the COVID-19 outbreak in China Spatial inequities in COVID-19 testing, positivity, incidence and mortality in 3 US cities: a longitudinal ecological study Impact of meteorological factors on the COVID-19 transmission: A multicity study in China Impact of temperature on the dynamics of the COVID-19 outbreak in China A spatio-temporal analysis for exploring the effect of temperature on COVID-19 early evolution in Spain Mapping community-level determinants of COVID-19 transmission in nursing homes: A multi-scale approach Socioeconomic factors influencing the spatial spread of COVID-19 in the United States A spatial-temporal model for the evolution of the COVID-19 pandemic in Spain including mobility Human mobility and COVID-19 initial dynamics How effective has the Spanish lockdown been to battle COVID-19? a spatial analysis of the coronavirus propagation across provinces Monitoring the COVID-19 epidemic with nationwide telecommunication data Evaluating the effect of demographic factors, socioeconomic factors, and risk aversion on mobility during the COVID-19 epidemic in France under lockdown: a population-based study The effect of human mobility and control measures on the COVID-19 epidemic in China How timing of stayhome orders and mobility reductions impacted first-wave COVID-19 deaths in US counties Association between mobility patterns and COVID-19 transmission in the usa: a mathematical modelling study Mobility network models of COVID-19 explain inequities and inform reopening Bayesian image restoration, with two applications in spatial statistics Spatial interaction and the statistical analysis of lattice systems Blacks/African Americans are 5 times more likely to develop COVID-19: spatial modeling of New York city zip code-level testing results Population-weighted exposure to air pollution and COVID-19 incidence in Germany Spatial risk factors for pillar 1 COVID-19 case counts and mortality in rural eastern England COVID-19 Flow Maps General Directorate of Information Systems, Quality and Pharmaceutical Provision Epidemiological surveillance network of Madrid Gaussian Markov random fields: theory and applications Spatial smoothing in bayesian models: a comparison of weights matrix specifications and their impact on inference Assessing the impact of a movement network on the spatiotemporal spread of infectious diseases Sheep movement networks and the transmission of infectious diseases Power law approximations of movement network data for modeling infectious disease spread Scaling intrinsic Gaussian Markov random field priors in spatial modelling Estimation of disease rates in small areas: a new mixed model for spatial dependence An intuitive bayesian spatial model for disease mapping that accounts for scaling Bayesian spatial models with a mixture neighborhood structure A note on intrinsic conditional autoregressive models for disconnected graphs Bayesian computing with INLA: a review Stan Modeling Language Users Guide and Reference Manual, version 2.26 Bayesian hierarchical spatial models: Implementing the Besag York Mollié model in Stan Penalising model component complexity: A principled, practical approach to constructing priors Intuitive joint priors for variance parameters makemyprior: Intuitive construction of joint priors for variance parameters in r