key: cord-0913628-s5n75nv1
authors: Python, A.; Bender, A.; Blangiardo, M.; Illian, J. B.; Lin, Y.; Liu, B.; Lucas, T. C. D.; Tan, S.; Wen, Y.; Svanidze, D.; Yin, J.
title: A downscaling approach to compare COVID-19 count data from databases aggregated at different spatial scales
date: 2020-06-20
journal: nan
DOI: 10.1101/2020.06.17.20133959
sha: 83bf898e33011d34ef84f69123ddae1b32d3b163
doc_id: 913628
cord_uid: s5n75nv1

As the COVID-19 pandemic continues to threaten various regions around the world, obtaining accurate and reliable COVID-19 data is crucial for governments and local communities aiming at rigorously assessing the extent and magnitude of the virus spread and deploying efficient interventions. Using data reported between January and February 2020 in China, we compared counts of COVID-19 from near-real time spatially disaggregated data (city-level) with fine-spatial scale predictions from a Bayesian downscaling regression model applied to a reference province-level dataset. The results highlight discrepancies in the counts of coronavirus-infected cases at district level and identify districts that may require further investigation.

: Cases of coronavirus in high-impacted (left panel) and low-impacted (right panel) provinces in China. The plot indicates the number of coronavirus-infected individuals in China in highimpacted (left panel) and low-impacted (right panel) provinces from January to February 2020. Note that data from Macau is included in Guangdong province. Sources (updated April 9, 2020): (left bar) Johns Hopkins University CSSE [7] , (center bar) The Paper [6] , and (right bar) Xu et al. [16] .

cases. Therefore, we compared the number of unique locations and associated observed 92 To further investigate the observed discrepancies among COVID-19 data, we use 93 a downscaling approach to predict at a fine spatial scale the expected cases from data 94 provided by JHU [15] , a major reference for national and subnational data on COVID-95 19, which provides data consistent with daily data from the Chinese centre for dis-96 ease control and prevention and WHO situation reports [15] . We use the R package 97 4 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. disaggregation which implements in R a Bayesian approach to carry out spatial dis-98 aggregation modeling, also called downscaling [17] . Similar approaches have been re-99 cently used to estimate fine-scale (grid-cell) risk of disease based on spatially aggregated 100 (polygon-level) data [18] [19] [20] [21] [22] [23] . 101 We apply a downscaling method to estimate cases of COVID-19 and their uncer-102 tainty within fine grid-cells (about 5km spatial resolution) in China from JHU data aggregated at province-level [15] . The model is implemented in template model builder 104 (TMB) [24] ), an R package that provides a suitable framework based on C++ to build and 105 efficiently fit hierarchical Bayesian models, including downscaling models. The model 106 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint predictions are compared with observed values from the two disaggregated data: Xu et al. [16] and the Paper [6] .

Making predictions at a spatial scale finer than the input data is subject to potential 109 validity threats that result from a mismatch between the scale from which the data and 110 the prediction are considered [25] . One issue, known as the "ecological fallacy" has been 111 well documented in geography and spatial epidemiology since the early 1950s [26] . A 112 classical example is the study of Durkheim (Le Suicide, 1897), whose analysis showed 113 that suicide rates in Prussia were highest in provinces that were heavily Protestants and 114 wrongly concluded that stronger social control among Catholics would result in lower 115 suicide rates. The author did not envision that it may have been non-Protestants (primar-116 ily Catholics) who were committing suicide in predominantly Protestant provinces [27] .

Analogously, an observed relationship between average population density and the risk 118 of COVID-19 at province-level may not hold at city-level.

A second issue is associated with the effects of the choice of spatial units on the re-120 sults. This potential validity threat refers to the "modifiable areal unit problem" (MAUP), 121 well known by geographers and epidemiologists [28] . Within a lattice framework which 122 uses polygons defined as the spatial unit, one needs to choose the shape (regular or ir-123 regular polygons) and size of the polygons used to make predictions at a desired spatial 124 scale [29] . To avoid the risk of ecological fallacy and mitigate the effect of the MAUP, the 125 downscaling approach used here takes benefit of the spatial information gathered from 126 fine-scale covariates to model the underlying processes behind the observed data that 127 explain the spatial variations of COVID-19 counts within provinces. 128 We model the count of coronavirus-infected cases y i, j for each 5 km grid-cell j that 129 lies within each Chinese province (polygon i), with N the total number of provinces. A 130 continuous-space data-generating process is discretized into 5 km grid-cells from which 131 data is aggregated and estimates are generated. In this framework, we consider the total 132 count of COVID-19 in a given province y i as the sum of each individual count y i, j in all 133 grid-cells that cover province i. We model the incidence rate, a relevant risk metric of 134 COVID-19 that is defined as the number of new coronavirus-infected during our investi-135 gated period (January-February 2020), divided by the population at the beginning of the 136 period.

The incidence rate λ i, j in pixel j of polygon i is associated via a log link function to a 138 linear predictor η i, j composed of an intercept β 0 , N q covariates x q i, j with coefficients β q , 139 a spatial random field ζ(i, j), and an i.i.d random effect u i associated with each province 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint in each polygon. The predicted sum of the counts from each grid-cell over a province 145 is weighted by the population size values in the corresponding grid-cell extracted from 146 a population raster (see Fig 3) [30] . The estimated incidence rate corresponds to the ex-147 pected cases divided by the weights, with

the linear predictor through a log function so that one can derive the incidence rate as avoids many of the main issues associated with using fine-scale covariates.

COVID-19 appears to transfer rapidly and easily among humans, with a basic repro-168 ductive number R 0 (also called transmission rate) seemingly between 2 and 3. There is evidence that humidity affects influenza-type virus transmission in both 195 experimental and observational studies [38, 39] , and in population-level studies [40] . High 

To mitigate a potential risk of multicolinearity, we applied a variance-inflation factor 203 function to identify and remove collinear covariates that show high levels of correlation 204 using a stepwise procedure from the R package usdm [42] . The procedure removes one 205 covariate from all pairs of covariates that exhibit Pearson's ρ > 0.75). We found that 206 travel time to the nearest city of more than 50,000 inhabitants (access) [35] and travel 207 time to Wuhan (Hubei province) are highly correlated (ρ > 0.75). We kept travel time 208 to Wuhan since we believe that proximity with Wuhan is a major driver of the spread 209 of COVID-19, especially over the study period (January-February 2020), which corre-210 sponds to the early pandemic which has been first identified in Wuhan.

In equation 1, the spatial structure ζ(i, j) is represented by a zero-mean Gaussian polygon-specific effect, with P(σ u > σ u,max ) = σ u,prob , with σ u,max and σ u,prob that can 224 be defined by the user [44] . 225 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. 

We carried out an out-of-sample procedure to compare the predictive performance covariates has in general the highest predictive performance. This is expected, since the 239 spread of COVID-19 is likely to be mainly driven by anthropogenic factors [33] . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint level), intercept, spatial hyperparameters log ρ and log σ, and the β coefficients associ-244 ated with the covariates. 245 Figure 4 : Parameter estimation of the downscaling model. The graphic shows the mean and 95% credible intervals (y-axis) of the estimated parameters (fixed-effects and random effects) in the model (y-axis). It includes (from bottom to top), the log precision of the i.i.d effects (polygon-level), intercept, spatial hyperparameters log ρ and log σ, and the β coefficients associated with each covariate.

As expected, travel time to Wuhan (95% CI: -0.63 ; -0.22) is negatively associated 246 with COVID-19 incidence rate: the expected COVID-19 incidence rate increases with 247 proximity to Wuhan, where the virus was first identified. We also observed a positive 248 effect of population size (pop) on COVID-19 incidence rate (95% CI: 0.08 ; 0.88), which 249 suggests that areas with larger population size tends to be more prone to COVID-19 250 infections.

There is evidence of an effect from the i.i.d. province-specific random effects (log(τ) 252 (95% CI: -0.82 ; -0.21), which is expected especially in the early stages of the epidemic, 253 where Hubei province had a much higher number of reported cases compared to the 254 other provinces, whose connection with Wuhan are highly variable can can therefore is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint incidence rate. Future studies based on larger datasets might provide additional infor-260 mation on its potential effect. 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint 

Since the reported cases from the disaggregated datasets are reported at city-level, 277 they are likely to include cases that occur in the neighbourhood of the reported locations.

To ease their comparison with the predicted cases from the downscaling approach, we 279 compare the data at district level, which include 340 districts corresponding to one ad- polygon-specific effects (i.i.d random), and parameter tuning on the spatial parameters. 307 We used default penalized complexity (PC) priors the spatial parameters with default 308 13 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. the predictions across a fine spatial grid that covers our study area.

The method proposed in this paper exhibits several shortcomings. Our initial ex-328 ploratory analysis remains descriptive and do not capture the processes behind the gen-329 erated data. As such, it has only allowed us to identify potential data mismatches. How- 15 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. . an accurate representation of the uncertainty [23] , which has allowed us to quantify de-342 viations of the counts from the disaggregated databases with regard to the estimated 343 mean and 95% credible intervals of the predicted counts from the downscaling model.

To prevent the risk of overfitting, we built a relatively parsimonious model by selecting 

[11] GISAID, Genomic epidemiology of BetaCoV 20192020 (https://www.gisaid.org/ 434 epiflu-applications/next-betacov-app/) (2020) Accessed on 2020-03-05.

[12] KG Andersen, A Rambaut, WI Lipkin, EC Holmes, RF Garry, The proximal origin of SARS-CoV-2.

Nature Medicine, 1-3 (2020).

[13] AD Usher, WHO launches crowdfund for COVID-19 response. The Lancet 395 (2020).

[14] World Health Organizatin (WHO) (2020). Q&A on coronaviruses (https://www.who.int/ 439 news-room/q-a-detail/q-a-coronaviruses.) (2020) Accessed: 2020-02-01. 18 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. This Section provides details on the data extraction and cleaning processes that has 520 been performed to compare and select data on coronavirus. In addition, it provides fur- by John Hopkins University CSSE (JHU) [15] through an interactive web platform, whose 530 data can be downloaded using github [7] . For cases in China, the sources used include Good in-sample performance may be the results of overfitting and are therefore not 557 necessarily informative about the true performance of models. In order to assess the 558 external validity of our model and to obtain more realistic estimations of performance 559 metrics, we performed an out-of-sample cross-validation procedure by taking into ac-560 count the spatial nature of the data and the model specifications [49] [50] [51] . While meth-561 ods to evaluate out-of-sample performance of downscaling approaches have been devel-562 oped (e.g. [20, 52] ), their relevance has been essentially drawn from the results of studies 563 based on simulated data, which may not necessarily apply to our case study. Also, since 564 province-level is the finer spatial resolution of our reference data (JHU data [15] ), a cross-565 validation approach based on spatial blocks [53] would not be feasible.

As a result, we fit the downscaling models using data in all provinces except one 567 (hold-out province), predicted the expected cases in the hold-out province, and reported 568 performance metrics (RMSE, MAE) based on the expected cases from JHU. Through an 569 iterative process, the data from the hold-out province is not used during the fitting pro-570 cess to ensure that the predictive performance of the models is assessed exclusively on 571 new data. We performed the out-of-sample procedure on various model speficications, 572 similarly to the insample procedure (see Table 1 ) except that we did not consider models 573 with i.i.d random polygon-specific effects since we opted for a procedure that remove 574 data from province in an iterative way. In this case, province-specific effects cannot be 575 used to inform the model.

The model specifications (Model spec) are the following: (1) include all covariates,

(2) include only anthropogenic covariates (Socio. cov.), (3) include only environmental 578 covariates (Env. cov.), (4-7) using alternative penalized complexity (PC) priors on the 579 spatial parameters ρ min and σ max. In total, this cross-validation computes predictive 580 performance metrics for a total of 33 provinces multiplied by 7, which corresponds to a 581 total of 231 runs. The results are illustrated in Table 2 .

In Table 2, The results in Table 2 show that the model using only anthropogenic covariates (mod 589 spec.: soc) has a better out-of-sample predictive ability in general, when we include CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 20, 2020. Table 2 : Performance metrics computed at province-level from an out-of-sample procedure. Performance metrics computed from an out-of-sample procedure (iteratively remove each of the 33 provinces) to compare different model specifications used to predict COVID-19 counts using a downscaling approach. We compare a total of seven model specifications (without i.i.d random effects): (1) include all covariates (all), (2) include only anthropogenic covariates (soc), (3) include only environmental covariates (env), and models that include all covariates but using alternative penalized complexity (PC) priors on the spatial parameters (rho1,rho2,sigma1,sigma2).

the highest predictive performance. From these results, we select model soc, which has 595 the best predictive performance across the different predictive contexts.

Insample predictions 597 Note that the out-of-sample predictions approach that remove in an iterative way 598 each province (hold-out province) cannot be used to assess the predictive performance of 599 models with province-specific i.i.d random effects. However, random effects can account 600 for important differences we observe among provinces that cannot be informed by the 601 covariates. Therefore, our final selected model is based on the selected specification 602 (model using only anthropogenic covariates (mod spec.: soc)), by including an i.i.d 603 random effects to account for province-specific characteristics.

To assess the insample predictive ability of the model, we compare the predicted 605 incidence rate at province-level with the observed incidence rate computed as the number 606 of reported COVID-19 cases from JHU [15] divided by the population size [30] of the 607 province. For each province, the average incidence rate predicted by the downscaling 608 is given by the sum of the mean predicted incidence rate multiplied by population size 609 for each grid-cell within the province. A brief look at the results in Table 3 indicates 610 that in most province, the predicted incidence rate (per 1,000) is equal to the observed 611 incidence rate (per 1,000) when rounded at 3 digits. We can conclude that the in-sample 612 predictions of the model are of sufficient accuracy.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted June 20, 2020. Table 3 : Results of insample predictions of the incidence rates (per 1,000) at province-level. Comparison of the observed incidence rate (per 1,000) at province-level (second row) with the average predicted incidence rate (per 1,000) (third row). The predictions are carried out in a total of 33 Chinese provinces are included.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted June 20, 2020. . https://doi.org/10.1101/2020.06.17.20133959 doi: medRxiv preprint

WHO Coronavirus Disease (COVID-19) Dashboard

Assuming a well-specified model, the downscaling model is likely to have provided