key: cord-0770113-iobr5uxe authors: Neyens, Thomas; Faes, Christel; Vranckx, Maren; Pepermans, Koen; Hens, Niel; Van Damme, Pierre; Molenberghs, Geert; Aerts, Jan; Beutels, Philippe title: Can COVID-19 symptoms as reported in a large-scale online survey be used to optimise spatial predictions of COVID-19 incidence risk in Belgium? date: 2020-09-25 journal: Spat Spatiotemporal Epidemiol DOI: 10.1016/j.sste.2020.100379 sha: c433ff98cae176ef21617fe1317119ebb067fe50 doc_id: 770113 cord_uid: iobr5uxe Although COVID-19 has been spreading throughout Belgium since February, 2020, its spatial dynamics in Belgium remain poorly understood, partly due to the limited testing of suspected cases during the epidemic’s early phase. We analyse data of COVID-19 symptoms, as self-reported in a weekly online survey, which is open to all Belgian citizens. We predict symptoms’ incidence using binomial models for spatially discrete data, and we introduce these as a covariate in the spatial analysis of COVID-19 incidence, as reported by the Belgian government during the days following a survey round. The symptoms’ incidence is moderately predictive of the variation in the relative risks based on the confirmed cases; exceedance probability maps of the symptoms’ incidence and confirmed cases’ relative risks overlap partly. We conclude that this framework can be used to detect COVID-19 clusters of substantial sizes, but it necessitates spatial information on finer scales to locate small clusters. Keywords: COVID-19, disease mapping, spatially correlated random effects, integrated nested Laplace approximation, self-reporting 2020 MSC: 00-01, 99-00 Can COVID-19 symptoms as reported in a large-scale online survey be used to optimise spatial predictions of COVID-19 incidence risk in Belgium? Although COVID-19 has been spreading throughout Belgium since February, 2020, its spatial dynamics in Belgium remain poorly understood, partly due to the limited testing of suspected cases during the epidemic's early phase. We analyse data of COVID-19 symptoms, as self-reported in a weekly online sur-5 vey, which is open to all Belgian citizens. We predict symptoms' incidence using binomial models for spatially discrete data, and we introduce these as a covariate in the spatial analysis of COVID-19 incidence, as reported by the Belgian government during the days following a survey round. The symptoms' incidence is moderately predictive of the variation in the relative risks based on the 10 confirmed cases; exceedance probability maps of the symptoms' incidence and confirmed cases' relative risks overlap partly. We conclude that this framework can be used to detect COVID-19 clusters of substantial sizes, but it necessitates spatial information on finer scales to locate small clusters. 15 COVID-19 is a respiratory disease caused by a highly infectious singlestranded RNA corona virus, SARS-CoV-2 (Chen et al. 2020, Wu et al. 2020) . It was first observed in Wuhan, the capital of the Hubei province in the People's Republic of China, in December 2019 (Zhu et al. 2020) . The virus most likely 20 has a zoonotic origin, but human-to-human transmission, which happens mainly via droplets and fomites, combined with a high basic reproductive number, has caused the disease to rapidly spread across continents. It has been declared a global pandemic on March 11, 2020 (WHO 2020 . 25 The first imported COVID-19 case in Belgium was reported on February 4, 2020, in Brussels; this case did not lead to further infections. Due to various further introductions, the disease spread throughout the country. The Belgian government has undertaken several measures to slow down community transmission, the most notable of which has been the implementation of a lockdown 30 of the country on March 18, 2020. Due to limited capacity, only a fraction of suspected Belgian COVID-19 patients has been tested to confirm SARS-CoV-2 infection. These are primarily severe cases, which has complicated the assessment of the true extent of the disease's spatio-temporal spread. 35 The University of Antwerp, in collaboration with Hasselt University and KU Leuven, has designed an ethically approved weekly online COVID-19 survey (https://www.uantwerpen.be/en/projects/corona-study/), which is open to the general Belgian public. A key objective of the survey is to collect information on COVID-19 symptoms from the general public. The weekly number of partic-40 ipants has been large; during its first four rounds, the survey reached 537, 172; 334, 935; 397, 529; and 215, 138 respondents, respectively, with complete residential and personal information to conduct a spatial analysis. However, as the survey may not reach all segments of society equally (Alessi & Martin 2010; Andrews et al. 2003) , it remains unclear whether sampling bias invalidates 45 statistical inference on the spatial dynamics of COVID-19-like symptoms as a proxy for the distribution of COVID-19. Geostatistical models are often applied to analyse and predict disease risk in a population (Diggle & Robeiro 2007) . Using methods for spatially dis-50 crete outcomes (Besag et al. 1991; Leroux et al. 1999; Lawson 2013) , we can predict COVID-19 incidence via crowd-sourced data of symptoms obtained by self-reporting in the online survey. We can use these predictions to optimally model the geographical risk distribution of confirmed COVID-19 cases, as reported by the Belgian government. This additionally allows to investigate routes 55 to develop an early-warning framework aimed at detecting COVID-19 cases by self-reporting citizens, when large-scale testing and tracing of the general public is not feasible, and to supplement information obtained from testing and tracing otherwise. In this study, we fit spatial models to data obtained during the third round of the online survey, conducted on March 31, 2020, and data of confirmed cases, as reported by the Belgian population health institute (Sciensano) between April 7 and April 9, 2020. We use approximate Bayesian estimation methods to spatially analyse self-reported COVID-19 symptoms. We then use mean incidence 65 predictions as a plug-in covariate in a spatial model to analyse confirmed cases. Our aim is to investigate whether symptoms that are reported in an online survey, which is ordinarily subject to sampling bias, are useful to predict the spatial spread of detected COVID-19 disease approximately one week later. The Belgian population health institute (Sciensano) collects daily numbers on confirmed cases in Belgium, an aggregated version of which is made publicly available (https://epistat.wiv-isp.be/covid/). We make use of the raw, publicly 75 unavailable, data of 5183 individuals with known residential, age, and gender information, who were diagnosed with COVID-19 on April 7, April 8, or April 9, 2020 (henceforth, covid data). Fig. 1 depicts the standardized incidence rates, SIR i = O i /E i , with O i and E i the observed number of cases and the internally age-gender standardised expected counts, respectively, for municipality 80 i = 1, . . . , N , with N = 589. We use age groups in the standardisation process, more specifically the age intervals, 0 − 24, 25 − 44, 45 − 64, and +65 years old. The widths of the age intervals are based on considerations related to the online survey data set; more information is provided in Section 2.2. Note that on Jan 1, 2019, a number of Belgian municipalities have been geographically and ad-85 ministratively united, which reduced the total number of Belgian municipalities from 589 to 581. We use the Belgian municipality structure of 2018 to improve spatial resolution, along with demographical information from the same year, which differs only minimally from the demography in 2020. Secondly, we use data on COVID-19 symptoms, as self-reported by participants in the third round of the online COVID-19 survey (March 31, 2020; henceforth, symptoms data), for which all necessary ethical approvals have been obtained. The survey can be filled in by all members of the public and is designed to collect data about spatial trends in COVID-19 symptoms within Belgium, 95 the extent to which members of the public adhere to measures taken by the gov- We fit models for spatially discrete data to the symptoms and covid data, us- For the symptoms data, the model structure is defined by P (Y ij = 1) = expit(α 0 + α 1 single ij + α 2 agecat 1ij + α 3 agecat 2ij + α 4 agecat 3ij + α 5 male ij + α 6 agecat 1ij * male ij + α 7 agecat 2ij * male ij + where single denotes a binary variable taking the value 1 when a participant is 130 the only member of a household and 0 otherwise; agecat 1 , agecat 2 , and agecat 3 are dummy variables that indicate whether participants belong to the age groups 25 − 44, 45 − 64, and +65, respectively, the interval widths of which we have chosen to categorise the data into groups expected to showcase different social behaviour, while maintaining balanced sample sizes among these categories; male = 1 for males, 0 for females; z 1i is a term that corrects for spatially correlated heterogeneity (CH) and/or uncorrelated heterogeneity (UH) at the municipality level. We apply and compare three approaches; (i) the convolution model (Besag et al. 1991) , where z 1i = v 1i + u 1i , with v 1i defined as a normally distributed random effects term to capture UH, CH is accommodated by u 1i , an intrinsic conditional autoregressive (CAR) random effects term, Here, w ik = 1 if areas i and k are adjacent and 0 otherwise; as an alternative, Here, Ω 1 is the precision matrix of an intrinsic CAR process, such as introduced We perform model selection using the Deviance Information Criterion (DIC, (WAIC, Watanabe 2010) goodness-of-fit criteria. We then estimate P (Y i. = 1), the predicted probability of a municipality's inhabitant to experience at least 1 typical COVID-19 symptom, which is corrected for the age, gender, and single households dynamics of the municipality. For the covid data, we fit a spatial Poisson model with a full model structure that is defined as, where R i denotes the relative risk for municipality i. We again use the convolution, Leroux, and log-normal modelling approaches to define z 2i . In the 165 convolution and log-normal models, v 2i and u 2i are defined similarly as v 1i and u 1i in (2)-(5), respectively, but with different separate heterogeneity terms denoted by σ 2 v2 ,μ 2i , σ 2 2i , and σ 2 u2 instead of σ 2 v1 ,μ 1i , σ 2 1i , and σ 2 u1 , respectively. For the Leroux model, we use similar parametrisations as in (6)-(7), but with Σ 2 , σ 2 , λ 2 , and Ω 2 instead of resp. Σ 1 , σ 1 , λ 1 , and Ω 1 . We include P (Y i. = 1), 170 as predicted by (1), as a risk factor in the model, but in its standardised form, which we denote as P (Y i = 1) s . We compare results with those coming from convolution, Leroux, and log-normal models without the symptoms' incidence covariates. Model selection is based on the joint investigation of DIC and WAIC statistics and the effect of P (Y i = 1) s . 175 We use vague priors: N (0, 1000) for all covariate effects, logit(λ) ∼ beta(1, 1) for the control parameters in the Leroux models, and penalised complexity (PC) priors (Simpson et al. 2017) for the precision parameters of the random effects. PC priors for a precision parameter, τ = 1/σ 2 , are defined by two parameters, 180 σ 0 and ξ, such that P (σ > σ 0 ) = ξ. We set σ 0 = 5 and ξ = 0.01, for τ u1 = 1/σ 2 u1 , τ v1 = 1/σ 2 v1 , τ u2 = 1/σ 2 u2 , and τ v2 = 1/σ 2 v2 . A sensitivity analysis for the choice of the prior distribution, where we use Gamma(1, 0.0005), here parametrised with a shape and rate parameter, as a prior for the precision parameters, and logit(λ) ∼ N (0, 10), has been documented in the Appendix (Section 6.1). Note 185 that precision, control, and covariate estimates, along with the maps displaying predictions and exceedance probabilities, remain almost unchanged in the sensitivity analysis. We denote covariate effects as significant, when their associated 95% credible interval does not include 0. Note that the online survey collected residential information at the postal code area level, a subdivision of the municipality level. Since this yielded at least one data record for 1083 out of these 1133 Belgian postal code areas, the symptoms data can be investigated on this finer scale as well. We provide this analysis as an Appendix (Section 6.2). We have chosen to report the data on 195 a coarser spatial scale in the main manuscript due to two main reasons: (i) the gain of working at the postal code area level is limited in the context of predictions, since demographic information at the postal code area level is currently not at our disposal. This means that predicting spatial symptom probabilities necessitates the use of covariate information at the municipality level, which is 200 not optimal; (ii) predictions at the postal code area level need to be aggregated within the municipality level to be used as a covariate in the analysis of the covid data. Hence, working at the municipality level in the symptoms data analysis allows us to provide these predictions in a more direct way. In the analysis of the symptoms data, the convolution and Leroux models provide very similar parameter estimates and goodness-of-fit statistics, while they outperform the log-normal model (Table 1) . This is in line with the observation that most extra-variability can be attributed to unobserved spatial 210 phenomena; in the Leroux model,λ 1 lies close to 1; furthermore, we see that the largest part of the uncorrelated extra-variability in the log-normal model is attributed to spatial dynamics when extending the model to the convolution model. One can argue to fit a CAR model without an UH term as an alternative, but we decide against that approach, since we would then assume that there 215 is no small-scale extra-variability. Note that the unexplained variability in all three models is arguably small, with its estimated standard deviation fluctuating around 0.1. We proceed with the Leroux model, since the convolution model can suffer from identifiability problems in its random effects structure (Leroux et al. 1999) , while noting that the final results remain largely unaffected by this 220 choice. Being single is significantly associated with a lower probability to report at least 1 typical COVID-19 symptom. However, its effect is small. Age and gender have significant interaction effects; we see the largest probabil-225 ity among non-single females between 25 and 44 years old, while the lowest probability is seen in single elderly males. Fig. 3 and 4 show, respectively, P (Y i. = 1), after correcting for demographic variation in age, gender, and household, i.e., singles vs. non-singles, and the exceedance probabilities, Table 2 presents parameter estimates and goodness-of-fit statistics for the covid data analyses. Again, the convolution and Leroux models outperform the log-normal model in terms of DIC and WAIC. These statistics are almost identical for both spatial models that include the symptoms' incidence covariate 235 and for the convolution model without the covariate. Estimates of β 0 and β 1 are similar across the three models. With regards to the symptoms' incidence's usefulness as a predictor for the relative incidence risk of the confirmed cases, we list a number of obser-240 vations: (i) goodness-of-fit statistics point towards improved model fits when the symptoms' predicted incidence is used as a covariate, except for the convolution model ; (ii) parameter estimates in the upper panel of Table 2 indicate a relatively small, but significantly positive association between the symptoms' incidence and the relative incidence risk based on the confirmed cases; (iii) how-245 ever, in all models, the extra-variability's parameter estimates do not increase substantially when the covariate is left out of the linear predictor; (iv) when we predict relative risks from a spatial model, e.g., the Leroux model, without the covariate, the Kendall correlation of those relative risks and P (Y i. = 1) is significantly different from 0, but its point estimate is small ( ρ = 0.2380; p 250 value < 2.2 * 10 −16 ); (v) unlike in the symptoms data analysis, the majority of the extra-variability is attributed to small-scale spatial variation (e.g., for the Leroux model,λ 2 = 0.1921). This is partly due to the fact that the symptoms' incidence, which predominantly shows spatial and little non-spatial dynamics, captures mostly spatial variation in the confirmed cases' incidence risk. The re- From the observations listed above, we learn that (i) P (Y i. = 1) explains a small, yet arguably significant, proportion of variability in the confirmed 260 cases' relative incidence risk. We will denote this as moderately predictive; (ii) P (Y i. = 1) mostly explains spatially correlated variation in the confirmed cases' and is best at pinpointing large disease clusters. This is also seen in Fig. 5 and 6, which depict, respectively, R i and the exceedance probabilities, P ( R i > 1.5), based on the Leroux model with P (Y i. = 1) as a covariate. The region with ele-265 vated predicted incidence of typical COVID-19 symptoms in the central-east of Belgium, situated around the city of Sint-Truiden, in Fig. 4 , overlaps well with the cluster of municipalities in Fig. 6 that has a high probability, i.e., larger than 95%, to have at least a 150% increase in relative incidence risk (Fig. 7) . This region is known, as it had the largest local COVID-19 outbreak in Bel-270 gium. Other smaller outbreaks could however not be predicted by P (Y i. = 1). Increased risk in these locations is likely accounted for in the model by the spatially uncorrelated heterogeneity term. Based on these considerations, we decide to leave the covariate in the model, 275 but note that its explanatory power is limited. We proceed with the Leroux model that includes the covariate, again due to possible identifiability issues in the convolution model. Note that the symptoms that were self-reported to be experienced during the period of March 24 − 30, have similar, moderately predictive, effects on the incidence risk of confirmed cases within a period that 280 spans more days than the period of April 7 − 9 (Table 3) . We find significant results for the effect of P (Y i = 1) s on the incidence risk of confirmed cases for almost all three-day periods throughout March 31 and April 22, as well as a significant effect when analysing all cases that were confirmed between March 31 and April 22 together. Based on the effect size and the credible intervals' 285 widths, the optimal predictive performance is suggested for the period between April 7 and April 9. Although it is uncommon to consider traditional issues related to multiple testing in the context of a Bayesian analysis, we note that for a number of three-day periods, lower limits of 95% credible intervals often lie close to zero, which might reflect spurious correlations. Our study shows that, when using geographical crowd-sourced information on COVID-19 that is obtained by self-reporting in the Big Corona Study, a large-scale online survey study, model-based symptom incidence predictions are 295 capable of explaining a (borderline) significant, yet limited, proportion of the heterogeneity that is seen in the number of confirmed COVID-19 cases, as reported by the government, within 1 to 28 days after these symptoms were experienced. Exceedance probabilities, based on the analysis of the symptoms data, pinpoint an important cluster of elevated COVID-19 risk around the city of 300 Sint-Truiden in central eastern Belgium, which aligns well with a region that has since then received increased attention, due to a number of local outbreaks. However, the symptoms analysis' exceedance probabilities do not detect disease clusters that occur very localised and that manifest themselves in the data as small-scale spatial variation, i.e., spatially uncorrelated overdispersion. Note that we have conducted the same analyses, using symptoms data from rounds 2 and 4 of the online survey, which we document as an appendix (Sec- Figure 6 : Leroux model: exceedance probabilities for the relative risk per municipality, based on data of confirmed cases between April 7 and April 9, 2020, with relative risk threshold = 1.5. tion 6.3). Similarly as in the analysis based on survey 3, the predictive means of the symptoms' incidence (borderline) significantly explain variation in the 310 number of confirmed cases. Note however that for survey round 2, these effects are seen for a more restricted set of three-day periods within a 21-day time span after the day of the respective surveys. As explained in Section 2.1, These weaker predictive performances are likely due to a combination of the overlapping influenza season during round 2 and a lower amount of participants from 315 Wallonia, which may obstruct the detection of all spatial dynamics of COVID-19 symptoms in Belgium. Note that the symptoms data reflect symptoms that were experienced during the week preceding the respective rounds of the survey. This period does not necessarily reflect the moment of the symptoms' onsets, which may have taken place earlier. Future studies will investigate how data of 320 COVID-19 symptoms' onsets can be optimally linked to data of confirmed cases. One limitation of the current modelling framework, is that not all high-risk areas in Fig. 6 could be detected by the symptoms data analysis (Fig. 4) . We believe that this is at least partly due to a considerably large amount of small-scale 325 variation in virus transmission, such that our model is currently best at pinpointing clusters that are occurring on a moderately large scale. This might be explained by the quarantine measures that obstruct typical transmission routes, such that infection mostly occurs very localised (Ganyani et al. 2020) . However, the definition of small-scale spatial variation is study-specific and depends on 330 the spatial resolution of data; in order to develop COVID-19 monitoring tools, analysts will need spatial information on finer scales than the municipality (or postal code area) level. Another reason for the limited predictive capabilities of the symptoms data is that there might still be considerable overlap with other common illnesses that have similar symptoms and occur in the same season, such 335 as influenza or pollen allergy. Because of this, we might currently only detect signals in regions that are severely hit by COVID-19. Considering the objective to use online surveys in large-scale COVID-19 monitoring, this prompts the need for more research into symptoms definitions, particularly when information on their incidence is collected through self-reporting. Furthermore, similarly to 340 the previous conclusions with respect to rounds 2 of the survey, the relatively small sample sizes in the southern part of Belgium for the symptoms data likely hamper the detection of high-incidence areas in the analysis of round 3 as well. This highlights the need for investments in monitoring tools, and promotional campaigns to engage the general public throughout the whole region to partic-345 ipate in these online surveys, in combination with such other measures as, for example, tracing strategies. With respect to demographic heterogeneity in typical COVID-19 symptoms, the analysis results report the lowest symptoms' incidence for elderly, while This study can be improved by investigating the outcomes spatio-temporally. However, the correct extraction of the specific day of the symptoms' onset from the online survey data should be undertaken with care and will be investigated 365 in the future. We have therefore analysed symptoms data that were aggregated in time. Moreover, we plan to develop a joint modelling framework in which we simultaneously model symptoms and confirmed cases, e.g., by extending correlated random-effects models proposed by Neyens et al. (2016) . This will allow us to exploit spatial dependence that is likely to occur between symptoms' in-370 cidence and the confirmed cases' disease risk. This can improve the current two-step approach where we use model-based symptom predictions as a plug-in covariate to model the spatial dynamics of the confirmed cases' disease risk. Furthermore, this modelling framework will allow us to optimally model uncertainty in the symptoms' incidence predictions, which is now left unaccounted for 375 when using these as a covariate in the covid data analysis. Ultimately, our goal is to develop a model that forecasts spatio-temporal dynamics in COVID-19 incidence, based on self-reporting of symptoms in addition to other data sources, such as absenteeism, mobility of individuals, etc., to act as an early warning system for surges in disease risk. The authors declare no conflicts of interest. The covid data, as analysed in this study, are confidential. They are available 395 in an aggregated format on https://epistat.wiv-isp.be/covid/. The symptoms data are confidential, but can be accessed after approval by the Corona Study steering committee. All data were collected in ethically approved studies. In the sensitivity analysis, we refit the Leroux models, using a Gamma(1, 0.0005) prior instead of a PC prior for the precision parameters of the random effects term, and logit(λ) ∼ N (0, 10) instead of logit(λ) ∼ beta(1, 1) for the control parameter. ipality, based on data of confirmed cases between April 7 and April 9, 2020. Figure 11 : Sensitivity analysis -Leroux model: exceedance probabilities for the relative risk per municipality, based on data of confirmed cases between April 7 and April 9, 2020, with relative risk threshold = 1.5. In the postal code area analysis, we analyse the symptoms data with a Leroux For the symptoms data, the model structure is then defined by P (Y lm = 1) = expit(α 0 + α 1 single lm + α 2 agecat 1lm + α 3 agecat 2lm + α 4 agecat 3lm + α 5 male lm + α 6 agecat 1lm * male lm + α 7 agecat 2lm * male lm + α 8 agecat 3lm * male lm + z 1l ) with the same parameter interpretations as presented in the main text. We fit the Leroux model, such that z 1 ∼ M V N (0, Σ 1 ), Σ 1 = σ 1 [(1 − λ 1 )I NP C + λ 1 Ω 1 ) −1 ], again with the same parameter interpretations as given in the main text. The probability of a municipality's inhabitant to experience at least 1 typical COVID-19 symptom is calculated as, with P (Y l. = 1) the predictive mean of the symptoms' incidence for postal code area l. We include P (Y i = 1) again in the linear predictor in its standardised 495 form, P (Y i = 1) s . For the covid data, we fit a Leroux Poisson model with the same parametrisation as presented in the main text, with the Leroux residual term Σ 2 , σ 2 , λ 2 , and Ω 2 instead of resp. Σ 1 , σ 1 , λ 1 , 500 and Ω 1 . We use N (0, 1000) as a prior for all covariate effects, logit(λ) ∼ beta (1, 1) for the control parameters, and penalised complexity (PC) priors (Simpson et al. 2017) for the precision parameters of the random effects, where we again set 505 σ 0 = 5 and ξ = 0.01. Table 5 presents the analysis results, which lie very close to the ones obtained in the analysis in the main text. municipality, based on data of confirmed cases between April 7 and April 9, 2020. The model includes symptoms' incidence predictions, based on the postal code area analysis of the symptoms data, which were aggregated within the municipality level. Figure 15 : Postal code area analysis -Leroux model: exceedance probabilities for the relative risk per municipality, based on data of confirmed cases between April 7 and April 9, 2020, with relative risk threshold = 1.5. The model includes symptoms' incidence predictions, based on the postal code area analysis of the symptoms data, which were aggregated within the municipality level. Bayesian measures of model complexity and fit On fitting spatiotemporal disease mapping models using approximate Bayesian inference Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory director-general-s-opening-remarks-at-the-media-briefing-on-covid A new coronavirus associated with human respiratory disease in China Clinical characteristics and imaging manifestations of the 2019 novel coronavirus disease (COVID-19):A multi-center study in Wenzhou city A novel 465 coronavirus from patients with pneumonia in China Estimation results for β 1 , the effect of P (Y i = 1) s , obtained from the analysis via a Leroux model of 341, 320 respondents in the second round of the online survey, when investigating different time periods of confirmed cases We thank Herman Van Oyen and Toon Braeye from the Belgian population health institute (Sciensano) for reading and commenting on our manuscript.