key: cord-0574933-l18zj7tu authors: Fritz, Cornelius; Nicola, Giacomo De; Rave, Martje; Weigert, Maximilian; Khazaei, Yeganeh; Berger, Ursula; Kuchenhoff, Helmut; Kauermann, Goran title: Statistical modelling of COVID-19 data: Putting Generalised Additive Models to work date: 2022-01-04 journal: nan DOI: nan sha: 93cfc71e82f41410757af986c22c8675273cbb00 doc_id: 574933 cord_uid: l18zj7tu Over the course of the COVID-19 pandemic, Generalised Additive Models (GAMs) have been successfully employed on numerous occasions to obtain vital data-driven insights. In this paper we further substantiate the success story of GAMs, demonstrating their flexibility by focusing on three relevant pandemic-related issues. First, we examine the interdepency among infections in different age groups, concentrating on school children. In this context, we derive the setting under which parameter estimates are independent of the (unknown) case-detection ratio, which plays an important role in COVID-19 surveillance data. Second, we model the incidence of hospitalisations, for which data is only available with a temporal delay. We illustrate how correcting for this reporting delay through a nowcasting procedure can be naturally incorporated into the GAM framework as an offset term. Third, we propose a multinomial model for the weekly occupancy of intensive care units (ICU), where we distinguish between the number of COVID-19 patients, other patients and vacant beds. With these three examples, we aim to showcase the practical and"off-the-shelf"applicability of GAMs to gain new insights from real-world data. From the early stages of the COVID-19 crisis, it became clear that looking at the raw data would only provide an incomplete picture of the situation, and that the application of principled statistical knowledge would be necessary to understand the manifold facets of the disease and its implications (Panovska-Griffiths, 2020; Pearce et al., 2020) . Statistical modelling has played an important role in providing decision-makers with robust, data-driven insights in this context. In this paper, we specifically highlight the versatility and practicality of Generalized Additive Models (GAMs). GAMs constitute a well-known model class, dating back to Hastie and Tibshirani (1987) , who extended classical Generalized Linear Models (Nelder and Wedderburn, 1972) to include non-parametric smooth components. This framework allows the practitioner to model arbitrary target variables that follow a distribution from the exponential family to depend on covariates in a flexible manner. Due to the duality between spline smoothing and normal random effects, mixed models with Gaussian random effects are also encompassed in this model class (Kimeldorf and Wahba, 1970) . One can justifiably claim that the model class is one of the main work-horses in statistical modelling (see Wood, 2017 and Wood, 2020 for a comprehensive overview of the most recent advances). Numerous authors have already used this model class for COVID-19-related data analyses. As research on topics related to COVID-19 is still developing rapidly, a complete survey of applications is impossible; hence, we here only highlight selected applications, sorted according to the topic they investigate. Numerous applications analyse the possibly non-linear and delayed effect of meteorological factors (including, e.g., temperature, humidity, and rainfall) on COVID-19 cases and deaths (see Goswami et al., 2020; Prata et al., 2020; Ward et al., 2020; Xie and Zhu, 2020) . While the results for cold temperatures are consistent across publications in that the risk of dying or being infected increases, the findings for high temperatures diverge between studies from no effects (Xie and Zhu, 2020) to U-shaped effects (Ma et al., 2020) . Logistic regression with a smooth temporal effect, on the other hand, was used to identify adequate risk factors for severe COVID-19 cases in a matched case-control study in Scotland (McKeigue et al., 2020) . In the field of demographic research, Basellini and Camarda (2021) investigate regional differences in mortality during the first infection wave in Italy through a Poisson GAM with Gaussian random effects that account for regional heterogeneities. With fine-grained district-level data, Fritz and Kauermann (2021) present an analysis confirming that mobility and social connectivity affect the spread of COVID-19 in Germany. Wood (2021) shows that UK data strongly suggest that the decline in infections began before the first full lockdown, implying that the measures preceding the lockdown may have been sufficient to bring the epidemic under control. This list of applications illustrates how GAMs have been successfully employed to obtain data-driven insights into the societal and healthcare-related implications of the crisis. We contribute to this success story by focusing on three applications to demonstrate the "off-the-shelf" usability of GAMs. First, we investigate how infections of children influence the infection dynamics in other age groups. In this context, we detail in which setting the unknown case-detection ratio does not affect the (multiplicative) parameter estimates of interest. Second, we show how correcting for a reporting delay through a nowcasting procedure akin to that proposed by Lawless (1994) can be naturally incorporated in a GAM as an offset term. Here, the application case focuses on the reporting delay of hospitalisations. Third, we propose a prediction model for the occupancy of Intensive Care Units (ICU) in hospitals with COVID-19 and non-COVID-19 patients. We thereby provide authorities with interpretable, reliable and robust tools to better manage healthcare resources. The remainder of the paper is organised as follows: Section 2 shortly describes the available data on infections, hospitalisations and ICU capacities that we use in the subsequent analyses, which are presented in Sections 3, 4 and 5, respectively. We conclude the paper in Section 6. For our analyses, we use data from official sources, which we describe below. Note that our applications are limited to Germany although all of our analyses could be extended to other countries given data availability. We pursue all subsequent analyses on the spatial level of German federal districts, which we henceforth refer to as "districts". This spatial unit corresponds to NUTS 3, the third and most fine-grained category of the NUTS European standard (Nomenclature of Territorial Units for Statistics). We refer to Annex A for a graphical depiction of the spatial resolution of the data. For investigating infection dynamics across different age groups, we use data provided by the Bavarian Health and Food Safety Authority (Landesamt für Gesundheit und Lebensmittelsicherheit, LGL). This statewide register includes, the registration date for all COVID-19 infections reported in Bavaria, as well as information on the patent's age and gender. Infection data for Germany is also published daily by the RKI (Robert Koch Institute, 2021), the German federal government agency and scientific institute responsible for health reporting and disease control. Due to privacy protection, the RKI groups patients in broad age categories, which inhibits the analysis of the group of school children. As this is necessary for our first application in Section 3.3, we restrict the analysis to Bavarian data and use LGL data where not stated otherwise. In addition, the LGL dataset includes information on the hospitalisation status of each patient, which is not included in the RKI data, i.e. whether or not a case has been hospitalised and the date of hospitalisation, if this had occurred. We determine the date on which a hospitalised case is reported to the health authorities by matching the cases across the downloads available on different dates. This is necessary in order to derive the reporting delay for each hospitalisation, which is of interest in Section 4. Intensive care unit occupancy Data on the daily occupancy of ICU beds in Germany, on the other hand, is made publicly available by the German Interdisciplinary Association for ICU Medicine and Emergency Medicine (Deutsche interdisziplinäre Vereinigung für Intensiv-und Notfallmedizin, (DIVI e.V., 2021)). Using this dataset we obtain information on the number of high and low care ICU-beds occupied by patients infected with COVID-19 and patients not infected with COVID-19. As a third category, there are also the vacant beds. In contrast to the infection data, no information is available on the age or gender composition of the occupied beds. In conjunction with the data sources described above, we use demographic data on the German population at the administrative district level, provided by the German Federal Statistical Office (DESTATIS). Since the raw numbers on infections and hospitalisations are strongly influenced by the number of people living in a particular district, we use this population data to transform the absolute infection and hospitalisations to incidence rates. In general, we use the term incidence rates to refer to infection incidence rates, and hospitalisation incidence rates when writing about hospitalisations. While we effectively model the incidence rate in Section 3 and the hospitalisation incidence rate in Section 4, we incorporate the incidence rate per 100.000 inhabitants as a regressor in Section 5. A central focus during the COVID-19 pandemic is to identify the main transmission patterns of the infection dynamics and their driving factors. In this context, the role of children in schools for the general incidence poses an important question with many socio-economic and psychological implications to it (see Andrew et al., 2020; Luijten et al., 2021) . Since findings from previous influenza epidemics have tended to identify the younger population, children aged between 5 and 17, as the key "drivers" of the disease (Worby et al., 2015) , the German government ordered school closures throughout the course of the pandemic between spring 2020 and 2021 to contain the pandemic. However, whether these measures were necessary or effective in the case of COVID-19 is still subject to current research (e.g. Perra, 2021) . In particular, several studies investigated the global effect of infections among school children, but a general conclusion could not be drawn (see Flasche and Edmunds, 2021; Hippich et al., 2021; Hoch et al., 2021; Im Kampe et al., 2020) . In general, we would like to remark that in many studies the main goal was to arrive at conclusions about the susceptibility, severity, and transmissibility of COVID-19 for children (Gaythorpe et al., 2021) . On the other hand, we are here primarily interested in quantifying how the incidences of children are associated with the incidences in other age groups. Therefore, we want to assess whether children are key "drivers" of the pandemic. Our analysis is based on aggregated data on the macro level, as opposed to the data on the individual level, which is needed to answer hypotheses , e.g., about the susceptibility of a particular child. To tackle this problem from a statistical point of view, we propose to analyse the infection data using a time-series approach (Fokianos and Kedem, 2004) . Let therefore Y w,r,a denote the number of reported infections in week w in district r and age group a. For simplicity, we assume independent developments among the districts and let Y w,r,a depend on the incidences in all age groups from the previous week w − 1. Put differently, we include Y w−1,r = (Y w−1,r,1 , . . . , Y w−1,r,A ) as covariates, where 1, ..., A indexes all A considered age groups. Among the components of Y w,r we then postulate independence conditional on Y w−1,r . For illustration, Figure 1 depicts the assumed dependence structure. As for the distributional assumption, we make use of a negative binomial distribution with mean structure E(Y w,r,a |Y w−1,r ) = exp{η w,r,a + o r,a } where o r,a serves as offset and η gives the linear predictor. To be specific, we set o r,a = log(x pop,r,a ), where x pop,r,a is the time-constant population size in district r and age group a. Note that we implicitly model the incidences by incorporating this offset term, since the incidences I w,r,a relate to the counts through Y w,r,a = I w,r,a x pop,r,a . The linear predictor is now defined as where θ w serves as week-specific intercept and δ is a small constant, which is included for numerical stability to cope with zero infections. We set δ to 1 in the calculation but omit the term subsequently for a less cluttered notation. Model (1) has the important methodological advantage of being able to cope with an unknown case-detection ratio, which is inevitable if there are under-reported cases. This is a key problem in COVID-19 surveillance as not all infections are reported (Li et al., 2020) ; hence the case-detection ratio (CDR) is typically less than one. Various approaches have been pursued to quantify the number of unreported cases, e.g., by estimating the proportion of current infections which are not detected by PCR tests . For demonstration, assume thatỸ w,r,a are the detected infections in week w in district r for age group a, while Y w,r,a are still the true infections. ApparentlyỸ w,r,a ≤ Y w,r,a holds if we assume under-reporting. We assume multiplicative under-reporting and denote with 0 < R w,r,a ≤ 1 the multiplicative CDR in district r in age group a and set with R w,d = (R w,r,1 , ..., R w,r,A ) the joint CDRs for all A available age groups. In this setting, we observeỸ w,r,a = R w,r,a Y w,r,a infections in the corresponding week w, district r, and age group a from the Y w,r,a true infections. Apparently, integrity for Y w,r,a is not guaranteed with (3), which we could, ... ... however, impose by rounding. We further assume that R w,r,a and Y w,r,a are independent of each other, conditional on the previous week's data. We further assume that R w,r,a are independent random draws for the different districts, thus the case-detection ratio may vary between the districts. Assuming further an i.i.d. setting such that E(R w,r,a ) = π w,a yields for model (1) under (3): E Ỹ w,r,a |Ỹ w−1,r = E Rw,R w−1 E Yw R w,r,a Y w,r,a |Ỹ w−1,r , R w,r,a , R w−1,r = E Rw,R w−1 R w,r,a E Yt (Y w,r,a |Y w−1,r ) = π w,a E R w−1 exp{η w−1,r,a } exp{o r,a } where for clarity we include the random variable as an index in the notation of the expectation. Note that whereη w−1,r,a = K k=1 log(Ỹ w−1,r,k )θ a,k andθ w = θ w + log E R w−1 exp K k=1 log(R w−1,r,k )θ a,k . Hence, combining (4) and (5) shows that if we fit the model (2) to the observed data, which are affected by unreported cases, we obtain the same autoregressive coefficients θ a,k for k = 1, ..., A as for the model trained with the true (unknown) infection numbers. All effects related to undetected cases accumulate in the intercept, which is of no particular interest in this context. In summary, if we assume that the CDR does not depend on the number of infections but might be different between age groups and different weeks, we obtain valid estimates for the autoregressive coefficients even if (multiplicative) under-reporting is present. While the independence assumptions made are generally questionable, it is reasonable to assume these for a short time interval. Note that a similar argument holds for an additive CDR under epidemiological models proposed by Meyer and Held (2017) and Held et al. (2005) . We can now investigate the infection dynamics between different age groups to answer the question brought up at the beginning of Section 3.1. Since the age groups provided by the RKI are too coarse for this purpose, we rely on the data provided by the LGL for Bavaria. For this dataset, we have the age for each recorded case, which, in turn, enables us to define customised age groups. To be specific, we define the age groups of the younger population in line with the proposal of the WHO and UNICEF (2020): 0-4, 5-11, 12-20, 21-39, 40-65, +65. For this analysis, we estimate model (1) with data on infections which were registered between March 1 and March 31, 2021. The data was downloaded in May 2021; hence reporting delays should have no relevant impact on the analysis. We employ model (1) separately for all five analysed age groups to assess how all age groups affect each other. The fitted autoregressive coefficients θ a,k are visualised in Figure 2 including their 95% confidence intervals. The partition of the x-axis refers to index a, while index k, the influence of the other age groups, is indicated by the different colours and drawn from left (5-11) to right (65+). For instance, the label "Model 5-11" shows all interpretable effects where the target variable is the incidence of people aged between 5 and 11. Note that the only interpretative results of our model concern the dependencies between the age groups. Thus we omit the weekly intercept estimates from (2) in Figure 2 , which lose all interpretative power in the context of under-reporting as argued in Section 3.2. In general, we observe that the autoregressive effects for the own age group, i.e. a = k (drawn as triangles in Figure 2 ) are among the essential predictors in all age-groupspecific models. Regarding the effects between age groups, the association of 5-11-yearolds (yellow, most left coefficient) with all other age groups is relatively small and, in most cases, not significant. In contrast, the age groups of working people aged between 21-39 (blue, middle) and 40-65 years (green, second right) have the highest relative effect on the incidences for all age groups (except for the autoregressive coefficients). For instance, we see that the effects of the children and adolescents (5-11 and 12-20 years) on the incidences of 21-39 and 40-65-year-olds, albeit sometimes being significantly different from 0, affect the prediction far less than the incidences of the working population. In this respect, the results confirm previous analyses concluding that increasing incidences in children and adolescents are weakly associated with the incidences of other age groups. Vice versa, we find empirical evidence that people between 21 and 65 are the main drivers of infection dynamics. The results do not come without limitations. First of all, note that the data is observational, not experimental. Hence, we can only draw associative and not causal conclusions with the current-week incidences for calendar weeks 9-12 in 2020 stratified by age group. from the data without additional assumptions. Moreover, we rely on the given assumptions on the under-reporting. Still, rerunning the analyses for other weeks, shown in the Supplementary Material, yielded similar results, supporting the robustness of our approach and findings. A relevant number of COVID-19 infections lead to hospitalisations, and the incidence of patients hospitalised in relation to COVID-19 is of paramount importance to policymakers for several reasons. Firstly, hospitalised cases are most likely to result in very severe illnesses and deaths, the minimisation of which is generally the primary aim of healthcare management efforts. In addition, knowing the number of hospitalised patients is crucial to adequately assess the current state of the healthcare system. Finally, while the number of detected infections depends considerably on testing strategy and capacity, the number of hospitalisations provides a more precise picture of the current situation. For these reasons, hospitalisation incidence has been deemed increasingly more relevant by scientists and decision-makers over the course of the pandemic, and finally became the central indicator for pandemic management in Germany from September 2021, complementing the incidence of reported infections. The central problem in calculating the hospitalisation incidence with current data is that hospitalisations are often reported with a delay. Such late registrations occur along reporting chains (from local authorities to central registers), but also due to data validity checking at different levels. Visual proof of the degree of this phenomenon is given in Figure 3, which depicts the empirical distribution function of the time (in days) between the date on which a patient is admitted into a Bavarian hospital and the date on which the hospitalisation is included in the central Bavarian register. In 2021, only 11.6% of hospitalised cases in Bavaria are known the day after admission, and about two thirds of them (66.4%) are reported within seven days. Moreover, the duration tends to be slightly shorter for patients younger than 60 than older patients. Modelling and interpreting current data with only partially observed hospitalisation incidences can lead to biased estimates and misleading conclusions, especially if one is interested in the temporal dynamics. To correct for such reporting delays, we utilise "nowcasting" techniques, loosely defined as "[t]he problem of predicting the present, the very near future, and the very recent past" (p. 193, Bańbura et al., 2012) . Related methods have been extensively treated in the statistical literature (see, e.g., Höhle and An Der Heiden, 2014; Lawless, 1994) and successfully applied to fatalities and infections data during the current health crisis (De Nicola et al., 2020; Günther et al., 2020; . In contrast to these approaches, we focus here on modelling the hospitalisation incidences, correcting for delayed reporting through a nowcasting procedure based on the work of . We denote by R t,r,g the hospitalisation incidence on day t for district r and age/gender group g, while the absolute count of hospitalisations in the same cohort is defined by H t,r,g . Naturally, those two quantities related to one another through R t,r,g = H t,r,g x pop,r,g . To account for the delayed registration of hospitalisations in H t,r,g when modelling R t,r,g , we pursue a two-step approach consisting of a nowcasting and a modelling step. In the former step, we nowcast the hospitalisations that are expected but not yet reported, while in the latter step we model R t,r,g as a function of several covariates, which will allow us Figure 4 : Illustration of the data setting for d max = 6. N t,d indicates hospitalisations reported with a specific delay d, while C t,d denotes all those reported with delay up to d. H t denotes the final number of hospitalised cases regardless of the delay with which they were reported, that is with a delay up to the maximum possible, d max . to gain insights into the geographic and sociodemographic drivers of the pandemic. We describe the two steps below. In this first step, we estimate the final number of hospitalised patients on day t, denoted by H t , factoring in the expected reporting delay. Note that, while we have data available at the district level, at this stage we aggregate hospitalisations across Bavaria due to the sparsity of the data. If we are performing the analysis on day T , we can compute the cumulative hospitalisation counts C t,d = d l=1 N t,l , where N t,d is the number of hospitalisations on day t reported with delay d, for every t ∈ {1, ..., T } and d ∈ {1, ..., T − t}. Assuming a maximal reporting delay of d max days, we denote the complete distribution of delayed registrations of cases with hospitalisation on day t by N t = (N t,1 , ..., N t,dmax ) ∈ N dmax with dmax d=1 N t,d = H t . We graphically demonstrate how N t,d , C t,d , and H t relate to one another in Figure 4 . By design, N t follows a multinomial distribution: where π t = (P(D t = 1; t), ..., P(D t = d max ; t)) are the proportions of hospitalisations on day t with a specific delay, and D t is a random variable describing the reporting delay of a single hospitalisation which occurred at time t. For this application, we do not directly model those probabilities but instead opt for a variant of the sequential multinomial model proposed by Tutz (1991) . In particular, we define the conditional probabilities through conditional on covariates x t . It follows that the cumulative distribution function of D can be written as: (1 − p t (d|x t )). Combining (7) and (8) allows us to model the delay distribution with incomplete data. We do this separately for two age groups, which we denote by an additional index a. This leads to the model with the structural assumption where θ 0 is the intercept, s 1 (t) = θ 1 t + L l=1 α l · (t − 28l) + is the piece-wise linear time effect, s 2 (d) the smooth duration effect, s 3 (d) a varying smooth duration effect for the age group 60+, and x t,d are additional covariates depending on t and the delay d, i.e. a weekday effect for t and t + d. From Figure 4 , one can also derive that the proportion of H t,a included in C t,a,d can be comprehended as the probability that a hospitalisation on day t in age group a has a reporting delay smaller than or equal to d, i.e. F t,a (d|x t,a ). Assuming independence of H t,a from D t,a then yields: meaning that the expected number of patients from age group a hospitalised on day t can finally be obtained as In summary, we can fit the logistic regression model given by (10) with the available data on hospitalisations. Based on this model, we exploit (12) to obtain an estimate for the expected number of hospitalisations from age group a on day t. Uncertainty intervals for the estimated nowcasts can then be obtained e.g. through a parametric bootstrapping approach relying on the asymptotic multivariate normal distribution of the estimated model coefficients. In the second step, we propose a model for the expected value of R t,r,g , the hospitalisation incidence on day t in district r and age/gender group g, conditional on covariates x t,r,g . To be specific we set E(R t,r,g |x t,r,g ) = exp{θ 0 + θ age x age,g + θ gender x gender,g + θ gender:age x age,g x gender,g + θ weekday x weekday,t + s 1 (t) + s 2 (x Lon,r , x Lat,r ) + u r } = exp{η t,r,g }, where the linear predictor η t,r,g includes, in addition to the intercept θ 0 , effects for the age/gender groups through the main and interaction effects θ age , θ gender and θ gender:age . Additionally, we include dummy effects θ weekday for each day of the week to account for potentially different hospitalisation rates over the course of the week. Furthermore, the hospitalisation incidences are allowed to vary over time through the smooth term s 1 (t). Finally to account for spatial heterogeneity, we add a smooth spatial effect of each district's average longitude and latitude s 2 (r) and a Gaussian random effect to capture random deviations from this smooth effect, i.e. u r ∼ N(0, τ 2 ) with τ 2 ∈ R + . Note that, on any given day t > T −d max , we do not yet observe the final hospitalisation counts H t,r,g , but only the ones already reported at this time, that is C t,r,g,T −t , indicating the cumulative observations on day t in district r reported with a delay of up to d days for age/gender group g. The age/gender group indexed by g extends the coarse (binary) age categorisation a used in Section 4.1, which only differentiates between cases younger and older than 60 years. Exploiting (12) and the definition (6) of the incidence leads to the final model E(R t,r,g |x t,r,g ) = E(C t,r,g,T −d |x t,r,g ) x pop,r,g F t,g (T − t|x t,g ) . We thereby set C t,r,g, Rearranging (14) shows that modelling the count variable C t,r,g,T −d with the offset term log(x pop,r,g F t,g (T − t|x t,g )) is equivalent to modelling R t,r,g as in (13), since E(C t,r,g,T −t |x t,r,g ) = exp {η t,r,g + log(x pop,r,g F t,g (T − t|x t,g ))} = µ t,r,g holds. In practice we thereby replace the unknown quantities in the offset through their estimates derived in the previous section. In other words, the delayed reporting is easily accommodated through an (estimated) offset in the model using only the reported data C t,r,g,T −t . With this "trick" We can now finalise the model, where we here make use of a negative binomial model to account for possible overdispersion, that is: with µ t,r,g parametrised as in (15) and (13), and the dispersion parameter σ 2 is estimated from the data. As an additional note, we point out that accounting for late registrations works analogously for any model within the endemic-epidemic framework originating in Held et al. (2005) . The only difference to the approach presented here is that the exact functional form of the expected value must be adequately accounted for. For instance, if µ t,r,g consists of the sum of non-negative endemic and epidemic terms, one should incorporate the offset in both terms. For the application, we focus on the first two months of the fourth wave of the pandemic in Bavaria, which began towards the end of September 2021. In particular, we consider hospitalisations between September 24th and November 18th, using data reported as of November 19th, 2021. We set d max = 40 days to be the maximum possible duration between hospitalisation and its reporting in the central Bavarian register. We derive this choice from the empirical delay distribution in Figure 3 , proving that since the beginning of 2021, around 94% of the hospitalisations have been reported within 40 days of their occurrence. We have no information on the date of hospital admission for about 9.6% of all hospitalisations related to COVID infections that were reported between September 24th and November 19th. For those cases, we replace the date of hospitalisation with the respective COVID-19 infection date as reported by the local health authorities. For brevity, we only present a comparison of the nowcasted and raw hospitalisation counts for the nowcasting model and the age/gender group-specific and spatial effects of the hospitalisation model. We refer to the Supplementary Material for additional results. Figure 5 maps the raw and corrected rolling weekly sums of hospitalisation counts accompanied by the 95% confidence intervals for the whole population as well as separately for the two age groups under consideration. While reported numbers indicate a relatively stable or even slightly decreasing development over the last two weeks observed, the nowcast reveals a continuous upward trend since the beginning of October. Comparing both age-stratified populations, the increase for those over 60 years (the more vulnerable) is steeper. These results emphasise the need to adjust reported hospitalisation counts, as they tend to systematically underestimate the number of recently occurred hospitalisations, which can lead to inaccurate conclusions about the current situation. Turning to the results of the hospitalisation model proposed in Section 4.2, the estimated coefficients for all age and gender combinations can be seen in Figure 7 . Those estimates reveal considerably lower hospitalisation rates for people younger than 35 than all other age groups. We generally observe a positive correlation between age and risk of hospitalisation for both genders, i.e. older people are more likely to be hospitalised. The only exception to this intuitive finding is men over 80 years, whose expected hospitalisation rates are slightly lower than men aged 60 to 79. Statistically significant differences between men and women are visible across all age groups. While women in the youngest and oldest age group tend to have a (slightly) higher hospitalisation rate than men, the opposite holds for the other groups. Figure 7 depicts the random and smooth spatial effects (on the log-scale). The smooth effect in Figure 7 (a) pictures a clear spatial pattern, with generally higher hospitalisation rates in the eastern parts of Bavaria and lower rates in the north-western districts. This structure reflects the pandemic situation in Bavaria in autumn 2021, where we observed the most severe dynamics in the respective districts. Districts with unexpectedly high or low hospitalisation rates (when compared to their neighbouring areas) can be located on the map of the district-specific random intercepts in Figure 7 (b). Contrary to its role as a hotspot during the second wave in autumn 2020, the district with the lowest random effect is Berchtesgadener Land. Understanding the infections of COVID-19 is essential when aiming to attain insights into the disease's spread. However, the analysis of infections alone can only tell part of the story. In particular, focusing only on infections disregards information on the severity of the disease, which is crucial in understanding and anticipating potential strains on the health care system. This information, among other data, can be captured by the ICU occupancy, which is the focus of our third application case. We consider the occupancy of ICUs where, as described in Section 2, beds are categorised into the number of vacant beds (Z w,r,1 ), number of beds occupied by patients infected with COVID-19 (Z w,r,2 ), and number of beds occupied by patients not infected with COVID-19 (Z w,r,3 ). Further, we denote by Z w,r = (Z w,r,1 , Z w,r,2 , Z w,r,3 ) the vector of length three expressing the average number of ICU-bed occupancy in week w and district r. The canonical GAM for this type of data is a multinomial model; hence the distributional assumption is: where N w,r = 3 j=1 Z w,r, j is the known number of available beds in district r and week w and π w,r = (π w,r,1 , π w,r,2 , π w,r,3 ) defines the proportion of occupied beds in all respective categories. One advantage of this multinomial approach is that we implicitly account for displacement effects commonly observed for ICU occupancy data. Over time, as the number of beds occupied by patients infected with COVID-19 rise, both free beds and beds occupied by patients not infected with COVID-19 decrease almost simultaneously. In particular, the "displacement" may be caused by practices such as rescheduling non-urgent operations or other treatments which would have required an ICU stay, which were already common during the first wave of COVID-19 (Stöß et al., 2020) . These effects lead to negative correlations between the entries in Z w,r , which is naturally accounted for in model (16) as the correlation between arbitrary counts Z w,r,k and Z w,r,l is −N w,d π w,r,k π w,r,l ∀ k, l ∈ {1, 2, 3}, k = l. Taking the number of COVID-19 as the reference category, we effectively parametrise pairwise comparisons via log π w,r, j π w,r,3 = η w,r, j ∀ j = 1, 2, where the linear predictors η w,r, j are functions of covariates labeled as x w,r and defined by: η w,r, j =θ 0, j + θ AR(1), j (Z w−1,r,1 ,Z w−1,r,2 ) + θ I, j log(Y w−1,r + δ )+ s j (x Lon,r , x Lat,r ) + u r, j ∀ j = 1, 2, where θ 0, j is the intercept term. Further, we incorporate an autoregressive component in (18) by including the relative ICU occupancy observed in the previous week as a regressor. We denote the distribution of the different occupancies of the previous week asZ w−1,d = (Z w−1,r,1 , Z w−1,r,2 )/( 3 j=1 Z w−1,r, j ), the respective effect is denoted by θ AR(1), j for the jth linear predictor. We also let (18) depend on the previous week's district-and age-specific infections per 100.000 inhabitants (incidences) denoted by Y w−1,r,a , that are weighted by the coefficient θ I, j ∀ j = 1, 2. To correct for district heterogeneity, we include Gaussian random effects, i.e. u r, j ∼ N(0, τ 2 ) ∀ r ∈ {1, ..., R} ∀ j = 1, 2. For smooth spatial deviations from these random effects, we add a bivariate function s j (·, ·) ∀ j = 1, 2 parametrised by thin-plate splines that take the longitude and latitude of each district as arguments (see Wood, 2003 , for more details). For notational brevity, let θ denote the joint parameter vector of (18) ∀ j = 1, 2. As stated, the multinomial model has the beneficial property of automatically accounting for displacement effects. Note, however, that patients' expected length of stay in intensive care may exceed our time unit of one week, as the average stay of COVID-19 patents is about 13 days (see Vekaria et al., 2021) . This means, that not all beds are completely redistributed at every time point of observation. However, apart from including the previous week's occupancy in the covariates, our proposed model does not adequately account for this stochastic variability. We therefore pursue a Bayesian view and let N w,r be the number of ICU beds in district r in week w. This number is known, and we assume that each week only a fixed but unknown proportion α of beds in the three categories become disposable, where 0 < α < 1. That is to say that αN w,r beds are redistributed among the three categories, where integrity is assumed but not explicitly included in the notation for simplicity. We assume that this new allocation is independent of the previous status of the beds and denote the newly allocated beds with the three-dimensional vector A w,r = (A w,r,1 , A w,r,2 , A w,r,3 ). This setting translates to: Z w,r = (1 − α)Z w−1,r + A w,r . For the newly allocated beds we still assume a multinomial model: with π w,r specified in (18). Note, however, that we do not know α and that no information is provided in the data concerning the length of stay or the number of beds changing their status. To account for that data deficiency, we impose a Dirichlet distribution on the vector π w,r , where the prior information is determined by the available beds, i.e. Combining the prior (20) with the likelihood from (19), leads to the posterior f π (π w,r |A w,d ) ∝ 3 j=1 π A w,r, j +(1−α)Z w−1,r, j w,r, j = 3 j=1 π Zw,r, j w,r, j This, in turn, equals the likelihood resulting from the multinomial model and justifies the use of model (17) even though not all beds are allocated weekly. Nevertheless, the central assumption of independent observations in standard uncertainty quantification in GAMs (Wood, 2006) is violated. To correct for this bias, we substitute the canonical covariance of the estimators with the robust sandwich estimator based on M-estimators defined by: where we set A(θ ) = E − ∂ ∂ θ ∂ θ (θ ) , B(θ ) = Var ∂ ∂ θ (θ ) , and (θ ) is the logarithmic likelihood resulting from (16) or equivalently the logarithm of the posterior of (18). See also Stefanski and Boos (2002) and Zeileis (2006) . We now employ the multinomial logistic regression (16) to ICU data recorded during the third wave between March and June 2021. For the incidence data used in the covariates, we employ the RKI data; hence we set A = 4 and the age groups are: 15-34, 35-59, 60-79 and 80+. Further, we normalise all non-binary covariates: In this way, we facilitate the interpretation of associations and guarantee a meaningful comparison between the covariates. Due to space restrictions, we here only present the linear effects from (18) and refer to the Supplementary Material for the random and smooth estimates. In Figure 8 , we visualise the estimated coefficients including their confidence intervals. The reference category in both pairwise comparisons is COVID-beds; hence we refer to the two models by free vs. COVID beds and non-COVID vs COVID bed. In particular, the coefficients relate to the association between the covariates and the logarithmic odds of a bed not being occupied in comparison to being occupied by a patient with COVID-19, shown with blue dots in Figure 8 . Analogously, the orange triangles in Figure 8 illustrate the estimated association between the covariates and the logarithmic odds of a bed being occupied by a patient not infected with COVID-19 in comparison to a bed being occupied by a patient infected with COVID-19. To demonstrate the uncertainty of each estimate, a 95% confidence interval is added. Keeping the other variables constant, the normalised lagged log-incidences of all age groups generally have a negative effect on the logarithmic odds of both pairwise comparisons. This translates to the finding, that an increase in the incidences leads to a decrease in the proportion of non-COVID and free-beds in when compared to COVID beds. The lagged normalised proportion of free and non-COVID beds is estimated to have a stronger, positive association with the logarithmic odds of both pairwise comparisons. We, therefore, expect a higher number of non-COVID beds in the previous week to be followed by a higher number of non-COVID beds in the next week. The model can be extended to a forecasting model, as shown in the supplementary material. In particular, we demonstrate how forecasting performance changes over the different waves of the pandemic. In principle, we could also incorporate further covariates like district-specific proportions of vaccinated people. Unfortunately, these numbers are not very reliable and require sophisticated cleaning, so we prefer not to present results in this direction here. The COVID-19 pandemic poses numerous complex challenges to scientists from different disciplines. Statisticians and epidemiologists, in particular, face the problem of extracting meaningful information from imperfect, incomplete and rapidly changing data. Generalised additive models are a powerful tool that, if used correctly, can help solving some of these challenges. In this work, we have addressed three such challenges where the utilisation of GAMs provided meaningful insight. 1. We investigated whether children are the main drivers of the pandemic under a timevarying case-detection ratio. 2. We modelled hospitalisation incidences controlling for delayed registrations, thereby providing both up to dates estimates of current hospitalisation numbers as well as insight on the demographic and spatio-temporal drivers of COVID-19. 3. We developed an interpretable predictive tool for ICU bed occupancy that is actively used by the Bavarian government. We achieved all of those results by using GAMs with different methodological extensions. Nevertheless, the use of our proposed models to extract novel information from the data provided is still subject to both data-related and methodological limitations. In general, our data sources are subject to exogenous shocks (e.g. policy changes) that lead to sudden changes in population behaviour and pose a danger to the validity of our results. Regarding the study of infection dynamics of school kids, revised testing policies hinder the long-range comparability of our findings. In the hospitalisation data, the exact date of hospitalisation is missing for about 10% of the hospitalised cases, which we impute by the given registration date of the infection. Furthermore, the records on the ICUbed occupancy do not include intrinsic constraints, as the capacity of beds available to COVID-19 patients does not equate to the capacity of beds available to patients not infected with COVID-19. There are also methodological limitations. First of all, note that the data is observational, not experimental. Additionally, the set of covariates in our model can easily be extended to control for other factors, such as meteorological and socioeconomic ones. We close this work by emphasising that the nowcasting model can also be used as a stand-alone model. In the German COVID-19 Nowcast Hub (Karlsruhe Institute of Technology and Heidelberg Institute for Theoretical Studies (2021)), the described model is used among other nowcasting methods, including the work of Günther et al. (2020) and van de Kassteele et al. (2019) , to estimate hospitalisation counts on the national and federal state level in Germany. Apart from a systematic evaluation of the different approaches, one of the main goals of this project is to combine individual nowcasts to an ensemble nowcast, which may lead to more accurate estimates. Figure 10 : Association of previous week's incidences in different age groups (colour-coded marked) with the current-week incidences for calendar weeks 9-12 2020 stratified by age group. To prove the robustness of the findings shown in Section 3.3 from the main article, we repeat the same analysis with data between 24.1.2021 and 29.2.2021 (calendar weeks 4 to 7). As can be seen in Figure 10 , the estimated coefficients are similar to the ones reported in the principal analysis of Section 3 and result in analogous interpretations. Modelling hospitalisations relies on the availability of hospital admission dates of COVID-19 patients. In cases where no dates are reported (about 9.6% of hospitalised cases reported after September 24th), the reporting date of the infection is used instead. This choice is justified by the seven-day sum of hospitalisations over time, calculated based on both types of dates, respectively. 11 shows that there are no structural differences between both time series considering hospitalisations where both types of dates are available. In the following, we visualise all estimated effects of the nowcasting model described in Section 4.1 of the main paper. We observe the strongest association from the time delay showing a steep decrease especially in the first days after admission to hospital ( Figure 13 ). Comparing the effects for both age groups, only minor differences can be noticed. The piece-wise linear time effect ( Figure 12 ) shows a positive trend over the considered period with a lower slope in the last four weeks. Differences in weekdays are visible for both, the admission date to hospital as well as the date on which a hospitalised case is reported (Figure 14) . Figure 12 : Estimated temporal effect within the time period between September, 24th and November 18th, 2021, accompanied by 95% confidence intervals. Figure 13 : Estimated smooth delay effects between admission to hospital and its reporting for age groups 0-59 (left) and 60+ (right) accompanied by 95% confidence bounds. reporting date of hospitalisation (right) accompanied by 95% confidence intervals. The reference category is taken to be Monday, respectively. In addition to the results described in Section 4.3, Figures 15 and 16 visualise additional effects in the hospitalisation model for which we controlled for. The smooth time indicates steady positive effects comparable to the development of the nowcasted seven day sum of hospitalisation counts illustrated in Figure 5 . The weekday effects reveal a clear pattern over the week, whereas there are rather small differences from Monday to Friday and considerably less hospitalisations occur at the weekend. Figure 16 : Estimated linear weekday effects accompanied by 95% confidence intervals. The reference category is taken to be Monday. Figure 17 : District specific random intercept in estimating the logarithmic odds of an ICU-bed not being occupied vs occupied by a patient infected with covid (a), and in estimating the logarithmic odds of an ICU bed being occupied by a patient not infected with COVID-19 vs a bed being occupied by a patient infected with COVID-19 (b) Figures 17 and 18 depict the estimated district-specific random and smooth effects, respectively. Upon first inspection, there does not necessarily seem to be a clearly visible pattern in the estimated random effects, shown in Figure 17 . However, on a closer look, districts that include towns or even cities seem to have a lower estimated effect, while purely rural areas seem to have a higher area-specific estimated effect on the in both pairwise comparisons of the ICU occupancy. Besides the area-specific associations estimated by our model, there could also be a spatial association in the occupancy throughout Germany. The estimates are given in Figure 18 . Generally, there seems to be a north-south divide, as the smoothed spatial effect appears to be higher in the north and lower in the south, where a belt from seemingly below Dresden stretching through Germany to what seems to be just below Bonn is the partition with an estimated centred effect of around zero. Figure 18 : Estimated centred smooth spatial association with the logarithmic odds of an ICU-bed not being occupied vs occupied by a patient infected with covid (a), and with the logarithmic odds of an ICU bed being occupied by a patient not infected with COVID-19 vs a bed being occupied by a patient infected with Besides interpreting the coefficients, we can use the same model to predict the occupancy of beds in the next week. To show how this works in practice, we carry out one-step-ahead predictions and evaluate the results in a rolling window framework. To obtain forecasts for the occupancy in week w, we train our model with the prior 8 weeks, i.e., w − 9, ..., w − 1, and use the information on week w as a test set. Since we only incorporate covariables that are lagged by one week, this setting is equivalent to providing real forecasts for the week after the end of the observational period. We let w vary between the 27th of September in 2020 (calendar week 40 in 2020) and the 12th of September in 2021 (calendar week 37 in 2021). Within this time frame, we cover two critical infection waves as well as low-infection seasons. In this way, we can assess the behaviour of our predictions in many different realistic scenarios. To measure the goodness of the model, we rely on strictly proper scoring rules (Gneiting and Raftery, 2007) and use the logarithmic score as employed in Held et al. (2017) and defined by the logarithmic probability to observe the occupancy in week w under the multinomial model from (16) and the predicted π w,r from the trained model. We compare the performance of the full model as specified in (18) Figure 19 : The logarithmic score of all candidate models from the one-week-ahead forecasts between October 2020 and September 2021. and spatial effects, without autoregressive effects, and only including an intercept term. To further differentiate between all models, we calculate average scores for each model and employ pairwise permutation tests to compare all sub-models against the predictions under the full model as proposed in Diebolt and Mariano (2002) . Figure 19 shows the results of our one-step-ahead forecasts for all proposed models. Overall, the average score indicates that the full model provides the best performance than all other model specifications. Consistently low p-values from the permutation tests underscore this finding. We can use the intercept model to identify the high and low infection seasons regarding temporal discrepancies between the models. It is also apparent that the model performs poorly during the infection waves when leaving out the infection data, underlining that lagged infections are crucial during these periods. Not including the autoregressive component in the model, on the other hand, seems to mainly impair the logarithmic score during low-infection periods such as in calendar weeks 1 to 25 in 2021. In comparison, the random and spatial effects hardly affect the model's predictive power. However, the logarithmic score is still significantly better when including them. In summary, we can deduce from Figure 19 that the full specification consistently provides the best forecasts for high and low infection periods. Inequalities in children's experiences of home learning during the COVID-19 lockdown in England* The Oxford Handbook of Economic Forecasting Explaining regional differences in mortality during the first wave of COVID-19 in Italy Regional now-and forecasting for data reported with delay: Towards surveillance of COVID-19 infections Comparing predictive accuracy Daily ICU occupancy data for COVID-19 and non-COVID-19 patients The role of schools and school-aged children in SARS-CoV-2 transmission Partial likelihood inference for time series following generalized linear models On the interplay of regional mobility, social connectedness, and the spread of COVID-19 in Germany Children's role in the COVID-19 pandemic: A systematic review of early surveillance data on susceptibility, severity, and transmissibility Strictly proper scoring rules, prediction, and estimation Projections for COVID-19 pandemic in india and effect of temperature and humidity Nowcasting the COVID-19 pandemic in Bavaria Generalized additive models: Some applications A statistical framework for the analysis of multivariate infectious disease surveillance counts Probabilistic forecasting in infectious disease epidemiology: The 13th Armitage lecture A public health antibody screening indicates a marked increase of SARS-CoV-2 exposure rate in children during the second wave Weekly SARS-CoV-2 sentinel surveillance in primary schools, kindergartens, and nurseries Bayesian nowcasting during the STEC O104: H4 outbreak in Germany Surveillance of COVID-19 school outbreaks Nowcasts of the hospitalization incidence in germany (covid-19) A correspondence between bayesian estimation on stochastic processes and smoothing by splines Adjustments for reporting delays and the prediction of occurred but not reported events Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) The impact of lockdown during the COVID-19 pandemic on mental and social health of children and adolescents Effects of temperature variation and humidity on the death of COVID-19 in wuhan, china Rapid epidemiological analysis of comorbidities and treatments as risk factors for COVID-19 in Scotland (REACT-SCOT): A population-based case-control study Incorporating social contact data in spatio-temporal models for infectious disease spread Generalized linear models Can mathematical modelling solve the current COVID-19 crisis? Accurate statistics on covid-19 are essential for policy guidance and decisions Non-pharmaceutical interventions during the COVID-19 pandemic: A review Temperature significantly changes COVID-19 transmission in (sub)tropical cities of brazil. Science of The Total Environment 729 Daily COVID-19 cases data Nowcasting fatal COVID-19 infections on a regional level in Germany A statistical model for the dynamics of COVID-19 infections and their case detection ratio in 2020 The calculus of M-estimation The COVID-19 pandemic: Impact on surgical departments of non-university hospitals Sequential models in categorical regression Nowcasting the number of new symptomatic cases during infectious disease outbreaks using constrained p-spline smoothing Hospital length of stay for COVID-19 patients: Data-driven methods for forward planning The role of climate during the COVID-19 epidemic in new south wales, australia Advice on the use of masks for children in the community in the context of COVID-19: Annex to the advice on the use of masks in the context of COVID-19 Thin plate regression splines On confidence intervals for generalized additive models based on penalized regression splines Generalized additive models: An introduction with R Inference and computation with generalized additive models and their extensions Inferring UK COVID-19 fatal infection trajectories from daily mortality data: Were infections already in decline before the uk lockdowns? On the relative role of different age groups in influenza epidemics Association between ambient temperature and COVID-19 infection in 122 cities from china Object-oriented computation of sandwich estimators We would like to thank Manfred Wildner and Katharina Katz on behalf of the staff of the IfSG Reporting Office of the Bavarian State Office for Health and Food Safety (LGL) for cooperatively providing the data used for sections 3 and 4 and for fruitful discussions on the analysis of the COVID-19 pandemic. Moreover, we would like to thank all COVID-19 Data Analysis Group (CODAG) members at LMU Munich for countless beneficial conversations and Constanze Schmaling for proofreading. The work has been partially supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibility for its content. We also acknowledge support of the Deutsche Forschungsgemeinschaft (KA 1188/13-1) and the Bavarian Health and Food Safety Authority (LGL). We declare that there is no conflict of interest. We carried out most modelling endeavours presented in this article on the NUTS 3 level, which is shown on the right side of Figure 9 . The only exception is the Nowcasting model from Section 4.1, where we aggregate all data onto the NUTS 1 level in Bavaria. Moreover, NUTS 1 regions, depicted on the left side of Figure 9 , are the federal states in Germany and Bavaria is one of them. In Section 3 and 4, we are only analysing data from Bavaria, while we employ data from complete Germany in Section 5.