key: cord-226245-p0cyzjwf authors: Schneble, Marc; Nicola, Giacomo De; Kauermann, Goran; Berger, Ursula title: Nowcasting fatal COVID-19 infections on a regional level in Germany date: 2020-05-15 journal: nan DOI: nan sha: doc_id: 226245 cord_uid: p0cyzjwf We analyse the temporal and regional structure in mortality rates related to COVID-19 infections. We relate the fatality date of each deceased patient to the corresponding day of registration of the infection, leading to a nowcasting model which allows us to estimate the number of present-day infections that will, at a later date, prove to be fatal. The numbers are broken down to the district level in Germany. Given that death counts generally provide more reliable information on the spread of the disease compared to infection counts, which inevitably depend on testing strategy and capacity, the proposed model and the presented results allow to obtain reliable insight into the current state of the pandemic in Germany. In March 2020, COVID-19 became a global pandemic. From Wuhan, China, the virus spread across the whole world, and with its diffusion, more and more data became available to scientists for analytical purposes. In daily reports, the WHO provides the number of registered infections as well as the daily death toll globally (https://www.who.int/). It is inevitable for the number of registered infections to depend on the testing strategy in each country (see e.g. Cohen and Kupferschmidt, 2020) . This has a direct influence on the number of undetected infections (see e.g. Li et al., 2020) , and first empirical analyses aim to quantify how detected and undetected infections are related (see e.g. Niehus et al., 2020) . Though similar issues with respect to data quality hold for the reported number of fatalities (see e.g. Baud et al., 2020) , the number of deaths can overall be considered a more reliable source of information than the number of registered infections. The results of the "Heinsberg Study" in Germany point in the same direction (Streeck et al., 2020) . A thorough analysis of death counts can in turn generate insights on changes in infections as proposed in Flaxman et al. (2020) (see also Ferguson et al., 2020) . In this paper we pursue the idea of directly modelling registered death counts instead of registered infections. We analyse data from Germany and break down the analyses to a regional level. Such regional view is apparently immensely important, considering the local nature of some of the outbreaks for example in Italy (see e.g. , France (see e.g. Massonnaud et al., 2020) or Spain. The analysis of fatalities has, however, an inevitable time delay, and requires to take the course of the disease into account. A first approach on modelling and analysing the time from illness and onset of symptoms to reporting and further to death is given in Jung et al. (2020) (see also Linton et al., 2020) . Understanding the delay between onset and registration of an infection and, for severe cases, the time between registered infection and death can be of vital importance. Knowledge on those time spans allows us to obtain estimates for the number of infections that are expected to be fatal based on the number of infections registered on the present day. The statistical technique to obtain such estimates is called nowcasting (see e.g. Höhle and an der Heiden, 2014) and traces back to Lawless (1994) . Nowcasting in Covid-19 data analyses is not novel and is for instance used in Günther et al. (2020) for nowcasting daily infection counts, that is to adjust daily reported new infections to include infections which occurred the same day but were not yet reported. We extend this approach to model the delay between the registration date of an infection and its fatal outcome. We therefore analyse the number of fatal cases of Covid-19 infections in Germany using district-level data. The data are provided by the Robert-Koch-Institute (www.rki.de) and give the cumulative number of deaths in different gender and age groups for each of the 412 administrative districts in Germany together with the date of registration of the infection. The data are available in dynamic form through daily downloads of the updated cumulated numbers of deaths. We employ flexible statistical models with smooth components (see e.g. Wood, 2017) assuming a district specific Poisson process. The spatial structure in the death rate is incorporated in two ways. First, we assume a spatial correlation of the number of deaths by including a long-range smooth spatial death intensity. This allows to show that regions of Germany are affected to different extents. On top of this long-range effect we include two types of unstructured region specific effects. An overall region specific effect reflects the situation of a district as a whole, while a short-term effect mirrors region specific variation of fatalities over time and captures local outbreaks as happened in e.g. Heinsberg (North-Rhine-Westphalia) or Tirschenreuth (Bavaria). In addition we include dynamic effects to capture the global changes in the number of fatal infections for Germany over calendar time. This enables us to investigate the impact of certain interventions, such as social distancing, school closure, complete lockdowns and lockdown releases, on the dynamic of the infection and hence on the number of deaths. Modelling infectious diseases is a well developed field in statistics and we refer to Held et al. (2017) for a general overview of the different models. We also refer to the powerful R package surveillance . Since our focus is on analysing the district specific dynamics of fatal infections we here make use of Poisson-based models implemented in the mgcv package in R, which allows to decompose the spatial component in more depth. The paper is organized as follows. In Section 2 we describe the data. Section 3 highlights the results of our analysis. The remaining sections provide the technical material, starting with Section 4 where we motivate the statistical model, which is extended by our nowcasting model in Section 5. Extended results as well as model validation are given in Section 6, while Section 7 concludes the paper. We make use of the COVID-19 dataset provided by the Robert-Koch-Institute for the 412 districts in Germany (which also include the twelve districts of Berlin separately). The data are updated on a daily basis and can be downloaded from the Robert-Koch-Institute's website. We have daily downloads of the data for the time interval from March 27, 2020 until today. The subsequent analysis was conducted on May 14, 2020, and was performed considering only deadly infections with registration dates from March 26, 2020 until May 13, 2020 (the day before the day of analysis). The data contain the newly notified laboratory-confirmed COVID-19 infections and the cumulated number of deaths related to COVID-19 for each district of Germany, classified by gender and age group. Each data entry has a time stamp which corresponds to the registration date of a confirmed Covid-19 infection. This means that the time stamp for a fatal outcome always refers to the registration date and not to the death date. Due to daily downloads of the data we can derive the time point of death (or to be more specific, the time point when the death of a case is included in the database). We obtain the latter by observing a status change from infected to deceased when comparing the data from two consecutive days. The Robert-Koch-Institute collects the data from the district-based health authorities (Gesundheitsämter). Due to different population sizes in the districts and certainly also because of different local situations, some health authorities report the daily numbers to the Robert-Koch-Institute with a delay. This happens in particular over the weekend, a fact that we need to take into account in our model. We refrain from providing general descriptive statistics of the data here, since these numbers can easily be found on the RKI webpage, which also gives a link to a dashboard to visualize the data (see also https://corona.stat.uni-muenchen.de/maps/) Before we discuss our modelling approach in detail, we want to describe our major findings. First, Table 1 shows that age and gender both play a major role when estimating the daily death toll. As is generally known, elderly people exhibit a much higher death rate which is for the age group 80+ around 100 times higher than for people in the age group 35-59. A remarkable difference is also observed between genders, where the expected death rate of females is around 40% (≈ 1 − exp(−0.503)) lower than the death rate for males. Furthermore, we see that significantly less deaths are attributed to infections registered on Sundays compared to weekdays, due to the existing reporting delay during weekends. Our model includes a global smooth time trend representing changes in the death rate since March 26th. This is visualized in Figure 1 . The plotted death rate is scaled to give the expected number of deaths per 100.000 people in an average district for the reference group, i.e. males in the age group 35 -59. Overall, we see a peak in the death rate on April 3rd and a downwards slope till end of April. However, our nowcast reveals that the rate remains constant since beginning of May. Note that this recent development cannot be seen by simply displaying the raw death counts of these days. The nowcasting step inevitably carries statistical uncertainty, which is taken into account in Figure 1 by including best and worst case scenarios. The latter are based on bootstrapped confidence intervals, where details are provided in Section 6.3 later in the paper. Our aim is to investigate spatial variation and regional dynamics. To do so, we combine a global geographic trend for Germany with unstructured region-specific effects, where the latter uncover local behaviour. In Figure 2 we combine these different components and map the fitted nowcasted death counts related to Covid-19 for the different districts of Germany, cumulating over the last seven days before the day of analysis (here May 14, 2020). While in most districts of Germany the death rate is relatively low, some hotspots can be identified. Among those, Traunstein and Rosenheim (in the south-east part of Bavaria) are the most evident, but Greiz and Sonneberg (east and south part of Thuringia) stand out as well, to mention a few. A deeper investigation of the spatial structure is provided in Section 6, where we show the global geographic trend and provide maps that allow to detect new hotspot areas, after correcting for the overall spatial distribution of the infection. On the day of analysis, we do not observe the total counts of deaths for recently registered infections, since not all patients with an ongoing fatal infections have died yet. We therefore nowcast those numbers, i.e. we predict the prospective deaths which can be attributed to all registration dates up to today. This is done on a national level, and the resulting nowcast of fatal infections for Germany is shown in Figure 3 . For example, on May 14, 2020 there are 25 deaths reported where the infection was registered on May 5th (red line on May 5th). We expect this number to increase to about 50 when all deaths due to Covid-19 for this registration date will have been reported (blue line on May 5th). Naturally, the closer a date is to the present, the larger the uncertainty in the nowcast. This is shown by the shaded bands. Details on how the statistical uncertainty has been quantified are provided in Section 5 below. The fit of this model has been incorporated into the district model discussed before, but the nowcast results are interesting in their own right. The curve confirms that the number of fatal infections is decreasing since the beginning of April. Note that the curve also mirrors the "weekend effect" in registration, as less infections are reported on Sundays. Further analyses and a detailed description of the model are given in the following sections. Let Y t,r,g denote the number of daily deaths due to COVID-19 in district/region r and age and gender group g with time point (date of registration) t = 0, . . . , T . Here t = T corresponds to the day of analysis, which is May 14, 2020 and t = 0 corresponds to March 26, 2020. Note that time point t refers to the time point of registration, i.e. the date at which the infection was confirmed. Even though the time point of infection obviously precedes that of death, registration can also occur after death, e.g. when a post mortem test is conducted, or when test results arrive after the patient has passed away. We set the day of death to be equal to the day of registered infections in this case. The majority of fatalities with registered infection at time point t have not yet been observed at time t, as these deaths will occur later. We therefore need a model for nowcasting, which is discussed in the next section. For now we assume all Y t,r,g to be known. We model Y t,r,g as (quasi-)Poisson distributed according to where we specify λ t,r,g through λ t,r,g = exp{(β 0 + age g β age + gender g β gender + weekday t β weekday The linear predictor is composed as follows: • β 0 is the intercept. • β age and β gender are the age and gender related regression coefficients. • β weekday are the weekday-related regression coefficients. • m 1 (t) is an overall smooth time trend, with no prior structure imposed on it. • m 2 (s r ) is a smooth spatial effect, where s r is the geographical centroid of district/region r. • u r0 and u r1 are district/region-specific random effects which are i.i.d. and follow a Normal prior probability model. While u r0 specifies an overall level of in the death rate for district r over the entire observation time, u r1 reveals region specific dynamics by allowing the regional effects to differ for the last 14 days. • pop r,g is the gender and age group-specific population size in district/region r and serves as an offset in our model. We here emphasize that we fit two spatial effects of different types: We model a smooth spatial effect, i.e. m 2 (s r ), which takes the correlation between the death rates of neighbouring districts/regions into account and gives a global overview of the spatial distribution of fatal infections. In addition to that we also have unstructured district/region-specific effects u r = (u r0 , u r1 ) , which capture local behaviour related to single districts only. The district specific effects u r are considered as random with a prior structure for r = 1, . . . , 412. The prior variance matrix Σ u is estimated from the data. The predicted values u r (i.e. the posterior mode) exhibit districts that show unexpectedly high or low death tolls when adjusted for the global spatial structure and for age-and gender-specific population size. Model (1) belongs to the model class of generalized additive mixed model, see e.g. Wood (2017) . The smooth functions are estimated by penalized splines, where the quadratic penalty can be comprehended as a Normal prior (see e.g. Wand, 2003) . The same type of prior structure holds for the region-specific random effects u r . In other words, smooth estimation and random effect estimation can be accommodated in one fitting routine, which is implemented in the R package mgcv. This package has been used to fit the model, so that no extra software implementation was necessary. This demonstrates the practicability of the method. The above model cannot be fitted directly to the available data, since we need to take the course of the disease into account. For a given registration date t, the number of deaths of patients registered as positive on that day, Y t,r,g , may not yet be known, since not all patients with a fatal outcome of the disease have died yet. This requires the implementation of nowcasting. We do this on a national level, and cumulate the numbers over district/region r and gender and age groups g. This allows to drop the corresponding subscripts in the following and we simply notate the cumulated number of deaths with registered infections at day t with Y t . Let N t,d denote the number of deaths reported on day t + d for infections registered on day t. Assuming that the true date of death is at t+d, or at least close to it, we ignore any time delays between time of death and its notification to the health authorities. We call d the duration between the registration date as a Covid-19 patient and the reported day of death, where d = 1, . . . , d max . Here, d max is a fixed reasonable maximum duration, which we set to 30 days (see e.g. Wilson et al., 2020) . The minimum delay is one day. In nowcasting we are interested in the cumulated number of deaths for infections registered on day t, which we define as The total number of deaths with a registered infection at t is apparently unknown at time point t and becomes available only after d max days. In other words, only after d max days we know exactly how many deaths occurred due to an infection which was registered on day t. We define the partial cumulated sum of deaths as On day t = T , when the nowcasting is performed, we are faced with the following data constellation, where NA stands for not (yet) available: We may consider the time span between registered infection and (reported) death as a discrete duration time taking values d = 1, . . . , d max . Let D be the random duration time, which by construction is a multinomial random variable. In principle, for each death we can consider the pairs (D i , t i ) as i.i.d. and we aim to find a suitable regression model for D i given t i , including potential additional covariates x t,d . We make use of the sequential multinomial model (see Agresti, 2010) and define π(d; t, x t,d ) = P (D = d|D ≤ d; t, x t,d ) Let F t (d) denote the corresponding cumulated distribution function of D which relates to probabilities π() through F t (d) = P t (D ≤ d) = P(D ≤ d|D ≤ d + 1) · P (D ≤ d + 1) = (1 − π(d + 1; ·)) · (1 − π(d + 2; ·)) · . . . · (1 − π(d max ; ·)) for d = 1, . . . , d max − 1 and F t (d max ) = 1. The available data on cumulated death counts allow us to estimate the conditional probabilities π(d; ) for d = 2, . . . , d max . In fact, the sequential multinomial model allows to look at binary data such that where • s 1 (t) is an overall smooth time trend over calendar days, • s 2 (d) is a smooth duration effects, capturing the course of the disease, • x t,d are covariates which may be time and duration specific. Assuming that D, the duration between a registered fatal infection and its reported death, is independent of the number of fatal Covid-19 infections, we obtain the relationship Note further that if we model Y t with a quasi-Poisson model as presented in the previous chapter, we have no available observation Y t for time points t > T − d max . Instead, we have observed C t,T −t , which relates to the mean of Y t through (7). Including therefore log F t (T − t) as additional offset in model (2), allows to fit the model as before, but with nowcasted deaths included. That means, instead of λ t,r,g as in (2), the expected death rates are now parametrized by λ t,r,g = λ t,r,g exp(log F t (T −t)), where the latter multiplicative term is included as additional offset in the model. We fit the nowcasting model (5) with parametrization (6). We include a weekday effect for the registration date of the infection with reference category "Monday". The estimates of the fixed linear effects are shown in Table 2 . The fitted smooth effects are shown in Figure 4 , where the top panel shows the effect over calendar time, which is very weak and confirms that the course of the disease hardly varies over time. This shows that the German health care system remained stable over the considered period, and hence survival did not depend on the date on which the infection was notified. The bottom panel of Figure 4 shows the course of the disease as a smooth effect over the time between registration of the infection and death. We see that the probabilities π(d; ·) decrease in d, where this effect is the strongest in the first days after registration. Thus, most of the Covid-19 patients with fatal infections are expected to die not long after their registration date. The effect of d becomes easier to interpret by visualizing the resulting distribution function F t (d). This is shown in Figure 5 for two dates t, i.e.. April 14th and May 13th. The plot also shows how the course of the disease hardly varies over calendar time: In fact, the small differences between the two distribution functions is dominated by the weekday effect, since the red curve is related to a Tuesday while the blue one is from a Wednesday. In Figure 3 above we have shown the nowcasting results along with uncertainty intervals shaded in grey. These were constructed using a bootstrap approach as follows. Given the fitted model, we simulate n = 10 000 times from the asymptotic joint normal distribution (7), where C T −t is the observed partial cumulated sum of deaths at time point T − t. The pointwise lower and upper bounds of the 95% prediction intervals for the nowcast for Y t are then given by the 2.5 and the 97.5 quantiles of the set { Y (i) t , i = 1, . . . , n}, respectively. In Section 3 we presented the fitted death rate, which is the convolution of a smooth spatial effect as well as region specific effects. It is of general interest to disentangle these two spatial components. This is provided by the model. We visualize the fitted global geographic trend Figure 7: Long term Region specific level (left hand side) and short term dynamics (right hand side) of the Covid-19 infections m 2 (·) for Germany in Figure 6 . The plot confirms that up to May 2020 the northern parts of the country are less affected by the disease in comparison to the southern states. The two plots in Figure 7 map the region specific effects, i.e. the predicted long term level of a district u r0 (left hand side) and the predicted short term dynamics u r1 (right hand side). Both plots uncover quite some region-specific variability. In particular, the short term dynamics captured in the right hand side plot (u r1 ) pinpoint districts with unexpectedly high nowcasted death rates in the last two weeks, after correcting for the global geographic trend and the long term effect of the district. Some of the noticeable districts have already been highlighted in Section 3 above, but we can detect further districts, which are less pronounced in Figure 1 . For instance, Steinfurt (in the north-west of North Rhine-Westphalia), Olpe (southern North Rhine-Westphalia) or Gotha (center of Thringen) presently show a high rate of fatal infections. A large number of the registered deaths related to Covid-19 stem from people in the age group 80+. Locally increased numbers are often caused by an outbreak in a retirement home. Such outbreaks apparently have a different effect on the spread of the disease, and the risk of an epidemic infection caused by outbreaks in this age group is limited. Thus, the death rate of people in the age group 80+ could vary differently across districts when compared to regional peaks in the death rate of the rest of the population. In order to respect this, we decompose the district-specific effects u r in (2) into u 80− r = (u 80− r0 , u 80− r1 ) for the age group 80-and u 80+ r = (u 80+ r0 , u 80+ r1 ) for the age group 80+, where the age group 80-consists of the aggregated age groups 15-34, 35-59 and 60-79. We put the same prior assumption on the random effects as we did in (3), but now the variance matrix that needs to be estimated from the data has dimension 4 by 4. The fitted age group-specific random effects are shown in Figure 8 , where the u 80− r are shown in the top panel and the u 80+ r in the bottom panel. Most evidently, the variation of the random effects is much higher in the age group 80+ when compared to the younger age groups, as more districts occur which are coloured dark blue or dark red, respectively. When comparing the district-specific short term dynamics of the last 14 days (u r1 ) in Figure 8 to those in Figure 7 , we recognize that in most of the districts which recently experienced very high death intensities (with respect to the whole period of analysis), these stem from the age group 80+. As mentioned before, this can often be explained by outbreaks in retirement homes. When fitting the mortality model (1) we included the fitted nowcast model as offset parameter. This apparently neglects the estimation variability in the nowcasting model, which we explored via bootstrap as explained in Section 5.3 and visualized in Figure 3 . In order to also incorporate this uncertainty in the fit of the mortality model, we refitted the model using (a) the upper end and (b) the lower end of the prediction intervals shown in Figure 3 . It appears that there is little (and hardly any visible) effect on the spatial components, which is therefore not shown here. But the time trend shown in Figure 1 does change, which is visualized by including the two fitted functions corresponding to the 2.5% and 97.5% quantile of the offset function. We can see that the estimated uncertainty of the nowcast model mostly affects the last ten days, with a strong potential increase in the death rate mirroring a possible worst case scenario. In Figure 9 we show a normal QQ-plot of the Pearson residuals in the nowcasting model. Apart from some observations in the lower tail, the Pearson residuals are distributed very closely to a standard normal distribution when considering the estimate φ = 1.766 of the dispersion parameter in the quasi-poisson model (7). Overall, the model seems to fit to the available data quite well. The paper presents a model to monitor the dynamic behaviour of Covid-19 infections based on death counts. It is important to highlight that the proposed model makes no use of new infection numbers, but only of observed deaths related to Covid-19. This in turn means that the results are less dependent on testing strategies. The nowcasting approach enables us to estimate the number of deaths following a registered infection today, even if the fatal outcome has not occurred yet. Moreover, the district level modelling uncovers hotspots, which are salient exclusively through increased death rates. A differential analysis of the number of current fatal infections on a regional level allows to draw conclusions on the current dynamics of the disease assuming a constant case fatality rate, i.e. a stable proportion of death compared to the true number of infections when adjusting for age and gender. A natural next step would now be to consider the nowcasted deaths in relation to the number of newly registered infections, which is, in contrast, highly dependent on both testing strategy and capacity. We consider this as future research, and the proposed model allows us to explore data in this direction. This might ultimately help us in shedding light on the relationship between registered and undetected infections as well as on the effectiveness of different testing strategies. There are several limitations to this study which we want to address as well. First and utmost, even though death counts are, with respect to cases counts, less dependent on testing strategies, they are not completely independent from them. This applies in particular to the handling of post-mortem tests. We therefore do not claim that our analysis of death counts is completely unaffected by testing strategies. Secondly, a fundamental assumption in the model is the independence between the course of the disease and the number of infections. Overall, if the local health systems have sufficient capacity and triage can be avoided, this assumption seems plausible, but it is difficult or even impossible to prove the assumption formally. Finally, the nowcasting itself is not carried out on a regional level, though the model focuses on regional aspects of the pandemic. While it would be desirable to fit the nowcast model regionally, the limited amount of data simply prevents us from extending the model in this direction. Analysis of Ordinal Categorical Data Real estimates of mortality following covid-19 infection Countries test tactics in 'war' against covid-19 Estimating the number of infections and the impact of non-pharmaceutical interventions on covid-19 in 11 european countries Critical care utilization for the covid-19 outbreak in lombardy, italy: Early experience and forecast during an emergency response Baseline characteristics and outcomes of 1591 patients infected with sars-cov-2 Nowcasting the covid-19 pandemic in bavaria Probabilistic forecasting in infectious disease epidemiology: The 13th armitage lecture Bayesian nowcasting during the stec o104:h4 outbreak in germany Real-time estimation of the risk of death from novel coronavirus (covid-19) infection: Inference using exported cases Adjustment for reporting delays and the prediction of occurred but not reported events Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov2) Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: A statistical analysis of publicly available case data Covid-19: Forecasting short term hospital needs in france. medRxiv Spatio-temporal analysis of epidemic phenomena using the r package surveillance Quantifying bias of covid-19 prevalence and severity estimates in wuhan, china that depend on reported cases in international travelers Infection fatality rate of sars-cov-2 infection in a german community with a super-spreading event Smoothing and mixed models Case-fatality risk estimates for covid-19 calculated by using a lag time for fatality Generalized additive models: an introduction with R We want to thank Maximilian Weigert and Andreas Bender for introducing us to the art of producing geographic maps with R. Moreover, we would like to thank all members of the Corona Data Analysis Group (CoDAG) at LMU Munich for fruitful discussions.