key: cord-0943810-ko5h6kup authors: Schneble, Marc; De Nicola, Giacomo; Kauermann, Göran; Berger, Ursula title: A statistical model for the dynamics of COVID‐19 infections and their case detection ratio in 2020 date: 2021-08-10 journal: Biom J DOI: 10.1002/bimj.202100125 sha: a1450a4d022232e9195265bb408d9d2dc3355c1b doc_id: 943810 cord_uid: ko5h6kup The case detection ratio of coronavirus disease 2019 (COVID‐19) infections varies over time due to changing testing capacities, different testing strategies, and the evolving underlying number of infections itself. This note shows a way of quantifying these dynamics by jointly modeling the reported number of detected COVID‐19 infections with nonfatal and fatal outcomes. The proposed methodology also allows to explore the temporal development of the actual number of infections, both detected and undetected, thereby shedding light on the infection dynamics. We exemplify our approach by analyzing German data from 2020, making only use of data available since the beginning of the pandemic. Our modeling approach can be used to quantify the effect of different testing strategies, visualize the dynamics in the case detection ratio over time, and obtain information about the underlying true infection numbers, thus enabling us to get a clearer picture of the course of the COVID‐19 pandemic in 2020. rates (Flaxman et al., 2020) . Moreover, Aspelund et al. (2020) used Bayes arguments applied to testing data from Ireland to estimate the CDR in the order of 7-11% at the beginning of the pandemic, and in the order of 10-20% after that. The argument is based on relating the number of tests and the share of positive tests. A similar approach has been pursued making use of Canadian data (Benatia et al., 2020) . The problem of estimating the true numbers of COVID-19 infections has also been discussed from a purely statistical point of view, where the CDR was related to the fatality ratio (Manski & Molinari, 2020) . A capture-recapture approach to estimate the total number of COVID-19 cases was proposed by Böhning et al. (2020) and Rocchetti et al. (2020) , where the latter derive an upper bound for the cumulative number in mid-April for 10 European countries. The ratio of the upper bound and the observed number of cases ranges from around 4 (Greece) to around 8 (France). The capture-recapture method makes only use of publicly available data on COVID-19 cases and deaths, which also holds for the method that we present in this note. Here, we assume that the number of infected can be split into detected and undetected infections. In SIDARTHE models (Giordano et al., 2020) , there is additional distinction into either asymptomatic or symptomatic cases, which we ignore here since the database that we use does not reliably contain these numbers. However, it should be noted that pre-and asymptomatic individuals have a significant impact on the spread of a pandemic disease, especially in the younger population (Stella et al., 2020) . Thereby, presymptomatic individuals play a more significant role than asymptomatic ones (Buitrago-Garcia et al., 2020) . Nonetheless, the number of asymptomatic cases can reduce the reproduction value of a disease because a background immunity is established, as shown for influenza transmission (Mathews et al., 2007) . Overall, underreporting appears to be an overarching problem, which plays a central role when estimating the CDR for COVID-19 (Russell et al., 2020) . The importance of assessing the detection ratio and its effect on predictions of future infections has been demonstrated in mathematical simulation studies (Fuhrmann & Barbarossa, 2020) . In this context, different national underreporting ratios have been compared (e.g., Rahmandad et al., 2020 or Jagodnik et al., 2020 ) and a general discussion and survey on assessing the infection fatality ratio (IFR) was conducted (Levin et al., 2020) . In general, it is clear that the CDR changes greatly over time depending on testing strategy and capacities, which vary over time and across different regions. In Germany, the number of tests has increased considerably since the pandemic outbreak in March 2020. The testing strategy has also been adjusted several times: In the beginning, mainly individuals with symptoms were being been tested, whereas in later phases, a very high number of tests have been performed on travelers returning from foreign countries and contact persons of COVID-19-positive individuals. In this note, we explore the dynamics in the CDR using publicly available registry data on COVID-19 infections in Germany from March to December 2020 provided by the Robert-Koch-Institute (RKI). It is important to mention that in Germany's first months of the pandemic, no mass or systematic testing of the population had taken place. Our model therefore only makes use of a limited amount of information. We propose to jointly model fatal and nonfatal infections using a dynamic generalized linear mixed model with smooth random effects (see, e.g., Durbán et al., 2005; Durban & Aguilera-Morillo, 2017; Wood, 2017) . The major advantage of our approach is that it only relies on the assumption that age-specific COVID-19 fatality ratios, while unknown, have not substantially changed over time. Whether this assumption is valid is currently discussed (Harris, 2020; Kip et al., 2020) and the possibility of differing fatality ratios in the second wave has been considered as well (Aspelund et al., 2020; Kenyon, 2020) . To assess the impact of this assumption on our results, we provide sensitivity analyses and a simulation study in the Supporting Information, which demonstrate that our approach is sufficiently robust if there is no abrupt change in the infection fatality ratio. Overall, our approach allows investigating the following. First, we explore how the case detection rate has changed over time, how it varies among different age groups, and if and how it changes in different regions of Germany, depending on infection dynamics and different testing strategies. Second, the model also provides an estimate of the dynamics in the true number of infections, regardless of whether they have been detected or not. All in all, this provides insight into the course of the COVID-19 pandemic, built exclusively on registry data. The remainder of the paper is structured as follows. We describe the data constellation in depth in Section 2, and we propose our model in Section 3. In Section 4, we show the results of our analyses and provide extensive interpretations, whereas Section 5 concludes the paper with some implications and limitations of our study. We make use of COVID-19 data openly provided by the RKI, the German federal government agency and scientific institute responsible for health reporting, disease control, and prevention in humans (Esri Deutschland GmbH, 2020). The data, exemplified in Table 1 , contain cumulated counts of newly registered, laboratory-confirmed COVID-19 cases in Germany TA B L E 1 Illustration of the data structure. To facilitate reproducibility, the original column names used in the RKI dataset are given in brackets below our English notation for each calendar day stratified by age group (0-4, 5-14, 15-34, 35-59, 60-79, or 80+ years) , gender (male/female), and district (412 in total). Furthermore, for all registration dates and strata, the number of deaths associated with COVID-19 transmitted to the RKI by the local health authorities of the respective district is recorded. Note that the date of death is not provided, but for each death, we have the date when the infection was detected and confirmed by a (PCR) test. The database of the RKI is updated every morning with the new numbers transmitted to it from the local health authorities. In this study, we only consider data entries with registration dates ranging from calendar week (CW) 10 (mid-March) to CW 53 (end of December) of the year 2020. For earlier weeks, the number of tests being positive was not large enough to draw conclusive results. On the other hand, the German vaccination campaign started at the very end of 2020. As this increasingly reduces the IFR, we only include infections that were registered in 2020. Consequently, the final outcome of almost all of these infections is known today. Moreover, although the data are given on a daily resolution, we here aggregate it into weekly data, which renders reporting delays occurring over the weekends and weekly reporting cycles irrelevant to our analysis, leading to more stable results. Since for children aged 14 years and younger, barely, any fatalities have been recorded, we excluded these age groups from our analysis. To give a first insight into the data at hand, we plot in Figure 1 the raw numbers of cases reported by the official health authorities over time together with the raw number of fatalities stratified by age group. This is shown in the top four plots on a log-scale. Both the number of registered cases and that of fatal cases (indexed by registration date of the infection, and not by day of death) peak in CW 13 for the two younger age groups and in CW 14 for the two oldest age groups, respectively. Over the following weeks, these numbers decrease. The small peak in CW 25 was caused by an outbreak in the district of Gütersloh, which is explored in more depth later on in the paper. From CW 28 onward, we resume seeing an exponential increase of registered cases, whereas the numbers of registered fatal cases only start to rise 7 weeks later, also exponentially. By the end of the year 2020, we see a slight decrease in registered infections. The raw case fatality ratio, calculated as the ratio of fatal cases over total registered cases, stratified by age group, is shown at the bottom of Figure 1 . The raw case fatality ratio for the age group 80+ generally dropped from CW 10 onward and fluctuated mostly between 10% and 15% from week 25 onward. However, since CW 40 the case fatality ratio in this age group steadily climbed up to more than 20%. For the age group 60-79, the case fatality ratio has peaked in CW 16 and gradually decreased to 2.5%. Here, we also observe a steady increase toward the end of 2020, which results in more than a doubling of the case fatality ratio within 10 weeks. All other age groups exhibit relatively low raw case fatality ratios throughout. Note that the raw data do not contain undetected cases, and therefore cannot provide a complete picture of the actual infection numbers, nor do these plots provide any information about the CDR. In the following, we develop a statistical model that enables us to estimate the relative changes in the CDR and the true infection numbers over time. When describing the dynamics of the COVID-19 pandemic, the number of interest is the true count of newly infected persons in a cohort, which shall be denoted by for week = 1, … , . Note that remains unobservable. However, the number can be decomposed into the number of detected and reported cases and the unknown number of newly infected persons, who have not been tested and remain undetected, which we can call the "dark number," . Hence, we have = + , and ∕ defines the CDR, which, however, remains unknown due to being unknown. Note that the index indicates the time point on which the infection took place, which is usually unknown. The infection is eventually detected through a positive test at a later time point̃= + . As is often unknown, in particular, if the spread of the disease is diffuse, we will conceptually omit in the following, which means that we set equal to the Absolute numbers on a log-scale stratified by age group. Bottom figure: Case fatality ratios (= fatal cases / registered cases) stratified by age group registration date when an infection is confirmed through a test. This time point is the registration date described in the previous section. Generally, this approach is justifiable for COVID-19 infections because the range of delay is small compared to the time range of our data analysis (Mallett et al., 2020) . From today's perspective, we have uncensored knowledge on the outcomes of all reported cases . That is, we know if they ended fatally or if they recovered. Consequently, the reported cases are composed of recovered (nonfatal) outcomes and fatal outcomes , that is, = + . Given this, the total number of infected persons splits into = + + . The expected number of reported fatal cases as well as the expected number of recovered cases are fractions of the total number of infections . This leads to where 0 < ( + ) < 1. Here, quantity defines the infection fatality ratio (IFR), whereas is the CDR of nonfatal (recovered) infections. Note that these nonfatal infections also include mild and symptom-free cases. Thus, if testing capacities are increased or the testing strategy is changed, will change as well, which is incorporated in the notation by time index . In contrast, the IFR will be assumed to remain constant over time. This can be justified by the fact that fatal cases, due to their severeness, are likely to be detected independently of any testing policy. This also includes, to some extent, postmortem tests. With this notation, we obtain the time-dependent case detection ratio CDR = + . Note that for the dark number, that is, the latent number of undetected infections , it holds that ( | ) = (1 − CDR ) . It would, of course, be favorable to estimate the number of undetected infections via estimation of and . However, when only the reported fatal and nonfatal cases and are known, these two ratios cannot be estimated due to nonidentifiability issues, which we will demonstrate below. Nonetheless, with the data at hand, we are able to estimate the ratio ∕ . To see this, we rewrite the above model in an equivalent form by defining a binary covariate ∈ {0, 1} and by specifying the response variable through This notational trick allows us to rewrite the above relations (1) as a regression model where = log( ), = log( ), and = log( ). Equations (2) and (3) can, in turn, be summarized into a single regression model formula Note that and hence = log( ) remain unobserved. We employ a Bayesian view and model as normally distributed random effects ∼ ( , 2 ). Still, the parameters in model (4) are not identifiable, because any shift in and a matching negative shift in and , respectively, results in the same model. This demonstrates the identifiability problem, which we have mentioned above. Hence, we are neither able to estimate the fatality ratio = exp( ) nor the time-dependent ratio = exp( ) with the data at hand. However, we can shift such that the integral of˜= − is equal to zero and define the global intercept 0 = + , which allows to rewrite (4) in an identifiable form (see Wood, 2017) to obtain the final regression model ( | , ) = exp( + 0 + ) and ∼ (˜, 2 ) for = 1, … , , where = − and exp( ) = ∕ . With this model, we can now explore the dynamics in the CDR. For two different time points 1 and 2 , we have using the small () notation The latter approximation in (6) holds as long as the fatality rate is small, which holds for COVID-19. Consequently, 2 − 1 can serve as a proxy for log(CDR 2 ) − log(CDR 1 ), and exp( 2 − 1 ) is a proxy for the relative change in the case detection ratio CDR 2 ∕CDR 1 . Based on these considerations, we see that it is necessary to model the dynamics in time more appropriately to derive stable estimates for the CDR. It is natural to assume that changes in the CDR over time do not occur suddenly but gradually. For instance, test capacities are slowly increased and test strategies are gradually changed. To accommodate this in our model (5), we fit by a smooth function in time leading to a time-varying coefficient model (Hastie & Tibshirani, 1993) . We also induce smooth dynamics on the random component, leading to a time-varying random effect (Durban & Aguilera-Morillo, 2017). These modifications lead to an identifiable and dynamic mixed regression model, for which we use a negative-binomial distribution for with a constant dispersion factor. The entire model can be fitted with standard software: All of our analyses were performed in R (R Core Team, 2013) and the dynamic mixed regression model is fitted using the R-package mgcv (Wood, 2017) . We apply this modeling approach using the reported data from CW 10 (beginning of March) up to CW 53 (final week of 2020), stratified by different age groups, to visualize the dynamics in the real infection numbers and the CDR from the beginning of the pandemic up to the beginning of the second wave. To assess the robustness of the approach concerning the assumption of time-constant and age-specific fatality ratios, we also refit the model when subdividing the data into different time frames. The results of this analysis are shown in the Supporting Information. As the IFR depends on age, we fit separate models for each of the relevant age groups defined by the RKI, that is, 15-34, 35-59, 60-79, and 80+ years. The dynamics in the true infection numbers on the log-scale, represented by the fitted smooth dynamic random effects , are displayed in Figure 2 . These curves mirror the relative change in the actual number of infected (detected and undetected) over time. Note that the absolute numbers cannot be interpreted on their own due to the mentioned identifiability issues. We therefore shift the curves such that CW10 = 0. We can see that the relative course of the pandemic was very similar across all age groups, where a peak is reached around CW 14. However, the peak for the younger age groups is estimated to be around 1 week earlier than for the older age groups, that is, in CW 13. An explanation for this finding is that the younger age groups have been more affected by the lockdown, which started in Germany in CW 12. Looking at the difference between the maximum max and the minimum of during the summer months, that is, min 20≤ ≤40 , we see that this difference increases with age, that is, the relative decline in true infections numbers after the first wave and the relative increase toward the second wave, respectively, was less pronounced in the younger age groups. Also eye-catching is the increase in infections around CW 25 for people below 60 years of age. This is the aforementioned outbreak in the district of Gütersloh, which occurred in an industrial slaughterhouse and has mainly affected people of the working age. From CW 35 (end of August), all curves start rising steadily, where the steepest rise is seen for the oldest age group, whereas the rise is flatter for the younger age group. This shows that the second wave of the pandemic had already begun around CW 35. Moreover, Figure 2 shows that in all age groups but the youngest one, the peak of the second wave has surpassed the peak of the first wave. Next, we look at the dynamics in the CDR. Figure 3 shows the fitted time-varying coefficients together with corresponding 95% confidence bands. Again, the absolute level is not identifiable, so these curves are normalized such that CW10 = 0. Hence, the function values on the exp-scale (right y-axes) give the relative change in the CDR with respect to CW 10. The CDR in the age group 80+ has risen monotonically since the beginning of the pandemic up to CW 33, where our model estimates the CDR to be more than four times higher as in mid-March. Note that in later weeks, the CDR among the elderly decreased again to the level of April/May. In contrast, for people aged 60-79, the CDR first dropped by about 70%, reaching its bottom as the pandemic passed its peak in Germany in CW 16. We subsequently see a monotonic increase, with the CDR becoming 1.5 times higher compared to the beginning of the pandemic. However, in this age group, the CDR has been more than halved from CW 40 up to the end of 2020 again. The dynamics in the CDR in the population aged 35-59 years are similar to those of the 60-79 years old: After a drop during March and April (CW 10-CW 16), the CDR increases, in mid-September, to nearly three times what it was in CW 10. For the youngest age group (aged 15-34), we also see a rise in the CDR over time, which seems substantial. However, the confidence bands in this age group are relatively wide because this age group is not as prone to fatal outcomes as older age groups. For the population aged 80 years and older, the CDR had increased until late summer, when it started to stagnate before slightly decreasing again. As the CDR can be at most 100%, and given that the relative change in this age group was about as high as a factor of 4 in CW 33 compared to March, we can conclude that at the beginning of the pandemic, the CDR among the population of 80 years and older could not have been more than 25%. Moreover, considering the relative change in the CDR, we can adjust the numbers from the peak in the first wave to be comparable, for example, to the numbers in week 40. To exemplify this, note that in week 40, the CDR for the age group 80+ was 2.3 times higher as in CW 15, at the peak of the first wave. This ratio results from the plot in Figure 3 in this age group. In CW 15, this number had become 80. However, in week 15, the CDR was much lower as in CW 40, and thus, we would have seen 2.3 ⋅ 80 = 184 cases per 100,000 in this age group 80+ if we had the same CDR in CW 15 as in CW 40. For the population aged 60-79 years, the CDR between the minimum in CW 16 and its maximum in calendar week 34 changed by a factor of around 5. From this, we can deduce that around the peak of the first wave in Germany, at most 20% of the infections were detected, whereas at least 80% remained unseen. To be able to compare numbers from the first wave to those in autumn, we apply a similar calculation as above. This results in an estimated number of at least 5 ⋅ 17 = 85 cases per 100,000, where only 16 cases per 100,00 have been observed in CW 16. In the age group 35-59, the relative change of CDR during the minimum in CW 16 and the maximum in CW 36 was as high as a factor of 5 as well. Again, the same calculation shows that the 22 detected infections per 100,000 in week 16 would increase to 5 ⋅ 22 = 110 cases per 100,000 if we would have had the CDR in week 16 as it was in week 36. A general question in the pandemic is whether extensive testing leads to a high CDR. Applying our model to regional data allows us to investigate this question. The Supporting Information compares separate model fits for the two most populous German states, North-Rhine-Westphalia and Bavaria. The two states implemented different testing strategies over the summer months. Although in Bavaria, public test stations were opened in summer, particularly at the borders on the motorways, such fine screening of holiday returnees was not pursued in North-Rhine-Westfalia. Our model allows assessing and, in particular, quantifying how such different testing strategies lead to different CDRs in these two regions. The results quantify by how much the dark figure was reduced in relationship with the Bavarian testing strategy. Raw reported case numbers and measures derived from them, such as the case fatality ratio, are prone to changes in testing strategies and test capacities, which also influence the CDR. Comparisons between raw case numbers over time therefore need to be interpreted with care. The case-fatality ratio, calculated from the raw number of reported deaths related to COVID-19 divided by reported cases, is also impaired because deaths occur with a time delay after registration, meaning that deaths registered today correspond to infections that have been reported up to several weeks ago. Our method allows us to uncover relative changes in the CDR over different pandemic phases. Moreover, by shedding light on the number of undetected cases, we can describe the dynamics in the true number of COVID-19 infections for Germany from March 2020 until December 2020. The approach is based on publicly available data on registered cases and does not rely on simulations or additional survey data. We make use of the fact that, for each fatal outcome, the registration date of the infection is included in the data. This allows us to jointly model the number of registered nonfatal cases and that of fatal infections in a dynamic mixed model, leading to an assessment of the dynamics taking place in real infection numbers. Based on the available information on the relative change in the CDR over time, we are able to compare numbers from the first wave of the pandemic in spring with numbers from the second wave in autumn, adjusting for the difference in the proportion of undetected cases. A general limitation of our approach is that it suffers from an identifiability issue and hence does not derive absolute values of the CDR. One may, however, combine our results with findings from seroepidemiological studies, which aim to assess the prevalence of COVID-19 in the general population by screening a representative sample. A list of current seroepidemiological studies in Germany is provided by the RKI (Robert-Koch-Institute, 2020). Although these studies provide crucial information on the current situation of the spread of the disease, they can only give a snapshot of the instantaneous situation when the study was conducted. With the knowledge of the dynamics in new infections given by our approach, the findings of such studies can be used to estimate the situation at other time points. For example, we look at the Prospective Covid-19 Cohort Study Munich (KoCo19, Radon et al., 2020) . They report a CDR of about 25%, where the survey was run between May and June 2020 in the city of Munich. We can deduce that the CDR for October to be about three times higher for the 35-59 age group. More precise calculations would require age-specific numbers in the study as well as a regional refit of our model. A nationwide seroprevalence study was conducted between the beginning of July and mid-August of 2020, which yielded a CDR of around 55% in the adult population (ifo Institut & forsa, 2020). Nonetheless, the authors admit that the fading of COVID-19 antibodies could influence their findings sometime after the infection. A seroprevalence study, which is also nationwide but on a larger scale, is currently being carried out, but the results are not yet available. 1 In principle, however, this demonstrates that the combination of seroepidemiological studies and our approach allows obtaining estimates for absolute numbers of the CDR instead of relative comparisons only. A critical assumption of our model is that we assume the IFR to be constant over time for a given age group and negligibly small compared to the detection ratio of nonfatal cases. The latter is certainly valid for the numbers we looked at. Staerk et al. (2021) show that most of the dynamics in the effective IFR of the German population can be explained by the varying age distribution of COVID-19 cases. As the age distribution within the RKI age categories varies as well, the IFR within each age group might slightly change over time that, however, occurs not abruptly bot smoothly over time. The sensitivity analysis, which can be found in the Supporting Infprmation, provides evidence that our assumption of being constant is, for the most part, fulfilled. With increasing vaccination levels in the population starting from January 2021, the assumption of a constant case fatality ratio becomes invalid. This eventually prevents the application of our model to later stages of the pandemic. The authors have declared no conflict of interest. The data that support the findings of this study are openly available at https://www.arcgis.com/home/item.html?id= f10774f1c63e40168479a1feb6c7ca74. This article has earned an Open Data badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge "Reproducible Research" for making publicly available the code necessary to reproduce the reported results. The results reported in this article could fully be reproduced. Schneble https://orcid.org/0000-0001-9523-4173 Identification and estimation of undetected COVID-19 cases using testing data from Iceland Estimates of COVID-19 cases across four Canadian provinces Estimating the undetected infections in the COVID-19 outbreak by harnessing capture-recapture methods Occurrence and transmission potential of asymptomatic and presymptomatic SARS-CoV-2 infections: A living systematic review and meta-analysis On calculating with B-splines Diagnosing misspecification of the random-effects distribution in mixed models On the estimation of functional random effects Simple fitting of subject-specific curves for longitudinal data Flexible smoothing with b-splines and penalties Daily COVID-19 case numbers provided by the Robert-Koch-Institute Estimating the effects of nonpharmaceutical interventions on COVID-19 in Europe The significance of case detection ratios for predictions on the outcome of an epidemic -A message from mathematical modelers Modelling the COVID-19 epidemic and implementation of population-wide interventions in Italy COVID-19 case mortality rates continue to decline in Florida. medRxiv Varying-coefficient models Die Deutschen und Corona -Schlussbericht der BMG Correcting under-reported COVID-19 case numbers: Estimating the true scale of the pandemic Flattening-the-curve associated with reduced COVID-19 case fatality rates-an ecological analysis of 65 countries Temporal changes in clinical practice with COVID-19 hospitalized patients: Potential explanations for better in-hospital outcomes Assessing the age specificity of infection fatality rates for COVID-19: Systematic review, meta-analysis, and public policy implications Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) At what times during infection is SARS-CoV-2 detectable and no longer detectable using rt-pcr-based tests? A systematic review of individual participant data Estimating the COVID-19 infection rate: Anatomy of an inference problem A biological model for influenza transmission: pandemic planning implications of asymptomatic infection and immunity Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship Protocol of a population-based prospective COVID-19 cohort study Munich Estimating COVID-19 under-reporting across 86 nations: Implications for projections and control Seroepidemiological studies in the general population Estimating the size of undetected cases of the COVID-19 outbreak in Europe: An upper bound estimator Using a delay-adjusted case fatality ratio to estimate under-reporting Estimating effective infection fatality rates during the course of the COVID-19 pandemic in Germany The role of asymptomatic individuals in the covid-19 pandemic via complex networks The COVID-19 epidemic Smoothing parameter and model selection for general smooth models Substantial underestimation of SARS-CoV-2 infection in the United States