key: cord-0540058-erpoh5he
authors: Schaback, Robert
title: On COVID-19 Modelling
date: 2020-05-11
journal: nan
DOI: nan
sha: ec9649864624af4669b195c4e95155f8a537dead
doc_id: 540058
cord_uid: erpoh5he

This contribution analyzes the COVID-19 outbreak by comparably simple mathematical and numerical methods. The final goal is to predict the peak of the epidemic outbreak per country with a reliable technique. This is done by an algorithm motivated by standard SIR models and aligned with the standard data provided by the Johns Hopkins University. To reconstruct data for the unregistered Infected, the algorithm uses current values of the infection fatality rate and a data-driven estimation of a specific form of the recovery rate. All other ingredients are data-driven as well. Various examples of predictions are provided for illustration.

This contribution starts in section 2 with a rather elementary reconciliation of the standard SIR model for epidemics, featuring the central notions like Basic Reproduction Number, Herd Immunity Threshold, and Doubling Time, together with some critical remarks on their abuse in the media. Experts can skip over this completely. Readers interested in the predictions should jump right away to section 5. Section 3 describes the Johns Hopkins data source with its limitations and flaws, and then presents a variation of a SIR model that can be applied directly to the data. It allows to estimate basic parameters, including the Basic Reproduction Number, but does not work for predictions of peaks of epidemics.

To achieve the latter goal, section 4 combines the data-compatible model of section 3 with a SIR model dealing with the unknown Susceptibles and the unregistered Infectious. This needs two extra parameters that should be extracted from the literature. The first is the infection fatality rate, as provided e.g. by [11, 10] , combined with the case fatality rate that can be deduced from the Johns Hopkins 3 pointing out certain abuses of these notions. This will not work without calculus, but things were kept as simple as possible. Readers should take the opportunity to brush up their calculus knowledge. Experts should go over to section 3.

The simplest standard "SIR" model of epidemics (e.g. [4] and easily retrievable in the wikipedia [12] ) deals with three variables 1. Susceptible (S), 2. Infectious (I), and 3. Removed (R).

The Removed cannot infect anybody anymore, being either dead or immune. This is the viewpoint of bacteria of viruses. The difference between death and immunity of subjects is totally irrelevant for them: they cannot proliferate anymore in both cases. The SIR model cannot say anything about death rates of persons.

The Susceptible are not yet infected and not immune, while the Infectious can infect Susceptibles. The three classes S, I, and R are disjoint and add up to a fixed total population count N = S + I + R. All of these are ideally assumed to be smooth functions of time t, and satisfy the differential equationṡ S = −β S N I,

where the dot stands for the time derivative, and where β and γ are positive parameters. The product S N I models the probability that an Infectious meets a Susceptible. Note that the Removed of the SIR model are not the Recovered of the Johns Hopkins data that we treat later, and the SIR model does not account for the Confirmed counted there.

SinceṄ =Ṡ +İ +Ṙ = 0, the equation N = S + I + R is kept valid at all times. The term β S N I moves Susceptibles to Infectious, while γI moves Infectious to Removed. Thus β represents an infection rate while the removal rate γ accounts 4 for either healing or fatality after infection, i.e. immunity. Political decisions about reducing contact probabilities will affect β , while γ resembles the balance between the medical aggressivity of the infection and the quality of the health care system.

As long as the Infectious I are positive, the Susceptibles S are decreasing, while the Removed R are increasing. Excluding the trivial case of zero Infectious from now on, the Removed and the Susceptible will be strictly monotonic. Qualitatively, the system is not really dependent on N, because one can multiply N, R, I, and S by arbitrary factors without changing the system. As an aside, one can also go over to relative quantities with the two differential equations d dt Figure 1 shows some simple examples that will be explained in some detail below.

We start by looking at the initial conditions. Since everything is invariant under an additive time shift, we can start at time 0 and consideṙ I(0) = +β S(0) N I(0) − γI(0) and see that the Infectious decrease right from the start if

and this keeps going on since S must decrease anḋ

There is no outbreak in this case, because there are not enough Susceptibles at start time. The case S(0) = N means that there is no infection at all, and we ignore it, though it is a solution to the system, with I = R = 0, S = N throughout. 

If there is a time instance t I (maybe t I = 0 above) where the Infectious are positive and do not change, we have 0 =İ(t I ) = β S(t I ) N I(t i ) − γI(t I ),

If β < γ holds, this situation cannot occur, and I must be decreasing all the time, i.e. the infection dies out. This is what everybody wants. There is no outbreak. In case of γ = β we go back to the initial situation of the previous section and see that there is no outbreak due to S(0)/N < 1 if there is an infection at all.

The interesting case is β > γ. Then the first part of (3) shows that as soon as t is larger than the peak time t I , the Infectious will decrease due to (4) . Therefore the zero ofİ must be a maximum, i.e. a peak, and it is unique. The Infectious go to zero even in the peak situation.

It is one of the most important practical problems in the beginning of an epidemic to predict

• whether there will be a peak at all,

• when the possible peak will come, and

• how many Infectious will be at the peak.

This can be answered if one has good estimates for β and γ, and we shall deal with this problem in the major part of this paper,

In real life it is highly important to avoid the peak situation, and this can only be done by administrative measures that change β and γ to the situation β < γ. This is what management of epidemics is all about, provided that an epidemic follows the SIR model. We shall see how countries perform.

The quotient

is called the basic reproduction number. If it is not larger than one, there is no outbreak, whatever the initial conditions are. If it is larger than one, there is an outbreak provided that

holds. In that case, there is a time t I where I reaches a maximum, and (4) holds there. When we discuss an outbreak in what follows, we always assume R 0 > 1 and (5). If we later let R 0 tend to 1 from above, we also require that S(0) tends to N from below, in order to stay in the outbreak situation.

Both β and γ change under a change of time scale, but the basic reproduction number is invariant. Physically, β and γ have the dimension time −1 , but R 0 = β /γ is dimensionless. Figure 1 shows a series of test runs with S(0) = N · 0.999 and R(0) = 0 with fixed γ = 0.1 and β varying from 0.02 to 0.3, such that R 0 varies from 1/5 to 3. Due to the realistically small I(0) being 0.001% of the population, one cannot see the decaying cases near startup, but the tails of the blue I curves are decaying examples by starting value, due to S(t) N < γ β = 1/R 0 when started at time t. Decreasing R 0 flattens the blue curves for I. One can observe that I always dies out, while S and R tend to fixed positive levels. We shall prove this below. From the system, one can also infer that R has an inflection point where I has its maximum, since ..

If only R would be observable, one could locate the peak of I via the inflection point of R. Figure 2 shows an artifical case with a large starting value I(0) = N/2, fixed γ = 0.1 and β varying from 0.005 to 0.3, letting R 0 vary from 0.05 to 3. In contrast to Figure 1 , this example shows cases with small R 0 properly. The essence is that the Infectious go down, whether they have a peak or not, and there will always be a portion of Susceptibles. Again, we shall prove this below.

2 CLASSICAL SIR MODELING 8 

This is a number related to the Basic Reproduction Number R 0 by

following a special scenario. If a population is threatened by an infection with Basic Reproduction Number R 0 , what is the number of immune persons needed to prevent an outbreak right from the start? We can read this off equation (2) in the ideal situation that I(0) = 0 and S(0) + R(0) = N, namely

R 0 as the threshold between outbreak and decay. This does not refer to a whole epidemic scenario, nor to an epidemic outbreak. It is a condition to be checked before anything happens, and useless within a developing epidemic, whatever the media say.

In the peak situation of (4), the fraction

of the Non-Susceptible at the peak t I of I is exactly the Herd Immunity Threshold. Thus it is correct to say that if the Immune of a population are below the Herd Immunity Threshold at startup, and if the Basic Reproduction Number is larger than one, the sum of the Immune and the Infectious will rise up to the Herd Immunity Threshold and then the Infectious will decay. This is often stated imprecisely in the media.

Furthermore, the Herd Immunity Threshold has nothing to do with the important long-term ratio of Susceptibles to Removed. We shall address this ratio in section 2.11. 9

The most interesting questions during an outbreak with R 0 > 1 are • At which time t I will we reach the maximum of the Infectious, and

• what is I(t I ), i.e. how many people will maximally be infectious at that time?

It will turn out that there are no easy direct answers. From (4) we see that at the maximum of I the Susceptibles S have the value

i.e. the portion 1/R 0 of the population is susceptible. From that time on, the Infectious decrease. In terms of R and I, the value

of the Non-Susceptibles marks the peak of the Infectious at the Herd Immunity Threshold. "Flattening the curve", as often mentioned in the media, is intended to mean making the maximum of I smaller, but this is not exactly what happens, since the maximum is described by the penultimate equation concerning the Susceptibles, while for I(t I ) we only know

yielding that the left-hand side gets smaller if R 0 gets closer to one. Politically, this requires either making β smaller via reducing contact probabilities or making γ larger by improving the health system, or both. Anyway, "flattening the curve" works by letting R 0 tend to 1 from above, but the basic reproduction number does not directly determine the time t I of the maximum or the value there. We shall improve the above analysis in section 2.13.

In the beginning of the outbreak, S/N is near to one, and thereforė I ≈ +β I − γI models an exponential outbreak with exponent β − γ > 0, with a solution

If this is done in discrete time steps ∆t, one has

The severity of the outbreak is not controlled by R 0 = β /γ, but rather via β − γ.

Publishing single values I(t) does not give any information about β − γ. Better is the ratio of two subsequent values

and if this gets smaller over time, the outbreak gets less dramatic because β − γ gets smaller. Really useful information about an outbreak does not consist of values and not of increments, but of increments of increments, i.e. some second derivative information. This is what the media rarely provided during the outbreak.

Another information used by media during the outbreak is the doubling time, i.e. how many days it takes until daily values double. This is the number n in

i.e. it is inversely proportional to β − γ. If political action doubles the "doubling time", if halves β − γ. If politicians do this repeatedly, they never reach β < γ, and they never escape an exponential outbreak if they do this any finite number of times. Extending the doubling time will never prevent a peak, it only postpones it and hopefully flattens it. When presenting a "doubling time", media should always point out that this makes only sense during an exponential outbreak. And it is not related to the basic reproduction number R 0 = β /γ, but rather to the difference β − γ.

Media often say that the basic reproduction number R 0 gives the number of persons an average Infectious infects while being infectious. This is a rather mystical statement that needs underpinning. The quantity

is a value that has the physical dimension of time. It describes the ratio between current Infectious and current newly Removed, and thus can be seen as the average time needed for an Infectious to get Removed, i.e. it is the average time that an Infectious can infect others. Correspondingly,

are the newly Infected, and therefore

can be seen as the time it needs for an average Infectious to generate a new Infectious. The ratio β γ S N then gives how many new Infectious can be generated by an Infectious while being infected, but this is only close to R 0 if S ≈ N, i.e. at the start of an outbreak. A correct statement is that R 0 is the average number of infections an Infectious generates while being infectious, but within an unlimited supply of Susceptibles.

Besides the peak in case R 0 > 1, it is interesting to know the portions of the population that get either removed (by death or immunity) or get never in contact to the infection. This concerns the long-term behavior of the Removed and the Susceptibles. Figures 1 and 2 If we are at a time t D behind the possible peak at t I , or in a decay situation enforced by starting value, like in (2), we know that I must decrease exponentially to zero. This follows from

showing that log I must decrease linearly, or I must decrease exponentially. Thus we get rid of the Infectious in the long run, keeping only Susceptibles and Removed. Surprisingly, this happens independent of how large R 0 is.

Dividing the first equation in (1) by the third leads to

and when setting σ = S/N and ρ = R/N, we get

when assuming R(0) = 0 at startup. Since ρ is increasing, it has a limit 0 < ρ ∞ ≤ 1 for t → ∞, and in this limit

holds, together with the condition ρ ∞ + σ (ρ ∞ ) = 1, because there are no more Infectious. The equation

has a unique solution in (0, 1) dependent on σ (0) < 1 and R 0 = β /γ. See Qualitatively, we can use (8) or (9) in the form

to see that the ratio of Removed to Susceptibles increases with R 0 , but there is a logarithm involved.

All of this has some serious implications, if the model is correct for an epidemic situation. First, the Infectious always go to zero, but Susceptibles always remain. This means that a new infection can always arise whenever some infected person enters the sanitized population. The outbreak risk is dependent on the portion σ ∞ = 1 − ρ ∞ of the Susceptibles. This illustrates the importance of vaccination, e.g. against measles or influenza.

The above analysis shows that large values of R 0 lead to large relative values of Removed to Susceptible in the limit. The consequence is that systems with large R 0 have a dramatic outbreak and lead to a large portion of Removed. This is good news in case that the rate of fatalities within the Removed is low, but very bad news otherwise.

When politicians try to "flatten the curve" by bringing R 0 below 1 from some time on, this will automatically decrease the asymptotic rate of Removed and increase the asymptotic rate of Susceptibles in the population. This is particularly important if the rate of fatalities within the Removed is high, but by the previous argument the risk of re-infection rises due to the larger portion of Susceptibles.

The decay situation (7) implies that

Therefore the final rate of the Removed is not smaller than the Herd Immunity Threshold. This is good news for possible re-infections, but only if the death rate among the Removed is small enough.

In a decay situation like in (7), we get Figure 3 : Solving for ρ ∞ for fixed C(0) = 0.9 and varying R 0 to see that the exponential decay is not ruled by β − γ as in the outbreak case with R 0 > 1, but rather by −γ + β σ ∞ . This also holds for large R 0 = β /γ because σ ∞ counteracts. The bell shapes of the peaked I curves are not symmetric with respect to the peak.

If we go back to analyzing the peak of I at t I for R 0 > 1, we know

and get

leading to For standard infections that have starting values σ (0) = S(0)/N very close to one, the maximal ratio of Infectious is Figure 4 shows the behaviour of the function, and this is what "flattening the curve" is all about. A value of R 0 = 4 gets a maximum of more than 40% of the population infectious at a single time. If 5% need hospital care, this implies that a country needs hospital beds for 2% of the population. The dotted line leaves the log term out, i.e. is marks the rate of the Susceptibles at the peak, and by (6) the difference is the rate R(t I )/N of the Recovered at the peak.

To analyze the peak time t I , we usė

to get an upper bound for the exponential outbreak

that implies a lower bound for t I of the form

This needs improvement.

To get a quantitative result about "flattening the curve", we first evaluate the integral

assuming R(0) = 0, and set it equal to an integral over the constant value at the maximum, i.e. we squeeze the area under the curve into a rectangle of length b −a under the maximal value, i.e.

and if we "flatten the curve" by letting R 0 tend to 1 from above, we see that the length b − a of the above rectangle goes to infinity like R 0 /(1 − R 0 ), because ρ ∞ tends to 1.

If there is no peak, e.g. if R 0 = β /γ is below 1 either at the beginning or after some political intervention, one can repeat the above argument starting with the Infectious at some time t looking at the area under I from t to infinity:

This needs improvement as well.

Here is a detour that is well-known in the SIR literature. The SIR system can be written as

and in a new time variable τ with dτ = I N dt, one gets the system 

The beauty of this is that the roles of β and γ are perfectly split, like the roles of σ and ρ. In the new timescale, ρ increases linearly and σ decreases exponentially. The Basic Reproduction Number then describes the fixed ratio (10), and the result (9) of section 2.11 comes back as

for the case ρ(0) = 0. This approach has the disadvantage to conceal the peak within the new timescale, and is useless for peak prediction.

If data for the SIR model were fully available, one could solve for

and we shall use this in section 3.3. The validity of a SIR model can be tested by checking whether the right-hand sides for β , γ and R 0 are roughly constant. If data are sampled locally, e.g. before or after a peak, the above technique should determine the parameters for the global epidemic, i.e. be useful for either prediction or backward testing.

However, in pandemics like COVID-19, the parameters β and γ change over time by administrative action. This means that they should be considered as functions in the above equations, and then their changes may be used for conclusions about the influence of such actions. This is intended when media say that "R 0 has changed". From this viewpoint, one can go back to the SIR model and consider β and γ as "control functions" that just describe the relation between the variables.

But the main argument against using (11) is that the data are hardly available. This is the concern of the next section.

Now we want to confront the modelling of the previous section with available data. This is crucial for maneuvering countries through the epidemics [13] 2 .

In this text, we work with the COVID-19 data from the CSSEGISandData repository of the Johns Hopkins University [7] . They are the only source that provides comparable data on a worldwide scale.

The numbers there are

as cumulative integer valued time series in days from Jan. 22nd, 2020. All these values are absolute numbers, not relative to a total population. Note that the unconfirmed cases are not accessible at all, while the Confirmed contain the Dead and the Recovered of earlier days.

At this point, we do not question the integrity of the data, but there are many wellknown flaws. In particular, the values for specific days are partly belonging to previous days, due to delays in the chains of data transmission in different countries. This is why, at some points, we shall apply some conservative smoothing of the data. Finally, there are inconsistencies that possibly need data changes.

For an example, consider that usually COVID-19 cases lead to recovery or death within a rather fixed period, e.g. k ≈ 15 − 21 days. But some Johns Hopkins data have less new Infectioned at day n than the sum of Recovered and Dead at day n + k. And, there are countries like Germany who deliver data of Recovered in a very questionable way. The law in Germany did not enforce authorities to collect data of Recovered, and the United Kingdom did not report numbers of Dead and Recovered from places outside the National Health System, e.g. from Senior's retirement homes. Both strategies have changed somewhat in the meantime, as of early May, but the data still keep these flaws.

We might assume that the Dead plus the Recovered of the Johns Hopkins data are the Removed of the SIR model, and that the Infectious I = C − R − D of the Johns Hopkins data are the Infectious of the SIR model. But this is not strictly valid, because registration or confirmation come in the way.

On the other hand, one can take the radical viewpoint that facts are not interesting if they do not show up in the Johns Hopkins data. Except for the United Kingdom, the important figures concern COVID-19 casualties that are actually registered as such, others do not count, and serious cases needing hospitalization or leading to death should not go unregistered. If they do in certain countries, using such data will not be of any help, unless other data sources are available. If SIR modelling does not work for the Johns Hopkins data, it is time to modify the SIR technique appropriately, and this will be tried in this section.

An important point for what follows is that the data come as daily values. To make this compatible with differential equations, we shall replace derivatives by differences.

To get a first impression about the Johns Hopkins data, Figure 5 shows raw data up to day 109 (May 10 th , as of this writing). The presentation is logarithmic, because then linearly increasing or decreasing parts correspond to exponentially increasing or decreasing numbers in the real data. Many presentations in the media are nonlogarithmic, and then all exponential outbreaks look similar. The real interesting data are the Infectious I = C − R − D in black that show a peak or not. The other curves are cumulative. The data for other countries tell similar stories and are suppressed.

The media, in particular German TV, present COVID-19 data in a rather debatable way. When mentioning Johns Hopkins data, they provide C, D, and R separately without stating the most important figures, namely I = C − D − R, their change, and the change of their change. When mentioning data of the Infectious from the Robert Koch institute alongside, they do not say precisely that these are noncumulative and should be compared to the I = C −R−D data of the Johns Hopkins University. And, in most cases during the outbreak, they did not mention the change of the change. Quite like all other media.

One can see in Figure 5 that Germany and South Korea have passed the peak of the Infectious, while France is roughly at the peak and the United States are still in an exponential outbreak. The early figures, below day 40, are rather useless, but then an exponential outbreak is visible in all cases. This outbreak changes its slope due to political actions, and we shall analyze this later. See [3] for a detailed early analysis of slope changes.

There are strange anomalies in the Recovered (green). France seems not to have delivered any data between days 40 and 58, Germany changed the data delivery policy between days 62 and 63, and the UK data for the Recovered are a mess.

It should be noted that the available medical results on the COVID-19 disease often state that Confirmed will die or survive after a more or less fixed number of days, roughly 14 to 21. This would imply that the red curves for the Dead and the green curves for the Recovered should roughly follow the blue curves for the Confirmed with a fixed but measurable delay. This is partially observable, but much less accurately for the Recovered.

The idea is to define a model that works exclusively with the Johns Hopkins data, but comes close to a SIR model, without being able to use S. Since the SIR model does not distinguish between recoveries and deaths, we set R SIR ⇔ D JH + R JH and let the Infectious be comparable, i.e.

and we completely omit the Susceptibles. From now on, we shall omit the subscript JH when we use the Johns Hopkins data, but we shall use SIR when we go back to the SIR model. 

defining time series γ n and b n that model γ and b = β · S SIR /N without knowing S SIR . This is equivalent to the model C n+1 −C n = b n I n , I n+1 − I n = b n I n − γ n I n , (R + D) n+1 − (R + D) n − = γ n I n that maintains C = I + R + D, and we may call it a Johns Hopkins Data Model. It is very close to a SIR model if the time series b n is not considered to be constant, but just an approximation of β · S SIR /N.

By brute force, one can consider

as a data-driven substitute for β γ

Then there is a rather simple observation:

If r n is smaller than one, the Infectious decrease.

It follows from

but this is visible in the data anyway and not of much help.

Since r n models R 0 S SIR N , it always underestimates R 0 . This underestimation gets dramatic when it must be assumed that S SIR gets seriously smaller than N.

At this point, it is not intended to forecast the epidemics. The focus is on extracting parameters from the Johns Hopkins data that relate to a background SIR-type model. Figure 6 shows R 0 S SIR N estimates via r n for the last four weeks before day 109, i.e. March 10 th . Except for the United States, all countries were more or less successful in pressing R 0 below one. In all cases, S SIR /N is too close to one to have any influence. The variation in r n is not due to the decrease in S SIR /N, but should rather be attributed to political action. As mentioned above, the estimates for R 0 by r n are always too optimistic.

For the figure, the raw Johns Hopkins data were smoothed by a double action of a 1/4, 1/2, 1/4 filter on the logarithms of the data. This smoother keeps constants and linear sections of the logarithm invariant, i.e. it does not change local exponential behavior. This smoothing was not applied to Figure 5 . It was by far not strong enough to eliminate the apparent 7-day oscillations that are frequently in the Johns Hopkins data, see the figure. Data from the Robert Koch Institute in Germany have even stronger 7-day variations.

As long as r n is roughly constant, the above approach will always model an exponential outbreak or decay, but never a peak, because the difference equations are linear. It can only help the user to tell if there is a peak ahead or behind, depending on r n ≈ R 0 being larger or smaller than 1. If r n is kept below one, the Confirmed Infectious will not increase, causing no new threats to the health system. Then the S/N factor will not decrease substantially, and a full SIR model is not necessary.

On the other hand, if a country manages to keep r n smaller than some r = b γ < 1, it is clear that it takes As long as countries keep the r n clearly below one, e.g. below 1/2, this would mean that R 0 ≈ r n N S SIR stays below one if S SIR ≥ N/2, i.e. as long as the majority of the population has not been in contact with the SARS-CoV-2 virus. This is good news. But observing a small r n can conceal a situation with a large R 0 if S SIR /N is small. This is one reason why countries need to get a grip on the Susceptibles nationwide.

It is tempting to use the above technique for prediction in such a way that the b n and the γ n are fitted to a constant or a linear function, and using the values of the fit for running the system into the future. This is very close to extending the logarithmic plots of Figure 5 by lines, using pencil and ruler, and thus hardly interesting.

So far, the above argument cannot replace a SIR model. It only interprets the available data. However, monitoring the Johns Hopkins data in the above way will be very useful when it comes to evaluate the effectivity of certain measures taken by politicians. It will be highly interesting to see how the data of Figure 6 continue, in particular when countries relax their contact restrictions.

For cases where one still has to expect R 0 > 1, e.g. UK, US and Sweden around day 90 (see Figure 6 ), the challenge remains to predict a possible peak. Using the estimates from the previous section is questionable, because they concern the subpopulation of Confirmed and are systematically underestimating R 0 . The "real" SIR model will have different parameters, and it needs the Susceptibles to model a peak or to make the r n estimates realistic. So far, the Johns Hopkins data work in a range where S/N is still close to one, and the Susceptibles are considered to be abundant. But the bad news for countries with r n > 1 is that r n underestimates R 0 .

Anyway, if one trusts the above time series as approximations to β and γ, one can run a SIR model, provided one is in the case R 0 > 1 and has reasonable starting values. But these are a problem, because the unconfirmed Infectious and the unconfirmed Recovered are not known, even if, for simplicity, one assumes that there are no unconfirmed COVID-19 deaths.

For an unrealistic scenario, consider Total Registration, i.e. all Infected are automatically confirmed. Then the Susceptibles in the Johns Hopkins model would be S n = N −C n = N − I n − R n − D n . Now the estimate for R 0 must be corrected to r n N S n = r n N N −C n = r n 1 + C n N −C n but this change will not be serious during the outbreak. one gets a crude prediction of the peak in case R 0 = β /γ > 1. Figure 7 shows results for two cases. The top shows the case for the United States using data from day 109 (May 10 th ) and estimating β and γ from the data one week before. The peak is predicted at day 472 with a total rate of 33% Infectious. To see how crude the technique is, the second plot shows the case of Germany using data up to day 75, i.e. before the peak, and the peak is predicted at day 230 with about 16% Infected. At day 75, R 0 was estimated at 2.01, but a few days later the estimate went below 1 ( Figure 6 ) by political intervention changing b n considerably. See Figure 12 for a much better prediction using data only up to day 67.

To get closer to what actually happens, one should combine the data-oriented Johns Hopkins Data Model with a SIR model that accounts for what happens outside of the Confirmed. We introduce the time series This implies that all deaths occur within the Confirmed, though this is a highly debatable issue. It assumes that persons with serious symptoms get confirmed, and nobody dies of COVID-19 without prior confirmation.

The Removed from the viewpoint of a global SIR model including H and M are H +C, and thus the SIR model is

To run this hidden model with constant N = S + M + H + C, one needs initial values and good estimates for β and γ, which are not the ones of the Johns Hopkins Data Model of section 3.3.

The Johns Hopkins variables D and R are linked to the hidden model via C = I − R − D. They follow an observable model

with instantaneous case death and recovery rates γ iCD and γ iCR for the Confirmed Infectious. These rates can be estimated separately from the available Johns Hopkins data, and we shall do this below. We call thes rates instantaneous, because they artificially attribute the new deaths or recoveries at day n + 1 to the previous day, not to earlier days. In this sense, they are rather artificial, and we shall address this question. They are case rates, because they concern the Confirmed.

The observable model is coupled to the hidden model only by C n . Any datadriven C n from the observable model can be used to enter the H + C variable of the hidden model, but in an unknown ratio. Conversely, any version of the hidden model produces H + C values that do not determine the C part. Summarizing, there is no way to fit the hidden model to the data without additional assumptions.

Various possibilites were tried to connect the Hidden to the Observable. Two will be presented now.

Recall that the parameter γ iCD in the observable model (14) relates case fatalities to the confirmed Infectious of the previous day. In contrast to this, the infection fatality rate in the standard literature, denoted by γ IF here, is relating to the infection directly, independent of the confirmation, and gives the probability to die of COVID-19 after infection with the SARS-CoV-2 virus. It is γ IF = 0.56% by [1] and 0.66% by [11] , but specialized for China. Recent data from the Heinsberg study by Streeck et. al. [10] gives a value of 0.37% for the Heinsberg population in Germany. The idea to use the infection fatality rate for information about the hidden system comes from [2] . The way to use it will depend on how to handle delays, and it turned out to be difficult to use these rates in a stable way.

Let us focus on probabilities to die either after an infection or after confirmation of an infection. The first is the infection fatality rate given in the literature, but what is the comparable case fatality rate γ CF when using the Johns Hopkins data? It is clearly not the γ iCD in (14), giving the ratio of new deaths at day n + 1 as a fraction of the confirmed Infectious at day n. It makes much more sense to use the fact that COVID-19 diseases end after k days from confirmation with either death or recovery. Let us call this the k-day rule. Suggested values for k range between 14 and 21.

Following [9] , one can estimate the probability p i to die on day i after confirmation, and this works in a stable way per country, based only on C and D, not on the unstable R data. In [9] this approach was used to produce R values that comply with the k-day rule, but here we use it for estimating the case fatality. The technique of [9] performs a fit

i.e. it assigns all new deaths at day n to previous new infections on previous days in a hopefully consistent way, minimizing the error in the above formula under variation of the probabilities p i . If the p i are known for days 1 to k, the case fatality rate is

This argument can also be seen the other way round: the new Confirmed C n −C n−1 at day 1 enter into D n+1 − D n with probability q 1 = p 1 , into D n+2 − D n+1 with probability q 2 = p 2 (1 − p 1 ) and so on. The rest enters into the new Recovered at day n + k with probability q k+1 if we set p k+1 = 0. Thus the case fatality rate can be expressed as 1 − q k+1 like above.

At this point, there is a hidden assumption. Persons that are new to the Confirmed at day n are not dead and not recovered. The change C n+1 −C n to the Confirmed is therefore understood as the number of new registered infections. Otherwise, one might replace C n−i −C n−i−1 by I n−i − I n−i−1 in (15), but this would connect a cumulative function to a non-cumulative function. Furthermore, this would use the unsafe data of the Recovered.

In fact, equation (15) is unexpectedly reliable, provided one looks at the sum of the probabilities p i . This follows from series of experiments that we do not document fully here, In [9] , data for 2k days backwards were used for the estimation, and results did not change much when more or less data were used or when k was modified. Here, the range 7 ≤ k ≤ 21 was tested, and backlogs of up to 50 days from day 109 (as of this writing). See Figure 8 below for an example. Larger k lead to somewhat larger fatality rates, because the method has more leeway to assign deaths to confirmations, but the increase is rather small when k is increased beyond 14. It is remarkable that k = 7 suffices for a rather good fit for US, UK, Sweden, and Italy. In contrast to other countries, this means that 7 days after confirmation are enough to assign deaths consistently to previous confirmations, indicating that a large portion of confirmations concern rather serious cases. In general, the fit gets better when the backlog is not too large, avoiding use of outdated data, and the resulting rate does not change much when backlogs are shortened. Roughly, the rule of a backlog of 2k days works well, together with k = 21 to allow enough leeway to ensure a small fitting error. Note that all variations of k and the backlog data have a rather small effect on the sum of the p i , while the p i will vary strongly.

See the first column of Table 1 for estimates of case fatality rates for different countries, calculated on day 109 (May 10 th ). These depend strongly on the strategy for confirmation. In particular, they are high when only serious cases are confirmed, e.g. cases that need hospital care. If many more people are tested, confirmations will contain plenty of much less serious cases, and then the case fatality rates are low. The values were entered manually after inspection of a plot of the rates as functions of k and the backlog. See Figure 8 for how the value 0.143 for Italy was determined.

The instantaneous case death rate γ iCD of (14) for the Johns Hopkins data comes out around 0.004 for Germany by direct inspection of the data via Table 1 is about 0.045. The deaths have to be attributed to different days using the k-day rule, they cannot easily be assigned to the previous day without making the rate smaller.

If the case fatality rates γ CF of Table 1 are used with the infection fatality rate of γ IF = 0.006, one should obtain an estimate of the total Infectious. If the formula (15) is written as

in terms of the previous new infections S n−i−1 − S n−i with infection fatality probabilitiesq i , one should maintain

and this works by setting

in general, without using the unstable p i .

The quotient γ IF γ CF can be called the detection rate, stating the fraction of Infectious that is entering confirmation. See the second column of Table 1 . Note that all of this is dependent on good estimates of the infection fatality rate, and the new value by [10] will roughly double the detection rate for Germany.

All of this is comparable to the findings of [2] and uses the basic idea from there, but is executed with a somewhat different technique.

A simple way to understand the quotient γ IF γ CF as a detection rate is to ask for the probability p(C) for Confirmation. If the probability to die after Confirmation is γ CF , and if there are no deaths outside confirmation, then

and

It is tempting to replace S by M in (17), but this would make M cumulative.

Under political changes of the parameters β and γ, the estimation of the Case Fatality Rate should be made locally, not globally. Using the experience of [9] and section 4.3.1, we shall do this using a fixed k = 21 for the k-day rule and data for a fixed backlog of 2k days.

We need another parameter to make the model work. There are many choices, and after some failures we selected the constant γ iIR in a model equation

Following what was mentioned about instantaneous rates in section 4.2, this is an instantaneous Infection Recovery Rate, relating the new unregistered Recovered to the unregistered Infections the day before.

A good value of γ iIR can come out of an experiment that produces a time series for M and H, i.e. for unregistered Infectious and unregistered Recovered. Then the instantaneous Infection Recovery rate γ iIR can be obtained directly by

The Infection Recovery rate γ IR = 1 − γ IF does not help much, because we need an instantaneous rate that has no interpretation as a probability.

Without additional input, we can look at the instantaneous Case Recovery rate

that is available from the Johns Hopkins data, and comes out experimentally to be rather stable. The rate γ iIR must be larger, because we now are not in the subpopulation of the Confirmed, and nobody can die without going first into the population of the Confirmed. As long as no better data are available, we use the formula

that accounts for two things:

1. the value γ iCR is increased by the ratio of Recovered probabilities for the Infected and the Confirmed, 2. the value γ IR is multiplied by a factor for transition to immediate rates, and this factor is the transition factor for the Confirmed Recovered.

The above strategy is debatable and may be the weakest point of this approach. However, others turned out to be worse, mainly due to instability of results. 

In (19) the rate γ IR is fixed, and the rate γ CR is determined locally via section 4.3.4. The rate γ iCR follows from the time series R n+1 − R n I n ≈ γ iCR as in (14). This works for countries that provide useful data for the Recovered. In that case, and in others to follow below, we can take the time series itself as long as we have data. For prediction, we estimate the constant from the time series using a fixed backlog of m days from the current day. Since many data have a weekly oscillation, due to data being not properly delivered during weekends, the backlog should not be less than 7.

But for certain countries, like the United Kingdom, the data for the Recovered are useless. In such cases, we employ the technique of [9] to estimate the Recovered using the k-day rule and a backlog of 2k days, like in section 4.3.4 for the case fatality rates.

We now have everything to run the hidden model, but we do it first for days that have available Johns Hopkins data. This leads to estimations of S, M, and H from the observed data of the Johns Hopkins source. With the parameters from above, we use the new relations

The first equation is a priori and determines S. One could run it over the whole range of available data, if the parameter γ CF were fixed, like γ CI . But since we estimate it via section 4.3 that resulted in Table 1 , we should use section 4.3.4 to calculate γ CF locally.

The second is a posteriori and lets H follow from M, but we postpone it. The third model equation in (13) will be handled defining γ n by brute force via

We then set up the second model equation for M as

that can be solved if an initial value is prescribed. The first model equation in (13) is used to define β n by

The model is run by executing (22) 

Since the populations are large, the starting values for S are not important. The starting value for H is irrelevant for H itself, because only differences enter, but it determines the starting value for M due to the balance equation. Anyway, it turns out experimentally that the starting values do not matter, if the model is started early. The hidden model (13) depends much more strongly on C than on the starting values. See Figure 12 for an example.

Along with the calculation of S, M, and H, we run the calculation of the time series β n and γ n using (23) and (21). These yield estimates for the parameters of the full SIR model that replace the earlier time series from the Johns Hopkins Data Model in section 3.3.

The figures to follow in section 5.2 show the original Johns Hopkins data together with the hidden variables S, M, and H that are calculated by the above technique.

Note that the only ingredients beside the Johns Hopkins data are the number k for the k-day rule, the Infection Fatality Rate γ IF from the literature, and the backlog m for estimation of constants from time series.

To let the combined model predict the future, or to check what it would have predicted if used at an earlier day, we take the completed model of the previous sections up to a day n and use the values S n , M n , H n , C n , I n , R n and D n for starting the prediction. With the variable HC := H +C, we use the recursion

This needs fixed values of β and γ that we estimate from the time series for β n and γ n by using a backlog of m days, following Section 4.5.2. The instantaneous rates γ iCR and γ iCD can be calculated via their time series, as in (18) and (16), using the same backlog. We do this at the starting point of the prediction, and then the model runs in what can be called a no political change mode. Examples will follow in section 5.2.

The first part of the full model (24) runs as a standard SIR model for the variables S, M and H +C, and inherits the properties of these as described in section 2. It does not use the γ iIR parameter of the second equation in (20), and it uses the first the other way round, now determining C from S, not S from C. 

The bracket is positive if

which is enough for practical purposes. Figure 9 shows predictions on day 109, May 10th, for Germany, Sweden, US and UK, from the top. The plots for countries behind their peak are rather similar to the one for Germany. The other three countries are selected because they still have to face their peak, if no action is taken to change the parameters. The estimated R 0 values are 0.59, 1.37, 3.34, and 1.20, respectively. Note that these are not directly comparable to Figure 6 , because they are the fitted constant to the backlog of a week, and using (21) and (23) instead of (12) . The black and magenta curves are the estimated M and H values, while the S values are hardly visible on the top. The hidden M and H in black and magenta follow roughly the observable I and C in blue and cyan, but with a factor due to the detection rate that is different between countries, see Table 1 .

To evaluate the predictions, one should go back and start the predictions for earlier days, to compare with what happened later. Figure 11 shows overplots of predictions for days 81, 95, and 109, each a fortnight apart. The starting points of the predictions are marked by asterisks again. Now each prediction has slightly different estimates for S, M, and H due to different available data. Recall that the determination of these variables is done while there are Johns Hopkins data available, following section 4.5, and will be dependent on the data-driven estimations described there. In particular, the case fatality rates and detection rates of Table 1 change with the starting point of the prediction, and they determine S, M, and H backwards, see the figure for Sweden. All test runs were made for the infection fatality rate γ IF = 0.005, the delay k = 21 for estimating case fatalities, and a backlog of 7 days when estimating constants out of recent values of time series. The choice γ IF = 0.005 is somewhat between 0.56% from [1] , 0.66% from [11] , and 0.37% from [10] . New information on infection fatality rates should be included as soon as they are available.

The Johns Hopkins data were smoothed by a double application of the 1/4, 1/2, 1/4 filter on the logarithms, like for Figure 6 . For UK and Sweden, the data for the confirmed Recovered R were replaced by the 21-day rule estimation via [9] and a backlog of 42 days. The original data were too messy to be useful, unfortunately.

For an early case in Germany, Figure 12 shows the prediction based on data of day 67, March 27 th . On the side, the figure contains a wide variation of the starting value H = N − S − C at the starting point, by multipliers between 1/32 and 32. This has hardly any effect on the results. The peak of about 20 million Infected is predicted on day 111, May 12 th , with roughly a million Confirmed and about 20000 casualties at the peak and about 100.000 finally. Note that the real death count is about 7500 on May 11 th , and the prediction of the day, in Figure 9 , targets a final count of below 10000. The parameter changes by political measures turned out to be rather effective, like in many countries that applied similar strategies. But since parts of the population want to go back to their previous lifestyle, all of this is endangered, and the figures should be monitored carefully.

Of course, all of this makes sense only under the assumption that reality follows the model, in spite of all attempts to design a model that follows reality.

So far, the model presented here seems to be useful, combining theory and practically available data. It is data-driven to a very large extent, using only the infection fatality rate from outside for prediction, and the approximation (19) for calibration.

On the downside, there is quite a number of shortcomings: latest data, but it needs changes as soon as new information on the hidden infections come in.

• There may be better ways of estimating the hidden part of the epidemics.

However, it will be easy to adapt the model to other parameter choices. If time series for the unknown variables get available, the model can easily be adapted to being data-driven by the new data.

• The treatment of delays is unsatisfactory. In particular, infected persons get infected immediately, and the k-day rule is not followed at all places in the model. But the rule is violated as well in the data [9] .

• There is no stochastics involved, except for simple things like estimating constants by least squares, or for certain probabilistic arguments on the side, e.g. in section 4.3.1. But it is not at all clear whether there are enough data to do a proper probabilistic analysis.

• As long as there is no probabilistic analysis, there should be more simulations under perturbations of the data and the parameters. A few were included, e.g. for section 4.3.1 and Figures 8 and 12 , but a large number was performed in the background when preparing the paper, making sure that results are stable, but there are never too many test simulations.

• Other models were not considered, e.g. the classical ones with delays [5, 6] . • Under certain circumstances, epidemics do not show an exponential outbreak, in particular if they hit only locally and a prepared population. See Figure 13 for the COVID-19 cases in Göttingen and vicinity.

• Estimates for the peak time in section 2.13 need improvement.

• Same for the underpinning of "flattening the curve" in section 2.14.

MATLAB programs are available on request. 

Modellierung von Beispielszenarien der SARS-CoV-2-Epidemie 2020 in Deutschland

Average detection rate of SARS-CoV-2 infections is estimated around six percent

Inferring COVID-19 spreading rates and potential change points for case number forecasts

The mathematics of infectious diseases

A problem in the theory of epidemics I

A problem in the theory of epidemics II

COVID-19 repository at GitHub

SARS-CoV-2 Steckbrief zur Coronavirus-Krankheit-2019 (COVID-19)

Modelling Recovered Cases and Death Probabilities for the COVID-19 Outbreak

Preliminary results from the Heinsberg outbreak, cited after Göttinger Tageblatt

Estimates of the severity of coronavirus disease 2019: a model-based analysis, www.thelancet.com/infection Published online

Bloß raus hier ! Article in the weekly journal DIE ZEIT