key: cord-0547112-5j5r6dd5
authors: Hult, Henrik; Favero, Martina
title: Estimates of the proportion of SARS-CoV-2 infected individuals in Sweden
date: 2020-05-25
journal: nan
DOI: nan
sha: 293b43baab3290c7e39d7f3cd0693281a6898cc2
doc_id: 547112
cord_uid: 5j5r6dd5

In this paper a Bayesian SEIR model is studied to estimate the proportion of the population infected with SARS-CoV-2, the virus responsible for COVID-19. To capture heterogeneity in the population and the effect of interventions to reduce the rate of epidemic spread, the model uses a time-varying contact rate, whose logarithm has a Gaussian process prior. A Poisson point process is used to model the occurrence of deaths due to COVID-19 and the model is calibrated using data of daily death counts in combination with a snapshot of the the proportion of individuals with an active infection, performed in Stockholm in late March. The methodology is applied to regions in Sweden. The results show that the estimated proportion of the population who has been infected is around 13.5% in Stockholm, by 2020-05-15, and ranges between 2.5% - 15.6% in the other investigated regions. In Stockholm where the peak of daily death counts is likely behind us, parameter uncertainty does not heavily influence the expected daily number of deaths, nor the expected cumulative number of deaths. It does, however, impact the estimated cumulative number of infected individuals. In the other regions, where random sampling of the number of active infections is not available, parameter sharing is used to improve estimates, but the parameter uncertainty remains substantial.

To understand the spread of the novel coronavirus, SARS-CoV-2, at an aggregate level it is possible to model the dynamic evolution of the epidemic using standard epidemic models. Such models include the (stochastic) Reed-Frost model and more general Markov chain models, or the corresponding (deterministic) law of large numbers limits such as the general epidemic model, see [3] . There is an extensive literature on extensions of the standard epidemic models incorporating various degrees of heterogeneity in the population, e.g. age groups, demographic information, spatial dependence, etc. These additional characteristics make the models more realistic. For instance, it is possible to evaluate the effect of various intervention strategies. More complex models also involve additional parameters that need to be estimated, contributing to a higher degree of parameter uncertainty.

A problem when calibrating, even the standard epidemic models, to COVID-19 data is that there are few reliable sources on the number of infected individuals. Publicly available sources provide data on the number of positive tests, the number of hospitalizations, the number of ICU admission and the number of deaths due to COVID-19. In some cases, small random samples of an active infection may be available. For example, the Swedish Folkhälsomyndigheten performed such a test in Stockholm with about 700 subjects in early April 2020. Moreover, there is still no consensus in the literature on the value of important parameters such as the basic reproduction number R 0 and the infection fatality rate.

A useful approach to incorporate the parameter uncertainty in the models is to consider a Bayesian framework. In the Bayesian approach parameter uncertainty is quantified by prior distributions over the unknown parameters. The impact of observed data, in the form of a likelihood, yields, via Bayes' theorem, the posterior distribution, which quantifies the effects of parameter uncertainty. The posterior can be used to construct estimates on the number of infected individuals, predictions on the future occurrence of infections and deaths, as well as uncertainties in such estimates.

In this paper an SEIR epidemic model with time-varying contact rate will be used to model the evolution of the number of susceptible (S), exposed (E), infected (I), and recovered (R) individuals. A time varying contact rate is used to capture heterogeneity in the population, which causes the rate of the spread of the epidemic to vary as the virus spreads through the population. Moreover, the time varying contact rate allows modeling the effect of interventions aimed at reducing the rate of epidemic spread. A Poisson point process is introduced to model the occurrence and time of deaths. Random samples of tests for active infections are treated as binomial trials where the success probability is the proportion of the population in the infectious state.

The methods are illustrated on regional data of daily COVID-19 deaths in Sweden. It is demonstrated that, by combining the information in the observed number of deaths and random samples of active infections, fairly precise estimates on the number of infected individuals can be given. By assuming that some parameters are identical in several regions, estimates for regions outside Stockholm can also be provided, albeit with greater uncertainty.

Our approach is inspired by [4] where the authors considers a Bayesian approach to model an influenza outbreak. The main extensions include the introduction of the Poisson point process to model the occurrence of deaths, the addition of random sampling to test for infection, and an extension to multiple regions. To evaluate the posterior distribution we employ Markov chain Monte Carlo (MCMC) sampling. Samples from the posterior are obtained using the Hamiltonian Monte Carlo algorithm, NUTS, by [9] , implemented in the software Stan, which is an open source software for MCMC.

To model the spread of the epidemic we consider the deterministic SEIR model [1, 5] , which is a simple deterministic model describing the evolution of the number of susceptible, exposed, infected, and recovered individuals in a large homogeneous population with N individuals. The epidemic is modeled by {(S t , E t , I t ), t ≥ 0}, where S t , E t and I t represent the number of susceptible, exposed and infected individuals at time t, respectively. The total number of recovered and deceased individuals at time t ≥ 0 is always given by N − S t − E t − I t . The epidemic starts from a state S 0 , E 0 , I 0 with S 0 + E 0 + I 0 = N , and proceeds by updating,

The parameters are the contact rate β > 0, the rate ν of transition from the exposed to the infected state and the recovery rate γ > 0. Note that I t represents the number of individuals with an active infection at time t, whereas N − S t is the cumulative number of individuals who have been exposed, and possibly infected, recovered or deceased, up until time t.

In the context of the COVID-19 epidemic the contact rate cannot be assumed to be constant, primarily due to interventions implemented in the early stage of the epidemic. Moreover, as the SEIR model describes the evolution at an aggregate level, a time varying contact rate may be used to capture inhomogeneities in the population. If, for example, the epidemic is initiated in a rural area the contact rate may be rather low, but as the epidemic reaches major cities the contact rate will be higher. The resulting SEIR model with time varying contact rate is given by

Clearly, one needs to put some restriction on the amount of variation of the contact rate. In this paper a Gaussian process prior will be used on the log contact rate, which restricts the amount of variation in time, but is sufficiently flexible to capture the reduction in contact rate after the interventions.

When observations on the number of infected and recovered individuals are available, the model (1) can be fitted to these observations. In the context of COVID-19, observations on the number of infected and recovered individuals are unavailable. There are many symptomatic individuals who are not tested and potentially a large pool of asymptomatic individuals. In this paper we will rely on the number of registered deaths due to COVID-19 to calibrate the model. In addition we will incorporate the test results from a random sample that provides a snapshot on the number of individuals with an active infection.

To model the occurrence of deaths due to COVID-19 we consider the following Poisson point process representation. We refer to [10] for details on Poisson point processes. Let f denote the infection fatality rate, that is, the probability that an infected individual eventually dies from the infection. Consider the number of individuals that enters the infected state on day t, that is, νE t . Each such infected individual has probability f to eventually die from the infection. Conditional on death due to the infection, the time from infection until death is assumed independent of everything else and follows a probability distribution with probability mass function p s D . Each individual that dies may be represented as a point (t, τ ) in E := {(t, τ ) ∈ N 2 : τ ≥ t}, where t denotes the time of entry to the infected state and τ the time of death of the individual. The number of deaths at time τ can then be computed by counting the number of points on the line ∪ τ t=0 (t, τ ). The number of deaths, and the corresponding time of infection and time of death is conveniently modelled by a Poisson point process on E. Let ξ be a Poisson point process on E with intensity

We may interpret a point at (t, τ ) of the Poisson point process as the time of infection, t, and the time of death, τ , of an individual who dies from the infection.

The number of deaths D τ that occurs at time τ is then given by summing up all the points of the point process on the row corresponding to τ , ξ(∪ τ t=0 (t, τ )). Since the rows are disjoint this implies that D 0 , D 1 , . . . are independent with each D τ having a Poisson distribution with parameter

Throughout this paper p s D is the probability mass function of a negative binomial distribution with mean s D . More precisely, a parametrization of the negative binomial distribution with parameters r, s D will be used, where

The value r = 3 will be used throughout as this fits well with the distribution of observed duration from symptoms to death in the study by [15] .

In this section we provide the assumptions on the prior distributions and derive the expression of the likelihood of the model. Note that λ τ is a function of all the parameters of the model, θ = ({β t }, ν, γ, s 0 , f, s D ). The parameters, their interpretation and prior distribution are summarized in Table 1 . Actually, since the contact rate is positive, a Gaussian process (GP) prior will be used for the natural logarithm of the contact rate, denoted log-GP in the sequel. The Gaussian process has a constant mean µ and a squared-exponential covariance kernel k with parameters α, ρ, δ such that

To compute the likelihood the observed number of daily deaths, d 0 , d 1 , . . . , d T , Table 1 . Specification of the parameters and prior distributions.

will be used, in combination with a random sample of n 0 tests for active infection, performed at a time t 0 , when such test result is available. The number of individuals Z with positive test result has a Bin(n 0 , I t0 /N ) distribution. The full likelihood is given by:

The joint prior is the product of the marginal priors and leads, by Bayes's theorem, to the posterior,

The expected number of daily deaths λ τ , the cumulative number of deaths and the cumulative number of infected individuals N −S t are all functions of θ and their distribution can therefore be inferred from the posterior p Θ|D,Z . By sampling from p Θ|D and iterating the dynamics (1) estimates of these quantities may be obtained along with the effects of parameter uncertainty. Moreover, predictions on the future development of the above mentioned quantities can be obtained by extrapolating the contact rate into the future.

As the posterior distribution is unavailable in explicit form it is necessary to employ Monte Carlo methods. In the next section Markov chain Monte Carlo methods are briefly described to sample from the posterior. 4.1. Multiple regions. The SEIR model (1) and the derivation of the likelihood (3) considers a single region. In the context of multiple regions it may be reasonable to assume that some parameters are identical. For example, when considering multiple regions of Sweden below it will be assumed that the rate, ν, from exposed to infected, the recovery rate, γ, the infection fatality rate, f , and the duration, s D are identical in all regions. It is tempting to include interaction terms between the regions as infected individuals from one region may travel to another region and cause new infections. In this paper, it will be assumed that each region has its own time varying contact rate that incorporates fluctuations in new infections due to import cases from other regions.

The likelihood from multiple regions is simply the product of the marginal likelihood for the individual regions and the prior is the product of the marginal priors for each parameter. Thus, for two regions the prior will be the product of two Gaussian process priors for the respective log contact rates for the two regions and the product of the marginal priors for the remaining parameters.

Markov chain Monte Carlo (MCMC) methods in Bayesian analysis aims at sampling from the posterior distribution. This is non-trivial because the marginal distribution of the data, which acts as normalizing constant of the posterior is practically impossible to compute. In MCMC algorithms the posterior is represented as a target distribution. The algorithms rely on the construction of a Markov chain whose invariant distribution is the target distribution. Standard MCMC methods are based on acceptance-rejection steps, where random proposals are accepted or rejected with a probability that does not require knowledge of the normalizing constant, e.g., Metropolis-Hasting and Gibbs sampling [13, 8, 6] . When the target distribution is complex and multi-modal, standard methods may lead to poor mixing of the Markov chain and slow convergence to the target distribution.

To overcome slow mixing of the Markov chain gradient-based sampling can be applied, which adapt the proposal distribution based on gradients of the target, see e.g. [2] . In this paper we will employ a Hamiltonian Monte Carlo sampler, the No-U-turn sampler (NUTS) by [9] in combination with automatic differentiation to numerically approximate the gradients [7] , which is implemented in the open source software Stan.

In this section the estimates of the number of infected individuals and predictions on the evolution of the number of deaths and number of infected individuals are provided for ten regions of Sweden. The epidemic is considered to start on 2020-03-01 and interventions in Sweden began on 2020-03-16. The joint prior distribution is the product of the marginal priors, and the hyper-parameters are specified in Table  2 . The choices of hyper-parameter values are made in line with existing literature on the COVID-19 epidemic. As a general principle we have used informative priors on the parameters ν, γ, and s D , whereas the priors on the time-varying contact rate {β t } and the fatality rate f are uninformative. Folkhälsomyndigheten reports that the incubation period is usually around 5 days, which corresponds to 1/ν ≈ 5. Similarly the expected time to recovery is around 14 days, 1/γ ≈ 14. The overall infection fatality rate f is estimated to be in the range 0.003 − 0.013, see [14, 15] . However, since the infection fatality rate is a very important parameter we have used an uninformative prior and simply use a uniform prior, Beta(1, 1). The expected duration from symptoms to death is around 16 days, see [15] .

Samples from the posterior are obtained using the NUTS-sampler with a burn-in period of 500 samples and 5 000 samples after burn-in.

Mean Prior 95%-C.I. individuals performed between 2020-03-27 and 2020-04-03 1 . It showed that 18 individuals carried the SARS-CoV-2 virus. These results are included in the analysis as a binomial sample of size n 0 = 707 and success probability I t0 /N where the test date, t 0 , is assumed to be 2020-03-30.

A summary of the marginal posterior distributions is provided in Table 3 . The posterior distribution of the time varying contact rate is illustrated in Figure 1 . Note that although there is great uncertainty about the initial contact rate, the model clearly picks up the reduction in contact rate after the interventions began on 2020-03-16. The contact rate is gradually reduced around the time of intervention and then remains at a low level. This slow reduction of the contact is, however, not due to stiffness of the Gaussian process kernel. We have experimented with a sharp break-point in the contact rate at the time of intervention, but it did not provide more accurate results. On the contrary, the data suggests that the reduction of the contact rate is slow. The contact rate is estimated until 2020-05-01. After this date the posterior is unreliable. This is because many of the deaths of individuals who are infected after 2020-05- 01 have not yet been observed. For this reason, the contact rate is only estimated until 2020-05-01.

To perform estimates and predictions on the future number of daily and cumulative infections and deaths, the contact rate has been extrapolated from its value on 2020-05-01. The posterior distribution suggests that the contact rate is constant, at a low rate, since roughly 2020-04-07, which motivates extrapolation into the future, assuming that the interventions remains at the present level. After 2020-05-01 the contact rate is extrapolated, by assuming it will remain constant. Figure 2 (top left) shows the observed daily number of deaths (black dots) along with the posterior median (dark red) and 95% credibility interval (red) for the expected number of daily deaths. Figure 2 (top right) shows the observed cumulative number of deaths (black dots) along with the posterior median (dark red) and 95% credibility interval (red) for the expected cumulative number of deaths.

We observe that the parameter uncertainty does not substantially impact the expected number of daily deaths and the peak of the daily number of deaths appears to have occurred by mid April. Similarly, the expected cumulative number of deaths in Stockholm is likely to terminate slightly above 2000.

We emphasize that this is the expected number of deaths, λ τ . Since we are considering a Poisson distribution for the number of daily deaths an approximate 95%-prediction interval would be λ τ ± 2 √ λ τ , where λ τ is the Poisson parameter on day τ .

Note from the observed number of daily deaths that the empirical distribution of daily deaths appear to be overdispersed, the variance is substantially larger than the mean. This is likely due to reporting of the data. The data presented at https://c19.se/ does not correct the reporting of death dates in hindsight. A comparison at the national level with data provided by Folkhälsomyndigheten shows that the official records of the daily number of deaths for Sweden does not appear to be overdispersed. Nevertheless, even after smoothing the data from https://c19.se/ by a moving average over a few days, the results of the simulations remain essentially the same. Figure 2 (bottom left/right) shows the posterior median (dark red) and 95% credibility interval (red) for the daily/cumulative number of infected individuals. Although the parameter uncertainty has significant impact on the cumulative number of infected individuals, some conclusions are still possible. As of mid May, the cumulative number of infected individuals has almost reached its terminal value and the spread of the epidemic has slowed down significantly. The estimated cumulative number of infected individuals is 13.5% of the population in Stockholm. The estimated number of infected individuals by 2020-04-11 is 10.7%, showing that these results are well in line with the reports of the anti-body test performed at KTH 2 , which indicated that 10% of the population in Stockholm had developed anti-bodies against the SARS-CoV-2 virus by the first weeks of April.

We emphasize that the estimate of the cumulative number of infected individuals in Stockholm relies heavily on the inclusion of results from the random sampling performed by Folkhälsomyndigheten in late March, early April. Without this crucial piece of information similar models to the one analyzed here may provide a significantly higher estimate on the cumulative number of infected. Table 3 . Marginal posterior median and credibility intervals for Region Stockholm.

6.2. Summary of the results for ten regions of Sweden. In this section estimates of the cumulative number of infected individuals are provided for the following regions of Sweden:

(1) Stockholm (population: 2.34 · 10 6 ) (2) Västra Götaland (population: 1.71 · 10 6 ) (3)Östergötland (population: 1.36 · 10 6 ) (4)Örebro (population: 3.02 · 10 5 ) (5) Skåne (population: 1.36 · 10 6 ) (6) Jönköping (population: 3.56 · 10 5 ) (7) Sörmland (population: 2.95 · 10 5 ) (8) Västmanland (population: 2.74 · 10 5 ) (9) Uppsala (population: 3.76 · 10 5 ) (10) Dalarna (population: 2.87 · 10 5 ) The daily death counts for the regions of Sweden until 2020-05-15 are obtained from the webpage: https://c19.se/. There is no random testing providing information on the proportion of infected individuals outside Region Stockholm. To estimate the contact rate and the cumulative number of infected individuals in regions outside Stockholm, we have implemented the multi-region model pairwise, with two regions in each MCMC simulation, where one region is Stockholm and the other region is from the list above. It is assumed that the parameters ν, γ, f, and s D are identical in both regions, but the time varying contact rate and the initial proportion of susceptible individuals are different between the regions. The posterior of the contact rates for the different regions are provided in Figure Table 4 along with 95% credibility intervals. Overall the proportions are low, far from herd immunity.

Prop Observed daily number of deaths (black dots), the posterior median (dark red) and 95% credibility interval for the expected daily number of deaths. Top right: Observed cumulative number of deaths (black dots), the posterior median (dark red) and 95% credibility interval for the expected cumulative number of deaths. Bottom left: The posterior median (dark red) and 95% credibility interval for the daily number of infected individuals. Bottom right: The posterior median (dark red) and 95% credibility interval for the cumulative number of infected individuals.

Infectious Diseases of Humans: Dynamics and Control

The geometric foundations of Hamiltonian Monte Carlo

Basic estimation-prediction techniques for COVID-19, and a prediction for Stockholm

Contemporary statistical inference for infectious disease models using Stan

Mathematical tools for understanding infectious disease dynamics

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation

Monte Carlo sampling methods using Markov Chains and their applications

The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo

Random Measures

Early transmission dynamics in Wuhan, China, of novel Coronavirus Infected Pneumonia

Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2)

Equation of state calculations by fast computing machines

Estimates of the severity of coronavirus disease 2019: a model-based analysis The Lancet Infections Diseases

Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China