key: cord-0980585-cq6mivr9
authors: Ganyani, Tapiwa; Kremer, Cecile; Chen, Dongxuan; Torneri, Andrea; Faes, Christel; Wallinga, Jacco; Hens, Niel
title: Estimating the generation interval for COVID-19 based on symptom onset data
date: 2020-03-08
journal: nan
DOI: 10.1101/2020.03.05.20031815
sha: af266fac8970a7960e96630a67d91bec5dda0335
doc_id: 980585
cord_uid: cq6mivr9

Background: Estimating key infectious disease parameters from the COVID-19 outbreak is quintessential for modelling studies and guiding intervention strategies. Whereas different estimates for the incubation period distribution and the serial interval distribution have been reported, estimates of the generation interval for COVID-19 have not been provided. Methods: We used outbreak data from clusters in Singapore and Tianjin, China to estimate the generation interval from symptom onset data while acknowledging uncertainty about the incubation period distribution and the underlying transmission network. From those estimates we obtained the proportions pre-symptomatic transmission and reproduction numbers. Results: The mean generation interval was 5.20 (95%CI 3.78-6.78) days for Singapore and 3.95 (95%CI 3.01-4.91) days for Tianjin, China when relying on a previously reported incubation period with mean 5.2 and SD 2.8 days. The proportion of pre-symptomatic transmission was 48% (95%CI 32-67%) for Singapore and 62% (95%CI 50-76%) for Tianjin, China. Estimates of the reproduction number based on the generation interval distribution were slightly higher than those based on the serial interval distribution. Conclusions: Estimating generation and serial interval distributions from outbreak data requires careful investigation of the underlying transmission network. Detailed contact tracing information is essential for correctly estimating these quantities.

In order to plan intervention strategies aimed at bringing disease outbreaks such as the COVID-19 outbreak under control as well as to monitor disease outbreaks, public health officials depend on insights about key disease transmission parameters which are typically obtained from mathematical or statistical modelling. Examples of key parameters include the reproduction number (average number of infections caused by an infectious individual) and distributions of the generation interval (time between infection events in an infector-infectee pair), serial interval (time between symptom onsets in an infector-infectee pair), and incubation period (time between moment of infection and symptom onset) [1] . Estimates of the reproduction number together with the generation interval distribution can provide insight into the speed with which a disease will spread.

On the other hand, estimates of the incubation period distribution can help guide determining appropriate quarantine periods.

As soon as line lists were made available, statistical and mathematical modelling was used to quantify these key epidemiological parameters. Li et al. [2] estimated the basic reproduction number (using a renewal equation) to be 2.2 (95% CI 1.4-3.9), the serial interval distribution to have a mean of 7.5 days (95% CI 5.5-19) based on 6 observations, and the incubation period distribution to have a mean of 5.2 days (95% CI 4.1-7.0) based on 10 observations. Other studies estimated the incubation period distribution to have a mean of 6.4 days (95% CI 5.6-7.7) [3] , median of 5 days (95% CI 4.0-5.8) [4] , mean of 5.2 days (range 1.8-12.4 days) [5] , and a mean of 4.8 days (range 2-11 days) [6] .

When the incubation period does not change over the course of the epidemic, the expected values of the serial and generation interval distributions are expected to be equal but their variances to be different [7] . It has recently been shown that ignoring the difference between the serial and generation interval can lead to biased estimates of the reproduction number. More specifically, when the serial interval distribution has larger variance than the generation interval distribution, using the serial interval as a proxy for 2 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint the generation interval will lead to an underestimation of the basic reproduction number R. When R is underestimated, this may lead to prevention policies that are insufficient to stop disease spread [7] .

The most well-known method to estimate the serial interval distribution from line list data is the likelihood-based estimation method proposed by Wallinga and Teunis [8] . In 2012, Hens et al. [9] proposed using the Expectation-Maximisation (EM) algorithm to estimate the generation interval distribution from incomplete line list data based on the method by [8] and allowing for auxiliary information to be used in assigning potential infector-infectee pairs. Te Beest et al [10] used a Markov chain Monte Carlo (MCMC) approach as an alternative to the EM-algorithm, to facilitate taking into account uncertainty related to the dates of symptom onset. In this paper, we use an MCMC approach to estimate, next to the serial interval distribution, the generation interval distribution upon specification of the incubation period distribution. We compare the impact of differences amongst previous estimates of the incubation period distribution for COVID-19 and analyse data on clusters of confirmed cases from Singapore (January 21 to February 26) and Tianjin, China (January 14 to February 27).

The data used in this paper consist of symptom onset dates and cluster information for confirmed cases in Singapore and Tianjin, China.

As of February 26th, 91 confirmed COVID-19 cases had been reported in Singapore.

Detailed information on age, sex, known travel history, time of symptom onset, and known contacts is available for 54 of these cases (link: https://www.moh.gov.sg/newshighlights/, last accessed February 26th). For cases with no infector information available, it is assumed that they could have been infected by any other case within the same cluster. Cases known to be Chinese/Wuhan nationals or known to have been in close contact with a Chinese/Wuhan national are labeled as index cases. All other cases are assumed to have been infected locally.

As of February 27th, 135 confirmed cases had been reported by the Tianjin municipal 3 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint health commission. Data on these cases are available in official daily reports (link: http://wsjk.tj.gov.cn/col/col87/index.html, last accessed February 27th) and include age, sex, relationship to other known cases, and travel history to risk areas in and outside Hubei province, China. In these data, 114 cases can be traced to one of 16 clusters. The largest cluster consisting of 45 cases can be traced to a shopping mall in Baodi district.

Through contact investigations, potential transmission links were identified for cases who had close contacts. Travel history information was used to identify some individuals as import cases. For cases with no infector information available, it is assumed that they could have been infected by any other case within the same cluster. 

Assuming the incubation period is independent of the infection time, Z i can be rewritten as a convolution of the generation interval for individual i and the difference between the incubation period of individual i and the incubation period of its infector v(i) [7] , i.e.,

The random variables X i and δ i are positive and are both assumed to be independent and identically distributed, i.e., X i ∼ f (x; Θ 1 ) and δ i ∼ k(δ; Θ 2 ), so that Y i ∼ g(y i ; Θ 2 ).

have the same mean and that the latter has a larger variance and can be negative.

The observed serial interval, z i , can be expressed in terms of the latent variables as

. The density function h(.) is given by [11] ,

In general, h(z; Θ 1 , Θ 2 ) and g(y; Θ 2 ) have no closed form for arbitrary choices of f (x; Θ 1 ) 4 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint and k(δ; Θ 2 ). Monte Carlo methods [12] can be used to estimate h(z; Θ 1 , Θ 2 ) as follows,

where y j is the j th Monte Carlo sample drawn from g(y; Θ 2 ). When all infector-infectee pairs are observed, the likelihood function is given by,

To account for uncertainty in the transmission links we resort to a Bayesian framework in which missing links are imputed [10] (see Subsection 2.3). The likelihood function is then given by L Θ, v(i) missing |z i , v(i) .

We use the Bayesian method described in te Beest et al. [10] for parameter estimation.

This method proceeds in two steps. The first step updates the missing links v(i) missing and the second step updates the parameter vector Θ 1 , i.e., the parameters of the generation interval distribution. We assume that both the generation interval and the incubation period are gamma distributed, i.e., f (x; Θ 1 ) ≡ Γ(α 1 , β 1 ) and k(δ; Θ 2 ) ≡ Γ(α 2 , β 2 ).

The parameter vector Θ 2 is fixed to (α 2 = 3.45, β 2 = 0.66), corresponding to an incubation period with a mean of 5.2 and standard deviation (SD) of 2.8 days [5] .

Minimally informative uniform priors are assigned to the parameters of the generation interval distribution, i.e., α 1 ∼ U (0, 30) and β 1 ∼ U (0, 20). For cases with multiple potential infectors, the possible links v(i) missing are assigned equal prior probabilities. The missing links are updated using an independence sampler, whereas Θ 1 is updated using a random-walk Metropolis-Hastings algorithm with a uniform proposal distribution [12] .

We evaluate the posterior distribution using 3 000 000 iterations of which the first 500 000 are discarded as burn-in. Thinning is applied by taking every 200 th iteration. The serial interval distribution is obtained by simulating 1 000 000 draws from h(z;Θ 1 ,Θ 2 ).

All analyses were performed using R, datasets and code are available on GitHub (https://github.com/cecilekremer/COVID19).

. CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint

The proportion of pre-symptomatic transmission is calculated as p = P (X i < δ v(i) ), i.e., pre-symptomatic transmission occurs when the generation interval is shorter than the incubation period of the infector. This proportion was obtained by simulating values from the estimated generation interval and incubation period distributions (assuming a mean incubation time of 5.2 days [5] ).

The reproduction number is calculated as R = e rµ− 1 2 r 2 σ 2 , where r denotes the exponential growth rate estimated from the initial phase of the outbreak, and µ and σ 2 are the mean and variance of the generation or serial interval distribution [13] .

Confidence intervals for p and R are calculated by evaluating p and R at each iteration of the converged MCMC chain, i.e., at each mean/variance pair of the posterior generation/serial interval distribution. The 95% confidence intervals are given by the 2.5% and 97.5% quantiles of the resulting distributions.

As sensitivity analyses, we investigate the robustness of our estimates of the generation interval distribution to the choice of different incubation period distributions. In particular, we fix Θ 2 to (α 2 = 7.74, β 2 = 1.21) and (α 2 = 4.36, β 2 = 0.91), corresponding to an incubation period with mean and SD (6.4, 2.3) days [3] and (4.8, 2.6) days [6] , respectively.

In our main (i.e., baseline) analyses, missing serial intervals were only allowed to be positive, i.e., the symptom onset time of the infector has to occur before that of the infectee. However, given that pre-symptomatic transmission is possible, this can be deemed an unrealistic assumption. Therefore, we assess the impact of allowing for negative serial intervals on our estimates of the generation interval distribution.

To further assess the robustness of the estimated generation interval distribution, for each dataset, we fit the model to data from the largest cluster. In the Tianjin dataset, the largest cluster is the shopping mall cluster consisting of 45 cases. In the Singapore dataset this is the Grace Assembly of God cluster consisting of 25 cases.

6 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint 3 Results Table 1 shows parameter estimates of the generation and serial interval distributions for each dataset, assuming an incubation period with mean 5.2 and SD 2.8 days. The mean generation time is estimated to be 5.2 days (95%CI, 3.78 -6.78) for the Singapore data, and 3.95 days (95%CI, 3.01 -4.91) for the Tianjin data. As expected, the estimated means of the generation interval and serial interval distributions are approximately equal but the latter has a larger variance. . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint it is estimated to be 2.57 days. Table 5 shows the proportions of pre-symptomatic transmission and reproduction numbers for each dataset. Pre-symptomatic transmission is higher when allowing for negative serial intervals for cases with no known infector. The reproduction number is lower when estimated using the serial interval compared to when using the generation interval. 

We estimated the generation time to have a mean of 5.20 (95%CI 3.78-6.78) days and a standard deviation of 1.72 (95%CI 0.91-3.93) days for the Singapore data, and a mean 9 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint As expected, the proportion of pre-symptomatic transmission increases from 48% (95%CI 32-67%) in the baseline scenario to 66% (95%CI 45-84%) when allowing for negative serial intervals, for the Singapore data, and from 62% (95%CI 50-76%) to 77% (95%CI 65-87%) for the Tianjin data. Hence, a substantial proportion of transmission appears to occur before symptom onset, which is an important point to consider when planning intervention strategies. We also estimated R 0 , solely to illustrate the bias that occurs when using the serial interval as a proxy for the generation interval [7] . Whereas the impact was limited for our analyses, estimates based on the generation interval are larger and should be preferred to inform intervention policies. Indeed, as expected, the reproduction number was underestimated when using the serial interval distribution which is more variable than the generation interval distribution.

Tindale et al. [14] recently estimated the mean serial interval for COVID-19 to be 4.56 (95%CI 2.69 -6.42) days for Singapore and 4.22 (95%CI 3.43 -5.01) days for Tianjin.

Although these estimates are different from the ones we report, they fall within the uncertainty ranges we obtained. An important advantage of our method is that we are able to infer the generation interval distribution while allowing serial intervals to be negative. Our estimates of R are smaller than the ones reported by Tindale et al. [14] , because we use a different estimate of the growth rate r (0.04 for Singapore and 0.12 for 10 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint Tianjin as obtained from the initial exponential growth phase in each dataset, compared to 0.15 used by [14] ).

Another advantage of our method is that we can derive a proper variance estimate for the generation interval, in contrast to using a too large variance estimate that is obtained when using the serial interval as a proxy for the generation interval. Furthermore, in theory we do not need to condition on the order of symptom onset times. However, when the data does not provide sufficient information on directionality of transmission, this lack of auxiliary information may cause problems for estimation.

This study does have some limitations. First, we rely on previous estimates for the incubation period. However, sensitivity analyses show that changing the incubation period distribution does not have a big impact on our estimates of the generation interval distribution. Second, we do not account for incomplete or possible changes in reporting over the course of the epidemic. Third, we do not acknowledge changes in contact processes and thus behavioral change, which could shape realised generation interval distributions as well as serial interval distributions (unpublished work). Fourth, we do not account for contraction of the generation interval because of depletion of susceptibles.

Future work should take into account these shortcomings.

Infection control for the COVID-19 epidemic relies on case-based measures such as finding cases and tracing contacts. A variable that determines how effective these case-based measures are is the proportion of pre-symptomatic transmission. Our estimates of this proportion are high, ranging from 0.48 to 0.77. This implies that the effectiveness of case finding and contact tracing in preventing COVID-19 infections will be considerably smaller compared to the effectiveness in preventing SARS or MERS infections, where presymptomatic transmission did not play an important role (see e.g. [15] ). It is unlikely that these measures alone will suffice to control the COVID-19 epidemic. Additional measures, such as social distancing, are required. 

. CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.03.05.20031815 doi: medRxiv preprint

Handbook of infectious disease data analysis

Early transmission dynamics in Wuhan, China, of novel coronavirs-infected pneumonia

Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from Wuhan

Eurosurveillance 25

Epidemiological characteristics of novel coronavirus infection: A statistical analysis of publicly available case data

Evolving epidemiology of novel coronavirus diseases 2019 and possible interruption of local transmission outside Hubei Province in China: a descriptive and modeling study

Transmission dynamics of 2019 novel coronavirus (2019-nCoV)

Estimation in emerging epidemics: biases and remedies

Different Epidemic Curves for Severe Acute Respiratory Syndrome Reveal Similar Impacts of Control Measures

Robust Reconstruction and Analysis of Outbreak Data: Influenza A(H1N1)v Transmission in a School-based Population

Estimating the generation interval of influenza A (H1N1) in a range of social settings

Introduction to the Theory of Statistics. McGraw-hill

author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint

Monte Carlo Statistical Methods

How generation intervals shape the relationship between growth rates and reproductive numbers

Transmission interval estimates suggest pre-symptomatic spread of COVID-19

Public Health Interventions and SARS Spread

author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint