key: cord-0773839-yaii6e16 authors: Kuk, Anthony Y. C.; Ma, Stefan title: The estimation of SARS incubation distribution from serial interval data using a convolution likelihood date: 2005-07-12 journal: Stat Med DOI: 10.1002/sim.2123 sha: e94098c7b681cd52ba8606d9d3f281ef87cbc434 doc_id: 773839 cord_uid: yaii6e16 The incubation period of SARS is the time between infection of disease and onset of symptoms. Knowledge about the distribution of incubation times is crucial in determining the length of quarantine period and is an important parameter in modelling the spread and control of SARS. As the exact time of infection is unknown for most patients, the incubation time cannot be determined. What is observable is the serial interval which is the time from the onset of symptoms in an index case to the onset of symptoms in a subsequent case infected by the index case. By constructing a convolution likelihood based on the serial interval data, we are able to estimate the incubation distribution which is assumed to be Weibull, and justifications are given to support this choice over other distributions. The method is applied to data provided by the Ministry of Health of Singapore and the results justify the choice of a ten‐day quarantine period. The indirect estimate obtained using the method of convolution likelihood is validated by means of comparison with a direct estimate obtained directly from a subset of patients for whom the incubation time can be ascertained. Despite its name, the proposed indirect estimate is actually more precise than the direct estimate because serial interval data are recorded for almost all patients, whereas exact incubation times can be determined for only a small subset. It is possible to obtain an even more efficient estimate by using the combined data but the improvement is not substantial. Copyright © 2005 John Wiley & Sons, Ltd. Severe acute respiratory syndrome (SARS) is an illness caused by a coronavirus, called SARSassociated coronavirus (SARS-CoV), see References [1] [2] [3] . As a disease, SARS has a high case-fatality rate. According to the WHO, a total of 8422 people worldwide became sick with SARS during the 2003 outbreak. Of these, 916 died. Up to now, there is no deÿnitive cure or vaccine for SARS. As a result, health o cials have to rely on isolation and quarantine as the two main preventive measures. The people to be isolated include the suspected, probable and conÿrmed SARS cases so as to prevent further transmission of the disease. As people infected with SARS have to go through an incubation period before they become symptomatic, it is also important to quarantine people who may have been exposed to SARS-CoV, for example, the contacts of cases, so that they can be isolated as soon as they show possible signs of the disease. Obviously, the suitable length of the quarantine period is directly related to the length of the incubation period. The distribution of SARS incubation period is also an important input parameter in a mathematical model to describe the dynamics and spread of the disease and for assessing the potential impact of interventions. Incubation period is the time between infection of a disease and onset of symptoms. Estimation of the SARS incubation distribution has been hampered by the fact that the dates of infection cannot be determined exactly for the majority of patients. As a result, Donnelly et al. [4] have to restrict their attention to a small subset of 57 SARS patients in Hong Kong with short and deÿned periods of exposure to known SARS cases. In their own words, the resulting estimate is 'based on a limited number of observations to date, and has high variance and may re ect biases in reporting, di erent routes of transmission, or varying infectious doses of the virus'. Meltzer [5] attempts to deal with the problem that many patients have multiple possible incubation periods by using a simulation approach whereby the incubation period of each patient is simulated uniformly from his=her set of possible values. It is claimed that the simulation approach can be used to give frequency distribution and conÿdence intervals for the various quantities of interest. The argument, is, however, awed from a statistical point of view. This is because the use of simulations only accounts for the Monte Carlo variation of simulations treating the observed sample as ÿxed but not the sampling variability that takes into account the randomness of samples. To be precise, suppose there are n patients in the sample and the incubation period of patient i could only be determined up to m i possible values t i1 ; : : : ; t imi . The estimate obtained from Meltzer's simulations, say,F is just an approximation of the distributionF that assigns probability mass 1=nm i to t ij ; i = 1; : : : ; n, j = 1; : : : ; m i . What has been accounted for is the variability ofF as an approximation ofF but not the variability ofF. In fact, given thatF can be computed, there is no need to do any simulations at all. While the exact date of infection and hence the incubation period cannot be determined for most patients, what is observable is the serial interval which is deÿned in Reference [6] as the time from onset of symptoms in an index case to the onset of symptoms in a subsequent case infected by the index patient. By constructing a convolution likelihood based on the serial interval data in Singapore, we are able to estimate the incubation distribution parametrically. To be speciÿc, the Weibull distribution is selected as a suitable distribution for SARS incubation from a four-parameter family of generalized F distributions (see Sections 2.2.7 and 3.9.1 of Reference [7] ) that contains the log-normal, log-logistic, gamma, Weibull and reciprocal Weibull distributions as special cases. Based on the resulting estimate, the choice of a 10-day incubation period in Singapore is well justiÿed. The validity of the convolution likelihood estimate is demonstrated by means of comparison with a direct estimate obtained directly from a subset of patients for whom the incubation time can be ascertained. Whereas the direct estimate is only based on 50 incubation times, the convolution likelihood estimate is based on 198 serial intervals and hence is more precise. It is also possible to construct a maximum likelihood estimate based on the ascertained incubation times for the 50 patients plus the serial times for the remaining 148 patients. A serial interval is the time from onset of symptoms in an index case to the onset of symptoms in a subsequent case infected by the index case. During the 2003 outbreak in Singapore, a total of 206 probable SARS cases were diagnosed using the initial case deÿnition issued by WHO in March 2003, and 32 additional cases were conÿrmed later by laboratory test. Serial intervals can be determined for 198 patients. Index these 198 patients by i = 1; : : : ; 198. Let s i be the observed serial time between onset of symptoms in patient i and onset of symptoms in his=her infector. The observed serial time s i for patient i can be decomposed as where u i is the time between the date of onset of symptoms in the index case and the unobserved date of infection of patient i, and t i the incubation time between infection and onset of symptoms for patient i. Note that we can only observe the sum s i , but not the components u i and t i . Assuming that the two components are statistically independent, the density of the sum s i is given by the convolution of the densities of u i and t i , and the incubation times t i are assumed to be independent and identically distributed according to a density f T (t). Furthermore, let d i be the duration between onset of symptoms and isolation for the spreader who transmitted the disease to patient i. We assume that a person infected with SARS will not transmit the disease to others before onset of symptoms and after isolation. Under this assumption, the time d i deÿned earlier can be interpreted as the duration of infectiousness of the person who transmitted SARS to patient i. This is a reasonable assumption as there is no compelling evidence so far of transmission from asymptomatic persons [6] and isolation measure, if strictly adhered to, should be e ective in preventing transmission. To obtain the density of u i that is to be convoluted with that of t i , we condition on the duration of infectiousness d i of the spreaders. Assuming constant infectiousness throughout the duration d i , the conditional density f Ui (u i | d i ) of the infection time u i is uniformly distributed between 0 and d i . We have also assumed independence between the incubation times t i of patients and the duration times of their infectors so that the conditional distribution of t i given d i is still given by f T (t). Combining, the density of s i = u i + t i given the duration of infectiousness d i , is given by the convolution A simpler way to derive (1) proposed by a referee is to think in terms of interval censoring of the incubation times t i . Since s i = u i + t i , we have t i = s i − u i , that is to say, the incubation time t i is the di erence between the serial interval s i and the infection time u i . Now the infection time u i must be within the period of infectiousness, that is, between 0 and d i , it follows that the incubation time But of course the incubation time cannot be negative and so we can conclude further that t i ∈ (0; s i ) if s i ¡d i . In the absence of further knowledge about the time of infection, which is equivalent to the uniform distribution assumption for u i , the likelihood contribution from , depending on whether s i ¡d i or s i ¿d i . This argument also leads to (1), apart from the factor 1=d i , which does not depend on parameters and hence makes no di erence in maximum likelihood estimation. We follow the convolution approach since it is arguably more rigorous than the censored data approach and the ÿrst line of (1) is general enough to include the case of non-uniform f U (u). It follows from (1) that the density of the serial interval s i has closed form if the cumulative distribution function F T (t) of the incubation times t i has closed form. A popular choice for The popularity of Weibull distribution in survival analysis stems from the fact that it is exible in shape, allows both decreasing and increasing hazard over time, and it is the only distribution that is compatible with both the proportional hazards and accelerated failure time regression models. Substituting (2) into (1), we get based on the serial interval data s, we obtain the estimatesˆ (s) = 5:44,ÿ(s) = 1:91. The associated standard errors are 0.27 and 0.16 which are obtained by the usual method of inverting the information matrix. The estimated incubation distribution is shown in Figure 1 . Under the Weibull assumption, the mean incubation time is The standard error 0.235 ofˆ is obtained from the variance-covariance matrix ofˆ andÿ using the delta method. It follows that an approximate 95 per cent conÿdence interval for is (4.37, 5.29) days. Turning to percentiles, the 95th percentile of the incubation time is given by = (− log(0:05)) 1=ÿ and estimated byˆ =ˆ (− log(0:05)) 1=ÿ = 9:66 days (SE = 0:50) as shown in Figure 1 . A one-sided 95 per cent conÿdence interval is (0, 10.49) days. Thus we are 95 per cent conÿdent that the 95th percentile of the incubation distribution is at most 10.49 A one-sided 95 per cent conÿdence interval for this probability is (0.937, 1.0). In other words, we are 95 per cent conÿdent that at least 93.7 per cent of SARS patients have incubation time of at most 10 days. This suggests that a 10-day quarantine period should be su cient, bearing in mind that some patients have started incubation before they get quarantined. Other common choices of F T (t) include the log-normal and log-logistic distribution. The convolution density of s i is again given by (1) and parameter estimates can be obtained by maximizing the convolution likelihood. Figure 2 displays the estimated incubation distribution for various choices of the functional form of F T (t). As expected, there is not much di erence between the log-normal and log-logistic estimates, but they are both quite di erent from the Weibull estimate. We will consider model validation in the next section to provide support for the Weibull model. With strenuous e ort and intensive contact tracing, the Ministry of Health of Singapore was able to more or less determine the dates of infection and hence ascertain the incubation times for a subset D of 50 patients. Since the incubation times t i are directly observable for the patients in D, we can estimate the incubation distribution directly by maximizing the likelihood based on t i , for i ∈ D, to obtain the estimatesˆ (t D ) = 5:80 andÿ(t D ) = 2:59. This maximization can be implemented by, for example, the SAS procedure LIFEREG. The histogram of the 50 incubation times and the ÿtted Weibull distribution are shown in Figure 3 . The Pearson goodness of ÿt statistic for ÿtting a Weibull distribution to the histogram is 5.03 on 7 degrees of freedom (p value = 0:66) and so the Weibull model cannot be rejected. A good way to check the validity of the convolution likelihood method proposed in the last section is to compare the direct estimates with the convolution likelihood estimates based on the same subset of patients. To be precise, we are comparing the direct esti- based on serial interval data. Note that the product is taken over the same subset D as in (4) instead of over all patients as done in (3). In terms of standard error, it comes as no surprise that estimating the incubation distribution directly using t D = {t i ; i ∈ D} is better than estimating it indirectly from the serial interval data s D = {s i ; i ∈ D} through convolution likelihood. In terms of the actual values of the estimates, there is not much di erence. As can be seen in Figure 4 , there is good visual agreement between the direct and indirect Weibull estimates. The standardized di erences between the indirect and direct estimates of and ÿ are 0.639 and 0.454, respectively which are not statistically signiÿcant. This lends support to the appropriateness of the Weibull assumption and the validity of the convolution likelihood method, as both are necessary for the indirect estimate to be close to the direct estimate. In contrast, there is a much more discernable visual di erence in Figure 5 between the direct and indirect log-normal estimate. The standardized di erences of the parameter estimates are now 0.731 and −1:353. Figure 5 suggests that the log-normal distribution is a less appropriate choice for describing SARS incubation in Singapore. The results for log-logistic distribution are similar and will not be reported here to save space. It is also possible to estimate the parameters and ÿ based on the ascertained incubation times for the 50 patients in D plus the serial times for the remaining 148 patients by maximizing the combined likelihood to get the combined data estimatesˆ (t D ; s D c ) = 5:50,ÿ(t D ; s D c ) = 2:02. By making use of the additional data s D c = {s i ; i = ∈ D}, the combined data estimatesˆ (t D ; s D c ),ÿ(t D ; s D c ) are expected to be more e cient than the estimatesˆ (t D ),ÿ(t D ) based on t D = {t 1 ; i ∈ D} only, and this is re ected by the smaller standard errors reported in Table I From Table I , we can also compare the estimatesˆ (t D ),ÿ(t D ) based on the 50 cases with ascertained incubation times with the estimatesˆ (s D c ),ÿ(s D c ) based on the remaining 148 cases. Assume t i ∼Weibull ( 1 ; ÿ 1 ) for i ∈ D with corresponding mean 1 = 1 (1 + ÿ −1 1 ) and t i ∼Weibull ( 2 ; ÿ 2 ) with mean 2 = 2 (1 + ÿ −1 2 ) for i = ∈ D. A 95 per cent conÿdence interval for 1 − 2 is 5:15 − 4:68 ± 1:96 √ 0:35 2 + 0:28 2 = (−0:41; 1:35) which contains zero and so the two mean estimates are not signiÿcantly di erent from one another. A more omnibus test of H 0 : 1 = 2 ; ÿ 1 = ÿ 2 versus H 1 : 1 = 2 ; ÿ 1 = ÿ 2 is the likelihood ratio test W = 2{431:805 − (30:786 + 397:504)} = 7:03 on two degrees of freedom. Thus there is some evidence that there could be some di erence between the two underlying distributions but their means appear to be compatible. As commented by Donnelly et al. [4] , the ascertained incubation times are subject to 'biases in reporting, di erent routes of transmission, or varying infectious doses of the virus', and so the distribution of the ascertained incubation times may be somewhat di erent from the actual incubation distribution. This is another reason why the estimatesˆ (s D ; s D c );ÿ(s D ; s D c ) based on serial interval data alone reported in Section 2 might be more reliable, but as explained in the last paragraph, the results do not change much if we useˆ (t D ; s D c );ÿ(t D ; s D c ) for this data set because of the small proportion of ascertained cases. Another way to compare the Weibull, log-normal and log-logistic models is to embed them all within a wider family. A convenient choice is the four-parameter family of generalized F distributions (see Sections 2.2.7 of Reference [7] ). More speciÿcally, we assume that log T = Á + W , where W is distributed like the logarithm of an F variate with 2m 1 and 2m 2 degrees of freedom. It is known that (m 1 ; m 2 ) = (1; 1) corresponds to a log-logistic distribution for T , m 2 = ∞ corresponds to the generalized gamma distribution which further reduces to the gamma distribution if = 1. A Weibull distribution for T corresponds to an extreme value distribution for W and this corresponds to (m 1 ; m 2 ) = (1; ∞). The log-normal distribution is obtained by letting (m 1 ; m 2 ) → (∞; ∞). To get around the di culty of inÿnite degrees of freedom, Prentice [8] proposed the re-parameterization (see also Section 3.9.1 of Reference [7] ) so that the log-logistic model corresponds to p = 1, q = 0 . The log-normal, Weibull, reciprocal Weibull and generalized gamma distributions all belong to the three-parameter subfamily with p = 0 and q = 0; 1; −1 and ¿0, respectively. When the three-parameter family is ÿtted to the subset of 50 incubation times, the log-likelihood function of q, with Á and proÿled out, is given in Figure 6 . It can be seen that the maximum likelihood estimate of q is around 0.8 which is quite close to the q value of 1 for the Weibull model. By inverting the likelihood ratio test, an approximate 95 per cent conÿdence interval for q is roughly from 0.06 to 1.83. Thus the Weibull distribution is compatible with the observed data but not so for the lognormal distribution. If the three-parameter family is ÿtted to the combined data consisting of the observed incubation times for the 50 cases as well as the serial intervals for the remaining 148 cases, the maximum likelihood estimate of q is 0.57. Due to the use of additional data, the 95 per cent conÿdence interval of q that results from test inversion is tighter and ranges from 0.07 to 1.15, which is again compatible with the Weibull but not the log-normal model. We have made the simplifying assumption that individuals are equally infectious throughout the period from the onset of symptoms to isolation. A consequence of this assumption is that given the duration of infectiousness of an index case, the times of infection of those infected by him=her are uniformly distributed. In other words, we are assuming that u i =d i are uniformly distributed in the unit interval. A Pearson chi-square test of the goodness of ÿt of the uniform distribution is conducted using the subset of patients for whom the infection times u i , and hence also u i =d i , can be determined. Partitioning the unit interval into 5 equal sub-intervals, the value of the chi-square statistic is 4.22 on 4 degrees of freedom which is not statistically signiÿcant (p value = 0:38). We do not expect the uniform assumption to be strictly true, as the clinicians we have spoken to believe that the viral load levels, and hence the infectiousness, of SARS patients will typically reach its peak in a number of days after onset of symptoms and then decrease. Nevertheless, we have obtained fairly reasonable estimate of the incubation distribution by using the uniform density f Ui (u i | d i ) = 1=d i to construct the convolution density of s i = u i + t i . The validity of the resulting convolution likelihood estimate has also been established through comparison with a direct estimate, see Figure 4 . In principle, we could assume that infections are transmitted by a symptomatic person according to a Poisson process with intensity function (u). It follows that given that there is an infection during the duration d, the infection time is distributed according to the normalized density rather than the uniform distribution. Unlike the case of uniformly distributed u i , the convo- in this general case has no closed form even though numerical maximization of the convolution likelihood might be possible. Given that the uniform distribution assumption seems to be producing sensible estimates, it is not clear what are the beneÿts of using a more general distribution for u i and whether these beneÿts will outweigh the computational di culties involved. Other researchers [9] have also found constant infectiousness a useful and convenient working assumption. A plausible conjecture is that the distribution of incubation times is age dependent and the proposed approach is exible enough to incorporate this. However, a scatter plot of incubation time versus age for the subset data suggests that there is not much relationship between the two, see Figure 7 . The Pearson correlation of 0.29 is also quite weak. For these reasons, we will not explore this possibility any further. In conclusion, we have proposed in this paper a convolution method for estimating the SARS incubation distribution from serial interval data. For the SARS data in Singapore, the Weibull model seems to be more appropriate than other distributions like the log-normal and log-logistic. The validity of the indirect estimate obtained from serial interval data is established by way of comparison with a direct estimate based on the ascertained incubation times for a subset of patients. An advantage of the maximum convolution likelihood estimates of and ÿ is that they are based on 198 serial intervals (SE = 0:27 and 0.16) and hence more precise than the direct estimates which are based on only 50 incubation times (SE = 0:38 and 0.32). It is possible to obtain a combined estimate that makes use of the incubation times for the 50 patients in the ascertained subset and the serial intervals for the remaining 148 cases. However, the resulting estimate is not that much di erent from that based on the full set of 198 serial intervals. The proposed method is general enough to allow non-uniform distribution for infection times and age dependence for incubation times but it appears that there is no strong reason for doing these. Our calculation shows that the use of a 10-day quarantine period in Singapore is safe. Inevitably, some of the individuals quarantined as a precautionary measure actually do not have SARS. During the 2003 outbreak in Singapore, 7863 contacts were served home quarantine orders while a further 4331 were put on daily telephone surveillance for 10 days [10] . Out of all these, only 58 persons turned out to have SARS. The next challenge is how to be more discriminatory in identifying contacts who are at risk, so as to reduce the number of people to be placed under quarantine, without loss of e ectiveness of the measure. Coronavirus as a possible cause of severe acute respiratory syndrome A novel coronavirus associated with severe acute respiratory syndrome Identiÿcation of a novel coronavirus in patients with severe acute respiratory syndrome Epidemiological determinants of spread of casual agent of severe acute respiratory syndrome in Hong Kong Multiple contact dates and SARS incubation periods Transmission dynamics and control of severe acute respiratory syndrome The Statistical Analysis of Failure Time Data Discrimination among some parametric models Transmission dynamics of the etiological agent of SARS in Hong Kong: impact of public health interventions Public health response: a view from Singapore The authors would like to thank the referees for their helpful comments and particularly to ÿrst referee for pointing out the connection with interval censoring. Preliminary results of this paper have been presented at a meeting of the Consortium on the Mathematical Modelling of Infectious Diseases, Ministry of Health, Singapore, and the authors would like to thank committee members for their helpful suggestions and comments. The ÿrst author would also like to thank the Ministry for allowing him access to the SARS data.