key: cord-0646193-klh8uc7t
authors: Kenah, Eben
title: Contact intervals, survival analysis of epidemic data, and estimation of R_0
date: 2009-12-17
journal: nan
DOI: nan
sha: 8c9278cab82a53f220533b8c1f320e1bfdcb39a6
doc_id: 646193
cord_uid: klh8uc7t

We argue that the time from the onset of infectiousness to infectious contact, which we call the contact interval, is a better basis for inference in epidemic data than the generation or serial interval. Since contact intervals can be right-censored, survival analysis is the natural approach to estimation. Estimates of the contact interval distribution can be used to estimate R_0 in both mass-action and network-based models.

Infectious disease remains one of the greatest threats to human health and commerce, and the analysis of epidemic data is one of the most important applications of statistics in public health. Some of the most important questions involve the basic reproductive number, R 0 , the number of secondary infections caused by a typical infectious person in the early stages of an epidemic (Diekmann and Heesterbeek, 2000) .

Higher values of R 0 indicate that an epidemic will be larger and harder to control. The effects of interc The Author 2009. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

ventions and the depletion of the susceptible population can be captured with the effective reproductive number R(t), which is the number of secondary infections caused by a typical person infected at time t.

The generation interval of an infectious disease is the time between the infection of a secondary case and the infection of his or her infector. The serial interval is the time between the symptom onset of a secondary case and the symptom onset of his or her infector. The generation and serial interval distributions are often considered characteristic features of an infectious disease (Fine, 2003) . For a given R 0 , a shorter mean serial or generation interval implies faster spread of the epidemic.

Usually, generation intervals are times between unobserved events. Serial intervals, which are times between observed events, are often used instead. Recent analyses of several past, emerging, and potentially emerging infectious diseases have been based on serial interval distributions, including the 1918 influenza (Mills et al., 2004) , Severe Acute Respiratory Syndrome (SARS) (Lipsitch et al., 2003; Wallinga and Teunis, 2004) , pandemic influenza A(H1N1) (Fraser et al., 2009; McBryde et al., 2009) , and avian influenza (Ferguson et al., 2005 (Ferguson et al., , 2006 . Three methods form the basis of these applications. With a measurement of the exponential growth rate at the beginning of an epidemic and a known serial interval distribution, R 0 can be estimated via the Lotka-Euler equation (Diekmann and Heesterbeek, 2000; Svensson, 2007; Wallinga and Lipsitch, 2007; Roberts and Heesterbeek, 2007) . Two other methods use the time series of symptom onset times, assuming that all infections are symptomatic and observed. Wallinga and Teunis (2004) estimate R(t) given a known serial interval distribution. Their approach has been adapted by other researchers (Cauchemez et al., 2006) , often supplemented with serial-interval observations from contacttracing data. White and Pagano (2008) jointly estimate R 0 and the serial interval distribution using a branching-process approximation to the initial spread of infection, assuming the number of secondary cases generated by each infectious person has a Poisson distribution with mean R 0 .

There are several problems with estimators based on generation or serial intervals in the context of an emerging infection. The Lotka-Euler and Wallinga-Teunis estimators rely on a previously known generation serial interval distribution. The Wallinga-Teunis and White-Pagano estimators assume that all serial intervals are independent and identically distributed, which occurs only if the incubation and infectious periods are constant. All three of these estimators assume a stable serial interval distribution, which limits their use to the early spread of infection. When multiple infectious persons compete to infect a given susceptible, the infector is the one who first makes infectious contact. Thus, the mean generation and serial intervals contract as the prevalence of infection increases either locally (e.g., within households) or globally (Svensson, 2007; Kenah et al., 2008) .

In this paper, we outline an alternative analysis of epidemic data that applies methods from survival analysis to contact intervals. Informally, the contact interval from an infectious person i to a susceptible person j is the time between the onset of infectiousness in i and the first infectious contact from i to j, where we define infectious contact to be a contact sufficient to infect a susceptible individual. This interval will be right-censored if j is infected by someone else prior to infectious contact from i or if i recovers from infection before making infectious contact with j. The contact interval is similar to the generation interval, except that its definition is not limited to contacts that actually cause infection and it begins with the onset of infectiousness rather than infection.

Here, we focus on the analysis of completely-observed "Susceptible-Exposed-Infectious-Recovered" (SEIR) epidemics. The SEIR framework applies to acute, immunizing diseases that spread from person to person, such as measles, influenza, smallpox, and polio. We also assume that the epidemic is completely observed, so all cases are detected and their times of infection, onset of infectiousness, and recovery are observed. Most epidemics are only partially observed, so we plan to explore the analysis of more realistic E. KENAH data sets in future papers. However, it is best viewed as a missing data problem, which requires that the methods for complete data be established.

In Section 2, we define a general stochastic SEIR epidemic model and show that survival likelihoods for a vector θ of contact interval distribution parameters have score processes that are zero-mean martingales at the true parameter θ 0 . In Section 3, we show how estimates of the contact interval distribution can be used to estimate R 0 in mass-action and network-based models. In Section 4, we evaluate the performance of these methods in simulated epidemic data and show that assumptions about the underlying contact process play a crucial role in accurate statistical inference. In Section 5, we discuss the advantages and limitations of survival methods in epidemic data analysis.

In this section, we show that the score processes from survival likelihoods for epidemic data can be written as stochastic integrals with respect to zero-mean martingales. We develop our methods in three stages.

First, we describe the underlying stochastic SEIR model and the observed data. Second, we imagine that we observe who-infected-whom and derive counting-process martingales for an ordered pair ij and for a fixed susceptible j. Finally, we consider the situation where we do not observe who-infected-whom and derive counting-process martingales for a fixed susceptible j and for the complete observed data. Our sources for the underlying theory are Kalbfleisch and Prentice (2002) and Serfling (1980) .

Consider a stochastic "Susceptible-Exposed-Infectious-Removed" (SEIR) model in a closed population of n individuals assigned indices 1, . . . , n. Each person i moves from S to E at his or her infection time t i , with t i = ∞ if i is never infected. After infection, i begins a latent period of length ε i during which he or she is infected but not infectious. At time t i + ε i , i moves from E to I, beginning an infectious period of length ι i . At time t i + r i , i recovers from infection and moves from I to R, where the recovery period r i = ε i + ι i is the total time between infection and removal. Once in R, i can no longer infect other persons or be infected. The latent period is a nonnegative random variable, the infectious and recovery periods are strictly positive random variables, and the recovery period is finite with probability one.

After becoming infectious at time t i + ε i , person i makes infectious contact with person j = i at their infectious contact time t ij = t i + ε i + τ * ij , where the infectious contact interval τ * ij is a strictly positive random variable with τ * ij = ∞ if infectious contact never occurs. Since infectious contact must occur while i is infectious or never, τ * ij ∈ (0, ι i ] or τ * ij = ∞. We define infectious contact to be sufficient to cause infection in a susceptible person, so t j t ij with equality if and only if j is susceptible at time t ij .

An epidemic begins with one or more persons infected from outside the population, which we call imported infections. For simplicity, we assume that epidemics begin with one or more imported infections at time t = 0 and there are no other imported infections.

Contact intervals For each ordered pair ij, let C ij = 1 if infectious contact from i to j is possible and C ij = 0 otherwise. We assume that the infectious contact interval τ * ij is generated in the following way:

A contact interval τ ij is drawn from a distribution with hazard function λ ij (τ ). If τ ij ι i and C ij = 1,

In this paper, we assume all contact intervals have an absolutely continuous distribution and, for a fixed i or a fixed j, the contact intervals τ ij , i = j, are independent.

Susceptibility and infectiousness processes Let S i (t) = 1 t ti and I i (t) = 1 t∈(ti+εi,ti+ri] be the susceptibility and infectiousness processes, respectively, for person i, where 1 X = 1 if X is true and zero E. KENAH otherwise. As defined, both processes are left-continuous and infectious contact from i to j is possible at

Complete observed data Our population has size n, and m represents the number of infections we observe. Observation begins at time t = 0 and ends at time t = T . During this period, we observe the times of all S → E (infection), E → I (onset of infectiousness), and I → R (recovery) transitions that occur in the population. For all ordered pairs ij, we observe C ij and any covariates X ij needed to specify λ ij (τ )

up to an unknown parameter vector θ with true value θ 0 .

Choose an ordered pair ij and let N ij (t) = 1 t ti+εi+τij count the number of infectious contacts from i to j on or before time t. We count only the first infectious contact because j is infected on or before that time. Consider the filtration

We assume that N ij (0) = 0 and λ ij (τ ) is predictable with respect to H ij t , so

is a zero-mean martingale with respect H ij t . Now suppose λ ij (τ ) is specified up to a parameter vector θ with true value θ 0 , so λ ij (τ ) = λ ij (τ ; θ 0 ). If the pair ij is observed from time 0 until time T , the corresponding log likelihood is

If ln λ ij (τ ; θ) is differentiable with respect to θ and we can interchange the order of differentiation and integration, the score process for data in the time interval

Therefore, U ij (θ 0 , t) is a zero-mean martingale because it is the integral of a predictable process with

Now fix j and assume there exist covariates X ij such that λ ij (τ ; θ) = λ(τ ; θ, X ij ) for all i = j.

Since the contact intervals τ ij are independent for a fixed j and absolutely continuous, the M ij (θ 0 , τ ) from equation (2.1) are orthogonal zero-mean martingales with respect to the filtration

The total score process for j is

and U ·j (θ 0 , t) is a zero-mean martingale with respect to H ·j t because it is a sum of zero-mean martingales.

The score process in equation (2.3) is that of a survival likelihood where the t ij are failure times and C ij I i (t)S j (t) = 1 indicates risk of infectious contact in the ordered pair ij. At the earliest infectious contact, the contact intervals in all remaining pairs at risk are right-censored, which is a type II independent censoring mechanism (Kalbfleisch and Prentice, 2002) .

E. KENAH

In the previous section, U ·j (θ, t) is adapted only if we observe which of the N ij (t) jumps first, which is equivalent to observing the infector of person j. Now suppose that we observe the infection time of j but not which person i was the infector. This is equivalent to observing

which counts the first infectious contact received by j. The corresponding filtration is

and the corresponding zero-mean counting process martingale is

We can no longer calculate U ·j (θ, t) as defined in equation (2.3), but we can calculate its conditional expectation given H ·j t . For each ij, define the expected score process

(2.5)

Given that N ·j jumps at time t, the probability that the jump occurred in N ij is

Thus,

and equation (2.5) can be rewritten

Therefore, the expected score process for person j is

which is the score process of the of the log likelihood

Finally, consider the filtration

. . , n generated by the complete data described at the end of Section 2.1. Since we assume that the τ ij , j = i, are independent for a fixed i and absolutely continuous, the M ·j (θ 0 , t) from equation (2.4) are orthogonal zero-mean martingales with respect to H t . The total expected score process is

which is the score process for the log likelihood

is a zero-mean martingale with respect to H t because it is a sum of zero-mean martingales. The maximum likelihood estimate (MLE) for θ is the solution to the equation U (θ, T ) = 0.

E. KENAH

In this section, we show that the variance of U (θ 0 , t) can be estimated using its predictable and optional variation processes, which are unbiased estimators of the Fisher information from the survival likelihood.

We then use the Lindeberg-Feller Central Limit Theorem to give a heuristic justification for standard maximum likelihood estimation with epidemic data. Throughout this section, we assume that λ(τ ; θ, X)

has a bounded second derivative in θ and that integration and differentiation can be interchanged.

Taking the derivative of U ·j (θ, t) with respect to θ in equation (2.6) leads to

Setting θ = θ 0 makes the first term the integral of a predictable process with respect to a zero-mean martingale. Therefore,

so the predictable variation process U ·j (θ 0 ) (t) is an unbiased estimator of Var[ U ·j (θ 0 , t)]. By equation

(2.8) and orthogonality of the U ·j (θ 0 , t), the total predictable variation process

To show that the same result holds for the optional variation process, rearrange equation (2.6) to get

Taking the derivative with respect to θ yields

Setting θ = θ 0 makes the second term the integral of a predictable process with respect to a zero-mean martingale. Therefore,

Imagine a series of epidemics in larger and larger populations, and assume that the final sizes of the epidemics become infinite as the population size n → ∞. For any fixed T , the number of infections will not become infinite as n → ∞, which makes it difficult to apply the Martingale Central Limit Theorem to U (θ 0 , T ). Instead, imagine that we observe m n infections in a population of size n between time 0 and time T n , with m n → ∞ as n → ∞. Let U n (θ, T n ) be the corresponding total expected score process, and letθ n be the corresponding MLE. If the Lindeberg condition holds for the triangular array U ·1 (θ 0 , T n ), . . . , U ·n (θ 0 , T n ), then

in distribution as n → ∞ by the Lindeberg-Feller Central Limit Theorem (Serfling, 1980) . Heuristically, this justifies the use of maximum likelihood methods such as Wald, score, and likelihood ratio tests.

The contact interval distribution can be used to estimate R 0 in both network-based and mass-action models. For simplicity, we assume that the hazard of infectious contact does not depend on covariates. Thus, λ ij (τ ; θ) = λ(τ ; θ) for all ij and the results in this section apply to homogeneous populations. For massaction models, we describe an asymptotic likelihood that is valid for the initial spread of disease.

In a network-based model, transmission takes place across the edges of a contact network, so we have C ij = 1 if and only if there is an edge leading from i to j in the contact network. Here, we will assume that contact networks are undirected, so C ij = C ji for all i and j. In a network-based model, R 0 depends on the structure of the contact network. The most tractable models are those on configuration-model networks, which are maximally random except for their degree distribution Reed, 1995, 1998; Newman et al., 2002) . More formally, let D be a nonnegative discrete random variable with finite mean and variance. To construct a configuration-model network with n nodes, assign each node i = 1, . . . , n a degree d i randomly sampled from the distribution of D. Then connect the stubs at random, erasing one stub if necessary so the sum of the degrees is even. As n → ∞, the probability of multiple edges between two nodes or loop from a node to itself goes to zero.

In these networks, there is a straighforward definition of R 0 (Andersson, 1998; Newman, 2002; Kenah and Robins, 2007) . In the early stages of transmission, an infected node of degree d has d − 1 edges across which infection can be transmitted. The probability of transmitting infection across each of these edges is exp(−Λ(ι; θ 0 )), where ι is the infectious period and Λ(t; θ) = t 0 λ(u; θ) du. Since the probability of reaching a node by following edges is proportional to the degree of the node, the mean number of secondary infections generated by a typical infected node in the early stages of an epidemic is

where the first expectation is taken over the distribution of the infectious period ι.

Network-based likelihood In a network-based model, the likelihood ℓ(θ) in equation (2.9) depends only on data about individuals who are either infected before time T or connected to an infected person in the contact network. In principle, these people could be identified through surveillance and contact tracing.

is the expected degree of persons who are infected by transmission within the population, it can be estimated by calculating the mean degree of persons who are infected.

In a mass-action model, individuals form no stable social bonds and interact like gas molecules. Thus, C ij = 1 for all ij but the hazard of infectious contact is inversely proportional to the population size. If λ n (τ ; θ) is the hazard function for the contact interval distribution in a population of size n, λ n (τ ; θ) = λ 0 (τ ; θ) n − 1 for a baseline hazard function λ 0 (τ ; θ) with corresponding cumulative hazard function Λ 0 (τ ; θ). As before, these functions are specified up to an unknown parameter vector θ with true value θ 0 .

The baseline hazard and cumulative hazard functions of a mass-action model have useful interpretations in terms of R 0 and the time course of infectiousness in the limit as n → ∞. Given an infectious period ι, the expected number of infectious contacts made is

Given that i makes infectious contact with j and has infectious period ι, the probability density function of the infectious contact interval from i to j is 1 n−1 λ 0 (τ ; θ)e − 1 n−1 Λ0(τ ;θ0)

Let m be the total number of infections observed before time T . If m ≪ n, an approximate likelihood that depends only on information about infected presons can be written in terms of λ 0 (τ ; θ). Expanding equation (2.9) in terms of λ 0 (τ ; θ), we get

(3.4)

All summands in the first term are zero except for those j with t j T . The second term is not a function of θ and can be ignored. The third term can be split into terms from j who get infected on or before time T and from those who remain uninfected at time T :

where x ∧ y = min(x, y). Since the first term is less than or equal to

for a fixed m as n → ∞. This asymptotic likelihood depends only on information about infected people.

In principle, these people could be identified through surveillance.

In this section, we first look at the performance of the methods from Sections 2 and 3 in simulated epidemic data sets from mass-action and network-based models. We then illustrate the use of our methods with an analysis of two epidemic curves from the early spread of influenza A(H1N1) in Mexico.

In this section, we look at the performance of the methods from Sections 2 and 3 in simulated epidemic data sets. In all models, we used data from the first m = 1, 000 infections in a population of size n = 100, 000. For each infected person i, we recorded the infection time t i , the onset of infectiousness t i + ǫ i , and the recovery time t i + r i . In network-based models, the degree d i and the indices of all neighbors of i were also recorded. All outbreaks started with a single imported infection at time 0. Since we are interested primarily in the analysis of emerging epidemics, outbreaks that terminated with a final size less than 1,000 were discarded. If an epidemic model was run 100 times without producing an epidemic final size of at least 1,000, it was discarded and another model was generated. For all simulations, R 0 was constrained to be between 1.01 and 16, a range that covers almost all known epidemic diseases.

In network-based models, the contact networks were undirected Erdős-Rényi random graphs (Newman et al., 2006) with an expected degree chosen from the discrete uniform distribution on {2, . . . , 16}.

A new contact network was constructed for each simulation.

Four scenarios were considered within each class of model: exponential or Weibull (baseline) contact interval distributions with constant or exponentially-distributed infectious periods. All infectious period distributions had mean one. The exponential distribution has the hazard function λ(τ ; β) = β for all τ > 0, where β > 0 is the rate parameter. The Weibull distribution has the hazard function λ(τ ; α, β) = αβ(βτ ) α−1 for all τ > 0, where α > 0 is the shape parameter and β > 0 is the rate parameter. Note that the exponential distribution is a Weibull distribution with α = 1.

Parameter estimates For network-based models, we used the likelihood in equation (2.9) to estimate the parameters of the contact interval distribution. For mass-action models, we used the asymptotic likelihood in equation (3.5) to estimate the parameters of the baseline contact interval distribution. Maximum likelihood estimates were obtained using the mle function in the R library stats4. Confidence intervals for each parameter were calculated using the confint function, which inverts the one-parameter likelihood ratio chi-squared test using a profile likelihood.

R 0 estimates For network-based models, R 0 was estimated using equation (3.1). For mass-action models, R 0 was estimated using equation (3.2). We calculated bootstrap percentile confidence intervals by sampling contact interval distribution parameters from their approximate joint normal distribution and combining each sample with a bootstrap sample of the observed infectious periods (and, for networkbased models, observed degrees in the contact network). The 95% confidence interval was defined by the 2.5% and 97.5% quantiles of the point estimates from 10,000 samples.

Implementation Simulations were implemented in Python 2.6 (www.python.org) using the SciPy 0.7 package (Jones et al., 2009) . Analyses were performed in R 2.10 (R Development Core Team, 2009) via the RPy 2.0 package (Moreira and Warnes, 2009) . Contact networks were generated using the NetworkX 0.99 package (Hagberg et al., 2008) . Sampling from multivariate normal distributions was done using the Cholesky distribution of the covariance matrix (Rizzo, 2008) . The simulation code is included as supplementary material (http://www.biostatistics.oxfordjournals.org).

For mass-action models with exponential contact intervals, R 0 = β for both fixed and exponentiallydistributed infectious periods. Letβ denote the MLE of the rate parameter β, and let ι k denote the infec-tious period of the k th infection observed. Our point estimate of R 0 iŝ

where β * is a parametric bootstrap sample from the approximate normal distribution ofβ and ι * 1 , . . . , ι * m is a bootstrap sample from the observed ι 1 , . . . , ι m .

For mass-action models with Weibull contact intervals R 0 = β α for a fixed infectious period and R 0 = β α Γ(α + 1) for exponentially-distributed infectious periods. In both cases,

whereα is the shape parameter MLE andβ is the rate parameter MLE. A bootstrap sample of R 0 is

where (α * , β * ) is a sample from the approximate joint normal distribution of (α,β).

Results Table 1 shows the coverage probabilities achieved in 1,000 simulations and exact binomial 95% confidence intervals for the true coverage probabilities in each of the four types of mass-action model. Figure 1 shows a scatterplot ofR 0 versus R 0 for models with exponential contact interval and infectious period distributions. Figure 2 shows a scatterplot of estimated versus true ln(R 0 ) for models with Weibull contact interval distributions and exponential infectious period distributions. For these models, estimates of R 0 are right-skewed because of exponentα in equation (4.3); this is reduced by taking logarithms.

Similar results were obtained in models with a fixed infectious period.

E. KENAH

Let ι k and d k denote the infectious period and degree, respectively, of the k th infection observed. In a contact network with n nodes, letD be the mean degree and let

For network-based models with exponential contact intervals, R 0 = (1−exp(−β)) D for a fixed infectious period and R 0 = λ λ+1 D for exponentially-distributed infectious periods. In both cases,

where β * is a sample from the approximate normal distribution ofβ and (ι * 1 , d * 1 ), . . . , (ι * m , d * m ) is a bootstrap sample from (ι 1 , d 1 ), . . . , (ι m , d m ).

For network-based models with Weibull contact intervals, R 0 = (1 − exp(−β α )) D for a fixed infectious period and

for exponentially-distributed infectious periods. In both cases,

where (α * , β * ) is a sample from the approximate joint normal distribution of (α,β) and (ι *

is a bootstrap sample from (ι 1 , d 1 ), . . . , (ι m , d m ).

Results Table 2 shows the coverage probabilities achieved in 1,000 simulations and exact binomial 95%

confidence intervals for the true coverage probability in each of the four types of network-based model. Figure 3 shows a scatterplot of the estimated versus true R 0 for models with exponential contact interval and infectious period distributions. Figure 4 shows a scatterplot of the estimated versus true R 0 for models with Weibull contact interval distributions and exponential infectious period distributions. Similar results were obtained in models with a fixed infectious period.

To look at the effect of assumptions about the contact process on statistical inference during an epidemic, we applied the mass-action likelihoods to data generated by the network-based models, ignoring all information about the contact network. Table 2 shows the coverage probabilities achieved in 1,000 simulations and exact binomial 95% confidence intervals for the true coverage probabilities for mass-action estimates applied to network-based models. The '+' signs in Figures 3 and 4 show the mass action estimates of R 0 versus the true R 0 in network-based models with exponential infectious periods. Many of these points fall above the top edge of each graph. Similar results were obtained in models with a fixed infectious period.

To show the practicability of methods based on contact intervals as well as the importance of data that is uncollected or unreported in emerging epidemics, we attempted to estimate R 0 based on two epidemic curves published at the beginning of the influenza A(H1N1) pandemic in Mexico. The first epidemic curve contains suspected cases in the village of Vera Cruz between March 9 and March 20 (Fraser et al., 2009 ).

The second epidemic curve contains lab-confirmed cases in Mexico City between April 13 and April 24 (Ministry of Health, 2009) . In both analyses, we assumed a latent period (between infection and the onset of infectiousness) of one day and an incubation period (between infection and onset of symptoms) of two days. With no data on links between cases or the duration of illness in each case, we assumed mass-action with a constant infectious period. Confidence intervals are generated as in the simulations.

Assuming an exponential contact interval distribution, we getR 0 = 1.95 (1.63, 2.33) for Vera Cruz andR 0 = 2.31 (2.15, 2.48) for Mexico City. These are high but consistent with some early estimates (Fraser et al., 2009; Yang et al., 2009) . Assuming a Weibull contact interval distribution, we getR 0 = 3.08 (2.55, 3.65) for Vera Cruz andR 0 = 4.37 (4.06, 4.70) for Mexico City; in both cases, the null hypothesis of an exponential contact interval distribution is strongly rejected (likelihood ratio p-value < .001). The estimates are also sensitive to the assumed infectious period. Assuming a five-day infectious period and a Weibull contact interval distribution, we getR 0 = 3.53 (2.79, 4.30) for Vera Cruz and R 0 = 7.14 (6.63, 7.66) for Mexico City. Subsequent experience shows that these R 0 estimates are far too high. This bias is consistent with the results obtained above when applying mass-action estimates to simulated data generated by network-based models. Since a most influenza transmission takes place in households, workplaces, and schools (Yang et al., 2009 ) the true underlying transmission model is probably closer to a network-based model than a mass-action model. Data on the duration of illness and, more importantly, on the social links between cases would allow better point and interval estimates of R 0 . The estimates could also be improved with incomplete-data methods that took into account the discreteness of the data and allowed variability in latent, incubation, and infectious periods.

The results of the simulations confirm that standard maximum-likelihood methods can be applied successfully to survival likelihoods written in terms of the contact interval distribution. In the mass-action models, performance deteriorated noticeably in moving from exponential to Weibull contact interval distributions, possibly because U (θ 0 , T ) was closer to a normal distribution in the simpler models. No such deterioration was noticeable in the network-based models, possibly due to the addition of contact-tracing information. Our methods were deliberately simple: all point estimates were plug-in estimators and all confidence intervals were based on normal approximations for the joint distributions of the MLEs. More sophisticated methods, such as Bayesian methods, might produce point estimates and confidence intervals whose performance is even better. The methods here would adapt quite well to a Bayesian analysis, and we believe that a Bayesian framework is the most natural setting for the development of methods to analyze partially-observed epidemics.

Methods based on contact intervals can incorporate a much greater variety of transmission models than methods based on generation or serial intervals, which usually assume mass-action. The simulation results presented above show that this flexibility is essential for accurate statistical inference during an epidemic. The mass action estimates failed spectacularly when applied to data generated by networkbased models. The point estimates were severely biased upward, and all 95% confidence intervals had coverage probabilities below 85%, with most below 25%.

The methods and simulation results in this paper have important implications for data collection during an emerging epidemic. First, they require information on the onset and duration of infectiousness. For an acute infectious disease, the onset and duration of illness may provide a useful proxy, especially if there is some knowledge of the incubation period and the pattern of pathogen shedding. Second, they E. KENAH

show the potential value of data about close contacts of cases, whether or not they are infected. Methods based on generation and serial intervals do not require such data, but this apparent advantage comes at a tremendous cost in terms of the flexibility and validity of the subsequent analysis. They are essentially missing-data methods with no complete-data counterparts, and they almost certainly understate the true data requirements for accurate estimation of R 0 .

Limitations The SEIR framework limits our methods to acute, immunizing diseases that spread personto-person. It does not apply to many diseases of public health importance, such as tuberculosis, meningococcal or pneumococcal diseases, foodborne or waterborne diseases, or HIV/AIDS. Most (though not all) emerging infections fit into the SEIR framework, and almost all methods currently used to analyze data from emerging epidemics make this assumption. We also assumed that all times of infection, onset of infectiousness, and recovery are observed. This is clearly unsatisfactory, but the development of incomplete-data methods must be based on complete-data methods. In Section 2, we assumed that the contact interval τ ij is independent of the infectious period ι i of i. This simplified the likelihoods, but it is probably unrealistic. This problem could be addressed by including ι i as a covariate in X ij or by using multivariate survival methods. In Section 3, we assumed that the population is homogeneous. This simplified the estimation of R 0 , but it is also unrealistic. In a heterogeneous population, estimates of R 0 would have to include the distribution of relevant covariates in the population.

Despite these limitations, methods based on contact intervals and survival analysis have the potential to become important tools in infectious disease epidemiology. The purpose of this paper was to introduce survival analysis based on contact intervals as a useful complete-data method, and we have done so in the simplest setting possible. These methods can be seen as descendants of methods based on generation and serial intervals, but they are more flexible and more explicit about assumptions and data requirements. and exponential infectious period distributions. Estimates are nearly unbiased at low R0, but biased upward at high R0. Similar results were obtained in models with a fixed infectious period (not shown). The smoothed mean was produced with the R command lowess. One simulation that producedR0 = 3730.8 (1089.9, 12,245.8) was excluded from the graph; it had a true R0 = 10.5. 

participants in the workshop "Design and Analysis of Infectious Disease Studies

Linking transmission models and data analysis in infectious disease epidemiology". Office space and administrative support were provided by the Fred Hutchinson Cancer Research Center

Limit theorems for a random graph epidemic model

Estimating in real time the efficacy of measures fo control emerging communicable diseases

Mathematical Epidemiology of Infectious Diseases: Model Building

Strategies for containing an emerging influenza pandemic in southeast asia

Strategies for mitigating an influenza pandemic

The interval between successive cases of an infectious disease

Pandemic potential of a strain of influenza A (H1N1): Early findings

Exploring network structure, dynamics, and function using NetworkX

SciPy: Open source scientific tools for Python

The Statistical Analysis of Failure Time Data

Generation interval contraction and epidemic data analysis

Second look at the spread of epidemics on networks

Transmission dynamics and control of severe acute respiratory syndrome

Early transmission characteristics of influenza A(H1N1)v in Australia: Victorian State

Transmissibility of 1918 pandemic influenza

A critical point for random graphs with a given degree sequence

The size of the giant component of a random graph with a given degree sequence

Structure and Dynamics of Networks

Spread of epidemic disease on networks

Random graphs with arbitrary degree distributions and their applications

R: A Language and Environment for Statistical Computing

Statistical Computing with R

Model-consistent estimation of the basic reproduction number from the incidence of an emerging infection

Approximation Theorems of Mathematical Statistics

A note on generation times in epidemic models

How generation intervals shape the relationship between growth rates and reproductive numbers

Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures

A likelihood-based method for real-time estimate of the serial interval and reproductive number of an epidemic

The transmissibility and control of pandemic influenza A(H1N1) virus