key: cord-0307713-lwwzacy2
authors: Kenah, Eben
title: Nonparametric survival analysis of epidemic data
date: 2011-04-21
journal: nan
DOI: 10.1111/j.1467-9868.2012.01042.x
sha: 8ef81fee8eb8b394eb6d6530ad81144a46a1af32
doc_id: 307713
cord_uid: lwwzacy2

This paper develops nonparametric methods for the survival analysis of epidemic data based on contact intervals. The contact interval from person i to person j is the time between the onset of infectiousness in i and infectious contact from i to j, where we define infectious contact as a contact sufficient to infect a susceptible individual. We show that the Nelson-Aalen estimator produces an unbiased estimate of the contact interval cumulative hazard function when who-infects-whom is observed. When who-infects-whom is not observed, we average the Nelson-Aalen estimates from all transmission networks consistent with the observed data using an EM algorithm. This converges to a nonparametric MLE of the contact interval cumulative hazard function that we call the marginal Nelson-Aalen estimate. We study the behavior of these methods in simulations and use them to analyze household surveillance data from the 2009 influenza A(H1N1) pandemic. In an appendix, we show that these methods extend chain-binomial models to continuous time.

Infectious diseases remain one of the greatest threats to human health and commerce, and the analysis of epidemic data is one of the most important applications of statistics in public health. If person i infects person j with a given disease, the generation interval is the time between the infection of i and the infection of j. The serial interval, which is often used as a proxy for the generation interval, is the time between the onset of symptoms in i and the onset of symptoms in j. Data from several recent and historical epidemics have been analyzed using methods based on generation or serial interval distributions, including the 1918 influenza (Mills et al., 2004) , severe acute respiratory syndrome (SARS) (Lipsitch et al., 2003; Wallinga and Teunis, 2004) , pandemic influenza A(H1N1) McBryde et al., 2009; Yang et al., 2009) , and avian influenza (Ferguson et al., 2005 (Ferguson et al., , 2006 .

Though the generation and serial interval distributions are often considered invariant features of an infectious disease (Fine, 2003) , these distributions can change systematically over the course of an epidemic (Svensson, 2007; Kenah et al., 2008) . When a susceptible person is exposed to multiple infectious people, his or her infector must be the first person to make infectious contact. Thus, the mean generation and serial intervals contract as the prevalence of infection increases. When transmission is rapid within groups of close contacts such as households or schools, this can occur even when the global prevalence of infection remains low (Kenah et al., 2008) .

Statistical methods for infectious disease data that use generation or serial interval distributions are based on the Lotka-Euler equation (Wallinga and Lipsitch, 2007; Roberts and Heesterbeek, 2007) or a branching-process approximation to the early spread of disease (Wallinga and Teunis, 2004; White and Pagano, 2008) . In both of these approaches, epidemic spread is modeled as a process that creates a population rather than a process of percolation through an existing population. Thus, these methods fail to account for the effects of exposure to multiple infectors and ignore information contributed by uninfected person-time. Since they assume generation intervals are independent and identically distributed, they implicitly assume a constant latent period (infection to onset of infectiousness) and a constant infectious period. When serial intervals are used as a proxy for generation intervals, they implicitly assume that the incubation period (infection to onset of symptoms) is also constant. Kenah (2010) outlined an alternative analysis of epidemic data based on contact intervals. Informally, the contact interval from an infectious person i to a susceptible person j is the time between the onset of infectiousness in i and the first infectious contact from i to j, where we define infectious contact to be a contact sufficient to infect a susceptible individual. The contact interval is similar to the generation interval, but it is defined for all possible infector-infectee pairs, not just those in which infection is actually transmitted, and it begins with the onset of infectiousness, not infection. The contact interval from i to j will be right-censored if j is infected by k = i prior to infectious contact from i or if i recovers before making infectious contact with j. This censoring is accounted for using methods from survival analysis. Statistical methods based on contact intervals avoid many of the problems with methods based on generation and serial intervals. They can incorporate a greater variety of transmission models, they use information contributed by uninfected person-time, and they allow clearer expression of analytical assumptions.

In epidemic modeling, it is common to specify infectious contact within an underlying contact process between individuals in the population. While a person i is infectious, he or she has some probability of transmitting infection to a susceptible person j each time a contact from i to j is made (e.g., Rhodes et al., 1996) . Models based on contact intervals define an infectious contact as a contact sufficient to infect a susceptible individual, ignore all contacts made by person i outside his or her infectious period, and ignore all infectious contacts from i to j after the first. This relaxes the assumption that the contact process is unaffected by infection and limits our attention to those contacts relevant to disease transmission. Any model specified in terms of an underlying contact process can be specified equivalently in terms of contact intervals. This paper outlines the nonparametric survival analysis of epidemic data based on contact intervals. The rest of Section 1 defines the general stochastic Susceptible-Exposed-Infectious-Removed (SEIR) model, describes our observed data, and reviews the parametric survival analysis of epidemic data. Section 2 shows that the Nelson-Aalen estimator produces an unbiased estimate of the cumulative hazard function of the contact interval distribution when who-infected-whom is observed. When who-infected-whom is not observed, the Nelson-Aalen estimates from all transmission networks consistent with the observed data are averaged to get a marginal Nelson-Aalen estimator. Since the transmission network probabilities are unknown, an expectation-maximization (EM) algorithm is used to iteratively reweight the possible transmission networks, producing a series of marginal Nelson-Aalen estimates that converges to a nonparametric maximum likelihood estimate (MLE) of the contact interval cumulative hazard function. Section 3 explores the performance of these estimators in simulations. Section 4 uses them to analyze household surveillance data from the 2009 influenza A(H1N1) pandemic. Section 5 discusses the limitations, advantages, and future development of these methods. Appendix A shows how our methods can be used when who-infected-whom is partially observed, and Appendix B shows how our methods generalize chain-binomial models to continuous time.

Consider a closed population of n individuals assigned indices 1 . . . n. Each individual is in one of four possible states: susceptible (S), exposed (E), infectious (I), or removed (R). Person i moves from S to E at his or her infection time t i , with t i = ∞ if i is never infected. After infection i has a latent period of length ε i , during which he or she is infected but not infectious. At time t i + ε i , i moves from E to I, beginning an infectious period of length ι i . At time t i + r i , where r i = ε i + ι i is the recovery period, i moves from I to R. Once in R, i can no longer infect others or be infected. The latent period is a nonnegative random variable, the infectious and recovery periods are strictly positive random variables, and the recovery period is finite with probability one.

An epidemic begins with one or more persons infected from outside the population, which we call imported infections. For simplicity, we assume that epidemics begin with one or more imported infections at time 0 and there are no other imported infections.

After becoming infectious at time t i + ε i , person i makes infectious contact with j = i at time t ij = t i + ε i + τ * ij , where the infectious contact interval τ * ij is a strictly positive random variable with τ * ij = ∞ if infectious contact never occurs. Since infectious contact must occur while i is infectious or never, τ * ij ∈ (0, ι j ] or τ * ij = ∞. We define infectious contact to be sufficient to cause infection in a susceptible person, so t j ≤ t ij .

For each ordered pair ij, let C ij = 1 if infectious contact from i to j is possible and C ij = 0 otherwise. We assume that the infectious contact interval τ * ij is generated in the following way: A contact interval τ ij is drawn from a distribution with hazard function λ ij (τ ). If τ ij ≤ ι i and C ij = 1, then τ * ij = τ ij . Otherwise, τ * ij = ∞. In this paper, we assume that all contact intervals have a continuous distribution and, for a fixed i or a fixed j, the contact intervals τ ij , j = i, are independent.

The parametric methods in Kenah (2010) are derived using counting processes and martingales, which are described in Kalbfleisch and Prentice (2002) and Aalen et al. (2009) . Let S i (t) = 1 t≤ti and I i (t) = 1 t∈(ti+εi,ti+ri] be the susceptibility and infectiousness processes for person i, where 1 X = 1 if X is true and zero otherwise. These processes are left-continuous and infectious contact from i to j is possible at time t only if C ij I i (t)S j (t) = 1. Following Wallinga and Teunis (2004) , let v j denote the index of the person who infected person j, with v j = 0 for imported infections and v j = ∞ for persons not infected at or before time T . The transmission network is the directed network with an edge from v j to j for each j such that t j ≤ T . It can be represented by a vector v = (v 1 , . . . , v n ). Let V j = {i : C ij I i (t j ) = 1} denote the set of persons who could have infected person j, which we call this the infectious set of person j. Let V denote the set of all possible v consistent with the observed data.

Our population has size n, and we observe the times of all S → E (infection), E → I (onset of infectiousness), and I → R (removal) transitions in the population between time 0 and time T . For all ordered pairs ij such that i is infected, we observe C ij . Let m denote the total number of infections we observe. Our goal is to estimate the contact interval distribution via its cumulative hazard function.

We begin by reviewing the parametric survival analysis of epidemic data (Kenah, 2010) , where we assume that λ ij (τ ) comes from a family of hazard functions indexed by a parameter vector θ with a unique θ 0 such that λ ij (τ ) = λ(τ ; θ 0 ) for all ij. To guarantee that the maximum likelihood estimates are consistent and asymptotically normal, we assume that, for all τ , λ(τ ; θ) and ln λ(τ ; θ) have continuous third derivatives with respect to θ in an open neighborhood of θ 0 .

1.2.1. Likelihood when who-infects-whom is observed Let N ij (t) = 1 t≥tij count the first infectious contact from i to j. By definition, N ij (t) is right-continuous with left-hand limits (cadlag). We count only the first infectious contact because j is infected at or before t ij . Since λ ij (τ ) = λ(τ ; θ 0 ) and N ij (0) = 0, the process

is a zero-mean martingale when θ = θ 0 . Since S j (u)1 u≤T is predictable, the stochastic integral

is a zero-mean martingale when θ = θ 0 . The corresponding counting process is

which counts infectious contacts from i to j while j is susceptible and before time T . The corresponding log likelihood is

which has the score process

Since it is the integral of a predictable process with respect to a zero-mean martingale, U ij (t; θ 0 ) is a zero-mean martingale. If t j ≤ T , this likelihood requires that we observe whether or not i made infectious contact with j at time t j . Now fix j. If we observe infectious contacts in all ordered pairs ij from time 0 until time T , the log likelihood is

with the score process

U ·j (t; θ 0 ) is a zero-mean martingale because it is a sum of zero-mean martingales. If t j ≤ T , the likelihood in equation (6) requires that we observe which counting process N ij jumped at t j , which is equivalent to observing which i infected j. The complete-data log likelihood when we observe who-infects-whom is ℓ(θ) = n j=1 ℓ ·j (θ) with the score process U (t; θ) = n j=1 U ·j (t; θ). Clearly, U (t; θ 0 ) is a zero-mean martingale.

Now assume that we do not observe who infected person j. The corresponding counting process is

and the process M ·j (t; θ) = i =j M ij (t; θ) is a zero-mean martinale when θ = θ 0 . Expanding M ·j (t; θ), we get

where

is the total hazard of infectious contact with j at time t given θ. When person j is observed from time 0 to time T , the log likelihood is

with the score process

U ·j (t; θ 0 ) is a zero-mean martingale because it is the integral of a predictable process with respect to M ·j (t; θ 0 ). The log likelihood for the complete observed data when we do not observe who-infected-whom is ℓ(θ) = n j=1 ℓ ·j (θ) with score process U (t; θ) = n j=1 U ·j (t; θ), which is a zero-mean martingale when θ = θ 0 .

In this section, we extend the Nelson-Aalen estimator (Altshuler, 1970; Nelson, 1972; Aalen, 1978) to derive a nonparametric estimators of the cumulative hazard function of the contact interval distribution. We continue to assume that the contact interval distribution is continuous, but we drop the dependence of the hazard function on a parameter vector θ, so it is simply λ(τ ). Let

denote the cumulative hazard function. We call τ the infectiousness age. To derive parametric estimators of the contact interval distribution, we used counting processes and martingales defined in absolute time. To derive nonparametric estimators, we use counting processes and martingales defined in infectiousness age. Whereas absolute time processes are defined for all ordered pairs ij, infectiousness age processes are defined only for those ij in which t i < ∞. In our notation, we use the argument t for absolute time and τ for infectiousness age. Martingales and counting processes in infectiousness age are given an asterisk superscript.

We show that the Nelson-Aalen estimator generates an unbiased estimate of Λ(τ ) when who-infects-whom is observed. When who-infects-whom is not observed, we average the Nelson-Aalen estimates from all v ∈ V to get a marginal Nelson-Aalen estimator. Since the probability of each v ∈ V is unknown, we use an EM algorithm (Dempster et al., 1977) to iteratively reweight each possible v, obtaining a sequence of marginal Nelson-Aalen estimates that converges to a nonparametric MLE of Λ(τ ). Finally, we show how these estimates can be approximated in mass-action SEIR models using only data about infected persons. In Appendix A, we show how these methods adapt to a situation where the transmission network is partially observed.

First, we assume that we observe who infected whom. For each person i such that t i +ε i ≤ T , let N * ij (τ ) = 1 τ ≥τij count the first infectious contact from i to j at or before infectiousness age τ in person i, and let I * i (τ ) = 1 τ ∈(0,ιi] indicate whether i remains infectious at infectiousness age τ . By definition, N * ij (τ ) is cadlag and

is a zero-mean martingale. Now let S * ij (τ ) = S j (t i + ε i + τ ) indicate the susceptibility of person j at infectiousness age τ of person i and let T i = T − t i − ε i denote the time between the onset of infectiousness in i and the end of observation. Since S * ij (τ )1 τ ≤Ti is predictable,

is a zero-mean martingale corresponding to the counting process

which counts infectious contacts from i to j prior to time T while j remains susceptible. Now fix j. Since it is a sum of zero-mean martingales,

is a zero-mean martingale, where

is a counting process that jumps at the infectiousness age at which v j (the infector of j) makes infectious contact with j and

denotes the number of persons who could have infected j at infectiousness age τ . Note that Y ·j (τ ) is left-continuous and decreasing in τ .

Finally, consider the combined observations of all j. Since it is a sum of zero-mean martingales,

is a zero-mean martingale, where

counts the number of observed infectious contacts with susceptible individuals occuring at infectiousness age ≤ τ and

denotes the number of contact intervals of length ≥ τ that were observed. Like each Y ·j (τ ), Y (τ ) is left-continuous and decreasing in τ .

We are now ready to show that the standard Nelson-Aalen estimator will produce an unbiased estimate of Λ(τ ). LetΛ

be the Nelson-Aalen estimate of Λ(τ ) and let T = max{τ : Y (τ ) > 0}. Then

is a zero-mean martingale, where τ ∧ T = min(τ, T ) . Therefore,Λ(τ ) is an unbiased estimate of Λ(τ ) for all τ ∈ [0, T ]. For these τ , a variance estimate forΛ(τ ) − Λ(τ ) is given by its optional variation procesŝ

Using the martingale central limit theorem and a log transformation, we get the approximate pointwise 1 − α confidence limitŝ

where Φ is the cumulative distribution function of the standard normal distribution. Note that the point estimateΛ(τ ) in equation (23) is valid for an arbitrary contact interval distribution but the variance estimateσ 2 (τ ) in equation (25) and the approximate confidence interval in equation (26) assume a continuous contact interval distribution.

Now assume that we do not observe who-infected-whom. We can no longer calculate the Nelson-Aalen estimate in equation (23) because we do not know which of the contact intervals are censored. However,

by the law of iterated expectation, so we can still obtain an unbiased estimate of Λ(τ ).

LetΛ v (τ ) denote the value ofΛ(τ ) that we would have calculated had we observed the transmission network v. Then

is an unbiased estimate of Λ(τ ) for each τ ∈ [0, T ]. For each v ∈ V, Pr(v|observed data) can be calculated if we know λ(τ ) (Kenah et al., 2008) . For each person j with t j < ∞, the probability that j was infected by person i ∈ V j is

and the probability of a given transmission network

Note that equations (29) and (30) assume a continuous contact interval distribution, which ensures that simultaneous infectious contacts have probability zero. With a known or estimated λ(τ ), it is easy (but unnecessary) to calculate a marginal Nelson-Aalen estimate. We outline this calculation and then use it as the basis of an EM algorithm that converges to a nonparametric MLE of Λ(τ ). First, let

which is a cadlag step function with a jump of size

By equations (28) through (30), the marginal Nelson-Aalen estimate given λ(τ ) is

If λ(τ ) is the true contact interval hazard function, this is an unbiased estimate of Λ(τ ) for each τ ∈ [0, T ]. The whole point of calculating Λ(τ ) is that λ(τ ) is unknown, but equation (33) can be used in the following EM algorithm:

(a) Begin with an initial λ (0) (τ ). This could be a parametric estimate or a constant. Use this to calculate an initial marginal Nelson-Aalen estimate Λ(τ |λ (0) ). (b) Use Λ(τ |λ (k) ) to calculate a new hazard function estimate λ (k+1) (τ ). (c) Update the probabilities in equation (29) using λ (k+1) (τ ). (d) Use the updated probabilities to calculate Λ(τ |λ (k+1) ) using equation (33). (e) Repeat the smoothing step (b), the E-step (c) and the M-step (d) until the sequence Λ(τ |λ (k) ) converges.

The limit of this sequence is the marginal Nelson-Aalen estimate, which we denote Λ(τ ).

To show that this is truly an EM algorithm, we must prove that Λ(τ |λ (k) ) maximizes the expected log likelihood given λ (k) (τ ). To do this, we adapt the argument of Kaplan and Meier (1958) that their product-limit estimate of the survival function is a nonparametric MLE. Let τ 1 , . . . , τ M denote the distinct infectiousness ages at which infectious contacts might have been made, and let S(τ ) denote an arbitrary survival function for the contact interval distribution. The conditional probability of failure (i.e., infectious contact) at infectiousness age τ j given survival (i.e., no infectious contact) until τ j is

The j th term of G(S) is maximized by

so the survival function that maximizes G(S) is

where π represents the product integral described in Kalbfleisch and Prentice (2002) and Aalen et al. (2009) . The Nelson-Aalen estimate corresponding to the Kaplan-Meier estimate in equation (38) is

Therefore, each iteration of the proposed EM algorithm maximizes the expected log likelihood, so the marginal Nelson-Aalen estimate is a nonparametric MLE of Λ(τ ). The corresponding marginal Kaplan-Meier estimator is a nonparametric MLE of S(τ ). The variance σ 2 (τ ) of Λ(τ ) can be estimated using the conditional variance formula. Conditioning on the transmission network v, we get

whereΛ v (τ ) is the Nelson-Aalen estimate from equation (23) andσ 2 v (τ ) is the variance estimate from equation (25) that we would have calculated had we observed the transmission network v. Let λ(τ ) denote the hazard function estimate corresponding to Λ(τ ), and let N * (τ ) = N * τ λ . The first term of equation (40) reduces to

and the second term reduces to

where N * ·j (τ ) = N * ·j τ λ . Therefore,

Using the martingale central limit theorem and a log transformation, we get the approximate pointwise 1 − α confidence limits

Obtaining λ (k+1) (τ ) from Λ(τ |λ (k) ) in step (b) of the EM algorithm is nontrivial. In general, the increments of Λ(τ |λ (k) ) cannot be used directly as λ (k+1) (τ ). To see this, consider a person j with t j ≤ T . For each person i ∈ V j , let p (k)

ij denote the estimated probability that i infected j in the k th iteration of the EM algorithm, with p (0) ij denoting the initial estimate. In the first iteration, we get p

Since Y (τ ) is decreasing in τ , the EM algorithm would converge to a marginal Nelson-Aalen estimate where the i ∈ V j with the longest τ ij is assigned probability one of being v j . This problem arises because equation (29) assumes a continuous contact interval distribution, which ensures that simultaneous infectious contacts have zero probability. Since the contact interval distribution is continuous, the intervals between the increments of Λ(τ |λ (k) ) contain information about λ(τ ) that is ignored if these increments are used directly as an estimate of λ(τ ). Thus, obtaining λ (k+1) (τ ) from Λ(τ |λ (k) ) requires some form of smoothing. The algorithm depends on the smoothed estimate of λ(τ ) only to calculate the probabilities p ij in equation (29). In our experience, cubic smoothing splines and kernel smoothers produce nearly identical results, so the algorithm seems insensitive to reasonable choices among smoothers.

In a mass-action model, C ij = 1 for all ordered pairs ij but the hazard of infectious contact is inversely proportional to the population size n. More specifically, if Λ n (τ ) is the cumulative hazard function of the contact interval distribution in a population of size n, then

where we call Λ * (τ ) the normalized cumulative hazard function of the contact interval distribution. With knowledge of n, the methods of the previous two sections could be used to estimate Λ n (τ ) and Λ * (τ ). In Kenah (2010) , it was shown that the full parametric likelihoods for mass-action SEIR models can be approximated by likelihoods that depend only on data about infected individuals. Here, we derive approximate nonparametric estimates of Λ * (τ ) that depend only on data about infected individuals. For a fixed number m of infections, these approximations become exact in the limit as n → ∞. Thus, they can be used to analyze data from the early stages of an epidemic, when m ≪ n.

Let Y * (τ ) = i I * i (τ )1 τ ≤Ti denote the number of infected persons who remain infectious and under observation at infectiousness age τ . By equations (19) and (22),

Since

LetΛ n (τ ) be the Nelson-Aalen estimate of Λ(τ ) obtained in a population of size n when who-infects-whom is observed, and let

Since n ≫ m, Y (u) > 0 if and only if Y * (u) > 0. Thus,

by equations (23) and (48). Therefore,Λ * (τ ) → (n−1)Λ n (τ ) for a fixed m as n → ∞. Sincê Λ(τ ) is a consistent estimator of Λ n (τ ), this implies thatΛ * (τ ) is a consistent estimator of Λ * (τ ) if we take limits first as n → ∞ and then as m → ∞. Similarly,

is a consistent estimate of the variance ofΛ * (τ ). Using the martingale central limit theorem and a log transformation, we get the asymptotic 1 − α confidence limitŝ

where we take limits first as n → ∞ and then as m → ∞.

When who-infects-whom is not observed, we have the approximate marginal Nelson-Aalen estimate

Since the probabilities p ij in equation (29) depend on λ (k+1) (τ ) only up to a multiplicative constant, they can be estimated by smoothing the normalized cumulative hazard function Λ * (τ |λ (k) * ). Thus, Λ * (τ |λ * ) can replace Λ(τ |λ) in the EM algorithm from Section 2.2. Let Λ * (τ ) denote the marginal Nelson-Aalen estimate to which this algorithm converges, let λ * (τ ) denote the corresponding normalized hazard function, and let N * (τ ) = N * τ λ * . The variance of Λ * (τ ) is

where N * ·j (τ ) = N * ·j τ λ * . The asymptotic 1 − α confidence limits are

To evaluate the performance of the methods from Section 2, we conducted simulations of network-based and mass-action epidemics. In each simulation, we used data on the first 1,000 infections from an epidemic in a population of 100,000 to calculate the Nelson-Aalen and marginal Nelson-Aalen estimates of Λ(τ ) and the Kaplan-Meier and marginal Kaplan-Meier estimates of the corresponding survival function. For each of these estimators, we looked at whether the confidence interval contained the true value of the estimated function at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of all possible (i.e., censored and uncensored) contact intervals. After 1,000 simulations, we calculated the coverage probability for each estimator at each quantile. For each coverage probability, we calculated an exact 95% confidence interval. All models had a latent period of zero and an exponential infectious period with mean one. The smoothing step of the EM algorithm was done via an inverse-variance weighted cubic smoothing spline with a smoothing parameter chosen by generalized cross-validation (Wegman and Wright, 1983) . Convergence of the EM algorithm was monitored by calculating the mean of the absolute values of the differences between the current and previous marginal Nelson-Aalen estimates at the 5 th , 10 th , 15 th ,. . . , 85 th , 90 th , and 95 th percentiles of the possible contact intervals. Informally, we call this the "L1 difference". All EM algorithms began by assuming that all possible transmission networks were equally likely, which is equivalent to assuming an exponential contact interval distribution. We ran the algorithm for a minimum of 5 iterations, and it was continued until an L1 difference less than a specified tolerance was achieved or the 50 th iteration was completed.

All epidemic models were written in Python (www.python.org) using the NumPy and SciPy packages (www.scipy.org). For network-based models, our contact networks were generated using the NetworkX package (networkx.lanl.gov). Statistical analysis was conducted in R (www.r-project.org) via the RPy2 package (rpy.sourceforge.net). The code for the models and estimators is available as Online Supplementary Information.

In all network-based simulations, the contact network was a Watts-Strogatz small-world network (Watts and Strogatz, 1998) , which mimics the high clustering and low diameter of real human contact networks. Starting with a ring of 100,000 nodes, each node was connected to its 10 nearest neighbors. Each edge was then rewired to a randomly chosen node with probability .1. Thus, 34.9% of nodes are connected only to their ten nearest neighbors, 38.7% are have one long-distance edge, 19.4% have two long-distance edges, 5.7% have three long-distance edges, and 1.3% have four or more long-distance edges. This network structure gave us high clustering, so infected nodes typically had several possible infectors. A new contact network was built for each simulation, so the results do not reflect the idiosyncrasies of any particular realization of the network.

We used Weibull(.5, 1), exponential(1), and Weibull(2, 1) contact interval distributions, where Weibull(s, r) denotes a Weibull distribution with shape parameter s and rate parameter r. The corresponding cumulative hazard functions are:

Λ(τ ) = τ for the exponential(1) distribution, and (57) Λ(τ ) = τ 2 for the Weibull(2, 1) distribution.

Note that the exponential(1) distribution is the same as a Weibull(1, 1) distribution. The hazard of infectious contact decreases throughout the infectious period in the Weibull(.5, 1) case, remains constant in the exponential(1) case, and increases throughout the infectious period in the Weibull(2, 1) case.

The L1 tolerance for the EM algorithm was set to .0005. This was approximately the minimum tolerance consistently achieved by the EM algorithm before the L1 differences became a chaotic sequence of small numbers.

The EM algorithm for the marginal Nelson-Aalen estimate converged easily in all simulations. For models with exponential(1) contact intervals, the desired tolerance was achieved within 5 iterations for all models. For Weibull(.5, 1) contact intervals, the maximum number of iterations was 6. For the Weibull(2, 1) contact intervals, the maximum number was 8. Table 1 shows the coverage probabilities and exact 95% confidence intervals for the Nelson-Aalen, marginal Nelson-Aalen, Kaplan-Meier, and marginal Kaplan-Meier estimators in models with Weibull(.5, 1) and Weibull(2, 1) contact interval distributions. Coverage probabilities for the Nelson-Aalen and Kaplan-Meier estimators was close to .95 across the range of available data for both models. Coverage probabilities for the marginal Nelson-Aalen and Kaplan-Meier estimators was slightly lower but above .90 in all cases. Coverage probabilities for the model with exponential contact intervals were above .93 for all estimators and data quantiles (not shown). Figures 1 and 2 show good agreement between the estimated and true cumulative hazard and survival functions for the contact interval distribution across the range of available data. The true cumulative hazard function (top) and survival function (bottom) are indicated by dashed lines. The Nelson-Aalen and Kaplan-Meier estimates (left) use information on who-infected-whom. The marginal Nelson-Aalen and Kaplan-Meier estimates (right) do not use this information. The point clouds are the point estimates at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of the possible contact intervals in each of 1,000 simulations. The clouds for the 5 th and 10 th and the 90 th and 95 th percentiles often run together, leaving five separate clouds of points.

For mass-action models, we used Weibull(.5, 5), exponential(2), and Weibull(2, 1) normalized contact interval distributions. The corresponding cumulative hazard functions are: Λ * (τ ) = √ 5τ for the Weibull(.5, 5) distribution (59) Λ * (τ ) = τ for the exponential(1) distribution, and

Λ * (τ ) = τ 2 for the Weibull(2, 1) distribution.

In a mass-action model, a Weibull(.5, 1) distribution for the normalized contact interval distribution produces R 0 = .89, so no epidemics occur. Changing the rate parameter to 5 produces R 0 = 1.98, which is very close to the R 0 = 2 of the other two models. These are R 0 estimates assuming a network with no clustering (Kenah, 2010) , which places an upper bound on the true R 0 in networks with clustering. The hazard of infectious contact decreases throughout the infectious period in the Weibull(.5, 5) case, remains constant in the exponential(1) case, and increases throughout the infectious period in the Weibull(2, 1) case. The L1 tolerance for the EM algorithm was set to .005. This was approximately the minimum tolerance consistently achieved by the EM algorithm before the L1 differences became a chaotic sequence of small numbers.

The EM algorithm for the marginal Nelson-Aalen estimate took longer to converge for mass-action models than for network-based models, despite the greater tolerance. For models with exponential contact intervals, the maximum number of iterations required was 19. For Weibull(2, 1) contact intervals, the maximum was 29. For Weibull(.5, 5) contact intervals, the maximum was 32. Table 2 shows coverage probabilities and exact 95% confidence intervals for the Nelson-Aalen, marginal Nelson-Aalen, Kaplan-Meier, and marginal Kaplan-Meier estimators in models with Weibull(.5, 5) and Weibull(2, 1) normalized contact interval distributions. Coverage probabilities for the Nelson-Aalen and Kaplan-Meier estimators were close to .95 across the range of available data for both models. In contrast, the marginal Nelson-Aalen and Kaplan-Meier estimators were spectacular failures. In the model with exponential normalized contact interval distributions, coverage probabilities for Nelson-Aalen and Kaplan-Meier estimators were all above .93 and coverage probabilities for the marginal Nelson-Aalen and Kaplan-Meier estimators were all above .99. Figures 3 and 4 show good agreement between the estimated and true survival and cumulative hazard functions when who-infected-whom is observed (left) but poor agreement when who-infected-whom is not observed (right). The dashed lines and clouds of points have the same interpretation as in Figures 1 and 2 .

The number of possible infectors for each infection is much, much larger in a massaction model than in a network-based model. It seems likely that the EM algorithm cannot usefully assign probabilities to the possible infectors of each infected person because the asymptotic approximation to the marginal Nelson-Aalen estimate is not sufficiently precise and because of limited numerical precision. In the case of exponential contact intervals, the initial step of the EM algorithm correctly guesses that all possible infectors are equally likely to have been the source of infection, so the EM algorithm converges. The large coverage probabilities suggest that the variance approximation from equation (54) overestimates the true variance.

In this section, we use the marginal Nelson-Aalen estimator to analyze pandemic influenza A(H1N1) (pH1N1) surveillance data collected by the Los Angeles County Department of Public Health between April 22 and May 19, 2009. The data was collected according to the following protocol :

(a) Individuals who presented to healthcare providers or the county health department with acute febrile respiratory illness (AFRI, defined as fever ≥ 100 • F plus cough, sore throat, or runny nose) had nasopharyngeal swabs and aspirates tested for pandemic influenza A(H1N1). Those who tested positive are the index cases. (b) The index cases were interviewed by telephone and asked to report AFRI episodes among household contacts, including the dates of illness onset. Additional AFRI episodes among household contacts were ascertained during follow-up interviews 14 days after the illness onset of the index case. All cases of AFRI in the household after 10 days prior to symptom onset in the index case were assumed to be pH1N1 cases.

The earliest pH1N1 case in each household is the primary case; this is usually (but not always) the index case.

There were 58 households with a total of 299 members. In these households, there were 62 primary cases (four households had co-primary cases) and 35 secondary cases. In 51 of the 58 households, the index case was a primary case. This is a good example of household surveillance data from the early stages of an epidemic. As in the simulations, our goal is to estimate the cumulative hazard function of the contact interval distribution. We compare the marginal Nelson-Aalen estimate with parametric estimates obtained using the methods from Kenah (2010) . We use the corresponding marginal Kaplan-Meier estimate to estimate the probability that an infected person makes infectious contact with a given household member during his or her infectious period, which we call the household infectious contact probability.

Our natural history assumptions are adapted from Yang et al. (2009) . In the primary analysis, we assumed an incubation period of 2 days, a latent period of 0 days, and an infectious period of 6 days. This means that a person i with onset of symptoms on day t sym i is infected on day t i = t sym i − 2, has onset of infectiousness on day t i , and recovers from infectiousness on day t i + 6. In discrete time, the first day on which i is able to infect other persons is the day after his or her onset of infectiousness, which is t i +1 = t sym i −1 under our assumptions. If day 0 is the infection time, then we have the first secondary transmissions on day 1, the onset of symptoms on day 2, and recovery on day 6. In a sensitivity analysis, we vary the incubation period from 1 day to 3 days, the infectious period from 5 days to 7 days, and the latent period from 0 days to 1 day. Figure 5 shows the marginal Nelson-Aalen estimate of the cumulative hazard function of the contact interval distribution, along with approximate 95% confidence limits and parametric estimates assuming exponential and Weibull contact interval distributions. A cumulative hazard estimate assuming a gamma contact interval distribution (not shown) was almost exactly the same as that assuming a Weibull distribution.

The exponential, Weibull, and marginal Nelson-Aalen point estimates of the cumulative hazard at 6 days post-infection are .0687, .0692, and .0729, respectively. The corresponding survival probabilities are .9334, .9331, and .9296. Thus, all three estimates indicate a household infectious contact probability of .07 over the course of the infectious period. The approximate 95% confidence interval for the marginal Nelson-Aalen estimate is (.0503, .1056). The corresponding 95% confidence interval for the marginal Kaplan-Meier estimate of the survival probability is (.8997, .9509). Therefore, our nonparametric estimate of the household infectious contact probability is .07 (.05, .10).

While the nonparametric and parametric estimates agree on the household infectious contact probability, they differ in their estimates of distribution of infectiousness over time. The exponential estimate inherently predicts a constant hazard of infectious contact over the entire infectious period. The Weibull estimate also predicts little variation in the hazard of infectious contact over the infectious period, with slightly lower infectiousness near the beginning and slightly higher infectiousness near the end. The marginal Nelson-Aalen curve has much larger jumps on days 1, 2, and 3 following infection than on days 5 and 6, which places the highest infectiousness on the day prior to, the day of, and the day after the onset of symptoms. According to this estimate, only 12% of infectious contacts occur > 2 days after the onset of symptoms and only 6% occur > 3 days after symptom onset. Figure 6 shows the results of a sensitivity analysis in which we vary the assumed (a) incubation period, (b) latent period, and (c) infectious period. In panel (d), we vary all three of these intervals simultaneously to generate the greatest possible variation. The results are insensitive to the incubation period and infectious period but sensitive to the latent period. However, there is good evidence that little or no transmission occurs two or more days prior to the onset of symptoms (Donnelly et al., 2011) .

Our nonparametric estimate of .07 (.05, .10) for the household infectious contact probability is very similar to the .06 (.03, .11) estimated using data collected by Public Health-Seattle and King County during April-May, 2009 . This estimate was obtained using a chain-binomial model to estimate the probability of infectious contact with a given household member on each day of the infectious period and then calculating the probability that infectious contact occurs at least once, making it directly comparable with our estimate. The relationship between contact intervals and chain binomial models is explained in Appendix B.

It is more difficult to compare our estimated household infectious contact probability with estimates of the household secondary attack rate (SAR), which are usually obtained by calculating the proportion of all susceptible members of study households who have the onset of a given set of symptoms within a given time period relative to symptom onset in the index case. Among these estimates for pH1N1 are: .145 (.129, .164) for influenzalike illness (ILI) within 7 days after the index case symptom onset (Carcione et al., 2011) , .13 for acute respiratory infection (ARI) and .9 for ILI within 1-9 days after the index case symptom onset (Morgan et al., 2010) , .113 (.088, .137) for ILI up to 23 days after index case symptom onset (France et al., 2010) , and .13 for ARI and .10 for ILI 7 days before or after the index case symptom onset . We did the following simulations to see if our estimated household infectious contact probability of .07 is consistent with these estimates:

(a) For each of the 58 households in the LA data, we simulated an epidemic using its observed number of index cases and an household infectious contact probability of .07. (b) After the epidemics in all households completed, we calculated the proportion of all household contacts who were infected.

Repeating this 10, 000 times, we got a mean household SAR of .12 with a bootstrap percentile 95% confidence interval of (.06, .19), which places us solidly within the range of previous household SAR estimates. Our nonparametric estimate of the distribution of infectiousness over time is consistent with estimates of the serial interval distribution for pH1N1 by Donnelly et al. (2011) , who found high infectiousness on days 0, 1, and 2 following symptom onset with only 18% of transmission occurring > 2 days after the onset of symptoms and only 5% occurring > 3 days after symptom onset. A similar pattern of high infectiousness early in the infectious period was found by Cauchemez et al. (2009) , who also used a serial interval distribution. Though theoretical and practical problems with the serial interval distribution were noted in Section 1, the similarity between our results and those of Donnelly et al. (2011) and Cauchemez et al. (2009) suggests that the marginal Nelson-Aalen estimate captured an important feature of pH1N1 transmission that was missed by the parametric contact interval distribution estimates.

The marginal Nelson-Aalen estimator suffers many of the same limitations as the parametric estimators in Kenah (2010) and some of its own. In roughly decreasing order of difficulty, these include:

(a) The SEIR framework is limited to acute, immunizing infectious diseases that spread person-to-person, so our methods do not apply directly to tuberculosis, HIV/AIDS, many foodborne and waterborne diseases, pneumococcal and meningococcal diseases, and other infectious diseases of major public health importance. (b) We have assumed that contact intervals are independent of infectious periods, which is unrealistic. Both are affected by the interaction between the host and the pathogen. (c) We have assumed that infection times, times of onset of infectiousness, and times of recovery from infectiousness are observed, which is unrealistic. Usually, only symptom onset times are observed. In the analysis of the Los Angeles County household data in Section 4, we were forced to assume constant latent, incubation, and infectious periods because we had only symptom onset times. The implicit assumption of constant latent, incubation, and infectious periods in methods based on generation and serial intervals was one of our main criticisms in Section 1. Unfortunately, our analysis merely made these same assumptions explicit. (d) We assumed that the contact interval distribution is identically distributed for all ordered pairs ij such that C ij = 1, which is unrealistic. For instance, age was found to have an effect on pH1N1 transmission in most of the papers cited in Section 4 France et al., 2010; Morgan et al., 2010; Carcione et al., 2011; . The effects of covariates on the transmission of disease is a central concern of infectious disease epidemiology.

(e) There are several technical issues that need further exploration. A more rigorous theoretical analysis of convergence of the marginal Nelson-Aalen estimator, including the smoothing step, is needed to better understand the conditions under which it will work. Convergence criteria need to be more carefully examined, confidence intervals for small samples need to be developed, and more general stopping times for the end of observation need to be considered.

Nonetheless, survival analysis based on contact intervals is a powerful framework for the development of statistical methods in infectious disease epidemiology. These methods have important theoretical and practical advantages over methods based on generation or serial intervals, including greater flexibility in the choice of an underlying transmission model, greater clarity about data requirements and analytical assumptions, the exploitation of information contributed by uninfected person-time, and validity throughout the course of an epidemic. The theory and methods of survival analysis offer many possibilities for overcoming the limitations listed above.

The arguments for convergence of the Nelson-Aalen estimator given in Section 2 are largely heuristic, but the theoretical analysis required to address limitation (e) is complicated by the smoothing step. In simulations, the performance of the marginal Nelson-Aalen estimator did not appear sensitive to the smoothing method. However, a smoother designed specifically for hazard functions, such as that of Müller and Wang (1994) , may improve performance under certain conditions. Another technical limitation is that the variance estimates based on equation (40) treat the estimated transmission network probabilities as known, ignoring a component of the variance that may be important in small samples. These and other technical issues need to be resolved with more rigorous theoretical analysis and further simulation studies.

Limitation (d) can be addressed by incorporating the marginal Nelson-Aalen estimate into a semiparametric regression model for the effects of covariates on the hazard of infectious contact. Suppose the hazard of infectious contact from person i to person j at infectiousness age τ of i is

where β is a parameter vector and X ij is a vector of infectiousness covariates for i, susceptibility covariates for j, and pairwise covariates for ij. The methods of Section 2 could be extended to nonparametrically estimate λ 0 (τ ) by adding a step to the EM algorithm that maximized the expected log likelihood over β. This would yield a semiparametric estimator of the effects of covariates on infectiousness and susceptibility that is similar in spirit to the Cox proportional hazards model. In infectious disease epidemiology, these covariate effects are usually a higher priority than the distribution of infectiousness over time, so the marginal Nelson-Aalen estimator may find its most important applications behind the scenes of this regression model. Limitation (c) could be addressed by adopting a semiparametric Bayesian framework (Sinha and Dey, 1997) or using a profile sampler (Lee et al., 2005) . A semiparametric Bayesian approach would use a prior process instead of a marginal Nelson-Aalen estimator. A profile sampler would take advantage of the fact that the marginal Nelson-Aalen estimate is an efficient numerical method of obtaining a likelihood maximized over all possible contact interval distributions for a given set of infection times, infectiousness onset times, and recovery times.

Limitation (b) could be addressed by using multivariate survival methods to estimate the joint distribution of the contact interval and the infectious period, and limitation (a) could be addressed by allowing individuals to experience multiple infection events or infection events of different types (e.g., new infection, carriage, relapse).

Survival analysis based on contact intervals is a promising approach to the statistical analysis of infectious disease data. The marginal Nelson-Aalen estimator extends a beautiful and powerful set of nonparametric methods from standard survival analysis to infectious disease epidemiology. The simulations in Section 3 show that these methods are reliable, and the data analysis in Section 4 shows that the nonparametric methods in this paper are an important extension of the parametric methods in Kenah (2010) . Each of the ingredients of this approach have been applied to infectious diseases before: counting processes and martingales by Becker (1989) and Rhodes et al. (1996) , the EM algorithm by Becker (1997) , and summation over possible transmission networks by Wallinga and Teunis (2004) . The marginal Nelson-Aalen estimator combines these in a novel and useful way, correcting approaches based on generation and serial intervals, generalizing the chain binomial model, and pointing the way to more flexible and practical statistical methods for infectious disease epidemiology. Wallinga, J. and P. Teunis (2004) 

In a partially-observed transmission network, we observe the infectors of a subset of all infected persons. By taking V j = v j for all j such that the infector is observed, equations (28) through (33) define a marginal Nelson-Aalen estimator that takes this additional information into account. For each j with an observed v j , N * ·j (τ ) = N vj j (τ ). The variance estimate in equation (43) can be used to obtain confidence intervals via equation (44).

In this appendix, we show that the methods of Section 2 reproduce classical chain-binomial models when the contact interval distribution is discrete. In the case of a continuous contact interval distribution (Kenah et al., 2008; Kenah, 2010) , the likelihood when who-infectedwhom is not observed is

where

, this is equivalent to the integrated likelihood maximized by the EM algorithm in Section 2.

When the contact interval and latent period distributions are discrete, we must account for the possibility of tied infectious contact times. Instead of each infected person j having a single infector, the source of his or her infection can be any nonempty subset V j ⊆ V j . Therefore, the hazard term for j in the likelihood becomes

where

is the conditional probability of infectious contact at τ given no infectious contact before τ . Equation (64) contains all terms in

so the likelihood for a discrete contact interval distribution can be written

Since 1 − λ(τ ) can be interpreted as an infection escape probability, the likelihood in equation (68) is precisely the same as that of a chain-binomial model (Rampey et al., 1992) . Therefore, chain-binomial models can be defined in terms of contact interval distributions, and the methods of this paper can be seen as an extension of chain-binomial models to continuous time. 

Time since onset of infection (τ) Survival Fig. 3 . Estimates of the cumulative hazard and survival functions for mass-action models with a Weibull(.5, 5) normalized contact interval distribution using the methods from Section 2.4, with (left) and without (right) data on who-infected-whom. The true cumulative hazard function (top) and survival function (bottom) are indicated with dashed lines. The point clouds are point estimates at the 5 th , 10 th , 25 th , 50 th , 75 th , 90 th , and 95 th percentiles of the possible contact intervals in each of 1,000 simulations. Note the deviation of the marginal Nelson-Aalen and Kaplan-Meier estimates from the true functions. For comparison, we have parametric estimates of the cumulative hazard function assuming exponential and Weibull contact interval distributions. The assumed latent period for all analyses was 2 days, so symptom onset occurs on day 2 after the onset of infectiousness. Note the large jumps on days 1, 2, and 3 following the onset of infectiousness and the small jumps on days 5 and 6. 

Days since onset of infectiousness Cumulative hazard 2, 0, 6 days 2, 1, 5 days 1, 0, 5 days Fig. 6 . Sensitivity analyses for the nonparametric analysis of the LA county household data. There is little sensitivity to the assumed incubation and infectious periods. There is greater sensitivity to the assumed latent period, but there is good evidence for transmission on the day prior to symptom onset and little evidence of significant transmission ≥ 2 days prior to symptom onset Donnelly et al., 2011) .

Nonparametric inference for a family of counting processes

Survival and Event History Analysis: A Process Point of View. Statistics for Biology and Health

Theory for the measurement of competing risks in animal experiments

Analysis of Infectious Disease Data. Monographs on Statistics and Applied Probability

Uses of the EM algorithm in the analysis of data on HIV/AIDS and other diseases

Secondary attack rate of pandemic influenza A(H1N1) 2009 in Western Australia

Household transmission of 2009 pandemic influenza A(H1N1) virus in the United States

Maximum likelihood from incomplete data via the EM algorithm

Serial intervals and the temporal distribution of secondary infections within households of 2009 pandemic influenza A(H1N1): implications for influenza control recommendations

Strategies for containing an emerging influenza pandemic in Southeast Asia

Strategies for mitigating an influenza pandemic

The interval between successive cases of an infectious disease

Household transmission of 2009 influenza A(H1N1) virus after a schoolbased outbreak

Pandemic potential of a strain of influenza A (H1N1): Early findings

The Statistical Analysis of Failure Time Data

Nonparametric estimation from incomplete observations

Contact intervals, survival analysis of epidemic data, and estimation of R 0

Generation interval contraction and epidemic data analysis

The profile sampler

Transmission dynamics and control of severe acute respiratory syndrome

Early transmission characteristics of influenza A(H1N1)v in Australia: Victorian State

Transmissibility of 1918 pandemic influenza

Household transmission of pandemic (H1N1)

Hazard rate estimation under random censoring with varying kernels and bandwidth

Theory and applications of hazard plotting for censored failure data

A discrete-time model for the statistical analysis of infectious disease incidence data

Counting process models for infectious disease data: Distinguishing exposure to infection from susceptibility

Model-consistent estimation of the basic reproduction number from the incidence of an emerging infection

Semiparametric Bayesian analysis of survival data

The effect of age on transmission of 2009 pandemic influenza A(H1N1) in a camp and associated households

Accounting for unobserved immunity and asymptomatic infection in the early household transmission of the pandemic influenza A (H1N1)

A note on generation times in epidemic models

How generation intervals shape the relationship between growth rates and reproductive numbers