key: cord-0127993-1vtzahag
authors: Polyakov, Pavel; Breban, Romulus
title: Bayesian monitoring of emerging infectious diseases
date: 2016-03-07
journal: nan
DOI: nan
sha: f3755a2f5ff6b3027b24baecfec0d1e412e1367b
doc_id: 127993
cord_uid: 1vtzahag

We define data analyses to monitor a change in R, the average number of secondary cases caused by a typical infected individual. The input dataset consists of incident cases partitioned into outbreaks, each initiated from a single index case. We split of the input dataset into two successive subsets, to evaluate two successive R values, according to the Bayesian paradigm. We used the Bayes factor between the model with two different R values and that with a single R value to justify that the change in R is statistically significant. We validated our approach using simulated data, generated using known R. In particular, we found that claiming two distinct R values may depend significantly on the number of outbreaks. We then reanalyzed data previously studied by Jansen et al. [Jansen et al. Science 301 (5634), 804], concerning the effective reproduction number for measles in the UK, during 1995-2002. Our analyses showed that the 1995-2002 dataset should be divided into two separate subsets for the periods 1995-1998 and 1999-2002. In contrast, Jansen et al. take this splitting point as input of their analysis. Our estimated effective reproduction numbers R are in good agreement with those found by Jansen et al. In conclusion, our methodology for detecting temporal changes in R using outbreak-size data worked satisfactorily with both simulated and real-world data. The methodology may be used for updating R in real time, as surveillance outbreak data become available.

We define data analyses to monitor a change in ! R , the average number of secondary cases caused by a typical infected individual. The input dataset consists of incident cases partitioned into outbreaks, each initiated from a single index case. We split of the input dataset into two successive subsets, to evaluate two successive values, according to the Bayesian paradigm. We used the Bayes factor between the model with two different ! R values and that with a single ! R value to justify that the change in ! R is statistically significant. We validated our approach using simulated data, generated using known ! R . In particular, we found that claiming two distinct values may depend significantly on the number of outbreaks. We then reanalyzed data previously studied by The methodology may be used for updating ! R in real time, as surveillance outbreak data become available.

! R

Incidence and prevalence are standard epidemiological indicators, monitored to understand disease dynamic within society [1] . In the case of infectious diseases, it is customary to measure how far an epidemic is from eradication by calculating yet another epidemiological parameter, the basic reproduction number, denoted by ! ! R 0 (e.g., [2] , pp. 4--5) . By definition, ! ! R 0 represents the average number of secondary cases caused by a typical infectious individual in a fully susceptible population. If ! ! R 0 is larger than 1, then an outbreak becomes an epidemic;

otherwise, it goes extinct. If the population is not fully susceptible, then one calculates the effective reproduction number, denoted by ! R [3, 4] . However, neither ! ! R 0 nor ! R is regularly monitored by public health authorities (e.g., [5] , pp.

! ! R 0 depends on a number of factors pertaining to the pathogen-host biology, such as pathogen transmissibility and the natural history of disease. In addition, ! ! R 0 may depend on factors pertaining to host sociology, such as population density and social awareness about epidemics. All these factors may change with time.

Changes in ! ! R 0 may signal that the pathogen has become more transmissible, virulent or persistent in the population. They may also signal societal re-organization in response to the epidemic dynamic. Particularly important are changes in ! R , which, in addition to changes in ! ! R 0 , may also reflect changes in the susceptibility of the population. Monitoring changes in ! R may thus be instrumental in determining the success of public health interventions such as mass vaccination [3, 6, 7 ].

An epidemiological situation that would benefit from and/or ! R monitoring is that of a zoonotic pathogen repeatedly introduced in a population where it undergoes subcritical transmission. As the pathogen explores more and more hosts, the opportunity for mutations increases, with increasing chance for pandemic strains to occur [8] . Possible applications of this scenario may be the cases of the pre-pandemic severe acute respiratory syndrome [9] and Middle East respiratory syndrome [10] .

Another application of and/or ! R monitoring is to assessing the success of mass vaccination by surveying disease outbreaks before and after a major epidemiological event, such as implementation of mass vaccination [4] and loss of confidence in vaccination programs [6] . Surveillance and contact tracing provides a means of monitoring outbreaks by permanent registration of new cases. For example, monkeypox [11] and non-zoonotic measles [3] are potential candidates for eradication through vaccination. Monkeypox remains worrisome for the possibility that repeated introductions in the human population may yield novel strains of human poxes [12] . Following vaccine licensing in 1963, measles

incidence in the United States decreased by more than 95% [13] . Nevertheless, recent ! R analysis [7] shows the potential for measles to re-emerge and emphasizes the importance of continued surveillance. In this work, we discuss monitoring ! R using outbreak size surveillance data; only minor modifications are needed to apply our methodology to monitoring ! ! R 0 of emerging infectious diseases.

There exist two major approaches to estimate ! R from outbreak size data. In the first approach, ! R is estimated using the maximum likelihood method. Blumberg et al. [7, 14] used Galton-Watson branching processes to construct the likelihood of observed data consisting of outbreak sizes. By maximizing the likelihood function, they calculated ! R for monkeypox in the Democratic Republic of Congo (1980) (1981) (1982) (1983) (1984) and measles in the United States (2001-2011). Jansen et al. [6] used continuous-time Markov chains to construct the likelihood of observed data.

By maximizing the likelihood function, they calculated for measles outbreaks in the United Kingdom (1995) (1996) (1997) (1998) (1999) (2000) (2001) (2002) , advocating for an increase in due to loss of confidence in the vaccination program.

The second approach is based on Bayesian inference, which requires a prior distribution describing the current knowledge on the parameter of interest (e.g., ! R ) and the likelihood of observing the data set according to a model of choice [15] . As a result, one computes a posterior distribution, representing an upgrade of the prior, according to the data. Hence Bayesian inference follows closely the ! R ! R principle of surveillance and learning processes.

Farrington et al. [3] proposed two alternative ways to constructing the posterior probability of ! R based on (1) outbreak size and (2) 

A patient of an outbreak directly infects a number of individuals, so-called secondary cases. In successive generations, each secondary case continues to spread the pathogen, causing secondary cases by themselves. The network of who has infected whom has a tree structure, which, for emerging or re-emerging diseases, may be described using the theory of branching processes [18] . We assume that the data set, denoted by Τ , is an array of ! N sizes of trees (i.e., outbreak sizes) ordered by the infection time of the index patient. We analyze the transmission process in terms of effective reproduction number, ! R , which may change with time due to biological (e.g., pathogen transmissibility and virulence) and/or sociological (e.g., frequency of contact between individuals) factors.

We first briefly present how to compute a single value of ! R from an epidemiological dataset, in the Bayesian paradigm. To express lack of knowledge about , we use a non-informative improper prior ! ! π(R) = 1 ! R is uniformly distributed from zero to infinity. We construct the likelihood ! ! L(Τ|R) that the dataset Τ is observed, assuming that the effective reproduction number is ! R .

Given that the detected trees have sizes ! ! n 1 ,...,n N , ordered according to the date of infection of index patients, we have [14, 18] ! !

is the probability that the single transmission tree ! i has size ! n i [18] . The posterior distribution ! !π (R|Τ) of ! R for the observed dataset Τ is calculated according to Bayes' rule [3] !

which yields

The posterior distribution ! !π (R|Τ) was used to calculate the average effective reproduction number ! R and its 95% credible interval (CI).

The quality of ! R statistics depends not only on the number of trees but also on their sizes. In short, we summarize the amount of data using the concept of information. According to the definition (e.g., [15] , pp. 32), information is given by the logarithm of the probability to observe the given dataset Τ . In our case, the information, denoted by ! I Τ , is given by the natural logarithm of the likelihood

so that information is measured in natural units (i.e., nat). Hence, information is presented as the sum of ! N contributions, one for each tree in Τ (i.e., an extensive quantity over the number of trees). For a numerical estimate of ! I Τ corresponding to the dataset, we used the average of (i.e., ! R ) obtained from the Bayesian framework.

Monitoring the pandemic risk of emerging infectious diseases, one extracts at least two time-ordered ! R values from the same dataset Τ . Hence, the dataset Τ To justify the choice of extracting two ! R values from Τ , we proceed as follows.

We denote by Η Τ the model with a single ! R estimate and by ! ! Η a,b the model with two estimates. Given ! ! R a,b , the likelihood for the dataset ! Τ a ∪ Τ b is given by

f., Eq. 1 for ! ! L(Τ|R) ]. We evaluate whether ! ! Η a,b is more plausible than Η Τ by calculating the Bayes factor [19] , using the corresponding likelihoods. When our initial beliefs are a priori equally probable, ! ! pr(Η a,b ) = pr(Η Τ )

, the Bayes factor

! R ! R expresses how well the observed data were predicted by ! ! Η a,b , compared to Η Τ ;

i.e., the higher the value of ! ! B(Τ a ), the more is justified to extract two ! R values rather than one from Τ . Kaas et al. [19] provided (1) Let ! Τ a consist of the first ! i trees in Τ and ! ! Τ b = Τ \ Τ a ;

(2) Estimate the pair of posterior distributions ! !π (R a,b |Τ a,b ) using Eq. 3;

Numerical integration of the normalization constant was performed using the trapeze rule.

as an estimate of the parameters .

(4) Calculate using Eq. 5.

as the best estimates of effective reproduction

numbers on ! ! Τ a,b * and we denote them by * ; otherwise, we calculate a single ! R value on Τ using Eq. 3.

Synthetic data consisting of arrays of sizes of transmission trees were simulated, assuming that the number of cases caused by each infected individual is a

Poisson deviate with average ! R . Tree sizes are random integers given by a distribution ! ! p(n i ,R) , obtained from the probability-generation function for the Galton-Watson branching processes according to Pintman [18] . For the Poisson offspring distribution, we obtain

The synthetic dataset, consisting of sizes of ! N transmission trees with known ! R , was used to validate our methodology of estimating ! ! R a,b . The analytical approach according to Eq. 3 yields

where ! n is the average size of the trees in Τ . The corresponding standard

becomes negligible when ! ! N >> 1 . Eq. 7 gives ! ! R = 1 if Τ consists of a single tree.

For large tree number in Τ , the third term goes to zero and Eq. 7 becomes identical to the formula derived from the maximum likelihood ! ! L(Τ|R) method.

Using Eq. 7, we estimate the change of ! R in real time, as new epidemiological data become available. Suppose we add a new tree of size ! n to Τ . The

shows that, for a sufficiently large tree number in the dataset ! ! N >> 1 , the sign of the change in ! R is determined by the difference between the size ! n of the newly added tree and ! n .

We now proceed with the discussion of our simulations. In the first example, we generated a homogeneous dataset Τ , consisting of 100 trees with ! R equal to 0.6. Figure 1 (a) shows ! ! 2lnB(Τ a ) as a function of information (c.f. Eq. 4) in the first subset ! Τ a . The observed maximum is ~3, indicating weak justification for calculating two ! R values. Hence, we concluded that this dataset should be assigned a single ! R estimate.

In the second example, we assumed a stepwise increase of ! R , modeling adaptation of the pathogen to human-to-human transmission. We generated a non-homogeneous synthetic dataset where 50 trees were generated with ! ! R a = 0.6 , while the remaining 50 trees were generated with ! ! R b = 0.85 . The parameter ! ! 2lnB(Τ a ) (c.f. Fig. 1(b) ) reaches its maximum at the 55 th tree.

Contrary to the previous example, the observed maximum of ! ! 2lnB(Τ a ) is ~14, suggesting strong justification for calculating ! ! R a,b . The best ! ! R a,b estimates, denoted by ! ! R a,b* , have non-overlapping error bars (95% CI) and are in satisfactory agreement with the numerical choices for the parameters when Fig. 1(c) ).

In order to clarify the minimal size of the initial dataset required to get reliable estimations of ! ! R a,b , we performed evaluations of ! ! R a,b* for statistically independent synthetic datasets Τ , containing from 2 to 200 trees. For each size of Τ , we averaged 120 independent realizations to alleviate stochastic effects. The situation is quite different for non-homogeneous datasets. As an example, we generated a set of trees Τ with ! R equal to 0.6 and 0.85 for the first and second halves, respectively. The results are presented in Fig. 2(b superior. For small sizes of Τ (c.f. Fig. 2(c) ), average estimates of ! ! R a,b display strong fluctuations. Starting with ~50 trees per dataset , ! R estimates are neater. With further increasing the size of Τ , ! R estimates are more grouped around the exact values and the error bars (95% CI) are smaller.

We applied our method of evaluating two ! R values from an epidemiological dataset. We aimed to reproduce recent results obtained by Jansen et al. [6] , concerning the transmission dynamics of measles in the UK during 1995-2002.

Although measles-elimination programs were set worldwide, measles eradication

has not yet been achieved. In the late nineties, the safety of a combined measles-mumps-rubella (MMR) vaccine became controversial, which resulted in decreased uptake of the MMR vaccine subsequent to 1998. As a consequence, measles outbreaks increased in sizes.

Jansen et al. [6] calculated two effective reproduction numbers ! R for the UK, regarding the periods 1995-1998 and 1999-2002, respectively. They found that ! R increased significantly, from 0.47 to 0.82. The epidemiological data consisted of measles cases grouped into outbreaks of size 2 or more. We accounted for the left censoring of the outbreak-size data by renormalizing the probability of observing outbreaks as ! ! p(n i ,R)/ (1− p(1,R) ) , where Eq. 6

provided the probability model. The results are presented in Fig. 3 . The parameter ! ! 2lnB(Τ a ) shows a marked maximum (c.f., Fig. 3(a) 

We propose a method to monitor the effective reproduction number of infectious diseases with sub-threshold transmission. The method may apply to alert and surveillance systems of diseases emerging and/or re-emerging from natural reservoirs. In this case, monitoring an increase in and/or ! R may be used to

determine the implementation of public health intervention. The method may also be used to asses the effectiveness of public-health programs designed for disease elimination. In this case, evaluating ! R before and after implementing intervention may confirm the performance of the public-health program.

We validated our method using synthetic data, of which we presented a few simulations. Lastly, we reanalyzed data previously studied by Jansen et al. [6] , in Eq. 6.

Our model provides monitoring of ! R in the regime of subcritical transmission (i.e., ! R <1); other methods may be used for estimating in the regime of supercritical transmission [21] . The quality of ! R estimation in our model depends on the number of outbreaks; see Fig. 3c . Therefore, the number of changes in that can be estimated from a dataset depends on the number of outbreaks. We addressed the case of detecting a single change in by solving a onedimensional maximization problem for the Bayes factor (c.f., Eq. 5). There is no coincidence that (global) maximization problems are divided between oneand multi-dimensional, the first class of problems being significantly easier.

However, a number of numerical algorithms are readily available to address maximization in several dimensions [22] . In our case, such algorithms would have to face additional difficulties, inherent to the stochastic nature of the data modeling.

In conclusion, our work proposes a novel method to monitor changes in the effective reproductive number from an epidemiological dataset consisting solely of outbreak sizes. Figure Fig3 .tiff

Mathematical tools for understanding infectious disease dynamics

Branching process models for surveillance of infectious diseases controlled by mass vaccination

Estimation of effective reproduction numbers for infectious diseases using serological survey data

Anderson RM, May RM, authors. Infectious diseases of humans

Measles outbreak in a population with declining vaccine uptake

Identifying postelimination trends for the introduction and transmissibility of measles in the United States

The role of evolution in the emergence of infectious diseases

Severe acute respiratory syndrome coronavirus--like virus in Chinese horseshoe bats

Spread, circulation, and evolution of the Middle East respiratory syndrome coronavirus

The transmission potential of monkeypox virus in human populations

Major increase in human monkeypox incidence 30 years after smallpox vaccination campaigns cease in the Democratic Republic of Congo

Epidemiology and Prevention of Vaccine--Preventable Diseases

Inference of R0 and transmission heterogeneity from the size distribution of shuttering chains

Information theory, inference, and learning algorithms

Bayesian estimation of the offspring mean in branching processes: Application to infectious disease data

Decision -theoretic estimation of the offspring mean in mortal branching processes

Enumerations of trees and forests related to branching processes and random walks

Bayes factors

The Borel--Tanner distribution

Perspectives on the basic reproductive ratio

Vetterling WT, Flannery BP, authors. Numerical recipes. The art of scientific computing

We thank Prof V.A.A. Jansen (Royal Holloway University of London) for the epidemiological data. P.P. is grateful for financing in the form of a postdoctoral scholarship from Agence Nationale de la Recherche (Labex Integrative Biology of Emerging Infectious Diseases).