key: cord-0002521-xmjzgqq7 authors: Ming, Rui-Xing; Liu, Ji-Ming; W. Cheung, William K.; Wan, Xiang title: Stochastic modelling of infectious diseases for heterogeneous populations date: 2016-12-22 journal: Infect Dis Poverty DOI: 10.1186/s40249-016-0199-5 sha: 23fb9140a9430c3d5f0638bf3d3cac251dd74fc7 doc_id: 2521 cord_uid: xmjzgqq7 BACKGROUND: Infectious diseases such as SARS and H1N1 can significantly impact people’s lives and cause severe social and economic damages. Recent outbreaks have stressed the urgency of effective research on the dynamics of infectious disease spread. However, it is difficult to predict when and where outbreaks may emerge and how infectious diseases spread because many factors affect their transmission, and some of them may be unknown. METHODS: One feasible means to promptly detect an outbreak and track the progress of disease spread is to implement surveillance systems in regional or national health and medical centres. The accumulated surveillance data, including temporal, spatial, clinical, and demographic information can provide valuable information that can be exploited to better understand and model the dynamics of infectious disease spread. The aim of this work is to develop and empirically evaluate a stochastic model that allows the investigation of transmission patterns of infectious diseases in heterogeneous populations. RESULTS: We test the proposed model on simulation data and apply it to the surveillance data from the 2009 H1N1 pandemic in Hong Kong. In the simulation experiment, our model achieves high accuracy in parameter estimation (less than 10.0 % mean absolute percentage error). In terms of the forward prediction of case incidence, the mean absolute percentage errors are 17.3 % for the simulation experiment and 20.0 % for the experiment on the real surveillance data. CONCLUSION: We propose a stochastic model to study the dynamics of infectious disease spread in heterogeneous populations from temporal-spatial surveillance data. The proposed model is evaluated using both simulated data and the real data from the 2009 H1N1 epidemic in Hong Kong and achieves acceptable prediction accuracy. We believe that our model can provide valuable insights for public health authorities to predict the effect of disease spread and analyse its underlying factors and to guide new control efforts. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s40249-016-0199-5) contains supplementary material, which is available to authorized users. tracking the number (S) of people susceptible to the disease, the number (I) of people infected with the disease, and the number (R) of people who have recovered from the disease. Three assumptions are made: (1) the total population N = S(t) + I(t) + R(t) is fixed at any time t; (2) those who have recovered from the disease are forever immune; and (3) those who have not had the disease are equally susceptible, and the probability of their contracting the disease at time t is proportional to the product of S(t) and I(t). Based on these assumptions, the SIR model defines a set of three ordinary differential equations for S(t), I(t), and R(t): Here, β ≥ 0 is the effective transmission rate and k ≥ 0 is the recovery rate. Because the SIR-based models are well presented in the literature, herein, we omit a verbose introduction of these models. Readers with an interest in such a topic can find the details in [5] [6] [7] . The SIR-based models and its variants have proven to be quite useful in the study of the spread dynamics of infectious diseases [8] [9] [10] . In [11] [12] [13] , the progression of disease spread is characterized by tracking the number of S t with a chain binomial model. The number of susceptible members S t+ t ( t represents the infectious period of the disease and is always chosen to be 1/k) at time t + t is a binomial random variable that depends on S t and I t α, S t+ t ∼ Bin(S t , 1 − I t α), which provides a recursive relationship between S t+ t and S t and produces a formal stochastic process. However, the power of these models is mainly limited to uniform and homogeneous populations or populations with infinite size and homogeneous interactions. In many cases, the actual spread of infectious diseases occurs in a diverse or dispersed population. To study the spread of infectious diseases in heterogeneous populations, people usually divide a population into subpopulations that differ from each other. Sub-populations can be determined on the basis of social, cultural, economic, demographic, and geographic factors. Next, besides the dynamics of the internal spread within a subpopulation, the transmission dynamics between subpopulations should also be considered in the study of epidemic spreading. Network-based epidemic modelling represents a popular approach for heterogeneous populations in which the nodes in the network correspond to sub-populations, and the links indicate the neighboring relationships. Many network-based models have been proposed, including patch models [14] [15] [16] , distance-transmission models [17] , and multi-group models [18, 19] . However, these models require knowledge of every individual (or host) and all relationships between individuals, which may be not achievable due to information privacy-related restrictions and the high cost of subject recruitment. To overcome the difficulties of collecting data, researchers have investigated several types of computer-generated networks in the context of disease spread in population-scale studies [20] [21] [22] [23] [24] . Grassberger first studied the dynamics of infectious diseases that propagate on regular networks using the percolation theory [25] . Recent studies have revealed that many real-world networks, including social networks in which infectious diseases propagate, are either smallworld [26] or scale-free [27] rather than regular or random, as thought previously [28] . Because the underlying structures of networks will influence the effect that the dynamics of epidemics will have on them, researchers, such as Pastor-Satorras and Vespignani, have made many contributions to critical value analysis of typical epidemics on different types of complex network [23, 24, 29] . On the basis of the mean-field theory, they found that compared with homogeneous networks, scale-free networks are fragile to the invasion of infectious diseases, computer viruses, or any other type of negative epidemics. Epidemics have also been studied in various disciplines. Sociologists are concerned with the diffusion of rumors or innovation on social networks [30] ; economists have studied viral marketing and recommendation strategies by considering both cascading dynamics and the network effects of vital nodes [31] ; and computer scientists are interested in how some topics can quickly cascade in virtual blog spaces and how their propagation trends [32, 33] . Although network-based studies have contributed to the modelling of disease and/or information dynamics, some models make a strong assumption that the structures of underlying networks over which epidemics spread are known beforehand. In the real world, however, the structures of underlying diffusion networks are not known directly. Many others assume the availability of information about the interactions occurring between individuals [34] [35] [36] [37] that are often not valid in the context of disease spread. What may be obtained is only the time at which particular sub-populations become infected, but not how they become infected, nor how they affect their neighboring areas. Moreover, the underlying structures of networks will greatly influence the dynamics of infectious disease spread. Since the emergence of the H1N1 influenza pandemic in April 2009, its underlying dynamics have been of great public health interest, and many approaches for its study have been proposed [14, [38] [39] [40] [41] . Most of them are based on the classic SIR model. For example, Birrell et al. [40] provided an age structure-based compartmental model with a Bayesian synthesis of multiple evidence sources to reveal substantial changes in contact patterns throughout the epidemic. Besides of the compartmental models, other mathematical models are also used to describe the transmission dynamics [3, [42] [43] [44] [45] [46] [47] . The chain binomial model was used to calculate the household secondary attack rates to measure the transmissibility of the 2009 H1N1 influenza pandemic by Lessler et al. [44] and Klick et al. [45] . Yang et al. [46] constructed a model based on chains of infections and used the infection hazard function and survival function to study the 2009 H1N1 influenza pandemic. Ferguson et al. [3] and Cauchemez et al. [42, 43] incorporated other factors, such as household risk, within-school risk, and community risk, in the study of infection spread and found out that younger age groups under 19 years old were more susceptible than older age groups. Jin et al. [47] formulated an epidemic model of influenza A based on networks and calculated the basic reproduction number and studied the effects of various immunization schemes. However, this work required that the individual contact pattern be provided. Nonetheless, none of the aforementioned approaches takes spatial heterogeneity into consideration in the study of disease spread. Recently, an outbreak of Ebola virus disease (EVD) swept across parts of West Africa from March 2014 to April 2015. By June 10, 2015, WHO had reported 27, 237 confirmed, probable, or suspected cases in three countries with 11, 158 deaths [48] . This epidemic received extensive research attention on its dynamics of spread [49] [50] [51] [52] [53] [54] [55] [56] [57] (for further references in the review article [58] ). To name a few, Chowell et al. found that district-level Ebola virus disease outbreaks in West Africa follow polynomial-based growth in time instead of the exponential growth that describes the progress of many infectious disease epidemics [52] . Fisman et al. used a simple, two parameter mathematical model to characterize epidemic growth patterns in the 2014 Ebola outbreak [53] . Webb et al. proposed a variant of the classic SIR model with three extra groups, incubating, contaminated and isolated, which can provide a more accurate prediction for the future incidences [56] . Carroll et al. used a deep sequencing approach to gain insight into the evolution of the Ebola virus (EBOV) in Guinea from the ongoing West African outbreak. The viral sequence data can be combined with epidemiological information to retrospectively test the effectiveness of control measures, and provides an unprecedented window into the evolution of an ongoing outbreak of viral haemorrhagic fever [57] . To accurately predict when and where outbreaks will occur, a feasible means is to deploy manual or electronic surveillance systems through regional or national public health and medical organizations [59] . Most of the surveillance data accumulated from such systems contains temporal, spatial, clinical, and demographic information. For instance, Telehealth Ontario is a teletriage helpline that is available free to all Ontario residents, which allows those with suspected infections to connect with experts who can assess their symptoms. The records of such calls provide valuable information on which individual from where was possibly infected and by which type of disease at what time. In this paper, we address the problem of modelling disease spread dynamics in heterogeneous populations from temporal-spatial surveillance data. We analyse the role of heterogeneity in a stochastic epidemic model on a two-dimensional lattice. Within a particular sub-population, the speed of spread is controlled by a single parameter, the transmissibility of the pathogen between individuals. Between sub-populations, the transmissibility becomes a random variable drawn from a probability distribution. Our work differs from existing studies in some fundamental ways, in light of the unique nature of infectious disease diffusion dynamics. Our results have practical implications for the analysis of disease control strategies in realistic heterogeneous epidemic systems. In this work, we propose a stochastic model to study the dynamics of infectious disease spread in heterogeneous populations from temporal-spatial surveillance data. We divide the whole population into m sub-populations on the basis of geographic regions. In the following, we use S i (t), I i (t), and R i (t) to denote the number of susceptible, infected, and recovered people, respectively, at time t in region i, i = 1, 2, · · · , m and t ∈[ 0, T]. Classic SIR-based modelling of infectious diseases assumes that the population is well-mixed. To take the role of heterogeneity into consideration, we use an alternative approach to model the dynamics of infectious disease spread. First, the classic SIR model (Eq. (1)) studies the change in the numbers of peoples in the three groups. In reality, the change in the number of the infected people is the major concern of society. Second, in many epidemics or pandemics such as H1N1 and SARS, the number of infected people I i (t) is relatively small compared to the whole subpopulation S i (t). Therefore, we may consider S i (t) as a constant to simplify the modelling of the change in the number of infected people I(t), for which we propose the following stochastic differential equation: where α is a parameter that measures the auto-recovery rate of one particular infectious disease, which is usually considered as a constant among sub-populations, δ i is the parameter that measures the different disease transmissibility in different subpopulations, σ i > 0 is the diffusion parameter that measures the disease spread from neighbors, and B i (t) is a standard Brownian motion. It is worth noting that we assume the parameter δ i = 0 for technical purposes, and the results in the case of δ i = 0 can be achieved with δ i → 0. Comparing our model in Eq. (2) with the classic model in Eq. (1), we can see that they both capture the situation in which the change in the number of infected people has a positive relationship with the total number of infected people, which means that the more infected people there are, the more people will get infected. There are two key differences between these two models: first, the key factor (βS i (t) − k) associated with the disease spread in Eq. (1) is replaced with a single parameter δ i in Eq. (2), which can be used to analyse the role of heterogeneity in the disease spread; and second, Eq. (2) takes the neighboring relationships into consideration to study the dynamics of the disease spread among different sub-populations. By Ito formula, the solution of Eq. (2) is given by Notice that for any fixed t, Thus, for any fixed t, I i (t) is a normal random variable with and There are three cases of being interested for parameter α: In this case, E[ I i (t)] tends to infinity as t goes to infinity, which implies that all people in that region will be infected if the time is long enough. In this case, the pandemic or epidemic will reach a state of equilibrium. In this case, E[ I i (t)] will reach 0 at some time t =t and go to negative infinity as t goes to infinity, which implies the pandemic or epidemic will end at timet. To estimate the parameters in our proposed stochastic model from the surveillance data, we need to divide the (2) is rewritten as It is easy to see that Then the transition density of the process {I i (t); t ≥ 0} is Hence, the likelihood function is given by Consequently, the log-likelihood function is Let We have the estimator of θ i as follows: It is obviously to see that α is not a bona fide estimator of α, because only the information of {I i (t); 0 ≤ t ≤ T} is used to estimate α. A good estimator should pool all the information {I i (t); 0 ≤ t ≤ T} (i = 1, 2, · · · , m). There are two ways to find the pool estimator. The first way is to approximate α by pooling all α i as follows: But the issue in Eq. (23) is that m must be very large in order to achieve the accurate estimate of α. In this work, we choose the second way, which is the maximum likelihood estimation. To do so, we need to assume that the processes {I i (t); 0 ≤ t ≤ T} (i = 1, 2, · · · , m) are mutually independent. Then the log-likelihood function of {I i (t); 0 ≤ t ≤ T} (i = 1, 2, · · · , m) is given by The maximum likelihood estimate is where α i is defined in Eq. (24) . In this section, we illustrate the performance of our proposed model using both simulated and real data. In the simulation study, we examine the performance of our proposed model with respect to the accuracy of parameter estimation and the forward prediction of the case incidence. First, we generate data using various parameters by the following steps: 1 Set m = 4 (the number of sub-populations) and T = 100 (the number of time slots). These two numbers are randomly selected. using Eq. (7) and Three parameters, α, δ i , and σ i , in Eq. (2) will be estimated from the simulated data. We conduct 100 replicates by repeating Step 2-4 and compare the estimated ones, α,δ i , andσ i , with the ground truth values in terms of the mean absolute percentage error (MAPE) defined as: The mean absolute percentage errors (MAPEs) for E α , E δ , and E σ are 10.0 %, 6.0 %, and 10.0 %, respectively. We plot the distribution of the estimated errors for 100 replicates forα,δ i , andσ i in Fig. 1. From Fig. 1 , we can see that both the estimates ofα andσ i have small variations. The variation of the estimate ofδ i is slightly larger but is still acceptable and is due to the uncertainty embedded in the stochastic process. We also use the estimated values of the parameters to generate the data and compare it with the simulated data using the ground truth values of the parameters. The correlation between them is 0.96. We randomly select one replicate and show the comparison results in Fig. 2 . Basically, we can use the estimated parameters to accurately recover the ground truth data. Next, we conduct an experiment to test the prediction accuracy of our model. Let us consider a sequence of data [ 1] , · · · , I ij [ s] as the training data and predict the data points I ij (t) for s < t ≤ T. s = 80 is chosen in this experiment. The MAPE of the prediction is defined as . The MAPE of the prediction is 17.3 %, which indicates that our model can achieve around 82.7 % accuracy in terms of the prediction. Again, we randomly select one replicate and show the prediction results in Fig. 3 . In the case study, we apply our model to the surveillance data from the 2009 H1N1 pandemic in Hong Kong. We acquired the time series data of the daily number of confirmed H1N1 cases with symptom onset from May 1, 2009 to May 23, 2010. The database includes 36 547 confirmed cases with demographic information on location, age, and sex along with the laboratory confirmation dates. The epidemic curve of confirmed H1N1 cases (see Fig. 4 ) reaches its peak at the end of September 2009, after which the intervention procedure comes into effect and the curve goes down. We use the data up to September 30, 2009 , which comprises 27 898 cases (more than 2/3 of all cases). Hong Kong is geographically divided by 18 political areas (districts). Each district Fig. 2 The comparison of original data and estimated data for four regions. The x-axis represents the time in days. The y-axis represents the total number of confirmed cases. The The original data is generated using the ground truth values of parameters while the estimated data is generated with estimated values of parameters. The correlation between them is 0.96 Fig. 3 The prediction performance of our method using simulation data for four regions.The x-axis represents the time in days. The y-axis represents the total number of confirmed cases. The data for the first 80 days are used for training. The data for the last 20 days are used for testing. The mean absolute percentage error in testing is 17.3 % is considered as one sub-population in our proposed model. The time interval t(k)(k = 0, 1, · · · , n − 1) of H1N1 is set as 1 day. The total number of days is 100. Figure 5 gives the effect of the different components for the 18 political areas in Hong Kong. From Fig. 5 , we can find that the effect of δ, which measures the internal disease spread within each district, varies less than the effect of σ , which measures the external disease spread between districts. In general, the speed of internal disease spread is closely connected with the population density and the external disease spread pattern is associated with the pattern of people's daily travel. It is well known that Hong Kong has the highest population density in the world, and most districts are densely populated. However, it possesses a heavy heterogeneous traffic pattern, and there is intensive traffic between districts every day. Therefore, the imported infections for each district account for a critical factor in the disease spread, whereas the internal effects only play a very small role in the progression of disease spread. We also use the H1N1 data to test the prediction accuracy of our model. The MAPEs for all districts are shown in Fig. 6 and Table 1 . The average prediction error is 20.0 %. We notice that the prediction error for the district "TSUEN WAN" is very high because the number of daily infections in this district changes suddenly during the epidemic period. Figure 7 shows the epidemic curves of the three regions with the lowest incidence rate. We can observe that between time slot 34 and 42, there is a sudden rise for the "TSUEN WAN" district. Such a change significantly affects the parameter estimation and thereby the prediction accuracy for the district "TSUEN WAN". Although the incidence rates of the other two districts also low, their epidemic curves are relatively smooth in comparison with that of "TSUEN WAN", indicating that the prediction accuracies of these two districts are higher than that of the "TSUEN WAN" district. Epidemic modelling offers a practical means for policy makers to evaluate the potential effects of intervention strategies. To do so, the accuracy of epidemic modelling with respect to the real-world disease transmission dynamics is essential and remains a challenging task due to the inaccessibility of many factors that affect the spread patterns of infectious diseases. In particular, Fig. 6 The prediction errors for 18 districts using the real H1N1 data. The mean absolute percentage error is 20.0 % heterogeneity should be taken into consideration when modelling the disease spread in non-random mixing populations. Many methods have been proposed to deal with heterogeneity in the study of epidemic dynamics, mostly using network-based epidemic models in which nodes correspond to spatial locations with reported incidences over time, and the directional links indicate the probability of disease transmission from one node to another over time. However, it is very challenging to determine the network topology. Many studies have used a geographical topology whereas others have used a mobility network inferred from the public transportation network or other sources. How to verify the inferred network topology is another challenging issue because the true epidemic network topology is unknown, and it may vary for different types of infectious diseases for the same population. Furthermore, the neighborhood effect estimation is non-trivial; it involves many parameters (a polynomial of the number of nodes) and requires a large amount of data to avoid overfitting. Such data may not always be available for the inference of network topology. Therefore, in this work, we propose an alternate approach to investigate the spatial heterogeneity from temporalspatial surveillance data without the inference of network topology. Our proposed model possesses several merits over the previous works. First, it quantifies the role of the heterogeneity in the analysis of the spread dynamics of infectious diseases in heterogeneous populations. Second, parameter estimation can be computed very quickly. Therefore, the prediction and the corresponding intervention policies can be implemented without delay in an outbreak of infectious disease. We apply our model on both the simulated data and the real data from the 2009 H1N1 epidemic in Hong Kong and achieve acceptable prediction accuracy. Based on the study of disease diffusion, the model proposed in this work can be extended to study other propagation patterns such as the Internet and World Wide Web, through which individuals form multiple communities in which information can propagate in a manner similar to that of infectious disease. We believe that our work makes theoretical and empirical contributions in many areas. There are some limitations in our proposed stochastic model. First, it does not consider the epidemic network topology. However, how to infer such networks is another challenging task. To the best of our knowledge, the best way to do so is to use the contact data among some infected patients to verify the results, but such data are not always available and can be difficult to collect due to many issues (e.g., privacy). This issue may be addressed by using other types of data, such as daily commute data extracted from social networks. Second, our proposed model achieves a prediction accuracy of only around 80 %. We need to further improve it to allow its full use in real applications. Third, the proposed model is only suitable for the situation in which the susceptible population (or sub-population) maintains a relatively constant size and structure in a region. However, if the number of infected people in an epidemic is large or asymptomatic infection plays a central role (e.g., the malaria epidemic in Africa), the population factor should be taken into consideration in the model. Moreover, for a highly spatially heterogeneous outbreak (e.g., the Ebola epidemic) in which cases may seem to disappear due to reduced transmission in one area while growth may continue or rise in new locales, our proposed model may have problems in capturing these opposite dynamics in different regions. Fourth, because the proposed model is based on the classic SIR model, it only works in the situation in which the number of infected people grows exponentially. We will investigate resolutions to these limitations in our future work. Fig. 7 The epidemic curve of three districts with the lowest incidence rates. The x-axis represents the time staring from May 1, 2009 to May 23, 2010. Between time slot 34 and 42, there is a sudden rise for the "TSUEN WAN" district. Such a change significantly affects the parameter estimation and thereby the prediction accuracy for the district "TSUEN WAN" World Health Organization Pandemic (H1N1) As swine flu circles globe, scientists grapple with basic questions Strategies for mitigating an influenza pandemic Mitigation strategies for pandemic influenza in the United States The Mathematical Theory of Infectious Diseases and Its Applications. High Wycombe: Charles Griffin and Company Ltd Infectious Diseases of Humans Mathematical Epidemiology of Infectious Diseases: Model Building Global stability for the SEIR model in epidemiology Bifurcation analysis of periodic SEIR and SIR epidemic models Qualitative analyses of communicable disease models Statistical studies of infectious disease incidence Estimation and inference of R0 of an infectious pathogen by a removal method An introduction to stochastic epidemic models Delaying the international spread of pandemic influenza Forecast and control of epidemics in a globalized world Will travel restrictions control the international spread of pandemic influenza Dynamics of the 2001 UK foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape Strategies for containing an emerging influenza pandemic in Southeast Asia Containing pandemic influenza at the source Epidemic dynamics and endemic states in complex networks Epidemics and immunization in scale-free networks Small world effect in an epidemiological model Percolation and epidemics in a two-dimensional small world Absence of epidemic threshold in scale-free networks with degree correlations On the critical behavior of the general epidemic process and dynamical percolation Collective dynamics of small-world networks Emergence of scaling in random networks On the evolution of random graphs Epidemics and immunization in scale-free networks Diffusion of Innovations The dynamics of viral marketing On the bursty evolution of Blogspace Patterns of cascading behavior in large blog graphs A high-resolution human contact network for infectious disease transmission High-resolution measurements of face-to-face contact patterns in a primary school Epidemics on interconnected networks Cross-reactive antibody responses to the 2009 pandemic H1N1 influenza virus Transmission dynamics of the etiological agent of SARS in Hong Kong: impact of public health interventions Transmissibility of 1918 pandemic influenza Bayesian modelling to unmask and predict influenza A/H1N1pdm dynamics in London Antiviral treatment for pandemic influenza: Assessing potential repercussions using a seasonally forced SIR model Estimating the impact of school closure on influenza transmission from sentinel data Household transmission of 2009 pandemic influenza A (H1N1) virus in the United States Outbreak of 2009 pandemic influenza a (h1n1) at a New York city school Transmissibility of seasonal and pandemic influenza in a cohort of households in Hong Kong The transmissibility and control of pandemic influenza a (H1N1) virus Modelling and analysis of influenza a (H1N1) on networks World Health Organization. Ebola Situation Reports Temporal changes in Ebola transmission in sierra leone and implications for control requirements: a real-time modelling study Modelling the effect of early detection of Ebola The basic reproductive number of Ebola and the effects of public health measures: the cases of Congo and Uganda The Western Africa Ebola virus disease epidemic exhibits both global exponential and local polynomial growth rates Early epidemic dynamics of the West African 2014 Ebola outbreak: estimates derived with a simple two-parameter model Assessing the international spreading risk associated with the 2014 West African Ebola outbreak Estimating the reproduction number of Ebola virus (EBOV) during the 2014 outbreak in West Africa A model of the 2014 Ebola epidemic in West Africa with contact tracing Temporal and spatial analysis of the 2014-2015 Ebola virus outbreak in West Africa Transmission dynamics and control of Ebola virus disease (EVD): a review Real-time surveillance for respiratory disease outbreaks Additional file 1: Multilingual abstracts in the five official working languages of the United Nations. (PDF 598 kb) Authors' contributions RXM, JML, and XW conceived and designed the experiments. RXM implemented the software. JML and XW analysed the data. All authors were involved in the manuscript preparation. All authors read and approved the final manuscript. The authors declare that they have no competing interests. • We accept pre-submission inquiries • Our selector tool helps you to find the most relevant journal Submit your next manuscript to BioMed Central and we will help you at every step: