key: cord-0187428-wj7ik9qv authors: Small, Michael; Shi, Pengliang; Tse, Chi Kong title: Plausible models for propagation of the SARS virus date: 2003-12-19 journal: nan DOI: nan sha: 113e00e55679d3ad398acfba07ab35530d076154 doc_id: 187428 cord_uid: wj7ik9qv Using daily infection data for Hong Kong we explore the validity of a variety of models of disease propagation when applied to the SARS epidemic. Surrogate data methods show that simple random models are insufficient and that the standard epidemic susceptible-infected-removed model does not fully account for the underlying variability in the observed data. As an alternative, we consider a more complex small world network model and show that such a structure can be applied to reliably produce simulations quantitative similar to the true data. The small world network model not only captures the apparently random fluctuation in the reported data, but can also reproduces mini-outbreaks such as those caused by so-called ``super-spreaders'' and in the Hong Kong housing estate Amoy Gardens. The redundantly named Severe Acute Respiratory Syndrome (SARS) [1] appeared in the Guangdong province of China in November 2002 and spread to the Hong Kong SAR * . From Hong Kong, the virus spread throughout the world: largely due to the airline (small world) network hub in Hong Kong. The economic and social effects of the SARS virus have been subject of much attention in the popular media [2] and have been widely reported. Less well reported, but no less unusual is the effect this disease had on academic activities, including the cancellation of the 2003 International Symposium on Nonlinear Theory and its Applications. In this report we analyse the SARS daily infection data for Hong Kong and test the validity of three distinct types of models: (i) stochastic models generated from surrogate data; (ii) standard susceptibleinfected-removed (SIR) models; and (iii) small world network models. We find that the small world network models are the only models capable of reproducing the quantitative behaviour of the SARS epidemic in Hong Kong. Moreover, these models also exhibit many features characteristic of this epidemic: a small fraction of individuals show a very great propensity to infect others (the "super-spreaders"); and, propagation of the virus For the purpose of comparison, the numbers on the lower plot are normalised by the maximum daily infection tally. The maximum daily infection tally for Taiwan, Singapore and China was 26, 13 and 203. The data for Taiwan and Singapore has been revised based on case information after the epidemic, this has not been possible for the China data. within certain physical locations led to a large number of infections (the outbreaks in Prince of Wales Hospital and Amoy Gardens Housing Estate). Figure 1 shows the daily infection rate for SARS in Hong Kong between 15 February 2003 and 15 June 2003. Compared to the data for other territories (lower panel of Fig. 1 ), the data for Hong Kong appears to be more complete (compared to Taiwan), more accurate (compared to the Chinese mainland * * ) and over a longer time period (compared to Singapore). The data for Hong Kong is shown as two time series, the original daily number of infections, as recorded by the Hong Kong Department of Health, and revised figures released after a more detailed analysis of the true infection path [3] . The revised data offers a more accurate picture of the true daily number of infections, but this data uses information not available at the time of the outbreak and we therefore do not analyse that data in this report. For the current analysis, let x t denote the number of infections reported in the Hong Kong media (based on statistics published by the Hong Kong department of Health) for day t and we start with x 0 = 0. Hence t is the number of days since 11 March 2003 (March 12 is the first day of recorded infections in Hong Kong, infections prior to this date were only identified in the revised data). For notational convenience we denote the entire time series as The remainder of this paper describes the analysis of the original data in Fig. 1 (a) with three different model structures. In the next section we describe the application of surrogate data techniques to test the Hong Kong SARS time series against the hypothesis that inter-day variability in the reported incidence of SARS is random. In Section 3 we assess the accuracy of the SIR model and in Section 4 we describe the small world network model. The simplest model of disease propagation is that of a random walk such that x 0 = x N = 0 and x t ≥ 0 for t = 0, . . . , N . The duration of the disease, N , is essentially a random variable. The inter-day variation is random and, according to the Central Limit Theorem, the disease will eventually die out. Our first task is to transform the data in Fig. 1 to have zero mean and be approximately independent and identically distributed (i.i.d.). We do this by taking the difference of successive values or taking the log-ratio of successive values Both {z (1) t } t and {z (1) t } t are depicted in Fig. 2 . We note that because there is an underlying pattern in {|x t |} t we expect that {z (2) t } t will offer a time series closer to i.i.d than {z (1) t } t . We now apply the method of surrogate data [4] , [5] to determine whether the transformed data in Fig. 2 is statistically distinct from i.i.d. noise. Note that this is equivalent to testing whether the raw time series {|x t |} t is a random walk. To test for i.i.d. noise the method of surrogate data specifies that one should generate an ensemble of surrogate data sets. Each surrogate data set is a realisation of an i.i.d. process, but is otherwise "like" the original data. From this ensemble of surrogates one estimates some statistical quantity and compares this distribution of statistic values to the statistic value for the true data. If the true data has a statistical value typical of the surrogates then the underlying hypothesis (in this case, that the data is i.i.d.) may not be rejected. If the data is atypical of the surrogate distribution then the hypothesis may be rejected. To generate suitable surrogates, we wish to destroy temporal correlation but preserve the data's probability density function. The simplest way to do this is by randomly shuffling either without [4] or with [6] replacement. The choice of suitable test statistic is also important [7] . In our past experience we have advocated correlation dimension [7] or other dynamic invariants [5] . However, for such extremely short time series this is inappropriate. We attempted to use nonlinear complexity [8] as a measure, but found that this index exhibited extremely low sensitivity. We instead settle on the normalised covariance between x t and x t+1 One further advantage of this simple measure is that it is sensitive to linear dependence: exactly the property we wish to detect. Unfortunately, this also means that testing more complex linear surrogate hypotheses will not be possible. Hence, as suggested by Takens [9] we prefer to test the model residuals against the i.i.d. hypothesis. The results of Fig. 3 clearly indicate that the data of Fig. 2 is not consistent with an i.i.d. noise source. Therefore the original data ( Fig. 1 ) may not be modelled as a simple random walk. It would be logical to continue this analysis with more sophisticated surrogate algorithms [4] , [10] . However, for this short time series results of more sophisticated algorithms are not conclusive. Furthermore, as we have already noted, it would not be possible to apply the covariance as a discriminating statistic in these cases, as the surrogate algorithms explicitly preserve autocorrelation. Instead, in the following section we consider a simple deterministic model of disease propagation. The standard Susceptible-Infected-Removed (SIR) model of disease propagation has a long history and is considered in detail in any standard text on mathematical biology [11] . In difference equation form, the relationship between the three populations at time t may be expressed as where S t , I t , R t are the number (or proportion) of individual that are susceptible to infection, infected and "removed" respectively. Not that, by removed, we mean individuals that have been infected but are no longer infectious or susceptible. Such individuals may be cured (and immune), quarantined, or dead. The parameters r and a are the infection rate and the removal rate respectively. Using the SIR data we obtain a maximum likelihood estimate (i.e. minimum least squares fit) of the parameters a and r for the Hong Kong SARS time series. The time series in Fig. 1 is the number of new infections reported each day. This corresponds to the term rS t I t in Eqn. (1) . The numerical fitting procedure often estimates (incorrectly) a monotonically increasing of decreasing function. To prevent these solutions, we add leading and trailing zeros to the data prior to fitting. This helps to ensure that the optimal model exhibits a single extremum -not at either endpoint. An alternative to fitting rS t I t to the observed data would be to also consider the daily fatality and recovery data for Hong Kong as an approximation for ∆R t . One could then fit ∆I t and ∆R t directly from the observed data. In Fig. 4 (a) we see the time course of the best model (Eqn. (1)) fitted from this data. Clearly, this deterministic model does not capture the random variation observed in the data. One way of considering this random variation is that the SIR model provides an estimate of the rate of infection (i.e. I t+1 = (rS t + (1 − a))I t ), but that this estimate is subject to random fluctuations. It may be better therefore to perturb the number of new infections by some unknown quantity ǫ t . The modified, stochastic SIR equations then read A similar random term could be added to the rate of removal, however we do not consider this problem here. Moreover, arbitrarily complex noise inputs could be devised that would eventually reproduce features of the observed data. Our concern is only whether these features can be observed from the SIR model, or immediate generalisations of this model. We found that simulations with ǫ t ∼ N (0, σ 2 ) for a constant σ 2 did not produce realistic simulations. The problem is that the magnitude of the random fluctuation should be proportional to the number of infected The normalised covariance of xt and x t+1 , is plotted for the data along with 1000 independent model simulations (this is a similar calculation to that depicted in Fig. 3 (b) ). The statistics in panel (a) and (b) differ. In panel (a) the normalise covariance of the data is estimated, in panel (b) the normalised covariance of the log-ratio of the data is computed. In both cases the hypothesis that the SIR model is a good model is soundly rejected (p ≥ 0.95). individuals. Few infected individuals are unlikely to have a great variation in effect, but many such individuals will. In Fig. 4 (b) we show five simulations with the noise term ǫ t ∼ N (0, (I t σ) 2 ). To ensure that the model is physically realistic (i.e. non-negative populations) we also restrict ǫ t > −(rS t + (1 − a))I t . We found that σ ≈ 0.25 gave quantitatively reasonable results and this is the value employed in Fig. 4 . To quantitatively evaluate the effectiveness of the SIR type models (Eqn. (2)) we apply a variant on the method of surrogate data [7] . However, instead of generating constrained realisations consistent with the null hypothesis † , we generate multiple realisations of the model and compare these realisations to the data [9] . For the model described by Eqn. (2) we generate 1000 simulations and for each simulation compute < x t x t+1 > and < z t z t+1 > where z t = log xt xt+1 . That is, we use two distinct test statistics. Results for both † See [12] for a description of this rather exclusive nomenclature. Informally this means we do not force the surrogates to preserve specific properties of the data. This means that we are unable to reject all SIR models with this test, only the SIR models that are fit to this data. statistics are plotted in Fig. 5 show a clear distinction between data and surrogates. Since this model is the optimal fit to the data and the distribution of covariance estimates do not vary greatly with change in model parameters † † , we therefore conclude that models of the form Eqn. (2) are not adequate to describe the underlying dynamics. One widely accepted reason for the failure of the SIR models is that the parameters a and r do not remain constant during the time course of an epidemic. Particularly when a disease is previously unknown and public health practice changes significantly in response to the spread of the disease. This is certainly consistent with the situation for SARS in Hong Kong. We therefore modify the SIR model of Eqn. (1) by estimating the temporal dependence of a and r. We apply a sliding window of width 10 days and re-estimate a and r for every day. Figure 6 depicts the results of this computation. The values a t and r t are based on the best fit estimate of Eqn. (1) to the data (x t , x t+1 , x t+2 , . . . , x t+10 ). While the simulations in Fig. 6 (b) does have the correct shape, if not the correct magnitude, the values of a and r depicted in Fig. 6 (a) illustrate the problem with this approach. The parameter values are sensitively dependent on the observed data and do not reflect reality. In a separate publication we modify this approach by considering longer windows and have shown the utility of the ratio r/a for estimating the effectiveness of various governmental control strategies [13] . A combination of the random fluctuations approach of Eqn. (2) and the variability in a and r may produce more realistic simulations. However, we feel that this would represent a significant overparameterisation of this modelling problem. For the current investigation we conclude that a general SIR model is unable to capture the underly- † † That is, the statistic is approximately pivotal [7] . ing dynamics of the system. More exotic SIR inspired models, or more contrived noise inputs, may be able to reproduce the observed behaviour. Indeed, we expect this should be the case when the system becomes complex enough. But this is beyond the focus of the current communication and is in general a less interesting problem. In the following section we consider a computationally more extensive model that is inspired by the physical situation and provides a more realistic alternative. The final model structure we examine is computationally more complex, but also more realistic. In contrast to potentially exotic modified SIR models, the model we introduce here is inspired by our understanding of the physical situation. Small world networks (SWN) [14] have recently become immensely popular and received significant attention in a wide variety of fields [15] . One system which has provided evidence of small world structure is the network of social interaction between individuals (the so-called "six degrees of separation"). It is therefore natural to consider propagation of disease epidemics between nodes of a SWN. In the following SWN we suppose that each node is either susceptible to disease (S), infected (I) or removed (R). However, unlike the standard SIR structure, nodes of the SWN can only infect those that they are acquainted with. We arrange the nodes in a rectangular grid (for Hong Kong, we use a square with side length 2700). On a given day, each node can infect n 1 of its four immediate neighbours (horizontally and vertically) with constant probability p 1 . Each node also has a random number of long range acquaintances. The number of long range acquaintances follows an exponential distribution with an expected value of n 2 and the acquaintances are chosen at random but fixed for each node. On each day, each of the long range acquaintances may be infected with some constant probability p 2 . Finally, on each day, every infected node has probability r 1 of becoming removed. As with the SIR model, the only possible transitions are S → I → R. For our first simulations we simplify this structure by setting r 1 = 0. To avoid eventual complete contamination we vary (i.e. decrease) p 1 and p 2 with time according to the scheme in Fig. 7 (b) . The values depicted in Fig. 7 are selected only because they yield a model with has a realistic behaviour. Figure 7 (a) compares five random realisations of ∆S t to the observed data. We see that the quantitative features are reproduced nicely. In Fig. 8 (a) we repeat the surrogate calculations of the previous sections for 1000 realisations of the process depicted in Fig. 7 . An examination of the variation in the estimated covariance shows that the data and model simulations are in very close agreement. We are there- fore unable to reject the model behaviour as a plausible model of propagation of SARS. This good agreement is due to a combination of long range correlation present in the model data and the same "burstiness" that is also observed in the true data. In Fig. 7 we see that both model and simulations exhibit similar large spikes when one (or several) so-called "super-spreaders" are encountered at random. However, this model is not entirely satisfactory. The disease epidemic is controlled artificial by suitable selection of p 1,2 (t). Furthermore, no nodes ever become "removed". To remedy this we set p 1 (t) = p 1 = 0.05, p 2 (t) = p 2 = 0.005, and r 1 = 0.1. These values are selected after a process of trial and error, but do roughly correspond to the experience of SARS in Hong Kong. The probability p 1 is the daily probability of infected close family members or neighbours, p 2 is the daily probability of infecting more casual acquaintances. Finally, we allow for an exponential distribution of the number of neighbours with an expected value of 6. Hence the system truly is a scale free network and the existence of "super-spreaders" is explicitly modelled by this random variable. The time series produced by this simulation are similar to those of the restricted SWN in Fig. 7 . The major differences are the inclusion of the removed category and that the infection probabilities do not change with time. A constant value for p 1,2 may not be entirely realistic, but is is a statistical necessity to avoid over-fitting in such a complex stochastic model. A consequence of these constant values of p 1,2 is that often the disease will "die-out" immediately, or, it will run out of control and infect the entire community. In simulations, both situations are common, but so to is the eventual control of the epidemic: without changing the infection control policy! We do not show typical trajectories of this final model because they are similar to those shown in Fig. 7 , only with a slightly larger variability. In Fig. 8 we generate 1000 simulations of length greater than 90 (i.e. simulations that immediately die out are rejected) but no longer than 120 (longer simulations were truncated) of this SWN system (n 1 = 4, E(n 2 ) = 10, p 1 = 0.25, p 2 = 0.001, and r 1 = 0.45) † and compare the covariance of ∆S t to the true data. In each case we are unable to reject the underlying SWN model. A slight increase in variability (when compared to the previous model with artificially variable p 1,2 ) is observed in this second structure and we believe that this is due to the lower level of "control" on the parameter values and the inclusion of both simulations that die out, and those that run out of control in the analysis. † Simulations were repeated with a variety of other values and a wide variety of dynamics behaviour. Most notably the parameter values n1 = 4, E(n2) = 6, p1 = 0.05, p2 = 0.005, and r1 = 0.1 produced surrogate results similar to those in Fig. 8 . These two sets of parameter values where selected as they reflected the authors' understandings of the physical environment. The daily reported SARS figures for Hong Kong have been considered. We find that the data is not consistent with a random walk or a noisy version of the standard SIR model. However, two SWN structures do exhibit dynamics indistinguishable from the true data. To accurately simulate the spread of disease in society, including features observed during the SARS epidemic (such as rapid, extensive and localised outbreaks), one needs to consider complex SWN type structures. The power of the SIR model lies in its ability to model large scale spread of a contagion, the SWN structure described here is much better at simulating the localised variability. This localised variability, inherent to the model simulations may allow administrators and forecasters to estimate possible variation in their predictions of disease dynamics (i.e. to provide error bars on their projections and evaluate the likelihood of various scenarios). Such techniques are certain to be useful when planning and implementing control strategies. In a related publication, we provide a more detailed analysis of the potential scale-free structure of various SWN models [13] . Furthermore, in this paper we propose the variable parameter SWN model is actually inspired by governmental public health practice [13] . In this current we our primary goal is to show that SWN models are able to capture the underlying dynamics of the SARS epidemic better than either simple stochastic or SIR models. Consensus document on the epidemiology of sars Various Authors South China Morning Post, Virtually every day since Report of the severe acute respiratory syndrome expert committee Testing for nonlinearity in time series: The method of surrogate data Detecting determinism in time series: The method of surrogate data Determinism in financial time series Correlation dimension: A pivotal statistic for non-constrained realizations of composite hypotheses in surrogate data analysis Computing complexity with audio coding schemes Detecting nonlinearities in stationary time series A surrogate test for pseudo-periodic time series data Mathematical Biology Constrained-realization Monte-Carlo method for hypothesis testing Modelling of SARS for Hong Kong Collective dynamics of 'smallworld' networks Six Degrees: The science of a connected age This research is supported by Hong Kong University Grants Council CERG number PolyU 5235/03E.