key: cord-0906918-x7y81ss1
authors: Dearlove, Bethany; Wilson, Daniel J.
title: Coalescent inference for infectious disease: meta-analysis of hepatitis C
date: 2013-03-19
journal: Philos Trans R Soc Lond B Biol Sci
DOI: 10.1098/rstb.2012.0314
sha: 05513568bb7b58bb7df33023160838ac885a6144
doc_id: 906918
cord_uid: x7y81ss1

Genetic analysis of pathogen genomes is a powerful approach to investigating the population dynamics and epidemic history of infectious diseases. However, the theoretical underpinnings of the most widely used, coalescent methods have been questioned, casting doubt on their interpretation. The aim of this study is to develop robust population genetic inference for compartmental models in epidemiology. Using a general approach based on the theory of metapopulations, we derive coalescent models under susceptible–infectious (SI), susceptible–infectious–susceptible (SIS) and susceptible–infectious–recovered (SIR) dynamics. We show that exponential and logistic growth models are equivalent to SI and SIS models, respectively, when co-infection is negligible. Implementing SI, SIS and SIR models in BEAST, we conduct a meta-analysis of hepatitis C epidemics, and show that we can directly estimate the basic reproductive number (R(0)) and prevalence under SIR dynamics. We find that differences in genetic diversity between epidemics can be explained by differences in underlying epidemiology (age of the epidemic and local population density) and viral subtype. Model comparison reveals SIR dynamics in three globally restricted epidemics, but most are better fit by the simpler SI dynamics. In summary, metapopulation models provide a general and practical framework for integrating epidemiology and population genetics for the purposes of joint inference.

During an ongoing outbreak, understanding the epidemiological dynamics and predicting the likely course of the outbreak are time-critical tasks essential for informing intervention [1, 2] . If systematic monitoring is in place, key parameters such as R 0 , the basic reproductive number [1] , can be estimated directly, as in the case of the foot and mouth disease outbreak among British cattle in 2001 [3] and the outbreaks of severe acute respiratory syndrome in Asia in 2002 and 2003 [4] . Genetic analysis provides a window into the epidemic history of a pathogen that can complement epidemiological analysis, as in the case of the H1N1 influenza A pandemic in 2009 [5, 6] , or take its place in the absence of reliable surveillance data. The ability to sequence pathogen genomes in real time, for example during the 2010 cholera outbreak in Haiti [7] , foretells of the increasingly important role for genetic analysis during outbreak response.

Genetic analysis is a well-established tool for revealing the epidemic history of pathogen populations [8, 9] . It commonly involves the post hoc interpretation of an evolutionary tree constructed from genetic sequences. Relationships between isolates may reveal the order of transmission events [10, 11] , whereas the shape of the tree is informative about overarching dynamics [12] . However, more powerful approaches explicitly integrate genetic and epidemiological models. For example, coalescent methods-which can be used to infer historical changes in population size [13] [14] [15] -have been applied to pathogen populations to infer historical changes in prevalence. By modelling changes in prevalence using the susceptible-infectious -susceptible (SIS) model, epidemiological parameters such as the intrinsic growth rate of the epidemic have been estimated directly [16] . Early applications of the coalescent approach shed new light on the epidemic behaviour of the hepatitis C virus (HCV) [16] , and the pathogen has continued to attract intense research attention owing to its medical importance and amenability to genetic analysis. HCV is a major cause of liver disease, including cirrhosis and liver cancer. Estimated to infect 160 million people around the world [17] , it is implicated in 350 000 deaths per year [18] . Sharing contaminated needles and transfusion of infected blood products are thought to be the main routes of transmission [19] . HCV is an enormously diverse RNA virus, comprising six major types with varying geographical distributions [20, 21] . Coalescent inference has been used to date the origin of HCV in different countries [16, [22] [23] [24] [25] [26] [27] [28] , providing a historical context for the emergence of epidemics and providing quantitative support for the roles of iatrogenic transmission [22] and drug use [29] . The advent of population-level whole genome sequencing has revealed previously unfathomed diversity in pathogenic bacteria [30] , leading to wider interest in integrated approaches to genetics and epidemiology beyond rapidly evolving viruses such as HCV. However, theoretical work has shown that although the central assumption of coalescent approachesthat effective population size is proportional to prevalence-is valid at dynamic equilibrium [31] , it does not hold more generally [32, 33] . In this study, we derive a new framework for population genetic inference of epidemiological dynamics based on a metapopulation model of pathogen populations. Using coalescent results for metapopulations [34, 35] , we expose the assumptions implicit to coalescent approaches and explore the limits of genetic inference. We implement SI, SIS and SIR models in BEAST [36] , and conduct a meta-analysis investigating the epidemiological processes that underlie differences in genetic diversity between HCV epidemics.

Metapopulations (literally populations of populations [37, 38] ) have been used to account for heterogeneity in pathogen species caused by strain structure or host structure [39, 40] . However, pathogen populations are metapopulations in a more fundamental sense, because the population is an aggregate of the many isolated subpopulations colonizing individual hosts (figure 1).

The key feature of a metapopulation that distinguishes it from other structured populations is the extinction of individual demes (i.e. subpopulations) and their re-colonization by other demes [41] . In pathogens, demes correspond to hosts, colonization corresponds to infection of an uninfected host (what we call primary infection) and extinction corresponds to clearance of infection. Migration to a colonized deme corresponds to secondary infection of an infected host. To make a concrete population genetics model, additional assumptions are required [34, 35, 41] , principally that (i) upon primary infection the infecting genotypes come from a single host, and (ii) the carrying capacity is immediately attained within the newly infected host.

Among the advantages of using the metapopulation model is the wealth of understanding of metapopulation dynamics [37,38,41 -43] . In a series of papers, Wakeley [44 -46] developed coalescent approximations for structured populations, including metapopulations [34, 35] , based on the assumption that the number of colonized demes is large. The main result from his work is that under disparate, complex models of population structure, the genealogy of individuals sampled from different demes is well approximated by a standard coalescent process whose effective population size is a function of the demographic parameters. This puts inference for metapopulations on a practical footing [36] , and the assumption that the number of infected hosts is large is consistent with the deterministic compartmental models commonly used in epidemiology.

Compartmental models are important tools for modelling infectious disease dynamics [1] . In a simple SI model, the proportions of all hosts that are susceptible (S) and infectious (I) are modelled using differential equations. Usually, the total rate of primary infection is assumed to depend on the number of susceptible and infectious individuals and a transmission coefficient (b 1 ). This is known as strong proportionate mixing [1] . In the SIS model, infected individuals clear infection and return to the susceptible class at rate g. In the SIR model, individuals that recover from infection instead become immune. These three models have different 

= g I s u s c eptib l e Figure 1 . Metapopulations and epidemiological dynamics. (a) Pathogen populations are metapopulations because they exist as an aggregate of isolated subpopulations within individual hosts. We refer to infection of susceptible hosts as primary infection, and subsequent infection events as secondary infection. We use compartmental models from epidemiology to model the dynamics of the metapopulation. (b) The SI, SIS and SIR models are simple compartmental models. Changes in the proportions of susceptible (S), infected (I ) and recovered (R) hosts are modelled using differential equations. In all three models, the proportion of infected hosts is assumed to increase at rate b 1 SI, where b 1 is the primary transmission coefficient. In the SIS model, hosts clear infection and return to the susceptible class at rate g. In the SIR model, hosts that clear infection recover and are no longer susceptible. (c) The models predict different epidemiological dynamics. In the SI model, the whole population is eventually infected. In the SIS model, a dynamic equilibrium is reached. In the SIR model, the epidemic peaks and burns out as the supply of susceptible hosts is exhausted. rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120314 dynamics, with the SIR model producing the classical epidemic expansion and burn out (figure 1).

Initially, when infection is rare and susceptible hosts are plentiful, the epidemic increases exponentially with rate r 0 , the intrinsic growth rate. In the SI model, r 0 ¼ b 1 and in the SIS and SIR models, r 0 ¼ b 1 2 g. During this exponential phase, the transmission rate per infection is b 1 , but it slows as susceptible hosts are exhausted. The clearance rate g corresponds to the inverse of the average duration of infection. An important quantity is the basic reproductive number R 0 , defined as the total number of infections caused by an index case in a totally susceptible population [1] . In the SIS and SIR models, R 0 ¼ b 1 /g. In the SIS model, R 0 determines the equilibrium prevalence, whereas it determines the peak prevalence in the SIR model.

Compartmental models can be elaborated endlessly. However, the only extension to the basic models we make is to consider the dynamics of secondary infection. Assuming strong proportionate mixing, it follows that the total rate of secondary infection depends on the square of the number of infectious individuals and a transmission coefficient (b 2 ). Although this is important for the metapopulation model, our treatment of secondary infection does not change the dynamics of the epidemiological models. As noted, the use of deterministic differential equations to model epidemic dynamics implies the number of infected hosts is large. Although this cannot hold in the early stages of the epidemic, experience suggests these models are nevertheless useful for epidemiological inference [3] [4] [5] .

The key parameter in a coalescent model is N e , the effective population size, because it determines the coalescence rate, which in turn determines relatedness within the sample [15] . In the metapopulation model described earlier, the many-demes limit [34, 35] gives the effective population size as

In these equations, D is the number of infected hosts, e 0 is the rate of primary transmission per infection, m is the rate of secondary transmission per infection, N P is the pathogen population size within a host and k is the number of genotypes transmitted during primary infection. F is the inbreeding coefficient, which is the probability that two individuals sampled within the same host are descended from the same transmission event. See table S1 in the electronic supplementary material for all parameter definitions. Assuming strong proportionate mixing, the rates of primary and secondary transmission per infection are e 0 ¼ b 1 S and m ¼ b 2 I, respectively, which yields

where N H is the total number of hosts. Equations (3.1) and (3.2) resolve the apparently conflicting observations that (i) N e is proportional to prevalence at dynamic equilibrium [31] , but (ii) changes in prevalence do not necessarily induce a linear change in N e [33] because the rates of primary and secondary transmission per infection and the inbreeding coefficient depend, in general, on prevalence. This is true under assumptions of both strong and weak proportionate mixing. For further explanation of the determinants of effective population size in the metapopulation, see electronic supplementary material, figure S1.

Equations (3.1) and (3.2) are consistent with the results of a simpler model [33] , which assumes co-infection is negligible (b 2 ¼ 0). Because this assumption will often be reasonable, and because it reduces the number of parameters to be estimated, we embrace it in the rest of what follows. The SI and SIS models can be solved in closed form (see §5 and equations (5.1) and (5.2)), so it is possible to write down the effective population size under these models. For the SI model, the effective population size simplifies to

which is an exponential growth curve with parameters

, the effective population size at present, and r 0 , the intrinsic growth rate. Time is measured from the present (t ¼ 0) back into the past (t . 0). For the SIS model, the effective population size simplifies to

which is a logistic growth curve with parameters N 0 , r 0 and t 50 ¼ 2 log(r 0 /(g(1 2 S 0 )) 2 1)/r 0 , the time at which N e reached half its maximum. Equations (3.3) and (3.4) show that the exponential and logistic growth curves, which are commonly used in coalescent analyses of pathogen effective population size [14, 23, 29] , arise from simple SI and SIS models under the assumptions of strong proportionate mixing and no co-infection. However, the growth curves describing changes in N e are simpler than the underlying growth curves that describe changes in prevalence, and have one fewer parameter. Consequently, there is no one-to-one correspondence between the coalescent parameters and the epidemiological parameters, meaning that the epidemiological parameters cannot be fully identified from genetic analysis alone. An independent estimate of one of the epidemiological parameters (e.g. rate of clearance of infection or present-day prevalence) is required to reconstruct historical changes in prevalence. In this respect, our results differ from Pybus et al. [16] , but we agree with their key result that the intrinsic growth rate (r 0 ) in an SIS model can be estimated by modelling changes in N e using a logistic growth curve. We also agree that to estimate the basic reproductive number R 0 , an independent estimate of one of the epidemiological parameters is needed. rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120314 (c) Coalescent SIR model Equations for the epidemiological dynamics in the SIR model cannot be solved analytically, but can be solved numerically using computational techniques [47] . Unlike the simpler models, there is no confounding of epidemiological parameters, meaning that, in principle, all the parameters of the epidemiological model (see table S1, electronic supplementary material) can be estimated from genetic data alone. Consequently, R 0 can also be estimated, in principle, directly from genetic data. We found that model comparison and parameter estimation using BEAST were aided by the following re-parameterization:

, the effective population size at present, r 0 ¼ b 1 2 g, the intrinsic growth rate, g, the rate of clearance and t peak , the time since the epidemic peaked, which must be calculated numerically.

To investigate the practical value of our approach for estimating epidemiological parameters, reconstructing epidemic history and explaining variation in genetic diversity between epidemics, we conducted a meta-analysis of HCV, one of the most intensively studied pathogens in the context of joint evolutionary-epidemiological inference. We conducted a literature search for HCV datasets with well-described sampling frames and readily available metadata. Initially, we identified 28 datasets for which subtype, sampling location, prevalence and NS5B gene sequences were available [22,23,25,29,48 -59] . However, we excluded those with small sample size (fewer than 20 sequences) and evidence of recombination (see the electronic supplementary material, table S2). Recombination is problematic for coalescent inference [60] and provides evidence of co-infection, which our method assumes is absent. In total, 18 datasets satisfied our incorporation criteria (see the electronic supplementary material, dataset S1). Figure 2 shows the geographical distribution of the HCV datasets and a genealogy based on a global alignment of all sequences, with the subtypes indicated. Subtypes formed distinct monophyletic groups, but the ancestral histories of datasets within the same subtype were shared to varying degrees. We fitted our coalescent SI, SIS and SIR models to each dataset separately while bearing in mind this overlap. For the meta-analysis, we estimated N 0 (the effective population size at the time of sampling) and r 0 (the intrinsic growth rate) using a model-averaging approach that assumed equal prior probability of each scenario (SI, SIS and SIR).

We used linear regression to explore the epidemiological determinants of genetic diversity between epidemics. We measured genetic diversity using p, the mean number of nucleotide differences between HCV sequences in the same dataset. Diversity varied considerably, ranging from p ¼ 20.3 to p ¼ 84.3 per kilobase (see the electronic supplementary material, table S2). We found that the strongest predictor of diversity was the age of the most recent common ancestor (T MRCA ), followed by population density and subtype (figure 3). Table 1 shows the regression coefficients and p-values, although the latter must be viewed with a degree of caution owing to pseudo-replication within subtypes. The overall predictive power of the regression was very high (R 2 ¼ 98.9%). Epidemics with older T MRCA had substantially higher diversity as would be expected, whereas increased population density predicted a reduction in diversity. Of the subtypes represented by multiple datasets, 1b had highest diversity and 6a had lowest diversity after correcting for the effects of T MRCA and population density. Surprisingly, there was no significant relationship between diversity and intrinsic growth rate, r 0 , after taking into account other factors. This would be explained by rapid epidemic growth across the datasets, resulting in star-shaped genealogies.

Reconstructing historical changes in N e revealed that most datasets exhibited strong exponential growth, consistent with the SI model ( figure 4) . For each dataset, we calculated the posterior probability (PP) of the SI, SIS and SIR models, and a model of endemic infection that implies a constant effective population size (see the electronic supplementary material, table S3). The endemic model was rejected outright for every dataset (PP 0.002). In 13 cases, the SI model was clearly preferred (PP ¼ 0.62 -0.99). In the subtype 1a dataset from Belgium, SI dynamics were most probable (PP ¼ 0.44), but there was also support for the SIS (PP ¼ 0.36) and SIR models (PP ¼ 0.20). Only in one example-subtype 3a in Belgium-was the SIS model most probable (PP ¼ 0.88). The preference for the simpler SI dynamics in most of the datasets is evidence that these epidemics have neither reached dynamic equilibrium, as in the SIS model, nor begun to burn out, as in the SIR model. All the epidemics except one (subtype 4a in Egypt) appear to have emerged during the 

In three datasets, the SIR model was preferred over the others: subtype 2c in Argentina, 6a in Hong Kong and 6f in Thailand. Only in the case of the SIR model can all the epidemiological parameters be estimated directly from genetic data alone. Consequently, we were able to estimate R 0 and reconstruct historical changes in prevalence for these three epidemics. Because the total number of hosts is a parameter, we were able to obtain separate estimates for prevalence (as a proportion) and the total number of infected hosts.

HCV-2c is generally uncommon but in the Có rdoba province of Argentina it is the dominant subtype, found in 50 per cent of cases or more [54, 58] . From 1880 to 1920, the central regions of Argentina, of which Có rdoba is part, received an influx of European migration, mainly from Italy where subtype 2c is also common [54] . The PP of SIR dynamics in HCV-2c in Có rdoba was 53.8 per cent, with the SIS model next most likely (PP ¼ 45.4%). We reconstructed historical changes in the number of infected individuals and prevalence under the SIR model ( figure 5 ). The T MRCA was dated to between 1915 and 1936. Initially, the epidemic grew exponentially with a doubling time (log(2)/r 0 ) between 3.6 and 6.7 years (see the electronic supplementary materials, table S3). We estimated that the epidemic peaked some time between 1969 and 2002 and has fallen since. Subtype 6a is common in Hong Kong, accounting for 23.6 per cent of all HCV infections and 58.5 per cent of HCV infections in intravenous drug users [61] . It is a relatively recent epidemic [55] . The rarity of HCV-6a in China led to the suggestion that HCV-6a was introduced from Vietnam, where it is dominant, during peaks of immigration around 1979 and 1992 [61] . SIR dynamics were most probable in this dataset (PP ¼ 71.0%), but there was also some support for the SIS model (PP ¼ 28.7%). We dated the T MRCA to between 1952 and 1962, following which the number of infections grew rapidly with a doubling time between 0.7 and 3.8 years. We estimated that the number of HCV-6a infections in Hong Kong peaked in 1986, with a broad 95 per cent credible interval of 1963-1993.

The many subtypes of HCV type 6 are distributed throughout Asia, but HCV-6f appears to be restricted to Thailand, where it is the most common form (56%) [48] 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 [56, 58] . These estimates are indicated in figure 5 by the intersection of the red lines. In all three cases, prevalence estimated by independent epidemiological investigation fell within the 95 per cent credible interval of prevalence reconstructed from genetic data.

Using a metapopulation model of pathogen populations, we have developed a new approach for integrated genetic and epidemiological inference. We derived a formula for the effective population size in a pathogen population that reconciles previous results [8, 31, 33] and provides rationale for widely used genetic analyses. Specifically, we showed that using exponential and logistic growth curves to analyse historical changes in pathogen effective population size is equivalent to assuming underlying SI and SIS dynamics when co-infection is absent.

Using BEAST to implement our models, we conducted a meta-analysis of 18 HCV datasets from across the world. As expected, we found the age of the MRCA to be the strongest predictor of the diversity of an epidemic. Surprisingly however, there was no relationship between intrinsic growth rate and diversity after accounting for age of the MRCA, population density and subtype. This observation is consistent with rapid growth during the exponential phase of the epidemics. Under rapid growth, the MRCA is only marginally younger than the epidemic. Therefore, it follows that HCV diversity can be used as rough guide to the age of an epidemic.

We found evidence for SIR dynamics in three datasets: subtype 2c in Argentina, 6a in Hong Kong and 6f in Thailand. Using the coalescent SIR model, we were able to directly estimate the basic reproductive number and historical changes in prevalence and in the absolute number of infected hosts in these epidemics. We obtained similar estimates of R 0 in the three epidemics (1.2 -1.4), although there was substantial uncertainty. This value is considerably lower than previous estimates, largely because the duration of the infectious period that we estimated (1.2 -1.6 years) was substantially shorter than the 10-30 years that have previously been supposed [16] . Estimating short infectious periods for hepatitis C is surprising in view of the nature of the disease, which is chronic in 80 per cent of people and has lifelong infectivity [17, 18] . One possible interpretation could be that the majority of transmission occurs shortly after infection. However, the broad 95 per cent credible intervals were consistent with infectious periods up to 27, 14 and 40 years, respectively.

There may also be an element of ascertainment bias to this result because we can infer only SIR dynamics and R 0 once an epidemic has passed its peak, which is likely to occur sooner when R 0 is smaller. However, the three epidemics exhibiting SIR dynamics shared features in common other than R 0 . All three were globally rare but locally dominant subtypes. The Argentinean and Hong Kong epidemics appear to have 2c Argentina 1900 2c Argentina 1920 2c Argentina 1940 2c Argentina 1960 2c Argentina 1980 2c Argentina 2000 2c Argentina 1900 2c Argentina 1920 2c Argentina 1940 2c Argentina 1960 2c Argentina 1980 2c Argentina 2000 2c Argentina 1900 2c Argentina 1920 2c Argentina 1940 2c Argentina 1960 2c Argentina 1980 2c Argentina 2000 2c Argentina 1900 2c Argentina 1920 2c Argentina 1940 2c Argentina 1960 2c Argentina 1980 2c Argentina 2000 2c Argentina 1900 2c Argentina 1920 2c Argentina 1940 2c Argentina 1960 2c Argentina 1980 2c Argentina 2000 2c Argentina 1900 2c Argentina 1920 2c Argentina 1940 2c Argentina 1960 rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120314 been introduced originally by migration [54, 61] , while both the Hong Kong and Thai epidemics emerged relatively recently. Dynamical modelling shows that the number of infectious individuals falls when the number of susceptible individuals becomes exhausted. Why this should occur more quickly in these epidemics than the global subtype 1a and 1b outbreaks is unclear, but may depend on mode of transmission, the behaviour of risk groups, local competition between subtypes and virological differences. Our approach has a number of assumptions and limitations, chief among which is the assumption that the number of infected hosts is large. Although this assumption is consistent with the use of deterministic compartmental models, it cannot possibly be true at the beginning of the epidemic. There are a number of promising avenues for incorporating stochasticity into combined genetic and epidemiological models. Particle Markov Chain Monte Carlo (MCMC) has been developed to fit stochastic, nonlinear dynamics to gene genealogies, although currently the genealogy is assumed to be known [62] . Branching processes have been used as an alternative to the coalescent; however, the approach is currently limited to simple birth-death processes [63, 64] . Stochastic demography is readily incorporated into the coalescent [65] , and this will be an area of further investigation.

Random mixing is a common assumption in compartmental models of epidemiological dynamics that is difficult to justify empirically. Theoretical work shows that variance in network connectivity substantially affects epidemiological dynamics and hence genetic diversity [31, [66] [67] [68] . There is hope that such variability can be handled using a more general formulation of the metapopulation model than was needed here [34] , in which different classes of hosts, such as super shedders, are explicitly modelled. Another of our assumptions, that co-infection is absent, is likely to prove more difficult to overcome. When there is co-infection, recombination can occur. We found evidence of recombination in some HCV datasets, which we excluded from further analysis. Although attempts have been made to incorporate recombination into population genetic inference [69] , these methods are generally computationally prohibitive.

There are a number of other extensions to our approach that we have left for future research. Changes in the size of the host population are readily incorporated into our model, and this might prove fruitful for inference if independent data are available to disentangle the effects of host and pathogen population dynamics, for instance by coupling an analysis of host and pathogen genetic diversity in BEAST. When there is no more than a single pathogen sequence per host, as we assumed here, longitudinal sampling is straightforward to account for using the standard technique [70] as implemented in BEAST, with no adjustments necessary to the model. When there are multiple pathogen sequences per host, the genealogy of the metapopulation is conceptually divided into the scattering and collecting phases [34] , which correspond informally to within-and between-host evolution, respectively. New apparatus would be required for inference in this situation.

For our analyses, we used a simple HKY85 substitution model [71] , ignoring heterogeneity in the molecular clock rate between sites, codon positions and branches of the tree. However, detailed analyses suggest that such heterogeneity does occur in HCV [26, 72] . One of the benefits of implementing our approach in BEAST is that this complexity can be readily incorporated in future analyses. There has been considerable variation in the estimates of the molecular clock rate in HCV [72] . We assumed a clock rate of 0.58 Â 10 23 substitutions per site per year, which was estimated for the NS5B gene [73] , and was previously applied to a number of the datasets we analysed. However, there is evidence to suggest that the rate may be closer to 1.0 Â 10 23 per site per year [26, 72] . The effect of underestimating the clock rate would be to systematically overestimate the dates of events during the epidemic history, while overlooking uncertainty and heterogeneity in the clock rate will cause the credible intervals for some of our parameters and dates to be anti-conservative.

One of the important points our work demonstrates is that there are limits to what may be inferred about epidemiological dynamics from genetic data. For example, 13 of the 18 datasets were best fit by the simplest, SI model. Although this model contains none of the biological complexity inherent to HCV epidemiology, on statistical grounds, there was no support for even modest elaborations of the SIS or SIR models. The SI, SIS and SIR models may be caricatures of true epidemiological dynamics, but they capture key features of epidemic processes, including exponential, plateau and burn-out phases. In this study, we directly compared the goodness-of-fit of endemic, SI, SIS and SIR models. In practice, a useful approach might be to include the non-parametric Bayesian skyline plot [74] in the model comparison [72] . This would allow rejection of the parametric models if none adequately described the population history of the sample. In such a case, the Bayesian skyline plot might help motivate and direct the construction of new, more realistic, parametric models via our metapopulation approach.

Another limitation of genetic inference, revealed by our theoretical results and in agreement with previous work [16] , is that R 0 cannot be directly estimated from genetic data in the coalescent SIS model because, although the intrinsic growth rate (r 0 ) is well identified, the transmission coefficient (b 1 ) and rate of loss of infection (g) cannot be disentangled. In stochastic models, b 1 and g and therefore R 0 can, in principle, be deconfounded, but if deterministic models are any guide, precise estimates cannot be expected unless additional information is available concerning, for example, the rate of clearance or prevalence. Fortunately, r 0 will often be a convenient proxy for R 0 because it exhibits the same threshold behaviour: when r 0 ! 0 (equivalently, R 0 ! 1), the infection persists in the population and when r 0 , 0 (equivalently, R 0 , 1), the epidemic dies out. The intrinsic growth rate is well identified from genetic data during the exponential growth period of the epidemic, in contrast to R 0 , which is not even well defined under the SI model.

Based on comparisons to independent estimates, the SIR model appeared to provide good predictions of prevalence ( figure 5 ). However, we saw that only once an epidemic had peaked could the SIR model be fitted ( figure 4 ). This has repercussions for the utility of genetic analysis for predicting an outbreak in real time. Although the intrinsic growth rate can be estimated during the exponential growth phase of the epidemic, it is not sufficient to predict the course of the epidemic. Independent estimates of quantities such as the duration of infection and point prevalence would be needed for prediction. Consequently, the role of genetic analysis in realtime prediction of outbreaks will be to complement, but not replace, epidemiological approaches.

The metapopulation analogy provides a firm grounding for combining population genetics and epidemiology. We rstb.royalsocietypublishing.org Phil Trans R Soc B 368: 20120314 have shown how it can be used to derive coalescent models with underlying SI, SIS and SIR dynamics that are readily used for practical analysis. With richer genetic data, it will become possible to detect microevolution on epidemiological timescales in many more pathogen species [30] . Joint genetic and epidemiological inference is a fertile area for research, and the machinery underlying our metapopulation approach [34] provides building blocks for arbitrary elaboration on the basic pattern we explored here.

To obtain the effective population size for the metapopulation model, we adapted the results of Wakeley & Aliacar [34] and Wakeley [35] assuming haploidy and the propagule-pool model [41] for colonization (equation (3.1)). To model changes in metapopulation dynamics over time, we used simple SI, SIS and SIR compartmental models (figure 1). For parameter estimation, we made the simplifying assumption that co-infection is negligible. In the case of the SI and SIS models, we were able to obtain analytical solutions for the effective population size using the following closed-form solutions for the proportion of susceptible hosts, S, as a function of time. For the SI model,

For the SIS model, S ¼ b 1 S 0 À g þ gð1 À S 0 Þe Àðb 1 ÀgÞt b 1 S 0 À g þ b 1 ð1 À S 0 Þe Àðb 1 ÀgÞt ; I ¼ 1 À S: ð5:2Þ

All parameter definitions are summarized in the electronic supplementary material, table S1. For the SIR model, a solution for S cannot be obtained analytically. However, assuming that the number of recovered individuals is initially zero gives the relationship

This simplifies the system of differential equations in the SIR model to a single ordinary differential equation that can be solved numerically: dS dt ¼ b 1 Sð1 À SÞ þ gS logðSÞ: ð5:4Þ

In the coalescent with demographic growth, the pairwise coalescence rate is the inverse of the effective population size, and calculation of the probability density of a genealogy under the coalescent model requires the calculation of the integrated coalescence rate [13] :

ð5:5Þ

(elsewhere we suppress the dependency on time to avoid cluttered notation). Assuming no co-infection (b 2 ¼ 0), we can write this integral as a differential equation dL dt ¼ 1 N e ¼ ð1 À S 0 þ g logðS 0 Þ=b 1 ÞS N 0 S 0 (1 À S þ g logðSÞ=b 1 ) : ð5:6Þ

Because the effective population size is dependent on S, equations (5.4) and (5.6) define a system of differential equations to be solved together. We implemented this as an extension to BEAST [36] in JAVA using a fifth-order Cash -Karp Runge -Kutta method with adaptive stepsize control [47] . We also reimplemented the logistic growth function in BEAST because our parametrization for the SIS model uses N 0 , the effective population size at the present, rather than the carrying capacity. Example XML code and details of the Bayesian analysis are provided in the electronic supplementary material, text S1.

We searched the literature for HCV datasets with well-described sampling frames for which subtype, sampling location, prevalence and NS5B gene sequences were available. We initially identified 28 datasets, but we excluded a further 10 that had small sample size (fewer than 20 sequences), evidence of recombination or questionable sampling on further investigation. We used a simple permutation test based on the correlation between physical distance and three measures of linkage disequilibrium (r 2 , jD 0 j and G4), implemented as part of OMEGAMAP [75] . We excluded a dataset if the null hypothesis of no recombination was rejected at the 5 per cent level by any of the three tests. This is not unduly conservative because of the similarity between the measures of linkage disequilibrium. Details of all 28 datasets are available in the electronic supplementary material, text S2. We performed multiple sequence alignment using the GENEIOUS alignment tool [76] to produce a global alignment of all sequences and where an alignment was not available between sequences within the same dataset. All the alignments that we analysed are available in the electronic supplementary material, dataset S1.

For each of the 18 datasets that met our incorporation criteria, we calculated mean pairwise genetic diversity ( p) and collated data on subtype, prevalence, host population size and population density (see the electronic supplementary material, text S2). We obtained point estimates of T MRCA , N 0 and r 0 averaged over models. We used multiple regression to explore the effect of these covariates on p. In the final model, we included all statistically significant covariates and r 0 , as we had strong prior interest in the inferred regression coefficient for this covariate.

Infectious diseases of humans: dynamics and control

Strategies for mitigating an influenza pandemic

The foot-and-mouth epidemic in Great Britain: pattern of spread and impact of interventions

Transmission dynamics and control of severe acute respiratory syndrome

Pandemic potential of a strain of influenza A (H1N1): early findings

Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic

The origin of the Haitian cholera outbreak strain

Germs, genomes and genealogies

Evolutionary analysis of the dynamics of viral infectious disease

Transmission pathways of foot-and-mouth disease virus in the United Kingdom in

Parallel bacterial evolution within multiple patients identifies candidate pathogenicity genes

Unifying the epidemiological and evolutionary dynamics of pathogens

Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations

An integrated framework for the inference of viral population history from reconstructed genealogies

Coalescent theory

The epidemic behavior of the hepatitis C virus

Evolving epidemiology of hepatitis C virus

Global epidemiology of hepatitis C virus infection

The origin of hepatitis C virus genotypes

Genetic diversity and evolution of hepatitis C virus: 15 years on

The epidemiology and iatrogenic transmission of hepatitis C virus in Egypt: a Bayesian coalescent approach

Viral gene sequences reveal the variable history of hepatitis C virus infection among countries

Investigating the origin and spread of hepatitis C virus genotype 5a

Population genetic history of hepatitis C virus 1b infection in China

The global spread of hepatitis C virus 1a and 1b: a phylodynamic and phylogeographic analysis

Variable epidemic histories of hepatitis C virus genotype 2 infection in West Africa and Cameroon

Genetic history of hepatitis C virus in East Asia

The hepatitis C virus epidemic among injecting drug users

Insights from genomics into bacterial pathogen populations

Rates of coalescence for common epidemiological models at equilibrium

Phylodynamics of infectious disease epidemics

Viral phylodynamics and the search for an 'effective number of infections

Gene genealogies in a metapopulation

Metapopulation models for historical inference

Bayesian phylogenetics with BEAUti and the BEAST 1.7

Evolution in changing environments; some theoretical explorations

Some demographic and genetic consequences of environmental heterogeneity for biological control

Superinfection, metapopulation dynamics, and the evolution of diversity

Temporally structured metapopulation dynamics and persistence of influenza A H3N2 virus in humans

Gene flow and genetic drift in a species subject to frequent local extinctions. Theor

Breeding structure of populations in relation to speciation

Neutral genetic diversity in a metapopulation with recurrent local extinction and recolonization

Segregating sites in Wright's island model. Theor

Nonequilibrium migration in human history

The coalescent in an island model of population subdivision with variation among demes. Theor

Numerical recipes in Cþþ

Geographic distribution of hepatitis C virus genotype 6 subtypes in Thailand

New trends of HCV infection in China revealed by genetic analysis of viral sequences determined from first-time volunteer blood donors

Evolutionary history of hepatitis C virus genotype 5a in France, a multicenter ANRS study

Tracing hepatitis C and Delta viruses to estimate their contribution in HCC rates in Mongolia

The epidemic history of hepatitis C among injecting drug users in Flanders

High prevalence of hepatitis C virus genotype 6 in Vietnam. Asian Pac

Phylodynamics of hepatitis C virus subtype 2c in the province of Córdoba

Molecular tracing of the global hepatitis C virus epidemic predicts regional patterns of hepatocellular carcinoma mortality

World Health Organization. 1999 Hepatitis C: global prevalence (update)

World population data sheet

¿Por qué el virus de le hepatitis C en Cruz del Eje?

The changing epidemiology of hepatitis C virus infection in Europe

Consequences of recombination on traditional phylogenetic analysis

A possible geographic origin of endemic hepatitis C virus 6a in Hong Kong: evidences for the association with Vietnamese immigration

Inference for nonlinear epidemiological models using genealogies and time series

Inferring epidemic contact structure from phylogenetic trees

Estimating the basic reproductive number from viral sequence data

Ancestral inference on gene trees under selection. Theor

Pathogen genetic variation in small-world host contact structures

Genetic diversity in the SIR model of pathogen evolution

SIR dynamics in random networks with heterogeneous connectivity

Unifying vertical and nonvertical evolution: a stochastic ARG-based framework

Coalescent approaches to HIV-1 population genetics

Dating of the human -ape splitting by a molecular clock of mitochondrial DNA

The mode and tempo of hepatitis C virus evolution within and among hosts

A comparison of the molecular clock of hepatitis C virus in the United States and Japan predicts that hepatocellular carcinoma incidence in the United States will increase over the next two decades

Bayesian coalescent inference of past population dynamics from molecular sequences

Estimating diversifying selection and functional constraint in the presence of recombination